Deep Learning is a powerful technology but you might want to try some “shallow” approaches before you dive in.

Neural networks are made up of neurones and synapses

It’s unquestionable that over the last decade, deep learning has changed machine learning landscape for the better. Deep Neural Networks (DNNs), first popularised by Yan LeCunn, Yoshua Bengio and Geoffrey Hinton, are a family of machine learning models that are capable of learning to see and categorise objects, predict stock market trends, understand written text and even play video games.

<div>
</div>

<div>
  <h3>
    Buzzwords like “LSTM” and “GAN” sound very cool but are they the right fit for purpose for your business problem?
  </h3>
</div>

<div>
  Neural Networks are (very loosely) modelled on the human brain. A series of <a href="https://en.wikipedia.org/wiki/Neuron">neurones</a> that pass signals to each other through synapses. Given recent news about deep learning and AI, you’d be forgiven for thinking that Deep Learning can do anything and everything and make humans all but obsolete. However, there are still lots of things they can’t master.  Buzzwords like “LSTM” and “GAN” sound very cool but are they the right fit for purpose for your business problem?
</div>

<div>
</div>

<h2>
  Why is Training Data Important for Deep Learning?
</h2>

<div>
  Neural Networks learn by backpropagation: this is an iterative process whereby the system makes a prediction and gets feedback about whether it was right or not. Over time and with many examples, the system is able to learn the correct answer by adjusting its internal model. It&#8217;s actually very similar to how children learn over time. Over time, an infant will learn to associate noises to sights. The more a parent says &#8220;Mummy&#8221; or &#8220;Daddy&#8221;, the more the child&#8217;s brain learns that these are important words. If the child points at their father and says &#8220;Mummy&#8221;, they will likely be corrected &#8211; &#8220;no that&#8217;s your Daddy&#8221; and over time they start to learn the correct association.
</div>

<div>
</div>

<div>
  <p>
    <figure id="attachment_320" aria-describedby="caption-attachment-320" style="width: 218px" class="wp-caption alignright"><img loading="lazy" class="wp-image-320 size-medium" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=218%2C300&#038;ssl=1" alt="" width="218" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=218%2C300&ssl=1 218w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=768%2C1059&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=743%2C1024&ssl=1 743w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?w=1741&ssl=1 1741w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?w=1320&ssl=1 1320w" sizes="(max-width: 218px) 100vw, 218px" data-recalc-dims="1" /><figcaption id="caption-attachment-320" class="wp-caption-text">machines learn by example &#8211; just like babies.</figcaption></figure>
  </p>
</div>

<div>
</div>

<div>
  The thing about neural network back-propagation (and human learning) is that it takes time and it takes lots of experience &#8211; just like human brains! Imagine if humans took everything we heard as the absolute truth the first time we heard it. We’d be a race stuck with some terrible, incorrect opinions and assumptions OR we’d flip-flop between different points of view as they are presented to us. We learn by sampling experiences from many different sources and trying to generalise across them. Babies need to hear “mummy” and “daddy” hundreds or even thousands of times before it starts to sink in that the noisy signals that their ears are receiving have some higher significance. It takes the average human 10-15 months to learn to walk and 18-36 months to learn to talk. We don’t just “pick things up” after one exposure to a concept, it takes our brains time to connect the dots and to understand correlations. The same is true of deep neural networks. This “thirst” for data and its associated drawbacks, like the need for huge amounts of compute power, can make deep learning the sub-optimal solution in a number of cases.
</div>

<div>
</div>

<div>
  But never fear! Classical machine learning approaches like <a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>, <a href="https://en.wikipedia.org/wiki/Random_forest">Random Decision Forest</a> or <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">Bayesian Classifiers</a> may have fallen out of vogue but they often present a viable and appropriate solution in cases where “deep learning” won’t work.
</div>

<div>
</div>

<h2>
  <b>Deep Learning or Classical Learning?</b>
</h2>

<div>
  Deciding whether to use deep learning ultimately comes down to a trade-off between how much data and compute power you can get your hands on vs how much time your engineers have to spend on the problem and how well they understand the problem.
</div>

<h3>
  <span>Most data scientists will prefer to KISS than to charge in with a deep learning model. </span>
</h3>

<div>
</div>

<div>
  Here are 3 rules of thumb for deciding whether to use deep learning or not. You should consider classical models if at least one of these is true:
</div>

<div>
</div>

<ol>
  <li>
    <div>
      You have an experienced data science team who understand feature engineering and the data they’re being asked to model or at the very least can get hold of people that understand the data.
    </div>
  </li>
  
  <li>
    <div>
      You don’t have access to GPUs and large amounts of compute power or hardware and computing power are at a premium
    </div>
  </li>
  
  <li>
    <div>
      You don’t have lots of data (i.e. you have 100 or 1000 examples rather than 100k or 1 million)
    </div>
  </li>
</ol>

<div>
  Before we dive into those, I also have a rule zero: KISS &#8211; keep it simple stupid. Most data scientists will prefer to KISS than to charge in with a deep learning model. If you start with a classical model and don’t get the performance that you need, a neural network could be a great secondary avenue. <a href="https://developers.google.com/machine-learning/guides/rules-of-ml/#ml_phase_i_your_first_pipeline">Google’s data science community hold the same point of view.</a> If you are considering DNNs then now’s the time to consider our other rules of thumb.
</div>

<div>
</div>

<div>
</div>

<h2>
  <b>1: Data Scientists, Feature Engineering and Understanding the data</b>
</h2>

<div>
  Features are fundamental properties of data. Think of an apple: features include colour e.g. red/green/brown(eww), size and sweetness (granny smiths are bitter, golden delicious is sweet). In traditional machine learning, a huge amount of manual work is invested in feature engineering. The data scientist needs to understand a) the problem the model is trying to solve b) the data being fed into the system and c) the attributes or ‘features’ of that data that are relevant for solving the problem. For example a model that predicts house price may need to know about the number of bedrooms that the house has and the year that the house was built in but may not care about which way around the toilet roll holder was installed in the bathroom. The data scientist needs to tweak the data that the model receives, turning features on and off in order to generate the most accurate results.
</div>

<h3>
  <span>&#8230;a deep learning model may be able to learn features of the data that data scientists can&#8217;t but if a hand-engineered model gets you to 90% accuracy, is the extra data gathering and compute power worth it&#8230;?</span>
</h3>

<div>
</div>

<div>
  Traditionally, feature engineering is a very manual process that requires experienced data scientists who understand the data and can make good inferences about how the model might react to data changes. Even with experienced data scientists, this activity can be more of an art than a science and often very time consuming. It is also really important that the data scientist understands the classifier’s purpose and is able to make good intuitions about which parts of the input might have a bigger effect on model accuracy. If the data scientist doesn’t have this information then they typically work very closely with a domain expert. For example, Filament doesn’t employ <a href="https://www.filament.ai/case-studies/improving-airport-operations-data-science/">aerospace logistics specialists</a> or <a href="https://www.filament.ai/case-studies/revolutionising-deal-origination-machine-learning/">private equity investors</a> but we were able to create useful models for our clients by working collaboratively their teams of experts.  For some problems it is possible to guess the most effective features and <a href="https://www.filament.ai/ai-suite/engine/">use software to automatically tune the model iteratively.</a>
</div>

<div>
</div>

<div>
  <p>
    <figure id="attachment_322" aria-describedby="caption-attachment-322" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-322" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=300%2C215&#038;ssl=1" alt="" width="300" height="215" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=300%2C215&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=768%2C549&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=1024%2C733&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-322" class="wp-caption-text">feature engineering is traditionally a very manual process</figcaption></figure>
  </p>
  
  <p>
    Conversely, one of the most exciting things about “deep learning” is that these models are able to learn complex features for themselves over time. Just like a human brain slowly assigns meaning to the seemingly random photons that hit our retinas, deep networks are able to receive series of pixels from images and slowly learn which patterns of pixels are interesting or predictive. The caveat is that automatically deriving these features requires huge volumes of data to learn from (see point 3). Ultimately a deep learning model may be able to implicitly learn features of the data that human data scientists are unable to isolate but if a classical, hand-engineered model gets you to 90% accuracy, is the extra data gathering and compute power worth it for that 5-7% boost?
  </p>
</div>

<div>
</div>

<h2>
  <b>2. Compute Power Requirements </b>
</h2>

<div>
  Deep learning models usually consist of a vast number of neurons and synapses connected together in layers stacked on top of each other (hence the ‘deep’). The more neurones, the more connections between them and the more calculations the neural network has to make during training and usage. Classical models are typically orders of magnitude simpler and thus much faster to train and use. DNNs are often so complex and resource intensive that they <a href="https://brainsteam.co.uk/2018/05/13/gpus-are-not-just-for-images-any-more/">require special hardware </a> in order to train and run.
</div>

<div>
</div>

<div>
  <p>
    <figure id="attachment_274" aria-describedby="caption-attachment-274" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-274 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=300%2C218&#038;ssl=1" alt="" width="300" height="218" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=300%2C218&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=768%2C557&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=1024%2C743&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-274" class="wp-caption-text">Deep Neural Networks tend to rely on GPUs for their computational requirements</figcaption></figure>
  </p>
  
  <p>
    It often makes sense to prefer simpler models in cases where compute resource is at a premium or even not available and where classical models give “good enough” accuracy. For example in an edge computing environment in a factory or in an anti-fraud solution at a retail bank where millions of transactions must be examined in real-time. It would either be impossible or obscenely expensive to run a complex deep learning model on millions of data records in real time. Or, it might not be practical to install a cluster of whirring servers into your working environment. On the other hand, if accuracy is what you need and you have lots of data then maybe its time to buy those GPUs&#8230;
  </p>
</div>

<div>
</div>

<h2>
  3: Lack of data and difficulty gathering data
</h2>

<div>
  One of the biggest challenges in supervised machine learning is gathering training data. In order to train a classification or regression model (deep neural network or otherwise) we need to have loads of examples of inputs and their desired matching outputs. For example “Here’s a picture of a cat, when you see it I want you to say ‘cat’”. <a href="https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/">I’ve previously written about some best practices for gathering these kinds of datasets</a>. For a classical machine learning model you need to collect a few hundred or thousand examples of representative, consistent data points.
</div>

<div>
</div>

<div>
  To train a deep learning model from scratch you need a lot, lot more. We’re not talking hundreds or even thousands. Academic state-of-the-art image recognition models like <a href="https://en.wikipedia.org/wiki/AlexNet">AlexNet</a> are typically trained on millions of examples of images. NLP models for <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> and chatbots rely on <a href="https://en.wikipedia.org/wiki/Word2vec">word vectors</a> trained on the entirety of wikipedia or Google News’ 3-billion-word news article corpus (thankfully word2vec is an unsupervised algorithm so you don’t need to manually annotate those billions of words but you do need to label documents for downstream tasks like sentiment analysis or topic classification). These are the sorts of datasets that only digital behemoths like Google and Facebook who collect millions of documents per day over many years are able to build and curate.
</div>

<div>
</div>

<div>
  <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial#part-4-comparing-deep-and-non-deep-learning-methods">Recent benchmarks</a> show that manually feature-engineered “classical” machine learning models like those mentioned above sometimes outperform deep learning systems where datasets are relatively small. In other cases, DNNs offer a <a href="https://www.kdnuggets.com/2018/07/overview-benchmark-deep-learning-models-text-classification.html">slight uplift in performance</a> of the order of a few percent.
</div>

<div>
  <b> </b>
</div>

<h2>
  <b>Conclusion</b>
</h2>

<div>
  Deploying a machine learning product is a complex and multi-faceted problem with many trade-offs and decisions to be made. Deep Learning and DNNs are a very exciting family of technologies that truly are revolutionising the world around us but they’re not always the best approach to a machine learning problem. You should always consider the complexity of the problem you’re trying to solve, the amount of data you have, the human expertise and the compute power you have access to. If a simpler model works well then go with it and potentially plan to swap in a more complex deep learning model when you have enough data to make it worthwhile. Don’t use Deep Learning, Recurrent Networks, LSTMs, Convolutional Networks or GANS because its cool. Use them because simple methods didn’t work. Use them because manual feature engineering isn’t giving optimal results. Use them because even though your simple SVM model has been producing great results for the last 10 years, you think that the 10 million rows of data that you’ve collected could potentially feed a more powerful model that will increase performance by 30%.
</div>