Spark best practices

We have been running Spark for a while now at Mozilla and this post is a summary of things we have learned about tuning and debugging Spark jobs.

Spark execution model

Spark’s simplicity makes it all too easy to ignore its execution model and still manage to write jobs that eventually complete. With larger datasets having an understanding of what happens under the hood becomes critical to reduce run-time and avoid out of memory errors.

Let’s start by taking our good old word-count friend as starting example:

rdd = sc.textFile("input.txt")\
.flatMap(lambda line: line.split())\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x, y: x + y, 3)\
.collect()

RDD operations are compiled into a Direct Acyclic Graph of RDD objects, where each RDD points to the parent it depends on:

DAG
Figure 1

At shuffle boundaries, the DAG is partitioned into so-called stages that are going to be executed in order, as shown in figure 2. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

stages
Figure 2

To organize data for the shuffle, Spark generates sets of tasks – map tasks to organize the data and reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations. Operations within a stage are pipelined into tasks that can run in parallel, as shown in figure 3.

tasks
Figure 3

Stages, tasks and shuffle writes and reads are concrete concepts that can be monitored from the Spark UI. The UI can be accessed from the driver node on port 4040, as shown in figure 4.

shell
Figure 4

Best practices

Spark UI

Running Spark jobs without the Spark UI is like flying blind. The UI allows to monitor and inspect the execution of jobs. To access it remotely a SOCKS proxy is needed as the UI connects also to the worker nodes.

Using a proxy management tool like FoxyProxy allows to automatically filter URLs based on text patterns and to limit the proxy settings to domains that match a set of rules. The browser add-on automatically handles turning the proxy on and off when you switch between viewing websites hosted on the master node, and those on the Internet.

Assuming that you launched your Spark cluster with the EMR service on AWS, type the following command to create a proxy:

ssh -i ~/mykeypair.pem -N -D 8157 hadoop@ec2-...-compute-1.amazonaws.com

Finally, import the following configuration into FoxyProxy:

<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
  <proxies>
    <proxy name="emr-socks-proxy" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false">
      <matches>
        <match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
        <match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
        <match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
        <match enabled="true" name="*10*.amazonaws.com*" pattern="*10*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
        <match enabled="true" name="*10*.compute*" pattern="*10*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
        <match enabled="true" name="*localhost*" pattern="*localhost*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
      </matches>
      <manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" />
    </proxy>
  </proxies>
</foxyproxy>

Once the proxy is enabled you can open the Spark UI by visiting localhost:4040.

Use the right level of parallelism

Clusters will not be fully utilized unless the level of parallelism for each operation is high enough. Spark automatically sets the number of partitions of an input file according to its size and for distributed shuffles, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument to an operation. In general, 2-3 tasks per CPU core in your cluster are recommended. That said, having tasks that are too small is also not advisable as there is some overhead paid to schedule and run a task.

As a rule of thumb tasks should take at least 100 ms to execute; you can ensure that this is the case by monitoring the task execution latency from the Spark UI. If your tasks take considerably longer than that keep increasing the level of parallelism, by say 1.5, until performance stops improving.

Reduce working set size

Sometimes, you will get terrible performance or out of memory errors, because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.

Even though those tables spill to disk, getting to the point where the tables need to be spilled increases the memory pressure on the executor incurring the additional overhead of disk I/O and increased garbage collection. If you are using pyspark, the memory pressure will also increase the chance of Python running out of memory.

The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.

Avoid groupByKey for associative operations

Both reduceByKey and groupByKey can be used for the same purposes but reduceByKey works much better on a large dataset. That’s because Spark knows it can combine output with a common key on each partition before shuffling the data.

In reduce tasks, key-value pairs are kept in a hash table that can spill to disk, as mentioned in “Reduce working set size“. However, the hash table flushes out the data to disk one key at a time. If a single key has more values than can fit in memory, an out of memory exception occurs. Pre-combining the keys on the mappers before the shuffle operation can drastically reduce the memory pressure and the amount of data shuffled over the network.

Avoid reduceByKey when the input and output value types are different

Consider the job of creating a set of strings for each key:

rdd.map(lambda p: (p[0], {p[1]}))\
.reduceByKey(lambda x, y: x | y)\
.collect()

Note how the input values are strings and the output values are sets. The map operation creates lots of temporary small objects. A better way to handle this scenario is to use aggregateByKey:

def seq_op(xs, x):
xs.add(x)
return xs

def comb_op(xs, ys):
return xs | ys

rdd.aggregateByKey(set(), seq_op, comb_op).collect()

Avoid the flatMap-join-groupBy pattern

When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.

Python memory overhead

The spark.executor.memory option, which determines the amount of memory to use per executor process, is JVM specific. If you are using pyspark you can’t set that option to be equal to the total amount of memory available to an executor node as the JVM might eventually use all the available memory leaving nothing behind for Python.

Use broadcast variables

Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program, like a static lookup table, consider turning it into a broadcast variable.

Cache judiciously

Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.

It should go without saying that if you only read a dataset once there is no point in caching it, i it will actually make your job slower. The size of cached datasets can be seen from the Spark UI.

Don’t collect large RDDs

When a collect operation is issued on a RDD, the dataset is copied to the driver, i.e. the master node. A memory exception will be thrown if the dataset is too large to fit in memory; take or takeSample can be used to retrieve only a capped number of elements instead.

Minimize amount of data shuffled

A shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. As illustrated in figure 3, each reducer in the second stage has to pull data across the network from all the mappers.

As of Spark 1.3, these files are not cleaned up from Spark’s temporary storage until Spark is stopped, which means that long-running Spark jobs may consume all available disk space. This is done in order to don’t re-compute shuffles.

Know the standard library

Avoid re-implementing existing functionality as it’s guaranteed to be slower.

Use dataframes

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a pandas Dataframe in Python.

props = get_pings_properties(pings,
["environment/system/os/name",
"payload/simpleMeasurements/firstPaint",
"payload/histograms/GC_MS"],
only_median=True)

frame = sqlContext.createDataFrame(props.map(lambda x: Row(**x)))
frame.groupBy("environment/system/os/name").count().show()

yields:

environment/system/os/name count
Darwin 2368
Linux 2237
Windows_NT 105223

Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution. Since the optimizer generates JVM bytecode for execution, pyspark users will experience the same high performance as Scala users.

A glance at unified FHR/Telemetry

Lots is changing in Telemetry land. If you do occasionally run data analyses with our Spark infrastructure you might want to keep reading.

Background

The Telemetry and FHR collection systems on desktop are in the process of being unified. Both systems will be sending their data through a common data pipeline which has some features of both the current Telemetry pipeline as well the Cloud Services one that we use to ingest server logs.

The goals of the unification are to:

  • avoid measuring the same metric in multiple systems on the client side;
  • reduce the latency from the time a measurement occurs until it can be analyzed on the server;
  • increase the accuracy of measurements so that they can be better correlated with factors in the user environment such as the specific build, enabled add-ons, and other hardware or software characteristics;
  • use a common data pipeline for client telemetry and service log data.

The unified pipeline is currently sending data for Nightly, Aurora and Beta. Classic FHR and Telemetry pipelines are going to keep sending data to the very least until the new unified pipeline has not been fully validated. The plan is to land this feature in 40 Release. We’ll also continue to respect existing user preferences. If the user has opted out of FHR or Telemetry, we’ll continue to respect that for the equivalent data sets. Similarly, the opt-out and opt-in defaults will remain the same for equivalent data sets.

Data format

A Telemetry ping, stored as JSON object on the client, encapsulates the data sent to our backend. The main differences between the new unified Telemetry ping format (v4) and the classic Telemetry one (v2) are that:

  • multiple ping types are supported beyond the classic saved-session ping, like the main ping;
  • pings have a common top-level which contains basic information shared between types, like build-id and channel;
  • pings have an optional environment field which consists of data that is expected to be characteristic for performance and other behavior.

From an analysis point of view, the most important addition is the main ping which includes the very same histograms and other performance and diagnostic data as the v2 saved-session pings. Unlike in “classic” Telemetry though, there can be multiple main pings during a single session. A main ping is triggered by different scenarios, which are documented by the reason field:

  • aborted-session: periodically saved to disk and deleted at shutdown – if a previous aborted session ping is found at startup it gets sent to our backend;
  • environment-change: generated when the environment changes;
  • shutdown: triggered when the browser session ends;
  • daily: a session split triggered in 24h hour intervals at local midnight; this is needed to make sure we keep receiving data also from clients that have very long sessions.

Data access through Spark

Once you connect to a Spark enabled IPython notebook launched from our self-service dashboard, you will be prompted with a new tutorial based on the v4 dataset. The v4 data is fetched through the get_pings function by passing “v4” as the schema parameter. The following parameters are valid for the new data format:

  • app: an application name, e.g.: “Firefox”;
  • channel: a channel name, e.g.: “nightly”;
  • version: the application version, e.g.: “40.0a1”;
  • build_id: a build id or a range of build ids, e.g.:”20150601000000″ or (“20150601000000”, “20150610999999”)
  • submission_date: a submission date or a range of submission dates, e.g: “20150601” or (“20150601”, “20150610”)
  • doc_type: ping type, e.g: “main”, set to “saved_session” by default
  • fraction: the fraction of pings to return, set to 1.0 by default

Once you have a RDD, you can further filter the pings down by reason. There is also a new experimental API that returns the history of submissions for a subset of profiles, which can be used for longitudinal analyses.

Confidence intervals and hypothesis tests for engineers

I wrote a small parallel library for python that implements the permutation test and the bias corrected version of the bootstrap, which gives non-statisticians the ability to exploit confidence intervals and hypothesis tests for arbitrary statistics at the expense of some CPU cycles.

Modern hardware allows us to understand and compute statistics in ways that were not possible when the field was born. Resampling methods allow to quantify uncertainty with fewer assumptions and greater accuracy, at a higher computational cost, a paradigm shift in the mindset of modern statistics.

Confidence intervals are based on the idea of the sampling distribution of a statistics, that is the distribution of values of the statistic over all possible samples of the same size. Given such sampling distribution, it’s easy to build a confidence interval. As we usually have a single sample, statisticians devised formulas to compute a confidence interval assuming that the sampling distribution has a certain well-known shape.

As a concrete example, the sampling distribution of the sample mean has a normal distribution which follows from the central limit theorem. If you are looking for the sampling distribution for say, the trimmed mean or the median, things are considerably harder. Exotic formula do exist in some cases but the bootstrap provides a computational way of approximating the sampling distribution without any assumption on its shape and spread.

The bootstrap of a statistic draws thousands of resamples with replacement from the original sample and computes the distribution of the statistic of those samples. This distribution approximates the shape, spread and bias of the real sampling distribution but is centered at the statistic of the of sample in the best case and can be affected by a considerable bias in the worst case. There are techniques though to remove the bias from the bootstrap distribution.

bootstrap

If the sample is a good approximation of the population, the bootstrap method will provide a good approximation of the sampling distribution. As a rule of thumb you should have at least 50 independent data points before applying the method with at least 1000 bootstrap samples. Also, trying to apply the bootstrap on some very weird statistics that depend on few values of the sample, like the maximum, is a recipe for disaster.

I wrote in the past about the permutation test and how I used it to implement a hypothesis test for Telemetry histograms, so I am not going to reiterate its core ideas here. What’s important to understand though is that it assumes that the observations are exchangeable under the null hypothesis. This implies that the observations viewed individually must be identically distributed.

Dashboards made simple with Spark and Plotly

In my previous about our new Spark infrastructure, I went into the details on how to launch a Spark cluster on AWS to perform custom analyses on Telemetry data. Sometimes though one has the need to rerun an analysis recurrently over a certain timeframe, usually to feed data into dashboards of various kinds. We are going to roll out a new feature that allows users to upload an IPython notebook to the self-serve data analysis dashboard and run it on a scheduled basis. The notebook will be executed periodically with the chosen frequency and the result will be made available as an updated IPython notebook.

To schedule a Spark job:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
  2. Click “Schedule a Spark Job”.
  3. Enter some details:
    • The “Job Name” field should be a short descriptive name, like “chromehangs analysis”.
    • Upload your IPython notebook containt the analysis.
    • Set the number of workers of the cluster in the “Cluster Size” field.
    • Set a schedule frequency using the remaining fields.

Once a new scheduled job is created it will appear in the top listing of the scheduling dashboard. When the job is run its result will be made available as an IPython notebook visible by clicking on the “View Data” entry of your job.

As I briefly mentioned at the beginning, periodic jobs are typically used to feed data to dashboards. Writing dashboards for a custom job isn’t very pleasant and I wrote in the past some simple tool to help with that. It turns out though that thanks to IPython one doesn’t need necessarily to write a dashboard from scratch but can simple re-use the notebook as the dashboard itself! I mean, why not? That might not be good enough for management facing dashboards but acceptable for ones aimed at engineers.

In fact with IPython we are not limited at all to matplotlib’s static charts. Thanks to Plotly, it’s easy enough to generate interactive plots which allow to:

  • Check the x and y coordinates of every point on the plot by hovering with the cursor.
  • Zoom in on the plot and resize lines, points and axes by clicking and dragging the cursor over a region.
  • Pan by holding the shift key while clicking and dragging.
  • Zooms back out to the original version by double clicking on the plot.

Plotly comes with its own API but if you have already a matplotlib based chart then it’s trivial to convert it to an interactive plot. As a concrete example, I updated my Spark Hello World example with a plotly chart.

fig = plt.figure(figsize=(18, 7))
frame["WINNT"].plot(kind="hist", bins=50)
plt.title("startup distribution for Windows")
plt.ylabel("count")
plt.xlabel("log(firstPaint)")
py.iplot_mpl(fig, strip_style=True)

As you can see, just a single extra line of code is needed for the conversion.

startup_distribution_for_windows

As WordPress doesn’t support iframes, you are going to have to click on the image and follow the link to see the interactive plot in action.

Add-on hangs: a brief summary

I have blogged about clustering BHR hangs before. This post is dedicated to the add-on side of things.

In a way BHR could be seen as a distributed profiler: each user runs a local profiler that samples the stack of a thread only when that thread is hanging for over N milliseconds. Then the stacks are sent to our backend infrastructure through a Telemetry submission.

Imagine to profile Firefox for an hour. At the end of the hour you would like to determine the impact of one of your installed add-ons on the browsing experience. Not simple at all, the add-on might have been in use only for a fraction of your session, say 20 seconds. In that fraction it might have slowed down the browser significantly though. Since you are collecting hangs for the whole session, that signal might eventually be dominated by noise. This means that in most cases, add-on hangs are not going to look like a big deal once aggregated.

I aggregated the stacks and partitioned them in browser and add-on specific ones. For a given add-on Foo, I computed for all sessions the ratio of hangs of Foo over the number of hangs of both Firefox and Foo. Finally I averaged those ratios. This final number gives an idea of the average proportion of hangs due to Foo an user of that add-on can expect to find in his session.

That’s not the whole story though, one can imagine scenarios where an add-on triggers an asynchrounous workload in the browser which will not be accounted to the add-on itself, like garbage collection. In a grantedly less common, but still plausible scenario, an add-on could improve the performances of the browser and in doing so reducing the number of browser specific hangs while increasing the ratio. In general I tend to see the final number as a lower bound and even though it’s not precise, it can help identify bottlenecks.

From the most popular add-ons the ones that have a high ratio are:

  1. Lastpass (12%)
  2. Hola Better Internet (11%)
  3. Avira Antivirus (10%)
  4. noscript (9%)
  5. Adblock (9%), note this is not Adblock Plus

LastPass, for instance, has a ratio of hangs of about 12%, which means that if you had just LastPass installed and no other add-on, on average about 12% of the hangs you would experience would likely be due to LastPass. That’s a lot and and the main culprit seems to be the logic to scan for input fields when the document changes dynamically. That said I am glad I can’t see any popular add-ons with a shockingly high ratio of hangs, which is good.

The numbers shouldn’t be used to mark an add-on as bad or good; they are based on a fallible heuristic and they are meant to give us the tools to prioritize which add-ons we should keep an eye on.

Next-gen Data Analysis Framework for Telemetry

The easier it is to get answers, the more questions will be asked

In that spirit me and Mark Reid have been working for a while now on a new analysis infrastracture to make it as easy as possible for engineers to get answers to data related questions.

Our shiny new analysis infrastructure is based primarily on IPython and Spark. I blogged about Spark before, I even gave a short tutorial on it at our last workweek in Portland (slides and tutorial); IPython might be something you are not familiar with unless you have a background in science. In a nutshell it’s a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.

IPython
An IPython notebook in all its glory

The combination of IPython and Spark allows to write data analyses interactively from a browser and seemingly parallelize them over multiple machines thanks to a rich API with over 80 distributed operators! It’s a huge leap forward in terms of productivity compared to traditional batch oriented map-reduce frameworks. An IPython notebook contains both the code and the product of the execution of that code, like plots. Once executed, a notebook can simply be serialized and uploaded to Github. Then, thanks to nbviewer, it can be visualized and shared among colleagues.

In fact, the issue with sharing just the end product of an analysis is that it’s all too easy for bugs to creep in or to make wrong assumptions. If your end result is a plot, how do you test it? How do you know that what you are looking at does actually reflect the truth? Having the code side by side with its evaluation allows more people to  inspect it and streamlines the review process.

This is what you need to do to start your IPython backed Spark cluster with access to Telemetry data:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
  2. Click “Launch an ad-hoc Spark cluster”.
  3. Enter some details:
    • The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
    • Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
    • Upload your SSH public key.
  4. Click “Submit”.
  5. A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.

Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, e.g.:

ssh -i my-private-key -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Finally, you can launch IPython in Firefox by visiting http://localhost:8888.

Now what? Glad you asked. In your notebook listing you will see a Hello World notebook. It’s a very simple analysis that produces the distribution of startup times faceted by operating system for a small fraction of Telemetry submissions; let’s quickly review it here.

We start by importing a telemetry utility to fetch pings and some commonly needed libraries for analysis: a json parser, pandas and matplotlib.

import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
from moztelemetry.spark import get_pings

To execute a block of code in IPython, aka cell, press Shift-Enter. While a cell is being executed, a gray circle will appear in the upper right border of the notebook. When the circle is full, your code is being executed by the IPython kernel; when only the borders of the circle are visible then the kernel is idle and waiting for commands.

Spark exploits parallelism across all cores of your cluster. To see the degree of parallelism you have at your disposal simply yield:


sc.defaultParallelism

Now, let’s fetch a set of telemetry submissions and load it in a RDD using the get_pings utility function from the moztelemetry library:

pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid="*",
                  submission_date="20141208",
                  fraction=0.1)

That’s pretty much self documenting. The fraction parameter, which defaults to 1, selects a random subset of the selected submissions. This comes in handy when you first write your analysis and don’t need to load lots of data to test and debug it.

Note that both the buildid and submission_date parameters accept also a tuple specifying, inclusively, a range of dates, e.g.:


pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid=("20141201", "20141202"),
                  submission_date=("20141202", ""20141208"))

Let’s do something with those pings. Since we are interested in the distribution of the startup time of Firefox faceted by operating system, let’s extract the needed fields from our submissions:

def extract(ping):
    ping = json.loads(ping)
    os = ping["info"]["OS"]
    startup = ping["simpleMeasurements"].get("firstPaint", -1)
    return (os, startup)

cached = pings.map(lambda ping: extract(ping)).filter(lambda p: p[1] > 0).cache()

As the Python API matches closely the one used from Scala, I suggest to have a look at my older Spark tutorial if you are not familiar with Spark. Another good resource are the hands-on exercises from AMP Camp 4.

Now, let’s collect the results back and stuff it into a pandas DataFrame. This is a very common pattern, once you reduce your dataset to a manageable size with Spark you collect it back on your driver (aka the master machine) and finalize your analysis with statistical tests, plots and whatnot.

grouped = cached.groupByKey().collectAsMap()

frame = pd.DataFrame({x: log(pd.Series(list(y))) for x, y in grouped.items()})
frame.boxplot()
plt.ylabel("log(firstPaint)")
plt.show()

startup_hello
Startup distribution by OS

Finally, you can save the notebook, upload it to Github or Bugzilla and visualize it on nbviewer, it’s that simple. Here is the nbviewer powered Hello World notebook. I warmly suggest that you open a bug report on Bugzilla for your custom Telemetry analysis and ask me or Vladan Djeric to review it. Mozilla has been doing code reviews for years and with good reasons, why should data analyses be different?

Congrats, you just completed your first Spark analysis with IPython! If you need any help with your custom job feel free to drop me a line in #telemetry.

Deep Neural Nets: a brief recap

Neural nets have experienced a surge in interest in the past years. This post is a summary about the subject from my non-academical point of view.

In the 80s feed-forward neural networks with a single hidden layer became pretty popular when a simple and efficient algorithm was engineered to train them, the backpropagation algorithm.

Feed_forward_neural_net
Feed-forward neural network (source: Wikipedia)

As it’s usual with machine learning algorithms, a naive implementation can be written very concisely without much effort. In fact, all one needs to understand to train a network is the chain rule.

A neural network is nothing more than a hierarchy of composed function applications. Take a net with a single neuron f and loss function L

L(f(\mathbf{x}, \mathbf{w}), y).

If you know how to compute the partial derivative of the loss function L you desire to minimize with respect to its input f(\mathbf{x}, \mathbf{w}), i.e.

\frac{\partial L}{\partial f} ,

and the derivative of f with respect to its weights \mathbf{w}, i.e.

\frac{\partial f}{\partial \mathbf{w}},

you can determine the derivative of L with respect to the weights \mathbf{w} using the chain rule:

\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}

That’s all you need apply gradient descent to update iteratively the parameters w:

\mathbf{w} = \mathbf{w} - \alpha \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}, where \alpha is the learning rate,

in order to reach a solution that minimizes the loss function. It’s easy to imagine applying this procedure recursively on a hierarchy of composed functions.

Unfortunately though training deep networks, i.e. networks with more than one hidden layer, is not a simple task. There are several reasons why the backpropagation algorithm can fail to find a good solution, e.g. vanishing gradient and overfitting come to mind, and it has been explored in depth elsewhere. Interest in neural nets faded away when researchers realized it.

Then, in 2006/2007 researchers have shown that by using unlabeled data it’s possible to train a deep neural net using a greedy approach. The main idea is based on a network of neurons able to reconstruct the original input with the smallest amount of error, similarly to what Principal Component Analysis does. Such a network is called Autoencoder and is composed of three layer:

  • the input layer of dimension d
  • hidden layer of dimension d' with d' < d (simplifying assumption)
  • the output layer of dimension d
Autoencoder636
Autoencoder (source: UFLDL tutorial)

The hidden layer learns a lower dimensional representation of the input which allows to reconstruct the original signal in the output layer with the minimum error possible. Once an Autoencoder is trained,  the representation learned by the hidden layer can be used as input for another Autoencoder. The process is repeated forming a so called stacked Autoencoder. Effectively, each Autoencoder learns a set of new features from the ones learned by the previous Autoencoder. To give you a concrete example, the first Autoencoder might learn to detect the edges of a picture, while the second one contours and so on.

Once you have trained a stacked Autoencoder, you can initialize the weights of a deep network with N hidden layers using the weights of the N hidden layer of the stacked Autoencoder (pre-training). From here on you can use the neural net just like any ordinary one and train it using labeled data (fine-tuning). Using unlabeled data to pre-train a deep neural net was a big thing back in 2006/2007. This older talk by Geoff Hinton was truly inspiring, but things have changed since then.

In 2012 Hinton et. al proofed that it is possible to train a deep convolutional neural net to classify images without any sort of pre-training, and beat at the same time traditional Computer Vision approaches. Since then unsupervised pre-training has mostly stopped being researched in various universities but, nevertheless, it was the culprit that lead to more fundings and ultimately to where we are now.

A state of the art deep convolutional neural network for image classification is based on a handful of powerful ingredients:

  • many hidden layers (or it wouldn’t be a deep net);
  • convolutional layers followed by pooling layers early on;
  • rectified linear units instead of the classic sigmoid activation function, to learn faster;
  • dropout to approximate the average result of many nets, to reduce overfitting.

The devil is in the detail but the basic concepts and ideas are easy to grasp and a simple implementation can be written in a couple of afternoons.

The real difficulty is more of an engineering one, how do you write the most efficient code? If I wet your appetite and are curious to have a look at a serious implementation you should check out cuda-convnet, an extremely well written and efficient multi-GPU based convolutional Neural Net.