When I moved to London in November I immediately noticed the sheer amount of conversations in Italian on the streets. Coming from San Francisco, where hearing my language on the streets was rarer than a lunar eclipse, I was quite surprised. I was obviously expecting to hear and see more Italians but it definitively felt like there were many more of my fellow citizen here than the last time I visited London many years ago.
I took the matter in my own hands and had a look at various data freely available. Since I am mostly interested in the younger generation, instead of taking the census which is compiled every 10 years, I had a look at the National Insurance Number (NIN) registration data, which is compiled yearly. If you want to work in the UK you need a NIN and I would expect younger people to request a NIN, though I don’t have the data to back this assumption up.
The following map shows the number of registration by borough in 2013, as you can see Tower Hamlet, Brent and Haringey have a particularly high number of registrations.
I wondered if that has always been the case so I faceted the map by year:
There are some interesting observations:
the number of registrations is going up over the years;
while the highest number of registrations came from Kensington until 2008-ish, from 2009 onwards there is a clear shift in the distributions of registrations by bouroughs.
The counts are based on the recorded address at the time of scan for that reporting period but some of the individuals may have subsequently moved out of the area or back abroad. In other words, take this visualizations with a grain of salt. There clearly is a pattern here but more data would be required to understand what caused the shift by borough.
Predicting time series can be very interesting not only for quants. Any server that logs metrics like the number of submissions or requests over time generates a so called time series or signal. An interesting time series I had the chance to play with some time ago is the one generated by Telemetry’s submissions. This is how it looks like for the Nightly channel for the past 60 days:
It’s immediately evident to an human eye when and where there was a drop in submissions in the past couple of months (bugs!). An interesting problem is to be able to automatically identify a drop in submissions as soon as it happens and at the same time reducing to a minimum the number of “false alarms”. It might seem rather trivial at first, but given that the distribution is quite sparse, caused mostly by daily oscillations, an outlier detection method based on the standard deviation is doomed to fail. Using the median absolute deviation is more robust but still not good enough to avoid false positives.
The periodic patterns might not be immediately visible from the raw data plot but once we connect the dots the daily and weekly pattern appear in all their beauty:
The method I came up with to catch drops does the following:
It retrieves the distributions of the last 10 days from the current data point
It performs a series of Mann-Whitney tests to compare the last 24h to the distributions of the previous 9 days
If the distributions are statistically different for at least 5 days with the current daily one having a lower mean, then we have a drop
The algorithm requires a certain amount of history to make good predictions, reason why it detected the first drop on the left only after several days. As expected though it was able to detect the second drop without any false positives. Sudden drops are easy to detect with a robust outlier detection method but slow drops, as we experienced in the past, can go unnoticed if you just look for outliers.
Another interesting approach is to use time series analysis to decompse the series into its seasonals (periodic), trend and and noise signals. A simple classical decomposition by moving average yields the following series:
This simple algorithm was able to remove most of the periodic pattern; the trend is affected now by the weekly signal and the drops. It turns out that newer methods are able to decompose time series with multiple periodic patterns, or seasonalities. One algorithm I particularly like is the so called TBATS method, which is an advanced exponential smoothing model:
That’s pretty impressive! The TBATS algorithm was able to identify and remove the daily and weekly frequency from our signal, what remains is basically the trend and some random noise. Now that we have such a clean signal we could try to apply statistical quality control to our time series, i.e. use a set of rules to identify drops. The rules look at the historical mean of a series of datapoints and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift (drop) or not.
Given a decomposition of a time series, we can also use it to predict future datapoints. This can be useful for a variety of reasons beyond detecting drops. To have an idea of how well we can predict future submissions let’s take a clean subset of our data, from day 20 to day 40, and let’s try to predict Telemetry’s submissions for the next 5 days while comparing it to the actual data:
That’s pretty neat, we can immediately see that we have an outlier and the prediction is very close to the actual real data.
I wonder if there are other methods used to detect alterations to time series so feel free to drop me a line with a pointer if you happen to have a suggestion.