Strata + Hadoop World @ London

The Strata conference in London has just ended and I came back with the feeling that in a few years the conference might as well be called Strata+Spark World. A quick look at the schedule reveals the sheer amount of Spark related talks. Most people I have spoken to have played with it but are not using it yet in production; mostly because they developed over the years a considerable amount of tooling around Hadoop MapReduce that makes it hard to jump ship. That said, it’s hard to miss where the industry is moving to: Spark is currently the most active Apache project and it’s experiencing an exponential growth.

Peter Wendell spoke about what’s coming in the next versions of Spark on Wednesday. What’s interesting for us, as Mozilla, is that there are some interesting improvements in sight, like metrics visualization for Spark Streaming and various performance improvement in terms of memory and CPU usage. We are not using Spark Streaming at the moment as we have our very own stream processing system, Heka, but we might consider it for few particular use cases.

Lots of work is going into Spark Dataframes, which one might think of as a distributed equivalent of a pandas dataframe. The current API isn’t stable yet but it looks very promising. When leveraging the dataframe API, Python jobs can be as fast as Scala ones thanks to a unified physical execution layer.

physicalI feel that Dataframes work well with simple scalar types, unlike our Telemetry histograms, and when most of the operations needed are available through the DSL. For that reason we are probably not going to adopt the Dataframe API for the time being and keep using the RDD API. That said we could exploit Dataframes in the future by representing Telemetry histograms with an average value.

There was a nice talk about Elasticsearch by Costin Leau on Thursday with a hands-on demo. I have been thinking for a while of using Kibana for Telemetry, a visualization platform for Elasticsearch, and seeing it in action convinced me to give it a shot.

The main issue with unleashing Elasticsearch on our dataset is that we rolled our own histogram representation which is not going to play ball with Elasticsearch. That said, as the goal of this dashboard would be to give an overview of our data, nothing is stopping us to represent histograms with a single average estimate, like the median. The dashboard could empower anyone in the organization to slice, dice and visualize Telemetry data without writing a single line of code while using Spark for non-trivial analyses and our existing dashboard to display the full distributions for a predetermined set of aggregations.