When people ask me what I do at Mozilla and I tell them I work with data, more often than not their reaction is something like “Wait, what? What kind of data could Mozilla possibly have?” But let’s start from the beginning of my journey. Even though I have worked on data systems since my first internship, I didn’t start my career at Mozilla officially as a Data Engineer. I was hired to track, investigate, and fix performance problems in the Firefox browser; an analyst of sorts, given my background in optimizing large scientific applications for the ATLAS experiment at the LHC.
The thing with applications for which you have some control over how they are used is that you can “just” profile your known use cases and go from there. Profiling a browser is an entirely different story. Every user might stress the browser in different ways and you can’t just sample stacks and indiscriminately collect data from your users without privacy and performance consequences. If that wasn’t enough, there are as many different hardware/software configurations as there are users. Even with the same configuration, a different driver version can be make a huge difference. This is where Telemetry comes in, an opt-in data acquisition system that collects general performance metrics like garbage collection timings and hardware information, like the GPU model, across our user-base. In case you are wondering what data I am talking about, why don’t you have a look? You are just one click away from about:telemetry, a page that shows the data that Telemetry collects.
Imagine receiving Terabytes of data with millions of payloads per day, each containing thousands of complex hierarchical measurements, and being tasked to make sense of that data, to find patterns in it, how would you do it? Step 1 is getting the data, how hard can it possibly be, right? In order to do that, I had to fire a script to launch an EC2 instance to which I had to ssh into and from which I could run a custom, non-distributed, MapReduce (MR) system written in Python. This was a couple of years ago; you might be wondering why anyone would use a custom MR system instead of something like Hadoop back then. It turns out that it was simple enough to support the use cases the performance team had at the time and when something went wrong it was easy to debug. Trust me, debugging a full-fledged distributed system can be painful.
The system was very simple, the data was partitioned on S3 according to a set of dimensions, like submission date, and the MR framework allowed the user to define the which values to use for each dimension. My first assignment was to get back the ratio of users that had a solid state disk. Sounds simple enough, yet it took me a while to get the data I wanted out of the system. The debug cycle was very time consuming; I had to edit my job, try to run it and if something went wrong start all over again. An analysis for a single day could take hours to complete and, as it’s often the case with data, if something that was supposed to be a string turned out be an integer because of data corruption, and I failed to handle that case, the whole thing exploded.
There weren’t many people running Telemetry analyses back then; it was just too painful. In time, as more metrics were added and I tried to answer more questions with data, I decided to step back for a moment and try something new. I played around with different systems like Pig, mrjob, and Spark. Though Spark has a rich API with far more operators than map & reduce, what ultimately sold me was the fact that it had a read–eval–print loop. If your last command failed, you could just retry it without having to rerun the whole job from the beginning! I initially used a Scala API to run my analyses but I soon started missing some of the Python libraries I grew used to for analysis, like pandas, scipy and matplotlib; the Scala counterpart just weren’t on par with it. I wanted to bring others on board and convincing someone to write analyses in a familiar language was a lot easier. So I built a tiny wrapper around Spark’s Python API which lets user run queries on our data from a Jupyter notebook and in time the number of people using Telemetry for their analytics needs grew.
I deliberately ignored many design problems that had to be solved. What storage format should be used? How should the data be partitioned? How should the cluster(s) be provisioned; EMR, ECS or something else entirely? How should recurrent jobs be scheduled?
Telemetry is one of the most complex data sources we have but there are many others, like the server logs from a variety of services. Others were solving similar problems as the ones I faced but with different approaches, so we joined forces. Fast-forward a couple of years and Telemetry has changed a lot. A Heka pipeline pre-processes and validates the Telemetry data before dumping it on S3. Scala jobs convert the raw JSON data in more manageable derived Parquet datasets that can then be read by Spark with Python, Scala or R. The scope of Telemetry has expanded to drive not only performance needs but business ones as well. Derived datasets can be queried with SQL through Presto and re:dash allows us to build and share dashboard among peers. Jobs can be scheduled and monitored with Airflow. Telemetry based analyses are being run at all level of the organization: no longer are few people responsible to answer every imaginable data related question: there are interns running statistically rigorous analyses on A/B experiments and Senior Managers building retention plots for their products. Analyses are being shared as Jupyter notebooks and peer-reviewed on Bugzilla. It’s every engineer’s responsibility to know how to measure how well its own feature or product is performing.
There is this popular meme about how Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. The truth is that there are only so many companies out there that can claim a user base with millions of users. This is one of the thing I love the most about Mozilla, we aren’t big as the giants but we aren’t small either. As an engineer, I get to solve challenging problems with the agility, flexibility and freedom of a startup. Our engineers work on interesting projects and there is a lot of room for trying new things, disrupting the status quo and pushing our comfort zone.
What’s the life of a Data engineer at Mozilla nowadays? How do we spend our time and what are our responsibilities?
Building the data pipeline
We build, maintain, monitor and deploy our data infrastructure and make sure that it works reliably and efficiently. We strive to use the smallest number of technologies that can satisfy our needs and work well in concert.
What kind of data format (e.g. Avro, Parquet) and compression codec (e.g. Snappy, LZO) should be used for the data? Should the data be stored on HDFS or S3? How should it be partitioned and bucketed? Where should the metadata (e.g. partition columns) be stored? Should the data be handled in real-time or in batch? Should views be recomputed incrementally or from scratch? Does summary data need to be accurate or can probabilistic data structures be used? Those are just some of the many decisions a data engineer is expected to take.
We contribute to internal projects. Heka for example, a stream processing system, is used in a multitude of ways: from loading and parsing log files to performing real time analysis, graphing, and anomaly detection.
At Mozilla, we love open source. When we built our stream processing system, we open sourced it. When different data systems like Spark or Presto need some custom glue to make them play ball, we extend them to fit our needs. Everything we do is out in the open; many of our discussions happen either on Bugzilla and on public mailing lists. Whatever we learn we share with the outside world.
Supporting other teams
Engineers, managers, and executives alike come to us with questions that they would like to answer with data but don’t know yet quite how to. We try to understand which are the best metrics that can answer their questions, assuming they can be answered in the first place. If the metrics required are missing then we add them to our pipeline. This could mean adding instrumentation to Firefox or, if the desired metrics can be derived from simpler ones that are available in the raw data, extending our derived datasets.
Once the data is queryable, we teach our users how to access it with e.g. Spark, Presto or Heka, depending on where the dataset is stored and what their requirements are. We learn from common patterns and inefficiencies by supporting our users’ analysis needs. For example, if a considerable amount queries require to group a dataset in a certain way, we might opt to build a pre-grouped version of the dataset to improve the performance.
We strive to review many of our users’ analyses. It’s far too easy to make mistakes when analyzing data since the final deliverables, like plots, can look reasonable even in the presence of subtle bugs.
Doing data science
We build alerting mechanisms that monitor our data to warn us when something is off. From representing time-series with hierarchical statistical models to using SAX for anomaly detection, the fun never stops.
We create dashboards for general use cases, like the display of daily aggregates partitioned by a set of dimensions or to track engagement ratios of products & features.
We validate hypotheses with A/B tests. For example, one of the main features we are going to ship this year, Electrolysis, requires a clear understanding of its performance impact before shipping it. We designed and ran A/B tests to compare with statistically sound methods a vast variety of performance metrics with and without Electrolysis enabled.
There is also no shortage of analyses that are more exploratory in nature. For example, we recently noticed a small spike in crashes whenever a new release is shipped. Is it a measurement artifact due to crashes that happened during the update? Or maybe there is something off that causes Firefox to crash after a new version is installed, and if so what is it? We don’t have an answer to this conundrum yet.
The simpler it is to get answers, the more questions will be asked.
This has been my motto for a while. Being data driven doesn’t mean having a siloed team of wizards and engineers that poke at data. Being data driven means that everyone in the company knows how to access the data they need in order to answer the questions they have; that’s the ultimate goal of a data engineering team. We are getting there but there is still a lot to do. If you have read so far, are passionate about data and have relevant skills, check out our jobs page.