Telemetry meets SQL

In an effort to ease data access at Mozilla we started providing SQL access to our Parquet datasets through Presto. The benefit of SQL is that it’s a commonly understood language among engineers and non-engineers alike and is generally easy to pick up.

Even though we use Redshift for some of our data,  there are datasets that store complex hierarchical data which would require unnatural transformations to fit in a typical SQL store that doesn’t support the flexibility of a nested data model (think of structs, arrays & maps) such as the one provided by Parquet. Furthermore, we were looking to use a single store for our data, i.e. Parquet files on S3, which both Spark and a SQL engine could access directly.

Presto provides the best of both worlds: SQL over Parquet. Presto is an open-source distributed SQL query engine for running interactive analytic queries against various data sources. It allows querying different sources such as Hive and Cassandra, relational databases or even proprietary data stores and a single query can combine data from multiple sources.

Apache Drill offers similar features without requiring up-front schema knowledge, which is a big advantage over Presto given how painful a schema migration can be at times. Overall Drill feels less mature than Presto though and is not supported yet by Amazon EMR unlike Presto, which makes deployment & maintance more involved. For those reasons we picked Presto as our SQL engine.

Presto supports a rich set of data types which map to Parquet ones:

Arrays
The [] operator is used to access an element of an array:

SELECT my_array[1] AS first_element

Maps
The [] operator is used to retrieve the value corresponding to a given key from a map:

SELECT name_to_age_map['Bob'] AS bob_age

Structs
Fields within a structure are accessed with the field reference operator .:

SELECT my_column.my_field

Unnesting maps and structs
The UNNEST clause is used to expand an array or a map into a relation. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument:


SELECT numbers, animals, n, a
FROM (
  VALUES
    (ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']),
    (ARRAY[7, 8, 9], ARRAY['cow', 'pig'])
) AS x (numbers, animals)
CROSS JOIN UNNEST(numbers, animals) AS t (n, a);

which yields:

  numbers  |     animals      |  n   |  a
-----------+------------------+------+------
 [2, 5]    | [dog, cat, bird] |    2 | dog
 [2, 5]    | [dog, cat, bird] |    5 | cat
 [2, 5]    | [dog, cat, bird] | NULL | bird
 [7, 8, 9] | [cow, pig]       |    7 | cow
 [7, 8, 9] | [cow, pig]       |    8 | pig
 [7, 8, 9] | [cow, pig]       |    9 | NULL

The longitudinal view is currently accessible through Presto.  As a concrete example, this is how one could count the number of Telemetry fragments over a time range:


SELECT d, count(*)
FROM (
  SELECT substr(ssd, 1, 11) AS d
  FROM longitudinal
  CROSS JOIN UNNEST(subsession_start_date) AS s(ssd)
)
WHERE d >= '2015-11-15' AND d < '2016-03-15'
GROUP BY d
ORDER BY d ASC

Re:dash
Re:dash allows to query Presto directly from Firefox. After a query is run a table is displayed with the result. Queries can be saved and optionally scheduled to run periodically at a given time.

table
Query editor

Different kinds of plots (e.g. bar charts, line charts, boxplots, …) can be built over a table which are updated every time the query is re-run.

graph
Number of Telemetry fragments over time

Custom dashboards can be built that link to tables and plots. Users can visualize the dashboards and optionally access the SQL code that powers them and fork it.

dash
A simple dashboard

Mozilla’s re:dash instance can be accessed at sql.telemetry.mozilla.org.

Longitudinal studies with Telemetry

Unified Telemetry (UT) replaced both Telemetry and FHR at the end of last year. FHR has been historically used to answer longitudinal questions, such as churn, while Telemetry has mainly been used for performance studies.

In UT-land, multiple self-contained submissions are generated for a profile over time in contrast to FHR for which submissions contained all historical records. With the new format, a typical longitudinal query on the raw data requires conceptually a big group-by on the profile ID over all submissions to recreate the history for each profile. To avoid the expensive grouping operation we are providing a longitudinal batch view of our Telemetry dataset.

The longitudinal view is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of chronologically sorted values, one per Telemetry submission received for that profile. Even though each profile could have been represented as an array of structures with Parquet, ultimately we decided to represent it as a structure of arrays to deal with some inefficiencies in reading nested data from Spark.

The dataset is periodically regenerated from scratch, which allows us to apply non backward compatible changes to the schema and not worry about merging procedures.

The  dataset can be accessed from Mozilla’s Spark clusters as a DataFrame:


frame = sqlContext.sql(SELECT * FROM longitudinal)

The view contains several thousand measures, which include all histograms and a large part of our scalar metrics stored in the various sections of Telemetry submissions.

frame.printSchema()
root
 |-- profile_id: string
 |-- normalized_channel: string
 |-- submission_date: array
 |    |-- element: string
 |-- build: array
 |    |-- element: struct
 |    |    |-- application_id: string
 |    |    |-- application_name: string
 |    |    |-- architecture: string
 |    |    |-- architectures_in_binary: string
 |    |    |-- build_id: string
 |    |    |-- version: string
 |    |    |-- vendor: string
 |    |    |-- platform_version: string
 |    |    |-- xpcom_abi: string
 |    |    |-- hotfix_version: string
...

A Jupyter notebook with example queries that use the longitudinal dataset is available on Spark clusters launched from a.t.m.o.