In an effort to ease data access at Mozilla we started providing SQL access to our Parquet datasets through Presto. The benefit of SQL is that it’s a commonly understood language among engineers and non-engineers alike and is generally easy to pick up.
Even though we use Redshift for some of our data, there are datasets that store complex hierarchical data which would require unnatural transformations to fit in a typical SQL store that doesn’t support the flexibility of a nested data model (think of structs, arrays & maps) such as the one provided by Parquet. Furthermore, we were looking to use a single store for our data, i.e. Parquet files on S3, which both Spark and a SQL engine could access directly.
Presto provides the best of both worlds: SQL over Parquet. Presto is an open-source distributed SQL query engine for running interactive analytic queries against various data sources. It allows querying different sources such as Hive and Cassandra, relational databases or even proprietary data stores and a single query can combine data from multiple sources.
Apache Drill offers similar features without requiring up-front schema knowledge, which is a big advantage over Presto given how painful a schema migration can be at times. Overall Drill feels less mature than Presto though and is not supported yet by Amazon EMR unlike Presto, which makes deployment & maintance more involved. For those reasons we picked Presto as our SQL engine.
Presto supports a rich set of data types which map to Parquet ones:
The  operator is used to access an element of an array:
SELECT my_array AS first_element
The  operator is used to retrieve the value corresponding to a given key from a map:
SELECT name_to_age_map['Bob'] AS bob_age
Fields within a structure are accessed with the field reference operator .:
Unnesting maps and structs
The UNNEST clause is used to expand an array or a map into a relation. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument:
SELECT numbers, animals, n, a FROM ( VALUES (ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']), (ARRAY[7, 8, 9], ARRAY['cow', 'pig']) ) AS x (numbers, animals) CROSS JOIN UNNEST(numbers, animals) AS t (n, a);
numbers | animals | n | a -----------+------------------+------+------ [2, 5] | [dog, cat, bird] | 2 | dog [2, 5] | [dog, cat, bird] | 5 | cat [2, 5] | [dog, cat, bird] | NULL | bird [7, 8, 9] | [cow, pig] | 7 | cow [7, 8, 9] | [cow, pig] | 8 | pig [7, 8, 9] | [cow, pig] | 9 | NULL
The longitudinal view is currently accessible through Presto. As a concrete example, this is how one could count the number of Telemetry fragments over a time range:
SELECT d, count(*) FROM ( SELECT substr(ssd, 1, 11) AS d FROM longitudinal CROSS JOIN UNNEST(subsession_start_date) AS s(ssd) ) WHERE d >= '2015-11-15' AND d < '2016-03-15' GROUP BY d ORDER BY d ASC
Re:dash allows to query Presto directly from Firefox. After a query is run a table is displayed with the result. Queries can be saved and optionally scheduled to run periodically at a given time.
Different kinds of plots (e.g. bar charts, line charts, boxplots, …) can be built over a table which are updated every time the query is re-run.
Custom dashboards can be built that link to tables and plots. Users can visualize the dashboards and optionally access the SQL code that powers them and fork it.
Mozilla’s re:dash instance can be accessed at sql.telemetry.mozilla.org.