Lots is changing in Telemetry land. If you do occasionally run data analyses with our Spark infrastructure you might want to keep reading.
The Telemetry and FHR collection systems on desktop are in the process of being unified. Both systems will be sending their data through a common data pipeline which has some features of both the current Telemetry pipeline as well the Cloud Services one that we use to ingest server logs.
The goals of the unification are to:
- avoid measuring the same metric in multiple systems on the client side;
- reduce the latency from the time a measurement occurs until it can be analyzed on the server;
- increase the accuracy of measurements so that they can be better correlated with factors in the user environment such as the specific build, enabled add-ons, and other hardware or software characteristics;
- use a common data pipeline for client telemetry and service log data.
The unified pipeline is currently sending data for Nightly, Aurora and Beta. Classic FHR and Telemetry pipelines are going to keep sending data to the very least until the new unified pipeline has not been fully validated. The plan is to land this feature in 40 Release. We’ll also continue to respect existing user preferences. If the user has opted out of FHR or Telemetry, we’ll continue to respect that for the equivalent data sets. Similarly, the opt-out and opt-in defaults will remain the same for equivalent data sets.
A Telemetry ping, stored as JSON object on the client, encapsulates the data sent to our backend. The main differences between the new unified Telemetry ping format (v4) and the classic Telemetry one (v2) are that:
- multiple ping types are supported beyond the classic saved-session ping, like the main ping;
- pings have a common top-level which contains basic information shared between types, like build-id and channel;
- pings have an optional environment field which consists of data that is expected to be characteristic for performance and other behavior.
From an analysis point of view, the most important addition is the main ping which includes the very same histograms and other performance and diagnostic data as the v2 saved-session pings. Unlike in “classic” Telemetry though, there can be multiple main pings during a single session. A main ping is triggered by different scenarios, which are documented by the reason field:
- aborted-session: periodically saved to disk and deleted at shutdown – if a previous aborted session ping is found at startup it gets sent to our backend;
- environment-change: generated when the environment changes;
- shutdown: triggered when the browser session ends;
- daily: a session split triggered in 24h hour intervals at local midnight; this is needed to make sure we keep receiving data also from clients that have very long sessions.
Data access through Spark
Once you connect to a Spark enabled IPython notebook launched from our self-service dashboard, you will be prompted with a new tutorial based on the v4 dataset. The v4 data is fetched through the get_pings function by passing “v4” as the schema parameter. The following parameters are valid for the new data format:
- app: an application name, e.g.: “Firefox”;
- channel: a channel name, e.g.: “nightly”;
- version: the application version, e.g.: “40.0a1”;
- build_id: a build id or a range of build ids, e.g.:”20150601000000″ or (“20150601000000”, “20150610999999”)
- submission_date: a submission date or a range of submission dates, e.g: “20150601” or (“20150601”, “20150610”)
- doc_type: ping type, e.g: “main”, set to “saved_session” by default
- fraction: the fraction of pings to return, set to 1.0 by default
Once you have a RDD, you can further filter the pings down by reason. There is also a new experimental API that returns the history of submissions for a subset of profiles, which can be used for longitudinal analyses.