Main-thread IO Analysis

<edit>

  • The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
  • Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

</edit>

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

  • How common are the various operations?
    The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
  • Is there a correlation between the number of operations and the time spent doing IO on a file?
    There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
  • Does having a SSD disk lower the time spent doing IO on a file?
    A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

  • Does the distribution of Windows versions differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
  • Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not seem to differ.
  • Does the distribution of the disk type differ between the population and the top 100?
    The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

Distribution of the top outliers.
1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

Top 20 files most users hit
2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension:

Top extensions with slow file operations.
3. Top extensions with slow file operations.

To keep track of some of the statistics presented here, mreid and myself prepared a  dashboard. Time to file some bugs!

11 thoughts on “Main-thread IO Analysis

  1. Can we please not have a repeat of the last automated addon performance measurement fiasco and just contact addon developers with suggestions on how to fix issues, rather than jumping straight to a “hall of shame”? Expecting addon developers to all know all the intricacies of how different APIs the need to use perform under various circumstances is kinda absurd.

    1. Don’t take the title so seriously, that was not my intention :) Without any sort of measurements how would you even know which authors to contact? Imho keeping that sort of plots behind closed doors and limiting ourselves to just contacting few developers doesn’t have the same positive impact as to publish them openly.

      1. I’m afraid your intention doesn’t matter if what you write ends up being quoted with less context on slashdot or c|net or something. That doesn’t have much positive impact on anything.

        If this stuff is important why aren’t these problems within Firefox fixed already? From my reading of these plots, Firefox developers should be the first targets for shaming, in particular whoever introduced that httpdatausage tracking, which I think is quite recent?

        If addons shouldn’t do this, then why are the APIs available to them in the first place?

      2. First, this post is not only about add-ons, it’s about main-thread IO. As it happens, add-ons perform main-thread IO as our platform code does. Just because there is some IO though it doesn’t mean that it’s necessarily avoidable so that has to be evaluated case by case.

        The asynchronous OS.File API is relatively new and it takes time to migrate code. It wouldn’t be feasible to migrate most of our code to the new API in one shot so it’s natural to profile and optimize IO where it matters.

        The goal of this post is to show that main-thread IO matters. There is only so much I can do alone by filing bugs and fixing some of them. But if Firefox and add-ons developers alike know about our off-main thread IO API and why it matters, next time they have to write some IO code they will make the right choice and everybody wins.

  2. So if I’m understanding the filtering correctly, addon-caused main-thread I/O is so dwarfed by Firefox-caused main-thread I/O that it only showed up when you look at only addons? And therefore focusing on addons (shaming or whatnot) is silly, at least at this point in time?

    Given what I know of what GreaseMonkey does, I don’t think Firefox provides the necessary APIs for it be off-main-thread.

    1. Filtering removes all files that don’t appear in the majority of Telemetry’s submissions on the nightly channel. Since most users have a different combination of extensions, add-on files don’t show up as often as the platform ones. Fixing FF issues first has clearly the priority, since those issues are the most common ones.

      You raised another very important point, which is that just because there is some main-thread IO, it doesn’t necessarily mean that it’s avoidable (e.g. prefs.js). This has to be investigated case by case.

  3. Hi, a friendly comment:

    You’ve made this great tool, written up this detailed analysis, even presented your results as attractive charts, and clearly you mean for your work to improve the state of play in the Mozillaverse.

    However, my reaction as I started reading this post was the same as bob’s, and my pattern-recognizer immediately served up a memory of one of the most useful add-ons evar, appearing at the top of a list of “bad add-ons”. I just think you need to be aware that you’re very close to stepping on tender toes here when that’s certainly not what you meant. Maybe you want to use more neutral wording, to avoid misinterpretation.

    Keep in mind that Greasemonkey and its creators are valued members of the Mozilla community. Greasemonkey’s been around almost as long as there’s been a Firefox, and with all the changes in Firefox between then and now and the at-times-Herculean effort of keeping up with those changes, it’s still here, still serving users, still supporting the Mozilla mission.

    1. I have the uttermost respect for add-ons developers which make Firefox the great product that it is. My choice of wording for the title was bad and didn’t reflect my intentions at all so I changed it accordingly. I really appreciated your feedback, thanks!

  4. Using scientific notation for the extensions graph is awful (but not for the main graph).

    It’s very hard for anyone to do anything remotely useful with such a graph with an incompatible scale.

    Include a reference point from the main graph. If using extensions.ini would cause all of the extensions I/O to be so tiny that it wouldn’t be visible in the graph, then pick the #100 or #1000 (or #10000) file from Firefox, name it, indicate its rank and include it in both graphs. In the extensions graph, graph it at the top of the graph.

    1. The top outliers for Firefox and the add-ons have a very similar behavior, in terms of order of magnitude, as hinted by my comment about httpDataUsage.dat on Figure 1. To make it more clear I added a reference bar in Figure 3.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s