Firefox Main-thread IO

  • The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
  • Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

  • How common are the various operations?
    The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
  • Is there a correlation between the number of operations and the time spent doing IO on a file?
    There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
  • Does having a SSD disk lower the time spent doing IO on a file?
    A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

  • Does the distribution of Windows versions differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
  • Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not seem to differ.
  • Does the distribution of the disk type differ between the population and the top 100?
    The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

Distribution of the top outliers.
1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

Top 20 files most users hit
2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension:

Top extensions with slow file operations.
3. Top extensions with slow file operations.

To keep track of some of the statistics presented here, mreid and myself prepared a  dashboard. Time to file some bugs!

Idle wakeups are evil

I have been spending some quality time with gdb tracking down timers firing during idle in Firefox, which turns out to be the main culprit for FF’s energy consumption. Wakeups are triggered either because a timer is fired or a process sleeping on a synchronisation variable is woken up. In fact the former can be reduced to the latter, since also the timer thread which executes the timer callbacks, blocks on a synchronisation variable.

Tracking down wakeups is nothing new and has been done in the past. Automating it on the other hand isn’t trivial without modifying heavily the sourcecode. I tried automating the process through a gdb script but it kept crashing randomly and when it didn’t FF got significantly slower so I gave up on that. There are some alternatives though, e.g. on Linux the perf subsystem can be exploited to track the following events:

  • the context switches (shced:sched_switch),
  • wakeups (sched:sched_wakeup),
  • and the C-states of the CPU (power:cpu_idle).

By keeping track of the CPU C-states we know when the CPU is sleeping. The wakeup probe tells us when a specific thread is woken up, and by correlating it with the previous probe, it gives us the culprit that caused the CPU C-state to change. Finally, the context switch probe allows to retrieve a user-space callgraph of the thread woken up.

That’s an interesting approach and powertop does something similar but it turns out that to investigate wakeups we don’t really need that degree of accuracy. Instead of considering only the context switches caused by wakeups correlated to C-state changes, we can just consider all context switches. Some context switches will not be associated with C-state changes but this heuristic is still good enough on first approximation during an idle workload. On Linux the following command can be used to collect callgraphs of context switches:

perf record -e sched:sched_switch -R -c 1 -g -p $(pgrep firefox)

As shown in the example above, most of the times a context switch happens we are either in the BackgroundHangManager, in the Timer thread or nsSocketTransportService. Not all timers are created equal though and we would really like to know which timers are firing. Unfortunately since the specific timer callback is executed only after the context switch, there isn’t a simple way to get that sort of information through perf.

The linker comes to our rescue and by using dladdr we can simply retrieve the mangled name of a timer callback handler, which we can demangle and print with __cxa_demangle. This clearly works only if the callbacks are exported in the dynamic symbol table, which on OSX are by default whereas on Linux we have to recompile the codebase with the -export-dynamic linker flag.