Clustering Firefox hangs

Jim Chen recently implemented a system to collect stacktraces of threads running for more than 500ms. A summary of the aggregated data is displayed in a nice dashboard in which the top N aggregated stacks are shown according to different filters.

I have looked at a different way to group the frames that would help us identify the culprits of main-thread hangs, aka jank. The problem with aggregating stackframes and looking at the top N is that there is a very long tail of stacks that are not considered. It might very well be that by ignoring the tail we are missing out some important patterns.

So I tried different clustering techniques until I settled with the very simple solution of aggregating the traces by their last frame. Why the last frame? When I used k-means to cluster the traces I noticed that, for many of the more interesting clusters the algorithm found, most stacks had the last frame in common, e.g.:

  • Startup::XRE_Main, (chrome script), Timer::Fire, nsRefreshDriver::Tick, PresShell::Flush, PresShell::DoReflow
  • Startup::XRE_Main, Timer::Fire, nsRefreshDriver::Tick, PresShell::Flush, PresShell::DoReflow
  • Startup::XRE_Main, EventDispatcher::Dispatch, (content script), PresShell::Flush, PresShell::DoReflow

Aggregating by the last frame yields clusters that are big enough to be considered interesting in terms of number of stacktraces and are likely to explain the most common issues our users experience.

Currently on Aurora, the top 10 meaningful offending main-thread frames are in order of importance:

  1. PresShell::DoReflow accounts for 5% of all stacks
  2. nsCycleCollector::collectSlice accounts for 4.5% of all stacks
  3. nsJSContext::GarbageCollectNow accounts for 3% of all stacks
  4. IPDL::PPluginInstance::SendPBrowserStreamConstructor accounts for 3% of all stacks
  5. (chrome script) accounts for 3% all stacks
  6. filterStorage.js (Adblock Plus?) accounts for 2.7% of all stacks
  7. nsStyleSet::FileRules accounts for 2.7% of all stacks
  8. IPDL::PPluginInstance::SendNPP_Destroy accounts for 2% of all stacks
  9. IPDL::PPluginScriptableObject::SendHasProperty accounts for 2% of all stacks
  10. IPDL::PPluginScriptableObject::SendInvoke accounts for 1.7% of all stacks

Even without showing sample stacks for each cluster, there is some useful information here. The elephants in the room are clearly plugins; or should I say Flash? But just how much do “plugins” hurt our responsiveness? In total, plugin related traces account for about 15% of all hangs. It also seems that the median duration of a plugin hang is not different from a non-plugin one, i.e. between 1 and 2 seconds.

But just how often does a hang occur during a session? Let’s have a look:

hangs

The median number of hangs for a session amounts to 5; the mean is not that interesting as there are big outliers that skew the data. Also noteworthy is that the median duration of a session is about 16 minutes.

As one would expect, the number of hangs tend to increase as the duration of a session does:

hangsvsuptime

Tha analysis was run on a week’s worth of data for Aurora (over 50M stackframes) and similar results where obtained on previous weeks.

There is some work in progress to improve the status quo. Aaron Klotz’s formidable async plugin initialization is going to eliminate trace 4 and he might tackle frame 8 in the future. Furthermore, a recent improvent in cycle collection is hopefully going to reduce the impact of frame 2.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s