Fighting regressions with Benchmarks in CI

Published in

Android Developers

7 min readOct 17, 2019

We released the first Benchmark library alpha at I/O 2019, and have been improving it since to help you measure performance accurately while optimizing your Android code. Jetpack Benchmarks are standard JUnit instrumentation tests that run on an Android device, and use a rule provided by the library to perform the measuring and reporting:

Sample project on Github at android/performance-samples.

Sample Android Studio output, running multiple benchmarks.

The library handles warmup, detects configuration problems, and measures the performance of your code all through its JUnit Rule API.

That’s all great for benchmarking at your desk, but much of the value of benchmarks come from detecting regressions in Continuous Integration. How do you handle benchmark data in CI?

Benchmarks vs Correctness Tests

Even when you have thousands of correctness tests, it’s easy to put that on a dashboard by collapsing information. Below is the one we use for Jetpack.

Nothing particularly special, but it uses two common tricks to reduce the visual load. First, it collapses the list of thousands of tests by package, and by class. Then, by default, it hides packages with no failures in them. The test results of dozens of libraries, nearly 20 thousand tests total, easily shown in a few lines of text. Correctness test dashboards scale pretty nicely!

But how about benchmarks? Benchmarks don’t output a simple pass/fail, it’s a scalar value, per test. That means we can’t simply collapse the pass results. Let’s just look at a graph of the data, and maybe we can discern patterns visually. After all, you’re likely to have far fewer benchmarks than correctness tests…

That’s a lot of visual noise. Even with just hundreds of results instead of thousands, this isn’t a useful way to look at data. Benchmarks that haven’t changed take up as much visible space as real regressions, so we really need to filter this down.

Simple Approach to Regression Detection

We can start with something simple, to try to get back to the pass/fail world of correctness tests. We could define a failure to be any benchmark that goes down by some percentage threshold between two runs. This falls over in practice though, due to variance.

Benchmarks with View inflation are prone to higher variance, but still provide useful data.

As much as we try to produce stable, consistent numbers in benchmarks, variance can still be high, depending on the workload and device you’re running on. For example, we find tests that inflate Views to be significantly less stable than other CPU workload benchmarks. One percentage threshold won’t work well for every test, but we don’t want to put the burden of assigning thresholds (or a baseline) on the benchmark author, since it’s cumbersome to maintain over time, and doesn’t scale well as the number of benchmarks grow.

Variance can also come in the form of infrequent suite-wide spikes, where some condition of the device being tested yielded abnormally slow results for several benchmarks in a row. While we can fix some of these (for instance, preventing runs when cores are disabled due to low battery), it’s difficult to entirely prevent these.

recyclerview, ads-identifier, and room benchmarks, all spiking during one run — we DO NOT want to report this as a regression

The takeaway is that we can’t just look at results from Build N vs N-1 to find a regression — we need more context to make that decision.

Step-Fitting, a Scalable Solution

The approach we use in Jetpack CI is step-fitting, provided by the Skia Perf application.

The idea is that we look for step functions in benchmark data. If we inspect each benchmark’s sequence of results, we can look for “steps” up or down, as signals that a particular build changed the performance of a benchmark. We want to look at several data points though, to ensure what we see is a consistent pattern across multiple results, not a fluke:

Context reveals a large regression to actually be an unstable benchmark

How do we check for such a step? We look at multiple results before and after a change:

Then we compute significance of the regression with the following code snippet:

This works by detecting error both before and after the change landed, and weighing the difference in the averages based on that error. The less variance a benchmark has, the more confidence we have in detecting a small regression. This lets us run micro benchmarks of nanosecond precision in the same system alongside big (for mobile) database benchmarks, with much higher variance.

You can also try it for yourself! Hit the run button to try out the algorithm against data from a WorkManager benchmark running in our CI. It will output links to one build with a regression, and to its subsequent fix (click on ‘View Changes’ to see the commits inside). Those match up to the regressions and improvements that a person would see when the data is graphed:

All the minor noise in the graph is ignored, based on our configuration of the algorithm. There are two parameters you can experiment with above to control when it fires:

WIDTH — how many results to consider before and after a commit
THRESHOLD — how severe must a regression be to show up on your dashboard.

Higher width increases resistance to inconsistency, but can make it harder to find regressions in a result that changes frequently — we use a width of 5, currently. Threshold is a general sensitivity control — we currently use 25. Lower it to find more regressions, but you may also see more false positives.

To set this up for yourself in CI:

Write some benchmarks!
Run them in CI on real devices, preferably with sustained performance support
Collect the output metrics from JSON
Every time a result is ready, look at the last 2 * WIDTH results

If there’s a regression or improvement, fire an alert (email, issue, or whatever works for you) to investigate the performance of WIDTH builds ago.

Presubmit

What about presubmit? It’s way easier to catch regressions if you don’t let them into the build!

Running benchmarks in presubmit can be a great way to prevent regressions entirely, but first remember: Benchmarks are like flakey tests, which need infrastructure like the above algorithm to work around instability.

For presubmit tests, which can interrupt the workflow of submitting a patch, you’ll want especially high confidence in the regression detection you use.

The step fit algorithm above is needed because single runs of benchmarks don’t give us sufficient confidence on their own. Again though, we can capture more data to gain that confidence — simply run multiple times with and without the change to check if the patch introduces a regression.

As long as you’re comfortable with the increased resource cost of running your benchmarks several times per change you’re testing to increase confidence, presubmit can work well!

Full disclosure — we don’t use Benchmarks in presubmit currently in Jetpack, but if you want to, here are my recommendations:

Run your benchmarks 5+ times, both with and without the patch (the latter can often be cached, or gotten from postsubmit results)
Consider skipping especially slow benchmarks
Don’t block submitting a patch based on results — just consider the results during code review. Regressions sometimes happen as part of improving the codebase!
Consider that the previous results may not exist. Benchmarks being added can’t be checked in presubmit!

Conclusion

Jetpack Benchmark provides an easy way to get accurate performance metrics out of an Android device. Put together with the step fit algorithm above, you can work around instabilities to detect performance regressions before they hit users — just as we do in Jetpack in CI.

Notes on where to start:

Capture your key scrolling interfaces in benchmarks
Add performance tests for key library interactions, and expensive CPU work
Treat improvements just like regressions — they’re worth investigating!

Additional Reading

If you’d like to learn more, I gave a talk about Benchmarks in CI with @itsdustinlam at Android Developer Summit 2019.

To learn more about how Jetpack Benchmark works, see our Google I/O talk. Benchmark results from Jetpack libraries are on androidx-perf.skia.org.

We use the Skia Perf application to track our performance of the AndroidX libraries. You can see the actual source of the step fit algorithm described here, as it runs in our CI. If you’re interested in learning more, Joe Gregorio has written another blogpost about their more advanced K-means clustering detection algorithm, explaining the specific problems and solutions that the Skia project has developed, specifically designed to scale with the many configurations (OS and OS version, CPU/GPU chip/driver variant, compiler, etc.).