Tackling Flaky Capybara Tests

Posted Mar 15, 2016 by Jordon Dornbos

At Handshake we heavily rely on automated testing to protect against regressions and to give us confidence and speed when writing and deploying code. Some of the technologies we use to test our code include RSpec, Capybara, FactoryGirl, and occasionally Selenium. As our application grew over time and we moved towards a component-based architecture, we saw an increase in flaky tests. Flaky tests are harmful because they can erode the trust you have in your test suite. When you lose trust in your test suite you might shrug off a test failure as just being flaky, when in fact there is a problem going on. Since we rely so heavily on automated testing and we didn’t want to lose this trust, we began implementing various tools and methods to help avoid tests from being flaky. These are the steps that we’ve taken to help reduce flakiness in our test suite. Your results may vary.

Prevention

The first step that we took revolved around preventing flaky tests from even happening in the first place. This included increasing knowledge around how to write great tests and providing helper functions to help avoid common pitfalls. Increasing knowledge involved everyone documenting best practices so that we could share lessons we’ve learned. One simple solution we’ve used is overriding the default wait time of have_content. By default Capybara waits two seconds before timing out, but for some long-running actions increasing this to four seconds, for example, will give the action enough time to finish. Increasing the default time is much better than adding something like sleep 4 to your tests, because sleep 4 will always take four seconds, where overriding the default to four seconds will at most take four seconds.

To avoid common pitfalls we’ve built numerous helper functions to make writing great tests easier. A common pitfall when using a component-based architecture involves not waiting properly for an AJAX event to finish. With Capybara you can do expect(page).to have_content('foo'), which will continue checking the page for the content “foo” for up to two seconds by default. This provides a loosely-coupled way of waiting for AJAX events to finish, which is sufficient for most cases, but there are some scenarios where checking for content changes alone won’t guarantee an AJAX call has finished.

An example of this in Handshake is a dropdown option we have to approve an employer. When you click approve we send an AJAX request to our servers and mark the employer as approved in the UI. If the approval isn’t successful, we show an error message to the user and revert the UI back to showing the employer as pending. In this scenario, checking expect(page).to have_content('approved') wouldn’t guarantee that the action completed successfully. All AJAX requests in Handshake are wrapped in a custom class (so we can show a loading bar for all actions), so we’ve written the following helper method to ensure the action has completed:

def wait_for_handshake_ajax_to_finish(wait_time = Capybara.default_max_wait_time)
  starting_workers_created = page.evaluate_script('Handshake.operations.workers_created()')
  yield
  Timeout.timeout(wait_time) do
    loop do
      workers_created = page.evaluate_script('Handshake.operations.workers_created()')
      active_workers = page.evaluate_script('Handshake.operations.workers()')
      break if workers_created > starting_workers_created and active_workers.zero?
    end
  end
end

This helper records the starting count of workers (i.e. actions), yields to a block where the action is performed, and then loops until the worker count has changed (meaning the action was started) and all running workers have finished (meaning the action completed). At first we implemented a way to wait for AJAX events to complete (similar to the method described here), but we were running into race conditions where the action would complete before we started waiting, which would result in a timeout error. Recording the starting worker count and waiting for it to change helped us avoid this race condition.

Randomization

The next step that we took to help avoid flaky tests was to randomize some of the fields that we use in our tests, as well randomizing the order that we run tests. The two main fields that we randomize are the id field and the time_zone field (on records that have a time zone). Randomizing the ids helps us catch any areas that we might be looking up records by id using ids from the wrong table. Randomizing time zones helps us ensure that our site behaves properly no matter what time zone a user might be in. This especially comes in handy for testing our appointment scheduling system, ensuring that a student studying abroad in Beijing can schedule an appointment with an advisor back in the US, for example.

Randomizing the order that tests are run helps ensure that tests aren’t inadvertently relying on each other. At first we saw a rise in build failures due to tests affecting each other, but little-by-little we would fix these flakes, which helped improve the overall quality of our tests. In particular, randomizing the order helps avoid issues where objects aren’t cleaned up properly after a test finishes, Elasticsearch indexes aren’t reset, or caches aren’t invalidated. In general, our test suite is structured in a way such that it will do this cleanup automatically and ensure that each test starts with a clean slate, but there are some special cases that rely on the developer to do this themselves.

Internal Tools

The last step that we took was building an internal tool to monitor our test suite. We run our test suite on CircleCI, so this internal tool processes all of the data that they send to us after every build finishes. With this data we do a few different things. Since we have CircleCI split our test suite up and run it in parallel we get our code coverage percentages per build container. As a result, the first thing we do is merge the code coverage data from each container, which gives us our overall code coverage percentage. This allows us to set goals and track our progress over time. Another thing that we do is track test failures over time. Tracking failures over time allows us to monitor how our test quality is trending and helps us quickly determine whether or not a test has a history of failing. When it detects that a test is flaky it will create an internal incident, assign it to the author of the commit, and will alert them via Slack. This helps us ensure that flaky tests are fixed as soon as possible, so we can continue to have confidence in our test suite.

Spec Analyzer Mock screenshot of our internal test analyzer

Conclusion

By randomizing our tests, building a tool to monitor and report flaky tests, and adding helper functions to make it easy to avoid flaky tests, we’ve seen an overall rise in the quality and reliability of our test suite. In fact, our build success rate increased by 38% over the past 6 months. We see far less flaky tests now, and when they do come up, they’re handled quickly. This allows our development team to continue trusting our test suite so we can keep iterating quickly and confidently while avoiding regressions along the way.

engineering