Making your specs bulletproof. Tips and best practices for identifying and fixing bad tests

Posted Nov 30, 2016 by Maria Verba

During my first couple of months at Handshake, I primarily focused on learning the product and making sure our team productivity is high by maintaining the health of our CI build. Our builds were generally good, but from time to time we ran into seemingly random failures that I took on fixing. While doing that, I noticed that, despite our common efforts at writing fast, deterministic tests, some builds were failing due to flaky tests. Flaky tests are bad news not just to your builds, but also to your engineering culture. They stall progress and cause a context switch to rerun the broken build or container; make you angry (“It just passed 10 minutes ago, why is it now failing??”); and worst of all, undermine your trust in tests, which can cause you to dismiss the test results and potentially miss an actual issue. In addition, since you’re more likely to ignore your test results, they tend to spiral out of control quickly, and one occasional flaky test can turn into a consistently red build. That’s why I believe that flaky tests cause more damage than not having tests at all, and in this post I would like to explore some of the lessons and solutions that I’ve found in my journey to stabilize our builds.

I can bet you that no developer ever writes a flaky test and commits that test on purpose. They are being tested, code reviewed, and tested again before they’re merged, but they still pop-up occasionally. Once they are part of a stable build, it’s really hard to diagnose where it came from and who introduced it, because the build it failed on can be far from the build that first introduced the failure. Because of that, our number one priority quickly became the ability to identify a bad test before it hits master. To achieve that, we introduced one of the most impactful changes to our test infrastructure: a flag to rerun a particular test on demand for an indicated amount of times. We called this flag “repeat_run”, which is a simple RSpec extension that reruns the test as many times as was specified (by either an environment variable or as an option passed to an individual the test). Here’s an example of how we use it:

it "can create a job", repeat_run: 5 do
  # ...
end

Then, run your test as normal:

rspec spec/request/jobs_spec.rb

The results become:

Jobs
  employer users
    can create a job
    can create a job
    can create a job
    can create a job
    can create a job

Finished in 56.29 seconds (files took 11.11 seconds to load)
1 example, 0 failures

The addition of this flag has helped us test our specs to ensure that they are stable from run to run and don’t have any external dependencies that we’re not aware of, which ultimately helped us detect non-deterministic specs before committing them.

From that point on, we had confidence that newly written specs would be good. We still had to figure out what to do with our existing suite of specs, though. Do we rerun all of them with the repeat_run flag, or is there a better way of spotting the problematic tests? We soon found a neat trick. If you’re using the rspec-retry gem, like we do, you might want to pay special attention to its results. If you notice that a test is being retried, but passes on the second (or third) run, that is the definition of a non-deterministic test. Considering you’re also wasting valuable build time rerunning these specs, fixing such specs will leave you with a double win.

When I first learned about the rspec-retry gem, I was pretty much in love: such an easy fix to a red build! As I started working on identifying and preventing flakes in specs more, my enthusiasm gradually decreased, as I realized that rspec-retry only masks the symptoms but doesn’t help solve the underlying issue. In certain cases, it even caused more trouble than help. One time, while debugging a strange failure in a model spec, I actually found that rspec-retry was adding to my confusion by showing a different error message than the one in the original failure. The reason for that was that retry worked really well on the request specs, but on controller or model specs it did not refresh the state of the test after the first run. Imagine the following situation: the test fails with some error, rspec-retry reruns the spec, but due to the test conditions being in a different state than expected, it fails with completely different error - and that’s the only error you see and use to debug your test. You can see what a nightmare debugging that was! Once I removed the retry, I was able to get a meaningful error message which helped me figure out the cause of failure.

Once we identified some of the flakiest specs in our suite, it was time to fix them. The hardest types of tests to debug were model and controller specs. To be fair, flaky model or controller specs are very rare (unless you’re doing something completely wrong when writing them), because they don’t involve any JavaScript and are testing small parts of the code that should not vary greatly. However, throughout this time, we still ran into a couple of non-deterministic results in these specs. Whenever that happened, it was almost always caused by timezone differences. Due to the nature of our multi-user app, we utilize timezones in the code quite a bit. All of the timezones are displayed properly in the app, because we wrap all of the actions in the controller in the current user’s timezone. In the specs, however, everything is run in UTC 0, so things like Time.from_now could possibly place the spec into a different timezone than the one in the controller. For example, considering the following spec:

expect(posting.expiration_date).to be_within(1.minute).of(12.days.from_now)

The reason this spec would occasionally fail is that when the user’s timezone falls into a different day than UTC, the spec’s 12.days.from_now will be offset by 1 day. As you can see, that wouldn’t happen all the time, so the spec’s results become dependent on the time of the day that it was executed. To fix that, we need to ensure that we’re comparing these days in the same timezone during the test run:

expect(posting.expiration_date).to be_within(1.minute).of(DateTime.now.in_time_zone(user.time_zone) + 12 days)

Furthermore, we can abstract the DateTime.now.in_time_zone(user.time_zone) into a separate helper like

def user_adjusted_date_time(user)
  DateTime.now.in_time_zone(user.time_zone)
end

Now, we can rewrite our expectation as

expect(posting.expiration_date).to be_within(1.minute).of(user_adjusted_time_zone(user) + 12 days)

Alternatively, for more complex objects, it may sometimes be easier to wrap the object in the user’s timezone on creation, like:

adjusted_schedule =
        Time.use_zone(user.time_zone) do
          create(
            :interivew_schedule,
            user: user,
            start_time: 2.months.ago,
            end_time: 2.months.ago + 30.minutes,
            start_date: 2.months.from_now,
            end_date: 3.months.from_now
          )
        end

Then, you can safely compare adjusted_schedule.start_time without having to change it inside the expect statement.

Another interesting example of an irregularly failing test came from a seemingly simple test. Imagine that you have a button “Join Event” that, once clicked, reloads the page and changes the button text to “Registered”. Logically, you might want to test that after you click the button, it changes from “Join Event” to “Registered”, so you might choose to write your test as follows:

expect(page).to have_content 'Join Event' # Button appears on the page
click_on "Join Event"
expect(page).to have_no_content 'Join Event' # Join Event disappears

This looks good, except for the fact that have_no_content waits for “Join Event” to not be found on the page and passes as soon as it is not found. This means that, if the page is not completely loaded yet (or, if it’s broken), the “Join Event” will not be present on the page, and so this assertion will pass right away. So, the test can pass even if the page is completely broken and create a false-negative result, not to mention that it won’t test what you wanted to test. To avoid situations like this, we had to make sure to assert on an element that would indicate the page (or, at least, the element under test) has fully loaded:

expect(page).to have_content 'Join Event'
click_on "Join Event"
expect(page).to have_content 'Registered'
expect(page).to have_no_content 'Join Event'

Similarly, you’re also risking running into a flaky situation if you do the following:

expect(page).to have_content 'Join Event'
click_on "Join Event"
expect(page).to have_no_content 'Join Event'
expect(page).to have_content 'Registered'

If the page is not loaded fully, line 3 might pass, whereas line 4 will fail since the page hasn’t loaded yet and the content “Registered” hasn’t appeared on the page at that time. Since the time it takes to load the page might vary from run to run depending on the network and compute strength of your CI host and other factors, the test might not fail regularly, thus creating non-deterministic results. You might think that’s because have_no_content only waits until the element is not present on the page and passes as soon as it find that it is not present on the page, so it should work if I change it to this:

expect(page).to have_content 'Join Event'
click_on "Join Event"
expect(page).to_not have_content 'Join Event'
expect(page).to have_content 'Registered'

Yes, it would work in this situation and produce a correct, deterministic, result. However, assuming your code doesn’t have a bug, here it will always wait 2 full seconds (the default Capybara maximum wait time) before evaluating to true and moving on to the next assertion, whereas, in the proposed solution, the first expect will wait for up to 2 seconds (or your default max wait time). The performance trade-off might not seem huge, but as you add more and more tests, these seconds will add up quickly and start costing you real time and money.

Interestingly, arranging expectations in the correct order and adding more validations that the page has finished loading was also the solution for sporadic “Deadlock” errors (PG::Error: ERROR: deadlock detected) we were seeing as a result of our use of the DatabaseCleaner gem. We employ this gem at the end of each test to prevent dependencies between our tests due to persisted data from each test. These failures proved to be very frustrating and hard to debug, because they were really difficult to reproduce locally and visual inspection of the tests did not necessarily reveal anything “wrong”. In fact, I actually spent some time trying to tweak our DatabaseCleaner configuration to prevent these errors before I realized the easiest and quickest solution was right in front of my eyes all along. All I needed to do was to look at the end of the test and make sure no AJAX calls were still active when the test finished. Once I added assertions that the page was fully loaded and there are no more active AJAX requests at the end of the test, the errors disappeared. I believe it also makes for a stronger test, because it simulates a user’s experience, who generally waits for the entire page to load before taking any further actions. The key learning from this scenario for me was to take a closer look at how the test is written, what exactly it’s testing, and what happens in the application after the specified assertion has finished before trying to meddle with the configurations of a third-party gem.

In conclusion, I would like to note that, even though I am proud of how much more stable our test build has become over the past couple of months, there’s still room for growth and improvements in regards to our test infrastructure. For example, I would love to fully eradicate the need for the rspec-retry gem, break down large files into smaller, more organized ones, and identify and fix any performance bottlenecks to increase the speed of our tests. On that and more, tune in to our next blog posts. I hope you have learned something new and can apply some of these tips to improve the stability of your suite as well.