Editor’s note: This article was originally published in October 2016, and was re-posted in February 2021.
Handshake ships new code multiple times a day, continuously. This fast deployment pace allows us to build and maintain features quickly and safely, but also pushes infrastructure to new limits. One of these limits is the test suite. When doing continuous delivery, a fast test suite becomes paramount. As our CI continued to slow with new features and CI checks, we started to look into alternatives to our hosted CI solution, and we recently finished the full switch to Buildkite for our CI pipelines.
Slow test suites
Handshake has grown quite a bit over the past few years. We've seen our test suite go from 5 minutes on one container, to 15 minutes, back to 5 minutes on many containers, to upwards of 25 minutes on 6 containers at it's most lengthy build time. Although 25 minutes not seem like a lot, it can feel like an eternity to an engineer in the flow. Quickly, 25 minutes per change can add up to an hour to the time-to-production; 25 minutes for the feature branch, and 25 more after merging into master. If we wanted to continue to move quickly, we knew we needed to address the problem and cut our test suite time drastically.
Why so slow?
When evaluating our test suite time there were a few obvious bottlenecks:
Because we use some custom libraries not provided by our previous hosted CI, we needed to install new libraries on each build. Often times these could be cached, but not always, and as a result our fixed up-front cost for any build was 7 minutes of installing libraries. These custom libraries ranged from new ruby versions that were not yet supported by the provider to upgraded qt for capybara-webkit to fix other issues such as flakey tests.
The rest of the build process took around 18 minutes, most of that time being spent running the actual test.
This meant that there were two ways to improve our build time. First, we wanted to cut the upfront cost as much as possible. We hoped to have a container with exactly the libraries relevant to us that we could control at a more granular level and re-use for every build. We also knew that by using more containers (parallelization) we could continue to cut down the build time; more containers = faster tests.
When we initially found Buildkite, it was unclear if they provided enough to be a viable solution. Buildkite's approach to CI is "bring your own compute". This means that although they provide the UI for viewing your tests, seeing test in-progress, editing your pipelines, and managing user accounts you provide the compute and infrastructure to actually run the tests. At first this feels like a steep task, but with great open source provided by Buildkite such as the elastic-ci-stack, it's not nearly as difficult as it seems. We'll get more into this later.
We quickly started experimenting with buildkite. Smaller services could be configured and green within a few hours, and we had complete control over our CI environment. As we started moving more apps to Buildkite we quickly began to discover many of the benefits.
Fast compute, please
Exact docker images
Next, we took advantage of the docker-compose functionality built into the beta build for buildkite. This plugin allows us to build the exact docker image we need for our tests and re-use that image later when running the actual tests. By using dedicated instances for building the docker images, we can take advantage of docker's caching functionality. This means that for a normal build our upfront cost for our docker images is merely the time it takes to upload and download the changed code from dockerhub, and a few initializations of services like postgres schema load. In most cases, our test containers start running tests in about 1.5 minutes - including docker image upload/download, starting services like postgres, schema load and precompiling all assets. The majority of that time (roughly one minute) is spent in uploading changed layers to docker hub, a time we hope to cut down even further.
All the containers!
Suddenly we were living in a new type of world. Instead of each container spending 7 minutes before it can run any tests, our test running containers spend merely 30 seconds before they are being productive. This means that containers cycle in and out much faster, and even with roughly the same number of containers on our old provider we can parellize our Buildkite builds much more without the queue backing up. Previously we were running only 6 containers per build, and the queue would backup on a regular basis. With buildkite we are running 16 containers in parallel for our Rspec tests, plus 3 other containers for some additional CI, and our queue rarely backs up because tests run quickly and move on!
What does all of this mean for our build times? Our builds now finish in around 8 minutes! This has had drastic positive benefits
- No longer do engineers have to context switch when waiting for CI
- Hotfixes can get out considerably faster, usually in around 20 minutes if running both feature branch CI and master CI.
We've also seen benefits through using Buildkite's more expressive and customizable pipeline structure. We can get feedback on other parts of CI, such as brakeman and bundle-audit, in around 2 minutes.
Tips for migration
Are you thinking about giving Buildkite a try? Here's a few thigns that worked well for us.
- Use elastic-ci-stack: It gives you a highly scalable, easy to manage CI infrastructure in minutes.
- Use the docker-compose plugin provided by the beta build
- Make sure you use quality ec2 instances. Not only do they speed up your builds, but they make them more reliable
- Get the whole team on board. Switching CI's is no small task, and ensure that your new CI is clearly at a point where it is better than your previous. If coworkers are still using the old CI while you're moving towards the new one, there's probably a reason.
- Use spot instances for cheaper ec2 instances, but make sure to bid high enough so builds don't lose their instances.
Switching to Buildkite has been an exciting and productive process. Although running your own build infrastructure comes with a maintenance cost, the benefits across the rest of your team can be huge. Being in a position of continuing to cut buildtime by simply adding more compute is powerful. Looking forward we plan to continue to cut down the fixed cost by reducing docker image size (if possible) and optimizing our docker layer structure.