Nightly tests

Overview

Every night, a set of tests run as part of the TeamCity project Nightlies. These tests have a few common characteristics:

  • They set up a temporary CockroachDB cluster and run load against it.
  • Their runtime is too long for them to be included in CI.

All nightly tests, except for Jepsen, use Terraform to create and destroy their temporary cluster. It may be wise to remove Terraform in the future, given the cognitive overhead of using a tool that provides much more functionality than we need.

Manually Running Nightly Tests

  • The simplest way to run a nightly test is to go to the Nightlies project, find the test you want to run, and click the Run button.
  • To specify flags for cockroach for a single test run, click the ... button next to the Run button for a test. Then, go to the Parameters tab and specify a value for env.COCKROACH_EXTRA_FLAGS.
  • To launch a test locally, use the appropriate build/teamcity-*.sh script. For many nightlies, this is build/teamcity-nightly-acceptance.sh. See the comments at the top of that script for setup steps.

Test Entry Points

TeamCity jobs execute various bash scripts that, in turn, run the relevant tests. These files are named teamcity/build-*.sh. Key files include:

Key Source Files

Allocator tests

The allocator tests stress the replica allocator under load. At a high level, they do the following:

  1. Create a temporary cluster.
  2. Restore tarballs of test data (which are TPC-H data sets with various scale factors) on to each node in the cluster.
  3. Add new nodes to the cluster. The only current exception to this is the "steady 6 nodes" test.
  4. Starts load generators.
  5. Wait until the replica allocators reach equilibrium (no replicas added/removed in the last N minutes).
  6. The test passes only if the standard deviation of range counts is lower than the threshold (set to 5% of the mean range count). This must happen before TESTTIMEOUT elapses.
  7. Destroys the temporary cluster.

Continuous load tests

These are straightforward tests that set up test clusters and run load against them. They pass if TESTTIMEOUT elapses with no crashes and no periods with 0 QPS.

Gotchas

  • Care should be taken when upgrading Terraform. Various backward incompatible changes have been introduced over time (e.g. terraform init).
  • The cloud provider Terraform uses for the temporary clusters is independent of the cloud provider used by TeamCity agents. For example, at the time I'm writing this, TeamCity agents run on GCE agents, and most Terraform clusters run on Azure.
  • Azure-based tests can take a long time to iterate on. Azure VM startup and destruction times (4-5 minutes) are much longer than GCE (~1 minute).
  • Core dumps aren't enabled for cockroach


  • Pressing control-C at certain times will leak cloud resources. Fortunately, there is a nightly script to clean up resource leaks.
  • Terraform is not needed for our relatively simple needs. It'd be helpful to replace it with roachprod and roachperf.