How to investigate a Go test failure

Here is a list of recent Go test failures, in case you wanted to check out a few. CI jobs file issues when a test fails on a release branch, and an owning team is determined according to TEAMS.yaml and the CODEOWNERS file. That team is then mentioned (to notify it) and the issue is put into the team’s Github project’s triage column.

Another source of failure occurs on PR builds itself. In both cases, your investigation will lead you to a TeamCity build.

Note: At the time of writing, a big shake-up of where things are going to be found is imminent as we switch CI over to bazel. As a result, things will shift around, but the general strategies will remain valid.

Test failures come in different categories, and come with different degrees of ergonomics. We try to expose the output of the test to TeamCity, but there are various scenarios in which this doesn’t work (and even if it does, it likely isn’t enough), so let’s focus on one in which it does not and discuss the artifacts of a test. Not all test failures provide the same artifacts. We will discuss a few, but they are all similar in that the challenge is to find the output of the failing test (or package).

The output of the ` go test` command is in an artifact, either

failures.txt (attempts to only capture the failing parts of the test run, which is a good thing to look at first - the entire output of the test run is massive) or raw[...].json.txt.gz → full_output.txt.
stress.txt (for failures that were discovered under nightly stress); here only the full-package output is available (but it’s only a single package, not all tests we have).

Additionally, many tests avoid having the in-memory server log to the default test output (this is very noisy and can hide the actual failure messages). The logs will be found in a directory containing the test name.

Package-level failures

The Go testing framework does not execute testing code in a sandbox. It is thus possible for a test to crash the entire test run and these failures are typically more difficult to find. You will likely need to look through the full output to identify the problem. Here are a few things to look out for:

Data race: contains string WARNING: DATA RACE. This applies only to builds that have the race detector enabled. When the race detector finds a race, it will let the package exit nonzero, which translates into an opaque test failure. Luckily, this one is easy to grep for.
Package crash: this is more difficult. Search for --- FAIL lines for a package (not a test) in the full output and up from there, to hopefully discover that the runtime aborted a the process and produced a goroutine dump. One common one encountered is “test timed out after X” or “panic: X” which you can grep for directly.
Killed by oomkiller: this one is tough. The best symptom is the absence of symptoms. Sometimes you will see a “Signal: killed” in the output but this is not guaranteed.

Investigating a failure in a remote execution environment

We use EngFlow to execute some tests. See https://cockroachlabs.atlassian.net/wiki/spaces/devinf/pages/3217850384 and https://cockroachlabs.atlassian.net/wiki/spaces/devinf/pages/3141107902 .

Reproducing a test failure

One common mistake is to forget to unskip the test that you’re stressing (if it is skipped to begin with, as it may be as the test-infra community service team does that to flaky tests). It happens to everyone. Just keep in mind that this is a thing that happens, and if the number of iterations seems to fly up very quickly, ponder whether it’s currently happening to you.

Below we describe how to “manually” stress tests (i.e. on your machine or a gceworker). We also recommend, as a quick first thing to try, to let CI stress the test for you:

https://www.loom.com/share/b672200008164b42b2763477ceea70da

Update [July 31, 2023]: Make sure you specify env.TARGET instead of env.PKG , otherwise the build will fail.

stress{,race}

./dev test --stress $PKG --filter "^$TEST"; ideally on a gceworker (to avoid clogging your work station).

If this doesn’t yield a reproduction in due time, you could try under race (add --race flag) or adjust the --stress-args (see bin/stress --help, possibly after ./dev build stress):

./dev test --stress $PKG --filter $TEST --stress-args '-maxtime 15m -p 12'

Other Notes:

To print the full logs of the failed test trial, add --show-logs to your test cmd. Run ./dev test --help to see this option. (The CRDB logs are default saved to a temp dir, but the path to that dir is currently broken).
If you’re trying to reproduce an instance of a slow or hanging test, add a per trial timeout (i.e. fail if a test trial takes longer than 5 minutes), add --test-args='-test.timeout 5m' to your test cmd. test-args are passed directly to go test whose binary is executed for every trial in stress; therefore, --test-args can be treated as ‘per trial’ args.
Build tags are passed like this (note the -- args separator):

./dev test --stress $PKG --filter $TEST -- --define gotags=bazel,gss,deadlock

roachprod-stress{,race}

When a gceworker won’t do, you can farm out the stressing to a number of roachprod machines. First create a cluster:

roachprod create $USER-stress -n 20 --gce-machine-type=n1-standard-8 --local-ssd=false

Then invoke

make roachprod-stress CLUSTER=$USER-stress PKG=... TESTS=... [STRESSFLAGS=...]

Once you know how to reproduce a failure, you can think about bisecting it or to reproduce it with additional logging.