Debugging with Delve

Note: In the past, Delve has garnered a reputation for randomly panicking and not living up to the expectations of a debugger. However, a couple of years have passed since then, and it is now possible to attach to and debug running cockroach clusters, making it a powerful tool to have in your arsenal.

Delve is an extremely useful tool to debug go programs. It has the ability to attach to a running process by simply specifying the process ID. Once attached to the process, you can do things such as switch goroutines, jump to a certain stack frame, and inspect state. What follows is a quick tutorial on how to get delve up and running, attach to a running cockroach process, and look at some state.

This tutorial assumes you have a running cockroach process.

Debugging A Process

Step 1: Get delve (OSX, Linux):

Running go get should be the only thing you need to do:

Step 2: Attach to process:

Get the process ID of the cockroach node you want to attach to and run:

Note that you must have the source code at the same sha as the binary you are debugging checked out for any debugging to make sense.

Step 3: Do what you came to do:

Now you attached to the process, the process is paused. You can jump around and look at things. For example, you can list the currently running goroutines and their IDs:

Once you have an interesting goroutine ID (e.g. from stuck goroutine stacks), you can jump to a goroutine:

Check its trace:

Jump to an interesting frame:

And print out some state (here n is the variable name for the *createStatsNode):

There are a lot of things you can do with dlv, some of which you can start exploring by running help while in dlv. If something is unclear, someone on the #engineering channel will be willing to help. Hopefully these steps can at least get you started in using a powerful debugging tool to cut down the amount of time spent wondering what's going wrong with a cluster.

One useful trick is that you can dive into any structures you have a pointer to. On most architectures (i.e. in practice nearly always) the receiver of a method in a stack trace corresponds to the first memory address listed for that frame. For example, to print the innards of a *Replica:

This can be scripted to extract information from a running process with an interruption that is (hopefully) small enough to not disturb the situation.

Debugging Tests

A common place to drop into a debugger is right in the middle of a unit test. This allows one to quickly iterate in a TDD fashion. With Delve there is a unique sub-command for starting a debugging session with tests running: dlv test. After executing this command, you break at the start of a specific unit test by issuing a break <TestName> statement. Finally, if you don't know the name of the test (or don't want to stop dlv to find it), you can issue the funcs command with a regex expression to list the tests of similar names. For example, here is a snippet of dropping into a dlv test session and breaking on a test in the github.com/cockroachlabs/managed-service/pkg/auditlog directory:

Below is another example where tests for the intrusion server are being run. Note the two syntaxes for creating break points.

Getting a Linux Delve binary onto a roachprod node

The instructions on the delve repo require go to be installed, which is something that the roachprod nodes do not have. The easiest way is to build the delve binary on your development machine. Skip the cloning if you already have the repo:

And then copy that to the roachprod node you're interested in debugging: