Debugging with Delve

Note: In the past, Delve has garnered a reputation for randomly panicking and not living up to the expectations of a debugger. However, a couple of years have passed since then, and it is now possible to attach to and debug running cockroach clusters, making it a powerful tool to have in your arsenal.

Delve is an extremely useful tool to debug go programs. It has the ability to attach to a running process by simply specifying the process ID. Once attached to the process, you can do things such as switch goroutines, jump to a certain stack frame, and inspect state. What follows is a quick tutorial on how to get delve up and running, attach to a running cockroach process, and look at some state.

This tutorial assumes you have a running cockroach process.

Debugging A Process

Step 1: Get delve (OSX, Linux):

Running go get should be the only thing you need to do:

$ go get -u github.com/go-delve/delve/cmd/dlv

Step 2: Attach to process:

Get the process ID of the cockroach node you want to attach to and run:

$ dlv attach <pid>

Note that you must have the source code at the same sha as the binary you are debugging checked out for any debugging to make sense.

Step 3: Do what you came to do:

Now you attached to the process, the process is paused. You can jump around and look at things. For example, you can list the currently running goroutines and their IDs:

(dlv) goroutines
[...]
Goroutine 2239 - User: ./pkg/sql/create_stats.go:222 github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob (0x58d576a)
[...]

Once you have an interesting goroutine ID (e.g. from stuck goroutine stacks), you can jump to a goroutine:

(dlv) goroutine 2239
Switched from 0 to 2239 (thread 5888500)

Check its trace:

(dlv) bt
0  0x00000000040308eb in runtime.gopark
   at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303
1  0x0000000004030993 in runtime.goparkunlock
   at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:308
2  0x00000000040082ae in runtime.chanrecv
   at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:520
3  0x0000000004007f9b in runtime.chanrecv1
   at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:402
4  0x00000000058d576a in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob
   at ./pkg/sql/create_stats.go:222
5  0x0000000005a3f0c3 in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startExec.func1
   at ./pkg/sql/create_stats.go:76
6  0x000000000405f281 in runtime.goexit
   at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/asm_amd64.s:1333

Jump to an interesting frame:

(dlv) frame 4
> runtime.gopark() /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303 (PC: 0x40308eb)
Warning: debugging optimized function
Frame 4: ./pkg/sql/create_stats.go:222 (PC: 58d576a)
   217:			Progress: jobspb.CreateStatsProgress{},
   218:		})
   219:		if err != nil {
   220:			return err
   221:		}
=> 222:		return <-errCh
   223:	}
   224:
   225:	// maxNonIndexCols is the maximum number of non-index columns that we will use
   226:	// when choosing a default set of column statistics.
   227:	const maxNonIndexCols = 100

And print out some state (here n is the variable name for the *createStatsNode):

(dlv) p n
*github.com/cockroachdb/cockroach/pkg/sql.createStatsNode {
	CreateStats: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.CreateStats {
		Name: "__auto__",
		ColumnNames: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.NameList len: 0, cap: 0, nil,
		Table: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableExpr(*github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableRef) ...,
		AsOf: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.AsOfClause")(0xc007d8eb78),},
	p: *github.com/cockroachdb/cockroach/pkg/sql.planner {
		txn: *(*"github.com/cockroachdb/cockroach/pkg/internal/client.Txn")(0xc00b56c240),
		stmt: *(*"github.com/cockroachdb/cockroach/pkg/sql.Statement")(0xc007d3ab60),
		semaCtx: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.SemaContext")(0xc00b85b460),
[...]
	run: github.com/cockroachdb/cockroach/pkg/sql.createStatsRun {resultsCh: chan github.com/cockroachdb/cockroach/pkg/sql/sem/tree.Datums 0/0, errCh: chan error 0/0},}

There are a lot of things you can do with dlv, some of which you can start exploring by running help while in dlv. If something is unclear, someone on the #engineering channel will be willing to help. Hopefully these steps can at least get you started in using a powerful debugging tool to cut down the amount of time spent wondering what's going wrong with a cluster.

One useful trick is that you can dive into any structures you have a pointer to. On most architectures (i.e. in practice nearly always) the receiver of a method in a stack trace corresponds to the first memory address listed for that frame. For example, to print the innards of a *Replica:

$ dlv attach 24490
Type 'help' for list of commands.
(dlv) print (*storage.Replica)(0xc01fed5800).mu.quotaReleaseQueue
[]int64 len: 1, cap: 1, [977]
(dlv) print (*storage.Replica)(0xc01fed5800).mu.proposalQuotaBaseIndex
23
(dlv) print (*storage.Replica)(0xc01fed5800).RangeID
1514
(dlv) exit
Would you like to kill the process? [Y/n] n

This can be scripted to extract information from a running process with an interruption that is (hopefully) small enough to not disturb the situation.

Debugging Tests

A common place to drop into a debugger is right in the middle of a unit test. This allows one to quickly iterate in a TDD fashion. With Delve there is a unique sub-command for starting a debugging session with tests running: dlv test. After executing this command, you break at the start of a specific unit test by issuing a break <TestName> statement. Finally, if you don't know the name of the test (or don't want to stop dlv to find it), you can issue the funcs command with a regex expression to list the tests of similar names. For example, here is a snippet of dropping into a dlv test session and breaking on a test in the github.com/cockroachlabs/managed-service/pkg/auditlog directory:

$ pwd
/home/roach/go/src/github.com/cockroachlabs/managed-service/pkg/auditlog
$ dlv test              
Type 'help' for list of commands.
(dlv) funcs auditlog.Test*
github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub
github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func1
github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func2
github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func3
github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery
github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func1
github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func2
github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func2.1
github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordHttpRequest
github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordHttpRespAndErr
github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordRequestAndRecordRespAndErr
github.com/cockroachlabs/managed-service/pkg/auditlog.TestScrubNestedField
github.com/cockroachlabs/managed-service/pkg/auditlog.TestScrubSensitiveFields
(dlv) break TestLogFailedQuery
Breakpoint 1 set at 0x1ce06bb for github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery() ./middleware_test.go:83
(dlv) c
> github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery() ./middleware_test.go:83 (hits goroutine(78):1 total:1) (PC: 0x1ce06bb)
    78:                 require.Equal(t, nulls.NewString("{}"), entry.Response) // default is null so this confirms the second write succeeded
    79:                 require.Equal(t, nulls.NewString(""), entry.Error)
    80:         })
    81: }
    82:
=>  83: func TestLogFailedQuery(t *testing.T) {
    84:         conn, err := pop.Connect("test")
    85:         require.NoError(t, err)
    86:
    87:         // Simulate authenticated request
    88:         om, cluster := models.CreateRandomOrgUserAndCluster(t, conn)
(dlv) 

Below is another example where tests for the intrusion server are being run. Note the two syntaxes for creating break points.

$ pwd
/home/roach/go/src/github.com/cockroachlabs/managed-service                                                                                                                                                /0.0s
$ dlv test ./intrusion/server
Type 'help' for list of commands.
(dlv) break TestUseHttpAuditLogMiddlewareWithAuditLogFailure
Breakpoint 1 set at 0x6b50ffb for github.com/cockroachlabs/managed-service/intrusion/server.TestUseHttpAuditLogMiddlewareWithAuditLogFailure() ./intrusion/server/middleware_test.go:134
(dlv) break ./intrusion/server/middleware_test.go:117
Breakpoint 2 set at 0x6b8cc46 for github.com/cockroachlabs/managed-service/intrusion/server.TestUseHttpAuditLogMiddlewareWithSuccess.func2() ./intrusion/server/middleware_test.go:117
(dlv) 

Getting a Linux Delve binary onto a roachprod node

The instructions on the delve repo require go to be installed, which is something that the roachprod nodes do not have. The easiest way is to build the delve binary on your development machine. Skip the cloning if you already have the repo:

$ git clone https://github.com/go-delve/delve.git $GOPATH/src/github.com/go-delve/delve
$ cd $GOPATH/src/github.com/go-delve/delve/cmd/dlv
# Build for linux-amd64.
$ GOOS=linux GOARCH=amd64 go build

And then copy that to the roachprod node you're interested in debugging:

$ roachprod put $CLUSTER:<node-id> $GOPATH/src/github.com/go-delve/delve/cmd/dlv/dlv