go tool pprof is a profiling tool built into Go. It is useful for understanding hotspots in your code. One can profile a running cockroach binary, or a specific go benchmark.

Basics

This blog post will get you started. This blog post also covers mutex profiles, which are relatively newer (and thus not covered in the first blog post).

There are broadly speaking, three types of profiles that one can examine: cpuprofiles, memory profiles, and mutex profiles. CPU profiles show you where CPU is being spent while your program is running. Memory profiles (which have 4 subtypes) broadly show you where memory is being used. Mutex profiles show you where code is waiting on a contended mutex.

No profile will give you a smoking gun, pointing you to a particular spot that, when fixed, will make your program magically faster. Instead, the different profiles give you visibility from different angles into a high-dimensional space. You need to think careful about what you need to measure, and formulate your own hypotheses, and use the appropriate profiles to test your hypotheses.

Pulling profiles from a running Cockroach server.

A running cockroach node can directly serve a pprof-compatible profile. The Admin UI debug page (http://localhost:8080/#/debug, replacing localhost with the URI of a remote cluster). Do note on a multi-node cluster this gives you the profile from the node you talk to, which may or may not be the node(s) you care about. You can get them directly with go tool pprof ./cockroach http://localhost:8080/debug/pprof/heap?debug=1 (the URLs for the various profile types are listed on the main debug page).

Profiles benefit from optionally specifying the binary that is being profiled. This binary must be exact: if you take a binary from a remote linux server, and are profiling on your local OS X machine, you should copy over the Linux binary as well. If you also have the exact same revision checked out in your $GOPATH, you will see accurate source code lines, which can be very useful. Do note that the ./cockroach binary must be the same one locally as the one running on the cluster at the URL, so if you point it at a remote linux machine, first copy over the linux binary. Be careful about this because pprof won't give you an error message, it will just happily show you incorrect symbols and you will be scratching your head about why some completely unconnected innocent function is taking up 80% of your CPU.

Pulling a profile from a go benchmark

Go benchmarks are nice because they let you isolate a very specific code path, as opposed to wading through the entire running Cockroach system. If you have a benchmark Foo in package bar, you can run:

make bench PKG=./pkg/bar BENCHES=BenchmarkFoo TESTFLAGS=-cpuprofile=/tmp/cpu.out.

Then you can run go tool pprof /tmp/cpu.out to jump into a CPU profile (or whatever other profile you are looking for) of that benchmark run. There is also a -memprofile flag for allocations.

Which profile to use when?

The following profiles exist:

CPU profiles

shows you the functions that use the most CPU time. Strictly speaking, it periodically samples the callstack of the running program, and populates a graph of function calls, with percentages indicating what fraction of the samples were "in" each of the functions.

Unlike all other profile types, CPU profiles sample over a window of time instead of taking a point-in-time snapshot. The default profiling duration is 30 seconds, but it is possible to adjust this by adding the seconds=n query param to the debug URL.

Heap profile

There are four kinds of heap profiles for allocations made by Go code (for allocations made by C++, see "jemalloc profile" below):

-inuse_space      Display in-use memory size
-inuse_objects    Display in-use object counts
-alloc_space      Display allocated memory size
-alloc_objects    Display allocated object counts

inuse profiles count only those memory allocations that occur during the running of the profile generation, alloc profiles count from the start of the program start. space measures total size of allocation, objects counts the total number of objects regardless of size.

In general you probably care to start with inuse over alloc since CockroachDB uses a lot of memory by design. Lots of small allocations are likely what you're looking to find and minimize (this is inuse_objects), especially if a CPU profile shows you that a large chunk of your CPU time is taken up by the garbage collector running, or runtime.growSlice or one of the other allocators of memory. Of course, if you are profiling a benchmark run that precisely tracks your specific code path from program start, alloc can be more accurate.

Mutex profile

you set a probability (which you can set via ENV variable COCKROACH_MUTEX_PROFILE_RATE if you're pulling directly off a Cockroach binary, 0 and 1 are special: 0 disables the mutex profile and 1 captures 100% of mutex contention events. For other values, the profiler will sample on average 1/X events.) and then every time a function hits a mutex, with that probability it records how long it waited in that mutex. This lets you find places in your code where goroutines are waiting. Of course, many times goroutines are waiting by design, but not always, so this one is tricky to use.

Block profile

like a mutex profile, but on all blocking events (mutexes, condition variables, select statements).

Goroutine profile

shows a snapshot of stack traces for all goroutines running at a given point in time.

Jemalloc profile

We use jemalloc as our c++ memory allocator, which includes a related profiling tool. Use this if you're seeing an unexpected amount of "cgo allocations" in the runtime.go log messages (remember that the rocksdb block cache is expected to use a lot of memory here, up to the limit set on the command line). It's a little harder to use than its go counterpart. You must start the server with COCKROACH_MEMPROF_INTERVAL=60s MALLOC_CONF=prof:true and profiles will be written to the log directory every minute. Then you can interpret the profiles with jeprof, which can be found in $GOPATH/native/$SOMETHING/jemalloc/bin. Benjamin Darnell is the local jemalloc expert.

Useful `pprof` commands

pprof can be frustrating to use and navigate, due to the cryptic error messages, and the arbitrary choice of command or variable for each feature. The full list is available at help and eventually you will use a lot more than the ones listed here, but this will get you started:

web: the most useful tool: fire up your browser with an SVG image of the current view of the profiler. The variables listed below (the kinds that are x=y) manipulate the state of pprof, and subsequent calls to web redraw a new image, with the new view.
pdf: like web, but generate a PDF instead of an SVG. Useful for uploading damning evidence to GitHub, which does not allow SVG uploads.
top<N> shows you the top N (by whatever metric you are currently profiling) functions. Some people use this, but I find it confusing and prefer to look at the web view.
list: outputs annotated source for functions matching regexp. This is great for seeing where in a function resources are being used.
weblist: like list, but opens up in a browser for easier viewing.
call_tree. By default, pprof draws a directed acyclic graph. This can get a little messy to understand, particularly if you care exactly when a particular hot node in the graph is getting called, and it has multiple callers. call_tree draws a tree (so child nodes from different callers are separated into different subtrees).
focus=foo: focus on functions named foo (i.e. draw a root node for each function that matches the regexp foo).
nodefraction=xyz: by default, pprof tries to intelligently "prune" the tree to only show a subset of nodes so t the graph is manageable. But sometimes this is too aggressive, and you want to relax it/make your browser suffer. Do note that on each SVG/PDF, the legend box shows the current edgefraction (under "Dropped nodes (cum <=xyzs)").
cum: set pprof into "cumulative" mode. By default pprof starts in flat mode, which shows how much time was spent in that function, whereas cum shows how much time was spent in that function including all children. Note that cum/flat just sets pprof into cumulative/flat mode, which affects subsequent calls to top10 and web, etc.
edgefraction, nodecount: like nodefraction, different ways to prune the graph.

peek foo: shows info about time spent in callers of foo

What's this `cgocall` that takes up all my CPU?

CockroachDB uses RocksDB. RocksDB is written in C++, and thus all calls to RocksDB go through CGo, Go's foreign function interface for calling external code written in C. pprof in its default mode does not cross this barrier, and thus all the CGo calls show up as a single block cgocall. This is not helpful if, for instance, you are benchmarking how to use RocksDB more intelligently.

To open this black box, you have to

Run Cockroach/Your benchmark on linux
follow the instructions here

to make sure your binary is running with the appropriate C++ symbols compiled in that pprof can use to decipher beyond the CGo barrier.

I'm confused! This is overwhelming!

I was too when I started. Here are some initial tips:

Start with a cpuprofile. It's the simplest profile.
Use web a lot. Browsing around on a single pprof graph is easier than trying to use the command line.
Run the program/benchmark with different settings (e.g. pprofing against Cockroach running against a load generator at one setting versus another setting), looking at the graphs in each case, rather than trying to navigate pprof over the command line. Try to just take in how the graphs look different.
Understand that profiling is difficult because so many hours have gone into looking at profiles. The low hanging fruit has mostly been picked! If you're looking for things to optimize, chances are you will not find it in existing code without some deeper digging. Talk to other engineers if you're looking for juicy trails to follow.

CockroachDB

Profiling CockroachDB

Basics

Pulling profiles from a running Cockroach server.

Pulling a profile from a go benchmark

Which profile to use when?

CPU profiles

Heap profile

Mutex profile

Block profile

Goroutine profile

Jemalloc profile

Useful `pprof` commands

What's this `cgocall` that takes up all my CPU?

I'm confused! This is overwhelming!

See also

Basics

Pulling profiles from a running Cockroach server.

Pulling a profile from a go benchmark

Which profile to use when?

CPU profiles

Heap profile

Mutex profile

Block profile

Goroutine profile

Jemalloc profile

Useful pprof commands

What's this cgocall that takes up all my CPU?

I'm confused! This is overwhelming!

See also

Useful `pprof` commands

What's this `cgocall` that takes up all my CPU?