Tracing logs with Jaeger and Zipkin

This guide will help you set up Zipkin tracing on a single node cockroach cluster running on your laptop.

You may have been confused at various points about all the contexts we pass around. What is the point of having a ctx as the first argument of every single function? Why are people always upset in PR reviews when you forget to pass one correctly? Why are there sometimes multiple contexts and it's completely non-obvious which one you should use? I used to feel like these were just magic incantations that I had to guess until the trolls were mollified. That was a frustrating time in my life. If you feel this way, then this guide on tracing will give you the necessary context to understand why it matters.

First, make sure docker is working on your system. Second, run the following incantation:

docker run -d -e COLLECTOR_ZIPKIN_HOST_PORT=9411 -p5775:5775/udp -p6831:6831/udp -p6832:6832/udp \
  -p5778:5778 -p16686:16686 -p14268:14268 -p9411:9411 jaegertracing/all-in-one:latest

This sets up a tracing collector running in a docker container, waiting for traces to be sent to it on port 9411 and storing all traces in memory.

Now, in CockroachDB running on your machine, run SET CLUSTER SETTING trace.zipkin.collector='localhost:9411'. This tells Cockroach to send the traces over to the Zipkin collector.

Now, in your browser, go to localhost:16686. Hopefully, you see an angry gopher. Under the 'service' dropdown, select 'cockroach'. If you do not see 'cockroach', this means you misconfigured some ports, or you have some firewall running. Essentially, the traces are not getting to the docker container, or you didn't set the cluster setting right.

This memory storage works well for small use cases, but can result in OOMs when collecting many traces over a long period of time. Luckily, Jaeger also supports disk-backed local storage using Badger (not Pebble. We'll give them a pass on this, for now). To use this, start Jaeger by running the following adjusted Docker command:

docker run -d -e COLLECTOR_ZIPKIN_HOST_PORT=9411 -e SPAN_STORAGE_TYPE=badger \
  -v /mnt/data1/jaeger:/badger -p5775:5775/udp -p6831:6831/udp -p6832:6832/udp \
  -p5778:5778 -p16686:16686 -p14268:14268 -p9411:9411 jaegertracing/all-in-one:latest

Play around looking for some traces. A few things:

  • instead of wading through log messages in an unstructured fashion, now the logs are graphed in a nice tree format based on how the contexts were passed around. This is great! This also traverses machine boundaries so you don't have to look at three different flat .log files trying to sync up events. This greatly speeds up your debugging.

Once you understand the value of distributed tracing, the next step is making this happen in a distributed fashion. This is fairly straightforward on your laptop; I recommend using roachprod local to set up a local 3 node cluster on your laptop, as you can now see distributed events across nodes without doing any heavy lifting that requires munging around with network ports and ensuring firewalls are open.