Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This guide will help you set up Zipkin tracing on a single node cockroach cluster running on your laptop.

You may have been confused at various points about all the contexts we pass around. What is the point of having a ctx as the first argument of every single function? Why are people always upset in PR reviews when you forget to pass one correctly? Why are there sometimes multiple contexts and it's completely non-obvious which one you should use? I used to feel like these were just magic incantations that I had to guess until the trolls were mollified. That was a frustrating time in my life. If you feel this way, then this guide on tracing will give you the necessary context to understand why it matters.

...

CockroachDB has extensive verbose logging and distributed tracing instrumentation built-in. One way in which this instrumentation is useful is through 3rd party trace collectors like Jaeger and Zipkin. CRDB can be instructed to trace everything it does and to send all the traces to a collector. Enabling tracing also activates all the log messages, at all verbosity levels, as traces include the log messages printed in the respective trace context.

Note that enabling full tracing is expensive both in terms of CPU usage and memory footprint, and is not suitable for high-throughput production environments.

There are several options for routing traces to a 3rd party collector, listed below. All of these are enabled by the fact that CRDB's Tracer can be configured to tee everything to the OpenTelemetry tracer, with OpenTelemetry being quickly embraced as the lingua franca of all observability tools.

  1. Output traces to a collector that speaks the OTLP protocol. For example, Lightstep supports this, as do special builds of Jaeger. This can be enabled with the trace.opentelemetry.collector cluster setting.
  2. Output traces to the OpenTelemetry Collector, which can in turn route them to a lot of other tools. The OTEL Collector is a canonical collector, speaking the OTLP protocol, that can buffer traces and perform some processing on them before exporting them to every tool in the universe (including Jaeger, Zipkin and other OTLP tools). This is again enabled with the trace.opentelemetry.collector cluster setting.
  3. Output traces to Jaeger or Zipkin using their native protocols. This is implemented by using the Jaeger and Zipkin dedicated "exporters" from the otel SDK. Enabling the Jaeger exporter is done through the trace.jaeger.agent cluster setting. Enabling the Zipkin exporter is done through the trace.zipkin.collector cluster setting.


When playing around and wanting to look at some traces, the simplest thing to do is use the Jaeger or Zipkin. Jaeger has a better UI, so we'll use that as an example. To run a Jaeger instance locally in a container, make sure Docker is running on your system and then following incantation:

docker run -d -

...

-name jaeger -p 6831:6831/

...

udp -p 16686:16686 jaegertracing/all-in-one:latest

...

This runs the latest version of Jaeger, and forwards two ports to the container. 6831 is the trace ingestion port, 16686 is the UI port. By default, Jaeger will store all received traces in memory.

Now

...

let's run CRDB and generate some traces. To see distributed traces in all their glory, the simplest thing is to use roachprod local. Create a cluster with:

roachprod create local -n 3
roachprod put local cockroach
roachprod start local

To enable trace generation do:
roachprod sql local:1
SET CLUSTER SETTING trace.

...

jaeger.

...

agent='localhost:

...

Now, in your browser, go to localhost:16686. Hopefully, you see an angry gopher. Under the 'service' dropdown, select 'cockroach'. If you do not see 'cockroach', this means you misconfigured some ports, or you have some firewall running. Essentially, the traces are not getting to the docker container, or you didn't set the cluster setting right.

...

6831'


Or even simpler, you can start the cluster with
roachprod start local --env=COCKROACH_JAEGER=localhost
and then you don't need to set the cluster setting.

Now go to http://localhost:16686, select the CockroachDB service, and you should be seeing traces streaming in.


Jaeger's memory storage works well for small use cases, but can result in OOMs when collecting many traces over a long period of time. Luckily, Jaeger also supports disk-backed local storage using Badger (not Pebble. We; we'll give them a pass on this, for now). To use this, start Jaeger by running the following adjusted Docker command:

docker run -d -e COLLECTOR_ZIPKIN_HOST_PORT=9411 --name jaeger \
-e SPAN_STORAGE_TYPE=badger \
  badger -e BADGER_EPHEMERAL=false \
-e BADGER_DIRECTORY_VALUE=/badger/data -e BADGER_DIRECTORY_KEY=/badger/key \
  -v /mnt/data1/jaeger:/badger -p5775:5775/udp -p6831badger \
-p 6831:6831/udp -p6832:6832/udp \
  -p5778:5778 -p16686:16686 -p14268:14268 -p9411:9411 jaegertracingudp -p 16686:16686 jaegertracing/all-in-one:latest



Play around looking for some traces. A few things:

  • instead of wading through log messages in an unstructured fashion, now the logs are graphed in a nice tree format based on how the contexts were passed around. This is great! This also traverses machine boundaries so you don't have to look at three different flat .log files trying to sync up events. This greatly speeds up your debugging.
Once you understand the value of distributed tracing, the next step is making this happen in a distributed fashion. This is fairly straightforward on your laptop; I recommend using roachprod local to set up a local 3 node cluster on your laptop, as you can now see distributed events across nodes without doing any heavy lifting that requires munging around with network ports and ensuring firewalls are open


...

An older version of this guide instructed to run Jaeger with the COLLECTOR_ZIPKIN_HOST_PORT=9411 environment variable set. This variable is no longer needed when using the trace.jaeger.agent setting. The envvar was asking Jaeger to accept the Zipkin protocol back when we didn't have native support for the Jaeger protocol.