Error concepts and handling in CockroachDB

Error concepts

Programmers do not usually like thinking about errors. When learning how to program, initially, programming assignments are silent about error handling, or at best dismissive. For many applications, the “best practice” for error handling is “exit the program as soon as an error is encountered”.

In contrast, in a distributed system, especially in CockroachDB, error definitions and error handling are a critical aspect of product quality.

Here are some of the important things that we care about:

  • Errors should not cause running servers (e.g. a database node) to terminate immediately. Customers would consider this an unacceptable defect. Correct and deliberate error handling is a core part of product quality and stability.

  • Users will read the text of error messages, however users cannot be assumed to understand the source code. If an error message is confusing, the users will ask confused questions to our tech support. If an error message is misguiding, the users will ask the wrong questions to our tech support. And so on. Error messages should be clear and accurate and avoid referring to source code internals.

  • Any error visible to one user will likely be visible to dozens, if not thousands of users eventually. We want our users to understand what they should do about an error on their own, so they do not need to reach out to technical support. For this, we want our error messages to be self-explanatory and include hint annotations. We also make specific error codes (e.g. SQLSTATE) part of our public, documented API for using CockroachDB.

  • Errors are part of the API and thus error situations should be exercised in unit tests.

  • Error make their way to log files and crash reports and can contain user-provided data. We care to separate customer confidential data from non-confidential data in log files and crash reports, and so we need to distinguish sensitive data inside error objects too.

Basic usage in Go

See the sub-page “Error handling basics”, which is also included in the overall Go style guide.

In a nutshell:

  • We prefer the use of the CockroachDB errors library at github.com/cockroachdb/errors. This is a superset of Go's own errors and pkg/errors.

  • use errors.Wrap to add context to an error

  • handle type assertion failures gracefully with an error, instead of letting Go generate a panic

  • avoid panics generally, unless in an init function or in a package that uses a disciplined panic-based error handling protocol (and converts panics to errors)

Errors and stability

Here are the general rules about how errors are allowed to impact the lifecycle of a network service:

Situation

Examples

What to do

Stop client session?

Send crash report to telemetry?

Stop process?

Situation

Examples

What to do

Stop client session?

Send crash report to telemetry?

Stop process?

Error due to user input in request/query; or computational error in the query language

SELECT invalid;

SELECT 1/0;

HTTP: request for object that does not exist

Return a regular error response to the client.

SQL: use a SQLSTATE code (See section below)

HTTP: use an appropriate HTTP code

No

No

No

Server detects unexpected condition scoped to a single client query. The situation does not correspond to a candidate future feature extension.

Unreachable code was reached

Precondition does not hold while processing client-specific input

return errors.AssertionFailedf(…)

(or NewAssertionFailureWithWrappedErrf)

No

Automatic for assertion failures

No

Server detects unexpected condition scoped to a single client query. The situation is a candidate future feature extension.

Client passes a combination of parameters that is not yet supported.

A complex condition arrives in the default or else clause of the code. At the time the code was written, that condition was thought to be impossible, but someone comes with a counter-example that makes sense.

Find a related issue or file a new one. Make an error withunimplemented.NewWithIssue(…) and refer to the issue. Mark the issue with labels docs-known-limitationand X-anchored-telemetry.

No

No
(Although all unimplemented errors get their own, non-crash telemetry automatically too.)

No

Server detects unexpected invalid state scoped to the client session

Unreachable code was reached

Precondition does not hold while processing internal session-bound state

Propagate assertion error to client, see above. (Wrap existing errors)

If the error pertains to an admin-only feature, call log.Warningf

Yes or make it read-only

Automatic for assertion failures

No

Server detects unexpected invalid state with uncertain scope on a read-only path or a path guaranteed not to persist data

Shared subsystem returns an unexpected error

Data returned from disk does not comply to expected type

A read-then-write operation reads invalid data from disk.

Propagate assertion error to client, see above. (Wrap existing errors)

Also call log.Errorf

Ensure no data is persisted after the error is detected

Yes or make it read-only

Automatic for assertion failures

No

Server detects unexpected invalid state on a path that might persist data in storage

The post-conditions during a data persistence operation fail

A write operation to a data persistence output fails in a way that doesn’t allow the write to be cancelled (e.g. corruption detected KV storage, or write error critical log sink).

Call log.Fatalf

Automatic by log.Fatal

Automatic by log.Fatal

Automatic by log.Fatal

Large strings inside error payloads

Be careful not to include arbitrarily large strings inside error payloads.

This can cause excessive memory consumption (even a server crash) and incomplete/truncated crash reports.

  • A copy of the SQL syntax input by the SQL client is usually OK.

  • Placeholder values or the body of COPY statements can be more tricky.

  • Be especially careful with data loaded from storage.

  • Be careful of data generated from SQL built-in functions or subqueries.

When in doubt, only include a prefix up to a maximum length. Use a special character (e.g. unicode ellipsis “…” ) to indicate that truncation happened.

Errors and performance

We work under the assumption that errors are important, but yet are uncommon.

There are two sides of this “uncommon“ coin:

  • Error handling does not need to be optimized for performance. For example, we tolerate a moderate amount of string processing and heap allocations to construct error objects.

  • Error objects should not be constructed on the common path. Only construct errors when needed.

For example:

Bad

Good

Bad

Good

func myFunc(x int) (result, error) { maybeErr := errors.New("hello") if x > 10 { return nil, maybeErr } return result, nil }
func myFunc(x int) (result, error) { if x > 10 { return nil, errors.New("hello") } return result, nil }

or alternatively, when the error will be tested elsewhere:

var maybeErr = errors.New("hello") func myFunc(x int) (result, error) { if x > 10 { return nil, errors.WithStack(maybeErr) } return result, nil }

Error messages, hints and codes

Error objects are structured. We use different parts of an error object for different purposes. Care should be taken to not stuff text/data intended for one field into another.

Field

What it’s for

Example

Field

What it’s for

Example

Message (mandatory)

Tells the human user a summary what happened.

  • The message is for the human user: tell what happened in prose.

  • It’s a summary. Keep it short (yet clear and accurate).

  • The message is about what happened up to the point the error occurred. It should be descriptive about the past / user input.

  • The error message is likely to be embedded in textual contexts that assume a single-line string:

    • Do not start the message with a capital nor end it with a period.

    • Avoid newline characters.

  • Be open to feedback from users and documentation writers about how to improve the text of the message.

  • There is a single message per error object: composite errors concatenate their messages.

SQLSTATE (highly recommended)

A 5-character code meant to inform automation about what happened and what it can do about the error.

  • Try to use the same code as PostgreSQL in an equivalent situation.

  • Only be creative if PostgreSQL has no equivalent or related situation.

  • When you are creative, be mindful that the SQLSTATE codes are organized in categories indicated by the first 2 characters. Use the proper category for your error.

  • Use SQL logic tests to verify that the proper SQLSTATE is returned in known situations.

  • We have special codes:

    • XX000 - internal error; code automatically derived for assertion failures, also triggers a crash report in telemetry when the error flows back to the client.

    • XXUUU - automatically chosen when the error does not announce its own SQLSTATE. We should reduce occurrences of XXUUU over time; a user encountering this is a suggestion to enhance our error handling to choose a better code.

    • XXA00 - txn committed but schema change failed. The transaction did commit but a schema change op failed. Manual intervention is likely needed.

    • See pgcode/codes.go for more.

  • PostgreSQL has special codes which are equally special in CockroachDB:

    • 40001: serialization error. The transaction did not commit and can be retried.

    • 40003: statement completion unknown. The transaction may or may not have committed and may or may not be retried. Manual intervention is likely needed.

or add a SQLSTATE to an existing error:

Hint (optional, recommended)

Tells the human user about what they can do to resolve the error.

  • The hints are for the human user: tell in prose.

  • Make recommendations about what the user can change to observe a different outcome.

  • Hints are presented to the user in paragraphs:

    • Each hint payload can be multi-line.

    • Use full sentences, with a capital at the beginning and a period at the end.

    • There can be multiple hint payloads. They typically appear under each other.

Detail (optional)

Tells the human user about the details of what happened.

  • The detail field is for the human user: tell in prose.

  • It’s also about what happened in the past.

  • Details are presented to the user in paragraphs:

    • Each detail payload can be multi-line.

    • Use full sentences, with a capital at the beginning and a period at the end.

    • There can be multiple detail payloads. They typically appear under each other.

Errors as API

What does it mean that “errors are part of the documented API”?

  • Whether an error can occur for given input situations is documented.

    • If an API is documented not to return an error, then users can consider CockroachDB defective if an error is returned.

  • The set of possible errors is documented for these input situations.

    • If an API returns an error that was not documented as possible, then users can consider CockroachDB (or its documentation) defective.

  • What to do when a given error occurs is documented.

    • If a API returns an error with no clear “next steps”, then users can consider CockroachDB (or its documentation) defective.

There is a careful balance to maintain: users want to have more guarantees, but each guarantee comes with an engineering burden.

Here is how we manage the amount of engineering work:

  • We do not guarantee nor document the specific text of error messages, hints and details as part of our error API.

    • We emphasize “an error can occur” as the guarantee, not “this specific error will occur”.

    • Specific guarantees are expressed over the SQLSTATE values. These are unit tested.

    • Conversely, engineers are free to improve / extend / modify messages, hints and details without approval by the documentation and product team.

    • In some cases (this is a legacy case, which we strive to avoid nowadays), the guarantee includes a keyword at the first position in the message. For example “restart_transaction”.

  • Mention when new SQLSTATE values are introduced, or when a single error case has been broken down into multiple alternatives, inside a release note in the commit message.

Checking errors, errors in unit tests

  • Messages should not be considered stable:

    • inside Go code, use errors.Is , errors.As and errors.HasType / HasInterface, not .Error() = “…” or strings.Contains(…Error(), “…”)

    • in SQL logic tests, use regular expressions that only match the “important” part of a message

  • In unit tests:

    • Check SQLSTATE values using SQL logic tests (error pgcode ….)

    • In Go unit tests, use testutils.IsError()

Sensitive data inside error objects

Many error objects are copied into logs, crash reports or other artifacts that are then communicated to CRL Tech Support automatically.

To preserve the confidentiality of our customer data, we are careful to isolate user-provided data from strings that are fixed inside CockroachDB. We call this “redactability”.

See the page https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1824817806 for more details.

General concepts:

  • When something is potentially sensitive / confidential, we call it “unsafe” and it is automatically deleted / redacted out when sent to CRL Tech support.

    • This conservative approach maximally protects customer confidentiality.

    • We need to work extra to include bits of known-safe data into errors to make the error reports more useful during troubleshooting.

  • The CockroachDB errors library already knows about redactability and helps engineers as follows:

    • The first literal string arguments to errors.New , errors.Newf, errors.Wrap etc is automatically considered non-confidential / non-redactable.

    • Most “simple” numeric values are automatically considered non-redactable.

    • All string values passed as positional arguments to error constructors and annotation functions are considered sensitive and thus redactable.

    • More non-redactability for values passed to error constructors are possible via the SafeFormatter interface (see implementations of SafeFormat throughout the source code)

    • Error objects used as input to a new error object are decomposed into redactable and non-redactable bits automatically.

  • Errors constructed outside of cockroachdb/errors, e.g. via fmt.Errorf, are considered sensitive and thus fully redactable.

 

Copyright (C) Cockroach Labs.
Attention: This documentation is provided on an "as is" basis, without warranties or conditions of any kind, either express or implied, including, without limitation, any warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose.