Organizing PRs and Commits

Short, simple projects produce a PR containing a single, small commit. However, we strongly encourage contributors to split up larger projects into multiple PRs or commits in order to make code reviews more effective.

This happens mainly in two situations:

Table of contents:

PR organization philosophy

A pull request (PR) is a package of work that engages three parties:

  • the programmer delivering the change,

  • the reviewer(s) who analyze and review it before it is integrated (merged) to the project,

  • later maintainers of the code who need to understand and modify the code further.

It is important to understand that a PR is not just about changing the code. It is really about creating a dialogue, via the history of commits, that helps both reviewers and future maintainers understanding the concepts and the components inside the code.

In a nutshell, the programmer’s job is to organize the PR to be easily reviewable and to make the code more easily maintainable. To achieve reviewability, the best way is to make each commit cover one conceptual unit of work towards the goal, and explain intent (in commit messages) separately from the code. To achieve maintainability, the programmer should explain their choices and spell out (in commit messages) what they left out, when applicable.

How PRs help reviewers and maintainers

To do their job properly, the reviewer wants to know “what is going on” in the change. The reviewer needs to evaluate whether there is a difference between what the programmer intended to do (i.e. the goal) and what the code actually does. To do this work properly, the reviewer needs to understand what the programmer thinks they were doing—separately from the code.

It’s easy to conflate the two; programmers often like to think “my code output is, by definition, what I wanted to do”. But that is not true. To err is human; a mistake is literally a difference between intent and result, and the reviewer cannot spot mistakes if the programmer doesn’t tell them (in prose) what they wanted to do separately from the code. It’s also easy to say ”well I have a google doc which explains intent already, why do I need to explain again?”. This is unfriendly to the future maintainer of the code, who will likely not know where to find this documentation. Also, sometimes, the best reviewer comes from another team and doesn’t know about the project yet. We wish the commit messages to speak for themselves.

Then separately, to do their job properly, the maintainers want to know “what the programmer did not put in the code”. Very often, in a project, a programmer cannot change their code in a perfect way towards the goal. The programmer makes choices, or leaves certain aspects away, on purpose, with the intent to tweak/optimize things later. The maintainers need to know what are the possible/acceptable follow-up changes, even if the code (that may be incomplete/imperfect) does not tell that on its own.

This is why we wish commit messages to tell a story that goes beyond the current change. We want the commit messages to talk about the goals, and all the ways in which the current commit/PR is not achieving the goal yet. The descriptions in prose should talk about choices (“I did X instead of Y because…” or “I did not do X even though it was requested, because…”) and link to related public issues, so that later maintainers can trace the decisions and revisit them when circumstances change.

So in summary, the commit messages and PR descriptions should talk about intent separately from the result, and document the choices. This does not need many sentences, or advanced use of the English language. It is not a complicated literary exercise.

It does need that the change has well-defined boundaries though. It is very had to talk about intent vs results and document choices if the commits are too fine-grained. If the programmer makes too many small commits (e.g. “fixed a comma”, or “added a function there” then another commit “added another function there”), there is nothing of value for the reviewer and maintainer to look at!

This is also why we encourage each commit to cover one conceptual unit of work towards the goal. Not more (do not combine too many changes / decisions into one commit), but not less either.

We call this “one commit per logical change”.

How to decompose PRs into logical changes

Studies have shown that as the code size of a change increases, the code review effectiveness decreases [1]. Keeping changes in a review “small, independent, and complete” has been suggested to improve the review process [2]. We strongly encourage contributors to thoughtfully split up larger projects into multiple PRs or commits.

For example, a change to add a new SQL statement could have the following commits:

  1. one commit to add support for the SQL syntax, with parsing unit tests. In this commit, the planning code is changed to report an "unimplemented error" when the new statement is planned. A planning unit test is added that verifies that the unimplemented error is reported

  2. another commit to plan the statement, with planning unit tests. In this commit, the execution code is changed to report an "unimplemented error" when the new statement is executed, and the planning unit test is updated to check the result of logical planning.

  3. a final commit to execute the statement, with execution unit tests. In this commit, the execution unit test is updated to check the results of execution.

All these 3 commits can be combined in a single PR.  However, we do not encourage PRs with numerous commits. After 3 logical changes in a single PR, consider splitting the PR into sub-PRs to be reviewed separately.

When using the "one commit per logical change" principle, each commit must individually have a standalone explanatory commit message, as per the page Git Commit Messages.

One aspect we’re strongly aiming for is that each standalone commit can be built into a CockroachDB executable, and pass unit tests. Achieving this strengthens the confidence we have that the commits are, in fact, pertaining to just one logical change. This also simplifies the work of analyzing the git history for regressions using the git bisect command. There are exceptions, of course, that can be negotiated on a per-PR basis with the reviewer or your team’s TL.

Sometimes, you realize that you wish to split the PR into multiple logical change only after starting the implementation and touching many files across multiple areas. This is still possible! You can use git add -p for this purpose.

Moving blocks of code

Changes may require moving blocks of code (or tests) to new packages or files, and then making edits to the moved code. We recommend splitting this type of refactoring into two commits. The first commit relocates the code without logical changes. The second makes the required logical changes to the code in its new location. This makes it easier for reviewers to scrutinize the moved code and logical changes separately. If both are contained in a single commit the logical changes will be “hidden” from the reviewer in a large block of newly added lines.

Aesthetic changes

Sometimes an author feels compelled to update spelling, grammar, or the overall structure of the code for legibility or other subjective purposes. When doing so, the author should be mindful that moving/editing code for aesthetic reasons also has a downside: it pollutes the output of git blame, which connects code to its last author/maintainer.

It's nice for a blame to be as clean as possible. It's nice for the history of a stanza to have fewer entries. Being mindful of blame is just common hygiene. There is a bar for changing a line of code; we don't do it gratuitously. For example, a one-line change to fix a typo in a comment impairs the output of blame and also has an opportunity cost in your use of time. (But see also the section “Comment updates “ in Code commenting guidelines.)

Besides the effect of blame churn on humans, it also has an effect on machines. For example, GitHub might start suggesting this person as a reviewer for future patches.

Authors should thus carefully weigh the legibility gains of aesthetic changes with the maintainability cost of impairing history navigation.

Projects with multiple interdependent in-flight PRs

Sometimes, the PR for one change has not been merged yet when we create a follow-up PR that builds upon the first one. This section shares some experience and considerations to keep in mind when this happens.

Note: this is an “advanced” topic and not intended for novice programmers.

Why do in-flight dependent PRs happen

This happens sometimes in more complex projects. For example, a first PRs might create “infrastructure” or a code framework, that the next PRs use to implement a feature.

Generally, we do not favor multiple PRs remaining in-flight concurrently, that pertain to the same project. It makes it harder for TLs and PMs to reason about progress and harder for programmers to keep everything top of mind.

However, it happens sometimes because there is another tension: once an “infrastructure” PR is merged and other PRs are merged on top of it, it makes it harder to change the infrastructure. Also, sometimes we simply do not know exactly how to build this infrastructure until we have some experience using it.

It is the job of the TL to analyze work in advance and propose a project plan that avoids multiple in-flight PRs. So, ideally, this situation remains infrequent. However, it still happens sometimes.

In any case, when this happens we do not consider merging the in-flight PRs together into a single PR. That would make things ever harder to review and iterate on.

Instead, we proceed with caution.

Work extra to keep the PR boundaries clear

When there are multiple related PRs, it is paramount to ensure that each PR speaks for itself. Even the review discussion should be self-contained. Things to keep in mind:

  • Keep discussion about the “base” PRs on the review thread for the base PR itself, not that of the dependent PRs.

  • Use PR number references in comments, to create hyperlinks.

  • If/when you re-arrange the ordering of the PRs or split PRs into multiple sub-PRs, explain this as additional review comments so that a latecomer to the conversation can understand what went on.

Avoid frequent rebases for multi-PR projects

When working on a standalone PR, it is desirable to rebase the PR’s branch on top of the master branch every now and then, to pick up bug fixes from coworkers, etc.

With multiple inter-dependent open PRs, each rebase makes it extra hard for reviewers of dependent PRs to understand what is going on. Additionally, the rebase process for multiple PRs is technically difficult for the programmmer and thus more error-prone. Therefore, in that situation we prefer to refrain from frequent rebases unless strictly required.

Be mindful of your time and that of the reviewer

Maintaining multiple PRs in-flight costs extra effort and time. The cognitive overhead of context-switching between multiple sub-projects is extremely high.

Be mindful of how much extra effort you are spending. It may be a better investment of your time (and that of your team) to decide to merge one or more of the prefix PRs early, to reduce the number of simultaneous discussions ongoing; then issue additional PRs later to address the shortcomings found in the first PRs.

References

[1] Bosu, A., Greiler, M., & Bird, C. (2015, May). Characteristics of useful code reviews: An empirical study at microsoft. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories (pp. 146-156). IEEE.

[2] Rigby, P., Cleary, B., Painchaud, F., Storey, M. A., & German, D. (2012). Contemporary peer review in action: Lessons from open source development. IEEE software, 29(6), 56-61.

Copyright (C) Cockroach Labs.
Attention: This documentation is provided on an "as is" basis, without warranties or conditions of any kind, either express or implied, including, without limitation, any warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose.