Git development
 help / color / mirror / Atom feed
From: "Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Kristofer Karlsson <krka@spotify.com>
Subject: [PATCH 0/3] revision: use priority queue for streaming walks
Date: Wed, 27 May 2026 15:49:59 +0000	[thread overview]
Message-ID: <pull.2127.git.1779897003.gitgitgadget@gmail.com> (raw)

This is a follow-up to kk/limit-list-optim (now in master), which replaced
the O(N) sorted linked-list insertion in limit_list() with a priority queue.
In the review thread for that patch, I mentioned that the same approach
could be applied to the streaming (non-limited) walk in get_revision_1().
Junio suggested doing it as a separate topic, and Peff noted he had a local
branch (jk/revs-commits-prio-queue) doing a broader conversion of
revs->commits to prio_queue entirely.

This series takes a lighter-weight approach: it keeps the linked list for
setup and external callers, and adds a separate commit_queue field that the
streaming walk drains into on first use. This avoids touching bisect,
line-log, and list-objects code, at the cost of a "only one should be
non-empty" invariant between the two fields.

Together with the limit_list() change already in master, this eliminates all
commit_list_insert_by_date() callers from revision.c.

Patch 1 is a small leak fix -- a missing release_revisions() call in
pack-objects that becomes visible once the commit queue uses a dynamically
allocated prio_queue array.

Patch 2 introduces a rev_walk_mode enum to replace the repeated if/else
chains in get_revision_1(). The function dispatches on walk mode in multiple
places (next commit, expand parents, flag clearing) and these chains must
stay in sync. The enum resolves the mode once and both dispatch sites switch
on the same variable. This is a lighter alternative to the vtable-based
refactoring I mentioned before. No functional change.

Patch 3 is the actual conversion of the streaming walk to use a priority
queue.

== Why this helps ==

The streaming walk in get_revision_1() inserts newly discovered parent
commits into a date-sorted queue. On master, this uses
commit_list_insert_by_date(), which walks the linked list to find the
insertion point -- O(w) per insert, where w is the queue width (active walk
frontier).

In merge-heavy repositories, the walk frontier stays wide:


Repository Commits Peak width Avg width
=======================================

monorepo (2.4M) 2,420K 2,653 1,700 linux.git 1,445K 581 235 git.git 82K 188
82

On the monorepo, each of the 2.4M commits requires scanning an average of
1,700 list entries to find the insertion point. With the priority queue,
this drops to ~11 heap comparisons.

== Benchmarks ==

All benchmarks: best of 3 runs, same machine, commit-graph present.

Streaming walks (affected by this series):

git rev-list --count HEAD (monorepo, 2.4M commits)

  master:   17.94s
  patched:   3.38s   (5.3x faster)

git rev-list HEAD (monorepo, full output)

  master:   27.72s
  patched:   8.61s   (2.8x faster, I/O-bound fraction unchanged)


Regression checks -- non-merge-heavy repos (streaming path, but frontier
stays narrow so O(w) insertion was never the bottleneck):

git rev-list --count HEAD (linux.git, 1.4M commits)

  master:    1.76s
  patched:   1.81s   (no change)

git rev-list HEAD (linux.git, full output)

  master:    4.46s
  patched:   4.52s   (no change)

git rev-list --count HEAD (git.git, 82K commits)

  master:     83ms
  patched:    86ms   (no change)


Regression checks -- other walk modes (not affected by this series):

git rev-list --count HEAD~5000...HEAD (monorepo, limited path)

  master:    7.36s
  patched:   7.02s   (no change)


== Profile breakdown ==

perf profiling of rev-list --count HEAD on the monorepo shows where the time
goes:

master (17.94s): commit_list_insert_by_date 79% 14.25s fixed overhead
(parse/lookup) 21% 3.69s

patched (3.38s): heap ops (compare + sift) 16% 0.53s fixed overhead
(parse/lookup) 84% 2.85s

The queue maintenance itself sped up 27x (14.25s to 0.53s). The overall 5.3x
is lower because the fixed costs -- object lookup (17%), commit-graph
parsing (14%), memory allocation (10%) -- are roughly constant between the
two versions at ~3s.

This means the patch removes the dominant bottleneck entirely. After the
patch, the walk cost is dominated by irreducible per-commit work (parsing
and object lookup) which scales linearly with commit count regardless of
frontier width.

Kristofer Karlsson (3):
  pack-objects: call release_revisions() after cruft traversal
  revision: introduce rev_walk_mode to clarify get_revision_1()
  revision: use priority queue for non-limited streaming walks

 builtin/pack-objects.c |   1 +
 commit.c               |  13 -----
 commit.h               |   2 -
 revision.c             | 113 +++++++++++++++++++++++++++--------------
 revision.h             |  12 ++++-
 5 files changed, 88 insertions(+), 53 deletions(-)


base-commit: c69baaf57ba26cf117c2b6793802877f19738b0d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2127%2Fspkrka%2Fstreaming-prio-queue-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2127/spkrka/streaming-prio-queue-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/2127
-- 
gitgitgadget

             reply	other threads:[~2026-05-27 15:50 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 15:49 Kristofer Karlsson via GitGitGadget [this message]
2026-05-27 15:50 ` [PATCH 1/3] pack-objects: call release_revisions() after cruft traversal Kristofer Karlsson via GitGitGadget
2026-05-27 15:50 ` [PATCH 2/3] revision: introduce rev_walk_mode to clarify get_revision_1() Kristofer Karlsson via GitGitGadget
2026-05-27 15:50 ` [PATCH 3/3] revision: use priority queue for non-limited streaming walks Kristofer Karlsson via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.2127.git.1779897003.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=krka@spotify.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox