Re: [PATCH] revision: use priority queue in limit_list()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Kristofer Karlsson via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: Kristofer Karlsson <krka@spotify.com>
Subject: Re: [PATCH] revision: use priority queue in limit_list()
Date: Thu, 14 May 2026 15:57:46 -0400	[thread overview]
Message-ID: <7e5abff7-79c9-41c3-9cfa-2aaf0e69a6a8@gmail.com> (raw)
In-Reply-To: <pull.2114.git.1778777491939.gitgitgadget@gmail.com>

On 5/14/2026 12:51 PM, Kristofer Karlsson via GitGitGadget wrote:
> From: Kristofer Karlsson <krka@spotify.com>
> 
> limit_list() maintains a date-sorted work queue of commits using a
> linked list with commit_list_insert_by_date() for insertion.  Each
> insertion walks the list to find the right position — O(n) per insert.
> In repositories with merge-heavy histories, the symmetric difference
> can contain thousands of commits, making this O(n) insertion the
> dominant cost.

Linear operations are bad, especially when multiplied by linear-ish
loops, causing quadratic behavior.

> Replace the sorted linked list with a prio_queue (binary heap).  This
> gives O(log n) insertion and O(log n) extraction instead of O(n)
> insertion and O(1) extraction, which is a net win when the queue is
> large.

Yes, much better.

> The still_interesting() and everybody_uninteresting() helpers are
> updated to scan the prio_queue's contiguous array instead of walking a
> linked list.  process_parents() already accepts both a commit_list and
> a prio_queue parameter, so the change in limit_list() simply switches
> which one is passed.
> 
> Benchmark: git rev-list --left-right --count HEAD~N...HEAD
> Repository: 2.3M commits, merge-heavy DAG (monorepo)
> Best of 5 runs, times in seconds:
> 
>   commits in
>   symmetric diff   baseline   patched    speedup
>   --------------   --------   -------    -------
>             10       0.01      0.01       1.0x
>             50       0.01      0.01       1.0x
>           3751      21.23      8.49       2.5x
>           4524      21.70      8.29       2.6x
>          10130      20.10      6.65       3.0x
> 
> No change for small traversals; 2.5-3.0x faster when the queue grows
> to thousands of commits.

This is good. Is there any chance that you could demonstrate this with
any commits in the Git repo? It does have some interesting behavior,
especially around point releases that are independent from the 'master'
branch and thus could have lopsided symmetric differences using well-
established tag names.

>     Switching to a prio_queue (binary heap) reduces insertion cost to O(log
>     w), bringing total cost to O(N·log w). The practical result on the same
>     repository:
>     
>     commits in
>     symmetric diff   before     after      speedup
>     --------------   --------   -------    -------
>             3751      21.2s      8.5s       2.5x
>             4524      21.7s      8.3s       2.6x
>            10130      20.1s      6.6s       3.0x

Very nice! I notice that this data is in your cover letter, but
not the commit message. Is that intentional?

>     This affects any command that triggers limit_list() — i.e., when
>     revs->limited is set — including --left-right, --cherry-mark,
>     --cherry-pick, --ancestry-path, bisect, and rebase's fork-point
>     computation. The practical trigger is git status --ahead-behind on a
>     branch that has diverged from upstream in a merge-heavy repository.

This also impacts 'git log --graph' when there is no serialized
commit-graph file. We are still using limit_list() in that case.

>     The change is minimal (+21/−17 lines, single file) because
>     process_parents() already accepts both a commit_list and a prio_queue
>     parameter — limit_list() just switches which one it passes.

The key logic is turning the initial list into the starting
points for the priority queue and everything else is about
moving types around, it seems.

> @@ -1451,6 +1447,7 @@ static int limit_list(struct rev_info *revs)
>  	struct commit_list *newlist = NULL;
>  	struct commit_list **p = &newlist;
>  	struct commit *interesting_cache = NULL;
> +	struct prio_queue queue = { .compare = compare_commits_by_commit_date };

Here, we are _not_ using generation numbers, which is correct
for this case because we are matching the date-based sorting
of the previous list.

>  	while (original_list) {
>  		struct commit *commit = pop_commit(&original_list);
> +		prio_queue_put(&queue, commit);
> +	}
> +
> +	while (queue.nr) {
> +		struct commit *commit = prio_queue_get(&queue);
>  		struct object *obj = &commit->object;
  
This is a fun reuse of lines to take the old "drain the
list as it is being mutated" loop and turn it into "fill
the priority queue" and "drain the priority queue as it
is being mutated"

This code change looks good. No new tests are needed, since
this is a performance-only change. Do any of the tests in
t/perf/ demonstrate this improvement?

Thanks,
-Stolee

next prev parent reply	other threads:[~2026-05-14 19:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-14 16:51 [PATCH] revision: use priority queue in limit_list() Kristofer Karlsson via GitGitGadget
2026-05-14 19:40 ` Junio C Hamano
2026-05-14 19:57 ` Derrick Stolee [this message]
2026-05-15  4:16 ` Jeff King
2026-05-15  7:47   ` Kristofer Karlsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e5abff7-79c9-41c3-9cfa-2aaf0e69a6a8@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=krka@spotify.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.