From: Patrick Steinhardt <ps@pks.im>
To: Toon Claes <toon@iotcl.com>
Cc: git@vger.kernel.org, "Jeff King" <peff@peff.net>,
"Taylor Blau" <me@ttaylorr.com>,
"Derrick Stolee" <stolee@gmail.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH RFC v2 4/5] last-modified: implement faster algorithm
Date: Tue, 27 May 2025 12:39:48 +0200 [thread overview]
Message-ID: <aDWWdGVW90LpxZhH@pks.im> (raw)
In-Reply-To: <20250523-toon-new-blame-tree-v2-4-101e4ca4c1c9@iotcl.com>
On Fri, May 23, 2025 at 11:33:51AM +0200, Toon Claes wrote:
> The current implementation of 'git last-modified' works by doing a
> revision walk, and inspecting the diff at each level of that walk to
> annotate the to-be-found entries to a path. In other words, if the diff
> at some level touches a path which has not yet been associated with a
> commit, then that commit becomes associated with the path.
It's a bit funny that we first introduce an algorithm, only to change it
in a subsequent step again. I don't mind it much though: this variant
here is quite a bit more complicated, and it's nice to first gain an
understanding of how the simple algorithm works before going to the
harder one.
It's also nice that the tests prove that this is indeed leads to the
same result.
[snip]
> More specifically, consider a priority queue of commits sorted by
> generation number. First, enqueue the set of boundary commits with all
> paths in the original spec marked as interesting.
>
> Then, while the queue is not empty, do the following:
>
> 1. Pop an element, say, 'c', off of the queue, making sure that 'c'
> isn't reachable by anything in the '--not' set.
>
> 2. For each parent 'p' (with index 'parent_i') of 'c', do the
> following:
>
> a. Compute the diff between 'c' and 'p'.
> b. Pass any active paths that are TREESAME from 'c' to 'p'.
> c. If 'p' has any active paths, push it onto the queue.
What if an active path is changed on both sides of a merge commit? Do we
pass it to the first parent?
[snip]
> Now, some performance numbers. On github/git, our numbers look like the
> following (all wall-clock times best-of-five, and with '--max-depth=0'
> on the root):
This option does not exist in this version of git-last-modified(1).
> github ttaylorr/blame-tree-fast
> with filters: 0.754s 0.271s (2.78x faster, 6.18x overall)
> without filters: 1.676s 1.056s (1.58x faster)
>
> and on torvalds/linux:
>
> github ttaylorr/blame-tree-fast
> with filters: 0.608 0.062 (9.81x faster, ~52x overall)
> without filters: 3.251 0.676 (4.81x faster)
>
> In short, the existing implementation is comparably fast *with* filters
> as the new implementation is *without* filters. So, most repositories
> should get a dramatic speed-up by just deploying this (even without
> computing Bloom filters), and all repositories should get faster still
> when computing Bloom filters.
It would be nice to introduce "filters" as "Bloom filters". I was
initially wondering what filters you talk about until you then mention
it in the last sentence.
> diff --git a/last-modified.c b/last-modified.c
> index f628434929..0a0818cdf1 100644
> --- a/last-modified.c
> +++ b/last-modified.c
> @@ -3,18 +3,20 @@
> #include "commit.h"
> #include "diffcore.h"
> #include "diff.h"
> -#include "object.h"
> #include "revision.h"
> #include "repository.h"
> #include "log-tree.h"
> #include "dir.h"
> #include "commit-graph.h"
> #include "bloom.h"
> +#include "prio-queue.h"
> +#include "commit-slab.h"
It would be nice if we could keep these lexicographically sorted right
from the first commit.
> @@ -116,6 +128,20 @@ void last_modified_release(struct last_modified *lm)
> }
> hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> release_revisions(&lm->rev);
> + free(lm->all_paths);
> +}
> +
> +struct commit_active_paths {
> + char *active;
> + int nr;
Should this be a `size_t` as it is counting something?
> +};
Hm, a bit weird, as I don't kno what `nr` is supposed to stand for.
Intuitively I would have expected that `active` is an array of strings,
and that `nr` tracks how many there are. But that's not the case.
Let's read on.
> +define_commit_slab(active_paths, struct commit_active_paths);
> +static struct active_paths active_paths;
> +
> +static void free_one_active_path(struct commit_active_paths *active)
s/free_one_active_path/commit_active_paths_release/
> @@ -197,7 +230,32 @@ static void last_modified_diff(struct diff_queue_struct *q,
> }
> }
>
> -static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +static char *scratch;
Having a global variable like this in a library is not great. Can we
instead pass around a context or something like this internally?
> +
> +static void pass_to_parent(struct commit_active_paths *c,
> + struct commit_active_paths *p,
> + int i)
> +{
> + c->active[i] = 0;
> + c->nr--;
> + p->active[i] = 1;
> + p->nr++;
Okay, so `active` is a bitfield. It's a bit weird that we use a full
byte for each bit though. It might not matter much in practice, but it
feels quite wasteful to me. Doubly so because we allocate this bitfield
for every commit we visit.
Can we maybe instead use `struct bitmap` for this?
> +}
> +
> +#define PARENT1 (1u<<16) /* used instead of SEEN */
> +#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
These are the same definitions as in "commit-reach.c". Might be worth it
to deduplicate those.
> @@ -221,8 +281,88 @@ static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> return 0;
> }
>
> +static int process_parent(struct last_modified *lm, struct prio_queue *queue,
> + struct commit *c,
> + struct commit_active_paths *active_c,
> + struct commit *parent, int parent_i)
> +{
> + int i, ret = 0; // TODO type & for loop var
This looks like a left-over comment that should be addressed.
Patrick
next prev parent reply other threads:[~2025-05-27 10:39 UTC|newest]
Thread overview: 135+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
2025-04-24 16:19 ` Junio C Hamano
2025-05-07 13:13 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
2025-05-07 14:22 ` Toon Claes
2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 14:26 ` Junio C Hamano
2025-05-08 15:12 ` Marc Branchaud
2025-05-14 14:42 ` Toon Claes
2025-05-14 19:29 ` Junio C Hamano
2025-05-14 21:15 ` Marc Branchaud
2025-05-15 13:29 ` Patrick Steinhardt
2025-05-15 16:39 ` Junio C Hamano
2025-05-15 17:39 ` Marc Branchaud
2025-05-15 19:30 ` Jeff King
2025-05-16 4:38 ` Patrick Steinhardt
2025-05-20 8:49 ` Toon Claes
2025-05-15 17:30 ` Marc Branchaud
2025-05-16 4:30 ` Patrick Steinhardt
2025-05-14 21:15 ` Marc Branchaud
2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:20 ` D. Ben Knoble
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 13:18 ` D. Ben Knoble
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-25 20:07 ` Justin Tobler
2025-06-05 8:32 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-06-13 9:34 ` Toon Claes
2025-06-13 9:52 ` Kristoffer Haugsbakk
2025-05-23 9:33 ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
2025-05-27 10:40 ` Patrick Steinhardt
2025-06-13 11:05 ` Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt [this message]
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
2025-07-01 21:06 ` Junio C Hamano
2025-07-01 21:30 ` Kristoffer Haugsbakk
2025-07-02 13:00 ` Toon Claes
2025-07-09 15:53 ` Toon Claes
2025-07-09 17:00 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-01 20:20 ` Kristoffer Haugsbakk
2025-07-02 11:51 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-01 23:01 ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
2025-07-09 21:57 ` Junio C Hamano
2025-07-10 18:37 ` Junio C Hamano
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
2025-07-22 15:50 ` Toon Claes
2025-08-01 9:09 ` Christian Couder
2025-08-01 16:59 ` Junio C Hamano
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
2025-07-18 0:08 ` Taylor Blau
2025-07-22 15:52 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
2025-07-18 0:16 ` Taylor Blau
2025-07-22 16:02 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
2025-07-16 15:50 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
2025-07-16 16:09 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-17 22:37 ` Junio C Hamano
2025-07-18 17:36 ` Junio C Hamano
2025-07-22 16:06 ` Toon Claes
2025-07-16 13:42 ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
2025-07-22 15:35 ` Toon Claes
2025-07-30 17:59 ` Toon Claes
2025-07-31 7:45 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40 ` Junio C Hamano
2025-07-31 23:57 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
2025-08-05 14:34 ` Patrick Steinhardt
2025-08-05 16:21 ` Junio C Hamano
2025-08-05 16:34 ` Junio C Hamano
2025-08-05 16:55 ` Toon Claes
2025-08-05 17:20 ` Jean-Noël AVILA
2025-08-05 21:46 ` Junio C Hamano
2025-08-06 12:01 ` Toon Claes
2025-08-06 15:38 ` Junio C Hamano
2025-08-28 22:44 ` Junio C Hamano
2025-08-05 18:28 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-08-05 9:33 ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
2025-08-05 9:33 ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
2025-08-04 6:34 ` Patrick Steinhardt
2025-08-04 17:14 ` Junio C Hamano
2025-08-05 5:35 ` Toon Claes
2025-08-01 20:34 ` Jean-Noël AVILA
2025-08-05 5:36 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-08-01 10:18 ` Christian Couder
2025-08-01 10:22 ` Patrick Steinhardt
2025-08-01 17:06 ` Junio C Hamano
2025-08-02 8:18 ` Christian Couder
2025-08-02 11:31 ` Christian Couder
2025-08-02 13:38 ` Christian Couder
2025-08-02 16:26 ` Junio C Hamano
2025-08-04 6:35 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
2025-07-31 6:43 ` Patrick Steinhardt
2025-08-01 16:23 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-07-09 15:26 ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-09 15:26 ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
2025-07-09 15:26 ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-16 13:35 ` [PATCH v5 6/6] fixup! " Toon Claes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aDWWdGVW90LpxZhH@pks.im \
--to=ps@pks.im \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
--cc=toon@iotcl.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).