Re: [PATCH] last-modified: implement faster algorithm

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Toon Claes <toon@iotcl.com>
To: Justin Tobler <jltobler@gmail.com>
Cc: git@vger.kernel.org, Karthik Nayak <karthik.188@gmail.com>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH] last-modified: implement faster algorithm
Date: Fri, 17 Oct 2025 12:38:49 +0200	[thread overview]
Message-ID: <87sefhu81y.fsf@iotcl.com> (raw)
In-Reply-To: <kkcpsorsmyfdxlxnlzliuggsaehhrfvfphdse7aslvwsrbm64b@ylgl65mzot2z>

Justin Tobler <jltobler@gmail.com> writes:

> Ok so if I understand correctly, the optimization here is that as we
> perform the revision walk, the set of paths we look for at each commit
> monotonically decreases as changed paths are identified. Prior to this,
> we were always checking all paths for each commit even though a path may
> have already found the commit that last modified it.

Not only that, we only pass on paths to one parent. So either a path is
treesame to a parent, that means it was not changed in the current
commit, and then we only pass that path to that parent. If the commit
had multiple parents, the merge ignored versions from other parents.

If the path isn't treesame to any of the parents, we know the current
commit modified the path, and we associate that commit with that path.


> [snip]
>> As an added benefit, this implementation gives more correct results. For
>> example implementation in 'master' gives:
>
> s/implementation/the implementation/

Fair enough, but I also need to rephrase the "more correct" as pointed
out by Ben[1].

[1]: https://lore.kernel.org/git/CALnO6CBwuAdBFjESZSYZkChNdU9R17OXDc+CY=Z96QoACPgrpQ@mail.gmail.com/

>> +/* Remember to update object flag allocation in object.h */
>
> At first I was wondering if this is a leftover note, but it looks like
> it is just a reminder if the allocations change here.

Yeah, it seems to be the common pattern to do it like this. Didn't
question it further.

>> +#define PARENT1 (1u<<16) /* used instead of SEEN */
>> +#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
>
> Naive question: why do we use these object flags instead of the ones
> mentioned?

We set these bits in `struct object::flags`, because other systems might
be using other bits for their internal workings we want to avoid
colliding bit ranges. That's why users define the bits they use in
object.h, to have a general overview over who is using what and which
systems might get in trouble with eachother.

Bit 16 and 17 seems to be safe to use. We could have named these defines
SEEN and BOUNDARY, but in revision.h they have different bit positions,
so we name that PARENT1 and PARENT2, similar to what these bits are
named in other files.

>> +
>>  struct last_modified_entry {
>>  	struct hashmap_entry hashent;
>>  	struct object_id oid;
>>  	struct bloom_key key;
>> +	size_t diff_idx;
>>  	const char path[FLEX_ARRAY];
>>  };
>>  
>> @@ -37,13 +43,35 @@ static int last_modified_entry_hashcmp(const void *unused UNUSED,
>>  	return strcmp(ent1->path, path ? path : ent2->path);
>>  }
>>  
>> +/*
>> + * Hold a bitmap for each commit we're working with. Each bit represents a path
>> + * in `lm->all_paths`. Active bit means the path still needs to be dealt with.
>> + */
>> +define_commit_slab(commit_bitmaps, struct bitmap *);
>
> Why do we need a path bitmap for each commit? My understanding is that
> we check commits in a certain order as dictated by the priority queue.
> As soon as the commit that last-modified a path has been identified,
> wouldn't we always want the remaining commits processed to only check
> the outstanding paths?

Because we want to pass paths to parents that are treesame. Imagine a
tree with:

    bar-a.txt
    bar-b.txt
    baz-a.txt
    foo-a.txt
    foo-a.txt

When we're looking at a commit and it has the 'active_c' bitmap
`11111`. This commit has two parents, and we see bar-a.txt and bar-b.txt
are treesame on parent-0, we pass bitmap `11000` to that parent. If
foo-a.txt and foo-b.txt are treesame on the parent-1, we pass bitmap
`00011` to that parent. This leaves bitmap `00100` behind on the current
commit, and we associate baz-a.txt with that commit.

So a bit for a path would only be passed on to a single parent (or
none).

>>  struct last_modified {
>>  	struct hashmap paths;
>>  	struct rev_info rev;
>>  	bool recursive;
>>  	bool show_trees;
>> +
>> +	const char **all_paths;
>> +	size_t all_paths_nr;
>
> It is not immediately obvious to me why we have both `paths` and
> `all_paths`. From my understanding, `all_paths` is defining the path
> order for the bitmap. If this is the case, maybe it would be worth
> explaining in a comment?

Yeah, I'm not extremely happy about storing the information twice, but
I kept it for two reasons:

 * The hashmap is filled by populate_paths_from_revs() and it grows with
   every call of add_path_from_diff(). Afterwards we malloc() the
   all_paths buffer to the correct size. Re-allocing all_paths on every
   call of add_path_from_diff() would be clunky.

 * Reverse lookups: When we have a path, we can lookup the index in
   all_paths using path_idx(), which uses the hashmap to locate the
   hashmap entry (which has the `diff_idx`).

> [snip]
>>  static int last_modified_run(struct last_modified *lm)
>>  {
>> +	int max_count, queue_popped = 0;
>> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>> +	struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
>> +	struct commit_list *list;
>>  	struct last_modified_callback_data data = { .lm = lm };
>>  
>>  	lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
>>  	lm->rev.diffopt.format_callback = last_modified_diff;
>>  	lm->rev.diffopt.format_callback_data = &data;
>> +	lm->rev.no_walk = 1;
>>  
>>  	prepare_revision_walk(&lm->rev);
>>  
>> -	while (hashmap_get_size(&lm->paths)) {
>> -		data.commit = get_revision(&lm->rev);
>> -		if (!data.commit)
>> -			BUG("paths remaining beyond boundary in last-modified");
>> +	max_count = lm->rev.max_count;
>> +
>> +	init_commit_bitmaps(&lm->commit_bitmaps);
>> +	lm->scratch = bitmap_word_alloc(lm->all_paths_nr);
>
> It looks like we initialize and release both `commit_bitmaps` and
> `scratch` here in `last_modified_run()`. Any reason we wouldn't want to
> move this to `last_modified_{init,release}()`?

For lm->commit_bitmaps, at the `cleanup:` label we call bitmap_free() on
`active_c`, but `lm->commit_bitmaps` still has a pointer to those
bitmaps. It feels a bit cleaner to me to not exit this function with
having any in `lm` pointing to invalid data.
(Also note: we don't call deep_clear_commit_bitmaps() because
the bitmaps are already freed, the commit_slab of bitmaps isn't)

`lm->scratch` we could move to `last_modified_{init,release}()` but I
decided to keep the bitmaps together.

>> +
>> +	/*
>> +	 * lm->rev.commits holds the set of boundary commits for our walk.
>
> Naive question: would it be more correct ...

Don't use "more correct", I've been told :-P

> ... to say that `rev.commits` is
> the list of starting commits? Boundary commits sounds like commits on
> the boundary of what we consider interesting/uninteresting which, from
> my understanding, is not the case here.

Why aren't these boundaries? Users can pass uninteresting commits to
this command with `--not aaaaaa` or `aaaaaa..bbbbbb`. So in this context
boundaries are used on both ends, and `queue` and `not_queue` are used
to store those.


>> +	 *
>> +	 * Loop through each such commit, and place it in the appropriate queue.
>> +	 */
>> +	for (list = lm->rev.commits; list; list = list->next) {
>> +		struct commit *c = list->item;
>> +
>> +		if (c->object.flags & BOTTOM) {
>> +			prio_queue_put(&not_queue, c);
>
> Ok so commits with the BOTTOM flag are at the boundary of the
> "interesting" commit graph. Thus they are not included in the search and
> added to the "not_queue".
>
>> +			c->object.flags |= PARENT2;
>
> What is the meaning behind the name PARENT2 in this context? From my
> understanding we are using this flag to denote a commit we are not
> interested in.

Yes, that's correct. It's used to avoid us adding it to the not_queue
more than once.

>
>> +		} else if (!(c->object.flags & PARENT1)) {
>
> Same question about PARENT1. It seems to be used to just denote commits
> that we have already encountered. The names confuse me a bit though.

Naming isn't great, but see my earlier comment.

>> +			/*
>> +			 * If the commit is a starting point (and hasn't been
>> +			 * seen yet), then initialize the set of interesting
>> +			 * paths, too.
>> +			 */
>> +			struct bitmap *active;
>> +
>> +			prio_queue_put(&queue, c);
>> +			c->object.flags |= PARENT1;
>
> We queue the commit and mark it as seen. Makes sense.
>
>> -		if (data.commit->object.flags & BOUNDARY) {
>> +			active = get_bitmap(lm, c);
>> +			for (size_t i = 0; i < lm->all_paths_nr; i++)
>> +				bitmap_set(active, i);
>
> Here we set up the path bitmap for the commit. At this point, all paths
> are still "active" and thus set accordingly. I'm still not entirely sure
> though if we really need a path bitmap per commit.

After my explanation above, let me know if you still have questions. I
must admit it took me a while to have it click in my brain too.

>> +		}
>> +	}
>> +
>> +	while (queue.nr) {
>> +		int parent_i;
>> +		struct commit_list *p;
>> +		struct commit *c = prio_queue_get(&queue);
>> +		struct bitmap *active_c = get_bitmap(lm, c);
>> +
>> +		if ((0 <= max_count && max_count < ++queue_popped) ||
>> +		    (c->object.flags & PARENT2)) {
>> +			/*
>> +			 * Either a boundary commit, or we have already seen too
>> +			 * many others. Either way, stop here.
>> +			 */
>> +			c->object.flags |= PARENT2 | BOUNDARY;
>> +			data.commit = c;
>>  			diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
>> -				      &data.commit->object.oid, "",
>> -				      &lm->rev.diffopt);
>> +				      &c->object.oid,
>> +				      "", &lm->rev.diffopt);
>>  			diff_flush(&lm->rev.diffopt);
>> +			goto cleanup;
>> +		}
>>  
>> -			break;
>> +		/*
>> +		 * Otherwise, make sure that 'c' isn't reachable from anything
>> +		 * in the '--not' queue.
>> +		 */
>> +		repo_parse_commit(lm->rev.repo, c);
>> +
>> +		while (not_queue.nr) {
>> +			struct commit_list *np;
>> +			struct commit *n = prio_queue_get(&not_queue);
>> +
>> +			repo_parse_commit(lm->rev.repo, n);
>> +
>> +			for (np = n->parents; np; np = np->next) {
>> +				if (!(np->item->object.flags & PARENT2)) {
>> +					prio_queue_put(&not_queue, np->item);
>> +					np->item->object.flags |= PARENT2;
>> +				}
>> +			}
>> +
>> +			if (commit_graph_generation(n) < commit_graph_generation(c))
>> +				break;
>
> If the generation number of 'c' is higher than 'n' we know 'c' cannot be
> an ancestor of 'n' and thus we continue on. Makes sense.

<3

>>  		}
>>  
>> -		if (!maybe_changed_path(lm, data.commit))
>> -			continue;
>> +		/*
>> +		 * Look at each parent and pass on each path that's TREESAME
>> +		 * with that parent. Stop early when no active paths remain.
>> +		 */
>> +		for (p = c->parents, parent_i = 0; p; p = p->next, parent_i++) {
>> +			process_parent(lm, &queue,
>> +				       c, active_c,
>> +				       p->item, parent_i);
>> +
>> +			if (bitmap_is_empty(active_c))
>> +				break;
>> +		}
>>  
>> -		log_tree_commit(&lm->rev, data.commit);
>> +		/*
>> +		 * Paths that remain active, or not TREESAME with any parent,
>> +		 * were changed by 'c'.
>> +		 */
>> +		if (!bitmap_is_empty(active_c))  {
>> +			data.commit = c;
>> +			for (size_t i = 0; i < lm->all_paths_nr; i++) {
>> +				if (bitmap_get(active_c, i))
>> +					mark_path(lm->all_paths[i], NULL, &data);
>> +			}
>> +		}
>> +
>> +cleanup:
>> +		bitmap_free(active_c);
>>  	}
>>  
>> +	if (hashmap_get_size(&lm->paths))
>> +		BUG("paths remaining beyond boundary in last-modified");
>> +
>> +	clear_prio_queue(&not_queue);
>> +	clear_prio_queue(&queue);
>> +	clear_commit_bitmaps(&lm->commit_bitmaps);
>> +	bitmap_free(lm->scratch);
>> +
>>  	return 0;
>>  }
> [snip]
>
> I've taken an initial look a this patch and have mostly questions so
> far. Just FYI, after applying this patch locally, the last-modified
> tests seem to be failing. The command seems to be segfaulting, but I
> haven't looked into it further.

Yeah, sorry about that. Taylor also noticed. In my final review I revived
some code I had removed at some point. I brought it back, but didn't do
a final test afterward. He provided a path to fix it[2] (which I also
had in my code at some point, but reverted).

[2]: https://lore.kernel.org/git/aPGB%2FFJtjDmyNLvG@nand.local/

-- 
Cheers,
Toon

next prev parent reply	other threads:[~2025-10-17 10:39 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-16  8:39 [PATCH] last-modified: implement faster algorithm Toon Claes
2025-10-16 18:51 ` Justin Tobler
2025-10-17 10:38   ` Toon Claes [this message]
2025-10-16 20:48 ` D. Ben Knoble
2025-10-17 10:45   ` Toon Claes
2025-10-16 23:38 ` Taylor Blau
2025-10-17  6:30   ` Jeff King
2025-10-17 14:54     ` Taylor Blau
2025-10-21  8:20       ` Jeff King
2025-10-17 12:07   ` Toon Claes
2025-10-21  9:04     ` Toon Claes
2025-10-23 23:59       ` Taylor Blau
2025-10-21 13:00     ` Toon Claes
2025-10-23 23:56     ` Taylor Blau
2025-10-27 15:48       ` Toon Claes
2025-10-17  6:37 ` Jeff King
2025-10-17 10:47   ` Toon Claes
2025-10-21 12:56 ` [PATCH v2] " Toon Claes
2025-10-21 17:52   ` Junio C Hamano
2025-10-22  0:26     ` Taylor Blau
2025-10-22  0:28       ` Taylor Blau
2025-10-22  3:48       ` Junio C Hamano
2025-10-24  0:01         ` Taylor Blau
2025-10-24  0:37           ` Junio C Hamano
2025-10-27 19:22             ` Taylor Blau
2025-10-29 13:01               ` Toon Claes
2025-10-23  8:01     ` Toon Claes
2025-10-23  7:50   ` [PATCH v3] " Toon Claes
2025-10-24  0:03     ` Taylor Blau
2025-10-27  7:03       ` Toon Claes
2025-11-03 15:47   ` [PATCH v4] " Toon Claes
2025-11-03 16:44     ` Junio C Hamano
2025-11-04 15:08       ` Toon Claes
2025-11-19 11:34     ` t8020-last-modified.sh failure on s390x (Re: [PATCH v4] last-modified: implement faster algorithm) Anders Kaseorg
2025-11-19 13:49       ` Kristoffer Haugsbakk
2025-11-19 20:06         ` Anders Kaseorg
2025-11-20  8:16           ` Jeff King
2025-11-28 16:45             ` Toon Claes
2025-11-28 17:35               ` Kristoffer Haugsbakk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sefhu81y.fsf@iotcl.com \
    --to=toon@iotcl.com \
    --cc=git@vger.kernel.org \
    --cc=jltobler@gmail.com \
    --cc=karthik.188@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).