git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Toon Claes <toon@iotcl.com>
Cc: git@vger.kernel.org, Karthik Nayak <karthik.188@gmail.com>,
	Justin Tobler <jltobler@gmail.com>,
	Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH] last-modified: implement faster algorithm
Date: Thu, 23 Oct 2025 19:56:56 -0400	[thread overview]
Message-ID: <aPrAyJYvWtlvmiEx@nand.local> (raw)
In-Reply-To: <87jz0tu3yh.fsf@iotcl.com>

On Fri, Oct 17, 2025 at 02:07:18PM +0200, Toon Claes wrote:
> > Regardless of how you handle the above, I think that the commit slab
> > name here is a little generic. I guess it's OK since this is only
> > visible within this compilation unit, but perhaps something like
> > "active_paths_bitmap" would be more descriptive.
>
> I struggled a lot naming this thing, so I'm open to suggestions.

I think calling it "active_paths_bitmap" or similar conveys that this is
somehow specific to "active paths", and since the slab is static within
the last-modified builtin, I think that's fine. It's a little word-y, so
if you have better ideas with fewer characters, I'm open to just about
anything.

> > Likewise, I wonder if we should have elemtype here be just 'struct
> > bitmap'. Unfortunately I don't think the EWAH code has a function like:
> >
> >     void bitmap_init(struct bitmap *);
> >
> > and only has ones that allocate for us. So we may consider adding one,
> > or creating a dummy bitmap and copying its contents, or otherwise.
> >
> >>  struct last_modified {
> >>  	struct hashmap paths;
> >>  	struct rev_info rev;
> >>  	bool recursive;
> >>  	bool show_trees;
> >> +
> >> +	const char **all_paths;
> >> +	size_t all_paths_nr;
> >
> > I wonder if all_paths should be a strvec here? I think that this code
> > was all written when the type was called argv_array (hilariously, that
> > change took place towards the end of July, 2020, and the --go-faster
> > code where this patch came from was written just a couple of weeks
> > earlier.)
>
> Ahha, that might be a good idea. This might allow us to get rid of the
> hashmap, which stops us from storing the paths twice. Not sure what the
> impact on the performance would be, because the hashmap now is valuable
> for path_idx() lookups.

Looking at this a little further, I don't think I consider this worth
doing. Using a strvec here is awkward since last_modified_init() really
wants to assign lm->all_paths based on each entry's diff_idx, which is
not what strvec.h is designed for.

You *could* use a string_list, and shove a pointer to the
last_modified_entry struct in the ->util field, but that is also
inefficient since we remove paths from our hashmap as we mark them, and
repeating that in a string_list would be wasteful.

> > In the GitHub version of this patch, we pass all active paths to the
> > parent, assign the PARENT1 flag if it doesn't already have it, and put
> > it in the queue as well.
> >
> > In your version, we'd skip past the next for-loop, and do the same
> > pass-to-parent dance below, along with inserting the parent into the
> > prio queue.
>
> This is the "shortcut" I'm mentioning in my cover letter. In my testing
> it seemed it didn't provide any performance gains to keep it. I consider
> less code better code, so I left it out.

Fair enough. I don't think that less code here results in a lack of
clarity, so this seems alright to me. But in general I am not sure that
I always agree that less code is better ;-).

> > So I think that this is all functionally equivalent, but I had to work
> > through a little bit of the details here, mostly since I haven't looked
> > at or thought about this code in many years ;-).
> >
> >>  static int last_modified_run(struct last_modified *lm)
> >>  {
> >> +	int max_count, queue_popped = 0;
> >> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> >> +	struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
> >> +	struct commit_list *list;
> >>  	struct last_modified_callback_data data = { .lm = lm };
> >>
> >>  	lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> >>  	lm->rev.diffopt.format_callback = last_modified_diff;
> >>  	lm->rev.diffopt.format_callback_data = &data;
> >> +	lm->rev.no_walk = 1;
> >
> > This one is new relative to the original patch. Why set no_walk here?
>
> This comes from
> https://github.com/ttaylorr/git/commit/e8ea49705873d28f64b815bd00d14bdf6d48ca4d
>
> Well, it basically squashes various commits together. There are various
> commits doing different things here. I don't think it's valuable for
> anyone to see the full history of the iterations at GitHub, that's why I
> squashed it in.
>
> Would you consider it better to not set `no_walk`?

Hah, I forgot that we added this one later on. Adding it here in this
patch makes sense to me.

Thanks,
Taylor

  parent reply	other threads:[~2025-10-23 23:56 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-16  8:39 [PATCH] last-modified: implement faster algorithm Toon Claes
2025-10-16 18:51 ` Justin Tobler
2025-10-17 10:38   ` Toon Claes
2025-10-16 20:48 ` D. Ben Knoble
2025-10-17 10:45   ` Toon Claes
2025-10-16 23:38 ` Taylor Blau
2025-10-17  6:30   ` Jeff King
2025-10-17 14:54     ` Taylor Blau
2025-10-21  8:20       ` Jeff King
2025-10-17 12:07   ` Toon Claes
2025-10-21  9:04     ` Toon Claes
2025-10-23 23:59       ` Taylor Blau
2025-10-21 13:00     ` Toon Claes
2025-10-23 23:56     ` Taylor Blau [this message]
2025-10-27 15:48       ` Toon Claes
2025-10-17  6:37 ` Jeff King
2025-10-17 10:47   ` Toon Claes
2025-10-21 12:56 ` [PATCH v2] " Toon Claes
2025-10-21 17:52   ` Junio C Hamano
2025-10-22  0:26     ` Taylor Blau
2025-10-22  0:28       ` Taylor Blau
2025-10-22  3:48       ` Junio C Hamano
2025-10-24  0:01         ` Taylor Blau
2025-10-24  0:37           ` Junio C Hamano
2025-10-27 19:22             ` Taylor Blau
2025-10-29 13:01               ` Toon Claes
2025-10-23  8:01     ` Toon Claes
2025-10-23  7:50   ` [PATCH v3] " Toon Claes
2025-10-24  0:03     ` Taylor Blau
2025-10-27  7:03       ` Toon Claes
2025-11-03 15:47   ` [PATCH v4] " Toon Claes
2025-11-03 16:44     ` Junio C Hamano
2025-11-04 15:08       ` Toon Claes
2025-11-19 11:34     ` t8020-last-modified.sh failure on s390x (Re: [PATCH v4] last-modified: implement faster algorithm) Anders Kaseorg
2025-11-19 13:49       ` Kristoffer Haugsbakk
2025-11-19 20:06         ` Anders Kaseorg
2025-11-20  8:16           ` Jeff King
2025-11-28 16:45             ` Toon Claes
2025-11-28 17:35               ` Kristoffer Haugsbakk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPrAyJYvWtlvmiEx@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=jltobler@gmail.com \
    --cc=karthik.188@gmail.com \
    --cc=stolee@gmail.com \
    --cc=toon@iotcl.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).