git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	gitster@pobox.com, johannes.schindelin@gmx.de,
	johncai86@gmail.com, jonathantanmy@google.com,
	karthik.188@gmail.com, kristofferhaugsbakk@fastmail.com,
	newren@gmail.com, peff@peff.net, ps@pks.im
Subject: Re: [PATCH v2 02/13] pack-objects: add --path-walk option
Date: Tue, 6 May 2025 15:39:31 -0400	[thread overview]
Message-ID: <052c534a-c2a4-4ff6-9130-4c1e96e1ef72@gmail.com> (raw)
In-Reply-To: <aBVTcvhW+lW0GLpQ@nand.local>

On 5/2/25 7:21 PM, Taylor Blau wrote:
> On Mon, Mar 24, 2025 at 03:22:38PM +0000, Derrick Stolee via GitGitGadget wrote:


>> +--path-walk::
>> +	By default, `git pack-objects` walks objects in an order that
>> +	presents trees and blobs in an order unrelated to the path they
>> +	appear relative to a commit's root tree. The `--path-walk` option
>> +	enables a different walking algorithm that organizes trees and
>> +	blobs by path. This has the potential to improve delta compression
>> +	especially in the presence of filenames that cause collisions in
>> +	Git's default name-hash algorithm. Due to changing how the objects
>> +	are walked, this option is not compatible with `--delta-islands`,
>> +	`--shallow`, or `--filter`. The `--use-bitmap-index` option will
>> +	be ignored in the presence of `--path-walk.`
>> +
> 
> This is perhaps a stylistic difference, but I don't think in general it
> is necessary to say "by default [...]" to explain some existing
> behavior, unless the alternative behavior enabled by the flag is
> impossible to understand without contrasting.
> 
> In this particular instance, I think that "--path-walk" can be
> explained in isolation rather than in contrast to the default behavior.
> 
> I would probably write something like:
> 
>      --path-walk::
>      Walk objects in batches grouped by a common path. This can improve
>      delta compression, especially in the case where filenames cause
>      collisions in the default name-hash algorithm.
> 
>      Incompatible with `--delta-islands`, `--shallow`, and `--filter`. If
>      provided, `--use-bitmap-index` will be ignored.

Sure. I'll give this a shot.

>> +static inline int is_oid_interesting(struct repository *repo,
>> +				     struct object_id *oid)
>> +{
>> +	struct object *o = lookup_object(repo, oid);
> 
> This function and its caller tripped me up a little bit while reading,
> but I was able to un-confuse myself. Here are some notes that I took
> while reading, which perhaps you can validate to tell me whether or not
> I'm on the right track.
> 
> Here we allow lookup_object() to return NULL, and then immediately
> return 0 if it does. But...
> 
>> +	return o && !(o->flags & UNINTERESTING);
>> +}
>> +
>> +static int add_objects_by_path(const char *path,
>> +			       struct oid_array *oids,
>> +			       enum object_type type,
>> +			       void *data)
>> +{
>> +	struct object_entry **delta_list;
>> +	size_t oe_start = to_pack.nr_objects;
>> +	size_t oe_end;
>> +	unsigned int sub_list_size;
> 
> Could you help me understand why sub_list_size is an unsigned int and
> not a size_t here?

This is based on a similar "list_size" used in ll_find_deltas(), but
since we are starting from scratch here it would be better to use
size_t.

>> +	unsigned int *processed = data;
>> +
>> +	/*
>> +	 * First, add all objects to the packing data, including the ones
>> +	 * marked UNINTERESTING (translated to 'exclude') as they can be
>> +	 * used as delta bases.
>> +	 */
>> +	for (size_t i = 0; i < oids->nr; i++) {
>> +		int exclude;
>> +		struct object_info oi = OBJECT_INFO_INIT;
>> +		struct object_id *oid = &oids->oid[i];
>> +
>> +		/* Skip objects that do not exist locally. */
>> +		if (exclude_promisor_objects &&
>> +		    oid_object_info_extended(the_repository, oid, &oi,
>> +					     OBJECT_INFO_FOR_PREFETCH) < 0)
>> +			continue;
>> +
>> +		exclude = !is_oid_interesting(the_repository, oid);
>> +
>> +		if (exclude && !thin)
>> +			continue;
> 
> ...here we only skip calling add_object_entry() if the object is
> excluded (which lookup_object() returning NULL would cause above) *and*
> we're not using thin packs. That makes sense, since if we have a missing
> object, but we're using thin packs, then it's OK to skip.
> 
> So I think all of this looks good. However, I find the interface here a
> little awkward. The function is called "is_oid_interesting()", but it
> determines that fact by checking whether or not the object has the
> UNINTERSTING bit set on its flag, and then negating the result. But then
> we negate it again here at the caller to obtain whether or not we should
> "exclude" the object.
> 
> I was wondering if there were other callers of is_oid_interesting() that
> don't negate its result. But it doesn't look like there are (at least in
> this patch). I wonder if it would be cleaner just to inline this instead
> of having the result negated at multiple levels.



>> +		add_object_entry(oid, type, path, exclude);
>> +	}
>> +
>> +	oe_end = to_pack.nr_objects;
>> +
>> +	/* We can skip delta calculations if it is a no-op. */
>> +	if (oe_end == oe_start || !window)
>> +		return 0;
>> +
>> +	sub_list_size = 0;
> 
> I'm nit-picking, but I would imagine that sub_list_nr is probably more
> conventional, but this is textbook bike-shedding ;-).

Sure.

>> +	ALLOC_ARRAY(delta_list, oe_end - oe_start);
> 
> ALLOC_ARRAY() will barf if oe_start > oe_end and we wrap around the
> range of size_t, but it might be nice to have a sanity check here just
> their equality.

I'll look for similar examples.

>> +	/*
>> +	 * Find delta bases among this list of objects that all match the same
>> +	 * path. This causes the delta compression to be interleaved in the
>> +	 * object walk, which can lead to confusing progress indicators. This is
>> +	 * also incompatible with threaded delta calculations. In the future,
>> +	 * consider creating a list of regions in the full to_pack.objects array
>> +	 * that could be picked up by the threaded delta computation.
>> +	 */
>> +	if (sub_list_size && window) {
> 
> Do we need to check that window is non-zero here? It looks like that
> check already happens above.

If we've already checked, then

>> +		QSORT(delta_list, sub_list_size, type_size_sort);
>> +		find_deltas(delta_list, &sub_list_size, window, depth, processed);
>> +	}
>> +
>> +	free(delta_list);
> 
> Is there a return path through this function that doesn't allocate
> delta_list? I think the answer is "no", but it might be a nice guard
> against future refactorings to initialize this field to NULL just in
> case.

OK.

>> +	return 0;
>> +}
>> +
>> +static void get_object_list_path_walk(struct rev_info *revs)
>> +{
>> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
>> +	unsigned int processed = 0;
>> +
>> +	info.revs = revs;
>> +	info.path_fn = add_objects_by_path;
>> +	info.path_fn_data = &processed;
>> +	revs->tag_objects = 1;
> 
> What is the purpose of setting revs->tag_objects to 1 here? Do we need
> to restore it back to its original value before returning? A comment
> here would be useful, I think.



>> @@ -4237,7 +4340,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
>>
>>   	warn_on_object_refname_ambiguity = save_warning;
>>
>> -	if (use_bitmap_index && !get_object_list_from_bitmap(revs))
>> +	if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))
> 
> I was going to suggest something like:
> 
>      if (path_walk && use_bitmap_index) {
>          warning(_("cannot use --use-bitmap-index with --path-walk, ignoring"));
>          use_bitmap_index = 0;
>      }
> 
> , which would avoid the need for this check here and insulate us against
> having to make similar changes to future conditionals that care about
> use_bitmap_index.
> 
> But it looks like such a check *does* already exist below, so I am not
> sure I understand why we need to check for path_walk here.

This is probably cruft from an earlier version of the check. Thanks.

>> +	if (path_walk && filter_options.choice) {
>> +		warning(_("cannot use --filter with --path-walk"));
>> +		path_walk = 0;
>> +	}
>> +	if (path_walk && use_delta_islands) {
>> +		warning(_("cannot use delta islands with --path-walk"));
>> +		path_walk = 0;
>> +	}
>> +	if (path_walk && shallow) {
>> +		warning(_("cannot use --shallow with --path-walk"));
>> +		path_walk = 0;
>> +	}
> 
> Here's a tiny idea I had while reading this code.
> 
>      if (path_walk) {
>          const char *option = NULL;
>          if (filter_options.choice)
>              option = "--filter";
>          else if (use_delta_islands)
>              option = "--delta-islands";
>          else if (shallow)
>              option = "--shallow";
> 
>          if (option) {
>              warning(_("cannot use %s with --path-walk"), option);
>              path_walk = 0;
>          }
>      }
> 
> This may be too DRY for your taste, and it does have the disadvantage of
> reducing the grep-ability of the error messages. I don't have a strong
> feeling about it one way or another, but I figured I'd at least write it
> down.
I like the reduction of work on the translators.

Thanks,
-Stolee


  reply	other threads:[~2025-05-06 19:39 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01   ` Taylor Blau
2025-03-20 19:48     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-20 19:46     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21   ` Taylor Blau
2025-03-20 19:57     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47   ` Taylor Blau
2025-03-20 20:18     ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-02 22:48     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:21     ` Taylor Blau
2025-05-06 19:39       ` Derrick Stolee [this message]
2025-05-16 15:27         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-02 23:25     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-02 23:31     ` Taylor Blau
2025-05-06 19:43       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-02 23:34     ` Taylor Blau
2025-05-16 15:32       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:38     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-02 23:42     ` Taylor Blau
2025-05-06 19:46       ` Derrick Stolee
2025-05-16 15:41         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-07  0:58     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-07  1:14     ` Taylor Blau
2025-05-16 16:27       ` Derrick Stolee
2025-05-29  0:17         ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-07  1:33     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-02 21:24   ` [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-05-02 22:45     ` Taylor Blau
2025-05-02 23:44       ` Taylor Blau
2025-05-07  1:35         ` Taylor Blau
2025-05-16 18:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-29  0:20     ` [PATCH v3 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=052c534a-c2a4-4ff6-9130-4c1e96e1ef72@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=karthik.188@gmail.com \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).