From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
gitster@pobox.com, johannes.schindelin@gmx.de,
johncai86@gmail.com, jonathantanmy@google.com,
karthik.188@gmail.com, kristofferhaugsbakk@fastmail.com,
newren@gmail.com, peff@peff.net, ps@pks.im,
Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 02/13] pack-objects: add --path-walk option
Date: Wed, 12 Mar 2025 17:14:14 -0400 [thread overview]
Message-ID: <Z9H5JsicyLWXagxS@nand.local> (raw)
In-Reply-To: <9b31dc87bb61f4d73eced02a24baea58bc51aa5e.1741571455.git.gitgitgadget@gmail.com>
On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> In order to more easily compute delta bases among objects that appear at
> the exact same path, add a --path-walk option to 'git pack-objects'.
>
> This option will use the path-walk API instead of the object walk given
> by the revision machinery. Since objects will be provided in batches
> representing a common path, those objects can be tested for delta bases
> immediately instead of waiting for a sort of the full object list by
> name-hash. This has multiple benefits, including avoiding collisions by
> name-hash.
>
> The objects marked as UNINTERESTING are included in these batches, so we
> are guaranteeing some locality to find good delta bases.
>
> After the individual passes are done on a per-path basis, the default
> name-hash is used to find other opportunistic delta bases that did not
> match exactly by the full path name.
>
> The current implementation performs delta calculations while walking
> objects, which is not ideal for a few reasons. First, this will cause
> the "Enumerating objects" phase to be much longer than usual. Second, it
> does not take advantage of threading during the path-scoped delta
> calculations. Even with this lack of threading, the path-walk option is
> sometimes faster than the usual approach. Future changes will refactor
> this code to allow for threading, but that complexity is deferred until
> later to keep this patch as simple as possible.
>
> This new walk is incompatible with some features and is ignored by
> others:
>
> * Object filters are not currently integrated with the path-walk API,
> such as sparse-checkout or tree depth. A blobless packfile could be
> integrated easily, but that is deferred for later.
>
> * Server-focused features such as delta islands, shallow packs, and
> using a bitmap index are incompatible with the path-walk API.
>
> * The path walk API is only compatible with the --revs option, not
> taking object lists or pack lists over stdin. These alternative ways
> to specify the objects currently ignores the --path-walk option
> without even a warning.
>
> Future changes will create performance tests that demonstrate the power
> of this approach.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> Documentation/git-pack-objects.adoc | 13 +-
> Documentation/technical/api-path-walk.adoc | 1 +
> builtin/pack-objects.c | 147 +++++++++++++++++++--
> t/t5300-pack-object.sh | 15 +++
> 4 files changed, 166 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
> index 7f69ae4855f..7dbbe6d54d2 100644
> --- a/Documentation/git-pack-objects.adoc
> +++ b/Documentation/git-pack-objects.adoc
> @@ -16,7 +16,7 @@ SYNOPSIS
> [--cruft] [--cruft-expiration=<time>]
> [--stdout [--filter=<filter-spec>] | <base-name>]
> [--shallow] [--keep-true-parents] [--[no-]sparse]
> - [--name-hash-version=<n>] < <object-list>
> + [--name-hash-version=<n>] [--path-walk] < <object-list>
>
>
> DESCRIPTION
> @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
> when writing reachability bitmap files with `--write-bitmap-index` and it
> will be automatically changed to version `1`.
>
> +--path-walk::
> + By default, `git pack-objects` walks objects in an order that
> + presents trees and blobs in an order unrelated to the path they
> + appear relative to a commit's root tree. The `--path-walk` option
> + enables a different walking algorithm that organizes trees and
> + blobs by path. This has the potential to improve delta compression
> + especially in the presence of filenames that cause collisions in
> + Git's default name-hash algorithm. Due to changing how the objects
> + are walked, this option is not compatible with `--delta-islands`,
> + `--shallow`, or `--filter`.
I think from reading further below that this feature is somewhat
incompatible with --use-bitmap-index, at least in the sense that we
implicitly disable the latter whenever we see the former. Would that be
worth mentioning here?
> +static int add_objects_by_path(const char *path,
> + struct oid_array *oids,
> + enum object_type type,
> + void *data)
> +{
> + struct object_entry **delta_list;
> + size_t oe_start = to_pack.nr_objects;
> + size_t oe_end;
> + unsigned int sub_list_size;
> + unsigned int *processed = data;
> +
> + /*
> + * First, add all objects to the packing data, including the ones
> + * marked UNINTERESTING (translated to 'exclude') as they can be
> + * used as delta bases.
> + */
> + for (size_t i = 0; i < oids->nr; i++) {
> + int exclude;
> + struct object_info oi = OBJECT_INFO_INIT;
> + struct object_id *oid = &oids->oid[i];
> +
> + /* Skip objects that do not exist locally. */
> + if (exclude_promisor_objects &&
> + oid_object_info_extended(the_repository, oid, &oi,
> + OBJECT_INFO_FOR_PREFETCH) < 0)
> + continue;
> +
> + exclude = !is_oid_interesting(the_repository, oid);
> +
> + if (exclude && !thin)
> + continue;
> +
> + add_object_entry(oid, type, path, exclude);
> + }
> +
> + oe_end = to_pack.nr_objects;
> +
> + /* We can skip delta calculations if it is a no-op. */
> + if (oe_end == oe_start || !window)
> + return 0;
> +
> + sub_list_size = 0;
> + ALLOC_ARRAY(delta_list, oe_end - oe_start);
Makes sense, and seems all reasonable.
> + for (size_t i = 0; i < oe_end - oe_start; i++) {
> + struct object_entry *entry = to_pack.objects + oe_start + i;
> +
> + if (!should_attempt_deltas(entry))
> + continue;
> +
> + delta_list[sub_list_size++] = entry;
> + }
> +
> + /*
> + * Find delta bases among this list of objects that all match the same
> + * path. This causes the delta compression to be interleaved in the
> + * object walk, which can lead to confusing progress indicators. This is
> + * also incompatible with threaded delta calculations. In the future,
> + * consider creating a list of regions in the full to_pack.objects array
> + * that could be picked up by the threaded delta computation.
> + */
> + if (sub_list_size && window) {
> + QSORT(delta_list, sub_list_size, type_size_sort);
> + find_deltas(delta_list, &sub_list_size, window, depth, processed);
> + }
Interesting. I like the "regions in to_pack.objects" idea as a way of
threading the delta selection process in the future.
Thanks,
Taylor
next prev parent reply other threads:[~2025-03-12 21:14 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-10 1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01 ` Taylor Blau
2025-03-20 19:48 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14 ` Taylor Blau [this message]
2025-03-20 19:46 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14 ` Taylor Blau
2025-03-10 1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21 ` Taylor Blau
2025-03-20 19:57 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47 ` Taylor Blau
2025-03-20 20:18 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-02 22:48 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:21 ` Taylor Blau
2025-05-06 19:39 ` Derrick Stolee
2025-05-16 15:27 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-02 23:25 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-02 23:31 ` Taylor Blau
2025-05-06 19:43 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-02 23:34 ` Taylor Blau
2025-05-16 15:32 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:38 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-02 23:42 ` Taylor Blau
2025-05-06 19:46 ` Derrick Stolee
2025-05-16 15:41 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-07 0:58 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-07 1:14 ` Taylor Blau
2025-05-16 16:27 ` Derrick Stolee
2025-05-29 0:17 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-07 1:33 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-02 21:24 ` [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-05-02 22:45 ` Taylor Blau
2025-05-02 23:44 ` Taylor Blau
2025-05-07 1:35 ` Taylor Blau
2025-05-16 18:11 ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-29 0:20 ` [PATCH v3 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z9H5JsicyLWXagxS@nand.local \
--to=me@ttaylorr.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=jonathantanmy@google.com \
--cc=karthik.188@gmail.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).