* Re: [PATCH 05/24] midx: implement `DISP` chunk
From: Taylor Blau @ 2023-11-30 19:27 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZWhhhaXiF_wC3er7@tanuki>
On Thu, Nov 30, 2023 at 11:18:45AM +0100, Patrick Steinhardt wrote:
> > diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
> > index 9fcb29a9c8..658682ddd5 100644
> > --- a/Documentation/gitformat-pack.txt
> > +++ b/Documentation/gitformat-pack.txt
> > @@ -396,6 +396,22 @@ CHUNK DATA:
> > is padded at the end with between 0 and 3 NUL bytes to make the
> > chunk size a multiple of 4 bytes.
> >
> > + Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
> > + Stores a table of three 4-byte unsigned integers in network order.
> > + Each table entry corresponds to a single pack (in the order that
> > + they appear above in the `PNAM` chunk). The values for each table
> > + entry are as follows:
> > + - The first bit position (in psuedo-pack order, see below) to
>
> s/psuedo/pseudo/
Good catch, thanks. Not sure how that escaped my spell-checker...
> > +=== `DISP` chunk and disjoint packs
> > +
> > +The Disjoint Packfiles (`DISP`) chunk encodes additional information
> > +about the objects in the multi-pack index's reachability bitmap. Recall
> > +that objects from the MIDX are arranged in "pseudo-pack" order (see:
>
> The colon feels a bit out-of-place here, so: s/see:/see/
Thanks, I'll fix that up.
> > +above) for reachability bitmaps.
> > +
> > +From the example above, suppose we have packs "a", "b", and "c", with
> > +10, 15, and 20 objects, respectively. In pseudo-pack order, those would
> > +be arranged as follows:
> > +
> > + |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
> > +
> > +When working with single-pack bitmaps (or, equivalently, multi-pack
> > +reachability bitmaps without any packs marked as disjoint),
> > +linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
> > +reuse chunks of the existing packfile instead of adding objects to the
> > +packing list.
>
> I'm not sure I full understand this paragraph. In the context of a
> single pack bitmap it's clear enough. But I stumbled over the MIDX case,
> because here we potentially have multiple packfiles, so it's not exactly
> clear to me what you refer to with "the existing packfile" in that case.
> I'd think that we perform verbatim reuse of the preferred packfile,
> right? If so, we might want to make that a bit more explicit.
Yep, sorry, I can see how that would be confusing. Since we're talking
about the existing behavior at this point in the series (before
multi-pack reuse is implemented), I changed this to:
"reuse chunks of the bitmapped or preferred packfile [...]"
Thanks for carefully reading and spotting my errors ;-).
> > +object. This introduces an additional constraint over the set of packs
> > +we may want to reuse. The most straightforward approach is to mandate
> > +that the set of packs is disjoint with respect to the set of objects
> > +contained in each pack. In other words, for each object `o` in the union
> > +of all objects stored by the disjoint set of packs, `o` is contained in
> > +exactly one pack from the disjoint set.
>
> Is this a property that usually holds for our normal housekeeping, or
> does it require careful managing by the user/admin? How about geometric
> repacking?
At this point in the series, it would require careful managing to ensure
that this is the case. In practice MIDX'd packs generated with a
geometric repack are mostly disjoint, but definitely not guaranteed to
be.
Further down in this series we'll introduce new options to generate
packs which are guaranteed to be disjoint with respect to the
currently-marked set of packs in the DISP chunk.
> > @@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
> > * Take only the first duplicate.
> > */
> > for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
> > - if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
> > - &fanout.entries[cur_object].oid))
> > - continue;
> > + struct pack_midx_entry *ours = &fanout.entries[cur_object];
> > + if (cur_object) {
> > + struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
> > + if (oideq(&prev->oid, &ours->oid)) {
> > + if (prev->disjoint && ours->disjoint)
> > + die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
> > + oid_to_hex(&prev->oid),
> > + info[prev->pack_int_id].pack_name,
> > + info[ours->pack_int_id].pack_name);
>
> Shouldn't we die if `prev->disjoint || ours->disjoint` instead of `&&`?
> Even if one of the packs isn't marked as disjoint, it's still wrong if
> the other one is and one of its objects exists in multiple packs.
>
> Or am I misunderstanding, and we only guarantee the disjoint property
> across packfiles that are actually marked as such?
Right, we only guarantee disjointed-ness among the set of packs that are
marked disjoint. It's fine for the same object to appear in a disjoint
and non-disjoint pack, and for both of those packs to end up in the
MIDX. But that is only because we'll use the disjoint copy in our
bitmap.
If there were two packs that are marked as supposedly disjoint, but
contain at least one duplicate of an object, then we will reject those
packs as non-disjoint.
Thanks,
Taylor
^ permalink raw reply
* Re: [PATCH 04/24] midx: factor out `fill_pack_info()`
From: Taylor Blau @ 2023-11-30 19:19 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZWhhfXoovgYzIYE0@tanuki>
On Thu, Nov 30, 2023 at 11:18:37AM +0100, Patrick Steinhardt wrote:
> On Tue, Nov 28, 2023 at 02:08:05PM -0500, Taylor Blau wrote:
> > When selecting which packfiles will be written while generating a MIDX,
> > the MIDX internals fill out a 'struct pack_info' with various pieces of
> > book-keeping.
> >
> > Instead of filling out each field of the `pack_info` structure
> > individually in each of the two spots that modify the array of such
> > structures (`ctx->info`), extract a common routine that does this for
> > us.
> >
> > This reduces the code duplication by a modest amount. But more
> > importantly, it zero-initializes the structure before assigning values
> > into it. This hardens us for a future change which will add additional
> > fields to this structure which (until this patch) was not
> > zero-initialized.
> >
> > As a result, any new fields added to the `pack_info` structure need only
> > be updated in a single location, instead of at each spot within midx.c.
> >
> > There are no functional changes in this patch.
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> > midx.c | 35 +++++++++++++++++++----------------
> > 1 file changed, 19 insertions(+), 16 deletions(-)
> >
> > diff --git a/midx.c b/midx.c
> > index 3b727dc633..591b3c636e 100644
> > --- a/midx.c
> > +++ b/midx.c
> > @@ -464,6 +464,17 @@ struct pack_info {
> > unsigned expired : 1;
> > };
> >
> > +static void fill_pack_info(struct pack_info *info,
> > + struct packed_git *p, char *pack_name,
> > + uint32_t orig_pack_int_id)
> > +{
> > + memset(info, 0, sizeof(struct pack_info));
> > +
> > + info->orig_pack_int_id = orig_pack_int_id;
> > + info->pack_name = pack_name;
> > + info->p = p;
> > +}
>
> Nit: all callers manually call `xstrdup(pack_name)` and pass that to
> `fill_pack_info()`. We could consider doing this in here instead so that
> ownership of the string becomes a tad clearer.
That's a great idea. I think we'd also want to mark the pack_name
argument as const, not just because xstrdup() requires it, but also
because it communicates the ownership more clearly.
I'll squash something like this in:
--- >8 ---
diff --git a/midx.c b/midx.c
index b8b3f41024..6fb5e237b7 100644
--- a/midx.c
+++ b/midx.c
@@ -465,13 +465,13 @@ struct pack_info {
};
static void fill_pack_info(struct pack_info *info,
- struct packed_git *p, char *pack_name,
+ struct packed_git *p, const char *pack_name,
uint32_t orig_pack_int_id)
{
memset(info, 0, sizeof(struct pack_info));
info->orig_pack_int_id = orig_pack_int_id;
- info->pack_name = pack_name;
+ info->pack_name = xstrdup(pack_name);
info->p = p;
}
@@ -557,8 +557,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
return;
}
- fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
- ctx->nr);
+ fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
ctx->nr++;
}
}
@@ -1336,7 +1335,7 @@ static int write_midx_internal(const char *object_dir,
}
fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
- xstrdup(ctx.m->pack_names[i]), i);
+ ctx.m->pack_names[i], i);
}
}
--- 8< ---
> > - if (open_pack_index(ctx->info[ctx->nr].p)) {
> > + if (open_pack_index(p)) {
> > warning(_("failed to open pack-index '%s'"),
> > full_path);
> > close_pack(ctx->info[ctx->nr].p);
>
> Isn't `ctx->info[ctx->nr].p` still uninitialized at this point?
Great catch, thank you!
> > @@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
> > if (open_pack_index(ctx.m->packs[i]))
> > die(_("could not open index for %s"),
> > ctx.m->packs[i]->pack_name);
> > - ctx.info[ctx.nr].p = ctx.m->packs[i];
>
> Just to make sure I'm not missing anything, but this assignment here was
> basically redundant before this patch already, right?
I think that's right, but in either case we're assigning the pack once
at the end of each loop iteration via a single call to fill_pack_info().
Since we're using ctx.m->packs[i] in both places (after a call to
prepare_midx_pack()), we should be OK here.
Thanks,
Taylor
^ permalink raw reply related
* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Taylor Blau @ 2023-11-30 19:11 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZWhhd5gLzYwPEgBl@tanuki>
On Thu, Nov 30, 2023 at 11:18:31AM +0100, Patrick Steinhardt wrote:
> > diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> > index f4ecdf8b0e..dd3a415b9d 100644
> > --- a/pack-bitmap-write.c
> > +++ b/pack-bitmap-write.c
> > @@ -198,6 +198,13 @@ struct bb_commit {
> > unsigned idx; /* within selected array */
> > };
> >
> > +static void clear_bb_commit(struct bb_commit *commit)
> > +{
> > + free(commit->reverse_edges);
>
> I'd have expected to see `free_commit_list()` here instead of a simple
> free. Is there any reason why we don't use it?
Thanks for spotting an oversight on my part. We should definitely be
using free_commit_list() here instead of a bare free() to avoid leaking
the tail.
Thanks,
Taylor
^ permalink raw reply
* Re: [PATCH 01/24] pack-objects: free packing_data in more places
From: Taylor Blau @ 2023-11-30 19:08 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZWhhcv25B5nUNoyu@tanuki>
On Thu, Nov 30, 2023 at 11:18:26AM +0100, Patrick Steinhardt wrote:
> > diff --git a/pack-objects.c b/pack-objects.c
> > index f403ca6986..1c7bedcc94 100644
> > --- a/pack-objects.c
> > +++ b/pack-objects.c
> > @@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
> > init_recursive_mutex(&pdata->odb_lock);
> > }
> >
> > +void free_packing_data(struct packing_data *pdata)
>
> Nit: shouldn't this rather be called `clear_packing_data`? `free` to me
> indicates that the data structure itself will be free'd, as well, which
> is not the case.
Thanks, that's a good suggestion. I've made the changes locally and will
include it in the subsequent round(s) ;-).
Thanks,
Taylor
^ permalink raw reply
* Re: Consider dropping the decimal places for KiB/s 52.00 KiB/s
From: Taylor Blau @ 2023-11-30 19:07 UTC (permalink / raw)
To: Jonny Grant; +Cc: git
In-Reply-To: <637be919-0b04-4e5c-8f2e-43340521e6d1@jguk.org>
On Thu, Nov 30, 2023 at 06:11:57PM +0000, Jonny Grant wrote:
> Hello
>
> May I suggest taking off the .00 KiB/s suffix, has that been
> considered? As the decimal places don't appear to change, they're
> stuck on .00.
I wonder if you have a throttled connection that is locked to 52KiB/s
exactly. The progress code that generates the throughput is in
progress.c::display_throughput(), which computes the rate. It's computed
in bytes/misec, and then passed to throughput_string() (really,
`strbuf_humanise_rate()`), which formats it appropriately.
If you're in the KiB range, it will print the decimal component, which
is:
((bytes & ((1<<10)-1)) * 100) >> 10
> $ git clone git://gcc.gnu.org/git/gcc.git git_1
> Cloning into 'git_1'...
> remote: Enumerating objects: 2949348, done.
> remote: Counting objects: 100% (209238/209238), done.
> remote: Compressing objects: 100% (14579/14579), done.
> Receiving objects: 7% (210878/2949348), 76.33 MiB | 52.00 KiB/s
On my machine:
$ git.compile clone git://gcc.gnu.org/git/gcc.git
[...]
Receiving objects: 11% (342176/2949348), 108.09 MiB | 24.01 MiB/s
I suppose we could consider dropping the decimal component if it's a
round number, but I think that it may produce awkward flickering if the
rate oscillates between a round number and a non-round number.
Thanks,
Taylor
^ permalink raw reply
* Re: [PATCH v2 1/1] t2400: avoid using pipes
From: Christian Couder @ 2023-11-30 18:16 UTC (permalink / raw)
To: Achu Luma; +Cc: git
In-Reply-To: <20231130165429.2595-2-ach.lumap@gmail.com>
Hi Luma,
On Thu, Nov 30, 2023 at 6:37 PM Achu Luma <ach.lumap@gmail.com> wrote:
>
> The exit code of the preceding command in a pipe is disregarded,
> so it's advisable to refrain from relying on it. Instead, by
> saving the output of a Git command to a file, we gain the
> ability to examine the exit codes of both commands separately.
>
> Signed-off-by: achluma <ach.lumap@gmail.com>
I think the issue with merging your patch (in
https://lore.kernel.org/git/xmqqedibzgi1.fsf@gitster.g/) was that this
"Signed-off-by: ..." line didn't show your full real name and didn't
match your name in your email address.
Assuming that "Achu Luma" is your full real name, you should replace
"achluma" with "Achu Luma" in the "Signed-off-by: ..." line.
Also it's better not to send a cover letter patch like
https://lore.kernel.org/git/20231130165429.2595-1-ach.lumap@gmail.com/
with no content for small patches like this.
When you resend, please also make sure to use [Outreachy] in the patch
subject and to increment the version number of the patch, using for
example "[PATCH v3]".
It would be nice too if after the line starting with --- below, you
could describe in a few lines the changes in the new version of the
patch compared to the previous version.
> ---
Here (after the line starting with --- above) is the place where you
can tell what changed in the patch compared to the previous version.
Note that when there is a cover letter patch, it's better to talk
about changes in the new version in the cover letter, but I dont think
it's worth sending a cover letter patch.
Thanks,
Christian.
> t/t2400-worktree-add.sh | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/t/t2400-worktree-add.sh b/t/t2400-worktree-add.sh
> index df4aff7825..7ead05bb98 100755
> --- a/t/t2400-worktree-add.sh
> +++ b/t/t2400-worktree-add.sh
> @@ -468,7 +468,8 @@ test_expect_success 'put a worktree under rebase' '
> cd under-rebase &&
> set_fake_editor &&
> FAKE_LINES="edit 1" git rebase -i HEAD^ &&
> - git worktree list | grep "under-rebase.*detached HEAD"
> + git worktree list >actual &&
> + grep "under-rebase.*detached HEAD" actual
> )
> '
>
> @@ -509,7 +510,8 @@ test_expect_success 'checkout a branch under bisect' '
> git bisect start &&
> git bisect bad &&
> git bisect good HEAD~2 &&
> - git worktree list | grep "under-bisect.*detached HEAD" &&
> + git worktree list >actual &&
> + grep "under-bisect.*detached HEAD" actual &&
> test_must_fail git worktree add new-bisect under-bisect &&
> ! test -d new-bisect
> )
> --
> 2.41.0.windows.1
>
>
^ permalink raw reply
* Consider dropping the decimal places for KiB/s 52.00 KiB/s
From: Jonny Grant @ 2023-11-30 18:11 UTC (permalink / raw)
To: git
Hello
May I suggest taking off the .00 KiB/s suffix, has that been considered? As the decimal places don't appear to change, they're stuck on .00.
I noticed the KiB/s is neatly rounded to the kibibyte.
Apologies I'm not using latest release.
git version 2.40.1
$ git clone git://gcc.gnu.org/git/gcc.git git_1
Cloning into 'git_1'...
remote: Enumerating objects: 2949348, done.
remote: Counting objects: 100% (209238/209238), done.
remote: Compressing objects: 100% (14579/14579), done.
Receiving objects: 7% (210878/2949348), 76.33 MiB | 52.00 KiB/s
Meanwhile I'll keep waiting for this slow server.
Kind regards, Jonny
^ permalink raw reply
* [PATCH v2 1/1] t2400: avoid using pipes
From: Achu Luma @ 2023-11-30 16:54 UTC (permalink / raw)
To: git; +Cc: Achu Luma
In-Reply-To: <20231130165429.2595-1-ach.lumap@gmail.com>
The exit code of the preceding command in a pipe is disregarded,
so it's advisable to refrain from relying on it. Instead, by
saving the output of a Git command to a file, we gain the
ability to examine the exit codes of both commands separately.
Signed-off-by: achluma <ach.lumap@gmail.com>
---
t/t2400-worktree-add.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/t/t2400-worktree-add.sh b/t/t2400-worktree-add.sh
index df4aff7825..7ead05bb98 100755
--- a/t/t2400-worktree-add.sh
+++ b/t/t2400-worktree-add.sh
@@ -468,7 +468,8 @@ test_expect_success 'put a worktree under rebase' '
cd under-rebase &&
set_fake_editor &&
FAKE_LINES="edit 1" git rebase -i HEAD^ &&
- git worktree list | grep "under-rebase.*detached HEAD"
+ git worktree list >actual &&
+ grep "under-rebase.*detached HEAD" actual
)
'
@@ -509,7 +510,8 @@ test_expect_success 'checkout a branch under bisect' '
git bisect start &&
git bisect bad &&
git bisect good HEAD~2 &&
- git worktree list | grep "under-bisect.*detached HEAD" &&
+ git worktree list >actual &&
+ grep "under-bisect.*detached HEAD" actual &&
test_must_fail git worktree add new-bisect under-bisect &&
! test -d new-bisect
)
--
2.41.0.windows.1
^ permalink raw reply related
* Re: [PATCH 1/4] wt-status: read HEAD and ORIG_HEAD via the refdb
From: Taylor Blau @ 2023-11-30 17:36 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, hanwenn
In-Reply-To: <ZWg8-HW6yCa6-tnS@tanuki>
On Thu, Nov 30, 2023 at 08:42:48AM +0100, Patrick Steinhardt wrote:
> > Are there any other gotchas that we should be thinking about?
>
> Not that I can think of. As you say, a repository with malformed HEAD
> will run into other problems anyway. And `read_ref_full()` would return
> errors if these refs were malformed, which would cause us to exit early
> from anyway. So unless "rebase-merge/amend" and "rebase-merge/orig-head"
> contained the same kind of garbage we'd retain the same behaviour as
> before, and that shouldn't really be happening.
>
> One interesting bit is that we don't set `RESOLVE_REF_READING`, so
> `read_ref_full()` may return successfully even if the ref doesn't exist.
> But in practice this is fine given that the resulting oid would be
> cleared in that case.
Thanks for thinking through these. I agree with your reasoning and think
that this is fine as-is.
(Off-topic, but can you please trim your replies to only include the
quoted parts/context that you're replying to?)
Thanks,
Taylor
^ permalink raw reply
* Re: [PATCH 0/4] refs: improve handling of special refs
From: Taylor Blau @ 2023-11-30 17:35 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, hanwenn
In-Reply-To: <ZWg97gfn80jeB3-_@tanuki>
On Thu, Nov 30, 2023 at 08:46:54AM +0100, Patrick Steinhardt wrote:
> On Wed, Nov 29, 2023 at 05:14:39PM -0500, Taylor Blau wrote:
> > These all look pretty good to me. I had a few minor questions on the
> > first three patches, but I don't think they necessarily require a
> > reroll.
>
> I agree that none of the comments really require a reroll, but I'll
> address them if there will be another iteration of this patch series.
>
> Thanks for your review!
No problem on either. I doubt that there will be another iteration of
this series since it is already good, so no need to worry too much about
these changes.
Thanks,
Taylor
^ permalink raw reply
* [PATCH v2 0/1] *** Avoid using Pipes ***
From: Achu Luma @ 2023-11-30 16:54 UTC (permalink / raw)
To: git; +Cc: Achu Luma
In-Reply-To: <20231003174853.1732-1-ach.lumap@gmail.com>
*** BLURB HERE ***
Achu Luma (1):
t2400: avoid using pipes
t/t2400-worktree-add.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
base-commit: d0e8084c65cbf949038ae4cc344ac2c2efd77415
--
2.41.0.windows.1
^ permalink raw reply
* [PATCH v2] hooks--pre-commit: detect non-ASCII when renaming
From: Julian Prein via GitGitGadget @ 2023-11-30 16:13 UTC (permalink / raw)
To: git; +Cc: Julian Prein, Julian Prein
In-Reply-To: <pull.1291.git.git.1657837073372.gitgitgadget@gmail.com>
From: Julian Prein <druckdev@protonmail.com>
When diff.renames is turned on, the diff-filter will not return renamed
files (or copied ones with diff.renames=copy) and potential non-ASCII
characters would not be caught by this hook.
Use the plumbing command diff-index instead of the porcelain one to not
be affected by diff.rename.
Signed-off-by: Julian Prein <druckdev@protonmail.com>
---
hooks--pre-commit: detect non-ASCII when renaming
A bit later than I expected, but here is v2.
Changes since v1:
* Switched to using diff-index and back to just the A filter as
suggested by Junio C Hamano
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1291%2Fdruckdev%2Fpre-commit-renames-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1291/druckdev/pre-commit-renames-v2
Pull-Request: https://github.com/git/git/pull/1291
Range-diff vs v1:
1: 101f327a040 ! 1: 1f6ca0dd3eb hooks--pre-commit: detect non-ASCII when renaming
@@ Metadata
## Commit message ##
hooks--pre-commit: detect non-ASCII when renaming
- Currently the diff-filter that is used to check for non-ASCII characters
- in filenames only checks new additions.
+ When diff.renames is turned on, the diff-filter will not return renamed
+ files (or copied ones with diff.renames=copy) and potential non-ASCII
+ characters would not be caught by this hook.
- Extend the diff-filter in the pre-commit sample to include `CR` as well.
- This way non-ASCII character in filenames are detected on a rename/copy
- as well.
+ Use the plumbing command diff-index instead of the porcelain one to not
+ be affected by diff.rename.
Signed-off-by: Julian Prein <druckdev@protonmail.com>
@@ templates/hooks--pre-commit.sample: if [ "$allownonascii" != "true" ] &&
# even required, for portability to Solaris 10's /usr/bin/tr), since
# the square bracket bytes happen to fall in the designated range.
- test $(git diff --cached --name-only --diff-filter=A -z $against |
-+ test $(git diff --cached --name-only --diff-filter=ACR -z $against |
++ test $(git diff-index --cached --name-only --diff-filter=A -z $against |
LC_ALL=C tr -d '[ -~]\0' | wc -c) != 0
then
cat <<\EOF
templates/hooks--pre-commit.sample | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index e144712c85c..29ed5ee486a 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -28,7 +28,7 @@ if [ "$allownonascii" != "true" ] &&
# Note that the use of brackets around a tr range is ok here, (it's
# even required, for portability to Solaris 10's /usr/bin/tr), since
# the square bracket bytes happen to fall in the designated range.
- test $(git diff --cached --name-only --diff-filter=A -z $against |
+ test $(git diff-index --cached --name-only --diff-filter=A -z $against |
LC_ALL=C tr -d '[ -~]\0' | wc -c) != 0
then
cat <<\EOF
base-commit: 61a22ddaf0626111193a17ac12f366bd6d167dff
--
gitgitgadget
^ permalink raw reply related
* Re: [PATCH 3/4] refs: complete list of special refs
From: Phillip Wood @ 2023-11-30 15:42 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: hanwenn
In-Reply-To: <0e38103114a206bedbbbd7ea97cb77fa05fd3c29.1701243201.git.ps@pks.im>
Hi Patrick
Thanks for working on this. I've left a couple of thought below.
On 29/11/2023 08:14, Patrick Steinhardt wrote:
> +static int is_special_ref(const char *refname)
> +{
> + /*
> + * Special references get written and read directly via the filesystem
> + * by the subsystems that create them. Thus, they must not go through
> + * the reference backend but must instead be read directly. It is
> + * arguable whether this behaviour is sensible, or whether it's simply
> + * a leaky abstraction enabled by us only having a single reference
> + * backend implementation. But at least for a subset of references it
> + * indeed does make sense to treat them specially:
> + *
> + * - FETCH_HEAD may contain multiple object IDs, and each one of them
> + * carries additional metadata like where it came from.
> + *
> + * - MERGE_HEAD may contain multiple object IDs when merging multiple
> + * heads.
> + *
> + * - "rebase-apply/" and "rebase-merge/" contain all of the state for
> + * rebases, where keeping it closely together feels sensible.
I'd really like to get away from treating these files as refs. I think
their use as refs is purely historic and predates the reflog and
possibly ORIG_HEAD. These days I'm not sure there is a good reason to be
running
git rev-parse rebase-merge/orig-head
One reason for not wanting to treat them as refs is that we do not
handle multi-level refs that do not begin with "refs/" consistently.
git update-ref foo/bar HEAD
succeeds and creates .git/foo/bar but
git update-ref -d foo/bar
fails with
error: refusing to update ref with bad name 'foo/bar'
To me it would make sense to refuse to create 'foo/bar' but allow an
existing ref named 'foo/bar' to be deleted but the current behavior is
the opposite of that.
I'd be quite happy to see us refuse to treat anything that fails
if (starts_with(refname, "refs/") || refname_is_safe(refname))
as a ref but I don't know how much pain that would cause.
> + const char * const special_refs[] = {
> + "AUTO_MERGE",
Is there any reason to treat this specially in the long term? It points
to a tree rather than a commit but unlike MERGE_HEAD and FETCH_HEAD it
is effectively a "normal" ref.
> + "BISECT_EXPECTED_REV",
> + "FETCH_HEAD",
> + "MERGE_AUTOSTASH",
Should we be treating this as a ref? I thought it was written as an
implementation detail of the autostash implementation rather than to
provide a ref for users and scripts.
Best Wishes
Phillip
^ permalink raw reply
* Re: [PATCH 2/2] checkout: forbid "-B <branch>" from touching a branch used elsewhere
From: Phillip Wood @ 2023-11-30 15:22 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Willem Verstraeten, git, Jeff King
In-Reply-To: <xmqqwmu42ccb.fsf@gitster.g>
Hi Junio
On 27/11/2023 01:51, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
>
>> At the moment this is academic as neither of the test scripts changed
>> by this patch are leak free and so I don't think we need to worry
>> about it but it raises an interesting question about how we should
>> handle memory leaks when dying. Leaving the leak when dying means that
>> a test script that tests an expected failure will never be leak free
>> but using UNLEAK() would mean we miss a leak being introduced in the
>> successful case should the call to "free()" ever be removed.
>
> Is there a leak here? The piece of memory is pointed at by an on-stack
> variable full_ref when leak sanitizer starts scanning the heap and
> the stack just before the process exits due to die, so I do not see
> a reason to worry about this particular variable over all the other
> on stack variables we accumulated before the control reached this
> point of the code.
Oh, good point. I was thinking "we exit without calling free() so it is
leaked" but as you say the leak checker (thankfully) does not consider
it a leak as there is still a reference to the allocation on the stack.
Sorry for the noise
Phillip
> Are you worried about optimizing compilers that behave more cleverly
> than their own good to somehow lose the on-stack reference to
> full_ref while calling die_if_switching_to_a_branch_in_use()? We
> might need to squelch them with UNLEAK() but that does not mean we
> have to remove the free() we see above, and I suspect a more
> productive use of our time to solve that issue is ensure that our
> leak-sanitizing build will not triger such an unwanted optimization
> anyway.
>
> Thanks.
^ permalink raw reply
* Re: [EXTERNAL] Re: [BUG] `git push` sends unnecessary objects
From: Javier Mora @ 2023-11-30 13:33 UTC (permalink / raw)
To: Derrick Stolee; +Cc: Bagas Sanjaya, Git Mailing List, Junio C Hamano
In-Reply-To: <PH0PR00MB1349BB447657A94A8EA90A78A183A@PH0PR00MB1349.namprd00.prod.outlook.com>
> I just want to push back on the word "regression" here as it is an expected result.
Yeah, I wouldn't quite call it a "regression" either; just an
inconvenient side effect of a new feature.
>> However, it is possible that extra objects are added to the pack-file if the included commits contain certain types of direct renames.
>
> [1] https://git-scm.com/docs/git-config#Documentation/git-config.txt-packuseSparse
>
> It is unfortunate that this user hit this problem, but it is easy to work around it
I eventually found that note, but not until I figured out that that
option was the one causing the trouble. So yes, it is easy to work
around this issue when you know about it. Maybe this potential corner
case should be given more visibility in the documentation so that it
is easier to spot? (Not just on the documentation for the option
itself, but maybe in the `git push` one.)
> and the benefits of the sparse algorithm should outweigh these kinds of infrequent issues (in my opinion).
Well, I don't use git THAT often and still got hit by this issue, so
it might not be that uncommon. And when the files are large, the time
spent sending them through an internet connection will probably
outweigh the processing time saved by skipping some files in the local
processing.
I wonder if it would make sense to try to make the algorithm smarter
to avoid this corner case. For example, ask the server if it already
has the objects before sending them, or modify the algorithm to look
into trees that have added OR deleted files (to detect that a file has
actually been moved, and thus the server must already have it), or
simply be extra careful when large files are involved. But I don't
know the details of the algorithm so I'm not sure all those
suggestions are feasible or even make sense at all.
^ permalink raw reply
* SMEs AID
From: HILTON FOUNDATION @ 2023-11-30 12:33 UTC (permalink / raw)
To: git
My name is Conrad N. Hilton, CEO of The Hilton Foundations.
I am working with the Centers for Disease Control (CDC) and The Hilton Foundations, We are offering business grants to your neighborhood as a relief package for SMEs.
For more information, visit our website: https://www.hiltonfoundation.org/grants
We wish you a very happy Christmas.
Best Regards
Conrad N. Hilton
Email: hiltonfoundations@gmail.com
Phone: 254-271-0344
Hilton Foundation Customer Success Story | Sage Intact
^ permalink raw reply
* Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <0368f7ab37669163b50b82185725935bde5bc946.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 15023 bytes --]
On Tue, Nov 28, 2023 at 02:08:16PM -0500, Taylor Blau wrote:
> Before multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will need a way to generate packs
> that are known to be disjoint with respect to the currently marked set
> of disjoint packs.
>
> In other words, we want a way to make a pack which does not have any
> objects contained in the union of the set of packs which are currently
> marked as disjoint.
>
> There are a various ways that we could go about this, for example:
>
> - passing `--unpacked`, which would exclude all packed objects (and
> thus would not contain any objects from the disjoint pack)
>
> - passing `--stdin-packs` with the set of packs currently marked as
> disjoint as "excluded", indicating that `pack-objects` should
> discard any objects present in any of the excluded packs (thus
> producing a disjoint pack)
>
> - marking each of the disjoint packs as kept in-core with the
> `--keep-pack` flag, and then passing `--honor-pack-keep` to
> similarly ignore any object(s) from kept packs (thus also producing
> a pack which is disjoint with respect to the current set)
>
> `git repack` is the main entry-point to generating a new pack, by
> invoking `pack-objects` and then adding the new pack to the set of
> disjoint packs if generating a new MIDX. However, `repack` has a number
> of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
> repacks, incremental repacks, etc.), all of which would require careful
> reasoning in order to prove that the resulting set of packs is disjoint.
>
> The most appealing option of the above would be to pass the set of
> disjoint packs as kept (via `--keep-pack`) and then ignore their
> contents (with `--honor-pack-keep`), doing so for all kinds of
> `pack-objects` invocations. But there may be more disjoint packs than we
> can easily fit into the command-line arguments.
>
> Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> moral equivalent of marking the set of disjoint packs as kept, and
> ignoring their contents, even if it would have otherwise been packed. In
> fact, this similarity extends down to the implementation, where each
> disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
>
> With this in place, we can use the kept-pack cache from 20b031fede
> (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> which looks up objects first in a cache containing just the set of kept
> (in this case, disjoint) packs. Assuming that the set of disjoint packs
> is a relatively small portion of the entire repository (which should be
> a safe assumption to make), each object lookup will be very inexpensive.
This cought me by surprise a bit. I'd have expected that in the end,
most of the packfiles in a repository would be disjoint. Using for
example geometric repacks, my expectation was that all of the packs that
get written via geometric repacking would eventually become disjoint
whereas new packs added to the repository would initially not be.
Patrick
> The only place we want to avoid using `--ignore-disjoint` is in
> conjunction with `--cruft`, since doing so may cause us to omit an
> object which would have been included in a new cruft pack in order to
> freshen it. In other words, failing to do so might cause that object to
> be pruned from the repository earlier than expected.
>
> Otherwise, `--ignore-disjoint` is compatible with most other modes of
> `pack-objects`. These various combinations are tested below. As a
> result, `repack` will be able to unconditionally (except for the cruft
> pack) pass `--ignore-disjoint` when trying to add a new pack to the
> disjoint set, and the result will be usable, without having to carefully
> consider and reason about each individual case.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> Documentation/git-pack-objects.txt | 8 ++
> builtin/pack-objects.c | 31 +++++-
> t/lib-disjoint.sh | 11 ++
> t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++++++++++++++++
> 4 files changed, 203 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index e32404c6aa..592c4ce742 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -96,6 +96,14 @@ base-name::
> Incompatible with `--revs`, or options that imply `--revs` (such as
> `--all`), with the exception of `--unpacked`, which is compatible.
>
> +--ignore-disjoint::
> + This flag causes an object that appears in any pack marked as
> + "disjoint" by the multi-pack index to be ignored, even if it
> + would have otherwise been packed. When used with
> + `--stdin-packs`, objects from disjoint packs may be included if
> + and only if a disjoint pack is explicitly given as an input pack
> + to `--stdin-packs`. Incompatible with `--cruft`.
> +
> --cruft::
> Packs unreachable objects into a separate "cruft" pack, denoted
> by the existence of a `.mtimes` file. Typically used by `git
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index bfa60359d4..107154db34 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -207,6 +207,7 @@ static int have_non_local_packs;
> static int incremental;
> static int ignore_packed_keep_on_disk;
> static int ignore_packed_keep_in_core;
> +static int ignore_midx_disjoint_packs;
> static int allow_ofs_delta;
> static struct pack_idx_option pack_idx_opts;
> static const char *base_name;
> @@ -1403,7 +1404,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
> /*
> * Then handle .keep first, as we have a fast(er) path there.
> */
> - if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
> + if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
> + ignore_midx_disjoint_packs) {
> /*
> * Set the flags for the kept-pack cache to be the ones we want
> * to ignore.
> @@ -1415,7 +1417,7 @@ static int want_found_object(const struct object_id *oid, int exclude,
> unsigned flags = 0;
> if (ignore_packed_keep_on_disk)
> flags |= ON_DISK_KEEP_PACKS;
> - if (ignore_packed_keep_in_core)
> + if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
> flags |= IN_CORE_KEEP_PACKS;
>
> if (ignore_packed_keep_on_disk && p->pack_keep)
> @@ -3389,6 +3391,7 @@ static void read_packs_list_from_stdin(void)
> die(_("could not find pack '%s'"), item->string);
> if (!is_pack_valid(p))
> die(_("packfile %s cannot be accessed"), p->pack_name);
> + p->pack_keep_in_core = 0;
> }
>
> /*
> @@ -4266,6 +4269,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> N_("create packs suitable for shallow fetches")),
> OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
> N_("ignore packs that have companion .keep file")),
> + OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
> + N_("ignore packs that are marked disjoint in the MIDX")),
> OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
> N_("ignore this pack")),
> OPT_INTEGER(0, "compression", &pack_compression_level,
> @@ -4412,7 +4417,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> if (use_internal_rev_list)
> die(_("cannot use internal rev list with --cruft"));
> if (stdin_packs)
> - die(_("cannot use --stdin-packs with --cruft"));
> + die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
> + if (ignore_midx_disjoint_packs)
> + die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
> }
>
> /*
> @@ -4452,6 +4459,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> if (!p) /* no keep-able packs found */
> ignore_packed_keep_on_disk = 0;
> }
> + if (ignore_midx_disjoint_packs) {
> + struct multi_pack_index *m = get_multi_pack_index(the_repository);
> + struct bitmapped_pack pack;
> + unsigned any_disjoint = 0;
> + uint32_t i;
> +
> + for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
> + if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
> + die(_("could not load bitmapped pack %i"), i);
> + if (pack.disjoint) {
> + pack.p->pack_keep_in_core = 1;
> + any_disjoint = 1;
> + }
> + }
> +
> + if (!any_disjoint) /* no disjoint packs to ignore */
> + ignore_midx_disjoint_packs = 0;
> + }
> if (local) {
> /*
> * unlike ignore_packed_keep_on_disk above, we do not
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> index c6c6e74aba..c802ca6940 100644
> --- a/t/lib-disjoint.sh
> +++ b/t/lib-disjoint.sh
> @@ -36,3 +36,14 @@ test_must_be_disjoint () {
> test_must_not_be_disjoint () {
> test_disjoint_1 "$1" "no"
> }
> +
> +# packed_contents </path/to/pack-$XYZ.idx [...]>
> +#
> +# Prints the set of objects packed in the given pack indexes.
> +packed_contents () {
> + for idx in "$@"
> + do
> + git show-index <$idx || return 1
> + done >tmp &&
> + cut -d" " -f2 <tmp | sort -u
> +}
> diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
> index 2dcf1eecee..e522aa3f7d 100755
> --- a/t/t5331-pack-objects-stdin.sh
> +++ b/t/t5331-pack-objects-stdin.sh
> @@ -6,6 +6,7 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>
> TEST_PASSES_SANITIZE_LEAK=true
> . ./test-lib.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>
> packed_objects () {
> git show-index <"$1" >tmp-object-list &&
> @@ -237,4 +238,159 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate
> test_cmp expected-objects actual-objects
> '
>
> +objdir=.git/objects
> +packdir=$objdir/pack
> +
> +test_expect_success 'loose objects also in disjoint packs are ignored' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + # create a pack containing the objects in each commit below, but
> + # do not delete their loose copies
> + test_commit base &&
> + base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> + test_commit other &&
> + other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + pack-$base_pack.idx
> + +pack-$other_pack.idx
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_commit more &&
> + out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
> +
> + # gather all objects in "all", and objects from the disjoint
> + # pack in "disjoint"
> + git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
> + packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
> +
> + # make sure that the set of objects we just generated matches
> + # "all \ disjoint"
> + packed_contents "$packdir/pack-$out.idx" >got &&
> + comm -23 all disjoint >want &&
> + test_cmp want got
> + )
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + for c in A B
> + do
> + test_commit "$c" || return 1
> + done &&
> +
> + A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> + B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + pack-$A.idx
> + +pack-$B.idx
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_not_be_disjoint "pack-$A.pack" &&
> + test_must_be_disjoint "pack-$B.pack" &&
> +
> + test_commit C &&
> +
> + got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
> + packed_contents "$packdir/pack-$got.idx" >actual &&
> +
> + git rev-list --objects --no-object-names B..C >expect.raw &&
> + sort <expect.raw >expect &&
> +
> + test_cmp expect actual
> + )
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
> + # Create objects in three separate packs:
> + #
> + # - pack A (midx, non disjoint)
> + # - pack B (midx, disjoint)
> + # - pack C (non-midx)
> + #
> + # Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
> + # including packs A, B, and C. The resulting pack should contain
> + # only the objects from packs A, and C, excluding those from
> + # pack B as it is marked as disjoint.
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + for c in A B C
> + do
> + test_commit "$c" || return 1
> + done &&
> +
> + A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> + B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> + C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + pack-$A.idx
> + +pack-$B.idx
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_not_be_disjoint "pack-$A.pack" &&
> + test_must_be_disjoint "pack-$B.pack" &&
> +
> + # Generate a pack with `--stdin-packs` using packs "A" and "C",
> + # but excluding objects from "B". The objects from pack "B" are
> + # expected to be omitted from the generated pack for two
> + # reasons:
> + #
> + # - because it was specified as a negated tip via
> + # `--stdin-packs`
> + # - because it is a disjoint pack.
> + cat >in <<-EOF &&
> + pack-$A.pack
> + ^pack-$B.pack
> + pack-$C.pack
> + EOF
> + got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> + packed_contents "$packdir/pack-$got.idx" >actual &&
> + packed_contents "$packdir/pack-$A.idx" \
> + "$packdir/pack-$C.idx" >expect &&
> + test_cmp expect actual &&
> +
> + # Generate another pack with `--stdin-packs`, this time
> + # using packs "B" and "C". The objects from pack "B" are
> + # expected to be in the final pack, despite it being a
> + # disjoint pack, because "B" was mentioned explicitly
> + # via `stdin-packs`.
> + cat >in <<-EOF &&
> + pack-$B.pack
> + pack-$C.pack
> + EOF
> + got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> + packed_contents "$packdir/pack-$got.idx" >actual &&
> + packed_contents "$packdir/pack-$B.idx" \
> + "$packdir/pack-$C.idx" >expect &&
> + test_cmp expect actual
> + )
> +'
> +
> +test_expect_success '--cruft is incompatible with --ignore-disjoint' '
> + test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
> + </dev/null >/dev/null 2>actual &&
> + cat >expect <<-\EOF &&
> + fatal: cannot use --ignore-disjoint with --cruft
> + EOF
> + test_cmp expect actual
> +'
> +
> test_done
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 07/24] midx: implement `--retain-disjoint` mode
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <3019738b52ba8cd78ea696a3b800fa91e722eb66.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 11318 bytes --]
On Tue, Nov 28, 2023 at 02:08:13PM -0500, Taylor Blau wrote:
> Once multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will want to teach `git repack` to
> evolve the set of disjoint packs over time.
>
> To evolve the set of disjoint packs means any new packs made by `repack`
> should be disjoint with respect to the existing set of disjoint packs so
> as to be able to join that set when updating the multi-pack index.
>
> The details of generating such packs will be left to future commits. But
> any new pack(s) created by repack as disjoint will be marked as such by
> passing them over `--stdin-packs` with the special '+' marker when
> generating a new MIDX.
>
> This patch, however, addresses the question of how we retain the
> existing set of disjoint packs when updating the multi-pack index. One
> option would be for `repack` to keep track of the set of disjoint packs
> itself by querying the MIDX, and then adding the special '+' marker
> appropriately when generating the input for `--stdin-packs`.
>
> But this is verbose and error-prone, since two different parts of Git
> would need to maintain the same notion of the set of disjoint packs.
> When one disagrees with the other, the set of so-called disjoint packs
> may actually contain two or more packs which have one or more object(s)
> in common, making the set non-disjoint.
>
> Instead, introduce a `--retain-disjoint` mode for the `git
> multi-pack-index write` sub-command which keeps any packs which are:
>
> - marked as disjoint in the existing MIDX, and
>
> - not deleted (e.g., they are not excluded from the input for
> `--stdin-packs`).
>
> This will allow the `repack` command to not have to keep track of the
> set of currently-disjoint packs itself, reducing the number of lines of
> code necessary to implement this feature, and making the resulting
> implementation less error-prone.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> Documentation/git-multi-pack-index.txt | 8 +++
> builtin/multi-pack-index.c | 3 +
> midx.c | 49 +++++++++++++++
> midx.h | 1 +
> t/lib-disjoint.sh | 38 ++++++++++++
> t/t5319-multi-pack-index.sh | 82 ++++++++++++++++++++++++++
> 6 files changed, 181 insertions(+)
> create mode 100644 t/lib-disjoint.sh
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index d130e65b28..ac0c7b124b 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -54,6 +54,14 @@ write::
> "disjoint". See the "`DISP` chunk and disjoint packs"
> section in linkgit:gitformat-pack[5] for more.
>
> + --retain-disjoint::
> + When writing a multi-pack index with a reachability
> + bitmap, keep any packs marked as disjoint in the
> + existing MIDX (if any) as such in the new MIDX. Existing
> + disjoint packs which are removed (e.g., not listed via
> + `--stdin-packs`) are ignored. This option works in
> + addition to the '+' marker for `--stdin-packs`.
I'm trying to understand when you're expected to pass this flag and when
you're expected not to pass it. This documentation could also help in
the documentation here so that the user can make a more informed
decision.
Patrick
> --refs-snapshot=<path>::
> With `--bitmap`, optionally specify a file which
> contains a "refs snapshot" taken prior to repacking.
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 0f1dd4651d..dcfabf2626 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -138,6 +138,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
> N_("write multi-pack index containing only given indexes")),
> OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
> N_("refs snapshot for selecting bitmap commits")),
> + OPT_BIT(0, "retain-disjoint", &opts.flags,
> + N_("retain non-deleted disjoint packs"),
> + MIDX_WRITE_RETAIN_DISJOINT),
> OPT_END(),
> };
>
> diff --git a/midx.c b/midx.c
> index 65ba0c70fe..ce67da9f85 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -721,6 +721,12 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
> &fanout->entries[fanout->nr],
> cur_object);
> fanout->entries[fanout->nr].preferred = 0;
> + /*
> + * It's OK to set disjoint to 0 here, even with
> + * `--retain-disjoint`, since we will always see the disjoint
> + * copy of some object below in get_sorted_entries(), causing us
> + * to die().
> + */
> fanout->entries[fanout->nr].disjoint = 0;
> fanout->nr++;
> }
> @@ -1362,6 +1368,37 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r,
> return result;
> }
>
> +static int midx_retain_existing_disjoint(struct repository *r,
> + struct multi_pack_index *from,
> + struct write_midx_context *ctx)
> +{
> + struct bitmapped_pack bp;
> + uint32_t i, midx_pos;
> +
> + for (i = 0; i < ctx->nr; i++) {
> + struct pack_info *info = &ctx->info[i];
> + /*
> + * Having to call `midx_locate_pack()` in a loop is
> + * sub-optimal, since it is O(n*log(n)) in the number
> + * of packs.
> + *
> + * When reusing an existing MIDX, we know that the first
> + * 'n' packs appear in the same order, so we could avoid
> + * this when reusing an existing MIDX. But we may be
> + * instead relying on the order given to us by
> + * for_each_file_in_pack_dir(), in which case we can't
> + * make any such guarantees.
> + */
> + if (!midx_locate_pack(from, info->pack_name, &midx_pos))
> + continue;
> +
> + if (nth_bitmapped_pack(r, from, &bp, midx_pos) < 0)
> + return -1;
> + info->disjoint = bp.disjoint;
> + }
> + return 0;
> +}
> +
> static int write_midx_internal(const char *object_dir,
> struct string_list *packs_to_include,
> struct string_list *packs_to_drop,
> @@ -1444,6 +1481,18 @@ static int write_midx_internal(const char *object_dir,
> for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
> stop_progress(&ctx.progress);
>
> + if (flags & MIDX_WRITE_RETAIN_DISJOINT) {
> + struct multi_pack_index *m = ctx.m;
> + if (!m)
> + m = lookup_multi_pack_index(the_repository, object_dir);
> +
> + if (m) {
> + result = midx_retain_existing_disjoint(the_repository, m, &ctx);
> + if (result)
> + goto cleanup;
> + }
> + }
> +
> if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
> !(packs_to_include || packs_to_drop)) {
> struct bitmap_index *bitmap_git;
> diff --git a/midx.h b/midx.h
> index a6e969c2ea..d7ce52ff7b 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -54,6 +54,7 @@ struct multi_pack_index {
> #define MIDX_WRITE_BITMAP (1 << 2)
> #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
> #define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
> +#define MIDX_WRITE_RETAIN_DISJOINT (1 << 5)
>
> const unsigned char *get_midx_checksum(struct multi_pack_index *m);
> void get_midx_filename(struct strbuf *out, const char *object_dir);
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> new file mode 100644
> index 0000000000..c6c6e74aba
> --- /dev/null
> +++ b/t/lib-disjoint.sh
> @@ -0,0 +1,38 @@
> +# Helpers for scripts testing disjoint packs; see t5319 for example usage.
> +
> +objdir=.git/objects
> +
> +test_disjoint_1 () {
> + local pack="$1"
> + local want="$2"
> +
> + test-tool read-midx --bitmap $objdir >out &&
> + grep -A 3 "$pack" out >found &&
> +
> + if ! test -s found
> + then
> + echo >&2 "could not find '$pack' in MIDX"
> + return 1
> + fi
> +
> + if ! grep -q "disjoint: $want" found
> + then
> + echo >&2 "incorrect disjoint state for pack '$pack'"
> + return 1
> + fi
> + return 0
> +}
> +
> +# test_must_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is marked as disjoint.
> +test_must_be_disjoint () {
> + test_disjoint_1 "$1" "yes"
> +}
> +
> +# test_must_not_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is not marked as disjoint.
> +test_must_not_be_disjoint () {
> + test_disjoint_1 "$1" "no"
> +}
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index fd24e0c952..02cfddf151 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -3,6 +3,7 @@
> test_description='multi-pack-indexes'
> . ./test-lib.sh
> . "$TEST_DIRECTORY"/lib-chunk.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>
> GIT_TEST_MULTI_PACK_INDEX=0
> objdir=.git/objects
> @@ -1215,4 +1216,85 @@ test_expect_success 'non-disjoint packs are detected' '
> )
> '
>
> +test_expect_success 'retain disjoint packs while writing' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + for i in 1 2
> + do
> + test_commit "$i" && git repack -d || return 1
> + done &&
> +
> + find $objdir/pack -type f -name "pack-*.idx" |
> + sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.old &&
> +
> + test_line_count = 2 packs.old &&
> + disjoint="$(head -n 1 packs.old)" &&
> + non_disjoint="$(tail -n 1 packs.old)" &&
> +
> + cat >in <<-EOF &&
> + +$disjoint
> + $non_disjoint
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_be_disjoint "${disjoint%.idx}.pack" &&
> + test_must_not_be_disjoint "${non_disjoint%.idx}.pack" &&
> +
> + test_commit 3 &&
> + git repack -d &&
> +
> + find $objdir/pack -type f -name "pack-*.idx" |
> + sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.new &&
> +
> + new_disjoint="$(comm -13 packs.old packs.new)" &&
> + cat >in <<-EOF &&
> + $disjoint
> + $non_disjoint
> + +$new_disjoint
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap \
> + --retain-disjoint <in &&
> +
> + test_must_be_disjoint "${disjoint%.idx}.pack" &&
> + test_must_be_disjoint "${new_disjoint%.idx}.pack" &&
> + test_must_not_be_disjoint "${non_disjoint%.idx}.pack"
> +
> + )
> +'
> +
> +test_expect_success 'non-disjoint packs are detected via --retain-disjoint' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> + packdir=.git/objects/pack &&
> +
> + test_commit base &&
> + base="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + +pack-$base.idx
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_be_disjoint "pack-$base.pack" &&
> +
> + test_commit other &&
> + other="$(echo other | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + pack-$base.idx
> + +pack-$other.idx
> + EOF
> + test_must_fail git multi-pack-index write --stdin-packs --retain-disjoint --bitmap <in 2>err &&
> + grep "duplicate object.* among disjoint packs" err &&
> +
> + test_must_fail git multi-pack-index write --retain-disjoint --bitmap 2>err &&
> + grep "duplicate object.* among disjoint packs" err
> + )
> +'
> +
> test_done
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 05/24] midx: implement `DISP` chunk
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <c52d7e7b27a9add4f58b8334db4fe4498af1c90f.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 27233 bytes --]
On Tue, Nov 28, 2023 at 02:08:08PM -0500, Taylor Blau wrote:
> When a multi-pack bitmap is used to implement verbatim pack reuse (that
> is, when verbatim chunks from an on-disk packfile are copied
> directly[^1]), it does so by using its "preferred pack" as the source
> for pack-reuse.
>
> This allows repositories to pack the majority of their objects into a
> single (often large) pack, and then use it as the single source for
> verbatim pack reuse. This increases the amount of objects that are
> reused verbatim (and consequently, decrease the amount of time it takes
> to generate many packs). But this performance comes at a cost, which is
> that the preferred packfile must pace its growth with that of the entire
> repository in order to maintain the utility of verbatim pack reuse.
>
> As repositories grow beyond what we can reasonably store in a single
> packfile, the utility of verbatim pack reuse diminishes. Or, at the very
> least, it becomes increasingly more expensive to maintain as the pack
> grows larger and larger.
>
> It would be beneficial to be able to perform this same optimization over
> multiple packs, provided some modest constraints (most importantly, that
> the set of packs eligible for verbatim reuse are disjoint with respect
> to the objects that they contain).
>
> If we assume that the packs which we treat as candidates for verbatim
> reuse are disjoint with respect to their objects, we need to make only
> modest modifications to the verbatim pack-reuse code itself. Most
> notably, we need to remove the assumption that the bits in the
> reachability bitmap corresponding to objects from the single reuse pack
> begin at the first bit position.
>
> Future patches will unwind these assumptions and reimplement their
> existing functionality as special cases of the more general assumptions
> (e.g. that reuse bits can start anywhere within the bitset, but happen
> to start at 0 for all existing cases).
>
> This patch does not yet relax any of those assumptions. Instead, it
> implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
> chunk of the multi-pack index. The `DISP` chunk's contents are described
> in detail here. Importantly, the `DISP` chunk contains information to
> map regions of a multi-pack index's reachability bitmap to the packs
> whose objects they represent.
>
> For now, this chunk is only written, not read (outside of the test-tool
> used in this patch to test the new chunk's behavior). Future patches
> will begin to make use of this new chunk.
>
> This patch implements reading (though no callers outside of the above
> one perform any reading) and writing this new chunk. It also extends the
> `--stdin-packs` format used by the `git multi-pack-index write` builtin
> to be able to designate that a given pack is to be marked as "disjoint"
> by prefixing it with a '+' character.
>
> [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
> pack that wasn't used verbatim.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> Documentation/git-multi-pack-index.txt | 4 +
> Documentation/gitformat-pack.txt | 109 +++++++++++++++++++++++
> builtin/multi-pack-index.c | 10 ++-
> midx.c | 116 ++++++++++++++++++++++---
> midx.h | 5 ++
> pack-bitmap.h | 9 ++
> t/helper/test-read-midx.c | 31 ++++++-
> t/t5319-multi-pack-index.sh | 58 +++++++++++++
> 8 files changed, 325 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 3696506eb3..d130e65b28 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -49,6 +49,10 @@ write::
> --stdin-packs::
> Write a multi-pack index containing only the set of
> line-delimited pack index basenames provided over stdin.
> + Lines beginning with a '+' character (followed by the
> + pack index basename as before) have their pack marked as
> + "disjoint". See the "`DISP` chunk and disjoint packs"
> + section in linkgit:gitformat-pack[5] for more.
>
> --refs-snapshot=<path>::
> With `--bitmap`, optionally specify a file which
> diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
> index 9fcb29a9c8..658682ddd5 100644
> --- a/Documentation/gitformat-pack.txt
> +++ b/Documentation/gitformat-pack.txt
> @@ -396,6 +396,22 @@ CHUNK DATA:
> is padded at the end with between 0 and 3 NUL bytes to make the
> chunk size a multiple of 4 bytes.
>
> + Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
> + Stores a table of three 4-byte unsigned integers in network order.
> + Each table entry corresponds to a single pack (in the order that
> + they appear above in the `PNAM` chunk). The values for each table
> + entry are as follows:
> + - The first bit position (in psuedo-pack order, see below) to
s/psuedo/pseudo/
> + contain an object from that pack.
> + - The number of bits whose objects are selected from that pack.
> + - A "meta" value, whose least-significant bit indicates whether or
> + not the pack is disjoint with respect to other packs. The
> + remaining bits are unused.
> + Two packs are "disjoint" with respect to one another when they have
> + disjoint sets of objects. In other words, any object found in a pack
> + contained in the set of disjoint packfiles is guaranteed to be
> + uniquely located among those packs.
> +
> OID Fanout (ID: {'O', 'I', 'D', 'F'})
> The ith entry, F[i], stores the number of OIDs with first
> byte at most i. Thus F[255] stores the total
> @@ -509,6 +525,99 @@ packs arranged in MIDX order (with the preferred pack coming first).
> The MIDX's reverse index is stored in the optional 'RIDX' chunk within
> the MIDX itself.
>
> +=== `DISP` chunk and disjoint packs
> +
> +The Disjoint Packfiles (`DISP`) chunk encodes additional information
> +about the objects in the multi-pack index's reachability bitmap. Recall
> +that objects from the MIDX are arranged in "pseudo-pack" order (see:
The colon feels a bit out-of-place here, so: s/see:/see/
> +above) for reachability bitmaps.
> +
> +From the example above, suppose we have packs "a", "b", and "c", with
> +10, 15, and 20 objects, respectively. In pseudo-pack order, those would
> +be arranged as follows:
> +
> + |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
> +
> +When working with single-pack bitmaps (or, equivalently, multi-pack
> +reachability bitmaps without any packs marked as disjoint),
> +linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
> +reuse chunks of the existing packfile instead of adding objects to the
> +packing list.
I'm not sure I full understand this paragraph. In the context of a
single pack bitmap it's clear enough. But I stumbled over the MIDX case,
because here we potentially have multiple packfiles, so it's not exactly
clear to me what you refer to with "the existing packfile" in that case.
I'd think that we perform verbatim reuse of the preferred packfile,
right? If so, we might want to make that a bit more explicit.
> +When a chunk of bytes are reused from an existing pack, any objects
s/are/is/, as it refers to the single chunk and not the plural bytes.
> +contained therein do not need to be added to the packing list, saving
> +memory and CPU time. But a chunk from an existing packfile can only be
> +reused when the following conditions are met:
> +
> + - The chunk contains only objects which were requested by the caller
> + (i.e. does not contain any objects which the caller didn't ask for
> + explicitly or implicitly).
> +
> + - All objects stored as offset- or reference-deltas also include their
> + base object in the resulting pack.
> +
> +Additionally, packfiles many not contain more than one copy of any given
s/many/may
> +object. This introduces an additional constraint over the set of packs
> +we may want to reuse. The most straightforward approach is to mandate
> +that the set of packs is disjoint with respect to the set of objects
> +contained in each pack. In other words, for each object `o` in the union
> +of all objects stored by the disjoint set of packs, `o` is contained in
> +exactly one pack from the disjoint set.
Is this a property that usually holds for our normal housekeeping, or
does it require careful managing by the user/admin? How about geometric
repacking?
> +One alternative design choice for multi-pack reuse might instead involve
> +imposing a chunk-level constraint that allows packs in the reusable set
> +to contain multiple copies across different packs, but restricts each
> +chunk against including more than one copy of such an object. This is in
> +theory possible to implement, but significantly more complicated than
> +forcing packs themselves to be disjoint. Most notably, we would have to
> +keep track of which objects have already been sent during verbatim
> +pack-reuse, defeating the main purpose of verbatim pack reuse (that we
> +don't have to keep track of individual objects).
> +
> +The `DISP` chunk encodes the necessary information in order to implement
> +multi-pack reuse over a disjoint set of packs as described above.
> +Specifically, the `DISP` chunk encodes three pieces of information (all
> +32-bit unsigned integers in network byte-order) for each packfile `p`
> +that is stored in the MIDX, as follows:
> +
> +`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
> + multi-pack index's reachability bitmap occupied by an object from `p`.
> +
> +`bitmap_nr`:: The number of bit positions (including the one at
> + `bitmap_pos`) that encode objects from that pack `p`.
> +
> +`disjoint`:: Metadata, including whether or not the pack `p` is
> + ``disjoint''. The least significant bit stores whether or not the pack
> + is disjoint. The remaining bits are reserved for future use.
> +
> +For example, the `DISP` chunk corresponding to the above example (with
> +packs ``a'', ``b'', and ``c'') would look like:
> +
> +[cols="1,2,2,2"]
> +|===
> +| |`bitmap_pos` |`bitmap_nr` |`disjoint`
> +
> +|packfile ``a''
> +|`0`
> +|`10`
> +|`0x1`
> +
> +|packfile ``b''
> +|`10`
> +|`15`
> +|`0x1`
> +
> +|packfile ``c''
> +|`25`
> +|`20`
> +|`0x1`
> +|===
> +
> +With these constraints and information in place, we can treat each
> +packfile marked as disjoint as individually reusable in the same fashion
> +as verbatim pack reuse is performed on individual packs prior to the
> +implementation of the `DISP` chunk.
> +
> == cruft packs
>
> The cruft packs feature offer an alternative to Git's traditional mechanism of
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index a72aebecaa..0f1dd4651d 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -106,11 +106,17 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
> return 0;
> }
>
> +#define DISJOINT ((void*)(uintptr_t)1)
> +
> static void read_packs_from_stdin(struct string_list *to)
> {
> struct strbuf buf = STRBUF_INIT;
> - while (strbuf_getline(&buf, stdin) != EOF)
> - string_list_append(to, buf.buf);
> + while (strbuf_getline(&buf, stdin) != EOF) {
> + if (*buf.buf == '+')
> + string_list_append(to, buf.buf + 1)->util = DISJOINT;
> + else
> + string_list_append(to, buf.buf);
> + }
> string_list_sort(to);
>
> strbuf_release(&buf);
> diff --git a/midx.c b/midx.c
> index 591b3c636e..f55020072f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -33,6 +33,7 @@
>
> #define MIDX_CHUNK_ALIGNMENT 4
> #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
> +#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
> #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
> @@ -182,6 +183,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>
> pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
> &m->chunk_large_offsets_len);
> + pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
> + (const unsigned char **)&m->chunk_disjoint_packs,
> + &m->chunk_disjoint_packs_len);
>
> if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
> pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
> @@ -275,6 +279,23 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
> return 0;
> }
>
> +int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
> + struct bitmapped_pack *bp, uint32_t pack_int_id)
> +{
> + if (!m->chunk_disjoint_packs)
> + return error(_("MIDX does not contain the DISP chunk"));
> +
> + if (prepare_midx_pack(r, m, pack_int_id))
> + return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
> +
> + bp->p = m->packs[pack_int_id];
> + bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
> + bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
> + bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
> +
> + return 0;
> +}
> +
> int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
> {
> return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
> @@ -457,11 +478,18 @@ static size_t write_midx_header(struct hashfile *f,
> return MIDX_HEADER_SIZE;
> }
>
> +#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
> +
> struct pack_info {
> uint32_t orig_pack_int_id;
> char *pack_name;
> struct packed_git *p;
> - unsigned expired : 1;
> +
> + uint32_t bitmap_pos;
> + uint32_t bitmap_nr;
> +
> + unsigned expired : 1,
> + disjoint : 1;
> };
>
> static void fill_pack_info(struct pack_info *info,
> @@ -473,6 +501,7 @@ static void fill_pack_info(struct pack_info *info,
> info->orig_pack_int_id = orig_pack_int_id;
> info->pack_name = pack_name;
> info->p = p;
> + info->bitmap_pos = BITMAP_POS_UNKNOWN;
> }
>
> static int pack_info_compare(const void *_a, const void *_b)
> @@ -516,6 +545,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
> {
> struct write_midx_context *ctx = data;
> struct packed_git *p;
> + struct string_list_item *item = NULL;
>
> if (ends_with(file_name, ".idx")) {
> display_progress(ctx->progress, ++ctx->pack_paths_checked);
> @@ -534,11 +564,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
> * should be performed independently (likely checking
> * to_include before the existing MIDX).
> */
> - if (ctx->m && midx_contains_pack(ctx->m, file_name))
> - return;
> - else if (ctx->to_include &&
> - !string_list_has_string(ctx->to_include, file_name))
> + if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
> return;
> + } else if (ctx->to_include) {
> + item = string_list_lookup(ctx->to_include, file_name);
> + if (!item)
> + return;
> + }
>
> ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
>
> @@ -559,6 +591,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>
> fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
> ctx->nr);
> + if (item)
> + ctx->info[ctx->nr].disjoint = !!item->util;
> ctx->nr++;
> }
> }
> @@ -568,7 +602,8 @@ struct pack_midx_entry {
> uint32_t pack_int_id;
> time_t pack_mtime;
> uint64_t offset;
> - unsigned preferred : 1;
> + unsigned preferred : 1,
> + disjoint : 1;
> };
>
> static int midx_oid_compare(const void *_a, const void *_b)
> @@ -586,6 +621,12 @@ static int midx_oid_compare(const void *_a, const void *_b)
> if (a->preferred < b->preferred)
> return 1;
>
> + /* Sort objects in a disjoint pack last when multiple copies exist. */
> + if (a->disjoint < b->disjoint)
> + return -1;
> + if (a->disjoint > b->disjoint)
> + return 1;
> +
> if (a->pack_mtime > b->pack_mtime)
> return -1;
> else if (a->pack_mtime < b->pack_mtime)
> @@ -671,6 +712,7 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
> &fanout->entries[fanout->nr],
> cur_object);
> fanout->entries[fanout->nr].preferred = 0;
> + fanout->entries[fanout->nr].disjoint = 0;
> fanout->nr++;
> }
> }
> @@ -696,6 +738,7 @@ static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
> cur_object,
> &fanout->entries[fanout->nr],
> preferred);
> + fanout->entries[fanout->nr].disjoint = info->disjoint;
> fanout->nr++;
> }
> }
> @@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
> * Take only the first duplicate.
> */
> for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
> - if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
> - &fanout.entries[cur_object].oid))
> - continue;
> + struct pack_midx_entry *ours = &fanout.entries[cur_object];
> + if (cur_object) {
> + struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
> + if (oideq(&prev->oid, &ours->oid)) {
> + if (prev->disjoint && ours->disjoint)
> + die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
> + oid_to_hex(&prev->oid),
> + info[prev->pack_int_id].pack_name,
> + info[ours->pack_int_id].pack_name);
Shouldn't we die if `prev->disjoint || ours->disjoint` instead of `&&`?
Even if one of the packs isn't marked as disjoint, it's still wrong if
the other one is and one of its objects exists in multiple packs.
Or am I misunderstanding, and we only guarantee the disjoint property
across packfiles that are actually marked as such?
Patrick
> + continue;
> + }
> + }
>
> ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
> alloc_objects);
> - memcpy(&deduplicated_entries[*nr_objects],
> - &fanout.entries[cur_object],
> + memcpy(&deduplicated_entries[*nr_objects], ours,
> sizeof(struct pack_midx_entry));
> (*nr_objects)++;
> }
> @@ -814,6 +865,27 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
> return 0;
> }
>
> +static int write_midx_disjoint_packs(struct hashfile *f, void *data)
> +{
> + struct write_midx_context *ctx = data;
> + size_t i;
> +
> + for (i = 0; i < ctx->nr; i++) {
> + struct pack_info *pack = &ctx->info[i];
> + if (pack->expired)
> + continue;
> +
> + if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
> + BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
> + pack->pack_name, pack->bitmap_nr);
> +
> + hashwrite_be32(f, pack->bitmap_pos);
> + hashwrite_be32(f, pack->bitmap_nr);
> + hashwrite_be32(f, !!pack->disjoint);
> + }
> + return 0;
> +}
> +
> static int write_midx_oid_fanout(struct hashfile *f,
> void *data)
> {
> @@ -981,8 +1053,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
> QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
>
> ALLOC_ARRAY(pack_order, ctx->entries_nr);
> - for (i = 0; i < ctx->entries_nr; i++)
> + for (i = 0; i < ctx->entries_nr; i++) {
> + struct pack_midx_entry *e = &ctx->entries[data[i].nr];
> + struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
> + if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
> + pack->bitmap_pos = i;
> + pack->bitmap_nr++;
> pack_order[i] = data[i].nr;
> + }
> + for (i = 0; i < ctx->nr; i++) {
> + struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
> + if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
> + pack->bitmap_pos = 0;
> + }
> free(data);
>
> trace2_region_leave("midx", "midx_pack_order", the_repository);
> @@ -1283,6 +1366,7 @@ static int write_midx_internal(const char *object_dir,
> struct hashfile *f = NULL;
> struct lock_file lk;
> struct write_midx_context ctx = { 0 };
> + int pack_disjoint_concat_len = 0;
> int pack_name_concat_len = 0;
> int dropped_packs = 0;
> int result = 0;
> @@ -1495,8 +1579,10 @@ static int write_midx_internal(const char *object_dir,
> }
>
> for (i = 0; i < ctx.nr; i++) {
> - if (!ctx.info[i].expired)
> - pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
> + if (ctx.info[i].expired)
> + continue;
> + pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
> + pack_disjoint_concat_len += 3 * sizeof(uint32_t);
> }
>
> /* Check that the preferred pack wasn't expired (if given). */
> @@ -1556,6 +1642,8 @@ static int write_midx_internal(const char *object_dir,
> add_chunk(cf, MIDX_CHUNKID_REVINDEX,
> st_mult(ctx.entries_nr, sizeof(uint32_t)),
> write_midx_revindex);
> + add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
> + pack_disjoint_concat_len, write_midx_disjoint_packs);
> }
>
> write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
> diff --git a/midx.h b/midx.h
> index a5d98919c8..cdd16a8378 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -7,6 +7,7 @@
> struct object_id;
> struct pack_entry;
> struct repository;
> +struct bitmapped_pack;
>
> #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
> #define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
> @@ -33,6 +34,8 @@ struct multi_pack_index {
>
> const unsigned char *chunk_pack_names;
> size_t chunk_pack_names_len;
> + const uint32_t *chunk_disjoint_packs;
> + size_t chunk_disjoint_packs_len;
> const uint32_t *chunk_oid_fanout;
> const unsigned char *chunk_oid_lookup;
> const unsigned char *chunk_object_offsets;
> @@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
>
> struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
> int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
> +int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
> + struct bitmapped_pack *bp, uint32_t pack_int_id);
> int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
> uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
> diff --git a/pack-bitmap.h b/pack-bitmap.h
> index 5273a6a019..b7fa1a42a9 100644
> --- a/pack-bitmap.h
> +++ b/pack-bitmap.h
> @@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
>
> struct bitmap_index;
>
> +struct bitmapped_pack {
> + struct packed_git *p;
> +
> + uint32_t bitmap_pos;
> + uint32_t bitmap_nr;
> +
> + unsigned disjoint : 1;
> +};
> +
> struct bitmap_index *prepare_bitmap_git(struct repository *r);
> struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
> void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
> diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
> index e9a444ddba..4b44995dca 100644
> --- a/t/helper/test-read-midx.c
> +++ b/t/helper/test-read-midx.c
> @@ -100,10 +100,37 @@ static int read_midx_preferred_pack(const char *object_dir)
> return 0;
> }
>
> +static int read_midx_bitmapped_packs(const char *object_dir)
> +{
> + struct multi_pack_index *midx = NULL;
> + struct bitmapped_pack pack;
> + uint32_t i;
> +
> + setup_git_directory();
> +
> + midx = load_multi_pack_index(object_dir, 1);
> + if (!midx)
> + return 1;
> +
> + for (i = 0; i < midx->num_packs; i++) {
> + if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
> + return 1;
> +
> + printf("%s\n", pack_basename(pack.p));
> + printf(" bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
> + printf(" bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
> + printf(" disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
> + }
> +
> + close_midx(midx);
> +
> + return 0;
> +}
> +
> int cmd__read_midx(int argc, const char **argv)
> {
> if (!(argc == 2 || argc == 3))
> - usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
> + usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
>
> if (!strcmp(argv[1], "--show-objects"))
> return read_midx_file(argv[2], 1);
> @@ -111,5 +138,7 @@ int cmd__read_midx(int argc, const char **argv)
> return read_midx_checksum(argv[2]);
> else if (!strcmp(argv[1], "--preferred-pack"))
> return read_midx_preferred_pack(argv[2]);
> + else if (!strcmp(argv[1], "--bitmap"))
> + return read_midx_bitmapped_packs(argv[2]);
> return read_midx_file(argv[1], 0);
> }
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index c4c6060cee..fd24e0c952 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
> test_cmp expect.err err
> '
>
> +test_expect_success 'disjoint packs are stored via the DISP chunk' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + for i in 1 2 3 4 5
> + do
> + test_commit "$i" &&
> + git repack -d || return 1
> + done &&
> +
> + find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
> +
> + git multi-pack-index write --stdin-packs <packs &&
> + test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
> + cat >expect <<-\EOF &&
> + error: MIDX does not contain the DISP chunk
> + EOF
> + test_cmp expect err &&
> +
> + sed -e "s/^/+/g" packs >in &&
> + git multi-pack-index write --stdin-packs --bitmap \
> + --preferred-pack="$(head -n1 <packs)" <in &&
> + test-tool read-midx --bitmap $objdir >actual &&
> + for i in $(test_seq $(wc -l <packs))
> + do
> + sed -ne "${i}s/\.idx$/\.pack/p" packs &&
> + echo " bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
> + echo " bitmap_nr: 3" &&
> + echo " disjoint: yes" || return 1
> + done >expect &&
> + test_cmp expect actual
> + )
> +'
> +
> +test_expect_success 'non-disjoint packs are detected' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + test_commit base &&
> + git repack -d &&
> + test_commit other &&
> + git repack -a &&
> +
> + ls -la .git/objects/pack/ &&
> +
> + find $objdir/pack -type f -name "*.idx" |
> + sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
> +
> + test_must_fail git multi-pack-index write --stdin-packs \
> + --bitmap <in 2>err &&
> + grep "duplicate object.* among disjoint packs" err
> + )
> +'
> +
> test_done
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 04/24] midx: factor out `fill_pack_info()`
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ccf1337305db60f1c8174e9b309e2a9e04ce1487.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 4502 bytes --]
On Tue, Nov 28, 2023 at 02:08:05PM -0500, Taylor Blau wrote:
> When selecting which packfiles will be written while generating a MIDX,
> the MIDX internals fill out a 'struct pack_info' with various pieces of
> book-keeping.
>
> Instead of filling out each field of the `pack_info` structure
> individually in each of the two spots that modify the array of such
> structures (`ctx->info`), extract a common routine that does this for
> us.
>
> This reduces the code duplication by a modest amount. But more
> importantly, it zero-initializes the structure before assigning values
> into it. This hardens us for a future change which will add additional
> fields to this structure which (until this patch) was not
> zero-initialized.
>
> As a result, any new fields added to the `pack_info` structure need only
> be updated in a single location, instead of at each spot within midx.c.
>
> There are no functional changes in this patch.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> midx.c | 35 +++++++++++++++++++----------------
> 1 file changed, 19 insertions(+), 16 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 3b727dc633..591b3c636e 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -464,6 +464,17 @@ struct pack_info {
> unsigned expired : 1;
> };
>
> +static void fill_pack_info(struct pack_info *info,
> + struct packed_git *p, char *pack_name,
> + uint32_t orig_pack_int_id)
> +{
> + memset(info, 0, sizeof(struct pack_info));
> +
> + info->orig_pack_int_id = orig_pack_int_id;
> + info->pack_name = pack_name;
> + info->p = p;
> +}
Nit: all callers manually call `xstrdup(pack_name)` and pass that to
`fill_pack_info()`. We could consider doing this in here instead so that
ownership of the string becomes a tad clearer.
> static int pack_info_compare(const void *_a, const void *_b)
> {
> struct pack_info *a = (struct pack_info *)_a;
> @@ -504,6 +515,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
> const char *file_name, void *data)
> {
> struct write_midx_context *ctx = data;
> + struct packed_git *p;
>
> if (ends_with(file_name, ".idx")) {
> display_progress(ctx->progress, ++ctx->pack_paths_checked);
> @@ -530,17 +542,14 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>
> ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
>
> - ctx->info[ctx->nr].p = add_packed_git(full_path,
> - full_path_len,
> - 0);
> -
> - if (!ctx->info[ctx->nr].p) {
> + p = add_packed_git(full_path, full_path_len, 0);
> + if (!p) {
> warning(_("failed to add packfile '%s'"),
> full_path);
> return;
> }
>
> - if (open_pack_index(ctx->info[ctx->nr].p)) {
> + if (open_pack_index(p)) {
> warning(_("failed to open pack-index '%s'"),
> full_path);
> close_pack(ctx->info[ctx->nr].p);
Isn't `ctx->info[ctx->nr].p` still uninitialized at this point?
> @@ -548,9 +557,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
> return;
> }
>
> - ctx->info[ctx->nr].pack_name = xstrdup(file_name);
> - ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
> - ctx->info[ctx->nr].expired = 0;
> + fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
> + ctx->nr);
> ctx->nr++;
> }
> }
> @@ -1310,11 +1318,6 @@ static int write_midx_internal(const char *object_dir,
> for (i = 0; i < ctx.m->num_packs; i++) {
> ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
>
> - ctx.info[ctx.nr].orig_pack_int_id = i;
> - ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> - ctx.info[ctx.nr].p = ctx.m->packs[i];
> - ctx.info[ctx.nr].expired = 0;
> -
> if (flags & MIDX_WRITE_REV_INDEX) {
> /*
> * If generating a reverse index, need to have
> @@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
> if (open_pack_index(ctx.m->packs[i]))
> die(_("could not open index for %s"),
> ctx.m->packs[i]->pack_name);
> - ctx.info[ctx.nr].p = ctx.m->packs[i];
Just to make sure I'm not missing anything, but this assignment here was
basically redundant before this patch already, right?
Patrick
> }
>
> - ctx.nr++;
> + fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
> + xstrdup(ctx.m->pack_names[i]), i);
> }
> }
>
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <6f5ff96998946f3f49da56fd05c096b949521339.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]
On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:
> The `bb_commit` commit slab is used by the pack-bitmap-write machinery
> to track various pieces of bookkeeping used to generate reachability
> bitmaps.
>
> Even though we clear the slab when freeing the bitmap_builder struct
> (with `bitmap_builder_clear()`), there are still pointers which point to
> locations in memory that have not yet been freed, resulting in a leak.
>
> Plug the leak by introducing a suitable `free_fn` for the `struct
> bb_commit` type, and make sure it is called on each member of the slab
> via the `deep_clear_bb_data()` function.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> pack-bitmap-write.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index f4ecdf8b0e..dd3a415b9d 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -198,6 +198,13 @@ struct bb_commit {
> unsigned idx; /* within selected array */
> };
>
> +static void clear_bb_commit(struct bb_commit *commit)
> +{
> + free(commit->reverse_edges);
I'd have expected to see `free_commit_list()` here instead of a simple
free. Is there any reason why we don't use it?
Patrick
> + bitmap_free(commit->commit_mask);
> + bitmap_free(commit->bitmap);
> +}
> +
> define_commit_slab(bb_data, struct bb_commit);
>
> struct bitmap_builder {
> @@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
>
> static void bitmap_builder_clear(struct bitmap_builder *bb)
> {
> - clear_bb_data(&bb->data);
> + deep_clear_bb_data(&bb->data, clear_bb_commit);
> free(bb->commits);
> bb->commits_nr = bb->commits_alloc = 0;
> }
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 10142 bytes --]
On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
>
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
>
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
>
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
>
> The implementation has a few components:
>
> - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
> to keep track of the bitmap position, number of objects, and
> disjointed-ness for each pack contained in the MIDX.
>
> - A new mode for `git multi-pack-index write --stdin-packs` that
> allows specifying disjoint packs, as well as a new option
> `--retain-disjoint` which preserves the set of existing disjoint
> packs in the new MIDX.
>
> - A new pack-objects mode `--ignore-disjoint`, which produces packs
> which are disjoint with respect to the current set of disjoint packs
> (i.e. it discards any objects from the packing list which appear in
> any of the known-disjoint packs).
>
> - A new repack mode, `--extend-disjoint` which causes any new pack(s)
> which are generated to be disjoint with respect to the set of packs
> currently marked as disjoint, minus any pack(s) which are about to
> be deleted.
>
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
>
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
>
> $ git clone git@github.com:git/git.git base
>
> $ cd base
> $ mv .git/objects/pack/pack-*.idx{,.bak}
> $ git unpack-objects <.git/objects/pack/pack-*.pack
>
> # pack the objects from each successive block of 1k commits
> $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
> do
> echo $rev |
> git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
> done
> # and grab any stragglers, pruning the unpacked objects
> $ git repack -d
> I then constructed a MIDX and corresponding bitmap
>
> $ find_pack () {
> for idx in .git/objects/pack/pack-*.idx
> do
> git show-index <$idx | grep -q "$1" && basename $idx
> done
> }
> $ preferred="$(find_pack $(git rev-parse HEAD))"
>
> $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
> git.compile multi-pack-index write --bitmap --stdin-packs \
> --preferred-pack=$preferred
> $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
>
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
>
> $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
> Benchmark 1: single-pack reuse
> Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s]
> Range (min … max): 6.063 s … 6.126 s 10 runs
>
> Benchmark 2: multi-pack reuse
> Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms]
> Range (min … max): 903.5 ms … 912.7 ms 10 runs
>
> Summary
> multi-pack reuse ran
> 6.72 ± 0.03 times faster than single-pack reuse
>
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
>
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.
Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?
No worry if you don't, I'm just curious.
> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
>
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.
The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).
I'll review the remaining patches at a later point in time.
Patrick
> Thanks in advance for your review!
>
> Taylor Blau (24):
> pack-objects: free packing_data in more places
> pack-bitmap-write: deep-clear the `bb_commit` slab
> pack-bitmap: plug leak in find_objects()
> midx: factor out `fill_pack_info()`
> midx: implement `DISP` chunk
> midx: implement `midx_locate_pack()`
> midx: implement `--retain-disjoint` mode
> pack-objects: implement `--ignore-disjoint` mode
> repack: implement `--extend-disjoint` mode
> pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
> pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
> pack-bitmap: return multiple packs via
> `reuse_partial_packfile_from_bitmap()`
> pack-objects: parameterize pack-reuse routines over a single pack
> pack-objects: keep track of `pack_start` for each reuse pack
> pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
> pack-objects: prepare `write_reused_pack()` for multi-pack reuse
> pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
> reuse
> pack-objects: include number of packs reused in output
> pack-bitmap: prepare to mark objects from multiple packs for reuse
> pack-objects: add tracing for various packfile metrics
> t/test-lib-functions.sh: implement `test_trace2_data` helper
> pack-objects: allow setting `pack.allowPackReuse` to "single"
> pack-bitmap: reuse objects from all disjoint packs
> t/perf: add performance tests for multi-pack reuse
>
> Documentation/config/pack.txt | 8 +-
> Documentation/git-multi-pack-index.txt | 12 ++
> Documentation/git-pack-objects.txt | 8 +
> Documentation/git-repack.txt | 12 ++
> Documentation/gitformat-pack.txt | 109 ++++++++++
> builtin/multi-pack-index.c | 13 +-
> builtin/pack-objects.c | 200 +++++++++++++++----
> builtin/repack.c | 57 +++++-
> midx.c | 218 +++++++++++++++++---
> midx.h | 11 +-
> pack-bitmap-write.c | 9 +-
> pack-bitmap.c | 265 ++++++++++++++++++++-----
> pack-bitmap.h | 18 +-
> pack-objects.c | 15 ++
> pack-objects.h | 1 +
> t/helper/test-read-midx.c | 31 ++-
> t/lib-disjoint.sh | 49 +++++
> t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++
> t/t5319-multi-pack-index.sh | 140 +++++++++++++
> t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++
> t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++
> t/t6113-rev-list-bitmap-filters.sh | 2 +
> t/t7700-repack.sh | 4 +-
> t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++
> t/test-lib-functions.sh | 14 ++
> 25 files changed, 1650 insertions(+), 144 deletions(-)
> create mode 100644 t/lib-disjoint.sh
> create mode 100755 t/perf/p5332-multi-pack-reuse.sh
> create mode 100755 t/t5332-multi-pack-reuse.sh
> create mode 100755 t/t7705-repack-extend-disjoint.sh
>
>
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> --
> 2.43.0.24.g980b318f98
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 01/24] pack-objects: free packing_data in more places
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <101d6a2841a909aa9c99225ebf1b65695cf116cb.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 3185 bytes --]
On Tue, Nov 28, 2023 at 02:07:57PM -0500, Taylor Blau wrote:
> The pack-objects internals use a packing_data struct to track what
> objects are part of the pack(s) being formed.
>
> Since these structures contain allocated fields, failing to
> appropriately free() them results in a leak. Plug that leak by
> introducing a free_packing_data() function, and call it in the
> appropriate spots.
>
> This is a fairly straightforward leak to plug, since none of the callers
> expect to read any values or have any references to parts of the address
> space being freed.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> builtin/pack-objects.c | 1 +
> midx.c | 5 +++++
> pack-objects.c | 15 +++++++++++++++
> pack-objects.h | 1 +
> 4 files changed, 22 insertions(+)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 89a8b5a976..bfa60359d4 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> reuse_packfile_objects);
>
> cleanup:
> + free_packing_data(&to_pack);
> list_objects_filter_release(&filter_options);
> strvec_clear(&rp);
>
> diff --git a/midx.c b/midx.c
> index 2f3863c936..3b727dc633 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1592,8 +1592,13 @@ static int write_midx_internal(const char *object_dir,
> flags) < 0) {
> error(_("could not write multi-pack bitmap"));
> result = 1;
> + free_packing_data(&pdata);
> + free(commits);
> goto cleanup;
> }
> +
> + free_packing_data(&pdata);
> + free(commits);
> }
> /*
> * NOTE: Do not use ctx.entries beyond this point, since it might
> diff --git a/pack-objects.c b/pack-objects.c
> index f403ca6986..1c7bedcc94 100644
> --- a/pack-objects.c
> +++ b/pack-objects.c
> @@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
> init_recursive_mutex(&pdata->odb_lock);
> }
>
> +void free_packing_data(struct packing_data *pdata)
Nit: shouldn't this rather be called `clear_packing_data`? `free` to me
indicates that the data structure itself will be free'd, as well, which
is not the case.
Patrick
> +{
> + if (!pdata)
> + return;
> +
> + free(pdata->cruft_mtime);
> + free(pdata->in_pack);
> + free(pdata->in_pack_by_idx);
> + free(pdata->in_pack_pos);
> + free(pdata->index);
> + free(pdata->layer);
> + free(pdata->objects);
> + free(pdata->tree_depth);
> +}
> +
> struct object_entry *packlist_alloc(struct packing_data *pdata,
> const struct object_id *oid)
> {
> diff --git a/pack-objects.h b/pack-objects.h
> index 0d78db40cb..336217e8cd 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -169,6 +169,7 @@ struct packing_data {
> };
>
> void prepare_packing_data(struct repository *r, struct packing_data *pdata);
> +void free_packing_data(struct packing_data *pdata);
>
> /* Protect access to object database */
> static inline void packing_data_lock(struct packing_data *pdata)
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 0/4] refs: improve handling of special refs
From: Patrick Steinhardt @ 2023-11-30 7:46 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, hanwenn
In-Reply-To: <ZWe3z59WGE0+8gXN@nand.local>
[-- Attachment #1: Type: text/plain, Size: 1104 bytes --]
On Wed, Nov 29, 2023 at 05:14:39PM -0500, Taylor Blau wrote:
> On Wed, Nov 29, 2023 at 09:14:07AM +0100, Patrick Steinhardt wrote:
> > Patrick Steinhardt (4):
> > wt-status: read HEAD and ORIG_HEAD via the refdb
> > refs: propagate errno when reading special refs fails
> > refs: complete list of special refs
> > bisect: consistently write BISECT_EXPECTED_REV via the refdb
> >
> > bisect.c | 25 +++------------
> > builtin/bisect.c | 8 ++---
> > refs.c | 64 +++++++++++++++++++++++++++++++++++--
> > t/t1403-show-ref.sh | 9 ++++++
> > t/t6030-bisect-porcelain.sh | 2 +-
> > wt-status.c | 17 +++++-----
> > 6 files changed, 86 insertions(+), 39 deletions(-)
>
> These all look pretty good to me. I had a few minor questions on the
> first three patches, but I don't think they necessarily require a
> reroll.
I agree that none of the comments really require a reroll, but I'll
address them if there will be another iteration of this patch series.
Thanks for your review!
Patrick
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 3/4] refs: complete list of special refs
From: Patrick Steinhardt @ 2023-11-30 7:44 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, hanwenn
In-Reply-To: <ZWe0RzOoHI9QZMox@nand.local>
[-- Attachment #1: Type: text/plain, Size: 5442 bytes --]
On Wed, Nov 29, 2023 at 04:59:35PM -0500, Taylor Blau wrote:
> On Wed, Nov 29, 2023 at 09:14:20AM +0100, Patrick Steinhardt wrote:
> > We have some references that are more special than others. The reason
> > for them being special is that they either do not follow the usual
> > format of references, or that they are written to the filesystem
> > directly by the respective owning subsystem and thus circumvent the
> > reference backend.
> >
> > This works perfectly fine right now because the reffiles backend will
> > know how to read those refs just fine. But with the prospect of gaining
> > a new reference backend implementation we need to be a lot more careful
> > here:
> >
> > - We need to make sure that we are consistent about how those refs are
> > written. They must either always be written via the filesystem, or
> > they must always be written via the reference backend. Any mixture
> > will lead to inconsistent state.
> >
> > - We need to make sure that such special refs are always handled
> > specially when reading them.
> >
> > We're already mostly good with regard to the first item, except for
> > `BISECT_EXPECTED_REV` which will be addressed in a subsequent commit.
> > But the current list of special refs is missing a lot of refs that
> > really should be treated specially. Right now, we only treat
> > `FETCH_HEAD` and `MERGE_HEAD` specially here.
> >
> > Introduce a new function `is_special_ref()` that contains all current
> > instances of special refs to fix the reading path.
> >
> > Based-on-patch-by: Han-Wen Nienhuys <hanwenn@gmail.com>
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > refs.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 56 insertions(+), 2 deletions(-)
> >
> > diff --git a/refs.c b/refs.c
> > index 7d4a057f36..2d39d3fe80 100644
> > --- a/refs.c
> > +++ b/refs.c
> > @@ -1822,15 +1822,69 @@ static int refs_read_special_head(struct ref_store *ref_store,
> > return result;
> > }
> >
> > +static int is_special_ref(const char *refname)
> > +{
> > + /*
> > + * Special references get written and read directly via the filesystem
> > + * by the subsystems that create them. Thus, they must not go through
> > + * the reference backend but must instead be read directly. It is
> > + * arguable whether this behaviour is sensible, or whether it's simply
> > + * a leaky abstraction enabled by us only having a single reference
> > + * backend implementation. But at least for a subset of references it
> > + * indeed does make sense to treat them specially:
> > + *
> > + * - FETCH_HEAD may contain multiple object IDs, and each one of them
> > + * carries additional metadata like where it came from.
> > + *
> > + * - MERGE_HEAD may contain multiple object IDs when merging multiple
> > + * heads.
> > + *
> > + * - "rebase-apply/" and "rebase-merge/" contain all of the state for
> > + * rebases, where keeping it closely together feels sensible.
> > + *
> > + * There are some exceptions that you might expect to see on this list
> > + * but which are handled exclusively via the reference backend:
> > + *
> > + * - CHERRY_PICK_HEAD
> > + * - HEAD
> > + * - ORIG_HEAD
> > + *
> > + * Writing or deleting references must consistently go either through
> > + * the filesystem (special refs) or through the reference backend
> > + * (normal ones).
> > + */
> > + const char * const special_refs[] = {
> > + "AUTO_MERGE",
> > + "BISECT_EXPECTED_REV",
> > + "FETCH_HEAD",
> > + "MERGE_AUTOSTASH",
> > + "MERGE_HEAD",
> > + };
>
> Is there a reason that we don't want to declare this statically? If we
> did, I think we could drop one const, since the strings would instead
> reside in the .rodata section.
Not really, no.
> > + int i;
>
> Not that it matters for this case, but it may be worth declaring i to be
> an unsigned type, since it's used as an index into an array. size_t
> seems like an appropriate choice there.
Hm. We do use `int` almost everywhere when iterating through an array
via `ARRAY_SIZE`, but ultimately I don't mind whether it's `int`,
`unsigned` or `size_t`.
> > + for (i = 0; i < ARRAY_SIZE(special_refs); i++)
> > + if (!strcmp(refname, special_refs[i]))
> > + return 1;
> > +
> > + /*
> > + * git-rebase(1) stores its state in `rebase-apply/` or
> > + * `rebase-merge/`, including various reference-like bits.
> > + */
> > + if (starts_with(refname, "rebase-apply/") ||
> > + starts_with(refname, "rebase-merge/"))
>
> Do we care about case sensitivity here? Definitely not on case-sensitive
> filesystems, but I'm not sure about case-insensitive ones. For instance,
> on macOS, I can do:
>
> $ git rev-parse hEAd
>
> and get the same value as "git rev-parse HEAD" (on my Linux workstation,
> this fails as expected).
>
> I doubt that there are many users in the wild asking to resolve
> reBASe-APPLY/xyz, but I think that after this patch that would no longer
> work as-is, so we may want to replace this with istarts_with() instead.
In practice I'd argue that nobody is ever going to ask for something in
`rebase-apply/` outside of Git internals or scripts, and I'd expect
these to always use proper casing. So I rather lean towards a "no, we
don't care about case sensitivity".
Patrick
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox