* Re: Fix git-send-email.perl w.r.t. recent Getopt::Long update
From: Todd Zullinger @ 2023-11-28 2:07 UTC (permalink / raw)
To: H.Merijn Brand; +Cc: Junio C Hamano, git
In-Reply-To: <20231127093810.2092fe1d@pc09>
Hi,
H.Merijn Brand wrote:
> On Mon, 27 Nov 2023 09:58:52 +0900, Junio C Hamano <gitster@pobox.com> wrote:
>> One downside of unconditional upgrade of the call is, of course,
>> that it would no longer work for those with older Getopt::Long that
>> did not support the "!" suffix. Fortunately, Getopt::Long 2.33
>> started shipping with Perl 5.8.1 that is more than 20 years old, so
>> with the series we accepted, we also have a change to bump the
>> required version of Perl from 5.8.0 to 5.8.1 to make it clear that
>> it is deliberate that we drop the support for anything older at the
>> same time.
>
> The is a no-issue ...
>
> Just the 'use Getopt::Long' is enough to guarantee a working version:
>
> The '!' was already implemented in version 2.10 (April 1997):
> --8<---
> =item !
>
> Option does not take an argument and may be negated, i.e. prefixed by
> "no". E.g. "foo!" will allow B<--foo> (with value 1) and B<-nofoo>
> (with value 0).
> The option variable will be set to 1, or 0 if negated.
> -->8---
The real issue is the lack of support for the '--no-' prefix
when used with the '!' parameter. The '--no-' form is what
has always been documented by git-send-email(1). It was not
supported until Getopt::Long 2.33, included in perl 5.8.1.
Prior, 'foo!' provided --foo and --nofoo but not --no-foo.
But as Junio said, we can accept requiring a perl which was
released sometime in the past 2 decades in order to run the
most recent git release. ;)
Thanks for noticing this and sending a patch!
--
Todd
^ permalink raw reply
* Re: [PATCH] doc: make the gitfile syntax easier to discover
From: Marcel Krause @ 2023-11-28 5:55 UTC (permalink / raw)
To: git mailing list; +Cc: Junio C Hamano, Marcel Krause
In-Reply-To: <70125a8e-57ed-2ac6-1260-2aaa10cbc851@pimpmybyte.de>
Sorry for being slow on understanding your "git worktree add" hint.
I agree that discoverability of that command is at least as important
as discoverability of the syntax. I'll add a line about that, too.
^ permalink raw reply
* [PATCH] doc: make the gitfile syntax easier to discover
From: Marcel Krause @ 2023-11-28 6:48 UTC (permalink / raw)
To: git mailing list; +Cc: Junio C Hamano, Marcel Krause
In-Reply-To: <182d290a-86e5-b361-87a1-6860641fc726@pimpmybyte.de>
---
Documentation/gitrepository-layout.txt | 4 +++-
Documentation/glossary-content.txt | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index 1a2ef4c150..949cd8a31e 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -23,7 +23,9 @@ A Git repository comes in two different flavours:
*Note*: Also you can have a plain text file `.git` at the root of
your working tree, containing `gitdir: <path>` to point at the real
-directory that has the repository. This mechanism is often used for
+directory that has the repository.
+This mechanism is called a 'gitfile' and is usually managed via the
+`git submodule` and `git worktree` commands. It is often used for
a working tree of a submodule checkout, to allow you in the
containing superproject to `git checkout` a branch that does not
have the submodule. The `checkout` has to remove the entire
diff --git a/Documentation/glossary-content.txt b/Documentation/glossary-content.txt
index 5a537268e2..3f912e7bb2 100644
--- a/Documentation/glossary-content.txt
+++ b/Documentation/glossary-content.txt
@@ -184,6 +184,8 @@ current branch integrates with) obviously do not work, as there is no
[[def_gitfile]]gitfile::
A plain file `.git` at the root of a working tree that
points at the directory that is the real repository.
+ For proper use see linkgit:git-worktree[1] or linkgit:git-submodule[1].
+ For syntax see linkgit:gitrepository-layout[5].
[[def_grafts]]grafts::
Grafts enables two otherwise different lines of development to be joined
--
2.25.1
^ permalink raw reply related
* [PATCH] doc: make the gitfile syntax easier to discover
From: Marcel Krause @ 2023-11-28 6:55 UTC (permalink / raw)
To: git mailing list; +Cc: Junio C Hamano, Marcel Krause
In-Reply-To: <182d290a-86e5-b361-87a1-6860641fc726@pimpmybyte.de>
Signed-off-by: Marcel Krause <mk+copyleft@pimpmybyte.de>
---
Documentation/gitrepository-layout.txt | 4 +++-
Documentation/glossary-content.txt | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index 1a2ef4c150..949cd8a31e 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -23,7 +23,9 @@ A Git repository comes in two different flavours:
*Note*: Also you can have a plain text file `.git` at the root of
your working tree, containing `gitdir: <path>` to point at the real
-directory that has the repository. This mechanism is often used for
+directory that has the repository.
+This mechanism is called a 'gitfile' and is usually managed via the
+`git submodule` and `git worktree` commands. It is often used for
a working tree of a submodule checkout, to allow you in the
containing superproject to `git checkout` a branch that does not
have the submodule. The `checkout` has to remove the entire
diff --git a/Documentation/glossary-content.txt b/Documentation/glossary-content.txt
index 5a537268e2..3f912e7bb2 100644
--- a/Documentation/glossary-content.txt
+++ b/Documentation/glossary-content.txt
@@ -184,6 +184,8 @@ current branch integrates with) obviously do not work, as there is no
[[def_gitfile]]gitfile::
A plain file `.git` at the root of a working tree that
points at the directory that is the real repository.
+ For proper use see linkgit:git-worktree[1] or linkgit:git-submodule[1].
+ For syntax see linkgit:gitrepository-layout[5].
[[def_grafts]]grafts::
Grafts enables two otherwise different lines of development to be joined
--
2.25.1
^ permalink raw reply related
* Re: --end-of-options inconsistently available?!
From: Sven Strickroth @ 2023-11-28 8:40 UTC (permalink / raw)
To: Jeff King; +Cc: git
In-Reply-To: <20231127212254.GA87495@coredump.intra.peff.net>
Am 27.11.2023 um 22:22 schrieb Jeff King:
>> $ git rev-parse --symbolic-full-name --end-of-options master
>> --end-of-options
>> refs/heads/master
>>
>> Here, the output also contains "--end-of-options" as if it is a reference
>> (same for "--")
>
> This one is intentional. rev-parse in its default mode is not just
> spitting out revisions, but also options that are meant to be passed
> along to the revision machinery via other commands (like rev-list). So
> for example:
>
> $ git rev-parse --foo HEAD
> --foo
> 564d0252ca632e0264ed670534a51d18a689ef5d
>
> And it does understand end-of-options explicitly, so:
>
> $ git rev-parse --end-of-options --foo --
> --end-of-options
> fatal: bad revision '--foo'
>
> If you just want to parse a name robustly, use --verify.
I would expect that -- and --end-of-options are handled in a special way
here so that rev-parse can also be used in scripts. I need to check
whether --verify works for me (from the manual I thought I need to
specify full reference names).
>> $ git checkout -f --end-of-options HEAD~1 -- afile.txt
>> fatal: only one reference expected, 2 given.
>
> I think this is the same KEEP_DASHDASH problem as with git-reset.
I also found another problem:
$ git format-patch --end-of-options -1
fatal: option '-1' must come before non-option arguments
Where -1 is the number of commits here...
Best,
Sven
^ permalink raw reply
* Re: [PATCH] i18n: factorize even more 'incompatible options' messages
From: Patrick Steinhardt @ 2023-11-28 12:33 UTC (permalink / raw)
To: René Scharfe
Cc: Eric Sunshine, Jean-Noël Avila, Junio C Hamano, Git List
In-Reply-To: <d1f28272-635d-4638-b0f4-76d64013b0d5@web.de>
[-- Attachment #1: Type: text/plain, Size: 4338 bytes --]
On Mon, Nov 27, 2023 at 06:39:41PM +0100, René Scharfe wrote:
> Am 26.11.23 um 18:49 schrieb Eric Sunshine:
> > On Sun, Nov 26, 2023 at 6:57 AM René Scharfe <l.s.r@web.de> wrote:
> >> Continue the work of 12909b6b8a (i18n: turn "options are incompatible"
> >> into "cannot be used together", 2022-01-05) and a699367bb8 (i18n:
> >> factorize more 'incompatible options' messages, 2022-01-31) to use the
> >> same parameterized error message for reporting incompatible command line
> >> options. This reduces the number of strings to translate and makes the
> >> UI slightly more consistent.
> >
> > Thanks for tackling this.
> >
> > A couple additional instances recently slipped into `show-ref.c` which
> > were caught during review[1,2] but nevertheless made it to
> > "master"[3,4]. This patch, of course, doesn't need to address those,
> > but if rerolling for some other reason, perhaps they can be included,
> > as well(?).
Ah, I wasn't aware of these new wrappers, either. The below patch looks
good to me, thanks for the fixup.
Patrick
> > [1]: https://lore.kernel.org/git/CAPig+cSrp7vZuy7D_ENHKZKZzF4OSmCtfYNHPGMtS1Hj6gArDw@mail.gmail.com/
> > [2]: https://lore.kernel.org/git/CAPig+cRTOMie0rUf=Mhbo9e2EXf-_2kQyMeqpB9OCRB1MZZ1rw@mail.gmail.com/
> > [3]: 199970e72f (builtin/show-ref: ensure mutual exclusiveness of
> > subcommands, 2023-10-31)
> > [4]: 9080a7f178 (builtin/show-ref: add new mode to check for reference
> > existence, 2023-10-31)
>
> [4] changes the message added by [3], so that's one instance, right?
>
> --- >8 ---
> Subject: [PATCH] show-ref: use die_for_incompatible_opt3()
>
> Use the standard message for reporting the use of multiple mutually
> exclusive options by calling die_for_incompatible_opt3() instead of
> rolling our own. This has the benefits of showing only the actually
> given options, reducing the number of strings to translate and making
> the UI slightly more consistent.
>
> Adjust the test to no longer insist on a specific order of the
> reported options, as this implementation detail does not affect the
> usefulness of the error message.
>
> Reported-by: Eric Sunshine <sunshine@sunshineco.com>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
> builtin/show-ref.c | 6 +++---
> t/t1403-show-ref.sh | 16 +++++++++-------
> 2 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/builtin/show-ref.c b/builtin/show-ref.c
> index 7aac525a87..59d2291cbf 100644
> --- a/builtin/show-ref.c
> +++ b/builtin/show-ref.c
> @@ -315,9 +315,9 @@ int cmd_show_ref(int argc, const char **argv, const char *prefix)
> argc = parse_options(argc, argv, prefix, show_ref_options,
> show_ref_usage, 0);
>
> - if ((!!exclude_existing_opts.enabled + !!verify + !!exists) > 1)
> - die(_("only one of '%s', '%s' or '%s' can be given"),
> - "--exclude-existing", "--verify", "--exists");
> + die_for_incompatible_opt3(exclude_existing_opts.enabled, "--exclude-existing",
> + verify, "--verify",
> + exists, "--exists");
>
> if (exclude_existing_opts.enabled)
> return cmd_show_ref__exclude_existing(&exclude_existing_opts);
> diff --git a/t/t1403-show-ref.sh b/t/t1403-show-ref.sh
> index b50ae6fcf1..d477689e33 100755
> --- a/t/t1403-show-ref.sh
> +++ b/t/t1403-show-ref.sh
> @@ -197,18 +197,20 @@ test_expect_success 'show-ref --verify with dangling ref' '
> '
>
> test_expect_success 'show-ref sub-modes are mutually exclusive' '
> - cat >expect <<-EOF &&
> - fatal: only one of ${SQ}--exclude-existing${SQ}, ${SQ}--verify${SQ} or ${SQ}--exists${SQ} can be given
> - EOF
> -
> test_must_fail git show-ref --verify --exclude-existing 2>err &&
> - test_cmp expect err &&
> + grep "verify" err &&
> + grep "exclude-existing" err &&
> + grep "cannot be used together" err &&
>
> test_must_fail git show-ref --verify --exists 2>err &&
> - test_cmp expect err &&
> + grep "verify" err &&
> + grep "exists" err &&
> + grep "cannot be used together" err &&
>
> test_must_fail git show-ref --exclude-existing --exists 2>err &&
> - test_cmp expect err
> + grep "exclude-existing" err &&
> + grep "exists" err &&
> + grep "cannot be used together" err
> '
>
> test_expect_success '--exists with existing reference' '
> --
> 2.43.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH] completion: add and use the __git_get_config_subsection helper function
From: SZEDER Gábor @ 2023-11-28 12:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Philippe Blain
In-Reply-To: <xmqqzfzh16qp.fsf@gitster.g>
On Tue, Nov 14, 2023 at 10:08:46AM +0900, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
>
> > +# Lists all subsections in the given section which contain the given
> > +# config variable, with the section and variable names removed.
> > +__git_get_config_subsections ()
> > +{
> > + local section="$1" var="$2" i IFS=$'\n'
> > + for i in $(__git config --name-only --get-regexp "^$section\..*\.$var$"); do
> > + i=${i#$section.}
> > + i=${i%.$var}
>
> As this script is allowed bash-isms, I wondered if we can use
> a single pattern substitution instead of two remove pre/suffix
> pattern substitution, but I guess it would not work, and the above
> is perfectly readable.
Yeah, I don't think it's possible to remove the prefix and suffix with
Bash builtins in a single operation.
> > + echo "$i"
>
> As the subsection is designed to contain unbounded set of end-user
> controlled names, we probably should do
>
> printf "%s\n" "$i"
>
> instead to protect us from interesting names (e.g. ones that begin
> with a dash).
Indeed, will do.
> > + done
> > +}
>
> Interesting to see that we do not need to bother deduplicating the
> output from here.
Bash will sort and deduplicate the completion words anyway, so we
don't have to. Sometimes we do deduplicate them, though, but either
to make testing easier or for performance reasons; in this case
neither of them applies.
> > diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
> > index a7c3b4eb63..11ed83d0ed 100755
> > --- a/t/t9902-completion.sh
> > +++ b/t/t9902-completion.sh
> > @@ -2130,6 +2130,19 @@ test_expect_success '__git_get_config_variables' '
> > test_cmp expect actual
> > '
> >
> > +test_expect_success '__git_get_config_subsections' '
> > + cat >expect <<-\EOF &&
> > + subsection-1
> > + SubSection-2
> > + sub.section.3
> > + EOF
> > + test_config interesting.subsection-1.name good &&
> > + test_config Interesting.SubSection-2.Name good &&
> > + test_config interesting.sub.section.3.name good &&
> > + __git_get_config_subsections interesting name >actual &&
> > + test_cmp expect actual
> > +'
>
> Good to see an uppercase character is used here ;-).
That's just for good measure, but not really necessasry here, as that
primarily tests that 'git config' lists the section and variable names
(but not the subsection!) normalized to lowercase, no matter what
CaMeLCase the user might have written them.
^ permalink raw reply
* Re: [PATCH] setup: recognize bare repositories with packed-refs
From: Adam Majer @ 2023-11-28 14:14 UTC (permalink / raw)
To: Josh Steadmon, git
In-Reply-To: <ZWTxuBV1DGieo6n2@google.com>
On 11/27/23 20:44, Josh Steadmon wrote:> Two suggestions for the test here:
> 1) Can you give the test a more descriptive name, such as "GCed bare
> repos still recognized"?
Thanks, adjusted. I've also added that empty refs/ directory is not there.
> 2) Can you add a check that bare.git/packed-refs exists?
Done.
I've also removed the -C parameter since we actually need GIT_DIR= in
all cases to prevent git from going up directory tree. -C is then
superflous. In addition, I've changed the hardcoded object id to master
branch to make it less magical looking.
- Adam
^ permalink raw reply
* [PATCH] setup: recognize bare repositories with packed-refs
From: Adam Majer @ 2023-11-28 14:28 UTC (permalink / raw)
To: git; +Cc: Adam Majer
In-Reply-To: <20231117203253.21143-1-adamm@zombino.com>
In a garbage collected bare git repository, the refs/ subdirectory is
empty. In use-cases when such a repository is directly added into
another repository, it no longer is detected as valid. Git doesn't
preserve empty paths so refs/ subdirectory is not present. Simply
creating an empty refs/ subdirectory fixes this problem.
Looking more carefully, there are two backends to handle various refs in
git -- the files backend that uses refs/ subdirectory and the
packed-refs backend that uses packed-refs file. If references are not
found in refs/ subdirectory (or directory doesn't exist), the
packed-refs directory will be consulted. Garbage collected repository
will have all its references in packed-refs file.
To allow the use-case when packed-refs is the only source of refs and
refs/ subdirectory is simply not present, augment 'is_git_directory()'
setup function to look for packed-refs file as an alternative to refs/
subdirectory.
Signed-off-by: Adam Majer <adamm@zombino.com>
---
setup.c | 10 +++++++---
t/t6500-gc.sh | 9 +++++++++
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/setup.c b/setup.c
index fc592dc6dd..2a6dda6ae9 100644
--- a/setup.c
+++ b/setup.c
@@ -348,7 +348,7 @@ int get_common_dir_noenv(struct strbuf *sb, const char *gitdir)
*
* - either an objects/ directory _or_ the proper
* GIT_OBJECT_DIRECTORY environment variable
- * - a refs/ directory
+ * - a refs/ directory or packed-refs file
* - either a HEAD symlink or a HEAD file that is formatted as
* a proper "ref:", or a regular file HEAD that has a properly
* formatted sha1 object name.
@@ -384,8 +384,12 @@ int is_git_directory(const char *suspect)
strbuf_setlen(&path, len);
strbuf_addstr(&path, "/refs");
- if (access(path.buf, X_OK))
- goto done;
+ if (access(path.buf, X_OK)) {
+ strbuf_setlen(&path, len);
+ strbuf_addstr(&path, "/packed-refs");
+ if (access(path.buf, R_OK))
+ goto done;
+ }
ret = 1;
done:
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 18fe1c25e6..4ad1690817 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -214,6 +214,15 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
'
+test_expect_success 'GCed bare repos without empty refs/ still recognized' '
+ GIT_DIR="$PWD"/bare.git git cat-file -e master &&
+ test_dir_is_empty bare.git/refs/heads &&
+ test_dir_is_empty bare.git/refs/tags &&
+ test_file_not_empty bare.git/packed-refs &&
+ rm -r bare.git/refs &&
+ GIT_DIR="$PWD"/bare.git git cat-file -e master
+'
+
test_expect_success 'gc.repackFilterTo store filtered out objects' '
test_when_finished "rm -rf bare.git filtered.git" &&
--
2.43.0.1.g67290e5b65
^ permalink raw reply related
* Re: Submodule update fetching with outer repo's default branch
From: Ricardo Abreu @ 2023-11-28 18:32 UTC (permalink / raw)
To: scott; +Cc: git
In-Reply-To: <111c2777-6fd4-45ab-8418-9d064999661c@app.fastmail.com>
I have the same problem. Here is a generic recipe to reproduce, assuming a repository X with a submodule Y (set with an `https` URL):
1. `git clone ... X`
2. `cd X`
3. `git submodule sync`
4. `git submodule update --init --recursive`
5. `git remote rename origin remotex` # rename the remote in X, not Y
6. Elsewhere, add a new commit to Y and create a branch "update-y" in X, pointing the submodule to that commit. Suppose the hash of the new commit in Y is `abc1abc`.
7. `git fetch remotex`
8. `git checkout update-y`
9. `git submodule sync`
10. `GIT_TRACE=1 git submodule update --init --recursive`
- observe:
* the command in the last trace message to be `git fetch remotex abc1abc` (should be `git fetch origin abc1abc`, because that is still the name of the submodule's remote)
* the wrong error message complaining about 'file' protocol (which isn't involved anywhere)
^ permalink raw reply
* Re: [PATCH] setup: recognize bare repositories with packed-refs
From: Josh Steadmon @ 2023-11-28 18:45 UTC (permalink / raw)
To: Adam Majer; +Cc: git
In-Reply-To: <20231128142845.11523-1-adamm@zombino.com>
On 2023.11.28 15:28, Adam Majer wrote:
> In a garbage collected bare git repository, the refs/ subdirectory is
> empty. In use-cases when such a repository is directly added into
> another repository, it no longer is detected as valid. Git doesn't
> preserve empty paths so refs/ subdirectory is not present. Simply
> creating an empty refs/ subdirectory fixes this problem.
>
> Looking more carefully, there are two backends to handle various refs in
> git -- the files backend that uses refs/ subdirectory and the
> packed-refs backend that uses packed-refs file. If references are not
> found in refs/ subdirectory (or directory doesn't exist), the
> packed-refs directory will be consulted. Garbage collected repository
> will have all its references in packed-refs file.
>
> To allow the use-case when packed-refs is the only source of refs and
> refs/ subdirectory is simply not present, augment 'is_git_directory()'
> setup function to look for packed-refs file as an alternative to refs/
> subdirectory.
>
> Signed-off-by: Adam Majer <adamm@zombino.com>
> ---
> setup.c | 10 +++++++---
> t/t6500-gc.sh | 9 +++++++++
> 2 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/setup.c b/setup.c
> index fc592dc6dd..2a6dda6ae9 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -348,7 +348,7 @@ int get_common_dir_noenv(struct strbuf *sb, const char *gitdir)
> *
> * - either an objects/ directory _or_ the proper
> * GIT_OBJECT_DIRECTORY environment variable
> - * - a refs/ directory
> + * - a refs/ directory or packed-refs file
> * - either a HEAD symlink or a HEAD file that is formatted as
> * a proper "ref:", or a regular file HEAD that has a properly
> * formatted sha1 object name.
> @@ -384,8 +384,12 @@ int is_git_directory(const char *suspect)
>
> strbuf_setlen(&path, len);
> strbuf_addstr(&path, "/refs");
> - if (access(path.buf, X_OK))
> - goto done;
> + if (access(path.buf, X_OK)) {
> + strbuf_setlen(&path, len);
> + strbuf_addstr(&path, "/packed-refs");
> + if (access(path.buf, R_OK))
> + goto done;
> + }
>
> ret = 1;
> done:
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index 18fe1c25e6..4ad1690817 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -214,6 +214,15 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
> grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
> '
>
> +test_expect_success 'GCed bare repos without empty refs/ still recognized' '
> + GIT_DIR="$PWD"/bare.git git cat-file -e master &&
> + test_dir_is_empty bare.git/refs/heads &&
> + test_dir_is_empty bare.git/refs/tags &&
> + test_file_not_empty bare.git/packed-refs &&
> + rm -r bare.git/refs &&
> + GIT_DIR="$PWD"/bare.git git cat-file -e master
> +'
> +
> test_expect_success 'gc.repackFilterTo store filtered out objects' '
> test_when_finished "rm -rf bare.git filtered.git" &&
>
> --
> 2.43.0.1.g67290e5b65
Thanks for the fixes. This looks good to me. BTW, in the future please
add a version number when you send updated patches (e.g. add "-v 2" to
your command-line if you're using git-format-patch).
Reviewed-by: Josh Steadmon <steadmon@google.com>
^ permalink raw reply
* Re: [PATCH] setup: recognize bare repositories with packed-refs
From: Jeff King @ 2023-11-28 19:04 UTC (permalink / raw)
To: Adam Majer; +Cc: git
In-Reply-To: <20231128142845.11523-1-adamm@zombino.com>
On Tue, Nov 28, 2023 at 03:28:45PM +0100, Adam Majer wrote:
> In a garbage collected bare git repository, the refs/ subdirectory is
> empty. In use-cases when such a repository is directly added into
> another repository, it no longer is detected as valid. Git doesn't
> preserve empty paths so refs/ subdirectory is not present. Simply
> creating an empty refs/ subdirectory fixes this problem.
I understand your use case, but I still have a vague feeling that this
is bending some assumptions in a way that may create problems or
confusion later. In particular:
> Looking more carefully, there are two backends to handle various refs in
> git -- the files backend that uses refs/ subdirectory and the
> packed-refs backend that uses packed-refs file. If references are not
> found in refs/ subdirectory (or directory doesn't exist), the
> packed-refs directory will be consulted. Garbage collected repository
> will have all its references in packed-refs file.
This second paragraph doesn't seem totally accurate to me. There are not
really two backends that Git can use. For production use, there is just
one, the "files" backend, which happens to also use packed-refs under
the hood (and a convenient way for the code to structure this was a
subordinate backend). But it has never been possible to have a repo that
just uses packed-refs.
There is also the experimental reftable, of course. And there we have
not yet loosened is_git_directory(), and it has to create an unused
"refs/" directory (there has been some discussion about allowing it to
be an empty file, though no patches have been merged).
So with regards to the loosening in your patch, my questions would be:
- if we are going to change the rules for repository detection, is
this where we want to end up? We haven't changed them (yet) for
reftables. If we are going to do so, should we have a scheme that
will work for that transition, too? The "refs is an empty file"
scheme would fix your use case, too (though see below).
- is the rest of Git ready to handle a missing "refs/" directory? It
looks like making a ref will auto-create it (since we may have to
make refs/foo/bar/... anyway).
- what about other implementations? Your embedded repos will
presumably not work with libgit2, jgit, etc, until they also get
similar patches.
- what about empty repositories? In that case there will be no "refs/"
file and no "packed-refs" file (such a repository is less likely, of
course, but it may contain objects but no refs, or the point may be
to have an empty repo as a test vector). Likewise, it is possible
for a repository to have an empty "objects" directory (even with a
non-empty refs directory, if there are only symrefs), and your patch
doesn't address that.
> To allow the use-case when packed-refs is the only source of refs and
> refs/ subdirectory is simply not present, augment 'is_git_directory()'
> setup function to look for packed-refs file as an alternative to refs/
> subdirectory.
Getting back to your use case, I'd suggest one of:
- do the usual "touch refs/.gitignore" trick to explicitly track the
empty directory. It looks like the ref code will ignore this (we
don't allow ref names to start with "." in a path component)
- whatever is consuming the embedded repos could "mkdir -p refs
objects" as needed. This is a minor pain, but I think in the long
term we are moving to a world where you have to explicitly do
"GIT_DIR=$PWD/embedded.git" to access an embedded bare repo. So
they're already special and require some setup; adding an extra step
may not be so bad.
Now it may be that neither of those solutions is acceptable for various
reasons. But it is probably worth detailing those reasons in your commit
message.
-Peff
^ permalink raw reply
* [PATCH 00/24] pack-objects: multi-pack verbatim reuse
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
2013-12-21), we added support for reachability bitmaps, and taught
pack-objects how to reuse verbatim chunks from the bitmapped pack. When
multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
use the MIDX's "preferred" pack as the source for verbatim reuse.
This allows repositories to incrementally repack themselves (e.g., using
a `--geometric` repack), storing the result in a MIDX, and generating a
corresponding bitmap. This keeps our bitmap coverage up-to-date, while
maintaining a relatively small number of packs.
However, it is recommended (and matches what we do in production at
GitHub) that repositories repack themselves all-into-one, and
generate a corresponding single-pack reachability bitmap. This is done
for a couple of reasons, but the most relevant one to this series is
that it enables us to perform verbatim pack-reuse over a complete copy
of the repository, since the entire repository resides in a single pack
(and thus is eligible for verbatim pack-reuse).
As repositories grow larger, packing their contents into a single pack
becomes less feasible. This series extends the pack-reuse mechanism to
operate over multiple packs which are known ahead of time to be disjoint
with respect to one another's set of objects.
The implementation has a few components:
- A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
to keep track of the bitmap position, number of objects, and
disjointed-ness for each pack contained in the MIDX.
- A new mode for `git multi-pack-index write --stdin-packs` that
allows specifying disjoint packs, as well as a new option
`--retain-disjoint` which preserves the set of existing disjoint
packs in the new MIDX.
- A new pack-objects mode `--ignore-disjoint`, which produces packs
which are disjoint with respect to the current set of disjoint packs
(i.e. it discards any objects from the packing list which appear in
any of the known-disjoint packs).
- A new repack mode, `--extend-disjoint` which causes any new pack(s)
which are generated to be disjoint with respect to the set of packs
currently marked as disjoint, minus any pack(s) which are about to
be deleted.
With all of that in place, the patch series then rewrites all of the
pack-reuse functions in terms of the new `bitmapped_pack` structure.
Once we have dropped all of the assumptions stemming from only
performing pack-reuse over a single candidate pack, we can then enable
reuse over all of the disjoint packs.
In addition to the many new tests in t5332 added by that series, I tried
to simulate a "real world" test on git.git by breaking the repository
into chunks of 1,000 commits (plus their set of reachable objects not
reachable from earlier chunk(s)) and packing those chunks. This produces
a large number of packs with the objects from git.git which are known to
be disjoint with respect to one another.
$ git clone git@github.com:git/git.git base
$ cd base
$ mv .git/objects/pack/pack-*.idx{,.bak}
$ git unpack-objects <.git/objects/pack/pack-*.pack
# pack the objects from each successive block of 1k commits
$ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
do
echo $rev |
git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
done
# and grab any stragglers, pruning the unpacked objects
$ git repack -d
I then constructed a MIDX and corresponding bitmap
$ find_pack () {
for idx in .git/objects/pack/pack-*.idx
do
git show-index <$idx | grep -q "$1" && basename $idx
done
}
$ preferred="$(find_pack $(git rev-parse HEAD))"
$ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
git.compile multi-pack-index write --bitmap --stdin-packs \
--preferred-pack=$preferred
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
With all of that in place, I was able to produce a significant speed-up
by reusing objects from multiple packs:
$ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
Benchmark 1: single-pack reuse
Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s]
Range (min … max): 6.063 s … 6.126 s 10 runs
Benchmark 2: multi-pack reuse
Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms]
Range (min … max): 903.5 ms … 912.7 ms 10 runs
Summary
multi-pack reuse ran
6.72 ± 0.03 times faster than single-pack reuse
(There are corresponding tests in p5332 that test different sized chunks
and measure the runtime performance as well as resulting pack size).
Performing verbatim pack reuse naturally trades off between CPU time and
the resulting pack size. In the above example, the single-pack reuse
case produces a clone size of ~194 MB on my machine, while the
multi-pack reuse case produces a clone size closer to ~266 MB, which is
a ~37% increase in clone size.
I think there is still some opportunity to close this gap, since the
"packing" strategy here is extremely naive. In a production setting, I'm
sure that there are more well thought out repacking strategies that
would produce more similar clone sizes.
I considered breaking this series up into smaller chunks, but was
unsatisfied with the result. Since this series is rather large, if you
have alternate suggestions on better ways to structure this, please let
me know.
Thanks in advance for your review!
Taylor Blau (24):
pack-objects: free packing_data in more places
pack-bitmap-write: deep-clear the `bb_commit` slab
pack-bitmap: plug leak in find_objects()
midx: factor out `fill_pack_info()`
midx: implement `DISP` chunk
midx: implement `midx_locate_pack()`
midx: implement `--retain-disjoint` mode
pack-objects: implement `--ignore-disjoint` mode
repack: implement `--extend-disjoint` mode
pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
pack-bitmap: return multiple packs via
`reuse_partial_packfile_from_bitmap()`
pack-objects: parameterize pack-reuse routines over a single pack
pack-objects: keep track of `pack_start` for each reuse pack
pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
pack-objects: prepare `write_reused_pack()` for multi-pack reuse
pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
reuse
pack-objects: include number of packs reused in output
pack-bitmap: prepare to mark objects from multiple packs for reuse
pack-objects: add tracing for various packfile metrics
t/test-lib-functions.sh: implement `test_trace2_data` helper
pack-objects: allow setting `pack.allowPackReuse` to "single"
pack-bitmap: reuse objects from all disjoint packs
t/perf: add performance tests for multi-pack reuse
Documentation/config/pack.txt | 8 +-
Documentation/git-multi-pack-index.txt | 12 ++
Documentation/git-pack-objects.txt | 8 +
Documentation/git-repack.txt | 12 ++
Documentation/gitformat-pack.txt | 109 ++++++++++
builtin/multi-pack-index.c | 13 +-
builtin/pack-objects.c | 200 +++++++++++++++----
builtin/repack.c | 57 +++++-
midx.c | 218 +++++++++++++++++---
midx.h | 11 +-
pack-bitmap-write.c | 9 +-
pack-bitmap.c | 265 ++++++++++++++++++++-----
pack-bitmap.h | 18 +-
pack-objects.c | 15 ++
pack-objects.h | 1 +
t/helper/test-read-midx.c | 31 ++-
t/lib-disjoint.sh | 49 +++++
t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++
t/t5319-multi-pack-index.sh | 140 +++++++++++++
t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++
t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++
t/t6113-rev-list-bitmap-filters.sh | 2 +
t/t7700-repack.sh | 4 +-
t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++
t/test-lib-functions.sh | 14 ++
25 files changed, 1650 insertions(+), 144 deletions(-)
create mode 100644 t/lib-disjoint.sh
create mode 100755 t/perf/p5332-multi-pack-reuse.sh
create mode 100755 t/t5332-multi-pack-reuse.sh
create mode 100755 t/t7705-repack-extend-disjoint.sh
base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
--
2.43.0.24.g980b318f98
^ permalink raw reply
* [PATCH 01/24] pack-objects: free packing_data in more places
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.
Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a free_packing_data() function, and call it in the
appropriate spots.
This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 1 +
midx.c | 5 +++++
pack-objects.c | 15 +++++++++++++++
pack-objects.h | 1 +
4 files changed, 22 insertions(+)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 89a8b5a976..bfa60359d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
reuse_packfile_objects);
cleanup:
+ free_packing_data(&to_pack);
list_objects_filter_release(&filter_options);
strvec_clear(&rp);
diff --git a/midx.c b/midx.c
index 2f3863c936..3b727dc633 100644
--- a/midx.c
+++ b/midx.c
@@ -1592,8 +1592,13 @@ static int write_midx_internal(const char *object_dir,
flags) < 0) {
error(_("could not write multi-pack bitmap"));
result = 1;
+ free_packing_data(&pdata);
+ free(commits);
goto cleanup;
}
+
+ free_packing_data(&pdata);
+ free(commits);
}
/*
* NOTE: Do not use ctx.entries beyond this point, since it might
diff --git a/pack-objects.c b/pack-objects.c
index f403ca6986..1c7bedcc94 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
init_recursive_mutex(&pdata->odb_lock);
}
+void free_packing_data(struct packing_data *pdata)
+{
+ if (!pdata)
+ return;
+
+ free(pdata->cruft_mtime);
+ free(pdata->in_pack);
+ free(pdata->in_pack_by_idx);
+ free(pdata->in_pack_pos);
+ free(pdata->index);
+ free(pdata->layer);
+ free(pdata->objects);
+ free(pdata->tree_depth);
+}
+
struct object_entry *packlist_alloc(struct packing_data *pdata,
const struct object_id *oid)
{
diff --git a/pack-objects.h b/pack-objects.h
index 0d78db40cb..336217e8cd 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -169,6 +169,7 @@ struct packing_data {
};
void prepare_packing_data(struct repository *r, struct packing_data *pdata);
+void free_packing_data(struct packing_data *pdata);
/* Protect access to object database */
static inline void packing_data_lock(struct packing_data *pdata)
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
The `bb_commit` commit slab is used by the pack-bitmap-write machinery
to track various pieces of bookkeeping used to generate reachability
bitmaps.
Even though we clear the slab when freeing the bitmap_builder struct
(with `bitmap_builder_clear()`), there are still pointers which point to
locations in memory that have not yet been freed, resulting in a leak.
Plug the leak by introducing a suitable `free_fn` for the `struct
bb_commit` type, and make sure it is called on each member of the slab
via the `deep_clear_bb_data()` function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap-write.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f4ecdf8b0e..dd3a415b9d 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -198,6 +198,13 @@ struct bb_commit {
unsigned idx; /* within selected array */
};
+static void clear_bb_commit(struct bb_commit *commit)
+{
+ free(commit->reverse_edges);
+ bitmap_free(commit->commit_mask);
+ bitmap_free(commit->bitmap);
+}
+
define_commit_slab(bb_data, struct bb_commit);
struct bitmap_builder {
@@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
static void bitmap_builder_clear(struct bitmap_builder *bb)
{
- clear_bb_data(&bb->data);
+ deep_clear_bb_data(&bb->data, clear_bb_commit);
free(bb->commits);
bb->commits_nr = bb->commits_alloc = 0;
}
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 03/24] pack-bitmap: plug leak in find_objects()
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
The `find_objects()` function creates an object_list for any tips of the
reachability query which do not have corresponding bitmaps.
The object_list is not used outside of `find_objects()`, but we never
free it with `object_list_free()`, resulting in a leak. Let's plug that
leak by calling `object_list_free()`, which results in t6113 becoming
leak-free.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap.c | 2 ++
t/t6113-rev-list-bitmap-filters.sh | 2 ++
2 files changed, 4 insertions(+)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 0260890341..d2f1306960 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
base = fill_in_bitmap(bitmap_git, revs, base, seen);
}
+ object_list_free(¬_mapped);
+
return base;
}
diff --git a/t/t6113-rev-list-bitmap-filters.sh b/t/t6113-rev-list-bitmap-filters.sh
index 86c70521f1..459f0d7412 100755
--- a/t/t6113-rev-list-bitmap-filters.sh
+++ b/t/t6113-rev-list-bitmap-filters.sh
@@ -4,6 +4,8 @@ test_description='rev-list combining bitmaps and filters'
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-bitmap.sh
+TEST_PASSES_SANITIZE_LEAK=true
+
test_expect_success 'set up bitmapped repo' '
# one commit will have bitmaps, the other will not
test_commit one &&
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 04/24] midx: factor out `fill_pack_info()`
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.
Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx->info`), extract a common routine that does this for
us.
This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.
As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.
There are no functional changes in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
midx.c | 35 +++++++++++++++++++----------------
1 file changed, 19 insertions(+), 16 deletions(-)
diff --git a/midx.c b/midx.c
index 3b727dc633..591b3c636e 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,17 @@ struct pack_info {
unsigned expired : 1;
};
+static void fill_pack_info(struct pack_info *info,
+ struct packed_git *p, char *pack_name,
+ uint32_t orig_pack_int_id)
+{
+ memset(info, 0, sizeof(struct pack_info));
+
+ info->orig_pack_int_id = orig_pack_int_id;
+ info->pack_name = pack_name;
+ info->p = p;
+}
+
static int pack_info_compare(const void *_a, const void *_b)
{
struct pack_info *a = (struct pack_info *)_a;
@@ -504,6 +515,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
const char *file_name, void *data)
{
struct write_midx_context *ctx = data;
+ struct packed_git *p;
if (ends_with(file_name, ".idx")) {
display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -530,17 +542,14 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
- ctx->info[ctx->nr].p = add_packed_git(full_path,
- full_path_len,
- 0);
-
- if (!ctx->info[ctx->nr].p) {
+ p = add_packed_git(full_path, full_path_len, 0);
+ if (!p) {
warning(_("failed to add packfile '%s'"),
full_path);
return;
}
- if (open_pack_index(ctx->info[ctx->nr].p)) {
+ if (open_pack_index(p)) {
warning(_("failed to open pack-index '%s'"),
full_path);
close_pack(ctx->info[ctx->nr].p);
@@ -548,9 +557,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
return;
}
- ctx->info[ctx->nr].pack_name = xstrdup(file_name);
- ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
- ctx->info[ctx->nr].expired = 0;
+ fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
+ ctx->nr);
ctx->nr++;
}
}
@@ -1310,11 +1318,6 @@ static int write_midx_internal(const char *object_dir,
for (i = 0; i < ctx.m->num_packs; i++) {
ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
- ctx.info[ctx.nr].orig_pack_int_id = i;
- ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
- ctx.info[ctx.nr].p = ctx.m->packs[i];
- ctx.info[ctx.nr].expired = 0;
-
if (flags & MIDX_WRITE_REV_INDEX) {
/*
* If generating a reverse index, need to have
@@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
if (open_pack_index(ctx.m->packs[i]))
die(_("could not open index for %s"),
ctx.m->packs[i]->pack_name);
- ctx.info[ctx.nr].p = ctx.m->packs[i];
}
- ctx.nr++;
+ fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
+ xstrdup(ctx.m->pack_names[i]), i);
}
}
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 05/24] midx: implement `DISP` chunk
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.
This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.
As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.
It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the objects that they contain).
If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to their objects, we need to make only
modest modifications to the verbatim pack-reuse code itself. Most
notably, we need to remove the assumption that the bits in the
reachability bitmap corresponding to objects from the single reuse pack
begin at the first bit position.
Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).
This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
chunk of the multi-pack index. The `DISP` chunk's contents are described
in detail here. Importantly, the `DISP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.
For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.
This patch implements reading (though no callers outside of the above
one perform any reading) and writing this new chunk. It also extends the
`--stdin-packs` format used by the `git multi-pack-index write` builtin
to be able to designate that a given pack is to be marked as "disjoint"
by prefixing it with a '+' character.
[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
pack that wasn't used verbatim.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/git-multi-pack-index.txt | 4 +
Documentation/gitformat-pack.txt | 109 +++++++++++++++++++++++
builtin/multi-pack-index.c | 10 ++-
midx.c | 116 ++++++++++++++++++++++---
midx.h | 5 ++
pack-bitmap.h | 9 ++
t/helper/test-read-midx.c | 31 ++++++-
t/t5319-multi-pack-index.sh | 58 +++++++++++++
8 files changed, 325 insertions(+), 17 deletions(-)
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 3696506eb3..d130e65b28 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -49,6 +49,10 @@ write::
--stdin-packs::
Write a multi-pack index containing only the set of
line-delimited pack index basenames provided over stdin.
+ Lines beginning with a '+' character (followed by the
+ pack index basename as before) have their pack marked as
+ "disjoint". See the "`DISP` chunk and disjoint packs"
+ section in linkgit:gitformat-pack[5] for more.
--refs-snapshot=<path>::
With `--bitmap`, optionally specify a file which
diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
index 9fcb29a9c8..658682ddd5 100644
--- a/Documentation/gitformat-pack.txt
+++ b/Documentation/gitformat-pack.txt
@@ -396,6 +396,22 @@ CHUNK DATA:
is padded at the end with between 0 and 3 NUL bytes to make the
chunk size a multiple of 4 bytes.
+ Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
+ Stores a table of three 4-byte unsigned integers in network order.
+ Each table entry corresponds to a single pack (in the order that
+ they appear above in the `PNAM` chunk). The values for each table
+ entry are as follows:
+ - The first bit position (in psuedo-pack order, see below) to
+ contain an object from that pack.
+ - The number of bits whose objects are selected from that pack.
+ - A "meta" value, whose least-significant bit indicates whether or
+ not the pack is disjoint with respect to other packs. The
+ remaining bits are unused.
+ Two packs are "disjoint" with respect to one another when they have
+ disjoint sets of objects. In other words, any object found in a pack
+ contained in the set of disjoint packfiles is guaranteed to be
+ uniquely located among those packs.
+
OID Fanout (ID: {'O', 'I', 'D', 'F'})
The ith entry, F[i], stores the number of OIDs with first
byte at most i. Thus F[255] stores the total
@@ -509,6 +525,99 @@ packs arranged in MIDX order (with the preferred pack coming first).
The MIDX's reverse index is stored in the optional 'RIDX' chunk within
the MIDX itself.
+=== `DISP` chunk and disjoint packs
+
+The Disjoint Packfiles (`DISP`) chunk encodes additional information
+about the objects in the multi-pack index's reachability bitmap. Recall
+that objects from the MIDX are arranged in "pseudo-pack" order (see:
+above) for reachability bitmaps.
+
+From the example above, suppose we have packs "a", "b", and "c", with
+10, 15, and 20 objects, respectively. In pseudo-pack order, those would
+be arranged as follows:
+
+ |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+When working with single-pack bitmaps (or, equivalently, multi-pack
+reachability bitmaps without any packs marked as disjoint),
+linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
+reuse chunks of the existing packfile instead of adding objects to the
+packing list.
+
+When a chunk of bytes are reused from an existing pack, any objects
+contained therein do not need to be added to the packing list, saving
+memory and CPU time. But a chunk from an existing packfile can only be
+reused when the following conditions are met:
+
+ - The chunk contains only objects which were requested by the caller
+ (i.e. does not contain any objects which the caller didn't ask for
+ explicitly or implicitly).
+
+ - All objects stored as offset- or reference-deltas also include their
+ base object in the resulting pack.
+
+Additionally, packfiles many not contain more than one copy of any given
+object. This introduces an additional constraint over the set of packs
+we may want to reuse. The most straightforward approach is to mandate
+that the set of packs is disjoint with respect to the set of objects
+contained in each pack. In other words, for each object `o` in the union
+of all objects stored by the disjoint set of packs, `o` is contained in
+exactly one pack from the disjoint set.
+
+One alternative design choice for multi-pack reuse might instead involve
+imposing a chunk-level constraint that allows packs in the reusable set
+to contain multiple copies across different packs, but restricts each
+chunk against including more than one copy of such an object. This is in
+theory possible to implement, but significantly more complicated than
+forcing packs themselves to be disjoint. Most notably, we would have to
+keep track of which objects have already been sent during verbatim
+pack-reuse, defeating the main purpose of verbatim pack reuse (that we
+don't have to keep track of individual objects).
+
+The `DISP` chunk encodes the necessary information in order to implement
+multi-pack reuse over a disjoint set of packs as described above.
+Specifically, the `DISP` chunk encodes three pieces of information (all
+32-bit unsigned integers in network byte-order) for each packfile `p`
+that is stored in the MIDX, as follows:
+
+`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
+ multi-pack index's reachability bitmap occupied by an object from `p`.
+
+`bitmap_nr`:: The number of bit positions (including the one at
+ `bitmap_pos`) that encode objects from that pack `p`.
+
+`disjoint`:: Metadata, including whether or not the pack `p` is
+ ``disjoint''. The least significant bit stores whether or not the pack
+ is disjoint. The remaining bits are reserved for future use.
+
+For example, the `DISP` chunk corresponding to the above example (with
+packs ``a'', ``b'', and ``c'') would look like:
+
+[cols="1,2,2,2"]
+|===
+| |`bitmap_pos` |`bitmap_nr` |`disjoint`
+
+|packfile ``a''
+|`0`
+|`10`
+|`0x1`
+
+|packfile ``b''
+|`10`
+|`15`
+|`0x1`
+
+|packfile ``c''
+|`25`
+|`20`
+|`0x1`
+|===
+
+With these constraints and information in place, we can treat each
+packfile marked as disjoint as individually reusable in the same fashion
+as verbatim pack reuse is performed on individual packs prior to the
+implementation of the `DISP` chunk.
+
== cruft packs
The cruft packs feature offer an alternative to Git's traditional mechanism of
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index a72aebecaa..0f1dd4651d 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -106,11 +106,17 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
return 0;
}
+#define DISJOINT ((void*)(uintptr_t)1)
+
static void read_packs_from_stdin(struct string_list *to)
{
struct strbuf buf = STRBUF_INIT;
- while (strbuf_getline(&buf, stdin) != EOF)
- string_list_append(to, buf.buf);
+ while (strbuf_getline(&buf, stdin) != EOF) {
+ if (*buf.buf == '+')
+ string_list_append(to, buf.buf + 1)->util = DISJOINT;
+ else
+ string_list_append(to, buf.buf);
+ }
string_list_sort(to);
strbuf_release(&buf);
diff --git a/midx.c b/midx.c
index 591b3c636e..f55020072f 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,7 @@
#define MIDX_CHUNK_ALIGNMENT 4
#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
@@ -182,6 +183,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
&m->chunk_large_offsets_len);
+ pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
+ (const unsigned char **)&m->chunk_disjoint_packs,
+ &m->chunk_disjoint_packs_len);
if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
@@ -275,6 +279,23 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
return 0;
}
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+ struct bitmapped_pack *bp, uint32_t pack_int_id)
+{
+ if (!m->chunk_disjoint_packs)
+ return error(_("MIDX does not contain the DISP chunk"));
+
+ if (prepare_midx_pack(r, m, pack_int_id))
+ return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
+
+ bp->p = m->packs[pack_int_id];
+ bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
+ bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
+ bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
+
+ return 0;
+}
+
int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
{
return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
@@ -457,11 +478,18 @@ static size_t write_midx_header(struct hashfile *f,
return MIDX_HEADER_SIZE;
}
+#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
+
struct pack_info {
uint32_t orig_pack_int_id;
char *pack_name;
struct packed_git *p;
- unsigned expired : 1;
+
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
+ unsigned expired : 1,
+ disjoint : 1;
};
static void fill_pack_info(struct pack_info *info,
@@ -473,6 +501,7 @@ static void fill_pack_info(struct pack_info *info,
info->orig_pack_int_id = orig_pack_int_id;
info->pack_name = pack_name;
info->p = p;
+ info->bitmap_pos = BITMAP_POS_UNKNOWN;
}
static int pack_info_compare(const void *_a, const void *_b)
@@ -516,6 +545,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
{
struct write_midx_context *ctx = data;
struct packed_git *p;
+ struct string_list_item *item = NULL;
if (ends_with(file_name, ".idx")) {
display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -534,11 +564,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
* should be performed independently (likely checking
* to_include before the existing MIDX).
*/
- if (ctx->m && midx_contains_pack(ctx->m, file_name))
- return;
- else if (ctx->to_include &&
- !string_list_has_string(ctx->to_include, file_name))
+ if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
return;
+ } else if (ctx->to_include) {
+ item = string_list_lookup(ctx->to_include, file_name);
+ if (!item)
+ return;
+ }
ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
@@ -559,6 +591,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
ctx->nr);
+ if (item)
+ ctx->info[ctx->nr].disjoint = !!item->util;
ctx->nr++;
}
}
@@ -568,7 +602,8 @@ struct pack_midx_entry {
uint32_t pack_int_id;
time_t pack_mtime;
uint64_t offset;
- unsigned preferred : 1;
+ unsigned preferred : 1,
+ disjoint : 1;
};
static int midx_oid_compare(const void *_a, const void *_b)
@@ -586,6 +621,12 @@ static int midx_oid_compare(const void *_a, const void *_b)
if (a->preferred < b->preferred)
return 1;
+ /* Sort objects in a disjoint pack last when multiple copies exist. */
+ if (a->disjoint < b->disjoint)
+ return -1;
+ if (a->disjoint > b->disjoint)
+ return 1;
+
if (a->pack_mtime > b->pack_mtime)
return -1;
else if (a->pack_mtime < b->pack_mtime)
@@ -671,6 +712,7 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
&fanout->entries[fanout->nr],
cur_object);
fanout->entries[fanout->nr].preferred = 0;
+ fanout->entries[fanout->nr].disjoint = 0;
fanout->nr++;
}
}
@@ -696,6 +738,7 @@ static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
cur_object,
&fanout->entries[fanout->nr],
preferred);
+ fanout->entries[fanout->nr].disjoint = info->disjoint;
fanout->nr++;
}
}
@@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
* Take only the first duplicate.
*/
for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
- if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
- &fanout.entries[cur_object].oid))
- continue;
+ struct pack_midx_entry *ours = &fanout.entries[cur_object];
+ if (cur_object) {
+ struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
+ if (oideq(&prev->oid, &ours->oid)) {
+ if (prev->disjoint && ours->disjoint)
+ die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
+ oid_to_hex(&prev->oid),
+ info[prev->pack_int_id].pack_name,
+ info[ours->pack_int_id].pack_name);
+ continue;
+ }
+ }
ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
alloc_objects);
- memcpy(&deduplicated_entries[*nr_objects],
- &fanout.entries[cur_object],
+ memcpy(&deduplicated_entries[*nr_objects], ours,
sizeof(struct pack_midx_entry));
(*nr_objects)++;
}
@@ -814,6 +865,27 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
return 0;
}
+static int write_midx_disjoint_packs(struct hashfile *f, void *data)
+{
+ struct write_midx_context *ctx = data;
+ size_t i;
+
+ for (i = 0; i < ctx->nr; i++) {
+ struct pack_info *pack = &ctx->info[i];
+ if (pack->expired)
+ continue;
+
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
+ BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
+ pack->pack_name, pack->bitmap_nr);
+
+ hashwrite_be32(f, pack->bitmap_pos);
+ hashwrite_be32(f, pack->bitmap_nr);
+ hashwrite_be32(f, !!pack->disjoint);
+ }
+ return 0;
+}
+
static int write_midx_oid_fanout(struct hashfile *f,
void *data)
{
@@ -981,8 +1053,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
ALLOC_ARRAY(pack_order, ctx->entries_nr);
- for (i = 0; i < ctx->entries_nr; i++)
+ for (i = 0; i < ctx->entries_nr; i++) {
+ struct pack_midx_entry *e = &ctx->entries[data[i].nr];
+ struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+ pack->bitmap_pos = i;
+ pack->bitmap_nr++;
pack_order[i] = data[i].nr;
+ }
+ for (i = 0; i < ctx->nr; i++) {
+ struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+ pack->bitmap_pos = 0;
+ }
free(data);
trace2_region_leave("midx", "midx_pack_order", the_repository);
@@ -1283,6 +1366,7 @@ static int write_midx_internal(const char *object_dir,
struct hashfile *f = NULL;
struct lock_file lk;
struct write_midx_context ctx = { 0 };
+ int pack_disjoint_concat_len = 0;
int pack_name_concat_len = 0;
int dropped_packs = 0;
int result = 0;
@@ -1495,8 +1579,10 @@ static int write_midx_internal(const char *object_dir,
}
for (i = 0; i < ctx.nr; i++) {
- if (!ctx.info[i].expired)
- pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+ if (ctx.info[i].expired)
+ continue;
+ pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+ pack_disjoint_concat_len += 3 * sizeof(uint32_t);
}
/* Check that the preferred pack wasn't expired (if given). */
@@ -1556,6 +1642,8 @@ static int write_midx_internal(const char *object_dir,
add_chunk(cf, MIDX_CHUNKID_REVINDEX,
st_mult(ctx.entries_nr, sizeof(uint32_t)),
write_midx_revindex);
+ add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
+ pack_disjoint_concat_len, write_midx_disjoint_packs);
}
write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
diff --git a/midx.h b/midx.h
index a5d98919c8..cdd16a8378 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,7 @@
struct object_id;
struct pack_entry;
struct repository;
+struct bitmapped_pack;
#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
#define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
@@ -33,6 +34,8 @@ struct multi_pack_index {
const unsigned char *chunk_pack_names;
size_t chunk_pack_names_len;
+ const uint32_t *chunk_disjoint_packs;
+ size_t chunk_disjoint_packs_len;
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_object_offsets;
@@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+ struct bitmapped_pack *bp, uint32_t pack_int_id);
int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5273a6a019..b7fa1a42a9 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
struct bitmap_index;
+struct bitmapped_pack {
+ struct packed_git *p;
+
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
+ unsigned disjoint : 1;
+};
+
struct bitmap_index *prepare_bitmap_git(struct repository *r);
struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e9a444ddba..4b44995dca 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -100,10 +100,37 @@ static int read_midx_preferred_pack(const char *object_dir)
return 0;
}
+static int read_midx_bitmapped_packs(const char *object_dir)
+{
+ struct multi_pack_index *midx = NULL;
+ struct bitmapped_pack pack;
+ uint32_t i;
+
+ setup_git_directory();
+
+ midx = load_multi_pack_index(object_dir, 1);
+ if (!midx)
+ return 1;
+
+ for (i = 0; i < midx->num_packs; i++) {
+ if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
+ return 1;
+
+ printf("%s\n", pack_basename(pack.p));
+ printf(" bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
+ printf(" bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
+ printf(" disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
+ }
+
+ close_midx(midx);
+
+ return 0;
+}
+
int cmd__read_midx(int argc, const char **argv)
{
if (!(argc == 2 || argc == 3))
- usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
+ usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
if (!strcmp(argv[1], "--show-objects"))
return read_midx_file(argv[2], 1);
@@ -111,5 +138,7 @@ int cmd__read_midx(int argc, const char **argv)
return read_midx_checksum(argv[2]);
else if (!strcmp(argv[1], "--preferred-pack"))
return read_midx_preferred_pack(argv[2]);
+ else if (!strcmp(argv[1], "--bitmap"))
+ return read_midx_bitmapped_packs(argv[2]);
return read_midx_file(argv[1], 0);
}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c4c6060cee..fd24e0c952 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
test_cmp expect.err err
'
+test_expect_success 'disjoint packs are stored via the DISP chunk' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ for i in 1 2 3 4 5
+ do
+ test_commit "$i" &&
+ git repack -d || return 1
+ done &&
+
+ find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
+
+ git multi-pack-index write --stdin-packs <packs &&
+ test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
+ cat >expect <<-\EOF &&
+ error: MIDX does not contain the DISP chunk
+ EOF
+ test_cmp expect err &&
+
+ sed -e "s/^/+/g" packs >in &&
+ git multi-pack-index write --stdin-packs --bitmap \
+ --preferred-pack="$(head -n1 <packs)" <in &&
+ test-tool read-midx --bitmap $objdir >actual &&
+ for i in $(test_seq $(wc -l <packs))
+ do
+ sed -ne "${i}s/\.idx$/\.pack/p" packs &&
+ echo " bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
+ echo " bitmap_nr: 3" &&
+ echo " disjoint: yes" || return 1
+ done >expect &&
+ test_cmp expect actual
+ )
+'
+
+test_expect_success 'non-disjoint packs are detected' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ test_commit base &&
+ git repack -d &&
+ test_commit other &&
+ git repack -a &&
+
+ ls -la .git/objects/pack/ &&
+
+ find $objdir/pack -type f -name "*.idx" |
+ sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
+
+ test_must_fail git multi-pack-index write --stdin-packs \
+ --bitmap <in 2>err &&
+ grep "duplicate object.* among disjoint packs" err
+ )
+'
+
test_done
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 06/24] midx: implement `midx_locate_pack()`
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.
There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).
We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.
Future patches will make use of this new function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
midx.c | 13 +++++++++++--
midx.h | 5 ++++-
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/midx.c b/midx.c
index f55020072f..65ba0c70fe 100644
--- a/midx.c
+++ b/midx.c
@@ -413,7 +413,8 @@ static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
return strcmp(idx_or_pack_name, idx_name);
}
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+ uint32_t *pos)
{
uint32_t first = 0, last = m->num_packs;
@@ -424,8 +425,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
current = m->pack_names[mid];
cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
- if (!cmp)
+ if (!cmp) {
+ if (pos)
+ *pos = mid;
return 1;
+ }
if (cmp > 0) {
first = mid + 1;
continue;
@@ -436,6 +440,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
return 0;
}
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+{
+ return midx_locate_pack(m, idx_or_pack_name, NULL);
+}
+
int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
{
struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index cdd16a8378..a6e969c2ea 100644
--- a/midx.h
+++ b/midx.h
@@ -70,7 +70,10 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
struct multi_pack_index *m,
uint32_t n);
int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
+int midx_contains_pack(struct multi_pack_index *m,
+ const char *idx_or_pack_name);
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+ uint32_t *pos);
int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
/*
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 07/24] midx: implement `--retain-disjoint` mode
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
Once multi-pack reachability bitmaps learn how to perform pack reuse
over the set of disjoint packs, we will want to teach `git repack` to
evolve the set of disjoint packs over time.
To evolve the set of disjoint packs means any new packs made by `repack`
should be disjoint with respect to the existing set of disjoint packs so
as to be able to join that set when updating the multi-pack index.
The details of generating such packs will be left to future commits. But
any new pack(s) created by repack as disjoint will be marked as such by
passing them over `--stdin-packs` with the special '+' marker when
generating a new MIDX.
This patch, however, addresses the question of how we retain the
existing set of disjoint packs when updating the multi-pack index. One
option would be for `repack` to keep track of the set of disjoint packs
itself by querying the MIDX, and then adding the special '+' marker
appropriately when generating the input for `--stdin-packs`.
But this is verbose and error-prone, since two different parts of Git
would need to maintain the same notion of the set of disjoint packs.
When one disagrees with the other, the set of so-called disjoint packs
may actually contain two or more packs which have one or more object(s)
in common, making the set non-disjoint.
Instead, introduce a `--retain-disjoint` mode for the `git
multi-pack-index write` sub-command which keeps any packs which are:
- marked as disjoint in the existing MIDX, and
- not deleted (e.g., they are not excluded from the input for
`--stdin-packs`).
This will allow the `repack` command to not have to keep track of the
set of currently-disjoint packs itself, reducing the number of lines of
code necessary to implement this feature, and making the resulting
implementation less error-prone.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/git-multi-pack-index.txt | 8 +++
builtin/multi-pack-index.c | 3 +
midx.c | 49 +++++++++++++++
midx.h | 1 +
t/lib-disjoint.sh | 38 ++++++++++++
t/t5319-multi-pack-index.sh | 82 ++++++++++++++++++++++++++
6 files changed, 181 insertions(+)
create mode 100644 t/lib-disjoint.sh
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index d130e65b28..ac0c7b124b 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -54,6 +54,14 @@ write::
"disjoint". See the "`DISP` chunk and disjoint packs"
section in linkgit:gitformat-pack[5] for more.
+ --retain-disjoint::
+ When writing a multi-pack index with a reachability
+ bitmap, keep any packs marked as disjoint in the
+ existing MIDX (if any) as such in the new MIDX. Existing
+ disjoint packs which are removed (e.g., not listed via
+ `--stdin-packs`) are ignored. This option works in
+ addition to the '+' marker for `--stdin-packs`.
+
--refs-snapshot=<path>::
With `--bitmap`, optionally specify a file which
contains a "refs snapshot" taken prior to repacking.
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 0f1dd4651d..dcfabf2626 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -138,6 +138,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
N_("write multi-pack index containing only given indexes")),
OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
N_("refs snapshot for selecting bitmap commits")),
+ OPT_BIT(0, "retain-disjoint", &opts.flags,
+ N_("retain non-deleted disjoint packs"),
+ MIDX_WRITE_RETAIN_DISJOINT),
OPT_END(),
};
diff --git a/midx.c b/midx.c
index 65ba0c70fe..ce67da9f85 100644
--- a/midx.c
+++ b/midx.c
@@ -721,6 +721,12 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
&fanout->entries[fanout->nr],
cur_object);
fanout->entries[fanout->nr].preferred = 0;
+ /*
+ * It's OK to set disjoint to 0 here, even with
+ * `--retain-disjoint`, since we will always see the disjoint
+ * copy of some object below in get_sorted_entries(), causing us
+ * to die().
+ */
fanout->entries[fanout->nr].disjoint = 0;
fanout->nr++;
}
@@ -1362,6 +1368,37 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r,
return result;
}
+static int midx_retain_existing_disjoint(struct repository *r,
+ struct multi_pack_index *from,
+ struct write_midx_context *ctx)
+{
+ struct bitmapped_pack bp;
+ uint32_t i, midx_pos;
+
+ for (i = 0; i < ctx->nr; i++) {
+ struct pack_info *info = &ctx->info[i];
+ /*
+ * Having to call `midx_locate_pack()` in a loop is
+ * sub-optimal, since it is O(n*log(n)) in the number
+ * of packs.
+ *
+ * When reusing an existing MIDX, we know that the first
+ * 'n' packs appear in the same order, so we could avoid
+ * this when reusing an existing MIDX. But we may be
+ * instead relying on the order given to us by
+ * for_each_file_in_pack_dir(), in which case we can't
+ * make any such guarantees.
+ */
+ if (!midx_locate_pack(from, info->pack_name, &midx_pos))
+ continue;
+
+ if (nth_bitmapped_pack(r, from, &bp, midx_pos) < 0)
+ return -1;
+ info->disjoint = bp.disjoint;
+ }
+ return 0;
+}
+
static int write_midx_internal(const char *object_dir,
struct string_list *packs_to_include,
struct string_list *packs_to_drop,
@@ -1444,6 +1481,18 @@ static int write_midx_internal(const char *object_dir,
for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
stop_progress(&ctx.progress);
+ if (flags & MIDX_WRITE_RETAIN_DISJOINT) {
+ struct multi_pack_index *m = ctx.m;
+ if (!m)
+ m = lookup_multi_pack_index(the_repository, object_dir);
+
+ if (m) {
+ result = midx_retain_existing_disjoint(the_repository, m, &ctx);
+ if (result)
+ goto cleanup;
+ }
+ }
+
if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
!(packs_to_include || packs_to_drop)) {
struct bitmap_index *bitmap_git;
diff --git a/midx.h b/midx.h
index a6e969c2ea..d7ce52ff7b 100644
--- a/midx.h
+++ b/midx.h
@@ -54,6 +54,7 @@ struct multi_pack_index {
#define MIDX_WRITE_BITMAP (1 << 2)
#define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
#define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
+#define MIDX_WRITE_RETAIN_DISJOINT (1 << 5)
const unsigned char *get_midx_checksum(struct multi_pack_index *m);
void get_midx_filename(struct strbuf *out, const char *object_dir);
diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
new file mode 100644
index 0000000000..c6c6e74aba
--- /dev/null
+++ b/t/lib-disjoint.sh
@@ -0,0 +1,38 @@
+# Helpers for scripts testing disjoint packs; see t5319 for example usage.
+
+objdir=.git/objects
+
+test_disjoint_1 () {
+ local pack="$1"
+ local want="$2"
+
+ test-tool read-midx --bitmap $objdir >out &&
+ grep -A 3 "$pack" out >found &&
+
+ if ! test -s found
+ then
+ echo >&2 "could not find '$pack' in MIDX"
+ return 1
+ fi
+
+ if ! grep -q "disjoint: $want" found
+ then
+ echo >&2 "incorrect disjoint state for pack '$pack'"
+ return 1
+ fi
+ return 0
+}
+
+# test_must_be_disjoint <pack-$XYZ.pack>
+#
+# Ensures that the given pack is marked as disjoint.
+test_must_be_disjoint () {
+ test_disjoint_1 "$1" "yes"
+}
+
+# test_must_not_be_disjoint <pack-$XYZ.pack>
+#
+# Ensures that the given pack is not marked as disjoint.
+test_must_not_be_disjoint () {
+ test_disjoint_1 "$1" "no"
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index fd24e0c952..02cfddf151 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,7 @@
test_description='multi-pack-indexes'
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-chunk.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
GIT_TEST_MULTI_PACK_INDEX=0
objdir=.git/objects
@@ -1215,4 +1216,85 @@ test_expect_success 'non-disjoint packs are detected' '
)
'
+test_expect_success 'retain disjoint packs while writing' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ for i in 1 2
+ do
+ test_commit "$i" && git repack -d || return 1
+ done &&
+
+ find $objdir/pack -type f -name "pack-*.idx" |
+ sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.old &&
+
+ test_line_count = 2 packs.old &&
+ disjoint="$(head -n 1 packs.old)" &&
+ non_disjoint="$(tail -n 1 packs.old)" &&
+
+ cat >in <<-EOF &&
+ +$disjoint
+ $non_disjoint
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap <in &&
+
+ test_must_be_disjoint "${disjoint%.idx}.pack" &&
+ test_must_not_be_disjoint "${non_disjoint%.idx}.pack" &&
+
+ test_commit 3 &&
+ git repack -d &&
+
+ find $objdir/pack -type f -name "pack-*.idx" |
+ sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.new &&
+
+ new_disjoint="$(comm -13 packs.old packs.new)" &&
+ cat >in <<-EOF &&
+ $disjoint
+ $non_disjoint
+ +$new_disjoint
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap \
+ --retain-disjoint <in &&
+
+ test_must_be_disjoint "${disjoint%.idx}.pack" &&
+ test_must_be_disjoint "${new_disjoint%.idx}.pack" &&
+ test_must_not_be_disjoint "${non_disjoint%.idx}.pack"
+
+ )
+'
+
+test_expect_success 'non-disjoint packs are detected via --retain-disjoint' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+ packdir=.git/objects/pack &&
+
+ test_commit base &&
+ base="$(echo base | git pack-objects --revs $packdir/pack)" &&
+
+ cat >in <<-EOF &&
+ +pack-$base.idx
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap <in &&
+
+ test_must_be_disjoint "pack-$base.pack" &&
+
+ test_commit other &&
+ other="$(echo other | git pack-objects --revs $packdir/pack)" &&
+
+ cat >in <<-EOF &&
+ pack-$base.idx
+ +pack-$other.idx
+ EOF
+ test_must_fail git multi-pack-index write --stdin-packs --retain-disjoint --bitmap <in 2>err &&
+ grep "duplicate object.* among disjoint packs" err &&
+
+ test_must_fail git multi-pack-index write --retain-disjoint --bitmap 2>err &&
+ grep "duplicate object.* among disjoint packs" err
+ )
+'
+
test_done
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
Before multi-pack reachability bitmaps learn how to perform pack reuse
over the set of disjoint packs, we will need a way to generate packs
that are known to be disjoint with respect to the currently marked set
of disjoint packs.
In other words, we want a way to make a pack which does not have any
objects contained in the union of the set of packs which are currently
marked as disjoint.
There are a various ways that we could go about this, for example:
- passing `--unpacked`, which would exclude all packed objects (and
thus would not contain any objects from the disjoint pack)
- passing `--stdin-packs` with the set of packs currently marked as
disjoint as "excluded", indicating that `pack-objects` should
discard any objects present in any of the excluded packs (thus
producing a disjoint pack)
- marking each of the disjoint packs as kept in-core with the
`--keep-pack` flag, and then passing `--honor-pack-keep` to
similarly ignore any object(s) from kept packs (thus also producing
a pack which is disjoint with respect to the current set)
`git repack` is the main entry-point to generating a new pack, by
invoking `pack-objects` and then adding the new pack to the set of
disjoint packs if generating a new MIDX. However, `repack` has a number
of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
repacks, incremental repacks, etc.), all of which would require careful
reasoning in order to prove that the resulting set of packs is disjoint.
The most appealing option of the above would be to pass the set of
disjoint packs as kept (via `--keep-pack`) and then ignore their
contents (with `--honor-pack-keep`), doing so for all kinds of
`pack-objects` invocations. But there may be more disjoint packs than we
can easily fit into the command-line arguments.
Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
moral equivalent of marking the set of disjoint packs as kept, and
ignoring their contents, even if it would have otherwise been packed. In
fact, this similarity extends down to the implementation, where each
disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
With this in place, we can use the kept-pack cache from 20b031fede
(packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
which looks up objects first in a cache containing just the set of kept
(in this case, disjoint) packs. Assuming that the set of disjoint packs
is a relatively small portion of the entire repository (which should be
a safe assumption to make), each object lookup will be very inexpensive.
The only place we want to avoid using `--ignore-disjoint` is in
conjunction with `--cruft`, since doing so may cause us to omit an
object which would have been included in a new cruft pack in order to
freshen it. In other words, failing to do so might cause that object to
be pruned from the repository earlier than expected.
Otherwise, `--ignore-disjoint` is compatible with most other modes of
`pack-objects`. These various combinations are tested below. As a
result, `repack` will be able to unconditionally (except for the cruft
pack) pass `--ignore-disjoint` when trying to add a new pack to the
disjoint set, and the result will be usable, without having to carefully
consider and reason about each individual case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/git-pack-objects.txt | 8 ++
builtin/pack-objects.c | 31 +++++-
t/lib-disjoint.sh | 11 ++
t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++++++++++++++++
4 files changed, 203 insertions(+), 3 deletions(-)
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aa..592c4ce742 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -96,6 +96,14 @@ base-name::
Incompatible with `--revs`, or options that imply `--revs` (such as
`--all`), with the exception of `--unpacked`, which is compatible.
+--ignore-disjoint::
+ This flag causes an object that appears in any pack marked as
+ "disjoint" by the multi-pack index to be ignored, even if it
+ would have otherwise been packed. When used with
+ `--stdin-packs`, objects from disjoint packs may be included if
+ and only if a disjoint pack is explicitly given as an input pack
+ to `--stdin-packs`. Incompatible with `--cruft`.
+
--cruft::
Packs unreachable objects into a separate "cruft" pack, denoted
by the existence of a `.mtimes` file. Typically used by `git
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bfa60359d4..107154db34 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -207,6 +207,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_midx_disjoint_packs;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1403,7 +1404,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
/*
* Then handle .keep first, as we have a fast(er) path there.
*/
- if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+ if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
+ ignore_midx_disjoint_packs) {
/*
* Set the flags for the kept-pack cache to be the ones we want
* to ignore.
@@ -1415,7 +1417,7 @@ static int want_found_object(const struct object_id *oid, int exclude,
unsigned flags = 0;
if (ignore_packed_keep_on_disk)
flags |= ON_DISK_KEEP_PACKS;
- if (ignore_packed_keep_in_core)
+ if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
flags |= IN_CORE_KEEP_PACKS;
if (ignore_packed_keep_on_disk && p->pack_keep)
@@ -3389,6 +3391,7 @@ static void read_packs_list_from_stdin(void)
die(_("could not find pack '%s'"), item->string);
if (!is_pack_valid(p))
die(_("packfile %s cannot be accessed"), p->pack_name);
+ p->pack_keep_in_core = 0;
}
/*
@@ -4266,6 +4269,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
N_("create packs suitable for shallow fetches")),
OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
N_("ignore packs that have companion .keep file")),
+ OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
+ N_("ignore packs that are marked disjoint in the MIDX")),
OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
N_("ignore this pack")),
OPT_INTEGER(0, "compression", &pack_compression_level,
@@ -4412,7 +4417,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (use_internal_rev_list)
die(_("cannot use internal rev list with --cruft"));
if (stdin_packs)
- die(_("cannot use --stdin-packs with --cruft"));
+ die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
+ if (ignore_midx_disjoint_packs)
+ die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
}
/*
@@ -4452,6 +4459,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (!p) /* no keep-able packs found */
ignore_packed_keep_on_disk = 0;
}
+ if (ignore_midx_disjoint_packs) {
+ struct multi_pack_index *m = get_multi_pack_index(the_repository);
+ struct bitmapped_pack pack;
+ unsigned any_disjoint = 0;
+ uint32_t i;
+
+ for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
+ if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
+ die(_("could not load bitmapped pack %i"), i);
+ if (pack.disjoint) {
+ pack.p->pack_keep_in_core = 1;
+ any_disjoint = 1;
+ }
+ }
+
+ if (!any_disjoint) /* no disjoint packs to ignore */
+ ignore_midx_disjoint_packs = 0;
+ }
if (local) {
/*
* unlike ignore_packed_keep_on_disk above, we do not
diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
index c6c6e74aba..c802ca6940 100644
--- a/t/lib-disjoint.sh
+++ b/t/lib-disjoint.sh
@@ -36,3 +36,14 @@ test_must_be_disjoint () {
test_must_not_be_disjoint () {
test_disjoint_1 "$1" "no"
}
+
+# packed_contents </path/to/pack-$XYZ.idx [...]>
+#
+# Prints the set of objects packed in the given pack indexes.
+packed_contents () {
+ for idx in "$@"
+ do
+ git show-index <$idx || return 1
+ done >tmp &&
+ cut -d" " -f2 <tmp | sort -u
+}
diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
index 2dcf1eecee..e522aa3f7d 100755
--- a/t/t5331-pack-objects-stdin.sh
+++ b/t/t5331-pack-objects-stdin.sh
@@ -6,6 +6,7 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
TEST_PASSES_SANITIZE_LEAK=true
. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
packed_objects () {
git show-index <"$1" >tmp-object-list &&
@@ -237,4 +238,159 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate
test_cmp expected-objects actual-objects
'
+objdir=.git/objects
+packdir=$objdir/pack
+
+test_expect_success 'loose objects also in disjoint packs are ignored' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ # create a pack containing the objects in each commit below, but
+ # do not delete their loose copies
+ test_commit base &&
+ base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
+
+ test_commit other &&
+ other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
+
+ cat >in <<-EOF &&
+ pack-$base_pack.idx
+ +pack-$other_pack.idx
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap <in &&
+
+ test_commit more &&
+ out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
+
+ # gather all objects in "all", and objects from the disjoint
+ # pack in "disjoint"
+ git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
+ packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
+
+ # make sure that the set of objects we just generated matches
+ # "all \ disjoint"
+ packed_contents "$packdir/pack-$out.idx" >got &&
+ comm -23 all disjoint >want &&
+ test_cmp want got
+ )
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ for c in A B
+ do
+ test_commit "$c" || return 1
+ done &&
+
+ A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+ B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+
+ cat >in <<-EOF &&
+ pack-$A.idx
+ +pack-$B.idx
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap <in &&
+
+ test_must_not_be_disjoint "pack-$A.pack" &&
+ test_must_be_disjoint "pack-$B.pack" &&
+
+ test_commit C &&
+
+ got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
+ packed_contents "$packdir/pack-$got.idx" >actual &&
+
+ git rev-list --objects --no-object-names B..C >expect.raw &&
+ sort <expect.raw >expect &&
+
+ test_cmp expect actual
+ )
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
+ # Create objects in three separate packs:
+ #
+ # - pack A (midx, non disjoint)
+ # - pack B (midx, disjoint)
+ # - pack C (non-midx)
+ #
+ # Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
+ # including packs A, B, and C. The resulting pack should contain
+ # only the objects from packs A, and C, excluding those from
+ # pack B as it is marked as disjoint.
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ for c in A B C
+ do
+ test_commit "$c" || return 1
+ done &&
+
+ A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+ B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+ C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
+
+ cat >in <<-EOF &&
+ pack-$A.idx
+ +pack-$B.idx
+ EOF
+ git multi-pack-index write --stdin-packs --bitmap <in &&
+
+ test_must_not_be_disjoint "pack-$A.pack" &&
+ test_must_be_disjoint "pack-$B.pack" &&
+
+ # Generate a pack with `--stdin-packs` using packs "A" and "C",
+ # but excluding objects from "B". The objects from pack "B" are
+ # expected to be omitted from the generated pack for two
+ # reasons:
+ #
+ # - because it was specified as a negated tip via
+ # `--stdin-packs`
+ # - because it is a disjoint pack.
+ cat >in <<-EOF &&
+ pack-$A.pack
+ ^pack-$B.pack
+ pack-$C.pack
+ EOF
+ got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+ packed_contents "$packdir/pack-$got.idx" >actual &&
+ packed_contents "$packdir/pack-$A.idx" \
+ "$packdir/pack-$C.idx" >expect &&
+ test_cmp expect actual &&
+
+ # Generate another pack with `--stdin-packs`, this time
+ # using packs "B" and "C". The objects from pack "B" are
+ # expected to be in the final pack, despite it being a
+ # disjoint pack, because "B" was mentioned explicitly
+ # via `stdin-packs`.
+ cat >in <<-EOF &&
+ pack-$B.pack
+ pack-$C.pack
+ EOF
+ got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+ packed_contents "$packdir/pack-$got.idx" >actual &&
+ packed_contents "$packdir/pack-$B.idx" \
+ "$packdir/pack-$C.idx" >expect &&
+ test_cmp expect actual
+ )
+'
+
+test_expect_success '--cruft is incompatible with --ignore-disjoint' '
+ test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
+ </dev/null >/dev/null 2>actual &&
+ cat >expect <<-\EOF &&
+ fatal: cannot use --ignore-disjoint with --cruft
+ EOF
+ test_cmp expect actual
+'
+
test_done
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 09/24] repack: implement `--extend-disjoint` mode
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
Now that we can generate packs which are disjoint with respect to the
set of currently-disjoint packs, implement a mode of `git repack` which
extends the set of disjoint packs with any new (non-cruft) pack(s)
generated during the repack.
The idea is mostly straightforward, with a couple of gotcha's. The
straightforward part is to make sure that any new packs are disjoint
with respect to the set of currently disjoint packs which are _not_
being removed from the repository as a result of the repack.
If a pack which is currently marked as disjoint is, on the other hand,
about to be removed from the repository, it is OK (and expected) that
new pack(s) will contain some or all of its objects. Since the pack
originally marked as disjoint will be removed, it will necessarily leave
the disjoint set, making room for new packs with its same objects to
take its place. In other words, the resulting set of disjoint packs will
be disjoint with respect to one another.
The gotchas mostly have to do with making sure that we do not generate a
disjoint pack in the following scenarios:
- promisor packs
- cruft packs (which may necessarily need to include an object from a
disjoint pack in order to freshen it in certain circumstances)
- all-into-one repacks without '-d'
- `--filter-to`, which conceptually could work with the new
`--extend-disjoint` option, but only in limited circumstances
Otherwise, we mark which packs were created as disjoint by using a new
bit in the `generated_pack_data` struct, and then marking those pack(s)
as disjoint accordingly when generating the MIDX. Non-deleted packs
which are marked as disjoint are retained as such by passing the
equivalent of `--retain-disjoint` when calling the MIDX API to update
the MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/git-repack.txt | 12 +++
builtin/repack.c | 57 +++++++++---
t/t7700-repack.sh | 4 +-
t/t7705-repack-extend-disjoint.sh | 142 ++++++++++++++++++++++++++++++
4 files changed, 203 insertions(+), 12 deletions(-)
create mode 100755 t/t7705-repack-extend-disjoint.sh
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e..50ba5e7f9c 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -249,6 +249,18 @@ linkgit:git-multi-pack-index[1]).
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.
+--extend-disjoint::
+ Extends the set of disjoint packs. All new non-cruft pack(s)
+ generated are constructed to be disjoint with respect to the set
+ of currently disjoint packs, excluding any packs that will be
+ removed as a result of the repack operation. For more on
+ disjoint packs, see the details in linkgit:gitformat-pack[5],
+ under the section "`DISP` chunk and disjoint packs".
++
+Useful only with the combination of `--write-midx` and
+`--write-bitmap-index`. Incompatible with `--filter-to`. Incompatible
+with `-A`, `-a`, or `--cruft` unless `-d` is given.
+
CONFIGURATION
-------------
diff --git a/builtin/repack.c b/builtin/repack.c
index edaee4dbec..0601bd16c4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -58,6 +58,7 @@ struct pack_objects_args {
int no_reuse_object;
int quiet;
int local;
+ int ignore_disjoint;
struct list_objects_filter_options filter_options;
};
@@ -293,6 +294,8 @@ static void prepare_pack_objects(struct child_process *cmd,
strvec_push(&cmd->args, "--local");
if (args->quiet)
strvec_push(&cmd->args, "--quiet");
+ if (args->ignore_disjoint)
+ strvec_push(&cmd->args, "--ignore-disjoint");
if (delta_base_offset)
strvec_push(&cmd->args, "--delta-base-offset");
strvec_push(&cmd->args, out);
@@ -334,9 +337,11 @@ static struct {
struct generated_pack_data {
struct tempfile *tempfiles[ARRAY_SIZE(exts)];
+ unsigned disjoint : 1;
};
-static struct generated_pack_data *populate_pack_exts(const char *name)
+static struct generated_pack_data *populate_pack_exts(const char *name,
+ unsigned disjoint)
{
struct stat statbuf;
struct strbuf path = STRBUF_INIT;
@@ -353,6 +358,8 @@ static struct generated_pack_data *populate_pack_exts(const char *name)
data->tempfiles[i] = register_tempfile(path.buf);
}
+ data->disjoint = disjoint;
+
strbuf_release(&path);
return data;
}
@@ -379,6 +386,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
+ strvec_pushf(&cmd.args, "--no-ignore-disjoint");
+
/*
* NEEDSWORK: Giving pack-objects only the OIDs without any ordering
* hints may result in suboptimal deltas in the resulting pack. See if
@@ -421,7 +430,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
line.buf);
write_promisor_file(promisor_name, NULL, 0);
- item->util = populate_pack_exts(item->string);
+ item->util = populate_pack_exts(item->string, 0);
free(promisor_name);
}
@@ -731,8 +740,13 @@ static void midx_included_packs(struct string_list *include,
for_each_string_list_item(item, &existing->kept_packs)
string_list_insert(include, xstrfmt("%s.idx", item->string));
- for_each_string_list_item(item, names)
- string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
+ for_each_string_list_item(item, names) {
+ const char *marker = "";
+ struct generated_pack_data *data = item->util;
+ if (data->disjoint)
+ marker = "+";
+ string_list_insert(include, xstrfmt("%spack-%s.idx", marker, item->string));
+ }
if (geometry->split_factor) {
struct strbuf buf = STRBUF_INIT;
uint32_t i;
@@ -788,7 +802,8 @@ static int write_midx_included_packs(struct string_list *include,
struct pack_geometry *geometry,
struct string_list *names,
const char *refs_snapshot,
- int show_progress, int write_bitmaps)
+ int show_progress, int write_bitmaps,
+ int exclude_disjoint)
{
struct child_process cmd = CHILD_PROCESS_INIT;
struct string_list_item *item;
@@ -852,6 +867,9 @@ static int write_midx_included_packs(struct string_list *include,
if (refs_snapshot)
strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+ if (exclude_disjoint)
+ strvec_push(&cmd.args, "--retain-disjoint");
+
ret = start_command(&cmd);
if (ret)
return ret;
@@ -895,7 +913,7 @@ static void remove_redundant_bitmaps(struct string_list *include,
static int finish_pack_objects_cmd(struct child_process *cmd,
struct string_list *names,
- int local)
+ int local, int disjoint)
{
FILE *out;
struct strbuf line = STRBUF_INIT;
@@ -913,7 +931,7 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
*/
if (local) {
item = string_list_append(names, line.buf);
- item->util = populate_pack_exts(line.buf);
+ item->util = populate_pack_exts(line.buf, disjoint);
}
}
fclose(out);
@@ -970,7 +988,7 @@ static int write_filtered_pack(const struct pack_objects_args *args,
fprintf(in, "%s%s.pack\n", caret, item->string);
fclose(in);
- return finish_pack_objects_cmd(&cmd, names, local);
+ return finish_pack_objects_cmd(&cmd, names, local, 0);
}
static int existing_cruft_pack_cmp(const void *va, const void *vb)
@@ -1098,7 +1116,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
fprintf(in, "%s.pack\n", item->string);
fclose(in);
- return finish_pack_objects_cmd(&cmd, names, local);
+ return finish_pack_objects_cmd(&cmd, names, local, 0);
}
static const char *find_pack_prefix(const char *packdir, const char *packtmp)
@@ -1190,6 +1208,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
N_("pack prefix to store a pack containing pruned objects")),
OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
N_("pack prefix to store a pack containing filtered out objects")),
+ OPT_BOOL(0, "extend-disjoint", &po_args.ignore_disjoint,
+ N_("add new packs to the set of disjoint ones")),
OPT_END()
};
@@ -1255,6 +1275,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
strbuf_release(&path);
}
+ if (po_args.ignore_disjoint) {
+ if (filter_to)
+ die(_("options '%s' and '%s' cannot be used together"),
+ "--filter-to", "--extend-disjoint");
+ if (pack_everything && !delete_redundant)
+ die(_("cannot use '--extend-disjoint' with '%s' but not '-d'"),
+ pack_everything & LOOSEN_UNREACHABLE ? "-A" :
+ pack_everything & PACK_CRUFT ? "--cruft" : "-a");
+ }
+
packdir = mkpathdup("%s/pack", get_object_directory());
packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
@@ -1308,6 +1338,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
if (pack_everything & ALL_INTO_ONE) {
repack_promisor_objects(&po_args, &names);
+ if (delete_redundant)
+ strvec_pushf(&cmd.args, "--no-ignore-disjoint");
+
if (has_existing_non_kept_packs(&existing) &&
delete_redundant &&
!(pack_everything & PACK_CRUFT)) {
@@ -1364,7 +1397,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
fclose(in);
}
- ret = finish_pack_objects_cmd(&cmd, &names, 1);
+ ret = finish_pack_objects_cmd(&cmd, &names, 1, po_args.ignore_disjoint);
if (ret)
goto cleanup;
@@ -1387,6 +1420,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
cruft_po_args.local = po_args.local;
cruft_po_args.quiet = po_args.quiet;
+ cruft_po_args.ignore_disjoint = 0;
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
@@ -1487,7 +1521,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
ret = write_midx_included_packs(&include, &geometry, &names,
refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
- show_progress, write_bitmaps > 0);
+ show_progress, write_bitmaps > 0,
+ po_args.ignore_disjoint);
if (!ret && write_bitmaps)
remove_redundant_bitmaps(&include, packdir);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index d2975e6c93..277f1ff1d7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -6,6 +6,7 @@ test_description='git repack works correctly'
. "${TEST_DIRECTORY}/lib-bitmap.sh"
. "${TEST_DIRECTORY}/lib-midx.sh"
. "${TEST_DIRECTORY}/lib-terminal.sh"
+. "${TEST_DIRECTORY}/lib-disjoint.sh"
commit_and_pack () {
test_commit "$@" 1>&2 &&
@@ -525,7 +526,8 @@ test_expect_success '--filter works with --max-pack-size' '
'
objdir=.git/objects
-midx=$objdir/pack/multi-pack-index
+packdir=$objdir/pack
+midx=$packdir/multi-pack-index
test_expect_success 'setup for --write-midx tests' '
git init midx &&
diff --git a/t/t7705-repack-extend-disjoint.sh b/t/t7705-repack-extend-disjoint.sh
new file mode 100755
index 0000000000..0c8be1cb3f
--- /dev/null
+++ b/t/t7705-repack-extend-disjoint.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='git repack --extend-disjoint works correctly'
+
+. ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-disjoint.sh"
+
+packdir=.git/objects/pack
+
+GIT_TEST_MULTI=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
+test_expect_success 'repack --extend-disjoint creates new disjoint packs' '
+ git init repo &&
+ (
+ cd repo &&
+
+ test_commit A &&
+ test_commit B &&
+
+ A="$(echo A | git pack-objects --revs $packdir/pack)" &&
+ B="$(echo A..B | git pack-objects --revs $packdir/pack)" &&
+
+ git prune-packed &&
+
+ cat >in <<-EOF &&
+ pack-$A.idx
+ +pack-$B.idx
+ EOF
+ git multi-pack-index write --bitmap --stdin-packs <in &&
+
+ test_must_not_be_disjoint "pack-$A.pack" &&
+ test_must_be_disjoint "pack-$B.pack" &&
+
+ test_commit C &&
+
+ find $packdir -type f -name "*.idx" | sort >packs.before &&
+ git repack --write-midx --write-bitmap-index --extend-disjoint &&
+ find $packdir -type f -name "*.idx" | sort >packs.after &&
+
+ comm -13 packs.before packs.after >packs.new &&
+
+ test_line_count = 1 packs.new &&
+
+ test_must_not_be_disjoint "pack-$A.pack" &&
+ test_must_be_disjoint "pack-$B.pack" &&
+ test_must_be_disjoint "$(basename $(cat packs.new) .idx).pack"
+ )
+'
+
+test_expect_success 'repack --extend-disjoint combines existing disjoint packs' '
+ (
+ cd repo &&
+
+ test_commit D &&
+
+ git repack -a -d --write-midx --write-bitmap-index --extend-disjoint &&
+
+ find $packdir -type f -name "*.pack" >packs &&
+ test_line_count = 1 packs &&
+
+ test_must_be_disjoint "$(basename $(cat packs))"
+
+ )
+'
+
+test_expect_success 'repack --extend-disjoint with --geometric' '
+ git init disjoint-geometric &&
+ (
+ cd disjoint-geometric &&
+
+ test_commit_bulk 8 &&
+ base="$(basename $(ls $packdir/pack-*.idx))" &&
+ echo "+$base" >>in &&
+
+ test_commit A &&
+ A="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" &&
+ test_commit B &&
+ B="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" &&
+
+ git prune-packed &&
+
+ cat >>in <<-EOF &&
+ +pack-$A.idx
+ +pack-$B.idx
+ EOF
+ git multi-pack-index write --bitmap --stdin-packs <in &&
+
+ test_must_be_disjoint "pack-$A.pack" &&
+ test_must_be_disjoint "pack-$B.pack" &&
+ test_must_be_disjoint "${base%.idx}.pack" &&
+
+ test_commit C &&
+
+ find $packdir -type f -name "*.pack" | sort >packs.before &&
+ git repack --geometric=2 -d --write-midx --write-bitmap-index --extend-disjoint &&
+ find $packdir -type f -name "*.pack" | sort >packs.after &&
+
+ comm -12 packs.before packs.after >packs.unchanged &&
+ comm -23 packs.before packs.after >packs.removed &&
+ comm -13 packs.before packs.after >packs.new &&
+
+ cat >expect <<-EOF &&
+ $packdir/${base%.idx}.pack
+ EOF
+ test_cmp expect packs.unchanged &&
+
+ sort >expect <<-EOF &&
+ $packdir/pack-$A.pack
+ $packdir/pack-$B.pack
+ EOF
+ test_cmp expect packs.removed &&
+
+ test_line_count = 1 packs.new &&
+
+ test_must_be_disjoint "$(basename $(cat packs.new))" &&
+ test_must_be_disjoint "${base%.idx}.pack"
+ )
+'
+
+for flag in "-A" "-a" "--cruft"
+do
+ test_expect_success "repack --extend-disjoint incompatible with $flag without -d" '
+ test_must_fail git repack $flag --extend-disjoint \
+ --write-midx --write-bitmap-index 2>actual &&
+ cat >expect <<-EOF &&
+ fatal: cannot use $SQ--extend-disjoint$SQ with $SQ$flag$SQ but not $SQ-d$SQ
+ EOF
+ test_cmp expect actual
+ '
+done
+
+test_expect_success 'repack --extend-disjoint is incompatible with --filter-to' '
+ test_must_fail git repack --extend-disjoint --filter-to=dir 2>actual &&
+
+ cat >expect <<-EOF &&
+ fatal: options $SQ--filter-to$SQ and $SQ--extend-disjoint$SQ cannot be used together
+ EOF
+ test_cmp expect actual
+'
+
+test_done
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
When trying to assemble a pack with bitmaps using `--use-bitmap-index`,
`pack-objects` asks the pack-bitmap machinery for a bitmap which
indicates the set of objects we can "reuse" verbatim from on-disk.
This set is roughly comprised of: a prefix of objects in the bitmapped
pack (or preferred pack, in the case of a multi-pack reachability
bitmap), plus any other objects not included in the prefix, excluding
any deltas whose base we are not sending in the resulting pack.
The pack-bitmap machinery is responsible for computing this bitmap, and
does so with the following functions:
- reuse_partial_packfile_from_bitmap()
- try_partial_reuse()
In the existing implementation, the first function is responsible for
(a) marking the prefix of objects in the reusable pack, and then (b)
calling try_partial_reuse() on any remaining objects to ensure that they
are also reusable (and removing them from the bitmapped set if they are
not).
Likewise, the `try_partial_reuse()` function is responsible for checking
whether an isolated object (that is, an object from the bitmapped
pack/preferred pack not contained in the prefix from earlier) may be
reused, i.e. that it isn't a delta of an object that we are not sending
in the resulting pack.
These functions are based on two core assumptions, which we will unwind
in this and the following commits:
1. There is only a single pack from the bitmap which is eligible for
verbatim pack-reuse. For single-pack bitmaps, this is trivially the
bitmapped pack. For multi-pack bitmaps, this is (currently) the
MIDX's preferred pack.
2. The pack eligible for reuse has its first object in bit position 0,
and all objects from that pack follow in pack-order from that first
bit position.
In order to perform verbatim pack reuse over multiple packs, we must
unwind these two assumptions. Most notably, in order to reuse bits from
a given packfile, we need to know the first bit position occupied by
an object form that packfile. To propagate this information around, pass
a `struct bitmapped_pack *` anywhere we previously passed a `struct
packed_git *`, since the former contains the bitmap position we're
interested in (as well as a pointer to the latter).
As an additional step, factor out a sub-routine from the main
`reuse_partial_packfile_from_bitmap()` function, called
`reuse_partial_packfile_from_bitmap_1()`. This new function will be
responsible for figuring out which objects may be reused from a single
pack, and the existing function will dispatch multiple calls to its new
helper function for each reusable pack.
Consequently, `reuse_partial_packfile_from_bitmap()` will now maintain
an array of reusable packs instead of a single such pack. We currently
expect that array to have only a single element, so this awkward state
is short-lived. It will serve as useful scaffolding in subsequent
commits as we begin to work towards enabling multi-pack reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap.c | 105 +++++++++++++++++++++++++++++++++++---------------
1 file changed, 74 insertions(+), 31 deletions(-)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d2f1306960..2ebe2c314e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1836,7 +1836,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
* -1 means "stop trying further objects"; 0 means we may or may not have
* reused, but you can keep feeding bits.
*/
-static int try_partial_reuse(struct packed_git *pack,
+static int try_partial_reuse(struct bitmapped_pack *pack,
size_t pos,
struct bitmap *reuse,
struct pack_window **w_curs)
@@ -1868,11 +1868,11 @@ static int try_partial_reuse(struct packed_git *pack,
* preferred pack precede all bits from other packs.
*/
- if (pos >= pack->num_objects)
+ if (pos >= pack->p->num_objects)
return -1; /* not actually in the pack or MIDX preferred pack */
- offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
- type = unpack_object_header(pack, w_curs, &offset, &size);
+ offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+ type = unpack_object_header(pack->p, w_curs, &offset, &size);
if (type < 0)
return -1; /* broken packfile, punt */
@@ -1888,11 +1888,11 @@ static int try_partial_reuse(struct packed_git *pack,
* and the normal slow path will complain about it in
* more detail.
*/
- base_offset = get_delta_base(pack, w_curs, &offset, type,
+ base_offset = get_delta_base(pack->p, w_curs, &offset, type,
delta_obj_offset);
if (!base_offset)
return 0;
- if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
+ if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
return 0;
/*
@@ -1915,14 +1915,14 @@ static int try_partial_reuse(struct packed_git *pack,
* to REF_DELTA on the fly. Better to just let the normal
* object_entry code path handle it.
*/
- if (!bitmap_get(reuse, base_pos))
+ if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
return 0;
}
/*
* If we got here, then the object is OK to reuse. Mark it.
*/
- bitmap_set(reuse, pos);
+ bitmap_set(reuse, pack->bitmap_pos + pos);
return 0;
}
@@ -1934,29 +1934,13 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
}
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
- uint32_t *entries,
- struct bitmap **reuse_out)
+static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
+ struct bitmapped_pack *pack,
+ struct bitmap *reuse)
{
- struct repository *r = the_repository;
- struct packed_git *pack;
struct bitmap *result = bitmap_git->result;
- struct bitmap *reuse;
struct pack_window *w_curs = NULL;
size_t i = 0;
- uint32_t offset;
- uint32_t objects_nr;
-
- assert(result);
-
- load_reverse_index(r, bitmap_git);
-
- if (bitmap_is_midx(bitmap_git))
- pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
- else
- pack = bitmap_git->pack;
- objects_nr = pack->num_objects;
while (i < result->word_alloc && result->words[i] == (eword_t)~0)
i++;
@@ -1969,15 +1953,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* we use it instead of another pack. In single-pack bitmaps, the choice
* is made for us.
*/
- if (i > objects_nr / BITS_IN_EWORD)
- i = objects_nr / BITS_IN_EWORD;
+ if (i > pack->p->num_objects / BITS_IN_EWORD)
+ i = pack->p->num_objects / BITS_IN_EWORD;
- reuse = bitmap_word_alloc(i);
memset(reuse->words, 0xFF, i * sizeof(eword_t));
for (; i < result->word_alloc; ++i) {
eword_t word = result->words[i];
size_t pos = (i * BITS_IN_EWORD);
+ size_t offset;
for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
if ((word >> offset) == 0)
@@ -2002,6 +1986,65 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
done:
unuse_pack(&w_curs);
+}
+
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct packed_git **packfile_out,
+ uint32_t *entries,
+ struct bitmap **reuse_out)
+{
+ struct repository *r = the_repository;
+ struct bitmapped_pack *packs = NULL;
+ struct bitmap *result = bitmap_git->result;
+ struct bitmap *reuse;
+ size_t i;
+ size_t packs_nr = 0, packs_alloc = 0;
+ size_t word_alloc;
+ uint32_t objects_nr = 0;
+
+ assert(result);
+
+ load_reverse_index(r, bitmap_git);
+
+ if (bitmap_is_midx(bitmap_git)) {
+ for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+ struct bitmapped_pack pack;
+ if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
+ warning(_("unable to load pack: '%s', disabling pack-reuse"),
+ bitmap_git->midx->pack_names[i]);
+ free(packs);
+ return -1;
+ }
+ if (!pack.bitmap_nr)
+ continue; /* no objects from this pack */
+ if (pack.bitmap_pos)
+ continue; /* not preferred pack */
+
+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+ memcpy(&packs[packs_nr++], &pack, sizeof(pack));
+
+ objects_nr += pack.p->num_objects;
+ }
+ } else {
+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+ packs[packs_nr].p = bitmap_git->pack;
+ packs[packs_nr].bitmap_pos = 0;
+ packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+ packs[packs_nr].disjoint = 1;
+
+ objects_nr = packs[packs_nr++].p->num_objects;
+ }
+
+ word_alloc = objects_nr / BITS_IN_EWORD;
+ if (objects_nr % BITS_IN_EWORD)
+ word_alloc++;
+ reuse = bitmap_word_alloc(word_alloc);
+
+ if (packs_nr != 1)
+ BUG("pack reuse not yet implemented for multiple packs");
+
+ reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
*entries = bitmap_popcount(reuse);
if (!*entries) {
@@ -2014,7 +2057,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* need to be handled separately.
*/
bitmap_and_not(result, reuse);
- *packfile_out = pack;
+ *packfile_out = packs[0].p;
*reuse_out = reuse;
return 0;
}
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.
The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).
Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.
As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 16 +++++++++-------
pack-bitmap.c | 16 +++++++---------
pack-bitmap.h | 7 +++----
3 files changed, 19 insertions(+), 20 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 107154db34..2bb1b64e8f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3946,13 +3946,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
return -1;
- if (pack_options_allow_reuse() &&
- !reuse_partial_packfile_from_bitmap(
- bitmap_git,
- &reuse_packfile,
- &reuse_packfile_objects,
- &reuse_packfile_bitmap)) {
- assert(reuse_packfile_objects);
+ if (pack_options_allow_reuse())
+ reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+ &reuse_packfile_bitmap);
+
+ if (reuse_packfile) {
+ reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
+ if (!reuse_packfile_objects)
+ BUG("expected non-empty reuse bitmap");
+
nr_result += reuse_packfile_objects;
nr_seen += reuse_packfile_objects;
display_progress(progress_state, nr_seen);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2ebe2c314e..614fc09a4e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1988,10 +1988,9 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
unuse_pack(&w_curs);
}
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
- uint32_t *entries,
- struct bitmap **reuse_out)
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct packed_git **packfile_out,
+ struct bitmap **reuse_out)
{
struct repository *r = the_repository;
struct bitmapped_pack *packs = NULL;
@@ -2013,7 +2012,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
warning(_("unable to load pack: '%s', disabling pack-reuse"),
bitmap_git->midx->pack_names[i]);
free(packs);
- return -1;
+ return;
}
if (!pack.bitmap_nr)
continue; /* no objects from this pack */
@@ -2046,10 +2045,10 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
- *entries = bitmap_popcount(reuse);
- if (!*entries) {
+ if (!bitmap_popcount(reuse)) {
+ free(packs);
bitmap_free(reuse);
- return -1;
+ return;
}
/*
@@ -2059,7 +2058,6 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
bitmap_and_not(result, reuse);
*packfile_out = packs[0].p;
*reuse_out = reuse;
- return 0;
}
int bitmap_walk_contains(struct bitmap_index *bitmap_git,
diff --git a/pack-bitmap.h b/pack-bitmap.h
index b7fa1a42a9..5bc1ca5b65 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,10 +78,9 @@ int test_bitmap_hashes(struct repository *r);
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects);
uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
- struct packed_git **packfile,
- uint32_t *entries,
- struct bitmap **reuse_out);
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
+ struct packed_git **packfile,
+ struct bitmap **reuse_out);
int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
kh_oid_map_t *reused_bitmaps, int show_progress);
void free_bitmap_index(struct bitmap_index *);
--
2.43.0.24.g980b318f98
^ permalink raw reply related
* [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.
Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).
Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.
In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.
[^1]: That is, even though we know the number of packs which are
candidates for pack-reuse, we do not know how many of those
candidates we can actually reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 9 +++++++--
pack-bitmap.c | 6 ++++--
pack-bitmap.h | 5 +++--
3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2bb1b64e8f..89de23f39a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3943,14 +3943,19 @@ static int pack_options_allow_reuse(void)
static int get_object_list_from_bitmap(struct rev_info *revs)
{
+ struct bitmapped_pack *packs = NULL;
+ size_t packs_nr = 0;
+
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
return -1;
if (pack_options_allow_reuse())
- reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+ reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
+ &packs_nr,
&reuse_packfile_bitmap);
- if (reuse_packfile) {
+ if (packs) {
+ reuse_packfile = packs[0].p;
reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
if (!reuse_packfile_objects)
BUG("expected non-empty reuse bitmap");
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 614fc09a4e..670deec909 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1989,7 +1989,8 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
}
void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
+ struct bitmapped_pack **packs_out,
+ size_t *packs_nr_out,
struct bitmap **reuse_out)
{
struct repository *r = the_repository;
@@ -2056,7 +2057,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* need to be handled separately.
*/
bitmap_and_not(result, reuse);
- *packfile_out = packs[0].p;
+ *packs_out = packs;
+ *packs_nr_out = packs_nr;
*reuse_out = reuse;
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5bc1ca5b65..901a3b86ed 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,8 +78,9 @@ int test_bitmap_hashes(struct repository *r);
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects);
uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
- struct packed_git **packfile,
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct bitmapped_pack **packs_out,
+ size_t *packs_nr_out,
struct bitmap **reuse_out);
int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
kh_oid_map_t *reused_bitmaps, int show_progress);
--
2.43.0.24.g980b318f98
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox