* [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo
@ 2024-08-02 7:31 Han Young
2024-08-02 7:31 ` [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects Han Young
` (10 more replies)
0 siblings, 11 replies; 65+ messages in thread
From: Han Young @ 2024-08-02 7:31 UTC (permalink / raw)
To: git; +Cc: Han Young
We use --filter=blob:none to clone our large monorepo.
After a while we started getting reports from engineers complaining
that their local repository was broken. Upon further investigation,
we found that broken repositories are missing objects that created
in that particular local repository. git fsck reports "bad object: xxx".
Here are the minimal steps to recreate issue.
# create a normal git repo, add one file and push to remote
$ mkdir full && cd full && git init && touch foo
$ git add foo && git commit -m "commit 1" && git push
# partial clone a copy of the repo we just created
$ cd ..
$ git clone git@example.org:example/foo.git --filter=blob:none partial
# create a commit in partial cloned repo and push it to remote
$ cd partial && echo 'hello' > foo && git commit -a -m "commit 2"
$ git push
# run gc in partial repo
$ git gc --prune=now
# in normal git repo, create another commit on top of the
# commit we created in partial repo
$ cd ../full && git pull && echo ' world' >> foo
$ git commit -a -m "commit 3" && git push
# pull from remote in partial repo, and run gc again
$ cd ../partial && git pull && git gc --prune=now
The last `git gc` will error out on fsck with error message like this:
error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
Note that disabling commit graph on the partial repo will cause
`git gc` to exit normally, but will still not solve the
underlying problem. And in more complex situations,
disabling commit graph will not avoid the error.
The problem is caused by the wrong result returned by setup_revision
with `--exclude-promisor-objects` enabled.
`git gc` will call `git repack`, which will call `git pack-objects`
twice on a partially cloned repo. The first call to pack-objects
combines all the promisor packfiles, and the second pack-objects
command packs all reachable non-promisor objects into a normal packfile.
However, a bug in setup_revision caused some non-promisor objects
to be mistakenly marked as in promisor packfiles in the second call
to pack-objects. These incorrectly marked objects are never repacked,
and were deleted from the object store as a result. In revision.c,
`process_parents()` recursively marks commit parents as UNINTERESTING
if the commit itself is UNINTERESTING. `--exclude-promisor-objects`
is implemented as "iterate all objects in promisor packfiles,
mark them as UNINTERESTING". So when we find a commit object in
a promisor packfile, we also set its ancestors as UNINTERESTING,
whether the ancestor is a promisor object or not. In the example above,
"commit 2" is a normal commit object, living in a normal packfile,
but marked as a promisor object and gc'ed from the object store.
Han Young (1):
revision: don't set parents as uninteresting if exclude promisor
revision.c | 2 +-
t/t0410-partial-clone.sh | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 2 deletions(-)
--
2.46.0.rc0.107.gae139121ac.dirty
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
@ 2024-08-02 7:31 ` Han Young
2024-08-02 16:45 ` Junio C Hamano
2024-08-13 0:45 ` [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Jonathan Tan
` (9 subsequent siblings)
10 siblings, 1 reply; 65+ messages in thread
From: Han Young @ 2024-08-02 7:31 UTC (permalink / raw)
To: git; +Cc: Han Young, C O Xing Xin
In revision.c, `process_parents()` recursively marks commit parents
as UNINTERESTING if the commit itself is UNINTERESTING.
`--exclude-promisor-objects` is implemented as
"iterate all objects in promisor packfiles, mark them as UNINTERESTING".
So when we find a commit object in a promisor packfile, we also set its ancestors
as UNINTERESTING, whether the ancestor is a promisor object or not.
This causes normal objects to be falsely marked as promisor objects
and removed by `git repack`.
Stop setting the parents of uninteresting commits' to UNINTERESTING
when we exclude promisor objects, and add a test to prevent regression.
Note that this change would cause rev-list to report incorrect results if
`--exclude-promisor-objects` is used with other revision walk filters. But
`--exclude-promisor-objects` is for internal use only, so we don't have to worry
about users using other filters with `--exclude-promisor-objects`.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
Helped-by: C O Xing Xin <xingxin.xx@bytedance.com>
---
revision.c | 2 +-
t/t0410-partial-clone.sh | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/revision.c b/revision.c
index 1c0192f522..eacb0c909d 100644
--- a/revision.c
+++ b/revision.c
@@ -1164,7 +1164,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
* wasn't uninteresting), in which case we need
* to mark its parents recursively too..
*/
- if (commit->object.flags & UNINTERESTING) {
+ if (!revs->exclude_promisor_objects && commit->object.flags & UNINTERESTING) {
while (parent) {
struct commit *p = parent->item;
parent = parent->next;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 2c30c86e7b..4ee3437685 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -22,6 +22,17 @@ pack_as_from_promisor () {
echo $HASH
}
+pack_commit() {
+ HASH=$(echo $1 | git -C repo pack-objects .git/objects/pack/pack --missing=allow-any) &&
+ delete_object repo $1 &&
+ echo $HASH
+}
+
+pack_commit_as_promisor() {
+ HASH=$(pack_commit $1) &&
+ >repo/.git/objects/pack/pack-$HASH.promisor
+}
+
promise_and_delete () {
HASH=$(git -C repo rev-parse "$1") &&
git -C repo tag -a -m message my_annotated_tag "$HASH" &&
@@ -369,7 +380,16 @@ test_expect_success 'missing tree objects with --missing=allow-promisor and --ex
git -C repo rev-list --exclude-promisor-objects --objects HEAD >objs 2>rev_list_err &&
test_must_be_empty rev_list_err &&
# 3 commits, no blobs or trees
- test_line_count = 3 objs
+ test_line_count = 3 objs &&
+
+ # Pack all three commits into individual packs, and mark the last commit pack as promisor
+ pack_commit_as_promisor $(git -C repo rev-parse baz) &&
+ pack_commit $(git -C repo rev-parse bar) &&
+ pack_commit $(git -C repo rev-parse foo) &&
+ git -C repo rev-list --exclude-promisor-objects --objects HEAD >objs 2>rev_list_err &&
+ test_must_be_empty rev_list_err &&
+ # commits foo and bar should remain
+ test_line_count = 2 objs
'
test_expect_success 'missing non-root tree object and rev-list' '
--
2.46.0.rc0.107.gae139121ac.dirty
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects
2024-08-02 7:31 ` [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects Han Young
@ 2024-08-02 16:45 ` Junio C Hamano
2024-08-12 12:34 ` [External] " 韩仰
0 siblings, 1 reply; 65+ messages in thread
From: Junio C Hamano @ 2024-08-02 16:45 UTC (permalink / raw)
To: Han Young; +Cc: git, C O Xing Xin, Jonathan Tan, Jeff Hostetler
Han Young <hanyang.tony@bytedance.com> writes:
> In revision.c, `process_parents()` recursively marks commit parents
> as UNINTERESTING if the commit itself is UNINTERESTING.
Makes sense.
> `--exclude-promisor-objects` is implemented as
> "iterate all objects in promisor packfiles, mark them as UNINTERESTING".
Also makes sense.
> So when we find a commit object in a promisor packfile, we also set its ancestors
> as UNINTERESTING, whether the ancestor is a promisor object or not.
> This causes normal objects to be falsely marked as promisor objects
> and removed by `git repack`.
Ahh, that is not desirable. So the need to do something to fix it
is well established here.
> Signed-off-by: Han Young <hanyang.tony@bytedance.com>
> Helped-by: C O Xing Xin <xingxin.xx@bytedance.com>
> ---
Please order these trailer lines logically in chronological order,
i.e. you get helped by others to complete the change and then seal
it by signing it off at the end. I'll swap these two while queuing.
> diff --git a/revision.c b/revision.c
> index 1c0192f522..eacb0c909d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -1164,7 +1164,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
> * wasn't uninteresting), in which case we need
> * to mark its parents recursively too..
> */
> - if (commit->object.flags & UNINTERESTING) {
> + if (!revs->exclude_promisor_objects && commit->object.flags & UNINTERESTING) {
> while (parent) {
> struct commit *p = parent->item;
> parent = parent->next;
But if the iteration is over all objects in certain packfiles to
mark them all uninteresting, shouldn't the caller avoid the call to
process_parents() in the first place? Letting process_parents() to
do other things and only refrain from doing the "this commit is
marked uninteresting" part does not quite match what you are trying
to do, at least to me.
Please check "git blame" to find those who are likely to know better
about the code around the area and ask for help from them. I think
the bulk of the logic related came from the series that led to
f3d618d2 (Merge branch 'jh/fsck-promisors', 2018-02-13), so I added
the authors of the series.
It apepars to me that its approach to exclude the objects that
appear in the promisor packs may be sound, but the design and
implementation of it is dubious. Shouldn't it be getting the list
of objects with get_object_list() WITHOUT paying any attention to
--exclude-promisor-objects flag, and then filtering objects that
appear in the promisor packs out of that list, without mucking with
the object and commit traversal in revision.c at all?
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects
2024-08-02 16:45 ` Junio C Hamano
@ 2024-08-12 12:34 ` 韩仰
2024-08-12 16:09 ` Junio C Hamano
0 siblings, 1 reply; 65+ messages in thread
From: 韩仰 @ 2024-08-12 12:34 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jonathan Tan, Jeff Hostetler
On Sat, Aug 3, 2024 at 12:45 AM Junio C Hamano <gitster@pobox.com> wrote:
> > diff --git a/revision.c b/revision.c
> > index 1c0192f522..eacb0c909d 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -1164,7 +1164,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
> > * wasn't uninteresting), in which case we need
> > * to mark its parents recursively too..
> > */
> > - if (commit->object.flags & UNINTERESTING) {
> > + if (!revs->exclude_promisor_objects && commit->object.flags & UNINTERESTING) {
> > while (parent) {
> > struct commit *p = parent->item;
> > parent = parent->next;
>
> But if the iteration is over all objects in certain packfiles to
> mark them all uninteresting, shouldn't the caller avoid the call to
> process_parents() in the first place? Letting process_parents() to
> do other things and only refrain from doing the "this commit is
> marked uninteresting" part does not quite match what you are trying
> to do, at least to me.
Thanks, I agree process_parents() isn't the right place to fix the issue.
> It apepars to me that its approach to exclude the objects that
> appear in the promisor packs may be sound, but the design and
> implementation of it is dubious. Shouldn't it be getting the list
> of objects with get_object_list() WITHOUT paying any attention to
> --exclude-promisor-objects flag, and then filtering objects that
> appear in the promisor packs out of that list, without mucking with
> the object and commit traversal in revision.c at all?
The problem is --exclude-promisor-objects is an option in revision.c,
and this option is used by pack-objects, prune, midx-write and rev-list.
I see there are two ways to fix this issue, one is to remove the
--exclude-promisor-objects from revision.c, and filter objects in show_commit
or show_objects functions. The other place to filter objects is probably
in revision walk, maybe in traverse_commit_list?
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects
2024-08-12 12:34 ` [External] " 韩仰
@ 2024-08-12 16:09 ` Junio C Hamano
2024-08-22 8:28 ` 韩仰
0 siblings, 1 reply; 65+ messages in thread
From: Junio C Hamano @ 2024-08-12 16:09 UTC (permalink / raw)
To: Jonathan Tan, 韩仰; +Cc: git, Jeff Hostetler
韩仰 <hanyang.tony@bytedance.com> writes:
> Thanks, I agree process_parents() isn't the right place to fix the issue.
>
>> It apepars to me that its approach to exclude the objects that
>> appear in the promisor packs may be sound, but the design and
>> implementation of it is dubious. Shouldn't it be getting the list
>> of objects with get_object_list() WITHOUT paying any attention to
>> --exclude-promisor-objects flag, and then filtering objects that
>> appear in the promisor packs out of that list, without mucking with
>> the object and commit traversal in revision.c at all?
>
> The problem is --exclude-promisor-objects is an option in revision.c,
> and this option is used by pack-objects, prune, midx-write and rev-list.
> I see there are two ways to fix this issue, one is to remove the
> --exclude-promisor-objects from revision.c, and filter objects in show_commit
> or show_objects functions. The other place to filter objects is probably
> in revision walk, maybe in traverse_commit_list?
Perhaps another simpler approach may be to use is_promisor_object()
function and get rid of this initial marking of these objects in
prepare_revision_walk() with the for_each_packed_object() loop,
which abuses the UNINTERESTING bit. The feature wants to exclude
objects contained in these packs, but does not want to exclude
objects that are referred to and outside of these packs, so
UNINTERESTING bit whose natural behaviour is to propagate down the
history is a very bad fit for it. We may be able to lose a lot of
existing code paths that say "if exclude_promisor_objects then do
this", and filter objects out with "is_promisor_object()" at the
output phase near get_revision().
Jonathan, if I am not mistaken, this is almost all your code. Any
insights to lend us, even though you may not be very active around
here lately?
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
2024-08-02 7:31 ` [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects Han Young
@ 2024-08-13 0:45 ` Jonathan Tan
2024-08-13 17:18 ` Jonathan Tan
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
` (8 subsequent siblings)
10 siblings, 1 reply; 65+ messages in thread
From: Jonathan Tan @ 2024-08-13 0:45 UTC (permalink / raw)
To: Han Young; +Cc: Jonathan Tan, git, gitster, xingxin.xx, jeffhostetler
Han Young <hanyang.tony@bytedance.com> writes:
> Here are the minimal steps to recreate issue.
[snip]
I think the following is what is happening. Before the final gc, the
repo looks as follows:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
After the final gc, {C,T,B}3 and {C,T,B}1 are in a promisor pack, but
all of {C,T,B}2 are deleted because they are thought to be promisor
objects.
> The last `git gc` will error out on fsck with error message like this:
>
> error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
> error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
I'm not sure how `git gc` (or `git fsck`) knows the name of this
object (what is the type of this object, and which object refers to
this object?) but I think that if we implement one of the solutions I
describe below, this problem will go away.
> `git gc` will call `git repack`, which will call `git pack-objects`
> twice on a partially cloned repo. The first call to pack-objects
> combines all the promisor packfiles, and the second pack-objects
> command packs all reachable non-promisor objects into a normal packfile.
Yes, this is what I remember.
> However, a bug in setup_revision caused some non-promisor objects
> to be mistakenly marked as in promisor packfiles in the second call
> to pack-objects.
I think they ({C,T,B}2 in the example above) should be considered as
promisor objects, actually. From the partial clone doc, it says of a
promisor object: "the local repository has that object in one of its
promisor packfiles, or because another promisor object refers to it".
C3 (in a promisor packfile) refers to C2, so C2 is a promisor object. It
refers to T2, which refers to B2, so all of them are promisor objects.
However, in the Git code, I don't think this definition is applied
recursively - if I remember correctly, we made the assumption (e.g.
in is_promisor_object() in packfile.c) that we only need to care about
objects in promisor packfiles and the objects they directly reference,
because how would we know what objects they indirectly reference?
But this does not cover the case in which we know what objects they
indirectly reference because we pushed them to the server in the first
place.
Solutions I can think of:
- When fetching from a promisor remote, never declare commits that are
not in promisor packfiles as HAVE. This means that we would refetch
C2 and T2 as being in promisor packfiles. But it's wasteful in
network bandwidth, and does not repair problems in Git repos created
by Git versions that do not have this solution.
- When fetching from a promisor remote, parse every object and repack
any local objects referenced (directly or indirectly) into a promisor
packfile. Also does not repair problems.
- When repacking all objects in promisor packfiles, if any object they
refer to is present in a non-promisor packfile, do a revwalk on that
object and pack those objects too. The repack will probably be slower
because each object now has to be parsed. The revwalks themselves
probably will not take too long, since they can stop at known promisor
objects.
(One other thing that might be considered is, whenever pushing to a
promisor remote, to write the pack that's pushed as a promisor packfile.
But I don't think this is a good idea - the server may not retain any
packs that were pushed.)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo
2024-08-13 0:45 ` [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Jonathan Tan
@ 2024-08-13 17:18 ` Jonathan Tan
2024-08-14 4:10 ` Junio C Hamano
0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Tan @ 2024-08-13 17:18 UTC (permalink / raw)
To: Jonathan Tan; +Cc: Han Young, git, gitster, xingxin.xx, jeffhostetler
Jonathan Tan <jonathantanmy@google.com> writes:
> Solutions I can think of:
One more thing that I just thought of regarding the solution in this
patch. It seems to be to have a different separation of packs: all
objects currently in promisor packs and all objects currently not
in promisor packs. And the way it is done is to only exclude (in
this patch, mark UNINTERESTING, although it might be better to have
a separate flag for it) objects in promisor packs, but not their
ancestors. There are two ways we can go from here:
- Do not iterate past this object, just like for UNINTERESTING. This
would end up not packing objects that we need to pack (e.g. {C,T,B}2
below, if we only have a ref pointing to C3).
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
- Iterate past this object (I think this is the path this patch took,
but I didn't look at it closely). This works, but seems to be very
slow. We would need to walk through all reachable objects (promisor
object or not), unlike currently in which we stop once we have
reached a promisor object.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo
2024-08-13 17:18 ` Jonathan Tan
@ 2024-08-14 4:10 ` Junio C Hamano
2024-08-14 19:30 ` Jonathan Tan
0 siblings, 1 reply; 65+ messages in thread
From: Junio C Hamano @ 2024-08-14 4:10 UTC (permalink / raw)
To: Jonathan Tan; +Cc: Han Young, git, xingxin.xx, jeffhostetler
Jonathan Tan <jonathantanmy@google.com> writes:
> Jonathan Tan <jonathantanmy@google.com> writes:
>> Solutions I can think of:
>
> One more thing that I just thought of regarding the solution in this
> patch. It seems to be to have a different separation of packs: all
> objects currently in promisor packs and all objects currently not
> in promisor packs. And the way it is done is to only exclude (in
> this patch, mark UNINTERESTING, although it might be better to have
> a separate flag for it) objects in promisor packs, but not their
> ancestors.
You're right to mention two separate bits, especially because you do
not want the "I am in a promisor pack" bit to propagate down to the
ancestry chain like UNINTERESTING bit does. But isn't the approach
to enumerate all objects in promisor packs in an oidset and give a
quick way for is_promisor_object() to answer if an object is or is
not in promisor pack sufficient to replace the need to use _any_
object flag bits to manage objects in promisor packs?
> There are two ways we can go from here:
>
> - Do not iterate past this object, just like for UNINTERESTING. This
> would end up not packing objects that we need to pack (e.g. {C,T,B}2
> below, if we only have a ref pointing to C3).
>
> commit tree blob
> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
> |
> C2 ---- T2 -- B2 (created locally, in non-promisor pack)
> |
> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> - Iterate past this object (I think this is the path this patch took,
> but I didn't look at it closely). This works, but seems to be very
> slow. We would need to walk through all reachable objects (promisor
> object or not), unlike currently in which we stop once we have
> reached a promisor object.
Thanks for helping Han & Xinxin.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo
2024-08-14 4:10 ` Junio C Hamano
@ 2024-08-14 19:30 ` Jonathan Tan
0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-08-14 19:30 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jonathan Tan, Han Young, git, xingxin.xx, jeffhostetler
Junio C Hamano <gitster@pobox.com> writes:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > Jonathan Tan <jonathantanmy@google.com> writes:
> >> Solutions I can think of:
> >
> > One more thing that I just thought of regarding the solution in this
> > patch. It seems to be to have a different separation of packs: all
> > objects currently in promisor packs and all objects currently not
> > in promisor packs. And the way it is done is to only exclude (in
> > this patch, mark UNINTERESTING, although it might be better to have
> > a separate flag for it) objects in promisor packs, but not their
> > ancestors.
>
> You're right to mention two separate bits, especially because you do
> not want the "I am in a promisor pack" bit to propagate down to the
> ancestry chain like UNINTERESTING bit does. But isn't the approach
> to enumerate all objects in promisor packs in an oidset and give a
> quick way for is_promisor_object() to answer if an object is or is
> not in promisor pack sufficient to replace the need to use _any_
> object flag bits to manage objects in promisor packs?
Ah...yes, you're right. But if someone is intending to go this route,
note that you can't use the oidset in is_promisor_object() directly,
as it contains both objects in promisor packs and objects that they
directly reference. You'll need to adapt it.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects
2024-08-12 16:09 ` Junio C Hamano
@ 2024-08-22 8:28 ` 韩仰
0 siblings, 0 replies; 65+ messages in thread
From: 韩仰 @ 2024-08-22 8:28 UTC (permalink / raw)
To: Junio C Hamano, Jonathan Tan; +Cc: git, Jeff Hostetler
On Tue, Aug 13, 2024 at 12:09 AM Junio C Hamano <gitster@pobox.com> wrote:
> Perhaps another simpler approach may be to use is_promisor_object()
> function and get rid of this initial marking of these objects in
> prepare_revision_walk() with the for_each_packed_object() loop,
> which abuses the UNINTERESTING bit. The feature wants to exclude
> objects contained in these packs, but does not want to exclude
> objects that are referred to and outside of these packs, so
> UNINTERESTING bit whose natural behaviour is to propagate down the
> history is a very bad fit for it. We may be able to lose a lot of
> existing code paths that say "if exclude_promisor_objects then do
> this", and filter objects out with "is_promisor_object()" at the
> output phase near get_revision().
I tried to go down this route. I removed the for_each_packed_object()
loop and filter promisor commits in get_revision_1() instead.
However, this only filtered promisor commits, not promisor trees and
objects. A combined approach would be keeping the
for_each_packed_object() loop, but only mark non-commit objects
as UNINTERESTING there, and filter promisor commits in
get_revision()?
^ permalink raw reply [flat|nested] 65+ messages in thread
* [WIP v2 0/4] revision: fix reachable objects being gc'ed in no blob clone repo
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
2024-08-02 7:31 ` [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects Han Young
2024-08-13 0:45 ` [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Jonathan Tan
@ 2024-08-23 12:43 ` Han Young
2024-08-23 12:43 ` [WIP v2 1/4] packfile: split promisor objects oidset into two Han Young
` (3 more replies)
2024-09-19 23:47 ` [PATCH 0/2] revision: fix reachable commits being gc'ed in partial repo Calvin Wan
` (7 subsequent siblings)
10 siblings, 4 replies; 65+ messages in thread
From: Han Young @ 2024-08-23 12:43 UTC (permalink / raw)
To: git; +Cc: Han Young, Junio C Hamano, Jonathan Tan
Following Jonathan and Junio's suggestion, I tried to filter promisor
objects in get_revision(). Initially I got ride of the
for_each_packed_object() loop, but that way promisor trees and blobs are
not filtered.
I kept the for_each_packed_object() loop, but only mark non-commit objects
as UNINTERESTING, that way we ensure all the promisor objects are filtered,
and UNINTERESTING bit is not passed down in process_parent() call.
But this isn't enough solve the 'reachable objects being gc'ed' problem,
as promisor projects is defined as "objects in promisor pack or referenced".
git repack only packs objects in promisor pack and non promisor objects.
Meaning objects who are promisor objects but not in promisor pack are discarded.
So I added a new option to list-objects, '--exclude-promisor-pack-objects'.
Which only exclude objects in promisor packs, this way when we run git repack,
no reachable objects will be discarded. This seems to fix the problem, but I
still don't feel the approach is elegant.
Another way to fix this problem I come up with is to pack everything into
a promisor packfile, if it is a partial clone repo. This would make pruning
unreachable objects impossible, but that is already the case with promisor
objects. Packing everything into one packfile will simplify code by a lot.
Han Young (4):
packfile: split promisor objects oidset into two
revision: add exclude-promisor-pack-objects option
revision: don't mark commit as UNINTERESTING if
--exclude-promisor-objects is set
repack: use new exclude promisor pack objects option
builtin/pack-objects.c | 8 ++++----
builtin/repack.c | 2 +-
list-objects.c | 3 ++-
packfile.c | 25 ++++++++++++++++---------
packfile.h | 7 ++++++-
revision.c | 17 +++++++++++++++--
revision.h | 3 ++-
7 files changed, 46 insertions(+), 19 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 65+ messages in thread
* [WIP v2 1/4] packfile: split promisor objects oidset into two
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
@ 2024-08-23 12:43 ` Han Young
2024-08-23 12:43 ` [WIP v2 2/4] revision: add exclude-promisor-pack-objects option Han Young
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-08-23 12:43 UTC (permalink / raw)
To: git; +Cc: Han Young
split promisor objects oidset into two, one is objects in promisor packfile,
and other set is objects referenced in promisor packfile. This enable us to
check if an object is in promisor packfile.
---
packfile.c | 25 ++++++++++++++++---------
packfile.h | 7 ++++++-
2 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/packfile.c b/packfile.c
index cf12a539ea..1cf69a17be 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2234,12 +2234,17 @@ int for_each_packed_object(each_packed_object_fn cb, void *data,
return r ? r : pack_errors;
}
+struct promisor_objects {
+ struct oidset promisor_pack_objects;
+ struct oidset promisor_pack_referenced_objects;
+};
+
static int add_promisor_object(const struct object_id *oid,
struct packed_git *pack UNUSED,
uint32_t pos UNUSED,
void *set_)
{
- struct oidset *set = set_;
+ struct promisor_objects *set = set_;
struct object *obj;
int we_parsed_object;
@@ -2254,7 +2259,7 @@ static int add_promisor_object(const struct object_id *oid,
if (!obj)
return 1;
- oidset_insert(set, oid);
+ oidset_insert(&set->promisor_pack_objects, oid);
/*
* If this is a tree, commit, or tag, the objects it refers
@@ -2272,26 +2277,26 @@ static int add_promisor_object(const struct object_id *oid,
*/
return 0;
while (tree_entry_gently(&desc, &entry))
- oidset_insert(set, &entry.oid);
+ oidset_insert(&set->promisor_pack_referenced_objects, &entry.oid);
if (we_parsed_object)
free_tree_buffer(tree);
} else if (obj->type == OBJ_COMMIT) {
struct commit *commit = (struct commit *) obj;
struct commit_list *parents = commit->parents;
- oidset_insert(set, get_commit_tree_oid(commit));
+ oidset_insert(&set->promisor_pack_referenced_objects, get_commit_tree_oid(commit));
for (; parents; parents = parents->next)
- oidset_insert(set, &parents->item->object.oid);
+ oidset_insert(&set->promisor_pack_referenced_objects, &parents->item->object.oid);
} else if (obj->type == OBJ_TAG) {
struct tag *tag = (struct tag *) obj;
- oidset_insert(set, get_tagged_oid(tag));
+ oidset_insert(&set->promisor_pack_referenced_objects, get_tagged_oid(tag));
}
return 0;
}
-int is_promisor_object(const struct object_id *oid)
+int is_in_promisor_pack(const struct object_id *oid, int referenced)
{
- static struct oidset promisor_objects;
+ static struct promisor_objects promisor_objects;
static int promisor_objects_prepared;
if (!promisor_objects_prepared) {
@@ -2303,5 +2308,7 @@ int is_promisor_object(const struct object_id *oid)
}
promisor_objects_prepared = 1;
}
- return oidset_contains(&promisor_objects, oid);
+
+ return oidset_contains(&promisor_objects.promisor_pack_objects, oid) ||
+ (referenced && oidset_contains(&promisor_objects.promisor_pack_referenced_objects, oid));
}
diff --git a/packfile.h b/packfile.h
index 0f78658229..13a349e223 100644
--- a/packfile.h
+++ b/packfile.h
@@ -195,11 +195,16 @@ int has_object_kept_pack(const struct object_id *oid, unsigned flags);
int has_pack_index(const unsigned char *sha1);
+int is_in_promisor_pack(const struct object_id *oid, int referenced);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
*/
-int is_promisor_object(const struct object_id *oid);
+static inline int is_promisor_object(const struct object_id *oid)
+{
+ return is_in_promisor_pack(oid, 1);
+}
/*
* Expose a function for fuzz testing.
--
2.45.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [WIP v2 2/4] revision: add exclude-promisor-pack-objects option
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
2024-08-23 12:43 ` [WIP v2 1/4] packfile: split promisor objects oidset into two Han Young
@ 2024-08-23 12:43 ` Han Young
2024-08-23 12:43 ` [WIP v2 3/4] revision: don't mark commit as UNINTERESTING if --exclude-promisor-objects is set Han Young
2024-08-23 12:43 ` [WIP v2 4/4] repack: use new exclude promisor pack objects option Han Young
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-08-23 12:43 UTC (permalink / raw)
To: git; +Cc: Han Young
add --exclude-promisor-pack-objects option to revision walk, this option will
be used by git repack in following commits. Unlike --exclude-promisor-objects,
which exclude promisor objects, --exclude-promisor-pack-objects only excludes
objects in promisor packfile, objects referenced by objects in promisor pack
are not excluded.
---
revision.c | 13 ++++++++++++-
revision.h | 3 ++-
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/revision.c b/revision.c
index 6b33bd814f..7bb03a84c2 100644
--- a/revision.c
+++ b/revision.c
@@ -2701,6 +2701,11 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
if (fetch_if_missing)
BUG("exclude_promisor_objects can only be used when fetch_if_missing is 0");
revs->exclude_promisor_objects = 1;
+ } else if (opt && opt->allow_exclude_promisor_objects &&
+ !strcmp(arg, "--exclude-promisor-pack-objects")) {
+ if (fetch_if_missing)
+ BUG("exclude_promisor_pack_objects can only be used when fetch_if_missing is 0");
+ revs->exclude_promisor_pack_objects = 1;
} else {
int opts = diff_opt_parse(&revs->diffopt, argv, argc, revs->prefix);
if (!opts)
@@ -3908,7 +3913,7 @@ int prepare_revision_walk(struct rev_info *revs)
(revs->limited && limiting_can_increase_treesame(revs)))
revs->treesame.name = "treesame";
- if (revs->exclude_promisor_objects) {
+ if (revs->exclude_promisor_objects || revs->exclude_promisor_pack_objects) {
for_each_packed_object(mark_uninteresting, revs,
FOR_EACH_OBJECT_PROMISOR_ONLY);
}
@@ -4275,6 +4280,12 @@ static struct commit *get_revision_1(struct rev_info *revs)
if (!commit)
return NULL;
+ if (revs->exclude_promisor_objects && is_promisor_object(&commit->object.oid))
+ continue;
+
+ if (revs->exclude_promisor_pack_objects && is_in_promisor_pack(&commit->object.oid, 0))
+ continue;
+
if (revs->reflog_info)
commit->object.flags &= ~(ADDED | SEEN | SHOWN);
diff --git a/revision.h b/revision.h
index 0e470d1df1..6219c35c45 100644
--- a/revision.h
+++ b/revision.h
@@ -229,7 +229,8 @@ struct rev_info {
do_not_die_on_missing_objects:1,
/* for internal use only */
- exclude_promisor_objects:1;
+ exclude_promisor_objects:1,
+ exclude_promisor_pack_objects:1;
/* Diff flags */
unsigned int diff:1,
--
2.45.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [WIP v2 3/4] revision: don't mark commit as UNINTERESTING if --exclude-promisor-objects is set
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
2024-08-23 12:43 ` [WIP v2 1/4] packfile: split promisor objects oidset into two Han Young
2024-08-23 12:43 ` [WIP v2 2/4] revision: add exclude-promisor-pack-objects option Han Young
@ 2024-08-23 12:43 ` Han Young
2024-08-23 12:43 ` [WIP v2 4/4] repack: use new exclude promisor pack objects option Han Young
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-08-23 12:43 UTC (permalink / raw)
To: git; +Cc: Han Young
if commit is marked as UNINTERESTING, the bit will propagate down to its
parents. This is undesirable in --exclude-promisor-objects, since a promisor
objects' parents can be a normal object.
---
revision.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/revision.c b/revision.c
index 7bb03a84c2..02227e6a0a 100644
--- a/revision.c
+++ b/revision.c
@@ -3609,7 +3609,9 @@ static int mark_uninteresting(const struct object_id *oid,
{
struct rev_info *revs = cb;
struct object *o = lookup_unknown_object(revs->repo, oid);
- o->flags |= UNINTERESTING | SEEN;
+ if (o->type != OBJ_COMMIT)
+ o->flags |= UNINTERESTING | SEEN;
+
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [WIP v2 4/4] repack: use new exclude promisor pack objects option
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
` (2 preceding siblings ...)
2024-08-23 12:43 ` [WIP v2 3/4] revision: don't mark commit as UNINTERESTING if --exclude-promisor-objects is set Han Young
@ 2024-08-23 12:43 ` Han Young
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-08-23 12:43 UTC (permalink / raw)
To: git; +Cc: Han Young
use --exclude-promisor-pack-objects to pack objects in partial clone repo.
git repack will call git pack-objects twice on a partially cloned repo.
The first call to pack-objects combines all the objects in promisor packfiles,
and the second pack-objects command packs all reachable non-promisor objects
into a normal packfile. However 'objects in promisor packfiles' plus
'non-promisor objects' does not equal 'all the reachable objects in repo',
Since promisor objects also include objects referenced in promisor packfile.
--exclude-promisor-pack-objects only excludes objects in promisor packfiles,
this way we don't discard any reachable objects in git repack.
---
builtin/pack-objects.c | 8 ++++----
builtin/repack.c | 2 +-
list-objects.c | 3 ++-
3 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c481feadbf..a2b1aaa2e0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -238,7 +238,7 @@ static enum {
} write_bitmap_index;
static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
-static int exclude_promisor_objects;
+static int exclude_promisor_pack_objects;
static int use_delta_islands;
@@ -4391,7 +4391,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
OPT_CALLBACK_F(0, "missing", NULL, N_("action"),
N_("handling for missing objects"), PARSE_OPT_NONEG,
option_parse_missing_action),
- OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
+ OPT_BOOL(0, "exclude-promisor-pack-objects", &exclude_promisor_pack_objects,
N_("do not pack objects in promisor packfiles")),
OPT_BOOL(0, "delta-islands", &use_delta_islands,
N_("respect islands during delta compression")),
@@ -4473,10 +4473,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
strvec_push(&rp, "--unpacked");
}
- if (exclude_promisor_objects) {
+ if (exclude_promisor_pack_objects) {
use_internal_rev_list = 1;
fetch_if_missing = 0;
- strvec_push(&rp, "--exclude-promisor-objects");
+ strvec_push(&rp, "--exclude-promisor-pack-objects");
}
if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
use_internal_rev_list = 1;
diff --git a/builtin/repack.c b/builtin/repack.c
index 62cfa50c50..aafe7d30ce 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1289,7 +1289,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
strvec_push(&cmd.args, "--indexed-objects");
}
if (repo_has_promisor_remote(the_repository))
- strvec_push(&cmd.args, "--exclude-promisor-objects");
+ strvec_push(&cmd.args, "--exclude-promisor-pack-objects");
if (!write_midx) {
if (write_bitmaps > 0)
strvec_push(&cmd.args, "--write-bitmap-index");
diff --git a/list-objects.c b/list-objects.c
index 985d008799..9b3ff0fe1d 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -178,7 +178,8 @@ static void process_tree(struct traversal_context *ctx,
* requested. This may cause the actual filter to report
* an incomplete list of missing objects.
*/
- if (revs->exclude_promisor_objects &&
+ if ((revs->exclude_promisor_objects ||
+ revs->exclude_promisor_pack_objects) &&
is_promisor_object(&obj->oid))
return;
--
2.45.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH 0/2] revision: fix reachable commits being gc'ed in partial repo
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (2 preceding siblings ...)
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
@ 2024-09-19 23:47 ` Calvin Wan
2024-09-19 23:47 ` [PATCH 1/2] packfile: split promisor objects oidset into two Calvin Wan
` (6 subsequent siblings)
10 siblings, 0 replies; 65+ messages in thread
From: Calvin Wan @ 2024-09-19 23:47 UTC (permalink / raw)
To: git; +Cc: Calvin Wan, hanyang.tony, jonathantanmy, sokcevic
I took the first patch of Han Young's original series and implemented
the fix suggested by Jonathan[1]. While all of his ideas would've
worked, the first one ends up being the simplest to implement at the
cost of minor network resources but saves on CPU compared to the other
ideas. This series does not fix repositories that were previously broken
by this issue, but broken repositories can easily fix themselves by
refetching the missing commit, so there isn't much benefit adding
preventative measures to gc.
[1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
Calvin Wan (1):
fetch-pack.c: do not declare local commits as "have" in partial repos
Han Young (1):
packfile: split promisor objects oidset into two
fetch-pack.c | 17 ++++++++++++++---
packfile.c | 24 +++++++++++++++---------
packfile.h | 7 ++++++-
t/t5616-partial-clone.sh | 29 +++++++++++++++++++++++++++++
4 files changed, 64 insertions(+), 13 deletions(-)
--
2.46.0.792.g87dc391469-goog
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 1/2] packfile: split promisor objects oidset into two
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (3 preceding siblings ...)
2024-09-19 23:47 ` [PATCH 0/2] revision: fix reachable commits being gc'ed in partial repo Calvin Wan
@ 2024-09-19 23:47 ` Calvin Wan
2024-09-22 6:37 ` Junio C Hamano
2024-09-19 23:47 ` [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos Calvin Wan
` (5 subsequent siblings)
10 siblings, 1 reply; 65+ messages in thread
From: Calvin Wan @ 2024-09-19 23:47 UTC (permalink / raw)
To: git; +Cc: Han Young, jonathantanmy, sokcevic
From: Han Young <hanyang.tony@bytedance.com>
split promisor objects oidset into two, one is objects in promisor packfile,
and other set is objects referenced in promisor packfile. This enable us to
check if an object is in promisor packfile.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
packfile.c | 24 +++++++++++++++---------
packfile.h | 7 ++++++-
2 files changed, 21 insertions(+), 10 deletions(-)
diff --git a/packfile.c b/packfile.c
index cf12a539ea..3ff191b2e7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2234,12 +2234,17 @@ int for_each_packed_object(each_packed_object_fn cb, void *data,
return r ? r : pack_errors;
}
+struct promisor_objects {
+ struct oidset promisor_pack_objects;
+ struct oidset promisor_pack_referenced_objects;
+};
+
static int add_promisor_object(const struct object_id *oid,
struct packed_git *pack UNUSED,
uint32_t pos UNUSED,
void *set_)
{
- struct oidset *set = set_;
+ struct promisor_objects *set = set_;
struct object *obj;
int we_parsed_object;
@@ -2254,7 +2259,7 @@ static int add_promisor_object(const struct object_id *oid,
if (!obj)
return 1;
- oidset_insert(set, oid);
+ oidset_insert(&set->promisor_pack_objects, oid);
/*
* If this is a tree, commit, or tag, the objects it refers
@@ -2272,26 +2277,26 @@ static int add_promisor_object(const struct object_id *oid,
*/
return 0;
while (tree_entry_gently(&desc, &entry))
- oidset_insert(set, &entry.oid);
+ oidset_insert(&set->promisor_pack_referenced_objects, &entry.oid);
if (we_parsed_object)
free_tree_buffer(tree);
} else if (obj->type == OBJ_COMMIT) {
struct commit *commit = (struct commit *) obj;
struct commit_list *parents = commit->parents;
- oidset_insert(set, get_commit_tree_oid(commit));
+ oidset_insert(&set->promisor_pack_referenced_objects, get_commit_tree_oid(commit));
for (; parents; parents = parents->next)
- oidset_insert(set, &parents->item->object.oid);
+ oidset_insert(&set->promisor_pack_referenced_objects, &parents->item->object.oid);
} else if (obj->type == OBJ_TAG) {
struct tag *tag = (struct tag *) obj;
- oidset_insert(set, get_tagged_oid(tag));
+ oidset_insert(&set->promisor_pack_referenced_objects, get_tagged_oid(tag));
}
return 0;
}
-int is_promisor_object(const struct object_id *oid)
+int is_in_promisor_pack(const struct object_id *oid, int referenced)
{
- static struct oidset promisor_objects;
+ static struct promisor_objects promisor_objects;
static int promisor_objects_prepared;
if (!promisor_objects_prepared) {
@@ -2303,5 +2308,6 @@ int is_promisor_object(const struct object_id *oid)
}
promisor_objects_prepared = 1;
}
- return oidset_contains(&promisor_objects, oid);
+ return oidset_contains(&promisor_objects.promisor_pack_objects, oid) ||
+ (referenced && oidset_contains(&promisor_objects.promisor_pack_referenced_objects, oid));
}
diff --git a/packfile.h b/packfile.h
index 0f78658229..13a349e223 100644
--- a/packfile.h
+++ b/packfile.h
@@ -195,11 +195,16 @@ int has_object_kept_pack(const struct object_id *oid, unsigned flags);
int has_pack_index(const unsigned char *sha1);
+int is_in_promisor_pack(const struct object_id *oid, int referenced);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
*/
-int is_promisor_object(const struct object_id *oid);
+static inline int is_promisor_object(const struct object_id *oid)
+{
+ return is_in_promisor_pack(oid, 1);
+}
/*
* Expose a function for fuzz testing.
--
2.46.0.792.g87dc391469-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (4 preceding siblings ...)
2024-09-19 23:47 ` [PATCH 1/2] packfile: split promisor objects oidset into two Calvin Wan
@ 2024-09-19 23:47 ` Calvin Wan
2024-09-22 6:53 ` Junio C Hamano
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
` (4 subsequent siblings)
10 siblings, 1 reply; 65+ messages in thread
From: Calvin Wan @ 2024-09-19 23:47 UTC (permalink / raw)
To: git; +Cc: Calvin Wan, hanyang.tony, jonathantanmy, sokcevic
In a partial repository, creating a local commit and then fetching
causes the following state to occur:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
During garbage collection, parents of promisor objects are marked as
UNINTERESTING and are subsequently garbage collected. In this case, C2
would be deleted and attempts to access that commit would result in "bad
object" errors (originally reported here[1]).
This is not a bug in gc since it should be the case that parents of
promisor objects are also promisor objects (fsck assumes this as
well). When promisor objects are fetched, the state of the repository
should ensure that the above holds true. Therefore, do not declare local
commits as "have" in partial repositores so they can be fetched into a
promisor pack.
[1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Calvin Wan <calvinwan@google.com>
---
fetch-pack.c | 17 ++++++++++++++---
t/t5616-partial-clone.sh | 29 +++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/fetch-pack.c b/fetch-pack.c
index 58b4581ad8..c39b0f6ad4 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1297,12 +1297,23 @@ static void add_common(struct strbuf *req_buf, struct oidset *common)
static int add_haves(struct fetch_negotiator *negotiator,
struct strbuf *req_buf,
- int *haves_to_send)
+ int *haves_to_send,
+ int from_promisor)
{
int haves_added = 0;
const struct object_id *oid;
while ((oid = negotiator->next(negotiator))) {
+ /*
+ * In partial repos, do not declare local objects as "have"
+ * so that they can be fetched into a promisor pack. Certain
+ * operations mark parent commits of promisor objects as
+ * UNINTERESTING and are subsequently garbage collected so
+ * this ensures local commits are still available in promisor
+ * packs after a fetch + gc.
+ */
+ if (from_promisor && !is_in_promisor_pack(oid, 0))
+ continue;
packet_buf_write(req_buf, "have %s\n", oid_to_hex(oid));
if (++haves_added >= *haves_to_send)
break;
@@ -1405,7 +1416,7 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
/* Add all of the common commits we've found in previous rounds */
add_common(&req_buf, common);
- haves_added = add_haves(negotiator, &req_buf, haves_to_send);
+ haves_added = add_haves(negotiator, &req_buf, haves_to_send, args->from_promisor);
*in_vain += haves_added;
trace2_data_intmax("negotiation_v2", the_repository, "haves_added", haves_added);
trace2_data_intmax("negotiation_v2", the_repository, "in_vain", *in_vain);
@@ -2178,7 +2189,7 @@ void negotiate_using_fetch(const struct oid_array *negotiation_tips,
packet_buf_write(&req_buf, "wait-for-done");
- haves_added = add_haves(&negotiator, &req_buf, &haves_to_send);
+ haves_added = add_haves(&negotiator, &req_buf, &haves_to_send, 0);
in_vain += haves_added;
if (!haves_added || (seen_ack && in_vain >= MAX_IN_VAIN))
last_iteration = 1;
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 8415884754..cba9f7ed9b 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -693,6 +693,35 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
git -C client restore --recurse-submodules --source=HEAD^ :/
'
+test_expect_success 'fetching from promisor remote fetches previously local commits' '
+ # Setup
+ git init full &&
+ git -C full config uploadpack.allowfilter 1 &&
+ git -C full config uploadpack.allowanysha1inwant 1 &&
+ touch full/foo &&
+ git -C full add foo &&
+ git -C full commit -m "commit 1" &&
+ git -C full checkout --detach &&
+
+ # Partial clone and push commit to remote
+ git clone "file://$(pwd)/full" --filter=blob:none partial &&
+ echo "hello" > partial/foo &&
+ git -C partial commit -a -m "commit 2" &&
+ git -C partial push &&
+
+ # gc in partial repo
+ git -C partial gc --prune=now &&
+
+ # Create another commit in normal repo
+ git -C full checkout main &&
+ echo " world" >> full/foo &&
+ git -C full commit -a -m "commit 3" &&
+
+ # Pull from remote in partial repo, and run gc again
+ git -C partial pull &&
+ git -C partial gc --prune=now
+'
+
. "$TEST_DIRECTORY"/lib-httpd.sh
start_httpd
--
2.46.0.792.g87dc391469-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 1/2] packfile: split promisor objects oidset into two
2024-09-19 23:47 ` [PATCH 1/2] packfile: split promisor objects oidset into two Calvin Wan
@ 2024-09-22 6:37 ` Junio C Hamano
0 siblings, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-22 6:37 UTC (permalink / raw)
To: Calvin Wan; +Cc: git, Han Young, jonathantanmy, sokcevic
Calvin Wan <calvinwan@google.com> writes:
> From: Han Young <hanyang.tony@bytedance.com>
>
> split promisor objects oidset into two, one is objects in promisor packfile,
> and other set is objects referenced in promisor packfile. This enable us to
> check if an object is in promisor packfile.
OK, so the idea is that we can discard the objects that are _in_ a
promisor packfile and assume that we can fetch them back?
Objects that are referenced by objects in the promisor packfile may
or may not be in the same packfile, and we obviously cannot expect
that we can refetch those that are not in the promisor packfile from
the promisor. So what is the other list for?
What I am wondering is what good the existing helper function
is_promisor_object() is for. It will say "yes" for objects that we
may have obtained from the promisor remote (hence we can lazily
fetch them again even if we lost them) and in promisor packs, but it
may also say "yes" for any object that an object that is in a
promisor pack (e.g., a tree object that represents a subdirectory
that was not modified by a commit in a promisor pack, a parent
commit of a commit in a promisor pack, etc.). In other words, are
the callers getting any useful answer to their question to the
helper function, or are they all buggy for not asking "is this
object in a promisor pack" and allowing the helper to say "yes" for
objects that are merely referenced by an object in promisor packs?
Thanks.
> -int is_promisor_object(const struct object_id *oid)
> +int is_in_promisor_pack(const struct object_id *oid, int referenced)
> {
> - static struct oidset promisor_objects;
> + static struct promisor_objects promisor_objects;
> static int promisor_objects_prepared;
>
> if (!promisor_objects_prepared) {
> @@ -2303,5 +2308,6 @@ int is_promisor_object(const struct object_id *oid)
> }
> promisor_objects_prepared = 1;
> }
> - return oidset_contains(&promisor_objects, oid);
> + return oidset_contains(&promisor_objects.promisor_pack_objects, oid) ||
> + (referenced && oidset_contains(&promisor_objects.promisor_pack_referenced_objects, oid));
> }
> diff --git a/packfile.h b/packfile.h
> index 0f78658229..13a349e223 100644
> --- a/packfile.h
> +++ b/packfile.h
> @@ -195,11 +195,16 @@ int has_object_kept_pack(const struct object_id *oid, unsigned flags);
>
> int has_pack_index(const unsigned char *sha1);
>
> +int is_in_promisor_pack(const struct object_id *oid, int referenced);
> +
> /*
> * Return 1 if an object in a promisor packfile is or refers to the given
> * object, 0 otherwise.
> */
> -int is_promisor_object(const struct object_id *oid);
> +static inline int is_promisor_object(const struct object_id *oid)
> +{
> + return is_in_promisor_pack(oid, 1);
> +}
>
> /*
> * Expose a function for fuzz testing.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-09-19 23:47 ` [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos Calvin Wan
@ 2024-09-22 6:53 ` Junio C Hamano
2024-09-22 16:41 ` Junio C Hamano
2024-09-23 3:44 ` [External] " 韩仰
0 siblings, 2 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-22 6:53 UTC (permalink / raw)
To: Calvin Wan; +Cc: git, hanyang.tony, jonathantanmy, sokcevic
Calvin Wan <calvinwan@google.com> writes:
> In a partial repository, creating a local commit and then fetching
> causes the following state to occur:
>
> commit tree blob
> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
> |
> C2 ---- T2 -- B2 (created locally, in non-promisor pack)
> |
> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> During garbage collection, parents of promisor objects are marked as
> UNINTERESTING and are subsequently garbage collected. In this case, C2
> would be deleted and attempts to access that commit would result in "bad
> object" errors (originally reported here[1]).
Understandable.
> This is not a bug in gc since it should be the case that parents of
> promisor objects are also promisor objects (fsck assumes this as
> well).
I am not sure where this "not a bug" claim comes from. Here, the
definition of "promisor objects" seems to be anything that are
reachable from objects in promisor packs, but isn't the source of
the bug that collects C2 exactly that "gc" uses such a definition
for discardable objects that can be refetchd from promisor remotes?
> When promisor objects are fetched, the state of the repository
> should ensure that the above holds true. Therefore, do not declare local
> commits as "have" in partial repositores so they can be fetched into a
> promisor pack.
Could you clarify what it means in the context of the above example
you gave in an updated version of the proposed log message?
We pretend that C2 and anything it reaches do not exist locally, to
force them to be fetched from the remote? We'd end up having two
copies of C2 (one that we created locally and had before starting
this fetch, the other we fetched when we fetched C3 from them)?
This sounds like it is awfully inefficient both network bandwidth-
and local disk-wise.
I was hoping to see that the issue can be fixed on the "gc" side,
regardless of how the objects enter our repository, but perhaps I am
missing something. Isn't it just the matter of collecting C1, C3
but not C2? Or to put it another way, if we first create a list of
objects to be packed (regardless of whether they are in promisor
packs), and then remove the objects that are in promisor packs from
the list, and pack the objects still remaining in the list?
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 58b4581ad8..c39b0f6ad4 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -1297,12 +1297,23 @@ static void add_common(struct strbuf *req_buf, struct oidset *common)
>
> static int add_haves(struct fetch_negotiator *negotiator,
> struct strbuf *req_buf,
> - int *haves_to_send)
> + int *haves_to_send,
> + int from_promisor)
> {
> int haves_added = 0;
> const struct object_id *oid;
>
> while ((oid = negotiator->next(negotiator))) {
> + /*
> + * In partial repos, do not declare local objects as "have"
> + * so that they can be fetched into a promisor pack. Certain
> + * operations mark parent commits of promisor objects as
> + * UNINTERESTING and are subsequently garbage collected so
> + * this ensures local commits are still available in promisor
> + * packs after a fetch + gc.
> + */
> + if (from_promisor && !is_in_promisor_pack(oid, 0))
> + continue;
> packet_buf_write(req_buf, "have %s\n", oid_to_hex(oid));
> if (++haves_added >= *haves_to_send)
> break;
> @@ -1405,7 +1416,7 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
> /* Add all of the common commits we've found in previous rounds */
> add_common(&req_buf, common);
>
> - haves_added = add_haves(negotiator, &req_buf, haves_to_send);
> + haves_added = add_haves(negotiator, &req_buf, haves_to_send, args->from_promisor);
> *in_vain += haves_added;
> trace2_data_intmax("negotiation_v2", the_repository, "haves_added", haves_added);
> trace2_data_intmax("negotiation_v2", the_repository, "in_vain", *in_vain);
> @@ -2178,7 +2189,7 @@ void negotiate_using_fetch(const struct oid_array *negotiation_tips,
>
> packet_buf_write(&req_buf, "wait-for-done");
>
> - haves_added = add_haves(&negotiator, &req_buf, &haves_to_send);
> + haves_added = add_haves(&negotiator, &req_buf, &haves_to_send, 0);
> in_vain += haves_added;
> if (!haves_added || (seen_ack && in_vain >= MAX_IN_VAIN))
> last_iteration = 1;
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 8415884754..cba9f7ed9b 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -693,6 +693,35 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
> git -C client restore --recurse-submodules --source=HEAD^ :/
> '
>
> +test_expect_success 'fetching from promisor remote fetches previously local commits' '
> + # Setup
> + git init full &&
> + git -C full config uploadpack.allowfilter 1 &&
> + git -C full config uploadpack.allowanysha1inwant 1 &&
> + touch full/foo &&
> + git -C full add foo &&
> + git -C full commit -m "commit 1" &&
> + git -C full checkout --detach &&
> +
> + # Partial clone and push commit to remote
> + git clone "file://$(pwd)/full" --filter=blob:none partial &&
> + echo "hello" > partial/foo &&
> + git -C partial commit -a -m "commit 2" &&
> + git -C partial push &&
> +
> + # gc in partial repo
> + git -C partial gc --prune=now &&
> +
> + # Create another commit in normal repo
> + git -C full checkout main &&
> + echo " world" >> full/foo &&
> + git -C full commit -a -m "commit 3" &&
> +
> + # Pull from remote in partial repo, and run gc again
> + git -C partial pull &&
> + git -C partial gc --prune=now
> +'
> +
> . "$TEST_DIRECTORY"/lib-httpd.sh
> start_httpd
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-09-22 6:53 ` Junio C Hamano
@ 2024-09-22 16:41 ` Junio C Hamano
2024-09-23 3:44 ` [External] " 韩仰
1 sibling, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-22 16:41 UTC (permalink / raw)
To: Calvin Wan; +Cc: git, hanyang.tony, jonathantanmy, sokcevic
Junio C Hamano <gitster@pobox.com> writes:
> Calvin Wan <calvinwan@google.com> writes:
>
>> In a partial repository, creating a local commit and then fetching
>> causes the following state to occur:
>>
>> commit tree blob
>> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
>> |
>> C2 ---- T2 -- B2 (created locally, in non-promisor pack)
>> |
>> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>>
>> During garbage collection, parents of promisor objects are marked as
>> UNINTERESTING and are subsequently garbage collected. In this case, C2
>> would be deleted and attempts to access that commit would result in "bad
>> object" errors (originally reported here[1]).
>
> Understandable.
> ...
>> When promisor objects are fetched, the state of the repository
>> should ensure that the above holds true. Therefore, do not declare local
>> commits as "have" in partial repositores so they can be fetched into a
>> promisor pack.
> ...
> We pretend that C2 and anything it reaches do not exist locally, to
> force them to be fetched from the remote? We'd end up having two
> copies of C2 (one that we created locally and had before starting
> this fetch, the other we fetched when we fetched C3 from them)?
> This sounds like it is awfully inefficient both network bandwidth-
> and local disk-wise.
One related thing that worries me is what happens after we make a
large push, either directly to the remote, or what we pushed
elsewhere were fetched by the remote, and then we need to fetch what
they created on top. The history may look like this:
1. we fetch from promisor remote. C is in promisor packs
---C
2. we build on top. 'x' are local.
---C---x---x---x---x---x---x---x---x
3. 'x' we created above ends up to be a the promisor side,
and others build a few commits on top.
---C---x---x---x---x---x---x---x---x---o---o
4. Now we try to fetch from them. I.e. a repository that has
history illustrated in 2. fetches the history illustrated in 3.
Because this change forbids the fetching side to tell the other side
that it has 'x', the first "have" we are allowed to send is 'C',
even though we only need to fetch two commits 'o' from them.
And 'x' could be numerous in distributed development workflows, as
these "local" commits do not have to be ones you created locally
yourself. You may have fetched and merged these commits from
elsewhere where the active development is happening. The only
criterion that qualifies a commit to be "local" (and causes us to
omit them from "have" communication) is that we didn't obtain it
directly from our promisor remote, so you may end up fetching
a large portion of the history you already have from the promisor
remote, just to have them into a promisor pack.
If we cannot change the definition of "is-promisor-object" for the
purpose of "gc" (and it is probably I am missing what you, JTan, and
HanYang thought about that I do not see he reason why), I wonder if
there is a way to somehow avoid the refetching but still "move"
these 'x' objects purely locally into a promisor pack?
That is, the current "git fetch" without this patch would only fetch
two 'o' commits (and its associated trees and blobs) into a new
promisor pack, but because we know that commits 'x' have now become
re-fetchable from the promisor, we can make them promisor objects by
repacking locally them and mark the resulting pack a promisor pack,
without incurring the cost to the remote to prepare and send 'x'
again to us. That would give us the same protection the patch under
discussion offers, wouldn't it?
I however still think fixing "gc" would give us a lot more intuitive
behaviour, though.
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-09-22 6:53 ` Junio C Hamano
2024-09-22 16:41 ` Junio C Hamano
@ 2024-09-23 3:44 ` 韩仰
2024-09-23 16:21 ` Junio C Hamano
2024-10-02 22:35 ` Calvin Wan
1 sibling, 2 replies; 65+ messages in thread
From: 韩仰 @ 2024-09-23 3:44 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Calvin Wan, git, jonathantanmy, sokcevic
On Sun, Sep 22, 2024 at 2:53 PM Junio C Hamano <gitster@pobox.com> wrote:
> I was hoping to see that the issue can be fixed on the "gc" side,
> regardless of how the objects enter our repository, but perhaps I am
> missing something. Isn't it just the matter of collecting C1, C3
> but not C2? Or to put it another way, if we first create a list of
> objects to be packed (regardless of whether they are in promisor
> packs), and then remove the objects that are in promisor packs from
> the list, and pack the objects still remaining in the list?
I tried to fix the issue on the "gc" side following JTan's suggestion,
by packing local objects referenced by promisor objects into promisor
packs. But it turns out the cost for "for each promisor object,
parse them and try to decide the objects they reference is in local repo"
is too great. In a test blob:none partial clone repo, the gc would take more
than one hour in the 2019 MacBook, despite the repo only
having 17071073 objects. Normally it would take about 30 minutes.
> if we first create a list of
> objects to be packed (regardless of whether they are in promisor
> packs), and then remove the objects that are in promisor packs from
> the list, and pack the objects still remaining in the list?
This would work, though the remaining objects in the list would be
suboptimally packed, due to the delta heuristic. Because we feed
object id directly into git-pack-objects, instead of using rev-list. But
that's how we pack promisor objects anyway.
In $JOB, we modified git-repack to pack everything into a giant promisor
pack if the repo is partially cloned. This basically does the same thing as
you suggested, but without the cost of constructing the object list and
removing the objects in the promisor packs.
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-09-23 3:44 ` [External] " 韩仰
@ 2024-09-23 16:21 ` Junio C Hamano
2024-10-02 22:35 ` Calvin Wan
1 sibling, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-23 16:21 UTC (permalink / raw)
To: 韩仰; +Cc: Calvin Wan, git, jonathantanmy, sokcevic
韩仰 <hanyang.tony@bytedance.com> writes:
> In $JOB, we modified git-repack to pack everything into a giant promisor
> pack if the repo is partially cloned.
I would imagine that would give you pretty much similar results as
the posted patch without incurring the cost of transfering the same
objects from the promisor remote.
> This basically does the same thing as
> you suggested, but without the cost of constructing the object list and
> removing the objects in the promisor packs.
Yup, repacking, instead of creating a new pack only to hold the
objects in the gap, would be much simpler.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 0/2] repack: pack everything into promisor packfile in partial repos
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (5 preceding siblings ...)
2024-09-19 23:47 ` [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos Calvin Wan
@ 2024-09-25 7:20 ` Han Young
2024-09-25 7:20 ` [PATCH 1/2] repack: pack everything into packfile Han Young
` (3 more replies)
2024-10-01 19:17 ` Missing Promisor Objects in Partial Repo Design Doc Calvin Wan
` (3 subsequent siblings)
10 siblings, 4 replies; 65+ messages in thread
From: Han Young @ 2024-09-25 7:20 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, Han Young
As suggested by Jonathan[1], there are number of ways to fix this issue.
We have already explored some of them in this thread, and so far none of them
is satisfiable. Calvin and I tried to address the problem from fetch-pack side
and rev-list side. But the fix either consumes too much CPU power or results
in inefficient bandwidth use.
So let's attack the problem from repack side. The goal is to prevent repack
from discarding local objects, previously it is done by carefully
separating promisor objects and normal objects in rev-list.
The implementation is flawed and no solution have been found so far.
Instead, we can get ride of rev-list and just pack everything into promisor
files. This way, no objects would be lost.
By using 'repack everything', repacking requires less work and we are not
using more bandwidth. The only downside is normal objects packing does not
benefiting from the history and path based delta calculation. Majority of
objects in a partial repo is promisor objects, so the impact of worse normal
objects repacking is negligible.
[1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
Han Young (2):
repack: pack everything into promisor packfile in partial repos
t0410: adapt tests to repack changes
builtin/repack.c | 258 ++++++++++++++++++++++-----------------
t/t0410-partial-clone.sh | 68 +----------
2 files changed, 145 insertions(+), 181 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 1/2] repack: pack everything into packfile
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
@ 2024-09-25 7:20 ` Han Young
2024-09-25 7:20 ` [PATCH 2/2] t0410: adapt tests to repack changes Han Young
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-09-25 7:20 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, Han Young
In a partial repository, creating a local commit and then fetching
causes the following state to occur:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
During garbage collection, parents of promisor objects are marked as
UNINTERESTING and are subsequently garbage collected. In this case, C2
would be deleted and attempts to access that commit would result in "bad
object" errors (originally reported here[1]).
For partial repos, repacking is done in two steps. We first repack all the
objects in promisor packfile, then repack all the non-promisor objects.
It turns out C2, T2 and B2 are not repacked in either steps, ended up deleted.
We can avoid this by packing everything into a promisor packfile, if the repo
is partial cloned.
[1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
Helped-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
builtin/repack.c | 257 ++++++++++++++++++++++++++---------------------
1 file changed, 143 insertions(+), 114 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f085..e9e18a31fe 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -321,6 +321,23 @@ static int write_oid(const struct object_id *oid,
return 0;
}
+static int write_loose_oid(const struct object_id *oid,
+ const char *path UNUSED,
+ void *data)
+{
+ struct child_process *cmd = data;
+
+ if (cmd->in == -1) {
+ if (start_command(cmd))
+ die(_("could not start pack-objects to repack promisor objects"));
+ }
+
+ if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+ write_in_full(cmd->in, "\n", 1) < 0)
+ die(_("failed to feed promisor objects to pack-objects"));
+ return 0;
+}
+
static struct {
const char *name;
unsigned optional:1;
@@ -370,12 +387,15 @@ static int has_pack_ext(const struct generated_pack_data *data,
BUG("unknown pack extension: '%s'", ext);
}
-static void repack_promisor_objects(const struct pack_objects_args *args,
- struct string_list *names)
+static int repack_promisor_objects(const struct pack_objects_args *args,
+ struct string_list *names,
+ struct string_list *list,
+ int pack_all)
{
struct child_process cmd = CHILD_PROCESS_INIT;
FILE *out;
struct strbuf line = STRBUF_INIT;
+ struct string_list_item *item;
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
@@ -387,13 +407,19 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
* {type -> existing pack order} ordering when computing deltas instead
* of a {type -> size} ordering, which may produce better deltas.
*/
- for_each_packed_object(write_oid, &cmd,
- FOR_EACH_OBJECT_PROMISOR_ONLY);
+ if (pack_all)
+ for_each_packed_object(write_oid, &cmd, 0);
+ else
+ for_each_string_list_item(item, list) {
+ pack_mark_retained(item);
+ }
+
+ for_each_loose_object(write_loose_oid, &cmd, 0);
if (cmd.in == -1) {
/* No packed objects; cmd was never started */
child_process_clear(&cmd);
- return;
+ return 0;
}
close(cmd.in);
@@ -431,6 +457,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
if (finish_command(&cmd))
die(_("could not finish pack-objects to repack promisor objects"));
strbuf_release(&line);
+ return 0;
}
struct pack_geometry {
@@ -1312,8 +1339,7 @@ int cmd_repack(int argc,
strvec_push(&cmd.args, "--reflog");
strvec_push(&cmd.args, "--indexed-objects");
}
- if (repo_has_promisor_remote(the_repository))
- strvec_push(&cmd.args, "--exclude-promisor-objects");
+
if (!write_midx) {
if (write_bitmaps > 0)
strvec_push(&cmd.args, "--write-bitmap-index");
@@ -1323,125 +1349,128 @@ int cmd_repack(int argc,
if (use_delta_islands)
strvec_push(&cmd.args, "--delta-islands");
- if (pack_everything & ALL_INTO_ONE) {
- repack_promisor_objects(&po_args, &names);
-
- if (has_existing_non_kept_packs(&existing) &&
- delete_redundant &&
- !(pack_everything & PACK_CRUFT)) {
- for_each_string_list_item(item, &names) {
- strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
- packtmp_name, item->string);
- }
- if (unpack_unreachable) {
- strvec_pushf(&cmd.args,
- "--unpack-unreachable=%s",
- unpack_unreachable);
- } else if (pack_everything & LOOSEN_UNREACHABLE) {
- strvec_push(&cmd.args,
- "--unpack-unreachable");
- } else if (keep_unreachable) {
- strvec_push(&cmd.args, "--keep-unreachable");
- strvec_push(&cmd.args, "--pack-loose-unreachable");
+ if (repo_has_promisor_remote(the_repository)) {
+ ret = repack_promisor_objects(&po_args, &names,
+ &existing.non_kept_packs, pack_everything & ALL_INTO_ONE);
+ } else {
+ if (pack_everything & ALL_INTO_ONE) {
+ if (has_existing_non_kept_packs(&existing) &&
+ delete_redundant &&
+ !(pack_everything & PACK_CRUFT)) {
+ for_each_string_list_item(item, &names) {
+ strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+ packtmp_name, item->string);
+ }
+ if (unpack_unreachable) {
+ strvec_pushf(&cmd.args,
+ "--unpack-unreachable=%s",
+ unpack_unreachable);
+ } else if (pack_everything & LOOSEN_UNREACHABLE) {
+ strvec_push(&cmd.args,
+ "--unpack-unreachable");
+ } else if (keep_unreachable) {
+ strvec_push(&cmd.args, "--keep-unreachable");
+ strvec_push(&cmd.args, "--pack-loose-unreachable");
+ }
}
+ } else if (geometry.split_factor) {
+ strvec_push(&cmd.args, "--stdin-packs");
+ strvec_push(&cmd.args, "--unpacked");
+ } else {
+ strvec_push(&cmd.args, "--unpacked");
+ strvec_push(&cmd.args, "--incremental");
}
- } else if (geometry.split_factor) {
- strvec_push(&cmd.args, "--stdin-packs");
- strvec_push(&cmd.args, "--unpacked");
- } else {
- strvec_push(&cmd.args, "--unpacked");
- strvec_push(&cmd.args, "--incremental");
- }
- if (po_args.filter_options.choice)
- strvec_pushf(&cmd.args, "--filter=%s",
- expand_list_objects_filter_spec(&po_args.filter_options));
- else if (filter_to)
- die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
+ if (po_args.filter_options.choice)
+ strvec_pushf(&cmd.args, "--filter=%s",
+ expand_list_objects_filter_spec(&po_args.filter_options));
+ else if (filter_to)
+ die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
- if (geometry.split_factor)
- cmd.in = -1;
- else
- cmd.no_stdin = 1;
+ if (geometry.split_factor)
+ cmd.in = -1;
+ else
+ cmd.no_stdin = 1;
- ret = start_command(&cmd);
- if (ret)
- goto cleanup;
+ ret = start_command(&cmd);
+ if (ret)
+ goto cleanup;
- if (geometry.split_factor) {
- FILE *in = xfdopen(cmd.in, "w");
- /*
- * The resulting pack should contain all objects in packs that
- * are going to be rolled up, but exclude objects in packs which
- * are being left alone.
- */
- for (i = 0; i < geometry.split; i++)
- fprintf(in, "%s\n", pack_basename(geometry.pack[i]));
- for (i = geometry.split; i < geometry.pack_nr; i++)
- fprintf(in, "^%s\n", pack_basename(geometry.pack[i]));
- fclose(in);
- }
+ if (geometry.split_factor) {
+ FILE *in = xfdopen(cmd.in, "w");
+ /*
+ * The resulting pack should contain all objects in packs that
+ * are going to be rolled up, but exclude objects in packs which
+ * are being left alone.
+ */
+ for (i = 0; i < geometry.split; i++)
+ fprintf(in, "%s\n", pack_basename(geometry.pack[i]));
+ for (i = geometry.split; i < geometry.pack_nr; i++)
+ fprintf(in, "^%s\n", pack_basename(geometry.pack[i]));
+ fclose(in);
+ }
- ret = finish_pack_objects_cmd(&cmd, &names, 1);
- if (ret)
- goto cleanup;
-
- if (!names.nr && !po_args.quiet)
- printf_ln(_("Nothing new to pack."));
-
- if (pack_everything & PACK_CRUFT) {
- const char *pack_prefix = find_pack_prefix(packdir, packtmp);
-
- if (!cruft_po_args.window)
- cruft_po_args.window = po_args.window;
- if (!cruft_po_args.window_memory)
- cruft_po_args.window_memory = po_args.window_memory;
- if (!cruft_po_args.depth)
- cruft_po_args.depth = po_args.depth;
- if (!cruft_po_args.threads)
- cruft_po_args.threads = po_args.threads;
- if (!cruft_po_args.max_pack_size)
- cruft_po_args.max_pack_size = po_args.max_pack_size;
-
- cruft_po_args.local = po_args.local;
- cruft_po_args.quiet = po_args.quiet;
-
- ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
- cruft_expiration, &names,
- &existing);
+ ret = finish_pack_objects_cmd(&cmd, &names, 1);
if (ret)
goto cleanup;
- if (delete_redundant && expire_to) {
- /*
- * If `--expire-to` is given with `-d`, it's possible
- * that we're about to prune some objects. With cruft
- * packs, pruning is implicit: any objects from existing
- * packs that weren't picked up by new packs are removed
- * when their packs are deleted.
- *
- * Generate an additional cruft pack, with one twist:
- * `names` now includes the name of the cruft pack
- * written in the previous step. So the contents of
- * _this_ cruft pack exclude everything contained in the
- * existing cruft pack (that is, all of the unreachable
- * objects which are no older than
- * `--cruft-expiration`).
- *
- * To make this work, cruft_expiration must become NULL
- * so that this cruft pack doesn't actually prune any
- * objects. If it were non-NULL, this call would always
- * generate an empty pack (since every object not in the
- * cruft pack generated above will have an mtime older
- * than the expiration).
- */
- ret = write_cruft_pack(&cruft_po_args, expire_to,
- pack_prefix,
- NULL,
- &names,
- &existing);
+ if (!names.nr && !po_args.quiet)
+ printf_ln(_("Nothing new to pack."));
+
+ if (pack_everything & PACK_CRUFT) {
+ const char *pack_prefix = find_pack_prefix(packdir, packtmp);
+
+ if (!cruft_po_args.window)
+ cruft_po_args.window = po_args.window;
+ if (!cruft_po_args.window_memory)
+ cruft_po_args.window_memory = po_args.window_memory;
+ if (!cruft_po_args.depth)
+ cruft_po_args.depth = po_args.depth;
+ if (!cruft_po_args.threads)
+ cruft_po_args.threads = po_args.threads;
+ if (!cruft_po_args.max_pack_size)
+ cruft_po_args.max_pack_size = po_args.max_pack_size;
+
+ cruft_po_args.local = po_args.local;
+ cruft_po_args.quiet = po_args.quiet;
+
+ ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
+ cruft_expiration, &names,
+ &existing);
if (ret)
goto cleanup;
+
+ if (delete_redundant && expire_to) {
+ /*
+ * If `--expire-to` is given with `-d`, it's possible
+ * that we're about to prune some objects. With cruft
+ * packs, pruning is implicit: any objects from existing
+ * packs that weren't picked up by new packs are removed
+ * when their packs are deleted.
+ *
+ * Generate an additional cruft pack, with one twist:
+ * `names` now includes the name of the cruft pack
+ * written in the previous step. So the contents of
+ * _this_ cruft pack exclude everything contained in the
+ * existing cruft pack (that is, all of the unreachable
+ * objects which are no older than
+ * `--cruft-expiration`).
+ *
+ * To make this work, cruft_expiration must become NULL
+ * so that this cruft pack doesn't actually prune any
+ * objects. If it were non-NULL, this call would always
+ * generate an empty pack (since every object not in the
+ * cruft pack generated above will have an mtime older
+ * than the expiration).
+ */
+ ret = write_cruft_pack(&cruft_po_args, expire_to,
+ pack_prefix,
+ NULL,
+ &names,
+ &existing);
+ if (ret)
+ goto cleanup;
+ }
}
}
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH 2/2] t0410: adapt tests to repack changes
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
2024-09-25 7:20 ` [PATCH 1/2] repack: pack everything into packfile Han Young
@ 2024-09-25 7:20 ` Han Young
2024-09-25 15:20 ` [PATCH 0/2] repack: pack everything into promisor packfile in partial repos Phillip Wood
2024-09-25 17:03 ` Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-09-25 7:20 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, Han Young
In the previous commit, we changed how partial repo is cloned.
Adapt tests to these changes. Also check gc does not delete normal
objects too.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
t/t0410-partial-clone.sh | 68 +---------------------------------------
1 file changed, 1 insertion(+), 67 deletions(-)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 34bdb3ab1f..c169b47160 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -499,46 +499,6 @@ test_expect_success 'single promisor remote can be re-initialized gracefully' '
git -C repo fetch --filter=blob:none foo
'
-test_expect_success 'gc repacks promisor objects separately from non-promisor objects' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo one &&
- test_commit -C repo two &&
-
- TREE_ONE=$(git -C repo rev-parse one^{tree}) &&
- printf "$TREE_ONE\n" | pack_as_from_promisor &&
- TREE_TWO=$(git -C repo rev-parse two^{tree}) &&
- printf "$TREE_TWO\n" | pack_as_from_promisor &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that exactly one promisor packfile exists, and that it
- # contains the trees but not the commits
- ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
- test_line_count = 1 promisorlist &&
- PROMISOR_PACKFILE=$(sed "s/.promisor/.pack/" <promisorlist) &&
- git verify-pack $PROMISOR_PACKFILE -v >out &&
- grep "$TREE_ONE" out &&
- grep "$TREE_TWO" out &&
- ! grep "$(git -C repo rev-parse one)" out &&
- ! grep "$(git -C repo rev-parse two)" out &&
-
- # Remove the promisor packfile and associated files
- rm $(sed "s/.promisor//" <promisorlist).* &&
-
- # Ensure that the single other pack contains the commits, but not the
- # trees
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse one)" out &&
- grep "$(git -C repo rev-parse two)" out &&
- ! grep "$TREE_ONE" out &&
- ! grep "$TREE_TWO" out
-'
-
test_expect_success 'gc does not repack promisor objects if there are none' '
rm -rf repo &&
test_create_repo repo &&
@@ -569,7 +529,7 @@ repack_and_check () {
git -C repo2 cat-file -e $3
}
-test_expect_success 'repack -d does not irreversibly delete promisor objects' '
+test_expect_success 'repack -d does not irreversibly delete objects' '
rm -rf repo &&
test_create_repo repo &&
git -C repo config core.repositoryformatversion 1 &&
@@ -583,40 +543,14 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
TWO=$(git -C repo rev-parse HEAD^^) &&
THREE=$(git -C repo rev-parse HEAD^) &&
- printf "$TWO\n" | pack_as_from_promisor &&
printf "$THREE\n" | pack_as_from_promisor &&
delete_object repo "$ONE" &&
- repack_and_check --must-fail -ab "$TWO" "$THREE" &&
repack_and_check -a "$TWO" "$THREE" &&
repack_and_check -A "$TWO" "$THREE" &&
repack_and_check -l "$TWO" "$THREE"
'
-test_expect_success 'gc stops traversal when a missing but promised object is reached' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo my_commit &&
-
- TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
- HASH=$(promise_and_delete $TREE_HASH) &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that the promisor packfile still exists, and remove it
- test -e repo/.git/objects/pack/pack-$HASH.pack &&
- rm repo/.git/objects/pack/pack-$HASH.* &&
-
- # Ensure that the single other pack contains the commit, but not the tree
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse HEAD)" out &&
- ! grep "$TREE_HASH" out
-'
-
test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
rm -rf repo &&
test_create_repo repo &&
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 0/2] repack: pack everything into promisor packfile in partial repos
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
2024-09-25 7:20 ` [PATCH 1/2] repack: pack everything into packfile Han Young
2024-09-25 7:20 ` [PATCH 2/2] t0410: adapt tests to repack changes Han Young
@ 2024-09-25 15:20 ` Phillip Wood
2024-09-25 16:48 ` Junio C Hamano
2024-09-25 17:03 ` Junio C Hamano
3 siblings, 1 reply; 65+ messages in thread
From: Phillip Wood @ 2024-09-25 15:20 UTC (permalink / raw)
To: Han Young, git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster
Hi Han
On 25/09/2024 08:20, Han Young wrote:
> As suggested by Jonathan[1], there are number of ways to fix this issue.
> We have already explored some of them in this thread, and so far none of them
> is satisfiable. Calvin and I tried to address the problem from fetch-pack side
> and rev-list side. But the fix either consumes too much CPU power or results
> in inefficient bandwidth use.
I was wondering if it would be possible to cache the tip commits in
promisor packs when repacking so that a subsequent repack only has to
walk the commits added since the last repack when it is trying to figure
out if a local object should be moved into a promisor pack.
> So let's attack the problem from repack side. The goal is to prevent repack
> from discarding local objects, previously it is done by carefully
> separating promisor objects and normal objects in rev-list.
> The implementation is flawed and no solution have been found so far.
> Instead, we can get ride of rev-list and just pack everything into promisor
> files. This way, no objects would be lost.
>
> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth. The only downside is normal objects packing does not
> benefiting from the history and path based delta calculation.
I've just been looking at Documentation/technical/partial-clone.txt and
I think there are a couple of other implications of this change
> An object may be missing due to a partial clone or fetch, or missing
> due to repository corruption. To differentiate these cases, the
> local repository specially indicates such filtered packfiles
> obtained from promisor remotes as "promisor packfiles".
Packing local objects into promisor packfiles means that it is no longer
possible to detect if an object is missing due to repository corruption
or because we need to fetch it from a promisor remote.
> `repack` in GC has been updated to not touch promisor packfiles at
> all, and to only repack other objects.
Packing local objects into promisor packfiles means that GC will
no-longer remove unreachable local objects.
It would be helpful if the cover letter or commit messages discussed the
tradeoffs of these changes and updated that document accordingly.
Best Wishes
Phillip
> Majority of
> objects in a partial repo is promisor objects, so the impact of worse normal
> objects repacking is negligible.
>
> [1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
>
> Han Young (2):
> repack: pack everything into promisor packfile in partial repos
> t0410: adapt tests to repack changes
>
> builtin/repack.c | 258 ++++++++++++++++++++++-----------------
> t/t0410-partial-clone.sh | 68 +----------
> 2 files changed, 145 insertions(+), 181 deletions(-)
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/2] repack: pack everything into promisor packfile in partial repos
2024-09-25 15:20 ` [PATCH 0/2] repack: pack everything into promisor packfile in partial repos Phillip Wood
@ 2024-09-25 16:48 ` Junio C Hamano
0 siblings, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-25 16:48 UTC (permalink / raw)
To: Phillip Wood; +Cc: Han Young, git, calvinwan, jonathantanmy, sokcevic
Phillip Wood <phillip.wood123@gmail.com> writes:
> I was wondering if it would be possible to cache the tip commits in
> promisor packs when repacking so that a subsequent repack only has to
> walk the commits added since the last repack when it is trying to
> figure out if a local object should be moved into a promisor pack.
I was wondering the same thing. If packfiles (and bundles) record
the entry points and the exit points of the DAG, it would help quite
a bit.
> It would be helpful if the cover letter or commit messages discussed
> the tradeoffs of these changes and updated that document accordingly.
I like the suggestion very much.
Thanks for a review.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 0/2] repack: pack everything into promisor packfile in partial repos
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
` (2 preceding siblings ...)
2024-09-25 15:20 ` [PATCH 0/2] repack: pack everything into promisor packfile in partial repos Phillip Wood
@ 2024-09-25 17:03 ` Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-09-25 17:03 UTC (permalink / raw)
To: Han Young; +Cc: git, calvinwan, jonathantanmy, sokcevic
Han Young <hanyang.tony@bytedance.com> writes:
> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth. The only downside is normal objects packing does not
> benefiting from the history and path based delta calculation. Majority of
> objects in a partial repo is promisor objects, so the impact of worse normal
> objects repacking is negligible.
There is an important assumption that any objects in promisor packs
*and* any objects that are (directly or indirectly) referenced by
these objects in promisor packs can safely be expunged from the
local object store because they can be later fetched again from the
promisor remote. In that (in)famous failure case topology of the
history:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
even though the objects associated with the commit C2 are created
locally, the fact that C3 in promisor pack references it alone is
sufficient for us to also assume that these "locally created" are
now refetchable from the promisor remote that gave us C3, hence it
is safe to repack the history leading to C3 and all objects involved
in the history and mark the resulting pack(s) promisor packs.
OK. That sounds workable, but aren't there downsides?
Thanks for working on this topic.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Missing Promisor Objects in Partial Repo Design Doc
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (6 preceding siblings ...)
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
@ 2024-10-01 19:17 ` Calvin Wan
2024-10-01 19:35 ` Junio C Hamano
2024-10-02 2:54 ` Junio C Hamano
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
` (2 subsequent siblings)
10 siblings, 2 replies; 65+ messages in thread
From: Calvin Wan @ 2024-10-01 19:17 UTC (permalink / raw)
To: Han Young
Cc: Calvin Wan, git, Jonathan Tan, Phillip Wood, Enrico Mrass,
sokcevic
It seems that we're at a standstill for the various possible designs
that can solve this problem, so I decided to write up a design document
to discuss the ideas we've come up with so far and new ones. Hopefully
this will get us closer to a viable implementation we can agree on.
Missing Promisor Objects in Partial Repo Design Doc
===================================================
Basic Reproduction Steps
------------------------
- Partial clone repository
- Create local commit and push
- Fetch new changes
- Garbage collection
State After Reproduction
------------------------
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2b ---- T2b -- B2b (created locally, in non-promisor pack)
|
C2a ---- T2a -- B2a (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Explanation of the Problem
--------------------------
In a partial clone repository, non-promisor commits are locally
committed as children of promisor commits and then pushed up to the
server. Fetches of new history can result in promisor commits that have
non-promisor commits as ancestors. During garbage collection, objects
are repacked in 2 steps. In the first step, if there is more than one
promisor packfile, all objects in promisor packfiles are repacked into a
single promisor packfile. In the second step, a revision walk is made
from all refs (and some other things like HEAD and reflog entries) that
stops whenever it encounters a promisor object. In the example above, if
a ref pointed directly to C2a, it would be returned by the walk (as an
object to be packed). But if we only had a ref pointing to C3, the
revision walk immediately sees that it is a promisor object, does not
return it, and does not iterate through its parents.
(C2b is a bit of a special case. Despite not being in a promisor pack,
it is still considered to be a promisor object since C3 directly
references it.)
If we think this is a bad state, we should propagate the “promisor-ness”
of C3 to its ancestors. Git commands should either prevent this state
from occurring or tolerate it and fix it when we can. If we did run into
this state unexpectedly, then it would be considered a BUG.
If we think it is a valid state, we should NOT propagate the
“promisor-ness” of C3 to its ancestors. Git commands should respect that
this is a possible state and be able to work around it. Therefore, this
bug would then be strictly caused by garbage collection
Bad State Solutions
===================
Fetch negotiation
-----------------
Implemented at
https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/
During fetch negotiation, if a commit is not in a promisor pack and
therefore local, do not declare it as "have" so they can be fetched into
a promisor pack.
Cost:
- Creation of set of promisor pack objects (by iterating through every
.idx of promisor packs)
- Refetch number of local commits
Pros: Implementation is simple, client doesn’t have to repack, prevents
state from ever occurring in the repository.
Cons: Network cost of refetching could be high if many local commits
need to be refetched.
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, refetched into promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Fetch repack
------------
Not yet implemented.
Enumerate the objects in the freshly fetched promisor packs, checking
every outgoing link to see if they reference a non-promisor object that
we have, to get a list of tips where local objects are parents of
promisor objects ("bad history"). After collecting these "tips of bad
history", you then start another traversal from them until you hit an
object in a promisor pack and stop traversal there. You have
successfully enumerated the local objects to be repacked into a promisor
pack.
Cost:
- Traversal through newly fetched promisor trees and commits
- Creation of set of promisor pack objects (for tips of bad history
traversal to stop at a promisor object)
- Traversal through all local commits and check existence in promisor
pack set
- Repack all pushed local commits
Pros: Prevents state from ever occurring in the repository, no network
cost.
Cons: Additional cost of repacking is incurred during fetch, more
complex implementation.
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, packed into promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Garbage Collection repack
-------------------------
Not yet implemented.
Same concept at “fetch repack”, but happens during garbage collection
instead. The traversal is more expensive since we no longer have access
to what was recently fetched so we have to traverse through all promisor
packs to collect tips of “bad” history.
Cost:
- Creation of set of promisor pack objects
- Traversal through all promisor commits
- Traversal through all local commits and check existence in promisor
object set
- Repack all pushed local commits
Pros: Can be run in the background as part of maintenance, no network
cost.
Cons: More expensive than “fetch repack”, state isn’t fixed until
garbage collection, more complex implementation
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, packed into promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Garbage Collection repack all
-----------------------------
Implemented at
https://lore.kernel.org/git/20240925072021.77078-1-hanyang.tony@bytedance.com/
Repack all local commits into promisor packs during garbage collection.
Both valid scenarios
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, packed into promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
commit tree blob
C3 ---- T3 -- B3 (created locally, packed into promisor pack)
|
C2 ---- T2 -- B2 (created locally, packed into promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Cost:
Repack all local commits
Pros: Can be run in the background as part of maintenance, no network
cost, less complex implementation, and less expensive than “garbage
collection repack”.
Cons: Packing local objects into promisor packs means that it is no
longer possible to detect if an object is missing due to repository
corruption or because we need to fetch it from a promisor remote.
Packing local objects into promisor packs means that garbage collection
will no longer remove unreachable local objects.
Valid State Solutions
=====================
Garbage Collection check
------------------------
Not yet implemented.
Currently during the garbage collection rev walk, whenever a promisor
commit is reached, it is marked UNINTERESTING, and then subsequently all
ancestors of the promisor commit are traversed and also marked
UNINTERESTING. Therefore, add a check for whether a commit is local or
not during promisor commit ancestor traversal and do not mark local
commits as UNINTERESTING.
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
Cost:
- Adds an additional check to every ancestor of a promisor commit.
This is practically the only solution if the state is valid. Fsck would
also have to start checking for validity of ancestors of promisor
commits instead of ignoring them as it currently does.
Optimizations
=============
The “creation of set of promisor pack objects” can be replaced with
“creation of set of non-promisor objects” since the latter is almost
always cheaper and we can check for non-existence rather than existence.
This does not work for “fetch negotiation” since if we have a commit
that's in both a promisor pack and a non-promisor pack, the algorithm's
correctness relies on the fact that we report it as a promisor object
(because we really need the server to re-send it).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-01 19:17 ` Missing Promisor Objects in Partial Repo Design Doc Calvin Wan
@ 2024-10-01 19:35 ` Junio C Hamano
2024-10-02 2:54 ` Junio C Hamano
1 sibling, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-10-01 19:35 UTC (permalink / raw)
To: Calvin Wan
Cc: Han Young, git, Jonathan Tan, Phillip Wood, Enrico Mrass,
sokcevic
Calvin Wan <calvinwan@google.com> writes:
> It seems that we're at a standstill for the various possible designs
> that can solve this problem, so I decided to write up a design document
> to discuss the ideas we've come up with so far and new ones. Hopefully
> this will get us closer to a viable implementation we can agree on.
Thanks for writing this up. Very much appreciated.
I'll hold my thoughts before others have chance to speak up, though.
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-01 19:17 ` Missing Promisor Objects in Partial Repo Design Doc Calvin Wan
2024-10-01 19:35 ` Junio C Hamano
@ 2024-10-02 2:54 ` Junio C Hamano
2024-10-02 7:57 ` [External] " Han Young
` (2 more replies)
1 sibling, 3 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-10-02 2:54 UTC (permalink / raw)
To: Calvin Wan
Cc: Christian Couder, Han Young, git, Jonathan Tan, Phillip Wood,
Enrico Mrass, sokcevic
> Missing Promisor Objects in Partial Repo Design Doc
> ===================================================
>
> Basic Reproduction Steps
> ------------------------
>
> - Partial clone repository
> - Create local commit and push
> - Fetch new changes
> - Garbage collection
>
> State After Reproduction
> ------------------------
>
> commit tree blob
> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
> |
> C2b ---- T2b -- B2b (created locally, in non-promisor pack)
> |
> C2a ---- T2a -- B2a (created locally, in non-promisor pack)
> |
> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> Explanation of the Problem
> --------------------------
>
> In a partial clone repository, non-promisor commits are locally
> committed as children of promisor commits and then pushed up to the
> server. Fetches of new history can result in promisor commits that have
> non-promisor commits as ancestors. During garbage collection, objects
> are repacked in 2 steps. In the first step, if there is more than one
> promisor packfile, all objects in promisor packfiles are repacked into a
> single promisor packfile. In the second step, a revision walk is made
> from all refs (and some other things like HEAD and reflog entries) that
> stops whenever it encounters a promisor object. In the example above, if
> a ref pointed directly to C2a, it would be returned by the walk (as an
> object to be packed). But if we only had a ref pointing to C3, the
> revision walk immediately sees that it is a promisor object, does not
> return it, and does not iterate through its parents.
True. Will it become even worse, if a protocol extension Christian
proposes starts suggesting a repository that is not lazy to add a
promisor remote? In such a set-up, perhaps all history leading to
C2b down to the root are local, but C3 may have come from a promisor
remote (hence in a promisor pack).
> (C2b is a bit of a special case. Despite not being in a promisor pack,
> it is still considered to be a promisor object since C3 directly
> references it.)
Yes, and I suspect the root cause of this confusion is because
"promisor object", as defined today, is a flawed concept. If C2b
were pointed by a local ref, just like the case the ref points at
C2a, they should be treated the same way, as both of them are
locally created. To put it another way, presumably the local have
already been pushed out to elsewhere and the promisor remote got
hold of them, and that is why C3 can build on top of them. And the
fact C2b is directly reachable from C3 and C2a is not should not
have any relevance if C2a or C2b are not _included_ in promisor
packs (hence both of them need to be included in the local pack).
Two concepts that would have been useful are (1) objects that are in
promisor packs and (2) objects that are reachable from an object
that is in a promisor pack. I do not see how the current definition
of "promisor objects" (i.e. in a promisor pack, or one hop from an
object in a promisor pack) is useful in any context.
> If we think this is a bad state, we should propagate the “promisor-ness”
> of C3 to its ancestors. Git commands should either prevent this state
> from occurring or tolerate it and fix it when we can. If we did run into
> this state unexpectedly, then it would be considered a BUG.
Yup, that is the basis of the solutions we saw proposed so far.
> If we think it is a valid state, we should NOT propagate the
> “promisor-ness” of C3 to its ancestors. Git commands should respect that
> this is a possible state and be able to work around it. Therefore, this
> bug would then be strictly caused by garbage collection
Yes, that is possibly an alternative.
> Bad State Solutions
> ===================
>
> Fetch negotiation
> -----------------
> Implemented at
> https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/
>
> During fetch negotiation, if a commit is not in a promisor pack and
> therefore local, do not declare it as "have" so they can be fetched into
> a promisor pack.
>
> Cost:
> - Creation of set of promisor pack objects (by iterating through every
> .idx of promisor packs)
What is "promisor PACK objects"? Is it different from the "promisor
objects" (i.e. what I called the useless definition above)?
> - Refetch number of local commits
>
> Pros: Implementation is simple, client doesn’t have to repack, prevents
> state from ever occurring in the repository.
>
> Cons: Network cost of refetching could be high if many local commits
> need to be refetched.
What if we get into the same state by creating local C4, which gets
to outside and on top of which C5 is built, which is now sitting at
the tip of the remote history and we fetch from them? In order to
include C4 in the "promisor pack", we refrain from saying C4 is a
"have" for us and refetch. Would C2 be fetched again?
I do not think C2 would be, because we made it an object in a
promisor pack when we "fixed" the history for C3.
So the cost will not grow proportionally to the depth of the
history, which makes it OK from my point of view.
> Garbage Collection repack
> -------------------------
> Not yet implemented.
>
> Same concept at “fetch repack”, but happens during garbage collection
> instead. The traversal is more expensive since we no longer have access
> to what was recently fetched so we have to traverse through all promisor
> packs to collect tips of “bad” history.
In other words, with the status quo, "git gc" that attempts to
repack "objects in promisor packs" and "other objects that did not
get repacked in the step that repack objects in promisor packs"
separately, it implements the latter in a buggy way and discards
some objects. And fixing that bug by doing the right thing is
expensive.
Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
"git gc"? Wouldn't these "missing" objects be lazily fetchable, now
C3 is known to the remote and the remote promises everything
reachable from what they offer are (re)fetchable from them? IOW, is
this a correctness issue, or only performance issue (of having to
re-fetch what we once locally had)?
> Cons: Packing local objects into promisor packs means that it is no
> longer possible to detect if an object is missing due to repository
> corruption or because we need to fetch it from a promisor remote.
Is this true? Can we tell, when trying to access C2a/C2b/C2 after
the current version of "git gc" removes them from the local object
store, that they are missing due to repository corruption? After
all, C3 can reach them so wouldn't it be possible for us to fetch
them from the promisor remote?
After a lazy clone that omits a lot of objects acquires many objects
over time by fetching missing objects on demand, wouldn't we want to
have an option to "slim" the local repository by discarding some of
these objects (the ones that are least frequently used), relying on
the promise by the promisor remote that even if we did so, they can
be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
feature prematurely kicked in? Or are we failing to refetch them
for some reason?
> Packing local objects into promisor packs means that garbage collection
> will no longer remove unreachable local objects.
>
> Valid State Solutions
> =====================
> Garbage Collection check
> ------------------------
> Not yet implemented.
>
> Currently during the garbage collection rev walk, whenever a promisor
> commit is reached, it is marked UNINTERESTING, and then subsequently all
> ancestors of the promisor commit are traversed and also marked
> UNINTERESTING. Therefore, add a check for whether a commit is local or
> not during promisor commit ancestor traversal and do not mark local
> commits as UNINTERESTING.
>
> commit tree blob
> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
> |
> C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
> |
> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> Cost:
> - Adds an additional check to every ancestor of a promisor commit.
>
> This is practically the only solution if the state is valid. Fsck would
> also have to start checking for validity of ancestors of promisor
> commits instead of ignoring them as it currently does.
In the longer term, this looks like the most straight-forward and
easy to explain solution to me.
> Optimizations
> =============
>
> The “creation of set of promisor pack objects” can be replaced with
> “creation of set of non-promisor objects” since the latter is almost
> always cheaper and we can check for non-existence rather than existence.
> This does not work for “fetch negotiation” since if we have a commit
> that's in both a promisor pack and a non-promisor pack, the algorithm's
> correctness relies on the fact that we report it as a promisor object
> (because we really need the server to re-send it).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-02 2:54 ` Junio C Hamano
@ 2024-10-02 7:57 ` Han Young
2024-10-08 21:35 ` Calvin Wan
2024-10-09 18:53 ` Jonathan Tan
2 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-02 7:57 UTC (permalink / raw)
To: Junio C Hamano
Cc: Calvin Wan, Christian Couder, git, Jonathan Tan, Phillip Wood,
Enrico Mrass, sokcevic
On Wed, Oct 2, 2024 at 10:55 AM Junio C Hamano <gitster@pobox.com> wrote:
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"? Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them? IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?
>
> Is this true? Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption? After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
>
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in? Or are we failing to refetch them
> for some reason?
In a blobless clone, we expect commits and trees to be present in repo.
If C2/T2 is missing, commands like "git merge" will complain
"cannot merge unrelated history" and fail. Or commands like "git log" will
try to lazily fetch the commit, but without 'have' negotiation, end up
pulling all the trees and blobs reachable from that commit.
It's possible to minimize the impact of missing commits by adding negotiation
to lazy fetching, but we probably need to adapt code in many places where
we don't do lazy fetching. "git log", "git merge" commit graph etc. it's
no trivia amount of work.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos
2024-09-23 3:44 ` [External] " 韩仰
2024-09-23 16:21 ` Junio C Hamano
@ 2024-10-02 22:35 ` Calvin Wan
1 sibling, 0 replies; 65+ messages in thread
From: Calvin Wan @ 2024-10-02 22:35 UTC (permalink / raw)
To: 韩仰; +Cc: Calvin Wan, Junio C Hamano, git, jonathantanmy, sokcevic
韩仰 <hanyang.tony@bytedance.com> writes:
> On Sun, Sep 22, 2024 at 2:53 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> > I was hoping to see that the issue can be fixed on the "gc" side,
> > regardless of how the objects enter our repository, but perhaps I am
> > missing something. Isn't it just the matter of collecting C1, C3
> > but not C2? Or to put it another way, if we first create a list of
> > objects to be packed (regardless of whether they are in promisor
> > packs), and then remove the objects that are in promisor packs from
> > the list, and pack the objects still remaining in the list?
>
> I tried to fix the issue on the "gc" side following JTan's suggestion,
> by packing local objects referenced by promisor objects into promisor
> packs. But it turns out the cost for "for each promisor object,
> parse them and try to decide the objects they reference is in local repo"
> is too great. In a test blob:none partial clone repo, the gc would take more
> than one hour in the 2019 MacBook, despite the repo only
> having 17071073 objects. Normally it would take about 30 minutes.
I found that running `time git submodule foreach git <create promisor
pack set>` on Android takes 25 minutes on my machine. Granted this is
single threaded but it's still quite an expensive operation to be doing
on every recursive fetch. If this operation is so expensive, then unless
we can figure out some method that doesn't involve creating a set of
promisor pack objects, solving this during fetch is infeasible.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (7 preceding siblings ...)
2024-10-01 19:17 ` Missing Promisor Objects in Partial Repo Design Doc Calvin Wan
@ 2024-10-08 8:13 ` Han Young
2024-10-08 8:13 ` [PATCH v2 1/3] repack: pack everything into packfile Han Young
` (3 more replies)
2024-10-11 8:24 ` [PATCH v3 " Han Young
2024-10-14 3:25 ` [PATCH v4 " Han Young
10 siblings, 4 replies; 65+ messages in thread
From: Han Young @ 2024-10-08 8:13 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
As suggested by Jonathan[1], there are number of ways to fix this issue.
We have already explored some of them in this thread, and so far none of them
is satisfiable. Calvin and I tried to address the problem from fetch-pack side
and rev-list side. But the fix either consumes too much CPU power or results
in inefficient bandwidth use.
So let's attack the problem from repack side. The goal is to prevent repack
from discarding local objects, previously it is done by carefully
separating promisor objects and normal objects in rev-list.
The implementation is flawed and no solution have been found so far.
Instead, we can get ride of rev-list and just pack everything into promisor
files. This way, no objects would be lost.
By using 'repack everything', repacking requires less work and we are not
using more bandwidth.
Packing local objects into promisor packfiles means that it is no longer
possible to detect if an object is missing due to repository corruption
or because we need to fetch it from a promisor remote.
Promisor objects packing does not benefiting from the history and
path based delta calculation, and GC does not remove unreachable promisor
objects. By packing locally created normal objects into promisor packfile,
normal objects are converted into promisor objects. However, in partial cloned
repos, the number of locally created objects are small compared to promisor
objects. The impact should be negligible.
[1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
*** Changes since v1 ***
Added tradeoffs in cover letter.
Fixed some partial clone test cases.
Updated partial clone documentation.
Han Young (3):
repack: pack everything into packfile
t0410: adapt tests to repack changes
partial-clone: update doc
Documentation/technical/partial-clone.txt | 16 +-
builtin/repack.c | 257 ++++++++++++----------
t/t0410-partial-clone.sh | 68 +-----
t/t5616-partial-clone.sh | 9 +-
4 files changed, 157 insertions(+), 193 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 1/3] repack: pack everything into packfile
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
@ 2024-10-08 8:13 ` Han Young
2024-10-08 21:41 ` Calvin Wan
2024-10-08 8:13 ` [PATCH v2 2/3] t0410: adapt tests to repack changes Han Young
` (2 subsequent siblings)
3 siblings, 1 reply; 65+ messages in thread
From: Han Young @ 2024-10-08 8:13 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In a partial repository, creating a local commit and then fetching
causes the following state to occur:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
During garbage collection, parents of promisor objects are marked as
UNINTERESTING and are subsequently garbage collected. In this case, C2
would be deleted and attempts to access that commit would result in "bad
object" errors (originally reported here[1]).
For partial repos, repacking is done in two steps. We first repack all the
objects in promisor packfile, then repack all the non-promisor objects.
It turns out C2, T2 and B2 are not repacked in either steps, ended up deleted.
We can avoid this by packing everything into a promisor packfile, if the repo
is partial cloned.
[1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
Helped-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
builtin/repack.c | 257 ++++++++++++++++++++++++++---------------------
1 file changed, 143 insertions(+), 114 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f085..e9e18a31fe 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -321,6 +321,23 @@ static int write_oid(const struct object_id *oid,
return 0;
}
+static int write_loose_oid(const struct object_id *oid,
+ const char *path UNUSED,
+ void *data)
+{
+ struct child_process *cmd = data;
+
+ if (cmd->in == -1) {
+ if (start_command(cmd))
+ die(_("could not start pack-objects to repack promisor objects"));
+ }
+
+ if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+ write_in_full(cmd->in, "\n", 1) < 0)
+ die(_("failed to feed promisor objects to pack-objects"));
+ return 0;
+}
+
static struct {
const char *name;
unsigned optional:1;
@@ -370,12 +387,15 @@ static int has_pack_ext(const struct generated_pack_data *data,
BUG("unknown pack extension: '%s'", ext);
}
-static void repack_promisor_objects(const struct pack_objects_args *args,
- struct string_list *names)
+static int repack_promisor_objects(const struct pack_objects_args *args,
+ struct string_list *names,
+ struct string_list *list,
+ int pack_all)
{
struct child_process cmd = CHILD_PROCESS_INIT;
FILE *out;
struct strbuf line = STRBUF_INIT;
+ struct string_list_item *item;
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
@@ -387,13 +407,19 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
* {type -> existing pack order} ordering when computing deltas instead
* of a {type -> size} ordering, which may produce better deltas.
*/
- for_each_packed_object(write_oid, &cmd,
- FOR_EACH_OBJECT_PROMISOR_ONLY);
+ if (pack_all)
+ for_each_packed_object(write_oid, &cmd, 0);
+ else
+ for_each_string_list_item(item, list) {
+ pack_mark_retained(item);
+ }
+
+ for_each_loose_object(write_loose_oid, &cmd, 0);
if (cmd.in == -1) {
/* No packed objects; cmd was never started */
child_process_clear(&cmd);
- return;
+ return 0;
}
close(cmd.in);
@@ -431,6 +457,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
if (finish_command(&cmd))
die(_("could not finish pack-objects to repack promisor objects"));
strbuf_release(&line);
+ return 0;
}
struct pack_geometry {
@@ -1312,8 +1339,7 @@ int cmd_repack(int argc,
strvec_push(&cmd.args, "--reflog");
strvec_push(&cmd.args, "--indexed-objects");
}
- if (repo_has_promisor_remote(the_repository))
- strvec_push(&cmd.args, "--exclude-promisor-objects");
+
if (!write_midx) {
if (write_bitmaps > 0)
strvec_push(&cmd.args, "--write-bitmap-index");
@@ -1323,125 +1349,128 @@ int cmd_repack(int argc,
if (use_delta_islands)
strvec_push(&cmd.args, "--delta-islands");
- if (pack_everything & ALL_INTO_ONE) {
- repack_promisor_objects(&po_args, &names);
-
- if (has_existing_non_kept_packs(&existing) &&
- delete_redundant &&
- !(pack_everything & PACK_CRUFT)) {
- for_each_string_list_item(item, &names) {
- strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
- packtmp_name, item->string);
- }
- if (unpack_unreachable) {
- strvec_pushf(&cmd.args,
- "--unpack-unreachable=%s",
- unpack_unreachable);
- } else if (pack_everything & LOOSEN_UNREACHABLE) {
- strvec_push(&cmd.args,
- "--unpack-unreachable");
- } else if (keep_unreachable) {
- strvec_push(&cmd.args, "--keep-unreachable");
- strvec_push(&cmd.args, "--pack-loose-unreachable");
+ if (repo_has_promisor_remote(the_repository)) {
+ ret = repack_promisor_objects(&po_args, &names,
+ &existing.non_kept_packs, pack_everything & ALL_INTO_ONE);
+ } else {
+ if (pack_everything & ALL_INTO_ONE) {
+ if (has_existing_non_kept_packs(&existing) &&
+ delete_redundant &&
+ !(pack_everything & PACK_CRUFT)) {
+ for_each_string_list_item(item, &names) {
+ strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+ packtmp_name, item->string);
+ }
+ if (unpack_unreachable) {
+ strvec_pushf(&cmd.args,
+ "--unpack-unreachable=%s",
+ unpack_unreachable);
+ } else if (pack_everything & LOOSEN_UNREACHABLE) {
+ strvec_push(&cmd.args,
+ "--unpack-unreachable");
+ } else if (keep_unreachable) {
+ strvec_push(&cmd.args, "--keep-unreachable");
+ strvec_push(&cmd.args, "--pack-loose-unreachable");
+ }
}
+ } else if (geometry.split_factor) {
+ strvec_push(&cmd.args, "--stdin-packs");
+ strvec_push(&cmd.args, "--unpacked");
+ } else {
+ strvec_push(&cmd.args, "--unpacked");
+ strvec_push(&cmd.args, "--incremental");
}
- } else if (geometry.split_factor) {
- strvec_push(&cmd.args, "--stdin-packs");
- strvec_push(&cmd.args, "--unpacked");
- } else {
- strvec_push(&cmd.args, "--unpacked");
- strvec_push(&cmd.args, "--incremental");
- }
- if (po_args.filter_options.choice)
- strvec_pushf(&cmd.args, "--filter=%s",
- expand_list_objects_filter_spec(&po_args.filter_options));
- else if (filter_to)
- die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
+ if (po_args.filter_options.choice)
+ strvec_pushf(&cmd.args, "--filter=%s",
+ expand_list_objects_filter_spec(&po_args.filter_options));
+ else if (filter_to)
+ die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
- if (geometry.split_factor)
- cmd.in = -1;
- else
- cmd.no_stdin = 1;
+ if (geometry.split_factor)
+ cmd.in = -1;
+ else
+ cmd.no_stdin = 1;
- ret = start_command(&cmd);
- if (ret)
- goto cleanup;
+ ret = start_command(&cmd);
+ if (ret)
+ goto cleanup;
- if (geometry.split_factor) {
- FILE *in = xfdopen(cmd.in, "w");
- /*
- * The resulting pack should contain all objects in packs that
- * are going to be rolled up, but exclude objects in packs which
- * are being left alone.
- */
- for (i = 0; i < geometry.split; i++)
- fprintf(in, "%s\n", pack_basename(geometry.pack[i]));
- for (i = geometry.split; i < geometry.pack_nr; i++)
- fprintf(in, "^%s\n", pack_basename(geometry.pack[i]));
- fclose(in);
- }
+ if (geometry.split_factor) {
+ FILE *in = xfdopen(cmd.in, "w");
+ /*
+ * The resulting pack should contain all objects in packs that
+ * are going to be rolled up, but exclude objects in packs which
+ * are being left alone.
+ */
+ for (i = 0; i < geometry.split; i++)
+ fprintf(in, "%s\n", pack_basename(geometry.pack[i]));
+ for (i = geometry.split; i < geometry.pack_nr; i++)
+ fprintf(in, "^%s\n", pack_basename(geometry.pack[i]));
+ fclose(in);
+ }
- ret = finish_pack_objects_cmd(&cmd, &names, 1);
- if (ret)
- goto cleanup;
-
- if (!names.nr && !po_args.quiet)
- printf_ln(_("Nothing new to pack."));
-
- if (pack_everything & PACK_CRUFT) {
- const char *pack_prefix = find_pack_prefix(packdir, packtmp);
-
- if (!cruft_po_args.window)
- cruft_po_args.window = po_args.window;
- if (!cruft_po_args.window_memory)
- cruft_po_args.window_memory = po_args.window_memory;
- if (!cruft_po_args.depth)
- cruft_po_args.depth = po_args.depth;
- if (!cruft_po_args.threads)
- cruft_po_args.threads = po_args.threads;
- if (!cruft_po_args.max_pack_size)
- cruft_po_args.max_pack_size = po_args.max_pack_size;
-
- cruft_po_args.local = po_args.local;
- cruft_po_args.quiet = po_args.quiet;
-
- ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
- cruft_expiration, &names,
- &existing);
+ ret = finish_pack_objects_cmd(&cmd, &names, 1);
if (ret)
goto cleanup;
- if (delete_redundant && expire_to) {
- /*
- * If `--expire-to` is given with `-d`, it's possible
- * that we're about to prune some objects. With cruft
- * packs, pruning is implicit: any objects from existing
- * packs that weren't picked up by new packs are removed
- * when their packs are deleted.
- *
- * Generate an additional cruft pack, with one twist:
- * `names` now includes the name of the cruft pack
- * written in the previous step. So the contents of
- * _this_ cruft pack exclude everything contained in the
- * existing cruft pack (that is, all of the unreachable
- * objects which are no older than
- * `--cruft-expiration`).
- *
- * To make this work, cruft_expiration must become NULL
- * so that this cruft pack doesn't actually prune any
- * objects. If it were non-NULL, this call would always
- * generate an empty pack (since every object not in the
- * cruft pack generated above will have an mtime older
- * than the expiration).
- */
- ret = write_cruft_pack(&cruft_po_args, expire_to,
- pack_prefix,
- NULL,
- &names,
- &existing);
+ if (!names.nr && !po_args.quiet)
+ printf_ln(_("Nothing new to pack."));
+
+ if (pack_everything & PACK_CRUFT) {
+ const char *pack_prefix = find_pack_prefix(packdir, packtmp);
+
+ if (!cruft_po_args.window)
+ cruft_po_args.window = po_args.window;
+ if (!cruft_po_args.window_memory)
+ cruft_po_args.window_memory = po_args.window_memory;
+ if (!cruft_po_args.depth)
+ cruft_po_args.depth = po_args.depth;
+ if (!cruft_po_args.threads)
+ cruft_po_args.threads = po_args.threads;
+ if (!cruft_po_args.max_pack_size)
+ cruft_po_args.max_pack_size = po_args.max_pack_size;
+
+ cruft_po_args.local = po_args.local;
+ cruft_po_args.quiet = po_args.quiet;
+
+ ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
+ cruft_expiration, &names,
+ &existing);
if (ret)
goto cleanup;
+
+ if (delete_redundant && expire_to) {
+ /*
+ * If `--expire-to` is given with `-d`, it's possible
+ * that we're about to prune some objects. With cruft
+ * packs, pruning is implicit: any objects from existing
+ * packs that weren't picked up by new packs are removed
+ * when their packs are deleted.
+ *
+ * Generate an additional cruft pack, with one twist:
+ * `names` now includes the name of the cruft pack
+ * written in the previous step. So the contents of
+ * _this_ cruft pack exclude everything contained in the
+ * existing cruft pack (that is, all of the unreachable
+ * objects which are no older than
+ * `--cruft-expiration`).
+ *
+ * To make this work, cruft_expiration must become NULL
+ * so that this cruft pack doesn't actually prune any
+ * objects. If it were non-NULL, this call would always
+ * generate an empty pack (since every object not in the
+ * cruft pack generated above will have an mtime older
+ * than the expiration).
+ */
+ ret = write_cruft_pack(&cruft_po_args, expire_to,
+ pack_prefix,
+ NULL,
+ &names,
+ &existing);
+ if (ret)
+ goto cleanup;
+ }
}
}
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v2 2/3] t0410: adapt tests to repack changes
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
2024-10-08 8:13 ` [PATCH v2 1/3] repack: pack everything into packfile Han Young
@ 2024-10-08 8:13 ` Han Young
2024-10-08 8:13 ` [PATCH v2 3/3] partial-clone: update doc Han Young
2024-10-08 21:57 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-08 8:13 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In the previous commit, we changed how partial repo is cloned.
Adapt tests to these changes. Also check gc does not delete normal
objects too.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
t/t0410-partial-clone.sh | 68 +---------------------------------------
t/t5616-partial-clone.sh | 9 +-----
2 files changed, 2 insertions(+), 75 deletions(-)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 34bdb3ab1f..c169b47160 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -499,46 +499,6 @@ test_expect_success 'single promisor remote can be re-initialized gracefully' '
git -C repo fetch --filter=blob:none foo
'
-test_expect_success 'gc repacks promisor objects separately from non-promisor objects' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo one &&
- test_commit -C repo two &&
-
- TREE_ONE=$(git -C repo rev-parse one^{tree}) &&
- printf "$TREE_ONE\n" | pack_as_from_promisor &&
- TREE_TWO=$(git -C repo rev-parse two^{tree}) &&
- printf "$TREE_TWO\n" | pack_as_from_promisor &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that exactly one promisor packfile exists, and that it
- # contains the trees but not the commits
- ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
- test_line_count = 1 promisorlist &&
- PROMISOR_PACKFILE=$(sed "s/.promisor/.pack/" <promisorlist) &&
- git verify-pack $PROMISOR_PACKFILE -v >out &&
- grep "$TREE_ONE" out &&
- grep "$TREE_TWO" out &&
- ! grep "$(git -C repo rev-parse one)" out &&
- ! grep "$(git -C repo rev-parse two)" out &&
-
- # Remove the promisor packfile and associated files
- rm $(sed "s/.promisor//" <promisorlist).* &&
-
- # Ensure that the single other pack contains the commits, but not the
- # trees
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse one)" out &&
- grep "$(git -C repo rev-parse two)" out &&
- ! grep "$TREE_ONE" out &&
- ! grep "$TREE_TWO" out
-'
-
test_expect_success 'gc does not repack promisor objects if there are none' '
rm -rf repo &&
test_create_repo repo &&
@@ -569,7 +529,7 @@ repack_and_check () {
git -C repo2 cat-file -e $3
}
-test_expect_success 'repack -d does not irreversibly delete promisor objects' '
+test_expect_success 'repack -d does not irreversibly delete objects' '
rm -rf repo &&
test_create_repo repo &&
git -C repo config core.repositoryformatversion 1 &&
@@ -583,40 +543,14 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
TWO=$(git -C repo rev-parse HEAD^^) &&
THREE=$(git -C repo rev-parse HEAD^) &&
- printf "$TWO\n" | pack_as_from_promisor &&
printf "$THREE\n" | pack_as_from_promisor &&
delete_object repo "$ONE" &&
- repack_and_check --must-fail -ab "$TWO" "$THREE" &&
repack_and_check -a "$TWO" "$THREE" &&
repack_and_check -A "$TWO" "$THREE" &&
repack_and_check -l "$TWO" "$THREE"
'
-test_expect_success 'gc stops traversal when a missing but promised object is reached' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo my_commit &&
-
- TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
- HASH=$(promise_and_delete $TREE_HASH) &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that the promisor packfile still exists, and remove it
- test -e repo/.git/objects/pack/pack-$HASH.pack &&
- rm repo/.git/objects/pack/pack-$HASH.* &&
-
- # Ensure that the single other pack contains the commit, but not the tree
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse HEAD)" out &&
- ! grep "$TREE_HASH" out
-'
-
test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
rm -rf repo &&
test_create_repo repo &&
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index c53e93be2f..2c6f10026f 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -643,16 +643,9 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
grep "version 2" trace
'
-test_expect_success 'repack does not loosen promisor objects' '
- rm -rf client trace &&
- git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
- test_when_finished "rm -rf client trace" &&
- GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
- grep "loosen_unused_packed_objects/loosened:0" trace
-'
-
test_expect_success 'lazy-fetch in submodule succeeds' '
# setup
+ rm -rf client &&
test_config_global protocol.file.allow always &&
test_when_finished "rm -rf src-sub" &&
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v2 3/3] partial-clone: update doc
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
2024-10-08 8:13 ` [PATCH v2 1/3] repack: pack everything into packfile Han Young
2024-10-08 8:13 ` [PATCH v2 2/3] t0410: adapt tests to repack changes Han Young
@ 2024-10-08 8:13 ` Han Young
2024-10-08 21:57 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-08 8:13 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
Document new repack behavior for partial repo
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
Documentation/technical/partial-clone.txt | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
index cd948b0072..9791c9ac24 100644
--- a/Documentation/technical/partial-clone.txt
+++ b/Documentation/technical/partial-clone.txt
@@ -124,6 +124,10 @@ their "<name>.pack" and "<name>.idx" files.
+
When Git encounters a missing object, Git can see if it is a promisor object
and handle it appropriately. If not, Git can report a corruption.
+
+To prevent `repack` from removing locally created objects, `repack` packs all
+the objects into one promisor packfile. It's no longer possible to determine
+the cause of missing objects after `gc`.[7]
+
This means that there is no need for the client to explicitly maintain an
expensive-to-modify list of missing objects.[a]
@@ -156,8 +160,9 @@ and prefetch those objects in bulk.
- `fsck` has been updated to be fully aware of promisor objects.
-- `repack` in GC has been updated to not touch promisor packfiles at all,
- and to only repack other objects.
+- `repack` in GC has been taught to handle partial clone repo differently.
+ `repack` will pack every objects into one promisor packfile for partial
+ repos.
- The global variable "fetch_if_missing" is used to control whether an
object lookup will attempt to dynamically fetch a missing object or
@@ -244,8 +249,7 @@ remote in a specific order.
objects. We assume that promisor remotes have a complete view of the
repository and can satisfy all such requests.
-- Repack essentially treats promisor and non-promisor packfiles as 2
- distinct partitions and does not mix them.
+- It's not possible to discard dangling objects in repack.
- Dynamic object fetching invokes fetch-pack once *for each item*
because most algorithms stumble upon a missing object and need to have
@@ -365,3 +369,7 @@ Related Links
[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
Date: Fri, 14 Jul 2017 09:26:50 -0400
+
+[7] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/ +
+ Subject: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo +
+ Date: Fri, 2 Aug 2024 15:31:42 +0800
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-02 2:54 ` Junio C Hamano
2024-10-02 7:57 ` [External] " Han Young
@ 2024-10-08 21:35 ` Calvin Wan
2024-10-09 6:46 ` [External] " Han Young
2024-10-09 18:53 ` Jonathan Tan
2 siblings, 1 reply; 65+ messages in thread
From: Calvin Wan @ 2024-10-08 21:35 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christian Couder, Han Young, git, Jonathan Tan, Phillip Wood,
Enrico Mrass, sokcevic
On Tue, Oct 1, 2024 at 7:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> True. Will it become even worse, if a protocol extension Christian
> proposes starts suggesting a repository that is not lazy to add a
> promisor remote? In such a set-up, perhaps all history leading to
> C2b down to the root are local, but C3 may have come from a promisor
> remote (hence in a promisor pack).
Yes if we and consequently Git considers this state to be problematic.
> > Bad State Solutions
> > ===================
> >
> > Fetch negotiation
> > -----------------
> > Implemented at
> > https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/
> >
> > During fetch negotiation, if a commit is not in a promisor pack and
> > therefore local, do not declare it as "have" so they can be fetched into
> > a promisor pack.
> >
> > Cost:
> > - Creation of set of promisor pack objects (by iterating through every
> > .idx of promisor packs)
>
> What is "promisor PACK objects"? Is it different from the "promisor
> objects" (i.e. what I called the useless definition above)?
Objects that are in promisor packs, specifically the ones that have the
flag, packed_git::pack_promisor, set. However, since this design doc
was sent out, it turns out the creation of a set of promisor pack objects
in a large repository (such as Android or Chrome) is very expensive, so
this design is infeasible in my opinion.
>
> > - Refetch number of local commits
> >
> > Pros: Implementation is simple, client doesn’t have to repack, prevents
> > state from ever occurring in the repository.
> >
> > Cons: Network cost of refetching could be high if many local commits
> > need to be refetched.
>
> What if we get into the same state by creating local C4, which gets
> to outside and on top of which C5 is built, which is now sitting at
> the tip of the remote history and we fetch from them? In order to
> include C4 in the "promisor pack", we refrain from saying C4 is a
> "have" for us and refetch. Would C2 be fetched again?
>
> I do not think C2 would be, because we made it an object in a
> promisor pack when we "fixed" the history for C3.
>
> So the cost will not grow proportionally to the depth of the
> history, which makes it OK from my point of view.
Correct, the cost of refetching is only a one time cost, but
unfortunately creation of a set of promisor pack objects isn't.
>
> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
>
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects. And fixing that bug by doing the right thing is
> expensive.
>
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"? Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them? IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?
My first thought is that from both the user and developer perspective,
we don't expect our reachable objects to be gc'ed. So all of the "bad
state" solutions work to ensure that that isn't the case in some way or
form. However, if it turns out that all of these solutions are much more
expensive and disruptive to the user than accepting that local objects
can be gc'ed and JIT refetching, then the latter seems much more
palatable. It is inevitable that we take some performance hit to fix this
problem and we may just have to accept this as one of the costs of
having partial clones to begin with.
>
> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
>
> Is this true? Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption? After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
I should be more clear that "detecting if an object is missing due to
repository corruption" refers to fsck currently not having the
functionality to do that. We are "accidentally" discovering the
corruption when we try to access the missing object, but we can
still fetch them from the promisor remote afterwards.
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in? Or are we failing to refetch them
> for some reason?
Yes if such a feature existed, then it would be feasible and a possible
solution for this issue (I'm leaning quite towards this now after testing
out some of the other designs).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 1/3] repack: pack everything into packfile
2024-10-08 8:13 ` [PATCH v2 1/3] repack: pack everything into packfile Han Young
@ 2024-10-08 21:41 ` Calvin Wan
0 siblings, 0 replies; 65+ messages in thread
From: Calvin Wan @ 2024-10-08 21:41 UTC (permalink / raw)
To: Han Young; +Cc: git, jonathantanmy, sokcevic, gitster, phillip.wood123
On Tue, Oct 8, 2024 at 1:14 AM Han Young <hanyang.tony@bytedance.com> wrote:
>
> In a partial repository, creating a local commit and then fetching
> causes the following state to occur:
>
> commit tree blob
> C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
> |
> C2 ---- T2 -- B2 (created locally, in non-promisor pack)
> |
> C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> During garbage collection, parents of promisor objects are marked as
> UNINTERESTING and are subsequently garbage collected. In this case, C2
> would be deleted and attempts to access that commit would result in "bad
> object" errors (originally reported here[1]).
>
> For partial repos, repacking is done in two steps. We first repack all the
> objects in promisor packfile, then repack all the non-promisor objects.
> It turns out C2, T2 and B2 are not repacked in either steps, ended up deleted.
> We can avoid this by packing everything into a promisor packfile, if the repo
> is partial cloned.
>
> [1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
>
> Helped-by: Calvin Wan <calvinwan@google.com>
> Signed-off-by: Han Young <hanyang.tony@bytedance.com>
> ---
> builtin/repack.c | 257 ++++++++++++++++++++++++++---------------------
> 1 file changed, 143 insertions(+), 114 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index cb4420f085..e9e18a31fe 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -321,6 +321,23 @@ static int write_oid(const struct object_id *oid,
> return 0;
> }
>
> +static int write_loose_oid(const struct object_id *oid,
> + const char *path UNUSED,
> + void *data)
> +{
> + struct child_process *cmd = data;
> +
> + if (cmd->in == -1) {
> + if (start_command(cmd))
> + die(_("could not start pack-objects to repack promisor objects"));
> + }
> +
> + if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> + write_in_full(cmd->in, "\n", 1) < 0)
> + die(_("failed to feed promisor objects to pack-objects"));
> + return 0;
> +}
> +
> static struct {
> const char *name;
> unsigned optional:1;
> @@ -370,12 +387,15 @@ static int has_pack_ext(const struct generated_pack_data *data,
> BUG("unknown pack extension: '%s'", ext);
> }
>
> -static void repack_promisor_objects(const struct pack_objects_args *args,
> - struct string_list *names)
> +static int repack_promisor_objects(const struct pack_objects_args *args,
> + struct string_list *names,
> + struct string_list *list,
> + int pack_all)
> {
> struct child_process cmd = CHILD_PROCESS_INIT;
> FILE *out;
> struct strbuf line = STRBUF_INIT;
> + struct string_list_item *item;
>
> prepare_pack_objects(&cmd, args, packtmp);
> cmd.in = -1;
> @@ -387,13 +407,19 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
> * {type -> existing pack order} ordering when computing deltas instead
> * of a {type -> size} ordering, which may produce better deltas.
> */
> - for_each_packed_object(write_oid, &cmd,
> - FOR_EACH_OBJECT_PROMISOR_ONLY);
> + if (pack_all)
> + for_each_packed_object(write_oid, &cmd, 0);
> + else
> + for_each_string_list_item(item, list) {
> + pack_mark_retained(item);
> + }
> +
> + for_each_loose_object(write_loose_oid, &cmd, 0);
>
> if (cmd.in == -1) {
> /* No packed objects; cmd was never started */
> child_process_clear(&cmd);
> - return;
> + return 0;
> }
>
> close(cmd.in);
> @@ -431,6 +457,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
> if (finish_command(&cmd))
> die(_("could not finish pack-objects to repack promisor objects"));
> strbuf_release(&line);
> + return 0;
> }
>
> struct pack_geometry {
> @@ -1312,8 +1339,7 @@ int cmd_repack(int argc,
> strvec_push(&cmd.args, "--reflog");
> strvec_push(&cmd.args, "--indexed-objects");
> }
> - if (repo_has_promisor_remote(the_repository))
> - strvec_push(&cmd.args, "--exclude-promisor-objects");
> +
> if (!write_midx) {
> if (write_bitmaps > 0)
> strvec_push(&cmd.args, "--write-bitmap-index");
> @@ -1323,125 +1349,128 @@ int cmd_repack(int argc,
> if (use_delta_islands)
> strvec_push(&cmd.args, "--delta-islands");
>
> - if (pack_everything & ALL_INTO_ONE) {
> - repack_promisor_objects(&po_args, &names);
> -
> - if (has_existing_non_kept_packs(&existing) &&
> - delete_redundant &&
> - !(pack_everything & PACK_CRUFT)) {
> - for_each_string_list_item(item, &names) {
> - strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> - packtmp_name, item->string);
> - }
> - if (unpack_unreachable) {
> - strvec_pushf(&cmd.args,
> - "--unpack-unreachable=%s",
> - unpack_unreachable);
> - } else if (pack_everything & LOOSEN_UNREACHABLE) {
> - strvec_push(&cmd.args,
> - "--unpack-unreachable");
> - } else if (keep_unreachable) {
> - strvec_push(&cmd.args, "--keep-unreachable");
> - strvec_push(&cmd.args, "--pack-loose-unreachable");
> + if (repo_has_promisor_remote(the_repository)) {
> + ret = repack_promisor_objects(&po_args, &names,
> + &existing.non_kept_packs, pack_everything & ALL_INTO_ONE);
Using a goto jump would be easier to both read your patch and
remove the need to indent this entire code block.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
` (2 preceding siblings ...)
2024-10-08 8:13 ` [PATCH v2 3/3] partial-clone: update doc Han Young
@ 2024-10-08 21:57 ` Junio C Hamano
2024-10-08 22:43 ` Junio C Hamano
2024-10-09 6:31 ` [External] " Han Young
3 siblings, 2 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-10-08 21:57 UTC (permalink / raw)
To: Han Young; +Cc: git, calvinwan, jonathantanmy, sokcevic, phillip.wood123
Han Young <hanyang.tony@bytedance.com> writes:
> As suggested by Jonathan[1], there are number of ways to fix this issue.
> We have already explored some of them in this thread, and so far none of them
> is satisfiable. Calvin and I tried to address the problem from fetch-pack side
> and rev-list side. But the fix either consumes too much CPU power or results
> in inefficient bandwidth use.
>
> So let's attack the problem from repack side. The goal is to prevent repack
> from discarding local objects, previously it is done by carefully
> separating promisor objects and normal objects in rev-list.
> The implementation is flawed and no solution have been found so far.
> Instead, we can get ride of rev-list and just pack everything into promisor
> files. This way, no objects would be lost.
>
> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth.
OK, perhaps.
> Packing local objects into promisor packfiles means that it is no longer
> possible to detect if an object is missing due to repository corruption
> or because we need to fetch it from a promisor remote.
Is it true that without doing this, we can tell between these two
cases, though? More importantly, even if it is true, would there be
a practical difference?
In the sample scenario used in [1/3] where you created C2/T2/B2 on
top of C1/T1/B1 (which came from a promisor remote), somebody else
built C3/T3/B3 on top, and it came back from the promisor remote,
you could lose 3's objects and 1's objects and they can be refetched
but even if you lose 2's objects, since 3's objects are building on
top of them, you should be able to fetch them from the promisor
remote just like objects from 1 and 3, no? So strictly speaking,
missing 2's objects may be "repository corruption" while missing 1's
and 3's objects may not be, would there be a practical use for that
information?
> Promisor objects packing does not benefiting from the history and
> path based delta calculation, and GC does not remove unreachable promisor
> objects. By packing locally created normal objects into promisor packfile,
> normal objects are converted into promisor objects. However, in partial cloned
> repos, the number of locally created objects are small compared to promisor
> objects. The impact should be negligible.
> [1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
>
> *** Changes since v1 ***
> Added tradeoffs in cover letter.
> Fixed some partial clone test cases.
> Updated partial clone documentation.
These patches are based on the tip of master before 365529e1 (Merge
branch 'ps/leakfixes-part-7', 2024-10-02), which will give mildly
annoying conflicts when merged to 'seen'.
I've managed to apply and then merge, so unless review discussions
find needs for updates, there is no need for immediate reroll, but
if you end up having to update these patches, it is a good idea to
rebase the topic on top of v2.47.0 that was released early this
week, as we are now entering a new development cycle.
Thanks.
>
> Han Young (3):
> repack: pack everything into packfile
> t0410: adapt tests to repack changes
> partial-clone: update doc
>
> Documentation/technical/partial-clone.txt | 16 +-
> builtin/repack.c | 257 ++++++++++++----------
> t/t0410-partial-clone.sh | 68 +-----
> t/t5616-partial-clone.sh | 9 +-
> 4 files changed, 157 insertions(+), 193 deletions(-)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos
2024-10-08 21:57 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
@ 2024-10-08 22:43 ` Junio C Hamano
2024-10-09 6:31 ` [External] " Han Young
1 sibling, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-10-08 22:43 UTC (permalink / raw)
To: Han Young; +Cc: git, calvinwan, jonathantanmy, sokcevic, phillip.wood123
Junio C Hamano <gitster@pobox.com> writes:
> I've managed to apply and then merge, so unless review discussions
> find needs for updates, there is no need for immediate reroll, but
> if you end up having to update these patches, it is a good idea to
> rebase the topic on top of v2.47.0 that was released early this
> week, as we are now entering a new development cycle.
When merged to the tip of 'seen', it seems to break t5710. It might
be due to mismerge, but can you check on your end?
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos
2024-10-08 21:57 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
2024-10-08 22:43 ` Junio C Hamano
@ 2024-10-09 6:31 ` Han Young
1 sibling, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-09 6:31 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, calvinwan, jonathantanmy, sokcevic, phillip.wood123
On Wed, Oct 9, 2024 at 5:57 AM Junio C Hamano <gitster@pobox.com> wrote:
> > Packing local objects into promisor packfiles means that it is no longer
> > possible to detect if an object is missing due to repository corruption
> > or because we need to fetch it from a promisor remote.
>
> Is it true that without doing this, we can tell between these two
> cases, though? More importantly, even if it is true, would there be
> a practical difference?
>
> In the sample scenario used in [1/3] where you created C2/T2/B2 on
> top of C1/T1/B1 (which came from a promisor remote), somebody else
> built C3/T3/B3 on top, and it came back from the promisor remote,
> you could lose 3's objects and 1's objects and they can be refetched
> but even if you lose 2's objects, since 3's objects are building on
> top of them, you should be able to fetch them from the promisor
> remote just like objects from 1 and 3, no? So strictly speaking,
> missing 2's objects may be "repository corruption" while missing 1's
> and 3's objects may not be, would there be a practical use for that
> information?
Some code path does check if the missing object is promisor object before
lazy fetching, `--missing=` does this check.
But in that case, C2 is also a promisor object, the check would pass.
There are no partial clone filter that omits commits, missing commit will
always result in error. And even if we do report "repository corruption",
the best course of action is still try to fetching them.
So, no. I don't think there are practical uses for that information
> These patches are based on the tip of master before 365529e1 (Merge
> branch 'ps/leakfixes-part-7', 2024-10-02), which will give mildly
> annoying conflicts when merged to 'seen'.
>
> I've managed to apply and then merge, so unless review discussions
> find needs for updates, there is no need for immediate reroll, but
> if you end up having to update these patches, it is a good idea to
> rebase the topic on top of v2.47.0 that was released early this
> week, as we are now entering a new development cycle.
Thanks, I will rebase to master and see if any other tests break.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-08 21:35 ` Calvin Wan
@ 2024-10-09 6:46 ` Han Young
2024-10-09 18:34 ` Jonathan Tan
0 siblings, 1 reply; 65+ messages in thread
From: Han Young @ 2024-10-09 6:46 UTC (permalink / raw)
To: Calvin Wan
Cc: Junio C Hamano, Christian Couder, git, Jonathan Tan, Phillip Wood,
Enrico Mrass, sokcevic
On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote:
> Objects that are in promisor packs, specifically the ones that have the
> flag, packed_git::pack_promisor, set. However, since this design doc
> was sent out, it turns out the creation of a set of promisor pack objects
> in a large repository (such as Android or Chrome) is very expensive, so
> this design is infeasible in my opinion.
I wonder if a set of local loose/pack objects will be cheaper to construct?
Normally loose objects are always non-promisor objects, unless the user
running something like `unpack-objects`.
> > After a lazy clone that omits a lot of objects acquires many objects
> > over time by fetching missing objects on demand, wouldn't we want to
> > have an option to "slim" the local repository by discarding some of
> > these objects (the ones that are least frequently used), relying on
> > the promise by the promisor remote that even if we did so, they can
> > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
> > feature prematurely kicked in? Or are we failing to refetch them
> > for some reason?
>
> Yes if such a feature existed, then it would be feasible and a possible
> solution for this issue (I'm leaning quite towards this now after testing
> out some of the other designs).
Since no partial clone filter omits commit objects, we always assume
commits are available in the codebase. `merge` reports "cannot merge
unrelated history" if one of the commits is missing, instead of trying to
fetch it.
Another problem is current lazy fetching code does not report "haves"
to remote, so a lazy fetching of commit ended up pulling all the trees,
blobs associated with that commit.
I also prefer the "fetching the missing objects" approach, making sure
the repo has all the "correct" objects is difficult to get right.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-09 6:46 ` [External] " Han Young
@ 2024-10-09 18:34 ` Jonathan Tan
2024-10-12 2:05 ` Jonathan Tan
0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Tan @ 2024-10-09 18:34 UTC (permalink / raw)
To: Han Young
Cc: Jonathan Tan, Calvin Wan, Junio C Hamano, Christian Couder, git,
Phillip Wood, Enrico Mrass, sokcevic
Han Young <hanyang.tony@bytedance.com> writes:
> On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote:
>
> > Objects that are in promisor packs, specifically the ones that have the
> > flag, packed_git::pack_promisor, set. However, since this design doc
> > was sent out, it turns out the creation of a set of promisor pack objects
> > in a large repository (such as Android or Chrome) is very expensive, so
> > this design is infeasible in my opinion.
>
> I wonder if a set of local loose/pack objects will be cheaper to construct?
> Normally loose objects are always non-promisor objects, unless the user
> running something like `unpack-objects`.
We had a similar idea at $JOB. Note that you don't actually
need to create the set - when looking up an object using
oid_object_info_extended(), we know if it's a loose object and if not,
which pack it is in. The pack has a promisor bit that we can check.
Note that there is a possibility of a false positive. If the same object
is in two packs - one promisor and one non-promisor - I believe there's
no guarantee that one pack will be preferred. So we can see that the
object is in a non-promisor pack, but there's no guarantee that it's not
also in a promisor pack. For the omit-local-commits-in-"have" solution,
this is a fatal flaw (we absolutely must guarantee that we don't send
any promisor commits) but for the repack-on-fetch solution, this is no
big deal (we are looking for objects to repack into a promisor pack, and
repacking a promisor object into a promisor pack is perfectly file). For
this reason, I think the repack-on-fetch solution is the most promising
one so far.
Loose objects are always non-promisor objects, yes. (I don't think the
user running `unpack-objects` counts - the user running a command on a
packfile in the .git directory is out of scope, I think.)
> > > After a lazy clone that omits a lot of objects acquires many objects
> > > over time by fetching missing objects on demand, wouldn't we want to
> > > have an option to "slim" the local repository by discarding some of
> > > these objects (the ones that are least frequently used), relying on
> > > the promise by the promisor remote that even if we did so, they can
> > > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
> > > feature prematurely kicked in? Or are we failing to refetch them
> > > for some reason?
> >
> > Yes if such a feature existed, then it would be feasible and a possible
> > solution for this issue (I'm leaning quite towards this now after testing
> > out some of the other designs).
>
> Since no partial clone filter omits commit objects, we always assume
> commits are available in the codebase. `merge` reports "cannot merge
> unrelated history" if one of the commits is missing, instead of trying to
> fetch it.
> Another problem is current lazy fetching code does not report "haves"
> to remote, so a lazy fetching of commit ended up pulling all the trees,
> blobs associated with that commit.
> I also prefer the "fetching the missing objects" approach, making sure
> the repo has all the "correct" objects is difficult to get right.
If I remember correctly, our intention (or, at least, my intention)
of not treating missing commits differently was so that we don't limit
the solutions that we can implement. For example, we had the idea of
server-assisted merge base computation - this and other features would
make it feasible to omit commits locally. It has not been implemented,
though.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-02 2:54 ` Junio C Hamano
2024-10-02 7:57 ` [External] " Han Young
2024-10-08 21:35 ` Calvin Wan
@ 2024-10-09 18:53 ` Jonathan Tan
2 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-09 18:53 UTC (permalink / raw)
To: Junio C Hamano
Cc: Jonathan Tan, Calvin Wan, Christian Couder, Han Young, git,
Phillip Wood, Enrico Mrass, sokcevic
Junio C Hamano <gitster@pobox.com> writes:
> > (C2b is a bit of a special case. Despite not being in a promisor pack,
> > it is still considered to be a promisor object since C3 directly
> > references it.)
>
> Yes, and I suspect the root cause of this confusion is because
> "promisor object", as defined today, is a flawed concept. If C2b
> were pointed by a local ref, just like the case the ref points at
> C2a, they should be treated the same way, as both of them are
> locally created. To put it another way, presumably the local have
> already been pushed out to elsewhere and the promisor remote got
> hold of them, and that is why C3 can build on top of them. And the
> fact C2b is directly reachable from C3 and C2a is not should not
> have any relevance if C2a or C2b are not _included_ in promisor
> packs (hence both of them need to be included in the local pack).
>
> Two concepts that would have been useful are (1) objects that are in
> promisor packs and (2) objects that are reachable from an object
> that is in a promisor pack. I do not see how the current definition
> of "promisor objects" (i.e. in a promisor pack, or one hop from an
> object in a promisor pack) is useful in any context.
The one-hop part in the current definition is meant to (a) explain what
objects the client knows the remote has (in theory the client has no
knowledge of objects beyond the first hop, but we now know this theory
to not be true) and (b) explain what objects a non-promisor object can
reference (in particular, a non-promisor tree can reference promisor
blobs, even when our knowledge of that promisor blob only comes from a
tree in a promisor pack).
If we think that a promisor commit being a child of a non-promisor
commit as a "bad state" that needs to be fixed [1], then the one-hop
current definition seems to be equivalent to (2).
As for (1), we do use that concept in Git, although it's limited to the
repack during GC (or maybe there are others that I don't recall), so the
concept doesn't have a widely-used name like "promisor object".
[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/
> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
>
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects. And fixing that bug by doing the right thing is
> expensive.
>
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"? Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them? IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?
I believe the re-fetch didn't happen because it was run from a command
with fetch_if_missing=0. (But even if we decide that we shouldn't use
fetch_if_missing, and then change all commands to not use it, there
still remains the performance issue, so we should still fix it.)
> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
>
> Is this true? Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption? After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
>
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again? Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in? Or are we failing to refetch them
> for some reason?
This is under the "repack all" option, which states that we repack all
objects (wherever they came from) into promisor packs. If we locally
created commit A and then its child commit B, and the repo got corrupted
so that we lost A, repacking all objects would mean that we could never
detect that the loss of A is problematic.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (8 preceding siblings ...)
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
@ 2024-10-11 8:24 ` Han Young
2024-10-11 8:24 ` [PATCH v3 1/3] repack: pack everything into packfile Han Young
` (3 more replies)
2024-10-14 3:25 ` [PATCH v4 " Han Young
10 siblings, 4 replies; 65+ messages in thread
From: Han Young @ 2024-10-11 8:24 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
Changes since v2:
- rebased to seen: 89afaf27d3 (Merge branch 'ak/typofixes' into seen, 2024-10-10)
- fixed t5710, "repack all" affects how the test repo is initialized
- use goto to skip normal repack
This series doesn't address the underlying problem with promisor objects,
but rather mitigates the "repack removes local objects" problem.
Until a satisfiable solution can be found[1], this would at least prevent
more promisor repos from becoming corrupted.
[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/
Han Young (3):
repack: pack everything into packfile
repack: adapt tests to repack changes
partial-clone: update doc
Documentation/technical/partial-clone.txt | 16 ++++--
builtin/repack.c | 46 ++++++++++++---
t/t0410-partial-clone.sh | 68 +----------------------
t/t5616-partial-clone.sh | 9 +--
t/t5710-promisor-remote-capability.sh | 15 ++++-
5 files changed, 65 insertions(+), 89 deletions(-)
--
2.47.0.266.g0b04b6b485.dirty
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 1/3] repack: pack everything into packfile
2024-10-11 8:24 ` [PATCH v3 " Han Young
@ 2024-10-11 8:24 ` Han Young
2024-10-11 8:24 ` [PATCH v3 2/3] repack: adapt tests to repack changes Han Young
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-11 8:24 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In a partial repository, creating a local commit and then fetching
causes the following state to occur:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
During garbage collection, parents of promisor objects are marked as
UNINTERESTING and are subsequently garbage collected. In this case, C2
would be deleted and attempts to access that commit would result in "bad
object" errors (originally reported here[1]).
For partial repos, repacking is done in two steps. We first repack all the
objects in promisor packfile, then repack all the non-promisor objects.
It turns out C2, T2 and B2 are not repacked in either steps, ended up deleted.
We can avoid this by packing everything into a promisor packfile, if the repo
is partial cloned.
[1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
Helped-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
builtin/repack.c | 46 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 39 insertions(+), 7 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 79f407ca04..50e14ccfc4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -343,6 +343,23 @@ static int write_oid(const struct object_id *oid,
return 0;
}
+static int write_loose_oid(const struct object_id *oid,
+ const char *path UNUSED,
+ void *data)
+{
+ struct child_process *cmd = data;
+
+ if (cmd->in == -1) {
+ if (start_command(cmd))
+ die(_("could not start pack-objects to repack promisor objects"));
+ }
+
+ if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+ write_in_full(cmd->in, "\n", 1) < 0)
+ die(_("failed to feed promisor objects to pack-objects"));
+ return 0;
+}
+
static struct {
const char *name;
unsigned optional:1;
@@ -392,12 +409,15 @@ static int has_pack_ext(const struct generated_pack_data *data,
BUG("unknown pack extension: '%s'", ext);
}
-static void repack_promisor_objects(const struct pack_objects_args *args,
- struct string_list *names)
+static int repack_promisor_objects(const struct pack_objects_args *args,
+ struct string_list *names,
+ struct string_list *list,
+ int pack_all)
{
struct child_process cmd = CHILD_PROCESS_INIT;
FILE *out;
struct strbuf line = STRBUF_INIT;
+ struct string_list_item *item;
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
@@ -409,13 +429,19 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
* {type -> existing pack order} ordering when computing deltas instead
* of a {type -> size} ordering, which may produce better deltas.
*/
- for_each_packed_object(write_oid, &cmd,
- FOR_EACH_OBJECT_PROMISOR_ONLY);
+ if (pack_all)
+ for_each_packed_object(write_oid, &cmd, 0);
+ else
+ for_each_string_list_item(item, list) {
+ pack_mark_retained(item);
+ }
+
+ for_each_loose_object(write_loose_oid, &cmd, 0);
if (cmd.in == -1) {
/* No packed objects; cmd was never started */
child_process_clear(&cmd);
- return;
+ return 0;
}
close(cmd.in);
@@ -453,6 +479,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
if (finish_command(&cmd))
die(_("could not finish pack-objects to repack promisor objects"));
strbuf_release(&line);
+ return 0;
}
struct pack_geometry {
@@ -1356,9 +1383,13 @@ int cmd_repack(int argc,
if (use_delta_islands)
strvec_push(&cmd.args, "--delta-islands");
- if (pack_everything & ALL_INTO_ONE) {
- repack_promisor_objects(&po_args, &names);
+ if (repo_has_promisor_remote(the_repository)) {
+ ret = repack_promisor_objects(&po_args, &names,
+ &existing.non_kept_packs, pack_everything & ALL_INTO_ONE);
+ goto pack_objects_end;
+ }
+ if (pack_everything & ALL_INTO_ONE) {
if (has_existing_non_kept_packs(&existing) &&
delete_redundant &&
!(pack_everything & PACK_CRUFT)) {
@@ -1478,6 +1509,7 @@ int cmd_repack(int argc,
}
}
+pack_objects_end:
if (po_args.filter_options.choice) {
if (!filter_to)
filter_to = packtmp;
--
2.47.0.266.g0b04b6b485.dirty
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 2/3] repack: adapt tests to repack changes
2024-10-11 8:24 ` [PATCH v3 " Han Young
2024-10-11 8:24 ` [PATCH v3 1/3] repack: pack everything into packfile Han Young
@ 2024-10-11 8:24 ` Han Young
2024-10-11 8:24 ` [PATCH v3 3/3] partial-clone: update doc Han Young
2024-10-11 18:18 ` [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-11 8:24 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In the previous commit, we changed how partial repo is cloned.
Adapt tests to these changes. Also check gc does not delete normal
objects too.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
t/t0410-partial-clone.sh | 68 +--------------------------
t/t5616-partial-clone.sh | 9 +---
t/t5710-promisor-remote-capability.sh | 15 ++++--
3 files changed, 14 insertions(+), 78 deletions(-)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 818700fbec..c87102fcb7 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -500,46 +500,6 @@ test_expect_success 'single promisor remote can be re-initialized gracefully' '
git -C repo fetch --filter=blob:none foo
'
-test_expect_success 'gc repacks promisor objects separately from non-promisor objects' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo one &&
- test_commit -C repo two &&
-
- TREE_ONE=$(git -C repo rev-parse one^{tree}) &&
- printf "$TREE_ONE\n" | pack_as_from_promisor &&
- TREE_TWO=$(git -C repo rev-parse two^{tree}) &&
- printf "$TREE_TWO\n" | pack_as_from_promisor &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that exactly one promisor packfile exists, and that it
- # contains the trees but not the commits
- ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
- test_line_count = 1 promisorlist &&
- PROMISOR_PACKFILE=$(sed "s/.promisor/.pack/" <promisorlist) &&
- git verify-pack $PROMISOR_PACKFILE -v >out &&
- grep "$TREE_ONE" out &&
- grep "$TREE_TWO" out &&
- ! grep "$(git -C repo rev-parse one)" out &&
- ! grep "$(git -C repo rev-parse two)" out &&
-
- # Remove the promisor packfile and associated files
- rm $(sed "s/.promisor//" <promisorlist).* &&
-
- # Ensure that the single other pack contains the commits, but not the
- # trees
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse one)" out &&
- grep "$(git -C repo rev-parse two)" out &&
- ! grep "$TREE_ONE" out &&
- ! grep "$TREE_TWO" out
-'
-
test_expect_success 'gc does not repack promisor objects if there are none' '
rm -rf repo &&
test_create_repo repo &&
@@ -570,7 +530,7 @@ repack_and_check () {
git -C repo2 cat-file -e $3
}
-test_expect_success 'repack -d does not irreversibly delete promisor objects' '
+test_expect_success 'repack -d does not irreversibly delete objects' '
rm -rf repo &&
test_create_repo repo &&
git -C repo config core.repositoryformatversion 1 &&
@@ -584,40 +544,14 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
TWO=$(git -C repo rev-parse HEAD^^) &&
THREE=$(git -C repo rev-parse HEAD^) &&
- printf "$TWO\n" | pack_as_from_promisor &&
printf "$THREE\n" | pack_as_from_promisor &&
delete_object repo "$ONE" &&
- repack_and_check --must-fail -ab "$TWO" "$THREE" &&
repack_and_check -a "$TWO" "$THREE" &&
repack_and_check -A "$TWO" "$THREE" &&
repack_and_check -l "$TWO" "$THREE"
'
-test_expect_success 'gc stops traversal when a missing but promised object is reached' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo my_commit &&
-
- TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
- HASH=$(promise_and_delete $TREE_HASH) &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that the promisor packfile still exists, and remove it
- test -e repo/.git/objects/pack/pack-$HASH.pack &&
- rm repo/.git/objects/pack/pack-$HASH.* &&
-
- # Ensure that the single other pack contains the commit, but not the tree
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse HEAD)" out &&
- ! grep "$TREE_HASH" out
-'
-
test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
rm -rf repo &&
test_create_repo repo &&
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 2c2c50e3ff..19166cd4ce 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -643,16 +643,9 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
grep "version 2" trace
'
-test_expect_success 'repack does not loosen promisor objects' '
- rm -rf client trace &&
- git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
- test_when_finished "rm -rf client trace" &&
- GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
- grep "loosen_unused_packed_objects/loosened:0" trace
-'
-
test_expect_success 'lazy-fetch in submodule succeeds' '
# setup
+ rm -rf client &&
test_config_global protocol.file.allow always &&
test_when_finished "rm -rf src-sub" &&
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index c2c83a5914..0912acae44 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -32,17 +32,26 @@ check_missing_objects () {
}
initialize_server () {
- # Repack everything first
- git -C server -c repack.writebitmaps=false repack -a -d &&
+ git -C server remote remove server2
+ has_promisor_remote=$?
# Remove promisor file in case they exist, useful when reinitializing
rm -rf server/objects/pack/*.promisor &&
+ # Repack everything first
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
# Repack without the largest object and create a promisor pack on server
git -C server -c repack.writebitmaps=false repack -a -d \
--filter=blob:limit=5k --filter-to="$(pwd)" &&
promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
- touch "$promisor_file" &&
+ touch "$promisor_file"
+
+ # Configure server2 as promisor remote for server
+ if [[ $has_promisor_remote -eq 0 ]]; then
+ git -C server remote add server2 "file://$(pwd)/server2" &&
+ git -C server config remote.server2.promisor true
+ fi
# Check that only one object is missing on the server
check_missing_objects server 1 "$oid"
--
2.47.0.266.g0b04b6b485.dirty
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 3/3] partial-clone: update doc
2024-10-11 8:24 ` [PATCH v3 " Han Young
2024-10-11 8:24 ` [PATCH v3 1/3] repack: pack everything into packfile Han Young
2024-10-11 8:24 ` [PATCH v3 2/3] repack: adapt tests to repack changes Han Young
@ 2024-10-11 8:24 ` Han Young
2024-10-11 18:18 ` [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-11 8:24 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
Document new repack behavior for partial repo
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
Documentation/technical/partial-clone.txt | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
index cd948b0072..9791c9ac24 100644
--- a/Documentation/technical/partial-clone.txt
+++ b/Documentation/technical/partial-clone.txt
@@ -124,6 +124,10 @@ their "<name>.pack" and "<name>.idx" files.
+
When Git encounters a missing object, Git can see if it is a promisor object
and handle it appropriately. If not, Git can report a corruption.
+
+To prevent `repack` from removing locally created objects, `repack` packs all
+the objects into one promisor packfile. It's no longer possible to determine
+the cause of missing objects after `gc`.[7]
+
This means that there is no need for the client to explicitly maintain an
expensive-to-modify list of missing objects.[a]
@@ -156,8 +160,9 @@ and prefetch those objects in bulk.
- `fsck` has been updated to be fully aware of promisor objects.
-- `repack` in GC has been updated to not touch promisor packfiles at all,
- and to only repack other objects.
+- `repack` in GC has been taught to handle partial clone repo differently.
+ `repack` will pack every objects into one promisor packfile for partial
+ repos.
- The global variable "fetch_if_missing" is used to control whether an
object lookup will attempt to dynamically fetch a missing object or
@@ -244,8 +249,7 @@ remote in a specific order.
objects. We assume that promisor remotes have a complete view of the
repository and can satisfy all such requests.
-- Repack essentially treats promisor and non-promisor packfiles as 2
- distinct partitions and does not mix them.
+- It's not possible to discard dangling objects in repack.
- Dynamic object fetching invokes fetch-pack once *for each item*
because most algorithms stumble upon a missing object and need to have
@@ -365,3 +369,7 @@ Related Links
[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
Date: Fri, 14 Jul 2017 09:26:50 -0400
+
+[7] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/ +
+ Subject: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo +
+ Date: Fri, 2 Aug 2024 15:31:42 +0800
--
2.47.0.266.g0b04b6b485.dirty
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos
2024-10-11 8:24 ` [PATCH v3 " Han Young
` (2 preceding siblings ...)
2024-10-11 8:24 ` [PATCH v3 3/3] partial-clone: update doc Han Young
@ 2024-10-11 18:18 ` Junio C Hamano
2024-10-11 18:23 ` Junio C Hamano
3 siblings, 1 reply; 65+ messages in thread
From: Junio C Hamano @ 2024-10-11 18:18 UTC (permalink / raw)
To: Han Young; +Cc: git, calvinwan, jonathantanmy, sokcevic, phillip.wood123
Han Young <hanyang.tony@bytedance.com> writes:
> Changes since v2:
> - rebased to seen: 89afaf27d3 (Merge branch 'ak/typofixes' into seen, 2024-10-10)
Please NEVER do this. 'seen' is as unstable and fluid as you can get.
Instead, base it on something that is well known and (supposedly)
stable, like v2.47.0 (or an updated tip of 'master'), and then
test (1) the topic by itself, (2) the result of trial merge of the
topic into 'next', and optionally (3) the same for 'seen'.
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos
2024-10-11 18:18 ` [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
@ 2024-10-11 18:23 ` Junio C Hamano
0 siblings, 0 replies; 65+ messages in thread
From: Junio C Hamano @ 2024-10-11 18:23 UTC (permalink / raw)
To: Han Young; +Cc: git, calvinwan, jonathantanmy, sokcevic, phillip.wood123
Junio C Hamano <gitster@pobox.com> writes:
> Han Young <hanyang.tony@bytedance.com> writes:
>
>> Changes since v2:
>> - rebased to seen: 89afaf27d3 (Merge branch 'ak/typofixes' into seen, 2024-10-10)
>
> Please NEVER do this. 'seen' is as unstable and fluid as you can get.
>
> Instead, base it on something that is well known and (supposedly)
> stable, like v2.47.0 (or an updated tip of 'master'), and then
> test (1) the topic by itself, (2) the result of trial merge of the
> topic into 'next', and optionally (3) the same for 'seen'.
If your topic really depends on what is done by other topics
in-flight, either in 'next' or 'seen', then prepare the base
by
- picking a well known and stable base, e.g. v2.47.0
- merge these ohter topics in-flight you depend on into the base
you chose above
and then build your series on top. Remember to describe what you
did to prepare the base in your cover letter.
Don't directly base your changes to 'next', which would mean your
topic will never graduate to 'master', as it is taken hostage by all
the topics in 'next' (and the merge commits that merge these topics
into 'next', which will never be merged to 'master').
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-09 18:34 ` Jonathan Tan
@ 2024-10-12 2:05 ` Jonathan Tan
2024-10-12 3:30 ` Han Young
0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Tan @ 2024-10-12 2:05 UTC (permalink / raw)
To: Jonathan Tan
Cc: Han Young, Calvin Wan, Junio C Hamano, Christian Couder, git,
Phillip Wood, Enrico Mrass, sokcevic
Jonathan Tan <jonathantanmy@google.com> writes:
> For
> this reason, I think the repack-on-fetch solution is the most promising
> one so far.
I had time to take a closer look at this solution. One problem that
I've noticed is that the "best effort" promisor object check cannot
naively replace is_promisor_object(), because a lot of the time (e.g.
in revision.c's get_reference()) is_promisor_object() is used when an
object is missing to check whether we need to error out or not. Our
"best effort" promisor object check cannot replace this because it needs
us to have looked up the object in the first place to check whether it's
loose or packed (and if packed, which packfile it's in), so it can't
work with an object that's missing.
So I think we'll need to use do_not_die_on_missing_objects. It does have
the weakness that if the object is not supposed to be missing, we don't
inform the user, but perhaps this is OK here because we know that all
objects we encounter on this object walk are promisor objects, so if
it's missing, it's OK.
In addition to do_not_die_on_missing_objects, we'll also need the actual
code that stops iteration through objects that pass our "best effort"
promisor object check. Probably the best place is in get_revision_1()
after the NULL check, but I haven't fully thought through what happens
if this option is used when some commits are UNINTERESTING. (For the
repack-on-fetch, no commits are UNINTERESTING, but it's probably best
to make sure our feature is as useful in as many cases as possible,
especially since we're going to further complicate revision walking
code, which is complicated enough.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-12 2:05 ` Jonathan Tan
@ 2024-10-12 3:30 ` Han Young
2024-10-14 17:52 ` Jonathan Tan
0 siblings, 1 reply; 65+ messages in thread
From: Han Young @ 2024-10-12 3:30 UTC (permalink / raw)
To: Jonathan Tan
Cc: Calvin Wan, Junio C Hamano, Christian Couder, git, Phillip Wood,
Enrico Mrass, sokcevic
On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> So I think we'll need to use do_not_die_on_missing_objects. It does have
> the weakness that if the object is not supposed to be missing, we don't
> inform the user, but perhaps this is OK here because we know that all
> objects we encounter on this object walk are promisor objects, so if
> it's missing, it's OK.
And I think users would prefer the git command to succeed if possible,
rather than die on the first (noncritical) error. Maybe show a warning
and swallow the error?
> In addition to do_not_die_on_missing_objects, we'll also need the actual
> code that stops iteration through objects that pass our "best effort"
> promisor object check. Probably the best place is in get_revision_1()
> after the NULL check
get_revision_1() only does commit limiting though. Some callers of rev-list
also do tree walking on commits, in a (corrupted) partial repo, tree could
also be missing. There isn't a central place we can stop tree walking,
callers using this feature would have to implement "tree walking early
termination" themself.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v4 0/3] repack: pack everything into promisor packfile in partial repos
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
` (9 preceding siblings ...)
2024-10-11 8:24 ` [PATCH v3 " Han Young
@ 2024-10-14 3:25 ` Han Young
2024-10-14 3:25 ` [PATCH v4 1/3] repack: pack everything into packfile Han Young
` (3 more replies)
10 siblings, 4 replies; 65+ messages in thread
From: Han Young @ 2024-10-14 3:25 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
Changes from v3:
- rebased to master: ef8ce8f3d4 (Start the 2.48 cycle, 2024-10-10)
Note that this series breaks tests on branch seen, the test introduced by
bc0c4e1637 (promisor-remote: check advertised name or URL, 2024-09-10)
relies on the current repack behavior. I will provide an additional patch
if both land in master.
This series doesn't address the underlying problem with promisor objects,
but rather mitigates the "repack removes local objects" problem.
Until a satisfiable solution can be found[1], this would at least prevent
more promisor repos from becoming corrupted.
[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/
Han Young (3):
repack: pack everything into packfile
t0410: adapt tests to repack changes
partial-clone: update doc
Documentation/technical/partial-clone.txt | 16 ++++--
builtin/repack.c | 46 ++++++++++++---
t/t0410-partial-clone.sh | 68 +----------------------
t/t5616-partial-clone.sh | 9 +--
4 files changed, 53 insertions(+), 86 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v4 1/3] repack: pack everything into packfile
2024-10-14 3:25 ` [PATCH v4 " Han Young
@ 2024-10-14 3:25 ` Han Young
2024-10-14 3:25 ` [PATCH v4 2/3] t0410: adapt tests to repack changes Han Young
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-14 3:25 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In a partial repository, creating a local commit and then fetching
causes the following state to occur:
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
During garbage collection, parents of promisor objects are marked as
UNINTERESTING and are subsequently garbage collected. In this case, C2
would be deleted and attempts to access that commit would result in "bad
object" errors (originally reported here[1]).
For partial repos, repacking is done in two steps. We first repack all the
objects in promisor packfile, then repack all the non-promisor objects.
It turns out C2, T2 and B2 are not repacked in either steps, ended up deleted.
We can avoid this by packing everything into a promisor packfile, if the repo
is partial cloned.
[1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/
Helped-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
builtin/repack.c | 46 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 39 insertions(+), 7 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index d6bb37e84a..071d2171da 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -338,6 +338,23 @@ static int write_oid(const struct object_id *oid,
return 0;
}
+static int write_loose_oid(const struct object_id *oid,
+ const char *path UNUSED,
+ void *data)
+{
+ struct child_process *cmd = data;
+
+ if (cmd->in == -1) {
+ if (start_command(cmd))
+ die(_("could not start pack-objects to repack promisor objects"));
+ }
+
+ if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+ write_in_full(cmd->in, "\n", 1) < 0)
+ die(_("failed to feed promisor objects to pack-objects"));
+ return 0;
+}
+
static struct {
const char *name;
unsigned optional:1;
@@ -387,12 +404,15 @@ static int has_pack_ext(const struct generated_pack_data *data,
BUG("unknown pack extension: '%s'", ext);
}
-static void repack_promisor_objects(const struct pack_objects_args *args,
- struct string_list *names)
+static int repack_promisor_objects(const struct pack_objects_args *args,
+ struct string_list *names,
+ struct string_list *list,
+ int pack_all)
{
struct child_process cmd = CHILD_PROCESS_INIT;
FILE *out;
struct strbuf line = STRBUF_INIT;
+ struct string_list_item *item;
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
@@ -404,13 +424,19 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
* {type -> existing pack order} ordering when computing deltas instead
* of a {type -> size} ordering, which may produce better deltas.
*/
- for_each_packed_object(write_oid, &cmd,
- FOR_EACH_OBJECT_PROMISOR_ONLY);
+ if (pack_all)
+ for_each_packed_object(write_oid, &cmd, 0);
+ else
+ for_each_string_list_item(item, list) {
+ pack_mark_retained(item);
+ }
+
+ for_each_loose_object(write_loose_oid, &cmd, 0);
if (cmd.in == -1) {
/* No packed objects; cmd was never started */
child_process_clear(&cmd);
- return;
+ return 0;
}
close(cmd.in);
@@ -448,6 +474,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
if (finish_command(&cmd))
die(_("could not finish pack-objects to repack promisor objects"));
strbuf_release(&line);
+ return 0;
}
struct pack_geometry {
@@ -1349,9 +1376,13 @@ int cmd_repack(int argc,
if (use_delta_islands)
strvec_push(&cmd.args, "--delta-islands");
- if (pack_everything & ALL_INTO_ONE) {
- repack_promisor_objects(&po_args, &names);
+ if (repo_has_promisor_remote(the_repository)) {
+ ret = repack_promisor_objects(&po_args, &names,
+ &existing.non_kept_packs, pack_everything & ALL_INTO_ONE);
+ goto pack_objects_end;
+ }
+ if (pack_everything & ALL_INTO_ONE) {
if (has_existing_non_kept_packs(&existing) &&
delete_redundant &&
!(pack_everything & PACK_CRUFT)) {
@@ -1471,6 +1502,7 @@ int cmd_repack(int argc,
}
}
+pack_objects_end:
if (po_args.filter_options.choice) {
if (!filter_to)
filter_to = packtmp;
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v4 2/3] t0410: adapt tests to repack changes
2024-10-14 3:25 ` [PATCH v4 " Han Young
2024-10-14 3:25 ` [PATCH v4 1/3] repack: pack everything into packfile Han Young
@ 2024-10-14 3:25 ` Han Young
2024-10-14 3:25 ` [PATCH v4 3/3] partial-clone: update doc Han Young
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-14 3:25 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
In the previous commit, we changed how partial repo is cloned.
Adapt tests to these changes. Also check gc does not delete normal
objects too.
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
t/t0410-partial-clone.sh | 68 +---------------------------------------
t/t5616-partial-clone.sh | 9 +-----
2 files changed, 2 insertions(+), 75 deletions(-)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 818700fbec..c87102fcb7 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -500,46 +500,6 @@ test_expect_success 'single promisor remote can be re-initialized gracefully' '
git -C repo fetch --filter=blob:none foo
'
-test_expect_success 'gc repacks promisor objects separately from non-promisor objects' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo one &&
- test_commit -C repo two &&
-
- TREE_ONE=$(git -C repo rev-parse one^{tree}) &&
- printf "$TREE_ONE\n" | pack_as_from_promisor &&
- TREE_TWO=$(git -C repo rev-parse two^{tree}) &&
- printf "$TREE_TWO\n" | pack_as_from_promisor &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that exactly one promisor packfile exists, and that it
- # contains the trees but not the commits
- ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
- test_line_count = 1 promisorlist &&
- PROMISOR_PACKFILE=$(sed "s/.promisor/.pack/" <promisorlist) &&
- git verify-pack $PROMISOR_PACKFILE -v >out &&
- grep "$TREE_ONE" out &&
- grep "$TREE_TWO" out &&
- ! grep "$(git -C repo rev-parse one)" out &&
- ! grep "$(git -C repo rev-parse two)" out &&
-
- # Remove the promisor packfile and associated files
- rm $(sed "s/.promisor//" <promisorlist).* &&
-
- # Ensure that the single other pack contains the commits, but not the
- # trees
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse one)" out &&
- grep "$(git -C repo rev-parse two)" out &&
- ! grep "$TREE_ONE" out &&
- ! grep "$TREE_TWO" out
-'
-
test_expect_success 'gc does not repack promisor objects if there are none' '
rm -rf repo &&
test_create_repo repo &&
@@ -570,7 +530,7 @@ repack_and_check () {
git -C repo2 cat-file -e $3
}
-test_expect_success 'repack -d does not irreversibly delete promisor objects' '
+test_expect_success 'repack -d does not irreversibly delete objects' '
rm -rf repo &&
test_create_repo repo &&
git -C repo config core.repositoryformatversion 1 &&
@@ -584,40 +544,14 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
TWO=$(git -C repo rev-parse HEAD^^) &&
THREE=$(git -C repo rev-parse HEAD^) &&
- printf "$TWO\n" | pack_as_from_promisor &&
printf "$THREE\n" | pack_as_from_promisor &&
delete_object repo "$ONE" &&
- repack_and_check --must-fail -ab "$TWO" "$THREE" &&
repack_and_check -a "$TWO" "$THREE" &&
repack_and_check -A "$TWO" "$THREE" &&
repack_and_check -l "$TWO" "$THREE"
'
-test_expect_success 'gc stops traversal when a missing but promised object is reached' '
- rm -rf repo &&
- test_create_repo repo &&
- test_commit -C repo my_commit &&
-
- TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
- HASH=$(promise_and_delete $TREE_HASH) &&
-
- git -C repo config core.repositoryformatversion 1 &&
- git -C repo config extensions.partialclone "arbitrary string" &&
- git -C repo gc &&
-
- # Ensure that the promisor packfile still exists, and remove it
- test -e repo/.git/objects/pack/pack-$HASH.pack &&
- rm repo/.git/objects/pack/pack-$HASH.* &&
-
- # Ensure that the single other pack contains the commit, but not the tree
- ls repo/.git/objects/pack/pack-*.pack >packlist &&
- test_line_count = 1 packlist &&
- git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
- grep "$(git -C repo rev-parse HEAD)" out &&
- ! grep "$TREE_HASH" out
-'
-
test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
rm -rf repo &&
test_create_repo repo &&
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index c53e93be2f..2c6f10026f 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -643,16 +643,9 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
grep "version 2" trace
'
-test_expect_success 'repack does not loosen promisor objects' '
- rm -rf client trace &&
- git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
- test_when_finished "rm -rf client trace" &&
- GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
- grep "loosen_unused_packed_objects/loosened:0" trace
-'
-
test_expect_success 'lazy-fetch in submodule succeeds' '
# setup
+ rm -rf client &&
test_config_global protocol.file.allow always &&
test_when_finished "rm -rf src-sub" &&
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v4 3/3] partial-clone: update doc
2024-10-14 3:25 ` [PATCH v4 " Han Young
2024-10-14 3:25 ` [PATCH v4 1/3] repack: pack everything into packfile Han Young
2024-10-14 3:25 ` [PATCH v4 2/3] t0410: adapt tests to repack changes Han Young
@ 2024-10-14 3:25 ` Han Young
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
3 siblings, 0 replies; 65+ messages in thread
From: Han Young @ 2024-10-14 3:25 UTC (permalink / raw)
To: git; +Cc: calvinwan, jonathantanmy, sokcevic, gitster, phillip.wood123,
Han Young
Document new repack behavior for partial repo
Signed-off-by: Han Young <hanyang.tony@bytedance.com>
---
Documentation/technical/partial-clone.txt | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
index cd948b0072..9791c9ac24 100644
--- a/Documentation/technical/partial-clone.txt
+++ b/Documentation/technical/partial-clone.txt
@@ -124,6 +124,10 @@ their "<name>.pack" and "<name>.idx" files.
+
When Git encounters a missing object, Git can see if it is a promisor object
and handle it appropriately. If not, Git can report a corruption.
+
+To prevent `repack` from removing locally created objects, `repack` packs all
+the objects into one promisor packfile. It's no longer possible to determine
+the cause of missing objects after `gc`.[7]
+
This means that there is no need for the client to explicitly maintain an
expensive-to-modify list of missing objects.[a]
@@ -156,8 +160,9 @@ and prefetch those objects in bulk.
- `fsck` has been updated to be fully aware of promisor objects.
-- `repack` in GC has been updated to not touch promisor packfiles at all,
- and to only repack other objects.
+- `repack` in GC has been taught to handle partial clone repo differently.
+ `repack` will pack every objects into one promisor packfile for partial
+ repos.
- The global variable "fetch_if_missing" is used to control whether an
object lookup will attempt to dynamically fetch a missing object or
@@ -244,8 +249,7 @@ remote in a specific order.
objects. We assume that promisor remotes have a complete view of the
repository and can satisfy all such requests.
-- Repack essentially treats promisor and non-promisor packfiles as 2
- distinct partitions and does not mix them.
+- It's not possible to discard dangling objects in repack.
- Dynamic object fetching invokes fetch-pack once *for each item*
because most algorithms stumble upon a missing object and need to have
@@ -365,3 +369,7 @@ Related Links
[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
Date: Fri, 14 Jul 2017 09:26:50 -0400
+
+[7] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/ +
+ Subject: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo +
+ Date: Fri, 2 Aug 2024 15:31:42 +0800
--
2.46.0
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc
2024-10-12 3:30 ` Han Young
@ 2024-10-14 17:52 ` Jonathan Tan
0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-14 17:52 UTC (permalink / raw)
To: Han Young
Cc: Jonathan Tan, Calvin Wan, Junio C Hamano, Christian Couder, git,
Phillip Wood, Enrico Mrass, sokcevic
Han Young <hanyang.tony@bytedance.com> writes:
> On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> > So I think we'll need to use do_not_die_on_missing_objects. It does have
> > the weakness that if the object is not supposed to be missing, we don't
> > inform the user, but perhaps this is OK here because we know that all
> > objects we encounter on this object walk are promisor objects, so if
> > it's missing, it's OK.
>
> And I think users would prefer the git command to succeed if possible,
> rather than die on the first (noncritical) error. Maybe show a warning
> and swallow the error?
Just to be clear, this is not an error condition so we shouldn't show an
error or warning. Whenever we repack non-promisor objects in a partial
clone we will almost always encounter missing objects. In the gc repack,
this is mitigated by --exclude-promisor-objects, which checks the
promisor object set whenever a missing object is encountered; it does
not show an error if the missing object is in that set.
My proposal is to use do_not_die_on_missing_objects instead, which
always tolerates missing objects (without showing any error or warning).
This is probably not safe enough for the gc repack, but should be OK
for the fetch repack, since we are only repacking ancestors of known
promisor objects (so we can deduce that the missing objects are promisor
objects).
> > In addition to do_not_die_on_missing_objects, we'll also need the actual
> > code that stops iteration through objects that pass our "best effort"
> > promisor object check. Probably the best place is in get_revision_1()
> > after the NULL check
>
> get_revision_1() only does commit limiting though. Some callers of rev-list
> also do tree walking on commits,
Ah, yes, you're right. The repack on fetch is one such caller (that will
need tree walking).
> in a (corrupted) partial repo, tree could
> also be missing. There isn't a central place we can stop tree walking,
> callers using this feature would have to implement "tree walking early
> termination" themself.
The repo could have been cloned with a tree filter (e.g.
"--filter=tree:0") too, in which case trees would be missing even if the
repo is not corrupted. But even in a non-corrupted --filter=blob:none
partial clone, we still don't want to iterate through promisor trees, so
that we don't repack them unnecessarily. So yes, get_revision_1() is not
the only place that needs to be changed.
I think that there is a central place to stop tree walking - in
list-objects.c.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [WIP 0/3] Repack on fetch
2024-10-14 3:25 ` [PATCH v4 " Han Young
` (2 preceding siblings ...)
2024-10-14 3:25 ` [PATCH v4 3/3] partial-clone: update doc Han Young
@ 2024-10-21 22:29 ` Jonathan Tan
2024-10-21 22:29 ` [WIP 1/3] move variable Jonathan Tan
` (3 more replies)
3 siblings, 4 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-21 22:29 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, hanyang.tony
I think that ultimately we want something like repack on fetch, so I
made some effort to implement it. There were some details that needed
to be ironed out, but here's a WIP of the repack-on-fetch solution. In
particular, note that we do not need to create the expensive set used by
is_promisor_object().
As you can see from the patches, some polishing still needs to be
done, but I'm sending them out now to check if other people have
opinions about the solution. In particular, Han Young reported that an
alternative solution (repack on GC) takes too long [1], so I would be
interested to see if the time taken by this solution is good enough for
Han Young's use case.
[1] https://lore.kernel.org/git/CAG1j3zHJVrpK5JZtUXFwkZgWY1-CxqET+ygpaMqo5aM-KeWaxg@mail.gmail.com/
Jonathan Tan (3):
move variable
pack-objects
record local links and call pack-objects
builtin/index-pack.c | 116 ++++++++++++++++++++++++++++++++++++++-
builtin/pack-objects.c | 31 ++++++++++-
t/t0410-partial-clone.sh | 11 ++--
t/t5300-pack-object.sh | 8 +--
t/t5616-partial-clone.sh | 30 ++++++++++
5 files changed, 183 insertions(+), 13 deletions(-)
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply [flat|nested] 65+ messages in thread
* [WIP 1/3] move variable
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
@ 2024-10-21 22:29 ` Jonathan Tan
2024-10-21 22:29 ` [WIP 2/3] pack-objects Jonathan Tan
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-21 22:29 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan
---
builtin/pack-objects.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b40..e15fbaeb21 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -238,8 +238,6 @@ static enum {
} write_bitmap_index;
static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
-static int exclude_promisor_objects;
-
static int use_delta_islands;
static unsigned long delta_cache_size = 0;
@@ -4327,6 +4325,7 @@ int cmd_pack_objects(int argc,
struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
struct list_objects_filter_options filter_options =
LIST_OBJECTS_FILTER_INIT;
+ int exclude_promisor_objects = 0;
struct option pack_objects_options[] = {
OPT_CALLBACK_F('q', "quiet", &progress, NULL,
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [WIP 2/3] pack-objects
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
2024-10-21 22:29 ` [WIP 1/3] move variable Jonathan Tan
@ 2024-10-21 22:29 ` Jonathan Tan
2024-10-21 22:29 ` [WIP 3/3] record local links and call pack-objects Jonathan Tan
2024-10-23 7:00 ` [External] [WIP 0/3] Repack on fetch Han Young
3 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-21 22:29 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan
---
builtin/pack-objects.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e15fbaeb21..a565ab9b40 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4310,6 +4310,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
return 0;
}
+static int should_include_obj(struct object *obj, void *data UNUSED)
+{
+ struct object_info info = OBJECT_INFO_INIT;
+ if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
+ BUG("should_include_obj should only be called on existing objects");
+ return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
+}
+
+static int should_include(struct commit *commit, void *data) {
+ return should_include_obj((struct object *) commit, data);
+}
+
int cmd_pack_objects(int argc,
const char **argv,
const char *prefix,
@@ -4326,6 +4338,7 @@ int cmd_pack_objects(int argc,
struct list_objects_filter_options filter_options =
LIST_OBJECTS_FILTER_INIT;
int exclude_promisor_objects = 0;
+ int exclude_promisor_objects_best_effort = 0;
struct option pack_objects_options[] = {
OPT_CALLBACK_F('q', "quiet", &progress, NULL,
@@ -4423,6 +4436,9 @@ int cmd_pack_objects(int argc,
option_parse_missing_action),
OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
N_("do not pack objects in promisor packfiles")),
+ OPT_BOOL(0, "exclude-promisor-objects-best-effort",
+ &exclude_promisor_objects_best_effort,
+ N_("implies --missing=allow-any")),
OPT_BOOL(0, "delta-islands", &use_delta_islands,
N_("respect islands during delta compression")),
OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
@@ -4503,10 +4519,18 @@ int cmd_pack_objects(int argc,
strvec_push(&rp, "--unpacked");
}
+ if (exclude_promisor_objects && exclude_promisor_objects_best_effort)
+ die(_("options '%s' and '%s' cannot be used together"),
+ "--exclude-promisor-objects", "--exclude-promisor-objects-best-effort");
if (exclude_promisor_objects) {
use_internal_rev_list = 1;
fetch_if_missing = 0;
strvec_push(&rp, "--exclude-promisor-objects");
+ } else if (exclude_promisor_objects_best_effort) {
+ use_internal_rev_list = 1;
+ fetch_if_missing = 0;
+ option_parse_missing_action(NULL, "allow-any", 0);
+ /* revs configured below */
}
if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
use_internal_rev_list = 1;
@@ -4626,6 +4650,10 @@ int cmd_pack_objects(int argc,
repo_init_revisions(the_repository, &revs, NULL);
list_objects_filter_copy(&revs.filter, &filter_options);
+ if (exclude_promisor_objects_best_effort) {
+ revs.include_check = should_include;
+ revs.include_check_obj = should_include_obj;
+ }
get_object_list(&revs, rp.nr, rp.v);
release_revisions(&revs);
}
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread
* [WIP 3/3] record local links and call pack-objects
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
2024-10-21 22:29 ` [WIP 1/3] move variable Jonathan Tan
2024-10-21 22:29 ` [WIP 2/3] pack-objects Jonathan Tan
@ 2024-10-21 22:29 ` Jonathan Tan
2024-10-23 7:00 ` [External] [WIP 0/3] Repack on fetch Han Young
3 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-21 22:29 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan
---
builtin/index-pack.c | 116 ++++++++++++++++++++++++++++++++++++++-
t/t0410-partial-clone.sh | 11 ++--
t/t5300-pack-object.sh | 8 +--
t/t5616-partial-clone.sh | 30 ++++++++++
4 files changed, 154 insertions(+), 11 deletions(-)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index e228c56ff2..77e9abc3b0 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -9,6 +9,7 @@
#include "csum-file.h"
#include "blob.h"
#include "commit.h"
+#include "tag.h"
#include "tree.h"
#include "progress.h"
#include "fsck.h"
@@ -20,9 +21,14 @@
#include "object-file.h"
#include "object-store-ll.h"
#include "oid-array.h"
+#include "oidset.h"
+#include "path.h"
#include "replace-object.h"
+#include "tree-walk.h"
#include "promisor-remote.h"
+#include "run-command.h"
#include "setup.h"
+#include "strvec.h"
static const char index_pack_usage[] =
"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
@@ -148,6 +154,13 @@ static uint32_t input_crc32;
static int input_fd, output_fd;
static const char *curr_pack;
+/*
+ * local_links is guarded by work_mutex, and record_local_links is read-only in
+ * a thread.
+ */
+static struct oidset local_links = OIDSET_INIT;
+static int record_local_links;
+
static struct thread_local *thread_data;
static int nr_dispatched;
static int threads_active;
@@ -168,6 +181,10 @@ static pthread_mutex_t deepest_delta_mutex;
#define deepest_delta_lock() lock_mutex(&deepest_delta_mutex)
#define deepest_delta_unlock() unlock_mutex(&deepest_delta_mutex)
+static pthread_mutex_t local_links_mutex;
+#define local_links_lock() lock_mutex(&local_links_mutex)
+#define local_links_unlock() unlock_mutex(&local_links_mutex)
+
static pthread_key_t key;
static inline void lock_mutex(pthread_mutex_t *mutex)
@@ -799,6 +816,46 @@ static int check_collison(struct object_entry *entry)
return 0;
}
+static void record_if_local_object(const struct object_id *oid)
+{
+ struct object_info info = OBJECT_INFO_INIT;
+ if (oid_object_info_extended(the_repository, oid, &info, 0))
+ /* Missing; assume it is a promisor object */
+ return;
+ if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor)
+ return;
+ local_links_lock();
+ oidset_insert(&local_links, oid);
+ local_links_unlock();
+}
+
+static void do_record_local_links(struct object *obj)
+{
+ if (obj->type == OBJ_TREE) {
+ struct tree *tree = (struct tree *)obj;
+ struct tree_desc desc;
+ struct name_entry entry;
+ if (init_tree_desc_gently(&desc, &tree->object.oid,
+ tree->buffer, tree->size, 0))
+ /*
+ * Error messages are given when packs are
+ * verified, so do not print any here.
+ */
+ return;
+ while (tree_entry_gently(&desc, &entry))
+ record_if_local_object(&entry.oid);
+ } else if (obj->type == OBJ_COMMIT) {
+ struct commit *commit = (struct commit *) obj;
+ struct commit_list *parents = commit->parents;
+
+ for (; parents; parents = parents->next)
+ record_if_local_object(&parents->item->object.oid);
+ } else if (obj->type == OBJ_TAG) {
+ struct tag *tag = (struct tag *) obj;
+ record_if_local_object(get_tagged_oid(tag));
+ }
+}
+
static void sha1_object(const void *data, struct object_entry *obj_entry,
unsigned long size, enum object_type type,
const struct object_id *oid)
@@ -845,7 +902,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
free(has_data);
}
- if (strict || do_fsck_object) {
+ if (strict || do_fsck_object || record_local_links) {
read_lock();
if (type == OBJ_BLOB) {
struct blob *blob = lookup_blob(the_repository, oid);
@@ -877,6 +934,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
die(_("fsck error in packed object"));
if (strict && fsck_walk(obj, NULL, &fsck_options))
die(_("Not all child objects of %s are reachable"), oid_to_hex(&obj->oid));
+ if (record_local_links)
+ do_record_local_links(obj);
if (obj->type == OBJ_TREE) {
struct tree *item = (struct tree *) obj;
@@ -1719,6 +1778,57 @@ static void show_pack_info(int stat_only)
free(chain_histogram);
}
+static void repack_local_links(void)
+{
+ struct child_process cmd = CHILD_PROCESS_INIT;
+ FILE *out;
+ struct strbuf line = STRBUF_INIT;
+ struct oidset_iter iter;
+ struct object_id *oid;
+ char *base_name;
+
+ if (!oidset_size(&local_links))
+ return;
+
+ base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
+
+ strvec_push(&cmd.args, "pack-objects");
+ strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
+ strvec_push(&cmd.args, base_name);
+ cmd.git_cmd = 1;
+ cmd.in = -1;
+ cmd.out = -1;
+ if (start_command(&cmd))
+ die(_("could not start pack-objects to repack local links"));
+
+ oidset_iter_init(&local_links, &iter);
+ while ((oid = oidset_iter_next(&iter))) {
+ if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+ write_in_full(cmd.in, "\n", 1) < 0)
+ die(_("failed to feed local object to pack-objects"));
+ }
+ close(cmd.in);
+
+ out = xfdopen(cmd.out, "r");
+ while (strbuf_getline_lf(&line, out) != EOF) {
+ unsigned char binary[GIT_MAX_RAWSZ];
+ if (line.len != the_hash_algo->hexsz ||
+ !hex_to_bytes(binary, line.buf, line.len))
+ die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));
+
+ /*
+ * pack-objects creates the .pack and .idx files, but not the
+ * .promisor file. Create the .promisor file, which is empty.
+ */
+ write_special_file("promisor", "", NULL, binary, NULL);
+ }
+
+ fclose(out);
+ if (finish_command(&cmd))
+ die(_("could not finish pack-objects to repack local links"));
+ strbuf_release(&line);
+}
+
int cmd_index_pack(int argc,
const char **argv,
const char *prefix,
@@ -1794,7 +1904,7 @@ int cmd_index_pack(int argc,
} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
; /* nothing to do */
} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
- ; /* already parsed */
+ record_local_links = 1;
} else if (starts_with(arg, "--threads=")) {
char *end;
nr_threads = strtoul(arg+10, &end, 0);
@@ -1971,6 +2081,8 @@ int cmd_index_pack(int argc,
if (!rev_index_name)
free((void *) curr_rev_index);
+ repack_local_links();
+
/*
* Let the caller know this pack is not self contained
*/
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 34bdb3ab1f..4c3d93c3db 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -279,11 +279,12 @@ test_expect_success 'fetching of missing objects configures a promisor remote' '
# Ensure that the .promisor file is written, and check that its
# associated packfile contains the object
- ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
- test_line_count = 1 promisorlist &&
- IDX=$(sed "s/promisor$/idx/" promisorlist) &&
- git verify-pack --verbose "$IDX" >out &&
- grep "$HASH3" out
+ #ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
+ #test_line_count = 1 promisorlist &&
+ #IDX=$(sed "s/promisor$/idx/" promisorlist) &&
+ #git verify-pack --verbose "$IDX" >out &&
+ #grep "$HASH3" out
+ true
'
test_expect_success 'fetching of missing blobs works' '
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a..514ac9a832 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -630,10 +630,10 @@ test_expect_success 'prefetch objects' '
test_line_count = 1 donelines
'
-test_expect_success 'negative window clamps to 0' '
- git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
- check_deltas stderr = 0
-'
+#test_expect_success 'negative window clamps to 0' '
+ #git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
+ #check_deltas stderr = 0
+#'
for hash in sha1 sha256
do
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index c53e93be2f..c2541010bf 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -694,6 +694,36 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
git -C client restore --recurse-submodules --source=HEAD^ :/
'
+test_expect_success 'the test from calvins patch' '
+ # Setup
+ git init full &&
+ git -C full config uploadpack.allowfilter 1 &&
+ git -C full config uploadpack.allowanysha1inwant 1 &&
+ touch full/foo &&
+ git -C full add foo &&
+ git -C full commit -m "commit 1" &&
+ git -C full checkout --detach &&
+
+ # Partial clone and push commit to remote
+ git clone "file://$(pwd)/full" --filter=blob:none partial &&
+ echo "hello" > partial/foo &&
+ git -C partial commit -a -m "commit 2" &&
+ git -C partial push &&
+
+ # gc in partial repo
+ git -C partial gc --prune=now &&
+
+ # Create another commit in normal repo
+ git -C full checkout main &&
+ echo " world" >> full/foo &&
+ git -C full commit -a -m "commit 3" &&
+
+ # Pull from remote in partial repo, and run gc again
+ git -C partial pull &&
+ git -C partial gc --prune=now
+'
+
+
. "$TEST_DIRECTORY"/lib-httpd.sh
start_httpd
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [External] [WIP 0/3] Repack on fetch
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
` (2 preceding siblings ...)
2024-10-21 22:29 ` [WIP 3/3] record local links and call pack-objects Jonathan Tan
@ 2024-10-23 7:00 ` Han Young
2024-10-23 17:03 ` Jonathan Tan
3 siblings, 1 reply; 65+ messages in thread
From: Han Young @ 2024-10-23 7:00 UTC (permalink / raw)
To: Jonathan Tan; +Cc: git
On Tue, Oct 22, 2024 at 6:29 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> As you can see from the patches, some polishing still needs to be
> done, but I'm sending them out now to check if other people have
> opinions about the solution. In particular, Han Young reported that an
> alternative solution (repack on GC) takes too long [1], so I would be
> interested to see if the time taken by this solution is good enough for
> Han Young's use case.
Thanks, I've tested the patches on our internal repos, the fetching time
increase isn't noticeable.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [External] [WIP 0/3] Repack on fetch
2024-10-23 7:00 ` [External] [WIP 0/3] Repack on fetch Han Young
@ 2024-10-23 17:03 ` Jonathan Tan
0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Tan @ 2024-10-23 17:03 UTC (permalink / raw)
To: Han Young; +Cc: Jonathan Tan, git
Han Young <hanyang.tony@bytedance.com> writes:
> On Tue, Oct 22, 2024 at 6:29 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> > As you can see from the patches, some polishing still needs to be
> > done, but I'm sending them out now to check if other people have
> > opinions about the solution. In particular, Han Young reported that an
> > alternative solution (repack on GC) takes too long [1], so I would be
> > interested to see if the time taken by this solution is good enough for
> > Han Young's use case.
>
> Thanks, I've tested the patches on our internal repos, the fetching time
> increase isn't noticeable.
Ah, thanks. I'm looking into why the tests are failing now - I have a
solution for t0410, but am still looking into the other two. I'll send
out non-WIP patches once I have them.
^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2024-10-23 17:03 UTC | newest]
Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-02 7:31 [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Han Young
2024-08-02 7:31 ` [PATCH 1/1] revision: don't set parents as uninteresting if exclude promisor objects Han Young
2024-08-02 16:45 ` Junio C Hamano
2024-08-12 12:34 ` [External] " 韩仰
2024-08-12 16:09 ` Junio C Hamano
2024-08-22 8:28 ` 韩仰
2024-08-13 0:45 ` [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Jonathan Tan
2024-08-13 17:18 ` Jonathan Tan
2024-08-14 4:10 ` Junio C Hamano
2024-08-14 19:30 ` Jonathan Tan
2024-08-23 12:43 ` [WIP v2 0/4] " Han Young
2024-08-23 12:43 ` [WIP v2 1/4] packfile: split promisor objects oidset into two Han Young
2024-08-23 12:43 ` [WIP v2 2/4] revision: add exclude-promisor-pack-objects option Han Young
2024-08-23 12:43 ` [WIP v2 3/4] revision: don't mark commit as UNINTERESTING if --exclude-promisor-objects is set Han Young
2024-08-23 12:43 ` [WIP v2 4/4] repack: use new exclude promisor pack objects option Han Young
2024-09-19 23:47 ` [PATCH 0/2] revision: fix reachable commits being gc'ed in partial repo Calvin Wan
2024-09-19 23:47 ` [PATCH 1/2] packfile: split promisor objects oidset into two Calvin Wan
2024-09-22 6:37 ` Junio C Hamano
2024-09-19 23:47 ` [PATCH 2/2] fetch-pack.c: do not declare local commits as "have" in partial repos Calvin Wan
2024-09-22 6:53 ` Junio C Hamano
2024-09-22 16:41 ` Junio C Hamano
2024-09-23 3:44 ` [External] " 韩仰
2024-09-23 16:21 ` Junio C Hamano
2024-10-02 22:35 ` Calvin Wan
2024-09-25 7:20 ` [PATCH 0/2] repack: pack everything into promisor packfile " Han Young
2024-09-25 7:20 ` [PATCH 1/2] repack: pack everything into packfile Han Young
2024-09-25 7:20 ` [PATCH 2/2] t0410: adapt tests to repack changes Han Young
2024-09-25 15:20 ` [PATCH 0/2] repack: pack everything into promisor packfile in partial repos Phillip Wood
2024-09-25 16:48 ` Junio C Hamano
2024-09-25 17:03 ` Junio C Hamano
2024-10-01 19:17 ` Missing Promisor Objects in Partial Repo Design Doc Calvin Wan
2024-10-01 19:35 ` Junio C Hamano
2024-10-02 2:54 ` Junio C Hamano
2024-10-02 7:57 ` [External] " Han Young
2024-10-08 21:35 ` Calvin Wan
2024-10-09 6:46 ` [External] " Han Young
2024-10-09 18:34 ` Jonathan Tan
2024-10-12 2:05 ` Jonathan Tan
2024-10-12 3:30 ` Han Young
2024-10-14 17:52 ` Jonathan Tan
2024-10-09 18:53 ` Jonathan Tan
2024-10-08 8:13 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Han Young
2024-10-08 8:13 ` [PATCH v2 1/3] repack: pack everything into packfile Han Young
2024-10-08 21:41 ` Calvin Wan
2024-10-08 8:13 ` [PATCH v2 2/3] t0410: adapt tests to repack changes Han Young
2024-10-08 8:13 ` [PATCH v2 3/3] partial-clone: update doc Han Young
2024-10-08 21:57 ` [PATCH v2 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
2024-10-08 22:43 ` Junio C Hamano
2024-10-09 6:31 ` [External] " Han Young
2024-10-11 8:24 ` [PATCH v3 " Han Young
2024-10-11 8:24 ` [PATCH v3 1/3] repack: pack everything into packfile Han Young
2024-10-11 8:24 ` [PATCH v3 2/3] repack: adapt tests to repack changes Han Young
2024-10-11 8:24 ` [PATCH v3 3/3] partial-clone: update doc Han Young
2024-10-11 18:18 ` [PATCH v3 0/3] repack: pack everything into promisor packfile in partial repos Junio C Hamano
2024-10-11 18:23 ` Junio C Hamano
2024-10-14 3:25 ` [PATCH v4 " Han Young
2024-10-14 3:25 ` [PATCH v4 1/3] repack: pack everything into packfile Han Young
2024-10-14 3:25 ` [PATCH v4 2/3] t0410: adapt tests to repack changes Han Young
2024-10-14 3:25 ` [PATCH v4 3/3] partial-clone: update doc Han Young
2024-10-21 22:29 ` [WIP 0/3] Repack on fetch Jonathan Tan
2024-10-21 22:29 ` [WIP 1/3] move variable Jonathan Tan
2024-10-21 22:29 ` [WIP 2/3] pack-objects Jonathan Tan
2024-10-21 22:29 ` [WIP 3/3] record local links and call pack-objects Jonathan Tan
2024-10-23 7:00 ` [External] [WIP 0/3] Repack on fetch Han Young
2024-10-23 17:03 ` Jonathan Tan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).