git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Issue with git log and reference repositories using --dissociate and --filter=blob:none
@ 2024-05-17 13:52 Matt Cree
  2024-05-21  9:55 ` Jeff King
  0 siblings, 1 reply; 2+ messages in thread
From: Matt Cree @ 2024-05-17 13:52 UTC (permalink / raw)
  To: git

Hello all, I am working with an application that uses a reference cache for repositories. We use that cache on subsequent clones to save on some of the overheads of cloning the repositories each time we need them, which is fairly frequently.

We've ran into an issue with git log combined with our use of the reference repository (and possibly --dissociate on the clone) where the commit history cannot be fetched (for a while) with the error:
> error: Could not read 2ffba4df2f9ec9df145fcdd84fe20a3d934b4555
> fatal: Failed to traverse parents of commit 7603ede45da4d396f2641b01e2ef3e13d49b572f

This is part of a user facing feature but thankfully it's extremely infrequent as of right now.

We do a partial clone of the repository with the following
> git -c core.symlinks=false -c gc.auto=0 clone --reference ./$reference_repo --dissociate --filter=blob:none --no-checkout -b master https://github.com/mattcree/dissociate-clone-issue ./$clone_repo

Then we try to get the log 
> git -c core.symlinks=false -c gc.auto=0 log -100 --first-parent --pretty="%H %an %ct %s" master b3447a67238c760aa2845d32e5eb95b96e67c733

I set the following trace options to help debugging
> GIT_TRACE=2 GIT_CURL_VERBOSE=2 GIT_TRACE_PERFORMANCE=2 GIT_TRACE_PACK_ACCESS=2 GIT_TRACE_PACKET=2 GIT_TRACE_PACKFILE=2 GIT_TRACE_SETUP=2 GIT_TRACE_SHALLOW=2

From what I can tell, what is going on here is the following
1. We cloned the repository using --dissociate which forces a repack of the cloned repository
2. The clone completes fast (on git 2.40+) but in the background 'git-remote-https' is running
3. The bug appears while I request the log during this time
4. When the 'git-remote-https' process ends, the log can be requested successfully

From what I can tell git repack is called during the dissociate, which I guess forces an update to the pack files. For this reason the pack file the commit objects are in may not exist at the time when the git log is called.

This means when going through the commits, eventually it does not find one here https://github.com/git/git/blob/492ee03f60297e7e83d101f4519ab8abc98782bc/revision.c#L1106 -- this code path does seem aware of the possibility of missing objects but in this case the arguments it has been given clearly don't stop it from failing here.

When we remove '--dissociate' this issue does not appear. The decision to use dissociate was mainly just driven by perceived safety (I did not work on this part and can't say either way) -- we may fix it for now by removing this.

When we remove '--filter=blob:none' the issue also do not appear.

When running the log command, Git 2.39.3 appears to print quite a bit of http logging e.g.
> Info: [HTTP/2] [1] OPENED stream for https://github.com/mattcree/dissociate-clone-issue/info/refs?service=git-upload-pack

This does not appear with 2.40+. I believe this is either a bug or a poorly handled scenario caused by the lazy retrieval of pack files but I can't say for sure. This is the second time I'm coming to the mailing list with a 'is this a bug?' type question -- it appears to me that it is, but our use case is fairly niche so I wasn't sure if we found something here. I've dived in a bit with the code and there's a lot of moving parts which I am not familiar with, so I may have missed something, but I think I've covered the main issues and I have supplied a recreation below.

The following gist contains a recreation including the reference repository state, the repository to clone, and a script for repeating the situation, and a selection from the log output of running with `--filter=blob:none` -- however it's probably easier for you to run the script yourself for the full output.
> https://gist.github.com/mattcree/b5fcd364c97219465f37b62598db36b0


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Issue with git log and reference repositories using --dissociate and --filter=blob:none
  2024-05-17 13:52 Issue with git log and reference repositories using --dissociate and --filter=blob:none Matt Cree
@ 2024-05-21  9:55 ` Jeff King
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2024-05-21  9:55 UTC (permalink / raw)
  To: Matt Cree; +Cc: git, Jonathan Tan

[+cc Jonathan Tan for his commit mentioned below but also as the general
promisor repacking guru]

On Fri, May 17, 2024 at 02:52:45PM +0100, Matt Cree wrote:

> From what I can tell, what is going on here is the following
> 1. We cloned the repository using --dissociate which forces a repack of the cloned repository
> 2. The clone completes fast (on git 2.40+) but in the background 'git-remote-https' is running
> 3. The bug appears while I request the log during this time
> 4. When the 'git-remote-https' process ends, the log can be requested successfully

Thanks for providing a thorough example, which let me reproduce the
problem. I think the stuff about remote-https is a red herring. The
fundamental issue is some weirdness in the interaction between the
promisor and alternates features when repacking (a "promisor" pack is
one we got from a filtered fetch, where the other side "promises" that
it can give us the rest of the objects later). Plus Git being a little
too hesitant to lazy-fetch the missing commit.

Here's an even simpler from-scratch reproduction:

-- >8 --
# start with a clean slate
rm -rf server client reference

# we have a server that allows partial clones
git init server
git -C server config uploadpack.allowFilter true

# we also have a reference repo which has its first commit
git -C server commit --allow-empty -m one
git clone --bare server reference

# but importantly the server also has a second commit which is not
# in the reference repo
git -C server commit --allow-empty -m two

# now we do a partial reference clone. This is going to get a new copy
# of commit "two" in a pack marked as a promisor (since we know the
# other side can give us anything it points to later if we ask). But it
# won't get a copy of "one", because that's in the reference repo
git clone --no-local --reference reference --filter=blob:none server client

# At this point we're good, and can access all objects. But if we had
# specified --dissociate, it would do the equivalent of these commands:
cd client
git repack -ad
rm -f .git/objects/info/alternates

# And now we find that when we run git-log, we are missing commit "one"!
git --no-pager log
-- >8 --

During that repack we actually make two packs:

  1. We repack all of the promisor objects we have into their own new
     pack (so just "two" in this case).

  2. And then we follow up by making a new pack with all of the
     non-promisor objects (both local and non-local). Which in this case
     would usually be "one". But we tell it to exclude any promisor
     objects, which causes our traversal to mark "two" with the
     UNINTERESTING flag. And then as we traverse, that flag transitively
     applies to things we can reach, including "one", and we exclude it
     from this pack.

     So we realize that "one" is something we _could_ ask the server
     for, even though it's not in a promisor pack itself. But it's sort
     of both a promisor (reachable from a promisor object) and sort of
     not (we don't have it in a promisor pack). And so it's included in
     neither of the new packs.

I think this is sub-optimal, in that we'd usually try to keep promisor
objects we have locally (without alternates, they'd have come from the
original filtered fetch and would be in the promisor pack). But we fail
to migrate them to the new pack in this case.

However, this isn't actually a corruption, because you really can get
"one" from the server. But git-log doesn't ask for it! It used to, but
stopped due to 7e2ad1cda2 (commit: don't lazy-fetch commits,
2022-12-14), where parse_commit(), etc, won't do the lazy fetch. You can
still grab it with something like "git cat-file commit HEAD^", which
lazy-fetches via read_object_file(). After which the repo is repaired
(though of course in a bigger repository there may be many such
commits).

The logic in 7e2ad1cda2 is that we don't have any filters which exclude
commits, so there's no need to lazy-fetch them. But obviously we can
still be missing them due to the promisor repacking scheme, as above. So
I'm not sure if that commit is wrong, or if the repacking should be more
careful about holding on to objects from the reference repo (which would
fix the bug, but also do the more efficient thing the user was trying in
the first place).

-Peff

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-05-21  9:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-17 13:52 Issue with git log and reference repositories using --dissociate and --filter=blob:none Matt Cree
2024-05-21  9:55 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).