* [PATCH] has_sha1_file: re-check pack directory before giving up @ 2013-08-30 1:10 Jeff King 2013-08-30 1:40 ` Eric Sunshine ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Jeff King @ 2013-08-30 1:10 UTC (permalink / raw) To: git; +Cc: Michael Haggerty, Junio C Hamano When we read a sha1 file, we first look for a packed version, then a loose version, and then re-check the pack directory again before concluding that we cannot find it. This lets us handle a process that is writing to the repository simultaneously (e.g., receive-pack writing a new pack followed by a ref update, or git-repack packing existing loose objects into a new pack). However, we do not do the same trick with has_sha1_file; we only check the packed objects once, followed by loose objects. This means that we might incorrectly report that we do not have an object, even though we could find it if we simply re-checked the pack directory. By itself, this is usually not a big deal. The other process is running simultaneously, so we may run has_sha1_file before it writes, anyway. It is a race whether we see the object or not. However, we may also see other things the writing process has done (like updating refs); and in that case, we must be able to also see the new objects. For example, imagine we are doing a for_each_ref iteration, and somebody simultaneously pushes. Receive-pack may write the pack and update a ref after we have examined the objects/pack directory, but before the iteration gets to the updated ref. When we do finally see the updated ref, for_each_ref will call has_sha1_file to check whether the ref is broken. If has_sha1_file returns the wrong answer, we erroneously will think that the ref is broken. In the case of git-fsck, which uses the DO_FOR_EACH_INCLUDE_BROKEN flag, this will cause us to erroneously complain that the ref points to an invalid object. But for git-repack, which does not use that flag, we will skip the ref completely! So not only will we fail to process the new objects that the ref points to (which is acceptabale, since the processes are running simultaneously, and we might well do our whole repack before the other process updates the ref), but we will not see the ref at all. Its old objects may be omitted from the pack (and even lost, if --unpack-unreachable is used with an expiration time). There's no test included here, because the success case is two processes running simultaneously forever. But you can replicate the issue with: # base.sh # run this in one terminal; it creates and pushes # repeatedly to a repository git init parent && (cd parent && # create a base commit that will trigger us looking at # the objects/pack directory before we hit the updated ref echo content >file && git add file && git commit -m base && # set the unpack limit abnormally low, which # lets us simulate full-size pushes using tiny ones git config receive.unpackLimit 1 ) && git clone parent child && cd child && n=0 && while true; do echo $n >file && git add file && git commit -m $n && git push origin HEAD:refs/remotes/child/master && n=$(($n + 1)) done # fsck.sh # now run this simultaneously in another terminal; it # repeatedly fscks, looking for us to consider the # newly-pushed ref broken. cd parent && while true; do broken=`git fsck 2>&1 | grep remotes/child` if test -n "$broken"; then echo $broken exit 1 fi done Without this patch, the fsck loop fails within a few seconds (and almost instantly if the test repository actually has a large number of refs). With it, the two can run indefinitely. Signed-off-by: Jeff King <peff@peff.net> --- I observed this being triggered in practice on a repository with a large number of refs. Clients complained of refs being reported as bogus by the server and being omitted (because of upload-pack missing the ref in the advertisement), and git-fsck would randomly report breakages (that would then go away if run again). I didn't observe any object loss as I theorized above, but I'm pretty sure it's possible (ironically, _not_ the new objects, because they will be newer than the prune time, but the old objects they are based on). Note that this works against the optimization in b495697 (fetch-pack: avoid repeatedly re-scanning pack directory, 2013-01-26), which uses has_sha1_file specifically to avoid the re-scan. But it is much more important to be correct here, so I'd like to start with this fix. For the use case in b495697, we can have a "quick but occasionally wrong" version. sha1_file.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/sha1_file.c b/sha1_file.c index 8e27db1..06784fb 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2925,7 +2925,10 @@ int has_sha1_file(const unsigned char *sha1) if (find_pack_entry(sha1, &e)) return 1; - return has_loose_object(sha1); + if (has_loose_object(sha1)) + return 1; + reprepare_packed_git(); + return find_pack_entry(sha1, &e); } static void check_tree(const void *buf, size_t size) -- 1.8.4.2.g87d4a77 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] has_sha1_file: re-check pack directory before giving up 2013-08-30 1:10 [PATCH] has_sha1_file: re-check pack directory before giving up Jeff King @ 2013-08-30 1:40 ` Eric Sunshine 2013-08-30 4:17 ` Junio C Hamano 2013-08-30 4:28 ` Jeff King 2 siblings, 0 replies; 6+ messages in thread From: Eric Sunshine @ 2013-08-30 1:40 UTC (permalink / raw) To: Jeff King; +Cc: Git List, Michael Haggerty, Junio C Hamano On Thu, Aug 29, 2013 at 9:10 PM, Jeff King <peff@peff.net> wrote: > When we read a sha1 file, we first look for a packed > version, then a loose version, and then re-check the pack > directory again before concluding that we cannot find it. > This lets us handle a process that is writing to the > repository simultaneously (e.g., receive-pack writing a new > pack followed by a ref update, or git-repack packing > existing loose objects into a new pack). > > However, we do not do the same trick with has_sha1_file; we > only check the packed objects once, followed by loose > objects. This means that we might incorrectly report that we > do not have an object, even though we could find it if we > simply re-checked the pack directory. > > By itself, this is usually not a big deal. The other process > is running simultaneously, so we may run has_sha1_file > before it writes, anyway. It is a race whether we see the > object or not. However, we may also see other things > the writing process has done (like updating refs); and in > that case, we must be able to also see the new objects. > > For example, imagine we are doing a for_each_ref iteration, > and somebody simultaneously pushes. Receive-pack may write > the pack and update a ref after we have examined the > objects/pack directory, but before the iteration gets to the > updated ref. When we do finally see the updated ref, > for_each_ref will call has_sha1_file to check whether the > ref is broken. If has_sha1_file returns the wrong answer, we > erroneously will think that the ref is broken. > > In the case of git-fsck, which uses the > DO_FOR_EACH_INCLUDE_BROKEN flag, this will cause us to > erroneously complain that the ref points to an invalid > object. But for git-repack, which does not use that flag, we > will skip the ref completely! So not only will we fail to > process the new objects that the ref points to (which is > acceptabale, since the processes are running simultaneously, s/acceptabale/acceptable/ > and we might well do our whole repack before the other > process updates the ref), but we will not see the ref at > all. Its old objects may be omitted from the pack (and even > lost, if --unpack-unreachable is used with an expiration > time). > > There's no test included here, because the success case is > two processes running simultaneously forever. But you can > replicate the issue with: > > # base.sh > # run this in one terminal; it creates and pushes > # repeatedly to a repository > git init parent && > (cd parent && > > # create a base commit that will trigger us looking at > # the objects/pack directory before we hit the updated ref > echo content >file && > git add file && > git commit -m base && > > # set the unpack limit abnormally low, which > # lets us simulate full-size pushes using tiny ones > git config receive.unpackLimit 1 > ) && > git clone parent child && > cd child && > n=0 && > while true; do > echo $n >file && git add file && git commit -m $n && > git push origin HEAD:refs/remotes/child/master && > n=$(($n + 1)) > done > > # fsck.sh > # now run this simultaneously in another terminal; it > # repeatedly fscks, looking for us to consider the > # newly-pushed ref broken. > cd parent && > while true; do > broken=`git fsck 2>&1 | grep remotes/child` > if test -n "$broken"; then > echo $broken > exit 1 > fi > done > > Without this patch, the fsck loop fails within a few seconds > (and almost instantly if the test repository actually has a > large number of refs). With it, the two can run > indefinitely. > > Signed-off-by: Jeff King <peff@peff.net> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] has_sha1_file: re-check pack directory before giving up 2013-08-30 1:10 [PATCH] has_sha1_file: re-check pack directory before giving up Jeff King 2013-08-30 1:40 ` Eric Sunshine @ 2013-08-30 4:17 ` Junio C Hamano 2013-08-30 4:25 ` Jeff King 2013-08-30 4:28 ` Jeff King 2 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2013-08-30 4:17 UTC (permalink / raw) To: Jeff King; +Cc: git, Michael Haggerty Jeff King <peff@peff.net> writes: > When we read a sha1 file, we first look for a packed > version, then a loose version, and then re-check the pack > directory again before concluding that we cannot find it. > This lets us handle a process that is writing to the > repository simultaneously (e.g., receive-pack writing a new > pack followed by a ref update, or git-repack packing > existing loose objects into a new pack). > > However, we do not do the same trick with has_sha1_file; we > only check the packed objects once, followed by loose > objects. This means that we might incorrectly report that we > do not have an object, even though we could find it if we > simply re-checked the pack directory. Hmm, would the same reasoning apply to sha1_object_info(), or do existing critical code happen not to have a problematic calling sequence like you noticed for repack? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] has_sha1_file: re-check pack directory before giving up 2013-08-30 4:17 ` Junio C Hamano @ 2013-08-30 4:25 ` Jeff King 0 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2013-08-30 4:25 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Michael Haggerty On Thu, Aug 29, 2013 at 09:17:57PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > When we read a sha1 file, we first look for a packed > > version, then a loose version, and then re-check the pack > > directory again before concluding that we cannot find it. > > This lets us handle a process that is writing to the > > repository simultaneously (e.g., receive-pack writing a new > > pack followed by a ref update, or git-repack packing > > existing loose objects into a new pack). > > > > However, we do not do the same trick with has_sha1_file; we > > only check the packed objects once, followed by loose > > objects. This means that we might incorrectly report that we > > do not have an object, even though we could find it if we > > simply re-checked the pack directory. > > Hmm, would the same reasoning apply to sha1_object_info(), or do > existing critical code happen not to have a problematic calling > sequence like you noticed for repack? I think the same reasoning would apply; however, we seem to already do the pack-loose-pack lookup there: int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi) { [...] if (!find_pack_entry(sha1, &e)) { /* Most likely it's a loose object. */ if (!sha1_loose_object_info(sha1, oi)) { oi->whence = OI_LOOSE; return 0; } /* Not a loose object; someone else may have just packed it. */ reprepare_packed_git(); if (!find_pack_entry(sha1, &e)) return -1; } -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] has_sha1_file: re-check pack directory before giving up 2013-08-30 1:10 [PATCH] has_sha1_file: re-check pack directory before giving up Jeff King 2013-08-30 1:40 ` Eric Sunshine 2013-08-30 4:17 ` Junio C Hamano @ 2013-08-30 4:28 ` Jeff King 2013-08-30 19:14 ` [PATCHv2] " Jeff King 2 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2013-08-30 4:28 UTC (permalink / raw) To: git; +Cc: Michael Haggerty, Junio C Hamano On Thu, Aug 29, 2013 at 09:10:52PM -0400, Jeff King wrote: > In the case of git-fsck, which uses the > DO_FOR_EACH_INCLUDE_BROKEN flag, this will cause us to > erroneously complain that the ref points to an invalid > object. But for git-repack, which does not use that flag, we > will skip the ref completely! Hmm. This is slightly inaccurate. fsck does not use INCLUDE_BROKEN, and that is why it recognizes (and prints the warning) the "broken" ref. pack-objects would also print a warning, but would dutifully ignore the broken ref during the repack. So it is actually something like for-each-ref, which _does_ use INCLUDE_BROKEN, that behaves differently. And it tends to work, because it ends up calling read_sha1_file to find out about the file rather than checking has_sha1_file. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCHv2] has_sha1_file: re-check pack directory before giving up 2013-08-30 4:28 ` Jeff King @ 2013-08-30 19:14 ` Jeff King 0 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2013-08-30 19:14 UTC (permalink / raw) To: git; +Cc: Michael Haggerty, Junio C Hamano On Fri, Aug 30, 2013 at 12:28:01AM -0400, Jeff King wrote: > On Thu, Aug 29, 2013 at 09:10:52PM -0400, Jeff King wrote: > > > In the case of git-fsck, which uses the > > DO_FOR_EACH_INCLUDE_BROKEN flag, this will cause us to > > erroneously complain that the ref points to an invalid > > object. But for git-repack, which does not use that flag, we > > will skip the ref completely! > > Hmm. This is slightly inaccurate. fsck does not use INCLUDE_BROKEN, and > that is why it recognizes (and prints the warning) the "broken" ref. > pack-objects would also print a warning, but would dutifully ignore the > broken ref during the repack. > > So it is actually something like for-each-ref, which _does_ use > INCLUDE_BROKEN, that behaves differently. And it tends to work, because > it ends up calling read_sha1_file to find out about the file rather than > checking has_sha1_file. Here's a re-roll with a commit message that clarifies the role of DO_FOR_EACH_INCLUDE_BROKEN. The code is the same, and the diff between the commit messages is below. @@ -34,18 +34,17 @@ ref is broken. If has_sha1_file returns the wrong answer, we erroneously will think that the ref is broken. - In the case of git-fsck, which uses the - DO_FOR_EACH_INCLUDE_BROKEN flag, this will cause us to - erroneously complain that the ref points to an invalid - object. But for git-repack, which does not use that flag, we - will skip the ref completely! So not only will we fail to - process the new objects that the ref points to (which is - acceptabale, since the processes are running simultaneously, - and we might well do our whole repack before the other - process updates the ref), but we will not see the ref at - all. Its old objects may be omitted from the pack (and even - lost, if --unpack-unreachable is used with an expiration - time). + For a normal iteration without DO_FOR_EACH_INCLUDE_BROKEN, + this means that the caller does not see the ref at all + (neither the old nor the new value). So not only will we + fail to see the new value of the ref (which is acceptable, + since we are running simultaneously with the writer, and we + might well read the ref before the writer commits its + write), but we will not see the old value either. For + programs that act on reachability like pack-objects or + prune, this can cause data loss, as we may see the objects + referenced by the original ref value as dangling (and either + omit them from the pack, or delete them via prune). There's no test included here, because the success case is two processes running simultaneously forever. But you can @@ -79,7 +78,11 @@ # fsck.sh # now run this simultaneously in another terminal; it # repeatedly fscks, looking for us to consider the - # newly-pushed ref broken. + # newly-pushed ref broken. We cannot use for-each-ref + # here, as it uses DO_FOR_EACH_INCLUDE_BROKEN, which + # skips the has_sha1_file check (and if it wants + # more information on the object, it will actually read + # the object, which does the proper two-step lookup) cd parent && while true; do broken=`git fsck 2>&1 | grep remotes/child` -- >8 -- Subject: has_sha1_file: re-check pack directory before giving up When we read a sha1 file, we first look for a packed version, then a loose version, and then re-check the pack directory again before concluding that we cannot find it. This lets us handle a process that is writing to the repository simultaneously (e.g., receive-pack writing a new pack followed by a ref update, or git-repack packing existing loose objects into a new pack). However, we do not do the same trick with has_sha1_file; we only check the packed objects once, followed by loose objects. This means that we might incorrectly report that we do not have an object, even though we could find it if we simply re-checked the pack directory. By itself, this is usually not a big deal. The other process is running simultaneously, so we may run has_sha1_file before it writes, anyway. It is a race whether we see the object or not. However, we may also see other things the writing process has done (like updating refs); and in that case, we must be able to also see the new objects. For example, imagine we are doing a for_each_ref iteration, and somebody simultaneously pushes. Receive-pack may write the pack and update a ref after we have examined the objects/pack directory, but before the iteration gets to the updated ref. When we do finally see the updated ref, for_each_ref will call has_sha1_file to check whether the ref is broken. If has_sha1_file returns the wrong answer, we erroneously will think that the ref is broken. For a normal iteration without DO_FOR_EACH_INCLUDE_BROKEN, this means that the caller does not see the ref at all (neither the old nor the new value). So not only will we fail to see the new value of the ref (which is acceptable, since we are running simultaneously with the writer, and we might well read the ref before the writer commits its write), but we will not see the old value either. For programs that act on reachability like pack-objects or prune, this can cause data loss, as we may see the objects referenced by the original ref value as dangling (and either omit them from the pack, or delete them via prune). There's no test included here, because the success case is two processes running simultaneously forever. But you can replicate the issue with: # base.sh # run this in one terminal; it creates and pushes # repeatedly to a repository git init parent && (cd parent && # create a base commit that will trigger us looking at # the objects/pack directory before we hit the updated ref echo content >file && git add file && git commit -m base && # set the unpack limit abnormally low, which # lets us simulate full-size pushes using tiny ones git config receive.unpackLimit 1 ) && git clone parent child && cd child && n=0 && while true; do echo $n >file && git add file && git commit -m $n && git push origin HEAD:refs/remotes/child/master && n=$(($n + 1)) done # fsck.sh # now run this simultaneously in another terminal; it # repeatedly fscks, looking for us to consider the # newly-pushed ref broken. We cannot use for-each-ref # here, as it uses DO_FOR_EACH_INCLUDE_BROKEN, which # skips the has_sha1_file check (and if it wants # more information on the object, it will actually read # the object, which does the proper two-step lookup) cd parent && while true; do broken=`git fsck 2>&1 | grep remotes/child` if test -n "$broken"; then echo $broken exit 1 fi done Without this patch, the fsck loop fails within a few seconds (and almost instantly if the test repository actually has a large number of refs). With it, the two can run indefinitely. Signed-off-by: Jeff King <peff@peff.net> --- sha1_file.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/sha1_file.c b/sha1_file.c index e1b2290..02ebb80 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2640,7 +2640,10 @@ int has_sha1_file(const unsigned char *sha1) if (find_pack_entry(sha1, &e)) return 1; - return has_loose_object(sha1); + if (has_loose_object(sha1)) + return 1; + reprepare_packed_git(); + return find_pack_entry(sha1, &e); } static void check_tree(const void *buf, size_t size) -- 1.8.4.2.g87d4a77 ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-08-30 19:14 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-30 1:10 [PATCH] has_sha1_file: re-check pack directory before giving up Jeff King 2013-08-30 1:40 ` Eric Sunshine 2013-08-30 4:17 ` Junio C Hamano 2013-08-30 4:25 ` Jeff King 2013-08-30 4:28 ` Jeff King 2013-08-30 19:14 ` [PATCHv2] " Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).