From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "René Scharfe" <l.s.r@web.de>,
"Shawn Pearce" <spearce@spearce.org>,
"Martin von Gagern" <Martin.vGagern@gmx.net>,
git@vger.kernel.org
Subject: Re: [PATCH 3/2] t5309: mark delta-cycle failover tests as passing
Date: Sun, 31 Aug 2014 11:15:50 -0400 [thread overview]
Message-ID: <20140831151550.GA16499@peff.net> (raw)
In-Reply-To: <20140830132311.GA14709@peff.net>
On Sat, Aug 30, 2014 at 09:23:11AM -0400, Jeff King wrote:
> The implications of this make me slightly nervous, though. In the
> --fix-thin case, the resulting pack will have 3 objects:
>
> - A as a delta on B
> - B as a delta on A
> - a full copy of either A (or B) provided by --fix-thin
>
> We create a .idx that has duplicate entries for A. If a reader is trying
> to reconstruct B and they find the full copy of A, they're fine. If they
> find the delta copy, what happens?
>
> Ideally the reader would say "hey, I can't reconstruct A here, let me
> try to find another copy". But I am not sure if that happens, or if we
> are even capable of finding another copy of A (certainly we can find one
> in another pack, but I do not think we are smart enough to find a
> duplicate in the same pack).
The main reason this was "makes me nervous" before is that I did not
fully understand _why_ it worked with the current code. That bugged me,
so I dug further. And the answer is that it does not, but just happens
to work for some small cases.
Try this on top:
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 5309095..4086983 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -58,9 +58,19 @@ test_expect_success 'index-pack detects REF_DELTA cycles' '
test_expect_success 'failover to an object in another pack' '
clear_packs &&
+ {
+ pack_header 100 &&
+ for i in $(test_seq 50); do
+ pack_obj $A $B &&
+ pack_obj $B $A || break
+ done
+ } >megacycle.pack &&
+ pack_trailer megacycle.pack &&
git index-pack --stdin <ab.pack &&
- git index-pack --stdin --fix-thin <cycle.pack &&
- test_must_fail git index-pack --strict --stdin --fix-thin <cycle.pack
+ git index-pack --stdin --fix-thin <megacycle.pack &&
+ echo >&2 indexed pack successfully... &&
+ git fsck &&
+ echo >&2 actually re-read pack successfully
'
test_expect_success 'failover to a duplicate object in the same pack' '
It has the same cycle problem, but we are just adding a larger number of
instances to the pack. Which means that any given sha1-lookup in the
index is more likely to hit a delta rather than the base object.
We successfully index the pack, but our fsck goes into an infinite loop.
Yikes.
I haven't really looked into it, but I suspect we would need some kind
of cycle detection on the delta resolution (and possibly to teach the
sha1-lookup to recognize duplicate objects in the pack and treat them
individually). Frankly, I don't think it is worth the effort or
complexity. We should probably just declare delta cycles insane and
reject them outright.
We used to do that because the only way to correctly resolve them was by
introducing a duplicate base object, and we did not allow that. Patch 2
from my series loosened this, which makes index-pack work, but not
necessarily the rest of git. And since index-pack is the gatekeeper on
receiving objects from remotes, it needs to be the _most_ picky. So my
series is definitely a regression as-is.
We can solve this in one of three ways:
1. Teach the rest of git to handle recoverable delta cycles. This is
probably crazy and not worth the effort (and it just lets crap
through that will hurt other git implementations, too).
2. Continue to let through duplicate objects in index-pack, but
specifically detect and reject delta cycles. This is more work (I'm
not sure yet how easy it would be to detect cycles), but it would
mean we can treat duplicates (a much less nasty problem) and cycles
differently.
3. Go back to outlawing duplicate bases. This is very easy. Just drop
my patch 2. :)
I am inclined to go with the third option. There has already been a
suggestion from Shawn that we disallow duplicates entirely, and I was
tempted to go that direction even without this finding. But to me this
makes it a no-brainer; the question has gone from "how strict do we want
to be" to "do we want to protect the rest of the code against useless
and potentially harmful violations of their assumptions".
If we do go with (3), that opens up two new questions:
a. Should we disallow _all_ duplicates, or just those that are bases?
This is actually easy to code; the assert() in find_unresolved_deltas
catches the bases, and the .idx writer catches any other ones.
b. How optional do we want to make this? Right now (without this
series) the delta-base duplicates always die, and regular
duplicates are prevented only under --strict.
If we treat them the same, it should probably be die-by-default.
Should there be an optional mode to let this stuff through (i.e., a
"I know this might cause problems with the rest of git, but I am
desperate to get the data out of this pack" mode?)
If we treat them differently, there is not much harm in an option
to loosen regular duplicates, as I think the rest of git handles
it. For bases, in theory you might be able to recover some data.
But you may also run into this infinite loop. It is very much "at
your own risk".
I wonder if index-pack is really the right place for such a "please
help me get the data out of this broken pack" operation in the
first place. If it is a broken pack, we are probably much better
off to explode it into loose objects than try to index a broken
pack. That's way less efficient, but this should be a last-resort.
I think my preference is to outlaw all duplicates unconditionally in
index-pack. Catch the duplicate base in find_unresolved_deltas as we do
now, but improve the error message. Confirm that unpack-objects can
handle these cases. Optionally attach advice to the duplicate errors
directing people to use "unpack-objects" and/or set
transfer.unpackLimit higher.
-Peff
next prev parent reply other threads:[~2014-08-31 15:16 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-21 11:35 [BUG] resolved deltas Petr Stodulka
2014-08-21 18:25 ` Petr Stodulka
2014-08-22 19:41 ` Martin von Gagern
2014-08-23 10:12 ` René Scharfe
2014-08-23 10:56 ` Jeff King
2014-08-23 11:04 ` Jeff King
2014-08-23 11:18 ` Jeff King
2014-08-25 16:39 ` René Scharfe
2014-08-28 22:08 ` Jeff King
2014-08-28 22:15 ` Jeff King
2014-08-28 23:04 ` Jeff King
2014-08-28 22:22 ` Jeff King
2014-08-28 23:14 ` Junio C Hamano
2014-08-29 20:55 ` Jeff King
2014-08-29 20:57 ` [PATCH 1/2] index-pack: fix race condition with duplicate bases Jeff King
2014-08-29 20:58 ` [PATCH 2/2] index-pack: handle duplicate base objects gracefully Jeff King
2014-08-29 21:56 ` Junio C Hamano
2014-08-29 22:08 ` Jeff King
2014-08-30 2:59 ` Shawn Pearce
2014-08-30 13:16 ` Jeff King
2014-08-30 16:00 ` René Scharfe
2014-08-31 15:17 ` Jeff King
2014-08-31 16:30 ` René Scharfe
2014-08-31 1:10 ` Shawn Pearce
2014-08-31 15:24 ` Jeff King
2014-08-31 22:23 ` Junio C Hamano
2014-08-30 13:23 ` [PATCH 3/2] t5309: mark delta-cycle failover tests as passing Jeff King
2014-08-31 15:15 ` Jeff King [this message]
2014-09-02 17:19 ` Junio C Hamano
2014-08-25 17:19 ` [BUG] resolved deltas Shawn Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140831151550.GA16499@peff.net \
--to=peff@peff.net \
--cc=Martin.vGagern@gmx.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=l.s.r@web.de \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).