* duplicate objects in packfile @ 2013-08-14 18:17 Jeff King 2013-08-14 18:29 ` Junio C Hamano 2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre 0 siblings, 2 replies; 37+ messages in thread From: Jeff King @ 2013-08-14 18:17 UTC (permalink / raw) To: git; +Cc: Shawn O. Pearce, Junio C Hamano, Nicolas Pitre I'm tracking down a rather odd problem in a packfile I found on GitHub. This particular packfile contains the same object at various offsets within the file. In fact there are several packfiles that exhibit this behavior, all in the same repository, and in each one there are several duplicated objects (some of which are present 3 or even 4 times). index-pack is happy to index these packfiles, and just creates multiple entries for the object. The usual binary search in find_pack_entry_one will find one of them (though of course which depends on the exact layout of the index). But curiously, the experimental sha1_entry_pos lookup does not. It hits an assert() that can only be triggered in the face of duplicate objects. For example: $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 commit $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed. Aborted It's easy enough to fix with a repack, but I am curious: 1. Is sha1_entry_pos wrong to barf on duplicate items in the index? If so, do we want to fix it, or simply retire GIT_USE_LOOKUP? Related, should we consider duplicate items in a packfile to be a bogus packfile (and consequently notice and complain during indexing)? I don't think it _hurts_ anything (aside from the assert above), though it is of course wasteful. I didn't go into detail on how the assertion above is triggered, but I can break it down line by line if anybody cares; the short of it is that it can only be caused by an unsorted index or a duplicate entry. 2. How can duplicate entries get into a packfile? Git itself should not generate duplicate entries (pack-objects is careful to remove duplicates). Since these packs almost certainly were pushed by a client, I wondered if "index-pack --fix-thin" might accidentally add multiple copies of an object when it is the preferred base for multiple objects, but it specifically avoids doing so. The packs in question were received by us in 2010. Did we ever have bugs in this area? I don't recall any, nor could I find any in the history. That makes me suspect the user may have been using an alternate git implementation. libgit2 did not know how to pack in 2010, and Grit still doesn't. JGit did, and I don't know offhand about Dulwich. Does anyone know of bugs in those implementations that might have caused this? The packs in question are public, so I can share them if anybody is curious to look (but the whole repository is on the order of 700M). Given the dates on the packs and how rare this is, I'm pretty much willing to chalk it up to a random bug (in git or otherwise) that does not any longer exist. But I was curious if anybody else knew anything, and we may want to fix sha1_entry_pos to behave more like the regular binary search. -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: duplicate objects in packfile 2013-08-14 18:17 duplicate objects in packfile Jeff King @ 2013-08-14 18:29 ` Junio C Hamano 2013-08-14 18:39 ` Junio C Hamano 2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre 1 sibling, 1 reply; 37+ messages in thread From: Junio C Hamano @ 2013-08-14 18:29 UTC (permalink / raw) To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre Jeff King <peff@peff.net> writes: > lookup does not. It hits an assert() that can only be triggered in the > face of duplicate objects. For example: > > $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 > commit > > $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 > git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed. > Aborted Older versions of JGit used to put duplicate objects in a pack, and IIRC Shawn declared that a bug in the packer and fixed it, so from that point of view, I think rejecting is probably the right thing, even though I think warning and continuing is also acceptable and indeed may be better. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: duplicate objects in packfile 2013-08-14 18:29 ` Junio C Hamano @ 2013-08-14 18:39 ` Junio C Hamano 2013-08-14 18:54 ` Nicolas Pitre 2013-08-16 15:01 ` Jeff King 0 siblings, 2 replies; 37+ messages in thread From: Junio C Hamano @ 2013-08-14 18:39 UTC (permalink / raw) To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre Junio C Hamano <gitster@pobox.com> writes: > Jeff King <peff@peff.net> writes: > >> lookup does not. It hits an assert() that can only be triggered in the >> face of duplicate objects. For example: >> >> $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 >> commit >> >> $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697 >> git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed. >> Aborted > > Older versions of JGit used to put duplicate objects in a pack, and > IIRC Shawn declared that a bug in the packer and fixed it, so from > that point of view, I think rejecting is probably the right thing, > even though I think warning and continuing is also acceptable and > indeed may be better. Also repacking may have a funny corner case. I do not recall the details as the above was a long time ago, but when I was tracking it down, a delta was made against one copy of the base object, and referred to it using delta-offset, while there was another copy of the base object which was found by the bisection search, and from there on, the inconsistencies between these two (they represent the same payload, but they are at different offsets in the same pack and with different in-pack sizes) led to some funky behaviour. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: duplicate objects in packfile 2013-08-14 18:39 ` Junio C Hamano @ 2013-08-14 18:54 ` Nicolas Pitre 2013-08-16 15:01 ` Jeff King 1 sibling, 0 replies; 37+ messages in thread From: Nicolas Pitre @ 2013-08-14 18:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git, Shawn O. Pearce On Wed, 14 Aug 2013, Junio C Hamano wrote: > Also repacking may have a funny corner case. I do not recall the > details as the above was a long time ago, but when I was tracking it > down, a delta was made against one copy of the base object, and > referred to it using delta-offset, while there was another copy of > the base object which was found by the bisection search, and from > there on, the inconsistencies between these two (they represent the > same payload, but they are at different offsets in the same pack and > with different in-pack sizes) led to some funky behaviour. Crap. Better try to cope with this (with a test case, etc.) rather than rejecting them now I'd say. Strictly speaking, they're still valid pack files. Nicolas ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: duplicate objects in packfile 2013-08-14 18:39 ` Junio C Hamano 2013-08-14 18:54 ` Nicolas Pitre @ 2013-08-16 15:01 ` Jeff King 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King 1 sibling, 1 reply; 37+ messages in thread From: Jeff King @ 2013-08-16 15:01 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Shawn O. Pearce, Nicolas Pitre On Wed, Aug 14, 2013 at 11:39:34AM -0700, Junio C Hamano wrote: > > Older versions of JGit used to put duplicate objects in a pack, and > > IIRC Shawn declared that a bug in the packer and fixed it, so from > > that point of view, I think rejecting is probably the right thing, > > even though I think warning and continuing is also acceptable and > > indeed may be better. > > Also repacking may have a funny corner case. I do not recall the > details as the above was a long time ago, but when I was tracking it > down, a delta was made against one copy of the base object, and > referred to it using delta-offset, while there was another copy of > the base object which was found by the bisection search, and from > there on, the inconsistencies between these two (they represent the > same payload, but they are at different offsets in the same pack and > with different in-pack sizes) led to some funky behaviour. Thanks for the pointer. I found this commit: https://eclipse.googlesource.com/jgit/jgit/+/2fbf296fda205446eac11a13abd4fcdb182f28d9 which is presumably what you're thinking of. I did not run into the problem described in my case, but presumably I did not have a delta cycle between the multiple versions. In theory we should find the same copy of the object each time we search, but there are enough code paths to access the objects that I would not be surprised if such funkiness is still triggerable, including infinite loops. That makes me inclined to teach index-pack to reject duplicate objects in a single pack in order to prevent denial-of-service attacks. We can potentially make them work in all code paths, but given that nobody should be doing this legitimately, rejecting the duplicates outright keeps our attack surface small, and nobody but attackers or users of broken implementations should care. -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC/PATCH 0/4] duplicate objects in packfiles 2013-08-16 15:01 ` Jeff King @ 2013-08-21 20:49 ` Jeff King 2013-08-21 20:51 ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King ` (4 more replies) 0 siblings, 5 replies; 37+ messages in thread From: Jeff King @ 2013-08-21 20:49 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre On Fri, Aug 16, 2013 at 11:01:38AM -0400, Jeff King wrote: > That makes me inclined to teach index-pack to reject duplicate objects > in a single pack in order to prevent denial-of-service attacks. We can > potentially make them work in all code paths, but given that nobody > should be doing this legitimately, rejecting the duplicates outright > keeps our attack surface small, and nobody but attackers or users of > broken implementations should care. Here's a patch series in that direction: [1/4]: sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP This reproduces and fixes the sha1-lookup bug. We should do this no matter what else we do. [2/4]: index-pack: optionally reject packs with duplicate objects This adds a pack.indexDuplicates option so that sites receiving packfiles from random folks on the internet can protect themselves from the potential denial-of-service mentioned above. The default remains to allow it. [3/4]: reject duplicates when indexing a packfile we created This is a safety check for packs we generate. Optional, but I think it's probably a good idea (and doesn't cost very much). [4/4]: index-pack: optionally skip duplicate packfile entries I really wanted to have a "fix" mode where we could take in packs with duplicate entries and just use them as-is. It's not correct to throw away the duplicates (they may be bases in a delta cycle), but we could maybe get by with simply not referencing them in the pack index. Unfortunately, the pack reader does not like the index we generate; see the patch for details and possible solutions (all of which are ugly). And it only helps in a delta cycle when delta base offsets are used. I had hoped to have a 5/4 flipping the default to "skip", since it would potentially fix the infinite loop problem and wouldn't have a downside. But since it doesn't work (and cannot fix the REF_DELTA case), it seems like a bad idea. Which leaves the open question: should the default for index-pack flip to reject duplicates rather than allow? It seems like it would be worth it to identify buggy packfiles before they move between repos. And in an emergency, we have the config flag to be more permissive in case somebody really needs to move the data via git. Thoughts? -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King @ 2013-08-21 20:51 ` Jeff King 2013-08-21 20:52 ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King ` (3 subsequent siblings) 4 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-21 20:51 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre The sha1_entry_pos function tries to be smart about selecting the middle of a range for its binary search by looking at the value differences between the "lo" and "hi" constraints. However, it is unable to cope with entries with duplicate keys in the sorted list. We may hit a point in the search where both our "lo" and "hi" point to the same key. In this case, the range of values between our endpoints is 0, and trying to scale the difference between our key and the endpoints over that range is undefined (i.e., divide by zero). The current code catches this with an "assert(lov < hiv)". Moreover, after seeing that the first 20 byte of the key are the same, we will try to establish a value from the 21st byte. Which is nonsensical. Instead, we can detect the case that we are in a run of duplicates, and simply do a final comparison against any one of them (since they are all the same, it does not matter which). If the keys match, we have found our entry (or one of them, anyway). If not, then we know that we do not need to look further, as we must be in a run of the duplicate key. Furthermore, we know that one of our endpoints must be the edge of the run of duplicates. For example, given this sequence: idx 0 1 2 3 4 5 key A C C C C D If we are searching for "B", we might hit the duplicate run at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can never have lo > 1, because B < C. That is, if our key is less than the run, we know that "lo" is the edge, but we can say nothing of "hi". Similarly, if our key is greater than the run, we know that "hi" is the edge, but we can say nothing of "lo". But that is enough for us to return not only "not found", but show the position at which we would insert the new item. Signed-off-by: Jeff King <peff@peff.net> --- sha1-lookup.c | 23 ++++++++++ t/t5308-pack-detect-duplicates.sh | 97 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 120 insertions(+) create mode 100755 t/t5308-pack-detect-duplicates.sh diff --git a/sha1-lookup.c b/sha1-lookup.c index c4dc55d..614cbb6 100644 --- a/sha1-lookup.c +++ b/sha1-lookup.c @@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table, * byte 0 thru (ofs-1) are the same between * lo and hi; ofs is the first byte that is * different. + * + * If ofs==20, then no bytes are different, + * meaning we have entries with duplicate + * keys. We know that we are in a solid run + * of this entry (because the entries are + * sorted, and our lo and hi are the same, + * there can be nothing but this single key + * in between). So we can stop the search. + * Either one of these entries is it (and + * we do not care which), or we do not have + * it. */ + if (ofs == 20) { + mi = lo; + mi_key = base + elem_size * mi + key_offset; + cmp = memcmp(mi_key, key, 20); + if (!cmp) + return mi; + if (cmp < 0) + return -1 - hi; + else + return -1 - lo; + } + hiv = hi_key[ofs_0]; if (ofs_0 < 19) hiv = (hiv << 8) | hi_key[ofs_0+1]; diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh new file mode 100755 index 0000000..4800379 --- /dev/null +++ b/t/t5308-pack-detect-duplicates.sh @@ -0,0 +1,97 @@ +#!/bin/sh + +test_description='handling of duplicate objects in incoming packfiles' +. ./test-lib.sh + +# git will never intentionally create packfiles with +# duplicate objects, so we have to construct them by hand. +# +# $1 is the number of times to duplicate each object (must be +# less than 127 for simplicify of implementation). +create_pack() { + { + # header magic + printf 'PACK' && + # version 2 + printf '\0\0\0\2' && + # num of objects + printf '\0\0\0\'"$(printf "%o" $(($1*2)))" && + + for i in $(test_seq 1 "$1"); do + # blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 + printf '\060\170\234\003\0\0\0\0\1' && + # blob e68fe8129b546b101aee9510c5328e7f21ca1d18 + printf '\062\170\234\143\267\3\0\0\116\0\106' + done + } >tmp.pack && + # we store and cat the pack because we also have to output + # the sha1 pack trailer + cat tmp.pack && + test-sha1 <tmp.pack | hex2bytes && + rm -f tmp.pack +} + +# convert an ascii hex representation of bytes into binary +hex2bytes() { + "$PERL_PATH" -ne 's/[0-9a-f]{2}/print chr hex $&/ge' +} + +# remove any existing packs, since we want to make +# sure future operations use any new packs we are about +# to install +clear_packs() { + rm -f .git/objects/pack/* +} + +# The sha1s we have in our pack. It's important that these have the same +# starting byte, so that they end up in the same fanout section of the index. +# That lets us make sure we are exercising the binary search with both sets. +LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18 +HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 + +# And here's a "missing sha1" which will produce failed lookups. It must also +# be in the same fanout section, and should be between the two (so that during +# our binary search, we are sure to end up looking at one or the other of the +# duplicate runs). +MISSING_SHA1='e69d000000000000000000000000000000000000' + +# double-check that create_pack actually works +test_expect_success 'pack with no duplicates' ' + create_pack 1 >no-dups.pack && + git index-pack --stdin <no-dups.pack +' + +test_expect_success 'index-pack will allow duplicate objects by default' ' + clear_packs && + create_pack 100 >dups.pack && + git index-pack --stdin <dups.pack +' + +test_expect_success 'create test vectors' ' + cat >input <<-EOF && + $LO_SHA1 + $HI_SHA1 + $MISSING_SHA1 + EOF + cat >expect <<-EOF + $LO_SHA1 blob 2 + $HI_SHA1 blob 0 + $MISSING_SHA1 missing + EOF +' + +test_expect_success 'lookup in duplicated pack (binary search)' ' + git cat-file --batch-check <input >actual && + test_cmp expect actual +' + +test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' + ( + GIT_USE_LOOKUP=1 && + export GIT_USE_LOOKUP && + git cat-file --batch-check <input >actual + ) && + test_cmp expect actual +' + +test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 2/4] index-pack: optionally reject packs with duplicate objects 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King 2013-08-21 20:51 ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King @ 2013-08-21 20:52 ` Jeff King 2013-08-22 13:45 ` Duy Nguyen 2013-08-21 20:53 ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King ` (2 subsequent siblings) 4 siblings, 1 reply; 37+ messages in thread From: Jeff King @ 2013-08-21 20:52 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre Git should never generate packs with duplicate objects. However, we may see such packs due to bugs in Git or other implementations (e.g., JGit had such a bug a few years ago). In theory, such packs should not be a problem for us (we will simply find one of the instances of the object when looking in the pack). However, the JGit bug report mentioned possible infinite loops during repacking due to cycles in the delta chain. Though this problem hasn't specifically been reproduced on modern git, there is no reason not to be careful with incoming packs, given that only buggy implementations should be producing such packs, anyway. This patch introduces the pack.indexDuplicates option to allow or reject such packs from index-pack. The default remains to allow it. Signed-off-by: Jeff King <peff@peff.net> --- builtin/index-pack.c | 10 ++++++++++ pack-write.c | 23 +++++++++++++++++++++++ pack.h | 5 +++++ t/t5308-pack-detect-duplicates.sh | 8 ++++++++ 4 files changed, 46 insertions(+) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 9c1cfac..83fd3bb 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1369,6 +1369,16 @@ static int git_index_pack_config(const char *k, const char *v, void *cb) #endif return 0; } + if (!strcmp(k, "pack.indexduplicates")) { + int boolval = git_config_maybe_bool(k, v); + if (boolval > 0 || !strcmp(v, "ignore")) + opts->duplicates = WRITE_IDX_DUPLICATES_IGNORE; + else if (boolval == 0 || !strcmp(v, "reject")) + opts->duplicates = WRITE_IDX_DUPLICATES_REJECT; + else + die("unknown value for pack.indexduplicates: %s", v); + return 0; + } return git_default_config(k, v, cb); } diff --git a/pack-write.c b/pack-write.c index ca9e63b..542b081 100644 --- a/pack-write.c +++ b/pack-write.c @@ -37,6 +37,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts) sizeof(ofsval), cmp_uint32); } +static void *find_duplicate(void *vbase, size_t n, size_t size, + int (*cmp)(const void *, const void *)) +{ + unsigned char *base = vbase; + while (n > 1) { + if (!cmp(base, base + size)) + return base; + base += size; + n--; + } + return NULL; +} + /* * On entry *sha1 contains the pack content SHA1 hash, on exit it is * the SHA1 hash of sorted object names. The objects array passed in @@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec else sorted_by_sha = list = last = NULL; + if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) { + struct pack_idx_entry **dup; + + dup = find_duplicate(sorted_by_sha, nr_objects, + sizeof(*sorted_by_sha), sha1_compare); + if (dup) + die("pack has duplicate entries for %s", + sha1_to_hex((*dup)->sha1)); + } + if (opts->flags & WRITE_IDX_VERIFY) { assert(index_name); f = sha1fd_check(index_name); diff --git a/pack.h b/pack.h index aa6ee7d..cd4b536 100644 --- a/pack.h +++ b/pack.h @@ -44,6 +44,11 @@ struct pack_idx_option { uint32_t version; uint32_t off32_limit; + enum { + WRITE_IDX_DUPLICATES_IGNORE = 0, + WRITE_IDX_DUPLICATES_REJECT + } duplicates; + /* * List of offsets that would fit within off32_limit but * need to be written out as 64-bit entity for byte-for-byte diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index 4800379..0f2d928 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -94,4 +94,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' test_cmp expect actual ' +test_expect_success 'index-pack can reject packs with duplicates' ' + clear_packs && + create_pack 2 >dups.pack && + test_must_fail \ + git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack && + test_expect_code 1 git cat-file -e $LO_SHA1 +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/4] index-pack: optionally reject packs with duplicate objects 2013-08-21 20:52 ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King @ 2013-08-22 13:45 ` Duy Nguyen 2013-08-22 14:43 ` Jeff King 0 siblings, 1 reply; 37+ messages in thread From: Duy Nguyen @ 2013-08-22 13:45 UTC (permalink / raw) To: Jeff King Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre On Thu, Aug 22, 2013 at 3:52 AM, Jeff King <peff@peff.net> wrote: > @@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec > else > sorted_by_sha = list = last = NULL; > > + if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) { > + struct pack_idx_entry **dup; > + > + dup = find_duplicate(sorted_by_sha, nr_objects, > + sizeof(*sorted_by_sha), sha1_compare); > + if (dup) > + die("pack has duplicate entries for %s", > + sha1_to_hex((*dup)->sha1)); > + } > + > if (opts->flags & WRITE_IDX_VERIFY) { > assert(index_name); > f = sha1fd_check(index_name); write_idx_file() is called after index-pack processes all delta objects. Could resolve_deltas() go cyclic with certain duplicate object setup? -- Duy ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/4] index-pack: optionally reject packs with duplicate objects 2013-08-22 13:45 ` Duy Nguyen @ 2013-08-22 14:43 ` Jeff King 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King 0 siblings, 1 reply; 37+ messages in thread From: Jeff King @ 2013-08-22 14:43 UTC (permalink / raw) To: Duy Nguyen Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre On Thu, Aug 22, 2013 at 08:45:19PM +0700, Nguyen Thai Ngoc Duy wrote: > On Thu, Aug 22, 2013 at 3:52 AM, Jeff King <peff@peff.net> wrote: > > @@ -68,6 +81,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec > > else > > sorted_by_sha = list = last = NULL; > > > > + if (opts->duplicates == WRITE_IDX_DUPLICATES_REJECT) { > > + struct pack_idx_entry **dup; > > + > > + dup = find_duplicate(sorted_by_sha, nr_objects, > > + sizeof(*sorted_by_sha), sha1_compare); > > + if (dup) > > + die("pack has duplicate entries for %s", > > + sha1_to_hex((*dup)->sha1)); > > + } > > + > > if (opts->flags & WRITE_IDX_VERIFY) { > > assert(index_name); > > f = sha1fd_check(index_name); > > write_idx_file() is called after index-pack processes all delta > objects. Could resolve_deltas() go cyclic with certain duplicate > object setup? Good question. I'm not sure. I'll check it out. -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCHv2 0/6] duplicate objects and delta cycles, oh my! 2013-08-22 14:43 ` Jeff King @ 2013-08-22 23:12 ` Jeff King 2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King ` (6 more replies) 0 siblings, 7 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:12 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre On Thu, Aug 22, 2013 at 10:43:05AM -0400, Jeff King wrote: > > write_idx_file() is called after index-pack processes all delta > > objects. Could resolve_deltas() go cyclic with certain duplicate > > object setup? > > Good question. I'm not sure. I'll check it out. I think the answer is "no", based on both reasoning and testing (both of which are included in patches 3-4 of the series below). So here's my re-roll of the series. [1/6]: test-sha1: add a binary output mode New in this iteration; the previous version piped test-sha1 into perl to create the pack trailer, but with this simple change we can drop the perl dependency. [2/6]: sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Same code as before. I've factored the pack-creation bits from the tests into lib-pack.sh, so they can be reused elsewhere when we want to create bogus packs (and patches 3-4 reuse them here). [3/6]: add tests for indexing packs with delta cycles [4/6]: test index-pack on packs with recoverable delta cycles New tests covering delta cycles. [5/6]: index-pack: optionally reject packs with duplicate objects Similar to before, but I converted the config flag to a simple boolean (since we scrapped the "fix" of the tri-state "allow, reject, fix"). [6/6]: default pack.indexDuplicates to false This flips the safety check on by default everywhere (before, it was left off for index-pack). -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 1/6] test-sha1: add a binary output mode 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King @ 2013-08-22 23:12 ` Jeff King 2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King ` (5 subsequent siblings) 6 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:12 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre The test-sha1 helper program will run our internal sha1 routines over its input and output the 40-byte hex sha1. Sometimes, however, it is useful to have the binary 20-byte sha1 (and it's a pain to convert back in the shell). Let's add a "-b" option to output the binary version. Signed-off-by: Jeff King <peff@peff.net> --- test-sha1.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/test-sha1.c b/test-sha1.c index 80daba9..e57eae1 100644 --- a/test-sha1.c +++ b/test-sha1.c @@ -5,10 +5,15 @@ int main(int ac, char **av) git_SHA_CTX ctx; unsigned char sha1[20]; unsigned bufsz = 8192; + int binary = 0; char *buffer; - if (ac == 2) - bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024; + if (ac == 2) { + if (!strcmp(av[1], "-b")) + binary = 1; + else + bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024; + } if (!bufsz) bufsz = 8192; @@ -42,6 +47,10 @@ int main(int ac, char **av) git_SHA1_Update(&ctx, buffer, this_sz); } git_SHA1_Final(sha1, &ctx); - puts(sha1_to_hex(sha1)); + + if (binary) + fwrite(sha1, 1, 20, stdout); + else + puts(sha1_to_hex(sha1)); exit(0); } -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King 2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King @ 2013-08-22 23:14 ` Jeff King 2013-08-23 16:41 ` Junio C Hamano 2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt 2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King ` (4 subsequent siblings) 6 siblings, 2 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre The sha1_entry_pos function tries to be smart about selecting the middle of a range for its binary search by looking at the value differences between the "lo" and "hi" constraints. However, it is unable to cope with entries with duplicate keys in the sorted list. We may hit a point in the search where both our "lo" and "hi" point to the same key. In this case, the range of values between our endpoints is 0, and trying to scale the difference between our key and the endpoints over that range is undefined (i.e., divide by zero). The current code catches this with an "assert(lov < hiv)". Moreover, after seeing that the first 20 byte of the key are the same, we will try to establish a value from the 21st byte. Which is nonsensical. Instead, we can detect the case that we are in a run of duplicates, and simply do a final comparison against any one of them (since they are all the same, it does not matter which). If the keys match, we have found our entry (or one of them, anyway). If not, then we know that we do not need to look further, as we must be in a run of the duplicate key. Furthermore, we know that one of our endpoints must be the edge of the run of duplicates. For example, given this sequence: idx 0 1 2 3 4 5 key A C C C C D If we are searching for "B", we might hit the duplicate run at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can never have lo > 1, because B < C. That is, if our key is less than the run, we know that "lo" is the edge, but we can say nothing of "hi". Similarly, if our key is greater than the run, we know that "hi" is the edge, but we can say nothing of "lo". But that is enough for us to return not only "not found", but show the position at which we would insert the new item. Signed-off-by: Jeff King <peff@peff.net> --- sha1-lookup.c | 23 ++++++++++++ t/lib-pack.sh | 78 +++++++++++++++++++++++++++++++++++++++ t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++ 3 files changed, 174 insertions(+) create mode 100644 t/lib-pack.sh create mode 100755 t/t5308-pack-detect-duplicates.sh diff --git a/sha1-lookup.c b/sha1-lookup.c index c4dc55d..614cbb6 100644 --- a/sha1-lookup.c +++ b/sha1-lookup.c @@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table, * byte 0 thru (ofs-1) are the same between * lo and hi; ofs is the first byte that is * different. + * + * If ofs==20, then no bytes are different, + * meaning we have entries with duplicate + * keys. We know that we are in a solid run + * of this entry (because the entries are + * sorted, and our lo and hi are the same, + * there can be nothing but this single key + * in between). So we can stop the search. + * Either one of these entries is it (and + * we do not care which), or we do not have + * it. */ + if (ofs == 20) { + mi = lo; + mi_key = base + elem_size * mi + key_offset; + cmp = memcmp(mi_key, key, 20); + if (!cmp) + return mi; + if (cmp < 0) + return -1 - hi; + else + return -1 - lo; + } + hiv = hi_key[ofs_0]; if (ofs_0 < 19) hiv = (hiv << 8) | hi_key[ofs_0+1]; diff --git a/t/lib-pack.sh b/t/lib-pack.sh new file mode 100644 index 0000000..61c5d19 --- /dev/null +++ b/t/lib-pack.sh @@ -0,0 +1,78 @@ +#!/bin/sh +# +# Support routines for hand-crafting weird or malicious packs. +# +# You can make a complete pack like: +# +# pack_header 2 >foo.pack && +# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack && +# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack && +# pack_trailer foo.pack + +# Print the big-endian 4-byte octal representation of $1 +uint32_octal() { + n=$1 + printf '\%o' $(($n / 16777216)); n=$((n % 16777216)) + printf '\%o' $(($n / 65536)); n=$((n % 65536)) + printf '\%o' $(($n / 256)); n=$((n % 256)) + printf '\%o' $(($n )); +} + +# Print the big-endian 4-byte binary representation of $1 +uint32_binary() { + printf "$(uint32_octal "$1")" +} + +# Print a pack header, version 2, for a pack with $1 objects +pack_header() { + printf 'PACK' && + printf '\0\0\0\2' && + uint32_binary "$1" +} + +# Print the pack data for object $1, as a delta against object $2 (or as a full +# object if $2 is missing or empty). The output is suitable for including +# directly in the packfile, and represents the entirety of the object entry. +# Doing this on the fly (especially picking your deltas) is quite tricky, so we +# have hardcoded some well-known objects. See the case statements below for the +# complete list. +pack_obj() { + case "$1" in + # empty blob + e69de29bb2d1d6434b8b29ae775ad8c2e48c5391) + case "$2" in + '') + printf '\060\170\234\003\0\0\0\0\1' + return + ;; + esac + ;; + + # blob containing "\7\76" + e68fe8129b546b101aee9510c5328e7f21ca1d18) + case "$2" in + '') + printf '\062\170\234\143\267\3\0\0\116\0\106' + return + ;; + esac + ;; + esac + + echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}" + return 1 +} + +# Compute and append pack trailer to "$1" +pack_trailer() { + test-sha1 -b <"$1" >trailer.tmp && + cat trailer.tmp >>"$1" && + rm -f trailer.tmp +} + +# Remove any existing packs to make sure that +# whatever we index next will be the pack that we +# actually use. +clear_packs() { + rm -f .git/objects/pack/* +} diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh new file mode 100755 index 0000000..719d48f --- /dev/null +++ b/t/t5308-pack-detect-duplicates.sh @@ -0,0 +1,73 @@ +#!/bin/sh + +test_description='handling of duplicate objects in incoming packfiles' +. ./test-lib.sh +. ../lib-pack.sh + +# The sha1s we have in our pack. It's important that these have the same +# starting byte, so that they end up in the same fanout section of the index. +# That lets us make sure we are exercising the binary search with both sets. +LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18 +HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 + +# And here's a "missing sha1" which will produce failed lookups. It must also +# be in the same fanout section, and should be between the two (so that during +# our binary search, we are sure to end up looking at one or the other of the +# duplicate runs). +MISSING_SHA1='e69d000000000000000000000000000000000000' + +# git will never intentionally create packfiles with +# duplicate objects, so we have to construct them by hand. +# +# $1 is the name of the packfile to create +# +# $2 is the number of times to duplicate each object +create_pack() { + pack_header "$((2 * $2))" >"$1" && + for i in $(test_seq 1 "$2"); do + pack_obj $LO_SHA1 && + pack_obj $HI_SHA1 + done >>"$1" && + pack_trailer "$1" +} + +# double-check that create_pack actually works +test_expect_success 'pack with no duplicates' ' + create_pack no-dups.pack 1 && + git index-pack --stdin <no-dups.pack +' + +test_expect_success 'index-pack will allow duplicate objects by default' ' + clear_packs && + create_pack dups.pack 100 && + git index-pack --stdin <dups.pack +' + +test_expect_success 'create batch-check test vectors' ' + cat >input <<-EOF && + $LO_SHA1 + $HI_SHA1 + $MISSING_SHA1 + EOF + cat >expect <<-EOF + $LO_SHA1 blob 2 + $HI_SHA1 blob 0 + $MISSING_SHA1 missing + EOF +' + +test_expect_success 'lookup in duplicated pack (binary search)' ' + git cat-file --batch-check <input >actual && + test_cmp expect actual +' + +test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' + ( + GIT_USE_LOOKUP=1 && + export GIT_USE_LOOKUP && + git cat-file --batch-check <input >actual + ) && + test_cmp expect actual +' + +test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King @ 2013-08-23 16:41 ` Junio C Hamano 2013-08-23 18:24 ` Jeff King 2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt 1 sibling, 1 reply; 37+ messages in thread From: Junio C Hamano @ 2013-08-23 16:41 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre Jeff King <peff@peff.net> writes: > Furthermore, we know that one of our endpoints must be > the edge of the run of duplicates. For example, given this > sequence: > > idx 0 1 2 3 4 5 > key A C C C C D > > If we are searching for "B", we might hit the duplicate run > at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can > never have lo > 1, because B < C. That is, if our key is > less than the run, we know that "lo" is the edge, but we can > say nothing of "hi". Similarly, if our key is greater than > the run, we know that "hi" is the edge, but we can say > nothing of "lo". But that is enough for us to return not > only "not found", but show the position at which we would > insert the new item. This is somewhat tricky and may deserve an in-code comment. > diff --git a/sha1-lookup.c b/sha1-lookup.c > index c4dc55d..614cbb6 100644 > --- a/sha1-lookup.c > +++ b/sha1-lookup.c > @@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table, > * byte 0 thru (ofs-1) are the same between > * lo and hi; ofs is the first byte that is > * different. > + * > + * If ofs==20, then no bytes are different, > + * meaning we have entries with duplicate > + * keys. We know that we are in a solid run > + * of this entry (because the entries are > + * sorted, and our lo and hi are the same, > + * there can be nothing but this single key > + * in between). So we can stop the search. > + * Either one of these entries is it (and > + * we do not care which), or we do not have > + * it. > */ > + if (ofs == 20) { > + mi = lo; > + mi_key = base + elem_size * mi + key_offset; > + cmp = memcmp(mi_key, key, 20); It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are the same at this point and we do not have to compare full 20 bytes again, but this is done only once and a better readablity of the above trumps micro-optimization possibility, I think. > + if (!cmp) > + return mi; > + if (cmp < 0) > + return -1 - hi; > + else > + return -1 - lo; > + } > + > hiv = hi_key[ofs_0]; > if (ofs_0 < 19) > hiv = (hiv << 8) | hi_key[ofs_0+1]; > diff --git a/t/lib-pack.sh b/t/lib-pack.sh > new file mode 100644 > index 0000000..61c5d19 > --- /dev/null > +++ b/t/lib-pack.sh > @@ -0,0 +1,78 @@ > +#!/bin/sh > +# > +# Support routines for hand-crafting weird or malicious packs. > +# > +# You can make a complete pack like: > +# > +# pack_header 2 >foo.pack && > +# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack && > +# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack && > +# pack_trailer foo.pack > + > +# Print the big-endian 4-byte octal representation of $1 > +uint32_octal() { micronit (style): uint32_octal () { > + n=$1 > + printf '\%o' $(($n / 16777216)); n=$((n % 16777216)) > + printf '\%o' $(($n / 65536)); n=$((n % 65536)) > + printf '\%o' $(($n / 256)); n=$((n % 256)) > + printf '\%o' $(($n )); > +} > + > +# Print the big-endian 4-byte binary representation of $1 > +uint32_binary() { > + printf "$(uint32_octal "$1")" > +} > + > +# Print a pack header, version 2, for a pack with $1 objects > +pack_header() { > + printf 'PACK' && > + printf '\0\0\0\2' && > + uint32_binary "$1" > +} > + > +# Print the pack data for object $1, as a delta against object $2 (or as a full > +# object if $2 is missing or empty). The output is suitable for including > +# directly in the packfile, and represents the entirety of the object entry. > +# Doing this on the fly (especially picking your deltas) is quite tricky, so we > +# have hardcoded some well-known objects. See the case statements below for the > +# complete list. Cute ;-) I like the idea of having this function with a right API in place, and cheating by limiting its implementation to what is necessary. Thanks. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-23 16:41 ` Junio C Hamano @ 2013-08-23 18:24 ` Jeff King 2013-08-23 18:54 ` Nicolas Pitre 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King 0 siblings, 2 replies; 37+ messages in thread From: Jeff King @ 2013-08-23 18:24 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre On Fri, Aug 23, 2013 at 09:41:57AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Furthermore, we know that one of our endpoints must be > > the edge of the run of duplicates. For example, given this > > sequence: > > > > idx 0 1 2 3 4 5 > > key A C C C C D > > > > If we are searching for "B", we might hit the duplicate run > > at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can > > never have lo > 1, because B < C. That is, if our key is > > less than the run, we know that "lo" is the edge, but we can > > say nothing of "hi". Similarly, if our key is greater than > > the run, we know that "hi" is the edge, but we can say > > nothing of "lo". But that is enough for us to return not > > only "not found", but show the position at which we would > > insert the new item. > > This is somewhat tricky and may deserve an in-code comment. Do you want me to re-roll, pushing it down into the comment, or do you want to mark it up yourself? I think there might be some value in the latter as your re-writing of it as a comment may cross-check that my logic is sound. > > + if (ofs == 20) { > > + mi = lo; > > + mi_key = base + elem_size * mi + key_offset; > > + cmp = memcmp(mi_key, key, 20); > > It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are > the same at this point and we do not have to compare full 20 bytes > again, but this is done only once and a better readablity of the > above trumps micro-optimization possibility, I think. Yes, I had the same idea, and came to the same conclusion. Though if anybody did want to try it, note that we have just overwritten the old ofs_0, so you would want to bump the new code up above that line). > > +uint32_octal() { > > micronit (style): > > uint32_octal () { Hmph. I always forget which one we prefer, and we seem to have equal numbers of both already. Again, want a re-roll or to mark it up yourself? > > +# Print the pack data for object $1, as a delta against object $2 (or as a full > > +# object if $2 is missing or empty). The output is suitable for including > > +# directly in the packfile, and represents the entirety of the object entry. > > +# Doing this on the fly (especially picking your deltas) is quite tricky, so we > > +# have hardcoded some well-known objects. See the case statements below for the > > +# complete list. > > Cute ;-) I like the idea of having this function with a right API in > place, and cheating by limiting its implementation to what is > necessary. Just for reference, the procedure I used to generate the "base" data is reasonably straight forward: sha1=$(printf %s "$content" | git hash-object -w --stdin) echo $sha1 | git pack-objects --stdout >tmp.pack tail -c +13 tmp.pack >no-header.pack head -c -20 no-header.pack >no-trailer.pack od -b no-trailer.pack | grep ' ' | cut -d' ' -f2- | tr ' ' '\\' Since we want binary, we can skip the "od" call at the end (I needed it to convert to something readable to hand "printf"). But "head -c" is not portable, nor is head with a negative count. To find items in the same fanout, I just used for-loops to calculate the sha1s of all 2-byte blobs. And that is why we have the odd magic "\7\76" blob. Making the deltas was considerably less elegant, since we cannot provoke pack-objects to pick arbitrary deltas (and it will not even try to delta tiny objects, anyway, which would bloat our samples). I ended up with the horrible patch below. We _could_ clean it up (error-checking? Who needs it?) and make it a debug-and-testing-only option for pack-objects, but I just didn't think the grossness was worth it. Still, it's probably worth documenting here on the list in case somebody else ever needs to add new samples to lib-pack.sh. --- builtin/pack-objects.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 8da2a66..e8937f5 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2442,6 +2442,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) const char *rp_av[6]; int rp_ac = 0; int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; + int magic = 0; struct option pack_objects_options[] = { OPT_SET_INT('q', "quiet", &progress, N_("do not show progress meter"), 0), @@ -2505,6 +2506,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_("pack compression level")), OPT_SET_INT(0, "keep-true-parents", &grafts_replace_parents, N_("do not hide commits by grafts"), 0), + OPT_BOOL(0, "magic", &magic, "make deltas"), OPT_END(), }; @@ -2520,6 +2522,34 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) argc = parse_options(argc, argv, prefix, pack_objects_options, pack_usage, 0); + if (magic) { + unsigned char sha1[20]; + struct delta_index *index; + void *src, *trg, *delta; + enum object_type src_type, trg_type; + unsigned long src_size, trg_size, delta_size, z_delta_size; + unsigned char header[10]; + unsigned long header_len; + + get_sha1(argv[0], sha1); + trg = read_sha1_file(sha1, &trg_type, &trg_size); + + get_sha1(argv[1], sha1); + src = read_sha1_file(sha1, &src_type, &src_size); + + index = create_delta_index(src, src_size); + delta = create_delta(index, trg, trg_size, &delta_size, 8192); + + z_delta_size = do_compress(&delta, delta_size); + header_len = encode_in_pack_object_header(OBJ_REF_DELTA, delta_size, + header); + fwrite(header, 1, header_len, stdout); + fwrite(sha1, 1, 20, stdout); + fwrite(delta, 1, z_delta_size, stdout); + return 0; + } + + if (argc) { base_name = argv[0]; argc--; ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-23 18:24 ` Jeff King @ 2013-08-23 18:54 ` Nicolas Pitre 2013-08-23 18:56 ` Jeff King 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King 1 sibling, 1 reply; 37+ messages in thread From: Nicolas Pitre @ 2013-08-23 18:54 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce On Fri, 23 Aug 2013, Jeff King wrote: > Making the deltas was considerably less elegant, since we cannot provoke > pack-objects to pick arbitrary deltas (and it will not even try to delta > tiny objects, anyway, which would bloat our samples). I ended up with > the horrible patch below. We _could_ clean it up (error-checking? Who > needs it?) and make it a debug-and-testing-only option for pack-objects, > but I just didn't think the grossness was worth it. Still, it's probably > worth documenting here on the list in case somebody else ever needs to > add new samples to lib-pack.sh. Maybe using test-delta (from test-delta.c) would have helped here? In any case, if something needs to be permanently added into the code to help in the creation of test objects, I think test-delta.c is a far better place than pack-objects.c. Nicolas ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-23 18:54 ` Nicolas Pitre @ 2013-08-23 18:56 ` Jeff King 0 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-23 18:56 UTC (permalink / raw) To: Nicolas Pitre Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce On Fri, Aug 23, 2013 at 02:54:19PM -0400, Nicolas Pitre wrote: > On Fri, 23 Aug 2013, Jeff King wrote: > > > Making the deltas was considerably less elegant, since we cannot provoke > > pack-objects to pick arbitrary deltas (and it will not even try to delta > > tiny objects, anyway, which would bloat our samples). I ended up with > > the horrible patch below. We _could_ clean it up (error-checking? Who > > needs it?) and make it a debug-and-testing-only option for pack-objects, > > but I just didn't think the grossness was worth it. Still, it's probably > > worth documenting here on the list in case somebody else ever needs to > > add new samples to lib-pack.sh. > > Maybe using test-delta (from test-delta.c) would have helped here? > > In any case, if something needs to be permanently added into the code to > help in the creation of test objects, I think test-delta.c is a far > better place than pack-objects.c. *forehead palm* I didn't even know we had test-delta. Yes, that is obviously a way better place (I initially looked at pack-objects because it has the helpers to do compression and the type/size header properly). -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCHv3 0/6] duplicate objects and delta cycles 2013-08-23 18:24 ` Jeff King 2013-08-23 18:54 ` Nicolas Pitre @ 2013-08-24 0:01 ` Jeff King 2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King ` (5 more replies) 1 sibling, 6 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:01 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre On Fri, Aug 23, 2013 at 02:24:10PM -0400, Jeff King wrote: > > This is somewhat tricky and may deserve an in-code comment. > > Do you want me to re-roll[...] Since there were a few things to fix, I went ahead and re-rolled. Here's v3, which changes: 1. Move "edge of run" logic description into the in-code comment rather than the commit message. 2. Include space between shell function names and parentheses. 3. Use $TEST_DIRECTORY to find lib-pack so that "--root" works. I added in Nicolas's acks, too, as this version makes no changes of substance from the previous ack'd version. Interdiff is below. sha1-lookup.c | 24 ++++++++++++++++++++++++ t/lib-pack.sh | 12 ++++++------ t/t5308-pack-detect-duplicates.sh | 4 ++-- t/t5309-pack-delta-cycles.sh | 2 +- 4 files changed, 33 insertions(+), 9 deletions(-) diff --git a/sha1-lookup.c b/sha1-lookup.c index 614cbb6..2dd8515 100644 --- a/sha1-lookup.c +++ b/sha1-lookup.c @@ -215,6 +215,30 @@ int sha1_entry_pos(const void *table, * Either one of these entries is it (and * we do not care which), or we do not have * it. + * + * Furthermore, we know that one of our + * endpoints must be the edge of the run of + * duplicates. For example, given this + * sequence: + * + * idx 0 1 2 3 4 5 + * key A C C C C D + * + * If we are searching for "B", we might + * hit the duplicate run at lo=1, hi=3 + * (e.g., by first mi=3, then mi=0). But we + * can never have lo > 1, because B < C. + * That is, if our key is less than the + * run, we know that "lo" is the edge, but + * we can say nothing of "hi". Similarly, + * if our key is greater than the run, we + * know that "hi" is the edge, but we can + * say nothing of "lo". + * + * Therefore if we do not find it, we also + * know where it would go if it did exist: + * just on the far side of the edge that we + * know about. */ if (ofs == 20) { mi = lo; diff --git a/t/lib-pack.sh b/t/lib-pack.sh index e028c40..7e8685b 100644 --- a/t/lib-pack.sh +++ b/t/lib-pack.sh @@ -10,7 +10,7 @@ # pack_trailer foo.pack # Print the big-endian 4-byte octal representation of $1 -uint32_octal() { +uint32_octal () { n=$1 printf '\%o' $(($n / 16777216)); n=$((n % 16777216)) printf '\%o' $(($n / 65536)); n=$((n % 65536)) @@ -19,12 +19,12 @@ uint32_binary() { } # Print the big-endian 4-byte binary representation of $1 -uint32_binary() { +uint32_binary () { printf "$(uint32_octal "$1")" } # Print a pack header, version 2, for a pack with $1 objects -pack_header() { +pack_header () { printf 'PACK' && printf '\0\0\0\2' && uint32_binary "$1" @@ -36,7 +36,7 @@ pack_header() { # Doing this on the fly (especially picking your deltas) is quite tricky, so we # have hardcoded some well-known objects. See the case statements below for the # complete list. -pack_obj() { +pack_obj () { case "$1" in # empty blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391) @@ -86,7 +86,7 @@ pack_obj() { } # Compute and append pack trailer to "$1" -pack_trailer() { +pack_trailer () { test-sha1 -b <"$1" >trailer.tmp && cat trailer.tmp >>"$1" && rm -f trailer.tmp @@ -95,6 +95,6 @@ pack_trailer() { # Remove any existing packs to make sure that # whatever we index next will be the pack that we # actually use. -clear_packs() { +clear_packs () { rm -f .git/objects/pack/* } diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index e982095..e0ba5e1 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -2,7 +2,7 @@ test_description='handling of duplicate objects in incoming packfiles' test_description='handling of duplicate objects in incoming packfiles' . ./test-lib.sh -. ../lib-pack.sh +. "$TEST_DIRECTORY"/lib-pack.sh # The sha1s we have in our pack. It's important that these have the same # starting byte, so that they end up in the same fanout section of the index. @@ -22,7 +22,7 @@ MISSING_SHA1='e69d000000000000000000000000000000000000' # $1 is the name of the packfile to create # # $2 is the number of times to duplicate each object -create_pack() { +create_pack () { pack_header "$((2 * $2))" >"$1" && for i in $(test_seq 1 "$2"); do pack_obj $LO_SHA1 && diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh index 38e1809..3e7861b 100755 --- a/t/t5309-pack-delta-cycles.sh +++ b/t/t5309-pack-delta-cycles.sh @@ -2,7 +2,7 @@ test_description='test index-pack handling of delta cycles in packfiles' test_description='test index-pack handling of delta cycles in packfiles' . ./test-lib.sh -. ../lib-pack.sh +. "$TEST_DIRECTORY"/lib-pack.sh # Two similar-ish objects that we have computed deltas between. A=01d7713666f4de822776c7622c10f1b07de280dc ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 1/6] test-sha1: add a binary output mode 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King @ 2013-08-24 0:01 ` Jeff King 2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King ` (4 subsequent siblings) 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:01 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre The test-sha1 helper program will run our internal sha1 routines over its input and output the 40-byte hex sha1. Sometimes, however, it is useful to have the binary 20-byte sha1 (and it's a pain to convert back in the shell). Let's add a "-b" option to output the binary version. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- test-sha1.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/test-sha1.c b/test-sha1.c index 80daba9..e57eae1 100644 --- a/test-sha1.c +++ b/test-sha1.c @@ -5,10 +5,15 @@ int main(int ac, char **av) git_SHA_CTX ctx; unsigned char sha1[20]; unsigned bufsz = 8192; + int binary = 0; char *buffer; - if (ac == 2) - bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024; + if (ac == 2) { + if (!strcmp(av[1], "-b")) + binary = 1; + else + bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024; + } if (!bufsz) bufsz = 8192; @@ -42,6 +47,10 @@ int main(int ac, char **av) git_SHA1_Update(&ctx, buffer, this_sz); } git_SHA1_Final(sha1, &ctx); - puts(sha1_to_hex(sha1)); + + if (binary) + fwrite(sha1, 1, 20, stdout); + else + puts(sha1_to_hex(sha1)); exit(0); } -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King 2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King @ 2013-08-24 0:02 ` Jeff King 2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King ` (3 subsequent siblings) 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre The sha1_entry_pos function tries to be smart about selecting the middle of a range for its binary search by looking at the value differences between the "lo" and "hi" constraints. However, it is unable to cope with entries with duplicate keys in the sorted list. We may hit a point in the search where both our "lo" and "hi" point to the same key. In this case, the range of values between our endpoints is 0, and trying to scale the difference between our key and the endpoints over that range is undefined (i.e., divide by zero). The current code catches this with an "assert(lov < hiv)". Moreover, after seeing that the first 20 byte of the key are the same, we will try to establish a value from the 21st byte. Which is nonsensical. Instead, we can detect the case that we are in a run of duplicates, and simply do a final comparison against any one of them (since they are all the same, it does not matter which). If the keys match, we have found our entry (or one of them, anyway). If not, then we know that we do not need to look further, as we must be in a run of the duplicate key. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- sha1-lookup.c | 47 +++++++++++++++++++++++ t/lib-pack.sh | 78 +++++++++++++++++++++++++++++++++++++++ t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++ 3 files changed, 198 insertions(+) create mode 100644 t/lib-pack.sh create mode 100755 t/t5308-pack-detect-duplicates.sh diff --git a/sha1-lookup.c b/sha1-lookup.c index c4dc55d..2dd8515 100644 --- a/sha1-lookup.c +++ b/sha1-lookup.c @@ -204,7 +204,54 @@ int sha1_entry_pos(const void *table, * byte 0 thru (ofs-1) are the same between * lo and hi; ofs is the first byte that is * different. + * + * If ofs==20, then no bytes are different, + * meaning we have entries with duplicate + * keys. We know that we are in a solid run + * of this entry (because the entries are + * sorted, and our lo and hi are the same, + * there can be nothing but this single key + * in between). So we can stop the search. + * Either one of these entries is it (and + * we do not care which), or we do not have + * it. + * + * Furthermore, we know that one of our + * endpoints must be the edge of the run of + * duplicates. For example, given this + * sequence: + * + * idx 0 1 2 3 4 5 + * key A C C C C D + * + * If we are searching for "B", we might + * hit the duplicate run at lo=1, hi=3 + * (e.g., by first mi=3, then mi=0). But we + * can never have lo > 1, because B < C. + * That is, if our key is less than the + * run, we know that "lo" is the edge, but + * we can say nothing of "hi". Similarly, + * if our key is greater than the run, we + * know that "hi" is the edge, but we can + * say nothing of "lo". + * + * Therefore if we do not find it, we also + * know where it would go if it did exist: + * just on the far side of the edge that we + * know about. */ + if (ofs == 20) { + mi = lo; + mi_key = base + elem_size * mi + key_offset; + cmp = memcmp(mi_key, key, 20); + if (!cmp) + return mi; + if (cmp < 0) + return -1 - hi; + else + return -1 - lo; + } + hiv = hi_key[ofs_0]; if (ofs_0 < 19) hiv = (hiv << 8) | hi_key[ofs_0+1]; diff --git a/t/lib-pack.sh b/t/lib-pack.sh new file mode 100644 index 0000000..fecd5a0 --- /dev/null +++ b/t/lib-pack.sh @@ -0,0 +1,78 @@ +#!/bin/sh +# +# Support routines for hand-crafting weird or malicious packs. +# +# You can make a complete pack like: +# +# pack_header 2 >foo.pack && +# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack && +# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack && +# pack_trailer foo.pack + +# Print the big-endian 4-byte octal representation of $1 +uint32_octal () { + n=$1 + printf '\%o' $(($n / 16777216)); n=$((n % 16777216)) + printf '\%o' $(($n / 65536)); n=$((n % 65536)) + printf '\%o' $(($n / 256)); n=$((n % 256)) + printf '\%o' $(($n )); +} + +# Print the big-endian 4-byte binary representation of $1 +uint32_binary () { + printf "$(uint32_octal "$1")" +} + +# Print a pack header, version 2, for a pack with $1 objects +pack_header () { + printf 'PACK' && + printf '\0\0\0\2' && + uint32_binary "$1" +} + +# Print the pack data for object $1, as a delta against object $2 (or as a full +# object if $2 is missing or empty). The output is suitable for including +# directly in the packfile, and represents the entirety of the object entry. +# Doing this on the fly (especially picking your deltas) is quite tricky, so we +# have hardcoded some well-known objects. See the case statements below for the +# complete list. +pack_obj () { + case "$1" in + # empty blob + e69de29bb2d1d6434b8b29ae775ad8c2e48c5391) + case "$2" in + '') + printf '\060\170\234\003\0\0\0\0\1' + return + ;; + esac + ;; + + # blob containing "\7\76" + e68fe8129b546b101aee9510c5328e7f21ca1d18) + case "$2" in + '') + printf '\062\170\234\143\267\3\0\0\116\0\106' + return + ;; + esac + ;; + esac + + echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}" + return 1 +} + +# Compute and append pack trailer to "$1" +pack_trailer () { + test-sha1 -b <"$1" >trailer.tmp && + cat trailer.tmp >>"$1" && + rm -f trailer.tmp +} + +# Remove any existing packs to make sure that +# whatever we index next will be the pack that we +# actually use. +clear_packs () { + rm -f .git/objects/pack/* +} diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh new file mode 100755 index 0000000..04fe242 --- /dev/null +++ b/t/t5308-pack-detect-duplicates.sh @@ -0,0 +1,73 @@ +#!/bin/sh + +test_description='handling of duplicate objects in incoming packfiles' +. ./test-lib.sh +. "$TEST_DIRECTORY"/lib-pack.sh + +# The sha1s we have in our pack. It's important that these have the same +# starting byte, so that they end up in the same fanout section of the index. +# That lets us make sure we are exercising the binary search with both sets. +LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18 +HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 + +# And here's a "missing sha1" which will produce failed lookups. It must also +# be in the same fanout section, and should be between the two (so that during +# our binary search, we are sure to end up looking at one or the other of the +# duplicate runs). +MISSING_SHA1='e69d000000000000000000000000000000000000' + +# git will never intentionally create packfiles with +# duplicate objects, so we have to construct them by hand. +# +# $1 is the name of the packfile to create +# +# $2 is the number of times to duplicate each object +create_pack () { + pack_header "$((2 * $2))" >"$1" && + for i in $(test_seq 1 "$2"); do + pack_obj $LO_SHA1 && + pack_obj $HI_SHA1 + done >>"$1" && + pack_trailer "$1" +} + +# double-check that create_pack actually works +test_expect_success 'pack with no duplicates' ' + create_pack no-dups.pack 1 && + git index-pack --stdin <no-dups.pack +' + +test_expect_success 'index-pack will allow duplicate objects by default' ' + clear_packs && + create_pack dups.pack 100 && + git index-pack --stdin <dups.pack +' + +test_expect_success 'create batch-check test vectors' ' + cat >input <<-EOF && + $LO_SHA1 + $HI_SHA1 + $MISSING_SHA1 + EOF + cat >expect <<-EOF + $LO_SHA1 blob 2 + $HI_SHA1 blob 0 + $MISSING_SHA1 missing + EOF +' + +test_expect_success 'lookup in duplicated pack (binary search)' ' + git cat-file --batch-check <input >actual && + test_cmp expect actual +' + +test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' + ( + GIT_USE_LOOKUP=1 && + export GIT_USE_LOOKUP && + git cat-file --batch-check <input >actual + ) && + test_cmp expect actual +' + +test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 3/6] add tests for indexing packs with delta cycles 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King 2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King 2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King @ 2013-08-24 0:02 ` Jeff King 2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King ` (2 subsequent siblings) 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre If we receive a broken or malicious pack from a remote, we will feed it to index-pack. As index-pack processes the objects as a stream, reconstructing and hashing each object to get its name, it is not very susceptible to doing the wrong with bad data (it simply notices that the data is bogus and aborts). However, one question raised on the list is whether it could be susceptible to problems during the delta-resolution phase. In particular, can a cycle in the packfile deltas cause us to go into an infinite loop or cause any other problem? The answer is no. We cannot have a cycle of delta-base offsets, because they go only in one direction (the OFS_DELTA object mentions its base by an offset towards the beginning of the file, and we explicitly reject negative offsets). We can have a cycle of REF_DELTA objects, which refer to base objects by sha1 name. However, index-pack does not know these sha1 names ahead of time; it has to reconstruct the objects to get their names, and it cannot do so if there is a delta cycle (in other words, it does not even realize there is a cycle, but only that there are items that cannot be resolved). Even though we can reason out that index-pack should handle this fine, let's add a few tests to make sure it behaves correctly. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- t/lib-pack.sh | 22 +++++++++++++++++ t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 81 insertions(+) create mode 100755 t/t5309-pack-delta-cycles.sh diff --git a/t/lib-pack.sh b/t/lib-pack.sh index fecd5a0..7e8685b 100644 --- a/t/lib-pack.sh +++ b/t/lib-pack.sh @@ -55,6 +55,28 @@ pack_obj () { printf '\062\170\234\143\267\3\0\0\116\0\106' return ;; + 01d7713666f4de822776c7622c10f1b07de280dc) + printf '\165\1\327\161\66\146\364\336\202\47\166' && + printf '\307\142\54\20\361\260\175\342\200\334\170' && + printf '\234\143\142\142\142\267\003\0\0\151\0\114' + return + ;; + esac + ;; + + # blob containing "\7\0" + 01d7713666f4de822776c7622c10f1b07de280dc) + case "$2" in + '') + printf '\062\170\234\143\147\0\0\0\20\0\10' + return + ;; + e68fe8129b546b101aee9510c5328e7f21ca1d18) + printf '\165\346\217\350\22\233\124\153\20\32\356' && + printf '\225\20\305\62\216\177\41\312\35\30\170\234' && + printf '\143\142\142\142\147\0\0\0\53\0\16' + return + ;; esac ;; esac diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh new file mode 100755 index 0000000..1640bf7 --- /dev/null +++ b/t/t5309-pack-delta-cycles.sh @@ -0,0 +1,59 @@ +#!/bin/sh + +test_description='test index-pack handling of delta cycles in packfiles' +. ./test-lib.sh +. "$TEST_DIRECTORY"/lib-pack.sh + +# Two similar-ish objects that we have computed deltas between. +A=01d7713666f4de822776c7622c10f1b07de280dc +B=e68fe8129b546b101aee9510c5328e7f21ca1d18 + +# double-check our hand-constucted packs +test_expect_success 'index-pack works with a single delta (A->B)' ' + clear_packs && + { + pack_header 2 && + pack_obj $A $B && + pack_obj $B + } >ab.pack && + pack_trailer ab.pack && + git index-pack --stdin <ab.pack && + git cat-file -t $A && + git cat-file -t $B +' + +test_expect_success 'index-pack works with a single delta (B->A)' ' + clear_packs && + { + pack_header 2 && + pack_obj $A && + pack_obj $B $A + } >ba.pack && + pack_trailer ba.pack && + git index-pack --stdin <ba.pack && + git cat-file -t $A && + git cat-file -t $B +' + +test_expect_success 'index-pack detects missing base objects' ' + clear_packs && + { + pack_header 1 && + pack_obj $A $B + } >missing.pack && + pack_trailer missing.pack && + test_must_fail git index-pack --fix-thin --stdin <missing.pack +' + +test_expect_success 'index-pack detects REF_DELTA cycles' ' + clear_packs && + { + pack_header 2 && + pack_obj $A $B && + pack_obj $B $A + } >cycle.pack && + pack_trailer cycle.pack && + test_must_fail git index-pack --fix-thin --stdin <cycle.pack +' + +test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 4/6] test index-pack on packs with recoverable delta cycles 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King ` (2 preceding siblings ...) 2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King @ 2013-08-24 0:02 ` Jeff King 2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King 2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre The previous commit added tests to show that index-pack correctly bails in unrecoverable situations. There are some situations where the data could be recovered, but it is not currently: 1. If we can break the cycle using an object from another pack via --fix-thin. 2. If we can break the cycle using a duplicate of one of the objects found in the same pack. Note that neither of these is particularly high priority; a delta cycle within a pack should never occur, and we have no record of even a buggy git implementation creating such a pack. However, it's worth adding these tests for two reasons. One, to document that we do not currently handle the situation, even though it is possible. And two, to exercise the code that runs in this situation; even though it fails, by running it we can confirm that index-pack detects the situation and aborts, and does not misbehave (e.g., by following the cycle in an infinite loop). In both cases, we hit an assert that aborts index-pack. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh index 1640bf7..3e7861b 100755 --- a/t/t5309-pack-delta-cycles.sh +++ b/t/t5309-pack-delta-cycles.sh @@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' ' test_must_fail git index-pack --fix-thin --stdin <cycle.pack ' +test_expect_failure 'failover to an object in another pack' ' + clear_packs && + git index-pack --stdin <ab.pack && + git index-pack --stdin --fix-thin <cycle.pack +' + +test_expect_failure 'failover to a duplicate object in the same pack' ' + clear_packs && + { + pack_header 3 && + pack_obj $A $B && + pack_obj $B $A && + pack_obj $A + } >recoverable.pack && + pack_trailer recoverable.pack && + git index-pack --fix-thin --stdin <recoverable.pack +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King ` (3 preceding siblings ...) 2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King @ 2013-08-24 0:02 ` Jeff King 2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre Git should never generate packs with duplicate objects. However, we may see such packs due to bugs in Git or other implementations (e.g., JGit had such a bug a few years ago). In theory, such packs should not be a problem for us (we will simply find one of the instances of the object when looking in the pack). However, the JGit bug report mentioned possible infinite loops during repacking due to cycles in the delta chain. Though this problem hasn't specifically been reproduced on modern git, there is no reason not to be careful with incoming packs, given that only buggy implementations should be producing such packs, anyway. This patch introduces the pack.indexDuplicates option to allow or reject such packs from index-pack. The default remains to allow it. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- builtin/index-pack.c | 4 ++++ pack-write.c | 24 ++++++++++++++++++++++++ pack.h | 2 ++ t/t5308-pack-detect-duplicates.sh | 8 ++++++++ 4 files changed, 38 insertions(+) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 79dfe47..72e19a0 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1364,6 +1364,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb) #endif return 0; } + if (!strcmp(k, "pack.indexduplicates")) { + opts->allow_duplicates = git_config_bool(k, v); + return 0; + } return git_default_config(k, v, cb); } diff --git a/pack-write.c b/pack-write.c index ca9e63b..da946a7 100644 --- a/pack-write.c +++ b/pack-write.c @@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts) memset(opts, 0, sizeof(*opts)); opts->version = 2; opts->off32_limit = 0x7fffffff; + opts->allow_duplicates = 1; } static int sha1_compare(const void *_a, const void *_b) @@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts) sizeof(ofsval), cmp_uint32); } +static void *find_duplicate(void *vbase, size_t n, size_t size, + int (*cmp)(const void *, const void *)) +{ + unsigned char *base = vbase; + while (n > 1) { + if (!cmp(base, base + size)) + return base; + base += size; + n--; + } + return NULL; +} + /* * On entry *sha1 contains the pack content SHA1 hash, on exit it is * the SHA1 hash of sorted object names. The objects array passed in @@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec else sorted_by_sha = list = last = NULL; + if (!opts->allow_duplicates) { + struct pack_idx_entry **dup; + + dup = find_duplicate(sorted_by_sha, nr_objects, + sizeof(*sorted_by_sha), sha1_compare); + if (dup) + die("pack has duplicate entries for %s", + sha1_to_hex((*dup)->sha1)); + } + if (opts->flags & WRITE_IDX_VERIFY) { assert(index_name); f = sha1fd_check(index_name); diff --git a/pack.h b/pack.h index aa6ee7d..45217b6 100644 --- a/pack.h +++ b/pack.h @@ -44,6 +44,8 @@ struct pack_idx_option { uint32_t version; uint32_t off32_limit; + int allow_duplicates; + /* * List of offsets that would fit within off32_limit but * need to be written out as 64-bit entity for byte-for-byte diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index 04fe242..97ce2e0 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' test_cmp expect actual ' +test_expect_success 'index-pack can reject packs with duplicates' ' + clear_packs && + create_pack dups.pack 2 && + test_must_fail \ + git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack && + test_expect_code 1 git cat-file -e $LO_SHA1 +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 6/6] default pack.indexDuplicates to false 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King ` (4 preceding siblings ...) 2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King @ 2013-08-24 0:02 ` Jeff King 5 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre We should never see duplicate objects in packs, and it is unknown what kind of complications such packs could create during the repacking process. The previous commit introduced a safety valve for checking packs coming into the repository and being indexed by index-pack. Let's turn the safety valve on by default. This helps protect sites receiving packfiles from potentially malicious strangers, and shouldn't affect normal use (and if it does, it will have helped us identify a buggy packfile writer). In a pinch, users can recover by turning off the option, or by resorting to unpack-objects to explode the pack. Note that this also turns the option on for packs we write ourselves (e.g., via pack-objects, fast-import, or streaming large blobs). We do not expect maliciously constructed packfiles in these code paths, of course, but it can serve as an extra check that we have no accidentally created a buggy pack (and the check itself is cheap to perform). Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> --- pack-write.c | 1 - t/t5308-pack-detect-duplicates.sh | 9 ++++----- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/pack-write.c b/pack-write.c index da946a7..1e3c459 100644 --- a/pack-write.c +++ b/pack-write.c @@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts) memset(opts, 0, sizeof(*opts)); opts->version = 2; opts->off32_limit = 0x7fffffff; - opts->allow_duplicates = 1; } static int sha1_compare(const void *_a, const void *_b) diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index 97ce2e0..e0ba5e1 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' ' git index-pack --stdin <no-dups.pack ' -test_expect_success 'index-pack will allow duplicate objects by default' ' +test_expect_success 'index-pack will allow duplicate objects if asked' ' clear_packs && create_pack dups.pack 100 && - git index-pack --stdin <dups.pack + git -c pack.indexDuplicates=true index-pack --stdin <dups.pack ' test_expect_success 'create batch-check test vectors' ' @@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' ' test_cmp expect actual ' -test_expect_success 'index-pack can reject packs with duplicates' ' +test_expect_success 'index-pack rejects packs with duplicates by default' ' clear_packs && create_pack dups.pack 2 && - test_must_fail \ - git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack && + test_must_fail git index-pack --stdin <dups.pack && test_expect_code 1 git cat-file -e $LO_SHA1 ' -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King 2013-08-23 16:41 ` Junio C Hamano @ 2013-08-23 19:41 ` Johannes Sixt 2013-08-23 23:37 ` Jeff King 1 sibling, 1 reply; 37+ messages in thread From: Johannes Sixt @ 2013-08-23 19:41 UTC (permalink / raw) To: Jeff King Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre Am 23.08.2013 01:14, schrieb Jeff King: > +++ b/t/t5308-pack-detect-duplicates.sh > @@ -0,0 +1,73 @@ > +#!/bin/sh > + > +test_description='handling of duplicate objects in incoming packfiles' > +. ./test-lib.sh > +. ../lib-pack.sh This should be . "$TEST_DIRECTORY"/lib-pack.sh to support running tests with --root (also in patch 3/6). -- Hannes ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP 2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt @ 2013-08-23 23:37 ` Jeff King 0 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-23 23:37 UTC (permalink / raw) To: Johannes Sixt Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre On Fri, Aug 23, 2013 at 09:41:39PM +0200, Johannes Sixt wrote: > Am 23.08.2013 01:14, schrieb Jeff King: > >+++ b/t/t5308-pack-detect-duplicates.sh > >@@ -0,0 +1,73 @@ > >+#!/bin/sh > >+ > >+test_description='handling of duplicate objects in incoming packfiles' > >+. ./test-lib.sh > >+. ../lib-pack.sh > > This should be > > . "$TEST_DIRECTORY"/lib-pack.sh > > to support running tests with --root (also in patch 3/6). Doh, you would think that I would remember that, as the one who introduced "--root" in the first place. Will fix. Thanks for noticing. -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3/6] add tests for indexing packs with delta cycles 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King 2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King 2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King @ 2013-08-22 23:14 ` Jeff King 2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King ` (3 subsequent siblings) 6 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre If we receive a broken or malicious pack from a remote, we will feed it to index-pack. As index-pack processes the objects as a stream, reconstructing and hashing each object to get its name, it is not very susceptible to doing the wrong with bad data (it simply notices that the data is bogus and aborts). However, one question raised on the list is whether it could be susceptible to problems during the delta-resolution phase. In particular, can a cycle in the packfile deltas cause us to go into an infinite loop or cause any other problem? The answer is no. We cannot have a cycle of delta-base offsets, because they go only in one direction (the OFS_DELTA object mentions its base by an offset towards the beginning of the file, and we explicitly reject negative offsets). We can have a cycle of REF_DELTA objects, which refer to base objects by sha1 name. However, index-pack does not know these sha1 names ahead of time; it has to reconstruct the objects to get their names, and it cannot do so if there is a delta cycle (in other words, it does not even realize there is a cycle, but only that there are items that cannot be resolved). Even though we can reason out that index-pack should handle this fine, let's add a few tests to make sure it behaves correctly. Signed-off-by: Jeff King <peff@peff.net> --- t/lib-pack.sh | 22 +++++++++++++++++ t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 81 insertions(+) create mode 100755 t/t5309-pack-delta-cycles.sh diff --git a/t/lib-pack.sh b/t/lib-pack.sh index 61c5d19..e028c40 100644 --- a/t/lib-pack.sh +++ b/t/lib-pack.sh @@ -55,6 +55,28 @@ pack_obj() { printf '\062\170\234\143\267\3\0\0\116\0\106' return ;; + 01d7713666f4de822776c7622c10f1b07de280dc) + printf '\165\1\327\161\66\146\364\336\202\47\166' && + printf '\307\142\54\20\361\260\175\342\200\334\170' && + printf '\234\143\142\142\142\267\003\0\0\151\0\114' + return + ;; + esac + ;; + + # blob containing "\7\0" + 01d7713666f4de822776c7622c10f1b07de280dc) + case "$2" in + '') + printf '\062\170\234\143\147\0\0\0\20\0\10' + return + ;; + e68fe8129b546b101aee9510c5328e7f21ca1d18) + printf '\165\346\217\350\22\233\124\153\20\32\356' && + printf '\225\20\305\62\216\177\41\312\35\30\170\234' && + printf '\143\142\142\142\147\0\0\0\53\0\16' + return + ;; esac ;; esac diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh new file mode 100755 index 0000000..9b3f2c3 --- /dev/null +++ b/t/t5309-pack-delta-cycles.sh @@ -0,0 +1,59 @@ +#!/bin/sh + +test_description='test index-pack handling of delta cycles in packfiles' +. ./test-lib.sh +. ../lib-pack.sh + +# Two similar-ish objects that we have computed deltas between. +A=01d7713666f4de822776c7622c10f1b07de280dc +B=e68fe8129b546b101aee9510c5328e7f21ca1d18 + +# double-check our hand-constucted packs +test_expect_success 'index-pack works with a single delta (A->B)' ' + clear_packs && + { + pack_header 2 && + pack_obj $A $B && + pack_obj $B + } >ab.pack && + pack_trailer ab.pack && + git index-pack --stdin <ab.pack && + git cat-file -t $A && + git cat-file -t $B +' + +test_expect_success 'index-pack works with a single delta (B->A)' ' + clear_packs && + { + pack_header 2 && + pack_obj $A && + pack_obj $B $A + } >ba.pack && + pack_trailer ba.pack && + git index-pack --stdin <ba.pack && + git cat-file -t $A && + git cat-file -t $B +' + +test_expect_success 'index-pack detects missing base objects' ' + clear_packs && + { + pack_header 1 && + pack_obj $A $B + } >missing.pack && + pack_trailer missing.pack && + test_must_fail git index-pack --fix-thin --stdin <missing.pack +' + +test_expect_success 'index-pack detects REF_DELTA cycles' ' + clear_packs && + { + pack_header 2 && + pack_obj $A $B && + pack_obj $B $A + } >cycle.pack && + pack_trailer cycle.pack && + test_must_fail git index-pack --fix-thin --stdin <cycle.pack +' + +test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 4/6] test index-pack on packs with recoverable delta cycles 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King ` (2 preceding siblings ...) 2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King @ 2013-08-22 23:15 ` Jeff King 2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King ` (2 subsequent siblings) 6 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre The previous commit added tests to show that index-pack correctly bails in unrecoverable situations. There are some situations where the data could be recovered, but it is not currently: 1. If we can break the cycle using an object from another pack via --fix-thin. 2. If we can break the cycle using a duplicate of one of the objects found in the same pack. Note that neither of these is particularly high priority; a delta cycle within a pack should never occur, and we have no record of even a buggy git implementation creating such a pack. However, it's worth adding these tests for two reasons. One, to document that we do not currently handle the situation, even though it is possible. And two, to exercise the code that runs in this situation; even though it fails, by running it we can confirm that index-pack detects the situation and aborts, and does not misbehave (e.g., by following the cycle in an infinite loop). In both cases, we hit an assert that aborts index-pack. Signed-off-by: Jeff King <peff@peff.net> --- t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh index 9b3f2c3..38e1809 100755 --- a/t/t5309-pack-delta-cycles.sh +++ b/t/t5309-pack-delta-cycles.sh @@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' ' test_must_fail git index-pack --fix-thin --stdin <cycle.pack ' +test_expect_failure 'failover to an object in another pack' ' + clear_packs && + git index-pack --stdin <ab.pack && + git index-pack --stdin --fix-thin <cycle.pack +' + +test_expect_failure 'failover to a duplicate object in the same pack' ' + clear_packs && + { + pack_header 3 && + pack_obj $A $B && + pack_obj $B $A && + pack_obj $A + } >recoverable.pack && + pack_trailer recoverable.pack && + git index-pack --fix-thin --stdin <recoverable.pack +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King ` (3 preceding siblings ...) 2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King @ 2013-08-22 23:15 ` Jeff King 2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King 2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre 6 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre Git should never generate packs with duplicate objects. However, we may see such packs due to bugs in Git or other implementations (e.g., JGit had such a bug a few years ago). In theory, such packs should not be a problem for us (we will simply find one of the instances of the object when looking in the pack). However, the JGit bug report mentioned possible infinite loops during repacking due to cycles in the delta chain. Though this problem hasn't specifically been reproduced on modern git, there is no reason not to be careful with incoming packs, given that only buggy implementations should be producing such packs, anyway. This patch introduces the pack.indexDuplicates option to allow or reject such packs from index-pack. The default remains to allow it. Signed-off-by: Jeff King <peff@peff.net> --- builtin/index-pack.c | 4 ++++ pack-write.c | 24 ++++++++++++++++++++++++ pack.h | 2 ++ t/t5308-pack-detect-duplicates.sh | 8 ++++++++ 4 files changed, 38 insertions(+) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 9c1cfac..c27556f 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1369,6 +1369,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb) #endif return 0; } + if (!strcmp(k, "pack.indexduplicates")) { + opts->allow_duplicates = git_config_bool(k, v); + return 0; + } return git_default_config(k, v, cb); } diff --git a/pack-write.c b/pack-write.c index ca9e63b..da946a7 100644 --- a/pack-write.c +++ b/pack-write.c @@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts) memset(opts, 0, sizeof(*opts)); opts->version = 2; opts->off32_limit = 0x7fffffff; + opts->allow_duplicates = 1; } static int sha1_compare(const void *_a, const void *_b) @@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts) sizeof(ofsval), cmp_uint32); } +static void *find_duplicate(void *vbase, size_t n, size_t size, + int (*cmp)(const void *, const void *)) +{ + unsigned char *base = vbase; + while (n > 1) { + if (!cmp(base, base + size)) + return base; + base += size; + n--; + } + return NULL; +} + /* * On entry *sha1 contains the pack content SHA1 hash, on exit it is * the SHA1 hash of sorted object names. The objects array passed in @@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec else sorted_by_sha = list = last = NULL; + if (!opts->allow_duplicates) { + struct pack_idx_entry **dup; + + dup = find_duplicate(sorted_by_sha, nr_objects, + sizeof(*sorted_by_sha), sha1_compare); + if (dup) + die("pack has duplicate entries for %s", + sha1_to_hex((*dup)->sha1)); + } + if (opts->flags & WRITE_IDX_VERIFY) { assert(index_name); f = sha1fd_check(index_name); diff --git a/pack.h b/pack.h index aa6ee7d..45217b6 100644 --- a/pack.h +++ b/pack.h @@ -44,6 +44,8 @@ struct pack_idx_option { uint32_t version; uint32_t off32_limit; + int allow_duplicates; + /* * List of offsets that would fit within off32_limit but * need to be written out as 64-bit entity for byte-for-byte diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index 719d48f..cb9b44e 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' ' test_cmp expect actual ' +test_expect_success 'index-pack can reject packs with duplicates' ' + clear_packs && + create_pack dups.pack 2 && + test_must_fail \ + git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack && + test_expect_code 1 git cat-file -e $LO_SHA1 +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 6/6] default pack.indexDuplicates to false 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King ` (4 preceding siblings ...) 2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King @ 2013-08-22 23:16 ` Jeff King 2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre 6 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 23:16 UTC (permalink / raw) To: Git Mailing List Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre We should never see duplicate objects in packs, and it is unknown what kind of complications such packs could create during the repacking process. The previous commit introduced a safety valve for checking packs coming into the repository and being indexed by index-pack. Let's turn the safety valve on by default. This helps protect sites receiving packfiles from potentially malicious strangers, and shouldn't affect normal use (and if it does, it will have helped us identify a buggy packfile writer). In a pinch, users can recover by turning off the option, or by resorting to unpack-objects to explode the pack. Note that this also turns the option on for packs we write ourselves (e.g., via pack-objects, fast-import, or streaming large blobs). We do not expect maliciously constructed packfiles in these code paths, of course, but it can serve as an extra check that we have no accidentally created a buggy pack (and the check itself is cheap to perform). Signed-off-by: Jeff King <peff@peff.net> --- pack-write.c | 1 - t/t5308-pack-detect-duplicates.sh | 9 ++++----- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/pack-write.c b/pack-write.c index da946a7..1e3c459 100644 --- a/pack-write.c +++ b/pack-write.c @@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts) memset(opts, 0, sizeof(*opts)); opts->version = 2; opts->off32_limit = 0x7fffffff; - opts->allow_duplicates = 1; } static int sha1_compare(const void *_a, const void *_b) diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index cb9b44e..e982095 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' ' git index-pack --stdin <no-dups.pack ' -test_expect_success 'index-pack will allow duplicate objects by default' ' +test_expect_success 'index-pack will allow duplicate objects if asked' ' clear_packs && create_pack dups.pack 100 && - git index-pack --stdin <dups.pack + git -c pack.indexDuplicates=true index-pack --stdin <dups.pack ' test_expect_success 'create batch-check test vectors' ' @@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' ' test_cmp expect actual ' -test_expect_success 'index-pack can reject packs with duplicates' ' +test_expect_success 'index-pack rejects packs with duplicates by default' ' clear_packs && create_pack dups.pack 2 && - test_must_fail \ - git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack && + test_must_fail git index-pack --stdin <dups.pack && test_expect_code 1 git cat-file -e $LO_SHA1 ' -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCHv2 0/6] duplicate objects and delta cycles, oh my! 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King ` (5 preceding siblings ...) 2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King @ 2013-08-22 23:35 ` Nicolas Pitre 6 siblings, 0 replies; 37+ messages in thread From: Nicolas Pitre @ 2013-08-22 23:35 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce On Thu, 22 Aug 2013, Jeff King wrote: > On Thu, Aug 22, 2013 at 10:43:05AM -0400, Jeff King wrote: > > > > write_idx_file() is called after index-pack processes all delta > > > objects. Could resolve_deltas() go cyclic with certain duplicate > > > object setup? > > > > Good question. I'm not sure. I'll check it out. > > I think the answer is "no", based on both reasoning and testing (both of > which are included in patches 3-4 of the series below). > > So here's my re-roll of the series. This looks all good to me. Acked-by: Nicolas Pitre <nico@fluxnic.net> Nicolas ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3/4] reject duplicates when indexing a packfile we created 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King 2013-08-21 20:51 ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King 2013-08-21 20:52 ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King @ 2013-08-21 20:53 ` Jeff King 2013-08-21 20:55 ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King 2013-08-21 22:17 ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano 4 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-21 20:53 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre The pack index-writing code is capable of detecting and rejecting duplicate entries. This can be used with index-pack to prevent buggy foreign packs from entering the repository. However, we can also be careful about the packs we generate, by double-checking during the index write that we do not have any duplicate objects. This should never happen, but it serves as an additional check that we are not generating such packfiles. Signed-off-by: Jeff King <peff@peff.net> --- This turns on rejection for everywhere _except_ a separately-run index-pack. If we decide to turn it on there, too, it would make sense to scrap this patch and just put it in reset_pack_idx_opts(). builtin/pack-objects.c | 1 + bulk-checkin.c | 1 + fast-import.c | 1 + 3 files changed, 3 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index f069462..8da2a66 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2511,6 +2511,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; reset_pack_idx_option(&pack_idx_opts); + pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT; git_config(git_pack_config, NULL); if (!pack_compression_seen && core_compression_seen) pack_compression_level = core_compression_level; diff --git a/bulk-checkin.c b/bulk-checkin.c index 6b0b6d4..bad191f 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -174,6 +174,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state, state->f = create_tmp_packfile(&state->pack_tmp_name); reset_pack_idx_option(&state->pack_idx_opts); + state->pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT; /* Pretend we are going to write only one object */ state->offset = write_pack_header(state->f, 1); diff --git a/fast-import.c b/fast-import.c index 23f625f..b7beab0 100644 --- a/fast-import.c +++ b/fast-import.c @@ -3360,6 +3360,7 @@ int main(int argc, char **argv) setup_git_directory(); reset_pack_idx_option(&pack_idx_opts); + pack_idx_opts.duplicates = WRITE_IDX_DUPLICATES_REJECT; git_config(git_pack_config, NULL); if (!pack_compression_seen && core_compression_seen) pack_compression_level = core_compression_level; -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King ` (2 preceding siblings ...) 2013-08-21 20:53 ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King @ 2013-08-21 20:55 ` Jeff King 2013-08-21 23:20 ` Junio C Hamano 2013-08-21 22:17 ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano 4 siblings, 1 reply; 37+ messages in thread From: Jeff King @ 2013-08-21 20:55 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre When we are building the pack index, we can notice that there are duplicate objects, pick one "winner" instance, and mention the object only once in the index (mapped to the winner's offset). This has the effect that the duplicate packfile entries are never found by lookup. The data still exists in the packfile, though, and can be used as a delta base if delta base offsets are used. If delta refs are used, then it is possible that some deltas may be broken. Unfortunately, this scheme does not quite work, as the pack reader checks that the index and packfile claim the same number of objects, and will refuse to open such a packfile. We have a few options: 1. Loosen the check in the reader. This drops a potentially useful sanity check on the data, and it won't work for any other implementations (including older versions of git). 2. Loosen the check only if a special flag is found in the index indicating that we removed duplicates. This means that we only lose the safety check in the rare case that duplicates are found. But there is actually no place in the index to put such a flag, and in addition no other implementation would understand our flag. 3. Instead of reducing the numnber of objects in the index, include "dummy" entries using the null sha1 or a similar sentinel value (and pointing to some invalid offset). This should work, but is awfully hacky (and will probably create havoc as we will now claim to have the null sha1, but will throw errors if you actually look at it). Signed-off-by: Jeff King <peff@peff.net> --- I think this line of "fixing" should probably be scrapped. Truly fixing it and covering the REF_DELTA case would involve magic in the actual object lookups (we would have to detect cycles and find the "other" object that is the real base). And it's probably just not worth the effort. builtin/index-pack.c | 2 ++ pack-write.c | 30 ++++++++++++++++++++++++++++++ pack.h | 3 ++- t/t5308-pack-detect-duplicates.sh | 8 ++++++++ 4 files changed, 42 insertions(+), 1 deletion(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 83fd3bb..1dadd56 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1375,6 +1375,8 @@ static int git_index_pack_config(const char *k, const char *v, void *cb) opts->duplicates = WRITE_IDX_DUPLICATES_IGNORE; else if (boolval == 0 || !strcmp(v, "reject")) opts->duplicates = WRITE_IDX_DUPLICATES_REJECT; + else if (!strcmp(v, "skip")) + opts->duplicates = WRITE_IDX_DUPLICATES_SKIP; else die("unknown value for pack.indexduplicates: %s", v); return 0; diff --git a/pack-write.c b/pack-write.c index 542b081..657da2a 100644 --- a/pack-write.c +++ b/pack-write.c @@ -50,6 +50,27 @@ static void *find_duplicate(void *vbase, size_t n, size_t size, return NULL; } +static size_t remove_duplicates(void *base, size_t n, size_t size, + int (*cmp)(const void *, const void *)) +{ + unsigned char *from = base, *to = base; + + from += size; + to += size; + n--; + + while (n > 0) { + if (cmp(from, from - size)) { + if (to != from) + memcpy(to, from, size); + to += size; + } + from += size; + n--; + } + return (to - (unsigned char *)base) / size; +} + /* * On entry *sha1 contains the pack content SHA1 hash, on exit it is * the SHA1 hash of sorted object names. The objects array passed in @@ -89,6 +110,15 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec if (dup) die("pack has duplicate entries for %s", sha1_to_hex((*dup)->sha1)); + } else if (opts->duplicates == WRITE_IDX_DUPLICATES_SKIP) { + int nr_unique = remove_duplicates(sorted_by_sha, + nr_objects, + sizeof(*sorted_by_sha), + sha1_compare); + if (nr_unique != nr_objects) { + nr_objects = nr_unique; + last = sorted_by_sha + nr_objects; + } } if (opts->flags & WRITE_IDX_VERIFY) { diff --git a/pack.h b/pack.h index cd4b536..3017ea4 100644 --- a/pack.h +++ b/pack.h @@ -46,7 +46,8 @@ struct pack_idx_option { enum { WRITE_IDX_DUPLICATES_IGNORE = 0, - WRITE_IDX_DUPLICATES_REJECT + WRITE_IDX_DUPLICATES_REJECT, + WRITE_IDX_DUPLICATES_SKIP } duplicates; /* diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh index 0f2d928..963d7b9 100755 --- a/t/t5308-pack-detect-duplicates.sh +++ b/t/t5308-pack-detect-duplicates.sh @@ -102,4 +102,12 @@ test_expect_success 'index-pack can reject packs with duplicates' ' test_expect_code 1 git cat-file -e $LO_SHA1 ' +test_expect_success 'index-pack can fix packs with duplicates' ' + clear_packs && + create_pack 2 >dups.pack && + git -c pack.indexDuplicates=skip index-pack --stdin <dups.pack && + git cat-file --batch-check <input >actual && + test_cmp expect actual +' + test_done -- 1.8.4.rc2.28.g6bb5f3f ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries 2013-08-21 20:55 ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King @ 2013-08-21 23:20 ` Junio C Hamano 2013-08-22 0:47 ` Jeff King 0 siblings, 1 reply; 37+ messages in thread From: Junio C Hamano @ 2013-08-21 23:20 UTC (permalink / raw) To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre Jeff King <peff@peff.net> writes: > When we are building the pack index, we can notice that > there are duplicate objects, pick one "winner" instance, and > mention the object only once in the index (mapped to the > winner's offset). > > This has the effect that the duplicate packfile entries are > never found by lookup. The data still exists in the > packfile, though, and can be used as a delta base if delta > base offsets are used. If delta refs are used, then it is > possible that some deltas may be broken. I do not understand the last bit. If two copies of an object exist but you have only one slot for the object in the index, and another object names it as its base with ref-delta, then reconstituting it should work just fine---whichever representation of the base object is recorded in the .idx, that first needs to be reconstituted before the delta is applied to it, and both copies should yield identical contents for the delta base object, no? In any case, ejecting one from the pack .idx would not help in the presense of either broken or malicious packer that reuses delta too aggressively. Suppose you have objects A and B and somehow manage to create a cycle of deltas, A names B as its delta base and B names A as its delta base. The packer may notice its mistake and then add another copy of A as a base object. The pack contains two copies of A (one is delta based on B, the other is full) and B (delta against A). If B refers to the copy of A that is delta against B using ofs-delta, fixing the .idx file will have no effect. read_packed_sha1(B) will read the delta data of B, finds the offset to start reading the data for A which was excised from the .idx and unless that codepath is updated to consult revindex (i.e. you mark one copy of A in the .pack as bad, and when B refers to that bad copy of A via ofs-delta, you check the offset against revindex to get the object name of A and go to the good copy of A), you will never finish reading B because reading the bad copy of A will lead you to first reconstitute B. > I think this line of "fixing" should probably be scrapped. I tend to agree. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries 2013-08-21 23:20 ` Junio C Hamano @ 2013-08-22 0:47 ` Jeff King 0 siblings, 0 replies; 37+ messages in thread From: Jeff King @ 2013-08-22 0:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Shawn O. Pearce, Nicolas Pitre On Wed, Aug 21, 2013 at 04:20:07PM -0700, Junio C Hamano wrote: > I do not understand the last bit. If two copies of an object exist > but you have only one slot for the object in the index, and another > object names it as its base with ref-delta, then reconstituting it > should work just fine---whichever representation of the base object > is recorded in the .idx, that first needs to be reconstituted before > the delta is applied to it, and both copies should yield identical > contents for the delta base object, no? > > In any case, ejecting one from the pack .idx would not help in the > presense of either broken or malicious packer that reuses delta too > aggressively. Suppose you have objects A and B and somehow manage > to create a cycle of deltas, A names B as its delta base and B names > A as its delta base. The packer may notice its mistake and then add > another copy of A as a base object. The pack contains two copies of > A (one is delta based on B, the other is full) and B (delta against > A). Yes, that is the potentially problematic scenario... > If B refers to the copy of A that is delta against B using ofs-delta, > fixing the .idx file will have no effect. read_packed_sha1(B) will > read the delta data of B, finds the offset to start reading the data > for A which was excised from the .idx and unless that codepath is > updated to consult revindex (i.e. you mark one copy of A in the .pack > as bad, and when B refers to that bad copy of A via ofs-delta, you > check the offset against revindex to get the object name of A and go > to the good copy of A), you will never finish reading B because > reading the bad copy of A will lead you to first reconstitute B. Yes. There are two levels of brokenness here. One is that the pack has a delta base offset for B that leads to the "wrong" A, creating a cycle. This is an utterly broken pack and cannot be fixed automatically (you do not even know the name of the cycled A because you cannot constitute it to find its name and realize that it has a duplicate!). But we should notice this during index-pack, because we cannot reconstruct the object. Another situation is that your delta base points to the "right" A. You can reconstruct either A or B, because although there is a duplicate, there are no cycles in the delta chain (i.e., the chain does not care about object name, only offset, and offsets point only one way). So we do not have a problem at all with reconstructing the object data. We do have two ways we might access A from the same pack. For regular object lookup, that is probably not a big deal. For operations like repacking that look more closely at the on-disk representation, I do not know. We may mark A as "has a delta" or "does not have a delta" at one point in the code, but then later find the other copy of A which does not match. I did not trace through all of pack-objects to find whether that is possible, or what bugs it might cause. But indexing only one copy as the "master" means that we will always get a consistent view of the object (from that particular pack, at least). Now consider REF_DELTA. We lookup the base object by name, so we may get a different answer depending on how it is indexed. E.g., index-pack keeps an internal hash table, whereas later callers will use the .idx file. When looking at the .idx file, we may use a regular binary search, or sha1_entry_pos. The exact order of the search may even change from git version to git version. So you may have a situation where index-pack finds the "right" A and properly reconstructs the file while creating the .idx, and thinks all is well. But later lookups may find the "wrong" A, and fail. In other words, we cannot trust that data that makes it through index-pack is not corrupted (and it is index-pack's output that lets receive-pack commit to the client that we have their objects, so we want to be reasonably sure we have uncorrupted copies at that point). Choosing a single "master" A to go in the .idx means we will get a consistent view of A once the .idx is generated. But it may not be the right one, and it may not be the one we used to check that we can resolve A and B in the first place. The right fix for that situation is to keep both entries in the .idx, detect delta cycles, then look up extra copies of A, but being aware that you already found one copy in the .idx and that sha1_entry_pos (or the vanilla binary search) should return any other copies, if they exist. I do not think we detect delta cycles at all (though I didn't check), but certainly we do not have a way to tell sha1_entry_pos any different information so that it will find the "other" A. So I think it is a recoverable state, but it is a non-trivial bit of code for something that should never happen. Rejecting on push/fetch via index-pack is simple and easy, and we can deal with the fallout if and when it ever actually happens. > > I think this line of "fixing" should probably be scrapped. > > I tend to agree. OK. Let's drop the "skip" patch, then. I'll re-roll with a patch to flip the default to "reject". -Peff ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC/PATCH 0/4] duplicate objects in packfiles 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King ` (3 preceding siblings ...) 2013-08-21 20:55 ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King @ 2013-08-21 22:17 ` Junio C Hamano 4 siblings, 0 replies; 37+ messages in thread From: Junio C Hamano @ 2013-08-21 22:17 UTC (permalink / raw) To: Jeff King; +Cc: git, Shawn O. Pearce, Nicolas Pitre Jeff King <peff@peff.net> writes: > Which leaves the open question: should the default for index-pack flip > to reject duplicates rather than allow? I'd say so. In am emergency, people with broken packfiles can feed them to unpack-objects ;-) More seriously, an alternative emergency mode may be an option to unpack-objects that does not unpack objects but streams to a new pack or something to resurret the objects that are salvageable from a corrupt packfile. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: duplicate objects in packfile 2013-08-14 18:17 duplicate objects in packfile Jeff King 2013-08-14 18:29 ` Junio C Hamano @ 2013-08-14 18:50 ` Nicolas Pitre 1 sibling, 0 replies; 37+ messages in thread From: Nicolas Pitre @ 2013-08-14 18:50 UTC (permalink / raw) To: Jeff King; +Cc: git, Shawn O. Pearce, Junio C Hamano On Wed, 14 Aug 2013, Jeff King wrote: > 1. Is sha1_entry_pos wrong to barf on duplicate items in the index? If > so, do we want to fix it, or simply retire GIT_USE_LOOKUP? I'd think that sha1_entry_pos should be more lenient here, especially if this doesn't compromize the overall git behavior. > Related, should we consider duplicate items in a packfile to be a > bogus packfile (and consequently notice and complain during > indexing)? I don't think it _hurts_ anything (aside from the assert > above), though it is of course wasteful. This should indeed be considered a bogus pack file. But we have a lot of code to be able to cope with bogus/corrupted pack files already. Handling this case as well would not hurt. More importantly we should make sure the code we have doesn't generate such packs. > 2. How can duplicate entries get into a packfile? > > Git itself should not generate duplicate entries (pack-objects is > careful to remove duplicates). Since these packs almost certainly > were pushed by a client, I wondered if "index-pack --fix-thin" > might accidentally add multiple copies of an object when it is the > preferred base for multiple objects, but it specifically avoids > doing so. It is probably simpler than that. An alternative pack-objects implementation could stream multiple copies of an object upon a push, and index-pack on the receiving end would simply store what's been received to disk as is without a fuss. > Given the dates on the packs and how rare this is, I'm pretty much > willing to chalk it up to a random bug (in git or otherwise) that does > not any longer exist. Possibly. Given this is not compromizing the validity of the pack, and a simple repack "fixes" it, I would not worry too much about it. Nicolas ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2013-08-24 0:02 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-14 18:17 duplicate objects in packfile Jeff King 2013-08-14 18:29 ` Junio C Hamano 2013-08-14 18:39 ` Junio C Hamano 2013-08-14 18:54 ` Nicolas Pitre 2013-08-16 15:01 ` Jeff King 2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King 2013-08-21 20:51 ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King 2013-08-21 20:52 ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King 2013-08-22 13:45 ` Duy Nguyen 2013-08-22 14:43 ` Jeff King 2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King 2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King 2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King 2013-08-23 16:41 ` Junio C Hamano 2013-08-23 18:24 ` Jeff King 2013-08-23 18:54 ` Nicolas Pitre 2013-08-23 18:56 ` Jeff King 2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King 2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King 2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King 2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King 2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King 2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King 2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King 2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt 2013-08-23 23:37 ` Jeff King 2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King 2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King 2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King 2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King 2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre 2013-08-21 20:53 ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King 2013-08-21 20:55 ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King 2013-08-21 23:20 ` Junio C Hamano 2013-08-22 0:47 ` Jeff King 2013-08-21 22:17 ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano 2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).