From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
"Shawn O. Pearce" <spearce@spearce.org>,
Nicolas Pitre <nico@fluxnic.net>
Subject: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries
Date: Wed, 21 Aug 2013 16:55:55 -0400 [thread overview]
Message-ID: <20130821205555.GD28165@sigill.intra.peff.net> (raw)
In-Reply-To: <20130821204955.GA28025@sigill.intra.peff.net>
When we are building the pack index, we can notice that
there are duplicate objects, pick one "winner" instance, and
mention the object only once in the index (mapped to the
winner's offset).
This has the effect that the duplicate packfile entries are
never found by lookup. The data still exists in the
packfile, though, and can be used as a delta base if delta
base offsets are used. If delta refs are used, then it is
possible that some deltas may be broken.
Unfortunately, this scheme does not quite work, as the pack
reader checks that the index and packfile claim the same
number of objects, and will refuse to open such a packfile.
We have a few options:
1. Loosen the check in the reader. This drops a
potentially useful sanity check on the data, and it
won't work for any other implementations (including
older versions of git).
2. Loosen the check only if a special flag is found in the
index indicating that we removed duplicates. This means
that we only lose the safety check in the rare case
that duplicates are found. But there is actually no
place in the index to put such a flag, and in addition
no other implementation would understand our flag.
3. Instead of reducing the numnber of objects in the
index, include "dummy" entries using the null sha1 or a
similar sentinel value (and pointing to some invalid
offset). This should work, but is awfully hacky (and
will probably create havoc as we will now claim to have
the null sha1, but will throw errors if you actually
look at it).
Signed-off-by: Jeff King <peff@peff.net>
---
I think this line of "fixing" should probably be scrapped. Truly fixing
it and covering the REF_DELTA case would involve magic in the actual
object lookups (we would have to detect cycles and find the "other"
object that is the real base). And it's probably just not worth the
effort.
builtin/index-pack.c | 2 ++
pack-write.c | 30 ++++++++++++++++++++++++++++++
pack.h | 3 ++-
t/t5308-pack-detect-duplicates.sh | 8 ++++++++
4 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 83fd3bb..1dadd56 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1375,6 +1375,8 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
opts->duplicates = WRITE_IDX_DUPLICATES_IGNORE;
else if (boolval == 0 || !strcmp(v, "reject"))
opts->duplicates = WRITE_IDX_DUPLICATES_REJECT;
+ else if (!strcmp(v, "skip"))
+ opts->duplicates = WRITE_IDX_DUPLICATES_SKIP;
else
die("unknown value for pack.indexduplicates: %s", v);
return 0;
diff --git a/pack-write.c b/pack-write.c
index 542b081..657da2a 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -50,6 +50,27 @@ static void *find_duplicate(void *vbase, size_t n, size_t size,
return NULL;
}
+static size_t remove_duplicates(void *base, size_t n, size_t size,
+ int (*cmp)(const void *, const void *))
+{
+ unsigned char *from = base, *to = base;
+
+ from += size;
+ to += size;
+ n--;
+
+ while (n > 0) {
+ if (cmp(from, from - size)) {
+ if (to != from)
+ memcpy(to, from, size);
+ to += size;
+ }
+ from += size;
+ n--;
+ }
+ return (to - (unsigned char *)base) / size;
+}
+
/*
* On entry *sha1 contains the pack content SHA1 hash, on exit it is
* the SHA1 hash of sorted object names. The objects array passed in
@@ -89,6 +110,15 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
if (dup)
die("pack has duplicate entries for %s",
sha1_to_hex((*dup)->sha1));
+ } else if (opts->duplicates == WRITE_IDX_DUPLICATES_SKIP) {
+ int nr_unique = remove_duplicates(sorted_by_sha,
+ nr_objects,
+ sizeof(*sorted_by_sha),
+ sha1_compare);
+ if (nr_unique != nr_objects) {
+ nr_objects = nr_unique;
+ last = sorted_by_sha + nr_objects;
+ }
}
if (opts->flags & WRITE_IDX_VERIFY) {
diff --git a/pack.h b/pack.h
index cd4b536..3017ea4 100644
--- a/pack.h
+++ b/pack.h
@@ -46,7 +46,8 @@ struct pack_idx_option {
enum {
WRITE_IDX_DUPLICATES_IGNORE = 0,
- WRITE_IDX_DUPLICATES_REJECT
+ WRITE_IDX_DUPLICATES_REJECT,
+ WRITE_IDX_DUPLICATES_SKIP
} duplicates;
/*
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 0f2d928..963d7b9 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -102,4 +102,12 @@ test_expect_success 'index-pack can reject packs with duplicates' '
test_expect_code 1 git cat-file -e $LO_SHA1
'
+test_expect_success 'index-pack can fix packs with duplicates' '
+ clear_packs &&
+ create_pack 2 >dups.pack &&
+ git -c pack.indexDuplicates=skip index-pack --stdin <dups.pack &&
+ git cat-file --batch-check <input >actual &&
+ test_cmp expect actual
+'
+
test_done
--
1.8.4.rc2.28.g6bb5f3f
next prev parent reply other threads:[~2013-08-21 20:56 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-14 18:17 duplicate objects in packfile Jeff King
2013-08-14 18:29 ` Junio C Hamano
2013-08-14 18:39 ` Junio C Hamano
2013-08-14 18:54 ` Nicolas Pitre
2013-08-16 15:01 ` Jeff King
2013-08-21 20:49 ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
2013-08-21 20:51 ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-21 20:52 ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 13:45 ` Duy Nguyen
2013-08-22 14:43 ` Jeff King
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-23 16:41 ` Junio C Hamano
2013-08-23 18:24 ` Jeff King
2013-08-23 18:54 ` Nicolas Pitre
2013-08-23 18:56 ` Jeff King
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
2013-08-23 23:37 ` Jeff King
2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
2013-08-21 20:53 ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
2013-08-21 20:55 ` Jeff King [this message]
2013-08-21 23:20 ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Junio C Hamano
2013-08-22 0:47 ` Jeff King
2013-08-21 22:17 ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130821205555.GD28165@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=nico@fluxnic.net \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).