* [BUG] diffcore-rename with duplicate tree entries can segfault
@ 2015-02-24 21:43 Jeff King
2015-02-24 22:42 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-24 21:43 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
I ran across a real-world case where git segfaults on some trees that
have duplicate entries. Those trees are obviously broken, and I'm fine
with us producing whatever output we like on them. But probably we
shouldn't segfault.
Basically what happens is that the rename_dst array maps paths to
diff_filepairs. When we process the results of the rename, if we have
duplicates we may hit that same pathname multiple times, and pull out
the same diff_filepair. So we end up with a diff_queue that has the same
filepair mentioned multiple times. When it comes time to flush and free
that queue, we do a double (or triple) free of the filepair and its
contents, which corrupts the heap.
I managed to reduce it to a simplified test case:
diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
new file mode 100755
index 0000000..9e03490
--- /dev/null
+++ b/t/t4058-diff-duplicates.sh
@@ -0,0 +1,85 @@
+#!/bin/sh
+
+test_description='test tree diff when trees have duplicate entries'
+. ./test-lib.sh
+
+# make_tree_entry <mode> <mode> <sha1>
+#
+# We have to rely on perl here because not all printfs understand
+# hex escapes (only octal), and xxd is not portable.
+make_tree_entry () {
+ printf '%s %s\0' "$1" "$2" &&
+ perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
+}
+
+# Like git-mktree, but without all of the pesky sanity checking.
+# Arguments come in groups of three, each group specifying a single
+# tree entry (see make_tree_entry above).
+make_tree () {
+ while test $# -gt 2; do
+ make_tree_entry "$1" "$2" "$3"
+ shift; shift; shift
+ done |
+ git hash-object -w -t tree --stdin
+}
+
+# this is kind of a convoluted setup, but matches
+# a real-world case. Each tree contains four entries
+# for the given path, one with one sha1, and three with
+# the other. The first tree has them split across
+# two subtrees (which are themselves duplicate entries in
+# the root tree), and the second has them all in a single subtree.
+test_expect_success 'create trees with duplicate entries' '
+ blob_one=$(echo one | git hash-object -w --stdin) &&
+ blob_two=$(echo two | git hash-object -w --stdin) &&
+ inner_one_a=$(make_tree \
+ 100644 inner $blob_one
+ ) &&
+ inner_one_b=$(make_tree \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two
+ ) &&
+ outer_one=$(make_tree \
+ 040000 outer $inner_one_a \
+ 040000 outer $inner_one_b
+ ) &&
+ inner_two=$(make_tree \
+ 100644 inner $blob_one \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two
+ ) &&
+ outer_two=$(make_tree \
+ 040000 outer $inner_two
+ ) &&
+ git tag one $outer_one &&
+ git tag two $outer_two
+'
+
+test_expect_success 'diff-tree between trees' '
+ {
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n"
+ } >expect &&
+ git diff-tree -r --no-abbrev one two >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'diff-tree with renames' '
+ {
+ printf ":100644 100644 $blob_two $blob_two M\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n"
+ } >expect &&
+ git diff-tree -M -r --no-abbrev one two >actual &&
+ test_cmp expect actual
+'
+
+test_done
The diff-tree outputs are up for debate. Possibly we should not even be
checking the output at all, and just making sure we don't segfault. The
first one is produced with stock git, and kind-of makes sense (we don't
link up the obviously matching paths in the tree diff because they come
from two different subtrees). Of course with "-p" it cannot be applied,
but it is at least somewhat informative to the user.
The second one is what is produced by the patch below. It also kind of
makes sense (we link up one pair, but the others are left out of the
rename detection). Though I would have expected the `M` in the first
line to be an `R100`.
My idea for a fix was to make sure we didn't pull the same entry out of
locate_rename_dst multiple times:
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 4e132f1..7030502 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -12,6 +12,7 @@
static struct diff_rename_dst {
struct diff_filespec *two;
struct diff_filepair *pair;
+ int used;
} *rename_dst;
static int rename_dst_nr, rename_dst_alloc;
@@ -27,7 +28,7 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
struct diff_rename_dst *dst = &(rename_dst[next]);
int cmp = strcmp(two->path, dst->two->path);
if (!cmp)
- return dst;
+ return dst->used ? NULL : dst;
if (cmp < 0) {
last = next;
continue;
@@ -46,6 +47,7 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
rename_dst[first].two = alloc_filespec(two->path);
fill_filespec(rename_dst[first].two, two->sha1, two->sha1_valid, two->mode);
rename_dst[first].pair = NULL;
+ rename_dst[first].used = 0;
return &(rename_dst[first]);
}
@@ -587,6 +589,7 @@ void diffcore_rename(struct diff_options *options)
if (dst && dst->pair) {
diff_q(&outq, dst->pair);
pair_to_free = p;
+ dst->used = 1;
}
else
/* no matching rename/copy source, so
That does fix this problem, and it doesn't break any other tests. But
frankly, I don't know what I'm doing and this feels like a giant hack.
The "M" output above is confusing to me, and I have no faith that there
isn't another way this problem can come up (e.g., with duplicate
sources).
Given that this is tangentially related to the "-B -M" stuff you've been
looking at (and it's your code in the first place :) ), I thought you
might have some insight.
-Peff
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-24 21:43 [BUG] diffcore-rename with duplicate tree entries can segfault Jeff King
@ 2015-02-24 22:42 ` Junio C Hamano
2015-02-24 22:49 ` Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-24 22:42 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> I ran across a real-world case where git segfaults on some trees that
> have duplicate entries. Those trees are obviously broken, and I'm fine
> with us producing whatever output we like on them. But probably we
> shouldn't segfault.
Thanks.
> ...
> That does fix this problem, and it doesn't break any other tests. But
> frankly, I don't know what I'm doing and this feels like a giant hack.
>
> Given that this is tangentially related to the "-B -M" stuff you've been
> looking at (and it's your code in the first place :) ), I thought you
> might have some insight.
Indeed.
Honestly, I'd rather see us diagnose duplicate filepairs as an error
and drop them on the floor upon entry to the diffcore_std(), even
before we go into the rename codepath.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-24 22:42 ` Junio C Hamano
@ 2015-02-24 22:49 ` Jeff King
2015-02-24 23:11 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-24 22:49 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Tue, Feb 24, 2015 at 02:42:42PM -0800, Junio C Hamano wrote:
> > That does fix this problem, and it doesn't break any other tests. But
> > frankly, I don't know what I'm doing and this feels like a giant hack.
> >
> > Given that this is tangentially related to the "-B -M" stuff you've been
> > looking at (and it's your code in the first place :) ), I thought you
> > might have some insight.
>
> Indeed.
>
> Honestly, I'd rather see us diagnose duplicate filepairs as an error
> and drop them on the floor upon entry to the diffcore_std(), even
> before we go into the rename codepath.
Yeah, I had a similar thought. Just saying "your diff is broken, we
can't do rename detection" is totally reasonable to me.
My main concern with that approach is that we would waste time finding
the duplicate paths, for something that comes up only rarely. At the
time of locate_rename_dst, we've already created a mapping, and it's
very easy to detect the duplicates. But before that, we have only the
linear list of queued items.
In theory they're sorted and we could do an O(n) pass to find
duplicates. But I'm not sure if the sorting holds in the face of other
breakages (like unsorted trees; they also shouldn't happen, but the
whole point here is to gracefully handle things that shouldn't).
I dunno. Maybe we could do an O(n) pass to check sort order and
uniqueness. If either fails (which should be rare), then we sort and
re-check uniqueness.
I'm assuming there _is_ a sane sort order. We have two halves of a
filepair, but I think before any of the rename or break detection kicks
in, each pair should either:
1. Have a name in pair->one, and an invalid filespec in pair->two
(i.e., a deletion).
2. The opposite (name in pair->two, /dev/null in pair->one). An
addition.
3. The same name in pair->one and pair->two.
-Peff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-24 22:49 ` Jeff King
@ 2015-02-24 23:11 ` Junio C Hamano
2015-02-24 23:47 ` Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-24 23:11 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> I'm assuming there _is_ a sane sort order. We have two halves of a
> filepair, but I think before any of the rename or break detection kicks
> in, each pair should either:
>
> 1. Have a name in pair->one, and an invalid filespec in pair->two
> (i.e., a deletion).
>
> 2. The opposite (name in pair->two, /dev/null in pair->one). An
> addition.
>
> 3. The same name in pair->one and pair->two.
I think creation and deletion are expressed with mode=0 and not with
/dev/null.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-24 23:11 ` Junio C Hamano
@ 2015-02-24 23:47 ` Jeff King
2015-02-25 5:00 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-24 23:47 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Tue, Feb 24, 2015 at 03:11:00PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > I'm assuming there _is_ a sane sort order. We have two halves of a
> > filepair, but I think before any of the rename or break detection kicks
> > in, each pair should either:
> >
> > 1. Have a name in pair->one, and an invalid filespec in pair->two
> > (i.e., a deletion).
> >
> > 2. The opposite (name in pair->two, /dev/null in pair->one). An
> > addition.
> >
> > 3. The same name in pair->one and pair->two.
>
> I think creation and deletion are expressed with mode=0 and not with
> /dev/null.
Yeah, or better spelled DIFF_FILE_VALID().
So here's a first stab at doing this. There are a couple of issues:
1. It makes the check part of diffcore_std. My thinking was that we
can actually correct some problems (like sort order) for all cases,
and the duplicate check may want to disable more than renames
(e.g., it could turn off any fancy features like break detection).
It is possible to run diffcore_rename by itself, which would miss
this protection. We don't seem to do that anywhere, though (even
for try_to_follow_renames, we set up a new diff_options struct and
just call diffcore_std).
2. It disables rename detection by tweaking the diff_options struct.
This is OK for a single diff, but I suspect is wrong for "git log",
as we use the same diff_options for each (so one bogus diff would
turn off renames for the rest of the commits). We can probably get
around this by returning a "bogus, don't do renames" flag from
diffcore_sanity and respecting it in diffcore_std.
3. The sort order check is wrong. :-/ It needs to take into account
git's magic "if it's a tree, pretend it has '/' after it" rule.
That's not too hard for a single tree (fsck.c:verify_ordered does
it). But for filepairs, I'm not sure what to do. Most cases
have a single mode/name pair. But what about a D/F typechange? If
"foo" becomes "foo/", which do I use to sort?
I have a feeling the order in which we queue the pairs may even
depend on the direction of the typechange. Or maybe it always
matches the p->one version. I'll have to think on it.
I'm out of time to work on this tonight, but I'll try to get back to it
tomorrow. Any wisdom is appreciated.
---
diff --git a/Makefile b/Makefile
index 44f1dd1..838a21c 100644
--- a/Makefile
+++ b/Makefile
@@ -690,6 +690,7 @@ LIB_OBJS += diffcore-delta.o
LIB_OBJS += diffcore-order.o
LIB_OBJS += diffcore-pickaxe.o
LIB_OBJS += diffcore-rename.o
+LIB_OBJS += diffcore-sanity.o
LIB_OBJS += diff-delta.o
LIB_OBJS += diff-lib.o
LIB_OBJS += diff-no-index.o
diff --git a/diff.c b/diff.c
index d1bd534..5f7c43a 100644
--- a/diff.c
+++ b/diff.c
@@ -4768,6 +4768,7 @@ void diffcore_std(struct diff_options *options)
/* NOTE please keep the following in sync with diff_tree_combined() */
if (options->skip_stat_unmatch)
diffcore_skip_stat_unmatch(options);
+ diffcore_sanity(options);
if (!options->found_follow) {
/* See try_to_follow_renames() in tree-diff.c */
if (options->break_opt != -1)
diff --git a/diffcore-sanity.c b/diffcore-sanity.c
new file mode 100644
index 0000000..ab770c9
--- /dev/null
+++ b/diffcore-sanity.c
@@ -0,0 +1,84 @@
+#include "cache.h"
+#include "diff.h"
+#include "diffcore.h"
+
+enum sanity_check {
+ SANITY_OK,
+ SANITY_UNSORTED,
+ SANITY_DUPLICATES
+};
+
+static const char *filepair_name(const struct diff_filepair *p)
+{
+ /*
+ * There's no point in checking that one or the other is valid;
+ * that invariant is set by earlier code, not by the trees
+ * themselves.
+ */
+ if (!DIFF_FILE_VALID(p->one))
+ return p->two->path;
+ if (!DIFF_FILE_VALID(p->two))
+ return p->one->path;
+ /*
+ * one->path should be the same as two->path here;
+ * we could check, but again, this invariant comes
+ * from our diff code, not the tree
+ */
+ return p->one->path;
+}
+
+static enum sanity_check check_sanity(struct diff_queue_struct *q)
+{
+ int i;
+ for (i = 1; i < q->nr; i++) {
+ int cmp = strcmp(filepair_name(q->queue[i - 1]),
+ filepair_name(q->queue[i]));
+ if (cmp == 0)
+ return SANITY_DUPLICATES;
+ if (cmp > 0)
+ return SANITY_UNSORTED;
+ }
+ return SANITY_OK;
+}
+
+int cmp_filepair(const void *va, const void *vb)
+{
+ const struct diff_filepair * const *a = va, * const *b = vb;
+ return strcmp(filepair_name(*a), filepair_name(*b));
+}
+
+static void sort_diff_queue(struct diff_queue_struct *q)
+{
+ qsort(q->queue, q->nr, sizeof(*q->queue), cmp_filepair);
+}
+
+void diffcore_sanity(struct diff_options *options)
+{
+ enum sanity_check check = check_sanity(&diff_queued_diff);
+
+ if (check == SANITY_OK)
+ return;
+
+ /*
+ * We can fix sorting, but once fixed, we have to check
+ * again to make sure we don't have duplicates, since
+ * that relies on sort order.
+ */
+ if (check == SANITY_UNSORTED) {
+ warning("diff entries are not sorted; your trees may be broken");
+ sort_diff_queue(&diff_queued_diff);
+ check = check_sanity(&diff_queued_diff);
+ if (check == SANITY_OK)
+ return;
+ }
+
+ /*
+ * If we get here, we have duplicates. The rename code isn't ready
+ * to handle this, so we have to turn it off.
+ */
+ if (options->detect_rename) {
+ warning("duplicate tree entries found; disabling rename detection");
+ options->detect_rename = 0;
+ }
+ /* XXX break detection, too? */
+}
diff --git a/diffcore.h b/diffcore.h
index 33ea2de..fa359ce 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -107,6 +107,7 @@ extern struct diff_filepair *diff_queue(struct diff_queue_struct *,
struct diff_filespec *);
extern void diff_q(struct diff_queue_struct *, struct diff_filepair *);
+extern void diffcore_sanity(struct diff_options *);
extern void diffcore_break(int);
extern void diffcore_rename(struct diff_options *);
extern void diffcore_merge_broken(void);
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-24 23:47 ` Jeff King
@ 2015-02-25 5:00 ` Junio C Hamano
2015-02-25 21:40 ` Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-25 5:00 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> 3. The sort order check is wrong. :-/ It needs to take into account
> git's magic "if it's a tree, pretend it has '/' after it" rule.
> That's not too hard for a single tree (fsck.c:verify_ordered does
> it). But for filepairs, I'm not sure what to do. Most cases
> have a single mode/name pair. But what about a D/F typechange? If
> "foo" becomes "foo/", which do I use to sort?
I think diff-index populates the diff queue in a wrong order and
then calls diffcore_fix_diff_index() to fix it up.
I am a bit worried about the effect this stricter input check might
have to "diff --no-index" codepath, though.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-25 5:00 ` Junio C Hamano
@ 2015-02-25 21:40 ` Jeff King
2015-02-25 21:50 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-25 21:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Tue, Feb 24, 2015 at 09:00:20PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > 3. The sort order check is wrong. :-/ It needs to take into account
> > git's magic "if it's a tree, pretend it has '/' after it" rule.
> > That's not too hard for a single tree (fsck.c:verify_ordered does
> > it). But for filepairs, I'm not sure what to do. Most cases
> > have a single mode/name pair. But what about a D/F typechange? If
> > "foo" becomes "foo/", which do I use to sort?
>
> I think diff-index populates the diff queue in a wrong order and
> then calls diffcore_fix_diff_index() to fix it up.
>
> I am a bit worried about the effect this stricter input check might
> have to "diff --no-index" codepath, though.
Interesting, I didn't even know about diffcore_fix_diff_index.
Let's forget about sorting for a moment. Unsorted trees will produce odd
results, certainly, but the order of your diff is perhaps the least of
the problems. The only reason we care about it here is to do a cheap
uniq() operation.
But we can also do that with a hash table, or an auxiliary sorted array.
And sure enough, that's exactly what the rename_dst array is. When we
are inserting into it, we can easily detect duplicates there, and just
abort the rename operation. Like:
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 4e132f1..e26dd62 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -27,7 +27,7 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
struct diff_rename_dst *dst = &(rename_dst[next]);
int cmp = strcmp(two->path, dst->two->path);
if (!cmp)
- return dst;
+ return insert_ok ? NULL : dst;
if (cmp < 0) {
last = next;
continue;
@@ -450,8 +450,10 @@ void diffcore_rename(struct diff_options *options)
else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
is_empty_blob_sha1(p->two->sha1))
continue;
- else
- locate_rename_dst(p->two, 1);
+ else {
+ if (!locate_rename_dst(p->two, 1))
+ goto cleanup;
+ }
}
else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
is_empty_blob_sha1(p->one->sha1))
which I think is a pretty simple and sane fix. I do notice that
rename_dst and rename_src are essentially hash tables. They predate
hashmap.c by quite a bit, but it might be worth moving them. It would
simplify the code, and I suspect perform better (they progressively
insert into a sorted list; in the worst case that means O(n^2)
memmove's, though if the input is sorted we get best-case behavior).
So to go forward, I'm happy to prepare a patch, but I'd like to know:
1. Does something like the above look reasonable to you (I'd probably
refactor it to avoid the bizarre return value semantics from
locate_rename_dst, though)?
2. If so, do you want something minimal like what's above, or do you
mind if I build it on top of a hashmap conversion? I suspect the
logic may also end up more clear with the hashmap (since inserting
versus lookup will be more distinct in the callers).
-Peff
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG] diffcore-rename with duplicate tree entries can segfault
2015-02-25 21:40 ` Jeff King
@ 2015-02-25 21:50 ` Junio C Hamano
2015-02-27 1:38 ` [PATCH 0/2] " Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-25 21:50 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> But we can also do that with a hash table, or an auxiliary sorted array.
> And sure enough, that's exactly what the rename_dst array is.
> ...
> which I think is a pretty simple and sane fix.
Yeah, good observation.
> So to go forward, I'm happy to prepare a patch, but I'd like to know:
>
> 1. Does something like the above look reasonable to you (I'd probably
> refactor it to avoid the bizarre return value semantics from
> locate_rename_dst, though)?
>
> 2. If so, do you want something minimal like what's above, or do you
> mind if I build it on top of a hashmap conversion? I suspect the
> logic may also end up more clear with the hashmap (since inserting
> versus lookup will be more distinct in the callers).
No, I don't mind. The diff-b-m topic seems to need a lot deeper
rethink than I originally anticipated anyway, and it can wait for a
clean-up to use hashmap to stabilize.
Thanks.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 0/2] diffcore-rename with duplicate tree entries can segfault
2015-02-25 21:50 ` Junio C Hamano
@ 2015-02-27 1:38 ` Jeff King
2015-02-27 1:39 ` [PATCH 1/2] diffcore-rename: split locate_rename_dst into two functions Jeff King
2015-02-27 1:42 ` [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations Jeff King
0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2015-02-27 1:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Feb 25, 2015 at 01:50:38PM -0800, Junio C Hamano wrote:
> > So to go forward, I'm happy to prepare a patch, but I'd like to know:
> >
> > 1. Does something like the above look reasonable to you (I'd probably
> > refactor it to avoid the bizarre return value semantics from
> > locate_rename_dst, though)?
> >
> > 2. If so, do you want something minimal like what's above, or do you
> > mind if I build it on top of a hashmap conversion? I suspect the
> > logic may also end up more clear with the hashmap (since inserting
> > versus lookup will be more distinct in the callers).
>
> No, I don't mind. The diff-b-m topic seems to need a lot deeper
> rethink than I originally anticipated anyway, and it can wait for a
> clean-up to use hashmap to stabilize.
I tried switching to a hashmap, but the diff_score code actually wants
to index into the array using an int. In a hashmap, we'd use a pointer
instead, but that increases the size of "struct diff_score", which is
something that we have to allocate a lot of (src * dst, I think).
So I punted on that and just cleaned up the locate_rename_dst interface
a bit. Here's the result.
[1/2]: diffcore-rename: split locate_rename_dst into two functions
[2/2]: diffcore-rename: avoid processing duplicate destinations
-Peff
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] diffcore-rename: split locate_rename_dst into two functions
2015-02-27 1:38 ` [PATCH 0/2] " Jeff King
@ 2015-02-27 1:39 ` Jeff King
2015-02-27 1:42 ` [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations Jeff King
1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2015-02-27 1:39 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
This function manages the mapping of destination pathnames
to filepairs, and it handles both insertion and lookup. This
makes the return value a bit confusing, as we return a newly
created entry (even though no caller cares), and have no
room to indicate to the caller that an entry already
existed.
Instead, let's break this up into two distinct functions,
both backed by a common binary search. The binary search
will use our normal "return the index if we found something,
or negative index minus one to show where it would have
gone" semantics.
Signed-off-by: Jeff King <peff@peff.net>
---
diffcore-rename.c | 38 ++++++++++++++++++++++++++------------
1 file changed, 26 insertions(+), 12 deletions(-)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 4e132f1..4250cc0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -15,8 +15,7 @@ static struct diff_rename_dst {
} *rename_dst;
static int rename_dst_nr, rename_dst_alloc;
-static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
- int insert_ok)
+static int find_rename_dst(struct diff_filespec *two)
{
int first, last;
@@ -27,16 +26,33 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
struct diff_rename_dst *dst = &(rename_dst[next]);
int cmp = strcmp(two->path, dst->two->path);
if (!cmp)
- return dst;
+ return next;
if (cmp < 0) {
last = next;
continue;
}
first = next+1;
}
- /* not found */
- if (!insert_ok)
- return NULL;
+ return -first - 1;
+}
+
+static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
+{
+ int ofs = find_rename_dst(two);
+ return ofs < 0 ? NULL : &rename_dst[ofs];
+}
+
+/*
+ * Returns 0 on success, -1 if we found a duplicate.
+ */
+static int add_rename_dst(struct diff_filespec *two)
+{
+ int first = find_rename_dst(two);
+
+ if (first >= 0)
+ return -1;
+ first = -first - 1;
+
/* insert to make it at "first" */
ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
rename_dst_nr++;
@@ -46,7 +62,7 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
rename_dst[first].two = alloc_filespec(two->path);
fill_filespec(rename_dst[first].two, two->sha1, two->sha1_valid, two->mode);
rename_dst[first].pair = NULL;
- return &(rename_dst[first]);
+ return 0;
}
/* Table of rename/copy src files */
@@ -451,7 +467,7 @@ void diffcore_rename(struct diff_options *options)
is_empty_blob_sha1(p->two->sha1))
continue;
else
- locate_rename_dst(p->two, 1);
+ add_rename_dst(p->two);
}
else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
is_empty_blob_sha1(p->one->sha1))
@@ -582,8 +598,7 @@ void diffcore_rename(struct diff_options *options)
* We would output this create record if it has
* not been turned into a rename/copy already.
*/
- struct diff_rename_dst *dst =
- locate_rename_dst(p->two, 0);
+ struct diff_rename_dst *dst = locate_rename_dst(p->two);
if (dst && dst->pair) {
diff_q(&outq, dst->pair);
pair_to_free = p;
@@ -613,8 +628,7 @@ void diffcore_rename(struct diff_options *options)
*/
if (DIFF_PAIR_BROKEN(p)) {
/* broken delete */
- struct diff_rename_dst *dst =
- locate_rename_dst(p->one, 0);
+ struct diff_rename_dst *dst = locate_rename_dst(p->one);
if (dst && dst->pair)
/* counterpart is now rename/copy */
pair_to_free = p;
--
2.3.0.449.g1690e78
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations
2015-02-27 1:38 ` [PATCH 0/2] " Jeff King
2015-02-27 1:39 ` [PATCH 1/2] diffcore-rename: split locate_rename_dst into two functions Jeff King
@ 2015-02-27 1:42 ` Jeff King
2015-02-27 21:48 ` Junio C Hamano
1 sibling, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-27 1:42 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
The rename code cannot handle an input where we have
duplicate destinations (i.e., more than one diff_filepair in
the queue with the same string in its pair->two->path). We
end up allocating only one slot in the rename_dst mapping.
If we fill in the diff_filepair for that slot, when we
re-queue the results, we may queue that filepair multiple
times. When the diff is finally flushed, the filepair is
processed and free()d multiple times, leading to heap
corruption.
This situation should only happen when a tree diff sees
duplicates in one of the trees (see the added test for a
detailed example). Rather than handle it, the sanest thing
is just to turn off rename detection altogether for the
diff.
Signed-off-by: Jeff King <peff@peff.net>
---
Like I mentioned before, I'm OK with not checking the actual diff output
in the test. It's not like it was planned, and is just what diff_tree()
happens to produce. It does make sense, though. We descend into the
first "outer/" of the "a/" side along with the sole "outer/" of the
"b/" side. We see that the entries from "b/" are all added. Then we come
back out, and see that "a/" has another "outer/", but that "b/" does
not. So all of those entries look like they were deleted.
diffcore-rename.c | 8 +++--
t/t4058-diff-duplicates.sh | 79 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 85 insertions(+), 2 deletions(-)
create mode 100755 t/t4058-diff-duplicates.sh
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 4250cc0..af1fe08 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -466,8 +466,12 @@ void diffcore_rename(struct diff_options *options)
else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
is_empty_blob_sha1(p->two->sha1))
continue;
- else
- add_rename_dst(p->two);
+ else if (add_rename_dst(p->two) < 0) {
+ warning("skipping rename detection, detected"
+ " duplicate destination '%s'",
+ p->two->path);
+ goto cleanup;
+ }
}
else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
is_empty_blob_sha1(p->one->sha1))
diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
new file mode 100755
index 0000000..0a23242
--- /dev/null
+++ b/t/t4058-diff-duplicates.sh
@@ -0,0 +1,79 @@
+#!/bin/sh
+
+test_description='test tree diff when trees have duplicate entries'
+. ./test-lib.sh
+
+# make_tree_entry <mode> <mode> <sha1>
+#
+# We have to rely on perl here because not all printfs understand
+# hex escapes (only octal), and xxd is not portable.
+make_tree_entry () {
+ printf '%s %s\0' "$1" "$2" &&
+ perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
+}
+
+# Like git-mktree, but without all of the pesky sanity checking.
+# Arguments come in groups of three, each group specifying a single
+# tree entry (see make_tree_entry above).
+make_tree () {
+ while test $# -gt 2; do
+ make_tree_entry "$1" "$2" "$3"
+ shift; shift; shift
+ done |
+ git hash-object -w -t tree --stdin
+}
+
+# this is kind of a convoluted setup, but matches
+# a real-world case. Each tree contains four entries
+# for the given path, one with one sha1, and three with
+# the other. The first tree has them split across
+# two subtrees (which are themselves duplicate entries in
+# the root tree), and the second has them all in a single subtree.
+test_expect_success 'create trees with duplicate entries' '
+ blob_one=$(echo one | git hash-object -w --stdin) &&
+ blob_two=$(echo two | git hash-object -w --stdin) &&
+ inner_one_a=$(make_tree \
+ 100644 inner $blob_one
+ ) &&
+ inner_one_b=$(make_tree \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two
+ ) &&
+ outer_one=$(make_tree \
+ 040000 outer $inner_one_a \
+ 040000 outer $inner_one_b
+ ) &&
+ inner_two=$(make_tree \
+ 100644 inner $blob_one \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two \
+ 100644 inner $blob_two
+ ) &&
+ outer_two=$(make_tree \
+ 040000 outer $inner_two
+ ) &&
+ git tag one $outer_one &&
+ git tag two $outer_two
+'
+
+test_expect_success 'diff-tree between trees' '
+ {
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":000000 100644 $_z40 $blob_two A\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n" &&
+ printf ":100644 000000 $blob_two $_z40 D\touter/inner\n"
+ } >expect &&
+ git diff-tree -r --no-abbrev one two >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'diff-tree with renames' '
+ # same expectation as above, since we disable rename detection
+ git diff-tree -M -r --no-abbrev one two >actual &&
+ test_cmp expect actual
+'
+
+test_done
--
2.3.0.449.g1690e78
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations
2015-02-27 1:42 ` [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations Jeff King
@ 2015-02-27 21:48 ` Junio C Hamano
0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2015-02-27 21:48 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> Like I mentioned before, I'm OK with not checking the actual diff output
> in the test. It's not like it was planned, and is just what diff_tree()
> happens to produce. It does make sense, though....
When the topic is on processing broken input, I do not think "It
does make sense, though." is a primary point, unless the result
expected by these tests is the only possibly sane outcome
(otherwise, another equally-sensible output will be rejected as a
test failure). So I agree that there is a possibility that we might
regret having the diff output tested here in the future when
somebody further works in the area.
But let's not worry too much about it for now.
Thanks; the solution seems reasonable.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-02-27 21:48 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-24 21:43 [BUG] diffcore-rename with duplicate tree entries can segfault Jeff King
2015-02-24 22:42 ` Junio C Hamano
2015-02-24 22:49 ` Jeff King
2015-02-24 23:11 ` Junio C Hamano
2015-02-24 23:47 ` Jeff King
2015-02-25 5:00 ` Junio C Hamano
2015-02-25 21:40 ` Jeff King
2015-02-25 21:50 ` Junio C Hamano
2015-02-27 1:38 ` [PATCH 0/2] " Jeff King
2015-02-27 1:39 ` [PATCH 1/2] diffcore-rename: split locate_rename_dst into two functions Jeff King
2015-02-27 1:42 ` [PATCH 2/2] diffcore-rename: avoid processing duplicate destinations Jeff King
2015-02-27 21:48 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).