* Wrong damage counting in diffcore_count_changes? @ 2009-12-04 20:07 Linus Torvalds 2009-12-04 20:47 ` Junio C Hamano 0 siblings, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2009-12-04 20:07 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List Ok, so I had somebody actually ask me about '--dirstat', and as a result I ended up looking at how well the numbers it calculates really reflect the damage done to a file. And to my horror, it doesn't necessarily reflect the damage well at all! Now, dirstat just takes the same damage numbers that git uses to estimate similarity for renames, so if dirstat gets odd numbers, then that implies that file similarity will also be somewhat odd. So looking at _why_ the dirstat numbers were odd, I came up with this patch that seems to make much sense. What used to happen is that diffcore_count_changes() simply ignored any hashes in the destination that didn't match hashes in the source. EXCEPT if the source hash didn't exist at all, in which case it would count _one_ destination hash that happened to have the "next" hash value. This changes it so that - whenever it bypasses a destination hash (because it doesn't match a source), it counts the bytes associated with that as "literal added" - at the end (once we have used up all the source hashes), we do the same thing with the remaining destination hashes. - when hashes do match, and we use the difference in counts as a value, we also use up that destination hash entry (the 'd++' This _seems_ to make --dirstat output more sensible, and I'd hope that that in turn should mean that file rename detection should also be more sensible. But I haven't actually verified it in any way. Maybe I just screwed up file rename detection entirely. Did I miss something? Linus --- diffcore-delta.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/diffcore-delta.c b/diffcore-delta.c index e670f85..7cf431d 100644 --- a/diffcore-delta.c +++ b/diffcore-delta.c @@ -201,10 +201,15 @@ int diffcore_count_changes(struct diff_filespec *src, while (d->cnt) { if (d->hashval >= s->hashval) break; + la += d->cnt; d++; } src_cnt = s->cnt; - dst_cnt = d->hashval == s->hashval ? d->cnt : 0; + dst_cnt = 0; + if (d->cnt && d->hashval == s->hashval) { + dst_cnt = d->cnt; + d++; + } if (src_cnt < dst_cnt) { la += dst_cnt - src_cnt; sc += src_cnt; @@ -213,6 +218,10 @@ int diffcore_count_changes(struct diff_filespec *src, sc += dst_cnt; s++; } + while (d->cnt) { + la += d->cnt; + d++; + } if (!src_count_p) free(src_count); ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Wrong damage counting in diffcore_count_changes? 2009-12-04 20:07 Wrong damage counting in diffcore_count_changes? Linus Torvalds @ 2009-12-04 20:47 ` Junio C Hamano 2009-12-04 22:48 ` Linus Torvalds 0 siblings, 1 reply; 4+ messages in thread From: Junio C Hamano @ 2009-12-04 20:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > This changes it so that > > - whenever it bypasses a destination hash (because it doesn't match a > source), it counts the bytes associated with that as "literal added" > > - at the end (once we have used up all the source hashes), we do the same > thing with the remaining destination hashes. > > - when hashes do match, and we use the difference in counts as a value, > we also use up that destination hash entry (the 'd++' > > This _seems_ to make --dirstat output more sensible, and I'd hope that > that in turn should mean that file rename detection should also be more > sensible. But I haven't actually verified it in any way. Maybe I just > screwed up file rename detection entirely. > > Did I miss something? Well, the current loop structure largely comes from your eb4d0e3 (optimize diffcore-delta by sorting hash entries., 2007-10-02) so you would be the best judge of the change ;-), even though it seems that the current code inherited the "skipping of dst when src does not exist" bug from c06c796 (diffcore-rename: somewhat optimized., 2006-03-12). Earlier code, e.g. as of ba23bbc (diffcore-delta: make change counter to byte oriented again., 2006-03-04), used to be very simple minded and inefficient and iterated over src_count[] and dst_count[] arrays for the entire hash value namespace and there was no such "missing, skipped, happened to have the next value" issue. I think you understand the original intention of the function correctly and from a cursory look of the patch I think it fixes the bug in the current code, and any changes to renames/breaks should be improvements. But my lunchbreak is over, and my evening is booked, so I unfortunately cannot spend more time thinking about any possible fallouts from this change until tomorrow. Sorry, and thanks. > Linus > --- > diffcore-delta.c | 11 ++++++++++- > 1 files changed, 10 insertions(+), 1 deletions(-) > > diff --git a/diffcore-delta.c b/diffcore-delta.c > index e670f85..7cf431d 100644 > --- a/diffcore-delta.c > +++ b/diffcore-delta.c > @@ -201,10 +201,15 @@ int diffcore_count_changes(struct diff_filespec *src, > while (d->cnt) { > if (d->hashval >= s->hashval) > break; > + la += d->cnt; > d++; > } > src_cnt = s->cnt; > - dst_cnt = d->hashval == s->hashval ? d->cnt : 0; > + dst_cnt = 0; > + if (d->cnt && d->hashval == s->hashval) { > + dst_cnt = d->cnt; > + d++; > + } > if (src_cnt < dst_cnt) { > la += dst_cnt - src_cnt; > sc += src_cnt; > @@ -213,6 +218,10 @@ int diffcore_count_changes(struct diff_filespec *src, > sc += dst_cnt; > s++; > } > + while (d->cnt) { > + la += d->cnt; > + d++; > + } > > if (!src_count_p) > free(src_count); ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Wrong damage counting in diffcore_count_changes? 2009-12-04 20:47 ` Junio C Hamano @ 2009-12-04 22:48 ` Linus Torvalds 2009-12-04 23:20 ` Linus Torvalds 0 siblings, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2009-12-04 22:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Fri, 4 Dec 2009, Junio C Hamano wrote: > > Well, the current loop structure largely comes from your eb4d0e3 (optimize > diffcore-delta by sorting hash entries., 2007-10-02) so you would be the > best judge of the change ;-), even though it seems that the current code > inherited the "skipping of dst when src does not exist" bug from c06c796 > (diffcore-rename: somewhat optimized., 2006-03-12). Yeah, I think the sorting thing tried to not change any logic, so the counting predates that whole thing. > But my lunchbreak is over, and my evening is booked, so I unfortunately > cannot spend more time thinking about any possible fallouts from this > change until tomorrow. > > Sorry, and thanks. No problem. Just an example of the fallout here on the kernel: - totally made-up trivial differences in two different kernel directories: [torvalds@nehalem linux]$ git diff -p --stat > kernel/sched.c | 1 + mm/memory.c | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 3c11ae0..7a86f4f 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1,3 +1,4 @@ +123 /* * kernel/sched.c * diff --git a/mm/memory.c b/mm/memory.c index 6ab19dd..0de58a6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1,3 +1,4 @@ +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa /* * linux/mm/memory.c * - Here's what current git reports (which is obviously garbage): [torvalds@nehalem linux]$ git diff --dirstat 100.0% kernel/ - Here's what a fixed git with that patch reports: [torvalds@nehalem linux]$ ~/git/git diff --dirstat 8.6% kernel/ 91.3% mm/ and notice how the fixed diff now sees the change to mm/memory.c as a real change (it used to dismiss it entirely because it was a new previously non-existent pattern, so the hash didn't exist in the source), and gives reasonable percentages as to how much damage has been done (ie the mm/memory.c changes were obviously bigger: 42 new characters vs 4 new characters). So the patch definitely improves dirstat, although for a rather made-up example (I normally use it for much bigger diffs, where the impact of this bug is not nearly as noticeable). But that patch also changes the end result (in major ways) for real examples too - probably exactly because it undercounted additions of new code. I can't prove that the new numbers are "better", but I think they are: - old and presumably broken: [torvalds@nehalem linux]$ git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32 5.8% arch/ 3.5% drivers/net/e1000e/ 9.6% drivers/net/ 3.9% drivers/staging/ 34.8% drivers/ 4.1% fs/cachefiles/ 24.9% fs/fscache/ 32.6% fs/ 14.5% kernel/ 3.7% net/ - new and hopefully fixed: [torvalds@nehalem linux]$ ~/git/git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32 3.3% Documentation/filesystems/caching/ 3.5% Documentation/filesystems/ 7.6% Documentation/ 3.2% arch/arm/ 4.2% arch/blackfin/ 10.1% arch/ 3.1% drivers/gpu/drm/ 7.8% drivers/net/ 3.1% drivers/staging/ 30.0% drivers/ 5.6% fs/cachefiles/ 19.9% fs/fscache/ 28.4% fs/ 3.1% include/ 12.9% kernel/ so it really does seem like the old code is crud. It just never really mattered, because from a "is this a copy" standpoint, we don't care all that much about the "added" content, we care mostly about the original size and how much of it still remains. (Sanity check: the diffstat for that thing says: 277 files changed, 4426 insertions(+), 1244 deletions(-) and diffstat for just Documentation/ says 5 files changed, 289 insertions(+), 14 deletions(-) so you'd expect that the Documentation changes would be at _least_ (289+14)/(4426+1244), ie ~5%, and since text documentation lines tend to be more dense than actual code (with lines with just curly braces etc), I do think that 7.6% Documentation/ sounds much more likely than <3% (invisible). I also did a "git log -M --summary" on the current kernel git tree, and it didn't actually change any of _that_. So it seems that rename detection still ends up spitting out the same numbers. Which is actually not surprising, because rename detection doesn't even end up _using_ the "literal_added" part at all (it just does: score = (int)(src_copied * MAX_SCORE / max_size); ie it bases it's score on the amount of copied data, scaling it by the bigger of the two src/dst sizes). So just fixing the "literal added" count should not mess anything up, and seems to fix dirstat. It also looks a bit like diffcore-break actually worked around this whole thing, and does /* sanity */ if (src->size < src_copied) src_copied = src->size; if (dst->size < literal_added + src_copied) { if (src_copied < dst->size) literal_added = dst->size - src_copied; else literal_added = 0; } src_removed = src->size - src_copied; so this _may_ change what -B does, but I get the feeling that it should improve that too. I'm running a before-and-after "git log -M -B --summary" on the kernel now, but it's a pretty expensive operation, so it hasn't finished yet. Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Wrong damage counting in diffcore_count_changes? 2009-12-04 22:48 ` Linus Torvalds @ 2009-12-04 23:20 ` Linus Torvalds 0 siblings, 0 replies; 4+ messages in thread From: Linus Torvalds @ 2009-12-04 23:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Fri, 4 Dec 2009, Linus Torvalds wrote: > > It also looks a bit like diffcore-break actually worked around this whole > thing, and does > > /* sanity */ > if (src->size < src_copied) > src_copied = src->size; > if (dst->size < literal_added + src_copied) { > if (src_copied < dst->size) > literal_added = dst->size - src_copied; > else > literal_added = 0; > } > src_removed = src->size - src_copied; > > so this _may_ change what -B does, but I get the feeling that it should > improve that too. I'm running a before-and-after "git log -M -B --summary" > on the kernel now, but it's a pretty expensive operation, so it hasn't > finished yet. Ok, somewhat confirmed. Before-and-after changes: - commit 2def7b8bcd4c49ca71a950611c9d456fd35282d2 - 2 files changed, 187 insertions(+), 218 deletions(-) - delete mode 100644 drivers/staging/hv/include/ChannelMessages.h + 1 files changed, 109 insertions(+), 4 deletions(-) + rename drivers/staging/hv/{include/ChannelMessages.h => ChannelMgmt.h} (70%) ie it _used_ to get broken up, now it shows up as a rename, presumably due to better scoring. - commit 64a1403d797d38c0bd18ba43bda5653c012c0d58. Similar. - commit 3ce0a23d2d253185df24e22e3d5f89800bb3dd1c - 34 files changed, 8316 insertions(+), 633 deletions(-) - create mode 100644 drivers/gpu/drm/radeon/avivod.h + 34 files changed, 8287 insertions(+), 643 deletions(-) + copy drivers/gpu/drm/radeon/{radeon_share.h => avivod.h} (50%) that looks like a smaller diff, judging by the line counts - commit 9f77da9f40045253e91f55c12d4481254b513d2d - 4 files changed, 358 insertions(+), 328 deletions(-) + 4 files changed, 132 insertions(+), 420 deletions(-) rewrite include/linux/rcutree.h (73%) + rename {include/linux => kernel}/rcutree.h (80%) - commit d15dd3e5d74186a3b0a4db271b440bbdc0f6da36 - 6 files changed, 75 insertions(+), 41 deletions(-) - create mode 100644 drivers/net/wireless/ath/ath.h + 6 files changed, 59 insertions(+), 47 deletions(-) + copy drivers/net/wireless/ath/{main.c => ath.h} (73%) - commit cf4f1e76c49dacfde0680b170b9a9b6a42f296bb - 2 files changed, 716 insertions(+), 714 deletions(-) + 2 files changed, 371 insertions(+), 1034 deletions(-) rewrite drivers/usb/host/r8a66597.h (62%) + rename {drivers/usb/host => include/linux/usb}/r8a66597.h (65%) - commit d69864158e24f323e818403c6b89ad4871aea6f6 - 4 files changed, 172 insertions(+), 238 deletions(-) + 3 files changed, 90 insertions(+), 106 deletions(-) + rename arch/sparc/include/asm/{dma-mapping_64.h => dma-mapping.h} (71%) delete mode 100644 arch/sparc/include/asm/dma-mapping_32.h - delete mode 100644 arch/sparc/include/asm/dma-mapping_64.h etc. From what I can see, it looks like in general it picked better things with the new diffcore_count_changes() implementation, Looking at just the "files changed" summaries, it tends to look like this: - 21 files changed, 5659 insertions(+), 3227 deletions(-) + 19 files changed, 5723 insertions(+), 2285 deletions(-) - 3 files changed, 645 insertions(+), 654 deletions(-) + 3 files changed, 387 insertions(+), 820 deletions(-) - 11 files changed, 187 insertions(+), 177 deletions(-) + 10 files changed, 166 insertions(+), 109 deletions(-) - 100 files changed, 1114 insertions(+), 3388 deletions(-) + 99 files changed, 1102 insertions(+), 3337 deletions(-) - 2 files changed, 127 insertions(+), 106 deletions(-) + 2 files changed, 113 insertions(+), 92 deletions(-) - 7 files changed, 128 insertions(+), 59 deletions(-) + 7 fileas changed, 87 insertions(+), 69 deletions(-) - 154 files changed, 3641 insertions(+), 4232 deletions(-) + 154 files changed, 3635 insertions(+), 4233 deletions(-) - 24 files changed, 162 insertions(+), 226 deletions(-) + 23 files changed, 61 insertions(+), 77 deletions(-) - 4 files changed, 1995 insertions(+), 2114 deletions(-) + 3 files changed, 796 insertions(+), 814 deletions(-) - 3 files changed, 1104 insertions(+), 1114 deletions(-) + 3 files changed, 886 insertions(+), 1038 deletions(-) - 46 files changed, 17743 insertions(+), 5986 deletions(-) + 46 files changed, 16239 insertions(+), 6636 deletions(-) - 9 files changed, 295 insertions(+), 284 deletions(-) + 9 files changed, 203 insertions(+), 291 deletions(-) ie I can see several cases where the new break choice resulted in fewer files changed and/or fewer over-all lines changed (due to better rename choices), and I haven't seen any going the other way. There's probably some, but it does seem that in general the patch results in better picks (when it makes any difference in the first place - there seems to be only 71 commits in the kernel git tree that are affected at all) Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-12-04 23:20 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-04 20:07 Wrong damage counting in diffcore_count_changes? Linus Torvalds 2009-12-04 20:47 ` Junio C Hamano 2009-12-04 22:48 ` Linus Torvalds 2009-12-04 23:20 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox