From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Wrong damage counting in diffcore_count_changes?
Date: Fri, 4 Dec 2009 14:48:28 -0800 (PST) [thread overview]
Message-ID: <alpine.LFD.2.00.0912041419540.24579@localhost.localdomain> (raw)
In-Reply-To: <7vljhio4a3.fsf@alter.siamese.dyndns.org>
On Fri, 4 Dec 2009, Junio C Hamano wrote:
>
> Well, the current loop structure largely comes from your eb4d0e3 (optimize
> diffcore-delta by sorting hash entries., 2007-10-02) so you would be the
> best judge of the change ;-), even though it seems that the current code
> inherited the "skipping of dst when src does not exist" bug from c06c796
> (diffcore-rename: somewhat optimized., 2006-03-12).
Yeah, I think the sorting thing tried to not change any logic, so the
counting predates that whole thing.
> But my lunchbreak is over, and my evening is booked, so I unfortunately
> cannot spend more time thinking about any possible fallouts from this
> change until tomorrow.
>
> Sorry, and thanks.
No problem. Just an example of the fallout here on the kernel:
- totally made-up trivial differences in two different kernel
directories:
[torvalds@nehalem linux]$ git diff -p --stat >
kernel/sched.c | 1 +
mm/memory.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c11ae0..7a86f4f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1,3 +1,4 @@
+123
/*
* kernel/sched.c
*
diff --git a/mm/memory.c b/mm/memory.c
index 6ab19dd..0de58a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1,3 +1,4 @@
+aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
/*
* linux/mm/memory.c
*
- Here's what current git reports (which is obviously garbage):
[torvalds@nehalem linux]$ git diff --dirstat
100.0% kernel/
- Here's what a fixed git with that patch reports:
[torvalds@nehalem linux]$ ~/git/git diff --dirstat
8.6% kernel/
91.3% mm/
and notice how the fixed diff now sees the change to mm/memory.c as a real
change (it used to dismiss it entirely because it was a new previously
non-existent pattern, so the hash didn't exist in the source), and gives
reasonable percentages as to how much damage has been done (ie the
mm/memory.c changes were obviously bigger: 42 new characters vs 4 new
characters).
So the patch definitely improves dirstat, although for a rather made-up
example (I normally use it for much bigger diffs, where the impact of this
bug is not nearly as noticeable).
But that patch also changes the end result (in major ways) for real
examples too - probably exactly because it undercounted additions of new
code. I can't prove that the new numbers are "better", but I think they
are:
- old and presumably broken:
[torvalds@nehalem linux]$ git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32
5.8% arch/
3.5% drivers/net/e1000e/
9.6% drivers/net/
3.9% drivers/staging/
34.8% drivers/
4.1% fs/cachefiles/
24.9% fs/fscache/
32.6% fs/
14.5% kernel/
3.7% net/
- new and hopefully fixed:
[torvalds@nehalem linux]$ ~/git/git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32
3.3% Documentation/filesystems/caching/
3.5% Documentation/filesystems/
7.6% Documentation/
3.2% arch/arm/
4.2% arch/blackfin/
10.1% arch/
3.1% drivers/gpu/drm/
7.8% drivers/net/
3.1% drivers/staging/
30.0% drivers/
5.6% fs/cachefiles/
19.9% fs/fscache/
28.4% fs/
3.1% include/
12.9% kernel/
so it really does seem like the old code is crud. It just never really
mattered, because from a "is this a copy" standpoint, we don't care all
that much about the "added" content, we care mostly about the original
size and how much of it still remains.
(Sanity check: the diffstat for that thing says:
277 files changed, 4426 insertions(+), 1244 deletions(-)
and diffstat for just Documentation/ says
5 files changed, 289 insertions(+), 14 deletions(-)
so you'd expect that the Documentation changes would be at _least_
(289+14)/(4426+1244), ie ~5%, and since text documentation lines tend to
be more dense than actual code (with lines with just curly braces etc), I
do think that 7.6% Documentation/ sounds much more likely than <3%
(invisible).
I also did a "git log -M --summary" on the current kernel git tree, and it
didn't actually change any of _that_. So it seems that rename detection
still ends up spitting out the same numbers.
Which is actually not surprising, because rename detection doesn't even
end up _using_ the "literal_added" part at all (it just does:
score = (int)(src_copied * MAX_SCORE / max_size);
ie it bases it's score on the amount of copied data, scaling it by the
bigger of the two src/dst sizes).
So just fixing the "literal added" count should not mess anything up, and
seems to fix dirstat.
It also looks a bit like diffcore-break actually worked around this whole
thing, and does
/* sanity */
if (src->size < src_copied)
src_copied = src->size;
if (dst->size < literal_added + src_copied) {
if (src_copied < dst->size)
literal_added = dst->size - src_copied;
else
literal_added = 0;
}
src_removed = src->size - src_copied;
so this _may_ change what -B does, but I get the feeling that it should
improve that too. I'm running a before-and-after "git log -M -B --summary"
on the kernel now, but it's a pretty expensive operation, so it hasn't
finished yet.
Linus
next prev parent reply other threads:[~2009-12-04 22:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-12-04 20:07 Wrong damage counting in diffcore_count_changes? Linus Torvalds
2009-12-04 20:47 ` Junio C Hamano
2009-12-04 22:48 ` Linus Torvalds [this message]
2009-12-04 23:20 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.00.0912041419540.24579@localhost.localdomain \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox