Git development
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Wrong damage counting in diffcore_count_changes?
Date: Fri, 4 Dec 2009 12:07:47 -0800 (PST)	[thread overview]
Message-ID: <alpine.LFD.2.00.0912041200120.24579@localhost.localdomain> (raw)


Ok, so I had somebody actually ask me about '--dirstat', and as a result I 
ended up looking at how well the numbers it calculates really reflect the 
damage done to a file.

And to my horror, it doesn't necessarily reflect the damage well at all!

Now, dirstat just takes the same damage numbers that git uses to estimate 
similarity for renames, so if dirstat gets odd numbers, then that implies 
that file similarity will also be somewhat odd.

So looking at _why_ the dirstat numbers were odd, I came up with this 
patch that seems to make much sense. What used to happen is that 
diffcore_count_changes() simply ignored any hashes in the destination that 
didn't match hashes in the source. EXCEPT if the source hash didn't exist 
at all, in which case it would count _one_ destination hash that happened 
to have the "next" hash value. 

This changes it so that

 - whenever it bypasses a destination hash (because it doesn't match a 
   source), it counts the bytes associated with that as "literal added"

 - at the end (once we have used up all the source hashes), we do the same 
   thing with the remaining destination hashes.

 - when hashes do match, and we use the difference in counts as a value, 
   we also use up that destination hash entry (the 'd++'

This _seems_ to make --dirstat output more sensible, and I'd hope that 
that in turn should mean that file rename detection should also be more 
sensible. But I haven't actually verified it in any way. Maybe I just 
screwed up file rename detection entirely.

Did I miss something?

		Linus
---
 diffcore-delta.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index e670f85..7cf431d 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -201,10 +201,15 @@ int diffcore_count_changes(struct diff_filespec *src,
 		while (d->cnt) {
 			if (d->hashval >= s->hashval)
 				break;
+			la += d->cnt;
 			d++;
 		}
 		src_cnt = s->cnt;
-		dst_cnt = d->hashval == s->hashval ? d->cnt : 0;
+		dst_cnt = 0;
+		if (d->cnt && d->hashval == s->hashval) {
+			dst_cnt = d->cnt;
+			d++;
+		}
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;
@@ -213,6 +218,10 @@ int diffcore_count_changes(struct diff_filespec *src,
 			sc += dst_cnt;
 		s++;
 	}
+	while (d->cnt) {
+		la += d->cnt;
+		d++;
+	}
 
 	if (!src_count_p)
 		free(src_count);

             reply	other threads:[~2009-12-04 20:08 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-04 20:07 Linus Torvalds [this message]
2009-12-04 20:47 ` Wrong damage counting in diffcore_count_changes? Junio C Hamano
2009-12-04 22:48   ` Linus Torvalds
2009-12-04 23:20     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.0912041200120.24579@localhost.localdomain \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox