git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kumar Gala <galak@kernel.crashing.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Fix a pathological case in git detecting proper renames
Date: Thu, 29 Nov 2007 16:40:21 -0800	[thread overview]
Message-ID: <7v4pf4tvka.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <alpine.LFD.0.9999.0711291442300.8458@woody.linux-foundation.org> (Linus Torvalds's message of "Thu, 29 Nov 2007 15:03:06 -0800 (PST)")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> For the fuzzy rename detection, we generate the full score matrix, and 
> sort it by the score, up front. So all the scoring - and more importantly, 
> all the sorting - has actually been done before we actually start looking 
> at *any* renames at all, so we cannot easily do the same thing I did for 
> the exact renames, namely to take into account _earlier_ renames in the 
> scoring. Because those earlier renames have simply not been done when the 
> score is calculated.

I think I've mentioned this before, but another thing we may want to do
is to give similarity boost to a src-dst pair if other files in the same
src directory are found to be renamed to the same dst directory.  That
is, if you have the same contents in the preimage at A/init.S and B/init.S,
and a similar contents appear in C/init.S in the postimage, instead of
randomly picking A/init.S over B/init.S as the source, we can notice
that A/Makefile was moved to C/Makefile (but B/Makefile was sufficiently
different from A/Makefile in the preimage), and favor A/init.S over
B/init.S as the rename source of C/init.S.

About the code structure, I think the very early draft of rename
detector did not do the full matrix, but iterated over dst to see if
there is a good src for it, picked the best src that is above the
threshold, and went on to next dst, like this:

	for (dst in dst candidates) {
        	best_src = NULL;
                best_score = minimum_score;
                for (src in src candidates) {
                	score = similarity(dst, src);
                        if (score > best_score)
                            best_src = src;
		}
		if (best_src) {
			match dst with src;
		}
	}

This was restructured in the current "full matrix first" form before the
rename detection logic first hit your tree, and I do not think it was
shown in the field to perform worse than the full matrix version.

We could do the current full matrix that does not take basename
similarity nor what other renames were detected first, and then use that
matrix result in order to primarily define the order of dst candidates
to process and run the above loop.  At that point, similarity between
dst and src does not need to be recomputed fully (the matrix would
record it).  Instead, we can tweak it to take other renames that already
have been detected (this includes "this src has already been used", and
"somebody nearby moved to the same directory") and basename similarity
to affect which possible src candidate to choose for each dst.

  parent reply	other threads:[~2007-11-30  0:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-29 16:57 problem with git detecting proper renames Kumar Gala
2007-11-29 17:44 ` Linus Torvalds
2007-11-29 19:06   ` Kumar Gala
2007-11-29 19:27     ` Linus Torvalds
2007-11-29 19:32       ` Kumar Gala
2007-11-29 20:27       ` Kumar Gala
2007-11-29 21:30         ` Fix a pathological case in " Linus Torvalds
2007-11-29 23:03           ` Linus Torvalds
2007-11-29 23:52             ` Jeff King
2007-11-30  0:41               ` Linus Torvalds
2007-11-30  0:48                 ` Jeff King
2007-11-30  1:18                 ` Kumar Gala
2007-11-30  0:40             ` Junio C Hamano [this message]
2007-11-30  0:21   ` problem with " Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7v4pf4tvka.fsf@gitster.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=galak@kernel.crashing.org \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).