problem with git detecting proper renames

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* problem with git detecting proper renames
@ 2007-11-29 16:57 Kumar Gala
  2007-11-29 17:44 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Gala @ 2007-11-29 16:57 UTC (permalink / raw)
  To: git

I was wondering if there was a way to ensure that git tracks renames
properly (or maybe its reporting them properly).

I did some git-mv and got the following:

the problem is git seems confused about what file was associated with its
source.

For example:
.../mpc8541cds => freescale/mpc8555cds}/init.S

thanks

- k

---

(the following is the results of a git-format-patch after the git-mv)

 Makefile                                           |    6 +-
 board/cds/mpc8548cds/Makefile                      |   60 -----
 board/cds/mpc8555cds/Makefile                      |   60 -----
 board/cds/mpc8555cds/init.S                        |  255 --------------------
 board/cds/mpc8555cds/u-boot.lds                    |  150 ------------
 board/{cds => freescale}/common/cadmus.c           |    0
 board/{cds => freescale}/common/cadmus.h           |    0
 board/{cds => freescale}/common/eeprom.c           |    0
 board/{cds => freescale}/common/eeprom.h           |    0
 board/{cds => freescale}/common/ft_board.c         |    0
 board/{cds => freescale}/common/via.c              |    0
 board/{cds => freescale}/common/via.h              |    0
 board/{cds => freescale}/mpc8541cds/Makefile       |    0
 board/{cds => freescale}/mpc8541cds/config.mk      |    0
 board/{cds => freescale}/mpc8541cds/init.S         |    0
 board/{cds => freescale}/mpc8541cds/mpc8541cds.c   |    0
 board/{cds => freescale}/mpc8541cds/u-boot.lds     |    4 +-
 .../mpc8541cds => freescale/mpc8548cds}/Makefile   |    0
 board/{cds => freescale}/mpc8548cds/config.mk      |    0
 board/{cds => freescale}/mpc8548cds/init.S         |    0
 board/{cds => freescale}/mpc8548cds/mpc8548cds.c   |    0
 board/{cds => freescale}/mpc8548cds/u-boot.lds     |    4 +-
 .../mpc8541cds => freescale/mpc8555cds}/Makefile   |    0
 board/{cds => freescale}/mpc8555cds/config.mk      |    0
 .../mpc8541cds => freescale/mpc8555cds}/init.S     |    0
 board/{cds => freescale}/mpc8555cds/mpc8555cds.c   |    0
 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds |    4 +-
 27 files changed, 9 insertions(+), 534 deletions(-)
 delete mode 100644 board/cds/mpc8548cds/Makefile
 delete mode 100644 board/cds/mpc8555cds/Makefile
 delete mode 100644 board/cds/mpc8555cds/init.S
 delete mode 100644 board/cds/mpc8555cds/u-boot.lds
 rename board/{cds => freescale}/common/cadmus.c (100%)
 rename board/{cds => freescale}/common/cadmus.h (100%)
 rename board/{cds => freescale}/common/eeprom.c (100%)
 rename board/{cds => freescale}/common/eeprom.h (100%)
 rename board/{cds => freescale}/common/ft_board.c (100%)
 rename board/{cds => freescale}/common/via.c (100%)
 rename board/{cds => freescale}/common/via.h (100%)
 copy board/{cds => freescale}/mpc8541cds/Makefile (100%)
 rename board/{cds => freescale}/mpc8541cds/config.mk (100%)
 copy board/{cds => freescale}/mpc8541cds/init.S (100%)
 rename board/{cds => freescale}/mpc8541cds/mpc8541cds.c (100%)
 copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%)
 copy board/{cds/mpc8541cds => freescale/mpc8548cds}/Makefile (100%)
 rename board/{cds => freescale}/mpc8548cds/config.mk (100%)
 rename board/{cds => freescale}/mpc8548cds/init.S (100%)
 rename board/{cds => freescale}/mpc8548cds/mpc8548cds.c (100%)
 rename board/{cds => freescale}/mpc8548cds/u-boot.lds (97%)
 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/Makefile (100%)
 rename board/{cds => freescale}/mpc8555cds/config.mk (100%)
 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/init.S (100%)
 rename board/{cds => freescale}/mpc8555cds/mpc8555cds.c (100%)
 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%)

diff --git a/Makefile b/Makefile
index 92632b9..a0f35df 100644
--- a/Makefile
+++ b/Makefile
@@ -1945,7 +1945,7 @@ MPC8541CDS_config:	unconfig
 		echo "#define CONFIG_LEGACY" >>$(obj)include/config.h ; \
 		echo "... legacy" ; \
 	fi
-	@$(MKCONFIG) -a MPC8541CDS ppc mpc85xx mpc8541cds cds
+	@$(MKCONFIG) -a MPC8541CDS ppc mpc85xx mpc8541cds freescale

 MPC8544DS_config:	unconfig
 	@$(MKCONFIG) $(@:_config=) ppc mpc85xx mpc8544ds freescale
@@ -1958,7 +1958,7 @@ MPC8548CDS_config:	unconfig
 		echo "#define CONFIG_LEGACY" >>$(obj)include/config.h ; \
 		echo "... legacy" ; \
 	fi
-	@$(MKCONFIG) -a MPC8548CDS ppc mpc85xx mpc8548cds cds
+	@$(MKCONFIG) -a MPC8548CDS ppc mpc85xx mpc8548cds freescale

 MPC8555CDS_legacy_config \
 MPC8555CDS_config:	unconfig
@@ -1968,7 +1968,7 @@ MPC8555CDS_config:	unconfig
 		echo "#define CONFIG_LEGACY" >>$(obj)include/config.h ; \
 		echo "... legacy" ; \
 	fi
-	@$(MKCONFIG) -a MPC8555CDS ppc mpc85xx mpc8555cds cds
+	@$(MKCONFIG) -a MPC8555CDS ppc mpc85xx mpc8555cds freescale

 MPC8568MDS_config:	unconfig
 	@$(MKCONFIG) $(@:_config=) ppc mpc85xx mpc8568mds freescale

[snip]

diff --git a/board/cds/mpc8541cds/u-boot.lds b/board/freescale/mpc8555cds/u-boot.lds
similarity index 97%
rename from board/cds/mpc8541cds/u-boot.lds
rename to board/freescale/mpc8555cds/u-boot.lds
index 7a5daef..df21ea8 100644
--- a/board/cds/mpc8541cds/u-boot.lds
+++ b/board/freescale/mpc8555cds/u-boot.lds
@@ -34,7 +34,7 @@ SECTIONS
   .bootpg 0xFFFFF000 :
   {
     cpu/mpc85xx/start.o	(.bootpg)
-    board/cds/mpc8541cds/init.o (.bootpg)
+    board/freescale/mpc8555cds/init.o (.bootpg)
   } = 0xffff

   /* Read-only sections, merged into text segment: */
@@ -64,7 +64,7 @@ SECTIONS
   .text      :
   {
     cpu/mpc85xx/start.o	(.text)
-    board/cds/mpc8541cds/init.o (.text)
+    board/freescale/mpc8555cds/init.o (.text)
     cpu/mpc85xx/traps.o (.text)
     cpu/mpc85xx/interrupts.o (.text)
     cpu/mpc85xx/cpu_init.o (.text)
-- 
1.5.3.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 16:57 problem with git detecting proper renames Kumar Gala
@ 2007-11-29 17:44 ` Linus Torvalds
  2007-11-29 19:06   ` Kumar Gala
  2007-11-30  0:21   ` problem with " Jakub Narebski
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2007-11-29 17:44 UTC (permalink / raw)
  To: Kumar Gala; +Cc: git

On Thu, 29 Nov 2007, Kumar Gala wrote:
> 
> I did some git-mv and got the following:
> 
> the problem is git seems confused about what file was associated with its
> source.

Well, I wouldn't say "confused". It found multiple identical options for 
the source, and picked the first one (where "first one" may not be obvious 
to a human, it can depend on an internal hash order).

But if you have the resultant git tree somewhere public (or just send me 
the exact "git mv" and revision to recreate), I'll happily give it a look, 
to see if we can improve our heuristics to be closer to what a human would 
expect.

For example, in this case, it looks  like there were two totally identical 
"init.S" files that got renamed with the same identical content to two new 
names. YOU seem to expect that it would stay as two renames, but from a 
content angle, since the two sources were identical, it's a totally 
arbitrary choice whether it's a "copy one source to two destinations and 
delete the other source" or whether it's two cases of "move one source to 
another destination" (and the latter case also has the issue of which way 
to move it).

(You also had two identical Makefile's with the exact same issue).

So git doesn't care about how you did the rename, it only cares about the 
end result, and the exact same way that it will detect a rename if you 
implement it as a "copy file" and then a later "delete old file", it will 
also potentially go the other way, or just decide that identical contents 
moved in different ways.

But we can certainly tweak the heuristics. For example, if we find 
multiple identical renames, right now we just pick one fairly at random, 
and have no logic to prefer independent renames over "multiple copies and 
a delete". But this code is actually fairly simple, and with a good 
example I can easily add heurstics (for example, it probably *is* better 
to consider it to be two renames, just because the resulting diff will be 
smaller - since a "delete" diff is much larger than a rename diff).

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 17:44 ` Linus Torvalds
@ 2007-11-29 19:06   ` Kumar Gala
  2007-11-29 19:27     ` Linus Torvalds
  2007-11-30  0:21   ` problem with " Jakub Narebski
  1 sibling, 1 reply; 14+ messages in thread
From: Kumar Gala @ 2007-11-29 19:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git


On Nov 29, 2007, at 11:44 AM, Linus Torvalds wrote:

> On Thu, 29 Nov 2007, Kumar Gala wrote:
>>
>> I did some git-mv and got the following:
>>
>> the problem is git seems confused about what file was associated  
>> with its
>> source.
>
> Well, I wouldn't say "confused". It found multiple identical options  
> for
> the source, and picked the first one (where "first one" may not be  
> obvious
> to a human, it can depend on an internal hash order).
>
> But if you have the resultant git tree somewhere public (or just  
> send me
> the exact "git mv" and revision to recreate), I'll happily give it a  
> look,
> to see if we can improve our heuristics to be closer to what a human  
> would
> expect.
>
> For example, in this case, it looks  like there were two totally  
> identical
> "init.S" files that got renamed with the same identical content to  
> two new
> names. YOU seem to expect that it would stay as two renames, but  
> from a
> content angle, since the two sources were identical, it's a totally
> arbitrary choice whether it's a "copy one source to two destinations  
> and
> delete the other source" or whether it's two cases of "move one  
> source to
> another destination" (and the latter case also has the issue of  
> which way
> to move it).
>
> (You also had two identical Makefile's with the exact same issue).
>
> So git doesn't care about how you did the rename, it only cares  
> about the
> end result, and the exact same way that it will detect a rename if you
> implement it as a "copy file" and then a later "delete old file", it  
> will
> also potentially go the other way, or just decide that identical  
> contents
> moved in different ways.

I was guessing most of this but wanted to make sure there wasn't some  
cool feature of git I wasn't aware of.

> But we can certainly tweak the heuristics. For example, if we find
> multiple identical renames, right now we just pick one fairly at  
> random,
> and have no logic to prefer independent renames over "multiple  
> copies and
> a delete". But this code is actually fairly simple, and with a good
> example I can easily add heurstics (for example, it probably *is*  
> better
> to consider it to be two renames, just because the resulting diff  
> will be
> smaller - since a "delete" diff is much larger than a rename diff).

In the case of multiple identical matches can we look at the file name  
as a possible heuristic?

- k

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 19:06   ` Kumar Gala
@ 2007-11-29 19:27     ` Linus Torvalds
  2007-11-29 19:32       ` Kumar Gala
  2007-11-29 20:27       ` Kumar Gala
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2007-11-29 19:27 UTC (permalink / raw)
  To: Kumar Gala; +Cc: git

On Thu, 29 Nov 2007, Kumar Gala wrote:
> 
> In the case of multiple identical matches can we look at the file name as a
> possible heuristic?

We already do. But we only do the base-name part and check it for 
exactness, since moving across directories is very common, and we 
explicitly want to pick up files that have the same base name.

However, in your case, not only did you have the same content, you had the 
same basename too! So git considered your renames to be totally identical 
wrt scoring with the current heuristics, and just picked one source at 
random. 

And the current heuristics don't even have any "if you already found a 
rename, avoid picking the same one twice", so it would pick the *same* 
source both times, which is why it looked like "two copies and one 
delete".

This is why I'd like to have a real-life example. I can change the 
heuristics, and I even know what are likely to be better heuristics, but I 
still want to actually see and play with an example so that when I send 
Junio a patch, I can explain it and say I've tested it with something 
real..

			Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 19:27     ` Linus Torvalds
@ 2007-11-29 19:32       ` Kumar Gala
  2007-11-29 20:27       ` Kumar Gala
  1 sibling, 0 replies; 14+ messages in thread
From: Kumar Gala @ 2007-11-29 19:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git


On Nov 29, 2007, at 1:27 PM, Linus Torvalds wrote:

>
>
> On Thu, 29 Nov 2007, Kumar Gala wrote:
>>
>> In the case of multiple identical matches can we look at the file  
>> name as a
>> possible heuristic?
>
> We already do. But we only do the base-name part and check it for
> exactness, since moving across directories is very common, and we
> explicitly want to pick up files that have the same base name.
>
> However, in your case, not only did you have the same content, you  
> had the
> same basename too! So git considered your renames to be totally  
> identical
> wrt scoring with the current heuristics, and just picked one source at
> random.
>
> And the current heuristics don't even have any "if you already found a
> rename, avoid picking the same one twice", so it would pick the *same*
> source both times, which is why it looked like "two copies and one
> delete".
>
> This is why I'd like to have a real-life example. I can change the
> heuristics, and I even know what are likely to be better heuristics,  
> but I
> still want to actually see and play with an example so that when I  
> send
> Junio a patch, I can explain it and say I've tested it with something
> real..

Ok, this is a real example from the u-boot tree.  If you give me a  
little while I can point you at a kernel.org git tree that showed this  
issue.

- k

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 19:27     ` Linus Torvalds
  2007-11-29 19:32       ` Kumar Gala
@ 2007-11-29 20:27       ` Kumar Gala
  2007-11-29 21:30         ` Fix a pathological case in " Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Kumar Gala @ 2007-11-29 20:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

> This is why I'd like to have a real-life example. I can change the
> heuristics, and I even know what are likely to be better heuristics,  
> but I
> still want to actually see and play with an example so that when I  
> send
> Junio a patch, I can explain it and say I've tested it with something
> real..

Ok, here's the tree:

         git.kernel.org:/pub/scm/boot/u-boot/galak/u-boot.git linus_git

and the commit that is doing the file movement is:

ba30ae3cc2e92d4b2362fbc01bedb659615e123e

let me know if there is anything else you need.

- k

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fix a pathological case in git detecting proper renames
  2007-11-29 20:27       ` Kumar Gala
@ 2007-11-29 21:30         ` Linus Torvalds
  2007-11-29 23:03           ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2007-11-29 21:30 UTC (permalink / raw)
  To: Kumar Gala, Junio C Hamano; +Cc: Git Mailing List

Kumar Gala had a case in the u-boot archive with multiple renames of files 
with identical contents, and git would turn those into multiple "copy" 
operations of one of the sources, and just deleting the other sources.

This patch makes the git exact rename detection prefer to spread out the 
renames over the multiple sources, rather than do multiple copies of one 
source.

NOTE! The changes are a bit larger than required, because I also renamed 
the variables named "one" and "two" to "target" and "source" respectively. 
That makes the logic easier to follow, especially as the "one" was 
illogically the target and not the soruce, for purely historical reasons 
(this piece of code used to traverse over sources and targets in the wrong 
order, and when we fixed that, we didn't fix the names back then. So I 
fixed them now).

The important part of this change is just the trivial score calculations 
for when files have identical contents:

	/* Give higher scores to sources that haven't been used already */
	score = !source->rename_used;
	score += basename_same(source, target);

and when we have multiple choices we'll now pick the choice that gets the 
best rename score, rather than only looking at whether the basename 
matched.

It's worth noting a few gotchas:

 - this scoring is currently only done for the "exact match" case. 

   In particular, in Kumar's example, even after this patch, the inexact
   match case is still done as a copy+delete rather than as two renames:

	 delete mode 100644 board/cds/mpc8555cds/u-boot.lds
	 copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%)
	 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%)

   because apparently the "cds/mpc8541cds/u-boot.lds" copy looked 
   a bit more similar to both end results. That said, I *suspect* we just 
   have the exact same issue there - the similarity analysis just gave 
   identical (or at least very _close_ to identical) similarity points, 
   and we do not have any logic to prefer multiple renames over a 
   copy/delete there.

   That is a separate patch.

 - When you have identical contents and identical basenames, the actual 
   entry that is chosen is still picked fairly "at random" for the first 
   one (but the subsequent ones will prefer entries that haven't already 
   been used).

   It's not actually really random, in that it actually depends on the
   relative alphabetical order of the files (which in turn will have 
   impacted the order that the entries got hashed!), so it gives 
   consistent results that can be explained. But I wanted to point it out 
   as an issue for when anybody actually does cross-renames.

   In Kumar's case the choice is the right one (and for a single normal 
   directory rename it should always be, since the relative alphabetical 
   sorting of the files will be identical), and we now get:

	 rename board/{cds => freescale}/mpc8541cds/init.S (100%)
	 rename board/{cds => freescale}/mpc8548cds/init.S (100%)

   which is the "expected" answer. However, it might still be better to 
   change the pedantic "exact same basename" on/off choice into a more 
   graduated "how similar are the pathnames" scoring situation, in order 
   to be more likely to get the exact rename choice that people *expect* 
   to see, rather than other alternatives that may *technically* be 
   equally good, but are surprising to a human.

It's also unclear whether we should consider "basenames are equal" or 
"have already used this as a source" to be more important. This gives them 
equal weight, but I suspect we might want to just multiple the "basenames 
are equal" weight by two, or something, to prefer equal basenames even if 
that causes a copy/delete pair. I dunno.

Anyway, what I'm just saying in a really long-winded manner is that I 
think this is right as-is, but it's not the complete solution, and it may 
want some further tweaking in the future.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

On Thu, 29 Nov 2007, Kumar Gala wrote:
> 
> let me know if there is anything else you need.

No, this was all right, and I've already got a patch ready for you to try.

So this patch actually does do what you want (for the exact renames, if 
not for the u-boot.lds file), but I wanted to just point out that we will 
almost certainly at least want to extend it to the inexact rename 
detection logic too, _and_ we may well want to make the "score" 
calculation a bit more involved depending on the actual filename, rather 
than just depend on the equality of the basename.

		Linus

---
 diffcore-rename.c |   25 ++++++++++++++++---------
 1 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f9ebea5..f64294e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -244,28 +244,35 @@ static int find_identical_files(struct file_similarity *src,
 	 * Walk over all the destinations ...
 	 */
 	do {
-		struct diff_filespec *one = dst->filespec;
+		struct diff_filespec *target = dst->filespec;
 		struct file_similarity *p, *best;
-		int i = 100;
+		int i = 100, best_score = -1;

 		/*
 		 * .. to find the best source match
 		 */
 		best = NULL;
 		for (p = src; p; p = p->next) {
-			struct diff_filespec *two = p->filespec;
+			int score;
+			struct diff_filespec *source = p->filespec;

 			/* False hash collission? */
-			if (hashcmp(one->sha1, two->sha1))
+			if (hashcmp(source->sha1, target->sha1))
 				continue;
 			/* Non-regular files? If so, the modes must match! */
-			if (!S_ISREG(one->mode) || !S_ISREG(two->mode)) {
-				if (one->mode != two->mode)
+			if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
+				if (source->mode != target->mode)
 					continue;
 			}
-			best = p;
-			if (basename_same(one, two))
-				break;
+			/* Give higher scores to sources that haven't been used already */
+			score = !source->rename_used;
+			score += basename_same(source, target);
+			if (score > best_score) {
+				best = p;
+				best_score = score;
+				if (score == 2)
+					break;
+			}

 			/* Too many identical alternatives? Pick one */
 			if (!--i)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-29 21:30         ` Fix a pathological case in " Linus Torvalds
@ 2007-11-29 23:03           ` Linus Torvalds
  2007-11-29 23:52             ` Jeff King
  2007-11-30  0:40             ` Junio C Hamano
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2007-11-29 23:03 UTC (permalink / raw)
  To: Kumar Gala, Junio C Hamano; +Cc: Git Mailing List

On Thu, 29 Nov 2007, Linus Torvalds wrote:
> 
> It's worth noting a few gotchas:
> 
>  - this scoring is currently only done for the "exact match" case. 
> 
>    In particular, in Kumar's example, even after this patch, the inexact
>    match case is still done as a copy+delete rather than as two renames:
> 
> 	 delete mode 100644 board/cds/mpc8555cds/u-boot.lds
> 	 copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%)
> 	 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%)
> 
>    because apparently the "cds/mpc8541cds/u-boot.lds" copy looked 
>    a bit more similar to both end results. That said, I *suspect* we just 
>    have the exact same issue there - the similarity analysis just gave 
>    identical (or at least very _close_ to identical) similarity points, 
>    and we do not have any logic to prefer multiple renames over a 
>    copy/delete there.
> 
>    That is a separate patch.

Side note: just in case people were expecting me to actually _ship_ that 
separate patch that handles the fuzzy matches too.. I wasn't planning on 
doing that patch. The way the fuzzy rename detection is currently done, 
that's actually quite painful.

For the fuzzy rename detection, we generate the full score matrix, and 
sort it by the score, up front. So all the scoring - and more importantly, 
all the sorting - has actually been done before we actually start looking 
at *any* renames at all, so we cannot easily do the same thing I did for 
the exact renames, namely to take into account _earlier_ renames in the 
scoring. Because those earlier renames have simply not been done when the 
score is calculated.

This would probably become easier to do with the linear-time hash-based 
similarity engine (the stuff Jeff King was working on), but the way the 
code is currently structured - with no incremental rename detection at 
all, and with all the scoring in one global table - it's pretty painful.

			Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-29 23:03           ` Linus Torvalds
@ 2007-11-29 23:52             ` Jeff King
  2007-11-30  0:41               ` Linus Torvalds
  2007-11-30  0:40             ` Junio C Hamano
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2007-11-29 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kumar Gala, Junio C Hamano, Git Mailing List

On Thu, Nov 29, 2007 at 03:03:06PM -0800, Linus Torvalds wrote:

> This would probably become easier to do with the linear-time hash-based 
> similarity engine (the stuff Jeff King was working on), but the way the 
> code is currently structured - with no incremental rename detection at 
> all, and with all the scoring in one global table - it's pretty painful.

I think it will get worse, because you are simultaneously calculating
all of the similarity scores bit by bit rather than doing a loop. Though
perhaps you mean at the end you will end up with a list of src/dst pairs
sorted by score, and you can loop over that.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-29 23:52             ` Jeff King
@ 2007-11-30  0:41               ` Linus Torvalds
  2007-11-30  0:48                 ` Jeff King
  2007-11-30  1:18                 ` Kumar Gala
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2007-11-30  0:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Kumar Gala, Junio C Hamano, Git Mailing List



On Thu, 29 Nov 2007, Jeff King wrote:
> 
> I think it will get worse, because you are simultaneously calculating
> all of the similarity scores bit by bit rather than doing a loop. Though
> perhaps you mean at the end you will end up with a list of src/dst pairs
> sorted by score, and you can loop over that.

Well, after thinking about this a bit, I think there's a solution that may 
work well with the current thing too: instead of looping just *once* over 
the list of rename pairs, loop twice - and simply refuse to do copies on 
the first loop.

This trivial patch does that, and turns Kumar's test-case into a perfect 
rename list.

It's not pretty, it's not smart, but it seems to work. There's something 
to be said for keeping it simple and stupid.

And it should not be nearly as expensive as it may _look_. Yes, the loop 
is "(i = 0; i < num_create * num_src; i++)", but the important part is 
that the whole array is sorted by rename score, and we have a

	if (mx[i].score < minimum_score)
		break;

in it, so uthe loop actually would tend to terminate rather quickly.

Anyway, Kumar, the thing to take away from this is:

 - git really doesn't even *care* about the whole "rename detection" 
   internally, and any commits you have done with renames are totally 
   independent of the heuristics we then use to *show* the renames.

 - the rename detection really is for just two reasons: (a) keep humans 
   happy, and keep the diffs small and (b) help automatic merging across 
   renames. So getting renames right is certainly good, but it's more of a 
   "politeness" issue than a "correctness" issue, although the merge 
   portion of it does matter a lot sometimes.

 - the important thing here is that you can commit your changes and not 
   worry about them being somehow "corrupted" by lack of rename detection, 
   even if you commit them with a version of git that doesn't do rename 
   detection the way you expected it. The rename detection is an 
   "after-the-fact" thing, not something that actually gets saved in the 
   repository, which is why we can change the heuristics _after_ seeing 
   examples, and the examples magically correct themselves!

 - try out the two patches I've posted, and see if they work for you. They 
   pass the test-suite, and the output for your example commit looks sane, 
   but hey, if you have other test-cases, try them out.

Here's Kumar's pretty diffstat with both my patches:

	 Makefile                                         |    6 +++---
	 board/{cds => freescale}/common/cadmus.c         |    0
	 board/{cds => freescale}/common/cadmus.h         |    0
	 board/{cds => freescale}/common/eeprom.c         |    0
	 board/{cds => freescale}/common/eeprom.h         |    0
	 board/{cds => freescale}/common/ft_board.c       |    0
	 board/{cds => freescale}/common/via.c            |    0
	 board/{cds => freescale}/common/via.h            |    0
	 board/{cds => freescale}/mpc8541cds/Makefile     |    0
	 board/{cds => freescale}/mpc8541cds/config.mk    |    0
	 board/{cds => freescale}/mpc8541cds/init.S       |    0
	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c |    0
	 board/{cds => freescale}/mpc8541cds/u-boot.lds   |    4 ++--
	 board/{cds => freescale}/mpc8548cds/Makefile     |    0
	 board/{cds => freescale}/mpc8548cds/config.mk    |    0
	 board/{cds => freescale}/mpc8548cds/init.S       |    0
	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c |    0
	 board/{cds => freescale}/mpc8548cds/u-boot.lds   |    4 ++--
	 board/{cds => freescale}/mpc8555cds/Makefile     |    0
	 board/{cds => freescale}/mpc8555cds/config.mk    |    0
	 board/{cds => freescale}/mpc8555cds/init.S       |    0
	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c |    0
	 board/{cds => freescale}/mpc8555cds/u-boot.lds   |    4 ++--
	 23 files changed, 9 insertions(+), 9 deletions(-)

and here it is before:

	 Makefile                                           |    6 +-
	 board/cds/mpc8548cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/init.S                        |  255 --------------------
	 board/cds/mpc8555cds/u-boot.lds                    |  150 ------------
	 board/{cds => freescale}/common/cadmus.c           |    0
	 board/{cds => freescale}/common/cadmus.h           |    0
	 board/{cds => freescale}/common/eeprom.c           |    0
	 board/{cds => freescale}/common/eeprom.h           |    0
	 board/{cds => freescale}/common/ft_board.c         |    0
	 board/{cds => freescale}/common/via.c              |    0
	 board/{cds => freescale}/common/via.h              |    0
	 board/{cds => freescale}/mpc8541cds/Makefile       |    0
	 board/{cds => freescale}/mpc8541cds/config.mk      |    0
	 board/{cds => freescale}/mpc8541cds/init.S         |    0
	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c   |    0
	 board/{cds => freescale}/mpc8541cds/u-boot.lds     |    4 +-
	 .../mpc8541cds => freescale/mpc8548cds}/Makefile   |    0
	 board/{cds => freescale}/mpc8548cds/config.mk      |    0
	 board/{cds => freescale}/mpc8548cds/init.S         |    0
	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c   |    0
	 board/{cds => freescale}/mpc8548cds/u-boot.lds     |    4 +-
	 .../mpc8541cds => freescale/mpc8555cds}/Makefile   |    0
	 board/{cds => freescale}/mpc8555cds/config.mk      |    0
	 .../mpc8541cds => freescale/mpc8555cds}/init.S     |    0
	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c   |    0
	 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds |    4 +-
	 27 files changed, 9 insertions(+), 534 deletions(-)

so it certainly makes the diffs prettier.

		Linus

---
 diffcore-rename.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f64294e..3d37725 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -497,6 +497,19 @@ void diffcore_rename(struct diff_options *options)
 	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
 	for (i = 0; i < num_create * num_src; i++) {
 		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
+		struct diff_filespec *src;
+		if (dst->pair)
+			continue; /* already done, either exact or fuzzy. */
+		if (mx[i].score < minimum_score)
+			break; /* there is no more usable pair. */
+		src = rename_src[mx[i].src].one;
+		if (src->rename_used)
+			continue;
+		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
+		rename_count++;
+	}
+	for (i = 0; i < num_create * num_src; i++) {
+		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
 		if (mx[i].score < minimum_score)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-30  0:41               ` Linus Torvalds
@ 2007-11-30  0:48                 ` Jeff King
  2007-11-30  1:18                 ` Kumar Gala
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff King @ 2007-11-30  0:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kumar Gala, Junio C Hamano, Git Mailing List

On Thu, Nov 29, 2007 at 04:41:09PM -0800, Linus Torvalds wrote:

> It's not pretty, it's not smart, but it seems to work. There's something 
> to be said for keeping it simple and stupid.
> 
> And it should not be nearly as expensive as it may _look_. Yes, the loop 
> is "(i = 0; i < num_create * num_src; i++)", but the important part is 
> that the whole array is sorted by rename score, and we have a
> 
> 	if (mx[i].score < minimum_score)
> 		break;
> 
> in it, so uthe loop actually would tend to terminate rather quickly.

I think the slowdown is a non-issue. From the benchmarking I did in the
past, all of the time was spent _before_ even getting to the qsort of
scores. So even if you doubled the expense of that loop, it would have a
negligible impact.

But I haven't actually benchmarked this new patch, of course.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-30  0:41               ` Linus Torvalds
  2007-11-30  0:48                 ` Jeff King
@ 2007-11-30  1:18                 ` Kumar Gala
  1 sibling, 0 replies; 14+ messages in thread
From: Kumar Gala @ 2007-11-30  1:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Junio C Hamano, Git Mailing List


On Nov 29, 2007, at 6:41 PM, Linus Torvalds wrote:

>
>
> On Thu, 29 Nov 2007, Jeff King wrote:
>>
>> I think it will get worse, because you are simultaneously calculating
>> all of the similarity scores bit by bit rather than doing a loop.  
>> Though
>> perhaps you mean at the end you will end up with a list of src/dst  
>> pairs
>> sorted by score, and you can loop over that.
>
> Well, after thinking about this a bit, I think there's a solution  
> that may
> work well with the current thing too: instead of looping just *once*  
> over
> the list of rename pairs, loop twice - and simply refuse to do  
> copies on
> the first loop.
>
> This trivial patch does that, and turns Kumar's test-case into a  
> perfect
> rename list.

Glad I can be of use for something :)

> It's not pretty, it's not smart, but it seems to work. There's  
> something
> to be said for keeping it simple and stupid.
>
> And it should not be nearly as expensive as it may _look_. Yes, the  
> loop
> is "(i = 0; i < num_create * num_src; i++)", but the important part is
> that the whole array is sorted by rename score, and we have a
>
> 	if (mx[i].score < minimum_score)
> 		break;
>
> in it, so uthe loop actually would tend to terminate rather quickly.
>
> Anyway, Kumar, the thing to take away from this is:
>
> - git really doesn't even *care* about the whole "rename detection"
>   internally, and any commits you have done with renames are totally
>   independent of the heuristics we then use to *show* the renames.
>
> - the rename detection really is for just two reasons: (a) keep humans
>   happy, and keep the diffs small and (b) help automatic merging  
> across
>   renames. So getting renames right is certainly good, but it's more  
> of a
>   "politeness" issue than a "correctness" issue, although the merge
>   portion of it does matter a lot sometimes.

Yeah both of these dawned on me after my brain started working and  
thinking about what you were talking about when you say git manages  
'content' and thinking about how the content management correlates to  
the diffs we generate.

> - the important thing here is that you can commit your changes and not
>   worry about them being somehow "corrupted" by lack of rename  
> detection,
>   even if you commit them with a version of git that doesn't do rename
>   detection the way you expected it. The rename detection is an
>   "after-the-fact" thing, not something that actually gets saved in  
> the
>   repository, which is why we can change the heuristics _after_ seeing
>   examples, and the examples magically correct themselves!
>
> - try out the two patches I've posted, and see if they work for you.  
> They
>   pass the test-suite, and the output for your example commit looks  
> sane,
>   but hey, if you have other test-cases, try them out.

Only started using -M when a co-worker informed me about it.  But if I  
see other cases in the future I'll let you guys know.

> Here's Kumar's pretty diffstat with both my patches:
>
> 	 Makefile                                         |    6 +++---
> 	 board/{cds => freescale}/common/cadmus.c         |    0
> 	 board/{cds => freescale}/common/cadmus.h         |    0
> 	 board/{cds => freescale}/common/eeprom.c         |    0
> 	 board/{cds => freescale}/common/eeprom.h         |    0
> 	 board/{cds => freescale}/common/ft_board.c       |    0
> 	 board/{cds => freescale}/common/via.c            |    0
> 	 board/{cds => freescale}/common/via.h            |    0
> 	 board/{cds => freescale}/mpc8541cds/Makefile     |    0
> 	 board/{cds => freescale}/mpc8541cds/config.mk    |    0
> 	 board/{cds => freescale}/mpc8541cds/init.S       |    0
> 	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c |    0
> 	 board/{cds => freescale}/mpc8541cds/u-boot.lds   |    4 ++--
> 	 board/{cds => freescale}/mpc8548cds/Makefile     |    0
> 	 board/{cds => freescale}/mpc8548cds/config.mk    |    0
> 	 board/{cds => freescale}/mpc8548cds/init.S       |    0
> 	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c |    0
> 	 board/{cds => freescale}/mpc8548cds/u-boot.lds   |    4 ++--
> 	 board/{cds => freescale}/mpc8555cds/Makefile     |    0
> 	 board/{cds => freescale}/mpc8555cds/config.mk    |    0
> 	 board/{cds => freescale}/mpc8555cds/init.S       |    0
> 	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c |    0
> 	 board/{cds => freescale}/mpc8555cds/u-boot.lds   |    4 ++--
> 	 23 files changed, 9 insertions(+), 9 deletions(-)
>
> and here it is before:
>
> 	 Makefile                                           |    6 +-
> 	 board/cds/mpc8548cds/Makefile                      |   60 -----
> 	 board/cds/mpc8555cds/Makefile                      |   60 -----
> 	 board/cds/mpc8555cds/init.S                        |  255  
> --------------------
> 	 board/cds/mpc8555cds/u-boot.lds                    |  150  
> ------------
> 	 board/{cds => freescale}/common/cadmus.c           |    0
> 	 board/{cds => freescale}/common/cadmus.h           |    0
> 	 board/{cds => freescale}/common/eeprom.c           |    0
> 	 board/{cds => freescale}/common/eeprom.h           |    0
> 	 board/{cds => freescale}/common/ft_board.c         |    0
> 	 board/{cds => freescale}/common/via.c              |    0
> 	 board/{cds => freescale}/common/via.h              |    0
> 	 board/{cds => freescale}/mpc8541cds/Makefile       |    0
> 	 board/{cds => freescale}/mpc8541cds/config.mk      |    0
> 	 board/{cds => freescale}/mpc8541cds/init.S         |    0
> 	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c   |    0
> 	 board/{cds => freescale}/mpc8541cds/u-boot.lds     |    4 +-
> 	 .../mpc8541cds => freescale/mpc8548cds}/Makefile   |    0
> 	 board/{cds => freescale}/mpc8548cds/config.mk      |    0
> 	 board/{cds => freescale}/mpc8548cds/init.S         |    0
> 	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c   |    0
> 	 board/{cds => freescale}/mpc8548cds/u-boot.lds     |    4 +-
> 	 .../mpc8541cds => freescale/mpc8555cds}/Makefile   |    0
> 	 board/{cds => freescale}/mpc8555cds/config.mk      |    0
> 	 .../mpc8541cds => freescale/mpc8555cds}/init.S     |    0
> 	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c   |    0
> 	 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds |    4 +-
> 	 27 files changed, 9 insertions(+), 534 deletions(-)
>
> so it certainly makes the diffs prettier.

Much better.  I love with open source development works.

- k

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fix a pathological case in git detecting proper renames
  2007-11-29 23:03           ` Linus Torvalds
  2007-11-29 23:52             ` Jeff King
@ 2007-11-30  0:40             ` Junio C Hamano
  1 sibling, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2007-11-30  0:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kumar Gala, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> For the fuzzy rename detection, we generate the full score matrix, and 
> sort it by the score, up front. So all the scoring - and more importantly, 
> all the sorting - has actually been done before we actually start looking 
> at *any* renames at all, so we cannot easily do the same thing I did for 
> the exact renames, namely to take into account _earlier_ renames in the 
> scoring. Because those earlier renames have simply not been done when the 
> score is calculated.

I think I've mentioned this before, but another thing we may want to do
is to give similarity boost to a src-dst pair if other files in the same
src directory are found to be renamed to the same dst directory.  That
is, if you have the same contents in the preimage at A/init.S and B/init.S,
and a similar contents appear in C/init.S in the postimage, instead of
randomly picking A/init.S over B/init.S as the source, we can notice
that A/Makefile was moved to C/Makefile (but B/Makefile was sufficiently
different from A/Makefile in the preimage), and favor A/init.S over
B/init.S as the rename source of C/init.S.

About the code structure, I think the very early draft of rename
detector did not do the full matrix, but iterated over dst to see if
there is a good src for it, picked the best src that is above the
threshold, and went on to next dst, like this:

	for (dst in dst candidates) {
        	best_src = NULL;
                best_score = minimum_score;
                for (src in src candidates) {
                	score = similarity(dst, src);
                        if (score > best_score)
                            best_src = src;
		}
		if (best_src) {
			match dst with src;
		}
	}

This was restructured in the current "full matrix first" form before the
rename detection logic first hit your tree, and I do not think it was
shown in the field to perform worse than the full matrix version.

We could do the current full matrix that does not take basename
similarity nor what other renames were detected first, and then use that
matrix result in order to primarily define the order of dst candidates
to process and run the above loop.  At that point, similarity between
dst and src does not need to be recomputed fully (the matrix would
record it).  Instead, we can tweak it to take other renames that already
have been detected (this includes "this src has already been used", and
"somebody nearby moved to the same directory") and basename similarity
to affect which possible src candidate to choose for each dst.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: problem with git detecting proper renames
  2007-11-29 17:44 ` Linus Torvalds
  2007-11-29 19:06   ` Kumar Gala
@ 2007-11-30  0:21   ` Jakub Narebski
  1 sibling, 0 replies; 14+ messages in thread
From: Jakub Narebski @ 2007-11-30  0:21 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
> On Thu, 29 Nov 2007, Kumar Gala wrote:
>> 
>> I did some git-mv and got the following:
>> 
>> the problem is git seems confused about what file was associated with its
>> source.
> 
> Well, I wouldn't say "confused". It found multiple identical options for 
> the source, and picked the first one (where "first one" may not be obvious 
> to a human, it can depend on an internal hash order).

By the way, which git version do you use? IIRC we have improved rename
detection heuristics to take into account similarity of filenames when
contents is identical...

...ah, I see, it is git 1.5.3.4

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-11-30  1:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-29 16:57 problem with git detecting proper renames Kumar Gala
2007-11-29 17:44 ` Linus Torvalds
2007-11-29 19:06   ` Kumar Gala
2007-11-29 19:27     ` Linus Torvalds
2007-11-29 19:32       ` Kumar Gala
2007-11-29 20:27       ` Kumar Gala
2007-11-29 21:30         ` Fix a pathological case in " Linus Torvalds
2007-11-29 23:03           ` Linus Torvalds
2007-11-29 23:52             ` Jeff King
2007-11-30  0:41               ` Linus Torvalds
2007-11-30  0:48                 ` Jeff King
2007-11-30  1:18                 ` Kumar Gala
2007-11-30  0:40             ` Junio C Hamano
2007-11-30  0:21   ` problem with " Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).