git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Pure renames/copies
@ 2005-11-21 12:01 Santi Béjar
  2005-11-21 18:37 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Santi Béjar @ 2005-11-21 12:01 UTC (permalink / raw)
  To: Git Mailing List

Hello:


        Is there any way to ask git to find pure renames or copies?

        I ask this because it is a much cheaper operation than the -C
        and -M do (-M100 does not work) and can be used when the number
        of paths if big, or when you track binary files.

        Thanks

        Santi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 12:01 Pure renames/copies Santi Béjar
@ 2005-11-21 18:37 ` Linus Torvalds
  2005-11-21 19:31   ` Junio C Hamano
  2005-11-21 19:50   ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2005-11-21 18:37 UTC (permalink / raw)
  To: Santi Béjar; +Cc: Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3291 bytes --]



On Mon, 21 Nov 2005, Santi Béjar wrote:
> 
>         Is there any way to ask git to find pure renames or copies?

Not directly, but git sure makes it easy for you.

Do this:

	git-diff-tree -r old..new |
		grep '^:[^<tab>]*0000000000000000000000000000000000000000'

and you'll get all the information (that "<tab>" is obviously the tab 
character) you need to efficiently do it (or, if you want to just do one 
commit, just do "git-diff-tree -r cmit"). 

In the git tree, commit 0086e2c854e3af3209915e4ec2f933bcef400050 can act 
as a good example of this: the output of 

	git-diff-tree -r 0086e2c854e3af3209915e4ec2f933bcef400050

is

	0086e2c854e3af3209915e4ec2f933bcef400050
	:100644 100644 328b399f9fe6e2b668691ab359319f50561cd773 16a8af63f0523cec82faa23f29cee579ac224e82 M      .gitignore
	:100644 000000 a8cc5739d7851da3aeca2388d74eb92c464f1732 0000000000000000000000000000000000000000 D      Documentation/git-lost+found.txt
	:000000 100644 0000000000000000000000000000000000000000 03156f218bb41b955779207ec2e94120f958fc45 A      Documentation/git-lost-found.txt
	:100644 100644 a9d47c115c071694321d076af8a73a06ddd46875 1c32dd5be7156ae0e1142523fe50d84745964793 M      Documentation/git.txt
	:100644 100644 b75cb137875b3cdb8746d2e0135e6f2743e2046a 5b2eca897386e17021d2a8a052b0c2759df96447 M      Makefile
	:100755 000000 3892f52005d1e36676681806a87ef35dc0689f22 0000000000000000000000000000000000000000 D      git-lost+found.sh
	:000000 100755 0000000000000000000000000000000000000000 3892f52005d1e36676681806a87ef35dc0689f22 A      git-lost-found.sh

and then after the "grep", you have just

	:100644 000000 a8cc5739d7851da3aeca2388d74eb92c464f1732 0000000000000000000000000000000000000000 D      Documentation/git-lost+found.txt
	:000000 100644 0000000000000000000000000000000000000000 03156f218bb41b955779207ec2e94120f958fc45 A      Documentation/git-lost-found.txt
	:100755 000000 3892f52005d1e36676681806a87ef35dc0689f22 0000000000000000000000000000000000000000 D      git-lost+found.sh
	:000000 100755 0000000000000000000000000000000000000000 3892f52005d1e36676681806a87ef35dc0689f22 A      git-lost-found.sh

left, which shows you the new and the deleted files.

Then, look for renames: just match up a new file that has the same SHA1 as 
a deleted file, and you can see that the change from "git-lost+found.sh" 
to "git-lost-found.sh" was exactly such an exact rename, because they 
share the 3892f52005d1e36676681806a87ef35dc0689f22 SHA1.

After rename detection, look at any remaining new files (in the above 
example, only

	:000000 100644 0000000000000000000000000000000000000000 03156f218bb41b955779207ec2e94120f958fc45 A      Documentation/git-lost-found.txt

would be left), and try to match up the SHA1 of that file with the result 
of "git-ls-tree -r $old", ie something like

	git-ls-tree -r 0086e2c854e3af3209915e4ec2f933bcef400050^ |
		grep 03156f218bb41b955779207ec2e94120f958fc45

which in this case is empty (that new file wasn't an exact copy of any old 
file, it was a rename+edit, of course).

Very efficient, very simple, you can do it either with a small 
shell-script (using cut + sort + join + grep), or write a specialized tool 
around the git-diff-tree logic.

Of course, arguably "-M100" should really do this optimization for you. 
Junio?

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 18:37 ` Linus Torvalds
@ 2005-11-21 19:31   ` Junio C Hamano
  2005-11-21 19:50   ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2005-11-21 19:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Of course, arguably "-M100" should really do this optimization for you. 
> Junio?

I'd agree.  That is what -M100 should mean.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 18:37 ` Linus Torvalds
  2005-11-21 19:31   ` Junio C Hamano
@ 2005-11-21 19:50   ` Junio C Hamano
  2005-11-21 21:01     ` H. Peter Anvin
  2005-11-22  9:03     ` Santi Bejar
  1 sibling, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2005-11-21 19:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Of course, arguably "-M100" should really do this optimization for you. 
> Junio?

Probably something like this would suffice.

-- >8 --
Subject: rename detection with -M100 means "exact renames only".

When the user is interested in pure renames, there is no point
doing the similarity scores.  This changes the score argument
parsing to special case -M100 (otherwise, it is a precision
scaled value 0 <= v < 1 and would mean 0.1, not 1.0 --- if you
do mean 0.1, you can say -M1), and optimizes the diffcore_rename
transformation to only look at pure renames in that case.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

diff --git a/diff.c b/diff.c
index 0391e8c..0f839c1 100644
--- a/diff.c
+++ b/diff.c
@@ -853,6 +853,10 @@ static int parse_num(const char **cp_p)
 	}
 	*cp_p = cp;
 
+	/* special case: -M100 would mean 1.0 not 0.1 */
+	if (num == 100 && scale == 1000)
+		return MAX_SCORE;
+
 	/* user says num divided by scale and we say internally that
 	 * is MAX_SCORE * num / scale.
 	 */
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6a9d95d..dba965c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -307,6 +307,9 @@ void diffcore_rename(struct diff_options
 	if (rename_count == rename_dst_nr)
 		goto cleanup;
 
+	if (minimum_score == MAX_SCORE)
+		goto cleanup;
+
 	num_create = (rename_dst_nr - rename_count);
 	num_src = rename_src_nr;
 	mx = xmalloc(sizeof(*mx) * num_create * num_src);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 19:50   ` Junio C Hamano
@ 2005-11-21 21:01     ` H. Peter Anvin
  2005-11-21 21:33       ` Junio C Hamano
  2005-11-22  9:03     ` Santi Bejar
  1 sibling, 1 reply; 11+ messages in thread
From: H. Peter Anvin @ 2005-11-21 21:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

Junio C Hamano wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> 
> 
>>Of course, arguably "-M100" should really do this optimization for you. 
>>Junio?
> 
> 
> Probably something like this would suffice.
> 
> -- >8 --
> Subject: rename detection with -M100 means "exact renames only".
> 
> When the user is interested in pure renames, there is no point
> doing the similarity scores.  This changes the score argument
> parsing to special case -M100 (otherwise, it is a precision
> scaled value 0 <= v < 1 and would mean 0.1, not 1.0 --- if you
> do mean 0.1, you can say -M1), and optimizes the diffcore_rename
> transformation to only look at pure renames in that case.
> 

Any reason we can't make it take an actual decimal number, like -M1.0 or 
-M0.345?  It seems odd and annoying to invent our own notation for 
floating-point numbers, especially in userspace.

	-hpa

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 21:01     ` H. Peter Anvin
@ 2005-11-21 21:33       ` Junio C Hamano
  2005-11-21 21:37         ` H. Peter Anvin
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2005-11-21 21:33 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git

"H. Peter Anvin" <hpa@zytor.com> writes:

> Any reason we can't make it take an actual decimal number, like -M1.0 or 
> -M0.345?  It seems odd and annoying to invent our own notation for 
> floating-point numbers, especially in userspace.

No reason we "can't".  About we "don't", inertia and nothing
else.  It happened around this time.

	http://marc.theaimsgroup.com/?l=git&m=111654149421574

We could in addition to take 0 <= x <= 1 decimal number and that
should be a simple patch to diff.c::parse_num().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 21:33       ` Junio C Hamano
@ 2005-11-21 21:37         ` H. Peter Anvin
  2005-11-21 22:00           ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: H. Peter Anvin @ 2005-11-21 21:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
> 
>>Any reason we can't make it take an actual decimal number, like -M1.0 or 
>>-M0.345?  It seems odd and annoying to invent our own notation for 
>>floating-point numbers, especially in userspace.
> 
> 
> No reason we "can't".  About we "don't", inertia and nothing
> else.  It happened around this time.
> 
> 	http://marc.theaimsgroup.com/?l=git&m=111654149421574
> 
> We could in addition to take 0 <= x <= 1 decimal number and that
> should be a simple patch to diff.c::parse_num().
> 

Okay, in that post Linus suggests that -M without an argument should be 
== 100% (1.0), thus avoiding having to mess up the meaning of -M100 as 
0.100.  It seems like a really odd thing to have -M100 mean something 
that's completely out of line with the rest of the meaning.

	-hpa

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 21:37         ` H. Peter Anvin
@ 2005-11-21 22:00           ` Junio C Hamano
  2005-11-21 22:10             ` H. Peter Anvin
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2005-11-21 22:00 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git

"H. Peter Anvin" <hpa@zytor.com> writes:

> Okay, in that post Linus suggests that -M without an argument should be 
> == 100% (1.0), thus avoiding having to mess up the meaning of -M100 as 
> 0.100.  It seems like a really odd thing to have -M100 mean something 
> that's completely out of line with the rest of the meaning.

True, but it might be too late to change that; I suspect people
expect -M to do a bit more than pure renames by now.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 22:00           ` Junio C Hamano
@ 2005-11-21 22:10             ` H. Peter Anvin
  2005-11-21 22:17               ` H. Peter Anvin
  0 siblings, 1 reply; 11+ messages in thread
From: H. Peter Anvin @ 2005-11-21 22:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 641 bytes --]

Junio C Hamano wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
> 
>>Okay, in that post Linus suggests that -M without an argument should be 
>>== 100% (1.0), thus avoiding having to mess up the meaning of -M100 as 
>>0.100.  It seems like a really odd thing to have -M100 mean something 
>>that's completely out of line with the rest of the meaning.
> 
> True, but it might be too late to change that; I suspect people
> expect -M to do a bit more than pure renames by now.
> 

Okay, how about the following?  It lets both -M1.0 and -M100% work, 
while keeping everything else compatible, and avoiding artificial 
special cases.

	-hpa

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 932 bytes --]

diff --git a/diff.c b/diff.c
index 0391e8c..df62d2b 100644
--- a/diff.c
+++ b/diff.c
@@ -843,11 +843,19 @@ static int parse_num(const char **cp_p)
 
 	cnt = num = 0;
 	scale = 1;
-	while ('0' <= (ch = *cp) && ch <= '9') {
-		if (cnt++ < 5) {
-			/* We simply ignore more than 5 digits precision. */
-			scale *= 10;
-			num = num * 10 + ch - '0';
+	for(;;) {
+		ch = *cp;
+		if ( ch == '.' ) {
+			scale = 1;
+		} else if ( ch == '%' ) {
+			scale = 100;
+		} else if ( ch >= '0' && ch <= '9' ) {
+			if ( scale < 100000 ) {
+				scale *= 10;
+				num = (num*10) + (ch-'0');
+			}
+		} else {
+			break;
 		}
 		cp++;
 	}
@@ -856,7 +864,7 @@ static int parse_num(const char **cp_p)
 	/* user says num divided by scale and we say internally that
 	 * is MAX_SCORE * num / scale.
 	 */
-	return (MAX_SCORE * num / scale);
+	return (num >= scale) ? MAX_SCORE : (MAX_SCORE * num / scale);
 }
 
 int diff_scoreopt_parse(const char *opt)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 22:10             ` H. Peter Anvin
@ 2005-11-21 22:17               ` H. Peter Anvin
  0 siblings, 0 replies; 11+ messages in thread
From: H. Peter Anvin @ 2005-11-21 22:17 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 257 bytes --]

Better variant, which handles stuff like "4.5%" and rejects 
"192.168.0.1".  Additionally, make sure numbers are unsigned (I'm making 
them unsigned long just for the hell of it), to make sure that 
artificial wraparound scenarios don't cause harm.

	-hpa


[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1186 bytes --]

diff --git a/diff.c b/diff.c
index 0391e8c..ffe8a55 100644
--- a/diff.c
+++ b/diff.c
@@ -838,16 +838,29 @@ int diff_opt_parse(struct diff_options *
 
 static int parse_num(const char **cp_p)
 {
-	int num, scale, ch, cnt;
+	unsigned long num, scale;
+	int ch, dot;
 	const char *cp = *cp_p;
 
-	cnt = num = 0;
+	num = 0;
 	scale = 1;
-	while ('0' <= (ch = *cp) && ch <= '9') {
-		if (cnt++ < 5) {
-			/* We simply ignore more than 5 digits precision. */
-			scale *= 10;
-			num = num * 10 + ch - '0';
+	dot = 0;
+	for(;;) {
+		ch = *cp;
+		if ( !dot && ch == '.' ) {
+			scale = 1;
+			dot = 1;
+		} else if ( ch == '%' ) {
+			scale = dot ? scale*100 : 100;
+			cp++;	/* % is always at the end */
+			break;
+		} else if ( ch >= '0' && ch <= '9' ) {
+			if ( scale < 100000 ) {
+				scale *= 10;
+				num = (num*10) + (ch-'0');
+			}
+		} else {
+			break;
 		}
 		cp++;
 	}
@@ -856,7 +869,7 @@ static int parse_num(const char **cp_p)
 	/* user says num divided by scale and we say internally that
 	 * is MAX_SCORE * num / scale.
 	 */
-	return (MAX_SCORE * num / scale);
+	return (num >= scale) ? MAX_SCORE : (MAX_SCORE * num / scale);
 }
 
 int diff_scoreopt_parse(const char *opt)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Pure renames/copies
  2005-11-21 19:50   ` Junio C Hamano
  2005-11-21 21:01     ` H. Peter Anvin
@ 2005-11-22  9:03     ` Santi Bejar
  1 sibling, 0 replies; 11+ messages in thread
From: Santi Bejar @ 2005-11-22  9:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

>
> Probably something like this would suffice.
>

Ok, thanks. Now the only issue with my broken repository (it does not
have all the blobs) is that it outputs:

error: unable to find abcde....

for all the src paths, but the result is ok.

But I can live with it.

Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-11-22  9:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-21 12:01 Pure renames/copies Santi Béjar
2005-11-21 18:37 ` Linus Torvalds
2005-11-21 19:31   ` Junio C Hamano
2005-11-21 19:50   ` Junio C Hamano
2005-11-21 21:01     ` H. Peter Anvin
2005-11-21 21:33       ` Junio C Hamano
2005-11-21 21:37         ` H. Peter Anvin
2005-11-21 22:00           ` Junio C Hamano
2005-11-21 22:10             ` H. Peter Anvin
2005-11-21 22:17               ` H. Peter Anvin
2005-11-22  9:03     ` Santi Bejar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).