Following renames

All of lore.kernel.org
 help / color / mirror / Atom feed

* Following renames
@ 2006-03-26  1:49 Petr Baudis
  2006-03-26  2:49 ` Junio C Hamano
  2006-03-26  3:19 ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26  1:49 UTC (permalink / raw)
  To: git

  Hi,

  so, now that I've put up with the fuzzy rename autodetection (for
now), I'd like to make cg-log auto-follow renames and I'm wondering
about the best implementation (it seems that I won't do without core
Git cooperation). I think it should be possible to implement in a way
so that it has minimal performance impact and therefore I can have it
turned on by default.

  Now I'm using the notorious

	git-rev-list listoffiles | git-diff-tree --stdin

pipeline in cg-log, and I'm wondering about the best way to add rename
detection there.

  In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
fly - it requires at least two git-ls-tree calls per revision which
would bog things down awfully (to roughly half of the original speed).

  But even if git-rev-list reported disappearing files, Cogito would
have to do a lot of complicated bookkeeping in order to properly track
renames in parallel branches - for each 'head' commit at any point of
the history traversal, you need to record a separate set of interesting
files. It would also have to restart git-rev-list at any moment when a
rename happens on any of the head commits. Scales well not.

  An obvious solution would be to have git-diff-tree --follow which
updates its interesting path set based on seen renames, and now that
I've written about non-linear history, it's obvious that it's incorrect.
The other obvious way to go is then to add rename detection support to
git-rev-list, and it's less obvious that this is a dead end too - I
didn't inspect the code myself yet, but for now I trust Linus in [2]
(I didn't quite understand the argument, I guess I need to sleep on it).

  So, any thoughts about how to approach this? Either git-diff-tree
would have to be taught about the heads bookkeeping, or the git-rev-list
hurdles would have to be overcome, or we might have a
git-rev-rename-filter or something, but that feels quite redundant and
might meet with the same problems as git-rev-list.

  == References ==

  [1] Oct 21 <Pine.LNX.4.64.0510211814050.10477@g5.osdl.org>
  [2] Oct 22 <Pine.LNX.4.64.0510221251330.10477@g5.osdl.org>

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  1:49 Following renames Petr Baudis
@ 2006-03-26  2:49 ` Junio C Hamano
  2006-03-26  3:52   ` Jakub Narebski
                     ` (3 more replies)
  2006-03-26  3:19 ` Linus Torvalds
  1 sibling, 4 replies; 41+ messages in thread
From: Junio C Hamano @ 2006-03-26  2:49 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

Petr Baudis <pasky@ucw.cz> writes:

>   An obvious solution would be to have git-diff-tree --follow which
> updates its interesting path set based on seen renames, and now that
> I've written about non-linear history, it's obvious that it's incorrect.
> The other obvious way to go is then to add rename detection support to
> git-rev-list, and it's less obvious that this is a dead end too - I
> didn't inspect the code myself yet, but for now I trust Linus in [2]
> (I didn't quite understand the argument, I guess I need to sleep on it).

I'd have to sleep on how the core side can help Porcelains, but
I think it is a good thing that you, one of the most vocal
advocate on the list for doing rename recording, are thinking
about this issue and probably would look into rev-list.c soon.

Looking at the evolution of rev-list.c file itself was a good
exercise to realize that rename tracking (more specifically,
having whatchanged to follow renames) is not such a useful
thing (at least for me).

If I am interested in rev-list.c's evolution from "the set of
command line flags it supported" point of view, then whatchanged
to show the history of rev-list.c file itself would be a very
good way to show that to me.  rev-list_usage[] = "..." stayed
there almost from the beginning.

However, if I am interested in the way how it traverses the
commits has changed over time, I would need to start from
revision.c and switch to rev-list.c when that part of the code
was split out from it, because the current rev-list.c does not
have the main part of the traversal logic at all.

Another example.  Today's tar-tree updates have one interesting
function I think should belong to strbuf.c, and before merging
it to the mainline, I may move that function from tar-tree.c to
strbuf.c.  After that happens, if I run "whatchanged strbuf.c"
to see where that function came from, I would want it to notice
it came from tar-tree.c, although it is not a rename at all.
Just one function moved from a file to another.

What this suggests is that switching the set of paths to follow
while traversing ancestry chain needs to depend on which part of
the original file you are interested in.  Marking "this commit
renames (or copies) file A to file B" is not that useful -- for
that matter, detecting at runtime like we currently do is not
better either.  If a file A and file B were cleaned up and
merged into a single file C, which is in the tip of the tree,
which one you would want whatchanged to switch following depends
on which part of the C you were interested in.

Unless you are interested in the _entire_ contents of the file,
that is.  Then tracking or even recording renames becomes
useful, but that is a special case.

That is the reason I am not so enthused about recording renames.
I think the time is better spent on enhancing what pickaxe tries
to do (currently it does very little), which I hinted in a
separate message late last night.

But that does not have to stop you, and does not have to stop me
from thinking about ways to help you either.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  2:49 ` Junio C Hamano
@ 2006-03-26  3:52   ` Jakub Narebski
  2006-03-27  6:00     ` Paul Jakma
  2006-03-26 10:52   ` Petr Baudis
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Jakub Narebski @ 2006-03-26  3:52 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Petr Baudis <pasky@ucw.cz> writes:
> 
>>   An obvious solution would be to have git-diff-tree --follow which
>> updates its interesting path set based on seen renames, and now that
>> I've written about non-linear history, it's obvious that it's incorrect.
>> The other obvious way to go is then to add rename detection support to
>> git-rev-list, and it's less obvious that this is a dead end too - I
>> didn't inspect the code myself yet, but for now I trust Linus in [2]
>> (I didn't quite understand the argument, I guess I need to sleep on it).
> 
> I'd have to sleep on how the core side can help Porcelains, but
> I think it is a good thing that you, one of the most vocal
> advocate on the list for doing rename recording, are thinking
> about this issue and probably would look into rev-list.c soon.
> 
> Looking at the evolution of rev-list.c file itself was a good
> exercise to realize that rename tracking (more specifically,
> having whatchanged to follow renames) is not such a useful
> thing (at least for me).
[...]
> What this suggests is that switching the set of paths to follow
> while traversing ancestry chain needs to depend on which part of
> the original file you are interested in.  Marking "this commit
> renames (or copies) file A to file B" is not that useful -- for
> that matter, detecting at runtime like we currently do is not
> better either.  If a file A and file B were cleaned up and
> merged into a single file C, which is in the tip of the tree,
> which one you would want whatchanged to switch following depends
> on which part of the C you were interested in.
> 
> Unless you are interested in the _entire_ contents of the file,
> that is.  Then tracking or even recording renames becomes
> useful, but that is a special case.
> 
> That is the reason I am not so enthused about recording renames.
> I think the time is better spent on enhancing what pickaxe tries
> to do (currently it does very little), which I hinted in a
> separate message late last night.

I think one of the better ideas/suggestions about *recording* filenames was
in the "impure renames / history tracking" thread
 http://marc.theaimsgroup.com/?l=git&m=114122175216489&w=2
 <Pine.LNX.4.64.0603011343170.13612@sheen.jakma.org>
about adding *auxiliary* (helper) information about renames in commits. I'm
not sure about recording parts of the file that were moved or copied. That
might have been left for runtime detection in the likes of pickaxe.

As it would be helper-only information it would ensure backwards
compatibility (older versions would ignore additional information) and
forward compatibility (newer version would fallback to current runtime
renames tracking/detection).

To be generic, I think that the command to record rename/copy or
copy'n'paste/cut'n'paste would take set of source files (one or more,
unless we want to have an option to mark the file as new supressing any
superficial similarities, in which case it would be zero or more), and set
of destination files (one or more, with files which were in source repeated
it was copy, not repeated if it was rename or cut'n'paste; unless we want
to record deletions also, in which case it would be zero or more files).
Such information can be I guess easily entered by user... if one remembers
to record rename/cut'n'paste/etc. that is. Perhaps if it were a way to easy
add such information later, for example confirming detected
renames/relationships during merge... It would be much more difficult for
user to enter the ranges unassisted.

What worries me is that such information, recorded in "own fields to the GIT
revision messages" (in commits) can be used only if you track the ancestry;
it doesn't help if you have only have two or more revisions and not build
relationship graph between them. But maybe I worry unnecessary...

BTW. following renames is important not only in examining file [contents]
history, in the likes of diff, whatchanged, annotate/blame, pickaxe but
also for merges.

References:
===========
* http://marc.theaimsgroup.com/?l=linux-kernel&m=111314792424707
* http://article.gmane.org/gmane.comp.version-control.git/217
* http://marc.theaimsgroup.com/?l=git&m=114123702826251
* http://marc.theaimsgroup.com/?l=git&m=114315795227271

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  3:52   ` Jakub Narebski
@ 2006-03-27  6:00     ` Paul Jakma
  0 siblings, 0 replies; 41+ messages in thread
From: Paul Jakma @ 2006-03-27  6:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Sun, 26 Mar 2006, Jakub Narebski wrote:

> I think one of the better ideas/suggestions about *recording* filenames was
> in the "impure renames / history tracking" thread
> http://marc.theaimsgroup.com/?l=git&m=114122175216489&w=2
> <Pine.LNX.4.64.0603011343170.13612@sheen.jakma.org>

For the record, the responses I received were educational ;). 
Sufficiently so I no longer think renames should be recorded. At 
least, definitely not as renames.

I now grok the reasoning for doing it by 'similarity' - it is indeed 
a *much* more useful concept. (E.g. the 'pickaxe' idea people keep 
alluding though sounds amazingly useful).

So the question really is what, if any, weaknesses does the current 
similarity estimation have, and how to solve them. I can think of two 
weaknesses:

1. the similarity algorithms can be expensive potentially, and they
    essentially get run a lot with the same inputs, to produce the
    same results - over and over as one works with a git repo. (there
    was a thread a while ago on this I think).

2. Some 'similarities' are just not deducible by current software
    state of the art. E.g. where some code is rewritten in another
    language:

 	foo.X -> foo.Y

    The high-level algorithms may remain the exact same, but the code
    may be unrecognisable as similar except to a human. However,
    tracking history back across this rewrite probably would still be
    valuable to the human.

So I think what /might/ be interesting is to have a 'similarity 
cache', which would help 1, and to allow for manual injection of such 
hints (into a seperate and stronger cache most likely) - which would 
help 2.

Something to record the following information:

(tree1,tree2)[1]:
 	Id1 <-> Id1'
 	.
 	.
 	.
 	Idn <-> Idn'

That would allow:

1. Performance repercussions of similarity estimation to be one-time,
    cached there-after. (throw-away information, if a better
    similarity estimation heuristic comes along, you can rebuild this
    cache)

2. The user to inject their own 'hints' into similarity estimation,
    particularly for cases that just aren't obvious and probably never
    will be to software estimators (e.g. the rewrite cases), but where
    the user sees value in being able to follow back the history.

Avoids:

- encoding anything permanently into the repository (which was
   something I was thinking of, and others before me apparently, but
   which I now accept would be an awful idea ;) ).

1. I'm not sure if it should be indexed by (commit ID) or
    (tree1,tree2) tuple. ??

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Men take only their needs into consideration -- never their abilities.
 		-- Napoleon Bonaparte

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  2:49 ` Junio C Hamano
  2006-03-26  3:52   ` Jakub Narebski
@ 2006-03-26 10:52   ` Petr Baudis
  2006-03-26 10:55     ` Petr Baudis
  2006-03-26 16:08   ` Timo Hirvonen
  2006-03-26 16:31   ` Jakub Narebski
  3 siblings, 1 reply; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 10:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

(Note that I do *not* want to raise the explicit vs. implicit rename
tracking argument, in case anyone would misunderstood. I've accepted
implicit rename tracking as a fact of Git life for now. I just want to
make use of it now. ;-)

Dear diary, on Sun, Mar 26, 2006 at 04:49:48AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> Looking at the evolution of rev-list.c file itself was a good
> exercise to realize that rename tracking (more specifically,
> having whatchanged to follow renames) is not such a useful
> thing (at least for me).

Well, noone argues that rename tracking cures all the woes of hackerkind
and anything more precise than that is useless. I'm rather saying that
rename tracking indeed _is_ a special case of something more general and
truly very interesting, but a special case so frequent that it's worth
doing even if we can't do the general case yet. Or at least people
*think* it's very frequent and it gives them the warm fuzzy feeling
knowing that the tool can handle it (at least somehow) - and the warm
fuzzy feeling is important, especially if you're trusting your sources
to the tool.

So, obviously, you'll find plenty of counter-examples where rename
detection won't help. I don't argue that. I merely say that there will
still be enough cases where following renames will help to warrant
doing it.

Now, Git history has enough examples of where rename following would be
useful. When I'm digging into the history, I'm hitting the big tools
rename barrier all the time, and just yesterday when wondering about
jdl's <snap> removal from git.txt I've hit 2cf565c53 - coming along any
file to that commit should make me follow Documentation/core-git.txt out
of the commit (well, that's rather copy than rename detection).

> Another example.  Today's tar-tree updates have one interesting
> function I think should belong to strbuf.c, and before merging
> it to the mainline, I may move that function from tar-tree.c to
> strbuf.c.  After that happens, if I run "whatchanged strbuf.c"
> to see where that function came from, I would want it to notice
> it came from tar-tree.c, although it is not a rename at all.
> Just one function moved from a file to another.

A wild pickaxe - when the string disappears from file X, scan all the
changes in the commit and start following files where it reappears. This
should help, right?

But when you want to implement this, you hit the exact same problems as
when you try to follow renames, only a different part of diffcore
detects it. So, what I'm trying to solve is actually not just following
renames but a more general problem.

> If a file A and file B were cleaned up and merged into a single file
> C, which is in the tip of the tree, which one you would want
> whatchanged to switch following depends on which part of the C you
> were interested in.

If in doubt (and the user does not use pickaxe to clarify it), you can
just follow both. The user will get some extra stuff (or maybe even not
if he wants to know about pieces from both), but we are at least trying
to be useful and DTRT instead of doing nothing in case we would by any
chance not do the very best.

> Unless you are interested in the _entire_ contents of the file,
> that is.  Then tracking or even recording renames becomes
> useful, but that is a special case.

A frequent (and wanted) special case.

> That is the reason I am not so enthused about recording renames.
> I think the time is better spent on enhancing what pickaxe tries
> to do (currently it does very little), which I hinted in a
> separate message late last night.

Sure, pickaxe is cool, but as I said above, if you try to teach _it_
following around files, you'll hit the exact same problems as me. We're
just trying to build something using lego blocks with different stuff
inside but otherwise actually looking pretty much the same.

The thing with pickaxe is that frequently it would be simply more
laborous to dig for and construct the proper pickaxe string than just
firing up cg-log -s filename with greedy renames following and quickly
scanning through the results.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 10:52   ` Petr Baudis
@ 2006-03-26 10:55     ` Petr Baudis
  0 siblings, 0 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 10:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Dear diary, on Sun, Mar 26, 2006 at 12:52:48PM CEST, I got a letter
where Petr Baudis <pasky@suse.cz> said that...
> Well, noone argues that rename tracking cures all the woes of hackerkind
                                                                ^^^^^^^^^^

Or is it hackerdom?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  2:49 ` Junio C Hamano
  2006-03-26  3:52   ` Jakub Narebski
  2006-03-26 10:52   ` Petr Baudis
@ 2006-03-26 16:08   ` Timo Hirvonen
  2006-03-26 16:43     ` Linus Torvalds
  2006-03-26 16:31   ` Jakub Narebski
  3 siblings, 1 reply; 41+ messages in thread
From: Timo Hirvonen @ 2006-03-26 16:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: pasky, git

On Sat, 25 Mar 2006 18:49:48 -0800
Junio C Hamano <junkio@cox.net> wrote:

> Looking at the evolution of rev-list.c file itself was a good
> exercise to realize that rename tracking (more specifically,
> having whatchanged to follow renames) is not such a useful
> thing (at least for me).

It would be useful for me.  I had all files organized in subdirectories,
but then noticed it was not good idea because make does not play nicely
with subdirs, so I moved all files to top level directory.

Now

    git-whatchanged -p file.c

stops at the big rename. To continue I have to do

    git-whatchanged -p -- <some-commit> <old-filename>

> Another example.  Today's tar-tree updates have one interesting
> function I think should belong to strbuf.c, and before merging
> it to the mainline, I may move that function from tar-tree.c to
> strbuf.c.  After that happens, if I run "whatchanged strbuf.c"
> to see where that function came from, I would want it to notice
> it came from tar-tree.c, although it is not a rename at all.
> Just one function moved from a file to another.

Yes in this case you can do

$ git-whatchanged strbuf.c
$ git-whatchanged tar-tree.c

but after rename...

$ git-whatchanged old-file.c
fatal: 'old-file.c': No such file or directory

$ touch old-file.c
$ git-whatchanged old-file.c

Hah, it worked!


Hmm... this works too without the touch-hack:

$ git-whatchanged file.c old-file.c

I wish I had known this before.

> What this suggests is that switching the set of paths to follow
> while traversing ancestry chain needs to depend on which part of
> the original file you are interested in.  Marking "this commit
> renames (or copies) file A to file B" is not that useful -- for
> that matter, detecting at runtime like we currently do is not
> better either.  If a file A and file B were cleaned up and
> merged into a single file C, which is in the tip of the tree,
> which one you would want whatchanged to switch following depends
> on which part of the C you were interested in.

OK, maybe following renames is not such a good idea.  But for GUIs
(gitk, qgit) following renames or even file merges (select a file to
follow by clicking it) would be big plus.

-- 
http://onion.dynserv.net/~timo/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 16:08   ` Timo Hirvonen
@ 2006-03-26 16:43     ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 16:43 UTC (permalink / raw)
  To: Timo Hirvonen; +Cc: Junio C Hamano, pasky, git



On Sun, 26 Mar 2006, Timo Hirvonen wrote:
>
> $ git-whatchanged old-file.c
> fatal: 'old-file.c': No such file or directory
> 
> $ touch old-file.c
> $ git-whatchanged old-file.c
> 
> Hah, it worked!

It worked even before:

	git-whatchanged -- old-file.c

always works.

If you think of the "filename spec" as _always_ having to have a "--" to 
separate the filenames from the other arguments, you're thinking the right 
way. Then, there's a _shorthand_ for existing files, where we allow users 
being lazy (because _I_ am very lazy indeed), which allows dropping of the 
"--", but then the code requires that the filenames are real filenames as 
of now.

> Hmm... this works too without the touch-hack:
> 
> $ git-whatchanged file.c old-file.c
> 
> I wish I had known this before.

Actually, it -shouldn't- work. It's just that "git-rev-parse" isn't as 
anal as it should be.

Here's a fix.

		Linus
----
diff --git a/rev-parse.c b/rev-parse.c
index f90e999..104b1e2 100644
--- a/rev-parse.c
+++ b/rev-parse.c
@@ -172,7 +172,9 @@ int main(int argc, char **argv)
 		char *dotdot;
 	
 		if (as_is) {
-			show_file(arg);
+			if (show_file(arg) && as_is < 2)
+				if (lstat(arg, &st) < 0)
+					die("'%s': %s", arg, strerror(errno));
 			continue;
 		}
 		if (!strcmp(arg,"-n")) {
@@ -192,7 +194,7 @@ int main(int argc, char **argv)
 
 		if (*arg == '-') {
 			if (!strcmp(arg, "--")) {
-				as_is = 1;
+				as_is = 2;
 				/* Pass on the "--" if we show anything but files.. */
 				if (filter & (DO_FLAGS | DO_REVS))
 					show_file(arg);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  2:49 ` Junio C Hamano
                     ` (2 preceding siblings ...)
  2006-03-26 16:08   ` Timo Hirvonen
@ 2006-03-26 16:31   ` Jakub Narebski
  2006-03-26 16:46     ` Linus Torvalds
  3 siblings, 1 reply; 41+ messages in thread
From: Jakub Narebski @ 2006-03-26 16:31 UTC (permalink / raw)
  To: git

I wonder what is the most common case in Linux kernel or git.

1.) renaming the file in the same directory, old-file.c to new-file.c?
2.) moving file to other directory (project reorganization), 
    old-dir/file.c to new-dir/file.c?
3.) splitting file into modules, huge-file.c to file1.c, file2.c?
4.) copying fragment of one file to other?
5.) moving fragment of code from one file to other?

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 16:31   ` Jakub Narebski
@ 2006-03-26 16:46     ` Linus Torvalds
  2006-03-26 17:10       ` Jakub Narebski
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 16:46 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git



On Sun, 26 Mar 2006, Jakub Narebski wrote:
>
> I wonder what is the most common case in Linux kernel or git.
> 
> 1.) renaming the file in the same directory, old-file.c to new-file.c?

The kernel uses subdirectories extensively, and a lot of renames (most of 
them, I'd say) is because of that subdirectory structure. 

So the same-directory case is the unusual one, I'd say.

> 3.) splitting file into modules, huge-file.c to file1.c, file2.c?
> 4.) copying fragment of one file to other?
> 5.) moving fragment of code from one file to other?

I'd say that (5) is very common. And (4) happens a lot under certain 
circumstances (new driver, new architecture, new filesystem..).

Doing (3) happens, but probably less often that it should ;/

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 16:46     ` Linus Torvalds
@ 2006-03-26 17:10       ` Jakub Narebski
  2006-03-26 18:10         ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Jakub Narebski @ 2006-03-26 17:10 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:

> On Sun, 26 Mar 2006, Jakub Narebski wrote:
>>
>> I wonder what is the most common case in Linux kernel or git.
>> 
>> 1.) renaming the file in the same directory, old-file.c to new-file.c?
>> 2.) moving file to other directory (project reorganization), 
>>     old-dir/file.c to new-dir/file.c?
> The kernel uses subdirectories extensively, and a lot of renames (most of
> them, I'd say) is because of that subdirectory structure.
> 
> So the same-directory case is the unusual one, I'd say.

If (2) is common enough then discussed improvements to rename detection, 
namely comparing basenames as a base for candidate selection is a good idea.
I wonder how common is (2) compared to (1)+(2) i.e. move to other dir 
and rename, old-dir/old-file.c to new-dir/new-subdir/new-file.c

>> 3.) splitting file into modules, huge-file.c to file1.c, file2.c?
>> 4.) copying fragment of one file to other?
>> 5.) moving fragment of code from one file to other?
> 
> I'd say that (5) is very common. And (4) happens a lot under certain
> circumstances (new driver, new architecture, new filesystem..).
> 
> Doing (3) happens, but probably less often that it should ;/

Detecting (4) and (5) fast (i.e. for merges) without auxilary (helper) 
information would probably be hard. For interrogation/porcellanish commands
(like pickaxe) would probably be easier.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 17:10       ` Jakub Narebski
@ 2006-03-26 18:10         ` Linus Torvalds
  2006-03-26 19:22           ` Marco Costalba
  2006-03-27  6:55           ` Jakub Narebski
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 18:10 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Sun, 26 Mar 2006, Jakub Narebski wrote:
> 
> If (2) is common enough then discussed improvements to rename detection, 
> namely comparing basenames as a base for candidate selection is a good idea.

BK had this "renametool" which got started automatically when you applied 
a patch that removed one or more files and added one or more files, so 
that you could then pair up the files manually.

It left the rename pairing 100% to the user, but it helped a bit by 
guessing what the pairing might be, and yes, it used the basenames to set 
up that initial guess.

It worked in many cases, but it also failed in many cases. I do think it 
was a useful heuristic within the BK model (since the _real_ choice was 
left to the user), but I don't think it's very useful for git.

The thing is, the fast rename detection that is in the "next" branch 
really does a lot better, and it's fast enough.

(If you wanted to make it even faster, but less precise, you could limit 
it to the first few kilobytes of file contents - still a _lot_ better 
heuristic than the actual filename, and it would make the worst-case 
behaviour much better).

> I wonder how common is (2) compared to (1)+(2) i.e. move to other dir 
> and rename, old-dir/old-file.c to new-dir/new-subdir/new-file.c

I don't have any numbers, but from usign renametool for a few years, my 
gut feel/recollection is that about half of renames in the kernel were 
moving to a new directory, and about half changed names (often in 
_addition_ to moving). But I didn't much think about it, so that's just a 
very rough guess based on using a tool that helped you do these things 
manually.

For example, one common case was a directory structure like

	..
	type-file1.c
	type-file2.c
	otherfiles.c
	yet-more.c
	..

being split up into a subdirectory

	..
	type/file1.c
	type/file2.c
	otherfiles.c
	yet-more.c
	..

(eg drivers/scsi/aic7xx-* being given a subdirectory of it's own, as 
drivers/scsi/aic7xx/*). So the basename wouldn't stay the same, because it 
contained some piece of data that became redundant with the move.

> >> 3.) splitting file into modules, huge-file.c to file1.c, file2.c?
> >> 4.) copying fragment of one file to other?
> >> 5.) moving fragment of code from one file to other?
> > 
> > I'd say that (5) is very common. And (4) happens a lot under certain
> > circumstances (new driver, new architecture, new filesystem..).
> > 
> > Doing (3) happens, but probably less often that it should ;/
> 
> Detecting (4) and (5) fast (i.e. for merges) without auxilary (helper) 
> information would probably be hard. For interrogation/porcellanish commands
> (like pickaxe) would probably be easier.

Yes. I don't think we necessarily want to merge automatically across 
things like that, even if it sounds like something you'd want in a perfect 
world. Stupid and obvious (and fails) is often better than smart and 
complex (and succeeds), because at least you _understand_ what happens. 

But _following_ a particular change back is important, and should be both 
efficient and simple to do. Ie the example tool I talked about in

        http://article.gmane.org/gmane.comp.version-control.git/217

is still relevant and important, I think.

I literally think that people wouldn't even _want_ a "git annotate", if 
they instead had more of a visual tool that showed the current state of 
the file, and you could click on a line/set of lines to follow it back to 
the previous change to that area. I'd argue that almost always when you 
want "annotate", you already have the particular place that you want to 
look at in mind (you're really not interested in the whole file).

So wouldn't it be _much_ nicer to have a "graphical git-whatchanged", 
where you just delve deeper (and you don't even look at the whole file 
like git-whatchanged does, but you ask for a very particular region).

Ie, what I imagine would be something gitk/qgit like, where you see the 
file content, select a line or two (or a whole function), and it goes back 
in history and shows you the last diff that changed that 
line/two/function. We can do that EFFICIENTLY. Much more efficiently than 
git-annotate, in fact. And then when you see the diff, you might say "I'm 
not interested in this one, that was just a re-indent" and then continue 
back. 

THAT is the kind of graphical tool I'd want. And dammit, it should even be 
_easy_. I'm just a total clutz myself when it comes to doing things like
QT or nice tcl/tk text-panes, and this really does have to be visual, 
since the whole point is that "select text" and interactive part.

So if somebody wants to be a hero, and feels comfortable with those kinds 
of things, this really should be a fairly straightforward thing to do (it 
would be useful even without rename detection or data movement detection, 
but it's also something where you really _could_ do efficient data 
movement detection by just looking at the "whole diff" when something 
changed in that small area).

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 18:10         ` Linus Torvalds
@ 2006-03-26 19:22           ` Marco Costalba
  2006-03-26 22:23             ` Linus Torvalds
  2006-03-27  6:55           ` Jakub Narebski
  1 sibling, 1 reply; 41+ messages in thread
From: Marco Costalba @ 2006-03-26 19:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 3/26/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> So wouldn't it be _much_ nicer to have a "graphical git-whatchanged",
> where you just delve deeper (and you don't even look at the whole file
> like git-whatchanged does, but you ask for a very particular region).
>
> Ie, what I imagine would be something gitk/qgit like, where you see the
> file content, select a line or two (or a whole function), and it goes back
> in history and shows you the last diff that changed that
> line/two/function. We can do that EFFICIENTLY. Much more efficiently than
> git-annotate, in fact. And then when you see the diff, you might say "I'm
> not interested in this one, that was just a re-indent" and then continue
> back.
>
> THAT is the kind of graphical tool I'd want. And dammit, it should even be
> _easy_. I'm just a total clutz myself when it comes to doing things like
> QT or nice tcl/tk text-panes, and this really does have to be visual,
> since the whole point is that "select text" and interactive part.
>
> So if somebody wants to be a hero, and feels comfortable with those kinds
> of things, this really should be a fairly straightforward thing to do (it
> would be useful even without rename detection or data movement detection,
> but it's also something where you really _could_ do efficient data
> movement detection by just looking at the "whole diff" when something
> changed in that small area).
>

I am a thousand miles away from being an hero (and glad of it), but....

I really need a bit of feedback or comment about this because IMHO
qgit annotate is *almost* very similar to what you would ask, so I
need to understand well the difference:

FIRST WAY

After annotating a file history (double click on a file name in
bottom-right window or directly from tree view), you see the whole
file annotated. If you have the diff window open you see also the
corresponding patch (scrolled to selected file name).

Now, double clicking on the chosen code line in file content makes
currently two things:

  - Diff window is updated to show corresponding revision patch, i.e.
the last patch that modified that line of code.

- File content, as well as file annotation, changes to show the
content of the file just after the patch was applied, from there it is
normally possible to go back in the history of that code region in the
same way, i.e. double clicking on interesting lines.

Biggest limitation of 'annotation browsing' is that 'code removing
only' patches are not annotated and you need to check them  directly
in diff window.

SECOND WAY

Without opening the file viewer it is possible to select a file (or
more then one or one directory) from tree view and press magic wand
button. This causes main view to be updated with git-rev-list  --
<selected paths>  content, i.e. a filtered view.

With diff viewer window open you can browse across file patch history
related to chosen file.

Biggest limitation is that all the revisions who touch the file are
shown, not only the ones limited to a selected region.

IF I HAVE UNDERSTOOD...

If I have understood what you would like to see it something like the following:

- From diff/file viewer window select a code region.

- Press Magic wand button and feed git-rev-list with <selected path>
_and_  <selected content>

- Show git-rev-list output on main window as usual, but now selected
revisions are filtered out not only for path but also for region of
code touched.


Am I guessing correctly?

Marco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 19:22           ` Marco Costalba
@ 2006-03-26 22:23             ` Linus Torvalds
  2006-03-27  5:47               ` Marco Costalba
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 22:23 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Jakub Narebski, git

On Sun, 26 Mar 2006, Marco Costalba wrote:
> 
> FIRST WAY
> 
> After annotating a file history (double click on a file name in
> bottom-right window or directly from tree view), you see the whole
> file annotated. If you have the diff window open you see also the
> corresponding patch (scrolled to selected file name).

The problem is that this step is already _way_ too expensive.

I don't want to work with any tool that makes "Step 1" take a minute or 
two for a project that has a few years of history. Try it on the linux 
historic project with some file that gets lots of modifications.

In other words, starting off with "annotate" is MUCH too expensive. You 
should start off basically with "git-whatchanged".

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 22:23             ` Linus Torvalds
@ 2006-03-27  5:47               ` Marco Costalba
  2006-03-27  6:46                 ` Junio C Hamano
  2006-03-27  8:07                 ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Marco Costalba @ 2006-03-27  5:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 3/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Sun, 26 Mar 2006, Marco Costalba wrote:
> >
> > FIRST WAY
> >
> > After annotating a file history (double click on a file name in
> > bottom-right window or directly from tree view), you see the whole
> > file annotated. If you have the diff window open you see also the
> > corresponding patch (scrolled to selected file name).
>
> The problem is that this step is already _way_ too expensive.
>
> I don't want to work with any tool that makes "Step 1" take a minute or
> two for a project that has a few years of history. Try it on the linux
> historic project with some file that gets lots of modifications.
>

Historic Linux test (63428 revisions)

File: drivers/net/tg3.c
Revisions that modify tg3.c : 292

With qgit
15s to retrieve file history (git-rev-list)
19.5s to annotate (git-diff-tree -p, current GNU algorithm, not new faster one)

and...

$ time git-whatchanged HEAD drivers/net/tg3.c > /dev/null
98.01user 2.44system 1:46.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (797major+43033minor)pagefaults 0swaps

NOTE: It seems that  git-whatchanged asks for checked the out file to
work. It didn't work with no repository checked out.


Marco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  5:47               ` Marco Costalba
@ 2006-03-27  6:46                 ` Junio C Hamano
  2006-03-27  8:07                 ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2006-03-27  6:46 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

"Marco Costalba" <mcostalba@gmail.com> writes:

> NOTE: It seems that  git-whatchanged asks for checked the out file to
> work. It didn't work with no repository checked out.

Perhaps,

	$ git-whatchanged HEAD -- drivers/net/tg3.c

as Linus explained in a separate message today...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  5:47               ` Marco Costalba
  2006-03-27  6:46                 ` Junio C Hamano
@ 2006-03-27  8:07                 ` Linus Torvalds
  2006-03-27 11:19                   ` Marco Costalba
  2006-03-27 11:55                   ` Marco Costalba
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-27  8:07 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Jakub Narebski, git



On Mon, 27 Mar 2006, Marco Costalba wrote:
> 
> Historic Linux test (63428 revisions)
> 
> File: drivers/net/tg3.c
> Revisions that modify tg3.c : 292
> 
> With qgit
> 15s to retrieve file history (git-rev-list)
> 19.5s to annotate (git-diff-tree -p, current GNU algorithm, not new faster one)

.. and it does absolutely _nothing_ while it's doing that, does it?

> $ time git-whatchanged HEAD drivers/net/tg3.c > /dev/null
> 98.01user 2.44system 1:46.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (797major+43033minor)pagefaults 0swaps

In contrast, git-whatchanged will start outputting the recent changes 
immediately.

And that's the point. Almost always, we're interested in the _recent_ 
stuff. The fact that it takes longer to get the old history  is not very 
important. You generally don't ask "what changed in this file" for a file 
that hasn't changed in five years.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  8:07                 ` Linus Torvalds
@ 2006-03-27 11:19                   ` Marco Costalba
  2006-03-27 11:30                     ` Johannes Schindelin
  2006-03-27 16:52                     ` Linus Torvalds
  2006-03-27 11:55                   ` Marco Costalba
  1 sibling, 2 replies; 41+ messages in thread
From: Marco Costalba @ 2006-03-27 11:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 3/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 27 Mar 2006, Marco Costalba wrote:
> >
> > Historic Linux test (63428 revisions)
> >
> > File: drivers/net/tg3.c
> > Revisions that modify tg3.c : 292
> >
> > With qgit
> > 15s to retrieve file history (git-rev-list)
> > 19.5s to annotate (git-diff-tree -p, current GNU algorithm, not new faster one)
>
> .. and it does absolutely _nothing_ while it's doing that, does it?
>

yes, it's true.

> > $ time git-whatchanged HEAD drivers/net/tg3.c > /dev/null
> > 98.01user 2.44system 1:46.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
> > 0inputs+0outputs (797major+43033minor)pagefaults 0swaps
>
> In contrast, git-whatchanged will start outputting the recent changes
> immediately.
>
> And that's the point. Almost always, we're interested in the _recent_
> stuff. The fact that it takes longer to get the old history  is not very
> important. You generally don't ask "what changed in this file" for a file
> that hasn't changed in five years.
>

We could run git-rev-list with a time range specifier (changes of last
year as example) by default so to have fast results and run all time
history _only_  on request.

This perhaps could solve the fast output for recent revs problem, if
this is the problem.

I still think the problem with annotation is that you don't see
patches that _remove_ lines of code, you need the whole diff for this.

Marco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27 11:19                   ` Marco Costalba
@ 2006-03-27 11:30                     ` Johannes Schindelin
  2006-03-27 16:52                     ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Johannes Schindelin @ 2006-03-27 11:30 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

Hi,

On Mon, 27 Mar 2006, Marco Costalba wrote:

> I still think the problem with annotation is that you don't see
> patches that _remove_ lines of code, you need the whole diff for this.

Interesting. You'd need a "git-emalb" (blame, but reverse), and you'd need 
to tell it a range "rev1..rev2" which is *not* to be interpreted as "^rev1 
rev2" but as a direct path from rev1 to rev2.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27 11:19                   ` Marco Costalba
  2006-03-27 11:30                     ` Johannes Schindelin
@ 2006-03-27 16:52                     ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-27 16:52 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Jakub Narebski, git

On Mon, 27 Mar 2006, Marco Costalba wrote:
> >
> > And that's the point. Almost always, we're interested in the _recent_
> > stuff. The fact that it takes longer to get the old history  is not very
> > important. You generally don't ask "what changed in this file" for a file
> > that hasn't changed in five years.
> 
> We could run git-rev-list with a time range specifier (changes of last
> year as example) by default so to have fast results and run all time
> history _only_  on request.

Yes.

However, what I've been meaning to do (but just haven't had the time and 
energy for so far) is to fix "git-rev-list" with a path limiter.

Right now that always causes things to be totally serialized, and the 
revision walking will first look up _all_ the history (well, it will prune 
out the merges) before starting to output stuff.

So right now in order for "git-whatchanged" to be fast and incremental, it 
doesn't do any path limiting with git-rev-list at ALL, and does it all in 
git-diff-tree. Which is horrid.

> I still think the problem with annotation is that you don't see
> patches that _remove_ lines of code, you need the whole diff for this.

Well, that's just another reason "annotate" sucks.

If you select a range of lines, my suggested tool _would_ show you lines 
that got removed there, and git-whatchanged does it quite well.

I really think "annotate" is _fundamentally_ a broken operation. It's not 
what any sane developer actually wants, and it has serious limitations (ie 
it depends on whole history, and it cannot show removals well).

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  8:07                 ` Linus Torvalds
  2006-03-27 11:19                   ` Marco Costalba
@ 2006-03-27 11:55                   ` Marco Costalba
  2006-03-27 12:27                     ` Andreas Ericsson
  1 sibling, 1 reply; 41+ messages in thread
From: Marco Costalba @ 2006-03-27 11:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 3/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
> In contrast, git-whatchanged will start outputting the recent changes
> immediately.
>

To integrate git-whatchanged like functionality with filter on a
specific code region, the Linus original request, I am wondering about
something like this:

A new git-diff-tree option --range=a..b to delimit a region,
identified by code lines bounduaries.

As example

git-diff-tree --range=10..15 HEAD -- <path>

Coud give these answers, added to standard git-diff-tree output:

* 10..25 --> modified region new region bounduaries are lines from 10 to 25

  15..20 --> region _NOT_ modified but new region bounduaries are
lines from 15 to 20 (perhaps patch added 5 lines _before_ the region)

  10..15  ---> region _NOT_ modified and lines, if any, added/removed 
_after_ the region

* 10..15 --> modified region with the same boundiaries (as example
removing trailing witespaces)

With this new option of git-diff-tree becames very simple to retrieve
a file history limited to a region, because the region bounduaries in
ouput from one rev are feed as input in parent rev.

Comments?

Marco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27 11:55                   ` Marco Costalba
@ 2006-03-27 12:27                     ` Andreas Ericsson
  0 siblings, 0 replies; 41+ messages in thread
From: Andreas Ericsson @ 2006-03-27 12:27 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Linus Torvalds, Jakub Narebski, git

Marco Costalba wrote:
> On 3/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>In contrast, git-whatchanged will start outputting the recent changes
>>immediately.
>>
> 
> 
> To integrate git-whatchanged like functionality with filter on a
> specific code region, the Linus original request, I am wondering about
> something like this:
> 
> A new git-diff-tree option --range=a..b to delimit a region,
> identified by code lines bounduaries.
> 

Make it --line-range if you implement this. On a first glance I thought 
you meant --commit-range.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 18:10         ` Linus Torvalds
  2006-03-26 19:22           ` Marco Costalba
@ 2006-03-27  6:55           ` Jakub Narebski
  2006-03-27  7:40             ` David Lang
  1 sibling, 1 reply; 41+ messages in thread
From: Jakub Narebski @ 2006-03-27  6:55 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:

> On Sun, 26 Mar 2006, Jakub Narebski wrote:
>> 
>> If (2) is common enough then discussed improvements to rename detection,
>> namely comparing basenames as a base for candidate selection is a good
>> idea.
> 
> BK had this "renametool" which got started automatically when you applied
> a patch that removed one or more files and added one or more files, so
> that you could then pair up the files manually.
[...]
> The thing is, the fast rename detection that is in the "next" branch
> really does a lot better, and it's fast enough.

I was thinking about the fast ename detection algorithm in "next" branch.

That is the question if recording additional (helper) information about
contents copying and moving like the mentioned "renametool" did is worth
the effort, both in coding it and from user's point of view. Or would
better contents copying and moving detection ("renames detection") for
whatchanged and similar suffice.

I am of opinion that voluntary information about contents moving and copying
in the commits would help.

Purposes:
1.) Record contents moving and similarity information which cannot or cannot
be easily calculated; see Paul Jakma response in this thread
  MessageID: <Pine.LNX.4.64.0603270642090.5276@sheen.jakma.org>
for example copying fragment of code, small fragment of the whole file,
creating documentation or header file from code, or code skeleton from
template, or rewrite of code in different language (e.g. shell script to
perl, script to compiled code e.g. Perl or Python to C).
2.) Caching the results of similarity algorithm/rename detection tool (also
Paul Jakma post), including remembering false positives and undetected
renames, for efficiency. Calculated automatically parts might be
throw-away.

Sources of information:
1.) Manually entered information *at commit*, including *-rm, *-mv, *-cp
like commands (which nobody likes) and systematized (pseudolanguage?) for
copying and moving contents in the log messages.
2.) Semi-manual tools like the mentioned "renametool" of BK.
3.) Support from editor (remebering where copied and pasted, or cut and
pasted fragment came from, and providing prefilled command to record
contents moving ("renames") or prefilled commit log containing this
information. Hard to get, probably most useful.
4.) Information from resolved merges and results of diagnosis (pickaxe like)
tools, especially recording "renames" which were not detected, and removing
"renames" which were detected falsily.  

Is that the place where I should provide code (patch) for testing the
idea :) ?

>> I wonder how common is (2) compared to (1)+(2) i.e. move to other dir
>> and rename, old-dir/old-file.c to new-dir/new-subdir/new-file.c
>
> For example, one common case was a directory structure like
> 
> ..
> type-file1.c
> type-file2.c
> otherfiles.c
> yet-more.c
> ..
> 
> being split up into a subdirectory
> 
> ..
> type/file1.c
> type/file2.c
> otherfiles.c
> yet-more.c
> ..
> 
> (eg drivers/scsi/aic7xx-* being given a subdirectory of it's own, as
> drivers/scsi/aic7xx/*). So the basename wouldn't stay the same, because it
> contained some piece of data that became redundant with the move.

Perhaps fast rename detection algorithm needs some smart similarity estimate
for names, which would put more weight in the parts closer to basename, and
would detect */type-file1.c and */type/file1.c as similar.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  6:55           ` Jakub Narebski
@ 2006-03-27  7:40             ` David Lang
  2006-03-27  7:53               ` Jakub Narebski
  0 siblings, 1 reply; 41+ messages in thread
From: David Lang @ 2006-03-27  7:40 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Mon, 27 Mar 2006, Jakub Narebski wrote:

> 2.) Caching the results of similarity algorithm/rename detection tool (also
> Paul Jakma post), including remembering false positives and undetected
> renames, for efficiency. Calculated automatically parts might be
> throw-away.

this sounds like it could easily devolve into a O(n!) situation where you 
are cacheing how everything is related (or not related) to everything 
else. Paul was makeing the point that the purpose was to cache the data to 
eliminate the time needed to calculate it, but if you don't store all the 
results then you don't know if the result is not relavent, or unknown, so 
you need to calculate it again.

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-27  7:40             ` David Lang
@ 2006-03-27  7:53               ` Jakub Narebski
  0 siblings, 0 replies; 41+ messages in thread
From: Jakub Narebski @ 2006-03-27  7:53 UTC (permalink / raw)
  To: git

David Lang wrote:

> On Mon, 27 Mar 2006, Jakub Narebski wrote:
> 
>> 2.) Caching the results of similarity algorithm/rename detection tool
>> (also Paul Jakma post), including remembering false positives and
>> undetected renames, for efficiency. Calculated automatically parts might
>> be throw-away.
> 
> this sounds like it could easily devolve into a O(n!) situation where you
> are cacheing how everything is related (or not related) to everything
> else. Paul was makeing the point that the purpose was to cache the data to
> eliminate the time needed to calculate it, but if you don't store all the
> results then you don't know if the result is not relavent, or unknown, so
> you need to calculate it again.

First of all, you only remember non-trivial relations (i.e. file.c is always
related to file.c). If the cache would be only for commits, it would be
O(c*p*n), where c is number of commits, p is percentage of contents moving
("renames") times percent of files changed in the commit, and n is the
number of files, probably O(c) practically. Even if we remember for all
(tree1,tree2) pairs it would be O(c^2). Additionally cache can be limited
in size (pruning oldest cache).  

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  1:49 Following renames Petr Baudis
  2006-03-26  2:49 ` Junio C Hamano
@ 2006-03-26  3:19 ` Linus Torvalds
  2006-03-26  7:35   ` Ryan Anderson
  2006-03-26 10:07   ` Petr Baudis
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26  3:19 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git



On Sun, 26 Mar 2006, Petr Baudis wrote:
> 
>   In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
> fly - it requires at least two git-ls-tree calls per revision which
> would bog things down awfully (to roughly half of the original speed).

No it doesn't. It requires one git-ls-tree WHEN SOMETHING IS RENAMED.

In other words, basically never.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  3:19 ` Linus Torvalds
@ 2006-03-26  7:35   ` Ryan Anderson
  2006-03-26 21:09     ` Petr Baudis
  2006-03-26 10:07   ` Petr Baudis
  1 sibling, 1 reply; 41+ messages in thread
From: Ryan Anderson @ 2006-03-26  7:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, git

[-- Attachment #1: Type: text/plain, Size: 646 bytes --]

Linus Torvalds wrote:
> On Sun, 26 Mar 2006, Petr Baudis wrote:
>   
>>   In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
>> fly - it requires at least two git-ls-tree calls per revision which
>> would bog things down awfully (to roughly half of the original speed).
>>     
>
> No it doesn't. It requires one git-ls-tree WHEN SOMETHING IS RENAMED.
>
> In other words, basically never.
>   

A simple example is the first loop in git-annotate.perl.  (Which was
basically written by Linus, I just translated it from a
shell/pseudo-code example into Perl)


-- 

Ryan Anderson
  sometimes Pug Majere



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  7:35   ` Ryan Anderson
@ 2006-03-26 21:09     ` Petr Baudis
  0 siblings, 0 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 21:09 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Linus Torvalds, git

Dear diary, on Sun, Mar 26, 2006 at 09:35:02AM CEST, I got a letter
where Ryan Anderson <ryan@michonline.com> said that...
> Linus Torvalds wrote:
> > On Sun, 26 Mar 2006, Petr Baudis wrote:
> >   
> >>   In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
> >> fly - it requires at least two git-ls-tree calls per revision which
> >> would bog things down awfully (to roughly half of the original speed).
> >>     
> >
> > No it doesn't. It requires one git-ls-tree WHEN SOMETHING IS RENAMED.
> >
> > In other words, basically never.
> >   
> 
> A simple example is the first loop in git-annotate.perl.  (Which was
> basically written by Linus, I just translated it from a
> shell/pseudo-code example into Perl)

One case it does not handle:

         2
      -- b --
  1 /         \ 6
  a             d
    \ 3     5 /
      c --- d

git-rev-list at 6 will (understandably) show

        6 5
        5

and you will never detect the d -> b rename leading to 2.

This is one reason why I'm actually not using --parents and pipe stuff
directly to git-diff-tree --stdin -M and then read its output. This also
lets me merge parallel lines of development based on date and I don't
have to fork per each file deletion.

With any luck I'll have the first draft of my (also perlish) script done
this evening yet. (BTW, it has the same output format as

	git-rev-list | git-diff-tree --pretty=raw -M

so with some tweaking it could also serve as a git-whatchanged backend.
Actually, it would be nice to have it in core Git in the long term so
that it gets all the portability tweaks and such.)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26  3:19 ` Linus Torvalds
  2006-03-26  7:35   ` Ryan Anderson
@ 2006-03-26 10:07   ` Petr Baudis
  2006-03-26 10:34     ` Fredrik Kuivinen
  2006-03-26 16:33     ` Linus Torvalds
  1 sibling, 2 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 10:07 UTC (permalink / raw)
  To: Linus Torvalds, Ryan Anderson; +Cc: git

Dear diary, on Sun, Mar 26, 2006 at 05:19:50AM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> On Sun, 26 Mar 2006, Petr Baudis wrote:
> > 
> >   In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
> > fly - it requires at least two git-ls-tree calls per revision which
> > would bog things down awfully (to roughly half of the original speed).
> 
> No it doesn't. It requires one git-ls-tree WHEN SOMETHING IS RENAMED.
> 
> In other words, basically never.

Huh? I don't see that now (and caps don't help me see it better). That's
certainly not what is in [1], and I don't see _how_ to detect the
renames in this case, and what would I be actually doing git-ls-tree for
when I've already detected the rename. Based on [1], I'd be doing
git-ls-tree merely to detect that a file _disappeared_ in the first
place, I have to do other stuff to detect the renames themselves.

Dear diary, on Sun, Mar 26, 2006 at 09:35:02AM CEST, I got a letter
where Ryan Anderson <ryan@michonline.com> said that...
> A simple example is the first loop in git-annotate.perl.  (Which was
> basically written by Linus, I just translated it from a
> shell/pseudo-code example into Perl)

Thanks for the hint. Unfortunately, this is precisely the thing I want
to avoid, that is essentially reimplementing part of git-rev-list - to
do something good, I would have to do my own toposort and merge by date
between parallel lines. OTOH, I might just construct a large revlist
commandline specifying all the segments I'm interested in and see what
happens when I run that.

Besides, doing it in shell would be pretty ugly job (forcing me to
finally rewrite it in perl is not a bad thing but that'd be a somewhat
larger project since I share various common routines with other cg
tools, etc).

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 10:07   ` Petr Baudis
@ 2006-03-26 10:34     ` Fredrik Kuivinen
  2006-03-26 16:33     ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Fredrik Kuivinen @ 2006-03-26 10:34 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, Ryan Anderson, git

On Sun, Mar 26, 2006 at 12:07:17PM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Mar 26, 2006 at 05:19:50AM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> said that...
> > On Sun, 26 Mar 2006, Petr Baudis wrote:
> > > 
> > >   In [1], Linus suggests a non-core solution. Unfortunately, it doesn't
> > > fly - it requires at least two git-ls-tree calls per revision which
> > > would bog things down awfully (to roughly half of the original speed).
> > 
> > No it doesn't. It requires one git-ls-tree WHEN SOMETHING IS RENAMED.
> > 
> > In other words, basically never.
> 
> Huh? I don't see that now (and caps don't help me see it better). That's
> certainly not what is in [1], and I don't see _how_ to detect the
> renames in this case, and what would I be actually doing git-ls-tree for
> when I've already detected the rename. Based on [1], I'd be doing
> git-ls-tree merely to detect that a file _disappeared_ in the first
> place, I have to do other stuff to detect the renames themselves.
> 
> Dear diary, on Sun, Mar 26, 2006 at 09:35:02AM CEST, I got a letter
> where Ryan Anderson <ryan@michonline.com> said that...
> > A simple example is the first loop in git-annotate.perl.  (Which was
> > basically written by Linus, I just translated it from a
> > shell/pseudo-code example into Perl)
> 
> Thanks for the hint. Unfortunately, this is precisely the thing I want
> to avoid, that is essentially reimplementing part of git-rev-list - to
> do something good, I would have to do my own toposort and merge by date
> between parallel lines. OTOH, I might just construct a large revlist
> commandline specifying all the segments I'm interested in and see what
> happens when I run that.
> 
> Besides, doing it in shell would be pretty ugly job (forcing me to
> finally rewrite it in perl is not a bad thing but that'd be a somewhat
> larger project since I share various common routines with other cg
> tools, etc).
> 

If you decide to modify rev-list to do rename tracking you might want
to have a look at how this is done in blame.c. git-blame only tracks
one file (since that is what it needs) but I think it should be
possible to track multiple files with a similar approach.

- Fredrik

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 10:07   ` Petr Baudis
  2006-03-26 10:34     ` Fredrik Kuivinen
@ 2006-03-26 16:33     ` Linus Torvalds
  2006-03-26 19:14       ` Petr Baudis
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 16:33 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Ryan Anderson, git

On Sun, 26 Mar 2006, Petr Baudis wrote:
> 
> Huh? I don't see that now (and caps don't help me see it better). That's
> certainly not what is in [1], and I don't see _how_ to detect the
> renames in this case, and what would I be actually doing git-ls-tree for
> when I've already detected the rename. Based on [1], I'd be doing
> git-ls-tree merely to detect that a file _disappeared_ in the first
> place, I have to do other stuff to detect the renames themselves.

No, the point is that "git-rev-list" already does all of [1] in the core.

If you do

	git-rev-list --parents --remove-empty $REV -- $filename

then you'll get the whole history for that filename. When it ends, you 
know the file went away, and then you do basically _one_ "where the hell 
did it go" thing.

And yes, it's not git-ls-tree (unless you only want to follow pure 
renames), it's actually one "git-diff-tree -M $lastrev". Then you just 
continue with the new filename (and do another "git-rev-list" until you 
hit the next rename).

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 16:33     ` Linus Torvalds
@ 2006-03-26 19:14       ` Petr Baudis
  2006-03-26 20:31         ` Petr Baudis
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 19:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ryan Anderson, git

Dear diary, on Sun, Mar 26, 2006 at 06:33:13PM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> If you do
> 
> 	git-rev-list --parents --remove-empty $REV -- $filename
> 
> then you'll get the whole history for that filename. When it ends, you 
> know the file went away, and then you do basically _one_ "where the hell 
> did it go" thing.
> 
> And yes, it's not git-ls-tree (unless you only want to follow pure 
> renames), it's actually one "git-diff-tree -M $lastrev". Then you just 
> continue with the new filename (and do another "git-rev-list" until you 
> hit the next rename).

I wrote a long rant but then it all suddenly fit together and I have now
an idea how to implement it reasonably elegantly.

So only a bugreport remains:

My current target is to support this tree (letters are filenames,
numbers are commit ids; I'll translate any git output to those digits):

    2    4
    b -- d
1 /        \ 6
a            d
  \ 3    5 /
    c -- d

With the commits created in the numerical order (so log shows
1,2,3,4,5,6, and my target is cg-log d showing the same output). If
anyone wants the sample history, it's at

	http://pasky.or.cz/~xpasky/renametree1.git/

Curiously, git-rev-list does something totally strange when trying to
list per-file history at this point:

	$ git-rev-list HEAD -- d
	4

Huh? (It should list 6, 5, 4 instead.)

I worked it around by recording a change in d in the merge 6:

	http://pasky.or.cz/~xpasky/renametree2.git/

	$ git-rev-list --parents --remove-empty HEAD -- d
	6 4 5
	5
	4

Which is the expected output.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 19:14       ` Petr Baudis
@ 2006-03-26 20:31         ` Petr Baudis
  2006-03-26 22:22         ` Linus Torvalds
  2006-03-26 23:26         ` Petr Baudis
  2 siblings, 0 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 20:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ryan Anderson, git

Dear diary, on Sun, Mar 26, 2006 at 09:14:45PM CEST, I got a letter
where Petr Baudis <pasky@suse.cz> said that...
> Curiously, git-rev-list does something totally strange when trying to
> list per-file history at this point:
> 
> 	$ git-rev-list HEAD -- d
> 	4
> 
> Huh? (It should list 6, 5, 4 instead.)

Obviously not 6 since the file was not changed in that revision, but I'd
still expect it to list 5.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 19:14       ` Petr Baudis
  2006-03-26 20:31         ` Petr Baudis
@ 2006-03-26 22:22         ` Linus Torvalds
  2006-03-26 22:31           ` Petr Baudis
  2006-03-26 23:26         ` Petr Baudis
  2 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 22:22 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Ryan Anderson, git

On Sun, 26 Mar 2006, Petr Baudis wrote:
> 
> My current target is to support this tree (letters are filenames,
> numbers are commit ids; I'll translate any git output to those digits):
> 
>     2    4
>     b -- d
> 1 /        \ 6
> a            d
>   \ 3    5 /
>     c -- d

Yeah, the problem with this is that you need to track separate names 
across separate points. However:

> Curiously, git-rev-list does something totally strange when trying to
> list per-file history at this point:
> 
> 	$ git-rev-list HEAD -- d
> 	4
> 
> Huh? (It should list 6, 5, 4 instead.)

What it does is list the points where file "d" _changed_.

"d" did not change in 6 - it had a parent commit (4) where "d" had the 
same contents (in fact, it likely had _two_ parents where it had the same 
contents, but git will pick the first one). So commit "6" is 
uninteresting, and commit "5" will never even be looked at, since we 
decided that the history of "d" comes from the first parent with the same 
contents.

So then it lists "4", because file "d" really did change in that commit 
(it went away).

Now you need to look at "4" and find the rename (which gives you 2) and 
then from there you do rename detection and get (1), and as a result your 
change history should end up being

 (1)a -> (2)b -> (4)d (-> 6(d) which was your start point)

which is correct (now, there are other histories _too_ that get us to the 
same point, but the one you found this way was _a_ history).

> I worked it around by recording a change in d in the merge 6:
> 
> 	http://pasky.or.cz/~xpasky/renametree2.git/
> 
> 	$ git-rev-list --parents --remove-empty HEAD -- d
> 	6 4 5
> 	5
> 	4
> 
> Which is the expected output.

No, it's the expected output just because you expected merges to always 
show up. Merges get ignored if any of the parents have the same content 
already.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 22:22         ` Linus Torvalds
@ 2006-03-26 22:31           ` Petr Baudis
  2006-03-26 22:43             ` Junio C Hamano
  2006-03-26 23:09             ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 22:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ryan Anderson, git

Dear diary, on Mon, Mar 27, 2006 at 12:22:04AM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> So commit "6" is uninteresting, and commit "5" will never even be
> looked at, since we decided that the history of "d" comes from the
> first parent with the same contents.

And this is the thing I have a problem with - this does not make much
sense to me, why can't we just follow all parents instead of arbitrarily
choosing one of them?

> which is correct (now, there are other histories _too_ that get us to the 
> same point, but the one you found this way was _a_ history).

Ok, in that case I want the _full_ history. :-)

> No, it's the expected output just because you expected merges to always 
> show up. Merges get ignored if any of the parents have the same content 
> already.

Eek. Can I avoid that? What was the reason for choosing this behavior?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 22:31           ` Petr Baudis
@ 2006-03-26 22:43             ` Junio C Hamano
  2006-03-26 23:10               ` Linus Torvalds
  2006-03-26 23:09             ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2006-03-26 22:43 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git, Linus Torvalds

Petr Baudis <pasky@suse.cz> writes:

>> No, it's the expected output just because you expected merges to always 
>> show up. Merges get ignored if any of the parents have the same content 
>> already.
>
> Eek. Can I avoid that? What was the reason for choosing this behavior?

Perhaps rev-list --sparse?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 22:43             ` Junio C Hamano
@ 2006-03-26 23:10               ` Linus Torvalds
  2006-03-27  7:30                 ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 23:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, git



On Sun, 26 Mar 2006, Junio C Hamano wrote:
> Petr Baudis <pasky@suse.cz> writes:
> 
> >> No, it's the expected output just because you expected merges to always 
> >> show up. Merges get ignored if any of the parents have the same content 
> >> already.
> >
> > Eek. Can I avoid that? What was the reason for choosing this behavior?
> 
> Perhaps rev-list --sparse?

No. "--sparse" still removes the uninteresting parents of merges. It just 
doesn't then make the linear history any denser.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 23:10               ` Linus Torvalds
@ 2006-03-27  7:30                 ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2006-03-27  7:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, git

Linus Torvalds <torvalds@osdl.org> writes:

> No. "--sparse" still removes the uninteresting parents of merges. It just 
> doesn't then make the linear history any denser.

Hmph, you are right.  add_parents_to_list() calls prune_fn
unconditionally while running limit_list().

Disabling that with yet another flag might be a possibility but
I suspect then it would not be much different from running
rev-list without path limiter and having the caller process the
result.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 22:31           ` Petr Baudis
  2006-03-26 22:43             ` Junio C Hamano
@ 2006-03-26 23:09             ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2006-03-26 23:09 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Ryan Anderson, git

On Mon, 27 Mar 2006, Petr Baudis wrote:

> Dear diary, on Mon, Mar 27, 2006 at 12:22:04AM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> said that...
> > So commit "6" is uninteresting, and commit "5" will never even be
> > looked at, since we decided that the history of "d" comes from the
> > first parent with the same contents.
> 
> And this is the thing I have a problem with - this does not make much
> sense to me, why can't we just follow all parents instead of arbitrarily
> choosing one of them?

Sure, you can. It's _usually_ a huge waste of time, though. Why would you 
want to do more work than you need, since clearly the other parent was 
_not_ interesting from the standpoint of the question "where did this 
content come from"?

> > No, it's the expected output just because you expected merges to always 
> > show up. Merges get ignored if any of the parents have the same content 
> > already.
> 
> Eek. Can I avoid that? What was the reason for choosing this behavior?

Huge efficiency gains.

Lookie here. Do

	gitk -- rev-list.c

on the git archive with the current git-rev-list, and with your hacked-up 
version.

And tell me my version isn't a hell of a lot better. Because, I guarantee 
you, it is. We're just not _interested_ in all those merges that didn't 
actually make any difference.

Read up on what modern neuro-science thinks about the human brain, and 
what a lot of it is about. It's about ignoring irrelevant information.

The ability to throw stuff out that isn't interesting is the _real_ basis 
of true intelligence. I'd rather have git do the _intelligent_ history, 
than show history that isn't relevant and workign harder doing so.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 19:14       ` Petr Baudis
  2006-03-26 20:31         ` Petr Baudis
  2006-03-26 22:22         ` Linus Torvalds
@ 2006-03-26 23:26         ` Petr Baudis
  2006-03-27 21:59           ` Petr Baudis
  2 siblings, 1 reply; 41+ messages in thread
From: Petr Baudis @ 2006-03-26 23:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ryan Anderson, git

Dear diary, on Sun, Mar 26, 2006 at 09:14:45PM CEST, I got a letter
where Petr Baudis <pasky@suse.cz> said that...
> Dear diary, on Sun, Mar 26, 2006 at 06:33:13PM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> said that...
> > If you do
> > 
> > 	git-rev-list --parents --remove-empty $REV -- $filename
> > 
> > then you'll get the whole history for that filename. When it ends, you 
> > know the file went away, and then you do basically _one_ "where the hell 
> > did it go" thing.
> > 
> > And yes, it's not git-ls-tree (unless you only want to follow pure 
> > renames), it's actually one "git-diff-tree -M $lastrev". Then you just 
> > continue with the new filename (and do another "git-rev-list" until you 
> > hit the next rename).
> 
> I wrote a long rant but then it all suddenly fit together and I have now
> an idea how to implement it reasonably elegantly.

So, this is what I have. Testing (I've gave it very little of that) and
thoughts welcome. It is probably pretty efficient, at least in terms of
fork()s it does only 2*N of them where N is the number of commits
containing interesting renames.  Actually, this should be even possible
to reduce to N+1 if you do a single git-diff-tree call and multiplex
different git-rev-lists to it, but I'm too tired to do the trickery now.

It has 'cg' in the name but depends on no Cogito stuff; it should be in
fact possible to trivially put it to git-whatchanged in place of the
final pipeline (not that I'd be suggesting this to be done universally,
but perhaps git-whatchanged -f ...?). There are three downsides in this
regard:

(i) No -c support. I need the separate deltas coming out from
git-diff-tree but I think I can join them together pretty easily on my
own, except that I have problems with -c (see
<20060326102100.GF18185@pasky.or.cz>) so I'm not sure how exactly is it
supposed to behave.

(ii) Only --pretty=raw output. It shouldn't be hard to add the
reformatting code, but I'm personally not going to use it and kind of
lazy, so I'll let someone else do that, I guess. :-)

(iii) Raw deltas required. -p parsing support would be certainly useful
and possible, but see (ii).


To quickly see what it does, you can try it e.g. on the git-log.sh file
in the Git repository.

Thoughts? Opinions? Bugs? Patches?


Signed-off-by: Petr Baudis <pasky@suse.cz>


diff --git a/cg-Xfollowrenames b/cg-Xfollowrenames
new file mode 100755
index 0000000..fa5c552
--- /dev/null
+++ b/cg-Xfollowrenames
@@ -0,0 +1,246 @@
+#!/usr/bin/env perl
+#
+# git-rev-list | git-diff-tree --stdin following renames
+# Copyright (c) Petr Baudis, 2006
+# Uses bits of git-annotate.perl by Ryan Anderson.
+#
+# This script will efficiently show output as of the
+#
+#	git-rev-list --remove-empty ARGS -- FILE... |
+#	git-diff-tree -M -r -m --stdin --pretty=raw ARGS
+#
+# pipeline, except that it follows renames of individual files listed
+# in the FILE... set.
+#
+# Usage:
+#
+#	cg-Xfollowrenames revlistargs -- difftreeargs -- revs -- files
+
+# TODO: Does not work on multiple files properly yet - most probably
+# (I didn't test it!). We want git-rev-list to stop traversing the history
+# when _any_ file disappears while now it probably stops traversing when
+# _all_ files disappear.
+
+use warnings;
+use strict;
+
+$| = 1;
+
+our (@revlist_args, @difftree_args, @revs, @files);
+
+{ # Load arguments
+	my @argp = (\@revlist_args, \@difftree_args, \@revs, \@files);
+	my $argi = 0;
+	for my $arg (@ARGV) {
+		if ($arg eq '--' and $argi < $#argp) {
+			$argi++;
+			next;
+		}
+		push(@{$argp[$argi]}, $arg);
+	}
+}
+
+
+# The heads we watch (sorted by commit time)
+our @heads;
+# Each head is: {
+#	# Persistent for the whole line of development:
+#	pipe => $pipe,
+#	files => \@files, # to watch for
+#
+#	id => $sha1, # useful actually only for debugging
+#	time => $timestamp,
+#	str => $prettyoutput,
+#	parents => \@sha1s,
+#
+#	# When the commit is processed, spawn these extra heads:
+#	recurse => {$sha1id => \@files, ...},
+# }
+
+# To avoid printing duplicate commits
+# FIXME: Currently, we will not handle merge commits properly since
+# we hit them multiple times.
+our %commits;
+
+
+sub open_pipe($@) {
+	my ($stdin, @execlist) = @_;
+
+	my $pid = open my $kid, "-|";
+	defined $pid or die "Cannot fork: $!";
+
+	unless ($pid) {
+		if (defined $stdin) {
+			open STDIN, "<&", $stdin or die "Cannot dup(): $!";
+		}
+		exec @execlist;
+		die "Cannot exec @execlist: $!";
+	}
+
+	return $kid;
+}
+
+sub revlist($@) {
+	my ($rev, @files) = @_;
+	open_pipe(undef, "git-rev-list", "--remove-empty",
+	                 @revlist_args, $rev, "--", @files)
+		or die "Failed to exec git-rev-list: $!";
+}
+
+sub difftree($) {
+	my ($revlist) = @_;
+	open_pipe($revlist, "git-diff-tree", "-r", "-m", "--stdin", "-M",
+	                    "--pretty=raw", @difftree_args)
+		or die "Failed to exec git-diff-tree: $!";
+}
+
+sub revdiffpipe($@) {
+	my ($rev, @files) = @_;
+	my $pipe = difftree(revlist($rev, @files));
+}
+
+
+sub read_commit($$) {
+	my ($head, $tolerant) = @_;
+	my $pipe = $head->{'pipe'};
+	my $against;
+	my @oldset = @{$head->{'files'}};
+	my @newset;
+	my $rename;
+
+	# Load header
+	while (my $line = <$pipe>) {
+		$head->{'str'} .= $line;
+		chomp $line;
+		$line eq '' and goto header_loaded;
+
+		if ($line =~ /^diff-tree (\S+) \(from (root|\S+)\)/) {
+			$head->{'id'} = $1;
+			if (not $tolerant and $commits{$1}++) {
+				close $pipe;
+				return undef;
+			}
+			# The 'root' case is harmless since there'll be no renames.
+			$against = $2;
+		} elsif ($line =~ /^parent (\S+)/) {
+			push (@{$head->{'parents'}}, $1);
+		} elsif ($line =~ /^committer .*?> (\d+)/) {
+			$head->{'time'} = $1;
+		}
+	}
+	return undef;
+header_loaded:
+
+	# Load message
+	while (my $line = <$pipe>) {
+		$head->{'str'} .= $line;
+		chomp $line;
+		$line eq '' and goto message_loaded;
+	}
+	return undef;
+message_loaded:
+
+	# Load delta
+	while (my $line = <$pipe>) {
+		$head->{'str'} .= $line;
+		chomp $line;
+		$line eq '' and goto delta_loaded;
+
+		$line =~ /^:/ or return undef;
+		my ($info, $newfile, $oldfile) = split("\t", $line);
+		if ($info =~ /[RC]\d*$/) {
+			# Behold, a rename!
+			# (Or a copy, it's all the same for us.)
+			my $i;
+			for ($i = 0; $i <= $#oldset; $i++) {
+				$oldfile eq $oldset[$i] or next;
+				$rename = 1;
+				splice(@oldset, $i, 1);
+				push(@newset, $newfile);
+				last;
+			}
+			# In case of multiple candidates, follow
+			# all of them:
+			# (TODO: This might be a policy decision
+			# best left on the user.)
+			if ($i > $#oldset and grep { $oldfile eq $_ } @newset) {
+				$rename = 1;
+				push(@newset, $newfile);
+			}
+		} elsif ($info =~ /D$/) {
+			# Not weeding out deleted files might cause bizarre
+			# results when following multiple files since
+			# git-rev-list weeds them out too (probably?).
+			@oldset = grep { $newfile ne $_ } @oldset;
+			@{$head->{'files'}} = grep { $newfile ne $_ } @{$head->{'files'}};
+		}
+	}
+	$head->{'str'} .= "\n";
+delta_loaded:
+
+	if ($rename) {
+		$head->{'recurse'}->{$against} = [@newset, @oldset];
+	}
+	return 1;
+}
+
+sub load_commit($) {
+	my ($head) = @_;
+	$head->{'time'} = undef;
+	$head->{'str'} = '';
+	$head->{'parents'} = ();
+
+	read_commit($head, 0) or return undef;
+
+	# In case there was a merge, the commit will be multiple times
+	# here, each time with a different delta section. Read them all.
+	for (1 .. $#{$head->{'parents'}}) { # stupid vim syntax highlighting
+		read_commit($head, 1) or return undef;
+	}
+
+	return 1;
+}
+
+
+# Add head at the proper position
+sub add_head($) {
+	my ($head) = @_;
+	my $i;
+	for ($i = 0; $i <= $#heads; $i++) {
+		last if ($head->{'time'} > $heads[$i]->{'time'})
+	}
+	splice(@heads, $i, 0, $head);
+}
+
+# Create new head
+sub init_head($@) {
+	my ($rev, @files) = @_;
+	my $head = { files => \@files, 'pipe' => revdiffpipe($rev, @files) };
+	load_commit($head) or return;
+	add_head($head);
+}
+
+
+
+{ # Seed the heads list
+	for my $rev (@revs) {
+		init_head($rev, @files);
+	}
+}
+
+# Process the heads
+{
+	while (@heads) {
+		my $head = splice(@heads, 0, 1);
+
+		print $head->{'str'};
+
+		foreach my $parent (keys %{$head->{'recurse'}}) {
+			init_head($parent, @{$head->{'recurse'}->{$parent}});
+		}
+		$head->{'recurse'} = undef;
+
+		load_commit($head) or next;
+		add_head($head);
+	}
+}


-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Following renames
  2006-03-26 23:26         ` Petr Baudis
@ 2006-03-27 21:59           ` Petr Baudis
  0 siblings, 0 replies; 41+ messages in thread
From: Petr Baudis @ 2006-03-27 21:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ryan Anderson, git

Dear diary, on Mon, Mar 27, 2006 at 01:26:49AM CEST, I got a letter
where Petr Baudis <pasky@suse.cz> said that...
> To quickly see what it does, you can try it e.g. on the git-log.sh file
> in the Git repository.

By the way, the cg-log in master uses it now to automagically follow
file renames (in case you call it per-file like cg-log FILENAME). If you
hate it, you can prevent it by cg-log --no-renames (cg-log -R).

Looks pretty slick.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2006-03-27 22:00 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-26  1:49 Following renames Petr Baudis
2006-03-26  2:49 ` Junio C Hamano
2006-03-26  3:52   ` Jakub Narebski
2006-03-27  6:00     ` Paul Jakma
2006-03-26 10:52   ` Petr Baudis
2006-03-26 10:55     ` Petr Baudis
2006-03-26 16:08   ` Timo Hirvonen
2006-03-26 16:43     ` Linus Torvalds
2006-03-26 16:31   ` Jakub Narebski
2006-03-26 16:46     ` Linus Torvalds
2006-03-26 17:10       ` Jakub Narebski
2006-03-26 18:10         ` Linus Torvalds
2006-03-26 19:22           ` Marco Costalba
2006-03-26 22:23             ` Linus Torvalds
2006-03-27  5:47               ` Marco Costalba
2006-03-27  6:46                 ` Junio C Hamano
2006-03-27  8:07                 ` Linus Torvalds
2006-03-27 11:19                   ` Marco Costalba
2006-03-27 11:30                     ` Johannes Schindelin
2006-03-27 16:52                     ` Linus Torvalds
2006-03-27 11:55                   ` Marco Costalba
2006-03-27 12:27                     ` Andreas Ericsson
2006-03-27  6:55           ` Jakub Narebski
2006-03-27  7:40             ` David Lang
2006-03-27  7:53               ` Jakub Narebski
2006-03-26  3:19 ` Linus Torvalds
2006-03-26  7:35   ` Ryan Anderson
2006-03-26 21:09     ` Petr Baudis
2006-03-26 10:07   ` Petr Baudis
2006-03-26 10:34     ` Fredrik Kuivinen
2006-03-26 16:33     ` Linus Torvalds
2006-03-26 19:14       ` Petr Baudis
2006-03-26 20:31         ` Petr Baudis
2006-03-26 22:22         ` Linus Torvalds
2006-03-26 22:31           ` Petr Baudis
2006-03-26 22:43             ` Junio C Hamano
2006-03-26 23:10               ` Linus Torvalds
2006-03-27  7:30                 ` Junio C Hamano
2006-03-26 23:09             ` Linus Torvalds
2006-03-26 23:26         ` Petr Baudis
2006-03-27 21:59           ` Petr Baudis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.