On recording renames

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* On recording renames
@ 2006-03-04  4:03 Junio C Hamano
  2006-03-04  4:09 ` Paul Jakma
  2006-03-04  6:16 ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04  4:03 UTC (permalink / raw)
  To: git; +Cc: paul

Recently some people said "I want to tell git that I renamed
fileA to fileB" on the list and #git channel, and I made some
vague comments on the issue that might have confused people.
This message is to clarify where I stand.

First of all, I understand some people "want to tell git", but
at the same time, I know very well that "git does not even want
to hear".  It does not care about names -- it only tracks
contents [*1*].

Having said that, there are at least two cases that renamed
files matter in practice from the end user's point of view.
Diff and merge.

For diff, it is often __very__ frustrating when you know you
renamed hello.c to world.c and then edited just a bit and "git
diff -M hello.c world.c" does not notice.  You can do one of two
things to help:

 (1) figure out why git (diffcore-rename) does not think they
     are similar enough, and improve its similarity estimator.
     Some of you who are paying attention to what is in my
     "next" branch might have noticed that I have been working
     in this area recently.

 (2) add a way to tell git-diff-files to compare hello.c in the
     index with working tree file world.c:

     	$ git-diff-files -p 'hello.c->world.c'

And people who "want to tell git" are after the second way.
Although this can probably be implemented as an extension to
diffcore-rename [*2*], I have to say that is kludging around the
real problem.  Only as a workaround for pathological cases it
may be OK, but I am really reluctant to accept such a change
without trying avenue (1) first.

About merge.  Currently recursive merge strategy claims to
handle renames and I've seen it handle renames well in some
cases.  However, it only uses three trees.  The rename between
merge base and one head, and the rename between merge base and
the other head are computed, compared and usual three-way merge
rules are applied (e.g. if you kept it there while she moved it
to somewhere else, result is to move it to somewhere else).  If
two development tracks forked long time ago are being merged,
and corresponding files deviated from each other beyond
recognition, there is no way for any heuristics to figure out
one is a rename-edit of the other only by looking at these three
trees:

       a1--a2--a3--a4--a5--A
      /                     \
  ---O---b1--b2--b3--b4--B---* 

    O has hello.c
    a1 renames file hello.c to world.c and a2-a5-A modifies world.c
    b1-b4-B modifies hello.c 
    we are about to merge A and B

    comparing O and A may not notice O's hello.c and A's world.c
    are similar!

But you are allowed to write a new merge strategy that is more
careful about renames.  There is no reason you can only look at
three trees.  Such a merge strategy, when given commit A and B,
would walk the history back, running "diff-tree -M" for each
commit along the way, and difference between O's hello.c and
a1's world.c would be hopefully *much* smaller than O's hello.c
and A's world.c -- even the current similarity estimator may
recognize it is a rename.

That is the first thing I'd like to see.  I do not want to even
think about recording renames in commit objects before anybody
explores that avenue.

Even with that, if O's hello.c and a1's world.c are _so_
different that if the changes are beyond recognition, you
_might_ want to "tell git" about the rename, or even record such
a rename in the commit object a1.  But I personally doubt it
would help anything in practice.  After such a huge rewrite
between O->a1, merging between A and B will be very hard anyway,
and you would need some off-line method to extract the intention
of the developer who originally did a1 commit while merging A
and B.  And when you inspect that change yourself, you may
decide O's hello.c correspond to a1's world.c yourself.  At that
point you will be hand merging the mess, so your being able to tell
git about it would not help you much.

[Footnotes]

*1* This is by design, and I am not going to debate if that is a
good design or not here.  There is a "Linus once said 'you say
users know better but users cannot be trusted -- trust me'"
factor involved.  I am a trusting kind and somebody needs to
convince me not to trust Linus.

*2* You would supply "in what you are comparing, the source path
hello.c and destination path world.c are similar with similarity
index 80% -- do not bother to estimate yourself, I am telling
you their similarity index so trust me".

	$ git-diff-files -p -M --similarity='hello.c world.c 80%'

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On recording renames
  2006-03-04  4:03 On recording renames Junio C Hamano
@ 2006-03-04  4:09 ` Paul Jakma
  2006-03-04  6:16 ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Paul Jakma @ 2006-03-04  4:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi Junio,

On Fri, 3 Mar 2006, Junio C Hamano wrote:

> Recently some people said "I want to tell git that I renamed fileA 
> to fileB" on the list and #git channel,

> *1* This is by design, and I am not going to debate if that is a
> good design or not here.

Thanks for your detailed email. Before I continue digesting it, I'd 
like to revise my original proposal (having now read the 
more of the 'design philosophy'):

- I want to tell git that object A is related to object B between two
   trees.

I think at this stage a proof-of-concept would be an idea. I'll try 
get back with that before end of the month.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
we just switched to Sprint.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On recording renames
  2006-03-04  4:03 On recording renames Junio C Hamano
  2006-03-04  4:09 ` Paul Jakma
@ 2006-03-04  6:16 ` Junio C Hamano
  2006-03-04 13:19   ` Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04  6:16 UTC (permalink / raw)
  To: git; +Cc: paul

Junio C Hamano <junkio@cox.net> writes:

>
>        a1--a2--a3--a4--a5--A
>       /                     \
>   ---O---b1--b2--b3--b4--B---* 
>
>     O has hello.c
>     a1 renames file hello.c to world.c and a2-a5-A modifies world.c
>     b1-b4-B modifies hello.c 
>     we are about to merge A and B
>
>     comparing O and A may not notice O's hello.c and A's world.c
>     are similar!
>
> But you are allowed to write a new merge strategy that is more
> careful about renames.  There is no reason you can only look at
> three trees.  Such a merge strategy, when given commit A and B,
> would walk the history back, running "diff-tree -M" for each
> commit along the way, and difference between O's hello.c and
> a1's world.c would be hopefully *much* smaller than O's hello.c
> and A's world.c -- even the current similarity estimator may
> recognize it is a rename.

A bit more on merges and renames.  The thread that started on
Dec 16 2005 by Don Zickus is about a case that anybody
interested in renaming merge should think about.  Unfortunately
gmane web interface says it is "down for maintenance" so I
cannot give an URL, but the message IDs of key messages are:

        <68948ca0512161205x3d5921bfm3bfcaa64f988eb99@mail.gmail.com>
        <7vbqzgbcyv.fsf@assigned-by-dhcp.cox.net>

The whole thread is worth reading, but the punch line is:

        The transition happened over time with multiple commits.
        You cannot record "this is the rename" by attributing that
        information to one particular commit.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On recording renames
  2006-03-04  6:16 ` Junio C Hamano
@ 2006-03-04 13:19   ` Junio C Hamano
  2006-03-04 17:36     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04 13:19 UTC (permalink / raw)
  To: git; +Cc: paul

Junio C Hamano <junkio@cox.net> writes:

> A bit more on merges and renames...
> The whole thread is worth reading, but the punch line is:
>
>         The transition happened over time with multiple commits.
>         You cannot record "this is the rename" by attributing that
>         information to one particular commit.

After re-reading that thread, and especially the analysis of the
history of that part of the kernel source I did back then, I am
again convinced that Linus was right when he said "file renames
do not matter".  That real-life example shows how inadequate
file boundaries are when dealing with content changes.

An ideal merge strategy would handle the case where pieces of
code gradually moves around across file boundaries.  I do not
think this is something you can sensibly do by recording file
rename history.  It would not help the situation a bit even if
you gave each file (or content or object or whatever you want to
call it) a persistent ID.

One way (now, it is my turn to handwave) to do such a merge
might be to take the whole tree as if it were a flat single file
(think of it as a concatenation of all files in the tree) with
each line tagged with the pathname.  You and your friend would
start from something like this.  A single file that describe
topics of interest to both of you:

                    notes.txt:Kernel Topics
                    notes.txt: - filesystem
                    notes.txt: - scheduler
                    notes.txt: - devices
                    notes.txt:Cool Git Topics
                    notes.txt: - git-cvsserver
                    notes.txt: - Cogito

And your friend splits this into two files and starts editing,
while you edit the original file:

        your friend                     you

        linux.txt:Kernel Topics         notes.txt:Kernel Topics
        linux.txt: - filesystem         notes.txt: - filesystem
        linux.txt: - scheduler          notes.txt: - scheduler
        linux.txt: - devices            notes.txt: - devices
        linux.txt: - stable driver API  notes.txt: - mm
        git.txt:Cool Git Topics         notes.txt:Cool Git Topics
        git.txt: - git-cvsserver        notes.txt: - git-cvsserver
        git.txt: - Cogito               notes.txt: - gitview
                                        notes.txt: - Cogito
                                        notes.txt: - StGIT
                                        notes.txt: - diff --cc

Now you would want to compare notes and merge them.  When
comparing these two "trees", the clever merge algorithm would
treat this two-column thingy and merge both labels
(i.e. pathnames) and contents:

                    linux.txt:Kernel Topics
                    linux.txt: - filesystem
                    linux.txt: - scheduler
                    linux.txt: - devices
                    linux.txt: - mm
                    linux.txt: - stable driver API
                    git.txt:Cool Git Topics
                    git.txt: - git-cvsserver
                    git.txt: - gitview
                    git.txt: - Cogito
                    git.txt: - StGIT
                    git.txt: - diff --cc

It could even guess that the line you touched are related to the
hunk your friend moved to another file (iow, your friend gave a
new label to the region you touched), and label your new line
with the same pathname as surrounding lines.

I suspect this is weave merge taken to its extreme, but I am
handwaving so please do not ask me how I would propose to
implement it ;-).  The point really is that file is a poor unit
of operation when dealing with changes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On recording renames
  2006-03-04 13:19   ` Junio C Hamano
@ 2006-03-04 17:36     ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2006-03-04 17:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, paul

On Sat, 4 Mar 2006, Junio C Hamano wrote:
> 
> An ideal merge strategy would handle the case where pieces of
> code gradually moves around across file boundaries.  I do not
> think this is something you can sensibly do by recording file
> rename history.  It would not help the situation a bit even if
> you gave each file (or content or object or whatever you want to
> call it) a persistent ID.

Actually, we have an absolutely perfect example of this much closer to 
home.

I originally did the "rev-list split" series on an older version of git, 
before you did the --objects-edge and the full pathname hashing 
improvements. But when I was done, you'd merged that, and I needed to 
merge my rev-list.c split with your improvements in order to send it to 
you.

Now, the whold file hadn't actually been renamed, but about 50% of that 
file had been split into a new one. So effectively you had a merge where 
part of the new stuff had to be merged into another file.

Now, I think this is actually more common than renames in many ways. It's 
not a "complete" rename, but as far as _part_ of your changes were
concerned, it was one.

And yes, such a split can be something that is done in stages, again 
exactly the same way about 85% of rev-list.c was moved into revision.c in 
two stages: the first stage was the argument parsing and setup, and the 
second stage was the actual revision walking logic.

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-03-04 17:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-04  4:03 On recording renames Junio C Hamano
2006-03-04  4:09 ` Paul Jakma
2006-03-04  6:16 ` Junio C Hamano
2006-03-04 13:19   ` Junio C Hamano
2006-03-04 17:36     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).