* On recording renames
@ 2006-03-04 4:03 Junio C Hamano
2006-03-04 4:09 ` Paul Jakma
2006-03-04 6:16 ` Junio C Hamano
0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04 4:03 UTC (permalink / raw)
To: git; +Cc: paul
Recently some people said "I want to tell git that I renamed
fileA to fileB" on the list and #git channel, and I made some
vague comments on the issue that might have confused people.
This message is to clarify where I stand.
First of all, I understand some people "want to tell git", but
at the same time, I know very well that "git does not even want
to hear". It does not care about names -- it only tracks
contents [*1*].
Having said that, there are at least two cases that renamed
files matter in practice from the end user's point of view.
Diff and merge.
For diff, it is often __very__ frustrating when you know you
renamed hello.c to world.c and then edited just a bit and "git
diff -M hello.c world.c" does not notice. You can do one of two
things to help:
(1) figure out why git (diffcore-rename) does not think they
are similar enough, and improve its similarity estimator.
Some of you who are paying attention to what is in my
"next" branch might have noticed that I have been working
in this area recently.
(2) add a way to tell git-diff-files to compare hello.c in the
index with working tree file world.c:
$ git-diff-files -p 'hello.c->world.c'
And people who "want to tell git" are after the second way.
Although this can probably be implemented as an extension to
diffcore-rename [*2*], I have to say that is kludging around the
real problem. Only as a workaround for pathological cases it
may be OK, but I am really reluctant to accept such a change
without trying avenue (1) first.
About merge. Currently recursive merge strategy claims to
handle renames and I've seen it handle renames well in some
cases. However, it only uses three trees. The rename between
merge base and one head, and the rename between merge base and
the other head are computed, compared and usual three-way merge
rules are applied (e.g. if you kept it there while she moved it
to somewhere else, result is to move it to somewhere else). If
two development tracks forked long time ago are being merged,
and corresponding files deviated from each other beyond
recognition, there is no way for any heuristics to figure out
one is a rename-edit of the other only by looking at these three
trees:
a1--a2--a3--a4--a5--A
/ \
---O---b1--b2--b3--b4--B---*
O has hello.c
a1 renames file hello.c to world.c and a2-a5-A modifies world.c
b1-b4-B modifies hello.c
we are about to merge A and B
comparing O and A may not notice O's hello.c and A's world.c
are similar!
But you are allowed to write a new merge strategy that is more
careful about renames. There is no reason you can only look at
three trees. Such a merge strategy, when given commit A and B,
would walk the history back, running "diff-tree -M" for each
commit along the way, and difference between O's hello.c and
a1's world.c would be hopefully *much* smaller than O's hello.c
and A's world.c -- even the current similarity estimator may
recognize it is a rename.
That is the first thing I'd like to see. I do not want to even
think about recording renames in commit objects before anybody
explores that avenue.
Even with that, if O's hello.c and a1's world.c are _so_
different that if the changes are beyond recognition, you
_might_ want to "tell git" about the rename, or even record such
a rename in the commit object a1. But I personally doubt it
would help anything in practice. After such a huge rewrite
between O->a1, merging between A and B will be very hard anyway,
and you would need some off-line method to extract the intention
of the developer who originally did a1 commit while merging A
and B. And when you inspect that change yourself, you may
decide O's hello.c correspond to a1's world.c yourself. At that
point you will be hand merging the mess, so your being able to tell
git about it would not help you much.
[Footnotes]
*1* This is by design, and I am not going to debate if that is a
good design or not here. There is a "Linus once said 'you say
users know better but users cannot be trusted -- trust me'"
factor involved. I am a trusting kind and somebody needs to
convince me not to trust Linus.
*2* You would supply "in what you are comparing, the source path
hello.c and destination path world.c are similar with similarity
index 80% -- do not bother to estimate yourself, I am telling
you their similarity index so trust me".
$ git-diff-files -p -M --similarity='hello.c world.c 80%'
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On recording renames
2006-03-04 4:03 On recording renames Junio C Hamano
@ 2006-03-04 4:09 ` Paul Jakma
2006-03-04 6:16 ` Junio C Hamano
1 sibling, 0 replies; 5+ messages in thread
From: Paul Jakma @ 2006-03-04 4:09 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Hi Junio,
On Fri, 3 Mar 2006, Junio C Hamano wrote:
> Recently some people said "I want to tell git that I renamed fileA
> to fileB" on the list and #git channel,
> *1* This is by design, and I am not going to debate if that is a
> good design or not here.
Thanks for your detailed email. Before I continue digesting it, I'd
like to revise my original proposal (having now read the
more of the 'design philosophy'):
- I want to tell git that object A is related to object B between two
trees.
I think at this stage a proof-of-concept would be an idea. I'll try
get back with that before end of the month.
regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
we just switched to Sprint.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On recording renames
2006-03-04 4:03 On recording renames Junio C Hamano
2006-03-04 4:09 ` Paul Jakma
@ 2006-03-04 6:16 ` Junio C Hamano
2006-03-04 13:19 ` Junio C Hamano
1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04 6:16 UTC (permalink / raw)
To: git; +Cc: paul
Junio C Hamano <junkio@cox.net> writes:
>
> a1--a2--a3--a4--a5--A
> / \
> ---O---b1--b2--b3--b4--B---*
>
> O has hello.c
> a1 renames file hello.c to world.c and a2-a5-A modifies world.c
> b1-b4-B modifies hello.c
> we are about to merge A and B
>
> comparing O and A may not notice O's hello.c and A's world.c
> are similar!
>
> But you are allowed to write a new merge strategy that is more
> careful about renames. There is no reason you can only look at
> three trees. Such a merge strategy, when given commit A and B,
> would walk the history back, running "diff-tree -M" for each
> commit along the way, and difference between O's hello.c and
> a1's world.c would be hopefully *much* smaller than O's hello.c
> and A's world.c -- even the current similarity estimator may
> recognize it is a rename.
A bit more on merges and renames. The thread that started on
Dec 16 2005 by Don Zickus is about a case that anybody
interested in renaming merge should think about. Unfortunately
gmane web interface says it is "down for maintenance" so I
cannot give an URL, but the message IDs of key messages are:
<68948ca0512161205x3d5921bfm3bfcaa64f988eb99@mail.gmail.com>
<7vbqzgbcyv.fsf@assigned-by-dhcp.cox.net>
The whole thread is worth reading, but the punch line is:
The transition happened over time with multiple commits.
You cannot record "this is the rename" by attributing that
information to one particular commit.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On recording renames
2006-03-04 6:16 ` Junio C Hamano
@ 2006-03-04 13:19 ` Junio C Hamano
2006-03-04 17:36 ` Linus Torvalds
0 siblings, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2006-03-04 13:19 UTC (permalink / raw)
To: git; +Cc: paul
Junio C Hamano <junkio@cox.net> writes:
> A bit more on merges and renames...
> The whole thread is worth reading, but the punch line is:
>
> The transition happened over time with multiple commits.
> You cannot record "this is the rename" by attributing that
> information to one particular commit.
After re-reading that thread, and especially the analysis of the
history of that part of the kernel source I did back then, I am
again convinced that Linus was right when he said "file renames
do not matter". That real-life example shows how inadequate
file boundaries are when dealing with content changes.
An ideal merge strategy would handle the case where pieces of
code gradually moves around across file boundaries. I do not
think this is something you can sensibly do by recording file
rename history. It would not help the situation a bit even if
you gave each file (or content or object or whatever you want to
call it) a persistent ID.
One way (now, it is my turn to handwave) to do such a merge
might be to take the whole tree as if it were a flat single file
(think of it as a concatenation of all files in the tree) with
each line tagged with the pathname. You and your friend would
start from something like this. A single file that describe
topics of interest to both of you:
notes.txt:Kernel Topics
notes.txt: - filesystem
notes.txt: - scheduler
notes.txt: - devices
notes.txt:Cool Git Topics
notes.txt: - git-cvsserver
notes.txt: - Cogito
And your friend splits this into two files and starts editing,
while you edit the original file:
your friend you
linux.txt:Kernel Topics notes.txt:Kernel Topics
linux.txt: - filesystem notes.txt: - filesystem
linux.txt: - scheduler notes.txt: - scheduler
linux.txt: - devices notes.txt: - devices
linux.txt: - stable driver API notes.txt: - mm
git.txt:Cool Git Topics notes.txt:Cool Git Topics
git.txt: - git-cvsserver notes.txt: - git-cvsserver
git.txt: - Cogito notes.txt: - gitview
notes.txt: - Cogito
notes.txt: - StGIT
notes.txt: - diff --cc
Now you would want to compare notes and merge them. When
comparing these two "trees", the clever merge algorithm would
treat this two-column thingy and merge both labels
(i.e. pathnames) and contents:
linux.txt:Kernel Topics
linux.txt: - filesystem
linux.txt: - scheduler
linux.txt: - devices
linux.txt: - mm
linux.txt: - stable driver API
git.txt:Cool Git Topics
git.txt: - git-cvsserver
git.txt: - gitview
git.txt: - Cogito
git.txt: - StGIT
git.txt: - diff --cc
It could even guess that the line you touched are related to the
hunk your friend moved to another file (iow, your friend gave a
new label to the region you touched), and label your new line
with the same pathname as surrounding lines.
I suspect this is weave merge taken to its extreme, but I am
handwaving so please do not ask me how I would propose to
implement it ;-). The point really is that file is a poor unit
of operation when dealing with changes.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On recording renames
2006-03-04 13:19 ` Junio C Hamano
@ 2006-03-04 17:36 ` Linus Torvalds
0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2006-03-04 17:36 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, paul
On Sat, 4 Mar 2006, Junio C Hamano wrote:
>
> An ideal merge strategy would handle the case where pieces of
> code gradually moves around across file boundaries. I do not
> think this is something you can sensibly do by recording file
> rename history. It would not help the situation a bit even if
> you gave each file (or content or object or whatever you want to
> call it) a persistent ID.
Actually, we have an absolutely perfect example of this much closer to
home.
I originally did the "rev-list split" series on an older version of git,
before you did the --objects-edge and the full pathname hashing
improvements. But when I was done, you'd merged that, and I needed to
merge my rev-list.c split with your improvements in order to send it to
you.
Now, the whold file hadn't actually been renamed, but about 50% of that
file had been split into a new one. So effectively you had a merge where
part of the new stuff had to be merged into another file.
Now, I think this is actually more common than renames in many ways. It's
not a "complete" rename, but as far as _part_ of your changes were
concerned, it was one.
And yes, such a split can be something that is done in stages, again
exactly the same way about 85% of rev-list.c was moved into revision.c in
two stages: the first stage was the argument parsing and setup, and the
second stage was the actual revision walking logic.
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-03-04 17:39 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-04 4:03 On recording renames Junio C Hamano
2006-03-04 4:09 ` Paul Jakma
2006-03-04 6:16 ` Junio C Hamano
2006-03-04 13:19 ` Junio C Hamano
2006-03-04 17:36 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).