* another perspective on renames.
@ 2005-04-14 22:22 C. Scott Ananian
2005-04-15 5:16 ` Paul Jackson
0 siblings, 1 reply; 4+ messages in thread
From: C. Scott Ananian @ 2005-04-14 22:22 UTC (permalink / raw)
To: git
Perhaps our thinking is being clouded by 'how other SCMs do things' ---
do we *really* need extra rename metadata? As Linus pointed out, as long
as a commit is done immediately after a rename (ie before the renamed file
is changed) the tree object contains all the information one needs: you
can notice that a given object's content-hash is named 'foo' in the first
version and 'bar' in the second version.
Ingo thought that this was insufficient because two *different* objects
(ie having different revision histories) might be mutated to a point where
they had a *same* contents (and then would be condensed into a single
blob). But isn't that a feature of the git-fs history generally (ie not a
renaming-specific issue)?
One solution would be to invent a new 'file-revision-history' annotation
on top of git-fs in order to keep these derivation paths seperate...
...but perhaps we might think of this as a 'feature' of our SCM instead?
The 'history' of a file may have join points where a single 'content' may
have been derived by two or more completely different paths. Explicit
guidance to the front-end tools is required to 'unmerge' these files after
this occurs (ie updating the directory cache for one, but not the others).
This makes sense for include/arch/{foo,bar}/baz.h, but maybe not so much
for (say) the empty file.
Anyway, maybe it's worth thinking a little about an SCM in which this is a
feature, instead of (or in addition to) automatically assuming this is a
bug we need to add infrastructure to work around.
--scott
PBFORTUNE Soviet cryptographic D5 SLBM MI5 CIA postcard WASHTUB [Hello to all my fans in domestic surveillance]
explosion Sigint Bush ODEARL FJHOPEFUL assassination Uzi Hussein Nader
( http://cscott.net/ )
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: another perspective on renames.
2005-04-14 22:22 another perspective on renames C. Scott Ananian
@ 2005-04-15 5:16 ` Paul Jackson
2005-04-15 8:27 ` Ingo Molnar
2005-04-15 14:47 ` C. Scott Ananian
0 siblings, 2 replies; 4+ messages in thread
From: Paul Jackson @ 2005-04-15 5:16 UTC (permalink / raw)
To: C. Scott Ananian; +Cc: git
Scott wrote:
> Anyway, maybe it's worth thinking a little about an SCM in which this is a
> feature, instead of (or in addition to) automatically assuming this is a
> bug we need to add infrastructure to work around.
Agreed.
To me, the main purpose in tracking renames is to obtain a deeper
history of the line-by-line changes in a file.
==> But that doesn't seem relevant here.
Last I looked, git has no such history. A given file contents
is the indivisable atom of the git world, with no fine structure.
This is quite unlike classic SCM's, built on file formats that
track source lines, not files, as the atomic unit.
To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.
I wonder if there might be someway to use the tools that biologists use
to analyze DNA sequences, to track the evolution of source code,
identifying things like common chunks of code that differ in just a few
mutations, and presenting the history of the evolution, at selectable
levels of detail.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: another perspective on renames.
2005-04-15 5:16 ` Paul Jackson
@ 2005-04-15 8:27 ` Ingo Molnar
2005-04-15 14:47 ` C. Scott Ananian
1 sibling, 0 replies; 4+ messages in thread
From: Ingo Molnar @ 2005-04-15 8:27 UTC (permalink / raw)
To: Paul Jackson; +Cc: C. Scott Ananian, git
* Paul Jackson <pj@engr.sgi.com> wrote:
> Scott wrote:
> > Anyway, maybe it's worth thinking a little about an SCM in which this is a
> > feature, instead of (or in addition to) automatically assuming this is a
> > bug we need to add infrastructure to work around.
>
> Agreed.
>
> To me, the main purpose in tracking renames is to obtain a deeper
> history of the line-by-line changes in a file.
>
> ==> But that doesn't seem relevant here.
>
> Last I looked, git has no such history. A given file contents is the
> indivisable atom of the git world, with no fine structure.
>
> This is quite unlike classic SCM's, built on file formats that track
> source lines, not files, as the atomic unit.
i believe the fundamental thing to think about is not file or line or
namespace, but 'tracking developer intent'. While keeping in mind that
GIT is not an SCM, all SCMs boil down to this single thing: being able
to track what the developer did and why he did it - to be a useful tool
later on. (SCMs are for humans with bad limitations, who have this
fundamental design bug and keep forgetting things.)
the basic question is, how much to track. The most extreme form of
tracking (just for the sake of visualizing it) would be to have an
eye-position recognizing software attached to a webcam looking at the
developer, and then exactly mapping what he did, how long did he look at
one particular line of code and exactly what did he type when doing
that. [ Perhaps also a thought-reader module in addition, once one is
available. (combined with another module that removes all the swearing)]
but i think Linus is on the right track to suggest that "the file names
dont matter all that much, it's all about the content". Global diffs
might track most types of plain renames, and if it gets it wrong - do we
care? Misdetection of renames can happen, but realistically only with
small files and trivial code, which wont have alot of history.
The only serious type of misdetection would be if two large modules in
two different places in the namespace happen to have exactly the same
content but have a different history (because e.g. they were merged in
via two separate trees, one came from one tree, the other from the other
tree), and the developer renamed both of them in the same commit: in
such a case the global diff would have no way to figure out what the
proper thread of history is. But is this a realistic scenario? If the
two files are nontrivial and have the same content, why werent they
merged in the namespace in the first place?
the moment we allow 'namespace' into the picture, things get complex and
ugly. Directory recursion is already a complexity that would have been
nice to avoid.
Ingo
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: another perspective on renames.
2005-04-15 5:16 ` Paul Jackson
2005-04-15 8:27 ` Ingo Molnar
@ 2005-04-15 14:47 ` C. Scott Ananian
1 sibling, 0 replies; 4+ messages in thread
From: C. Scott Ananian @ 2005-04-15 14:47 UTC (permalink / raw)
To: Paul Jackson; +Cc: git
On Thu, 14 Apr 2005, Paul Jackson wrote:
> To me, rename is a special case of the more general case of a
> big chunk of code (a portion of a file) that was in one place
> either being moved or copied to another place.
>
> I wonder if there might be someway to use the tools that biologists use
> to analyze DNA sequences, to track the evolution of source code,
> identifying things like common chunks of code that differ in just a few
> mutations, and presenting the history of the evolution, at selectable
> levels of detail.
The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html)
is probably a good place to start, although it is relatively sensitive to
mutations. It will be able to efficiently detect identical blocks larger
than some block size N (512 bytes or so for rsync). You might well
consider smaller blocks to be irrelevant. The data can be made
considerably more useful to developers by canonicalizing before searching
(ie, compressing whitespace to ' ', etc)[*]. Note that the identical
regions do *not* have to line up on block boundaries; see the rsync
algorithm for more detail.
I think Linus has made a persuasive case that the 'developer-friendly'
features of an SCM (ie annotate, log, and friends) can be built *on top*
of GIT. This is a perfect example. Since the computation is non-trivial
(although linear in the number of lines of code involved in the history of
a file; ie doesn't depend on the unrelated size of the archive), it might
make sense for the front-end SCM to maintain its own caches --- for
example, of the block and rolling checksums for each file required by the
rsync algorithm. The key point being that these are just *caches*, not
essential history information, and can always be wiped and regenerated.
The nice 'feature' of this system (some may disagree, I guess) is that it
does *not* depend on extensive programmer annotation of file changes (ie,
chunk A in file B came from lines C-D of file D, or file E was once named
F, etc). By inferring history from content-similar files and blocks, it
seems that it would be more able to generate useful results after
importing third-party sources, which may come in distinct 'releases' but
lack explicit history annotations.
--scott
[*] in general, i will be *glad* to see source-management move away from
CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc. When we build 'developer-friendly' tools
we should make every effort to auto-detect source code, image formats,
etc, and automatically perform appropriate canonicalization and
beautification of diffs, because this can be/should be/is entirely
separate from git's underlying storage representation.
Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force
MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine
( http://cscott.net/ )
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2005-04-15 14:44 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-14 22:22 another perspective on renames C. Scott Ananian
2005-04-15 5:16 ` Paul Jackson
2005-04-15 8:27 ` Ingo Molnar
2005-04-15 14:47 ` C. Scott Ananian
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).