From: "C. Scott Ananian" <cscott@cscott.net>
To: Paul Jackson <pj@engr.sgi.com>
Cc: git@vger.kernel.org
Subject: Re: another perspective on renames.
Date: Fri, 15 Apr 2005 10:47:01 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.61.0504151031330.27637@cag.csail.mit.edu> (raw)
In-Reply-To: <20050414221626.10c6c0e7.pj@engr.sgi.com>
On Thu, 14 Apr 2005, Paul Jackson wrote:
> To me, rename is a special case of the more general case of a
> big chunk of code (a portion of a file) that was in one place
> either being moved or copied to another place.
>
> I wonder if there might be someway to use the tools that biologists use
> to analyze DNA sequences, to track the evolution of source code,
> identifying things like common chunks of code that differ in just a few
> mutations, and presenting the history of the evolution, at selectable
> levels of detail.
The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html)
is probably a good place to start, although it is relatively sensitive to
mutations. It will be able to efficiently detect identical blocks larger
than some block size N (512 bytes or so for rsync). You might well
consider smaller blocks to be irrelevant. The data can be made
considerably more useful to developers by canonicalizing before searching
(ie, compressing whitespace to ' ', etc)[*]. Note that the identical
regions do *not* have to line up on block boundaries; see the rsync
algorithm for more detail.
I think Linus has made a persuasive case that the 'developer-friendly'
features of an SCM (ie annotate, log, and friends) can be built *on top*
of GIT. This is a perfect example. Since the computation is non-trivial
(although linear in the number of lines of code involved in the history of
a file; ie doesn't depend on the unrelated size of the archive), it might
make sense for the front-end SCM to maintain its own caches --- for
example, of the block and rolling checksums for each file required by the
rsync algorithm. The key point being that these are just *caches*, not
essential history information, and can always be wiped and regenerated.
The nice 'feature' of this system (some may disagree, I guess) is that it
does *not* depend on extensive programmer annotation of file changes (ie,
chunk A in file B came from lines C-D of file D, or file E was once named
F, etc). By inferring history from content-similar files and blocks, it
seems that it would be more able to generate useful results after
importing third-party sources, which may come in distinct 'releases' but
lack explicit history annotations.
--scott
[*] in general, i will be *glad* to see source-management move away from
CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc. When we build 'developer-friendly' tools
we should make every effort to auto-detect source code, image formats,
etc, and automatically perform appropriate canonicalization and
beautification of diffs, because this can be/should be/is entirely
separate from git's underlying storage representation.
Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force
MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine
( http://cscott.net/ )
prev parent reply other threads:[~2005-04-15 14:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-14 22:22 another perspective on renames C. Scott Ananian
2005-04-15 5:16 ` Paul Jackson
2005-04-15 8:27 ` Ingo Molnar
2005-04-15 14:47 ` C. Scott Ananian [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.61.0504151031330.27637@cag.csail.mit.edu \
--to=cscott@cscott.net \
--cc=git@vger.kernel.org \
--cc=pj@engr.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).