git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "C. Scott Ananian" <cscott@cscott.net>
To: Paul Jackson <pj@engr.sgi.com>
Cc: git@vger.kernel.org
Subject: Re: another perspective on renames.
Date: Fri, 15 Apr 2005 10:47:01 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.61.0504151031330.27637@cag.csail.mit.edu> (raw)
In-Reply-To: <20050414221626.10c6c0e7.pj@engr.sgi.com>

On Thu, 14 Apr 2005, Paul Jackson wrote:

> To me, rename is a special case of the more general case of a
> big chunk of code (a portion of a file) that was in one place
> either being moved or copied to another place.
>
> I wonder if there might be someway to use the tools that biologists use
> to analyze DNA sequences, to track the evolution of source code,
> identifying things like common chunks of code that differ in just a few
> mutations, and presenting the history of the evolution, at selectable
> levels of detail.

The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) 
is probably a good place to start, although it is relatively sensitive to 
mutations.  It will be able to efficiently detect identical blocks larger 
than some block size N (512 bytes or so for rsync).  You might well 
consider smaller blocks to be irrelevant.  The data can be made 
considerably more useful to developers by canonicalizing before searching 
(ie, compressing whitespace to ' ', etc)[*].  Note that the identical 
regions do *not* have to line up on block boundaries; see the rsync 
algorithm for more detail.

I think Linus has made a persuasive case that the 'developer-friendly' 
features of an SCM (ie annotate, log, and friends) can be built *on top* 
of GIT.   This is a perfect example.  Since the computation is non-trivial 
(although linear in the number of lines of code involved in the history of 
a file; ie doesn't depend on the unrelated size of the archive), it might 
make sense for the front-end SCM to maintain its own caches --- for 
example, of the block and rolling checksums for each file required by the 
rsync algorithm.  The key point being that these are just *caches*, not 
essential history information, and can always be wiped and regenerated.

The nice 'feature' of this system (some may disagree, I guess) is that it 
does *not* depend on extensive programmer annotation of file changes (ie, 
chunk A in file B came from lines C-D of file D, or file E was once named 
F, etc).  By inferring history from content-similar files and blocks, it 
seems that it would be more able to generate useful results after 
importing third-party sources, which may come in distinct 'releases' but 
lack explicit history annotations.
   --scott

[*] in general, i will be *glad* to see source-management move away from 
CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc.  When we build 'developer-friendly' tools 
we should make every effort to auto-detect source code, image formats, 
etc, and automatically perform appropriate canonicalization and 
beautification of diffs, because this can be/should be/is entirely 
separate from git's underlying storage representation.

Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force 
MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine
                          ( http://cscott.net/ )

      parent reply	other threads:[~2005-04-15 14:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-14 22:22 another perspective on renames C. Scott Ananian
2005-04-15  5:16 ` Paul Jackson
2005-04-15  8:27   ` Ingo Molnar
2005-04-15 14:47   ` C. Scott Ananian [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.61.0504151031330.27637@cag.csail.mit.edu \
    --to=cscott@cscott.net \
    --cc=git@vger.kernel.org \
    --cc=pj@engr.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).