Re: [JGIT] Blame functionality for jgit

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Manuel Woelker" <manuel.woelker@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Robin Rosenberg" <robin.rosenberg@dewire.com>, git@vger.kernel.org
Subject: Re: [JGIT] Blame functionality for jgit
Date: Mon, 12 Jan 2009 22:17:47 +0100	[thread overview]
Message-ID: <3d045c7e0901121317j4ccd9515vbc7a44abc8ae5356@mail.gmail.com> (raw)
In-Reply-To: <20090112174232.GJ10179@spearce.org>

On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>
>> I largely ported the cgit blame algorithm described here
>> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
>> relevant file is builtin-blame.c cf.
>> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
>
> OK.  Fortunately Junio has largely given his blessing to this code
> being converted to Java under the BSD license.

Glad to hear that, Dscho already mentioned that a port like this might
raise license concerns. So I hope Junio would be okay with that.

> Unfortunately I can't reach incava.org to download java-diff, but
> its entry on sourceforge says it uses an LGPL license.  Within JGit
> we do want to stick to BSD and BSD dependencies, so bringing in
> java-diff as a new dependency isn't something we really want to do.
>
> Fortunately Dscho has been working on a Java diff implementation for
> JGit, and is considering releasing it under a BSD license so we can
> include it.
>
> The interface is still to be decided, but in general we have
> found that running against a byte[] is much faster than running
> against String.  We're probably going to want the diff library to
> take a byte[] and an IntList (as created by RawParseUtils.lineMap)
> and let it work against the byte array ranges directly, without any
> additional allocations, except where it needs to build up its hash
> of records.

That sounds like a good plan. For now I am not all that concerned
about performance myself (premature optimization and all that), but in
the long run - and especially with rename/copy detection that will
definitely a factor for usability. We could adopt the interface to
allow for a "fast" implementation, and have existing diff
implementations adapted to that. I'd be very interested to see Dscho's
diff work, and maybe we can get something going on that front. A
patience diff implementation would be rather neat too, and all my
googling hasn't turned up anything.

> I think eventually we'll have a BSD licensed LCS that will come with
> JGit, which would eliminate the need for such a shim.  I'd like
> to see that happen before blame gets added, but if blame is ready
> and the shim is reasonably well done, I'm inclined to bring blame
> in early.

While trying to look up the Myers diff algorithm I found a diff
implementation in Apache wicket (cf.
http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html
). This one is under an Apache license, is that any better? It's truly
kind of sad that you need a degree in law these days to get any work
done in this license jungle. I just happen to strongly oppose the
reinvention of circular transportation-enabling devices...

> - Don't use @author tags, we don't use them anywhere else.
>
> - Please include copyright headers on all new files.
>
> - I think the IDiff, IDifference should be in a ".diff" package so
> we can reuse them for other diff applications.  Especially if we
> wind up using a shim to link to different LCS implementations.
>
> - I think the API for BlameEngine should be more like:
>
>  public class BlameEngine {
>    private final RevWalk rw;
>
>    public BlameEngine(final Repository repo) {
>      this(new RevWalk(repo));
>    }
>
>    public BlameEngine(final RevWalk walk) {
>      rw = walk;
>    }
>
>    public RevWalk getRevWalk() {
>      return rw;
>    }
>
>    public List<BlameEntry> blame(RevCommit c, String path);
>  }
>
>  One reason for this style of API is we can have the engine cache
>  blame records.  This can make a GUI navigator much more efficient
>  as it jumps around between commits and asks for blames over the
>  same data.
>
> - Internally you should *always* use RevCommit, not Commit.
> The RevCommit class can be orders of magnitude faster than Commit.
>
> - Internally you should always use RevTree and TreeWalk to locate
> blob IDs.  Its orders of magnitude faster than Tree and TreeEntry.
>
> - Don't use new String(loader.getBytes()) (e.g. in Origin.getData)
> to get the data for a file.  We want to do raw diffs, so that the
> character encoding is considered as part of the blame.  E.g. if
> a file switches character encodings, the blame has to go to the
> commit that introduced the new encoding.  This is a huge reason
> to use byte[] and IntList over an array of String.
>
> - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$".
>
Thanks for the comments. This first version was basically just a
proof-of-concept to see if and how it could work. I didn't want to
pour too much effort into, before knowing if this project was viable
at all. This input is very valuable and I will try to incorporate the
suggestions into future revisions.

I'll keep you posted on new developments
 - Manuel

next prev parent reply	other threads:[~2009-01-12 21:19 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker
2009-01-12 17:42 ` Shawn O. Pearce
2009-01-12 21:17   ` Manuel Woelker [this message]
2009-01-12 21:55     ` Shawn O. Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d045c7e0901121317j4ccd9515vbc7a44abc8ae5356@mail.gmail.com \
    --to=manuel.woelker@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=robin.rosenberg@dewire.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).