From: "Manuel Woelker" <manuel.woelker@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Robin Rosenberg" <robin.rosenberg@dewire.com>, git@vger.kernel.org
Subject: Re: [JGIT] Blame functionality for jgit
Date: Mon, 12 Jan 2009 22:17:47 +0100 [thread overview]
Message-ID: <3d045c7e0901121317j4ccd9515vbc7a44abc8ae5356@mail.gmail.com> (raw)
In-Reply-To: <20090112174232.GJ10179@spearce.org>
On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>
>> I largely ported the cgit blame algorithm described here
>> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
>> relevant file is builtin-blame.c cf.
>> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
>
> OK. Fortunately Junio has largely given his blessing to this code
> being converted to Java under the BSD license.
Glad to hear that, Dscho already mentioned that a port like this might
raise license concerns. So I hope Junio would be okay with that.
> Unfortunately I can't reach incava.org to download java-diff, but
> its entry on sourceforge says it uses an LGPL license. Within JGit
> we do want to stick to BSD and BSD dependencies, so bringing in
> java-diff as a new dependency isn't something we really want to do.
>
> Fortunately Dscho has been working on a Java diff implementation for
> JGit, and is considering releasing it under a BSD license so we can
> include it.
>
> The interface is still to be decided, but in general we have
> found that running against a byte[] is much faster than running
> against String. We're probably going to want the diff library to
> take a byte[] and an IntList (as created by RawParseUtils.lineMap)
> and let it work against the byte array ranges directly, without any
> additional allocations, except where it needs to build up its hash
> of records.
That sounds like a good plan. For now I am not all that concerned
about performance myself (premature optimization and all that), but in
the long run - and especially with rename/copy detection that will
definitely a factor for usability. We could adopt the interface to
allow for a "fast" implementation, and have existing diff
implementations adapted to that. I'd be very interested to see Dscho's
diff work, and maybe we can get something going on that front. A
patience diff implementation would be rather neat too, and all my
googling hasn't turned up anything.
> I think eventually we'll have a BSD licensed LCS that will come with
> JGit, which would eliminate the need for such a shim. I'd like
> to see that happen before blame gets added, but if blame is ready
> and the shim is reasonably well done, I'm inclined to bring blame
> in early.
While trying to look up the Myers diff algorithm I found a diff
implementation in Apache wicket (cf.
http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html
). This one is under an Apache license, is that any better? It's truly
kind of sad that you need a degree in law these days to get any work
done in this license jungle. I just happen to strongly oppose the
reinvention of circular transportation-enabling devices...
> - Don't use @author tags, we don't use them anywhere else.
>
> - Please include copyright headers on all new files.
>
> - I think the IDiff, IDifference should be in a ".diff" package so
> we can reuse them for other diff applications. Especially if we
> wind up using a shim to link to different LCS implementations.
>
> - I think the API for BlameEngine should be more like:
>
> public class BlameEngine {
> private final RevWalk rw;
>
> public BlameEngine(final Repository repo) {
> this(new RevWalk(repo));
> }
>
> public BlameEngine(final RevWalk walk) {
> rw = walk;
> }
>
> public RevWalk getRevWalk() {
> return rw;
> }
>
> public List<BlameEntry> blame(RevCommit c, String path);
> }
>
> One reason for this style of API is we can have the engine cache
> blame records. This can make a GUI navigator much more efficient
> as it jumps around between commits and asks for blames over the
> same data.
>
> - Internally you should *always* use RevCommit, not Commit.
> The RevCommit class can be orders of magnitude faster than Commit.
>
> - Internally you should always use RevTree and TreeWalk to locate
> blob IDs. Its orders of magnitude faster than Tree and TreeEntry.
>
> - Don't use new String(loader.getBytes()) (e.g. in Origin.getData)
> to get the data for a file. We want to do raw diffs, so that the
> character encoding is considered as part of the blame. E.g. if
> a file switches character encodings, the blame has to go to the
> commit that introduced the new encoding. This is a huge reason
> to use byte[] and IntList over an array of String.
>
> - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$".
>
Thanks for the comments. This first version was basically just a
proof-of-concept to see if and how it could work. I didn't want to
pour too much effort into, before knowing if this project was viable
at all. This input is very valuable and I will try to incorporate the
suggestions into future revisions.
I'll keep you posted on new developments
- Manuel
next prev parent reply other threads:[~2009-01-12 21:19 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker
2009-01-12 17:42 ` Shawn O. Pearce
2009-01-12 21:17 ` Manuel Woelker [this message]
2009-01-12 21:55 ` Shawn O. Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3d045c7e0901121317j4ccd9515vbc7a44abc8ae5356@mail.gmail.com \
--to=manuel.woelker@gmail.com \
--cc=git@vger.kernel.org \
--cc=robin.rosenberg@dewire.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).