* [JGIT] Blame functionality for jgit
@ 2009-01-11 20:23 Manuel Woelker
2009-01-12 17:42 ` Shawn O. Pearce
0 siblings, 1 reply; 4+ messages in thread
From: Manuel Woelker @ 2009-01-11 20:23 UTC (permalink / raw)
To: spearce, Robin Rosenberg; +Cc: git
Hello there,
Over the weekend I have been hacking the jgit sources a little to see
if I can add blame/praise/annotate functionality to it. The results
can be found at http://github.com/manuel-woelker/egit/tree/blame . All
work is in the blame branch in org.spearce.jgit.blame package.
I largely ported the cgit blame algorithm described here
https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
relevant file is builtin-blame.c cf.
http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
The structure has been kept largely intact, but I have tried to
translate the concepts to idiomatic java, with the bulk of the logic
now in the Scoreboard class
The blame algorithm needs to use a diff algorithm to find common parts
in files. AFAICT there is no diff implementation in jgit at the
moment. I used the incava java-diff library, (see
http://www.incava.org/projects/java/java-diff ), but I introduced an
interface that should make it possible to swap implementations with a
minimum of effort. To compile I just create a new eclipse project with
the java-diff sources.
Currently renames, copies etc. are not supported, so only files with
the same name can receive the blame. Unmodified renames and copies
should be fairly simple to implement. Modified renames and copies
might prove to be a little bit harder, so that would have to wait
until jgit can follow history across renames/copies.
There are some simple unit tests to check the basic functionality. I
also "blamed" SUBMITTING_PATCHES in the egit repo, and got the same
results as cgit. I am certain that there a some bugs lurking in the
code, but overall it looks quite promising.
I would like to hear your thoughts on a couple of topics:
- Merge/patch/diff/blame functionality needs a diff implementation,
what are our options within technical and license constraints?
- What is the roadmap for these features?
- Can you see this blame effort getting integrated upstream?
I would love to contribute more effort to egit and the blame
functionality in particular. To me, "blame" is one of the killer
features of modern SCMs.
Last no least, kudos to the git and egit teams for their hard work on
making git such a great piece of software.
- Manuel Woelker
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit
2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker
@ 2009-01-12 17:42 ` Shawn O. Pearce
2009-01-12 21:17 ` Manuel Woelker
0 siblings, 1 reply; 4+ messages in thread
From: Shawn O. Pearce @ 2009-01-12 17:42 UTC (permalink / raw)
To: Manuel Woelker; +Cc: Robin Rosenberg, git
Manuel Woelker <manuel.woelker@gmail.com> wrote:
> Over the weekend I have been hacking the jgit sources a little to see
> if I can add blame/praise/annotate functionality to it. The results
> can be found at http://github.com/manuel-woelker/egit/tree/blame . All
> work is in the blame branch in org.spearce.jgit.blame package.
Interesting.
> I largely ported the cgit blame algorithm described here
> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
> relevant file is builtin-blame.c cf.
> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
OK. Fortunately Junio has largely given his blessing to this code
being converted to Java under the BSD license.
> The blame algorithm needs to use a diff algorithm to find common parts
> in files. AFAICT there is no diff implementation in jgit at the
> moment. I used the incava java-diff library, (see
> http://www.incava.org/projects/java/java-diff ), but I introduced an
> interface that should make it possible to swap implementations with a
> minimum of effort. To compile I just create a new eclipse project with
> the java-diff sources.
Unfortunately I can't reach incava.org to download java-diff, but
its entry on sourceforge says it uses an LGPL license. Within JGit
we do want to stick to BSD and BSD dependencies, so bringing in
java-diff as a new dependency isn't something we really want to do.
Fortunately Dscho has been working on a Java diff implementation for
JGit, and is considering releasing it under a BSD license so we can
include it.
The interface is still to be decided, but in general we have
found that running against a byte[] is much faster than running
against String. We're probably going to want the diff library to
take a byte[] and an IntList (as created by RawParseUtils.lineMap)
and let it work against the byte array ranges directly, without any
additional allocations, except where it needs to build up its hash
of records.
> I would like to hear your thoughts on a couple of topics:
> - Merge/patch/diff/blame functionality needs a diff implementation,
> what are our options within technical and license constraints?
I'm open to a plugin interface like you have done with the IDiff
API to hide java-diff, but for that to work we cannot have the
java-diff code as part of the JGit classpath. It needs to be
something added externally by the user, in a different context,
so the user can easily load their preferred diff implementation.
If that happents to be the LGPL java-diff, that's the user's choice.
We can provide the shim needed to get it to load, and the adapter,
but IMHO we shouldn't distribute the LGPL code, and we shouldn't
make it required in order to make the library compile or work.
Eclipes has its own LCS implementation, we should also be able
to have the Eclipse plugin provide its own shim to link to the
Eclipse EPL licensed LCS, so Eclipse can avoid java-diff entirely.
I'm not sure if NetBeans has an LCS available that the NetBeans
plugin could easily link to. It probably does.
I think eventually we'll have a BSD licensed LCS that will come with
JGit, which would eliminate the need for such a shim. I'd like
to see that happen before blame gets added, but if blame is ready
and the shim is reasonably well done, I'm inclined to bring blame
in early.
> - What is the roadmap for these features?
There isn't one. Its whatever gets contributed in a state that is
ready for inclusion. :-)
Since you are submitting this work, I'd say its on you to get
blame added.
> - Can you see this blame effort getting integrated upstream?
Yes.
> I would love to contribute more effort to egit and the blame
> functionality in particular. To me, "blame" is one of the killer
> features of modern SCMs.
I agree, I use git blame fairly often myself...
> Last no least, kudos to the git and egit teams for their hard work on
> making git such a great piece of software.
Thanks.
Now for some comments about your blame branch.
- Don't use @author tags, we don't use them anywhere else.
- Please include copyright headers on all new files.
- I think the IDiff, IDifference should be in a ".diff" package so
we can reuse them for other diff applications. Especially if we
wind up using a shim to link to different LCS implementations.
- I think the API for BlameEngine should be more like:
public class BlameEngine {
private final RevWalk rw;
public BlameEngine(final Repository repo) {
this(new RevWalk(repo));
}
public BlameEngine(final RevWalk walk) {
rw = walk;
}
public RevWalk getRevWalk() {
return rw;
}
public List<BlameEntry> blame(RevCommit c, String path);
}
One reason for this style of API is we can have the engine cache
blame records. This can make a GUI navigator much more efficient
as it jumps around between commits and asks for blames over the
same data.
- Internally you should *always* use RevCommit, not Commit.
The RevCommit class can be orders of magnitude faster than Commit.
- Internally you should always use RevTree and TreeWalk to locate
blob IDs. Its orders of magnitude faster than Tree and TreeEntry.
- Don't use new String(loader.getBytes()) (e.g. in Origin.getData)
to get the data for a file. We want to do raw diffs, so that the
character encoding is considered as part of the blame. E.g. if
a file switches character encodings, the blame has to go to the
commit that introduced the new encoding. This is a huge reason
to use byte[] and IntList over an array of String.
- RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$".
Otherwise, I'm finding your code to be quite clear, and converted
rather nicely from C. I'd love to see this integrated into the
JGit project.
--
Shawn.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit
2009-01-12 17:42 ` Shawn O. Pearce
@ 2009-01-12 21:17 ` Manuel Woelker
2009-01-12 21:55 ` Shawn O. Pearce
0 siblings, 1 reply; 4+ messages in thread
From: Manuel Woelker @ 2009-01-12 21:17 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Robin Rosenberg, git
On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>
>> I largely ported the cgit blame algorithm described here
>> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
>> relevant file is builtin-blame.c cf.
>> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
>
> OK. Fortunately Junio has largely given his blessing to this code
> being converted to Java under the BSD license.
Glad to hear that, Dscho already mentioned that a port like this might
raise license concerns. So I hope Junio would be okay with that.
> Unfortunately I can't reach incava.org to download java-diff, but
> its entry on sourceforge says it uses an LGPL license. Within JGit
> we do want to stick to BSD and BSD dependencies, so bringing in
> java-diff as a new dependency isn't something we really want to do.
>
> Fortunately Dscho has been working on a Java diff implementation for
> JGit, and is considering releasing it under a BSD license so we can
> include it.
>
> The interface is still to be decided, but in general we have
> found that running against a byte[] is much faster than running
> against String. We're probably going to want the diff library to
> take a byte[] and an IntList (as created by RawParseUtils.lineMap)
> and let it work against the byte array ranges directly, without any
> additional allocations, except where it needs to build up its hash
> of records.
That sounds like a good plan. For now I am not all that concerned
about performance myself (premature optimization and all that), but in
the long run - and especially with rename/copy detection that will
definitely a factor for usability. We could adopt the interface to
allow for a "fast" implementation, and have existing diff
implementations adapted to that. I'd be very interested to see Dscho's
diff work, and maybe we can get something going on that front. A
patience diff implementation would be rather neat too, and all my
googling hasn't turned up anything.
> I think eventually we'll have a BSD licensed LCS that will come with
> JGit, which would eliminate the need for such a shim. I'd like
> to see that happen before blame gets added, but if blame is ready
> and the shim is reasonably well done, I'm inclined to bring blame
> in early.
While trying to look up the Myers diff algorithm I found a diff
implementation in Apache wicket (cf.
http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html
). This one is under an Apache license, is that any better? It's truly
kind of sad that you need a degree in law these days to get any work
done in this license jungle. I just happen to strongly oppose the
reinvention of circular transportation-enabling devices...
> - Don't use @author tags, we don't use them anywhere else.
>
> - Please include copyright headers on all new files.
>
> - I think the IDiff, IDifference should be in a ".diff" package so
> we can reuse them for other diff applications. Especially if we
> wind up using a shim to link to different LCS implementations.
>
> - I think the API for BlameEngine should be more like:
>
> public class BlameEngine {
> private final RevWalk rw;
>
> public BlameEngine(final Repository repo) {
> this(new RevWalk(repo));
> }
>
> public BlameEngine(final RevWalk walk) {
> rw = walk;
> }
>
> public RevWalk getRevWalk() {
> return rw;
> }
>
> public List<BlameEntry> blame(RevCommit c, String path);
> }
>
> One reason for this style of API is we can have the engine cache
> blame records. This can make a GUI navigator much more efficient
> as it jumps around between commits and asks for blames over the
> same data.
>
> - Internally you should *always* use RevCommit, not Commit.
> The RevCommit class can be orders of magnitude faster than Commit.
>
> - Internally you should always use RevTree and TreeWalk to locate
> blob IDs. Its orders of magnitude faster than Tree and TreeEntry.
>
> - Don't use new String(loader.getBytes()) (e.g. in Origin.getData)
> to get the data for a file. We want to do raw diffs, so that the
> character encoding is considered as part of the blame. E.g. if
> a file switches character encodings, the blame has to go to the
> commit that introduced the new encoding. This is a huge reason
> to use byte[] and IntList over an array of String.
>
> - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$".
>
Thanks for the comments. This first version was basically just a
proof-of-concept to see if and how it could work. I didn't want to
pour too much effort into, before knowing if this project was viable
at all. This input is very valuable and I will try to incorporate the
suggestions into future revisions.
I'll keep you posted on new developments
- Manuel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit
2009-01-12 21:17 ` Manuel Woelker
@ 2009-01-12 21:55 ` Shawn O. Pearce
0 siblings, 0 replies; 4+ messages in thread
From: Shawn O. Pearce @ 2009-01-12 21:55 UTC (permalink / raw)
To: Manuel Woelker; +Cc: Robin Rosenberg, git
Manuel Woelker <manuel.woelker@gmail.com> wrote:
> On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>
> That sounds like a good plan. For now I am not all that concerned
> about performance myself (premature optimization and all that), but in
> the long run - and especially with rename/copy detection that will
> definitely a factor for usability.
Yea, I know. Premature optimization is the root of all evil.
But we've also learned the hard way that Java is slow as snot
compared to C git. The only way we can even stay close is to
optimize the hell out of the tight inner sections, and very often
that means using byte[] and avoiding upconverting to String to as
late as we possibly can.
Performance *is* a feature in Git. Its not a "nice to have",
its a requirement. The old history view for example was too damn
slow using Commit, requiring minutes on one of my systems to render
egit.git history. Using RevCommit its subsecond response time.
I just wanted to point out that we care quite a bit about speed, and
that given our input (raw byte[] from the pack) we need to be able
to quickly make decisions without upconverting to String, otherwise
blame performance will be so bad that its completely unusable.
> > I think eventually we'll have a BSD licensed LCS [...]
>
> While trying to look up the Myers diff algorithm I found a diff
> implementation in Apache wicket (cf.
> http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html
> ). This one is under an Apache license, is that any better? It's truly
> kind of sad that you need a degree in law these days to get any work
> done in this license jungle. I just happen to strongly oppose the
> reinvention of circular transportation-enabling devices...
Yea, even the ASF has trouble deciding if the Apache License and
the GPL can get along:
"The Apache Software Foundation is still trying to determine if
this version of the Apache License is compatible with the GPL."
http://www.apache.org/licenses/
The Apache License doesn't play nice with GPLv2 apparently, but is
OK with GPLv2, at least according to the FSF.
Anyway. The Apache License is roughly the new style BSD, but with
patent protection clauses built in. I think we can consume code
under the Apache License and redistribute it without any trouble
for us, our any of our downstream consumers.
> I'll keep you posted on new developments
Looking forward to it.
--
Shawn.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-01-12 21:56 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker
2009-01-12 17:42 ` Shawn O. Pearce
2009-01-12 21:17 ` Manuel Woelker
2009-01-12 21:55 ` Shawn O. Pearce
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).