* [JGIT] Blame functionality for jgit @ 2009-01-11 20:23 Manuel Woelker 2009-01-12 17:42 ` Shawn O. Pearce 0 siblings, 1 reply; 4+ messages in thread From: Manuel Woelker @ 2009-01-11 20:23 UTC (permalink / raw) To: spearce, Robin Rosenberg; +Cc: git Hello there, Over the weekend I have been hacking the jgit sources a little to see if I can add blame/praise/annotate functionality to it. The results can be found at http://github.com/manuel-woelker/egit/tree/blame . All work is in the blame branch in org.spearce.jgit.blame package. I largely ported the cgit blame algorithm described here https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the relevant file is builtin-blame.c cf. http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD The structure has been kept largely intact, but I have tried to translate the concepts to idiomatic java, with the bulk of the logic now in the Scoreboard class The blame algorithm needs to use a diff algorithm to find common parts in files. AFAICT there is no diff implementation in jgit at the moment. I used the incava java-diff library, (see http://www.incava.org/projects/java/java-diff ), but I introduced an interface that should make it possible to swap implementations with a minimum of effort. To compile I just create a new eclipse project with the java-diff sources. Currently renames, copies etc. are not supported, so only files with the same name can receive the blame. Unmodified renames and copies should be fairly simple to implement. Modified renames and copies might prove to be a little bit harder, so that would have to wait until jgit can follow history across renames/copies. There are some simple unit tests to check the basic functionality. I also "blamed" SUBMITTING_PATCHES in the egit repo, and got the same results as cgit. I am certain that there a some bugs lurking in the code, but overall it looks quite promising. I would like to hear your thoughts on a couple of topics: - Merge/patch/diff/blame functionality needs a diff implementation, what are our options within technical and license constraints? - What is the roadmap for these features? - Can you see this blame effort getting integrated upstream? I would love to contribute more effort to egit and the blame functionality in particular. To me, "blame" is one of the killer features of modern SCMs. Last no least, kudos to the git and egit teams for their hard work on making git such a great piece of software. - Manuel Woelker ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit 2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker @ 2009-01-12 17:42 ` Shawn O. Pearce 2009-01-12 21:17 ` Manuel Woelker 0 siblings, 1 reply; 4+ messages in thread From: Shawn O. Pearce @ 2009-01-12 17:42 UTC (permalink / raw) To: Manuel Woelker; +Cc: Robin Rosenberg, git Manuel Woelker <manuel.woelker@gmail.com> wrote: > Over the weekend I have been hacking the jgit sources a little to see > if I can add blame/praise/annotate functionality to it. The results > can be found at http://github.com/manuel-woelker/egit/tree/blame . All > work is in the blame branch in org.spearce.jgit.blame package. Interesting. > I largely ported the cgit blame algorithm described here > https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the > relevant file is builtin-blame.c cf. > http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD OK. Fortunately Junio has largely given his blessing to this code being converted to Java under the BSD license. > The blame algorithm needs to use a diff algorithm to find common parts > in files. AFAICT there is no diff implementation in jgit at the > moment. I used the incava java-diff library, (see > http://www.incava.org/projects/java/java-diff ), but I introduced an > interface that should make it possible to swap implementations with a > minimum of effort. To compile I just create a new eclipse project with > the java-diff sources. Unfortunately I can't reach incava.org to download java-diff, but its entry on sourceforge says it uses an LGPL license. Within JGit we do want to stick to BSD and BSD dependencies, so bringing in java-diff as a new dependency isn't something we really want to do. Fortunately Dscho has been working on a Java diff implementation for JGit, and is considering releasing it under a BSD license so we can include it. The interface is still to be decided, but in general we have found that running against a byte[] is much faster than running against String. We're probably going to want the diff library to take a byte[] and an IntList (as created by RawParseUtils.lineMap) and let it work against the byte array ranges directly, without any additional allocations, except where it needs to build up its hash of records. > I would like to hear your thoughts on a couple of topics: > - Merge/patch/diff/blame functionality needs a diff implementation, > what are our options within technical and license constraints? I'm open to a plugin interface like you have done with the IDiff API to hide java-diff, but for that to work we cannot have the java-diff code as part of the JGit classpath. It needs to be something added externally by the user, in a different context, so the user can easily load their preferred diff implementation. If that happents to be the LGPL java-diff, that's the user's choice. We can provide the shim needed to get it to load, and the adapter, but IMHO we shouldn't distribute the LGPL code, and we shouldn't make it required in order to make the library compile or work. Eclipes has its own LCS implementation, we should also be able to have the Eclipse plugin provide its own shim to link to the Eclipse EPL licensed LCS, so Eclipse can avoid java-diff entirely. I'm not sure if NetBeans has an LCS available that the NetBeans plugin could easily link to. It probably does. I think eventually we'll have a BSD licensed LCS that will come with JGit, which would eliminate the need for such a shim. I'd like to see that happen before blame gets added, but if blame is ready and the shim is reasonably well done, I'm inclined to bring blame in early. > - What is the roadmap for these features? There isn't one. Its whatever gets contributed in a state that is ready for inclusion. :-) Since you are submitting this work, I'd say its on you to get blame added. > - Can you see this blame effort getting integrated upstream? Yes. > I would love to contribute more effort to egit and the blame > functionality in particular. To me, "blame" is one of the killer > features of modern SCMs. I agree, I use git blame fairly often myself... > Last no least, kudos to the git and egit teams for their hard work on > making git such a great piece of software. Thanks. Now for some comments about your blame branch. - Don't use @author tags, we don't use them anywhere else. - Please include copyright headers on all new files. - I think the IDiff, IDifference should be in a ".diff" package so we can reuse them for other diff applications. Especially if we wind up using a shim to link to different LCS implementations. - I think the API for BlameEngine should be more like: public class BlameEngine { private final RevWalk rw; public BlameEngine(final Repository repo) { this(new RevWalk(repo)); } public BlameEngine(final RevWalk walk) { rw = walk; } public RevWalk getRevWalk() { return rw; } public List<BlameEntry> blame(RevCommit c, String path); } One reason for this style of API is we can have the engine cache blame records. This can make a GUI navigator much more efficient as it jumps around between commits and asks for blames over the same data. - Internally you should *always* use RevCommit, not Commit. The RevCommit class can be orders of magnitude faster than Commit. - Internally you should always use RevTree and TreeWalk to locate blob IDs. Its orders of magnitude faster than Tree and TreeEntry. - Don't use new String(loader.getBytes()) (e.g. in Origin.getData) to get the data for a file. We want to do raw diffs, so that the character encoding is considered as part of the blame. E.g. if a file switches character encodings, the blame has to go to the commit that introduced the new encoding. This is a huge reason to use byte[] and IntList over an array of String. - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$". Otherwise, I'm finding your code to be quite clear, and converted rather nicely from C. I'd love to see this integrated into the JGit project. -- Shawn. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit 2009-01-12 17:42 ` Shawn O. Pearce @ 2009-01-12 21:17 ` Manuel Woelker 2009-01-12 21:55 ` Shawn O. Pearce 0 siblings, 1 reply; 4+ messages in thread From: Manuel Woelker @ 2009-01-12 21:17 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Robin Rosenberg, git On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote: > >> I largely ported the cgit blame algorithm described here >> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the >> relevant file is builtin-blame.c cf. >> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD > > OK. Fortunately Junio has largely given his blessing to this code > being converted to Java under the BSD license. Glad to hear that, Dscho already mentioned that a port like this might raise license concerns. So I hope Junio would be okay with that. > Unfortunately I can't reach incava.org to download java-diff, but > its entry on sourceforge says it uses an LGPL license. Within JGit > we do want to stick to BSD and BSD dependencies, so bringing in > java-diff as a new dependency isn't something we really want to do. > > Fortunately Dscho has been working on a Java diff implementation for > JGit, and is considering releasing it under a BSD license so we can > include it. > > The interface is still to be decided, but in general we have > found that running against a byte[] is much faster than running > against String. We're probably going to want the diff library to > take a byte[] and an IntList (as created by RawParseUtils.lineMap) > and let it work against the byte array ranges directly, without any > additional allocations, except where it needs to build up its hash > of records. That sounds like a good plan. For now I am not all that concerned about performance myself (premature optimization and all that), but in the long run - and especially with rename/copy detection that will definitely a factor for usability. We could adopt the interface to allow for a "fast" implementation, and have existing diff implementations adapted to that. I'd be very interested to see Dscho's diff work, and maybe we can get something going on that front. A patience diff implementation would be rather neat too, and all my googling hasn't turned up anything. > I think eventually we'll have a BSD licensed LCS that will come with > JGit, which would eliminate the need for such a shim. I'd like > to see that happen before blame gets added, but if blame is ready > and the shim is reasonably well done, I'm inclined to bring blame > in early. While trying to look up the Myers diff algorithm I found a diff implementation in Apache wicket (cf. http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html ). This one is under an Apache license, is that any better? It's truly kind of sad that you need a degree in law these days to get any work done in this license jungle. I just happen to strongly oppose the reinvention of circular transportation-enabling devices... > - Don't use @author tags, we don't use them anywhere else. > > - Please include copyright headers on all new files. > > - I think the IDiff, IDifference should be in a ".diff" package so > we can reuse them for other diff applications. Especially if we > wind up using a shim to link to different LCS implementations. > > - I think the API for BlameEngine should be more like: > > public class BlameEngine { > private final RevWalk rw; > > public BlameEngine(final Repository repo) { > this(new RevWalk(repo)); > } > > public BlameEngine(final RevWalk walk) { > rw = walk; > } > > public RevWalk getRevWalk() { > return rw; > } > > public List<BlameEntry> blame(RevCommit c, String path); > } > > One reason for this style of API is we can have the engine cache > blame records. This can make a GUI navigator much more efficient > as it jumps around between commits and asks for blames over the > same data. > > - Internally you should *always* use RevCommit, not Commit. > The RevCommit class can be orders of magnitude faster than Commit. > > - Internally you should always use RevTree and TreeWalk to locate > blob IDs. Its orders of magnitude faster than Tree and TreeEntry. > > - Don't use new String(loader.getBytes()) (e.g. in Origin.getData) > to get the data for a file. We want to do raw diffs, so that the > character encoding is considered as part of the blame. E.g. if > a file switches character encodings, the blame has to go to the > commit that introduced the new encoding. This is a huge reason > to use byte[] and IntList over an array of String. > > - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$". > Thanks for the comments. This first version was basically just a proof-of-concept to see if and how it could work. I didn't want to pour too much effort into, before knowing if this project was viable at all. This input is very valuable and I will try to incorporate the suggestions into future revisions. I'll keep you posted on new developments - Manuel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [JGIT] Blame functionality for jgit 2009-01-12 21:17 ` Manuel Woelker @ 2009-01-12 21:55 ` Shawn O. Pearce 0 siblings, 0 replies; 4+ messages in thread From: Shawn O. Pearce @ 2009-01-12 21:55 UTC (permalink / raw) To: Manuel Woelker; +Cc: Robin Rosenberg, git Manuel Woelker <manuel.woelker@gmail.com> wrote: > On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@spearce.org> wrote: > > That sounds like a good plan. For now I am not all that concerned > about performance myself (premature optimization and all that), but in > the long run - and especially with rename/copy detection that will > definitely a factor for usability. Yea, I know. Premature optimization is the root of all evil. But we've also learned the hard way that Java is slow as snot compared to C git. The only way we can even stay close is to optimize the hell out of the tight inner sections, and very often that means using byte[] and avoiding upconverting to String to as late as we possibly can. Performance *is* a feature in Git. Its not a "nice to have", its a requirement. The old history view for example was too damn slow using Commit, requiring minutes on one of my systems to render egit.git history. Using RevCommit its subsecond response time. I just wanted to point out that we care quite a bit about speed, and that given our input (raw byte[] from the pack) we need to be able to quickly make decisions without upconverting to String, otherwise blame performance will be so bad that its completely unusable. > > I think eventually we'll have a BSD licensed LCS [...] > > While trying to look up the Myers diff algorithm I found a diff > implementation in Apache wicket (cf. > http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html > ). This one is under an Apache license, is that any better? It's truly > kind of sad that you need a degree in law these days to get any work > done in this license jungle. I just happen to strongly oppose the > reinvention of circular transportation-enabling devices... Yea, even the ASF has trouble deciding if the Apache License and the GPL can get along: "The Apache Software Foundation is still trying to determine if this version of the Apache License is compatible with the GPL." http://www.apache.org/licenses/ The Apache License doesn't play nice with GPLv2 apparently, but is OK with GPLv2, at least according to the FSF. Anyway. The Apache License is roughly the new style BSD, but with patent protection clauses built in. I think we can consume code under the Apache License and redistribute it without any trouble for us, our any of our downstream consumers. > I'll keep you posted on new developments Looking forward to it. -- Shawn. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-01-12 21:56 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-11 20:23 [JGIT] Blame functionality for jgit Manuel Woelker 2009-01-12 17:42 ` Shawn O. Pearce 2009-01-12 21:17 ` Manuel Woelker 2009-01-12 21:55 ` Shawn O. Pearce
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).