* filter-branch performance @ 2014-12-09 18:52 Henning Moll 2014-12-09 18:59 ` Jeff King 0 siblings, 1 reply; 7+ messages in thread From: Henning Moll @ 2014-12-09 18:52 UTC (permalink / raw) To: git Hi, i am runningthis command git filter-branch --env-filter 'export GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL" GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"' --prune-empty --tag-name-filter cat -- --all in a repository which i copied to /dev/shm before. According to "top", the git process only consumes about 5 percent of the CPU. The load is between 0.70 and 1.00. I assume that there is a lot of process forking going on. Could that be the cause? Any ideas how to further improve? Best regards Henning ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-09 18:52 filter-branch performance Henning Moll @ 2014-12-09 18:59 ` Jeff King 2014-12-10 14:18 ` Roberto Tyley 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2014-12-09 18:59 UTC (permalink / raw) To: Henning Moll; +Cc: git On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote: > i am runningthis command > > git filter-branch --env-filter 'export > GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL" > GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"' > --prune-empty --tag-name-filter cat -- --all > > in a repository which i copied to /dev/shm before. According to "top", the > git process only consumes about 5 percent of the CPU. The load is between > 0.70 and 1.00. > > I assume that there is a lot of process forking going on. Could that be the > cause? Yes. filter-branch is a shell scripts, and it is probably running multiple git commands per commit it is filtering. > Any ideas how to further improve? In your case you are not touching the tree contents at all. Last time I looked into this, I believe that filter-branch always loaded the index for each commit, even if no --index-filter is being used. So teaching filter-branch to optimize this case would be one strategy. Another is to try using "git fast-export | git fast-import", and munging the data stream in between. That's may be more work, depending how fancy you want to get with accurate parsing (look into fast-export's --no-data, which omits blob data; that should make things faster and make hacky context-less parsing less likely to cause problems). -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-09 18:59 ` Jeff King @ 2014-12-10 14:18 ` Roberto Tyley 2014-12-10 14:37 ` Jeff King 2014-12-10 16:05 ` Junio C Hamano 0 siblings, 2 replies; 7+ messages in thread From: Roberto Tyley @ 2014-12-10 14:18 UTC (permalink / raw) To: Jeff King; +Cc: Henning Moll, git@vger.kernel.org On 9 December 2014 at 18:59, Jeff King <peff@peff.net> wrote: > On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote: >> I assume that there is a lot of process forking going on. Could that be the >> cause? > > Yes. filter-branch is a shell scripts, and it is probably running > multiple git commands per commit it is filtering. > >> Any ideas how to further improve? Depending on how much time you can sink into improving the performance (versus just allowing the process to run to completion), you could also look into a non-forking solution, as well as not bothering to load the commit trees. To me non-forking means putting everything into the JVM by using JGit, like the BFG does, though libgit2 might also be an option. Changing the BFG's code to do the transformation in your script is absolutely trivial - define a commit-node cleaner like this: object SetCommitterToAuthor extends CommitNodeCleaner { override def fixer(kit: CommitNodeCleaner.Kit) = c => c.copy(committer = c.author) // PersonIdent class holds name, email & time } ...trivial if you don't mind compiling Scala with SBT that is, and I'm sure some people do! A DSL for non-Scala people to define their own BFG scripts would be good, I must get on that some day. The BFG is generally faster than filter-branch for 3 reasons: 1. No forking - everything stays in the JVM process 2. Embarrassingly parallel algorithm makes good use of multi-core machines 3. Memoization means no Git object (file or folder) is cleaned more than once In the case of your problem, only the first factor will be noticeably helpful. Unfortunately commits do need to be cleaned sequentially, as their hashes depend on the hashes of their parents, and filter-branch doesn't clean /commits/ more than once, the way it does with files or folders - so the last 2 reasons in the list won't be significant. For your specific use case tho', the fact that BFG doesn't load the file tree at all unless it needs to clean it will also help. I decided to knock up an egregious hack in the BFG to see what performance would be like. I ran it against a fairly large repo (https://github.com/bfg-repo-cleaner-demos/intellij-community-original), 100k commits, stored in /dev/shm, and used the SetCommitterToAuthor code above. The BFG run completed in 31.7 seconds, you can see the resulting repo here: https://github.com/rtyley/intellij-community-set-committer-to-author I started running the same test some time ago using filter-branch, unfortunately that test has not completed yet - the BFG appears to be substantially faster. Before: $ git cat-file -p b02bf46c4e93c2e8570910cdd68eb6f4ce21ff81 tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1 parent 8794219e3e84aed3cc8af926ffd74beafa51fb6b author peter <peter@jetbrains.com> 1370854045 +0200 committer peter <peter@jetbrains.com> 1370854098 +0200 After: $ git cat-file -p 3adb7b2a5c87320a5a028b6a59a7132c75a6e91c tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1 parent 5efcdb551789b0d0bb541de9325f09521c5fbcb6 author peter <peter@jetbrains.com> 1370854045 +0200 committer peter <peter@jetbrains.com> 1370854045 +0200 <- time fixed The relevant code is in: https://github.com/rtyley/bfg-repo-cleaner/compare/set-committer-to-author ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-10 14:18 ` Roberto Tyley @ 2014-12-10 14:37 ` Jeff King 2014-12-10 15:25 ` Roberto Tyley 2014-12-10 16:05 ` Junio C Hamano 1 sibling, 1 reply; 7+ messages in thread From: Jeff King @ 2014-12-10 14:37 UTC (permalink / raw) To: Roberto Tyley; +Cc: Henning Moll, git@vger.kernel.org On Wed, Dec 10, 2014 at 02:18:24PM +0000, Roberto Tyley wrote: > Depending on how much time you can sink into improving the performance > (versus just allowing the process to run to completion), you could > also look into a non-forking solution, as well as not bothering to > load the commit trees. To me non-forking means putting everything into > the JVM by using JGit, like the BFG does, though libgit2 might also be > an option. > > Changing the BFG's code to do the transformation in your script is > absolutely trivial - define a commit-node cleaner like this: > > object SetCommitterToAuthor extends CommitNodeCleaner { > override def fixer(kit: CommitNodeCleaner.Kit) = c => > c.copy(committer = c.author) // PersonIdent class holds name, email & > time > } Thanks. I _almost_ mentioned BFG in the original email, but I didn't think it could do arbitrary fixes like this. Can you monkey-patch in arbitrary code, or do you have to rebuild all of BFG to include the snippet above? > ...trivial if you don't mind compiling Scala with SBT that is, and I'm > sure some people do! A DSL for non-Scala people to define their own > BFG scripts would be good, I must get on that some day. That would be cool. Even if the DSL was just Java, if you could do something like: vi fix.java javac fix.java bfg --filter=fix.class that would be very useful (and I am probably showing my lack of Java chops by getting the compilation command or filenames wrong :) ). > I started running the same test some time ago using filter-branch, > unfortunately that test has not completed yet - the BFG appears to be > substantially faster. No fair if you didn't run filter-branch on a PC and BFG on a Raspberry Pi. You have to give us a fighting chance. :) -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-10 14:37 ` Jeff King @ 2014-12-10 15:25 ` Roberto Tyley 0 siblings, 0 replies; 7+ messages in thread From: Roberto Tyley @ 2014-12-10 15:25 UTC (permalink / raw) To: Jeff King; +Cc: Henning Moll, git@vger.kernel.org On 10 December 2014 at 14:37, Jeff King <peff@peff.net> wrote: > On Wed, Dec 10, 2014 at 02:18:24PM +0000, Roberto Tyley wrote: >> object SetCommitterToAuthor extends CommitNodeCleaner { >> override def fixer(kit: CommitNodeCleaner.Kit) = c => >> c.copy(committer = c.author) // PersonIdent class holds name, email & >> time >> } > > Thanks. I _almost_ mentioned BFG in the original email, but I didn't > think it could do arbitrary fixes like this. Can you monkey-patch in > arbitrary code, or do you have to rebuild all of BFG to include the > snippet above? Well, I publish a bfg-library jar to Maven Central, so you don't need to rebuild that: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22bfg-library_2.11%22 ...in principle you can write a Java/Groovy/whatever project that calls that jar (your entry point would be com.madgag.git.bfg.cleaner.RepoRewriter) - tho' to be honest, I can't swear to how /friendly/ the API would be to call from non-Scala-land though, as I haven't tried it. Incidentally, if people want to try compiling this monkey-patched BFG at home, this is how you'd do it: * Install SBT - http://www.scala-sbt.org/download.html (or 'brew install sbt' for Mac OS X) * git clone https://github.com/rtyley/bfg-repo-cleaner.git --branch set-committer-to-author * cd bfg-repo-cleaner * sbt "bfg/run --no-blob-protection" There will be a lot of automated downloading of dependencies, and compilation will be slow the first time around, but at least there aren't that many steps. I do realise that being Scala/JVM based makes working on the BFG a bit of a specialist activity at the moment! >> A DSL for non-Scala people to define their own >> BFG scripts would be good, I must get on that some day. > > That would be cool. Even if the DSL was just Java, if you could do > something like: > > vi fix.java > javac fix.java > bfg --filter=fix.class > > that would be very useful (and I am probably showing my lack of Java chops > by getting the compilation command or filenames wrong :) ). Your syntax is right :) I'll give it some thought. >> I started running the same test some time ago using filter-branch, >> unfortunately that test has not completed yet - the BFG appears to be >> substantially faster. > > No fair if you didn't run filter-branch on a PC and BFG on a Raspberry > Pi. You have to give us a fighting chance. :) I guess I made that rod for my own back :) http://youtu.be/Ir4IHzPhJuI for those who haven't seen it. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-10 14:18 ` Roberto Tyley 2014-12-10 14:37 ` Jeff King @ 2014-12-10 16:05 ` Junio C Hamano 2014-12-10 23:44 ` Roberto Tyley 1 sibling, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2014-12-10 16:05 UTC (permalink / raw) To: Roberto Tyley; +Cc: Jeff King, Henning Moll, git@vger.kernel.org Roberto Tyley <roberto.tyley@gmail.com> writes: > The BFG is generally faster than filter-branch for 3 reasons: > > 1. No forking - everything stays in the JVM process > 2. Embarrassingly parallel algorithm makes good use of multi-core machines > 3. Memoization means no Git object (file or folder) is cleaned more than once > > In the case of your problem, only the first factor will be noticeably > helpful. Unfortunately commits do need to be cleaned sequentially, as > their hashes depend on the hashes of their parents, and filter-branch > doesn't clean /commits/ more than once, the way it does with files or > folders - so the last 2 reasons in the list won't be significant. Just this part. If your history is bushy, you should be able to rewrite histories of merged branches in parallel up to the point they are merged---rewriting of the merge commit of course has to wait until all the branches have been rewritten, though. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: filter-branch performance 2014-12-10 16:05 ` Junio C Hamano @ 2014-12-10 23:44 ` Roberto Tyley 0 siblings, 0 replies; 7+ messages in thread From: Roberto Tyley @ 2014-12-10 23:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Henning Moll, git@vger.kernel.org On 10 December 2014 at 16:05, Junio C Hamano <gitster@pobox.com> wrote: > Roberto Tyley <roberto.tyley@gmail.com> writes: > >> The BFG is generally faster than filter-branch for 3 reasons: >> >> 1. No forking - everything stays in the JVM process >> 2. Embarrassingly parallel algorithm makes good use of multi-core machines >> 3. Memoization means no Git object (file or folder) is cleaned more than once >> >> In the case of your problem, only the first factor will be noticeably >> helpful. Unfortunately commits do need to be cleaned sequentially, as >> their hashes depend on the hashes of their parents, and filter-branch >> doesn't clean /commits/ more than once, the way it does with files or >> folders - so the last 2 reasons in the list won't be significant. > > Just this part. If your history is bushy, you should be able to > rewrite histories of merged branches in parallel up to the point > they are merged---rewriting of the merge commit of course has to > wait until all the branches have been rewritten, though. That's true, and the bfg does take advantage of that parallelism, so as well as point 1, point 2 will provide some benefit if history is bushy enough :) ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-12-10 23:45 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-09 18:52 filter-branch performance Henning Moll 2014-12-09 18:59 ` Jeff King 2014-12-10 14:18 ` Roberto Tyley 2014-12-10 14:37 ` Jeff King 2014-12-10 15:25 ` Roberto Tyley 2014-12-10 16:05 ` Junio C Hamano 2014-12-10 23:44 ` Roberto Tyley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).