filter-branch performance

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* filter-branch performance
@ 2014-12-09 18:52 Henning Moll
  2014-12-09 18:59 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Henning Moll @ 2014-12-09 18:52 UTC (permalink / raw)
  To: git

Hi,

i am runningthis command

git filter-branch --env-filter 'export 
GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL" 
GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" 
GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"' --prune-empty --tag-name-filter 
cat -- --all

in a repository which i copied to /dev/shm before. According to "top", 
the git process only consumes about 5 percent of the CPU. The load is 
between 0.70 and 1.00.

I assume that there is a lot of process forking going on. Could that be 
the cause?

Any ideas how to further improve?

Best regards
Henning

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-09 18:52 filter-branch performance Henning Moll
@ 2014-12-09 18:59 ` Jeff King
  2014-12-10 14:18   ` Roberto Tyley
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-12-09 18:59 UTC (permalink / raw)
  To: Henning Moll; +Cc: git

On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote:

> i am runningthis command
> 
> git filter-branch --env-filter 'export
> GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL"
> GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"'
> --prune-empty --tag-name-filter cat -- --all
> 
> in a repository which i copied to /dev/shm before. According to "top", the
> git process only consumes about 5 percent of the CPU. The load is between
> 0.70 and 1.00.
> 
> I assume that there is a lot of process forking going on. Could that be the
> cause?

Yes. filter-branch is a shell scripts, and it is probably running
multiple git commands per commit it is filtering.

> Any ideas how to further improve?

In your case you are not touching the tree contents at all. Last time I
looked into this, I believe that filter-branch always loaded the index
for each commit, even if no --index-filter is being used. So teaching
filter-branch to optimize this case would be one strategy.

Another is to try using "git fast-export | git fast-import", and munging
the data stream in between. That's may be more work, depending how fancy
you want to get with accurate parsing (look into fast-export's
--no-data, which omits blob data; that should make things faster and
make hacky context-less parsing less likely to cause problems).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-09 18:59 ` Jeff King
@ 2014-12-10 14:18   ` Roberto Tyley
  2014-12-10 14:37     ` Jeff King
  2014-12-10 16:05     ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Roberto Tyley @ 2014-12-10 14:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Henning Moll, git@vger.kernel.org

On 9 December 2014 at 18:59, Jeff King <peff@peff.net> wrote:
> On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote:
>> I assume that there is a lot of process forking going on. Could that be the
>> cause?
>
> Yes. filter-branch is a shell scripts, and it is probably running
> multiple git commands per commit it is filtering.
>
>> Any ideas how to further improve?

Depending on how much time you can sink into improving the performance
(versus just allowing the process to run to completion), you could
also look into a non-forking solution, as well as not bothering to
load the commit trees. To me non-forking means putting everything into
the JVM by using JGit, like the BFG does, though libgit2 might also be
an option.

Changing the BFG's code to do the transformation in your script is
absolutely trivial - define a commit-node cleaner like this:

object SetCommitterToAuthor extends CommitNodeCleaner {
  override def fixer(kit: CommitNodeCleaner.Kit) = c =>
c.copy(committer = c.author) // PersonIdent class holds name, email &
time
}

...trivial if you don't mind compiling Scala with SBT that is, and I'm
sure some people do! A DSL for non-Scala people to define their own
BFG scripts would be good, I must get on that some day.

The BFG is generally faster than filter-branch for 3 reasons:

1. No forking - everything stays in the JVM process
2. Embarrassingly parallel algorithm makes good use of multi-core machines
3. Memoization means no Git object (file or folder) is cleaned more than once

In the case of your problem, only the first factor will be noticeably
helpful. Unfortunately commits do need to be cleaned sequentially, as
their hashes depend on the hashes of their parents, and filter-branch
doesn't clean /commits/ more than once, the way it does with files or
folders - so the last 2 reasons in the list won't be significant.

For your specific use case tho', the fact that BFG doesn't load the
file tree at all unless it needs to clean it will also help.

I decided to knock up an egregious hack in the BFG to see what
performance would be like. I ran it against a fairly large repo
(https://github.com/bfg-repo-cleaner-demos/intellij-community-original),
100k commits, stored in /dev/shm, and used the SetCommitterToAuthor
code above. The BFG run completed in 31.7 seconds, you can see the
resulting repo here:

https://github.com/rtyley/intellij-community-set-committer-to-author

I started running the same test some time ago using filter-branch,
unfortunately that test has not completed yet - the BFG appears to be
substantially faster.

Before:
$ git cat-file -p b02bf46c4e93c2e8570910cdd68eb6f4ce21ff81
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 8794219e3e84aed3cc8af926ffd74beafa51fb6b
author peter <peter@jetbrains.com> 1370854045 +0200
committer peter <peter@jetbrains.com> 1370854098 +0200

After:
$ git cat-file -p 3adb7b2a5c87320a5a028b6a59a7132c75a6e91c
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 5efcdb551789b0d0bb541de9325f09521c5fbcb6
author peter <peter@jetbrains.com> 1370854045 +0200
committer peter <peter@jetbrains.com> 1370854045 +0200 <- time fixed

The relevant code is in:
https://github.com/rtyley/bfg-repo-cleaner/compare/set-committer-to-author

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-10 14:18   ` Roberto Tyley
@ 2014-12-10 14:37     ` Jeff King
  2014-12-10 15:25       ` Roberto Tyley
  2014-12-10 16:05     ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-12-10 14:37 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: Henning Moll, git@vger.kernel.org

On Wed, Dec 10, 2014 at 02:18:24PM +0000, Roberto Tyley wrote:

> Depending on how much time you can sink into improving the performance
> (versus just allowing the process to run to completion), you could
> also look into a non-forking solution, as well as not bothering to
> load the commit trees. To me non-forking means putting everything into
> the JVM by using JGit, like the BFG does, though libgit2 might also be
> an option.
> 
> Changing the BFG's code to do the transformation in your script is
> absolutely trivial - define a commit-node cleaner like this:
> 
> object SetCommitterToAuthor extends CommitNodeCleaner {
>   override def fixer(kit: CommitNodeCleaner.Kit) = c =>
> c.copy(committer = c.author) // PersonIdent class holds name, email &
> time
> }

Thanks. I _almost_ mentioned BFG in the original email, but I didn't
think it could do arbitrary fixes like this. Can you monkey-patch in
arbitrary code, or do you have to rebuild all of BFG to include the
snippet above?

> ...trivial if you don't mind compiling Scala with SBT that is, and I'm
> sure some people do! A DSL for non-Scala people to define their own
> BFG scripts would be good, I must get on that some day.

That would be cool.  Even if the DSL was just Java, if you could do
something like:

  vi fix.java
  javac fix.java
  bfg --filter=fix.class

that would be very useful (and I am probably showing my lack of Java chops
by getting the compilation command or filenames wrong :) ).

> I started running the same test some time ago using filter-branch,
> unfortunately that test has not completed yet - the BFG appears to be
> substantially faster.

No fair if you didn't run filter-branch on a PC and BFG on a Raspberry
Pi. You have to give us a fighting chance. :)

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-10 14:37     ` Jeff King
@ 2014-12-10 15:25       ` Roberto Tyley
  0 siblings, 0 replies; 7+ messages in thread
From: Roberto Tyley @ 2014-12-10 15:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Henning Moll, git@vger.kernel.org

On 10 December 2014 at 14:37, Jeff King <peff@peff.net> wrote:
> On Wed, Dec 10, 2014 at 02:18:24PM +0000, Roberto Tyley wrote:
>> object SetCommitterToAuthor extends CommitNodeCleaner {
>>   override def fixer(kit: CommitNodeCleaner.Kit) = c =>
>> c.copy(committer = c.author) // PersonIdent class holds name, email &
>> time
>> }
>
> Thanks. I _almost_ mentioned BFG in the original email, but I didn't
> think it could do arbitrary fixes like this. Can you monkey-patch in
> arbitrary code, or do you have to rebuild all of BFG to include the
> snippet above?

Well, I publish a bfg-library jar to Maven Central, so you don't need
to rebuild that:

http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22bfg-library_2.11%22

...in principle you can write a Java/Groovy/whatever project that
calls that jar (your entry point would be
com.madgag.git.bfg.cleaner.RepoRewriter) - tho' to be honest, I can't
swear to how /friendly/ the API would be to call from non-Scala-land
though, as I haven't tried it.

Incidentally, if people want to try compiling this monkey-patched BFG
at home, this is how you'd do it:

* Install SBT - http://www.scala-sbt.org/download.html (or 'brew
install sbt' for Mac OS X)
* git clone https://github.com/rtyley/bfg-repo-cleaner.git --branch
set-committer-to-author
* cd bfg-repo-cleaner
* sbt "bfg/run --no-blob-protection"

There will be a lot of automated downloading of dependencies, and
compilation will be slow the first time around, but at least there
aren't that many steps. I do realise that being Scala/JVM based makes
working on the BFG a bit of a specialist activity at the moment!

>> A DSL for non-Scala people to define their own
>> BFG scripts would be good, I must get on that some day.
>
> That would be cool.  Even if the DSL was just Java, if you could do
> something like:
>
>   vi fix.java
>   javac fix.java
>   bfg --filter=fix.class
>
> that would be very useful (and I am probably showing my lack of Java chops
> by getting the compilation command or filenames wrong :) ).

Your syntax is right :) I'll give it some thought.

>> I started running the same test some time ago using filter-branch,
>> unfortunately that test has not completed yet - the BFG appears to be
>> substantially faster.
>
> No fair if you didn't run filter-branch on a PC and BFG on a Raspberry
> Pi. You have to give us a fighting chance. :)

I guess I made that rod for my own back :) http://youtu.be/Ir4IHzPhJuI
for those who haven't seen it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-10 14:18   ` Roberto Tyley
  2014-12-10 14:37     ` Jeff King
@ 2014-12-10 16:05     ` Junio C Hamano
  2014-12-10 23:44       ` Roberto Tyley
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-12-10 16:05 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: Jeff King, Henning Moll, git@vger.kernel.org

Roberto Tyley <roberto.tyley@gmail.com> writes:

> The BFG is generally faster than filter-branch for 3 reasons:
>
> 1. No forking - everything stays in the JVM process
> 2. Embarrassingly parallel algorithm makes good use of multi-core machines
> 3. Memoization means no Git object (file or folder) is cleaned more than once
>
> In the case of your problem, only the first factor will be noticeably
> helpful. Unfortunately commits do need to be cleaned sequentially, as
> their hashes depend on the hashes of their parents, and filter-branch
> doesn't clean /commits/ more than once, the way it does with files or
> folders - so the last 2 reasons in the list won't be significant.

Just this part.  If your history is bushy, you should be able to
rewrite histories of merged branches in parallel up to the point
they are merged---rewriting of the merge commit of course has to
wait until all the branches have been rewritten, though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch performance
  2014-12-10 16:05     ` Junio C Hamano
@ 2014-12-10 23:44       ` Roberto Tyley
  0 siblings, 0 replies; 7+ messages in thread
From: Roberto Tyley @ 2014-12-10 23:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Henning Moll, git@vger.kernel.org

On 10 December 2014 at 16:05, Junio C Hamano <gitster@pobox.com> wrote:
> Roberto Tyley <roberto.tyley@gmail.com> writes:
>
>> The BFG is generally faster than filter-branch for 3 reasons:
>>
>> 1. No forking - everything stays in the JVM process
>> 2. Embarrassingly parallel algorithm makes good use of multi-core machines
>> 3. Memoization means no Git object (file or folder) is cleaned more than once
>>
>> In the case of your problem, only the first factor will be noticeably
>> helpful. Unfortunately commits do need to be cleaned sequentially, as
>> their hashes depend on the hashes of their parents, and filter-branch
>> doesn't clean /commits/ more than once, the way it does with files or
>> folders - so the last 2 reasons in the list won't be significant.
>
> Just this part.  If your history is bushy, you should be able to
> rewrite histories of merged branches in parallel up to the point
> they are merged---rewriting of the merge commit of course has to
> wait until all the branches have been rewritten, though.

That's true, and the bfg does take advantage of that parallelism, so
as well as point 1, point 2 will provide some benefit if history is
bushy enough :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-10 23:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-09 18:52 filter-branch performance Henning Moll
2014-12-09 18:59 ` Jeff King
2014-12-10 14:18   ` Roberto Tyley
2014-12-10 14:37     ` Jeff King
2014-12-10 15:25       ` Roberto Tyley
2014-12-10 16:05     ` Junio C Hamano
2014-12-10 23:44       ` Roberto Tyley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).