git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A fast alternative to git-filter-branch - The BFG Repo-Cleaner
@ 2013-02-05  0:04 Roberto Tyley
  0 siblings, 0 replies; only message in thread
From: Roberto Tyley @ 2013-02-05  0:04 UTC (permalink / raw)
  To: git

I recently released The BFG Repo-Cleaner, a new tool for cleansing bad
data out of Git repository histories. The BFG is typically at least
10-50x faster than git-filter-branch at these tasks:

* Removing Crazy Big Files from repo history
* Removing Passwords, Credentials & other Private data

http://rtyley.github.com/bfg-repo-cleaner/

As an example, these are timings for deleting an arbitrary file from
the large GCC repository (148495 commits):

The BFG : 3m29s
$ bfg -D README-fixinc

git filter-branch : 472m31s
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch
gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all

(roughly a 135x speed increase, reducing the task of processing a
large codebase from an overnight job to the work of a few minutes....
all timings done in a 4GB tmpfs ramdisk)


The BFG has some simple but very powerful command-line options, which
perform at similar speed:

remove all blobs bigger than 1 megabyte :
$ bfg --strip-blobs-bigger-than 1M  my-repo.git

replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** :
$ bfg --replace-banned-strings passwords.txt  my-repo.git


The main source of the BFG's performance advantage comes from
preventing repeated examination of the same tree objects. The approach
of git-filter-branch performs filtering for each commit, against the
complete file-hierarchy of each commit, one after the other, even
though commit trees are largely very similar. For the use-cases of The
BFG that's unnecessary- we don't care where, and in which commit, a
'bad' file exists - we just want it dealt with. Consequently the BFG
processes the Git object db on a memoised tree-by-tree basis,
processing each and every file & folder exactly once - the final
processing of the commit hierarchy is very quick. This _does_ mean
that it's not possible to delete files based on their absolute path
within the repo, but they can deleted based on their filename,
blob-id, or contents. This, and multi-core processing by default,
gives the dramatic speed-up while still providing the same results.
There's more performance data here:
https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc

I'd welcome feedback, and if anyone has cause to filter a repository's
history in future, I'd appreciate you giving the BFG a try and letting
me know how you found it.

thanks,
Roberto Tyley
software dev @ The Guardian

http://rtyley.github.com/bfg-repo-cleaner/

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2013-02-05  0:04 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-05  0:04 A fast alternative to git-filter-branch - The BFG Repo-Cleaner Roberto Tyley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).