From: Roberto Tyley <roberto.tyley@gmail.com>
To: git@vger.kernel.org
Subject: A fast alternative to git-filter-branch - The BFG Repo-Cleaner
Date: Tue, 5 Feb 2013 00:04:20 +0000 [thread overview]
Message-ID: <CAFY1edb6osN+Qe33K9e6imaMG=3_ZUJx7Q1R++RHfY6h+zGXYQ@mail.gmail.com> (raw)
I recently released The BFG Repo-Cleaner, a new tool for cleansing bad
data out of Git repository histories. The BFG is typically at least
10-50x faster than git-filter-branch at these tasks:
* Removing Crazy Big Files from repo history
* Removing Passwords, Credentials & other Private data
http://rtyley.github.com/bfg-repo-cleaner/
As an example, these are timings for deleting an arbitrary file from
the large GCC repository (148495 commits):
The BFG : 3m29s
$ bfg -D README-fixinc
git filter-branch : 472m31s
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch
gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all
(roughly a 135x speed increase, reducing the task of processing a
large codebase from an overnight job to the work of a few minutes....
all timings done in a 4GB tmpfs ramdisk)
The BFG has some simple but very powerful command-line options, which
perform at similar speed:
remove all blobs bigger than 1 megabyte :
$ bfg --strip-blobs-bigger-than 1M my-repo.git
replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** :
$ bfg --replace-banned-strings passwords.txt my-repo.git
The main source of the BFG's performance advantage comes from
preventing repeated examination of the same tree objects. The approach
of git-filter-branch performs filtering for each commit, against the
complete file-hierarchy of each commit, one after the other, even
though commit trees are largely very similar. For the use-cases of The
BFG that's unnecessary- we don't care where, and in which commit, a
'bad' file exists - we just want it dealt with. Consequently the BFG
processes the Git object db on a memoised tree-by-tree basis,
processing each and every file & folder exactly once - the final
processing of the commit hierarchy is very quick. This _does_ mean
that it's not possible to delete files based on their absolute path
within the repo, but they can deleted based on their filename,
blob-id, or contents. This, and multi-core processing by default,
gives the dramatic speed-up while still providing the same results.
There's more performance data here:
https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc
I'd welcome feedback, and if anyone has cause to filter a repository's
history in future, I'd appreciate you giving the BFG a try and letting
me know how you found it.
thanks,
Roberto Tyley
software dev @ The Guardian
http://rtyley.github.com/bfg-repo-cleaner/
reply other threads:[~2013-02-05 0:04 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAFY1edb6osN+Qe33K9e6imaMG=3_ZUJx7Q1R++RHfY6h+zGXYQ@mail.gmail.com' \
--to=roberto.tyley@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).