* [ANNOUNCE] git_fast_filter @ 2009-04-08 3:35 Elijah Newren 2009-04-08 9:45 ` Johannes Schindelin 0 siblings, 1 reply; 3+ messages in thread From: Elijah Newren @ 2009-04-08 3:35 UTC (permalink / raw) To: Git Mailing List Just thought I'd make this available, in case there's others with niche needs that find it useful... git_fast_filter assists with quickly rewriting the history of a repository by making it easy to write scripts whose purpose is to serve as safe filters between fast-export and fast-import. git_fast_filter comes with example programs, a basic test-suite, and a double your money back satisfaction guarantee. (I love free software.) You can get it from git://gitorious.org/git_fast_filter/mainline.git In more detail... === Purpose === git_fast_filter is designed to make it easy to filter or rewrite the history of a repository. As such, it fills the same role as git-filter-branch, and was written primarily to overcome the sometimes severe speed shortcomings of git-filter-branch. In particular, using git_fast_filter can avoid thousands or millions of new process forks, and can allow you to rewrite the same file only one time instead of 50,000 times. However, while using git_fast_filter is fairly simple and quick, it is hard to beat writing a simple git-filter-branch one-liner for efficiency of human time. Also, the two tools use very different methods of rewriting history and do not have exactly overlapping feature sets, so the best tool for a particular job is going to be very problem dependent. As human time is often more important than computer time, especially for one-shot rewrites, git-filter-branch will probably continue to be the more common tool. However, git_fast_filter is useful in cases where computer time of a rewrite matters (particularly larger repositories and more involved rewrites that need to be run and tested many times on large data sets). Also git_fast_filter has a couple features that may come in handy in special cases (assisting with generating fast-export output from scratch, interleaving commits from seperate repositories, and bidirectional collaboration between filtered and unfiltered repositories). === Idea === The way git_fast_filter works is by providing a simple python library, git_fast_filter.py. This library can be used in simple python scripts to create a filter for the output of git-fast-export. Thus, the typical calling convention is of the form: git fast-export | filter_script.py | git fast-import === Example === An example script that renames the 'master' branch to 'other is shown below (this is similar to the example in the git-fast-export manpage, but is safe against the string 'refs/heads/master' appearing in some file or commit message in the repository): #!/usr/bin/python from git_fast_filter import Commit, FastExportFilter def my_commit_callback(commit): if commit.branch == "refs/heads/master": commit.branch = "refs/heads/other" filter = FastExportFilter(commit_callback = my_commit_callback) filter.run() The user can then run this script by: $ mkdir target && cd target && git init $ (cd /PATH/LEADING/TO/source && git fast-export --all) \ | /PATH/TO/filter_script.py | git fast-import (Note: The user can have the script take care of the git init, the cd's, and the invocations of git fast-export and git fast-import by just passing directory names to FastExportFilter.run; however, writing out the details explicitly as in the above example makes it clearer what is going on.) Elijah ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [ANNOUNCE] git_fast_filter 2009-04-08 3:35 [ANNOUNCE] git_fast_filter Elijah Newren @ 2009-04-08 9:45 ` Johannes Schindelin 2009-04-08 16:55 ` Elijah Newren 0 siblings, 1 reply; 3+ messages in thread From: Johannes Schindelin @ 2009-04-08 9:45 UTC (permalink / raw) To: Elijah Newren; +Cc: Git Mailing List Hi, On Tue, 7 Apr 2009, Elijah Newren wrote: > Just thought I'd make this available, in case there's others with niche > needs that find it useful... Have you seen http://thread.gmane.org/gmane.comp.version-control.git/52323 I was rather disappointed that skimo left the patch series in a rather half-useful state. Ciao, Dscho ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [ANNOUNCE] git_fast_filter 2009-04-08 9:45 ` Johannes Schindelin @ 2009-04-08 16:55 ` Elijah Newren 0 siblings, 0 replies; 3+ messages in thread From: Elijah Newren @ 2009-04-08 16:55 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Git Mailing List Hi, On Wed, Apr 8, 2009 at 3:45 AM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Tue, 7 Apr 2009, Elijah Newren wrote: > >> Just thought I'd make this available, in case there's others with niche >> needs that find it useful... > > Have you seen > > http://thread.gmane.org/gmane.comp.version-control.git/52323 > > I was rather disappointed that skimo left the patch series in a rather > half-useful state. No, I had not. Looks useful, though it appears to be missing a number of options from git-filter-branch, such as --subdirectory-filter, --tree-filter, and --prune-empty. I'm guessing that's related to your "half-useful" comment? I was particularly interested in --tree-filter for my rewrite, since I needed it (to remove all cvs keywords and a few other touch-ups) and it was the slowest of the filters I'd be using. Problem is, in a repository with 40,000 commits and 70,000 files in the latest commits, --tree-filter is unacceptably slow. On average, a commit would have about 35,000 files (assuming approximately linear growth over the commit history), meaning that I'd have to modify 40,000 x 35,000 files = 1,400,000,000 files[1]. However, on average, less than a dozen files changed per commit, so there are less than 40,000 x 12 = 480,000 unique files that actually need to be rewritten. git-fast-export provides (and git-fast-import expects) just those half million files, and rewriting half a million files instead of 1.4 billion files is the difference between a 45 minute rewrite and a 3 month one. I didn't see a way to easily avoid the 1.4 billion file rewrite using git-filter-branch (or git-rewrite-commits had I known about it), and writing something to parse and modify git-fast-export output seemed like the easiest solution. Perhaps I could have written some fancy index-filter script that recorded original and modified file sha1sums somewhere and used that to only check out certain files and rewrite them, but such an idea hadn't occurred to me (and I'm not sure it would have been the better route even if it had). Maybe there's something I missed that would have made this easy, though? Elijah [1] For simplicity, I'm ignoring the 'binary' files that should not have any cvs-keyword unmunging performed on them. However, it does present an issue, particularly with extra process forks, since you need to determine which files are safe to modify. I used libmagic (the library behind the unix 'file' command) to avoid the need to run 'file' repeatedly. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-04-08 16:57 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-08 3:35 [ANNOUNCE] git_fast_filter Elijah Newren 2009-04-08 9:45 ` Johannes Schindelin 2009-04-08 16:55 ` Elijah Newren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).