git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/3] filter-branch: add new --blob-filter option.
@ 2008-06-13  0:52 Avery Pennarun
  2008-06-13  0:52 ` [PATCH v2 2/3] filter-branch --blob-filter: speed/flexibility improvements Avery Pennarun
  2008-06-13  6:25 ` [PATCH v2 1/3] filter-branch: add new --blob-filter option Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Avery Pennarun @ 2008-06-13  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, Jeff King

From: Jeff King <peff@peff.net>

On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote:

> Do you think git would benefit from having a generalized version of
> this script?  Basically, the user provides a "munge" script on the
> command line, and there's a git-filter-branch mode for auto-munging
> (with a cache) every file in every checkin.  Even if it's *only* ever
> used for CRLF, I can imagine this being useful to a lot of people.

It was easy enough to work up the patch below, which allows

  git filter-branch --blob-filter 'tr a-z A-Z'

However, it's _still_ horribly slow. Shell script is nice and flexible,
but running a tight loop like this is just painful. I suspect
filter-branch in something like perl would be a lot faster and just as
flexible (you could even do it in C, but you'd probably have to invent a
little domain-specific scripting language).

It is still much better performance than a tree filter, though:

  $ cd git && time git filter-branch --tree-filter '
      find . -type f | while read f; do
        tr a-z A-Z <"$f" >tmp
        mv tmp "$f"
      done
    ' HEAD~10..HEAD

  real    4m38.626s
  user    1m32.726s
  sys     2m51.163s

  $ cd git && git filter-branch --blob-filter 'tr a-z A-Z' HEAD~10..HEAD
  real    1m40.809s
  user    0m36.822s
  sys     1m14.273s

Lots of system time in both. I'm sure we spend a fair bit of time
hitting our very large map and blob-cache directories, which would be
much more nicely implemented as associative arrays in memory (if we were
using a more featureful language).

Anyway, here is the patch. I don't know if it is even worth applying,
since it is still painfully slow.

Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-filter-branch.sh |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index d04c346..a0d9a79 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -54,6 +54,23 @@ EOF
 
 eval "$functions"
 
+munge_blobs() {
+	while read mode sha1 stage path
+	do
+		if ! test -r "$workdir/../blob-cache/$sha1"
+		then
+			new=`git cat-file blob $sha1 |
+			     eval "$filter_blob" |
+			     git hash-object -w --stdin`
+			printf $new >$workdir/../blob-cache/$sha1
+		fi
+		printf "%s %s\t%s\n" \
+			"$mode" \
+			$(cat "$workdir/../blob-cache/$sha1") \
+			"$path"
+	done
+}
+
 # When piped a commit, output a script to set the ident of either
 # "author" or "committer
 
@@ -105,6 +122,7 @@ tempdir=.git-rewrite
 filter_env=
 filter_tree=
 filter_index=
+filter_blob=
 filter_parent=
 filter_msg=cat
 filter_commit='git commit-tree "$@"'
@@ -150,6 +168,9 @@ do
 	--index-filter)
 		filter_index="$OPTARG"
 		;;
+	--blob-filter)
+		filter_blob="$OPTARG"
+		;;
 	--parent-filter)
 		filter_parent="$OPTARG"
 		;;
@@ -227,6 +248,9 @@ ret=0
 # map old->new commit ids for rewriting parents
 mkdir ../map || die "Could not create map/ directory"
 
+# cache rewritten blobs for blob filter
+mkdir ../blob-cache || die "Could not create blob-cache/ directory"
+
 case "$filter_subdir" in
 "")
 	git rev-list --reverse --topo-order --default HEAD \
@@ -295,6 +319,12 @@ while read commit parents; do
 	eval "$filter_index" < /dev/null ||
 		die "index filter failed: $filter_index"
 
+	if test -n "$filter_blob"; then
+		git ls-files --stage |
+		munge_blobs |
+		git update-index --index-info
+	fi
+
 	parentstr=
 	for parent in $parents; do
 		for reparent in $(map "$parent"); do
-- 
1.5.6.rc2.29.g4717e

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/3] filter-branch --blob-filter: speed/flexibility improvements.
  2008-06-13  0:52 [PATCH v2 1/3] filter-branch: add new --blob-filter option Avery Pennarun
@ 2008-06-13  0:52 ` Avery Pennarun
  2008-06-13  0:52   ` [PATCH v2 3/3] filter-branch --blob-filter: add tests Avery Pennarun
  2008-06-13  6:25 ` [PATCH v2 1/3] filter-branch: add new --blob-filter option Jeff King
  1 sibling, 1 reply; 5+ messages in thread
From: Avery Pennarun @ 2008-06-13  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, Avery Pennarun

Export the current file path as $GIT_BLOB_PATH, so we can filter a blob
differently based on its path, and change the caching mechanism to re-filter
a particular blob if its path changes.

Also, make it much faster by not calling 'cat'. The main loop of
munge_blobs() had to fork-exec "cat" every time through the loop, even when
a blob was already cached.  Let's use the sh builtin 'read' instead for a
huge speedup.

cd git
time git filter-branch --blob-filter 'tr a-z A-Z' HEAD~10..HEAD

(original --blob-filter)
real    3m58.569s
user    0m22.900s
sys     3m32.030s

(with 'cat' calls removed)
real	1m11.931s
user	0m8.520s
sys	1m2.900s

(with 'cat' calls removed and blob cache already filled)
real	0m19.660s
user	0m3.930s
sys	0m15.720s

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
---
 Documentation/git-filter-branch.txt |   27 +++++++++++++++++++++++++++
 git-filter-branch.sh                |   27 +++++++++++++++++----------
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index ea77f1f..0c5cd0f 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -12,6 +12,7 @@ SYNOPSIS
 	[--index-filter <command>] [--parent-filter <command>]
 	[--msg-filter <command>] [--commit-filter <command>]
 	[--tag-name-filter <command>] [--subdirectory-filter <directory>]
+	[--blob-filter <command]
 	[--original <namespace>] [-d <directory>] [-f | --force]
 	[<rev-list options>...]
 
@@ -149,6 +150,16 @@ to other tags will be rewritten to point to the underlying commit.
 	The result will contain that directory (and only that) as its
 	project root.
 
+--blob-filter <command>::
+	This is the filter for modifying the contents of each file (blob) in
+	the tree.  The contents of a file are provided on stdin, and the new
+	file contents should be provided on stdout.  The pathname of the
+	blob in the current revision is in $GIT_BLOB_PATH. For efficiency,
+	the before/after results of a given blob+filename are only
+	calculated once and then cached, so your filter must always return
+	the same output blob for any given input blob.  You might use this
+	filter for converting CRLF to LF in all your files, for example.
+
 --original <namespace>::
 	Use this option to set the namespace where the original commits
 	will be stored. The default value is 'refs/original'.
@@ -196,6 +207,22 @@ git filter-branch --index-filter 'git update-index --remove filename' HEAD
 
 Now, you will get the rewritten history saved in HEAD.
 
+To convert CRLF to LF in all your files using the "fromdos" program (be
+careful: this will attempt to modify binary files too!):
+
+----------------------------------------------
+git filter-branch --blob-filter 'fromdos' HEAD
+----------------------------------------------
+
+To convert CRLF to LF in all your *.c and *.cpp files:
+
+---------------------------------------------------------
+git filter-branch --blob-filter 'case "$GIT_BLOB_PATH" in
+	*.c|*.cpp) fromdos;;
+	*) cat;;
+esac' HEAD
+---------------------------------------------------------
+
 To set a commit (which typically is at the tip of another
 history) to be the parent of the current initial commit, in
 order to paste the other history behind the current history:
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index a0d9a79..f1ee263 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -55,19 +55,24 @@ EOF
 eval "$functions"
 
 munge_blobs() {
-	while read mode sha1 stage path
+	while read GIT_BLOB_MODE GIT_BLOB_SHA1 stage GIT_BLOB_PATH
 	do
-		if ! test -r "$workdir/../blob-cache/$sha1"
+		export GIT_BLOB_MODE GIT_BLOB_SHA1 GIT_BLOB_PATH
+		cachefile="$cachedir/$GIT_BLOB_SHA1/$GIT_BLOB_PATH"
+		if ! test -r "$cachefile"
 		then
-			new=`git cat-file blob $sha1 |
-			     eval "$filter_blob" |
-			     git hash-object -w --stdin`
-			printf $new >$workdir/../blob-cache/$sha1
+			new=$(git cat-file blob $GIT_BLOB_SHA1 |
+			      eval "$filter_blob" |
+			      git hash-object -w --stdin)
+			mkdir -p "$(dirname "$cachefile")"
+			echo -n $new >"$cachefile"
+		else
+			read new <"$cachefile"
 		fi
 		printf "%s %s\t%s\n" \
-			"$mode" \
-			$(cat "$workdir/../blob-cache/$sha1") \
-			"$path"
+			"$GIT_BLOB_MODE" \
+			"$new" \
+			"$GIT_BLOB_PATH"
 	done
 }
 
@@ -108,6 +113,7 @@ USAGE="[--env-filter <command>] [--tree-filter <command>] \
 [--index-filter <command>] [--parent-filter <command>] \
 [--msg-filter <command>] [--commit-filter <command>] \
 [--tag-name-filter <command>] [--subdirectory-filter <directory>] \
+[--blob-filter <command>] \
 [--original <namespace>] [-d <directory>] [-f | --force] \
 [<rev-list options>...]"
 
@@ -249,7 +255,8 @@ ret=0
 mkdir ../map || die "Could not create map/ directory"
 
 # cache rewritten blobs for blob filter
-mkdir ../blob-cache || die "Could not create blob-cache/ directory"
+cachedir="$workdir/../blob-cache"
+mkdir "$cachedir" || die "Could not create blob-cache/ directory"
 
 case "$filter_subdir" in
 "")
-- 
1.5.6.rc2.29.g4717e

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 3/3] filter-branch --blob-filter: add tests.
  2008-06-13  0:52 ` [PATCH v2 2/3] filter-branch --blob-filter: speed/flexibility improvements Avery Pennarun
@ 2008-06-13  0:52   ` Avery Pennarun
  0 siblings, 0 replies; 5+ messages in thread
From: Avery Pennarun @ 2008-06-13  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, Avery Pennarun

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
---
 t/t7003-filter-branch-blob.sh |   94 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100755 t/t7003-filter-branch-blob.sh

diff --git a/t/t7003-filter-branch-blob.sh b/t/t7003-filter-branch-blob.sh
new file mode 100755
index 0000000..f18031e
--- /dev/null
+++ b/t/t7003-filter-branch-blob.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='git-filter-branch --blob-filter'
+. ./test-lib.sh
+
+make_commit () {
+	echo -n "$2" >"$1"
+	git add "$1"
+	git commit -a -m "$1"
+}
+
+match_file() {
+	f="$(cat "$1")"
+	echo "'$f'" = "'$2'"
+	test "$f" = "$2"
+}
+
+myfilter() {
+	git-filter-branch --blob-filter 'case "$GIT_BLOB_PATH" in '"$1"') echo -n REPLACEMENT;; *) cat;; esac' HEAD &&
+	rm -rf .git/refs/original
+}
+
+test_expect_success 'setup' '
+	make_commit A "textA" &&
+	make_commit "Space file" "Space text" &&
+	make_commit B.txt "textB" &&
+	make_commit C.jpg "jpgC" &&
+	git checkout -b caching &&
+	make_commit AA "textA" &&
+	make_commit A "textA2" &&
+	git checkout -b renames master &&
+	mkdir dir &&
+	rm -f B.txt &&
+	make_commit dir/B.jpg "textB" &&
+	rm -f C.jpg &&
+	make_commit dir/C.txt "jpgC"
+'
+
+test_expect_success 'rewrite all' '
+	git checkout -b rewrite1 master &&
+	git-filter-branch --blob-filter echo\ -n\ \$GIT_BLOB_PATH HEAD
+	rm -rf .git/refs/original
+'
+
+test_expect_success 'rewrite all - result' '
+	match_file A "A" &&
+	match_file "Space file" "Space file" &&
+	match_file B.txt "B.txt" &&
+	match_file C.jpg "C.jpg"
+'
+
+countfilter() {
+	rm -f counter
+	export P="$PWD"
+	git-filter-branch --blob-filter 'echo tick >>$P/counter; echo -n $GIT_BLOB_PATH' HEAD &&
+	rm -rf .git/refs/original
+}
+
+test_expect_success 'caching' '
+	git checkout -b rewrite1b caching &&
+	countfilter
+'
+
+test_expect_success 'caching - result' '
+	match_file A "A" &&
+	match_file AA "AA" &&
+	match_file B.txt "B.txt" &&
+	match_file C.jpg "C.jpg" &&
+	test "$(cat counter | wc -l)" = 6
+'
+
+test_expect_success 'rewrite .txt only' '
+	git checkout -b rewrite2 master &&
+	myfilter \*.txt
+'
+
+test_expect_success 'rewrite .txt only - result' '
+	match_file A "textA" &&
+	match_file B.txt "REPLACEMENT" &&
+	match_file C.jpg "jpgC"
+'
+
+test_expect_success 'rewrite with renames' '
+	git checkout -b rewrite3 renames &&
+	myfilter \*.txt
+'
+
+test_expect_success 'rewrite with renames - result' '
+	match_file A "textA" &&
+	match_file dir/B.jpg "textB" &&
+	match_file dir/C.txt "REPLACEMENT"
+'
+
+test_done
-- 
1.5.6.rc2.29.g4717e

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/3] filter-branch: add new --blob-filter option.
  2008-06-13  0:52 [PATCH v2 1/3] filter-branch: add new --blob-filter option Avery Pennarun
  2008-06-13  0:52 ` [PATCH v2 2/3] filter-branch --blob-filter: speed/flexibility improvements Avery Pennarun
@ 2008-06-13  6:25 ` Jeff King
  2008-06-13 16:10   ` Avery Pennarun
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2008-06-13  6:25 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: git, gitster

On Thu, Jun 12, 2008 at 08:52:22PM -0400, Avery Pennarun wrote:

> It was easy enough to work up the patch below, which allows
> 
>   git filter-branch --blob-filter 'tr a-z A-Z'

First, two procedural complaints:

  1. We're supposed to be in rc freeze, so this is not a great time to
     publish a new feature. ;)

  2. When bringing back an old patch, please please please give at least
     a little bit of cover letter context. "Here is what happened last
     time, here are the reasons this patch was not accepted before, and
     here is {why I think it that decision was wrong, what I have done
     to improve the patch, etc}.

IIRC, the situation last time had two issues:

  1. it was a one-off "we're not sure if this is really useful" patch

  2. it was unclear whether paths should be available, and if they were,
     there was an issue of encountering the same hash at two different
     paths.

I assume your answer to '1' is "I have been using this and it is
useful". And for '2', it looks like you have extended the cache
mechanism to take into account the sha1 and the path, which I think is
the right solution (and I am pleased to see it looks like the final test
covers the exact situation I was concerned about).

So:

(for 1/3):
Signed-off-by: Jeff King <peff@peff.net>

(for the others (and for 1/3, do I get to ack my own patch?)):
Acked-by: Jeff King <peff@peff.net>

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/3] filter-branch: add new --blob-filter option.
  2008-06-13  6:25 ` [PATCH v2 1/3] filter-branch: add new --blob-filter option Jeff King
@ 2008-06-13 16:10   ` Avery Pennarun
  0 siblings, 0 replies; 5+ messages in thread
From: Avery Pennarun @ 2008-06-13 16:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster

On 6/13/08, Jeff King <peff@peff.net> wrote:
>   1. We're supposed to be in rc freeze, so this is not a great time to
>      publish a new feature. ;)

I thought that was what branches were for :)  Anyway, I only rarely
get the chance to work on this stuff lately, so I guess I got excited.

>   2. When bringing back an old patch, please please please give at least
>      a little bit of cover letter context. "Here is what happened last
>      time, here are the reasons this patch was not accepted before, and
>      here is {why I think it that decision was wrong, what I have done
>      to improve the patch, etc}.

Will do next time.

>  IIRC, the situation last time had two issues:
>
>   1. it was a one-off "we're not sure if this is really useful" patch
>
>   2. it was unclear whether paths should be available, and if they were,
>      there was an issue of encountering the same hash at two different
>      paths.
>
>  I assume your answer to '1' is "I have been using this and it is
>  useful". And for '2', it looks like you have extended the cache
>  mechanism to take into account the sha1 and the path, which I think is
>  the right solution (and I am pleased to see it looks like the final test
>  covers the exact situation I was concerned about).

Yes, for #1 it is indeed useful.  I'm using git-svn on Windows with an
IDE that auto-generates files with CRLF in them, and the translation
of that is something roughly like "ARRGH!"  I have to re-fix the
newlines on various different branches at various times and this is
the best way I've found.  (Although I can also imagine using it for
whitespace fixes, etc.)

You are correct about #2.  I believe I've covered all the complaints
that were brought up at the time.

>  (for 1/3):
>  Signed-off-by: Jeff King <peff@peff.net>
>
>  (for the others (and for 1/3, do I get to ack my own patch?)):
>  Acked-by: Jeff King <peff@peff.net>

Thanks!

Avery

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-06-13 16:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-13  0:52 [PATCH v2 1/3] filter-branch: add new --blob-filter option Avery Pennarun
2008-06-13  0:52 ` [PATCH v2 2/3] filter-branch --blob-filter: speed/flexibility improvements Avery Pennarun
2008-06-13  0:52   ` [PATCH v2 3/3] filter-branch --blob-filter: add tests Avery Pennarun
2008-06-13  6:25 ` [PATCH v2 1/3] filter-branch: add new --blob-filter option Jeff King
2008-06-13 16:10   ` Avery Pennarun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).