git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Git on Windows, CRLF issues
@ 2008-04-21 19:48 Peter Karlsson
  2008-04-21 20:07 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Peter Karlsson @ 2008-04-21 19:48 UTC (permalink / raw)
  To: git

Hi!

I have began moving old repositories for Windows-based software to Git. 
Since the tool I am moving from stores everything with CRLF line endings and 
have RCS-like keyword expansion, I'm treating it all as binary data when 
exporting to Git, i.e I have CRLF in the checked-in data (and I do want 
that, since this is Windows-only sources).

Now the latests msysgit comes along and (finally!) sets core.autocrlf to 
true by default.

How do I handle this without having everyone breaking check-ins? I can't 
require everyone to do unset core.autocrlf globally. Can I do that with 
gitattributes?

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson
@ 2008-04-21 20:07 ` Johannes Schindelin
  2008-04-21 21:53   ` Avery Pennarun
  2008-04-21 21:51 ` Jakub Narebski
  2008-04-22  6:31 ` Johannes Sixt
  2 siblings, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2008-04-21 20:07 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: git

Hi,

On Mon, 21 Apr 2008, Peter Karlsson wrote:

> Now the latests msysgit comes along and (finally!) sets core.autocrlf to 
> true by default.

It is actually nice to hear at least _somebody_ not insulting us for this 
decision.  Thank you!

> How do I handle this without having everyone breaking check-ins? I can't 
> require everyone to do unset core.autocrlf globally. Can I do that with 
> gitattributes?

I think that the only solution to this is (sorry!) to have one single big 
checkin which converts all CR/LF to LF line endings...

Desole,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson
  2008-04-21 20:07 ` Johannes Schindelin
@ 2008-04-21 21:51 ` Jakub Narebski
  2008-04-22  6:52   ` Peter Karlsson
  2008-04-22  6:31 ` Johannes Sixt
  2 siblings, 1 reply; 30+ messages in thread
From: Jakub Narebski @ 2008-04-21 21:51 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: git

Peter Karlsson <peter@softwolves.pp.se> writes:

> I have began moving old repositories for Windows-based software to
> Git. Since the tool I am moving from stores everything with CRLF line
> endings and have RCS-like keyword expansion, I'm treating it all as
> binary data when exporting to Git, i.e I have CRLF in the checked-in
> data (and I do want that, since this is Windows-only sources).
> 
> Now the latests msysgit comes along and (finally!) sets core.autocrlf
> to true by default.
> 
> How do I handle this without having everyone breaking check-ins? I
> can't require everyone to do unset core.autocrlf globally. Can I do
> that with gitattributes?

I think you can, by unsetting `crlf` attribute, i.e. putting the
following in .gitattributes:

   * -crlf

See gitattributes(5):

  `crlf`
  ^^^^^^

  This attribute controls the line-ending convention.

  [...]

  Unset::

        Unsetting the `crlf` attribute on a path is meant to
        mark the path as a "binary" file.  The path never goes
        through line endings conversion upon checkin/checkout.

Not tested!
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 20:07 ` Johannes Schindelin
@ 2008-04-21 21:53   ` Avery Pennarun
  2008-04-22  2:39     ` Jeff King
  2008-04-22  6:41     ` Johannes Sixt
  0 siblings, 2 replies; 30+ messages in thread
From: Avery Pennarun @ 2008-04-21 21:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Peter Karlsson, git

On 4/21/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> I think that the only solution to this is (sorry!) to have one single big
>  checkin which converts all CR/LF to LF line endings...

If it were me (and I hope it will be, soon, if we can entirely shut
down svn internally), I would prefer to use git-filter-branch to go
through *all* my checkins and fix up the CRLFs in all of them.  That
way the history will be clean and diffs/annotates/merges will go more
smoothly.

Does anyone know the most efficient way to do this with
git-filter-branch, when there are already thousands of files in the
repo with CRLF in them?  Running dos2unix on all the files for every
single revision could take a *very* long time.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 21:53   ` Avery Pennarun
@ 2008-04-22  2:39     ` Jeff King
  2008-04-22 16:51       ` Avery Pennarun
  2008-04-22  6:41     ` Johannes Sixt
  1 sibling, 1 reply; 30+ messages in thread
From: Jeff King @ 2008-04-22  2:39 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git

On Mon, Apr 21, 2008 at 05:53:34PM -0400, Avery Pennarun wrote:

> Does anyone know the most efficient way to do this with
> git-filter-branch, when there are already thousands of files in the
> repo with CRLF in them?  Running dos2unix on all the files for every
> single revision could take a *very* long time.

Yes, a tree filter would probably be quite slow due to checking out, and
then munging all of the files.

You could maybe do an index filter that gets the blob SHA1 of each file
that is new, and just munges those. But I think it is even simpler to
just keep a cache of original blob hashes mapping to munged blob hashes.

Something like:

  git filter-branch --index-filter '
    git ls-files --stage |
    perl /path/to/caching-munger |
    git update-index --index-info
  '

where your caching munger looks something like:

-- >8 --
#!/usr/bin/perl

use strict;
use DB_File;
use Fcntl;
tie my %cache, 'DB_File', "$ENV{HOME}/filter-cache", O_RDWR|O_CREAT, 0666
  or die "unable to open db: $!";

while(<>) {
  my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/
    or die "bad ls-files line: $_";
  $cache{$hash} = munge($hash)
    unless exists $cache{$hash};
  print "$mode $cache{$hash}\t$path\n";
}

sub munge {
  my $h = shift;
  my $r = scalar `git show $h | sed 's/\$/\\r/' | git hash-object -w --stdin`;
  chomp $r;
  return $r;
}
-- 8< --

so we keep a dbm of the hash mapping, and do no work if we have already
seen this blob. If we don't, then we actually do the expensive 'show |
munge | hash-object'. And here our munge adds a CR, but you should be
able to do an arbitrary transformation.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson
  2008-04-21 20:07 ` Johannes Schindelin
  2008-04-21 21:51 ` Jakub Narebski
@ 2008-04-22  6:31 ` Johannes Sixt
  2008-04-22  8:42   ` Peter Karlsson
  2 siblings, 1 reply; 30+ messages in thread
From: Johannes Sixt @ 2008-04-22  6:31 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: git

Peter Karlsson schrieb:
> I have began moving old repositories for Windows-based software to Git.
> Since the tool I am moving from stores everything with CRLF line endings
> and have RCS-like keyword expansion, I'm treating it all as binary data
> when exporting to Git, i.e I have CRLF in the checked-in data (and I do
> want that, since this is Windows-only sources).
> 
> Now the latests msysgit comes along and (finally!) sets core.autocrlf to
> true by default.
> 
> How do I handle this without having everyone breaking check-ins? I can't
> require everyone to do unset core.autocrlf globally. Can I do that with
> gitattributes?

I see 2 other options:

1. Create a custom setup of msysgit that has core.autocrlf set to false.

2. You are still converting repositories? Convert the files in your
repository to LF. I did it like this, but it was a week -or more- worth of
labor to get the scripts in a shape that I could reproduce the conversion
(and it all happened before core.autocrlf even existed).

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 21:53   ` Avery Pennarun
  2008-04-22  2:39     ` Jeff King
@ 2008-04-22  6:41     ` Johannes Sixt
  1 sibling, 0 replies; 30+ messages in thread
From: Johannes Sixt @ 2008-04-22  6:41 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git

Avery Pennarun schrieb:
> Does anyone know the most efficient way to [convert CRLF] with
> git-filter-branch, when there are already thousands of files in the
> repo with CRLF in them?  Running dos2unix on all the files for every
> single revision could take a *very* long time.

I chose to write a custom script. Otherwise, a file that stays the same
throughout the history would still have been converted on each commit. My
script converted each unique file only once, then reconstructed the tree
objects and then changed the commits.

In the end I don't think it payed off. It took me a week or so to convert
the repo; I just could have let filter-branch run for a week, too. But I
also have to mention that I did the CVS->git conversion a few times to get
a suitable history, and I also repeated the CRLF conversion sometimes, and
back then git-filter-branch did not exist in its current shape.

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-21 21:51 ` Jakub Narebski
@ 2008-04-22  6:52   ` Peter Karlsson
  2008-04-22  9:04     ` Johannes Sixt
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Karlsson @ 2008-04-22  6:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Git Mailing List

Jakub Narebski:

> I think you can, by unsetting `crlf` attribute, i.e. putting the
> following in .gitattributes:
> 
>    * -crlf

Yeah, that does indeed seem to work, no matter how core.autocrlf is
configured globally. I think this is the best way to go for the
repositories I am working on (as they are very much DOS/Windows-only).

Does anyone know how to hack an existing repository so that I can add
such a .gitattributes file to all commits? I've tried reading the
git-filter-branch manual page a few times, but I am still confused by
it.

I guess I need some combination of "git filter-branch --tree-filter"
and "git update-index --add".

It doesn't matter much that the all commits are re-written, as I am
still the only one to have cloned them.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-22  6:31 ` Johannes Sixt
@ 2008-04-22  8:42   ` Peter Karlsson
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Karlsson @ 2008-04-22  8:42 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt:

> 2. You are still converting repositories? Convert the files in your
> repository to LF.

Or, perhaps, this is the way to go.

Got to figure out how to get CRLF->LF conversion working without having
RCS keyword expansion going haywire. I'm using RCS format as a
middle-man between the old repositories (PVCS) I'm converting and
parsecvs which imports them into Git. The old repository has expanded
keywords, and I must avoid having RCS/CVS expand them as they would
expand in a different manner... :-/

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-22  6:52   ` Peter Karlsson
@ 2008-04-22  9:04     ` Johannes Sixt
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Sixt @ 2008-04-22  9:04 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: Jakub Narebski, Git Mailing List

Peter Karlsson schrieb:
> Jakub Narebski:
> 
>> I think you can, by unsetting `crlf` attribute, i.e. putting the
>> following in .gitattributes:
>>
>>    * -crlf
> 
> Yeah, that does indeed seem to work, no matter how core.autocrlf is
> configured globally. I think this is the best way to go for the
> repositories I am working on (as they are very much DOS/Windows-only).
> 
> Does anyone know how to hack an existing repository so that I can add
> such a .gitattributes file to all commits? I've tried reading the
> git-filter-branch manual page a few times, but I am still confused by
> it.

Something like (untested, using bash):

X=$(echo "* -crlf" | git hash-object -w --stdin)
git filter-branch \
	--index-filter $'git-update-index --index-info <<< \
				"100644 $X\t.gitattributes"' \
	-- --all

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-22  2:39     ` Jeff King
@ 2008-04-22 16:51       ` Avery Pennarun
  2008-04-23  7:11         ` Peter Karlsson
  2008-04-23  8:08         ` Jeff King
  0 siblings, 2 replies; 30+ messages in thread
From: Avery Pennarun @ 2008-04-22 16:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Peter Karlsson, git

On 4/21/08, Jeff King <peff@peff.net> wrote:
>  You could maybe do an index filter that gets the blob SHA1 of each file
>  that is new, and just munges those. But I think it is even simpler to
>  just keep a cache of original blob hashes mapping to munged blob hashes.
> [...]

Thanks, this is really cool.  I'll try it next time I'm messing with
our repositories (this week is unfortunately a bit too busy).

Do you think git would benefit from having a generalized version of
this script?  Basically, the user provides a "munge" script on the
command line, and there's a git-filter-branch mode for auto-munging
(with a cache) every file in every checkin.  Even if it's *only* ever
used for CRLF, I can imagine this being useful to a lot of people.

Thanks,

Avery

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-22 16:51       ` Avery Pennarun
@ 2008-04-23  7:11         ` Peter Karlsson
  2008-04-23  8:10           ` Jeff King
  2008-04-23  8:08         ` Jeff King
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Karlsson @ 2008-04-23  7:11 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Jeff King, Johannes Schindelin, Git Mailing List

Avery Pennarun:

> Do you think git would benefit from having a generalized version of
> this script?

Definitely. Also, something that would work with a) several branches
(i.e traverse all the branches; keeping the points at which they
diverge), and b) submodules (i.e apply the same changes to the
submodules and updating the submodule index accordingly).

I ended up doing CRLF conversion for most of the repositories I had
converted. Fortunately, most of them had a single branch, so after
having created a small script that did CRLF->LF for the text files, I
could do a

  git filter-branch --tree-filter 'c:/temp/crlf2lf.sh' \
                    --tag-name-filter 'cat' HEAD

on each repository and get everything converted during my lunch break.


What I couldn't figure out is why, after converting everything,
removing all references to the repositories I cloned from, and removing
references to the old objects in the reflogs, why

 git fsck --unreachable

did not report any unreachable objects? I would have guessed the entire
old history and its objects would now be invalidated and could be
killed off.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-22 16:51       ` Avery Pennarun
  2008-04-23  7:11         ` Peter Karlsson
@ 2008-04-23  8:08         ` Jeff King
  2008-04-23 10:13           ` Johannes Schindelin
  2008-04-23 10:58           ` Johannes Sixt
  1 sibling, 2 replies; 30+ messages in thread
From: Jeff King @ 2008-04-23  8:08 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git

On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote:

> Do you think git would benefit from having a generalized version of
> this script?  Basically, the user provides a "munge" script on the
> command line, and there's a git-filter-branch mode for auto-munging
> (with a cache) every file in every checkin.  Even if it's *only* ever
> used for CRLF, I can imagine this being useful to a lot of people.

It was easy enough to work up the patch below, which allows

  git filter-branch --blob-filter 'tr a-z A-Z'

However, it's _still_ horribly slow. Shell script is nice and flexible,
but running a tight loop like this is just painful. I suspect
filter-branch in something like perl would be a lot faster and just as
flexible (you could even do it in C, but you'd probably have to invent a
little domain-specific scripting language).

It is still much better performance than a tree filter, though:

  $ cd git && time git filter-branch --tree-filter '
      find . -type f | while read f; do
        tr a-z A-Z <"$f" >tmp
        mv tmp "$f"
      done
    ' HEAD~10..HEAD

  real    4m38.626s
  user    1m32.726s
  sys     2m51.163s

  $ cd git && git filter-branch --blob-filter 'tr a-z A-Z' HEAD~10..HEAD
  real    1m40.809s
  user    0m36.822s
  sys     1m14.273s

Lots of system time in both. I'm sure we spend a fair bit of time
hitting our very large map and blob-cache directories, which would be
much more nicely implemented as associative arrays in memory (if we were
using a more featureful language).

Anyway, here is the patch. I don't know if it is even worth applying,
since it is still painfully slow.

---
 git-filter-branch.sh |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 333f6a8..0602b25 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -54,6 +54,23 @@ EOF
 
 eval "$functions"
 
+munge_blobs() {
+	while read mode sha1 stage path
+	do
+		if ! test -r "$workdir/../blob-cache/$sha1"
+		then
+			new=`git cat-file blob $sha1 |
+			     eval "$filter_blob" |
+			     git hash-object -w --stdin`
+			printf $new >$workdir/../blob-cache/$sha1
+		fi
+		printf "%s %s\t%s\n" \
+			"$mode" \
+			$(cat "$workdir/../blob-cache/$sha1") \
+			"$path"
+	done
+}
+
 # When piped a commit, output a script to set the ident of either
 # "author" or "committer
 
@@ -105,6 +122,7 @@ tempdir=.git-rewrite
 filter_env=
 filter_tree=
 filter_index=
+filter_blob=
 filter_parent=
 filter_msg=cat
 filter_commit='git commit-tree "$@"'
@@ -150,6 +168,9 @@ do
 	--index-filter)
 		filter_index="$OPTARG"
 		;;
+	--blob-filter)
+		filter_blob="$OPTARG"
+		;;
 	--parent-filter)
 		filter_parent="$OPTARG"
 		;;
@@ -227,6 +248,9 @@ ret=0
 # map old->new commit ids for rewriting parents
 mkdir ../map || die "Could not create map/ directory"
 
+# cache rewritten blobs for blob filter
+mkdir ../blob-cache || die "Could not create blob-cache/ directory"
+
 case "$filter_subdir" in
 "")
 	git rev-list --reverse --topo-order --default HEAD \
@@ -295,6 +319,12 @@ while read commit parents; do
 	eval "$filter_index" < /dev/null ||
 		die "index filter failed: $filter_index"
 
+	if test -n "$filter_blob"; then
+		git ls-files --stage |
+		munge_blobs |
+		git update-index --index-info
+	fi
+
 	parentstr=
 	for parent in $parents; do
 		for reparent in $(map "$parent"); do
-- 
1.5.5.1.144.g4c416.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23  7:11         ` Peter Karlsson
@ 2008-04-23  8:10           ` Jeff King
  2008-04-23 13:47             ` Peter Karlsson
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2008-04-23  8:10 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: Avery Pennarun, Johannes Schindelin, Git Mailing List

On Wed, Apr 23, 2008 at 08:11:49AM +0100, Peter Karlsson wrote:

> I ended up doing CRLF conversion for most of the repositories I had
> converted. Fortunately, most of them had a single branch, so after
> having created a small script that did CRLF->LF for the text files, I
> could do a
> 
>   git filter-branch --tree-filter 'c:/temp/crlf2lf.sh' \
>                     --tag-name-filter 'cat' HEAD
> 
> on each repository and get everything converted during my lunch break.

Sure, but that is quite slow on a larger tree, since it has to do a
full checkout for each commit. The idea of the specialized filter was to
avoid that. But if your project was small enough to do it that way, that
certainly works.

> What I couldn't figure out is why, after converting everything,
> removing all references to the repositories I cloned from, and removing
> references to the old objects in the reflogs, why
> 
>  git fsck --unreachable
> 
> did not report any unreachable objects? I would have guessed the entire
> old history and its objects would now be invalidated and could be
> killed off.

Did you remove refs/original/ ?

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23  8:08         ` Jeff King
@ 2008-04-23 10:13           ` Johannes Schindelin
  2008-04-23 10:58             ` Jeff King
  2008-04-23 10:58           ` Johannes Sixt
  1 sibling, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2008-04-23 10:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, Peter Karlsson, git

Hi,

On Wed, 23 Apr 2008, Jeff King wrote:

> On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote:
> 
> > Do you think git would benefit from having a generalized version of 
> > this script?  Basically, the user provides a "munge" script on the 
> > command line, and there's a git-filter-branch mode for auto-munging 
> > (with a cache) every file in every checkin.  Even if it's *only* ever 
> > used for CRLF, I can imagine this being useful to a lot of people.
> 
> It was easy enough to work up the patch below, which allows
> 
>   git filter-branch --blob-filter 'tr a-z A-Z'
> 
> However, it's _still_ horribly slow.

You create a quite huge blob-cache, so you are pretty heavy on disk-I/O.  
Have you tried (as suggested in the man page) to run this on a huge RAM 
disk?  That should blow you away.

> Shell script is nice and flexible, but running a tight loop like this is 
> just painful. I suspect filter-branch in something like perl would be a 
> lot faster and just as flexible (you could even do it in C, but you'd 
> probably have to invent a little domain-specific scripting language).

I hoped that the rewrite-commits attempt was more than just that: an 
attempt.  So there is a point you could start from, doing things in C.

But I doubt that you get any joy: either your language is too limited, or 
you will get the same problems (fork() overhead) again.

> Anyway, here is the patch. I don't know if it is even worth applying, 
> since it is still painfully slow.

I like your patch:

Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 10:13           ` Johannes Schindelin
@ 2008-04-23 10:58             ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2008-04-23 10:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Avery Pennarun, Peter Karlsson, git

On Wed, Apr 23, 2008 at 11:13:27AM +0100, Johannes Schindelin wrote:

> > It was easy enough to work up the patch below, which allows
> > 
> >   git filter-branch --blob-filter 'tr a-z A-Z'
> > 
> > However, it's _still_ horribly slow.
> 
> You create a quite huge blob-cache, so you are pretty heavy on disk-I/O.  
> Have you tried (as suggested in the man page) to run this on a huge RAM 
> disk?  That should blow you away.

No, I didn't. But the disk I/O is pretty minimal. The blob cache is only
a few megabytes, and it stays entirely in Linux's disk cache. My disk
light only blinks every 5-10 seconds to flush dirty pages to disk.

> I hoped that the rewrite-commits attempt was more than just that: an 
> attempt.  So there is a point you could start from, doing things in C.
>
> But I doubt that you get any joy: either your language is too limited, or 
> you will get the same problems (fork() overhead) again.

Ah, right. I totally forgot about that effort. I will take a peek next
time I need to do some filtering.

> > Anyway, here is the patch. I don't know if it is even worth applying, 
> > since it is still painfully slow.
> 
> I like your patch:
> 
> Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>

I think it could use some documentation updates. Avery, do you want to
try adding a CRLF example to the manpage?

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23  8:08         ` Jeff King
  2008-04-23 10:13           ` Johannes Schindelin
@ 2008-04-23 10:58           ` Johannes Sixt
  2008-04-23 11:04             ` Jeff King
  2008-04-23 20:02             ` Avery Pennarun
  1 sibling, 2 replies; 30+ messages in thread
From: Johannes Sixt @ 2008-04-23 10:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git

Jeff King schrieb:
> It was easy enough to work up the patch below, which allows
> 
>   git filter-branch --blob-filter 'tr a-z A-Z'
...
> +munge_blobs() {
> +	while read mode sha1 stage path
> +	do
> +		if ! test -r "$workdir/../blob-cache/$sha1"
> +		then
> +			new=`git cat-file blob $sha1 |
> +			     eval "$filter_blob" |
> +			     git hash-object -w --stdin`
> +			printf $new >$workdir/../blob-cache/$sha1
> +		fi
> +		printf "%s %s\t%s\n" \
> +			"$mode" \
> +			$(cat "$workdir/../blob-cache/$sha1") \
> +			"$path"
> +	done
> +}

In practice, this is not sufficient. The blob filter must have an
opportunity to decide what it wants to do, not just blindly munge every
blob. The minimum is a path name, e.g. in $1:

	new=$(git cat-file blob $sha1 |
		$SHELL_PATH -c "$filter_blob" ignored "$path" |
		git hash-object -w --stdin)

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 10:58           ` Johannes Sixt
@ 2008-04-23 11:04             ` Jeff King
  2008-04-23 11:46               ` Johannes Sixt
  2008-04-23 20:02             ` Avery Pennarun
  1 sibling, 1 reply; 30+ messages in thread
From: Jeff King @ 2008-04-23 11:04 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git

On Wed, Apr 23, 2008 at 12:58:57PM +0200, Johannes Sixt wrote:

> In practice, this is not sufficient. The blob filter must have an
> opportunity to decide what it wants to do, not just blindly munge every
> blob. The minimum is a path name, e.g. in $1:
> 
> 	new=$(git cat-file blob $sha1 |
> 		$SHELL_PATH -c "$filter_blob" ignored "$path" |
> 		git hash-object -w --stdin)

I intentionally left that out, because:

  - I assumed if you were going to do trickery with pathnames, you
    should just be doing an index filter

  - it violates the cache assumption, which is that blob $X is always
    transformed the same way

I assume you are wanting to do something like:

  git filter-branch --blob-filter '
    case "$1" in
      *.jpg) cat ;;
          *) tr a-z A-Z ;;
    esac
  '

Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and
"foo.txt", but it just feels a little wrong.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 11:04             ` Jeff King
@ 2008-04-23 11:46               ` Johannes Sixt
  2008-04-23 21:47                 ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Sixt @ 2008-04-23 11:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git

Jeff King schrieb:
> On Wed, Apr 23, 2008 at 12:58:57PM +0200, Johannes Sixt wrote:
> 
>> In practice, this is not sufficient. The blob filter must have an
>> opportunity to decide what it wants to do, not just blindly munge every
>> blob. The minimum is a path name, e.g. in $1:
>>
>> 	new=$(git cat-file blob $sha1 |
>> 		$SHELL_PATH -c "$filter_blob" ignored "$path" |
>> 		git hash-object -w --stdin)
> 
> I intentionally left that out, because:
> 
>   - I assumed if you were going to do trickery with pathnames, you
>     should just be doing an index filter
> 
>   - it violates the cache assumption, which is that blob $X is always
>     transformed the same way
> 
> I assume you are wanting to do something like:
> 
>   git filter-branch --blob-filter '
>     case "$1" in
>       *.jpg) cat ;;
>           *) tr a-z A-Z ;;
>     esac
>   '
> 
> Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and
> "foo.txt", but it just feels a little wrong.

Yes, that's how I intended it to work. What's wrong here? The fact that a
user might name a JPEG foo.txt instead of foo.jpg? Or that the same blob
might appear with entirely different names, including different suffixes?
Well, tough luck. Use an index filter. But without any sort of hint what
the blob is about, your original --blob-filter is useless except for the
most simplistic repositories.

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23  8:10           ` Jeff King
@ 2008-04-23 13:47             ` Peter Karlsson
  2008-04-23 14:24               ` Johan Herland
  2008-04-23 15:12               ` Johannes Sixt
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Karlsson @ 2008-04-23 13:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Git Mailing List

Jeff King:

> Sure, but that is quite slow on a larger tree, since it has to do a
> full checkout for each commit.

Indeed. That's why I would welcome a script such as the one you
mentioned :-) Fortunately, the repositories I worked on were small
enough to not suffer too much (even when using Git on Windows, which is
a bit slower than on Linux).

[Not seeing any unreachable objects]
> Did you remove refs/original/ ?

That, and cloned the repository to a new location after the conversion,
and removing the references to "origin" there. It does seem that the
objects are still there, but I can't see them with "gitk --all".

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 13:47             ` Peter Karlsson
@ 2008-04-23 14:24               ` Johan Herland
  2008-04-23 15:12               ` Johannes Sixt
  1 sibling, 0 replies; 30+ messages in thread
From: Johan Herland @ 2008-04-23 14:24 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: git, Jeff King, Avery Pennarun, Johannes Schindelin

On Wednesday 23 April 2008, Peter Karlsson wrote:
> Jeff King:
> > Sure, but that is quite slow on a larger tree, since it has to do a
> > full checkout for each commit.
>
> Indeed. That's why I would welcome a script such as the one you
> mentioned :-) Fortunately, the repositories I worked on were small
> enough to not suffer too much (even when using Git on Windows, which
> is a bit slower than on Linux).
>
> [Not seeing any unreachable objects]
>
> > Did you remove refs/original/ ?
>
> That, and cloned the repository to a new location after the
> conversion, and removing the references to "origin" there. It does
> seem that the objects are still there, but I can't see them with
> "gitk --all".

Maybe they are kept alive by reflogs?


Have fun! :)

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 13:47             ` Peter Karlsson
  2008-04-23 14:24               ` Johan Herland
@ 2008-04-23 15:12               ` Johannes Sixt
  1 sibling, 0 replies; 30+ messages in thread
From: Johannes Sixt @ 2008-04-23 15:12 UTC (permalink / raw)
  To: Peter Karlsson
  Cc: Jeff King, Avery Pennarun, Johannes Schindelin, Git Mailing List

Peter Karlsson schrieb:
> [Not seeing any unreachable objects]
> Jeff King:
>> Did you remove refs/original/ ?
> 
> That, and cloned the repository to a new location after the conversion,
> and removing the references to "origin" there. It does seem that the
> objects are still there, but I can't see them with "gitk --all".

Did you clone locally? Then you must use the file:// protocol, otherwise
everything is hard-linked from the origin.

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 10:58           ` Johannes Sixt
  2008-04-23 11:04             ` Jeff King
@ 2008-04-23 20:02             ` Avery Pennarun
  2008-04-24  6:25               ` Johannes Sixt
  1 sibling, 1 reply; 30+ messages in thread
From: Avery Pennarun @ 2008-04-23 20:02 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Jeff King, Johannes Schindelin, Peter Karlsson, git

On 4/23/08, Johannes Sixt <j.sixt@viscovery.net> wrote:
> In practice, this is not sufficient. The blob filter must have an
>  opportunity to decide what it wants to do, not just blindly munge every
>  blob. The minimum is a path name, e.g. in $1:

Actually, it may not have been intentional, but because of the way
'eval' works, the munge script will find that $path already contains
the path of the file being munged.  Works for me.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 11:46               ` Johannes Sixt
@ 2008-04-23 21:47                 ` Jeff King
  2008-04-23 23:01                   ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2008-04-23 21:47 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git

On Wed, Apr 23, 2008 at 01:46:20PM +0200, Johannes Sixt wrote:

> > I assume you are wanting to do something like:
> > 
> >   git filter-branch --blob-filter '
> >     case "$1" in
> >       *.jpg) cat ;;
> >           *) tr a-z A-Z ;;
> >     esac
> >   '
> > 
> > Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and
> > "foo.txt", but it just feels a little wrong.
> 
> Yes, that's how I intended it to work. What's wrong here? The fact that a
> user might name a JPEG foo.txt instead of foo.jpg? Or that the same blob
> might appear with entirely different names, including different suffixes?
> Well, tough luck. Use an index filter. But without any sort of hint what
> the blob is about, your original --blob-filter is useless except for the
> most simplistic repositories.

Yes, the script produces incorrect results if you have the same blob
with different names. IOW, if I accidentally add a JPEG as 'foo', and
then later rename it to 'foo.jpg', it will munge the blob the first time
it sees it, and then use the munged value for 'foo.jpg', since we never
even run the case statement. Yes, this is not terribly likely, but it
does seem like an awful (and hard to diagnose!) bug to have hiding in
the script.

The correct fix is either:

  - the blob cache needs to take into account sha1 _and_ path

  - the cache lookup needs to be _inside_ the path filter. In that case
    you would either have to support it in the script (e.g.,
    --blob-ignore jpg), or you could make the caching an optional part
    of the blob filter (the way you can call 'map' explicitly from your
    filters).

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 21:47                 ` Jeff King
@ 2008-04-23 23:01                   ` Junio C Hamano
  2008-04-23 23:04                     ` Avery Pennarun
  2008-04-24  1:37                     ` Jeff King
  0 siblings, 2 replies; 30+ messages in thread
From: Junio C Hamano @ 2008-04-23 23:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Sixt, Avery Pennarun, Johannes Schindelin,
	Peter Karlsson, git

Jeff King <peff@peff.net> writes:

> The correct fix is either:
>
>   - the blob cache needs to take into account sha1 _and_ path
>
>   - the cache lookup needs to be _inside_ the path filter. In that case
>     you would either have to support it in the script (e.g.,
>     --blob-ignore jpg), or you could make the caching an optional part
>     of the blob filter (the way you can call 'map' explicitly from your
>     filters).

But once you start saying "even originally the same blob (i.e. identified
by one object name) can be rewritten into different result, depending on
where in the tree it appears", would it make sense to have blob filters to
begin with?

Shouldn't that kind of of context sensitive (in the space dimension -- you
can introduce the context sensitivity in the time dimension by saying
there may even be cases where you would want to filter differently
depending on the path and which commit the blob appears, which is even
worse) filtering be best left to the tree or index filter?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 23:01                   ` Junio C Hamano
@ 2008-04-23 23:04                     ` Avery Pennarun
  2008-04-24  8:11                       ` Johannes Schindelin
  2008-04-24  1:37                     ` Jeff King
  1 sibling, 1 reply; 30+ messages in thread
From: Avery Pennarun @ 2008-04-23 23:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Johannes Sixt, Johannes Schindelin, Peter Karlsson,
	git

On 4/23/08, Junio C Hamano <gitster@pobox.com> wrote:
> But once you start saying "even originally the same blob (i.e. identified
>  by one object name) can be rewritten into different result, depending on
>  where in the tree it appears", would it make sense to have blob filters to
>  begin with?
>
>  Shouldn't that kind of of context sensitive (in the space dimension -- you
>  can introduce the context sensitivity in the time dimension by saying
>  there may even be cases where you would want to filter differently
>  depending on the path and which commit the blob appears, which is even
>  worse) filtering be best left to the tree or index filter?

What I really want is the equivalent of "dos2unix --recursive *.c
*.txt etc" for all commits.

Theoretically, a .txt file might be renamed to a .jpg file, in which
case funny things would happen with such a filter, depending which
commit was seen first.

I'm pretty confident that this will never happen to me, but it's a
valid concern.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 23:01                   ` Junio C Hamano
  2008-04-23 23:04                     ` Avery Pennarun
@ 2008-04-24  1:37                     ` Jeff King
  1 sibling, 0 replies; 30+ messages in thread
From: Jeff King @ 2008-04-24  1:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Sixt, Avery Pennarun, Johannes Schindelin,
	Peter Karlsson, git

On Wed, Apr 23, 2008 at 04:01:10PM -0700, Junio C Hamano wrote:

> But once you start saying "even originally the same blob (i.e. identified
> by one object name) can be rewritten into different result, depending on
> where in the tree it appears", would it make sense to have blob filters to
> begin with?
> 
> Shouldn't that kind of of context sensitive (in the space dimension -- you
> can introduce the context sensitivity in the time dimension by saying
> there may even be cases where you would want to filter differently
> depending on the path and which commit the blob appears, which is even
> worse) filtering be best left to the tree or index filter?

Yes, that was my original reasoning. But I think the problem then is
that the blob filter isn't terribly useful. IOW, it is not really a
separate filter, but rather an optimizing pattern for an index filter,
so maybe calling it a blob filter is the wrong approach, and it would be
better as a short perl script in contrib/filter-branch. Then you could
call:

  git filter-branch --index-filter '
    /path/to/git/contrib/filter-branch/dos2unix \
      "*.txt" "*.c"
  '

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 20:02             ` Avery Pennarun
@ 2008-04-24  6:25               ` Johannes Sixt
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Sixt @ 2008-04-24  6:25 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Jeff King, Johannes Schindelin, Peter Karlsson, git

Avery Pennarun schrieb:
> On 4/23/08, Johannes Sixt <j.sixt@viscovery.net> wrote:
>> In practice, this is not sufficient. The blob filter must have an
>>  opportunity to decide what it wants to do, not just blindly munge every
>>  blob. The minimum is a path name, e.g. in $1:
> 
> Actually, it may not have been intentional, but because of the way
> 'eval' works, the munge script will find that $path already contains
> the path of the file being munged.  Works for me.

Yes, of course! So I stand corrected, and Jeff's patch makes sense.

For consistency's sake, the path should be made available in, say,
GIT_BLOB_PATH just like the commit is available in GIT_COMMIT.

-- Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-23 23:04                     ` Avery Pennarun
@ 2008-04-24  8:11                       ` Johannes Schindelin
  2008-04-24 16:56                         ` Avery Pennarun
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2008-04-24  8:11 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Junio C Hamano, Jeff King, Johannes Sixt, Peter Karlsson, git

Hi,

On Wed, 23 Apr 2008, Avery Pennarun wrote:

> What I really want is the equivalent of "dos2unix --recursive *.c *.txt 
> etc" for all commits.

I start to wonder if "git fast-export --all | my-intelligent-perl-script | 
git fast-import" would not be a better solution here.

All you would have to do is to detect when a blob begins, and how long it 
is, and work with that.  If your trees do not contain any binary files, it 
should be trivial.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Git on Windows, CRLF issues
  2008-04-24  8:11                       ` Johannes Schindelin
@ 2008-04-24 16:56                         ` Avery Pennarun
  0 siblings, 0 replies; 30+ messages in thread
From: Avery Pennarun @ 2008-04-24 16:56 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Jeff King, Johannes Sixt, Peter Karlsson, git

On 4/24/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>  On Wed, 23 Apr 2008, Avery Pennarun wrote:
>  > What I really want is the equivalent of "dos2unix --recursive *.c *.txt
>  > etc" for all commits.
>
> I start to wonder if "git fast-export --all | my-intelligent-perl-script |
>  git fast-import" would not be a better solution here.
>
>  All you would have to do is to detect when a blob begins, and how long it
>  is, and work with that.  If your trees do not contain any binary files, it
>  should be trivial.

Err, yes... as long as there are no binary files.  I'm not so lucky,
and life is a little more complex in that case.  It also gives no easy
way of selectively applying the blob filter based on filename, which
is pretty important when you do have some binary files and you're
trying to decide whether to run dos2unix.

(In contrast, the *other* objection, which is that the same blob might
have multiple filenames, doesn't bother me at all, since I'm sure I
don't have any .txt files that were accidentally named .jpg at some
point.)

I agree that a working solution based on
git-fast-export/git-fast-import should run faster than any of the
other proposed solutions, but my version of Jeff's patch is quite fast
and it's easy to compose simple command lines that "make simple things
simple and hard things possible":

  git-filter-branch --blob-filter dos2unix HEAD
  git-filter-branch --blob-filter 'case "$path" in *.c) expand -8;; *)
cat;; esac' HEAD

It sure beats writing a perl script every time you want to do something.

Jeff wrote:
> But I think the problem then is
> that the blob filter isn't terribly useful. IOW, it is not really a
> separate filter, but rather an optimizing pattern for an index filter,
> so maybe calling it a blob filter is the wrong approach

The problem is that doing an optimization on an index filter is kind
of hard for a user to express, and each user will have to implement
the caching logic by hand every time.  Using --index-filter at all
requires extremely high levels of shell and git knowledge.

The fact that the blob transformation might "slightly depend on" the
path is not actually very important; fundamentally we're still
transforming blobs, not paths.  We're just using the filename as a
*hint* about what kind of transformation we need to do on that
particular blob.

I think the measure of a good idea here is how straightforward it is
to express what you want on the command line, and --blob-filter makes
it easy to express a certain class of filters.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-04-24 16:57 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson
2008-04-21 20:07 ` Johannes Schindelin
2008-04-21 21:53   ` Avery Pennarun
2008-04-22  2:39     ` Jeff King
2008-04-22 16:51       ` Avery Pennarun
2008-04-23  7:11         ` Peter Karlsson
2008-04-23  8:10           ` Jeff King
2008-04-23 13:47             ` Peter Karlsson
2008-04-23 14:24               ` Johan Herland
2008-04-23 15:12               ` Johannes Sixt
2008-04-23  8:08         ` Jeff King
2008-04-23 10:13           ` Johannes Schindelin
2008-04-23 10:58             ` Jeff King
2008-04-23 10:58           ` Johannes Sixt
2008-04-23 11:04             ` Jeff King
2008-04-23 11:46               ` Johannes Sixt
2008-04-23 21:47                 ` Jeff King
2008-04-23 23:01                   ` Junio C Hamano
2008-04-23 23:04                     ` Avery Pennarun
2008-04-24  8:11                       ` Johannes Schindelin
2008-04-24 16:56                         ` Avery Pennarun
2008-04-24  1:37                     ` Jeff King
2008-04-23 20:02             ` Avery Pennarun
2008-04-24  6:25               ` Johannes Sixt
2008-04-22  6:41     ` Johannes Sixt
2008-04-21 21:51 ` Jakub Narebski
2008-04-22  6:52   ` Peter Karlsson
2008-04-22  9:04     ` Johannes Sixt
2008-04-22  6:31 ` Johannes Sixt
2008-04-22  8:42   ` Peter Karlsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).