* Git on Windows, CRLF issues @ 2008-04-21 19:48 Peter Karlsson 2008-04-21 20:07 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Peter Karlsson @ 2008-04-21 19:48 UTC (permalink / raw) To: git Hi! I have began moving old repositories for Windows-based software to Git. Since the tool I am moving from stores everything with CRLF line endings and have RCS-like keyword expansion, I'm treating it all as binary data when exporting to Git, i.e I have CRLF in the checked-in data (and I do want that, since this is Windows-only sources). Now the latests msysgit comes along and (finally!) sets core.autocrlf to true by default. How do I handle this without having everyone breaking check-ins? I can't require everyone to do unset core.autocrlf globally. Can I do that with gitattributes? -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson @ 2008-04-21 20:07 ` Johannes Schindelin 2008-04-21 21:53 ` Avery Pennarun 2008-04-21 21:51 ` Jakub Narebski 2008-04-22 6:31 ` Johannes Sixt 2 siblings, 1 reply; 30+ messages in thread From: Johannes Schindelin @ 2008-04-21 20:07 UTC (permalink / raw) To: Peter Karlsson; +Cc: git Hi, On Mon, 21 Apr 2008, Peter Karlsson wrote: > Now the latests msysgit comes along and (finally!) sets core.autocrlf to > true by default. It is actually nice to hear at least _somebody_ not insulting us for this decision. Thank you! > How do I handle this without having everyone breaking check-ins? I can't > require everyone to do unset core.autocrlf globally. Can I do that with > gitattributes? I think that the only solution to this is (sorry!) to have one single big checkin which converts all CR/LF to LF line endings... Desole, Dscho ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 20:07 ` Johannes Schindelin @ 2008-04-21 21:53 ` Avery Pennarun 2008-04-22 2:39 ` Jeff King 2008-04-22 6:41 ` Johannes Sixt 0 siblings, 2 replies; 30+ messages in thread From: Avery Pennarun @ 2008-04-21 21:53 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Peter Karlsson, git On 4/21/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > I think that the only solution to this is (sorry!) to have one single big > checkin which converts all CR/LF to LF line endings... If it were me (and I hope it will be, soon, if we can entirely shut down svn internally), I would prefer to use git-filter-branch to go through *all* my checkins and fix up the CRLFs in all of them. That way the history will be clean and diffs/annotates/merges will go more smoothly. Does anyone know the most efficient way to do this with git-filter-branch, when there are already thousands of files in the repo with CRLF in them? Running dos2unix on all the files for every single revision could take a *very* long time. Have fun, Avery ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 21:53 ` Avery Pennarun @ 2008-04-22 2:39 ` Jeff King 2008-04-22 16:51 ` Avery Pennarun 2008-04-22 6:41 ` Johannes Sixt 1 sibling, 1 reply; 30+ messages in thread From: Jeff King @ 2008-04-22 2:39 UTC (permalink / raw) To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git On Mon, Apr 21, 2008 at 05:53:34PM -0400, Avery Pennarun wrote: > Does anyone know the most efficient way to do this with > git-filter-branch, when there are already thousands of files in the > repo with CRLF in them? Running dos2unix on all the files for every > single revision could take a *very* long time. Yes, a tree filter would probably be quite slow due to checking out, and then munging all of the files. You could maybe do an index filter that gets the blob SHA1 of each file that is new, and just munges those. But I think it is even simpler to just keep a cache of original blob hashes mapping to munged blob hashes. Something like: git filter-branch --index-filter ' git ls-files --stage | perl /path/to/caching-munger | git update-index --index-info ' where your caching munger looks something like: -- >8 -- #!/usr/bin/perl use strict; use DB_File; use Fcntl; tie my %cache, 'DB_File', "$ENV{HOME}/filter-cache", O_RDWR|O_CREAT, 0666 or die "unable to open db: $!"; while(<>) { my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/ or die "bad ls-files line: $_"; $cache{$hash} = munge($hash) unless exists $cache{$hash}; print "$mode $cache{$hash}\t$path\n"; } sub munge { my $h = shift; my $r = scalar `git show $h | sed 's/\$/\\r/' | git hash-object -w --stdin`; chomp $r; return $r; } -- 8< -- so we keep a dbm of the hash mapping, and do no work if we have already seen this blob. If we don't, then we actually do the expensive 'show | munge | hash-object'. And here our munge adds a CR, but you should be able to do an arbitrary transformation. -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-22 2:39 ` Jeff King @ 2008-04-22 16:51 ` Avery Pennarun 2008-04-23 7:11 ` Peter Karlsson 2008-04-23 8:08 ` Jeff King 0 siblings, 2 replies; 30+ messages in thread From: Avery Pennarun @ 2008-04-22 16:51 UTC (permalink / raw) To: Jeff King; +Cc: Johannes Schindelin, Peter Karlsson, git On 4/21/08, Jeff King <peff@peff.net> wrote: > You could maybe do an index filter that gets the blob SHA1 of each file > that is new, and just munges those. But I think it is even simpler to > just keep a cache of original blob hashes mapping to munged blob hashes. > [...] Thanks, this is really cool. I'll try it next time I'm messing with our repositories (this week is unfortunately a bit too busy). Do you think git would benefit from having a generalized version of this script? Basically, the user provides a "munge" script on the command line, and there's a git-filter-branch mode for auto-munging (with a cache) every file in every checkin. Even if it's *only* ever used for CRLF, I can imagine this being useful to a lot of people. Thanks, Avery ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-22 16:51 ` Avery Pennarun @ 2008-04-23 7:11 ` Peter Karlsson 2008-04-23 8:10 ` Jeff King 2008-04-23 8:08 ` Jeff King 1 sibling, 1 reply; 30+ messages in thread From: Peter Karlsson @ 2008-04-23 7:11 UTC (permalink / raw) To: Avery Pennarun; +Cc: Jeff King, Johannes Schindelin, Git Mailing List Avery Pennarun: > Do you think git would benefit from having a generalized version of > this script? Definitely. Also, something that would work with a) several branches (i.e traverse all the branches; keeping the points at which they diverge), and b) submodules (i.e apply the same changes to the submodules and updating the submodule index accordingly). I ended up doing CRLF conversion for most of the repositories I had converted. Fortunately, most of them had a single branch, so after having created a small script that did CRLF->LF for the text files, I could do a git filter-branch --tree-filter 'c:/temp/crlf2lf.sh' \ --tag-name-filter 'cat' HEAD on each repository and get everything converted during my lunch break. What I couldn't figure out is why, after converting everything, removing all references to the repositories I cloned from, and removing references to the old objects in the reflogs, why git fsck --unreachable did not report any unreachable objects? I would have guessed the entire old history and its objects would now be invalidated and could be killed off. -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 7:11 ` Peter Karlsson @ 2008-04-23 8:10 ` Jeff King 2008-04-23 13:47 ` Peter Karlsson 0 siblings, 1 reply; 30+ messages in thread From: Jeff King @ 2008-04-23 8:10 UTC (permalink / raw) To: Peter Karlsson; +Cc: Avery Pennarun, Johannes Schindelin, Git Mailing List On Wed, Apr 23, 2008 at 08:11:49AM +0100, Peter Karlsson wrote: > I ended up doing CRLF conversion for most of the repositories I had > converted. Fortunately, most of them had a single branch, so after > having created a small script that did CRLF->LF for the text files, I > could do a > > git filter-branch --tree-filter 'c:/temp/crlf2lf.sh' \ > --tag-name-filter 'cat' HEAD > > on each repository and get everything converted during my lunch break. Sure, but that is quite slow on a larger tree, since it has to do a full checkout for each commit. The idea of the specialized filter was to avoid that. But if your project was small enough to do it that way, that certainly works. > What I couldn't figure out is why, after converting everything, > removing all references to the repositories I cloned from, and removing > references to the old objects in the reflogs, why > > git fsck --unreachable > > did not report any unreachable objects? I would have guessed the entire > old history and its objects would now be invalidated and could be > killed off. Did you remove refs/original/ ? -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 8:10 ` Jeff King @ 2008-04-23 13:47 ` Peter Karlsson 2008-04-23 14:24 ` Johan Herland 2008-04-23 15:12 ` Johannes Sixt 0 siblings, 2 replies; 30+ messages in thread From: Peter Karlsson @ 2008-04-23 13:47 UTC (permalink / raw) To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Git Mailing List Jeff King: > Sure, but that is quite slow on a larger tree, since it has to do a > full checkout for each commit. Indeed. That's why I would welcome a script such as the one you mentioned :-) Fortunately, the repositories I worked on were small enough to not suffer too much (even when using Git on Windows, which is a bit slower than on Linux). [Not seeing any unreachable objects] > Did you remove refs/original/ ? That, and cloned the repository to a new location after the conversion, and removing the references to "origin" there. It does seem that the objects are still there, but I can't see them with "gitk --all". -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 13:47 ` Peter Karlsson @ 2008-04-23 14:24 ` Johan Herland 2008-04-23 15:12 ` Johannes Sixt 1 sibling, 0 replies; 30+ messages in thread From: Johan Herland @ 2008-04-23 14:24 UTC (permalink / raw) To: Peter Karlsson; +Cc: git, Jeff King, Avery Pennarun, Johannes Schindelin On Wednesday 23 April 2008, Peter Karlsson wrote: > Jeff King: > > Sure, but that is quite slow on a larger tree, since it has to do a > > full checkout for each commit. > > Indeed. That's why I would welcome a script such as the one you > mentioned :-) Fortunately, the repositories I worked on were small > enough to not suffer too much (even when using Git on Windows, which > is a bit slower than on Linux). > > [Not seeing any unreachable objects] > > > Did you remove refs/original/ ? > > That, and cloned the repository to a new location after the > conversion, and removing the references to "origin" there. It does > seem that the objects are still there, but I can't see them with > "gitk --all". Maybe they are kept alive by reflogs? Have fun! :) ...Johan -- Johan Herland, <johan@herland.net> www.herland.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 13:47 ` Peter Karlsson 2008-04-23 14:24 ` Johan Herland @ 2008-04-23 15:12 ` Johannes Sixt 1 sibling, 0 replies; 30+ messages in thread From: Johannes Sixt @ 2008-04-23 15:12 UTC (permalink / raw) To: Peter Karlsson Cc: Jeff King, Avery Pennarun, Johannes Schindelin, Git Mailing List Peter Karlsson schrieb: > [Not seeing any unreachable objects] > Jeff King: >> Did you remove refs/original/ ? > > That, and cloned the repository to a new location after the conversion, > and removing the references to "origin" there. It does seem that the > objects are still there, but I can't see them with "gitk --all". Did you clone locally? Then you must use the file:// protocol, otherwise everything is hard-linked from the origin. -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-22 16:51 ` Avery Pennarun 2008-04-23 7:11 ` Peter Karlsson @ 2008-04-23 8:08 ` Jeff King 2008-04-23 10:13 ` Johannes Schindelin 2008-04-23 10:58 ` Johannes Sixt 1 sibling, 2 replies; 30+ messages in thread From: Jeff King @ 2008-04-23 8:08 UTC (permalink / raw) To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote: > Do you think git would benefit from having a generalized version of > this script? Basically, the user provides a "munge" script on the > command line, and there's a git-filter-branch mode for auto-munging > (with a cache) every file in every checkin. Even if it's *only* ever > used for CRLF, I can imagine this being useful to a lot of people. It was easy enough to work up the patch below, which allows git filter-branch --blob-filter 'tr a-z A-Z' However, it's _still_ horribly slow. Shell script is nice and flexible, but running a tight loop like this is just painful. I suspect filter-branch in something like perl would be a lot faster and just as flexible (you could even do it in C, but you'd probably have to invent a little domain-specific scripting language). It is still much better performance than a tree filter, though: $ cd git && time git filter-branch --tree-filter ' find . -type f | while read f; do tr a-z A-Z <"$f" >tmp mv tmp "$f" done ' HEAD~10..HEAD real 4m38.626s user 1m32.726s sys 2m51.163s $ cd git && git filter-branch --blob-filter 'tr a-z A-Z' HEAD~10..HEAD real 1m40.809s user 0m36.822s sys 1m14.273s Lots of system time in both. I'm sure we spend a fair bit of time hitting our very large map and blob-cache directories, which would be much more nicely implemented as associative arrays in memory (if we were using a more featureful language). Anyway, here is the patch. I don't know if it is even worth applying, since it is still painfully slow. --- git-filter-branch.sh | 30 ++++++++++++++++++++++++++++++ 1 files changed, 30 insertions(+), 0 deletions(-) diff --git a/git-filter-branch.sh b/git-filter-branch.sh index 333f6a8..0602b25 100755 --- a/git-filter-branch.sh +++ b/git-filter-branch.sh @@ -54,6 +54,23 @@ EOF eval "$functions" +munge_blobs() { + while read mode sha1 stage path + do + if ! test -r "$workdir/../blob-cache/$sha1" + then + new=`git cat-file blob $sha1 | + eval "$filter_blob" | + git hash-object -w --stdin` + printf $new >$workdir/../blob-cache/$sha1 + fi + printf "%s %s\t%s\n" \ + "$mode" \ + $(cat "$workdir/../blob-cache/$sha1") \ + "$path" + done +} + # When piped a commit, output a script to set the ident of either # "author" or "committer @@ -105,6 +122,7 @@ tempdir=.git-rewrite filter_env= filter_tree= filter_index= +filter_blob= filter_parent= filter_msg=cat filter_commit='git commit-tree "$@"' @@ -150,6 +168,9 @@ do --index-filter) filter_index="$OPTARG" ;; + --blob-filter) + filter_blob="$OPTARG" + ;; --parent-filter) filter_parent="$OPTARG" ;; @@ -227,6 +248,9 @@ ret=0 # map old->new commit ids for rewriting parents mkdir ../map || die "Could not create map/ directory" +# cache rewritten blobs for blob filter +mkdir ../blob-cache || die "Could not create blob-cache/ directory" + case "$filter_subdir" in "") git rev-list --reverse --topo-order --default HEAD \ @@ -295,6 +319,12 @@ while read commit parents; do eval "$filter_index" < /dev/null || die "index filter failed: $filter_index" + if test -n "$filter_blob"; then + git ls-files --stage | + munge_blobs | + git update-index --index-info + fi + parentstr= for parent in $parents; do for reparent in $(map "$parent"); do -- 1.5.5.1.144.g4c416.dirty ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 8:08 ` Jeff King @ 2008-04-23 10:13 ` Johannes Schindelin 2008-04-23 10:58 ` Jeff King 2008-04-23 10:58 ` Johannes Sixt 1 sibling, 1 reply; 30+ messages in thread From: Johannes Schindelin @ 2008-04-23 10:13 UTC (permalink / raw) To: Jeff King; +Cc: Avery Pennarun, Peter Karlsson, git Hi, On Wed, 23 Apr 2008, Jeff King wrote: > On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote: > > > Do you think git would benefit from having a generalized version of > > this script? Basically, the user provides a "munge" script on the > > command line, and there's a git-filter-branch mode for auto-munging > > (with a cache) every file in every checkin. Even if it's *only* ever > > used for CRLF, I can imagine this being useful to a lot of people. > > It was easy enough to work up the patch below, which allows > > git filter-branch --blob-filter 'tr a-z A-Z' > > However, it's _still_ horribly slow. You create a quite huge blob-cache, so you are pretty heavy on disk-I/O. Have you tried (as suggested in the man page) to run this on a huge RAM disk? That should blow you away. > Shell script is nice and flexible, but running a tight loop like this is > just painful. I suspect filter-branch in something like perl would be a > lot faster and just as flexible (you could even do it in C, but you'd > probably have to invent a little domain-specific scripting language). I hoped that the rewrite-commits attempt was more than just that: an attempt. So there is a point you could start from, doing things in C. But I doubt that you get any joy: either your language is too limited, or you will get the same problems (fork() overhead) again. > Anyway, here is the patch. I don't know if it is even worth applying, > since it is still painfully slow. I like your patch: Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de> Ciao, Dscho ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 10:13 ` Johannes Schindelin @ 2008-04-23 10:58 ` Jeff King 0 siblings, 0 replies; 30+ messages in thread From: Jeff King @ 2008-04-23 10:58 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Avery Pennarun, Peter Karlsson, git On Wed, Apr 23, 2008 at 11:13:27AM +0100, Johannes Schindelin wrote: > > It was easy enough to work up the patch below, which allows > > > > git filter-branch --blob-filter 'tr a-z A-Z' > > > > However, it's _still_ horribly slow. > > You create a quite huge blob-cache, so you are pretty heavy on disk-I/O. > Have you tried (as suggested in the man page) to run this on a huge RAM > disk? That should blow you away. No, I didn't. But the disk I/O is pretty minimal. The blob cache is only a few megabytes, and it stays entirely in Linux's disk cache. My disk light only blinks every 5-10 seconds to flush dirty pages to disk. > I hoped that the rewrite-commits attempt was more than just that: an > attempt. So there is a point you could start from, doing things in C. > > But I doubt that you get any joy: either your language is too limited, or > you will get the same problems (fork() overhead) again. Ah, right. I totally forgot about that effort. I will take a peek next time I need to do some filtering. > > Anyway, here is the patch. I don't know if it is even worth applying, > > since it is still painfully slow. > > I like your patch: > > Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de> I think it could use some documentation updates. Avery, do you want to try adding a CRLF example to the manpage? -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 8:08 ` Jeff King 2008-04-23 10:13 ` Johannes Schindelin @ 2008-04-23 10:58 ` Johannes Sixt 2008-04-23 11:04 ` Jeff King 2008-04-23 20:02 ` Avery Pennarun 1 sibling, 2 replies; 30+ messages in thread From: Johannes Sixt @ 2008-04-23 10:58 UTC (permalink / raw) To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git Jeff King schrieb: > It was easy enough to work up the patch below, which allows > > git filter-branch --blob-filter 'tr a-z A-Z' ... > +munge_blobs() { > + while read mode sha1 stage path > + do > + if ! test -r "$workdir/../blob-cache/$sha1" > + then > + new=`git cat-file blob $sha1 | > + eval "$filter_blob" | > + git hash-object -w --stdin` > + printf $new >$workdir/../blob-cache/$sha1 > + fi > + printf "%s %s\t%s\n" \ > + "$mode" \ > + $(cat "$workdir/../blob-cache/$sha1") \ > + "$path" > + done > +} In practice, this is not sufficient. The blob filter must have an opportunity to decide what it wants to do, not just blindly munge every blob. The minimum is a path name, e.g. in $1: new=$(git cat-file blob $sha1 | $SHELL_PATH -c "$filter_blob" ignored "$path" | git hash-object -w --stdin) -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 10:58 ` Johannes Sixt @ 2008-04-23 11:04 ` Jeff King 2008-04-23 11:46 ` Johannes Sixt 2008-04-23 20:02 ` Avery Pennarun 1 sibling, 1 reply; 30+ messages in thread From: Jeff King @ 2008-04-23 11:04 UTC (permalink / raw) To: Johannes Sixt; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git On Wed, Apr 23, 2008 at 12:58:57PM +0200, Johannes Sixt wrote: > In practice, this is not sufficient. The blob filter must have an > opportunity to decide what it wants to do, not just blindly munge every > blob. The minimum is a path name, e.g. in $1: > > new=$(git cat-file blob $sha1 | > $SHELL_PATH -c "$filter_blob" ignored "$path" | > git hash-object -w --stdin) I intentionally left that out, because: - I assumed if you were going to do trickery with pathnames, you should just be doing an index filter - it violates the cache assumption, which is that blob $X is always transformed the same way I assume you are wanting to do something like: git filter-branch --blob-filter ' case "$1" in *.jpg) cat ;; *) tr a-z A-Z ;; esac ' Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and "foo.txt", but it just feels a little wrong. -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 11:04 ` Jeff King @ 2008-04-23 11:46 ` Johannes Sixt 2008-04-23 21:47 ` Jeff King 0 siblings, 1 reply; 30+ messages in thread From: Johannes Sixt @ 2008-04-23 11:46 UTC (permalink / raw) To: Jeff King; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git Jeff King schrieb: > On Wed, Apr 23, 2008 at 12:58:57PM +0200, Johannes Sixt wrote: > >> In practice, this is not sufficient. The blob filter must have an >> opportunity to decide what it wants to do, not just blindly munge every >> blob. The minimum is a path name, e.g. in $1: >> >> new=$(git cat-file blob $sha1 | >> $SHELL_PATH -c "$filter_blob" ignored "$path" | >> git hash-object -w --stdin) > > I intentionally left that out, because: > > - I assumed if you were going to do trickery with pathnames, you > should just be doing an index filter > > - it violates the cache assumption, which is that blob $X is always > transformed the same way > > I assume you are wanting to do something like: > > git filter-branch --blob-filter ' > case "$1" in > *.jpg) cat ;; > *) tr a-z A-Z ;; > esac > ' > > Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and > "foo.txt", but it just feels a little wrong. Yes, that's how I intended it to work. What's wrong here? The fact that a user might name a JPEG foo.txt instead of foo.jpg? Or that the same blob might appear with entirely different names, including different suffixes? Well, tough luck. Use an index filter. But without any sort of hint what the blob is about, your original --blob-filter is useless except for the most simplistic repositories. -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 11:46 ` Johannes Sixt @ 2008-04-23 21:47 ` Jeff King 2008-04-23 23:01 ` Junio C Hamano 0 siblings, 1 reply; 30+ messages in thread From: Jeff King @ 2008-04-23 21:47 UTC (permalink / raw) To: Johannes Sixt; +Cc: Avery Pennarun, Johannes Schindelin, Peter Karlsson, git On Wed, Apr 23, 2008 at 01:46:20PM +0200, Johannes Sixt wrote: > > I assume you are wanting to do something like: > > > > git filter-branch --blob-filter ' > > case "$1" in > > *.jpg) cat ;; > > *) tr a-z A-Z ;; > > esac > > ' > > > > Obviously it is unlikely to get the same blob sha1 as "foo.jpg" and > > "foo.txt", but it just feels a little wrong. > > Yes, that's how I intended it to work. What's wrong here? The fact that a > user might name a JPEG foo.txt instead of foo.jpg? Or that the same blob > might appear with entirely different names, including different suffixes? > Well, tough luck. Use an index filter. But without any sort of hint what > the blob is about, your original --blob-filter is useless except for the > most simplistic repositories. Yes, the script produces incorrect results if you have the same blob with different names. IOW, if I accidentally add a JPEG as 'foo', and then later rename it to 'foo.jpg', it will munge the blob the first time it sees it, and then use the munged value for 'foo.jpg', since we never even run the case statement. Yes, this is not terribly likely, but it does seem like an awful (and hard to diagnose!) bug to have hiding in the script. The correct fix is either: - the blob cache needs to take into account sha1 _and_ path - the cache lookup needs to be _inside_ the path filter. In that case you would either have to support it in the script (e.g., --blob-ignore jpg), or you could make the caching an optional part of the blob filter (the way you can call 'map' explicitly from your filters). -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 21:47 ` Jeff King @ 2008-04-23 23:01 ` Junio C Hamano 2008-04-23 23:04 ` Avery Pennarun 2008-04-24 1:37 ` Jeff King 0 siblings, 2 replies; 30+ messages in thread From: Junio C Hamano @ 2008-04-23 23:01 UTC (permalink / raw) To: Jeff King Cc: Johannes Sixt, Avery Pennarun, Johannes Schindelin, Peter Karlsson, git Jeff King <peff@peff.net> writes: > The correct fix is either: > > - the blob cache needs to take into account sha1 _and_ path > > - the cache lookup needs to be _inside_ the path filter. In that case > you would either have to support it in the script (e.g., > --blob-ignore jpg), or you could make the caching an optional part > of the blob filter (the way you can call 'map' explicitly from your > filters). But once you start saying "even originally the same blob (i.e. identified by one object name) can be rewritten into different result, depending on where in the tree it appears", would it make sense to have blob filters to begin with? Shouldn't that kind of of context sensitive (in the space dimension -- you can introduce the context sensitivity in the time dimension by saying there may even be cases where you would want to filter differently depending on the path and which commit the blob appears, which is even worse) filtering be best left to the tree or index filter? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 23:01 ` Junio C Hamano @ 2008-04-23 23:04 ` Avery Pennarun 2008-04-24 8:11 ` Johannes Schindelin 2008-04-24 1:37 ` Jeff King 1 sibling, 1 reply; 30+ messages in thread From: Avery Pennarun @ 2008-04-23 23:04 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Johannes Sixt, Johannes Schindelin, Peter Karlsson, git On 4/23/08, Junio C Hamano <gitster@pobox.com> wrote: > But once you start saying "even originally the same blob (i.e. identified > by one object name) can be rewritten into different result, depending on > where in the tree it appears", would it make sense to have blob filters to > begin with? > > Shouldn't that kind of of context sensitive (in the space dimension -- you > can introduce the context sensitivity in the time dimension by saying > there may even be cases where you would want to filter differently > depending on the path and which commit the blob appears, which is even > worse) filtering be best left to the tree or index filter? What I really want is the equivalent of "dos2unix --recursive *.c *.txt etc" for all commits. Theoretically, a .txt file might be renamed to a .jpg file, in which case funny things would happen with such a filter, depending which commit was seen first. I'm pretty confident that this will never happen to me, but it's a valid concern. Have fun, Avery ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 23:04 ` Avery Pennarun @ 2008-04-24 8:11 ` Johannes Schindelin 2008-04-24 16:56 ` Avery Pennarun 0 siblings, 1 reply; 30+ messages in thread From: Johannes Schindelin @ 2008-04-24 8:11 UTC (permalink / raw) To: Avery Pennarun Cc: Junio C Hamano, Jeff King, Johannes Sixt, Peter Karlsson, git Hi, On Wed, 23 Apr 2008, Avery Pennarun wrote: > What I really want is the equivalent of "dos2unix --recursive *.c *.txt > etc" for all commits. I start to wonder if "git fast-export --all | my-intelligent-perl-script | git fast-import" would not be a better solution here. All you would have to do is to detect when a blob begins, and how long it is, and work with that. If your trees do not contain any binary files, it should be trivial. Ciao, Dscho ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-24 8:11 ` Johannes Schindelin @ 2008-04-24 16:56 ` Avery Pennarun 0 siblings, 0 replies; 30+ messages in thread From: Avery Pennarun @ 2008-04-24 16:56 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Jeff King, Johannes Sixt, Peter Karlsson, git On 4/24/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Wed, 23 Apr 2008, Avery Pennarun wrote: > > What I really want is the equivalent of "dos2unix --recursive *.c *.txt > > etc" for all commits. > > I start to wonder if "git fast-export --all | my-intelligent-perl-script | > git fast-import" would not be a better solution here. > > All you would have to do is to detect when a blob begins, and how long it > is, and work with that. If your trees do not contain any binary files, it > should be trivial. Err, yes... as long as there are no binary files. I'm not so lucky, and life is a little more complex in that case. It also gives no easy way of selectively applying the blob filter based on filename, which is pretty important when you do have some binary files and you're trying to decide whether to run dos2unix. (In contrast, the *other* objection, which is that the same blob might have multiple filenames, doesn't bother me at all, since I'm sure I don't have any .txt files that were accidentally named .jpg at some point.) I agree that a working solution based on git-fast-export/git-fast-import should run faster than any of the other proposed solutions, but my version of Jeff's patch is quite fast and it's easy to compose simple command lines that "make simple things simple and hard things possible": git-filter-branch --blob-filter dos2unix HEAD git-filter-branch --blob-filter 'case "$path" in *.c) expand -8;; *) cat;; esac' HEAD It sure beats writing a perl script every time you want to do something. Jeff wrote: > But I think the problem then is > that the blob filter isn't terribly useful. IOW, it is not really a > separate filter, but rather an optimizing pattern for an index filter, > so maybe calling it a blob filter is the wrong approach The problem is that doing an optimization on an index filter is kind of hard for a user to express, and each user will have to implement the caching logic by hand every time. Using --index-filter at all requires extremely high levels of shell and git knowledge. The fact that the blob transformation might "slightly depend on" the path is not actually very important; fundamentally we're still transforming blobs, not paths. We're just using the filename as a *hint* about what kind of transformation we need to do on that particular blob. I think the measure of a good idea here is how straightforward it is to express what you want on the command line, and --blob-filter makes it easy to express a certain class of filters. Have fun, Avery ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 23:01 ` Junio C Hamano 2008-04-23 23:04 ` Avery Pennarun @ 2008-04-24 1:37 ` Jeff King 1 sibling, 0 replies; 30+ messages in thread From: Jeff King @ 2008-04-24 1:37 UTC (permalink / raw) To: Junio C Hamano Cc: Johannes Sixt, Avery Pennarun, Johannes Schindelin, Peter Karlsson, git On Wed, Apr 23, 2008 at 04:01:10PM -0700, Junio C Hamano wrote: > But once you start saying "even originally the same blob (i.e. identified > by one object name) can be rewritten into different result, depending on > where in the tree it appears", would it make sense to have blob filters to > begin with? > > Shouldn't that kind of of context sensitive (in the space dimension -- you > can introduce the context sensitivity in the time dimension by saying > there may even be cases where you would want to filter differently > depending on the path and which commit the blob appears, which is even > worse) filtering be best left to the tree or index filter? Yes, that was my original reasoning. But I think the problem then is that the blob filter isn't terribly useful. IOW, it is not really a separate filter, but rather an optimizing pattern for an index filter, so maybe calling it a blob filter is the wrong approach, and it would be better as a short perl script in contrib/filter-branch. Then you could call: git filter-branch --index-filter ' /path/to/git/contrib/filter-branch/dos2unix \ "*.txt" "*.c" ' -Peff ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 10:58 ` Johannes Sixt 2008-04-23 11:04 ` Jeff King @ 2008-04-23 20:02 ` Avery Pennarun 2008-04-24 6:25 ` Johannes Sixt 1 sibling, 1 reply; 30+ messages in thread From: Avery Pennarun @ 2008-04-23 20:02 UTC (permalink / raw) To: Johannes Sixt; +Cc: Jeff King, Johannes Schindelin, Peter Karlsson, git On 4/23/08, Johannes Sixt <j.sixt@viscovery.net> wrote: > In practice, this is not sufficient. The blob filter must have an > opportunity to decide what it wants to do, not just blindly munge every > blob. The minimum is a path name, e.g. in $1: Actually, it may not have been intentional, but because of the way 'eval' works, the munge script will find that $path already contains the path of the file being munged. Works for me. Have fun, Avery ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-23 20:02 ` Avery Pennarun @ 2008-04-24 6:25 ` Johannes Sixt 0 siblings, 0 replies; 30+ messages in thread From: Johannes Sixt @ 2008-04-24 6:25 UTC (permalink / raw) To: Avery Pennarun; +Cc: Jeff King, Johannes Schindelin, Peter Karlsson, git Avery Pennarun schrieb: > On 4/23/08, Johannes Sixt <j.sixt@viscovery.net> wrote: >> In practice, this is not sufficient. The blob filter must have an >> opportunity to decide what it wants to do, not just blindly munge every >> blob. The minimum is a path name, e.g. in $1: > > Actually, it may not have been intentional, but because of the way > 'eval' works, the munge script will find that $path already contains > the path of the file being munged. Works for me. Yes, of course! So I stand corrected, and Jeff's patch makes sense. For consistency's sake, the path should be made available in, say, GIT_BLOB_PATH just like the commit is available in GIT_COMMIT. -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 21:53 ` Avery Pennarun 2008-04-22 2:39 ` Jeff King @ 2008-04-22 6:41 ` Johannes Sixt 1 sibling, 0 replies; 30+ messages in thread From: Johannes Sixt @ 2008-04-22 6:41 UTC (permalink / raw) To: Avery Pennarun; +Cc: Johannes Schindelin, Peter Karlsson, git Avery Pennarun schrieb: > Does anyone know the most efficient way to [convert CRLF] with > git-filter-branch, when there are already thousands of files in the > repo with CRLF in them? Running dos2unix on all the files for every > single revision could take a *very* long time. I chose to write a custom script. Otherwise, a file that stays the same throughout the history would still have been converted on each commit. My script converted each unique file only once, then reconstructed the tree objects and then changed the commits. In the end I don't think it payed off. It took me a week or so to convert the repo; I just could have let filter-branch run for a week, too. But I also have to mention that I did the CVS->git conversion a few times to get a suitable history, and I also repeated the CRLF conversion sometimes, and back then git-filter-branch did not exist in its current shape. -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson 2008-04-21 20:07 ` Johannes Schindelin @ 2008-04-21 21:51 ` Jakub Narebski 2008-04-22 6:52 ` Peter Karlsson 2008-04-22 6:31 ` Johannes Sixt 2 siblings, 1 reply; 30+ messages in thread From: Jakub Narebski @ 2008-04-21 21:51 UTC (permalink / raw) To: Peter Karlsson; +Cc: git Peter Karlsson <peter@softwolves.pp.se> writes: > I have began moving old repositories for Windows-based software to > Git. Since the tool I am moving from stores everything with CRLF line > endings and have RCS-like keyword expansion, I'm treating it all as > binary data when exporting to Git, i.e I have CRLF in the checked-in > data (and I do want that, since this is Windows-only sources). > > Now the latests msysgit comes along and (finally!) sets core.autocrlf > to true by default. > > How do I handle this without having everyone breaking check-ins? I > can't require everyone to do unset core.autocrlf globally. Can I do > that with gitattributes? I think you can, by unsetting `crlf` attribute, i.e. putting the following in .gitattributes: * -crlf See gitattributes(5): `crlf` ^^^^^^ This attribute controls the line-ending convention. [...] Unset:: Unsetting the `crlf` attribute on a path is meant to mark the path as a "binary" file. The path never goes through line endings conversion upon checkin/checkout. Not tested! -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 21:51 ` Jakub Narebski @ 2008-04-22 6:52 ` Peter Karlsson 2008-04-22 9:04 ` Johannes Sixt 0 siblings, 1 reply; 30+ messages in thread From: Peter Karlsson @ 2008-04-22 6:52 UTC (permalink / raw) To: Jakub Narebski; +Cc: Git Mailing List Jakub Narebski: > I think you can, by unsetting `crlf` attribute, i.e. putting the > following in .gitattributes: > > * -crlf Yeah, that does indeed seem to work, no matter how core.autocrlf is configured globally. I think this is the best way to go for the repositories I am working on (as they are very much DOS/Windows-only). Does anyone know how to hack an existing repository so that I can add such a .gitattributes file to all commits? I've tried reading the git-filter-branch manual page a few times, but I am still confused by it. I guess I need some combination of "git filter-branch --tree-filter" and "git update-index --add". It doesn't matter much that the all commits are re-written, as I am still the only one to have cloned them. -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-22 6:52 ` Peter Karlsson @ 2008-04-22 9:04 ` Johannes Sixt 0 siblings, 0 replies; 30+ messages in thread From: Johannes Sixt @ 2008-04-22 9:04 UTC (permalink / raw) To: Peter Karlsson; +Cc: Jakub Narebski, Git Mailing List Peter Karlsson schrieb: > Jakub Narebski: > >> I think you can, by unsetting `crlf` attribute, i.e. putting the >> following in .gitattributes: >> >> * -crlf > > Yeah, that does indeed seem to work, no matter how core.autocrlf is > configured globally. I think this is the best way to go for the > repositories I am working on (as they are very much DOS/Windows-only). > > Does anyone know how to hack an existing repository so that I can add > such a .gitattributes file to all commits? I've tried reading the > git-filter-branch manual page a few times, but I am still confused by > it. Something like (untested, using bash): X=$(echo "* -crlf" | git hash-object -w --stdin) git filter-branch \ --index-filter $'git-update-index --index-info <<< \ "100644 $X\t.gitattributes"' \ -- --all -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson 2008-04-21 20:07 ` Johannes Schindelin 2008-04-21 21:51 ` Jakub Narebski @ 2008-04-22 6:31 ` Johannes Sixt 2008-04-22 8:42 ` Peter Karlsson 2 siblings, 1 reply; 30+ messages in thread From: Johannes Sixt @ 2008-04-22 6:31 UTC (permalink / raw) To: Peter Karlsson; +Cc: git Peter Karlsson schrieb: > I have began moving old repositories for Windows-based software to Git. > Since the tool I am moving from stores everything with CRLF line endings > and have RCS-like keyword expansion, I'm treating it all as binary data > when exporting to Git, i.e I have CRLF in the checked-in data (and I do > want that, since this is Windows-only sources). > > Now the latests msysgit comes along and (finally!) sets core.autocrlf to > true by default. > > How do I handle this without having everyone breaking check-ins? I can't > require everyone to do unset core.autocrlf globally. Can I do that with > gitattributes? I see 2 other options: 1. Create a custom setup of msysgit that has core.autocrlf set to false. 2. You are still converting repositories? Convert the files in your repository to LF. I did it like this, but it was a week -or more- worth of labor to get the scripts in a shape that I could reproduce the conversion (and it all happened before core.autocrlf even existed). -- Hannes ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Git on Windows, CRLF issues 2008-04-22 6:31 ` Johannes Sixt @ 2008-04-22 8:42 ` Peter Karlsson 0 siblings, 0 replies; 30+ messages in thread From: Peter Karlsson @ 2008-04-22 8:42 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Johannes Sixt: > 2. You are still converting repositories? Convert the files in your > repository to LF. Or, perhaps, this is the way to go. Got to figure out how to get CRLF->LF conversion working without having RCS keyword expansion going haywire. I'm using RCS format as a middle-man between the old repositories (PVCS) I'm converting and parsecvs which imports them into Git. The old repository has expanded keywords, and I must avoid having RCS/CVS expand them as they would expand in a different manner... :-/ -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2008-04-24 16:57 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-04-21 19:48 Git on Windows, CRLF issues Peter Karlsson 2008-04-21 20:07 ` Johannes Schindelin 2008-04-21 21:53 ` Avery Pennarun 2008-04-22 2:39 ` Jeff King 2008-04-22 16:51 ` Avery Pennarun 2008-04-23 7:11 ` Peter Karlsson 2008-04-23 8:10 ` Jeff King 2008-04-23 13:47 ` Peter Karlsson 2008-04-23 14:24 ` Johan Herland 2008-04-23 15:12 ` Johannes Sixt 2008-04-23 8:08 ` Jeff King 2008-04-23 10:13 ` Johannes Schindelin 2008-04-23 10:58 ` Jeff King 2008-04-23 10:58 ` Johannes Sixt 2008-04-23 11:04 ` Jeff King 2008-04-23 11:46 ` Johannes Sixt 2008-04-23 21:47 ` Jeff King 2008-04-23 23:01 ` Junio C Hamano 2008-04-23 23:04 ` Avery Pennarun 2008-04-24 8:11 ` Johannes Schindelin 2008-04-24 16:56 ` Avery Pennarun 2008-04-24 1:37 ` Jeff King 2008-04-23 20:02 ` Avery Pennarun 2008-04-24 6:25 ` Johannes Sixt 2008-04-22 6:41 ` Johannes Sixt 2008-04-21 21:51 ` Jakub Narebski 2008-04-22 6:52 ` Peter Karlsson 2008-04-22 9:04 ` Johannes Sixt 2008-04-22 6:31 ` Johannes Sixt 2008-04-22 8:42 ` Peter Karlsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).