git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Non-ASCII paths and git-cvsserver
@ 2006-11-09 11:11 sf
  2006-11-10 18:59 ` Martin Langhoff
  2006-11-10 19:49 ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: sf @ 2006-11-09 11:11 UTC (permalink / raw)
  To: git

Hello,

I want to access a git repository via git-cvsserver. The problem is that 
the repository contains paths with umlauts. These paths come out quoted 
and escaped when checked out with cvs.

Test case:

#! /bin/sh

set -e -u -x

WORK='/tmp/gittest'
FILE=$'\303\244'

mkdir "${WORK}"
mkdir "${WORK}/git"

#trap 'rm -r "${WORK}"' EXIT

cd "${WORK}/git"

git init-db
git repo-config gitcvs.enabled 1
git repo-config gitcvs.logfile "${WORK}/git/.git/cvslog.txt"

touch "${FILE}"
git add "${FILE}"
git commit -a -mx

cd "${WORK}"

CVS_SERVER='git-cvsserver'
export CVS_SERVER

cvs -d ":fork:${WORK}/git/.git" co master

ls master

### end


This is what I get:

+ WORK=/tmp/gittest
+ FILE=$'\303\244'
+ mkdir /tmp/gittest
+ mkdir /tmp/gittest/git
+ cd /tmp/gittest/git
+ git init-db
defaulting to local storage area
+ git repo-config gitcvs.enabled 1
+ git repo-config gitcvs.logfile /tmp/gittest/git/.git/cvslog.txt
+ touch $'\303\244'
+ git add $'\303\244'
+ git commit -a -mx
Committing initial tree 23d6145738bba135994775c19d6e8ae707d399ee
+ cd /tmp/gittest
+ CVS_SERVER=git-cvsserver
+ export CVS_SERVER
+ cvs -d :fork:/tmp/gittest/git/.git co master
cvs checkout: Updating master
U master/"\303\244"
+ ls master
"\303\244"  CVS


I do not speak perl so can anyone help?

Regards

Stephan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-09 11:11 Non-ASCII paths and git-cvsserver sf
@ 2006-11-10 18:59 ` Martin Langhoff
  2006-11-10 19:49 ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Martin Langhoff @ 2006-11-10 18:59 UTC (permalink / raw)
  To: sf; +Cc: git

On 11/9/06, sf <sf@b-i-t.de> wrote:
> I want to access a git repository via git-cvsserver. The problem is that
> the repository contains paths with umlauts. These paths come out quoted
> and escaped when checked out with cvs.

Thanks for the detailed report! I am travelling right now, so with
"high latency" and on a machine that's missing sqlite libs :-/

But I'll give it a go anyway.

Does this mini-patch help? You'll need Perl 5.8.x and probably a
recent SQLite for this.

diff --git a/git-cvsserver.perl b/git-cvsserver.perl
index 8817f8b..c534de5 100755
--- a/git-cvsserver.perl
+++ b/git-cvsserver.perl
@@ -22,6 +22,9 @@ use Fcntl;
 use File::Temp qw/tempdir tempfile/;
 use File::Basename;

+binmode(STDIN,  ':utf8');
+binmode(STDOUT, ':utf8');
+
 my $log = GITCVS::log->new();
 my $cfg;

@@ -2104,6 +2107,11 @@ sub new
         $self->{tables}{$table} = 1;
     }

+    # this will set the encoding for new DBs
+    # or return false for existing DBs that are not
+    # utf-8
+    $self->{dbh}->do('PRAGMA encoding = "UTF-8"');
+
     # Construct the revision table if required
     unless ( $self->{tables}{revision} )

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-09 11:11 Non-ASCII paths and git-cvsserver sf
  2006-11-10 18:59 ` Martin Langhoff
@ 2006-11-10 19:49 ` Junio C Hamano
  2006-11-13 13:58   ` sf
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2006-11-10 19:49 UTC (permalink / raw)
  To: sf; +Cc: git, Martin Langhoff

sf <sf@b-i-t.de> writes:

> I want to access a git repository via git-cvsserver. The problem is
> that the repository contains paths with umlauts. These paths come out
> quoted and escaped when checked out with cvs.

I think this is because the cvsserver invokes diff-tree and
ls-tree without -z and the output from these command quote
non-ascii letters as unsafe.

Martin's sqlite may probably be needed as well, but regardless
of that something like this patch is needed -- otherwise what 
populates sqlite database will be quoted to begin with so it
would not help much.

I've tested with your reproduction recipe, but otherwise not
tested this patch.

-- >8 --

diff --git a/git-cvsserver.perl b/git-cvsserver.perl
index 8817f8b..ca519b7 100755
--- a/git-cvsserver.perl
+++ b/git-cvsserver.perl
@@ -2343,67 +2343,72 @@ sub update
 
         if ( defined ( $lastpicked ) )
         {
-            my $filepipe = open(FILELIST, '-|', 'git-diff-tree', '-r', $lastpicked, $commit->{hash}) or die("Cannot call git-diff-tree : $!");
+            my $filepipe = open(FILELIST, '-|', 'git-diff-tree', '-z', '-r', $lastpicked, $commit->{hash}) or die("Cannot call git-diff-tree : $!");
+	    local ($/) = "\0";
             while ( <FILELIST> )
             {
-                unless ( /^:\d{6}\s+\d{3}(\d)\d{2}\s+[a-zA-Z0-9]{40}\s+([a-zA-Z0-9]{40})\s+(\w)\s+(.*)$/o )
+		chomp;
+                unless ( /^:\d{6}\s+\d{3}(\d)\d{2}\s+[a-zA-Z0-9]{40}\s+([a-zA-Z0-9]{40})\s+(\w)$/o )
                 {
                     die("Couldn't process git-diff-tree line : $_");
                 }
+		my ($mode, $hash, $change) = ($1, $2, $3);
+		my $name = <FILELIST>;
+		chomp($name);
 
-                # $log->debug("File mode=$1, hash=$2, change=$3, name=$4");
+                # $log->debug("File mode=$mode, hash=$hash, change=$change, name=$name");
 
                 my $git_perms = "";
-                $git_perms .= "r" if ( $1 & 4 );
-                $git_perms .= "w" if ( $1 & 2 );
-                $git_perms .= "x" if ( $1 & 1 );
+                $git_perms .= "r" if ( $mode & 4 );
+                $git_perms .= "w" if ( $mode & 2 );
+                $git_perms .= "x" if ( $mode & 1 );
                 $git_perms = "rw" if ( $git_perms eq "" );
 
-                if ( $3 eq "D" )
+                if ( $change eq "D" )
                 {
-                    #$log->debug("DELETE   $4");
-                    $head->{$4} = {
-                        name => $4,
-                        revision => $head->{$4}{revision} + 1,
+                    #$log->debug("DELETE   $name");
+                    $head->{$name} = {
+                        name => $name,
+                        revision => $head->{$name}{revision} + 1,
                         filehash => "deleted",
                         commithash => $commit->{hash},
                         modified => $commit->{date},
                         author => $commit->{author},
                         mode => $git_perms,
                     };
-                    $self->insert_rev($4, $head->{$4}{revision}, $2, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
+                    $self->insert_rev($name, $head->{$name}{revision}, $hash, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
                 }
-                elsif ( $3 eq "M" )
+                elsif ( $change eq "M" )
                 {
-                    #$log->debug("MODIFIED $4");
-                    $head->{$4} = {
-                        name => $4,
-                        revision => $head->{$4}{revision} + 1,
-                        filehash => $2,
+                    #$log->debug("MODIFIED $name");
+                    $head->{$name} = {
+                        name => $name,
+                        revision => $head->{$name}{revision} + 1,
+                        filehash => $hash,
                         commithash => $commit->{hash},
                         modified => $commit->{date},
                         author => $commit->{author},
                         mode => $git_perms,
                     };
-                    $self->insert_rev($4, $head->{$4}{revision}, $2, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
+                    $self->insert_rev($name, $head->{$name}{revision}, $hash, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
                 }
-                elsif ( $3 eq "A" )
+                elsif ( $change eq "A" )
                 {
-                    #$log->debug("ADDED    $4");
-                    $head->{$4} = {
-                        name => $4,
+                    #$log->debug("ADDED    $name");
+                    $head->{$name} = {
+                        name => $name,
                         revision => 1,
-                        filehash => $2,
+                        filehash => $hash,
                         commithash => $commit->{hash},
                         modified => $commit->{date},
                         author => $commit->{author},
                         mode => $git_perms,
                     };
-                    $self->insert_rev($4, $head->{$4}{revision}, $2, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
+                    $self->insert_rev($name, $head->{$name}{revision}, $hash, $commit->{hash}, $commit->{date}, $commit->{author}, $git_perms);
                 }
                 else
                 {
-                    $log->warn("UNKNOWN FILE CHANGE mode=$1, hash=$2, change=$3, name=$4");
+                    $log->warn("UNKNOWN FILE CHANGE mode=$mode, hash=$hash, change=$change, name=$name");
                     die;
                 }
             }
@@ -2412,10 +2417,12 @@ sub update
             # this is used to detect files removed from the repo
             my $seen_files = {};
 
-            my $filepipe = open(FILELIST, '-|', 'git-ls-tree', '-r', $commit->{hash}) or die("Cannot call git-ls-tree : $!");
+            my $filepipe = open(FILELIST, '-|', 'git-ls-tree', '-z', '-r', $commit->{hash}) or die("Cannot call git-ls-tree : $!");
+	    local $/ = "\0";
             while ( <FILELIST> )
             {
-                unless ( /^(\d+)\s+(\w+)\s+([a-zA-Z0-9]+)\s+(.*)$/o )
+		chomp;
+                unless ( /^(\d+)\s+(\w+)\s+([a-zA-Z0-9]+)\t(.*)$/o )
                 {
                     die("Couldn't process git-ls-tree line : $_");
                 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-10 19:49 ` Junio C Hamano
@ 2006-11-13 13:58   ` sf
  2006-11-13 14:20     ` Jakub Narebski
  2006-11-13 18:22     ` Martin Langhoff
  0 siblings, 2 replies; 11+ messages in thread
From: sf @ 2006-11-13 13:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Martin Langhoff

Junio C Hamano wrote:
> sf <sf@b-i-t.de> writes:
> 
>> I want to access a git repository via git-cvsserver. The problem is
>> that the repository contains paths with umlauts. These paths come out
>> quoted and escaped when checked out with cvs.
> 
> I think this is because the cvsserver invokes diff-tree and
> ls-tree without -z and the output from these command quote
> non-ascii letters as unsafe.

I knew I had seen that kind of quoting before but right then I thought 
it was related to Perl or SQLite.

> Martin's sqlite may probably be needed as well, but regardless
> of that something like this patch is needed -- otherwise what 
> populates sqlite database will be quoted to begin with so it
> would not help much.

Martin, are you sure your patch is needed? (see below)

> I've tested with your reproduction recipe, but otherwise not
> tested this patch.

Thanks, Junio. Paths with umlauts are returned correctly now both in 
UTF-8 and ISO-8859-1. I guess git-cvsserver is now as encoding agnostic 
as git core.

Regards


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 13:58   ` sf
@ 2006-11-13 14:20     ` Jakub Narebski
  2006-11-13 18:30       ` Robin Rosenberg
  2006-11-13 18:22     ` Martin Langhoff
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Narebski @ 2006-11-13 14:20 UTC (permalink / raw)
  To: git

sf wrote:

> Thanks, Junio. Paths with umlauts are returned correctly now both in 
> UTF-8 and ISO-8859-1. I guess git-cvsserver is now as encoding agnostic 
> as git core.

By the way, now that git has per user config file, ~/.gitconfig, perhaps
it is time to add i18n.filesystemEncoding configuration variable, to
automatically convert between filesystem encoding (somthing you usually
don't have any control over) and UTF-8 encoding of paths in tree objects.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 13:58   ` sf
  2006-11-13 14:20     ` Jakub Narebski
@ 2006-11-13 18:22     ` Martin Langhoff
  2006-11-14 10:40       ` sf
  1 sibling, 1 reply; 11+ messages in thread
From: Martin Langhoff @ 2006-11-13 18:22 UTC (permalink / raw)
  To: sf; +Cc: Junio C Hamano, git

On 11/13/06, sf <sf@b-i-t.de> wrote:
> Martin, are you sure your patch is needed? (see below)

Not 100% sure. I was just making sure we crossed all the Ts and dotted
the Is. I gather you have tried my patch and it didn't make any
difference. What SQLite and Perl versions are you using?

cheers,




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 14:20     ` Jakub Narebski
@ 2006-11-13 18:30       ` Robin Rosenberg
  2006-11-13 18:57         ` Jakub Narebski
  2006-11-13 19:48         ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Robin Rosenberg @ 2006-11-13 18:30 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

måndag 13 november 2006 15:20 skrev Jakub Narebski:
> sf wrote:
> > Thanks, Junio. Paths with umlauts are returned correctly now both in
> > UTF-8 and ISO-8859-1. I guess git-cvsserver is now as encoding agnostic
> > as git core.
>
> By the way, now that git has per user config file, ~/.gitconfig, perhaps
> it is time to add i18n.filesystemEncoding configuration variable, to
> automatically convert between filesystem encoding (somthing you usually
> don't have any control over) and UTF-8 encoding of paths in tree objects.

I'd prefer git to store filenames and comments in UTF-8 and convert on 
input/output when and if it is necessary rather than forcing everybody to 
take the hit. Most systems, but far from all, already use UTF-8 so it's a 
noop for them. The only reason I want conversion is for the years to come 
where we still live in two worlds of non-utf-8 and utf-8 and then forget 
about everything non-utf-8, rather than carry around the baggage forever.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 18:30       ` Robin Rosenberg
@ 2006-11-13 18:57         ` Jakub Narebski
  2006-11-13 21:41           ` Robin Rosenberg
  2006-11-13 19:48         ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Narebski @ 2006-11-13 18:57 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

Dnia poniedziałek 13. listopada 2006 19:30, Robin Rosenberg napisał:
> måndag 13 november 2006 15:20 skrev Jakub Narebski:
>> sf wrote:
>>> Thanks, Junio. Paths with umlauts are returned correctly now both in
>>> UTF-8 and ISO-8859-1. I guess git-cvsserver is now as encoding agnostic
>>> as git core.
>>
>> By the way, now that git has per user config file, ~/.gitconfig, perhaps
>> it is time to add i18n.filesystemEncoding configuration variable, to
>> automatically convert between filesystem encoding (somthing you usually
>> don't have any control over) and UTF-8 encoding of paths in tree objects.
> 
> I'd prefer git to store filenames and comments in UTF-8 and convert on 
> input/output when and if it is necessary rather than forcing everybody to 
> take the hit. Most systems, but far from all, already use UTF-8 so it's a 
> noop for them. The only reason I want conversion is for the years to come 
> where we still live in two worlds of non-utf-8 and utf-8 and then forget 
> about everything non-utf-8, rather than carry around the baggage forever.

That was my idea, to have i18n.filesystemEncoding configuration variable
to convert between filesystem encoding (which is usually something you don't
have control over, and which depends from place to place, but not from
repository to repository) and UTF-8 encoding git would store filenames.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 18:30       ` Robin Rosenberg
  2006-11-13 18:57         ` Jakub Narebski
@ 2006-11-13 19:48         ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2006-11-13 19:48 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

Robin Rosenberg <robin.rosenberg.lists@dewire.com> writes:

> måndag 13 november 2006 15:20 skrev Jakub Narebski:
>> sf wrote:
>> > Thanks, Junio. Paths with umlauts are returned correctly now both in
>> > UTF-8 and ISO-8859-1. I guess git-cvsserver is now as encoding agnostic
>> > as git core.
>>
>> By the way, now that git has per user config file, ~/.gitconfig, perhaps
>> it is time to add i18n.filesystemEncoding configuration variable, to
>> automatically convert between filesystem encoding (somthing you usually
>> don't have any control over) and UTF-8 encoding of paths in tree objects.
>
> I'd prefer git to store filenames and comments in UTF-8 and convert on 
> input/output when and if it is necessary rather than forcing everybody to 
> take the hit. Most systems, but far from all, already use UTF-8 so it's a 
> noop for them. The only reason I want conversion is for the years to come 
> where we still live in two worlds of non-utf-8 and utf-8 and then forget 
> about everything non-utf-8, rather than carry around the baggage forever.

Pathnames in git core are encoding agnostic just like UNIX
pathnames are.  As you say, if the project convention is UTF-8
then it would not make any difference either way, so the status
quo is fine for people living in UTF-8 only world.

To people for whom it is inconvenient to work with UTF-8,
including me, it is always wrong to record UTF-8 at the core
level and try to autoconvert.  If (non-git) tools, libraries and
legacy-to-unicode roundtrip conversion were perfect, we would
have already converted and living in UTF-8 only world.  Projects
that choose to run with legacy pathname encoding should be
allowed to do so without taking the roundtrip risk converting to
and from UTF-8.

Interestingly enough, Linus mentioned this once, a lot better
than myself would have, here:

http://thread.gmane.org/gmane.comp.version-control.git/12240/focus=12279

Having said that, I am not opposed to have an option to make the
external interface to do the pathname conversion.  If your
project chooses to use euc-jp for commit messages, your
configuration variable i18n.commitencoding is set to euc-jp, and
if gitweb always wants to do its thing in utf-8 (which is
probably a sensible thing to do), it would make a lot of sense
to take the commit message and convert it from euc-jp to utf-8
before rendering it in HTML.  Maybe i18n.pathnameencoding could
be used for similar purposes for external interfaces.

But the core will stay encoding agnostic; pathnames stored in
the index and tree are what you can feed stat() and open(), and
what you read from readdir().  Maybe we could revisit this
decision in five years, but not now.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 18:57         ` Jakub Narebski
@ 2006-11-13 21:41           ` Robin Rosenberg
  0 siblings, 0 replies; 11+ messages in thread
From: Robin Rosenberg @ 2006-11-13 21:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

måndag 13 november 2006 19:57 skrev Jakub Narebski:
> That was my idea, to have i18n.filesystemEncoding configuration variable
> to convert between filesystem encoding (which is usually something you
> don't have control over, and which depends from place to place, but not
> from repository to repository) and UTF-8 encoding git would store
> filenames.

Yes, I know.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Non-ASCII paths and git-cvsserver
  2006-11-13 18:22     ` Martin Langhoff
@ 2006-11-14 10:40       ` sf
  0 siblings, 0 replies; 11+ messages in thread
From: sf @ 2006-11-14 10:40 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, git

Martin Langhoff wrote:
> On 11/13/06, sf <sf@b-i-t.de> wrote:
>> Martin, are you sure your patch is needed? (see below)
> 
> Not 100% sure. I was just making sure we crossed all the Ts and dotted
> the Is. I gather you have tried my patch and it didn't make any
> difference. What SQLite and Perl versions are you using?

Your patch did make a difference but the outcome is not good:

+ WORK=/tmp/gittest
+ FILE=$'\303\244'
+ mkdir /tmp/gittest
+ mkdir /tmp/gittest/git
+ cd /tmp/gittest/git
+ git init-db
defaulting to local storage area
+ git repo-config gitcvs.enabled 1
+ git repo-config gitcvs.logfile /tmp/gittest/git/.git/cvslog.txt
+ touch $'\303\244'
+ git add $'\303\244'
+ git commit -a -mx
Committing initial tree 23d6145738bba135994775c19d6e8ae707d399ee
+ cd /tmp/gittest
+ CVS_SERVER=git-cvsserver
+ export CVS_SERVER
+ cvs -d :fork:/tmp/gittest/git/.git co master
cvs checkout: Updating master
U master/ä
+ ls master
ä  CVS


The pathname has been UTF-8 encoded _twice_!

Perl's version is 5.8.8. How do I get the version of SQLite? Do you mean 
DBD-SQLite-1.11?

Regards

Stephan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-11-14 10:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-09 11:11 Non-ASCII paths and git-cvsserver sf
2006-11-10 18:59 ` Martin Langhoff
2006-11-10 19:49 ` Junio C Hamano
2006-11-13 13:58   ` sf
2006-11-13 14:20     ` Jakub Narebski
2006-11-13 18:30       ` Robin Rosenberg
2006-11-13 18:57         ` Jakub Narebski
2006-11-13 21:41           ` Robin Rosenberg
2006-11-13 19:48         ` Junio C Hamano
2006-11-13 18:22     ` Martin Langhoff
2006-11-14 10:40       ` sf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).