[RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-02  8:47 Peter Krefting
  2009-03-02 10:30 ` Johannes Sixt
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-02  8:47 UTC (permalink / raw)
  To: git

When opening a file through open() or fopen(), the path passed is
UTF-8 encoded. To handle this on Windows, we need to convert the
path string to UTF-16 and use the Unicode-based interface.
---
Windows does support file names using arbitrary Unicode characters, you just 
need to use its wchar_t interfaces instead of the char ones (the char ones 
just gets converted into wchar_t on the API level anyway, for the same 
reasons). This is the beginnings of support for UTF-8 file names on Git on 
Windows.

Since there is no real file system abstraction beyond using stdio (AFAIK), I 
need to hack it by replacing fopen (and open). Probably opendir/readdir as 
well (might be trickier), and possibly even hack around main() to parse the 
wchar_t command-line instead of the char copy.

This will lose all chances of Windows 9x compatibility, but I don't know if 
there are any attempts of supporting it anyway?

Please note that MultiByteToWideChar() will reject any invalid UTF-8 
strings, perhaps it should just fall back to a regular open()/fopen() in 
that case?

No Signed-Off line since this is unfinished, just presenting rough sketches 
of an idea.

  compat/mingw.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
  compat/mingw.h |    3 ++
  2 files changed, 62 insertions(+), 1 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index e25cb4f..8b19b80 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -9,13 +9,30 @@ int mingw_open (const char *filename, int oflags, ...)
  {
  	va_list args;
  	unsigned mode;
+	wchar_t *unicode_filename;
+	int unicode_filename_len;
  	va_start(args, oflags);
  	mode = va_arg(args, int);
  	va_end(args);

  	if (!strcmp(filename, "/dev/null"))
  		filename = "nul";
-	int fd = open(filename, oflags, mode);
+
+	unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+	if (0 == unicode_filename_len) {
+		errno = EINVAL;
+		return -1;
+	};
+
+	unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+	if (NULL == unicode_filename) {
+		errno = ENOMEM;
+		return -1;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+	int fd = _wopen(unicode_filename, oflags, mode);
+	free(unicode_filename);
+
  	if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
  		DWORD attrs = GetFileAttributes(filename);
  		if (attrs != INVALID_FILE_ATTRIBUTES && (attrs & FILE_ATTRIBUTE_DIRECTORY))
@@ -24,6 +41,47 @@ int mingw_open (const char *filename, int oflags, ...)
  	return fd;
  }

+FILE *mingw_fopen (const char *filename, const char *mode)
+{
+	wchar_t *unicode_filename, *unicode_mode;
+	int unicode_filename_len, unicode_mode_len;
+	FILE *fh;
+
+	unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+	if (0 == unicode_filename_len) {
+		errno = EINVAL;
+		return NULL;
+	};
+
+	unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+	if (NULL == unicode_filename) {
+		errno = ENOMEM;
+		return NULL;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+
+	unicode_mode_len = MultiByteToWideChar(CP_UTF8, 0, mode, -1, NULL, 0);
+	if (0 == unicode_mode_len) {
+		free(unicode_filename);
+		errno = EINVAL;
+		return NULL;
+	};
+
+	unicode_mode = xmalloc(unicode_mode_len * sizeof (wchar_t));
+	if (NULL == unicode_mode) {
+		free(unicode_mode);
+		errno = ENOMEM;
+		return NULL;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, mode, -1, unicode_mode, unicode_mode_len);
+
+	fh = _wfopen(unicode_filename, unicode_mode);
+	free(unicode_filename);
+	free(unicode_mode);
+
+	return fh;
+}
+
  static inline time_t filetime_to_time_t(const FILETIME *ft)
  {
  	long long winTime = ((long long)ft->dwHighDateTime << 32) + ft->dwLowDateTime;
diff --git a/compat/mingw.h b/compat/mingw.h
index 4f275cb..235df0a 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -142,6 +142,9 @@ int sigaction(int sig, struct sigaction *in, struct sigaction *out);
  int mingw_open (const char *filename, int oflags, ...);
  #define open mingw_open

+FILE *mingw_fopen (const char *filename, const char *mode);
+#define fopen mingw_fopen
+
  char *mingw_getcwd(char *pointer, int len);
  #define getcwd mingw_getcwd

-- 
1.6.0.2.1172.ga5ed0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02  8:47 [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded Peter Krefting
@ 2009-03-02 10:30 ` Johannes Sixt
  2009-03-02 10:46   ` Peter Krefting
  2009-03-03  9:43 ` Dmitry Potapov
  2009-03-07 10:38 ` Robin Rosenberg
  2 siblings, 1 reply; 33+ messages in thread
From: Johannes Sixt @ 2009-03-02 10:30 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

Peter Krefting schrieb:
> When opening a file through open() or fopen(), the path passed is
> UTF-8 encoded.

I don't think that this assumption is valid. Whenever the Windows API has
to convert between Unicode strings and char* strings, it uses the current
"ANSI code page". As far as I know, the UTF-8 codepage (65001) cannot be
used as the "current ANSI code page". Users will always have some code
page set that is not UTF-8.

For example, if the user specifies a file name on the command line, than
it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
page" encoding. If git prints a file name under the assumption that it is
UTF-8 encoded, then it will be displayed incorrectly because the system
uses a different encoding.

> Since there is no real file system abstraction beyond using stdio
> (AFAIK), I need to hack it by replacing fopen (and open). Probably
> opendir/readdir as well (might be trickier), and possibly even hack
> around main() to parse the wchar_t command-line instead of the char copy.

I think you are grossly underestimating the venture that you want to
undertake here.

Please come up with a plan how you are going to deal with the various
issues. File names enter and leave the system through different channels:

- the command line and terminal window
- object database (tree objects)
- opendir/readdir; opening files or directories for reading or writing

And there is probably some more... How do you treat encodings in these
channels? What if the file names are not valid UTF-8? Etc.

The biggest obstacle will be that git does not have a notion of "file name
encoding" - it simply treats a file name as a stream of bytes. There is no
place to write an encoding. If the byte streams are regarded as having an
encoding, then you can have ambiguities, mixed encodings, or invalid
characters. You would have to deal with this in some way.

> This will lose all chances of Windows 9x compatibility, but I don't know
> if there are any attempts of supporting it anyway?

Windows 9x is already out of the loop. We use GetFileInformationByHandle()
that is only available since Windows 2000.

-- Hannes

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 10:30 ` Johannes Sixt
@ 2009-03-02 10:46   ` Peter Krefting
  2009-03-02 10:56     ` Johannes Schindelin
  2009-03-02 12:34     ` Johannes Sixt
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 10:46 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt:

> I don't think that this assumption is valid.

Depends on where you are coming from. For the files stored in the Git 
repositories, I believe all file names are supposed to be UTF-8 encoded 
(just like commit messages and user names are). That's the assumption I 
started working from.

> Users will always have some code page set that is not UTF-8.

Indeed. And as long as the char-pointer interfaces in stdio and elsewhere 
work on that assumption, we have a problem.

> For example, if the user specifies a file name on the command line, than
> it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
> page" encoding.

That problem is already solved as we do have a wchar_t command line 
available. If you pass a file name that is not representable in the current 
"ANSI" codepage on the command line, it will come out as garbage in the 
char* version, but will be correct in the wchar_t* version. Thus we need to 
convert that to utf-8 and use that instead.

> If git prints a file name under the assumption that it is UTF-8 encoded, 
> then it will be displayed incorrectly because the system uses a different 
> encoding.

Here setting the local codepage to UTF-8 *might* work, although I haven't 
tested that. Or always use the wchar_t versions of printf and friends.

> I think you are grossly underestimating the venture that you want to 
> undertake here.

I've done this before with other software, so, yes, I know it is quite a big 
undertaking. That is also why I started out with a minimal RFC patch to see 
if there was any interest in working with this.

> Please come up with a plan how you are going to deal with the various
> issues. File names enter and leave the system through different channels:
>
> - the command line and terminal window

GetCommandLineW() as decribed above.

> - object database (tree objects)

Those file names are supposedly always UTF-8.

> - opendir/readdir; opening files or directories for reading or writing

Wrap file open and directory read to use the wchar_t versions, converting 
that to UTF-8 strings at the API level.

> And there is probably some more... How do you treat encodings in these 
> channels? What if the file names are not valid UTF-8? Etc.

Ill-formed UTF-8 should just be rejected. Invalid UTF-8 is worse. I'm not 
sure what the Linux version does, when running in a UTF-8 locale. Does it 
allow ill-formed or illegal UTF-8 sequences?

NTFS allows almost any sequence of wchar_t's, it doesn't even have to be 
valid UTF-16.

> The biggest obstacle will be that git does not have a notion of "file name 
> encoding" - it simply treats a file name as a stream of bytes.

Yeah, that is one of the major bugs in its design, IMHO. But almost everyone 
seems to assume that file names are UTF-8 strings anyway, so in the absence 
of any other information, it's a good assumption as any to make.

> If the byte streams are regarded as having an encoding, then you can have 
> ambiguities, mixed encodings, or invalid characters. You would have to 
> deal with this in some way.

Considering we already see problems with file names that cannot properly be 
represented on some file systems (case-only differences in the Linux kernel 
when checked out on Windows; Mac OS' built-in Unicode normalization of file 
names, etc.)

> Windows 9x is already out of the loop.

Good.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 10:46   ` Peter Krefting
@ 2009-03-02 10:56     ` Johannes Schindelin
  2009-03-02 12:03       ` Peter Krefting
  2009-03-02 12:34     ` Johannes Sixt
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2009-03-02 10:56 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Johannes Sixt, git

Hi,

On Mon, 2 Mar 2009, Peter Krefting wrote:

> Johannes Sixt:
> 
> > I don't think that this assumption is valid.
> 
> Depends on where you are coming from. For the files stored in the Git 
> repositories, I believe all file names are supposed to be UTF-8 encoded 
> (just like commit messages and user names are). That's the assumption I 
> started working from.

No.  As far as Git is concerned, the file names are just as much blobs as 
the file contents.

The fact that Windows messes with this notion just as it messes with the 
file contents (think the endless story whose name is CR/LF) shows only how 
"well" designed the concepts in Windows are.

And as it stands, we have at least two issues on the msysGit issue tracker 
that complain that Git does not work with localized file names properly.

So no, file names are not UTF-8 at all, especially not on Windows.

Do not get me wrong, I really welcome you taking care of the issue, but I 
do not think that forcing UTF-8 is a solution.

Thanks & sorry,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 10:56     ` Johannes Schindelin
@ 2009-03-02 12:03       ` Peter Krefting
       [not found]         ` <a2633edd0903020512u5682e9am203f0faccd0acf6a@mail.gmail.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 12:03 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Johannes Sixt, git

Johannes Schindelin:

> No.  As far as Git is concerned, the file names are just as much blobs as 
> the file contents.

I've struggled with the same problems on Linux before, since its file 
systems doesn't have the concept of characters, either. I guess it's just 
design principles, but as far as I am concerned, having file names be 
constructed from characters makes a lot more sense than having them 
constructed from bytes.

Git does the right thing in assuming commit messages and user names be UTF-8 
characters, though, it would have been nice to have file names covered by 
the same constraints.

> The fact that Windows messes with this notion just as it messes with the 
> file contents (think the endless story whose name is CR/LF) shows only how 
> "well" designed the concepts in Windows are.

In this case, yes, Windows' way of doing does make more sense, at least to 
me. And as far as text files are concerned, treating text as sequences of 
bytes are in most cases not a very smart thing to do, either, but it's hard 
not to given how most computers are constructed.

> And as it stands, we have at least two issues on the msysGit issue tracker 
> that complain that Git does not work with localized file names properly.
>
> So no, file names are not UTF-8 at all, especially not on Windows.

I am not trying to make file names *on Windows* to be UTF-8. I am trying to 
make file names on Windows be Windows file names, i.e UTF-16 Unicode. It's 
just that since Git internally uses the char* APIs, and from what I have 
seen in most other cases assume that char* text is UTF-8, I am trying to 
convert from Windows' view of path names to Git's (UTF-16 to UTF-8) and back.

The other way would be to keep the char* APIs but convert to the Windows 
locale encoding ("ANSI codepage"), but that will break horribly as not all 
file names that can be used on a file system can be represented as such. 
Plus, all calls to a Windows API using a char* path name *is* converted into 
UTF-16 anyway, since that is what is used internally in the Windows NT 
subsystems.

> Do not get me wrong, I really welcome you taking care of the issue, but I a
> do not think that forcing UTF-8 is a solution.

Some kind of handling of Git repositories where file names are not UTF-8 
would probably need to be added, yes.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 10:46   ` Peter Krefting
  2009-03-02 10:56     ` Johannes Schindelin
@ 2009-03-02 12:34     ` Johannes Sixt
  2009-03-02 13:12       ` Peter Krefting
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Sixt @ 2009-03-02 12:34 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

Peter Krefting schrieb:
> Johannes Sixt:
>> If git prints a file name under the assumption that it is UTF-8
>> encoded, then it will be displayed incorrectly because the system uses
>> a different encoding.
> 
> Here setting the local codepage to UTF-8 *might* work, although I
> haven't tested that. Or always use the wchar_t versions of printf and
> friends.

You cannot expect users to switch the locale. For example, I have to test
our software with Japanese settings: I *cannot* switch to UTF-8 just
because of git.

Can you set the local codepage per program? (I don't know.) It might help
here, but it doesn't help in all cases, particularly in certain pipelines:

  git ls-files -o
  git ls-files -o | git update-index --add --stdin
  find . -name \*.jpg | git update-index --add --stdin

- What encoding should 'ls-files' use for its output? Certainly not always
UTF-8: stdout should use the local code page so that the file names are
interpreted correctly by the terminal window (it expects the local code page).

- What encoding should 'update-index' expect from its input? Can you be
sure that other programs generate UTF-8 output?

How do you solve that?

-- Hannes

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 12:34     ` Johannes Sixt
@ 2009-03-02 13:12       ` Peter Krefting
  2009-03-02 19:58         ` Robin Rosenberg
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 13:12 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt:

> Can you set the local codepage per program? (I don't know.)

The locale is set per thread, and gets reset when the program exits. So 
setting the codepage to UTF-8 before outputting should work. That should 
also work for displaying the log to the terminal if you have UTF-8 log 
messages.

Converting it to wchar_t and using wprintf and similar should be safer, 
though (and I have no idea what happens if you try to pipe the output to 
something else).

> - What encoding should 'ls-files' use for its output? Certainly not always 
> UTF-8: stdout should use the local code page so that the file names are 
> interpreted correctly by the terminal window (it expects the local code 
> page).

That is exactly why trying to mix "protocol" data ("plumbing" in Git's case) 
and user output will always come back and bite you, one way or another. I 
haven't really the faintest how pipes work with Unicode on Windows. 
Somewhere along the line there will probably be some conversions, which 
would cause interesting issues.

Better not use pipes, then. Heh. I sense that there is a slight problem with 
the architecture of Git and trying to get it to behave on Windows... :-)

> - What encoding should 'update-index' expect from its input? Can you be 
> sure that other programs generate UTF-8 output?

Theoretically, if all the internal stuff is hacked around to output Unicode, 
and the thread codepage is set up to use UTF-8, it should "just work". And 
if run directly from the shell, it should still be converted to whatever the 
system is set up to emit. That would mean, however, that a Git program that 
internally runs

   git-foo | git-bar | git-gazonk

might behave differently compared to if a user would enter it on the 
command-line.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
       [not found]         ` <a2633edd0903020512u5682e9am203f0faccd0acf6a@mail.gmail.com>
@ 2009-03-02 13:57           ` Peter Krefting
  2009-03-02 14:29             ` Thomas Rast
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 13:57 UTC (permalink / raw)
  To: git

Hi!

> Makes sense too. I think the whole API would have to be changed to use 
> TCHAR*.

I'd rather just say wchar_t explicitely. I'm not particularly fond of macros 
that change under your feet just because you fail to define a symbol 
somewhere...

> Then you need to do the right conversion at the right places, this will be 
> quite tricky, painful work, but there is probably no way around that.

In the other project I worked on we ended up wrapping all file-related calls 
in our own porting interface, and then let each platform we compiled for 
implement their own methods for handling Unicode paths. For Windows it's 
trivial since all APIs are Unicode. For Unix-like OSes it's tricky as you 
have to take the locale settings into account, but fortunately the world is 
slowly moving towards UTF-8 locales, which eases the pain a bit.

> Note that not only conversions will be needed but you'll also need to 
> adjust all routines handling filenames to use the proper Unicode version. 
> (strchr -> _tstrchr, open -> _topen, strcpy -> _tstrcpy, strlen -> 
> _tcslen, ...).

Not necessarily. If the code can be set up to use UTF-8 char* internally, 
not everything needs to be rewritten (I've done that too, only took a 
couple of years to move the codebase over to all-Unicode).

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 13:57           ` Peter Krefting
@ 2009-03-02 14:29             ` Thomas Rast
  2009-03-02 20:41               ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Rast @ 2009-03-02 14:29 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

Peter Krefting wrote:
> In the other project I worked on we ended up wrapping all file-related calls 
> in our own porting interface, and then let each platform we compiled for 
> implement their own methods for handling Unicode paths. For Windows it's 
> trivial since all APIs are Unicode. For Unix-like OSes it's tricky as you 
> have to take the locale settings into account, but fortunately the world is 
> slowly moving towards UTF-8 locales, which eases the pain a bit.

Have you thought about all the consequences this would have for the
*nix people here? [*]

Even if you pretend that Git did always enforce UTF-8 paths in its
trees, so that there's no backward compatibility to be cared for,
you're still in a world of hurt when trying to check out such paths
under a locale (or whatever setting might control this new encoding
logic) that does not support the whole range of UTF-8.

Like, say, the C locale.

Next you get to see to it that the users can spell all filenames even
if their locale doesn't let them, since they'll want to do things like
'git show $rev:$file' with them.

With backwards compatibility it's even worse as you're suddenly
imposing extra restrictions on what a valid filename in the repository
must look like.

[*] I'm _extremely_ tempted to write "people using non-broken OSes",
but let's pretend to be neutral for a second.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 13:12       ` Peter Krefting
@ 2009-03-02 19:58         ` Robin Rosenberg
  2009-03-02 20:52           ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2009-03-02 19:58 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Johannes Sixt, git

> Johannes Sixt:
> 
> > Can you set the local codepage per program? (I don't know.)
> 
> The locale is set per thread, and gets reset when the program exits. So 
> setting the codepage to UTF-8 before outputting should work. That should 
> also work for displaying the log to the terminal if you have UTF-8 log 
> messages.

Messing with locale is probably going to break subtly. An explicit approach
is better, respecting the user's locale when necessary.

> Converting it to wchar_t and using wprintf and similar should be safer, 
> though (and I have no idea what happens if you try to pipe the output to 
> something else).
> 
> > - What encoding should 'ls-files' use for its output? Certainly not always 
> > UTF-8: stdout should use the local code page so that the file names are 
> > interpreted correctly by the terminal window (it expects the local code 
> > page).
> 
> That is exactly why trying to mix "protocol" data ("plumbing" in Git's case) 
> and user output will always come back and bite you, one way or another. I 
> haven't really the faintest how pipes work with Unicode on Windows. 
> Somewhere along the line there will probably be some conversions, which 
> would cause interesting issues.

Pipes are just bytes so you have to know what you're piping by convention
or protocol. You can ask for the console output page, which may be set to
a multibyte locale or unicode and maybe trust that.... (just guessing, really).

> Better not use pipes, then. Heh. I sense that there is a slight problem with 
> the architecture of Git and trying to get it to behave on Windows... :-)

architecture? Like the "architecture" of species? No, it's evolution.
If that applies to the linux kernel, it's not so strange it applies to git too.

> > - What encoding should 'update-index' expect from its input? Can you be 
> > sure that other programs generate UTF-8 output?
> 
> Theoretically, if all the internal stuff is hacked around to output Unicode, 
> and the thread codepage is set up to use UTF-8, it should "just work". And 

msys doesn't seem to understand UTF-8 at all, so depending on that to work
seems futile. Simply bypassing the locale for any internal work is probably the 
most sane thing. That also won't depend of the quality of the locale support in 
the runtime. Start by making the git commands working without msys bash,
and figure a way to fix msys later, unless someone has a very good idea on
how to fix msys.

> if run directly from the shell, it should still be converted to whatever the 
> system is set up to emit. That would mean, however, that a Git program that 
> internally runs
> 
>    git-foo | git-bar | git-gazonk
> 
> might behave differently compared to if a user would enter it on the 
> command-line.
> 

You might also want to check out my work in the area. See 

http://www.jgit.org/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=shortlog;h=i18n

The goal is locale neutrality yielding the "expected", in the users eyes, result regardless
of locale as much as possible. Junio didn't want to have it for five years, so I
guess there's still three and half to go. Hopefully he can change his mind. That branch
is heavily outdated by now, as some of functionality have been introduced by other
means like logoutputencoding and other parts of git have been rewritten.

Related to this, JGit assumes UTF-8 on reading. If it's not valid UTF-8 we try the user's 
locale (rougly) and on writing object meta data, including any sort of identifier, 
we always write UTF-8 when have to be explicit. We let the runtime decide on how
to encode file names in the file system using the user's locale.

I'd be almost happy with a solution that works when people are interacting using
the subset that is convertible between the character sets in use.

-- robin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 14:29             ` Thomas Rast
@ 2009-03-02 20:41               ` Peter Krefting
  2009-03-03  7:56                 ` Lars Noschinski
  2009-03-03  9:47                 ` Dmitry Potapov
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 20:41 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git

Thomas Rast:

> Have you thought about all the consequences this would have for the *nix 
> people here? [*]

Yeah. It will fix problems trying to check out a Git repository created by 
me in a iso8859-1 locale on a machine using a utf-8 locale, where both ends 
would like to have a file named "Ü".

Or, hopefully, a careful adoption of this on Windows won't affect Unixes and 
other systems with pre-Unicode APIs at all, since the Windows code would be 
in the "compat" directory.

> you're still in a world of hurt when trying to check out such paths under 
> a locale (or whatever setting might control this new encoding logic) that 
> does not support the whole range of UTF-8.

Yeah. That would be a case similar to the casing problem on Windows.

> With backwards compatibility it's even worse as you're suddenly imposing 
> extra restrictions on what a valid filename in the repository must look 
> like.

Indeed. It is unfortunate that this wasn't properly specified to start with. 
It's mostly a minor issue since *most* people will not use non-ASCII file 
names. At least for most of the kind of projects that Git have attracted so 
far, so the problem is not that big. The problem is if Git is to attract 
"the masses". Especially on Windows, where file names using non-ASCII are 
common, this needs to be addressed eventually.

> [*] I'm _extremely_ tempted to write "people using non-broken OSes", but 
> let's pretend to be neutral for a second.

In most cases, I would most definitely agree with you on calling it that, 
but when it comes to Unicode support, Windows is one of the least broken 
OSes (with Symbian being my favourite).

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 19:58         ` Robin Rosenberg
@ 2009-03-02 20:52           ` Peter Krefting
  2009-03-02 21:21             ` Robin Rosenberg
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-02 20:52 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Johannes Sixt, git

Robin Rosenberg:

> Pipes are just bytes so you have to know what you're piping by convention 
> or protocol. You can ask for the console output page, which may be set to 
> a multibyte locale or unicode and maybe trust that.... (just guessing, 
> really).

You can get cmd.exe to write data to pipes and redirections as UTF-16 
Unicode (cmd.exe /u), perhaps there is a way to capitalise on that? 
"Unfortunately", the Git stuff is mostly called from a bash shell inside 
msys, so it requires a "bit" more work...

> architecture? Like the "architecture" of species? No, it's evolution.

There's still an architecture there, somewhere. Perhaps not intended or 
specified, but there definitely is one :-)

> http://www.jgit.org/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=shortlog;h=i18n
>
> The goal is locale neutrality yielding the "expected", in the users eyes, 
> result regardless of locale as much as possible.

Ah, yes, that looks like an interesting starting point. I already assumed 
that Git on Linux would use UTF-8 for everything already, since it already 
does that for the commit messages despite me using an iso8859-1 locale. 
Apparently I haven't done my homework.

> We let the runtime decide on how to encode file names in the file system 
> using the user's locale.

That's good. That's what I'm trying to achieve. Or, rather, avoid the user 
locale altogether (which is easy on Windows since the file names are always 
stored in Unicode, and the user locale can be bypassed).

> I'd be almost happy with a solution that works when people are interacting 
> using the subset that is convertible between the character sets in use.

You mean like the "invariant" character set? :-) Using Unicode internally 
(in whatever encoding) is nice, the problem is when you have to interact 
with the world around you.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 20:52           ` Peter Krefting
@ 2009-03-02 21:21             ` Robin Rosenberg
  2009-03-03  5:51               ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2009-03-02 21:21 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Johannes Sixt, git

måndag 02 mars 2009 21:52:41 skrev Peter Krefting <peter@softwolves.pp.se>:
> Robin Rosenberg:
> 
> > I'd be almost happy with a solution that works when people are interacting 
> > using the subset that is convertible between the character sets in use.
> 
> You mean like the "invariant" character set? :-) Using Unicode internally 
> (in whatever encoding) is nice, the problem is when you have to interact 
> with the world around you.

Not sure what that is. I mean that in a local nordic, setting people can use iso-8859-1|15/windows-1252/UTF-8 for their needs be means of converting the characters as-needed without loss, with very few practial restrictions. 

For a larger setting that won't do, but then the need is typically less since people tend to use ASCII only, or you jump to all unicode.

Just because I use UTF-8 doesn't mean I use start using more characters in practice.

-- robin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 21:21             ` Robin Rosenberg
@ 2009-03-03  5:51               ` Peter Krefting
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-03  5:51 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Johannes Sixt, git

Robin Rosenberg:

> Not sure what that is.

"Invariant" is defined in an old RFC as the common subset of several 
ASCII-like and ASCII-based encodings. This was back before the MIME days, 
IIANM.

> I mean that in a local nordic, setting people can use 
> iso-8859-1|15/windows-1252/UTF-8 for their needs be means of converting 
> the characters as-needed without loss, with very few practial 
> restrictions.

Indeed. The trick is to have the storage (in this case, Git and it's tree 
objects) storing the file name data in a commonly agreed-upon way. Then it 
is simple to convert at the end-points.

> Just because I use UTF-8 doesn't mean I use start using more characters in 
> practice.

Most people do not, no. But using a Unicode encoding means that they at 
least have the option. Sometimes, having to mangle stuff down to ASCII is a 
pain.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 20:41               ` Peter Krefting
@ 2009-03-03  7:56                 ` Lars Noschinski
  2009-03-03 11:54                   ` Peter Krefting
  2009-03-03  9:47                 ` Dmitry Potapov
  1 sibling, 1 reply; 33+ messages in thread
From: Lars Noschinski @ 2009-03-03  7:56 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Thomas Rast, git

* Peter Krefting <peter@softwolves.pp.se> [09-03-02 21:41]:
> Indeed. It is unfortunate that this wasn't properly specified to start with. 
> It's mostly a minor issue since *most* people will not use non-ASCII file 
> names. At least for most of the kind of projects that Git have attracted so 
> far, so the problem is not that big. The problem is if Git is to attract "the 
> masses". Especially on Windows, where file names using non-ASCII are common, 
> this needs to be addressed eventually.

Using no encoding for filenames was the obvious (and I would argue)
correct choice. Unix filenames are specified to be a sequence of bytes,
excluding '/' and '\0'. A lot of these sequences are not valid UTF-8.
Further, the encoding needed for filenames depends on the encoding used
in the source code for referencing these files. Again, for the unix file
handling functions, this means no encoding.

Changing the filename (on checkout), so that the user sees an Ü
regardless of his or her locale (instead of an \0xDC, which only
resolves to an Ü on latin-1) would be an absolutely broken concept here.

> >[*] I'm _extremely_ tempted to write "people using non-broken OSes", but let's 
> >pretend to be neutral for a second.
> 
> In most cases, I would most definitely agree with you on calling it that, but 
> when it comes to Unicode support, Windows is one of the least broken OSes (with 
> Symbian being my favourite).

IMHO having encoding specific open functions is begging for problems.

 - Lars.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02  8:47 [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded Peter Krefting
  2009-03-02 10:30 ` Johannes Sixt
@ 2009-03-03  9:43 ` Dmitry Potapov
  2009-03-03 11:56   ` Peter Krefting
  2009-03-07 10:38 ` Robin Rosenberg
  2 siblings, 1 reply; 33+ messages in thread
From: Dmitry Potapov @ 2009-03-03  9:43 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

On Mon, Mar 02, 2009 at 09:47:22AM +0100, Peter Krefting wrote:
> When opening a file through open() or fopen(), the path passed is
> UTF-8 encoded. To handle this on Windows, we need to convert the
> path string to UTF-16 and use the Unicode-based interface.

IMHO, you grossly underestimate what is needed to enable UTF-8 encoding
in Windows. AFAIK, Microsoft C runtime library does not support UTF-8,
so you have to wrap all C functions taking 'char*' as an input parameter.
For example, think about what is going to happen if Git tries to print
a simple error message:
  fprintf (stderr, "unable to open %s", path);

> Since there is no real file system abstraction beyond using stdio_
> (AFAIK), I need to hack it by replacing fopen (and open). Probably_
> opendir/readdir as well (might be trickier), and possibly even hack_
> around main() to parse the wchar_t command-line instead of the char copy.

And the command-line is not the only source of file names. Some Git
commands read list of files from stdin usually though the pipe. In
what encoding are they going to be?

Dmitry

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02 20:41               ` Peter Krefting
  2009-03-03  7:56                 ` Lars Noschinski
@ 2009-03-03  9:47                 ` Dmitry Potapov
  2009-03-03 11:48                   ` Peter Krefting
  1 sibling, 1 reply; 33+ messages in thread
From: Dmitry Potapov @ 2009-03-03  9:47 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Thomas Rast, git

On Mon, Mar 02, 2009 at 09:41:57PM +0100, Peter Krefting wrote:
>
> In most cases, I would most definitely agree with you on calling it that,_
> but when it comes to Unicode support, Windows is one of the least broken__
> OSes (with Symbian being my favourite).

The C Standard requires that the type wchar_t is capable of representing
any character in the current locale. If Windows uses UTF-16 as internal
encoding (so, it can work with symbols outside of the BMP), it means you
cannot have 16-bit wchar_t and be compliant with the C standard...

Dmitry

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03  9:47                 ` Dmitry Potapov
@ 2009-03-03 11:48                   ` Peter Krefting
  2009-03-03 17:13                     ` Dmitry Potapov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-03 11:48 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Thomas Rast, git

Dmitry Potapov:

> The C Standard requires that the type wchar_t is capable of representing 
> any character in the current locale. If Windows uses UTF-16 as internal 
> encoding (so, it can work with symbols outside of the BMP), it means you 
> cannot have 16-bit wchar_t and be compliant with the C standard...

No, that's not quite correct. wchar_t is defined to be "an integer type whose 
range of values can represent distinct codes for all members of 
the largest extended character set specified among the supported locales". 
Since Windows defines all local character sets as Unicode-based, having 
wchar_t defined as Unicode means that it can represent everything.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03  7:56                 ` Lars Noschinski
@ 2009-03-03 11:54                   ` Peter Krefting
  2009-03-03 16:29                     ` Lars Noschinski
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-03 11:54 UTC (permalink / raw)
  To: Lars Noschinski; +Cc: Thomas Rast, git

Lars Noschinski:

> Using no encoding for filenames was the obvious (and I would argue) 
> correct choice. Unix filenames are specified to be a sequence of bytes, 
> excluding '/' and '\0'.

I know the Unix way of thinking lends itself to such a design. This is one 
of the few cases where I personally think Unix has got it wrong, and Windows 
(NT) has got it right. But then again, Unix' design pre-dates the locale 
issue by quite some time, so it is not difficult to see where it comes from.

> Changing the filename (on checkout), so that the user sees an Ü regardless 
> of his or her locale (instead of an \0xDC, which only resolves to an Ü on 
> latin-1) would be an absolutely broken concept here.

Why would it? It is my view as a user on my files that define how file names 
are looked upon. If I have three machines, one Linux box using a iso8859-1 
locale, an OS X box (where, I would believe, file APIs use UTF-8, someone 
please correct me if I'm wrong), and a Windows box (which uses UTF-16 on the 
file system layer, but does provide compatibility functions that use char 
pointers), and create a file on each of these called "Ü.txt" (which would be 
the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74" (or 
probably something else since I believe OS X decomposes the string) on the 
OS X box and "00DC 002E 0074 0078 0074" on the Windows box, I see these 
three file names as equal.

If I would create a Git repo on each of the three machines and put the file 
name in it, and then clone that on one of the other machines. *I* would 
assume that the file names were converted to fit the host operating system.

> IMHO having encoding specific open functions is begging for problems.

Indeed. That's why I like Windows' wchar_t APIs, and dislike Unix' and 
Linux' char APIs that, in some ways, depend on the user locale.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03  9:43 ` Dmitry Potapov
@ 2009-03-03 11:56   ` Peter Krefting
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-03 11:56 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git

Dmitry Potapov:

> IMHO, you grossly underestimate what is needed to enable UTF-8 encoding in 
> Windows. AFAIK, Microsoft C runtime library does not support UTF-8, so you 
> have to wrap all C functions taking 'char*' as an input parameter.

I have to wrap all file-related functions, at least.

> For example, think about what is going to happen if Git tries to print a 
> simple error message: fprintf (stderr, "unable to open %s", path);

Yeah. That's a problem. That might be solvable by setting the thread locale 
to something UTF-8 based and have the console window convert to the output 
codepage (that is what it does when you use wprintf and friends).

> And the command-line is not the only source of file names. Some Git 
> commands read list of files from stdin usually though the pipe. In what 
> encoding are they going to be?

Indeed. Pipes are a problem.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 11:54                   ` Peter Krefting
@ 2009-03-03 16:29                     ` Lars Noschinski
  2009-03-03 20:59                       ` Robin Rosenberg
  0 siblings, 1 reply; 33+ messages in thread
From: Lars Noschinski @ 2009-03-03 16:29 UTC (permalink / raw)
  To: git

* Peter Krefting <peter@softwolves.pp.se> [09-03-03 12:54]:
> Lars Noschinski:
> >Changing the filename (on checkout), so that the user sees an Ü regardless of 
> >his or her locale (instead of an \0xDC, which only resolves to an Ü on 
> >latin-1) would be an absolutely broken concept here.
> 
> Why would it? It is my view as a user on my files that define how file names 
> are looked upon. If I have three machines, one Linux box using a iso8859-1 
> locale, an OS X box (where, I would believe, file APIs use UTF-8, someone 
> please correct me if I'm wrong), and a Windows box (which uses UTF-16 on the 
> file system layer, but does provide compatibility functions that use char 
> pointers), and create a file on each of these called "Ü.txt" (which would be 
> the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74" (or 
> probably something else since I believe OS X decomposes the string) on the OS X 
> box and "00DC 002E 0074 0078 0074" on the Windows box, I see these three file 
> names as equal.

Because a function in the source code refers to (e.g.) "DC 2E 74 78 74",
not "C3 9C 2E 74 78 74" nor "00DC 0024 0074 0078 0074". And it does so
regardless of the locale.

The file name may look funny depending on your locale, but if you rename
the file to fit your local enconding, it would not work.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 11:48                   ` Peter Krefting
@ 2009-03-03 17:13                     ` Dmitry Potapov
  2009-03-04 10:51                       ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: Dmitry Potapov @ 2009-03-03 17:13 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Thomas Rast, git

On Tue, Mar 3, 2009 at 2:48 PM, Peter Krefting <peter@softwolves.pp.se> wrote:
> Dmitry Potapov:
>
>> The C Standard requires that the type wchar_t is capable of representing
>> any character in the current locale. If Windows uses UTF-16 as internal
>> encoding (so, it can work with symbols outside of the BMP), it means you
>> cannot have 16-bit wchar_t and be compliant with the C standard...
>
> No, that's not quite correct. wchar_t is defined to be "an integer type
> whose range of values can represent distinct codes for all members of the
> largest extended character set specified among the supported locales". Since
> Windows defines all local character sets as Unicode-based, having wchar_t
> defined as Unicode means that it can represent everything.

No, it does not, if you have wchar_t that is only 16-bit wide, because
characters
outside of the BMP have integer values in Unicode greater than 65535...

Dmitry

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 18:25 John Dlugosz
  2009-03-04 10:53 ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 18:25 UTC (permalink / raw)
  To: git; +Cc: peter

===Re:===
The other way would be to keep the char* APIs but convert to the Windows

locale encoding ("ANSI codepage"), but that will break horribly as not
all 
file names that can be used on a file system can be represented as such.
===end===

Actually, UTF-8 is a valid code page on Windows.  The code page ID is
65001.  So, if you set the process code page to that, =and= set the file
system API's code page to follow rather than using the OEM code page
(the default), it should work just fine.

Also, there is a national code page that =will= represent all file names
on the systems and is supported:  That is the Chinese GB18030, code page
54936.  That has every character that Unicode does, just encoded
differently to be forward compatible with GBK.  That is fully supported
by windows, as it is required by law to sell in Chinese markets.

Let me know if I can be of help.  I know character set stuff and Win32
fairly well.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 19:36 John Dlugosz
  0 siblings, 0 replies; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 19:36 UTC (permalink / raw)
  To: git; +Cc: j.sixt

===Re:===
You cannot expect users to switch the locale. For example, I have to
test
our software with Japanese settings: I *cannot* switch to UTF-8 just
because of git.

Can you set the local codepage per program? (I don't know.) It might
help
here, but it doesn't help in all cases, particularly in certain
pipelines:
===end===

Yes, you can.  The code page can be set per thread.  The function call
is:

	SetThreadLocale (lcid);

where lcid is just 65001 for UTF-8.  (The other fields in the LCID are
high-order bits and all zero for no sublanguage and default sort order).

When a thread is created, it starts with the system default thread
locale.  So call SetThreadLocale on every thread you create.  In
particular, realize that the new thread does not inherit this from the
creating thread.

Meanwhile... the file I/O functions don't use the same code page.  The
encoding of file names on a floppy disk or whatnot was historically done
using the "OEM code page", and when a different code page is used for
text editing, that shouldn't break compatibility.  So, all functions
exported from Kernel32.dll that accept or return file names uses a
separate setting, and setting the locale as shown above will not affect
it.  This might be the source of confusion to those experimenting with
it.

So, also make a call to

	SetFileApisToANSI();

This affects the entire process, not just the thread.

So much for specifying UTF-8 file names in Windows.  A related issue is
the console input and output of same.  I don't know if the sh program
that is part of msys or Cygwin does anything to the console window it is
using, but each console window can have its own code page as well.  The
default for 8-bit API (char*'s) is also the OEM character set, not the
so-called ANSI character set that is specified with SetThreadLocale.
I've not experimented with setting this (and restoring it) within a
program invoked in that console.  But if you use the 16-bit API for
console I/O, it is not a problem and works regardless of how the user
chose to set it.  To make it even more confusing, the console doesn't
respect the UTF-8 setting if the font is not set properly too.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 20:39 John Dlugosz
  2009-03-03 21:02 ` Dmitry Potapov
  0 siblings, 1 reply; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 20:39 UTC (permalink / raw)
  To: git; +Cc: dpotapov

Re: AFAIK, Microsoft C runtime library does not support UTF-8,

Actually, here is a clip from the runtime library source code:

        tmode = _textmode(fh);

        switch(tmode) {
            case __IOINFO_TM_UTF8 :
                /* For a UTF-8 file, we need 2 buffers, because after
reading we
                   need to convert it into UNICODE - MultiByteToWideChar
doesn't do
                   in-place conversions. */

                /* MultiByte To WideChar conversion may double the size
of the
                   buffer required & hence we divide cnt by 2 */

                /*
                 * Since we are reading UTF8 stream, cnt bytes read may
vary
                 * from cnt wchar_t characters to cnt/4 wchar_t
characters. For
                 * this reason if we need to read cnt characters, we
will
                 * allocate MBCS buffer of cnt. In case cnt is 0, we
will
                 * have 4 as minimum value. This will make sure we don't
                 * overflow for reading from pipe case.
                 *
                 *
                 * In this case the numbers of wchar_t characters that
we can
                 * read is cnt/2. This means that the buffer size that
we will
                 * require is cnt/2.
                 */

                /* For UTF8 we want the count to be an even number */

This is in the _read(fd, buffer, count) function, and shows that it will
in fact read UTF-8 and automatically transform it to UTF-16LE
transparently.  The documentation for _open explains this feature.

Meanwhile, a quick look at _mbslen() etc. shows that they are
implemented, and will handle UTF-8 encoded text as variable-length char*
just fine as long as suitable tables are loaded in its locale.  An
internal header shows macros for generating the lead-byte information as
needed by that table.

Now, the default when a program starts is to use the "C" locale.  The
locale argument to setlocale can take a form ".code_page", so calling

	setlocale (LC_CTYPE, ".65001");

should do the trick.  Assuming, that is, that you don't hit macros that
assume that characters are never multibyte.  So define the preprocessor
symbol _MBCS when you compile.

Older versions might not work right because MBCS (multibyte character
strings) was only actually implemented to DBCS (double-byte).  That is,
a single lead byte would be followed by a second byte, and no other
cases are provided for.  But, GB18030 has up to 4 bytes in a single
character.  It might still not be completely "clean" though because
GB18030 has a "double double" nature to it.  Just like assuming 16-bit
characters period mostly works with surrogate pairs even if you didn't
code full UTF-16 support, DBCS code will see a 4-byte GB18030 character
as two double byte characters.  So it gets the len (in characters)
wrong, and might still break up what is supposed to be a single
character.  So it really needs some improvement from the historical
DBCS-only code to work properly.  

Anyway, if UTF-8 really doesn't work with MBCS functions acceptably
well, and the goal is to allow passage of all characters through the
program, then set the program to use Chinese.  GB18030 is =fully=
supported and is just another (albeit strange) encoding for Unicode.

As for what
	fprintf (stderr, "unable to open %s", path);
will do, it will have no problem copying the contents of path to the
output stream no matter how it is encoded.  The result will be sent to
stderr, which may be autotranslating the local code page to UTF-16 or
UTF-8, but by default just feeds the stream of bytes to the console
window's 8-bit API, which has its own code page setting.

Personally, I have printf'ed UTF-8 encoded text to standard output.  It
looks OK if the console is also set to UTF-8.

--John
(please excuse the footer; it's not my idea)

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 16:29                     ` Lars Noschinski
@ 2009-03-03 20:59                       ` Robin Rosenberg
  0 siblings, 0 replies; 33+ messages in thread
From: Robin Rosenberg @ 2009-03-03 20:59 UTC (permalink / raw)
  To: Lars Noschinski; +Cc: git, Peter Krefting, Thomas Rast

Lars Noschinski <lars-2008-2@usenet.noschinski.de> writes:
> * Peter Krefting <peter@softwolves.pp.se> [09-03-03 12:54]:
> > Lars Noschinski:
> > >Changing the filename (on checkout), so that the user sees an Ü regardless of 
> > >his or her locale (instead of an \0xDC, which only resolves to an Ü on 
> > >latin-1) would be an absolutely broken concept here.
> > 
> > Why would it? It is my view as a user on my files that define how file names 
> > are looked upon. If I have three machines, one Linux box using a iso8859-1 
> > locale, an OS X box (where, I would believe, file APIs use UTF-8, someone 
> > please correct me if I'm wrong), and a Windows box (which uses UTF-16 on the 
> > file system layer, but does provide compatibility functions that use char 
> > pointers), and create a file on each of these called "Ü.txt" (which would be 
> > the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74" (or 
> > probably something else since I believe OS X decomposes the string) on the OS X 
> > box and "00DC 002E 0074 0078 0074" on the Windows box, I see these three file 
> > names as equal.
> 
> Because a function in the source code refers to (e.g.) "DC 2E 74 78 74",
> not "C3 9C 2E 74 78 74" nor "00DC 0024 0074 0078 0074". And it does so
> regardless of the locale.

The only actual language I know where I've seen people use non-ascii names for
referenced files, i.e. classes, is Java and there you specify the encoding to
the compiler. Class names are not byte sequences there. XML files are another
case where references files are defined in unicode. I assume this applies to
C# and other modern languages too.

> The file name may look funny depending on your locale, but if you rename
> the file to fit your local enconding, it would not work.

In the Java case, you /have/ to "rename" or the build will break. Build systems like Ant
or Maven require you to "rename" too regardless of what you build. A C Git clone
will produce unbuildable code, but JGit will produce a working one for unicode
aware systems and documentation, the case where unicode filenames are more common
than in source, will look good.

-- robin

PS. I readded the people you forgot to Cc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 20:39 John Dlugosz
@ 2009-03-03 21:02 ` Dmitry Potapov
  2009-03-03 21:56   ` John Dlugosz
  0 siblings, 1 reply; 33+ messages in thread
From: Dmitry Potapov @ 2009-03-03 21:02 UTC (permalink / raw)
  To: John Dlugosz; +Cc: git

On Tue, Mar 3, 2009 at 11:39 PM, John Dlugosz <JDlugosz@tradestation.com> wrote:
>
> Now, the default when a program starts is to use the "C" locale.  The
> locale argument to setlocale can take a form ".code_page", so calling
>
>        setlocale (LC_CTYPE, ".65001");
>
> should do the trick.  Assuming, that is, that you don't hit macros that
> assume that characters are never multibyte.  So define the preprocessor
> symbol _MBCS when you compile.

If Microsoft fixed the problem with UTF-8 support in C runtime, it is
a really good
news, because setlocale did not work not so long time ago:

http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

As to Win32 API, it has always worked correctly with UTF-8... In fact, the
documentation of GetOEMCP function goes as far as recommending
to use UTF-8 or UTF-16: "For the most consistent results, applications should
use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.

So it would be great if Git supported UTF-8 on Windows (as an option), but it
is not my itch right now....

Dmitry

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 21:02 ` Dmitry Potapov
@ 2009-03-03 21:56   ` John Dlugosz
  0 siblings, 0 replies; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 21:56 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git

===Re:===
If Microsoft fixed the problem with UTF-8 support in C runtime, it is
a really good
news, because setlocale did not work not so long time ago:
===end===

They totally replaced it with one written by P.J.Plauger.  I'm not sure
when, but I would guess around VC++7.1, which was a "sea change" and
felt more like a different brand than a simple update.  That's when
templates started following the standard.

Re:
http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

Interesting.  So it sort-of worked, as per my overlong muse as I looked
at the source code, but they started explicitly preventing it because it
doesn't always work for everything.

    //  verify codepage validity
    if (!iCodePage || iCodePage == CP_UTF7 || iCodePage == CP_UTF8 ||
        !IsValidCodePage((WORD)iCodePage))
        return FALSE;

===Re:===
As to Win32 API, it has always worked correctly with UTF-8... In fact,
the
documentation of GetOEMCP function goes as far as recommending
to use UTF-8 or UTF-16: "For the most consistent results, applications
should
use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.
===end===

I remember a time when it did not.  I don't recall if it was NT (as
opposed to consumer windows) or some version of NT beyond 3.5 (starting
in 4?) that it became available.  But I had to supply code with the
program because it could not count on it.

===Re:===
So it would be great if Git supported UTF-8 on Windows (as an option),
but it
is not my itch right now....
===end===

someone else mentioned "most people use ASCII file names", and I would
take that to be true only if "most people" == "developers".  If you look
at my wife's "explorer" view, it's all Chinese.  Files are downloaded
with Asian file names.  Most people =in= China are used to seemless
support within Windows.  It's only with Chinese MUI on English Windows
that the "ANSI" stuff doesn't match and programs that use 8-bit API
calls suddenly croak as they see "?????" for input.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 17:13                     ` Dmitry Potapov
@ 2009-03-04 10:51                       ` Peter Krefting
  2009-03-04 14:18                         ` Dmitry Potapov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-04 10:51 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git

Dmitry Potapov:

> No, it does not, if you have wchar_t that is only 16-bit wide, because 
> characters outside of the BMP have integer values in Unicode greater than 
> 65535...

UTF-16 allows you to reference all of Unicode (i.e up to U+10FFFF) using 
surrogate pairs. That means that not all characters can be represented as a 
single wchar_t, that is true. The problem with changing wchar_t is that it 
was defined to use 16-bit values at a time where Unicode was defined to use 
16-bit code points (but they soon figured out that was not enough).

Anyway, this is getting off-topic. Please feel free reply in private.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-03 18:25 John Dlugosz
@ 2009-03-04 10:53 ` Peter Krefting
  2009-03-04 19:34   ` John Dlugosz
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Krefting @ 2009-03-04 10:53 UTC (permalink / raw)
  To: John Dlugosz; +Cc: git

John Dlugosz:

> Actually, UTF-8 is a valid code page on Windows.

Yes, but I am unsure whether it can be set as a thread locale for the sake 
of file APIs.

> Also, there is a national code page that =will= represent all file names 
> on the systems and is supported: That is the Chinese GB18030, code page 
> 54936.

Yeah, but unfortunately it is explicitly documented that it is only 
supported in MultiByteToWideChar, WideCharToMultiByte and some text painting 
APIs in Windows, i.e the stdio functions and others may break horribly.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-04 10:51                       ` Peter Krefting
@ 2009-03-04 14:18                         ` Dmitry Potapov
  0 siblings, 0 replies; 33+ messages in thread
From: Dmitry Potapov @ 2009-03-04 14:18 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

On Wed, Mar 04, 2009 at 11:51:15AM +0100, Peter Krefting wrote:

> The problem with changing wchar_t is that_
> it was defined to use 16-bit values at a time where Unicode was defined_
> to use 16-bit code points (but they soon figured out that was not_
> enough).

I do realize that is a problem, and unfortunately there is no easy and
quick fix to it. But you brought Windows as an example of good Unicode
support... Well, to my mind, it is not, at least, not for C programs.
You have two serious problems here:
1. wchar_t is too small to hold all Unicode characters as it is required
   by C standard.
2. UTF-8 support is broken in C runtime library.

In fact, if UTF-8 were supported by C runtime, we would not have this thread
in the first place... Now, it is possible to wrap all C functions used by Git to
make them work with UTF-8, but it is a lot of work...

Dmitry

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-04 10:53 ` Peter Krefting
@ 2009-03-04 19:34   ` John Dlugosz
  0 siblings, 0 replies; 33+ messages in thread
From: John Dlugosz @ 2009-03-04 19:34 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

===Re:===
Yes, but I am unsure whether it [UTF-8] can be set as a thread locale
for the sake 
of file APIs.
===end===

Why wouldn't it?  If the ANSI forms simply allocate buffers and call
WideCharToMultiByte and MultiByteToWideChar, it should work with
anything those functions handles.  My only concern would be with buffer
length when converting to MultiByte, if it assumes a limit based on 2
bytes max per character.  But, it works with GB18030, which can have
4-byte characters.

It's certainly easy enough to try.

===Re:===
Yeah, but unfortunately it [GB18030] is explicitly documented that it is
only 
supported in MultiByteToWideChar, WideCharToMultiByte and some text
painting 
APIs in Windows, i.e the stdio functions and others may break horribly.
===end===

Code that works with the other multi-byte "ANSI" character sets, and GBK
in particular, will handle GB18030 "reasonably well" with no changes.
For example, printf ("xxx%sxxx", name), where each 'x' may actually be
any character, will work without problems -- it won't mis-identify the %
in the middle of a 4-byte character.  But printf ("%5s",name) will count
some of the characters in 'name' as two, and print less than 5 of them;
or worse yet, break a character in half.

I can't think of anything that breaks horribly.  Only situations that
involve counting them will have issues.

As empirical evidence, lots of Windows software works fine in China.
You need full GB18030 support to read a newspaper on the web, because
the 4-byte characters are mostly obscure and regional words, but also
proper nouns including the names of some prominent people (Prime
Minister or something like that; I don't remember exactly).  But mostly
you don't encounter them and chug along with GBK and the occasional '?'
where some character did not work.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
  2009-03-02  8:47 [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded Peter Krefting
  2009-03-02 10:30 ` Johannes Sixt
  2009-03-03  9:43 ` Dmitry Potapov
@ 2009-03-07 10:38 ` Robin Rosenberg
  2 siblings, 0 replies; 33+ messages in thread
From: Robin Rosenberg @ 2009-03-07 10:38 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git


Slightly related; A new cygwin (not msysgit-related) version with UTF-8 support was announced. Most notably:

- New setlocale implementation allows to specify POSIX locale strings.
  You can now use, for instance in bash, `export LC_ALL=en_US.UTF-8'.
  The language and territory will be ignored for now, the charset
  will be used by multibyte-releated functions.

- UTF-8 filenames are supported now. 

- Support UTF-8 in console window.

This certainly makes it more feasable to interoperate with *nix repos that has non-ascii metadata and file names.

-- robin

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-03-07 10:59 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-02  8:47 [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded Peter Krefting
2009-03-02 10:30 ` Johannes Sixt
2009-03-02 10:46   ` Peter Krefting
2009-03-02 10:56     ` Johannes Schindelin
2009-03-02 12:03       ` Peter Krefting
     [not found]         ` <a2633edd0903020512u5682e9am203f0faccd0acf6a@mail.gmail.com>
2009-03-02 13:57           ` Peter Krefting
2009-03-02 14:29             ` Thomas Rast
2009-03-02 20:41               ` Peter Krefting
2009-03-03  7:56                 ` Lars Noschinski
2009-03-03 11:54                   ` Peter Krefting
2009-03-03 16:29                     ` Lars Noschinski
2009-03-03 20:59                       ` Robin Rosenberg
2009-03-03  9:47                 ` Dmitry Potapov
2009-03-03 11:48                   ` Peter Krefting
2009-03-03 17:13                     ` Dmitry Potapov
2009-03-04 10:51                       ` Peter Krefting
2009-03-04 14:18                         ` Dmitry Potapov
2009-03-02 12:34     ` Johannes Sixt
2009-03-02 13:12       ` Peter Krefting
2009-03-02 19:58         ` Robin Rosenberg
2009-03-02 20:52           ` Peter Krefting
2009-03-02 21:21             ` Robin Rosenberg
2009-03-03  5:51               ` Peter Krefting
2009-03-03  9:43 ` Dmitry Potapov
2009-03-03 11:56   ` Peter Krefting
2009-03-07 10:38 ` Robin Rosenberg
  -- strict thread matches above, loose matches on Subject: below --
2009-03-03 18:25 John Dlugosz
2009-03-04 10:53 ` Peter Krefting
2009-03-04 19:34   ` John Dlugosz
2009-03-03 19:36 John Dlugosz
2009-03-03 20:39 John Dlugosz
2009-03-03 21:02 ` Dmitry Potapov
2009-03-03 21:56   ` John Dlugosz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).