git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-02  8:47 Peter Krefting
  2009-03-02 10:30 ` Johannes Sixt
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Peter Krefting @ 2009-03-02  8:47 UTC (permalink / raw)
  To: git

When opening a file through open() or fopen(), the path passed is
UTF-8 encoded. To handle this on Windows, we need to convert the
path string to UTF-16 and use the Unicode-based interface.
---
Windows does support file names using arbitrary Unicode characters, you just 
need to use its wchar_t interfaces instead of the char ones (the char ones 
just gets converted into wchar_t on the API level anyway, for the same 
reasons). This is the beginnings of support for UTF-8 file names on Git on 
Windows.

Since there is no real file system abstraction beyond using stdio (AFAIK), I 
need to hack it by replacing fopen (and open). Probably opendir/readdir as 
well (might be trickier), and possibly even hack around main() to parse the 
wchar_t command-line instead of the char copy.

This will lose all chances of Windows 9x compatibility, but I don't know if 
there are any attempts of supporting it anyway?

Please note that MultiByteToWideChar() will reject any invalid UTF-8 
strings, perhaps it should just fall back to a regular open()/fopen() in 
that case?

No Signed-Off line since this is unfinished, just presenting rough sketches 
of an idea.

  compat/mingw.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
  compat/mingw.h |    3 ++
  2 files changed, 62 insertions(+), 1 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index e25cb4f..8b19b80 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -9,13 +9,30 @@ int mingw_open (const char *filename, int oflags, ...)
  {
  	va_list args;
  	unsigned mode;
+	wchar_t *unicode_filename;
+	int unicode_filename_len;
  	va_start(args, oflags);
  	mode = va_arg(args, int);
  	va_end(args);

  	if (!strcmp(filename, "/dev/null"))
  		filename = "nul";
-	int fd = open(filename, oflags, mode);
+
+	unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+	if (0 == unicode_filename_len) {
+		errno = EINVAL;
+		return -1;
+	};
+
+	unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+	if (NULL == unicode_filename) {
+		errno = ENOMEM;
+		return -1;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+	int fd = _wopen(unicode_filename, oflags, mode);
+	free(unicode_filename);
+
  	if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
  		DWORD attrs = GetFileAttributes(filename);
  		if (attrs != INVALID_FILE_ATTRIBUTES && (attrs & FILE_ATTRIBUTE_DIRECTORY))
@@ -24,6 +41,47 @@ int mingw_open (const char *filename, int oflags, ...)
  	return fd;
  }

+FILE *mingw_fopen (const char *filename, const char *mode)
+{
+	wchar_t *unicode_filename, *unicode_mode;
+	int unicode_filename_len, unicode_mode_len;
+	FILE *fh;
+
+	unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+	if (0 == unicode_filename_len) {
+		errno = EINVAL;
+		return NULL;
+	};
+
+	unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+	if (NULL == unicode_filename) {
+		errno = ENOMEM;
+		return NULL;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+
+	unicode_mode_len = MultiByteToWideChar(CP_UTF8, 0, mode, -1, NULL, 0);
+	if (0 == unicode_mode_len) {
+		free(unicode_filename);
+		errno = EINVAL;
+		return NULL;
+	};
+
+	unicode_mode = xmalloc(unicode_mode_len * sizeof (wchar_t));
+	if (NULL == unicode_mode) {
+		free(unicode_mode);
+		errno = ENOMEM;
+		return NULL;
+	}
+	MultiByteToWideChar(CP_UTF8, 0, mode, -1, unicode_mode, unicode_mode_len);
+
+	fh = _wfopen(unicode_filename, unicode_mode);
+	free(unicode_filename);
+	free(unicode_mode);
+
+	return fh;
+}
+
  static inline time_t filetime_to_time_t(const FILETIME *ft)
  {
  	long long winTime = ((long long)ft->dwHighDateTime << 32) + ft->dwLowDateTime;
diff --git a/compat/mingw.h b/compat/mingw.h
index 4f275cb..235df0a 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -142,6 +142,9 @@ int sigaction(int sig, struct sigaction *in, struct sigaction *out);
  int mingw_open (const char *filename, int oflags, ...);
  #define open mingw_open

+FILE *mingw_fopen (const char *filename, const char *mode);
+#define fopen mingw_fopen
+
  char *mingw_getcwd(char *pointer, int len);
  #define getcwd mingw_getcwd

-- 
1.6.0.2.1172.ga5ed0

^ permalink raw reply related	[flat|nested] 33+ messages in thread
* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 18:25 John Dlugosz
  2009-03-04 10:53 ` Peter Krefting
  0 siblings, 1 reply; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 18:25 UTC (permalink / raw)
  To: git; +Cc: peter

===Re:===
The other way would be to keep the char* APIs but convert to the Windows

locale encoding ("ANSI codepage"), but that will break horribly as not
all 
file names that can be used on a file system can be represented as such.
===end===

Actually, UTF-8 is a valid code page on Windows.  The code page ID is
65001.  So, if you set the process code page to that, =and= set the file
system API's code page to follow rather than using the OEM code page
(the default), it should work just fine.

Also, there is a national code page that =will= represent all file names
on the systems and is supported:  That is the Chinese GB18030, code page
54936.  That has every character that Unicode does, just encoded
differently to be forward compatible with GBK.  That is fully supported
by windows, as it is required by law to sell in Chinese markets.

Let me know if I can be of help.  I know character set stuff and Win32
fairly well.

--John



TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 19:36 John Dlugosz
  0 siblings, 0 replies; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 19:36 UTC (permalink / raw)
  To: git; +Cc: j.sixt

===Re:===
You cannot expect users to switch the locale. For example, I have to
test
our software with Japanese settings: I *cannot* switch to UTF-8 just
because of git.

Can you set the local codepage per program? (I don't know.) It might
help
here, but it doesn't help in all cases, particularly in certain
pipelines:
===end===

Yes, you can.  The code page can be set per thread.  The function call
is:

	SetThreadLocale (lcid);

where lcid is just 65001 for UTF-8.  (The other fields in the LCID are
high-order bits and all zero for no sublanguage and default sort order).

When a thread is created, it starts with the system default thread
locale.  So call SetThreadLocale on every thread you create.  In
particular, realize that the new thread does not inherit this from the
creating thread.

Meanwhile... the file I/O functions don't use the same code page.  The
encoding of file names on a floppy disk or whatnot was historically done
using the "OEM code page", and when a different code page is used for
text editing, that shouldn't break compatibility.  So, all functions
exported from Kernel32.dll that accept or return file names uses a
separate setting, and setting the locale as shown above will not affect
it.  This might be the source of confusion to those experimenting with
it.

So, also make a call to
	
	SetFileApisToANSI();

This affects the entire process, not just the thread.

So much for specifying UTF-8 file names in Windows.  A related issue is
the console input and output of same.  I don't know if the sh program
that is part of msys or Cygwin does anything to the console window it is
using, but each console window can have its own code page as well.  The
default for 8-bit API (char*'s) is also the OEM character set, not the
so-called ANSI character set that is specified with SetThreadLocale.
I've not experimented with setting this (and restoring it) within a
program invoked in that console.  But if you use the 16-bit API for
console I/O, it is not a problem and works regardless of how the user
chose to set it.  To make it even more confusing, the console doesn't
respect the UTF-8 setting if the font is not set properly too.

--John


TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
@ 2009-03-03 20:39 John Dlugosz
  2009-03-03 21:02 ` Dmitry Potapov
  0 siblings, 1 reply; 33+ messages in thread
From: John Dlugosz @ 2009-03-03 20:39 UTC (permalink / raw)
  To: git; +Cc: dpotapov

Re: AFAIK, Microsoft C runtime library does not support UTF-8,

Actually, here is a clip from the runtime library source code:

        tmode = _textmode(fh);

        switch(tmode) {
            case __IOINFO_TM_UTF8 :
                /* For a UTF-8 file, we need 2 buffers, because after
reading we
                   need to convert it into UNICODE - MultiByteToWideChar
doesn't do
                   in-place conversions. */

                /* MultiByte To WideChar conversion may double the size
of the
                   buffer required & hence we divide cnt by 2 */

                /*
                 * Since we are reading UTF8 stream, cnt bytes read may
vary
                 * from cnt wchar_t characters to cnt/4 wchar_t
characters. For
                 * this reason if we need to read cnt characters, we
will
                 * allocate MBCS buffer of cnt. In case cnt is 0, we
will
                 * have 4 as minimum value. This will make sure we don't
                 * overflow for reading from pipe case.
                 *
                 *
                 * In this case the numbers of wchar_t characters that
we can
                 * read is cnt/2. This means that the buffer size that
we will
                 * require is cnt/2.
                 */

                /* For UTF8 we want the count to be an even number */

This is in the _read(fd, buffer, count) function, and shows that it will
in fact read UTF-8 and automatically transform it to UTF-16LE
transparently.  The documentation for _open explains this feature.

Meanwhile, a quick look at _mbslen() etc. shows that they are
implemented, and will handle UTF-8 encoded text as variable-length char*
just fine as long as suitable tables are loaded in its locale.  An
internal header shows macros for generating the lead-byte information as
needed by that table.

Now, the default when a program starts is to use the "C" locale.  The
locale argument to setlocale can take a form ".code_page", so calling

	setlocale (LC_CTYPE, ".65001");

should do the trick.  Assuming, that is, that you don't hit macros that
assume that characters are never multibyte.  So define the preprocessor
symbol _MBCS when you compile.

Older versions might not work right because MBCS (multibyte character
strings) was only actually implemented to DBCS (double-byte).  That is,
a single lead byte would be followed by a second byte, and no other
cases are provided for.  But, GB18030 has up to 4 bytes in a single
character.  It might still not be completely "clean" though because
GB18030 has a "double double" nature to it.  Just like assuming 16-bit
characters period mostly works with surrogate pairs even if you didn't
code full UTF-16 support, DBCS code will see a 4-byte GB18030 character
as two double byte characters.  So it gets the len (in characters)
wrong, and might still break up what is supposed to be a single
character.  So it really needs some improvement from the historical
DBCS-only code to work properly.  

Anyway, if UTF-8 really doesn't work with MBCS functions acceptably
well, and the goal is to allow passage of all characters through the
program, then set the program to use Chinese.  GB18030 is =fully=
supported and is just another (albeit strange) encoding for Unicode.

As for what
	fprintf (stderr, "unable to open %s", path);
will do, it will have no problem copying the contents of path to the
output stream no matter how it is encoded.  The result will be sent to
stderr, which may be autotranslating the local code page to UTF-16 or
UTF-8, but by default just feeds the stream of bytes to the console
window's 8-bit API, which has its own code page setting.

Personally, I have printf'ed UTF-8 encoded text to standard output.  It
looks OK if the console is also set to UTF-8.

--John
(please excuse the footer; it's not my idea)



TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-03-07 10:59 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-02  8:47 [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded Peter Krefting
2009-03-02 10:30 ` Johannes Sixt
2009-03-02 10:46   ` Peter Krefting
2009-03-02 10:56     ` Johannes Schindelin
2009-03-02 12:03       ` Peter Krefting
     [not found]         ` <a2633edd0903020512u5682e9am203f0faccd0acf6a@mail.gmail.com>
2009-03-02 13:57           ` Peter Krefting
2009-03-02 14:29             ` Thomas Rast
2009-03-02 20:41               ` Peter Krefting
2009-03-03  7:56                 ` Lars Noschinski
2009-03-03 11:54                   ` Peter Krefting
2009-03-03 16:29                     ` Lars Noschinski
2009-03-03 20:59                       ` Robin Rosenberg
2009-03-03  9:47                 ` Dmitry Potapov
2009-03-03 11:48                   ` Peter Krefting
2009-03-03 17:13                     ` Dmitry Potapov
2009-03-04 10:51                       ` Peter Krefting
2009-03-04 14:18                         ` Dmitry Potapov
2009-03-02 12:34     ` Johannes Sixt
2009-03-02 13:12       ` Peter Krefting
2009-03-02 19:58         ` Robin Rosenberg
2009-03-02 20:52           ` Peter Krefting
2009-03-02 21:21             ` Robin Rosenberg
2009-03-03  5:51               ` Peter Krefting
2009-03-03  9:43 ` Dmitry Potapov
2009-03-03 11:56   ` Peter Krefting
2009-03-07 10:38 ` Robin Rosenberg
  -- strict thread matches above, loose matches on Subject: below --
2009-03-03 18:25 John Dlugosz
2009-03-04 10:53 ` Peter Krefting
2009-03-04 19:34   ` John Dlugosz
2009-03-03 19:36 John Dlugosz
2009-03-03 20:39 John Dlugosz
2009-03-03 21:02 ` Dmitry Potapov
2009-03-03 21:56   ` John Dlugosz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).