* git archive --format zip utf-8 issues @ 2012-08-10 21:58 Sven Strickroth 2012-08-10 22:47 ` Junio C Hamano 0 siblings, 1 reply; 21+ messages in thread From: Sven Strickroth @ 2012-08-10 21:58 UTC (permalink / raw) To: git Hi, when I create a git repository, add a file containing utf-8 characters or umlauts (like öäü.txt), commit and then export the HEAD revision to a zip archive using "git archive --format zip -o 1.zip HEAD", the zip file contains incorrect filenames: $ unzip -l 1.zip Archive: 1.zip 4490a6dab1df5404f91ab3eb871f133154bff0bf Length Date Time Name --------- ---------- ----- ---- 6 2012-08-10 23:41 +?+?++.txt --------- ------- 6 1 file -- Best regards, Sven Strickroth PGP key id F5A9D4C4 @ any key-server ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth @ 2012-08-10 22:47 ` Junio C Hamano 2012-08-10 23:53 ` Sven Strickroth 2012-08-11 20:53 ` René Scharfe 0 siblings, 2 replies; 21+ messages in thread From: Junio C Hamano @ 2012-08-10 22:47 UTC (permalink / raw) To: Sven Strickroth; +Cc: git, René Scharfe Sven Strickroth <sven.strickroth@tu-clausthal.de> writes: > when I create a git repository, add a file containing utf-8 characters > or umlauts (like öäü.txt), commit and then export the HEAD revision to a > zip archive using "git archive --format zip -o 1.zip HEAD", the zip file > contains incorrect filenames: My reading of archive-zip.c seems to suggest that we write out whatever pathname you have in the tree, so a pathname encoded in UTF-8 will be literally written out in the resulting zip archive. Do you know in what encoding the pathnames are _expected_ to be stored in zip archives? Random documentation seems to suggest that there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html says: There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin. which may explain it. It may not be a bad idea for "git archive --format=zip" to (1) check if pathname is a correct UTF-8; and (2) check if it can be reencoded to latin-1 and if (and only if) both are true, automatically re-encode the path to latin-1. Of course, "git archive --format=zip --path-reencode=utf8-to-latin1" would be the most generic way to do this. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-10 22:47 ` Junio C Hamano @ 2012-08-10 23:53 ` Sven Strickroth 2012-08-11 20:53 ` René Scharfe 2012-08-11 20:53 ` René Scharfe 1 sibling, 1 reply; 21+ messages in thread From: Sven Strickroth @ 2012-08-10 23:53 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, René Scharfe Am 11.08.2012 00:47 schrieb Junio C Hamano: > Do you know in what encoding the pathnames are _expected_ to be > stored in zip archives? re-encoding to latin1 does not always work and may break double byte totally (e.g. chinese or japanese). PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8 flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT > A. Local file header: > general purpose bit flag: (2 bytes) > Bit 11: Language encoding flag (EFS). If this bit is > set, the filename and comment fields for this file > must be encoded using UTF-8. (see APPENDIX D) -- Best regards, Sven Strickroth PGP key id F5A9D4C4 @ any key-server ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-10 23:53 ` Sven Strickroth @ 2012-08-11 20:53 ` René Scharfe 2012-08-12 4:08 ` Junio C Hamano 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw) To: Sven Strickroth; +Cc: git, Junio C Hamano Am 11.08.2012 01:53, schrieb Sven Strickroth: > Am 11.08.2012 00:47 schrieb Junio C Hamano: >> Do you know in what encoding the pathnames are _expected_ to be >> stored in zip archives? > > re-encoding to latin1 does not always work and may break double byte > totally (e.g. chinese or japanese). > > PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8 > flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT >> A. Local file header: >> general purpose bit flag: (2 bytes) >> Bit 11: Language encoding flag (EFS). If this bit is >> set, the filename and comment fields for this file >> must be encoded using UTF-8. (see APPENDIX D) Yes, that's one of the two methods for supporting UTF-8 filenames described there. The other method involves writing extra ZIP header fields and was invented by Info-ZIP. They don't use it consistently anymore, though (from zip -h2): "Zip now stores UTF-8 in entry path and comment fields on systems where UTF-8 char set is default, such as most modern Unix, and and on other systems in new extra fields with escaped versions in entry path and comment fields for backward compatibility." René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-11 20:53 ` René Scharfe @ 2012-08-12 4:08 ` Junio C Hamano 0 siblings, 0 replies; 21+ messages in thread From: Junio C Hamano @ 2012-08-12 4:08 UTC (permalink / raw) To: René Scharfe; +Cc: Sven Strickroth, git René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: >> PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8 >> flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT >>> A. Local file header: >>> general purpose bit flag: (2 bytes) >>> Bit 11: Language encoding flag (EFS). If this bit is >>> set, the filename and comment fields for this file >>> must be encoded using UTF-8. (see APPENDIX D) > > Yes, that's one of the two methods for supporting UTF-8 filenames > described there. > > The other method involves writing extra ZIP header fields and was > invented by Info-ZIP. They don't use it consistently anymore, though > (from zip -h2): > > "Zip now stores UTF-8 in entry path and comment fields on systems > where UTF-8 char set is default, such as most modern Unix, and > and on other systems in new extra fields with escaped versions in > entry path and comment fields for backward compatibility." Thanks; so if we adopt one of these methods, the readers that matter will be happy? And if so, which one? Or both? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-10 22:47 ` Junio C Hamano 2012-08-10 23:53 ` Sven Strickroth @ 2012-08-11 20:53 ` René Scharfe 2012-08-11 21:37 ` Sven Strickroth 2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano 1 sibling, 2 replies; 21+ messages in thread From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: Sven Strickroth, git Am 11.08.2012 00:47, schrieb Junio C Hamano: > Sven Strickroth <sven.strickroth@tu-clausthal.de> writes: > >> when I create a git repository, add a file containing utf-8 characters >> or umlauts (like öäü.txt), commit and then export the HEAD revision to a >> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file >> contains incorrect filenames: > > My reading of archive-zip.c seems to suggest that we write out > whatever pathname you have in the tree, so a pathname encoded in > UTF-8 will be literally written out in the resulting zip archive. Sorry for my imperialistic attitude of "ASCII filenames should be enough for everybody". Laziness.. > Do you know in what encoding the pathnames are _expected_ to be > stored in zip archives? Random documentation seems to suggest that > there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html > says: > > There is no official file name encoding for ZIP files. If you > have unicode file names, you must convert them to byte strings > in your desired encoding before passing them to write(). WinZip > interprets all file names as encoded in CP437, also known as DOS > Latin. > > which may explain it. http://www.pkware.com/documents/casestudies/APPNOTE.TXT is the standard document, as Sven noted, and it says that filenames are encoded in code page 437, or optionally UTF-8 (a later addition). Discussions like http://stackoverflow.com/questions/106367/ seem to indicate that at least some archivers use the local code page as well. > It may not be a bad idea for "git archive --format=zip" to > > (1) check if pathname is a correct UTF-8; and > (2) check if it can be reencoded to latin-1 > > and if (and only if) both are true, automatically re-encode the path > to latin-1. The standard says we need to convert to CP437, or to UTF-8, or provide both versions. A more interesting question is: What's supported by which programs? The ZIP functionality built into Windows 7 doesn't seem to work with UTF-8 encoded filenames (except for those that only use the ASCII subset), and to ignore the UTF-8 part if both are given. Handling umlauts should be possible anyway, because they are on code page 437, but for other characters we'd have to aim for compatibility with other programs like Info-ZIP and 7-Zip. How do we know which encoding was used for a filename? > Of course, "git archive --format=zip --path-reencode=utf8-to-latin1" > would be the most generic way to do this. I really hope we can make do without additional options. René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-11 20:53 ` René Scharfe @ 2012-08-11 21:37 ` Sven Strickroth 2012-08-30 22:26 ` Jeff King 2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano 1 sibling, 1 reply; 21+ messages in thread From: Sven Strickroth @ 2012-08-11 21:37 UTC (permalink / raw) To: git; +Cc: René Scharfe, Junio C Hamano Am 11.08.2012 22:53 schrieb René Scharfe: > The standard says we need to convert to CP437, or to UTF-8, or provide > both versions. A more interesting question is: What's supported by which > programs? > > The ZIP functionality built into Windows 7 doesn't seem to work with > UTF-8 encoded filenames (except for those that only use the ASCII > subset), and to ignore the UTF-8 part if both are given. I played a bit with the git source code and found out, that diff --git a/archive-zip.c b/archive-zip.c index f5af81f..e0ccb4f 100644 --- a/archive-zip.c +++ b/archive-zip.c @@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args, copy_le16(dirent.creator_version, S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0); copy_le16(dirent.version, 10); - copy_le16(dirent.flags, flags); + copy_le16(dirent.flags, flags+2048); copy_le16(dirent.compression_method, method); copy_le16(dirent.mtime, zip_time); copy_le16(dirent.mdate, zip_date); -- works with 7-zip, however, not with Windows 7 build-in zip. If I create a zip file with 7-zip which contains umlauts and other unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays them correctly, too. -- Best regards, Sven Strickroth PGP key id F5A9D4C4 @ any key-server ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-11 21:37 ` Sven Strickroth @ 2012-08-30 22:26 ` Jeff King 2012-09-04 20:23 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2012-08-30 22:26 UTC (permalink / raw) To: Sven Strickroth; +Cc: git, René Scharfe, Junio C Hamano On Sat, Aug 11, 2012 at 11:37:05PM +0200, Sven Strickroth wrote: > Am 11.08.2012 22:53 schrieb René Scharfe: > > The standard says we need to convert to CP437, or to UTF-8, or provide > > both versions. A more interesting question is: What's supported by which > > programs? > > > > The ZIP functionality built into Windows 7 doesn't seem to work with > > UTF-8 encoded filenames (except for those that only use the ASCII > > subset), and to ignore the UTF-8 part if both are given. > > I played a bit with the git source code and found out, that > > diff --git a/archive-zip.c b/archive-zip.c > index f5af81f..e0ccb4f 100644 > --- a/archive-zip.c > +++ b/archive-zip.c > @@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args, > copy_le16(dirent.creator_version, > S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0); > copy_le16(dirent.version, 10); > - copy_le16(dirent.flags, flags); > + copy_le16(dirent.flags, flags+2048); > copy_le16(dirent.compression_method, method); > copy_le16(dirent.mtime, zip_time); > copy_le16(dirent.mdate, zip_date); > -- > works with 7-zip, however, not with Windows 7 build-in zip. > > If I create a zip file with 7-zip which contains umlauts and other > unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays > them correctly, too. Ping on this stalled discussion. It seems like there are two separate issues here: 1. Knowing the encoding of pathnames in the repository. 2. Setting the right flags in zip output. A full solution would handle both parts, but let's ignore (1) for a moment, and assume we have utf-8 (or can massage into utf-8 from an encoding specified by the user). It seems like just setting the magic utf-8 flag would be the only thing we need to do, according to the standard. But according to discussions referenced elsewhere in this thread, that flag was invented only in 2007, so we may be dealing with older implementations (I have no idea how common they would be; that may be the problem with Windows 7's zip you are seeing). We could re-encode to cp437, which the standard specifies, but apparently some implementations do not respect that (and use a local code page instead). And it cannot represent all utf-8 characters, anyway. It sounds like 7-zip has figured out a more portable solution. Can you show us a sample of 7-zip's output with utf-8 characters to compare to what git generates? I wonder if it is using a combination of methods. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-30 22:26 ` Jeff King @ 2012-09-04 20:23 ` René Scharfe 2012-09-04 21:03 ` Junio C Hamano 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2012-09-04 20:23 UTC (permalink / raw) To: Jeff King; +Cc: Sven Strickroth, git, Junio C Hamano Am 31.08.2012 00:26, schrieb Jeff King: > Ping on this stalled discussion. Sorry, I got distracted by other stuff again. I did some experiments, though, and here's a preliminary result. > It seems like there are two separate issues here: > > 1. Knowing the encoding of pathnames in the repository. > > 2. Setting the right flags in zip output. > > A full solution would handle both parts, but let's ignore (1) for a > moment, and assume we have utf-8 (or can massage into utf-8 from an > encoding specified by the user). Yes, good thinking. Re-encoding may be beneficial for tar files as well, but we can ignore that point for the moment. > It seems like just setting the magic utf-8 flag would be the only thing > we need to do, according to the standard. But according to discussions > referenced elsewhere in this thread, that flag was invented only in > 2007, so we may be dealing with older implementations (I have no idea > how common they would be; that may be the problem with Windows 7's zip > you are seeing). We could re-encode to cp437, which the standard > specifies, but apparently some implementations do not respect that > (and use a local code page instead). And it cannot represent all utf-8 > characters, anyway. Yes, we could do that, plus adding an extra field with a UTF-8 version of the path. That's the legacy method invented by Info-ZIP. They switched to using the new flag on Linux at least, though. > It sounds like 7-zip has figured out a more portable solution. Can you > show us a sample of 7-zip's output with utf-8 characters to compare to > what git generates? I wonder if it is using a combination of methods. I'm not so sure they produce more portable files. I created an archive with files named jaя.txt, smørrebrød.txt, süd.txt and €uro.txt with 7-Zip on Windows 7 and while unzip on Ubuntu 12.04 managed to recreate the cyrillic character and the Euro symbol, it mangled the slashed o and the umlaut. With the following patch I could create archives with git on Linux and msysgit and extract them flawlessly on Windows with 7-Zip and with Info-ZIP unzip on Linux, but not with unzip on Windows, where it mangled all non-ASCII characters. This gets confusing; it would help to have a compatibility matrix for all intersting extractors and character classes -- for each proposed solution or archiver we'd like to imitate. But now for the patch, which is a bit confusing as well. I'm curious to hear about results for more platforms, extractors and character classes. Based on that we can see if we need to generate the extra fields instead of relying on the new flag. -- >8 -- Subject: [PATCH] archive-zip: support UTF-8 paths Set general purpose flag 11 if we encounter a path that contains non-ASCII characters. We assume that all paths are given as UTF-8; no conversion is done. The flag seems to be ignored by unzip unless we also mark the archive entry as coming from a Unix system. This is done by setting the field creator_version ("version made by" in the standard[1]) to 0x03NN. The NN part represents the version of the standard supported by us, and this patch sets it to 3f (for version 6.3) for Unix paths. We keep creator_version set to 0 (FAT filesystem, standard version 0) in the non-special cases, as before. But when we declare a file to have a Unix path, then we have to set the file mode as well, or unzip will extract the files with the permission set 0000, i.e. inaccessible by all. [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- archive-zip.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/archive-zip.c b/archive-zip.c index f5af81f..928da1d 100644 --- a/archive-zip.c +++ b/archive-zip.c @@ -4,6 +4,8 @@ #include "cache.h" #include "archive.h" #include "streaming.h" +#include "commit.h" +#include "utf8.h" static int zip_date; static int zip_time; @@ -16,7 +18,8 @@ static unsigned int zip_dir_offset; static unsigned int zip_dir_entries; #define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024) -#define ZIP_STREAM (8) +#define ZIP_STREAM (1 << 3) +#define ZIP_UTF8 (1 << 11) struct zip_local_header { unsigned char magic[4]; @@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args, { struct zip_local_header header; struct zip_dir_header dirent; - unsigned long attr2; + unsigned int creator_version = 0; + unsigned long attr2 = 0; unsigned long compressed_size; unsigned long crc; unsigned long direntsize; @@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args, crc = crc32(0, NULL, 0); + if (has_non_ascii(path)) { + if (is_utf8(path)) + flags |= ZIP_UTF8; + else + warning("Path is not valid UTF-8: %s", path); + } + if (pathlen > 0xffff) { return error("path too long (%d chars, SHA1: %s): %s", (int)pathlen, sha1_to_hex(sha1), path); @@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args, enum object_type type = sha1_object_info(sha1, &size); method = 0; - attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) : - (mode & 0111) ? ((mode) << 16) : 0; if (S_ISREG(mode) && args->compression_level != 0 && size > 0) method = 8; + if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) { + creator_version = 0x033f; + attr2 = mode; + if (S_ISLNK(mode)) + attr2 |= 0777; + attr2 <<= 16; + } compressed_size = size; if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert && @@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args, } copy_le32(dirent.magic, 0x02014b50); - copy_le16(dirent.creator_version, - S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0); + copy_le16(dirent.creator_version, creator_version); copy_le16(dirent.version, 10); copy_le16(dirent.flags, flags); copy_le16(dirent.compression_method, method); -- 1.7.11.msysgit.1.1.gbf71771 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-04 20:23 ` René Scharfe @ 2012-09-04 21:03 ` Junio C Hamano 2012-09-05 19:36 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2012-09-04 21:03 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > But now for the patch, which is a bit confusing as well. I'm curious to > hear about results for more platforms, extractors and character classes. > Based on that we can see if we need to generate the extra fields instead > of relying on the new flag. Thanks for keeping the ball rolling. > Subject: [PATCH] archive-zip: support UTF-8 paths > > Set general purpose flag 11 if we encounter a path that contains > non-ASCII characters. We assume that all paths are given as UTF-8; no > conversion is done. > > The flag seems to be ignored by unzip unless we also mark the archive > entry as coming from a Unix system. This is done by setting the field > creator_version ("version made by" in the standard[1]) to 0x03NN. > > The NN part represents the version of the standard supported by us, and > this patch sets it to 3f (for version 6.3) for Unix paths. We keep > creator_version set to 0 (FAT filesystem, standard version 0) in the > non-special cases, as before. > > But when we declare a file to have a Unix path, then we have to set the > file mode as well, or unzip will extract the files with the permission > set 0000, i.e. inaccessible by all. > > [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT > > Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> > --- > archive-zip.c | 27 +++++++++++++++++++++------ > 1 file changed, 21 insertions(+), 6 deletions(-) > > diff --git a/archive-zip.c b/archive-zip.c > index f5af81f..928da1d 100644 > --- a/archive-zip.c > +++ b/archive-zip.c > @@ -4,6 +4,8 @@ > #include "cache.h" > #include "archive.h" > #include "streaming.h" > +#include "commit.h" > +#include "utf8.h" > > static int zip_date; > static int zip_time; > @@ -16,7 +18,8 @@ static unsigned int zip_dir_offset; > static unsigned int zip_dir_entries; > > #define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024) > -#define ZIP_STREAM (8) > +#define ZIP_STREAM (1 << 3) > +#define ZIP_UTF8 (1 << 11) > > struct zip_local_header { > unsigned char magic[4]; > @@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args, > { > struct zip_local_header header; > struct zip_dir_header dirent; > - unsigned long attr2; > + unsigned int creator_version = 0; > + unsigned long attr2 = 0; > unsigned long compressed_size; > unsigned long crc; > unsigned long direntsize; > @@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args, > > crc = crc32(0, NULL, 0); > > + if (has_non_ascii(path)) { Do we want to treat \033 as "ascii" in this codepath? The function primarily is used by the log formatter to see if we need 8-bit CTE when writing out in the e-mail format. > + if (is_utf8(path)) > + flags |= ZIP_UTF8; > + else > + warning("Path is not valid UTF-8: %s", path); > + } > + > if (pathlen > 0xffff) { > return error("path too long (%d chars, SHA1: %s): %s", > (int)pathlen, sha1_to_hex(sha1), path); > @@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args, > enum object_type type = sha1_object_info(sha1, &size); > > method = 0; > - attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) : > - (mode & 0111) ? ((mode) << 16) : 0; > if (S_ISREG(mode) && args->compression_level != 0 && size > 0) > method = 8; > + if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) { > + creator_version = 0x033f; > + attr2 = mode; > + if (S_ISLNK(mode)) > + attr2 |= 0777; > + attr2 <<= 16; > + } > compressed_size = size; > > if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert && > @@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args, > } > > copy_le32(dirent.magic, 0x02014b50); > - copy_le16(dirent.creator_version, > - S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0); > + copy_le16(dirent.creator_version, creator_version); > copy_le16(dirent.version, 10); > copy_le16(dirent.flags, flags); > copy_le16(dirent.compression_method, method); ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-04 21:03 ` Junio C Hamano @ 2012-09-05 19:36 ` René Scharfe 2012-09-18 19:40 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2012-09-05 19:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git Am 04.09.2012 23:03, schrieb Junio C Hamano: > René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: >> + if (has_non_ascii(path)) { > > Do we want to treat \033 as "ascii" in this codepath? The function > primarily is used by the log formatter to see if we need 8-bit CTE > when writing out in the e-mail format. Argh, yes, I'd think so. The function name mislead me. This won't matter for compatibility testing, but should be corrected before inclusion. Just checked: unzip strips the ESC character when extracting (whether the UTF-8 flag is set or not) and 7-Zip replaces it with an underscore. The built-in ZIP extractor of Windows 7 skips such a file; it doesn't even show up in archive directory listings. René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-05 19:36 ` René Scharfe @ 2012-09-18 19:40 ` René Scharfe 2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe ` (4 more replies) 0 siblings, 5 replies; 21+ messages in thread From: René Scharfe @ 2012-09-18 19:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git [-- Attachment #1: Type: text/plain, Size: 3044 bytes --] Hello again, so two weeks have passed, and I've moved at a glacial pace towards a method how to measure compatibility of our generated ZIP files. Sorry, I just keep getting distracted. Anyway, the idea is to have a bunch of files with names using different scripts, zip them with several packers (including git archive), unzip them and compare the result with the original files. As test corpus I used files named like the pangrams on this UTF-8 sampler page, the exact commands are attached: http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox The numbers below are how many lines the output of diff -ru contains for this pair of packer and unpacker. There are 37 files, so the worst result is 74 lines of difference ("Only in [...]" for both sides), while 0 indicates a perfect score. Hmm, come to think of it, an empty directory would show up as 37, so this metric is not ideal. A better one would be to simply give one point for each correctly unpacked file. Windows Info-ZIP unzip 7-Zip PeaZip builtin Linux msysgit Windows 7-Zip 9.20 0 0 46 26 43 43 PeaZip 4.7.1 win64 0 0 46 26 42 42 Info-ZIP zip 3.0 Linux 0 0 72 0 43 43 Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43 git-master 72 72 72 60 72 72 git-master-patch1 0 0 72 60 72 72 git-master-patch2 0 0 72 0 72 72 git-v1.7.11.msysgit.1 72 72 72 60 72 72 git-v1.7.11.msysgit.1-patch1 0 0 72 60 72 72 git-v1.7.11.msysgit.1-patch2 0 0 72 0 72 72 Info-ZIP's programs don't work too well on Windows. The built-in unzipper of Windows 7 even refuses to open the file created by the Windows version of zip. Speaking of which, this is the worst of the unpackers. With the two patches applied, we can say "use 7-Zip or PeaZip on Windows and unzip on Linux" and filenames with all tested characters will be preserved. I was surprised to see this working fine with msysgit like that, even though no reencoding is introduced by the patches. I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score with the Windows-internal unzipper. Umlauts, Nordic characters and accents are preserved by that combination. It seems that unzip on Linux fails to unpack exactly these names, so perhaps they employ a dirty trick like using the local encoding in the ZIP file, which makes it unportable. I'll reply with the two patches, which contain basically the same code as the previous patch, only split up. The second one declares that filenames with UTF-8 encoding came from Unix (instead of FAT), which makes unzip happy. This, however, implies that we contain Unix permissions for these entries, which is a bit ugly. René [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: pangrams.sh --] [-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2536 bytes --] #!/bin/sh ( mkdir pangrams cd pangrams echo English >"The quick brown fox jumps over the lazy dog" echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall" echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú" echo Irish 3 >"D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór" echo Irish 4 >"Éava agus Áḋaiṁ" echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct" echo German 1 >"Falsches Üben von Xylophonmusik quält" echo German 2 >"jeden größeren Zwerg" echo Norwegian >"Blåbærsyltetøy" echo Danish >"Høj bly gom vandt fræk sexquiz på wc" echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor" echo Icelandic >"Sævör grét áðan því úlpan var ónýt" echo Finnish >"Törkylempijävongahdus" echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig" echo Czech >"Příliš žluťoučký kůň úpěl ďábelské kódy" echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže" echo Slovak 2 >"na stĺpe sa ďateľ učí kvákať novú ódu o živote" echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόρα βδελυγμία" echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία" echo Russian >"Съешь же ещё этих мягких французских булок да выпей чаю" echo Bulgarian 1 >"Жълтата дюля беше щастлива" echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон" echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža" echo Hungarian >"Árvíztűrő tükörfúrógép" echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva" echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro" echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico" echo Portuguese 2 >"põe freqüentemente o único médico" echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle" echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles" echo French 3 >"d'œufs abîmés" echo Esperanto >"Eĥoŝanĝo ĉiuĵaŭde" echo Hebrew >"זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן" echo Hiragana 1 >"いろはにほへど ちりぬるを" echo Hiragana 2 >"わがよたれぞ つねならむ" echo Hiragana 3 >"うゐのおくやま けふこえて" echo Hiragana 4 >"あさきゆめみじ ゑひもせず" ) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] archive-zip: support UTF-8 paths 2012-09-18 19:40 ` René Scharfe @ 2012-09-18 19:46 ` René Scharfe 2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe ` (3 subsequent siblings) 4 siblings, 0 replies; 21+ messages in thread From: René Scharfe @ 2012-09-18 19:46 UTC (permalink / raw) Cc: Junio C Hamano, Jeff King, Sven Strickroth, git Set general purpose flag 11 if we encounter a path that contains non-ASCII characters. We assume that all paths are given as UTF-8; no conversion is done. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- Changes from previous version: Stop using has_non_ascii(), which does slightly too much for our purposes, and split off creator version change into a separate patch. archive-zip.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/archive-zip.c b/archive-zip.c index f5af81f..0f763e8 100644 --- a/archive-zip.c +++ b/archive-zip.c @@ -4,6 +4,7 @@ #include "cache.h" #include "archive.h" #include "streaming.h" +#include "utf8.h" static int zip_date; static int zip_time; @@ -16,7 +17,8 @@ static unsigned int zip_dir_offset; static unsigned int zip_dir_entries; #define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024) -#define ZIP_STREAM (8) +#define ZIP_STREAM (1 << 3) +#define ZIP_UTF8 (1 << 11) struct zip_local_header { unsigned char magic[4]; @@ -164,6 +166,17 @@ static void set_zip_header_data_desc(struct zip_local_header *header, copy_le32(header->size, size); } +static int has_only_ascii(const char *s) +{ + for (;;) { + int c = *s++; + if (c == '\0') + return 1; + if (!isascii(c)) + return 0; + } +} + #define STREAM_BUFFER_SIZE (1024 * 16) static int write_zip_entry(struct archiver_args *args, @@ -187,6 +200,13 @@ static int write_zip_entry(struct archiver_args *args, crc = crc32(0, NULL, 0); + if (!has_only_ascii(path)) { + if (is_utf8(path)) + flags |= ZIP_UTF8; + else + warning("Path is not valid UTF-8: %s", path); + } + if (pathlen > 0xffff) { return error("path too long (%d chars, SHA1: %s): %s", (int)pathlen, sha1_to_hex(sha1), path); -- 1.7.12 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] archive-zip: declare creator to be Unix for UTF-8 paths 2012-09-18 19:40 ` René Scharfe 2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe @ 2012-09-18 19:53 ` René Scharfe 2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe ` (2 subsequent siblings) 4 siblings, 0 replies; 21+ messages in thread From: René Scharfe @ 2012-09-18 19:53 UTC (permalink / raw) Cc: Junio C Hamano, Jeff King, Sven Strickroth, git The UTF-8 flag seems to be ignored by unzip unless we also mark the archive entry as coming from a Unix system. This is done by setting the field creator_version ("version made by" in the standard[1]) to 0x03NN. The NN part represents the version of the standard supported by us, and this patch sets it to 3f (for version 6.3) for Unix paths. We keep creator_version set to 0 (FAT filesystem, standard version 0) in the non-special cases, as before. But when we declare a file to have a Unix path, then we have to set the file mode as well, or unzip will extract the files with the permission set 0000, i.e. inaccessible by all. [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT --- No sign-off for this, yet. Perhaps there is a better way to convince unzip to respect the flag? And if not, do we need to offer umask settings for ZIP as well as we have for tar? And perhaps declare all files as being from a Unix filesystem, for consistency? archive-zip.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/archive-zip.c b/archive-zip.c index 0f763e8..e9b3dc9 100644 --- a/archive-zip.c +++ b/archive-zip.c @@ -186,7 +186,8 @@ static int write_zip_entry(struct archiver_args *args, { struct zip_local_header header; struct zip_dir_header dirent; - unsigned long attr2; + unsigned int creator_version = 0; + unsigned long attr2 = 0; unsigned long compressed_size; unsigned long crc; unsigned long direntsize; @@ -224,10 +225,15 @@ static int write_zip_entry(struct archiver_args *args, enum object_type type = sha1_object_info(sha1, &size); method = 0; - attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) : - (mode & 0111) ? ((mode) << 16) : 0; if (S_ISREG(mode) && args->compression_level != 0 && size > 0) method = 8; + if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) { + creator_version = 0x033f; + attr2 = mode; + if (S_ISLNK(mode)) + attr2 |= 0777; + attr2 <<= 16; + } compressed_size = size; if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert && @@ -274,8 +280,7 @@ static int write_zip_entry(struct archiver_args *args, } copy_le32(dirent.magic, 0x02014b50); - copy_le16(dirent.creator_version, - S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0); + copy_le16(dirent.creator_version, creator_version); copy_le16(dirent.version, 10); copy_le16(dirent.flags, flags); copy_le16(dirent.compression_method, method); -- 1.7.12 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-18 19:40 ` René Scharfe 2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe 2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe @ 2012-09-18 20:24 ` René Scharfe 2012-09-18 21:12 ` Junio C Hamano 2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe 4 siblings, 0 replies; 21+ messages in thread From: René Scharfe @ 2012-09-18 20:24 UTC (permalink / raw) Cc: Junio C Hamano, Jeff King, Sven Strickroth, git [-- Attachment #1: Type: text/plain, Size: 576 bytes --] Am 18.09.2012 21:40, schrieb René Scharfe: > Windows Info-ZIP unzip > 7-Zip PeaZip builtin Linux msysgit Windows > git-master-patch1 0 0 72 60 72 72 > git-master-patch2 0 0 72 0 72 72 Oh, and when I wrote Windows, I meant a German Windows 7 Home Premium x64, and Linux is Ubuntu 12.04 x86. I've also attached generated ZIP files for the two archivers above, for easier testing. Let's see if they make it through. René [-- Attachment #2: pangrams-git-master-patch2.zip --] [-- Type: application/x-zip-compressed, Size: 7432 bytes --] [-- Attachment #3: pangrams-git-master-patch1.zip --] [-- Type: application/x-zip-compressed, Size: 7432 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-18 19:40 ` René Scharfe ` (2 preceding siblings ...) 2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe @ 2012-09-18 21:12 ` Junio C Hamano 2012-09-20 22:00 ` René Scharfe 2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe 4 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2012-09-18 21:12 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > Windows Info-ZIP unzip > 7-Zip PeaZip builtin Linux msysgit Windows > 7-Zip 9.20 0 0 46 26 43 43 > PeaZip 4.7.1 win64 0 0 46 26 42 42 > Info-ZIP zip 3.0 Linux 0 0 72 0 43 43 > Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43 > ... > I wonder what 7-Zip and PeaZip do that gives them a slightly nicer > score with the Windows-internal unzipper. Umlauts, Nordic characters > and accents are preserved by that combination. It seems that unzip on > Linux fails to unpack exactly these names, so perhaps they employ a > dirty trick like using the local encoding in the ZIP file, which makes > it unportable. > ... Thanks for this work. It is kind of surprising that "Windows builtin" has very poor score extracting from the output of Zip tools running on Windows (I am looking at 46, 46 and n/a over there). If you tell it to create an archive from its disk and then extract from it, I wonder what would happen. Does this result mean that practically nobody uses Zip archive with exotic letters in paths on that platform? I am not talking about developers and savvy people who know where to download third-party Zip archivers and how to install them. I am imagining a grandma who received an archive full of photos of her grandchild in her Outlook Express or GMail inbox, clicked the attachment to download it, and is trying to view the photo inside. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-18 21:12 ` Junio C Hamano @ 2012-09-20 22:00 ` René Scharfe 2012-09-24 15:56 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2012-09-20 22:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git Am 18.09.2012 23:12, schrieb Junio C Hamano: > René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > >> Windows Info-ZIP unzip >> 7-Zip PeaZip builtin Linux msysgit Windows >> 7-Zip 9.20 0 0 46 26 43 43 >> PeaZip 4.7.1 win64 0 0 46 26 42 42 >> Info-ZIP zip 3.0 Linux 0 0 72 0 43 43 >> Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43 > It is kind of surprising that "Windows builtin" has very poor score > extracting from the output of Zip tools running on Windows (I am > looking at 46, 46 and n/a over there). If you tell it to create an > archive from its disk and then extract from it, I wonder what would > happen. I didn't include it as a packer because it refused to archive the pangrams directory due to illegal characters in one of the filenames. When I just tried a bit harder, I had to delete all but 14 files with Latin script, accents etc. before I could zip the directory. I'll include these results in the next round. It uses codepage 850 on my system (MSDOS Latin 1). I don't expect this to be portable. > Does this result mean that practically nobody uses Zip archive with > exotic letters in paths on that platform? I am not talking about > developers and savvy people who know where to download third-party > Zip archivers and how to install them. I am imagining a grandma who > received an archive full of photos of her grandchild in her Outlook > Express or GMail inbox, clicked the attachment to download it, and > is trying to view the photo inside. Not necessarily. Photos often have names like img_0123.jpg etc., which are handled just fine. And all family members probably use the same codepage on their computers, so they're less likely to run into this problem. By the way, I found this bug asking for codepage support in unzip: https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961 Multiple codepages seem to be used for ZIP files in the wild, none of them are supported by unzip on Linux, which only accepts ASCII or UTF-8. René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-20 22:00 ` René Scharfe @ 2012-09-24 15:56 ` René Scharfe 2012-09-24 18:13 ` Junio C Hamano 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw) Cc: Junio C Hamano, Jeff King, Sven Strickroth, git [-- Attachment #1: Type: text/plain, Size: 2058 bytes --] Hi, I found a way to make unzip respect the UTF-8 flag in ZIP files: Apparently (from looking at the source) an extended field needs to be present in order for it to even look at general purpose flag 11. I sent a patch to add an extended timestamp field that fits the bill. Here are new numbers on ZIP international filename compatibility: 7-Zip PeaZip builtin unzip unzip unzip 7z Windows Windows Windows Linux mingw Windows Linux git Linux 1 1 1 7 1 1 1 git 1 Linux 37 37 1 7 1 1 37 git 2 Linux 37 37 1 37 1 1 37 git 3 Linux 37 37 1 37 15 15 37 git mingw 1 1 1 7 1 1 1 git 1 mingw 37 37 1 7 1 1 37 git 2 mingw 37 37 1 37 1 1 37 git 3 mingw 37 37 1 37 15 15 37 7-Zip Windows 37 37 14 24 15 15 24 PeaZip Windows 37 37 14 24 15 15 24 zip Linux 37 37 1 37 15 15 37 zip Windows 14 14 0 37 15 15 1 builtin Windows 14 14 14 1 14 14 1 The test corpus still consists of 37 files based on the pangrams on the following web page: http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox The files can be created using the attached script. It also provides a check command to count the files with correct names, and the results of that for different ZIP extractors are give in the table. The built-in ZIP functionality on Windows was only able to pack 14 of the 37 files, which explains the low score across the board for this packer. "git 1" is the patch "archive-zip: support UTF-8 paths" added, which let's archive-zip make use of the UTF-8 flag. "git 2" is "git 1" plus the patch "archive-zip: declare creator to be Unix for UTF-8 paths". Both have been posted before. "git 3" is "git 1" plus the new patch "archive-zip: write extended timestamp". Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag) and 3 (mtime field) to make archive-zip record non-ASCII filenames in a portable way. It's not perfect, but I don't know how to do any better given that Windows' built-in ZIP functionality expects filenames in the local code page and with an international audience for projects distributing ZIP files. René [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: pangrams.sh --] [-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2367 bytes --] #!/bin/sh files() { cat <<EOF pangrams/わがよたれぞ つねならむ pangrams/うゐのおくやま けふこえて pangrams/いろはにほへど ちりぬるを pangrams/あさきゆめみじ ゑひもせず pangrams/An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall pangrams/Árvíztűrő tükörfúrógép pangrams/Blåbærsyltetøy pangrams/D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór pangrams/d'œufs abîmés pangrams/Éava agus Áḋaiṁ pangrams/Eĥoŝanĝo ĉiuĵaŭde pangrams/El pingüino Wenceslao hizo kilómetros bajo exhaustiva pangrams/Falsches Üben von Xylophonmusik quält pangrams/Flygande bäckasiner söka strax hwila på mjuka tuvor pangrams/Høj bly gom vandt fræk sexquiz på wc pangrams/jeden größeren Zwerg pangrams/lena ṗóg éada ó ṡlí do leasa ṫú pangrams/Les naïfs ægithales hâtifs pondant à Noël où il gèle pangrams/lluvia y frío añoraba a su querido cachorro pangrams/na stĺpe sa ďateľ učí kvákať novú ódu o živote pangrams/O próximo vôo à noite sobre o Atlântico pangrams/Pa's wijze lynx bezag vroom het fikse aquaduct pangrams/Pchnąć w tę łódź jeża lub osiem skrzyń fig pangrams/põe freqüentemente o único médico pangrams/Příliš žluťoučký kůň úpěl ďábelské kódy pangrams/Sævör grét áðan því úlpan var ónýt pangrams/sont sûrs d'être déçus en voyant leurs drôles pangrams/Starý kôň na hŕbe kníh žuje tíško povädnuté ruže pangrams/The quick brown fox jumps over the lazy dog pangrams/Törkylempijävongahdus pangrams/Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža pangrams/זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן pangrams/ξεσκεπάζω την ψυχοφθόρα βδελυγμία pangrams/ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία pangrams/Жълтата дюля беше щастлива pangrams/Съешь же ещё этих мягких французских булок да выпей чаю pangrams/че пухът, който цъфна, замръзна като гьон EOF } case "$1" in create) mkdir -p pangrams files | while read file do touch "$file" done ;; check) files | while read file do test -f "$file" && echo "$file" done | wc -l ;; *) echo "Usage: $0 create | check" >&2 exit 1 ;; esac ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-09-24 15:56 ` René Scharfe @ 2012-09-24 18:13 ` Junio C Hamano 0 siblings, 0 replies; 21+ messages in thread From: Junio C Hamano @ 2012-09-24 18:13 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > "git 1" is the patch "archive-zip: support UTF-8 paths" added, which > let's archive-zip make use of the UTF-8 flag. "git 2" is "git 1" plus > the patch "archive-zip: declare creator to be Unix for UTF-8 > paths". Both have been posted before. "git 3" is "git 1" plus the new > patch "archive-zip: write extended timestamp". > > Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag) > and 3 (mtime field) to make archive-zip record non-ASCII filenames in > a portable way. It's not perfect, but I don't know how to do any > better given that Windows' built-in ZIP functionality expects > filenames in the local code page and with an international audience > for projects distributing ZIP files. Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/2] archive-zip: write extended timestamp 2012-09-18 19:40 ` René Scharfe ` (3 preceding siblings ...) 2012-09-18 21:12 ` Junio C Hamano @ 2012-09-24 15:56 ` René Scharfe 4 siblings, 0 replies; 21+ messages in thread From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw) Cc: Junio C Hamano, Jeff King, Sven Strickroth, git File modification times in ZIP files are encoded in DOS format: local time with a granularity of two seconds. Add an extra field to all archive entries to also record the mtime in Unix' fashion, as UTC with a granularity of one second. This has the desirable side-effect of convincing Info-ZIP unzip 6.00 to respect general purpose flag 11, which is used to indicate that a file name is encoded in UTF-8. Any extra field would do, actually, but the extended timestamp is a reasonably small one (22 bytes per entry). Archives created by Info-ZIP zip 3.0 contain it, too (but with ctime and atime as well). Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- archive-zip.c | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/archive-zip.c b/archive-zip.c index 0f763e8..55f66b4 100644 --- a/archive-zip.c +++ b/archive-zip.c @@ -76,6 +76,14 @@ struct zip_dir_trailer { unsigned char _end[1]; }; +struct zip_extra_mtime { + unsigned char magic[2]; + unsigned char extra_size[2]; + unsigned char flags[1]; + unsigned char mtime[4]; + unsigned char _end[1]; +}; + /* * On ARM, padding is added at the end of the struct, so a simple * sizeof(struct ...) reports two bytes more than the payload size @@ -85,6 +93,9 @@ struct zip_dir_trailer { #define ZIP_DATA_DESC_SIZE offsetof(struct zip_data_desc, _end) #define ZIP_DIR_HEADER_SIZE offsetof(struct zip_dir_header, _end) #define ZIP_DIR_TRAILER_SIZE offsetof(struct zip_dir_trailer, _end) +#define ZIP_EXTRA_MTIME_SIZE offsetof(struct zip_extra_mtime, _end) +#define ZIP_EXTRA_MTIME_PAYLOAD_SIZE \ + (ZIP_EXTRA_MTIME_SIZE - offsetof(struct zip_extra_mtime, flags)) static void copy_le16(unsigned char *dest, unsigned int n) { @@ -186,6 +197,7 @@ static int write_zip_entry(struct archiver_args *args, { struct zip_local_header header; struct zip_dir_header dirent; + struct zip_extra_mtime extra; unsigned long attr2; unsigned long compressed_size; unsigned long crc; @@ -266,8 +278,13 @@ static int write_zip_entry(struct archiver_args *args, } } + copy_le16(extra.magic, 0x5455); + copy_le16(extra.extra_size, ZIP_EXTRA_MTIME_PAYLOAD_SIZE); + extra.flags[0] = 1; /* just mtime */ + copy_le32(extra.mtime, args->time); + /* make sure we have enough free space in the dictionary */ - direntsize = ZIP_DIR_HEADER_SIZE + pathlen; + direntsize = ZIP_DIR_HEADER_SIZE + pathlen + ZIP_EXTRA_MTIME_SIZE; while (zip_dir_size < zip_dir_offset + direntsize) { zip_dir_size += ZIP_DIRECTORY_MIN_SIZE; zip_dir = xrealloc(zip_dir, zip_dir_size); @@ -283,7 +300,7 @@ static int write_zip_entry(struct archiver_args *args, copy_le16(dirent.mdate, zip_date); set_zip_dir_data_desc(&dirent, size, compressed_size, crc); copy_le16(dirent.filename_length, pathlen); - copy_le16(dirent.extra_length, 0); + copy_le16(dirent.extra_length, ZIP_EXTRA_MTIME_SIZE); copy_le16(dirent.comment_length, 0); copy_le16(dirent.disk, 0); copy_le16(dirent.attr1, 0); @@ -301,11 +318,13 @@ static int write_zip_entry(struct archiver_args *args, else set_zip_header_data_desc(&header, size, compressed_size, crc); copy_le16(header.filename_length, pathlen); - copy_le16(header.extra_length, 0); + copy_le16(header.extra_length, ZIP_EXTRA_MTIME_SIZE); write_or_die(1, &header, ZIP_LOCAL_HEADER_SIZE); zip_offset += ZIP_LOCAL_HEADER_SIZE; write_or_die(1, path, pathlen); zip_offset += pathlen; + write_or_die(1, &extra, ZIP_EXTRA_MTIME_SIZE); + zip_offset += ZIP_EXTRA_MTIME_SIZE; if (stream && method == 0) { unsigned char buf[STREAM_BUFFER_SIZE]; ssize_t readlen; @@ -402,6 +421,8 @@ static int write_zip_entry(struct archiver_args *args, zip_dir_offset += ZIP_DIR_HEADER_SIZE; memcpy(zip_dir + zip_dir_offset, path, pathlen); zip_dir_offset += pathlen; + memcpy(zip_dir + zip_dir_offset, &extra, ZIP_EXTRA_MTIME_SIZE); + zip_dir_offset += ZIP_EXTRA_MTIME_SIZE; zip_dir_entries++; return 0; -- 1.7.12 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues 2012-08-11 20:53 ` René Scharfe 2012-08-11 21:37 ` Sven Strickroth @ 2012-08-12 4:27 ` Junio C Hamano 1 sibling, 0 replies; 21+ messages in thread From: Junio C Hamano @ 2012-08-12 4:27 UTC (permalink / raw) To: René Scharfe; +Cc: Sven Strickroth, git René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > ... A more interesting question is: What's supported by > which programs? Yes, that is the most interesting question. >> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1" >> would be the most generic way to do this. > > I really hope we can make do without additional options. We need to at least know the path encoding used in the tree objects, and I'd be OK with a solution that assumes a single encoding is used for the entire tree. We would eventually need to also know the encoding used on the local working tree (i.e. in what encoding paths are returned from readdir() and the pathspec the user gives us from the command line), and iconv it to the tree objects encoding for the project when creating a cache_entry object to be fed to add_to_index(), and iconv it back from the tree objects encoding to the working tree encoding in write_entry(), but that is a longer term direction. For now, in order to address the immediate issue, we only need the tree object encoding, which should default to UTF-8 for interoperability. So "git archive --format=zip --in-object-path-encoding=big5" for a project whose tree object pathnames are in that encoding (and we always record paths in UTF-8 when writing zipfiles) should be the minimal that we need for now. Optionally, with a configuration variable i18n.inObjectPathEncoding (as opposed to the eventual i18n.worktreePathEncoding) set to big5, users of such a project can say "git archive --format=zip" without the "--in-object-path-encoding" option. Considering that zip is a format meant for exchange, I'd think we would be fine to always write in UTF-8 and leaving the readers responsible for converting the pathname while extracting. If a major zip extractor is incapable of handling UTF-8 (or even if capable it is cumbersome, for that matter), we may end up having to add "--in-archive-path-encoding=UTF-8" option to "git archive", with associated "zip.archivePathEncoding" variable, though. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-09-24 18:13 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth 2012-08-10 22:47 ` Junio C Hamano 2012-08-10 23:53 ` Sven Strickroth 2012-08-11 20:53 ` René Scharfe 2012-08-12 4:08 ` Junio C Hamano 2012-08-11 20:53 ` René Scharfe 2012-08-11 21:37 ` Sven Strickroth 2012-08-30 22:26 ` Jeff King 2012-09-04 20:23 ` René Scharfe 2012-09-04 21:03 ` Junio C Hamano 2012-09-05 19:36 ` René Scharfe 2012-09-18 19:40 ` René Scharfe 2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe 2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe 2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe 2012-09-18 21:12 ` Junio C Hamano 2012-09-20 22:00 ` René Scharfe 2012-09-24 15:56 ` René Scharfe 2012-09-24 18:13 ` Junio C Hamano 2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe 2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).