git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <rene.scharfe@lsrfire.ath.cx>
To: Junio C Hamano <gitster@pobox.com>
Cc: Sven Strickroth <sven.strickroth@tu-clausthal.de>, git@vger.kernel.org
Subject: Re: git archive --format zip utf-8 issues
Date: Sat, 11 Aug 2012 22:53:29 +0200	[thread overview]
Message-ID: <5026C649.2090700@lsrfire.ath.cx> (raw)
In-Reply-To: <7vtxwagy9f.fsf@alter.siamese.dyndns.org>

Am 11.08.2012 00:47, schrieb Junio C Hamano:
> Sven Strickroth <sven.strickroth@tu-clausthal.de> writes:
>
>> when I create a git repository, add a file containing utf-8 characters
>> or umlauts (like öäü.txt), commit and then export the HEAD revision to a
>> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
>> contains incorrect filenames:
>
> My reading of archive-zip.c seems to suggest that we write out
> whatever pathname you have in the tree, so a pathname encoded in
> UTF-8 will be literally written out in the resulting zip archive.

Sorry for my imperialistic attitude of "ASCII filenames should be enough 
for everybody".  Laziness..

> Do you know in what encoding the pathnames are _expected_ to be
> stored in zip archives?  Random documentation seems to suggest that
> there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
> says:
>
>      There is no official file name encoding for ZIP files. If you
>      have unicode file names, you must convert them to byte strings
>      in your desired encoding before passing them to write(). WinZip
>      interprets all file names as encoded in CP437, also known as DOS
>      Latin.
>
> which may explain it.

http://www.pkware.com/documents/casestudies/APPNOTE.TXT is the standard 
document, as Sven noted, and it says that filenames are encoded in code 
page 437, or optionally UTF-8 (a later addition).  Discussions like 
http://stackoverflow.com/questions/106367/ seem to indicate that at 
least some archivers use the local code page as well.

> It may not be a bad idea for "git archive --format=zip" to
>
>   (1) check if pathname is a correct UTF-8; and
>   (2) check if it can be reencoded to latin-1
>
> and if (and only if) both are true, automatically re-encode the path
> to latin-1.

The standard says we need to convert to CP437, or to UTF-8, or provide 
both versions. A more interesting question is: What's supported by which 
programs?

The ZIP functionality built into Windows 7 doesn't seem to work with 
UTF-8 encoded filenames (except for those that only use the ASCII 
subset), and to ignore the UTF-8 part if both are given.  Handling 
umlauts should be possible anyway, because they are on code page 437, 
but for other characters we'd have to aim for compatibility with other 
programs like Info-ZIP and 7-Zip.

How do we know which encoding was used for a filename?

> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
> would be the most generic way to do this.

I really hope we can make do without additional options.

René

  parent reply	other threads:[~2012-08-11 20:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53   ` Sven Strickroth
2012-08-11 20:53     ` René Scharfe
2012-08-12  4:08       ` Junio C Hamano
2012-08-11 20:53   ` René Scharfe [this message]
2012-08-11 21:37     ` Sven Strickroth
2012-08-30 22:26       ` Jeff King
2012-09-04 20:23         ` René Scharfe
2012-09-04 21:03           ` Junio C Hamano
2012-09-05 19:36             ` René Scharfe
2012-09-18 19:40               ` René Scharfe
2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12                 ` Junio C Hamano
2012-09-20 22:00                   ` René Scharfe
2012-09-24 15:56                     ` René Scharfe
2012-09-24 18:13                       ` Junio C Hamano
2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5026C649.2090700@lsrfire.ath.cx \
    --to=rene.scharfe@lsrfire.ath.cx \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sven.strickroth@tu-clausthal.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).