From: "René Scharfe" <rene.scharfe@lsrfire.ath.cx>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.net>,
Sven Strickroth <sven.strickroth@tu-clausthal.de>,
git@vger.kernel.org
Subject: Re: git archive --format zip utf-8 issues
Date: Tue, 18 Sep 2012 21:40:57 +0200 [thread overview]
Message-ID: <5058CE49.3070108@lsrfire.ath.cx> (raw)
In-Reply-To: <5047A9C0.9020200@lsrfire.ath.cx>
[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]
Hello again,
so two weeks have passed, and I've moved at a glacial pace towards a
method how to measure compatibility of our generated ZIP files. Sorry,
I just keep getting distracted.
Anyway, the idea is to have a bunch of files with names using different
scripts, zip them with several packers (including git archive), unzip
them and compare the result with the original files.
As test corpus I used files named like the pangrams on this UTF-8
sampler page, the exact commands are attached:
http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox
The numbers below are how many lines the output of diff -ru contains for
this pair of packer and unpacker. There are 37 files, so the worst
result is 74 lines of difference ("Only in [...]" for both sides), while
0 indicates a perfect score.
Hmm, come to think of it, an empty directory would show up as 37, so
this metric is not ideal. A better one would be to simply give one
point for each correctly unpacked file.
Windows Info-ZIP unzip
7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20 0 0 46 26 43 43
PeaZip 4.7.1 win64 0 0 46 26 42 42
Info-ZIP zip 3.0 Linux 0 0 72 0 43 43
Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43
git-master 72 72 72 60 72 72
git-master-patch1 0 0 72 60 72 72
git-master-patch2 0 0 72 0 72 72
git-v1.7.11.msysgit.1 72 72 72 60 72 72
git-v1.7.11.msysgit.1-patch1 0 0 72 60 72 72
git-v1.7.11.msysgit.1-patch2 0 0 72 0 72 72
Info-ZIP's programs don't work too well on Windows. The built-in
unzipper of Windows 7 even refuses to open the file created by the
Windows version of zip. Speaking of which, this is the worst of the
unpackers.
With the two patches applied, we can say "use 7-Zip or PeaZip on Windows
and unzip on Linux" and filenames with all tested characters will be
preserved. I was surprised to see this working fine with msysgit like
that, even though no reencoding is introduced by the patches.
I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score
with the Windows-internal unzipper. Umlauts, Nordic characters and
accents are preserved by that combination. It seems that unzip on Linux
fails to unpack exactly these names, so perhaps they employ a dirty
trick like using the local encoding in the ZIP file, which makes it
unportable.
I'll reply with the two patches, which contain basically the same code
as the previous patch, only split up. The second one declares that
filenames with UTF-8 encoding came from Unix (instead of FAT), which
makes unzip happy. This, however, implies that we contain Unix
permissions for these entries, which is a bit ugly.
René
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2536 bytes --]
#!/bin/sh
(
mkdir pangrams
cd pangrams
echo English >"The quick brown fox jumps over the lazy dog"
echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall"
echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
echo Irish 3 >"D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór"
echo Irish 4 >"Éava agus Áḋaiṁ"
echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
echo German 1 >"Falsches Üben von Xylophonmusik quält"
echo German 2 >"jeden größeren Zwerg"
echo Norwegian >"Blåbærsyltetøy"
echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
echo Finnish >"Törkylempijävongahdus"
echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
echo Czech >"Příliš žluťoučký kůň úpěl ďábelské kódy"
echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže"
echo Slovak 2 >"na stĺpe sa ďateľ učí kvákať novú ódu o živote"
echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόρα βδελυγμία"
echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία"
echo Russian >"Съешь же ещё этих мягких французских булок да выпей чаю"
echo Bulgarian 1 >"Жълтата дюля беше щастлива"
echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža"
echo Hungarian >"Árvíztűrő tükörfúrógép"
echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
echo Portuguese 2 >"põe freqüentemente o único médico"
echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
echo French 3 >"d'œufs abîmés"
echo Esperanto >"Eĥoŝanĝo ĉiuĵaŭde"
echo Hebrew >"זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן"
echo Hiragana 1 >"いろはにほへど ちりぬるを"
echo Hiragana 2 >"わがよたれぞ つねならむ"
echo Hiragana 3 >"うゐのおくやま けふこえて"
echo Hiragana 4 >"あさきゆめみじ ゑひもせず"
)
next prev parent reply other threads:[~2012-09-18 19:41 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53 ` Sven Strickroth
2012-08-11 20:53 ` René Scharfe
2012-08-12 4:08 ` Junio C Hamano
2012-08-11 20:53 ` René Scharfe
2012-08-11 21:37 ` Sven Strickroth
2012-08-30 22:26 ` Jeff King
2012-09-04 20:23 ` René Scharfe
2012-09-04 21:03 ` Junio C Hamano
2012-09-05 19:36 ` René Scharfe
2012-09-18 19:40 ` René Scharfe [this message]
2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12 ` Junio C Hamano
2012-09-20 22:00 ` René Scharfe
2012-09-24 15:56 ` René Scharfe
2012-09-24 18:13 ` Junio C Hamano
2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5058CE49.3070108@lsrfire.ath.cx \
--to=rene.scharfe@lsrfire.ath.cx \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=sven.strickroth@tu-clausthal.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).