git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <rene.scharfe@lsrfire.ath.cx>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.net>,
	Sven Strickroth <sven.strickroth@tu-clausthal.de>,
	git@vger.kernel.org
Subject: Re: git archive --format zip utf-8 issues
Date: Tue, 18 Sep 2012 21:40:57 +0200	[thread overview]
Message-ID: <5058CE49.3070108@lsrfire.ath.cx> (raw)
In-Reply-To: <5047A9C0.9020200@lsrfire.ath.cx>

[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]

Hello again,

so two weeks have passed, and I've moved at a glacial pace towards a 
method how to measure compatibility of our generated ZIP files.  Sorry, 
I just keep getting distracted.

Anyway, the idea is to have a bunch of files with names using different 
scripts, zip them with several packers (including git archive), unzip 
them and compare the result with the original files.

As test corpus I used files named like the pangrams on this UTF-8 
sampler page, the exact commands are attached:

    http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox

The numbers below are how many lines the output of diff -ru contains for 
this pair of packer and unpacker.  There are 37 files, so the worst 
result is 74 lines of difference ("Only in [...]" for both sides), while 
0 indicates a perfect score.

Hmm, come to think of it, an empty directory would show up as 37, so 
this metric is not ideal.  A better one would be to simply give one 
point for each correctly unpacked file.

                                          Windows    Info-ZIP unzip
                             7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20                      0      0      46    26      43      43
PeaZip 4.7.1 win64              0      0      46    26      42      42
Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43
git-master                     72     72      72    60      72      72
git-master-patch1               0      0      72    60      72      72
git-master-patch2               0      0      72     0      72      72
git-v1.7.11.msysgit.1          72     72      72    60      72      72
git-v1.7.11.msysgit.1-patch1    0      0      72    60      72      72
git-v1.7.11.msysgit.1-patch2    0      0      72     0      72      72

Info-ZIP's programs don't work too well on Windows.  The built-in 
unzipper of Windows 7 even refuses to open the file created by the 
Windows version of zip.  Speaking of which, this is the worst of the 
unpackers.

With the two patches applied, we can say "use 7-Zip or PeaZip on Windows 
and unzip on Linux" and filenames with all tested characters will be 
preserved.  I was surprised to see this working fine with msysgit like 
that, even though no reencoding is introduced by the patches.

I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score 
with the Windows-internal unzipper.  Umlauts, Nordic characters and 
accents are preserved by that combination.  It seems that unzip on Linux 
fails to unpack exactly these names, so perhaps they employ a dirty 
trick like using the local encoding in the ZIP file, which makes it 
unportable.

I'll reply with the two patches, which contain basically the same code 
as the previous patch, only split up.  The second one declares that 
filenames with UTF-8 encoding came from Unix (instead of FAT), which 
makes unzip happy.  This, however, implies that we contain Unix 
permissions for these entries, which is a bit ugly.

René

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2536 bytes --]

#!/bin/sh
(
	mkdir pangrams
	cd pangrams

	echo English >"The quick brown fox jumps over the lazy dog"
	echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall"
	echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
	echo Irish 3 >"D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór"
	echo Irish 4 >"Éava agus Áḋaiṁ"
	echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
	echo German 1 >"Falsches Üben von Xylophonmusik quält"
	echo German 2 >"jeden größeren Zwerg"
	echo Norwegian >"Blåbærsyltetøy"
	echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
	echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
	echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
	echo Finnish >"Törkylempijävongahdus"
	echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
	echo Czech >"Příliš žluťoučký kůň úpěl ďábelské kódy"
	echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže"
	echo Slovak 2 >"na stĺpe sa ďateľ učí kvákať novú ódu o živote"
	echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόρα βδελυγμία"
	echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία"
	echo Russian >"Съешь же ещё этих мягких французских булок да выпей чаю"
	echo Bulgarian 1 >"Жълтата дюля беше щастлива"
	echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
	echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža"
	echo Hungarian >"Árvíztűrő tükörfúrógép"
	echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
	echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
	echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
	echo Portuguese 2 >"põe freqüentemente o único médico"
	echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
	echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
	echo French 3 >"d'œufs abîmés"
	echo Esperanto >"Eĥoŝanĝo ĉiuĵaŭde"
	echo Hebrew >"זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן"
	echo Hiragana 1 >"いろはにほへど ちりぬるを"
	echo Hiragana 2 >"わがよたれぞ つねならむ"
	echo Hiragana 3 >"うゐのおくやま けふこえて"
	echo Hiragana 4 >"あさきゆめみじ ゑひもせず"
)

  reply	other threads:[~2012-09-18 19:41 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53   ` Sven Strickroth
2012-08-11 20:53     ` René Scharfe
2012-08-12  4:08       ` Junio C Hamano
2012-08-11 20:53   ` René Scharfe
2012-08-11 21:37     ` Sven Strickroth
2012-08-30 22:26       ` Jeff King
2012-09-04 20:23         ` René Scharfe
2012-09-04 21:03           ` Junio C Hamano
2012-09-05 19:36             ` René Scharfe
2012-09-18 19:40               ` René Scharfe [this message]
2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12                 ` Junio C Hamano
2012-09-20 22:00                   ` René Scharfe
2012-09-24 15:56                     ` René Scharfe
2012-09-24 18:13                       ` Junio C Hamano
2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5058CE49.3070108@lsrfire.ath.cx \
    --to=rene.scharfe@lsrfire.ath.cx \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sven.strickroth@tu-clausthal.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).