git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <rene.scharfe@lsrfire.ath.cx>
To: unlisted-recipients:; (no To-header on input)
Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
	Sven Strickroth <sven.strickroth@tu-clausthal.de>,
	git@vger.kernel.org
Subject: Re: git archive --format zip utf-8 issues
Date: Mon, 24 Sep 2012 17:56:57 +0200	[thread overview]
Message-ID: <506082C9.6050604@lsrfire.ath.cx> (raw)
In-Reply-To: <505B91E9.7060208@lsrfire.ath.cx>

[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]

Hi,

I found a way to make unzip respect the UTF-8 flag in ZIP files: 
Apparently (from looking at the source) an extended field needs to be 
present in order for it to even look at general purpose flag 11.  I sent 
a patch to add an extended timestamp field that fits the bill.

Here are new numbers on ZIP international filename compatibility:

		7-Zip	PeaZip	builtin	unzip	unzip	unzip	7z
		Windows	Windows	Windows	Linux	mingw	Windows	Linux
git	Linux	1	1	1	7	1	1	1
git 1	Linux	37	37	1	7	1	1	37
git 2	Linux	37	37	1	37	1	1	37
git 3	Linux	37	37	1	37	15	15	37
git	mingw	1	1	1	7	1	1	1
git 1	mingw	37	37	1	7	1	1	37
git 2	mingw	37	37	1	37	1	1	37
git 3	mingw	37	37	1	37	15	15	37
7-Zip	Windows	37	37	14	24	15	15	24
PeaZip	Windows	37	37	14	24	15	15	24
zip	Linux	37	37	1	37	15	15	37
zip	Windows	14	14	0	37	15	15	1
builtin	Windows	14	14	14	1	14	14	1

The test corpus still consists of 37 files based on the pangrams on the 
following web page:

	http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox

The files can be created using the attached script.  It also provides a 
check command to count the files with correct names, and the results of 
that for different ZIP extractors are give in the table.  The built-in 
ZIP functionality on Windows was only able to pack 14 of the 37 files, 
which explains the low score across the board for this packer.

"git 1" is the patch "archive-zip: support UTF-8 paths" added, which 
let's archive-zip make use of the UTF-8 flag.  "git 2" is "git 1" plus 
the patch "archive-zip: declare creator to be Unix for UTF-8 paths". 
Both have been posted before.  "git 3" is "git 1" plus the new patch 
"archive-zip: write extended timestamp".

Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag) and 
3 (mtime field) to make archive-zip record non-ASCII filenames in a 
portable way.  It's not perfect, but I don't know how to do any better 
given that Windows' built-in ZIP functionality expects filenames in the 
local code page and with an international audience for projects 
distributing ZIP files.

René


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2367 bytes --]

#!/bin/sh

files() {
cat <<EOF
pangrams/わがよたれぞ つねならむ
pangrams/うゐのおくやま けふこえて
pangrams/いろはにほへど ちりぬるを
pangrams/あさきゆめみじ ゑひもせず
pangrams/An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall
pangrams/Árvíztűrő tükörfúrógép
pangrams/Blåbærsyltetøy
pangrams/D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór
pangrams/d'œufs abîmés
pangrams/Éava agus Áḋaiṁ
pangrams/Eĥoŝanĝo ĉiuĵaŭde
pangrams/El pingüino Wenceslao hizo kilómetros bajo exhaustiva
pangrams/Falsches Üben von Xylophonmusik quält
pangrams/Flygande bäckasiner söka strax hwila på mjuka tuvor
pangrams/Høj bly gom vandt fræk sexquiz på wc
pangrams/jeden größeren Zwerg
pangrams/lena ṗóg éada ó ṡlí do leasa ṫú
pangrams/Les naïfs ægithales hâtifs pondant à Noël où il gèle
pangrams/lluvia y frío añoraba a su querido cachorro
pangrams/na stĺpe sa ďateľ učí kvákať novú ódu o živote
pangrams/O próximo vôo à noite sobre o Atlântico
pangrams/Pa's wijze lynx bezag vroom het fikse aquaduct
pangrams/Pchnąć w tę łódź jeża lub osiem skrzyń fig
pangrams/põe freqüentemente o único médico
pangrams/Příliš žluťoučký kůň úpěl ďábelské kódy
pangrams/Sævör grét áðan því úlpan var ónýt
pangrams/sont sûrs d'être déçus en voyant leurs drôles
pangrams/Starý kôň na hŕbe kníh žuje tíško povädnuté ruže
pangrams/The quick brown fox jumps over the lazy dog
pangrams/Törkylempijävongahdus
pangrams/Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža
pangrams/זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן
pangrams/ξεσκεπάζω την ψυχοφθόρα βδελυγμία
pangrams/ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
pangrams/Жълтата дюля беше щастлива
pangrams/Съешь же ещё этих мягких французских булок да выпей чаю
pangrams/че пухът, който цъфна, замръзна като гьон
EOF
}

case "$1" in
create)
	mkdir -p pangrams
	files | while read file
	do
		touch "$file"
	done
	;;
check)
	files | while read file
	do
		test -f "$file" && echo "$file"
	done | wc -l
	;;
*)
	echo "Usage: $0 create | check" >&2
	exit 1
	;;
esac

  reply	other threads:[~2012-09-24 15:57 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53   ` Sven Strickroth
2012-08-11 20:53     ` René Scharfe
2012-08-12  4:08       ` Junio C Hamano
2012-08-11 20:53   ` René Scharfe
2012-08-11 21:37     ` Sven Strickroth
2012-08-30 22:26       ` Jeff King
2012-09-04 20:23         ` René Scharfe
2012-09-04 21:03           ` Junio C Hamano
2012-09-05 19:36             ` René Scharfe
2012-09-18 19:40               ` René Scharfe
2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12                 ` Junio C Hamano
2012-09-20 22:00                   ` René Scharfe
2012-09-24 15:56                     ` René Scharfe [this message]
2012-09-24 18:13                       ` Junio C Hamano
2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=506082C9.6050604@lsrfire.ath.cx \
    --to=rene.scharfe@lsrfire.ath.cx \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sven.strickroth@tu-clausthal.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).