From: Michael Haggerty <mhagger@alum.mit.edu>
To: John Keeping <john@keeping.me.uk>
Cc: Junio C Hamano <gitster@pobox.com>,
git@vger.kernel.org, Sverre Rabbelier <srabbelier@gmail.com>
Subject: Re: [PATCH] git-remote-testpy: fix patch hashing on Python 3
Date: Mon, 28 Jan 2013 11:44:34 +0100 [thread overview]
Message-ID: <51065692.9000708@alum.mit.edu> (raw)
In-Reply-To: <20130127145056.GP7498@serenity.lan>
[-- Attachment #1: Type: text/plain, Size: 4681 bytes --]
On 01/27/2013 03:50 PM, John Keeping wrote:
> When this change was originally made (0846b0c - git-remote-testpy: hash
> bytes explicitly , I didn't realised that the "hex" encoding we chose is
> a "bytes to bytes" encoding so it just fails with an error on Python 3
> in the same way as the original code.
>
> It is not possible to provide a single code path that works on Python 2
> and Python 3 since Python 2.x will attempt to decode the string before
> encoding it, which fails for strings that are not valid in the default
> encoding. Python 3.1 introduced the "surrogateescape" error handler
> which handles this correctly and permits a bytes -> unicode -> bytes
> round-trip to be lossless.
>
> At this point Python 3.0 is unsupported so we don't go out of our way to
> try to support it.
>
> Helped-by: Michael Haggerty <mhagger@alum.mit.edu>
> Signed-off-by: John Keeping <john@keeping.me.uk>
> ---
> On Sun, Jan 27, 2013 at 02:13:29PM +0000, John Keeping wrote:
>> On Sun, Jan 27, 2013 at 05:44:37AM +0100, Michael Haggerty wrote:
>>> So to handle all of the cases across Python versions as closely as
>>> possible to the old 2.x code, it might be necessary to make the code
>>> explicitly depend on the Python version number, like:
>>>
>>> hasher = _digest()
>>> if sys.hexversion < 0x03000000:
>>> pathbytes = repo.path
>>> elif sys.hexversion < 0x03010000:
>>> # If support for Python 3.0.x is desired (note: result can
>>> # be different in this case than under 2.x or 3.1+):
>>> pathbytes = repo.path.encode(sys.getfilesystemencoding(),
>>> 'backslashreplace')
>>> else
>>> pathbytes = repo.path.encode(sys.getfilesystemencoding(),
>>> 'surrogateescape')
>>> hasher.update(pathbytes)
>>> repo.hash = hasher.hexdigest()
>
> How about this?
>
> git-remote-testpy.py | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/git-remote-testpy.py b/git-remote-testpy.py
> index c7a04ec..16b0c52 100644
> --- a/git-remote-testpy.py
> +++ b/git-remote-testpy.py
> @@ -36,6 +36,22 @@ if sys.hexversion < 0x02000000:
> sys.stderr.write("git-remote-testgit: requires Python 2.0 or later.\n")
> sys.exit(1)
>
> +
> +def _encode_filepath(path):
> + """Encodes a Unicode file path to a byte string.
> +
> + On Python 2 this is a no-op; on Python 3 we encode the string as
> + suggested by [1] which allows an exact round-trip from the command line
> + to the filesystem.
> +
> + [1] http://docs.python.org/3/c-api/unicode.html#file-system-encoding
> +
> + """
> + if sys.hexversion < 0x03000000:
> + return path
> + return path.encode('utf-8', 'surrogateescape')
> +
> +
> def get_repo(alias, url):
> """Returns a git repository object initialized for usage.
> """
> @@ -45,7 +61,7 @@ def get_repo(alias, url):
> repo.get_head()
>
> hasher = _digest()
> - hasher.update(repo.path.encode('hex'))
> + hasher.update(_encode_filepath(repo.path))
> repo.hash = hasher.hexdigest()
>
> repo.get_base_path = lambda base: os.path.join(
>
NAK. It is still not right. If the locale is not utf-8 based, then it
is incorrect to re-encode the string using utf-8. I think you really
have to use sys.getfilesystemencoding() as I suggested.
The attached program demonstrates the problem: the output of re-encoding
using UTF-8 depends on the locale, whereas that of re-encoding using the
filesystemencoding is independent of locale (as we want). The output,
using Python 3.2.3:
# This is 0xb6 0xc3:
$ ARG="ö"
$ LANG='C' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'C'
fse = 'ascii'
sys.argv[1] = u"U+DCC3 U+DCB6"
re-encoded using UTF-8: b"C3 B6"
re-encoded using fse: b"C3 B6"
$ LANG='C.UTF-8' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'C.UTF-8'
fse = 'utf-8'
sys.argv[1] = u"U+00F6"
re-encoded using UTF-8: b"C3 B6"
re-encoded using fse: b"C3 B6"
$ LANG='en_US.iso88591' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'en_US.iso88591'
fse = 'iso8859-1'
sys.argv[1] = u"U+00C3 U+00B6"
re-encoded using UTF-8: b"C3 83 C2 B6"
re-encoded using fse: b"C3 B6"
Even though the Unicode intermediate representation is different for
UTF-8 and ASCII, re-encoding using the correct encoding gives back the
original bytes (which is what we want). But when using the ios8859-1
locale, the original bytes look like a valid latin1 string so they are
not surrogated going in, giving the incorrect Unicode string u"U+00C3
U+00B6". When this is re-encoded using UTF-8, the code points U+00C3
and U+00B6 are each encoded as two bytes.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
[-- Attachment #2: chaos3.py --]
[-- Type: text/x-python, Size: 664 bytes --]
#! /usr/bin/python3
import sys
import os
def explicit(s):
"""Convert a string or bytestring into an unambiguous human-readable string."""
if isinstance(s, str):
return 'u"%s"' % (' '.join('U+%04X' % (ord(c),) for c in s))
else:
return 'b"%s"' % (' '.join('%02X' % (c,) for c in s))
fse = sys.getfilesystemencoding()
print('LANG = %r' % (os.getenv('LANG'),))
print('fse = %r' % (fse,))
print('sys.argv[1] = %s' % explicit(sys.argv[1]))
print('re-encoded using UTF-8: %s' % explicit(sys.argv[1].encode('utf-8', 'surrogateescape')))
print('re-encoded using fse: %s' % explicit(sys.argv[1].encode(fse, 'surrogateescape')))
print()
next prev parent reply other threads:[~2013-01-28 10:45 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-20 13:15 [PATCH v3 0/8] Python 3 support for git_remote_helpers John Keeping
2013-01-20 13:15 ` [PATCH v3 1/8] git_remote_helpers: Allow building with Python 3 John Keeping
2013-01-23 18:49 ` Sverre Rabbelier
2013-01-20 13:15 ` [PATCH v3 2/8] git_remote_helpers: fix input when running under " John Keeping
2013-01-23 19:20 ` Sverre Rabbelier
2013-01-23 19:47 ` John Keeping
2013-01-23 20:14 ` Sverre Rabbelier
2013-01-23 20:36 ` Junio C Hamano
2013-01-25 20:23 ` Brandon Casey
2013-02-05 16:07 ` Erik Faye-Lund
2013-01-20 13:15 ` [PATCH v3 3/8] git_remote_helpers: Force rebuild if python version changes John Keeping
2013-01-23 18:51 ` Sverre Rabbelier
2013-01-20 13:15 ` [PATCH v3 4/8] git_remote_helpers: Use 2to3 if building with Python 3 John Keeping
2013-01-20 13:15 ` [PATCH v3 5/8] svn-fe: allow svnrdump_sim.py to run " John Keeping
2013-01-20 13:15 ` [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly John Keeping
2013-01-26 17:51 ` John Keeping
2013-01-26 21:44 ` Junio C Hamano
2013-01-26 23:32 ` [PATCH] git-remote-testpy: fix patch hashing on Python 3 John Keeping
2013-01-27 4:44 ` [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly Michael Haggerty
2013-01-27 5:30 ` Junio C Hamano
2013-01-27 14:21 ` John Keeping
2013-01-27 5:30 ` Sverre Rabbelier
2013-01-27 8:41 ` Michael Haggerty
2013-01-27 14:13 ` John Keeping
2013-01-27 14:50 ` [PATCH] git-remote-testpy: fix patch hashing on Python 3 John Keeping
2013-01-27 19:49 ` Junio C Hamano
2013-01-27 20:04 ` John Keeping
2013-01-27 20:11 ` Junio C Hamano
2013-01-27 20:21 ` John Keeping
2013-01-27 20:38 ` Junio C Hamano
2013-01-27 20:47 ` Junio C Hamano
2013-01-27 22:42 ` John Keeping
2013-01-27 23:18 ` Junio C Hamano
2013-01-28 10:44 ` Michael Haggerty [this message]
2013-01-28 11:20 ` [PATCH] fixup! git-remote-testpy: fix path " John Keeping
2013-01-28 17:53 ` Junio C Hamano
2013-01-20 13:15 ` [PATCH v3 7/8] git-remote-testpy: don't do unbuffered text I/O John Keeping
2013-01-20 13:15 ` [PATCH v3 8/8] git-remote-testpy: call print as a function John Keeping
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51065692.9000708@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=john@keeping.me.uk \
--cc=srabbelier@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).