git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Keeping <john@keeping.me.uk>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, Sverre Rabbelier <srabbelier@gmail.com>
Subject: Re: [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly
Date: Sun, 27 Jan 2013 14:13:29 +0000	[thread overview]
Message-ID: <20130127141329.GN7498@serenity.lan> (raw)
In-Reply-To: <5104B0B5.1030501@alum.mit.edu>

On Sun, Jan 27, 2013 at 05:44:37AM +0100, Michael Haggerty wrote:
> On 01/26/2013 10:44 PM, Junio C Hamano wrote:
> > John Keeping <john@keeping.me.uk> writes:
> >> @@ -45,7 +45,7 @@ def get_repo(alias, url):
> >>      repo.get_head()
> >>  
> >>      hasher = _digest()
> >> -    hasher.update(repo.path)
> >> +    hasher.update(repo.path.encode('utf-8'))
> >>      repo.hash = hasher.hexdigest()
> >>  
> >>      repo.get_base_path = lambda base: os.path.join(
> 
> This will still fail under Python 2.x if repo.path is a byte string that
> contains non-ASCII characters.

I had forgotten about Python 2 while doing this.

>                                 And it will fail under Python 3.1 and
> later if repo.path contains characters using the surrogateescape
> encoding option [1], as it will if the original command-line argument
> contained bytes that cannot be decoded into Unicode using the user's
> default encoding:

Interesting.  I wasn't aware of the "surrogateescape" error handler.

> 'surrogateescape' is not supported in Python 3.0, but I think it would
> be quite acceptable only to support Python 3.x for x >= 1.

I agree.

> But 'surrogateescape' doesn't seem to be supported at all in Python 2.x
> (I tested 2.7.3 and it's not there).
> 
> Here you don't really need byte-for-byte correctness; it would be enough
> to get *some* byte string that is unique for a given input (ideally,
> consistent with ASCII or UTF-8 for backwards compatibility).  So you
> could use
> 
>     b = s.encode('utf-8', 'backslashreplace')
> 
> Unfortunately, this doesn't work under Python 2.x:
> 
>     $ python2 -c "
>     import sys
>     print(repr(sys.argv[1]))
>     print(repr(sys.argv[1].encode('utf-8', 'backslashreplace')))
>     " $(echo français|iconv -t latin1)
>     'fran\xe7ais'
>     Traceback (most recent call last):
>       File "<string>", line 4, in <module>
>     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position
> 4: ordinal not in range(128)
> 
> Apparently when you call bytestring.encode(), Python first tries to
> decode the string to Unicode using the 'ascii' encoding.

Actually it appears to use sys.getdefaultencoding() to do this initial
decode.  Not that it makes much difference here since the failure is the
same.

> So to handle all of the cases across Python versions as closely as
> possible to the old 2.x code, it might be necessary to make the code
> explicitly depend on the Python version number, like:
> 
>     hasher = _digest()
>     if sys.hexversion < 0x03000000:
>         pathbytes = repo.path
>     elif sys.hexversion < 0x03010000:
>         # If support for Python 3.0.x is desired (note: result can
>         # be different in this case than under 2.x or 3.1+):
>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
> 'backslashreplace')
>     else
>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
> 'surrogateescape')
>     hasher.update(pathbytes)
>     repo.hash = hasher.hexdigest()

If we don't want to put a version check in it probably wants to look
like this (ignoring Python 3.0 since I don't think we need to support
it):

    hasher = _digest()
    try:
        codecs.lookup_error('surrogateescape')
    except LookupError:
        pathbytes = repo.path
    else:
        pathbytes = repo.path.encode(sys.getfilesystemencoding(),
                                     'surrogateescape')
    hasher.update(pathbytes)
    repo.hash = hasher.hexdigest()

The version with a version check seems better to me, although this
should probably be a utility function.


John

  parent reply	other threads:[~2013-01-27 14:19 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-20 13:15 [PATCH v3 0/8] Python 3 support for git_remote_helpers John Keeping
2013-01-20 13:15 ` [PATCH v3 1/8] git_remote_helpers: Allow building with Python 3 John Keeping
2013-01-23 18:49   ` Sverre Rabbelier
2013-01-20 13:15 ` [PATCH v3 2/8] git_remote_helpers: fix input when running under " John Keeping
2013-01-23 19:20   ` Sverre Rabbelier
2013-01-23 19:47     ` John Keeping
2013-01-23 20:14       ` Sverre Rabbelier
2013-01-23 20:36         ` Junio C Hamano
2013-01-25 20:23           ` Brandon Casey
2013-02-05 16:07             ` Erik Faye-Lund
2013-01-20 13:15 ` [PATCH v3 3/8] git_remote_helpers: Force rebuild if python version changes John Keeping
2013-01-23 18:51   ` Sverre Rabbelier
2013-01-20 13:15 ` [PATCH v3 4/8] git_remote_helpers: Use 2to3 if building with Python 3 John Keeping
2013-01-20 13:15 ` [PATCH v3 5/8] svn-fe: allow svnrdump_sim.py to run " John Keeping
2013-01-20 13:15 ` [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly John Keeping
2013-01-26 17:51   ` John Keeping
2013-01-26 21:44     ` Junio C Hamano
2013-01-26 23:32       ` [PATCH] git-remote-testpy: fix patch hashing on Python 3 John Keeping
2013-01-27  4:44       ` [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly Michael Haggerty
2013-01-27  5:30         ` Junio C Hamano
2013-01-27 14:21           ` John Keeping
2013-01-27  5:30         ` Sverre Rabbelier
2013-01-27  8:41           ` Michael Haggerty
2013-01-27 14:13         ` John Keeping [this message]
2013-01-27 14:50           ` [PATCH] git-remote-testpy: fix patch hashing on Python 3 John Keeping
2013-01-27 19:49             ` Junio C Hamano
2013-01-27 20:04               ` John Keeping
2013-01-27 20:11                 ` Junio C Hamano
2013-01-27 20:21                   ` John Keeping
2013-01-27 20:38                     ` Junio C Hamano
2013-01-27 20:47                       ` Junio C Hamano
2013-01-27 22:42                         ` John Keeping
2013-01-27 23:18                           ` Junio C Hamano
2013-01-28 10:44             ` Michael Haggerty
2013-01-28 11:20               ` [PATCH] fixup! git-remote-testpy: fix path " John Keeping
2013-01-28 17:53                 ` Junio C Hamano
2013-01-20 13:15 ` [PATCH v3 7/8] git-remote-testpy: don't do unbuffered text I/O John Keeping
2013-01-20 13:15 ` [PATCH v3 8/8] git-remote-testpy: call print as a function John Keeping

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130127141329.GN7498@serenity.lan \
    --to=john@keeping.me.uk \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).