git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Oakley <andrew@adoakley.name>
To: Joel Holdsworth <jholdsworth@nvidia.com>
Cc: git@vger.kernel.org,
	Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>,
	Dorgon Chang <dorgonman@hotmail.com>,
	Joachim Kuebart <joachim.kuebart@gmail.com>,
	Daniel Levin <dendy.ua@gmail.com>,
	Johannes Schindelin <johannes.schindelin@gmx.de>,
	Luke Diamand <luke@diamand.org>, Ben Keene <seraphire@gmail.com>
Subject: Re: [PATCH v2 1/3] git-p4: remove support for Python 2
Date: Sun, 12 Dec 2021 17:50:54 +0000	[thread overview]
Message-ID: <20211212175054.5d3c11af@ado-tr.dyn.home.arpa> (raw)
In-Reply-To: <20211210153101.35433-2-jholdsworth@nvidia.com>

On Fri, 10 Dec 2021 15:30:59 +0000
Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> The motivation for this change is that there is a family of issues
> with git-p4's handling of incoming text data when it contains bytes
> which cannot be decoded into UTF-8 characters. For text files created
> in Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen
> fairly frequently. These codes are invalid in UTF-8, so if the script
> encounters any file or file name containing them, on Python 2 the
> symbols will be corrupted, and on Python 3 the script will fail with
> an exception.

As I've pointed out previously, peforce fails to store the encoding of
text like commit messages.  With Windows perforce clients, the encoding
used seems to be based on the current code page on the client which
made the commit.  If you're part of a global organisation with people
in different locales making commits then you will find that there is
not a consistent encoding for commit messages.

Given that you don't know the encoding of the text, what's the best
thing to do with the data?  Options I can see are:

- Feed the raw bytes directly into git.  The i18n.commitEncoding config
  option can be set by the user if they want to attempt to decode the
  commit messages in something other than UTF-8.
- Attempt to detect the encoding somehow, feed the raw bytes directly
  into git and set the encoding on the commit.
- Attempt to dedect the encoding somehow and reencode everything into
  UTF-8.

Right now, if you use python2 then you get the behaviour as described
in the first of these options.  It doesn't "corrupt" anything, it just
transfers the bytes from perforce into git.  If you use python3 then
git-p4 is unusable because it throws exceptions trying to decode things.

It's not clear to me how "attempt to detect the encoding somehow" would
work.  The first option therefore seems like the best choice.

I think that this is the problem which really needs solving.  Dropping
support for python2 doesn't make the issue go away (although it might
make it slightly easier to write the code).  I think that the python2
compatibility should be maintained at least until the encoding problems
have been solved for python3.

I previously wrote some patches which attempt to move in what I think is
the right direction, but unfortunately they never got upstreamed:

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

Your comments elsewhere that git-p4 could benifit from some clean-up
seem accurate to me, and it would be good to see that kind of change.

  reply	other threads:[~2021-12-12 18:17 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
2021-12-12 17:50   ` Andrew Oakley [this message]
2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
2021-12-10 15:31 ` [PATCH v2 2/3] git-p4: eliminate decode_stream and encode_stream Joel Holdsworth
2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
2021-12-10 18:57   ` Luke Diamand
2021-12-11 21:19   ` Elijah Newren
2021-12-13  0:11     ` brian m. carlson
  -- strict thread matches above, loose matches on Subject: below --
2021-12-13  0:30 [PATCH v2 1/3] git-p4: remove support for Python 2 Tzadik Vanderhoof

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211212175054.5d3c11af@ado-tr.dyn.home.arpa \
    --to=andrew@adoakley.name \
    --cc=dendy.ua@gmail.com \
    --cc=dorgonman@hotmail.com \
    --cc=git@vger.kernel.org \
    --cc=jholdsworth@nvidia.com \
    --cc=joachim.kuebart@gmail.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=luke@diamand.org \
    --cc=seraphire@gmail.com \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).