git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: paul <pl.gruener@gmail.com>
To: git@vger.kernel.org
Subject: commit messages are silently re-encoded to UTF-8 despite docs implying otherwise
Date: Mon, 19 Jun 2023 15:22:27 +0200	[thread overview]
Message-ID: <c6950001-ced5-448d-8f0e-22f597c68b9b@gmail.com> (raw)

When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.

Documentation/i18n.txt explicitly says (emphasis mine):
> Although we encourage that the commit log messages are encoded
> in UTF-8, both the core and Git Porcelain are designed **not to
> force UTF-8** on projects.  If all participants of a particular
> project find it more convenient to use legacy encodings, Git
> does not forbid it.  However, there are a few things to keep in
> mind.
> . 'git commit' and 'git commit-tree' issues
>   a warning if the commit log message given to it does not look
>   like a valid UTF-8 string, unless you explicitly say your
>   project uses a legacy encoding.  The way to say this is to
>   have `i18n.commitEncoding` in `.git/config` file, like this:
> […]
> Note that **we deliberately chose not to re-code the commit log
> message when a commit is made to force UTF-8** at the commit
> object level, because re-coding to UTF-8 is not necessarily a
> reversible operation.

Said warning reads:
> # Warning: commit message did not conform to UTF-8.
> # You may want to amend it after fixing the message, or set the config
> # variable i18n.commitEncoding to the encoding your project uses.

I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.

But we can easily verify that a non-UTF-8 commit message produces the same commit:
---
#!/usr/bin/env bash

export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
# symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1

git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db

git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
---

It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like

> # Warning: commit message did not conform to UTF-8 and was automatically
> # converted. You possibly need to fix and amend it.
> # To use a different encoding than UTF-8, set the config variable
> # i18n.commitEncoding to the encoding your project uses.

and the documentation adapted similarly.

regards,
paul


PS: the code responsible for the conversion is in
commit.c +1654
> /* And check the encoding */
> if (encoding_is_utf8 && !verify_utf8(&buffer))
>     fprintf(stderr, _(commit_utf8_warn));

and
> /*
>  * This verifies that the buffer is in proper utf8 format.
>  *
>  * If it isn't, it assumes any non-utf8 characters are Latin1,
>  * and does the conversion.
>  */
> static int verify_utf8(struct strbuf *buf)

             reply	other threads:[~2023-06-19 13:22 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-19 13:22 paul [this message]
2023-06-21 18:56 ` commit messages are silently re-encoded to UTF-8 despite docs implying otherwise Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6950001-ced5-448d-8f0e-22f597c68b9b@gmail.com \
    --to=pl.gruener@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).