From: paul <pl.gruener@gmail.com>
To: git@vger.kernel.org
Subject: commit messages are silently re-encoded to UTF-8 despite docs implying otherwise
Date: Mon, 19 Jun 2023 15:22:27 +0200 [thread overview]
Message-ID: <c6950001-ced5-448d-8f0e-22f597c68b9b@gmail.com> (raw)
When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.
Documentation/i18n.txt explicitly says (emphasis mine):
> Although we encourage that the commit log messages are encoded
> in UTF-8, both the core and Git Porcelain are designed **not to
> force UTF-8** on projects. If all participants of a particular
> project find it more convenient to use legacy encodings, Git
> does not forbid it. However, there are a few things to keep in
> mind.
> . 'git commit' and 'git commit-tree' issues
> a warning if the commit log message given to it does not look
> like a valid UTF-8 string, unless you explicitly say your
> project uses a legacy encoding. The way to say this is to
> have `i18n.commitEncoding` in `.git/config` file, like this:
> […]
> Note that **we deliberately chose not to re-code the commit log
> message when a commit is made to force UTF-8** at the commit
> object level, because re-coding to UTF-8 is not necessarily a
> reversible operation.
Said warning reads:
> # Warning: commit message did not conform to UTF-8.
> # You may want to amend it after fixing the message, or set the config
> # variable i18n.commitEncoding to the encoding your project uses.
I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.
But we can easily verify that a non-UTF-8 commit message produces the same commit:
---
#!/usr/bin/env bash
export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
# symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1
git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
---
It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like
> # Warning: commit message did not conform to UTF-8 and was automatically
> # converted. You possibly need to fix and amend it.
> # To use a different encoding than UTF-8, set the config variable
> # i18n.commitEncoding to the encoding your project uses.
and the documentation adapted similarly.
regards,
paul
PS: the code responsible for the conversion is in
commit.c +1654
> /* And check the encoding */
> if (encoding_is_utf8 && !verify_utf8(&buffer))
> fprintf(stderr, _(commit_utf8_warn));
and
> /*
> * This verifies that the buffer is in proper utf8 format.
> *
> * If it isn't, it assumes any non-utf8 characters are Latin1,
> * and does the conversion.
> */
> static int verify_utf8(struct strbuf *buf)
next reply other threads:[~2023-06-19 13:22 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-19 13:22 paul [this message]
2023-06-21 18:56 ` commit messages are silently re-encoded to UTF-8 despite docs implying otherwise Torsten Bögershausen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c6950001-ced5-448d-8f0e-22f597c68b9b@gmail.com \
--to=pl.gruener@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).