git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* commit messages are silently re-encoded to UTF-8 despite docs implying otherwise
@ 2023-06-19 13:22 paul
  2023-06-21 18:56 ` Torsten Bögershausen
  0 siblings, 1 reply; 2+ messages in thread
From: paul @ 2023-06-19 13:22 UTC (permalink / raw)
  To: git

When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.

Documentation/i18n.txt explicitly says (emphasis mine):
> Although we encourage that the commit log messages are encoded
> in UTF-8, both the core and Git Porcelain are designed **not to
> force UTF-8** on projects.  If all participants of a particular
> project find it more convenient to use legacy encodings, Git
> does not forbid it.  However, there are a few things to keep in
> mind.
> . 'git commit' and 'git commit-tree' issues
>   a warning if the commit log message given to it does not look
>   like a valid UTF-8 string, unless you explicitly say your
>   project uses a legacy encoding.  The way to say this is to
>   have `i18n.commitEncoding` in `.git/config` file, like this:
> […]
> Note that **we deliberately chose not to re-code the commit log
> message when a commit is made to force UTF-8** at the commit
> object level, because re-coding to UTF-8 is not necessarily a
> reversible operation.

Said warning reads:
> # Warning: commit message did not conform to UTF-8.
> # You may want to amend it after fixing the message, or set the config
> # variable i18n.commitEncoding to the encoding your project uses.

I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.

But we can easily verify that a non-UTF-8 commit message produces the same commit:
---
#!/usr/bin/env bash

export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
# symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1

git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db

git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
---

It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like

> # Warning: commit message did not conform to UTF-8 and was automatically
> # converted. You possibly need to fix and amend it.
> # To use a different encoding than UTF-8, set the config variable
> # i18n.commitEncoding to the encoding your project uses.

and the documentation adapted similarly.

regards,
paul


PS: the code responsible for the conversion is in
commit.c +1654
> /* And check the encoding */
> if (encoding_is_utf8 && !verify_utf8(&buffer))
>     fprintf(stderr, _(commit_utf8_warn));

and
> /*
>  * This verifies that the buffer is in proper utf8 format.
>  *
>  * If it isn't, it assumes any non-utf8 characters are Latin1,
>  * and does the conversion.
>  */
> static int verify_utf8(struct strbuf *buf)

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-06-21 18:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-19 13:22 commit messages are silently re-encoded to UTF-8 despite docs implying otherwise paul
2023-06-21 18:56 ` Torsten Bögershausen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).