git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* commit messages are silently re-encoded to UTF-8 despite docs implying otherwise
@ 2023-06-19 13:22 paul
  2023-06-21 18:56 ` Torsten Bögershausen
  0 siblings, 1 reply; 2+ messages in thread
From: paul @ 2023-06-19 13:22 UTC (permalink / raw)
  To: git

When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.

Documentation/i18n.txt explicitly says (emphasis mine):
> Although we encourage that the commit log messages are encoded
> in UTF-8, both the core and Git Porcelain are designed **not to
> force UTF-8** on projects.  If all participants of a particular
> project find it more convenient to use legacy encodings, Git
> does not forbid it.  However, there are a few things to keep in
> mind.
> . 'git commit' and 'git commit-tree' issues
>   a warning if the commit log message given to it does not look
>   like a valid UTF-8 string, unless you explicitly say your
>   project uses a legacy encoding.  The way to say this is to
>   have `i18n.commitEncoding` in `.git/config` file, like this:
> […]
> Note that **we deliberately chose not to re-code the commit log
> message when a commit is made to force UTF-8** at the commit
> object level, because re-coding to UTF-8 is not necessarily a
> reversible operation.

Said warning reads:
> # Warning: commit message did not conform to UTF-8.
> # You may want to amend it after fixing the message, or set the config
> # variable i18n.commitEncoding to the encoding your project uses.

I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.

But we can easily verify that a non-UTF-8 commit message produces the same commit:
---
#!/usr/bin/env bash

export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
# symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1

git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db

git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
---

It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like

> # Warning: commit message did not conform to UTF-8 and was automatically
> # converted. You possibly need to fix and amend it.
> # To use a different encoding than UTF-8, set the config variable
> # i18n.commitEncoding to the encoding your project uses.

and the documentation adapted similarly.

regards,
paul


PS: the code responsible for the conversion is in
commit.c +1654
> /* And check the encoding */
> if (encoding_is_utf8 && !verify_utf8(&buffer))
>     fprintf(stderr, _(commit_utf8_warn));

and
> /*
>  * This verifies that the buffer is in proper utf8 format.
>  *
>  * If it isn't, it assumes any non-utf8 characters are Latin1,
>  * and does the conversion.
>  */
> static int verify_utf8(struct strbuf *buf)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: commit messages are silently re-encoded to UTF-8 despite docs implying otherwise
  2023-06-19 13:22 commit messages are silently re-encoded to UTF-8 despite docs implying otherwise paul
@ 2023-06-21 18:56 ` Torsten Bögershausen
  0 siblings, 0 replies; 2+ messages in thread
From: Torsten Bögershausen @ 2023-06-21 18:56 UTC (permalink / raw)
  To: paul; +Cc: git


(please no top-posting, see my reply at the end)
On Mon, Jun 19, 2023 at 03:22:27PM +0200, paul wrote:
> When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.
>
> Documentation/i18n.txt explicitly says (emphasis mine):
> > Although we encourage that the commit log messages are encoded
> > in UTF-8, both the core and Git Porcelain are designed **not to
> > force UTF-8** on projects.  If all participants of a particular
> > project find it more convenient to use legacy encodings, Git
> > does not forbid it.  However, there are a few things to keep in
> > mind.
> > . 'git commit' and 'git commit-tree' issues
> >   a warning if the commit log message given to it does not look
> >   like a valid UTF-8 string, unless you explicitly say your
> >   project uses a legacy encoding.  The way to say this is to
> >   have `i18n.commitEncoding` in `.git/config` file, like this:
> > […]
> > Note that **we deliberately chose not to re-code the commit log
> > message when a commit is made to force UTF-8** at the commit
> > object level, because re-coding to UTF-8 is not necessarily a
> > reversible operation.
>
> Said warning reads:
> > # Warning: commit message did not conform to UTF-8.
> > # You may want to amend it after fixing the message, or set the config
> > # variable i18n.commitEncoding to the encoding your project uses.
>
> I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.
>
> But we can easily verify that a non-UTF-8 commit message produces the same commit:
> ---
> #!/usr/bin/env bash
>
> export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
> export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
> # symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1
>
> git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
> # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
>
> git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
> # Warning: commit message did not conform to UTF-8.
> # You may want to amend it after fixing the message, or set the config
> # variable i18n.commitEncoding to the encoding your project uses.
> # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
> ---
>
> It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like
>
> > # Warning: commit message did not conform to UTF-8 and was automatically
> > # converted. You possibly need to fix and amend it.
> > # To use a different encoding than UTF-8, set the config variable
> > # i18n.commitEncoding to the encoding your project uses.
>
> and the documentation adapted similarly.
>
> regards,
> paul
>
>
> PS: the code responsible for the conversion is in
> commit.c +1654
> > /* And check the encoding */
> > if (encoding_is_utf8 && !verify_utf8(&buffer))
> >     fprintf(stderr, _(commit_utf8_warn));
>
> and
> > /*
> >  * This verifies that the buffer is in proper utf8 format.
> >  *
> >  * If it isn't, it assumes any non-utf8 characters are Latin1,
> >  * and does the conversion.
> >  */
> > static int verify_utf8(struct strbuf *buf)

Thanks for the well written report.
I haven't had time to dig into the details with the same depth as you did,
to find out what the best solution is.
Fixing the documentation to reflect the reality is always good.
In the beginning Git was not aware about encodings, kind of.
Then Windows came, using local code pages, and all Linux distros switched
to use UTF-8 everywhere.
Having said that:

Do you want to send a patch (or 2) to improve the situation ?

And out of interest:
How did you find that problem, on which OS, system ?

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-06-21 18:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-19 13:22 commit messages are silently re-encoded to UTF-8 despite docs implying otherwise paul
2023-06-21 18:56 ` Torsten Bögershausen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).