From: "Torsten Bögershausen" <tboegi@web.de>
To: Jan-Philip Gehrcke <jgehrcke@googlemail.com>, git@vger.kernel.org
Subject: Re: Should the --encoding argument to log/show commands make any guarantees about their output?
Date: Mon, 15 Jun 2015 18:21:40 +0200 [thread overview]
Message-ID: <557EFB94.3040104@web.de> (raw)
In-Reply-To: <557E91D2.3000908@googlemail.com>
On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
> Hello,
>
> I was surprised to see that the output of
>
> git log --encoding=utf-8 "--format=format:%b"
>
> can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body.
>
> I have seen this with the Linux git repository and the following test:
>
> git log --encoding=utf-8 "--format=format:%b" | python2 -c \
> 'import sys; [l.decode("utf-8") for l in sys.stdin]'
>
> Soon enough errors like this appears:
>
> 'utf8' codec can't decode byte 0xf6 in position 19
>
> The help message to the --encoding argument reads:
>
>> The commit objects record the encoding used for the log message in
>> their encoding header; this option can be used to tell the command to
>> re-code the commit log message in the encoding preferred by the user
>
> I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases.
>
> However, I wonder what --encoding precisely does and if it has the behavior most users would expect.
>
> Let me describe what I think it currently does:
>
> The program attempts to re-code a log message, so it follows the chain
>
> raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)
When the encoding of the commit message is undefined, UTF-8 is assumed.
But Git does no verify if the encoding is really UTF-8.
We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented.
>
> For the first step, knowledge about the input encoding is required.
When someone does a commit where the commit message does not conform to UTF-8,
This message is shown from Git:
"Warning: commit message did not conform to UTF-8.\n"
"You may want to amend it after fixing the message, or set the config\n"
"variable i18n.commitencoding to the encoding your project uses.\n";
If the user ignores this warning, how should Git guess the encoding ?
(Later Git versions try do an auto-conversion assuming ISO-8859-1) ,
but that doesn't help real existing repos.
> This is retrieved from the encoding header of the commit object if present or (from the docs)
>"lack of this header implies that the commit log message is encoded in UTF-8."
>If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec),
>the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
>
> Is that correct?
Yes, see above.
>
> From my point of view the most natural abstraction of a log *message* is *text*, not bytes.
>The same is true for author names.
>If I want to build a tool chain on top of log/show, this usually means that I want to work with text information.
>Hence, I want to retrieve text (a sequence of code points) from git show/log.
>Text must be transported in encoded form, sure,
>but it must not contain byte sequences that are invalid in this codec.
>Because otherwise it's just not text anymore.
>
Call it corrupted.
> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument.
> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
> further below in the tool chain
>(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
>
> I am not entirely sure where this discussion should lead to.
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
> However, I think that if the behavior of the software will not be changed,
>then the documentation for the --encoding option should be more precise and
>clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.
>
>
> Cheers,
>
>
> Jan-Philip Gehrcke
next prev parent reply other threads:[~2015-06-15 16:21 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-15 8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen [this message]
2015-06-16 9:38 ` Jan-Philip Gehrcke
2015-06-16 20:04 ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07 ` Jan-Philip Gehrcke
2015-06-17 18:46 ` Jeff King
2015-06-17 20:02 ` Junio C Hamano
2015-06-17 19:55 ` Torsten Bögershausen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=557EFB94.3040104@web.de \
--to=tboegi@web.de \
--cc=git@vger.kernel.org \
--cc=jgehrcke@googlemail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).