git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: Jan-Philip Gehrcke <jgehrcke@googlemail.com>, git@vger.kernel.org
Subject: Re: Should the --encoding argument to log/show commands make any guarantees about their output?
Date: Mon, 15 Jun 2015 18:21:40 +0200	[thread overview]
Message-ID: <557EFB94.3040104@web.de> (raw)
In-Reply-To: <557E91D2.3000908@googlemail.com>

On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
> Hello,
> 
> I was surprised to see that the output of
> 
>     git log --encoding=utf-8 "--format=format:%b"
> 
> can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body.
> 
> I have seen this with the Linux git repository and the following test:
> 
>     git log --encoding=utf-8 "--format=format:%b" | python2 -c \
>         'import sys; [l.decode("utf-8") for l in sys.stdin]'
> 
> Soon enough errors like this appears:
> 
>     'utf8' codec can't decode byte 0xf6 in position 19
> 
> The help message to the --encoding argument reads:
> 
>> The commit objects record the encoding used for the log message in
>> their encoding header; this option can be used to tell the command to
>> re-code the commit log message in the encoding preferred by the user
> 
> I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases.
> 
> However, I wonder what --encoding precisely does and if it has the behavior most users would expect.
> 
> Let me describe what I think it currently does:
> 
> The program attempts to re-code a log message, so it follows the chain
> 
>     raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)
When the encoding of the commit message is undefined, UTF-8 is assumed.
But Git does no verify if the encoding is really UTF-8.
We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented.

> 
> For the first step, knowledge about the input encoding is required. 
When someone does a commit where the commit message does not conform to UTF-8,
This message is shown from Git:
"Warning: commit message did not conform to UTF-8.\n"
"You may want to amend it after fixing the message, or set the config\n"
"variable i18n.commitencoding to the encoding your project uses.\n";

If the user ignores this warning, how should Git guess the encoding  ?
(Later Git versions try do an auto-conversion assuming ISO-8859-1) ,
but that doesn't help real existing repos.

> This is retrieved from the encoding header of the commit object if present or (from the docs) 
>"lack of this header implies that the commit log message is encoded in UTF-8." 
>If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec), 
>the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
> 
> Is that correct?
Yes, see above.
> 
> From my point of view the most natural abstraction of a log *message* is *text*, not bytes. 

>The same is true for author names. 

>If I want to build a tool chain on top of log/show, this usually means that I want to work with text information. 
>Hence, I want to retrieve text (a sequence of code points) from git show/log. 
>Text must be transported in encoded form, sure, 
>but it must not contain byte sequences that are invalid in this codec. 
>Because otherwise it's just not text anymore.
> 
Call it corrupted.
> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument. 
> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
> further below in the tool chain 
>(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
> 
> I am not entirely sure where this discussion should lead to. 
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
> However, I think that if the behavior of the software will not be changed, 
>then the documentation for the --encoding option should be more precise and 
>clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.
> 
> 
> Cheers,
> 
> 
> Jan-Philip Gehrcke

  reply	other threads:[~2015-06-15 16:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen [this message]
2015-06-16  9:38   ` Jan-Philip Gehrcke
2015-06-16 20:04     ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07   ` Jan-Philip Gehrcke
2015-06-17 18:46     ` Jeff King
2015-06-17 20:02       ` Junio C Hamano
2015-06-17 19:55     ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557EFB94.3040104@web.de \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=jgehrcke@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).