Re: Should the --encoding argument to log/show commands make any guarantees about their output?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Torsten Bögershausen" <tboegi@web.de>
To: Jan-Philip Gehrcke <jgehrcke@googlemail.com>, git@vger.kernel.org
Subject: Re: Should the --encoding argument to log/show commands make any guarantees about their output?
Date: Mon, 15 Jun 2015 18:21:40 +0200	[thread overview]
Message-ID: <557EFB94.3040104@web.de> (raw)
In-Reply-To: <557E91D2.3000908@googlemail.com>

On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
> Hello,
> 
> I was surprised to see that the output of
> 
>     git log --encoding=utf-8 "--format=format:%b"
> 
> can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body.
> 
> I have seen this with the Linux git repository and the following test:
> 
>     git log --encoding=utf-8 "--format=format:%b" | python2 -c \
>         'import sys; [l.decode("utf-8") for l in sys.stdin]'
> 
> Soon enough errors like this appears:
> 
>     'utf8' codec can't decode byte 0xf6 in position 19
> 
> The help message to the --encoding argument reads:
> 
>> The commit objects record the encoding used for the log message in
>> their encoding header; this option can be used to tell the command to
>> re-code the commit log message in the encoding preferred by the user
> 
> I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases.
> 
> However, I wonder what --encoding precisely does and if it has the behavior most users would expect.
> 
> Let me describe what I think it currently does:
> 
> The program attempts to re-code a log message, so it follows the chain
> 
>     raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)
When the encoding of the commit message is undefined, UTF-8 is assumed.
But Git does no verify if the encoding is really UTF-8.
We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented.

> 
> For the first step, knowledge about the input encoding is required. 
When someone does a commit where the commit message does not conform to UTF-8,
This message is shown from Git:
"Warning: commit message did not conform to UTF-8.\n"
"You may want to amend it after fixing the message, or set the config\n"
"variable i18n.commitencoding to the encoding your project uses.\n";

If the user ignores this warning, how should Git guess the encoding  ?
(Later Git versions try do an auto-conversion assuming ISO-8859-1) ,
but that doesn't help real existing repos.

> This is retrieved from the encoding header of the commit object if present or (from the docs) 
>"lack of this header implies that the commit log message is encoded in UTF-8." 
>If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec), 
>the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
> 
> Is that correct?
Yes, see above.
> 
> From my point of view the most natural abstraction of a log *message* is *text*, not bytes. 

>The same is true for author names. 

>If I want to build a tool chain on top of log/show, this usually means that I want to work with text information. 
>Hence, I want to retrieve text (a sequence of code points) from git show/log. 
>Text must be transported in encoded form, sure, 
>but it must not contain byte sequences that are invalid in this codec. 
>Because otherwise it's just not text anymore.
> 
Call it corrupted.
> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument. 
> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
> further below in the tool chain 
>(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
> 
> I am not entirely sure where this discussion should lead to. 
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
> However, I think that if the behavior of the software will not be changed, 
>then the documentation for the --encoding option should be more precise and 
>clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.
> 
> 
> Cheers,
> 
> 
> Jan-Philip Gehrcke

next prev parent reply	other threads:[~2015-06-15 16:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen [this message]
2015-06-16  9:38   ` Jan-Philip Gehrcke
2015-06-16 20:04     ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07   ` Jan-Philip Gehrcke
2015-06-17 18:46     ` Jeff King
2015-06-17 20:02       ` Junio C Hamano
2015-06-17 19:55     ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557EFB94.3040104@web.de \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=jgehrcke@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.