All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan-Philip Gehrcke <jgehrcke@googlemail.com>
To: git@vger.kernel.org
Subject: Re: Should the --encoding argument to log/show commands make any guarantees about their output?
Date: Wed, 17 Jun 2015 19:07:48 +0200	[thread overview]
Message-ID: <5581A964.4000500@googlemail.com> (raw)
In-Reply-To: <xmqqzj3y2snq.fsf@gitster.dls.corp.google.com>

On 17.06.2015 18:42, Junio C Hamano wrote:
> Jan-Philip Gehrcke <jgehrcke@googlemail.com> writes:
>
>> I was surprised to see that the output of
>>
>>      git log --encoding=utf-8 "--format=format:%b"
>>
>> can contain byte sequences that are invalid in UTF-8. Note: I am using
>> git 2.1.4 and the %b format specifier represents the commit message
>> body.
>
> Yeah, if the original was bad and cannot be sanely expressed in
> UTF-8, you have two options.  You can show the contents as raw bytes
> recorded in the object with a warning so that the user can use it as
> such (e.g. perhaps the original was indeed an iso8859-2 but was
> incorrectly marked as UTF-8, or something like that, and a human
> that is more intelligent than a tool _could_ guess and attempt to
> recover).  Or you can error out and refuse to produce output.

The two-option scenario is totally clear. Although one must stress that 
the "error-out" option can, as discussed, be kept minimally invasive: it 
is sufficient (and common) to just skip those byte sequences (and 
replace them with a replacement symbol) that would be invalid in the 
requested output encoding. This would retain as much information as 
possible while guaranteeing a subsequent decoder to retrieve valid input.

> We deliberately made a design choice to take the former option.

I totally support this design choice in general, especially when 
invoking `git whatever` without options. This here is, I think, mainly 
about documentation and the semantics of "--encoding". From my point of 
view, `--encoding=utf-8` semantically suggests that the output *is* 
valid UTF-8. But it is not, not always. May initial question was: what 
do you think about this? Should we

* just make this more clear in the docs and/or
* should we adjust the behavior of --encoding or
* should we do something entirely different, like adding a new command 
line option or
* should we just leave things as they are?

Thanks and cheers,


Jan-Philip

  reply	other threads:[~2015-06-17 17:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen
2015-06-16  9:38   ` Jan-Philip Gehrcke
2015-06-16 20:04     ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07   ` Jan-Philip Gehrcke [this message]
2015-06-17 18:46     ` Jeff King
2015-06-17 20:02       ` Junio C Hamano
2015-06-17 19:55     ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5581A964.4000500@googlemail.com \
    --to=jgehrcke@googlemail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.