git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Jan-Philip Gehrcke <jgehrcke@googlemail.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: Should the --encoding argument to log/show commands make any guarantees about their output?
Date: Wed, 17 Jun 2015 14:46:08 -0400	[thread overview]
Message-ID: <20150617184607.GA28455@peff.net> (raw)
In-Reply-To: <5581A964.4000500@googlemail.com>

On Wed, Jun 17, 2015 at 07:07:48PM +0200, Jan-Philip Gehrcke wrote:

> The two-option scenario is totally clear. Although one must stress that the
> "error-out" option can, as discussed, be kept minimally invasive: it is
> sufficient (and common) to just skip those byte sequences (and replace them
> with a replacement symbol) that would be invalid in the requested output
> encoding. This would retain as much information as possible while
> guaranteeing a subsequent decoder to retrieve valid input.

I think "munge into valid UTF-8, even if it means losing data" is a
totally valid and useful option. I'm not completely sure that git should
do that, though.  E.g., you could just as easily do:

  git log --encoding=utf8 | drop_invalid_utf8 | your_script

Or quite possibly, your_script could do the munging itself while reading
the data. I do not know much about Python's input handling, but in Perl,
it is easy to say "the input is utf8, and replace anything bogus with a
substitution character"[1].

> Should we
> 
> * just make this more clear in the docs and/or
> * should we adjust the behavior of --encoding or
> * should we do something entirely different, like adding a new command line
> option or
> * should we just leave things as they are?

I would vote for a documentation change, perhaps like:

Subject: docs: clarify that --encoding can produce invalid sequences

In the common case that the commit encoding matches the
output encoding, we do not touch the buffer at all, which
makes things much more efficient. But it might be unclear to
a consumer that we will pass through bogus sequences.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/pretty-options.txt | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 74aa01a..642af6e 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -37,7 +37,10 @@ people using 80-column terminals.
 	in their encoding header; this option can be used to tell the
 	command to re-code the commit log message in the encoding
 	preferred by the user.  For non plumbing commands this
-	defaults to UTF-8.
+	defaults to UTF-8. Note that if an object claims to be encoded
+	in `X` and we are outputting in `X`, we will output the object
+	verbatim; this means that invalid sequences in the original
+	commit may be copied to the output.
 
 --notes[=<ref>]::
 	Show the notes (see linkgit:git-notes[1]) that annotate the
-- 
2.4.4.719.g3984bc6

  reply	other threads:[~2015-06-17 18:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen
2015-06-16  9:38   ` Jan-Philip Gehrcke
2015-06-16 20:04     ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07   ` Jan-Philip Gehrcke
2015-06-17 18:46     ` Jeff King [this message]
2015-06-17 20:02       ` Junio C Hamano
2015-06-17 19:55     ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150617184607.GA28455@peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jgehrcke@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).