* Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
@ 2008-11-11 19:12 Constantine Plotnikov
2008-11-12 10:43 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Constantine Plotnikov @ 2008-11-11 19:12 UTC (permalink / raw)
To: git
If encoding is forced for git log with "--encoding=UTF-8", it seems to
be ineffective for the formatted output. Is it intentional?
Lets consider the following scenario (script is for bash, git 1.5.4.3
and 1.6.0.2):
git init
echo initial >test.txt
git add test.txt
echo -e \\0320\\0242\\0320\\0265\\0321\\0201\\0321\\0202 >msg-UTF-8.txt
git commit -F msg-UTF-8.txt
echo updated >test.txt
git add test.txt
echo -e \\0322\\0345\\0361\\0362 >msg-windows-1251.txt
git config i18n.commitencoding Windows-1251
git commit -F msg-windows-1251.txt
git log --encoding=Windows-1251 >log1.txt
git log --encoding=UTF-8 >log2.txt
git log --encoding=Windows-1251 --pretty=format:%e%n%s%n >log3.txt
git log --encoding=UTF-8 --pretty=format:%e%n%s%n >log4.txt
In the both cases the string is the russian string meaning test in
different encoding.
If we compare log1.txt and log2.txt, we will see that the same
encoding specified is used for both commits in the log.
If we compare log3.txt and log4.txt, we will see that the same the
contents of the file is completely identical. The native encoding is
used for each commit. So the first listed commit
is encoded using Windows-1251 and the second one using UTF-8. However
in the description of the %s %b options nothing is said about which
encoding is used and implied behavior is that they are affected by
--encoding option.
I suggest documenting that the placeholders %s and %b use native
commit encoding and introducing the placeholders %S and %B options
that use encoding specified on the command line or the default log
encoding.
I also suggest adding %g and %G placeholders (%m placeholder is
already occupied) that print the entire commit message instead of just
the subject or the body. Currently the tools have to join the entire
message from two parts when they are just interested in the entire
message.
Regards,
Constantine
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-11 19:12 Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used Constantine Plotnikov
@ 2008-11-12 10:43 ` Jeff King
[not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com>
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2008-11-12 10:43 UTC (permalink / raw)
To: Constantine Plotnikov; +Cc: git
On Tue, Nov 11, 2008 at 10:12:46PM +0300, Constantine Plotnikov wrote:
> is encoded using Windows-1251 and the second one using UTF-8. However
> in the description of the %s %b options nothing is said about which
> encoding is used and implied behavior is that they are affected by
> --encoding option.
>
> I suggest documenting that the placeholders %s and %b use native
> commit encoding and introducing the placeholders %S and %B options
> that use encoding specified on the command line or the default log
> encoding.
I don't actually use any encodings except UTF-8, so maybe there is some
subtle reason not to do so that I don't understand, but I would have
expected all of the format placeholders to respect any --encoding
parameter.
If that is the desired behavior, this should not be too hard to make a
patch for:
1. in pretty_print_commit, move the code path for userformat to just
after the re-encoding
2. pass the re-encoded buffer to format_commit_message, where it will
be put into the context struct
3. use the re-encoded buffer in parse_commit_header
Maybe it would make a good exercise for somebody who wants to dig into
git a little deeper? Volunteers?
> I also suggest adding %g and %G placeholders (%m placeholder is
> already occupied) that print the entire commit message instead of just
> the subject or the body. Currently the tools have to join the entire
> message from two parts when they are just interested in the entire
> message.
This actually annoyed me earlier today. What got me was that '%s%n%n%b'
doesn't necessarily give you the exact commit message; if it's a
one-liner (i.e., body is blank), then you end up with an extra newline.
Again, this should be a pretty easy exercise to add. Volunteers?
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
[not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com>
@ 2008-11-12 11:26 ` Jeff King
2008-11-12 13:08 ` Constantine Plotnikov
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2008-11-12 11:26 UTC (permalink / raw)
To: Constantine Plotnikov; +Cc: git
[re-adding list to the cc]
On Wed, Nov 12, 2008 at 02:11:46PM +0300, Constantine Plotnikov wrote:
> > I don't actually use any encodings except UTF-8, so maybe there is some
> > subtle reason not to do so that I don't understand, but I would have
> > expected all of the format placeholders to respect any --encoding
> > parameter.
> >
> Even if this is the bug, it would be better to leave the old behavior
> for backward compatibility reasons and introduce new placeholders.
> Currently tools have to decode messages according to the commit
> encoding, and changing behavior of options will break these tools
> that have implemented workaround for this problem.
Are there such tools? I assumed they would have complained about this as
a bug before writing their own encoding conversion tools. And this is,
AFAIK, the first bug report.
I don't mind playing it safe to avoid breaking other people's tools, but
I'm also not excited about adding a second, "respect encoding" version
of many placeholders (and it's not just %s and %b; I think you would
need author and committer names and emails, too).
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-12 11:26 ` Jeff King
@ 2008-11-12 13:08 ` Constantine Plotnikov
2008-11-13 1:38 ` Junio C Hamano
0 siblings, 1 reply; 8+ messages in thread
From: Constantine Plotnikov @ 2008-11-12 13:08 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Wed, Nov 12, 2008 at 2:26 PM, Jeff King <peff@peff.net> wrote:
> [re-adding list to the cc]
>
> On Wed, Nov 12, 2008 at 02:11:46PM +0300, Constantine Plotnikov wrote:
>
>> > I don't actually use any encodings except UTF-8, so maybe there is some
>> > subtle reason not to do so that I don't understand, but I would have
>> > expected all of the format placeholders to respect any --encoding
>> > parameter.
>> >
>> Even if this is the bug, it would be better to leave the old behavior
>> for backward compatibility reasons and introduce new placeholders.
>> Currently tools have to decode messages according to the commit
>> encoding, and changing behavior of options will break these tools
>> that have implemented workaround for this problem.
>
> Are there such tools? I assumed they would have complained about this as
> a bug before writing their own encoding conversion tools. And this is,
> AFAIK, the first bug report.
>
> I don't mind playing it safe to avoid breaking other people's tools, but
> I'm also not excited about adding a second, "respect encoding" version
> of many placeholders (and it's not just %s and %b; I think you would
> need author and committer names and emails, too).
>
The reason for the request was that for IDE integration (I'm working
on the IDEA plugin), we need to work with past versions of the git as
well. However we could write that this is known git bug that will be
fixed in some future version and just to show incorrect data in
history view when non-UTF-8 encoding is used for a while. I hope that
non-UTF-8 encoding for commits is indeed a rare case, so users will
not complain much.
BTW for some reason --pretty=raw is affected by encoding option on the
command line. And this is a bit surprising as from description of the
raw format it looks like it should not be affected, because the
re-encoded commit is not "the entire commit exactly as stored in the
commit object". Possibly the man page should be updated to clarify
this.
Regards,
Constantine
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-12 13:08 ` Constantine Plotnikov
@ 2008-11-13 1:38 ` Junio C Hamano
2008-11-13 4:34 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2008-11-13 1:38 UTC (permalink / raw)
To: Constantine Plotnikov; +Cc: Jeff King, git
"Constantine Plotnikov" <constantine.plotnikov@gmail.com> writes:
> BTW for some reason --pretty=raw is affected by encoding option on the
> command line.
Unfortunately, that is what you get for reading from a Porcelain output,
which is meant for, and are subject to improvement for, human consumption.
If you want bit-for-bit information, you can always ask "git cat-file".
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-13 1:38 ` Junio C Hamano
@ 2008-11-13 4:34 ` Jeff King
2008-11-13 5:10 ` Junio C Hamano
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2008-11-13 4:34 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Constantine Plotnikov, git
On Wed, Nov 12, 2008 at 05:38:47PM -0800, Junio C Hamano wrote:
> > BTW for some reason --pretty=raw is affected by encoding option on the
> > command line.
>
> Unfortunately, that is what you get for reading from a Porcelain output,
> which is meant for, and are subject to improvement for, human consumption.
>
> If you want bit-for-bit information, you can always ask "git cat-file".
What about "git rev-list --pretty=raw"? Is that also porcelain?
I would be curious to hear your take on our failure to respect
--encoding for --pretty=format. Is it a bug to be fixed, or a historical
behavior to be maintained?
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-13 4:34 ` Jeff King
@ 2008-11-13 5:10 ` Junio C Hamano
2008-11-13 5:48 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2008-11-13 5:10 UTC (permalink / raw)
To: Jeff King; +Cc: Constantine Plotnikov, git
Jeff King <peff@peff.net> writes:
> What about "git rev-list --pretty=raw"? Is that also porcelain?
Does it re-encode? I didn't check, but ideally it shouldn't (but I do not
care too much either way, to be honest).
> I would be curious to hear your take on our failure to respect
> --encoding for --pretty=format. Is it a bug to be fixed, or a historical
> behavior to be maintained?
I think the fix you outlined was quite reasonable.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used
2008-11-13 5:10 ` Junio C Hamano
@ 2008-11-13 5:48 ` Jeff King
0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2008-11-13 5:48 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Constantine Plotnikov, git
On Wed, Nov 12, 2008 at 09:10:26PM -0800, Junio C Hamano wrote:
> > What about "git rev-list --pretty=raw"? Is that also porcelain?
>
> Does it re-encode? I didn't check, but ideally it shouldn't (but I do not
> care too much either way, to be honest).
Yes, it uses the same pretty_print_commit routine as the "log".
> > I would be curious to hear your take on our failure to respect
> > --encoding for --pretty=format. Is it a bug to be fixed, or a historical
> > behavior to be maintained?
>
> I think the fix you outlined was quite reasonable.
One thing I just realized that makes it even more reasonable: we
properly munge the encoding header when we _do_ re-encode. So whether we
re-encode or not, you will always get the correct encoding for what is
being output via "%e". Which means that a tool which handles the current
"broken" behavior by re-encoding themselves will trivially handle the
new version: the output will just always be in the --encoding specified
instead of whatever the original encoding was.
And if there are tools that are not looking at the output encoding (and
blindly assuming --encoding works), then they are already broken by the
current behavior, and we will be fixing them.
So I think it is safe to "fix" it as I described.
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-11-13 5:50 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-11 19:12 Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used Constantine Plotnikov
2008-11-12 10:43 ` Jeff King
[not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com>
2008-11-12 11:26 ` Jeff King
2008-11-12 13:08 ` Constantine Plotnikov
2008-11-13 1:38 ` Junio C Hamano
2008-11-13 4:34 ` Jeff King
2008-11-13 5:10 ` Junio C Hamano
2008-11-13 5:48 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).