* Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used @ 2008-11-11 19:12 Constantine Plotnikov 2008-11-12 10:43 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Constantine Plotnikov @ 2008-11-11 19:12 UTC (permalink / raw) To: git If encoding is forced for git log with "--encoding=UTF-8", it seems to be ineffective for the formatted output. Is it intentional? Lets consider the following scenario (script is for bash, git 1.5.4.3 and 1.6.0.2): git init echo initial >test.txt git add test.txt echo -e \\0320\\0242\\0320\\0265\\0321\\0201\\0321\\0202 >msg-UTF-8.txt git commit -F msg-UTF-8.txt echo updated >test.txt git add test.txt echo -e \\0322\\0345\\0361\\0362 >msg-windows-1251.txt git config i18n.commitencoding Windows-1251 git commit -F msg-windows-1251.txt git log --encoding=Windows-1251 >log1.txt git log --encoding=UTF-8 >log2.txt git log --encoding=Windows-1251 --pretty=format:%e%n%s%n >log3.txt git log --encoding=UTF-8 --pretty=format:%e%n%s%n >log4.txt In the both cases the string is the russian string meaning test in different encoding. If we compare log1.txt and log2.txt, we will see that the same encoding specified is used for both commits in the log. If we compare log3.txt and log4.txt, we will see that the same the contents of the file is completely identical. The native encoding is used for each commit. So the first listed commit is encoded using Windows-1251 and the second one using UTF-8. However in the description of the %s %b options nothing is said about which encoding is used and implied behavior is that they are affected by --encoding option. I suggest documenting that the placeholders %s and %b use native commit encoding and introducing the placeholders %S and %B options that use encoding specified on the command line or the default log encoding. I also suggest adding %g and %G placeholders (%m placeholder is already occupied) that print the entire commit message instead of just the subject or the body. Currently the tools have to join the entire message from two parts when they are just interested in the entire message. Regards, Constantine ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-11 19:12 Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used Constantine Plotnikov @ 2008-11-12 10:43 ` Jeff King [not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com> 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2008-11-12 10:43 UTC (permalink / raw) To: Constantine Plotnikov; +Cc: git On Tue, Nov 11, 2008 at 10:12:46PM +0300, Constantine Plotnikov wrote: > is encoded using Windows-1251 and the second one using UTF-8. However > in the description of the %s %b options nothing is said about which > encoding is used and implied behavior is that they are affected by > --encoding option. > > I suggest documenting that the placeholders %s and %b use native > commit encoding and introducing the placeholders %S and %B options > that use encoding specified on the command line or the default log > encoding. I don't actually use any encodings except UTF-8, so maybe there is some subtle reason not to do so that I don't understand, but I would have expected all of the format placeholders to respect any --encoding parameter. If that is the desired behavior, this should not be too hard to make a patch for: 1. in pretty_print_commit, move the code path for userformat to just after the re-encoding 2. pass the re-encoded buffer to format_commit_message, where it will be put into the context struct 3. use the re-encoded buffer in parse_commit_header Maybe it would make a good exercise for somebody who wants to dig into git a little deeper? Volunteers? > I also suggest adding %g and %G placeholders (%m placeholder is > already occupied) that print the entire commit message instead of just > the subject or the body. Currently the tools have to join the entire > message from two parts when they are just interested in the entire > message. This actually annoyed me earlier today. What got me was that '%s%n%n%b' doesn't necessarily give you the exact commit message; if it's a one-liner (i.e., body is blank), then you end up with an extra newline. Again, this should be a pretty easy exercise to add. Volunteers? -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com>]
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used [not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com> @ 2008-11-12 11:26 ` Jeff King 2008-11-12 13:08 ` Constantine Plotnikov 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2008-11-12 11:26 UTC (permalink / raw) To: Constantine Plotnikov; +Cc: git [re-adding list to the cc] On Wed, Nov 12, 2008 at 02:11:46PM +0300, Constantine Plotnikov wrote: > > I don't actually use any encodings except UTF-8, so maybe there is some > > subtle reason not to do so that I don't understand, but I would have > > expected all of the format placeholders to respect any --encoding > > parameter. > > > Even if this is the bug, it would be better to leave the old behavior > for backward compatibility reasons and introduce new placeholders. > Currently tools have to decode messages according to the commit > encoding, and changing behavior of options will break these tools > that have implemented workaround for this problem. Are there such tools? I assumed they would have complained about this as a bug before writing their own encoding conversion tools. And this is, AFAIK, the first bug report. I don't mind playing it safe to avoid breaking other people's tools, but I'm also not excited about adding a second, "respect encoding" version of many placeholders (and it's not just %s and %b; I think you would need author and committer names and emails, too). -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-12 11:26 ` Jeff King @ 2008-11-12 13:08 ` Constantine Plotnikov 2008-11-13 1:38 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Constantine Plotnikov @ 2008-11-12 13:08 UTC (permalink / raw) To: Jeff King; +Cc: git On Wed, Nov 12, 2008 at 2:26 PM, Jeff King <peff@peff.net> wrote: > [re-adding list to the cc] > > On Wed, Nov 12, 2008 at 02:11:46PM +0300, Constantine Plotnikov wrote: > >> > I don't actually use any encodings except UTF-8, so maybe there is some >> > subtle reason not to do so that I don't understand, but I would have >> > expected all of the format placeholders to respect any --encoding >> > parameter. >> > >> Even if this is the bug, it would be better to leave the old behavior >> for backward compatibility reasons and introduce new placeholders. >> Currently tools have to decode messages according to the commit >> encoding, and changing behavior of options will break these tools >> that have implemented workaround for this problem. > > Are there such tools? I assumed they would have complained about this as > a bug before writing their own encoding conversion tools. And this is, > AFAIK, the first bug report. > > I don't mind playing it safe to avoid breaking other people's tools, but > I'm also not excited about adding a second, "respect encoding" version > of many placeholders (and it's not just %s and %b; I think you would > need author and committer names and emails, too). > The reason for the request was that for IDE integration (I'm working on the IDEA plugin), we need to work with past versions of the git as well. However we could write that this is known git bug that will be fixed in some future version and just to show incorrect data in history view when non-UTF-8 encoding is used for a while. I hope that non-UTF-8 encoding for commits is indeed a rare case, so users will not complain much. BTW for some reason --pretty=raw is affected by encoding option on the command line. And this is a bit surprising as from description of the raw format it looks like it should not be affected, because the re-encoded commit is not "the entire commit exactly as stored in the commit object". Possibly the man page should be updated to clarify this. Regards, Constantine ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-12 13:08 ` Constantine Plotnikov @ 2008-11-13 1:38 ` Junio C Hamano 2008-11-13 4:34 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2008-11-13 1:38 UTC (permalink / raw) To: Constantine Plotnikov; +Cc: Jeff King, git "Constantine Plotnikov" <constantine.plotnikov@gmail.com> writes: > BTW for some reason --pretty=raw is affected by encoding option on the > command line. Unfortunately, that is what you get for reading from a Porcelain output, which is meant for, and are subject to improvement for, human consumption. If you want bit-for-bit information, you can always ask "git cat-file". ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-13 1:38 ` Junio C Hamano @ 2008-11-13 4:34 ` Jeff King 2008-11-13 5:10 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2008-11-13 4:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: Constantine Plotnikov, git On Wed, Nov 12, 2008 at 05:38:47PM -0800, Junio C Hamano wrote: > > BTW for some reason --pretty=raw is affected by encoding option on the > > command line. > > Unfortunately, that is what you get for reading from a Porcelain output, > which is meant for, and are subject to improvement for, human consumption. > > If you want bit-for-bit information, you can always ask "git cat-file". What about "git rev-list --pretty=raw"? Is that also porcelain? I would be curious to hear your take on our failure to respect --encoding for --pretty=format. Is it a bug to be fixed, or a historical behavior to be maintained? -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-13 4:34 ` Jeff King @ 2008-11-13 5:10 ` Junio C Hamano 2008-11-13 5:48 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2008-11-13 5:10 UTC (permalink / raw) To: Jeff King; +Cc: Constantine Plotnikov, git Jeff King <peff@peff.net> writes: > What about "git rev-list --pretty=raw"? Is that also porcelain? Does it re-encode? I didn't check, but ideally it shouldn't (but I do not care too much either way, to be honest). > I would be curious to hear your take on our failure to respect > --encoding for --pretty=format. Is it a bug to be fixed, or a historical > behavior to be maintained? I think the fix you outlined was quite reasonable. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used 2008-11-13 5:10 ` Junio C Hamano @ 2008-11-13 5:48 ` Jeff King 0 siblings, 0 replies; 8+ messages in thread From: Jeff King @ 2008-11-13 5:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Constantine Plotnikov, git On Wed, Nov 12, 2008 at 09:10:26PM -0800, Junio C Hamano wrote: > > What about "git rev-list --pretty=raw"? Is that also porcelain? > > Does it re-encode? I didn't check, but ideally it shouldn't (but I do not > care too much either way, to be honest). Yes, it uses the same pretty_print_commit routine as the "log". > > I would be curious to hear your take on our failure to respect > > --encoding for --pretty=format. Is it a bug to be fixed, or a historical > > behavior to be maintained? > > I think the fix you outlined was quite reasonable. One thing I just realized that makes it even more reasonable: we properly munge the encoding header when we _do_ re-encode. So whether we re-encode or not, you will always get the correct encoding for what is being output via "%e". Which means that a tool which handles the current "broken" behavior by re-encoding themselves will trivially handle the new version: the output will just always be in the --encoding specified instead of whatever the original encoding was. And if there are tools that are not looking at the output encoding (and blindly assuming --encoding works), then they are already broken by the current behavior, and we will be fixing them. So I think it is safe to "fix" it as I described. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-11-13 5:50 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-11 19:12 Possible bug: "git log" ignores "--encoding=UTF-8" option if --pretty=format:%e%n%s%n is used Constantine Plotnikov
2008-11-12 10:43 ` Jeff King
[not found] ` <85647ef50811120311q7bc5451x7c084fd2a7864177@mail.gmail.com>
2008-11-12 11:26 ` Jeff King
2008-11-12 13:08 ` Constantine Plotnikov
2008-11-13 1:38 ` Junio C Hamano
2008-11-13 4:34 ` Jeff King
2008-11-13 5:10 ` Junio C Hamano
2008-11-13 5:48 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).