* git format-patch doesn't add Content-type for UTF-8 diffs
@ 2014-06-30 9:03 Paul Eggert
2014-06-30 17:30 ` Jeff King
0 siblings, 1 reply; 4+ messages in thread
From: Paul Eggert @ 2014-06-30 9:03 UTC (permalink / raw)
To: git
I've been having trouble sending my Git-generated patches to the tz
mailing list. Patches containing UTF-8 text are garbled, e.g., if you
visit <http://mm.icann.org/pipermail/tz/2014-June/021086.html> you'll
see "Ürümqi" where the patch actually had "Ürümqi".
I've tracked this down to the fact that "git format-patch" isn't
outputting a Content-Type: line in the outgoing email. I thought it was
supposed to do that; the man page implies that it does.
Here's how I can reproduce the bug with the git 1.9.3 that's shipped
with Fedora 20. Notice that the patch is missing the line
"Content-Type: text/plain; charset=UTF-8" that the git-format-patch man
page implies it should be generating, and this causes the ICANN email
software to misinterpret the patch's character set encoding.
$ git init
Initialized empty Git repository in /home/eggert/junk/d/.git/
$ echo x >x
$ git add x
$ git commit -m'x'
[master (root-commit) 5d0e0ce] x
1 file changed, 1 insertion(+)
create mode 100644 x
$ echo '§' >x
$ git commit -am'added UTF-8'
[master 57f0669] added UTF-8
1 file changed, 1 insertion(+), 1 deletion(-)
$ git format-patch -1
0001-added-UTF-8.patch
$ cat 0001-added-UTF-8.patch
From 57f066927a1d8e253715b7980460d81cb549b162 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon, 30 Jun 2014 01:49:28 -0700
Subject: [PATCH] added UTF-8
---
x | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/x b/x
index 587be6b..3038d22 100644
--- a/x
+++ b/x
@@ -1 +1 @@
-x
+§
--
1.9.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs
2014-06-30 9:03 git format-patch doesn't add Content-type for UTF-8 diffs Paul Eggert
@ 2014-06-30 17:30 ` Jeff King
2014-06-30 18:54 ` Paul Eggert
2014-07-01 4:38 ` Torsten Bögershausen
0 siblings, 2 replies; 4+ messages in thread
From: Jeff King @ 2014-06-30 17:30 UTC (permalink / raw)
To: Paul Eggert; +Cc: git
On Mon, Jun 30, 2014 at 02:03:25AM -0700, Paul Eggert wrote:
> I've been having trouble sending my Git-generated patches to the tz mailing
> list. Patches containing UTF-8 text are garbled, e.g., if you visit
> <http://mm.icann.org/pipermail/tz/2014-June/021086.html> you'll see
> "Ürümqi" where the patch actually had "Ürümqi".
>
> I've tracked this down to the fact that "git format-patch" isn't outputting
> a Content-Type: line in the outgoing email. I thought it was supposed to do
> that; the man page implies that it does.
format-patch will add a content-type header if the commit message
contains non-ascii characters, and is marked as an alternate encoding
(usually this is utf8, but you can use i18n.commitEncoding to store them
in a different format).
However, it doesn't look at the filenames or diff contents at all. If it
were to do so, it would have to guess at the correct encoding, since git
doesn't know anything about the encoding of filenames or contents.
Worse, you could actually have several different encodings, across
multiple files in the same diff.
Typically, the next stage in the pipeline is to give the output to
send-email, or to a MUA. Send-email will detect high-bit characters in
this case and ask you which encoding you want. Many MUAs will do some
kind of auto-detection and fill in the content-type (e.g., I know that
mutt handles this correctly).
How do you send the mails after they come out of format-patch?
> Here's how I can reproduce the bug with the git 1.9.3 that's shipped with
> Fedora 20. Notice that the patch is missing the line "Content-Type:
> text/plain; charset=UTF-8" that the git-format-patch man page implies it
> should be generating, and this causes the ICANN email software to
> misinterpret the patch's character set encoding.
> [...]
Thanks for the reproduction recipe. While I think we have to accept that
some hard cases (e.g., multiple encodings in a single diff) can't be
handled cleanly, it would be really nice if this all-utf8 case worked
out of the box. And perhaps the complex cases could use binary diffs
when we see multiple encodings.
One tricky thing about the implementation is that we stream the output
from format-patch, and write the content-type header (if any) before we
start opening the blobs for diff.
I wonder if it would be enough to do:
1. Always add a content-type header, even if the commit is utf-8 and
contains only ascii characters. This _shouldn't_ hurt anything,
though I suppose it would if you have latin1 (for example) commit
messages and did not correctly set the encoding header in your
commits.
2. When producing diff header lines do not respect core.quotepath if
the filename does is not valid for the encoding we claimed earlier.
3. When producing lines of textual diff, use a binary diff if the
contents are not valid for the encoding we claimed earlier.
That would make the utf-8 case "just work", and would prevent us from
ever sending malformed contents (i.e., mismatched encodings in the
commit message and diff contents). However, it is not perfect:
1. Right now if you send a diff for a latin1 file but do not use any
non-ascii characters in your commit message (and do not set
i18n.commitEncoding, so it is "utf8"), you get no claimed encoding
in the email. If your receiving end is OK with that, everything
works, and you get to see text diffs.
So my scheme would be a slight regression there. But it is somewhat
of an accident waiting to happen. If you ever use a utf-8 character
in your commit message, that particular email will be marked as
utf-8, and your diff will be broken.
2. We can only check "is it valid?" for the encoding. That works well
with utf-8, which has rules. But for something like latin1 versus
another "use a code page for the high-bit bytes" type of encoding,
we cannot really tell the difference. However, I do not think we
are making anything _worse_ there. You'd already get mojibake in
such a case.
-Peff
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs
2014-06-30 17:30 ` Jeff King
@ 2014-06-30 18:54 ` Paul Eggert
2014-07-01 4:38 ` Torsten Bögershausen
1 sibling, 0 replies; 4+ messages in thread
From: Paul Eggert @ 2014-06-30 18:54 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King wrote:
> How do you send the mails after they come out of format-patch?
I run a shell command like this (on Solaris 10):
/usr/lib/sendmail -ONoRecipientAction=add-to tz@iana.org <
0001-whatever.patch
(The "NoRecipientAction" option pacifies the IANA MTA.)
This is an old machine not under my control, with an old 'git' installed
that I don't use and don't particularly want to worry about porting to.
I generate the patch file on a different machine with git 1.9.3, and
scp it into the email-sending machine.
I suppose that I could work around the problem with this shell command:
(grep -q '^Mime-Version: ' 0001-whatever.patch ||
printf '%s\n' \
'MIME-Version: 1.0' \
'Content-Type: text/plain; charset=UTF-8' \
'Content-Transfer-Encoding: 8bit'
cat 0001-whatever.patch) |
/usr/lib/sendmail -ONoRecipientAction=add-to tz@iana.org
but that's less convenient.
> I wonder if it would be enough to do:
>
> 1. Always add a content-type header, even if the commit is utf-8 and
> contains only ascii characters.
That would help for my case, yes. We use only UTF-8, and to me it
feelds weird that patches are mailed properly if the commit log contains
non-ASCII characters, but don't work if the commit log is ASCII and the
diff contains non-ASCII.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs
2014-06-30 17:30 ` Jeff King
2014-06-30 18:54 ` Paul Eggert
@ 2014-07-01 4:38 ` Torsten Bögershausen
1 sibling, 0 replies; 4+ messages in thread
From: Torsten Bögershausen @ 2014-07-01 4:38 UTC (permalink / raw)
To: Jeff King, Paul Eggert; +Cc: git
>I wonder if it would be enough to do:
> 1. Always add a content-type header, even if the commit is utf-8 and
> contains only ascii characters. This _shouldn't_ hurt anything,
> though I suppose it would if you have latin1 (for example) commit
> messages and did not correctly set the encoding header in your
> commits.
Does it make sense to call this function (from utf8.c)
int is_utf8(const char *text)
and either add the content-type header for utf-8 (or not)
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-07-01 4:39 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-30 9:03 git format-patch doesn't add Content-type for UTF-8 diffs Paul Eggert
2014-06-30 17:30 ` Jeff King
2014-06-30 18:54 ` Paul Eggert
2014-07-01 4:38 ` Torsten Bögershausen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).