* git format-patch doesn't add Content-type for UTF-8 diffs @ 2014-06-30 9:03 Paul Eggert 2014-06-30 17:30 ` Jeff King 0 siblings, 1 reply; 4+ messages in thread From: Paul Eggert @ 2014-06-30 9:03 UTC (permalink / raw) To: git I've been having trouble sending my Git-generated patches to the tz mailing list. Patches containing UTF-8 text are garbled, e.g., if you visit <http://mm.icann.org/pipermail/tz/2014-June/021086.html> you'll see "Ürümqi" where the patch actually had "Ürümqi". I've tracked this down to the fact that "git format-patch" isn't outputting a Content-Type: line in the outgoing email. I thought it was supposed to do that; the man page implies that it does. Here's how I can reproduce the bug with the git 1.9.3 that's shipped with Fedora 20. Notice that the patch is missing the line "Content-Type: text/plain; charset=UTF-8" that the git-format-patch man page implies it should be generating, and this causes the ICANN email software to misinterpret the patch's character set encoding. $ git init Initialized empty Git repository in /home/eggert/junk/d/.git/ $ echo x >x $ git add x $ git commit -m'x' [master (root-commit) 5d0e0ce] x 1 file changed, 1 insertion(+) create mode 100644 x $ echo '§' >x $ git commit -am'added UTF-8' [master 57f0669] added UTF-8 1 file changed, 1 insertion(+), 1 deletion(-) $ git format-patch -1 0001-added-UTF-8.patch $ cat 0001-added-UTF-8.patch From 57f066927a1d8e253715b7980460d81cb549b162 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Mon, 30 Jun 2014 01:49:28 -0700 Subject: [PATCH] added UTF-8 --- x | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/x b/x index 587be6b..3038d22 100644 --- a/x +++ b/x @@ -1 +1 @@ -x +§ -- 1.9.3 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs 2014-06-30 9:03 git format-patch doesn't add Content-type for UTF-8 diffs Paul Eggert @ 2014-06-30 17:30 ` Jeff King 2014-06-30 18:54 ` Paul Eggert 2014-07-01 4:38 ` Torsten Bögershausen 0 siblings, 2 replies; 4+ messages in thread From: Jeff King @ 2014-06-30 17:30 UTC (permalink / raw) To: Paul Eggert; +Cc: git On Mon, Jun 30, 2014 at 02:03:25AM -0700, Paul Eggert wrote: > I've been having trouble sending my Git-generated patches to the tz mailing > list. Patches containing UTF-8 text are garbled, e.g., if you visit > <http://mm.icann.org/pipermail/tz/2014-June/021086.html> you'll see > "Ürümqi" where the patch actually had "Ürümqi". > > I've tracked this down to the fact that "git format-patch" isn't outputting > a Content-Type: line in the outgoing email. I thought it was supposed to do > that; the man page implies that it does. format-patch will add a content-type header if the commit message contains non-ascii characters, and is marked as an alternate encoding (usually this is utf8, but you can use i18n.commitEncoding to store them in a different format). However, it doesn't look at the filenames or diff contents at all. If it were to do so, it would have to guess at the correct encoding, since git doesn't know anything about the encoding of filenames or contents. Worse, you could actually have several different encodings, across multiple files in the same diff. Typically, the next stage in the pipeline is to give the output to send-email, or to a MUA. Send-email will detect high-bit characters in this case and ask you which encoding you want. Many MUAs will do some kind of auto-detection and fill in the content-type (e.g., I know that mutt handles this correctly). How do you send the mails after they come out of format-patch? > Here's how I can reproduce the bug with the git 1.9.3 that's shipped with > Fedora 20. Notice that the patch is missing the line "Content-Type: > text/plain; charset=UTF-8" that the git-format-patch man page implies it > should be generating, and this causes the ICANN email software to > misinterpret the patch's character set encoding. > [...] Thanks for the reproduction recipe. While I think we have to accept that some hard cases (e.g., multiple encodings in a single diff) can't be handled cleanly, it would be really nice if this all-utf8 case worked out of the box. And perhaps the complex cases could use binary diffs when we see multiple encodings. One tricky thing about the implementation is that we stream the output from format-patch, and write the content-type header (if any) before we start opening the blobs for diff. I wonder if it would be enough to do: 1. Always add a content-type header, even if the commit is utf-8 and contains only ascii characters. This _shouldn't_ hurt anything, though I suppose it would if you have latin1 (for example) commit messages and did not correctly set the encoding header in your commits. 2. When producing diff header lines do not respect core.quotepath if the filename does is not valid for the encoding we claimed earlier. 3. When producing lines of textual diff, use a binary diff if the contents are not valid for the encoding we claimed earlier. That would make the utf-8 case "just work", and would prevent us from ever sending malformed contents (i.e., mismatched encodings in the commit message and diff contents). However, it is not perfect: 1. Right now if you send a diff for a latin1 file but do not use any non-ascii characters in your commit message (and do not set i18n.commitEncoding, so it is "utf8"), you get no claimed encoding in the email. If your receiving end is OK with that, everything works, and you get to see text diffs. So my scheme would be a slight regression there. But it is somewhat of an accident waiting to happen. If you ever use a utf-8 character in your commit message, that particular email will be marked as utf-8, and your diff will be broken. 2. We can only check "is it valid?" for the encoding. That works well with utf-8, which has rules. But for something like latin1 versus another "use a code page for the high-bit bytes" type of encoding, we cannot really tell the difference. However, I do not think we are making anything _worse_ there. You'd already get mojibake in such a case. -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs 2014-06-30 17:30 ` Jeff King @ 2014-06-30 18:54 ` Paul Eggert 2014-07-01 4:38 ` Torsten Bögershausen 1 sibling, 0 replies; 4+ messages in thread From: Paul Eggert @ 2014-06-30 18:54 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King wrote: > How do you send the mails after they come out of format-patch? I run a shell command like this (on Solaris 10): /usr/lib/sendmail -ONoRecipientAction=add-to tz@iana.org < 0001-whatever.patch (The "NoRecipientAction" option pacifies the IANA MTA.) This is an old machine not under my control, with an old 'git' installed that I don't use and don't particularly want to worry about porting to. I generate the patch file on a different machine with git 1.9.3, and scp it into the email-sending machine. I suppose that I could work around the problem with this shell command: (grep -q '^Mime-Version: ' 0001-whatever.patch || printf '%s\n' \ 'MIME-Version: 1.0' \ 'Content-Type: text/plain; charset=UTF-8' \ 'Content-Transfer-Encoding: 8bit' cat 0001-whatever.patch) | /usr/lib/sendmail -ONoRecipientAction=add-to tz@iana.org but that's less convenient. > I wonder if it would be enough to do: > > 1. Always add a content-type header, even if the commit is utf-8 and > contains only ascii characters. That would help for my case, yes. We use only UTF-8, and to me it feelds weird that patches are mailed properly if the commit log contains non-ASCII characters, but don't work if the commit log is ASCII and the diff contains non-ASCII. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git format-patch doesn't add Content-type for UTF-8 diffs 2014-06-30 17:30 ` Jeff King 2014-06-30 18:54 ` Paul Eggert @ 2014-07-01 4:38 ` Torsten Bögershausen 1 sibling, 0 replies; 4+ messages in thread From: Torsten Bögershausen @ 2014-07-01 4:38 UTC (permalink / raw) To: Jeff King, Paul Eggert; +Cc: git >I wonder if it would be enough to do: > 1. Always add a content-type header, even if the commit is utf-8 and > contains only ascii characters. This _shouldn't_ hurt anything, > though I suppose it would if you have latin1 (for example) commit > messages and did not correctly set the encoding header in your > commits. Does it make sense to call this function (from utf8.c) int is_utf8(const char *text) and either add the content-type header for utf-8 (or not) ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-07-01 4:39 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-30 9:03 git format-patch doesn't add Content-type for UTF-8 diffs Paul Eggert 2014-06-30 17:30 ` Jeff King 2014-06-30 18:54 ` Paul Eggert 2014-07-01 4:38 ` Torsten Bögershausen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).