* [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
@ 2006-05-16 10:18 Junio C Hamano
2006-05-16 10:38 ` Jakub Narebski
2006-05-16 10:49 ` Rocco Rutte
0 siblings, 2 replies; 3+ messages in thread
From: Junio C Hamano @ 2006-05-16 10:18 UTC (permalink / raw)
To: git
By convention, the commit message and the author/committer names
in the commit objects are UTF-8 encoded. When formatting for
e-mails, Q-encode them according to RFC 2047.
While we are at it, generate the content-type and
content-transfer-encoding headers as well.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
With this patch, the output formatted with
git show --pretty=email --patch-with-stat 9d7f73d4
would start like this:
From 9d7f73d43fa49d0d2f5a8cfcce9d659e8ad2d265 Thu Apr 7 15:13:13 2005
From: =?utf-8?q?Lukas_Sandstr=C3=B6m?= <lukass@etek.chalmers.se>
Date: Sat, 25 Feb 2006 12:20:13 +0100
Subject: [PATCH] git-fetch: print the new and old ref when fast-forwarding
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This is marked RFC because I am not convinced if this kind of
header formatting should be done by format-patch; we might be
better off leaving the proper massaging to whatever downstream
program that reads its output (e.g. send-email or imap-send).
We produce the mbox format (and that is a requirement -- its
output should be consumable by git-am), so the downstream needs
to strip off the initial UNIX-From line at least anyway.
Thoughts?
If we decide to do the header formatting here, there are two
further enhancements that need to be done:
(1) The charset must be configurable for projects that use
encoding different from UTF-8, perhaps with the .git/config
[i18n] commitEncoding. It is only a convention, not a hard
rule, to use UTF-8 for the metainformation.
(2) Some projects, notably Wine, seem to prefer patches to be
sent as attachments, and we have support for that in the
script version of format-patch. We would want to have the
same here. This needs to be an option; define a new
format, CMIT_FMT_MIME, and invoke it with --pretty=mime.
Ideally we would want to say, in the body part header for
the attachment, that the type of the payload is a raw 8bit
text/patch without any specific charset (if the upstream
project has a UTF-8 encoded file, you should not send in a
patch in iso-8859-1 and expect somebody to automagically
transcode your patch -- the patch is applied as is and MTA
should not molest it).
The RFC2047 q-encoding code definitely needs to be audited by
an RFC lawyer. I used to be one myself but I lost my edge and
patience these days.
diff --git a/commit.c b/commit.c
index 93b3903..dee5756 100644
--- a/commit.c
+++ b/commit.c
@@ -413,6 +413,46 @@ static int get_one_line(const char *msg,
return ret;
}
+static int is_rfc2047_special(char ch)
+{
+ return ((ch & 0x80) || (ch == '=') || (ch == '?') || (ch == '_'));
+}
+
+static int add_rfc2047(char *buf, const char *line, int len)
+{
+ char *bp = buf;
+ int i, needquote;
+ static const char q_utf8[] = "=?utf-8?q?";
+
+ for (i = needquote = 0; !needquote && i < len; i++) {
+ unsigned ch = line[i];
+ if (ch & 0x80)
+ needquote++;
+ if ((i + 1 < len) &&
+ (ch == '=' && line[i+1] == '?'))
+ needquote++;
+ }
+ if (!needquote)
+ return sprintf(buf, "%.*s", len, line);
+
+ memcpy(bp, q_utf8, sizeof(q_utf8)-1);
+ bp += sizeof(q_utf8)-1;
+ for (i = 0; i < len; i++) {
+ unsigned ch = line[i];
+ if (is_rfc2047_special(ch)) {
+ sprintf(bp, "=%02X", ch);
+ bp += 3;
+ }
+ else if (ch == ' ')
+ *bp++ = '_';
+ else
+ *bp++ = ch;
+ }
+ memcpy(bp, "?=", 2);
+ bp += 2;
+ return bp - buf;
+}
+
static int add_user_info(const char *what, enum cmit_fmt fmt, char *buf, const char *line)
{
char *date;
@@ -431,12 +471,26 @@ static int add_user_info(const char *wha
tz = strtol(date, NULL, 10);
if (fmt == CMIT_FMT_EMAIL) {
- what = "From";
+ char *name_tail = strchr(line, '<');
+ int display_name_length;
+ if (!name_tail)
+ return 0;
+ while (line < name_tail && isspace(name_tail[-1]))
+ name_tail--;
+ display_name_length = name_tail - line;
filler = "";
+ strcpy(buf, "From: ");
+ ret = strlen(buf);
+ ret += add_rfc2047(buf + ret, line, display_name_length);
+ memcpy(buf + ret, name_tail, namelen - display_name_length);
+ ret += namelen - display_name_length;
+ buf[ret++] = '\n';
+ }
+ else {
+ ret = sprintf(buf, "%s: %.*s%.*s\n", what,
+ (fmt == CMIT_FMT_FULLER) ? 4 : 0,
+ filler, namelen, line);
}
- ret = sprintf(buf, "%s: %.*s%.*s\n", what,
- (fmt == CMIT_FMT_FULLER) ? 4 : 0,
- filler, namelen, line);
switch (fmt) {
case CMIT_FMT_MEDIUM:
ret += sprintf(buf + ret, "Date: %s\n", show_date(time, tz));
@@ -575,14 +629,24 @@ unsigned long pretty_print_commit(enum c
int slen = strlen(subject);
memcpy(buf + offset, subject, slen);
offset += slen;
+ offset += add_rfc2047(buf + offset, line, linelen);
+ }
+ else {
+ memset(buf + offset, ' ', indent);
+ memcpy(buf + offset + indent, line, linelen);
+ offset += linelen + indent;
}
- memset(buf + offset, ' ', indent);
- memcpy(buf + offset + indent, line, linelen);
- offset += linelen + indent;
buf[offset++] = '\n';
if (fmt == CMIT_FMT_ONELINE)
break;
- subject = NULL;
+ if (subject) {
+ static const char header[] =
+ "Content-Type: text/plain; charset=UTF-8\n"
+ "Content-Transfer-Encoding: 8bit\n";
+ memcpy(buf + offset, header, sizeof(header)-1);
+ offset += sizeof(header)-1;
+ subject = NULL;
+ }
}
while (offset && isspace(buf[offset-1]))
offset--;
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
@ 2006-05-16 10:38 ` Jakub Narebski
2006-05-16 10:49 ` Rocco Rutte
1 sibling, 0 replies; 3+ messages in thread
From: Jakub Narebski @ 2006-05-16 10:38 UTC (permalink / raw)
To: git
Junio C Hamano wrote:
> By convention, the commit message and the author/committer names
> in the commit objects are UTF-8 encoded. When formatting for
> e-mails, Q-encode them according to RFC 2047.
>
> While we are at it, generate the content-type and
> content-transfer-encoding headers as well.
>
> Signed-off-by: Junio C Hamano <junkio@cox.net>
>
> ---
>
> With this patch, the output formatted with
>
> git show --pretty=email --patch-with-stat 9d7f73d4
>
> would start like this:
>
> From 9d7f73d43fa49d0d2f5a8cfcce9d659e8ad2d265 Thu Apr 7 15:13:13 2005
> From: =?utf-8?q?Lukas_Sandstr=C3=B6m?= <lukass@etek.chalmers.se>
> Date: Sat, 25 Feb 2006 12:20:13 +0100
> Subject: [PATCH] git-fetch: print the new and old ref when fast-forwarding
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
I guess that we also need
MIME-Version: 1.0
(from what I remember of troubles with Eoutlook Express not sending all
the required headers, and tin not working properly).
If I remember correctly encoding headers using quoted-printable is needed
only because headers are before charset is set. IIRC there was proposal
to use UTF-8 for headers regardless of the charset used for body of message.
P.S. Should we set User-Agent header as well?
--
Jakub Narebski
Warsaw, Poland
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
2006-05-16 10:38 ` Jakub Narebski
@ 2006-05-16 10:49 ` Rocco Rutte
1 sibling, 0 replies; 3+ messages in thread
From: Rocco Rutte @ 2006-05-16 10:49 UTC (permalink / raw)
To: git
Hi,
* Junio C Hamano [06-05-16 03:18:24 -0700] wrote:
[...]
> Thoughts?
> If we decide to do the header formatting here, there are two
> further enhancements that need to be done:
> (1) The charset must be configurable for projects that use
> encoding different from UTF-8, perhaps with the .git/config
> [i18n] commitEncoding. It is only a convention, not a hard
> rule, to use UTF-8 for the metainformation.
To write an encoder really fully conforming to RfC2047 is a mess. Not so
much because the algorithms are difficult but because there're many
things to take care of if you want to do it right.
For example, encoded words are required to be at most something below 80
characters long. For names this maybe is not an issue, but for subjects.
I didn't really check whether your patch produces only the minimum
encoding (i.e. only those words that need it and not just all words with
'_' or '=20' in between them) but if not, 80 isn't that much after all.
And you may need to think about header folding (and unfolding for
reading it back in).
Also, supporting any character set (via iconv()) blows up the
implementation. There're character sets for which other RfCs define the
encoding method so only using quoted-printable is not fully correct in
all possible cases.
And, with the first point, several character sets really can become a
mess as you need to produce several encoded words because the input
would exceed RfC limits otherwise. Because for multi-byte character sets
you musn't break within a multi-byte character sequence but only at
their boundaries. So you need a generic way to detect the byte-size of
such a character in any supported character set.
With just the UTF-8 encoding all of this is pretty simple though.
I would rather try to find a way to implement this in a scripting
language that already has standard modules for this or makes it easy to
write one. In C this gets quite lengthy...
bye, Rocco
--
:wq!
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-05-16 10:50 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
2006-05-16 10:38 ` Jakub Narebski
2006-05-16 10:49 ` Rocco Rutte
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.