[PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
@ 2006-05-16 10:18 Junio C Hamano
  2006-05-16 10:38 ` Jakub Narebski
  2006-05-16 10:49 ` Rocco Rutte
  0 siblings, 2 replies; 3+ messages in thread
From: Junio C Hamano @ 2006-05-16 10:18 UTC (permalink / raw)
  To: git

By convention, the commit message and the author/committer names
in the commit objects are UTF-8 encoded.  When formatting for
e-mails, Q-encode them according to RFC 2047.

While we are at it, generate the content-type and
content-transfer-encoding headers as well.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 With this patch, the output formatted with

	git show --pretty=email --patch-with-stat 9d7f73d4

 would start like this:

   From 9d7f73d43fa49d0d2f5a8cfcce9d659e8ad2d265  Thu Apr 7 15:13:13 2005
   From: =?utf-8?q?Lukas_Sandstr=C3=B6m?= <lukass@etek.chalmers.se>
   Date: Sat, 25 Feb 2006 12:20:13 +0100
   Subject: [PATCH] git-fetch: print the new and old ref when fast-forwarding
   Content-Type: text/plain; charset=UTF-8
   Content-Transfer-Encoding: 8bit

 This is marked RFC because I am not convinced if this kind of
 header formatting should be done by format-patch; we might be
 better off leaving the proper massaging to whatever downstream
 program that reads its output (e.g. send-email or imap-send).
 We produce the mbox format (and that is a requirement -- its
 output should be consumable by git-am), so the downstream needs
 to strip off the initial UNIX-From line at least anyway.

 Thoughts?

 If we decide to do the header formatting here, there are two
 further enhancements that need to be done:

 (1) The charset must be configurable for projects that use
     encoding different from UTF-8, perhaps with the .git/config
     [i18n] commitEncoding.  It is only a convention, not a hard
     rule, to use UTF-8 for the metainformation.

 (2) Some projects, notably Wine, seem to prefer patches to be
     sent as attachments, and we have support for that in the
     script version of format-patch.  We would want to have the
     same here.  This needs to be an option; define a new
     format, CMIT_FMT_MIME, and invoke it with --pretty=mime.

     Ideally we would want to say, in the body part header for
     the attachment, that the type of the payload is a raw 8bit
     text/patch without any specific charset (if the upstream
     project has a UTF-8 encoded file, you should not send in a
     patch in iso-8859-1 and expect somebody to automagically
     transcode your patch -- the patch is applied as is and MTA
     should not molest it).

 The RFC2047 q-encoding code definitely needs to be audited by
 an RFC lawyer.  I used to be one myself but I lost my edge and
 patience these days.

diff --git a/commit.c b/commit.c
index 93b3903..dee5756 100644
--- a/commit.c
+++ b/commit.c
@@ -413,6 +413,46 @@ static int get_one_line(const char *msg,
 	return ret;
 }
 
+static int is_rfc2047_special(char ch)
+{
+	return ((ch & 0x80) || (ch == '=') || (ch == '?') || (ch == '_'));
+}
+
+static int add_rfc2047(char *buf, const char *line, int len)
+{
+	char *bp = buf;
+	int i, needquote;
+	static const char q_utf8[] = "=?utf-8?q?";
+
+	for (i = needquote = 0; !needquote && i < len; i++) {
+		unsigned ch = line[i];
+		if (ch & 0x80)
+			needquote++;
+		if ((i + 1 < len) &&
+		    (ch == '=' && line[i+1] == '?'))
+			needquote++;
+	}
+	if (!needquote)
+		return sprintf(buf, "%.*s", len, line);
+
+	memcpy(bp, q_utf8, sizeof(q_utf8)-1);
+	bp += sizeof(q_utf8)-1;
+	for (i = 0; i < len; i++) {
+		unsigned ch = line[i];
+		if (is_rfc2047_special(ch)) {
+			sprintf(bp, "=%02X", ch);
+			bp += 3;
+		}
+		else if (ch == ' ')
+			*bp++ = '_';
+		else
+			*bp++ = ch;
+	}
+	memcpy(bp, "?=", 2);
+	bp += 2;
+	return bp - buf;
+}
+
 static int add_user_info(const char *what, enum cmit_fmt fmt, char *buf, const char *line)
 {
 	char *date;
@@ -431,12 +471,26 @@ static int add_user_info(const char *wha
 	tz = strtol(date, NULL, 10);
 
 	if (fmt == CMIT_FMT_EMAIL) {
-		what = "From";
+		char *name_tail = strchr(line, '<');
+		int display_name_length;
+		if (!name_tail)
+			return 0;
+		while (line < name_tail && isspace(name_tail[-1]))
+			name_tail--;
+		display_name_length = name_tail - line;
 		filler = "";
+		strcpy(buf, "From: ");
+		ret = strlen(buf);
+		ret += add_rfc2047(buf + ret, line, display_name_length);
+		memcpy(buf + ret, name_tail, namelen - display_name_length);
+		ret += namelen - display_name_length;
+		buf[ret++] = '\n';
+	}
+	else {
+		ret = sprintf(buf, "%s: %.*s%.*s\n", what,
+			      (fmt == CMIT_FMT_FULLER) ? 4 : 0,
+			      filler, namelen, line);
 	}
-	ret = sprintf(buf, "%s: %.*s%.*s\n", what,
-		      (fmt == CMIT_FMT_FULLER) ? 4 : 0,
-		      filler, namelen, line);
 	switch (fmt) {
 	case CMIT_FMT_MEDIUM:
 		ret += sprintf(buf + ret, "Date:   %s\n", show_date(time, tz));
@@ -575,14 +629,24 @@ unsigned long pretty_print_commit(enum c
 			int slen = strlen(subject);
 			memcpy(buf + offset, subject, slen);
 			offset += slen;
+			offset += add_rfc2047(buf + offset, line, linelen);
+		}
+		else {
+			memset(buf + offset, ' ', indent);
+			memcpy(buf + offset + indent, line, linelen);
+			offset += linelen + indent;
 		}
-		memset(buf + offset, ' ', indent);
-		memcpy(buf + offset + indent, line, linelen);
-		offset += linelen + indent;
 		buf[offset++] = '\n';
 		if (fmt == CMIT_FMT_ONELINE)
 			break;
-		subject = NULL;
+		if (subject) {
+			static const char header[] =
+				"Content-Type: text/plain; charset=UTF-8\n"
+				"Content-Transfer-Encoding: 8bit\n";
+			memcpy(buf + offset, header, sizeof(header)-1);
+			offset += sizeof(header)-1;
+			subject = NULL;
+		}
 	}
 	while (offset && isspace(buf[offset-1]))
 		offset--;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
  2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
@ 2006-05-16 10:38 ` Jakub Narebski
  2006-05-16 10:49 ` Rocco Rutte
  1 sibling, 0 replies; 3+ messages in thread
From: Jakub Narebski @ 2006-05-16 10:38 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> By convention, the commit message and the author/committer names
> in the commit objects are UTF-8 encoded.  When formatting for
> e-mails, Q-encode them according to RFC 2047.
> 
> While we are at it, generate the content-type and
> content-transfer-encoding headers as well.
> 
> Signed-off-by: Junio C Hamano <junkio@cox.net>
> 
> ---
> 
>  With this patch, the output formatted with
> 
> git show --pretty=email --patch-with-stat 9d7f73d4
> 
>  would start like this:
> 
>    From 9d7f73d43fa49d0d2f5a8cfcce9d659e8ad2d265  Thu Apr 7 15:13:13 2005
>    From: =?utf-8?q?Lukas_Sandstr=C3=B6m?= <lukass@etek.chalmers.se>
>    Date: Sat, 25 Feb 2006 12:20:13 +0100
>    Subject: [PATCH] git-fetch: print the new and old ref when fast-forwarding 
>    Content-Type: text/plain; charset=UTF-8 
>    Content-Transfer-Encoding: 8bit

I guess that we also need

     MIME-Version: 1.0

(from what I remember of troubles with Eoutlook Express not sending all 
the required headers, and tin not working properly).

If I remember correctly encoding headers using quoted-printable is needed
only because headers are before charset is set. IIRC there was proposal
to use UTF-8 for headers regardless of the charset used for body of message.

P.S. Should we set User-Agent header as well?
-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields.
  2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
  2006-05-16 10:38 ` Jakub Narebski
@ 2006-05-16 10:49 ` Rocco Rutte
  1 sibling, 0 replies; 3+ messages in thread
From: Rocco Rutte @ 2006-05-16 10:49 UTC (permalink / raw)
  To: git

Hi,

* Junio C Hamano [06-05-16 03:18:24 -0700] wrote:

[...]

> Thoughts?

> If we decide to do the header formatting here, there are two
> further enhancements that need to be done:

> (1) The charset must be configurable for projects that use
>     encoding different from UTF-8, perhaps with the .git/config
>     [i18n] commitEncoding.  It is only a convention, not a hard
>     rule, to use UTF-8 for the metainformation.

To write an encoder really fully conforming to RfC2047 is a mess. Not so
much because the algorithms are difficult but because there're many
things to take care of if you want to do it right.

For example, encoded words are required to be at most something below 80
characters long. For names this maybe is not an issue, but for subjects.
I didn't really check whether your patch produces only the minimum
encoding (i.e. only those words that need it and not just all words with
'_' or '=20' in between them) but if not, 80 isn't that much after all.
And you may need to think about header folding (and unfolding for
reading it back in).

Also, supporting any character set (via iconv()) blows up the
implementation. There're character sets for which other RfCs define the
encoding method so only using quoted-printable is not fully correct in
all possible cases.

And, with the first point, several character sets really can become a
mess as you need to produce several encoded words because the input
would exceed RfC limits otherwise. Because for multi-byte character sets
you musn't break within a multi-byte character sequence but only at
their boundaries. So you need a generic way to detect the byte-size of
such a character in any supported character set.

With just the UTF-8 encoding all of this is pretty simple though.

I would rather try to find a way to implement this in a scripting
language that already has standard modules for this or makes it easy to
write one. In C this gets quite lengthy...

   bye, Rocco
-- 
:wq!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-05-16 10:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-16 10:18 [PATCH] CMIT_FMT_EMAIL: Q-encode Subject: and display-name part of From: fields Junio C Hamano
2006-05-16 10:38 ` Jakub Narebski
2006-05-16 10:49 ` Rocco Rutte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.