[PATCH] Fix Q-encoded multi-octet-char split in email.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix Q-encoded multi-octet-char split in email.
@ 2012-07-03  1:41 katsu
  2012-07-03  6:35 ` Jeff King
  2012-07-03  9:52 ` Erik Faye-Lund
  0 siblings, 2 replies; 8+ messages in thread
From: katsu @ 2012-07-03  1:41 UTC (permalink / raw)
  To: git, gitster; +Cc: katsu, Takeharu Katsuyama

Issue: Email subject written in multi-octet language like japanese cannot
be displayed in correct at destinations's email client, because the
Q-encoded subject which is longer than 78 octets is split by a octet not by
a character at line breaks.
e.g.)
   "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
                    |
                    V
   "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
   "=?utf-8?q?=81=A9=?"

Changes: Add a judge if a character is an part of utf-8 muti-octet, and
split the characters by a character not by a octet at line breaks in
function add_rfc2407() in pretty.c. Like following.

   "=?utf-8?q? [PATCH] ... =E8=83=86?="
   "=?utf-8?q?=E8=81=A9=?"

Signed-off-by: Takeharu Katsuyama <tkatsu.ne@gmail.com>
---
 pretty.c |   29 ++++++++++++++++++++++++++++-
 1 files changed, 28 insertions(+), 1 deletions(-)
 mode change 100644 => 100755 pretty.c

diff --git a/pretty.c b/pretty.c
old mode 100644
new mode 100755
index 8b1ea9f..266a8fe
--- a/pretty.c
+++ b/pretty.c
@@ -272,6 +272,12 @@ static void add_rfc2047(struct strbuf *sb, const char *line, int len,
 	static const int max_length = 78; /* per rfc2822 */
 	int i;
 	int line_len;
+	int utf_ctr, use_utf;
+
+	if (!strcmp(encoding, "UTF-8") || !strcmp(encoding, "utf-8"))
+		use_utf = 1;
+	else
+		use_utf = 0;
 
 	/* How many bytes are already used on the current line? */
 	for (i = sb->len - 1; i >= 0; i--)
@@ -293,10 +299,31 @@ needquote:
 	strbuf_grow(sb, len * 3 + strlen(encoding) + 100);
 	strbuf_addf(sb, "=?%s?q?", encoding);
 	line_len += strlen(encoding) + 5; /* 5 for =??q? */
+	utf_ctr = 0;
 	for (i = 0; i < len; i++) {
 		unsigned ch = line[i] & 0xFF;
 
-		if (line_len >= max_length - 2) {
+		/*
+		 * Judge if it is an utf-8 char, to avoid inserting newline
+		 * in the middle of utf-8 char code.
+		 */
+		if (use_utf) {
+			if (ch >= 0xC2 && ch <= 0xDF)	/* 1'st byte of 2-bytes utf-8 */
+				utf_ctr = 1;
+			else if (ch >= 0xE0 && ch <= 0xEF)	/*  3-bytes utf-8 */
+				utf_ctr = 2;
+			else if (ch >= 0xF0 && ch <= 0xF7)	/*  4-bytes utf-8 */
+				utf_ctr = 3;
+			else if (ch >= 0xF8 && ch <= 0xFB)	/*  5-bytes utf-8 */
+				utf_ctr = 4;
+			else if (ch >= 0xFC && ch <= 0xFD)	/*  6-bytes utf-8 */
+				utf_ctr = 5;
+			else if (ch >= 0x80 && ch <= 0xBF)  /* 2'nd to 6'th byte of utf-8 */
+				utf_ctr--;
+			else
+				utf_ctr = 0;
+		}
+		if (line_len >= (max_length - 2 - utf_ctr *3)) {
 			strbuf_addf(sb, "?=\n =?%s?q?", encoding);
 			line_len = strlen(encoding) + 5 + 1; /* =??q? plus SP */
 		}
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-03  1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
@ 2012-07-03  6:35 ` Jeff King
       [not found]   ` <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>
  2012-07-03  9:52 ` Erik Faye-Lund
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff King @ 2012-07-03  6:35 UTC (permalink / raw)
  To: katsu; +Cc: git, gitster, Takeharu Katsuyama

On Tue, Jul 03, 2012 at 10:41:37AM +0900, katsu wrote:

> Issue: Email subject written in multi-octet language like japanese cannot
> be displayed in correct at destinations's email client, because the
> Q-encoded subject which is longer than 78 octets is split by a octet not by
> a character at line breaks.
> e.g.)
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
>                     |
>                     V
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
>    "=?utf-8?q?=81=A9=?"
> 
> Changes: Add a judge if a character is an part of utf-8 muti-octet, and
> split the characters by a character not by a octet at line breaks in
> function add_rfc2407() in pretty.c. Like following.
> 
>    "=?utf-8?q? [PATCH] ... =E8=83=86?="
>    "=?utf-8?q?=E8=81=A9=?"
> 
> Signed-off-by: Takeharu Katsuyama <tkatsu.ne@gmail.com>

Yeah, we definitely don't handle that properly according to the rfc.
This patch is is going in the right direction, but I have a few
comments:

> --- a/pretty.c
> +++ b/pretty.c
> @@ -272,6 +272,12 @@ static void add_rfc2047(struct strbuf *sb, const char *line, int len,
>  	static const int max_length = 78; /* per rfc2822 */
>  	int i;
>  	int line_len;
> +	int utf_ctr, use_utf;
> +
> +	if (!strcmp(encoding, "UTF-8") || !strcmp(encoding, "utf-8"))
> +		use_utf = 1;
> +	else
> +		use_utf = 0;

Please use is_encoding_utf8, which handles both of these spellings, as
well as "utf8" and "UTF8" (it also handles encoding==NULL; I don't think
that can happen in this code path, but it is nice to be defensive).

> @@ -293,10 +299,31 @@ needquote:
>  	strbuf_grow(sb, len * 3 + strlen(encoding) + 100);
>  	strbuf_addf(sb, "=?%s?q?", encoding);
>  	line_len += strlen(encoding) + 5; /* 5 for =??q? */
> +	utf_ctr = 0;
>  	for (i = 0; i < len; i++) {
>  		unsigned ch = line[i] & 0xFF;
>  
> -		if (line_len >= max_length - 2) {
> +		/*
> +		 * Judge if it is an utf-8 char, to avoid inserting newline
> +		 * in the middle of utf-8 char code.
> +		 */
> +		if (use_utf) {
> +			if (ch >= 0xC2 && ch <= 0xDF)	/* 1'st byte of 2-bytes utf-8 */
> +				utf_ctr = 1;
> +			else if (ch >= 0xE0 && ch <= 0xEF)	/*  3-bytes utf-8 */
> +				utf_ctr = 2;
> +			else if (ch >= 0xF0 && ch <= 0xF7)	/*  4-bytes utf-8 */
> +				utf_ctr = 3;
> +			else if (ch >= 0xF8 && ch <= 0xFB)	/*  5-bytes utf-8 */
> +				utf_ctr = 4;
> +			else if (ch >= 0xFC && ch <= 0xFD)	/*  6-bytes utf-8 */
> +				utf_ctr = 5;
> +			else if (ch >= 0x80 && ch <= 0xBF)  /* 2'nd to 6'th byte of utf-8 */
> +				utf_ctr--;
> +			else
> +				utf_ctr = 0;
> +		}
> +		if (line_len >= (max_length - 2 - utf_ctr *3)) {

Can we re-use utf8_width here instead of rewriting these rules?

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>]

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
       [not found]   ` <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>
@ 2012-07-04  6:44     ` Jeff King
  2012-07-18  5:10       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2012-07-04  6:44 UTC (permalink / raw)
  To: Katsuyama Takeharu; +Cc: git, gitster, Takeharu Katsuyama

On Wed, Jul 04, 2012 at 03:19:31PM +0900, Katsuyama Takeharu wrote:

> diff --git a/pretty.c b/pretty.c
> --- a/pretty.c
> +++ b/pretty.c
> @@ -272,6 +272,13 @@ static void add_rfc2047(struct strbuf *sb, const char
> *line, int len,
>      static const int max_length = 78; /* per rfc2822 */
>      int i;
>      int line_len;
> +    int utf8_ctr, use_utf8;
> +    const char *utf8_start;
> +
> +    if (is_encoding_utf8(encoding) && encoding != NULL)
> +        use_utf8 = 1;
> +    else
> +        use_utf8 = 0;

I think you can drop the "encoding != NULL" here. If we don't have an
explicit encoding, git always assumes utf8 (also, as it happens we never
hit this point with a NULL encoding in the current code anyway, though
that could in theory change in the future).

> > Can we re-use utf8_width here instead of rewriting these rules?
> 
> Yes you can. But there are an issue which utf8_width seems not to return
> correct value.  It returns 3  even if  a provided code has 3 octet utf-8
> char(e.g. 0xE38292).
> I expect it returns 2.

Hmm. I think I may have led you astray. It seems that the return value
of utf8_width is not about the byte-width of the character
representation, but rather about the intended character-width of the
glyph. But since we are encoding the bytes, we care about the former.

So I think you would really want to use pick_one_utf8_char and see how
many characters it consumed, like this:

  const char *p = &line[i];
  pick_one_utf8_char(&p, NULL);
  if (!p) /* not valid utf8, just assume single byte */
          utf8_ctr = 1;
  else
          utf8_ctr = p - &line[i];

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-04  6:44     ` Jeff King
@ 2012-07-18  5:10       ` Junio C Hamano
  2012-07-18  7:27         ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2012-07-18  5:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Katsuyama Takeharu, git, Takeharu Katsuyama

Ping on a seemingly stalled thread.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-18  5:10       ` Junio C Hamano
@ 2012-07-18  7:27         ` Jeff King
  2012-07-25 11:10           ` Drew Northup
  2012-08-16 21:52           ` Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff King @ 2012-07-18  7:27 UTC (permalink / raw)
  To: Katsuyama Takeharu; +Cc: Junio C Hamano, git, Takeharu Katsuyama

On Tue, Jul 17, 2012 at 10:10:25PM -0700, Junio C Hamano wrote:

> Ping on a seemingly stalled thread.

Hrm. I could swear that Takeharu sent a follow-up using
pick_one_utf8_char directly that looked OK, but now I can't seem to find
it in the list archives. I wonder if it was off-list and I didn't
realize it.

If I did not simply dream it, can you re-post the latest patch you
sent?

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-18  7:27         ` Jeff King
@ 2012-07-25 11:10           ` Drew Northup
  2012-08-16 21:52           ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Drew Northup @ 2012-07-25 11:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Katsuyama Takeharu, Junio C Hamano, git, Takeharu Katsuyama

On Wed, Jul 18, 2012 at 3:27 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Jul 17, 2012 at 10:10:25PM -0700, Junio C Hamano wrote:
>
>> Ping on a seemingly stalled thread.

> If I did not simply dream it, can you re-post the latest patch you
> sent?
>
> -Peff

...And I thought dreaming I was trapped in a DOS prompt was bad.
Dreaming in patches? OUCH!

-- 
-Drew Northup
--------------------------------------------------------------
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-18  7:27         ` Jeff King
  2012-07-25 11:10           ` Drew Northup
@ 2012-08-16 21:52           ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2012-08-16 21:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Katsuyama Takeharu, git, Takeharu Katsuyama

Jeff King <peff@peff.net> writes:

> On Tue, Jul 17, 2012 at 10:10:25PM -0700, Junio C Hamano wrote:
>
>> Ping on a seemingly stalled thread.
>
> Hrm. I could swear that Takeharu sent a follow-up using
> pick_one_utf8_char directly that looked OK, but now I can't seem to find
> it in the list archives. I wonder if it was off-list and I didn't
> realize it.
>
> If I did not simply dream it, can you re-post the latest patch you
> sent?

Another ping on a seemingly stalled thread.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
  2012-07-03  1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
  2012-07-03  6:35 ` Jeff King
@ 2012-07-03  9:52 ` Erik Faye-Lund
  1 sibling, 0 replies; 8+ messages in thread
From: Erik Faye-Lund @ 2012-07-03  9:52 UTC (permalink / raw)
  To: katsu; +Cc: git, gitster, Takeharu Katsuyama

On Tue, Jul 3, 2012 at 3:41 AM, katsu <gkatsu.ne@gmail.com> wrote:
> Issue: Email subject written in multi-octet language like japanese cannot
> be displayed in correct at destinations's email client, because the
> Q-encoded subject which is longer than 78 octets is split by a octet not by
> a character at line breaks.
> e.g.)
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
>                     |
>                     V
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
>    "=?utf-8?q?=81=A9=?"
>
> Changes: Add a judge if a character is an part of utf-8 muti-octet, and
> split the characters by a character not by a octet at line breaks in
> function add_rfc2407() in pretty.c.

You mean add_rfc2047(), right?

Anyway, I'm not an expert here, but can't a "soft" newline (as
specified in rfc 2045) be used in message headers? If it could, then
we wouldn't need to grok the underlying encoding when wrapping, which
strikes me as slightly better...

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-08-16 21:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-03  1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
2012-07-03  6:35 ` Jeff King
     [not found]   ` <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>
2012-07-04  6:44     ` Jeff King
2012-07-18  5:10       ` Junio C Hamano
2012-07-18  7:27         ` Jeff King
2012-07-25 11:10           ` Drew Northup
2012-08-16 21:52           ` Junio C Hamano
2012-07-03  9:52 ` Erik Faye-Lund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).