* [PATCH] Fix Q-encoded multi-octet-char split in email.
@ 2012-07-03 1:41 katsu
2012-07-03 6:35 ` Jeff King
2012-07-03 9:52 ` Erik Faye-Lund
0 siblings, 2 replies; 8+ messages in thread
From: katsu @ 2012-07-03 1:41 UTC (permalink / raw)
To: git, gitster; +Cc: katsu, Takeharu Katsuyama
Issue: Email subject written in multi-octet language like japanese cannot
be displayed in correct at destinations's email client, because the
Q-encoded subject which is longer than 78 octets is split by a octet not by
a character at line breaks.
e.g.)
"=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
|
V
"=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
"=?utf-8?q?=81=A9=?"
Changes: Add a judge if a character is an part of utf-8 muti-octet, and
split the characters by a character not by a octet at line breaks in
function add_rfc2407() in pretty.c. Like following.
"=?utf-8?q? [PATCH] ... =E8=83=86?="
"=?utf-8?q?=E8=81=A9=?"
Signed-off-by: Takeharu Katsuyama <tkatsu.ne@gmail.com>
---
pretty.c | 29 ++++++++++++++++++++++++++++-
1 files changed, 28 insertions(+), 1 deletions(-)
mode change 100644 => 100755 pretty.c
diff --git a/pretty.c b/pretty.c
old mode 100644
new mode 100755
index 8b1ea9f..266a8fe
--- a/pretty.c
+++ b/pretty.c
@@ -272,6 +272,12 @@ static void add_rfc2047(struct strbuf *sb, const char *line, int len,
static const int max_length = 78; /* per rfc2822 */
int i;
int line_len;
+ int utf_ctr, use_utf;
+
+ if (!strcmp(encoding, "UTF-8") || !strcmp(encoding, "utf-8"))
+ use_utf = 1;
+ else
+ use_utf = 0;
/* How many bytes are already used on the current line? */
for (i = sb->len - 1; i >= 0; i--)
@@ -293,10 +299,31 @@ needquote:
strbuf_grow(sb, len * 3 + strlen(encoding) + 100);
strbuf_addf(sb, "=?%s?q?", encoding);
line_len += strlen(encoding) + 5; /* 5 for =??q? */
+ utf_ctr = 0;
for (i = 0; i < len; i++) {
unsigned ch = line[i] & 0xFF;
- if (line_len >= max_length - 2) {
+ /*
+ * Judge if it is an utf-8 char, to avoid inserting newline
+ * in the middle of utf-8 char code.
+ */
+ if (use_utf) {
+ if (ch >= 0xC2 && ch <= 0xDF) /* 1'st byte of 2-bytes utf-8 */
+ utf_ctr = 1;
+ else if (ch >= 0xE0 && ch <= 0xEF) /* 3-bytes utf-8 */
+ utf_ctr = 2;
+ else if (ch >= 0xF0 && ch <= 0xF7) /* 4-bytes utf-8 */
+ utf_ctr = 3;
+ else if (ch >= 0xF8 && ch <= 0xFB) /* 5-bytes utf-8 */
+ utf_ctr = 4;
+ else if (ch >= 0xFC && ch <= 0xFD) /* 6-bytes utf-8 */
+ utf_ctr = 5;
+ else if (ch >= 0x80 && ch <= 0xBF) /* 2'nd to 6'th byte of utf-8 */
+ utf_ctr--;
+ else
+ utf_ctr = 0;
+ }
+ if (line_len >= (max_length - 2 - utf_ctr *3)) {
strbuf_addf(sb, "?=\n =?%s?q?", encoding);
line_len = strlen(encoding) + 5 + 1; /* =??q? plus SP */
}
--
1.7.9
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
2012-07-03 1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
@ 2012-07-03 6:35 ` Jeff King
[not found] ` <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>
2012-07-03 9:52 ` Erik Faye-Lund
1 sibling, 1 reply; 8+ messages in thread
From: Jeff King @ 2012-07-03 6:35 UTC (permalink / raw)
To: katsu; +Cc: git, gitster, Takeharu Katsuyama
On Tue, Jul 03, 2012 at 10:41:37AM +0900, katsu wrote:
> Issue: Email subject written in multi-octet language like japanese cannot
> be displayed in correct at destinations's email client, because the
> Q-encoded subject which is longer than 78 octets is split by a octet not by
> a character at line breaks.
> e.g.)
> "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
> |
> V
> "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
> "=?utf-8?q?=81=A9=?"
>
> Changes: Add a judge if a character is an part of utf-8 muti-octet, and
> split the characters by a character not by a octet at line breaks in
> function add_rfc2407() in pretty.c. Like following.
>
> "=?utf-8?q? [PATCH] ... =E8=83=86?="
> "=?utf-8?q?=E8=81=A9=?"
>
> Signed-off-by: Takeharu Katsuyama <tkatsu.ne@gmail.com>
Yeah, we definitely don't handle that properly according to the rfc.
This patch is is going in the right direction, but I have a few
comments:
> --- a/pretty.c
> +++ b/pretty.c
> @@ -272,6 +272,12 @@ static void add_rfc2047(struct strbuf *sb, const char *line, int len,
> static const int max_length = 78; /* per rfc2822 */
> int i;
> int line_len;
> + int utf_ctr, use_utf;
> +
> + if (!strcmp(encoding, "UTF-8") || !strcmp(encoding, "utf-8"))
> + use_utf = 1;
> + else
> + use_utf = 0;
Please use is_encoding_utf8, which handles both of these spellings, as
well as "utf8" and "UTF8" (it also handles encoding==NULL; I don't think
that can happen in this code path, but it is nice to be defensive).
> @@ -293,10 +299,31 @@ needquote:
> strbuf_grow(sb, len * 3 + strlen(encoding) + 100);
> strbuf_addf(sb, "=?%s?q?", encoding);
> line_len += strlen(encoding) + 5; /* 5 for =??q? */
> + utf_ctr = 0;
> for (i = 0; i < len; i++) {
> unsigned ch = line[i] & 0xFF;
>
> - if (line_len >= max_length - 2) {
> + /*
> + * Judge if it is an utf-8 char, to avoid inserting newline
> + * in the middle of utf-8 char code.
> + */
> + if (use_utf) {
> + if (ch >= 0xC2 && ch <= 0xDF) /* 1'st byte of 2-bytes utf-8 */
> + utf_ctr = 1;
> + else if (ch >= 0xE0 && ch <= 0xEF) /* 3-bytes utf-8 */
> + utf_ctr = 2;
> + else if (ch >= 0xF0 && ch <= 0xF7) /* 4-bytes utf-8 */
> + utf_ctr = 3;
> + else if (ch >= 0xF8 && ch <= 0xFB) /* 5-bytes utf-8 */
> + utf_ctr = 4;
> + else if (ch >= 0xFC && ch <= 0xFD) /* 6-bytes utf-8 */
> + utf_ctr = 5;
> + else if (ch >= 0x80 && ch <= 0xBF) /* 2'nd to 6'th byte of utf-8 */
> + utf_ctr--;
> + else
> + utf_ctr = 0;
> + }
> + if (line_len >= (max_length - 2 - utf_ctr *3)) {
Can we re-use utf8_width here instead of rewriting these rules?
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix Q-encoded multi-octet-char split in email.
2012-07-03 1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
2012-07-03 6:35 ` Jeff King
@ 2012-07-03 9:52 ` Erik Faye-Lund
1 sibling, 0 replies; 8+ messages in thread
From: Erik Faye-Lund @ 2012-07-03 9:52 UTC (permalink / raw)
To: katsu; +Cc: git, gitster, Takeharu Katsuyama
On Tue, Jul 3, 2012 at 3:41 AM, katsu <gkatsu.ne@gmail.com> wrote:
> Issue: Email subject written in multi-octet language like japanese cannot
> be displayed in correct at destinations's email client, because the
> Q-encoded subject which is longer than 78 octets is split by a octet not by
> a character at line breaks.
> e.g.)
> "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
> |
> V
> "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
> "=?utf-8?q?=81=A9=?"
>
> Changes: Add a judge if a character is an part of utf-8 muti-octet, and
> split the characters by a character not by a octet at line breaks in
> function add_rfc2407() in pretty.c.
You mean add_rfc2047(), right?
Anyway, I'm not an expert here, but can't a "soft" newline (as
specified in rfc 2045) be used in message headers? If it could, then
we wouldn't need to grok the underlying encoding when wrapping, which
strikes me as slightly better...
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-08-16 21:52 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-03 1:41 [PATCH] Fix Q-encoded multi-octet-char split in email katsu
2012-07-03 6:35 ` Jeff King
[not found] ` <CAGxub4-9E0W8ZgsPHeTyUyxmPD80LUd7NjSezg5Zt2-nZPBMJA@mail.gmail.com>
2012-07-04 6:44 ` Jeff King
2012-07-18 5:10 ` Junio C Hamano
2012-07-18 7:27 ` Jeff King
2012-07-25 11:10 ` Drew Northup
2012-08-16 21:52 ` Junio C Hamano
2012-07-03 9:52 ` Erik Faye-Lund
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).