[PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
@ 2008-12-26 18:38 Kirill Smelkov
  2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov
  2009-01-08 10:08 ` [PATCH " Alexander Potashev
  0 siblings, 2 replies; 12+ messages in thread
From: Kirill Smelkov @ 2008-12-26 18:38 UTC (permalink / raw)
  To: Junio C Hamano, Junio C Hamano; +Cc: Kirill Smelkov, git

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6867 bytes --]

When native language (RU) is in use, subject header usually contains several
parts, e.g.

Subject: [Navy-patches] [PATCH]
	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
	=?utf-8?b?0YHQsdC+0YDQutC4?=

This exposes several bugs in builtin-mailinfo.c that I try to fix:


1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing
   correct header construction on parts concatenation via strbuf_addbuf in
   decode_header_bq. Fixes:

-Subject: Изменён список пакетов необходимых для сборки
+Subject: Изменён список па


Then

2. (hackish) do not emit '\n' after processing of every header segment. It
   seems we should emit previous part as-is only if it does not end with
   '=?='. Fixes:

-Subject: Изменён список пакетов необходимых для сборки
+Subject: Изменён список па кетов необходимых для сборки


Sorry for low-quality patch and description. I did what I could and don't have
energy and time dig more into MIME.

Please help.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>

---
 builtin-mailinfo.c  |   18 ++++++++++++++++-
 t/t5100-mailinfo.sh |    2 +-
 t/t5100/info0012    |    5 ++++
 t/t5100/msg0012     |    7 ++++++
 t/t5100/patch0012   |   30 +++++++++++++++++++++++++++++
 t/t5100/sample.mbox |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index e890f7a..d138bc3 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
 			 * for now we just trust the data.
 			 */
 			c = 0;
+
+			/* XXX: the following is needed not to output NUL in
+			 * the resulting string
+			 *
+			 * This seems to be ok, but I'm not 100% sure -- that's
+			 * why this is an RFC.
+			 */
+			continue;
 		}
 		else
 			continue; /* garbage */
@@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it)
 		strbuf_reset(&piecebuf);
 		rfc2047 = 1;
 
-		if (in != ep) {
+		/* XXX: the follwoing is needed not to output '\n' on every
+		 * multi-line segment in Subject.
+		 *
+		 * I suspect this is not 100% correct, but I'm not a MIME guy
+		 * -- that's why this is an RFC.
+		 */
+
+		/* if in does not end with '=?=', we emit it as is */
+		if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) {
 			strbuf_add(&outbuf, in, ep - in);
 			in = ep;
 		}
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index fe14589..6825f99 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -11,7 +11,7 @@ test_expect_success 'split sample box' \
 	'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last &&
 	last=`cat last` &&
 	echo total is $last &&
-	test `cat last` = 11'
+	test `cat last` = 12'
 
 for mail in `echo 00*`
 do
diff --git a/t/t5100/info0012 b/t/t5100/info0012
new file mode 100644
index 0000000..ac1216f
--- /dev/null
+++ b/t/t5100/info0012
@@ -0,0 +1,5 @@
+Author: Dmitriy Blinov
+Email: bda@mnsspb.ru
+Subject: Изменён список пакетов необходимых для сборки
+Date: Wed, 12 Nov 2008 17:54:41 +0300
+
diff --git a/t/t5100/msg0012 b/t/t5100/msg0012
new file mode 100644
index 0000000..1dc2bf7
--- /dev/null
+++ b/t/t5100/msg0012
@@ -0,0 +1,7 @@
+textlive-* исправлены на texlive-*
+docutils заменён на python-docutils
+
+Действительно, оказалось, что rest2web вытягивает за собой
+python-docutils. В то время как сам rest2web не нужен.
+
+Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru>
diff --git a/t/t5100/patch0012 b/t/t5100/patch0012
new file mode 100644
index 0000000..36a0b68
--- /dev/null
+++ b/t/t5100/patch0012
@@ -0,0 +1,30 @@
+---
+ howto/build_navy.txt |    6 +++---
+ 1 files changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/howto/build_navy.txt b/howto/build_navy.txt
+index 3fd3afb..0ee807e 100644
+--- a/howto/build_navy.txt
++++ b/howto/build_navy.txt
+@@ -119,8 +119,8 @@
+    - libxv-dev
+    - libusplash-dev
+    - latex-make
+-   - textlive-lang-cyrillic
+-   - textlive-latex-extra
++   - texlive-lang-cyrillic
++   - texlive-latex-extra
+    - dia
+    - python-pyrex
+    - libtool
+@@ -128,7 +128,7 @@
+    - sox
+    - cython
+    - imagemagick
+-   - docutils
++   - python-docutils
+ 
+ #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev
+ #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом::
+-- 
+1.5.6.5
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 4bf7947..94da4da 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -501,3 +501,55 @@ index 3e5fe51..aabfe5c 100644
 
 --=-=-=--
 
+From bda@mnsspb.ru Wed Nov 12 17:54:41 2008
+From: Dmitriy Blinov <bda@mnsspb.ru>
+To: navy-patches@dinar.mns.mnsspb.ru
+Date: Wed, 12 Nov 2008 17:54:41 +0300
+Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru>
+X-Mailer: git-send-email 1.5.6.5
+MIME-Version: 1.0
+Content-Type: text/plain;
+  charset=utf-8
+Content-Transfer-Encoding: 8bit
+Subject: [Navy-patches] [PATCH]
+	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
+	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
+	=?utf-8?b?0YHQsdC+0YDQutC4?=
+
+textlive-* исправлены на texlive-*
+docutils заменён на python-docutils
+
+Действительно, оказалось, что rest2web вытягивает за собой
+python-docutils. В то время как сам rest2web не нужен.
+
+Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru>
+---
+ howto/build_navy.txt |    6 +++---
+ 1 files changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/howto/build_navy.txt b/howto/build_navy.txt
+index 3fd3afb..0ee807e 100644
+--- a/howto/build_navy.txt
++++ b/howto/build_navy.txt
+@@ -119,8 +119,8 @@
+    - libxv-dev
+    - libusplash-dev
+    - latex-make
+-   - textlive-lang-cyrillic
+-   - textlive-latex-extra
++   - texlive-lang-cyrillic
++   - texlive-latex-extra
+    - dia
+    - python-pyrex
+    - libtool
+@@ -128,7 +128,7 @@
+    - sox
+    - cython
+    - imagemagick
+-   - docutils
++   - python-docutils
+ 
+ #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev
+ #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом::
+-- 
+1.5.6.5
-- 
tg: (2292ebd..) t/mailinfo-multiline-subject (depends on: tmp)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov
@ 2009-01-07 22:43 ` Kirill Smelkov
  2009-01-08  8:13   ` Junio C Hamano
  2009-01-08 10:08 ` [PATCH " Alexander Potashev
  1 sibling, 1 reply; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-07 22:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote:
> When native language (RU) is in use, subject header usually contains several
> parts, e.g.
> 
> Subject: [Navy-patches] [PATCH]
> 	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
> 	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
> 	=?utf-8?b?0YHQsdC+0YDQutC4?=

Which btw should be extracted by git-mailinfo to:

    'Subject: Изменён список пакетов необходимых для сборки'

> This exposes several bugs in builtin-mailinfo.c that I try to fix:
> 
> 
> 1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing
>    correct header construction on parts concatenation via strbuf_addbuf in
>    decode_header_bq. Fixes:
> 
> -Subject: Изменён список пакетов необходимых для сборки
> +Subject: Изменён список па
> 
> 
> Then
> 
> 2. (hackish) do not emit '\n' after processing of every header segment. It
>    seems we should emit previous part as-is only if it does not end with
>    '=?='. Fixes:
> 
> -Subject: Изменён список пакетов необходимых для сборки
> +Subject: Изменён список па кетов необходимых для сборки
> 
> 
> Sorry for low-quality patch and description. I did what I could and don't have
> energy and time dig more into MIME.
> 
> Please help.
> 
> Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
> 
> ---
>  builtin-mailinfo.c  |   18 ++++++++++++++++-
>  t/t5100-mailinfo.sh |    2 +-
>  t/t5100/info0012    |    5 ++++
>  t/t5100/msg0012     |    7 ++++++
>  t/t5100/patch0012   |   30 +++++++++++++++++++++++++++++
>  t/t5100/sample.mbox |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 112 insertions(+), 2 deletions(-)

Junio, All,

What about this patch?

It at least exposes bug in git-mailinfo wrt handling of multiline
subjects, and in very details documents it and adds a test for it.


Yes, my fixes are of 'low quality', but may I try to attract git
community attention one more time?


Thanks beforehand,
Kirill


P.S. original post with patch:

http://marc.info/?l=git&m=123031899307286&w=2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov
@ 2009-01-08  8:13   ` Junio C Hamano
  2009-01-08  8:35     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2009-01-08  8:13 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

> On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote:
>> When native language (RU) is in use, subject header usually contains several
>> parts, e.g.
> ...
> Junio, All,
>
> What about this patch?

What's most interesting is that I do not recall seeing this patch before.
Neither gmane (which is my back-up interface to the mailing list) nor my
mailbox seems to have a copy, and from the look of quoted parts (namely,
some Russian strings in the message), it is not implausible that my spam
filter (either on my receiving end or at the ISP) may have eaten it.

> It at least exposes bug in git-mailinfo wrt handling of multiline
> subjects, and in very details documents it and adds a test for it.
>
> ..., but may I try to attract git
> community attention one more time?

It is very appreciated.

> P.S. original post with patch:
>
> http://marc.info/?l=git&m=123031899307286&w=2

I have not had chance to look at your patch at marc yet, but from the look
of your problem description, I presume you could trigger this with any
utf-8 b-encoded loooooong subject line?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-08  8:13   ` Junio C Hamano
@ 2009-01-08  8:35     ` Junio C Hamano
  2009-01-08 23:11       ` Kirill Smelkov
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2009-01-08  8:35 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> ...
>> http://marc.info/?l=git&m=123031899307286&w=2
>
> I have not had chance to look at your patch at marc yet, but from the look
> of your problem description, I presume you could trigger this with any
> utf-8 b-encoded loooooong subject line?

Ok, I took a look at it after downloading from the marc archive.

> diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
> index e890f7a..d138bc3 100644
> --- a/builtin-mailinfo.c
> +++ b/builtin-mailinfo.c
> @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>  			 * for now we just trust the data.
>  			 */
>  			c = 0;
> +
> +			/* XXX: the following is needed not to output NUL in
> +			 * the resulting string
> +			 *
> +			 * This seems to be ok, but I'm not 100% sure -- that's
> +			 * why this is an RFC.
> +			 */
> +			continue;
>  		}
>  		else
>  			continue; /* garbage */

B encoding (RFC 2045) encodes an octet stream into a sequence of groups of
4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or
more padding char '=' to make the result multiple of 4.

 * If the length of the payload is a multiple of 3 octets, there is no
   special handling.  Padding char '=' is not produced;

 * If it is a multiple of 3 octets plus one, the remaining one octet is
   encoded with two letters, and two more padding char '=' is added;

 * If it is a multiple of 3 octets plus two, the remaining two octets are
   encoded with three letters, and one padding char '=' is added.

Hence, a "correct" implementation should decode the input as if '=' were
the same as 'A' (which encodes 6 bits of 0) til the end, making sure that
the padding char '=' appears only at the end of the input, that no char
outside the Base64 encoding alphabet appears in the input, and that the
length of the entire encoded string is multiple of 4.  Finally it would
discard either one or two octets (depending on the number of padding chars
it saw) from the end of the output.

Our decode_b_segment() however emits each octet as it completes, without
waiting for the 24-bit group that contains it to complete.  When decoding
a correctly encoded input, by the time we see a padding '=', all the real
payload octets are complete and we would not have any real information
still kept in the variable "acc" (accumulator), so ignoring '=' (you do
not even need to assign c = 0) like your patch did would work just fine.
An alternative would be to count the number of padding at the end and drop
the NULs from the output as necessary after the loop but that does not add
any value to the current code.

Ideally we should validate the encoded string a bit more carefully (see
the "correct" implementation about), and warn if a malformed input is
found (but probably not reject outright).  But as a low-impact fix for the
maintenance branches, I think your fix is very good.

	Side note: I suspect that the existing code was Ok before strbuf
	conversion as we assumed NUL terminated output buffer.

> @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it)
>  		strbuf_reset(&piecebuf);
>  		rfc2047 = 1;
>  
> -		if (in != ep) {
> +		/* XXX: the follwoing is needed not to output '\n' on every
> +		 * multi-line segment in Subject.
> +		 *
> +		 * I suspect this is not 100% correct, but I'm not a MIME guy
> +		 * -- that's why this is an RFC.
> +		 */
> +
> +		/* if in does not end with '=?=', we emit it as is */
> +		if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) {
>  			strbuf_add(&outbuf, in, ep - in);
>  			in = ep;
> 
>  		}

I am not a MIME guy either (and mailinfo has a big comment that says we do
not really do MIME --- we just pretend to do), but let me give it a try.

RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may
not be more than 75 characters long, and multiple encoded-words, separated
by CRLF SPACE can be used to encode more text if needed.

It further specifies that an encoded-word can appear next to ordinary text
or another encoded-word but it must be separated by linear white space,
and says that such linear white space is to be ignored when displaying.

Which means that we should be eating the CRLF SPACE we see if we have seen
an encoded-word immediately before and we are about to process another
encoded-word.

Based on the above discussion, here is what I came up with.  It passes
your test, but I ran out of energy to try breaking it seriously in any
other way than just running the existing test suite.  

We might want to steal some test cases from the "8. Examples" section of
RFC2047 and add them to t5100.

Thanks.

 builtin-mailinfo.c |   27 +++++++++++++++++++--------
 1 files changed, 19 insertions(+), 8 deletions(-)

diff --git c/builtin-mailinfo.c w/builtin-mailinfo.c
index e890f7a..fcb32c9 100644
--- c/builtin-mailinfo.c
+++ w/builtin-mailinfo.c
@@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
 			c -= 'a' - 26;
 		else if ('0' <= c && c <= '9')
 			c -= '0' - 52;
-		else if (c == '=') {
-			/* padding is almost like (c == 0), except we do
-			 * not output NUL resulting only from it;
-			 * for now we just trust the data.
-			 */
-			c = 0;
-		}
 		else
 			continue; /* garbage */
 		switch (pos++) {
@@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it)
 		rfc2047 = 1;

 		if (in != ep) {
-			strbuf_add(&outbuf, in, ep - in);
+			/*
+			 * We are about to process an encoded-word
+			 * that begins at ep, but there is something
+			 * before the encoded word.
+			 */
+			char *scan;
+			for (scan = in; scan < ep; scan++)
+				if (!isspace(*scan))
+					break;
+
+			if (scan != ep || in == it->buf) {
+				/*
+				 * We should not lose that "something",
+				 * unless we have just processed an
+				 * encoded-word, and there is only LWS
+				 * before the one we are about to process.
+				 */
+				strbuf_add(&outbuf, in, ep - in);
+			}
 			in = ep;
 		}
 		/* E.g.

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-08  8:35     ` Junio C Hamano
@ 2009-01-08 23:11       ` Kirill Smelkov
  2009-01-10 10:12         ` Kirill Smelkov
  2009-01-11  1:54         ` Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-08 23:11 UTC (permalink / raw)
  To: Junio C Hamano, Alexander Potashev; +Cc: git

On Thu, Jan 08, 2009 at 12:13:42AM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> 
> > On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote:
> >> When native language (RU) is in use, subject header usually contains several
> >> parts, e.g.
> > ...
> > Junio, All,
> >
> > What about this patch?
> 
> What's most interesting is that I do not recall seeing this patch before.
> Neither gmane (which is my back-up interface to the mailing list) nor my
> mailbox seems to have a copy, and from the look of quoted parts (namely,
> some Russian strings in the message), it is not implausible that my spam
> filter (either on my receiving end or at the ISP) may have eaten it.
> 
> > It at least exposes bug in git-mailinfo wrt handling of multiline
> > subjects, and in very details documents it and adds a test for it.
> >
> > ..., but may I try to attract git
> > community attention one more time?
> 
> It is very appreciated.

Thanks!


On Thu, Jan 08, 2009 at 12:35:52AM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> > ...
> >> http://marc.info/?l=git&m=123031899307286&w=2
> >
> > I have not had chance to look at your patch at marc yet, but from the look
> > of your problem description, I presume you could trigger this with any
> > utf-8 b-encoded loooooong subject line?
> 
> Ok, I took a look at it after downloading from the marc archive.
> 
> > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
> > index e890f7a..d138bc3 100644
> > --- a/builtin-mailinfo.c
> > +++ b/builtin-mailinfo.c
> > @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
> >  			 * for now we just trust the data.
> >  			 */
> >  			c = 0;
> > +
> > +			/* XXX: the following is needed not to output NUL in
> > +			 * the resulting string
> > +			 *
> > +			 * This seems to be ok, but I'm not 100% sure -- that's
> > +			 * why this is an RFC.
> > +			 */
> > +			continue;
> >  		}
> >  		else
> >  			continue; /* garbage */
> 
> B encoding (RFC 2045) encodes an octet stream into a sequence of groups of
> 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or
> more padding char '=' to make the result multiple of 4.
> 
>  * If the length of the payload is a multiple of 3 octets, there is no
>    special handling.  Padding char '=' is not produced;
> 
>  * If it is a multiple of 3 octets plus one, the remaining one octet is
>    encoded with two letters, and two more padding char '=' is added;
> 
>  * If it is a multiple of 3 octets plus two, the remaining two octets are
>    encoded with three letters, and one padding char '=' is added.
> 
> Hence, a "correct" implementation should decode the input as if '=' were
> the same as 'A' (which encodes 6 bits of 0) til the end, making sure that
> the padding char '=' appears only at the end of the input, that no char
> outside the Base64 encoding alphabet appears in the input, and that the
> length of the entire encoded string is multiple of 4.  Finally it would
> discard either one or two octets (depending on the number of padding chars
> it saw) from the end of the output.
> 
> Our decode_b_segment() however emits each octet as it completes, without
> waiting for the 24-bit group that contains it to complete.  When decoding
> a correctly encoded input, by the time we see a padding '=', all the real
> payload octets are complete and we would not have any real information
> still kept in the variable "acc" (accumulator), so ignoring '=' (you do
> not even need to assign c = 0) like your patch did would work just fine.
> An alternative would be to count the number of padding at the end and drop
> the NULs from the output as necessary after the loop but that does not add
> any value to the current code.
> 
> Ideally we should validate the encoded string a bit more carefully (see
> the "correct" implementation about), and warn if a malformed input is
> found (but probably not reject outright).  But as a low-impact fix for the
> maintenance branches, I think your fix is very good.
> 
> 	Side note: I suspect that the existing code was Ok before strbuf
> 	conversion as we assumed NUL terminated output buffer.

Junio, thanks for the explanation.

I've updated the patch and included your analysis into description.

> > @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it)
> >  		strbuf_reset(&piecebuf);
> >  		rfc2047 = 1;
> >  
> > -		if (in != ep) {
> > +		/* XXX: the follwoing is needed not to output '\n' on every
> > +		 * multi-line segment in Subject.
> > +		 *
> > +		 * I suspect this is not 100% correct, but I'm not a MIME guy
> > +		 * -- that's why this is an RFC.
> > +		 */
> > +
> > +		/* if in does not end with '=?=', we emit it as is */
> > +		if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) {
> >  			strbuf_add(&outbuf, in, ep - in);
> >  			in = ep;
> > 
> >  		}
> 
> I am not a MIME guy either (and mailinfo has a big comment that says we do
> not really do MIME --- we just pretend to do), but let me give it a try.
> 
> RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may
> not be more than 75 characters long, and multiple encoded-words, separated
> by CRLF SPACE can be used to encode more text if needed.
> 
> It further specifies that an encoded-word can appear next to ordinary text
> or another encoded-word but it must be separated by linear white space,
> and says that such linear white space is to be ignored when displaying.
> 
> Which means that we should be eating the CRLF SPACE we see if we have seen
> an encoded-word immediately before and we are about to process another
> encoded-word.
> 
> Based on the above discussion, here is what I came up with.  It passes
> your test, but I ran out of energy to try breaking it seriously in any
> other way than just running the existing test suite.  

Thanks again very much!

I was once maintaining software, and I think I understand what you mean
by saying 'ran out of energy', so I'll try to do my best to help improve
this patch and to get it merged.

> We might want to steal some test cases from the "8. Examples" section of
> RFC2047 and add them to t5100.

Good idea. I took all the examples and incorporated them into our
testsuite.

> 
> Thanks.
> 
>  builtin-mailinfo.c |   27 +++++++++++++++++++--------
>  1 files changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git c/builtin-mailinfo.c w/builtin-mailinfo.c
> index e890f7a..fcb32c9 100644
> --- c/builtin-mailinfo.c
> +++ w/builtin-mailinfo.c
> @@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>  			c -= 'a' - 26;
>  		else if ('0' <= c && c <= '9')
>  			c -= '0' - 52;
> -		else if (c == '=') {
> -			/* padding is almost like (c == 0), except we do
> -			 * not output NUL resulting only from it;
> -			 * for now we just trust the data.
> -			 */
> -			c = 0;
> -		}
>  		else
>  			continue; /* garbage */
>  		switch (pos++) {
> @@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it)
>  		rfc2047 = 1;
>  
>  		if (in != ep) {
> -			strbuf_add(&outbuf, in, ep - in);
> +			/*
> +			 * We are about to process an encoded-word
> +			 * that begins at ep, but there is something
> +			 * before the encoded word.
> +			 */
> +			char *scan;
> +			for (scan = in; scan < ep; scan++)
> +				if (!isspace(*scan))
> +					break;
> +
> +			if (scan != ep || in == it->buf) {
> +				/*
> +				 * We should not lose that "something",
> +				 * unless we have just processed an
> +				 * encoded-word, and there is only LWS
> +				 * before the one we are about to process.
> +				 */
> +				strbuf_add(&outbuf, in, ep - in);
> +			}
>  			in = ep;
>  		}
>  		/* E.g.

Based on the above description the code looks good now. I've
incorporated it into the patch and added tests from RFC2047 (see patch
below).

On Thu, Jan 08, 2009 at 01:08:13PM +0300, Alexander Potashev wrote:
> On 21:38 Fri 26 Dec     , Kirill Smelkov wrote:
> > When native language (RU) is in use, subject header usually contains several
> > parts, e.g.
> > 
> > Subject: [Navy-patches] [PATCH]
> > 	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
> > 	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
> > 	=?utf-8?b?0YHQsdC+0YDQutC4?=
> > 
> 
> >  t/t5100/info0012    |    5 ++++
> >  t/t5100/msg0012     |    7 ++++++
> >  t/t5100/patch0012   |   30 +++++++++++++++++++++++++++++
> >  t/t5100/sample.mbox |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 112 insertions(+), 2 deletions(-)
> 
> The testcases are too long, a minimal mbox with encoded "Subject:" would
> be enough to test the mailinfo parser, it's all the you need to test
> here.

Thanks Alexander for pointing this out.

I've based my testcase on already-in-there tests, which e.g. for
t/t5100/{info,msg,patch}00{04,05,09,10,11} are of approximately the same
size and are based on real mails.

Is this ok?


As to new RFC2047-examples based tests, I've tried to keep them to the
bare minimum.


Changes since v1:

 o incorporated Junio's description and code about padding
 o incorporated Junio's description and code about LWS between encoded
   words
 o incorporated tests from RFC2047 examples  (one testresult is unclear
   -- see patch description)


From: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Subject: mailinfo: correctly handle multiline 'Subject:' header

When native language (RU) is in use, subject header usually contains several
parts, e.g.

Subject: [Navy-patches] [PATCH]
	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
	=?utf-8?b?0YHQsdC+0YDQutC4?=

( which btw should be extracted by git-mailinfo to:

    'Subject: Изменён список пакетов необходимых для сборки' )

This exposes several bugs in builtin-mailinfo.c which we try to fix:

1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing
   correct header construction on parts concatenation via strbuf_addbuf in
   decode_header_bq. Fixes:

-Subject: Изменён список пакетов необходимых для сборки
+Subject: Изменён список па

Junio:

> B encoding (RFC 2045) encodes an octet stream into a sequence of groups of
> 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or
> more padding char '=' to make the result multiple of 4.
>
>  * If the length of the payload is a multiple of 3 octets, there is no
>    special handling.  Padding char '=' is not produced;
>
>  * If it is a multiple of 3 octets plus one, the remaining one octet is
>    encoded with two letters, and two more padding char '=' is added;
>
>  * If it is a multiple of 3 octets plus two, the remaining two octets are
>    encoded with three letters, and one padding char '=' is added.
>
> Hence, a "correct" implementation should decode the input as if '=' were
> the same as 'A' (which encodes 6 bits of 0) til the end, making sure that
> the padding char '=' appears only at the end of the input, that no char
> outside the Base64 encoding alphabet appears in the input, and that the
> length of the entire encoded string is multiple of 4.  Finally it would
> discard either one or two octets (depending on the number of padding chars
> it saw) from the end of the output.
>
> Our decode_b_segment() however emits each octet as it completes, without
> waiting for the 24-bit group that contains it to complete.  When decoding
> a correctly encoded input, by the time we see a padding '=', all the real
> payload octets are complete and we would not have any real information
> still kept in the variable "acc" (accumulator), so ignoring '=' (you do
> not even need to assign c = 0) like your patch did would work just fine.
> An alternative would be to count the number of padding at the end and drop
> the NULs from the output as necessary after the loop but that does not add
> any value to the current code.
>
> Ideally we should validate the encoded string a bit more carefully (see
> the "correct" implementation about), and warn if a malformed input is
> found (but probably not reject outright).  But as a low-impact fix for the
> maintenance branches, I think your fix is very good.
>
> 	Side note: I suspect that the existing code was Ok before strbuf
> 	conversion as we assumed NUL terminated output buffer.


Then

2. whitespaces between encoded words should be removed

-Subject: Изменён список пакетов необходимых для сборки
+Subject: Изменён список па кетов необходимых для сборки

Junio:

> I am not a MIME guy either (and mailinfo has a big comment that says we do
> not really do MIME --- we just pretend to do), but let me give it a try.
>
> RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may
> not be more than 75 characters long, and multiple encoded-words, separated
> by CRLF SPACE can be used to encode more text if needed.
>
> It further specifies that an encoded-word can appear next to ordinary text
> or another encoded-word but it must be separated by linear white space,
> and says that such linear white space is to be ignored when displaying.
>
> Which means that we should be eating the CRLF SPACE we see if we have seen
> an encoded-word immediately before and we are about to process another
> encoded-word.

Also as suggested by Junio, in order to try to catch other MIME problems test
cases from the "8. Examples" section of RFC2047 are added to t5100 testsuite as
well.

    [but I'm not sure whether testresult with Nathaniel Borenstein
     (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004]

Big-thanks-to: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>

---
 builtin-mailinfo.c           |   27 +++++++++++++++------
 t/t5100-mailinfo.sh          |   24 ++++++++++++++++++-
 t/t5100/info0012             |    5 ++++
 t/t5100/msg0012              |    7 +++++
 t/t5100/patch0012            |   30 ++++++++++++++++++++++++
 t/t5100/rfc2047-info-0001    |    4 +++
 t/t5100/rfc2047-info-0002    |    4 +++
 t/t5100/rfc2047-info-0003    |    4 +++
 t/t5100/rfc2047-info-0004    |    5 ++++
 t/t5100/rfc2047-info-0005    |    2 +
 t/t5100/rfc2047-info-0006    |    2 +
 t/t5100/rfc2047-info-0007    |    2 +
 t/t5100/rfc2047-info-0008    |    2 +
 t/t5100/rfc2047-info-0009    |    2 +
 t/t5100/rfc2047-info-0010    |    2 +
 t/t5100/rfc2047-info-0011    |    2 +
 t/t5100/rfc2047-samples.mbox |   48 ++++++++++++++++++++++++++++++++++++++
 t/t5100/sample.mbox          |   52 ++++++++++++++++++++++++++++++++++++++++++
 18 files changed, 215 insertions(+), 9 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index f7c8c08..77a7121 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
 			c -= 'a' - 26;
 		else if ('0' <= c && c <= '9')
 			c -= '0' - 52;
-		else if (c == '=') {
-			/* padding is almost like (c == 0), except we do
-			 * not output NUL resulting only from it;
-			 * for now we just trust the data.
-			 */
-			c = 0;
-		}
 		else
 			continue; /* garbage */
 		switch (pos++) {
@@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it)
 		rfc2047 = 1;
 
 		if (in != ep) {
-			strbuf_add(&outbuf, in, ep - in);
+			/*
+			 * We are about to process an encoded-word
+			 * that begins at ep, but there is something
+			 * before the encoded word.
+			 */
+			char *scan;
+			for (scan = in; scan < ep; scan++)
+				if (!isspace(*scan))
+					break;
+
+			if (scan != ep || in == it->buf) {
+				/*
+				 * We should not lose that "something",
+				 * unless we have just processed an
+				 * encoded-word, and there is only LWS
+				 * before the one we are about to process.
+				 */
+				strbuf_add(&outbuf, in, ep - in);
+			}
 			in = ep;
 		}
 		/* E.g.
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index fe14589..625c204 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -11,7 +11,7 @@ test_expect_success 'split sample box' \
 	'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last &&
 	last=`cat last` &&
 	echo total is $last &&
-	test `cat last` = 11'
+	test `cat last` = 12'
 
 for mail in `echo 00*`
 do
@@ -26,6 +26,28 @@ do
 	'
 done
 
+
+test_expect_success 'split box with rfc2047 samples' \
+	'mkdir rfc2047 &&
+	git mailsplit -orfc2047 "$TEST_DIRECTORY"/t5100/rfc2047-samples.mbox \
+	  >rfc2047/last &&
+	last=`cat rfc2047/last` &&
+	echo total is $last &&
+	test `cat rfc2047/last` = 11'
+
+for mail in `echo rfc2047/00*`
+do
+	test_expect_success "mailinfo $mail" '
+		git mailinfo -u $mail-msg $mail-patch <$mail >$mail-info &&
+		echo msg &&
+		test_cmp "$TEST_DIRECTORY"/t5100/empty $mail-msg &&
+		echo patch &&
+		test_cmp "$TEST_DIRECTORY"/t5100/empty $mail-patch &&
+		echo info &&
+		test_cmp "$TEST_DIRECTORY"/t5100/rfc2047-info-$(basename $mail) $mail-info
+	'
+done
+
 test_expect_success 'respect NULs' '
 
 	git mailsplit -d3 -o. "$TEST_DIRECTORY"/t5100/nul-plain &&
diff --git a/t/t5100/empty b/t/t5100/empty
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/info0012 b/t/t5100/info0012
new file mode 100644
index 0000000..ac1216f
--- /dev/null
+++ b/t/t5100/info0012
@@ -0,0 +1,5 @@
+Author: Dmitriy Blinov
+Email: bda@mnsspb.ru
+Subject: Изменён список пакетов необходимых для сборки
+Date: Wed, 12 Nov 2008 17:54:41 +0300
+
diff --git a/t/t5100/msg0012 b/t/t5100/msg0012
new file mode 100644
index 0000000..1dc2bf7
--- /dev/null
+++ b/t/t5100/msg0012
@@ -0,0 +1,7 @@
+textlive-* исправлены на texlive-*
+docutils заменён на python-docutils
+
+Действительно, оказалось, что rest2web вытягивает за собой
+python-docutils. В то время как сам rest2web не нужен.
+
+Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru>
diff --git a/t/t5100/patch0012 b/t/t5100/patch0012
new file mode 100644
index 0000000..36a0b68
--- /dev/null
+++ b/t/t5100/patch0012
@@ -0,0 +1,30 @@
+---
+ howto/build_navy.txt |    6 +++---
+ 1 files changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/howto/build_navy.txt b/howto/build_navy.txt
+index 3fd3afb..0ee807e 100644
+--- a/howto/build_navy.txt
++++ b/howto/build_navy.txt
+@@ -119,8 +119,8 @@
+    - libxv-dev
+    - libusplash-dev
+    - latex-make
+-   - textlive-lang-cyrillic
+-   - textlive-latex-extra
++   - texlive-lang-cyrillic
++   - texlive-latex-extra
+    - dia
+    - python-pyrex
+    - libtool
+@@ -128,7 +128,7 @@
+    - sox
+    - cython
+    - imagemagick
+-   - docutils
++   - python-docutils
+ 
+ #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev
+ #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом::
+-- 
+1.5.6.5
diff --git a/t/t5100/rfc2047-info-0001 b/t/t5100/rfc2047-info-0001
new file mode 100644
index 0000000..0a383b0
--- /dev/null
+++ b/t/t5100/rfc2047-info-0001
@@ -0,0 +1,4 @@
+Author: Keith Moore
+Email: moore@cs.utk.edu
+Subject: If you can read this you understand the example.
+
diff --git a/t/t5100/rfc2047-info-0002 b/t/t5100/rfc2047-info-0002
new file mode 100644
index 0000000..881be75
--- /dev/null
+++ b/t/t5100/rfc2047-info-0002
@@ -0,0 +1,4 @@
+Author: Olle Järnefors
+Email: ojarnef@admin.kth.se
+Subject: Time for ISO 10646?
+
diff --git a/t/t5100/rfc2047-info-0003 b/t/t5100/rfc2047-info-0003
new file mode 100644
index 0000000..d0f7891
--- /dev/null
+++ b/t/t5100/rfc2047-info-0003
@@ -0,0 +1,4 @@
+Author: Patrik Fältström
+Email: paf@nada.kth.se
+Subject: RFC-HDR care and feeding
+
diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004
new file mode 100644
index 0000000..850f831
--- /dev/null
+++ b/t/t5100/rfc2047-info-0004
@@ -0,0 +1,5 @@
+Author: Nathaniel Borenstein  
+     (םולש ןב ילטפנ)
+Email: nsb@thumper.bellcore.com
+Subject: Test of new header generator
+
diff --git a/t/t5100/rfc2047-info-0005 b/t/t5100/rfc2047-info-0005
new file mode 100644
index 0000000..c27be3b
--- /dev/null
+++ b/t/t5100/rfc2047-info-0005
@@ -0,0 +1,2 @@
+Subject: (a)
+
diff --git a/t/t5100/rfc2047-info-0006 b/t/t5100/rfc2047-info-0006
new file mode 100644
index 0000000..9dad474
--- /dev/null
+++ b/t/t5100/rfc2047-info-0006
@@ -0,0 +1,2 @@
+Subject: (a b)
+
diff --git a/t/t5100/rfc2047-info-0007 b/t/t5100/rfc2047-info-0007
new file mode 100644
index 0000000..294f195
--- /dev/null
+++ b/t/t5100/rfc2047-info-0007
@@ -0,0 +1,2 @@
+Subject: (ab)
+
diff --git a/t/t5100/rfc2047-info-0008 b/t/t5100/rfc2047-info-0008
new file mode 100644
index 0000000..294f195
--- /dev/null
+++ b/t/t5100/rfc2047-info-0008
@@ -0,0 +1,2 @@
+Subject: (ab)
+
diff --git a/t/t5100/rfc2047-info-0009 b/t/t5100/rfc2047-info-0009
new file mode 100644
index 0000000..294f195
--- /dev/null
+++ b/t/t5100/rfc2047-info-0009
@@ -0,0 +1,2 @@
+Subject: (ab)
+
diff --git a/t/t5100/rfc2047-info-0010 b/t/t5100/rfc2047-info-0010
new file mode 100644
index 0000000..9dad474
--- /dev/null
+++ b/t/t5100/rfc2047-info-0010
@@ -0,0 +1,2 @@
+Subject: (a b)
+
diff --git a/t/t5100/rfc2047-info-0011 b/t/t5100/rfc2047-info-0011
new file mode 100644
index 0000000..9dad474
--- /dev/null
+++ b/t/t5100/rfc2047-info-0011
@@ -0,0 +1,2 @@
+Subject: (a b)
+
diff --git a/t/t5100/rfc2047-samples.mbox b/t/t5100/rfc2047-samples.mbox
new file mode 100644
index 0000000..3ca2470
--- /dev/null
+++ b/t/t5100/rfc2047-samples.mbox
@@ -0,0 +1,48 @@
+From nobody Mon Sep 17 00:00:00 2001
+From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
+To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
+CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be>
+Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
+ =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
+
+From nobody Mon Sep 17 00:00:00 2001
+From: =?ISO-8859-1?Q?Olle_J=E4rnefors?= <ojarnef@admin.kth.se>
+To: ietf-822@dimacs.rutgers.edu, ojarnef@admin.kth.se
+Subject: Time for ISO 10646?
+
+From nobody Mon Sep 17 00:00:00 2001
+To: Dave Crocker <dcrocker@mordor.stanford.edu>
+Cc: ietf-822@dimacs.rutgers.edu, paf@comsol.se
+From: =?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?= <paf@nada.kth.se>
+Subject: Re: RFC-HDR care and feeding
+
+From nobody Mon Sep 17 00:00:00 2001
+From: Nathaniel Borenstein <nsb@thumper.bellcore.com>
+      (=?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?=)
+To: Greg Vaudreuil <gvaudre@NRI.Reston.VA.US>, Ned Freed
+   <ned@innosoft.com>, Keith Moore <moore@cs.utk.edu>
+Subject: Test of new header generator
+MIME-Version: 1.0
+Content-type: text/plain; charset=ISO-8859-1
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?=)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?= b)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?=
+    =?ISO-8859-1?Q?b?=)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a_b?=)
+
+From nobody Mon Sep 17 00:00:00 2001
+Subject: (=?ISO-8859-1?Q?a?= =?ISO-8859-2?Q?_b?=)
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 4bf7947..94da4da 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -501,3 +501,55 @@ index 3e5fe51..aabfe5c 100644
 
 --=-=-=--
 
+From bda@mnsspb.ru Wed Nov 12 17:54:41 2008
+From: Dmitriy Blinov <bda@mnsspb.ru>
+To: navy-patches@dinar.mns.mnsspb.ru
+Date: Wed, 12 Nov 2008 17:54:41 +0300
+Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru>
+X-Mailer: git-send-email 1.5.6.5
+MIME-Version: 1.0
+Content-Type: text/plain;
+  charset=utf-8
+Content-Transfer-Encoding: 8bit
+Subject: [Navy-patches] [PATCH]
+	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
+	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
+	=?utf-8?b?0YHQsdC+0YDQutC4?=
+
+textlive-* исправлены на texlive-*
+docutils заменён на python-docutils
+
+Действительно, оказалось, что rest2web вытягивает за собой
+python-docutils. В то время как сам rest2web не нужен.
+
+Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru>
+---
+ howto/build_navy.txt |    6 +++---
+ 1 files changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/howto/build_navy.txt b/howto/build_navy.txt
+index 3fd3afb..0ee807e 100644
+--- a/howto/build_navy.txt
++++ b/howto/build_navy.txt
+@@ -119,8 +119,8 @@
+    - libxv-dev
+    - libusplash-dev
+    - latex-make
+-   - textlive-lang-cyrillic
+-   - textlive-latex-extra
++   - texlive-lang-cyrillic
++   - texlive-latex-extra
+    - dia
+    - python-pyrex
+    - libtool
+@@ -128,7 +128,7 @@
+    - sox
+    - cython
+    - imagemagick
+-   - docutils
++   - python-docutils
+ 
+ #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev
+ #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом::
+-- 
+1.5.6.5
-- 
tg: (c123b7c..) t/mailinfo-multiline-subject (depends on: master)

Thanks,
Kirill

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-08 23:11       ` Kirill Smelkov
@ 2009-01-10 10:12         ` Kirill Smelkov
  2009-01-11  1:54         ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-10 10:12 UTC (permalink / raw)
  To: Junio C Hamano, Alexander Potashev; +Cc: git

On Fri, Jan 09, 2009 at 02:11:35AM +0300, Kirill Smelkov wrote:
> Changes since v1:
> 
>  o incorporated Junio's description and code about padding
>  o incorporated Junio's description and code about LWS between encoded
>    words
>  o incorporated tests from RFC2047 examples  (one testresult is unclear
>    -- see patch description)
> 
> 
> From: Kirill Smelkov <kirr@landau.phys.spbu.ru>
> Subject: mailinfo: correctly handle multiline 'Subject:' header

[...]

Junio, All, just in case this again got spam-detected:

http://marc.info/?l=git&m=123145624611936&w=2


Thanks,
Kirill

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-08 23:11       ` Kirill Smelkov
  2009-01-10 10:12         ` Kirill Smelkov
@ 2009-01-11  1:54         ` Junio C Hamano
  2009-01-12 22:34           ` Kirill Smelkov
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2009-01-11  1:54 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: Alexander Potashev, git

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

>     [but I'm not sure whether testresult with Nathaniel Borenstein
>      (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004]
> ...
> diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004
> new file mode 100644
> index 0000000..850f831
> --- /dev/null
> +++ b/t/t5100/rfc2047-info-0004
> @@ -0,0 +1,5 @@
> +Author: Nathaniel Borenstein  
> +     (םולש ןב ילטפנ)
> +Email: nsb@thumper.bellcore.com
> +Subject: Test of new header generator
> +

That does look wrong.  If you can fix this, please do so; otherwise please
mark the test that deals with this entry with test_expect_failure, until
somebody else does.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-11  1:54         ` Junio C Hamano
@ 2009-01-12 22:34           ` Kirill Smelkov
  2009-01-12 23:27             ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-12 22:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alexander Potashev, git

On Sat, Jan 10, 2009 at 05:54:14PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> 
> >     [but I'm not sure whether testresult with Nathaniel Borenstein
> >      (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004]
> > ...
> > diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004
> > new file mode 100644
> > index 0000000..850f831
> > --- /dev/null
> > +++ b/t/t5100/rfc2047-info-0004
> > @@ -0,0 +1,5 @@
> > +Author: Nathaniel Borenstein  
> > +     ([somethig that could be detected as spam])
> > +Email: nsb@thumper.bellcore.com
> > +Subject: Test of new header generator
> > +
> 
> That does look wrong.  If you can fix this, please do so; otherwise please
> mark the test that deals with this entry with test_expect_failure, until
> somebody else does.

Yes, I think I've dealt with it -- we weren't unfolding 'From' header,
and we were not skipping comments in rfc822 headers, so:

From: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Subject: [PATCH] mailinfo: 'From:' header should be unfold as well

At present we do headers unfolding (see RFC822 3.1.1. LONG HEADER FIELDS) for
all fields except 'From' (always) and 'Subject' (when keep_subject is set)

Not unfolding 'From' is a bug -- see above-mentioned RFC link.

Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>

---
 builtin-mailinfo.c  |    1 +
 t/t5100/sample.mbox |    5 ++++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index f7c8c08..6d72c1b 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -860,6 +860,7 @@ static void handle_info(void)
 			}
 			output_header_lines(fout, "Subject", hdr);
 		} else if (!memcmp(header[i], "From", 4)) {
+			cleanup_space(hdr);
 			handle_from(hdr);
 			fprintf(fout, "Author: %s\n", name.buf);
 			fprintf(fout, "Email: %s\n", email.buf);
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 4bf7947..d465685 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -2,7 +2,10 @@
 	
     
 From nobody Mon Sep 17 00:00:00 2001
-From: A U Thor <a.u.thor@example.com>
+From: A
+      U
+      Thor
+      <a.u.thor@example.com>
 Date: Fri, 9 Jun 2006 00:44:16 -0700
 Subject: [PATCH] a commit.
 
-- 
tg: (1562445..) t/mail-from-unfold (depends on: master)




From: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Subject: [PATCH] mailinfo: more smarter removal of rfc822 comments from 'From'

As described in RFC822 (3.4.3 COMMENTS, and  A.1.4.), comments, as e.g.

    John (zzz) Doe <john.doe@xz> (Comment)

should "NOT [be] included in the destination mailbox"

We need this functionality to pass all RFC2047 based tests in the next commit.

Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>

---
 builtin-mailinfo.c  |   30 ++++++++++++++++++++++++++++++
 t/t5100/sample.mbox |    4 ++--
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 6d72c1b..c0b1ab4 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -29,6 +29,9 @@ static struct strbuf **p_hdr_data, **s_hdr_data;
 #define MAX_HDR_PARSED 10
 #define MAX_BOUNDARIES 5
 
+static void cleanup_space(struct strbuf *sb);
+
+
 static void get_sane_name(struct strbuf *out, struct strbuf *name, struct strbuf *email)
 {
 	struct strbuf *src = name;
@@ -120,6 +123,33 @@ static void handle_from(const struct strbuf *from)
 		strbuf_setlen(&f, f.len - 1);
 	}
 
+	/* This still could not be finished for emails like
+	 *
+	 *	"John (zzz) Doe <john.doe@xz> (Comment)"
+	 *
+	 * The email part had already been removed, so let's kill comments as
+	 * well -- RFC822 says comments should not be present in destination
+	 * mailbox (3.4.3. Comments  and  A.1.4.)
+	 */
+	while (1) {
+		char *ta;
+
+		at = strchr(f.buf, '(');
+		if (!at)
+			break;
+		ta = strchr(at, ')');
+		if (!ta)
+			break;
+
+		strbuf_remove(&f, at - f.buf, ta-at + (*ta ? 1 : 0));
+	}
+
+	/* and let's finally cleanup spaces that were around (possibly
+	 * internal) comments
+	 */
+	cleanup_space(&f);
+	strbuf_trim(&f);
+
 	get_sane_name(&name, &f, &email);
 	strbuf_release(&f);
 }
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index d465685..42e02f3 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -2,10 +2,10 @@
 	
     
 From nobody Mon Sep 17 00:00:00 2001
-From: A
+From: A (zzz)
       U
       Thor
-      <a.u.thor@example.com>
+      <a.u.thor@example.com> (Comment)
 Date: Fri, 9 Jun 2006 00:44:16 -0700
 Subject: [PATCH] a commit.
 
-- 
tg: (b798ad9..) t/mail-from-comments (depends on: t/mail-from-unfold)



All these patches + original one (trivially adapted) could be pulled from

    git://repo.or.cz/git/kirr.git  for-junio



Kirill Smelkov (3):
      mailinfo: 'From:' header should be unfold as well
      mailinfo: more smarter removal of rfc822 comments from 'From'
      mailinfo: correctly handle multiline 'Subject:' header


builtin-mailinfo.c           |   58 ++++++++++++++++++++++++++++++++++++------
t/t5100-mailinfo.sh          |   24 ++++++++++++++++-
t/t5100/info0012             |    5 +++
t/t5100/msg0012              |    7 +++++
t/t5100/patch0012            |   30 +++++++++++++++++++++
t/t5100/rfc2047-info-0001    |    4 +++
t/t5100/rfc2047-info-0002    |    4 +++
t/t5100/rfc2047-info-0003    |    4 +++
t/t5100/rfc2047-info-0004    |    4 +++
t/t5100/rfc2047-info-0005    |    2 +
t/t5100/rfc2047-info-0006    |    2 +
t/t5100/rfc2047-info-0007    |    2 +
t/t5100/rfc2047-info-0008    |    2 +
t/t5100/rfc2047-info-0009    |    2 +
t/t5100/rfc2047-info-0010    |    2 +
t/t5100/rfc2047-info-0011    |    2 +
t/t5100/rfc2047-samples.mbox |   48 ++++++++++++++++++++++++++++++++++
t/t5100/sample.mbox          |   57 ++++++++++++++++++++++++++++++++++++++++-
18 files changed, 249 insertions(+), 10 deletions(-)


Thanks,
Kirill

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-12 22:34           ` Kirill Smelkov
@ 2009-01-12 23:27             ` Junio C Hamano
  2009-01-13  9:39               ` Kirill Smelkov
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2009-01-12 23:27 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: Alexander Potashev, git

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

> diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
> index f7c8c08..6d72c1b 100644
> --- a/builtin-mailinfo.c
> +++ b/builtin-mailinfo.c
> @@ -860,6 +860,7 @@ static void handle_info(void)
>  			}
>  			output_header_lines(fout, "Subject", hdr);
>  		} else if (!memcmp(header[i], "From", 4)) {
> +			cleanup_space(hdr);
>  			handle_from(hdr);
>  			fprintf(fout, "Author: %s\n", name.buf);
>  			fprintf(fout, "Email: %s\n", email.buf);
> diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
> index 4bf7947..d465685 100644
> --- a/t/t5100/sample.mbox
> +++ b/t/t5100/sample.mbox
> @@ -2,7 +2,10 @@
>  	
>      
>  From nobody Mon Sep 17 00:00:00 2001
> -From: A U Thor <a.u.thor@example.com>
> +From: A
> +      U
> +      Thor
> +      <a.u.thor@example.com>
>  Date: Fri, 9 Jun 2006 00:44:16 -0700
>  Subject: [PATCH] a commit.

I think this is a reasonable change.

But doesn't this

>  From nobody Mon Sep 17 00:00:00 2001
> -From: A
> +From: A (zzz)
>        U
>        Thor
> -      <a.u.thor@example.com>
> +      <a.u.thor@example.com> (Comment)

regress for people who spell their names like this?

	From: john.doe@email.xz (John Doe)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-12 23:27             ` Junio C Hamano
@ 2009-01-13  9:39               ` Kirill Smelkov
  2009-01-14  8:19                 ` Kirill Smelkov
  0 siblings, 1 reply; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-13  9:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alexander Potashev, git

On Mon, Jan 12, 2009 at 03:27:44PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> 
> > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
> > index f7c8c08..6d72c1b 100644
> > --- a/builtin-mailinfo.c
> > +++ b/builtin-mailinfo.c
> > @@ -860,6 +860,7 @@ static void handle_info(void)
> >  			}
> >  			output_header_lines(fout, "Subject", hdr);
> >  		} else if (!memcmp(header[i], "From", 4)) {
> > +			cleanup_space(hdr);
> >  			handle_from(hdr);
> >  			fprintf(fout, "Author: %s\n", name.buf);
> >  			fprintf(fout, "Email: %s\n", email.buf);
> > diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
> > index 4bf7947..d465685 100644
> > --- a/t/t5100/sample.mbox
> > +++ b/t/t5100/sample.mbox
> > @@ -2,7 +2,10 @@
> >  	
> >      
> >  From nobody Mon Sep 17 00:00:00 2001
> > -From: A U Thor <a.u.thor@example.com>
> > +From: A
> > +      U
> > +      Thor
> > +      <a.u.thor@example.com>
> >  Date: Fri, 9 Jun 2006 00:44:16 -0700
> >  Subject: [PATCH] a commit.
> 
> I think this is a reasonable change.

Thanks.


> But doesn't this
> 
> >  From nobody Mon Sep 17 00:00:00 2001
> > -From: A
> > +From: A (zzz)
> >        U
> >        Thor
> > -      <a.u.thor@example.com>
> > +      <a.u.thor@example.com> (Comment)
> 
> regress for people who spell their names like this?
> 
> 	From: john.doe@email.xz (John Doe)

I think everything is ok:

There is an explicit handler for such emails before my comments removal
in builtin-mailinfo.c:

        /* The remainder is name.  It could be "John Doe <john.doe@xz>"
         * or "john.doe@xz (John Doe)", but we have removed the
         * email part, so trim from both ends, possibly removing
         * the () pair at the end.
         */
        strbuf_trim(&f);
        if (f.buf[0] == '(' && f.len && f.buf[f.len - 1] == ')') {
                strbuf_remove(&f, 0, 1);
                strbuf_setlen(&f, f.len - 1);
        }


http://repo.or.cz/w/git.git?a=blob;f=builtin-mailinfo.c;h=f7c8c08b320c99d8bf96443ae57aa33c1de7e8c0;hb=HEAD#l112


And only a test for this is missing



From 77316ad6db2c3b0f4be238c4ba855b2f785b50d6 Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Tue, 13 Jan 2009 12:33:48 +0300
Subject: [PATCH] mailinfo: add explicit test for mails like '<a.u.thor@example.com> (A U Thor)'

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 t/t5100-mailinfo.sh |    2 +-
 t/t5100/info0013    |    5 +++++
 t/t5100/sample.mbox |    5 +++++
 3 files changed, 11 insertions(+), 1 deletions(-)
 create mode 100644 t/t5100/info0013
 create mode 100644 t/t5100/msg0013
 create mode 100644 t/t5100/patch0013

diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index 625c204..e70ea94 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -11,7 +11,7 @@ test_expect_success 'split sample box' \
 	'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last &&
 	last=`cat last` &&
 	echo total is $last &&
-	test `cat last` = 12'
+	test `cat last` = 13'
 
 for mail in `echo 00*`
 do
diff --git a/t/t5100/info0013 b/t/t5100/info0013
new file mode 100644
index 0000000..bbe049e
--- /dev/null
+++ b/t/t5100/info0013
@@ -0,0 +1,5 @@
+Author: A U Thor
+Email: a.u.thor@example.com
+Subject: a patch
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
diff --git a/t/t5100/msg0013 b/t/t5100/msg0013
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/patch0013 b/t/t5100/patch0013
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 4f80b82..c5ad206 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -556,3 +556,8 @@ index 3fd3afb..0ee807e 100644
  #. п╫п╟ я│п╡п╬п╣п╧ п╪п╟я┬п╦п╫п╣: п╬я┌я─п╣п╢п╟п╨я┌п╦я─п╬п╡п╟я┌я▄ /etc/sudoers (п╨п╬п╪п╟п╫п╢п╟ ``visudo``) п©я─п╦п╪п╣я─п╫п╬ я│п╩п╣п╢я┐я▌я┴п╦п╪ п╬п╠я─п╟п╥п╬п╪::
 -- 
 1.5.6.5
+From nobody Mon Sep 17 00:00:00 2001
+From: <a.u.thor@example.com> (A U Thor)
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+Subject: [PATCH] a patch
+
-- 
1.6.1.101.g0335

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2009-01-13  9:39               ` Kirill Smelkov
@ 2009-01-14  8:19                 ` Kirill Smelkov
  0 siblings, 0 replies; 12+ messages in thread
From: Kirill Smelkov @ 2009-01-14  8:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alexander Potashev, git

On Tue, Jan 13, 2009 at 12:39:16PM +0300, Kirill Smelkov wrote:
> On Mon, Jan 12, 2009 at 03:27:44PM -0800, Junio C Hamano wrote:
> > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

[...]

> > But doesn't this
> > 
> > >  From nobody Mon Sep 17 00:00:00 2001
> > > -From: A
> > > +From: A (zzz)
> > >        U
> > >        Thor
> > > -      <a.u.thor@example.com>
> > > +      <a.u.thor@example.com> (Comment)
> > 
> > regress for people who spell their names like this?
> > 
> > 	From: john.doe@email.xz (John Doe)
> 
> I think everything is ok:
[...]

Just in case it got spam-detected again:

http://marc.info/?l=git&m=123183962105146&w=2


Thanks,
Kirill

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header
  2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov
  2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov
@ 2009-01-08 10:08 ` Alexander Potashev
  1 sibling, 0 replies; 12+ messages in thread
From: Alexander Potashev @ 2009-01-08 10:08 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: Junio C Hamano, git

On 21:38 Fri 26 Dec     , Kirill Smelkov wrote:
> When native language (RU) is in use, subject header usually contains several
> parts, e.g.
> 
> Subject: [Navy-patches] [PATCH]
> 	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
> 	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
> 	=?utf-8?b?0YHQsdC+0YDQutC4?=
> 

>  t/t5100/info0012    |    5 ++++
>  t/t5100/msg0012     |    7 ++++++
>  t/t5100/patch0012   |   30 +++++++++++++++++++++++++++++
>  t/t5100/sample.mbox |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 112 insertions(+), 2 deletions(-)

The testcases are too long, a minimal mbox with encoded "Subject:" would
be enough to test the mailinfo parser, it's all the you need to test
here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-01-14  8:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov
2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov
2009-01-08  8:13   ` Junio C Hamano
2009-01-08  8:35     ` Junio C Hamano
2009-01-08 23:11       ` Kirill Smelkov
2009-01-10 10:12         ` Kirill Smelkov
2009-01-11  1:54         ` Junio C Hamano
2009-01-12 22:34           ` Kirill Smelkov
2009-01-12 23:27             ` Junio C Hamano
2009-01-13  9:39               ` Kirill Smelkov
2009-01-14  8:19                 ` Kirill Smelkov
2009-01-08 10:08 ` [PATCH " Alexander Potashev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).