* [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header @ 2008-12-26 18:38 Kirill Smelkov 2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov 2009-01-08 10:08 ` [PATCH " Alexander Potashev 0 siblings, 2 replies; 12+ messages in thread From: Kirill Smelkov @ 2008-12-26 18:38 UTC (permalink / raw) To: Junio C Hamano, Junio C Hamano; +Cc: Kirill Smelkov, git [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 6867 bytes --] When native language (RU) is in use, subject header usually contains several parts, e.g. Subject: [Navy-patches] [PATCH] =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= =?utf-8?b?0YHQsdC+0YDQutC4?= This exposes several bugs in builtin-mailinfo.c that I try to fix: 1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing correct header construction on parts concatenation via strbuf_addbuf in decode_header_bq. Fixes: -Subject: Изменён список пакетов необходимых для сборки +Subject: Изменён список па Then 2. (hackish) do not emit '\n' after processing of every header segment. It seems we should emit previous part as-is only if it does not end with '=?='. Fixes: -Subject: Изменён список пакетов необходимых для сборки +Subject: Изменён список па кетов необходимых для сборки Sorry for low-quality patch and description. I did what I could and don't have energy and time dig more into MIME. Please help. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> --- builtin-mailinfo.c | 18 ++++++++++++++++- t/t5100-mailinfo.sh | 2 +- t/t5100/info0012 | 5 ++++ t/t5100/msg0012 | 7 ++++++ t/t5100/patch0012 | 30 +++++++++++++++++++++++++++++ t/t5100/sample.mbox | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 112 insertions(+), 2 deletions(-) diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c index e890f7a..d138bc3 100644 --- a/builtin-mailinfo.c +++ b/builtin-mailinfo.c @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) * for now we just trust the data. */ c = 0; + + /* XXX: the following is needed not to output NUL in + * the resulting string + * + * This seems to be ok, but I'm not 100% sure -- that's + * why this is an RFC. + */ + continue; } else continue; /* garbage */ @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it) strbuf_reset(&piecebuf); rfc2047 = 1; - if (in != ep) { + /* XXX: the follwoing is needed not to output '\n' on every + * multi-line segment in Subject. + * + * I suspect this is not 100% correct, but I'm not a MIME guy + * -- that's why this is an RFC. + */ + + /* if in does not end with '=?=', we emit it as is */ + if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) { strbuf_add(&outbuf, in, ep - in); in = ep; } diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh index fe14589..6825f99 100755 --- a/t/t5100-mailinfo.sh +++ b/t/t5100-mailinfo.sh @@ -11,7 +11,7 @@ test_expect_success 'split sample box' \ 'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last && last=`cat last` && echo total is $last && - test `cat last` = 11' + test `cat last` = 12' for mail in `echo 00*` do diff --git a/t/t5100/info0012 b/t/t5100/info0012 new file mode 100644 index 0000000..ac1216f --- /dev/null +++ b/t/t5100/info0012 @@ -0,0 +1,5 @@ +Author: Dmitriy Blinov +Email: bda@mnsspb.ru +Subject: Изменён список пакетов необходимых для сборки +Date: Wed, 12 Nov 2008 17:54:41 +0300 + diff --git a/t/t5100/msg0012 b/t/t5100/msg0012 new file mode 100644 index 0000000..1dc2bf7 --- /dev/null +++ b/t/t5100/msg0012 @@ -0,0 +1,7 @@ +textlive-* исправлены на texlive-* +docutils заменён на python-docutils + +Действительно, оказалось, что rest2web вытягивает за собой +python-docutils. В то время как сам rest2web не нужен. + +Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru> diff --git a/t/t5100/patch0012 b/t/t5100/patch0012 new file mode 100644 index 0000000..36a0b68 --- /dev/null +++ b/t/t5100/patch0012 @@ -0,0 +1,30 @@ +--- + howto/build_navy.txt | 6 +++--- + 1 files changed, 3 insertions(+), 3 deletions(-) + +diff --git a/howto/build_navy.txt b/howto/build_navy.txt +index 3fd3afb..0ee807e 100644 +--- a/howto/build_navy.txt ++++ b/howto/build_navy.txt +@@ -119,8 +119,8 @@ + - libxv-dev + - libusplash-dev + - latex-make +- - textlive-lang-cyrillic +- - textlive-latex-extra ++ - texlive-lang-cyrillic ++ - texlive-latex-extra + - dia + - python-pyrex + - libtool +@@ -128,7 +128,7 @@ + - sox + - cython + - imagemagick +- - docutils ++ - python-docutils + + #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev + #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом:: +-- +1.5.6.5 diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox index 4bf7947..94da4da 100644 --- a/t/t5100/sample.mbox +++ b/t/t5100/sample.mbox @@ -501,3 +501,55 @@ index 3e5fe51..aabfe5c 100644 --=-=-=-- +From bda@mnsspb.ru Wed Nov 12 17:54:41 2008 +From: Dmitriy Blinov <bda@mnsspb.ru> +To: navy-patches@dinar.mns.mnsspb.ru +Date: Wed, 12 Nov 2008 17:54:41 +0300 +Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru> +X-Mailer: git-send-email 1.5.6.5 +MIME-Version: 1.0 +Content-Type: text/plain; + charset=utf-8 +Content-Transfer-Encoding: 8bit +Subject: [Navy-patches] [PATCH] + =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= + =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= + =?utf-8?b?0YHQsdC+0YDQutC4?= + +textlive-* исправлены на texlive-* +docutils заменён на python-docutils + +Действительно, оказалось, что rest2web вытягивает за собой +python-docutils. В то время как сам rest2web не нужен. + +Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru> +--- + howto/build_navy.txt | 6 +++--- + 1 files changed, 3 insertions(+), 3 deletions(-) + +diff --git a/howto/build_navy.txt b/howto/build_navy.txt +index 3fd3afb..0ee807e 100644 +--- a/howto/build_navy.txt ++++ b/howto/build_navy.txt +@@ -119,8 +119,8 @@ + - libxv-dev + - libusplash-dev + - latex-make +- - textlive-lang-cyrillic +- - textlive-latex-extra ++ - texlive-lang-cyrillic ++ - texlive-latex-extra + - dia + - python-pyrex + - libtool +@@ -128,7 +128,7 @@ + - sox + - cython + - imagemagick +- - docutils ++ - python-docutils + + #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev + #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом:: +-- +1.5.6.5 -- tg: (2292ebd..) t/mailinfo-multiline-subject (depends on: tmp) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov @ 2009-01-07 22:43 ` Kirill Smelkov 2009-01-08 8:13 ` Junio C Hamano 2009-01-08 10:08 ` [PATCH " Alexander Potashev 1 sibling, 1 reply; 12+ messages in thread From: Kirill Smelkov @ 2009-01-07 22:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote: > When native language (RU) is in use, subject header usually contains several > parts, e.g. > > Subject: [Navy-patches] [PATCH] > =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= > =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= > =?utf-8?b?0YHQsdC+0YDQutC4?= Which btw should be extracted by git-mailinfo to: 'Subject: Изменён список пакетов необходимых для сборки' > This exposes several bugs in builtin-mailinfo.c that I try to fix: > > > 1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing > correct header construction on parts concatenation via strbuf_addbuf in > decode_header_bq. Fixes: > > -Subject: Изменён список пакетов необходимых для сборки > +Subject: Изменён список па > > > Then > > 2. (hackish) do not emit '\n' after processing of every header segment. It > seems we should emit previous part as-is only if it does not end with > '=?='. Fixes: > > -Subject: Изменён список пакетов необходимых для сборки > +Subject: Изменён список па кетов необходимых для сборки > > > Sorry for low-quality patch and description. I did what I could and don't have > energy and time dig more into MIME. > > Please help. > > Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> > > --- > builtin-mailinfo.c | 18 ++++++++++++++++- > t/t5100-mailinfo.sh | 2 +- > t/t5100/info0012 | 5 ++++ > t/t5100/msg0012 | 7 ++++++ > t/t5100/patch0012 | 30 +++++++++++++++++++++++++++++ > t/t5100/sample.mbox | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 6 files changed, 112 insertions(+), 2 deletions(-) Junio, All, What about this patch? It at least exposes bug in git-mailinfo wrt handling of multiline subjects, and in very details documents it and adds a test for it. Yes, my fixes are of 'low quality', but may I try to attract git community attention one more time? Thanks beforehand, Kirill P.S. original post with patch: http://marc.info/?l=git&m=123031899307286&w=2 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov @ 2009-01-08 8:13 ` Junio C Hamano 2009-01-08 8:35 ` Junio C Hamano 0 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2009-01-08 8:13 UTC (permalink / raw) To: Kirill Smelkov; +Cc: git Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote: >> When native language (RU) is in use, subject header usually contains several >> parts, e.g. > ... > Junio, All, > > What about this patch? What's most interesting is that I do not recall seeing this patch before. Neither gmane (which is my back-up interface to the mailing list) nor my mailbox seems to have a copy, and from the look of quoted parts (namely, some Russian strings in the message), it is not implausible that my spam filter (either on my receiving end or at the ISP) may have eaten it. > It at least exposes bug in git-mailinfo wrt handling of multiline > subjects, and in very details documents it and adds a test for it. > > ..., but may I try to attract git > community attention one more time? It is very appreciated. > P.S. original post with patch: > > http://marc.info/?l=git&m=123031899307286&w=2 I have not had chance to look at your patch at marc yet, but from the look of your problem description, I presume you could trigger this with any utf-8 b-encoded loooooong subject line? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-08 8:13 ` Junio C Hamano @ 2009-01-08 8:35 ` Junio C Hamano 2009-01-08 23:11 ` Kirill Smelkov 0 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2009-01-08 8:35 UTC (permalink / raw) To: Kirill Smelkov; +Cc: git Junio C Hamano <gitster@pobox.com> writes: > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > ... >> http://marc.info/?l=git&m=123031899307286&w=2 > > I have not had chance to look at your patch at marc yet, but from the look > of your problem description, I presume you could trigger this with any > utf-8 b-encoded loooooong subject line? Ok, I took a look at it after downloading from the marc archive. > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c > index e890f7a..d138bc3 100644 > --- a/builtin-mailinfo.c > +++ b/builtin-mailinfo.c > @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) > * for now we just trust the data. > */ > c = 0; > + > + /* XXX: the following is needed not to output NUL in > + * the resulting string > + * > + * This seems to be ok, but I'm not 100% sure -- that's > + * why this is an RFC. > + */ > + continue; > } > else > continue; /* garbage */ B encoding (RFC 2045) encodes an octet stream into a sequence of groups of 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or more padding char '=' to make the result multiple of 4. * If the length of the payload is a multiple of 3 octets, there is no special handling. Padding char '=' is not produced; * If it is a multiple of 3 octets plus one, the remaining one octet is encoded with two letters, and two more padding char '=' is added; * If it is a multiple of 3 octets plus two, the remaining two octets are encoded with three letters, and one padding char '=' is added. Hence, a "correct" implementation should decode the input as if '=' were the same as 'A' (which encodes 6 bits of 0) til the end, making sure that the padding char '=' appears only at the end of the input, that no char outside the Base64 encoding alphabet appears in the input, and that the length of the entire encoded string is multiple of 4. Finally it would discard either one or two octets (depending on the number of padding chars it saw) from the end of the output. Our decode_b_segment() however emits each octet as it completes, without waiting for the 24-bit group that contains it to complete. When decoding a correctly encoded input, by the time we see a padding '=', all the real payload octets are complete and we would not have any real information still kept in the variable "acc" (accumulator), so ignoring '=' (you do not even need to assign c = 0) like your patch did would work just fine. An alternative would be to count the number of padding at the end and drop the NULs from the output as necessary after the loop but that does not add any value to the current code. Ideally we should validate the encoded string a bit more carefully (see the "correct" implementation about), and warn if a malformed input is found (but probably not reject outright). But as a low-impact fix for the maintenance branches, I think your fix is very good. Side note: I suspect that the existing code was Ok before strbuf conversion as we assumed NUL terminated output buffer. > @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it) > strbuf_reset(&piecebuf); > rfc2047 = 1; > > - if (in != ep) { > + /* XXX: the follwoing is needed not to output '\n' on every > + * multi-line segment in Subject. > + * > + * I suspect this is not 100% correct, but I'm not a MIME guy > + * -- that's why this is an RFC. > + */ > + > + /* if in does not end with '=?=', we emit it as is */ > + if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) { > strbuf_add(&outbuf, in, ep - in); > in = ep; > > } I am not a MIME guy either (and mailinfo has a big comment that says we do not really do MIME --- we just pretend to do), but let me give it a try. RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may not be more than 75 characters long, and multiple encoded-words, separated by CRLF SPACE can be used to encode more text if needed. It further specifies that an encoded-word can appear next to ordinary text or another encoded-word but it must be separated by linear white space, and says that such linear white space is to be ignored when displaying. Which means that we should be eating the CRLF SPACE we see if we have seen an encoded-word immediately before and we are about to process another encoded-word. Based on the above discussion, here is what I came up with. It passes your test, but I ran out of energy to try breaking it seriously in any other way than just running the existing test suite. We might want to steal some test cases from the "8. Examples" section of RFC2047 and add them to t5100. Thanks. builtin-mailinfo.c | 27 +++++++++++++++++++-------- 1 files changed, 19 insertions(+), 8 deletions(-) diff --git c/builtin-mailinfo.c w/builtin-mailinfo.c index e890f7a..fcb32c9 100644 --- c/builtin-mailinfo.c +++ w/builtin-mailinfo.c @@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) c -= 'a' - 26; else if ('0' <= c && c <= '9') c -= '0' - 52; - else if (c == '=') { - /* padding is almost like (c == 0), except we do - * not output NUL resulting only from it; - * for now we just trust the data. - */ - c = 0; - } else continue; /* garbage */ switch (pos++) { @@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it) rfc2047 = 1; if (in != ep) { - strbuf_add(&outbuf, in, ep - in); + /* + * We are about to process an encoded-word + * that begins at ep, but there is something + * before the encoded word. + */ + char *scan; + for (scan = in; scan < ep; scan++) + if (!isspace(*scan)) + break; + + if (scan != ep || in == it->buf) { + /* + * We should not lose that "something", + * unless we have just processed an + * encoded-word, and there is only LWS + * before the one we are about to process. + */ + strbuf_add(&outbuf, in, ep - in); + } in = ep; } /* E.g. ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-08 8:35 ` Junio C Hamano @ 2009-01-08 23:11 ` Kirill Smelkov 2009-01-10 10:12 ` Kirill Smelkov 2009-01-11 1:54 ` Junio C Hamano 0 siblings, 2 replies; 12+ messages in thread From: Kirill Smelkov @ 2009-01-08 23:11 UTC (permalink / raw) To: Junio C Hamano, Alexander Potashev; +Cc: git On Thu, Jan 08, 2009 at 12:13:42AM -0800, Junio C Hamano wrote: > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > > > On Fri, Dec 26, 2008 at 09:38:41PM +0300, Kirill Smelkov wrote: > >> When native language (RU) is in use, subject header usually contains several > >> parts, e.g. > > ... > > Junio, All, > > > > What about this patch? > > What's most interesting is that I do not recall seeing this patch before. > Neither gmane (which is my back-up interface to the mailing list) nor my > mailbox seems to have a copy, and from the look of quoted parts (namely, > some Russian strings in the message), it is not implausible that my spam > filter (either on my receiving end or at the ISP) may have eaten it. > > > It at least exposes bug in git-mailinfo wrt handling of multiline > > subjects, and in very details documents it and adds a test for it. > > > > ..., but may I try to attract git > > community attention one more time? > > It is very appreciated. Thanks! On Thu, Jan 08, 2009 at 12:35:52AM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > > ... > >> http://marc.info/?l=git&m=123031899307286&w=2 > > > > I have not had chance to look at your patch at marc yet, but from the look > > of your problem description, I presume you could trigger this with any > > utf-8 b-encoded loooooong subject line? > > Ok, I took a look at it after downloading from the marc archive. > > > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c > > index e890f7a..d138bc3 100644 > > --- a/builtin-mailinfo.c > > +++ b/builtin-mailinfo.c > > @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) > > * for now we just trust the data. > > */ > > c = 0; > > + > > + /* XXX: the following is needed not to output NUL in > > + * the resulting string > > + * > > + * This seems to be ok, but I'm not 100% sure -- that's > > + * why this is an RFC. > > + */ > > + continue; > > } > > else > > continue; /* garbage */ > > B encoding (RFC 2045) encodes an octet stream into a sequence of groups of > 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or > more padding char '=' to make the result multiple of 4. > > * If the length of the payload is a multiple of 3 octets, there is no > special handling. Padding char '=' is not produced; > > * If it is a multiple of 3 octets plus one, the remaining one octet is > encoded with two letters, and two more padding char '=' is added; > > * If it is a multiple of 3 octets plus two, the remaining two octets are > encoded with three letters, and one padding char '=' is added. > > Hence, a "correct" implementation should decode the input as if '=' were > the same as 'A' (which encodes 6 bits of 0) til the end, making sure that > the padding char '=' appears only at the end of the input, that no char > outside the Base64 encoding alphabet appears in the input, and that the > length of the entire encoded string is multiple of 4. Finally it would > discard either one or two octets (depending on the number of padding chars > it saw) from the end of the output. > > Our decode_b_segment() however emits each octet as it completes, without > waiting for the 24-bit group that contains it to complete. When decoding > a correctly encoded input, by the time we see a padding '=', all the real > payload octets are complete and we would not have any real information > still kept in the variable "acc" (accumulator), so ignoring '=' (you do > not even need to assign c = 0) like your patch did would work just fine. > An alternative would be to count the number of padding at the end and drop > the NULs from the output as necessary after the loop but that does not add > any value to the current code. > > Ideally we should validate the encoded string a bit more carefully (see > the "correct" implementation about), and warn if a malformed input is > found (but probably not reject outright). But as a low-impact fix for the > maintenance branches, I think your fix is very good. > > Side note: I suspect that the existing code was Ok before strbuf > conversion as we assumed NUL terminated output buffer. Junio, thanks for the explanation. I've updated the patch and included your analysis into description. > > @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it) > > strbuf_reset(&piecebuf); > > rfc2047 = 1; > > > > - if (in != ep) { > > + /* XXX: the follwoing is needed not to output '\n' on every > > + * multi-line segment in Subject. > > + * > > + * I suspect this is not 100% correct, but I'm not a MIME guy > > + * -- that's why this is an RFC. > > + */ > > + > > + /* if in does not end with '=?=', we emit it as is */ > > + if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) { > > strbuf_add(&outbuf, in, ep - in); > > in = ep; > > > > } > > I am not a MIME guy either (and mailinfo has a big comment that says we do > not really do MIME --- we just pretend to do), but let me give it a try. > > RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may > not be more than 75 characters long, and multiple encoded-words, separated > by CRLF SPACE can be used to encode more text if needed. > > It further specifies that an encoded-word can appear next to ordinary text > or another encoded-word but it must be separated by linear white space, > and says that such linear white space is to be ignored when displaying. > > Which means that we should be eating the CRLF SPACE we see if we have seen > an encoded-word immediately before and we are about to process another > encoded-word. > > Based on the above discussion, here is what I came up with. It passes > your test, but I ran out of energy to try breaking it seriously in any > other way than just running the existing test suite. Thanks again very much! I was once maintaining software, and I think I understand what you mean by saying 'ran out of energy', so I'll try to do my best to help improve this patch and to get it merged. > We might want to steal some test cases from the "8. Examples" section of > RFC2047 and add them to t5100. Good idea. I took all the examples and incorporated them into our testsuite. > > Thanks. > > builtin-mailinfo.c | 27 +++++++++++++++++++-------- > 1 files changed, 19 insertions(+), 8 deletions(-) > > diff --git c/builtin-mailinfo.c w/builtin-mailinfo.c > index e890f7a..fcb32c9 100644 > --- c/builtin-mailinfo.c > +++ w/builtin-mailinfo.c > @@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) > c -= 'a' - 26; > else if ('0' <= c && c <= '9') > c -= '0' - 52; > - else if (c == '=') { > - /* padding is almost like (c == 0), except we do > - * not output NUL resulting only from it; > - * for now we just trust the data. > - */ > - c = 0; > - } > else > continue; /* garbage */ > switch (pos++) { > @@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it) > rfc2047 = 1; > > if (in != ep) { > - strbuf_add(&outbuf, in, ep - in); > + /* > + * We are about to process an encoded-word > + * that begins at ep, but there is something > + * before the encoded word. > + */ > + char *scan; > + for (scan = in; scan < ep; scan++) > + if (!isspace(*scan)) > + break; > + > + if (scan != ep || in == it->buf) { > + /* > + * We should not lose that "something", > + * unless we have just processed an > + * encoded-word, and there is only LWS > + * before the one we are about to process. > + */ > + strbuf_add(&outbuf, in, ep - in); > + } > in = ep; > } > /* E.g. Based on the above description the code looks good now. I've incorporated it into the patch and added tests from RFC2047 (see patch below). On Thu, Jan 08, 2009 at 01:08:13PM +0300, Alexander Potashev wrote: > On 21:38 Fri 26 Dec , Kirill Smelkov wrote: > > When native language (RU) is in use, subject header usually contains several > > parts, e.g. > > > > Subject: [Navy-patches] [PATCH] > > =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= > > =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= > > =?utf-8?b?0YHQsdC+0YDQutC4?= > > > > > t/t5100/info0012 | 5 ++++ > > t/t5100/msg0012 | 7 ++++++ > > t/t5100/patch0012 | 30 +++++++++++++++++++++++++++++ > > t/t5100/sample.mbox | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > 6 files changed, 112 insertions(+), 2 deletions(-) > > The testcases are too long, a minimal mbox with encoded "Subject:" would > be enough to test the mailinfo parser, it's all the you need to test > here. Thanks Alexander for pointing this out. I've based my testcase on already-in-there tests, which e.g. for t/t5100/{info,msg,patch}00{04,05,09,10,11} are of approximately the same size and are based on real mails. Is this ok? As to new RFC2047-examples based tests, I've tried to keep them to the bare minimum. Changes since v1: o incorporated Junio's description and code about padding o incorporated Junio's description and code about LWS between encoded words o incorporated tests from RFC2047 examples (one testresult is unclear -- see patch description) From: Kirill Smelkov <kirr@landau.phys.spbu.ru> Subject: mailinfo: correctly handle multiline 'Subject:' header When native language (RU) is in use, subject header usually contains several parts, e.g. Subject: [Navy-patches] [PATCH] =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= =?utf-8?b?0YHQsdC+0YDQutC4?= ( which btw should be extracted by git-mailinfo to: 'Subject: Изменён список пакетов необходимых для сборки' ) This exposes several bugs in builtin-mailinfo.c which we try to fix: 1. decode_b_segment: do not append explicit NUL -- explicit NUL was preventing correct header construction on parts concatenation via strbuf_addbuf in decode_header_bq. Fixes: -Subject: Изменён список пакетов необходимых для сборки +Subject: Изменён список па Junio: > B encoding (RFC 2045) encodes an octet stream into a sequence of groups of > 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or > more padding char '=' to make the result multiple of 4. > > * If the length of the payload is a multiple of 3 octets, there is no > special handling. Padding char '=' is not produced; > > * If it is a multiple of 3 octets plus one, the remaining one octet is > encoded with two letters, and two more padding char '=' is added; > > * If it is a multiple of 3 octets plus two, the remaining two octets are > encoded with three letters, and one padding char '=' is added. > > Hence, a "correct" implementation should decode the input as if '=' were > the same as 'A' (which encodes 6 bits of 0) til the end, making sure that > the padding char '=' appears only at the end of the input, that no char > outside the Base64 encoding alphabet appears in the input, and that the > length of the entire encoded string is multiple of 4. Finally it would > discard either one or two octets (depending on the number of padding chars > it saw) from the end of the output. > > Our decode_b_segment() however emits each octet as it completes, without > waiting for the 24-bit group that contains it to complete. When decoding > a correctly encoded input, by the time we see a padding '=', all the real > payload octets are complete and we would not have any real information > still kept in the variable "acc" (accumulator), so ignoring '=' (you do > not even need to assign c = 0) like your patch did would work just fine. > An alternative would be to count the number of padding at the end and drop > the NULs from the output as necessary after the loop but that does not add > any value to the current code. > > Ideally we should validate the encoded string a bit more carefully (see > the "correct" implementation about), and warn if a malformed input is > found (but probably not reject outright). But as a low-impact fix for the > maintenance branches, I think your fix is very good. > > Side note: I suspect that the existing code was Ok before strbuf > conversion as we assumed NUL terminated output buffer. Then 2. whitespaces between encoded words should be removed -Subject: Изменён список пакетов необходимых для сборки +Subject: Изменён список па кетов необходимых для сборки Junio: > I am not a MIME guy either (and mailinfo has a big comment that says we do > not really do MIME --- we just pretend to do), but let me give it a try. > > RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may > not be more than 75 characters long, and multiple encoded-words, separated > by CRLF SPACE can be used to encode more text if needed. > > It further specifies that an encoded-word can appear next to ordinary text > or another encoded-word but it must be separated by linear white space, > and says that such linear white space is to be ignored when displaying. > > Which means that we should be eating the CRLF SPACE we see if we have seen > an encoded-word immediately before and we are about to process another > encoded-word. Also as suggested by Junio, in order to try to catch other MIME problems test cases from the "8. Examples" section of RFC2047 are added to t5100 testsuite as well. [but I'm not sure whether testresult with Nathaniel Borenstein (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004] Big-thanks-to: Junio C Hamano <gitster@pobox.com> Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru> --- builtin-mailinfo.c | 27 +++++++++++++++------ t/t5100-mailinfo.sh | 24 ++++++++++++++++++- t/t5100/info0012 | 5 ++++ t/t5100/msg0012 | 7 +++++ t/t5100/patch0012 | 30 ++++++++++++++++++++++++ t/t5100/rfc2047-info-0001 | 4 +++ t/t5100/rfc2047-info-0002 | 4 +++ t/t5100/rfc2047-info-0003 | 4 +++ t/t5100/rfc2047-info-0004 | 5 ++++ t/t5100/rfc2047-info-0005 | 2 + t/t5100/rfc2047-info-0006 | 2 + t/t5100/rfc2047-info-0007 | 2 + t/t5100/rfc2047-info-0008 | 2 + t/t5100/rfc2047-info-0009 | 2 + t/t5100/rfc2047-info-0010 | 2 + t/t5100/rfc2047-info-0011 | 2 + t/t5100/rfc2047-samples.mbox | 48 ++++++++++++++++++++++++++++++++++++++ t/t5100/sample.mbox | 52 ++++++++++++++++++++++++++++++++++++++++++ 18 files changed, 215 insertions(+), 9 deletions(-) diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c index f7c8c08..77a7121 100644 --- a/builtin-mailinfo.c +++ b/builtin-mailinfo.c @@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) c -= 'a' - 26; else if ('0' <= c && c <= '9') c -= '0' - 52; - else if (c == '=') { - /* padding is almost like (c == 0), except we do - * not output NUL resulting only from it; - * for now we just trust the data. - */ - c = 0; - } else continue; /* garbage */ switch (pos++) { @@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it) rfc2047 = 1; if (in != ep) { - strbuf_add(&outbuf, in, ep - in); + /* + * We are about to process an encoded-word + * that begins at ep, but there is something + * before the encoded word. + */ + char *scan; + for (scan = in; scan < ep; scan++) + if (!isspace(*scan)) + break; + + if (scan != ep || in == it->buf) { + /* + * We should not lose that "something", + * unless we have just processed an + * encoded-word, and there is only LWS + * before the one we are about to process. + */ + strbuf_add(&outbuf, in, ep - in); + } in = ep; } /* E.g. diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh index fe14589..625c204 100755 --- a/t/t5100-mailinfo.sh +++ b/t/t5100-mailinfo.sh @@ -11,7 +11,7 @@ test_expect_success 'split sample box' \ 'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last && last=`cat last` && echo total is $last && - test `cat last` = 11' + test `cat last` = 12' for mail in `echo 00*` do @@ -26,6 +26,28 @@ do ' done + +test_expect_success 'split box with rfc2047 samples' \ + 'mkdir rfc2047 && + git mailsplit -orfc2047 "$TEST_DIRECTORY"/t5100/rfc2047-samples.mbox \ + >rfc2047/last && + last=`cat rfc2047/last` && + echo total is $last && + test `cat rfc2047/last` = 11' + +for mail in `echo rfc2047/00*` +do + test_expect_success "mailinfo $mail" ' + git mailinfo -u $mail-msg $mail-patch <$mail >$mail-info && + echo msg && + test_cmp "$TEST_DIRECTORY"/t5100/empty $mail-msg && + echo patch && + test_cmp "$TEST_DIRECTORY"/t5100/empty $mail-patch && + echo info && + test_cmp "$TEST_DIRECTORY"/t5100/rfc2047-info-$(basename $mail) $mail-info + ' +done + test_expect_success 'respect NULs' ' git mailsplit -d3 -o. "$TEST_DIRECTORY"/t5100/nul-plain && diff --git a/t/t5100/empty b/t/t5100/empty new file mode 100644 index 0000000..e69de29 diff --git a/t/t5100/info0012 b/t/t5100/info0012 new file mode 100644 index 0000000..ac1216f --- /dev/null +++ b/t/t5100/info0012 @@ -0,0 +1,5 @@ +Author: Dmitriy Blinov +Email: bda@mnsspb.ru +Subject: Изменён список пакетов необходимых для сборки +Date: Wed, 12 Nov 2008 17:54:41 +0300 + diff --git a/t/t5100/msg0012 b/t/t5100/msg0012 new file mode 100644 index 0000000..1dc2bf7 --- /dev/null +++ b/t/t5100/msg0012 @@ -0,0 +1,7 @@ +textlive-* исправлены на texlive-* +docutils заменён на python-docutils + +Действительно, оказалось, что rest2web вытягивает за собой +python-docutils. В то время как сам rest2web не нужен. + +Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru> diff --git a/t/t5100/patch0012 b/t/t5100/patch0012 new file mode 100644 index 0000000..36a0b68 --- /dev/null +++ b/t/t5100/patch0012 @@ -0,0 +1,30 @@ +--- + howto/build_navy.txt | 6 +++--- + 1 files changed, 3 insertions(+), 3 deletions(-) + +diff --git a/howto/build_navy.txt b/howto/build_navy.txt +index 3fd3afb..0ee807e 100644 +--- a/howto/build_navy.txt ++++ b/howto/build_navy.txt +@@ -119,8 +119,8 @@ + - libxv-dev + - libusplash-dev + - latex-make +- - textlive-lang-cyrillic +- - textlive-latex-extra ++ - texlive-lang-cyrillic ++ - texlive-latex-extra + - dia + - python-pyrex + - libtool +@@ -128,7 +128,7 @@ + - sox + - cython + - imagemagick +- - docutils ++ - python-docutils + + #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev + #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом:: +-- +1.5.6.5 diff --git a/t/t5100/rfc2047-info-0001 b/t/t5100/rfc2047-info-0001 new file mode 100644 index 0000000..0a383b0 --- /dev/null +++ b/t/t5100/rfc2047-info-0001 @@ -0,0 +1,4 @@ +Author: Keith Moore +Email: moore@cs.utk.edu +Subject: If you can read this you understand the example. + diff --git a/t/t5100/rfc2047-info-0002 b/t/t5100/rfc2047-info-0002 new file mode 100644 index 0000000..881be75 --- /dev/null +++ b/t/t5100/rfc2047-info-0002 @@ -0,0 +1,4 @@ +Author: Olle Järnefors +Email: ojarnef@admin.kth.se +Subject: Time for ISO 10646? + diff --git a/t/t5100/rfc2047-info-0003 b/t/t5100/rfc2047-info-0003 new file mode 100644 index 0000000..d0f7891 --- /dev/null +++ b/t/t5100/rfc2047-info-0003 @@ -0,0 +1,4 @@ +Author: Patrik Fältström +Email: paf@nada.kth.se +Subject: RFC-HDR care and feeding + diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004 new file mode 100644 index 0000000..850f831 --- /dev/null +++ b/t/t5100/rfc2047-info-0004 @@ -0,0 +1,5 @@ +Author: Nathaniel Borenstein + (םולש ןב ילטפנ) +Email: nsb@thumper.bellcore.com +Subject: Test of new header generator + diff --git a/t/t5100/rfc2047-info-0005 b/t/t5100/rfc2047-info-0005 new file mode 100644 index 0000000..c27be3b --- /dev/null +++ b/t/t5100/rfc2047-info-0005 @@ -0,0 +1,2 @@ +Subject: (a) + diff --git a/t/t5100/rfc2047-info-0006 b/t/t5100/rfc2047-info-0006 new file mode 100644 index 0000000..9dad474 --- /dev/null +++ b/t/t5100/rfc2047-info-0006 @@ -0,0 +1,2 @@ +Subject: (a b) + diff --git a/t/t5100/rfc2047-info-0007 b/t/t5100/rfc2047-info-0007 new file mode 100644 index 0000000..294f195 --- /dev/null +++ b/t/t5100/rfc2047-info-0007 @@ -0,0 +1,2 @@ +Subject: (ab) + diff --git a/t/t5100/rfc2047-info-0008 b/t/t5100/rfc2047-info-0008 new file mode 100644 index 0000000..294f195 --- /dev/null +++ b/t/t5100/rfc2047-info-0008 @@ -0,0 +1,2 @@ +Subject: (ab) + diff --git a/t/t5100/rfc2047-info-0009 b/t/t5100/rfc2047-info-0009 new file mode 100644 index 0000000..294f195 --- /dev/null +++ b/t/t5100/rfc2047-info-0009 @@ -0,0 +1,2 @@ +Subject: (ab) + diff --git a/t/t5100/rfc2047-info-0010 b/t/t5100/rfc2047-info-0010 new file mode 100644 index 0000000..9dad474 --- /dev/null +++ b/t/t5100/rfc2047-info-0010 @@ -0,0 +1,2 @@ +Subject: (a b) + diff --git a/t/t5100/rfc2047-info-0011 b/t/t5100/rfc2047-info-0011 new file mode 100644 index 0000000..9dad474 --- /dev/null +++ b/t/t5100/rfc2047-info-0011 @@ -0,0 +1,2 @@ +Subject: (a b) + diff --git a/t/t5100/rfc2047-samples.mbox b/t/t5100/rfc2047-samples.mbox new file mode 100644 index 0000000..3ca2470 --- /dev/null +++ b/t/t5100/rfc2047-samples.mbox @@ -0,0 +1,48 @@ +From nobody Mon Sep 17 00:00:00 2001 +From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu> +To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk> +CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be> +Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= + =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= + +From nobody Mon Sep 17 00:00:00 2001 +From: =?ISO-8859-1?Q?Olle_J=E4rnefors?= <ojarnef@admin.kth.se> +To: ietf-822@dimacs.rutgers.edu, ojarnef@admin.kth.se +Subject: Time for ISO 10646? + +From nobody Mon Sep 17 00:00:00 2001 +To: Dave Crocker <dcrocker@mordor.stanford.edu> +Cc: ietf-822@dimacs.rutgers.edu, paf@comsol.se +From: =?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?= <paf@nada.kth.se> +Subject: Re: RFC-HDR care and feeding + +From nobody Mon Sep 17 00:00:00 2001 +From: Nathaniel Borenstein <nsb@thumper.bellcore.com> + (=?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?=) +To: Greg Vaudreuil <gvaudre@NRI.Reston.VA.US>, Ned Freed + <ned@innosoft.com>, Keith Moore <moore@cs.utk.edu> +Subject: Test of new header generator +MIME-Version: 1.0 +Content-type: text/plain; charset=ISO-8859-1 + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?=) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?= b) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?= + =?ISO-8859-1?Q?b?=) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a_b?=) + +From nobody Mon Sep 17 00:00:00 2001 +Subject: (=?ISO-8859-1?Q?a?= =?ISO-8859-2?Q?_b?=) diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox index 4bf7947..94da4da 100644 --- a/t/t5100/sample.mbox +++ b/t/t5100/sample.mbox @@ -501,3 +501,55 @@ index 3e5fe51..aabfe5c 100644 --=-=-=-- +From bda@mnsspb.ru Wed Nov 12 17:54:41 2008 +From: Dmitriy Blinov <bda@mnsspb.ru> +To: navy-patches@dinar.mns.mnsspb.ru +Date: Wed, 12 Nov 2008 17:54:41 +0300 +Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru> +X-Mailer: git-send-email 1.5.6.5 +MIME-Version: 1.0 +Content-Type: text/plain; + charset=utf-8 +Content-Transfer-Encoding: 8bit +Subject: [Navy-patches] [PATCH] + =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= + =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= + =?utf-8?b?0YHQsdC+0YDQutC4?= + +textlive-* исправлены на texlive-* +docutils заменён на python-docutils + +Действительно, оказалось, что rest2web вытягивает за собой +python-docutils. В то время как сам rest2web не нужен. + +Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru> +--- + howto/build_navy.txt | 6 +++--- + 1 files changed, 3 insertions(+), 3 deletions(-) + +diff --git a/howto/build_navy.txt b/howto/build_navy.txt +index 3fd3afb..0ee807e 100644 +--- a/howto/build_navy.txt ++++ b/howto/build_navy.txt +@@ -119,8 +119,8 @@ + - libxv-dev + - libusplash-dev + - latex-make +- - textlive-lang-cyrillic +- - textlive-latex-extra ++ - texlive-lang-cyrillic ++ - texlive-latex-extra + - dia + - python-pyrex + - libtool +@@ -128,7 +128,7 @@ + - sox + - cython + - imagemagick +- - docutils ++ - python-docutils + + #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev + #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом:: +-- +1.5.6.5 -- tg: (c123b7c..) t/mailinfo-multiline-subject (depends on: master) Thanks, Kirill ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-08 23:11 ` Kirill Smelkov @ 2009-01-10 10:12 ` Kirill Smelkov 2009-01-11 1:54 ` Junio C Hamano 1 sibling, 0 replies; 12+ messages in thread From: Kirill Smelkov @ 2009-01-10 10:12 UTC (permalink / raw) To: Junio C Hamano, Alexander Potashev; +Cc: git On Fri, Jan 09, 2009 at 02:11:35AM +0300, Kirill Smelkov wrote: > Changes since v1: > > o incorporated Junio's description and code about padding > o incorporated Junio's description and code about LWS between encoded > words > o incorporated tests from RFC2047 examples (one testresult is unclear > -- see patch description) > > > From: Kirill Smelkov <kirr@landau.phys.spbu.ru> > Subject: mailinfo: correctly handle multiline 'Subject:' header [...] Junio, All, just in case this again got spam-detected: http://marc.info/?l=git&m=123145624611936&w=2 Thanks, Kirill ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-08 23:11 ` Kirill Smelkov 2009-01-10 10:12 ` Kirill Smelkov @ 2009-01-11 1:54 ` Junio C Hamano 2009-01-12 22:34 ` Kirill Smelkov 1 sibling, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2009-01-11 1:54 UTC (permalink / raw) To: Kirill Smelkov; +Cc: Alexander Potashev, git Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > [but I'm not sure whether testresult with Nathaniel Borenstein > (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004] > ... > diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004 > new file mode 100644 > index 0000000..850f831 > --- /dev/null > +++ b/t/t5100/rfc2047-info-0004 > @@ -0,0 +1,5 @@ > +Author: Nathaniel Borenstein > + (םולש ןב ילטפנ) > +Email: nsb@thumper.bellcore.com > +Subject: Test of new header generator > + That does look wrong. If you can fix this, please do so; otherwise please mark the test that deals with this entry with test_expect_failure, until somebody else does. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-11 1:54 ` Junio C Hamano @ 2009-01-12 22:34 ` Kirill Smelkov 2009-01-12 23:27 ` Junio C Hamano 0 siblings, 1 reply; 12+ messages in thread From: Kirill Smelkov @ 2009-01-12 22:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alexander Potashev, git On Sat, Jan 10, 2009 at 05:54:14PM -0800, Junio C Hamano wrote: > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > > > [but I'm not sure whether testresult with Nathaniel Borenstein > > (םולש ןב ילטפנ) is correct -- see rfc2047-info-0004] > > ... > > diff --git a/t/t5100/rfc2047-info-0004 b/t/t5100/rfc2047-info-0004 > > new file mode 100644 > > index 0000000..850f831 > > --- /dev/null > > +++ b/t/t5100/rfc2047-info-0004 > > @@ -0,0 +1,5 @@ > > +Author: Nathaniel Borenstein > > + ([somethig that could be detected as spam]) > > +Email: nsb@thumper.bellcore.com > > +Subject: Test of new header generator > > + > > That does look wrong. If you can fix this, please do so; otherwise please > mark the test that deals with this entry with test_expect_failure, until > somebody else does. Yes, I think I've dealt with it -- we weren't unfolding 'From' header, and we were not skipping comments in rfc822 headers, so: From: Kirill Smelkov <kirr@landau.phys.spbu.ru> Subject: [PATCH] mailinfo: 'From:' header should be unfold as well At present we do headers unfolding (see RFC822 3.1.1. LONG HEADER FIELDS) for all fields except 'From' (always) and 'Subject' (when keep_subject is set) Not unfolding 'From' is a bug -- see above-mentioned RFC link. Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru> --- builtin-mailinfo.c | 1 + t/t5100/sample.mbox | 5 ++++- 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c index f7c8c08..6d72c1b 100644 --- a/builtin-mailinfo.c +++ b/builtin-mailinfo.c @@ -860,6 +860,7 @@ static void handle_info(void) } output_header_lines(fout, "Subject", hdr); } else if (!memcmp(header[i], "From", 4)) { + cleanup_space(hdr); handle_from(hdr); fprintf(fout, "Author: %s\n", name.buf); fprintf(fout, "Email: %s\n", email.buf); diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox index 4bf7947..d465685 100644 --- a/t/t5100/sample.mbox +++ b/t/t5100/sample.mbox @@ -2,7 +2,10 @@ From nobody Mon Sep 17 00:00:00 2001 -From: A U Thor <a.u.thor@example.com> +From: A + U + Thor + <a.u.thor@example.com> Date: Fri, 9 Jun 2006 00:44:16 -0700 Subject: [PATCH] a commit. -- tg: (1562445..) t/mail-from-unfold (depends on: master) From: Kirill Smelkov <kirr@landau.phys.spbu.ru> Subject: [PATCH] mailinfo: more smarter removal of rfc822 comments from 'From' As described in RFC822 (3.4.3 COMMENTS, and A.1.4.), comments, as e.g. John (zzz) Doe <john.doe@xz> (Comment) should "NOT [be] included in the destination mailbox" We need this functionality to pass all RFC2047 based tests in the next commit. Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru> --- builtin-mailinfo.c | 30 ++++++++++++++++++++++++++++++ t/t5100/sample.mbox | 4 ++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c index 6d72c1b..c0b1ab4 100644 --- a/builtin-mailinfo.c +++ b/builtin-mailinfo.c @@ -29,6 +29,9 @@ static struct strbuf **p_hdr_data, **s_hdr_data; #define MAX_HDR_PARSED 10 #define MAX_BOUNDARIES 5 +static void cleanup_space(struct strbuf *sb); + + static void get_sane_name(struct strbuf *out, struct strbuf *name, struct strbuf *email) { struct strbuf *src = name; @@ -120,6 +123,33 @@ static void handle_from(const struct strbuf *from) strbuf_setlen(&f, f.len - 1); } + /* This still could not be finished for emails like + * + * "John (zzz) Doe <john.doe@xz> (Comment)" + * + * The email part had already been removed, so let's kill comments as + * well -- RFC822 says comments should not be present in destination + * mailbox (3.4.3. Comments and A.1.4.) + */ + while (1) { + char *ta; + + at = strchr(f.buf, '('); + if (!at) + break; + ta = strchr(at, ')'); + if (!ta) + break; + + strbuf_remove(&f, at - f.buf, ta-at + (*ta ? 1 : 0)); + } + + /* and let's finally cleanup spaces that were around (possibly + * internal) comments + */ + cleanup_space(&f); + strbuf_trim(&f); + get_sane_name(&name, &f, &email); strbuf_release(&f); } diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox index d465685..42e02f3 100644 --- a/t/t5100/sample.mbox +++ b/t/t5100/sample.mbox @@ -2,10 +2,10 @@ From nobody Mon Sep 17 00:00:00 2001 -From: A +From: A (zzz) U Thor - <a.u.thor@example.com> + <a.u.thor@example.com> (Comment) Date: Fri, 9 Jun 2006 00:44:16 -0700 Subject: [PATCH] a commit. -- tg: (b798ad9..) t/mail-from-comments (depends on: t/mail-from-unfold) All these patches + original one (trivially adapted) could be pulled from git://repo.or.cz/git/kirr.git for-junio Kirill Smelkov (3): mailinfo: 'From:' header should be unfold as well mailinfo: more smarter removal of rfc822 comments from 'From' mailinfo: correctly handle multiline 'Subject:' header builtin-mailinfo.c | 58 ++++++++++++++++++++++++++++++++++++------ t/t5100-mailinfo.sh | 24 ++++++++++++++++- t/t5100/info0012 | 5 +++ t/t5100/msg0012 | 7 +++++ t/t5100/patch0012 | 30 +++++++++++++++++++++ t/t5100/rfc2047-info-0001 | 4 +++ t/t5100/rfc2047-info-0002 | 4 +++ t/t5100/rfc2047-info-0003 | 4 +++ t/t5100/rfc2047-info-0004 | 4 +++ t/t5100/rfc2047-info-0005 | 2 + t/t5100/rfc2047-info-0006 | 2 + t/t5100/rfc2047-info-0007 | 2 + t/t5100/rfc2047-info-0008 | 2 + t/t5100/rfc2047-info-0009 | 2 + t/t5100/rfc2047-info-0010 | 2 + t/t5100/rfc2047-info-0011 | 2 + t/t5100/rfc2047-samples.mbox | 48 ++++++++++++++++++++++++++++++++++ t/t5100/sample.mbox | 57 ++++++++++++++++++++++++++++++++++++++++- 18 files changed, 249 insertions(+), 10 deletions(-) Thanks, Kirill ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-12 22:34 ` Kirill Smelkov @ 2009-01-12 23:27 ` Junio C Hamano 2009-01-13 9:39 ` Kirill Smelkov 0 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2009-01-12 23:27 UTC (permalink / raw) To: Kirill Smelkov; +Cc: Alexander Potashev, git Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c > index f7c8c08..6d72c1b 100644 > --- a/builtin-mailinfo.c > +++ b/builtin-mailinfo.c > @@ -860,6 +860,7 @@ static void handle_info(void) > } > output_header_lines(fout, "Subject", hdr); > } else if (!memcmp(header[i], "From", 4)) { > + cleanup_space(hdr); > handle_from(hdr); > fprintf(fout, "Author: %s\n", name.buf); > fprintf(fout, "Email: %s\n", email.buf); > diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox > index 4bf7947..d465685 100644 > --- a/t/t5100/sample.mbox > +++ b/t/t5100/sample.mbox > @@ -2,7 +2,10 @@ > > > From nobody Mon Sep 17 00:00:00 2001 > -From: A U Thor <a.u.thor@example.com> > +From: A > + U > + Thor > + <a.u.thor@example.com> > Date: Fri, 9 Jun 2006 00:44:16 -0700 > Subject: [PATCH] a commit. I think this is a reasonable change. But doesn't this > From nobody Mon Sep 17 00:00:00 2001 > -From: A > +From: A (zzz) > U > Thor > - <a.u.thor@example.com> > + <a.u.thor@example.com> (Comment) regress for people who spell their names like this? From: john.doe@email.xz (John Doe) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-12 23:27 ` Junio C Hamano @ 2009-01-13 9:39 ` Kirill Smelkov 2009-01-14 8:19 ` Kirill Smelkov 0 siblings, 1 reply; 12+ messages in thread From: Kirill Smelkov @ 2009-01-13 9:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alexander Potashev, git On Mon, Jan 12, 2009 at 03:27:44PM -0800, Junio C Hamano wrote: > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: > > > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c > > index f7c8c08..6d72c1b 100644 > > --- a/builtin-mailinfo.c > > +++ b/builtin-mailinfo.c > > @@ -860,6 +860,7 @@ static void handle_info(void) > > } > > output_header_lines(fout, "Subject", hdr); > > } else if (!memcmp(header[i], "From", 4)) { > > + cleanup_space(hdr); > > handle_from(hdr); > > fprintf(fout, "Author: %s\n", name.buf); > > fprintf(fout, "Email: %s\n", email.buf); > > diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox > > index 4bf7947..d465685 100644 > > --- a/t/t5100/sample.mbox > > +++ b/t/t5100/sample.mbox > > @@ -2,7 +2,10 @@ > > > > > > From nobody Mon Sep 17 00:00:00 2001 > > -From: A U Thor <a.u.thor@example.com> > > +From: A > > + U > > + Thor > > + <a.u.thor@example.com> > > Date: Fri, 9 Jun 2006 00:44:16 -0700 > > Subject: [PATCH] a commit. > > I think this is a reasonable change. Thanks. > But doesn't this > > > From nobody Mon Sep 17 00:00:00 2001 > > -From: A > > +From: A (zzz) > > U > > Thor > > - <a.u.thor@example.com> > > + <a.u.thor@example.com> (Comment) > > regress for people who spell their names like this? > > From: john.doe@email.xz (John Doe) I think everything is ok: There is an explicit handler for such emails before my comments removal in builtin-mailinfo.c: /* The remainder is name. It could be "John Doe <john.doe@xz>" * or "john.doe@xz (John Doe)", but we have removed the * email part, so trim from both ends, possibly removing * the () pair at the end. */ strbuf_trim(&f); if (f.buf[0] == '(' && f.len && f.buf[f.len - 1] == ')') { strbuf_remove(&f, 0, 1); strbuf_setlen(&f, f.len - 1); } http://repo.or.cz/w/git.git?a=blob;f=builtin-mailinfo.c;h=f7c8c08b320c99d8bf96443ae57aa33c1de7e8c0;hb=HEAD#l112 And only a test for this is missing From 77316ad6db2c3b0f4be238c4ba855b2f785b50d6 Mon Sep 17 00:00:00 2001 From: Kirill Smelkov <kirr@mns.spb.ru> Date: Tue, 13 Jan 2009 12:33:48 +0300 Subject: [PATCH] mailinfo: add explicit test for mails like '<a.u.thor@example.com> (A U Thor)' Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> --- t/t5100-mailinfo.sh | 2 +- t/t5100/info0013 | 5 +++++ t/t5100/sample.mbox | 5 +++++ 3 files changed, 11 insertions(+), 1 deletions(-) create mode 100644 t/t5100/info0013 create mode 100644 t/t5100/msg0013 create mode 100644 t/t5100/patch0013 diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh index 625c204..e70ea94 100755 --- a/t/t5100-mailinfo.sh +++ b/t/t5100-mailinfo.sh @@ -11,7 +11,7 @@ test_expect_success 'split sample box' \ 'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last && last=`cat last` && echo total is $last && - test `cat last` = 12' + test `cat last` = 13' for mail in `echo 00*` do diff --git a/t/t5100/info0013 b/t/t5100/info0013 new file mode 100644 index 0000000..bbe049e --- /dev/null +++ b/t/t5100/info0013 @@ -0,0 +1,5 @@ +Author: A U Thor +Email: a.u.thor@example.com +Subject: a patch +Date: Fri, 9 Jun 2006 00:44:16 -0700 + diff --git a/t/t5100/msg0013 b/t/t5100/msg0013 new file mode 100644 index 0000000..e69de29 diff --git a/t/t5100/patch0013 b/t/t5100/patch0013 new file mode 100644 index 0000000..e69de29 diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox index 4f80b82..c5ad206 100644 --- a/t/t5100/sample.mbox +++ b/t/t5100/sample.mbox @@ -556,3 +556,8 @@ index 3fd3afb..0ee807e 100644 #. п╫п╟ я│п╡п╬п╣п╧ п╪п╟я┬п╦п╫п╣: п╬я┌я─п╣п╢п╟п╨я┌п╦я─п╬п╡п╟я┌я▄ /etc/sudoers (п╨п╬п╪п╟п╫п╢п╟ ``visudo``) п©я─п╦п╪п╣я─п╫п╬ я│п╩п╣п╢я┐я▌я┴п╦п╪ п╬п╠я─п╟п╥п╬п╪:: -- 1.5.6.5 +From nobody Mon Sep 17 00:00:00 2001 +From: <a.u.thor@example.com> (A U Thor) +Date: Fri, 9 Jun 2006 00:44:16 -0700 +Subject: [PATCH] a patch + -- 1.6.1.101.g0335 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2009-01-13 9:39 ` Kirill Smelkov @ 2009-01-14 8:19 ` Kirill Smelkov 0 siblings, 0 replies; 12+ messages in thread From: Kirill Smelkov @ 2009-01-14 8:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alexander Potashev, git On Tue, Jan 13, 2009 at 12:39:16PM +0300, Kirill Smelkov wrote: > On Mon, Jan 12, 2009 at 03:27:44PM -0800, Junio C Hamano wrote: > > Kirill Smelkov <kirr@landau.phys.spbu.ru> writes: [...] > > But doesn't this > > > > > From nobody Mon Sep 17 00:00:00 2001 > > > -From: A > > > +From: A (zzz) > > > U > > > Thor > > > - <a.u.thor@example.com> > > > + <a.u.thor@example.com> (Comment) > > > > regress for people who spell their names like this? > > > > From: john.doe@email.xz (John Doe) > > I think everything is ok: [...] Just in case it got spam-detected again: http://marc.info/?l=git&m=123183962105146&w=2 Thanks, Kirill ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header 2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov 2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov @ 2009-01-08 10:08 ` Alexander Potashev 1 sibling, 0 replies; 12+ messages in thread From: Alexander Potashev @ 2009-01-08 10:08 UTC (permalink / raw) To: Kirill Smelkov; +Cc: Junio C Hamano, git On 21:38 Fri 26 Dec , Kirill Smelkov wrote: > When native language (RU) is in use, subject header usually contains several > parts, e.g. > > Subject: [Navy-patches] [PATCH] > =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= > =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= > =?utf-8?b?0YHQsdC+0YDQutC4?= > > t/t5100/info0012 | 5 ++++ > t/t5100/msg0012 | 7 ++++++ > t/t5100/patch0012 | 30 +++++++++++++++++++++++++++++ > t/t5100/sample.mbox | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 6 files changed, 112 insertions(+), 2 deletions(-) The testcases are too long, a minimal mbox with encoded "Subject:" would be enough to test the mailinfo parser, it's all the you need to test here. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-01-14 8:21 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-26 18:38 [PATCH RFC] mailinfo: correctly handle multiline 'Subject:' header Kirill Smelkov 2009-01-07 22:43 ` [BUG PATCH " Kirill Smelkov 2009-01-08 8:13 ` Junio C Hamano 2009-01-08 8:35 ` Junio C Hamano 2009-01-08 23:11 ` Kirill Smelkov 2009-01-10 10:12 ` Kirill Smelkov 2009-01-11 1:54 ` Junio C Hamano 2009-01-12 22:34 ` Kirill Smelkov 2009-01-12 23:27 ` Junio C Hamano 2009-01-13 9:39 ` Kirill Smelkov 2009-01-14 8:19 ` Kirill Smelkov 2009-01-08 10:08 ` [PATCH " Alexander Potashev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).