[PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec
@ 2014-12-06 19:36 Роман Донченко
  2014-12-06 19:36 ` [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
  0 siblings, 1 reply; 9+ messages in thread
From: Роман Донченко @ 2014-12-06 19:36 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Jeff King, Thomas Rast

More specifically:

* Add "\" to the list of characters not allowed in a token (see RFC 2047
  errata).

* Share regexes between unquote_rfc2047 and is_rfc2047_quoted. Besides
  removing duplication, this also makes unquote_rfc2047 more stringent.

* Allow both "q" and "Q" to identify the encoding.

* Allow lowercase hexadecimal digits in the "Q" encoding.

And, more on the cosmetic side:

* Change the "encoded-text" regex to exclude rather than include characters,
  for clarity and consistency with "token".

Signed-off-by: Роман Донченко <dpb@corrigendum.ru>
---
 git-send-email.perl | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 9949db0..d461ffb 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -145,6 +145,11 @@ my $have_mail_address = eval { require Mail::Address; 1 };
 my $smtp;
 my $auth;
 
+# Regexes for RFC 2047 productions.
+my $re_token = qr/[^][()<>@,;:\\"\/?.= \000-\037\177-\377]+/;
+my $re_encoded_text = qr/[^? \000-\037\177-\377]+/;
+my $re_encoded_word = qr/=\?($re_token)\?($re_token)\?($re_encoded_text)\?=/;
+
 # Variables we fill in automatically, or via prompting:
 my (@to,$no_to,@initial_to,@cc,$no_cc,@initial_cc,@bcclist,$no_bcc,@xh,
 	$initial_reply_to,$initial_subject,@files,
@@ -913,15 +918,20 @@ $time = time - scalar $#files;
 
 sub unquote_rfc2047 {
 	local ($_) = @_;
-	my $encoding;
-	s{=\?([^?]+)\?q\?(.*?)\?=}{
-		$encoding = $1;
-		my $e = $2;
-		$e =~ s/_/ /g;
-		$e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
-		$e;
+	my $charset;
+	s{$re_encoded_word}{
+		$charset = $1;
+		my $encoding = $2;
+		my $text = $3;
+		if ($encoding eq 'q' || $encoding eq 'Q') {
+			$text =~ s/_/ /g;
+			$text =~ s/=([0-9A-F]{2})/chr(hex($1))/egi;
+			$text;
+		} else {
+			$&; # other encodings not supported yet
+		}
 	}eg;
-	return wantarray ? ($_, $encoding) : $_;
+	return wantarray ? ($_, $charset) : $_;
 }
 
 sub quote_rfc2047 {
@@ -934,10 +944,8 @@ sub quote_rfc2047 {
 
 sub is_rfc2047_quoted {
 	my $s = shift;
-	my $token = qr/[^][()<>@,;:"\/?.= \000-\037\177-\377]+/;
-	my $encoded_text = qr/[!->@-~]+/;
 	length($s) <= 75 &&
-	$s =~ m/^(?:"[[:ascii:]]*"|=\?$token\?$token\?$encoded_text\?=)$/o;
+	$s =~ m/^(?:"[[:ascii:]]*"|$re_encoded_word)$/o;
 }
 
 sub subject_needs_rfc2047_quoting {
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-06 19:36 [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec Роман Донченко
@ 2014-12-06 19:36 ` Роман Донченко
  2014-12-07  9:18   ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Роман Донченко @ 2014-12-06 19:36 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Jeff King, Thomas Rast

The RFC says that they are to be concatenated after decoding (i.e. the
intervening whitespace is ignored).
---
 git-send-email.perl   | 26 ++++++++++++++++----------
 t/t9001-send-email.sh |  7 +++++++
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index d461ffb..7d5cc8a 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -919,17 +919,23 @@ $time = time - scalar $#files;
 sub unquote_rfc2047 {
 	local ($_) = @_;
 	my $charset;
-	s{$re_encoded_word}{
-		$charset = $1;
-		my $encoding = $2;
-		my $text = $3;
-		if ($encoding eq 'q' || $encoding eq 'Q') {
-			$text =~ s/_/ /g;
-			$text =~ s/=([0-9A-F]{2})/chr(hex($1))/egi;
-			$text;
-		} else {
-			$&; # other encodings not supported yet
+	my $sep = qr/[ \t]+/;
+	s{$re_encoded_word(?:$sep$re_encoded_word)*}{
+		my @words = split $sep, $&;
+		foreach (@words) {
+			m/$re_encoded_word/;
+			$charset = $1;
+			my $encoding = $2;
+			my $text = $3;
+			if ($encoding eq 'q' || $encoding eq 'Q') {
+				$_ = $text;
+				s/_/ /g;
+				s/=([0-9A-F]{2})/chr(hex($1))/egi;
+			} else {
+				# other encodings not supported yet
+			}
 		}
+		join '', @words;
 	}eg;
 	return wantarray ? ($_, $charset) : $_;
 }
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 19a3ced..fa965ff 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -240,6 +240,13 @@ test_expect_success $PREREQ 'non-ascii self name is suppressed' "
 		'non_ascii_self_suppressed'
 "
 
+# This name is long enough to force format-patch to split it into multiple
+# encoded-words, assuming it uses UTF-8 with the "Q" encoding.
+test_expect_success $PREREQ 'long non-ascii self name is suppressed' "
+	test_suppress_self_quoted 'Ƒüñníęř €. Nâṁé' 'odd_?=mail@example.com' \
+		'long_non_ascii_self_suppressed'
+"
+
 test_expect_success $PREREQ 'sanitized self name is suppressed' "
 	test_suppress_self_unquoted '\"A U. Thor\"' 'author@example.com' \
 		'self_name_sanitized_suppressed'
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-06 19:36 ` [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
@ 2014-12-07  9:18   ` Jeff King
  2014-12-07 14:35     ` Роман Донченко
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2014-12-07  9:18 UTC (permalink / raw)
  To: Роман Донченко
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote:

> The RFC says that they are to be concatenated after decoding (i.e. the
> intervening whitespace is ignored).

Thanks. Both patches look good to me, and I'd be happy to have them
applied as-is. I wrote a few comments below, but in all cases I think I
convinced myself that what you wrote is best.

> +	my $sep = qr/[ \t]+/;
> +	s{$re_encoded_word(?:$sep$re_encoded_word)*}{
> +		my @words = split $sep, $&;
> +		foreach (@words) {
> +			m/$re_encoded_word/;
> +			$charset = $1;
> +			my $encoding = $2;
> +			my $text = $3;

It feels a little weird to have to split and rematch $re_encoded_word in
the loop, but I don't think there is a way around it. ($1, $2, $3) will
have our first word, and ($4, $5, $6) will have the final (if any), but
I don't think we can get access to what is in between.

So I think what you have here is the best we can do.

> +			if ($encoding eq 'q' || $encoding eq 'Q') {
> +				$_ = $text;
> +				s/_/ /g;
> +				s/=([0-9A-F]{2})/chr(hex($1))/egi;

It took me a minute to figure out why this works. $_ is a reference to
the iterator for @words, so it is re-assigning that element of the array
first to the encoded text, and then modifying it in place.

I wonder if it would be more obvious like this:

  join '',
  map {
          m/$re_encoded_word/;
	  $charset = $1;
	  my $encoding = $2;
	  my $text = $3;
          if ($encoding eq 'q' || $encoding eq 'Q') {
	    $text =~ s/_/ /g;
	    $text =~ s=([0-9A-F]{2}/chr(hex($1))/egi;
	  } else {
	    # other encoding not supported yet
	  }
  } split($sep, $&);

I dunno. I kind of like your version better now that I understand it,
but it did take me a minute.

One final note on this bit of code: if there are multiple encoded words,
we grab the $charset from the final encoded word, and never report the
earlier charsets. Technically they do not all have to be the same
(rfc2047 even has an example where they are not). I think we can dismiss
this, though, as:

  1. It was like this before your patches (we might have seen multiple
     non-adjacent encoded words; you're just handling adjacent ones),
     and nobody has complained.

  2. Using two separate encodings in the same header is sufficiently
     ridiculous that I can live with us not handling it properly.

> +# This name is long enough to force format-patch to split it into multiple
> +# encoded-words, assuming it uses UTF-8 with the "Q" encoding.
> +test_expect_success $PREREQ 'long non-ascii self name is suppressed' "
> +	test_suppress_self_quoted 'Ƒüñníęř €. Nâṁé' 'odd_?=mail@example.com' \
> +		'long_non_ascii_self_suppressed'
> +"

Cute. :)

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-07  9:18   ` Jeff King
@ 2014-12-07 14:35     ` Роман Донченко
  2014-12-07 17:48       ` Philip Oakley
  0 siblings, 1 reply; 9+ messages in thread
From: Роман Донченко @ 2014-12-07 14:35 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

Jeff King <peff@peff.net> писал в своём письме Sun, 07 Dec 2014 12:18:59  
+0300:

> On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote:
>
>> The RFC says that they are to be concatenated after decoding (i.e. the
>> intervening whitespace is ignored).
>
> Thanks. Both patches look good to me, and I'd be happy to have them
> applied as-is. I wrote a few comments below, but in all cases I think I
> convinced myself that what you wrote is best.

I had the same concerns myself, and eventually convinced myself of the  
same. :-)

> One final note on this bit of code: if there are multiple encoded words,
> we grab the $charset from the final encoded word, and never report the
> earlier charsets. Technically they do not all have to be the same
> (rfc2047 even has an example where they are not). I think we can dismiss
> this, though, as:
>
>   1. It was like this before your patches (we might have seen multiple
>      non-adjacent encoded words; you're just handling adjacent ones),
>      and nobody has complained.
>
>   2. Using two separate encodings in the same header is sufficiently
>      ridiculous that I can live with us not handling it properly.

Yeah, that bugs me as well. But I think handling multiple encodings would  
require substantial reworking of the code, so I chickened out (with the  
same excuses :-)).

Roman.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-07 14:35     ` Роман Донченко
@ 2014-12-07 17:48       ` Philip Oakley
  2014-12-07 18:17         ` Роман Донченко
  0 siblings, 1 reply; 9+ messages in thread
From: Philip Oakley @ 2014-12-07 17:48 UTC (permalink / raw)
  To: Jeff King,
	Роман Донченко
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

From: "Роман Донченко" <dpb@corrigendum.ru>
> Jeff King <peff@peff.net> писал в своём письме Sun, 07 Dec 2014 
> 12:18:59  +0300:
>
>> On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote:
>>
>>> The RFC says that they are to be concatenated after decoding (i.e. 
>>> the
>>> intervening whitespace is ignored).
>>
>> Thanks. Both patches look good to me, and I'd be happy to have them
>> applied as-is. I wrote a few comments below, but in all cases I think 
>> I
>> convinced myself that what you wrote is best.
>
> I had the same concerns myself, and eventually convinced myself of the 
> same. :-)
>
>> One final note on this bit of code: if there are multiple encoded 
>> words,
>> we grab the $charset from the final encoded word, and never report 
>> the
>> earlier charsets. Technically they do not all have to be the same
>> (rfc2047 even has an example where they are not). I think we can 
>> dismiss
>> this, though, as:
>>
>>   1. It was like this before your patches (we might have seen 
>> multiple
>>      non-adjacent encoded words; you're just handling adjacent ones),
>>      and nobody has complained.
>>
>>   2. Using two separate encodings in the same header is sufficiently
>>      ridiculous that I can live with us not handling it properly.
>
> Yeah, that bugs me as well. But I think handling multiple encodings 
> would  require substantial reworking of the code, so I chickened out 
> (with the  same excuses :-)).

Would that be worth a terse comment in the documentation change part of 
the patch?
"Multiple  (RFC2047) encodings are not supported.",
or would that be bike shed noise.

Philip 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-07 17:48       ` Philip Oakley
@ 2014-12-07 18:17         ` Роман Донченко
  2014-12-10 22:21           ` Philip Oakley
  0 siblings, 1 reply; 9+ messages in thread
From: Роман Донченко @ 2014-12-07 18:17 UTC (permalink / raw)
  To: Jeff King, Philip Oakley
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

Philip Oakley <philipoakley@iee.org> писал в своём письме Sun, 07 Dec 2014  
20:48:05 +0300:

> From: "Роман Донченко" <dpb@corrigendum.ru>
>> Jeff King <peff@peff.net> писал в своём письме Sun, 07 Dec 2014  
>> 12:18:59  +0300:
>>
>>> On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote:
>>> One final note on this bit of code: if there are multiple encoded  
>>> words,
>>> we grab the $charset from the final encoded word, and never report the
>>> earlier charsets. Technically they do not all have to be the same
>>> (rfc2047 even has an example where they are not). I think we can  
>>> dismiss
>>> this, though, as:
>>>
>>>   1. It was like this before your patches (we might have seen multiple
>>>      non-adjacent encoded words; you're just handling adjacent ones),
>>>      and nobody has complained.
>>>
>>>   2. Using two separate encodings in the same header is sufficiently
>>>      ridiculous that I can live with us not handling it properly.
>>
>> Yeah, that bugs me as well. But I think handling multiple encodings  
>> would  require substantial reworking of the code, so I chickened out  
>> (with the  same excuses :-)).
>
> Would that be worth a terse comment in the documentation change part of  
> the patch?
> "Multiple  (RFC2047) encodings are not supported.",
> or would that be bike shed noise.

I didn't change any documentation... and in either case, they weren't  
supported in the first place, so I don't think it's anything I need to  
mention.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
  2014-12-07 18:17         ` Роман Донченко
@ 2014-12-10 22:21           ` Philip Oakley
  0 siblings, 0 replies; 9+ messages in thread
From: Philip Oakley @ 2014-12-10 22:21 UTC (permalink / raw)
  To: Роман Донченко
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast, Jeff King

From: "Роман Донченко" <dpb@corrigendum.ru>
Sent: Sunday, December 07, 2014 6:17 PM
>> Would that be worth a terse comment in the documentation change part 
>> of  the patch?
>> "Multiple  (RFC2047) encodings are not supported.",
>> or would that be bike shed noise.
>
> I didn't change any documentation... and in either case, they weren't 
> supported in the first place, so I don't think it's anything I need to 
> mention.

I'd confused this with the crossing thread by Paolo Bonzini 
<bonzini@gnu.org> [PATCH 2/2] git-send-email: add --transfer-encoding 
option; 25 November 2014 14:00. $gmane/260217.

Sorry for the noise.
--
Philip 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec
@ 2014-12-14 15:59 Роман Донченко
  2014-12-15 23:13 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Роман Донченко @ 2014-12-14 15:59 UTC (permalink / raw)
  To: gitster; +Cc: git

More specifically:

* Add "\" to the list of characters not allowed in a token (see RFC 2047
  errata).

* Share regexes between unquote_rfc2047 and is_rfc2047_quoted. Besides
  removing duplication, this also makes unquote_rfc2047 more stringent.

* Allow both "q" and "Q" to identify the encoding.

* Allow lowercase hexadecimal digits in the "Q" encoding.

And, more on the cosmetic side:

* Change the "encoded-text" regex to exclude rather than include characters,
  for clarity and consistency with "token".

Signed-off-by: Роман Донченко <dpb@corrigendum.ru>
Acked-by: Jeff King <peff@peff.net>
---
 git-send-email.perl | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 9949db0..d461ffb 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -145,6 +145,11 @@ my $have_mail_address = eval { require Mail::Address; 1 };
 my $smtp;
 my $auth;
 
+# Regexes for RFC 2047 productions.
+my $re_token = qr/[^][()<>@,;:\\"\/?.= \000-\037\177-\377]+/;
+my $re_encoded_text = qr/[^? \000-\037\177-\377]+/;
+my $re_encoded_word = qr/=\?($re_token)\?($re_token)\?($re_encoded_text)\?=/;
+
 # Variables we fill in automatically, or via prompting:
 my (@to,$no_to,@initial_to,@cc,$no_cc,@initial_cc,@bcclist,$no_bcc,@xh,
 	$initial_reply_to,$initial_subject,@files,
@@ -913,15 +918,20 @@ $time = time - scalar $#files;
 
 sub unquote_rfc2047 {
 	local ($_) = @_;
-	my $encoding;
-	s{=\?([^?]+)\?q\?(.*?)\?=}{
-		$encoding = $1;
-		my $e = $2;
-		$e =~ s/_/ /g;
-		$e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
-		$e;
+	my $charset;
+	s{$re_encoded_word}{
+		$charset = $1;
+		my $encoding = $2;
+		my $text = $3;
+		if ($encoding eq 'q' || $encoding eq 'Q') {
+			$text =~ s/_/ /g;
+			$text =~ s/=([0-9A-F]{2})/chr(hex($1))/egi;
+			$text;
+		} else {
+			$&; # other encodings not supported yet
+		}
 	}eg;
-	return wantarray ? ($_, $encoding) : $_;
+	return wantarray ? ($_, $charset) : $_;
 }
 
 sub quote_rfc2047 {
@@ -934,10 +944,8 @@ sub quote_rfc2047 {
 
 sub is_rfc2047_quoted {
 	my $s = shift;
-	my $token = qr/[^][()<>@,;:"\/?.= \000-\037\177-\377]+/;
-	my $encoded_text = qr/[!->@-~]+/;
 	length($s) <= 75 &&
-	$s =~ m/^(?:"[[:ascii:]]*"|=\?$token\?$token\?$encoded_text\?=)$/o;
+	$s =~ m/^(?:"[[:ascii:]]*"|$re_encoded_word)$/o;
 }
 
 sub subject_needs_rfc2047_quoting {
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec
  2014-12-14 15:59 [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec Роман Донченко
@ 2014-12-15 23:13 ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2014-12-15 23:13 UTC (permalink / raw)
  To: Роман Донченко
  Cc: git

Thanks; queued both patches.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-15 23:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-06 19:36 [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec Роман Донченко
2014-12-06 19:36 ` [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
2014-12-07  9:18   ` Jeff King
2014-12-07 14:35     ` Роман Донченко
2014-12-07 17:48       ` Philip Oakley
2014-12-07 18:17         ` Роман Донченко
2014-12-10 22:21           ` Philip Oakley
  -- strict thread matches above, loose matches on Subject: below --
2014-12-14 15:59 [PATCH v2 1/2] send-email: align RFC 2047 decoding more closely with the spec Роман Донченко
2014-12-15 23:13 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).