From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?utf-8?B?0KDQvtC80LDQvSDQlNC+0L3Rh9C10L3QutC+?= Subject: Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly Date: Mon, 24 Nov 2014 21:26:22 +0300 Message-ID: References: <1416786604-4988-1-git-send-email-dpb@corrigendum.ru> <20141124153609.GA25912@peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed delsp=yes Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: git@vger.kernel.org, "Junio C Hamano" , =?utf-8?Q?=C3=86var_Arnfj=C3=B6r=C3=B0_Bjar?= =?utf-8?Q?mason?= , "Jay Soffian" , "Thomas Rast" To: "Jeff King" X-From: git-owner@vger.kernel.org Mon Nov 24 19:26:36 2014 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XsyLI-0007dB-Kh for gcvg-git-2@plane.gmane.org; Mon, 24 Nov 2014 19:26:32 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753598AbaKXS02 convert rfc822-to-quoted-printable (ORCPT ); Mon, 24 Nov 2014 13:26:28 -0500 Received: from forward9l.mail.yandex.net ([84.201.143.142]:34212 "EHLO forward9l.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751537AbaKXS01 convert rfc822-to-8bit (ORCPT ); Mon, 24 Nov 2014 13:26:27 -0500 Received: from smtp2m.mail.yandex.net (smtp2m.mail.yandex.net [77.88.61.129]) by forward9l.mail.yandex.net (Yandex) with ESMTP id 20D52E610C9; Mon, 24 Nov 2014 21:26:24 +0300 (MSK) Received: from smtp2m.mail.yandex.net (localhost [127.0.0.1]) by smtp2m.mail.yandex.net (Yandex) with ESMTP id 55E934208B9; Mon, 24 Nov 2014 21:26:23 +0300 (MSK) Received: from cs7-mts-1-124.dialup.mts-nn.ru (cs7-mts-1-124.dialup.mts-nn.ru [82.208.71.124]) by smtp2m.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id i2S17R8nAN-QMauC98n; Mon, 24 Nov 2014 21:26:22 +0300 (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: d832ac02-4e78-4b40-8d42-34912e4a4479 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=corrigendum.ru; s=mail; t=1416853582; bh=2cyqjyARzZNkUvGOHyRQ/9eJUeKlWxpe1qOiEjUDjwE=; h=Content-Type:To:Cc:Subject:References:Date:MIME-Version: Content-Transfer-Encoding:From:Message-ID:In-Reply-To:User-Agent; b=z/+ePKy31nUXei2knHhRrMOjvQyk+kvDHZsvQdDNkB7ekUCR2QsPlM2JqiQ+yvVFe UBAKh6yEi2Z8EIa2KdrVQAdgC+uf1zw/a3uLYbnIJxs/aw6Vd+0QIEPGRvGux4+H++ aMN3mgjf1SVNQGx4FrNO+H6A2knISflDJMJDgeuE= Authentication-Results: smtp2m.mail.yandex.net; dkim=pass header.i=@corrigendum.ru In-Reply-To: <20141124153609.GA25912@peff.net> User-Agent: Opera Mail/12.17 (Win64) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Jeff King =D0=BF=D0=B8=D1=81=D0=B0=D0=BB =D0=B2 =D1=81=D0= =B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Mon, 24 Nov = 2014 18:36:09 =20 +0300: > On Mon, Nov 24, 2014 at 02:50:04AM +0300, =D0=A0=D0=BE=D0=BC=D0=B0=D0= =BD =D0=94=D0=BE=D0=BD=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote: > >> The RFC says that they are to be concatenated after decoding (i.e. t= he >> intervening whitespace is ignored). >> >> I change the sender's name to an all-Cyrillic string in the tests so= =20 >> that >> its encoded form goes over the 76 characters in a line limit, forcin= g >> format-patch to split it into multiple encoded words. >> >> Since I have to modify the regular expression for an encoded word =20 >> anyway, >> I take the opportunity to bring it closer to the spec, most notably >> disallowing embedded spaces and making it case-insensitive (thus =20 >> allowing >> the encoding to be specified as both "q" and "Q"). > > The overall goal makes sense to me. Thanks for working on this. I hav= e a > few questions/comments, though. > >> sub unquote_rfc2047 { >> local ($_) =3D @_; >> + >> + my $et =3D qr/[!->@-~]+/; # encoded-text from RFC 2047 >> + my $sep =3D qr/[ \t]+/; >> + my $encoded_word =3D qr/=3D\?($et)\?q\?($et)\?=3D/i; > > The first $et in $encoded_word is actually the charset, which is defi= ned > by RFC 2047 as: > > charset =3D token ; see section 3 > > token =3D 1* > > especials =3D "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / " > <"> / "/" / "[" / "]" / "?" / "." / "=3D" > > Your regex is a little more liberal. I doubt that it is a big deal in > practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine= ). > But if we are tightening things up in general, it may make sense to d= o > so here (and I notice that is_rfc2047_quoted does a more thorough $to= ken > definition, and it probably makes sense for the two functions to be > consistent). Yeah, I did realize that token is more restrictive than encoded-text, b= ut =20 I didn't want to stray too far from the subject line of the patch. What= =20 I'll probably do is split the patch into two, one for regex tweaking an= d =20 one for multiple-word handling. And yeah, I'll try to make the two =20 functions use the same regexes. > > For your definition of encoded-text, RFC 2047 says: > > encoded-text =3D 1* or SPACE> > > It looks like you pulled the definition of $et from is_rfc2047_quoted= , > but I am not clear on where that original came from (it is from a3a82= 62, > but that commit message does not explain the regex). No, it's actually an independent discovery. :-) I don't think it needs = =20 explanation, though - it's just a character class with two ranges cover= ing =20 every printable character but the question mark. > Also, I note that we handle 'q'-style encodings here, but not 'b'. I > wonder if it is worth adding that in while we are in the area (it is = not > a big deal if you always send-email git-generated patches, as we neve= r > generate it). I could add "b" decoding, but since format-patch never generates "b" =20 encodings, testing would be a problem. And I'd rather not do it without= =20 any tests. > >> + s{$encoded_word(?:$sep$encoded_word)+}{ > > If I am reading this right, it requires at least two $encoded_words. > Should this "+" be a "*"? I hang my head in shame. Looks like I'll have to add more tests... > >> + my @words =3D split $sep, $&; >> + foreach (@words) { >> + m/$encoded_word/; >> + $encoding =3D $1; >> + $_ =3D $2; >> + s/_/ /g; >> + s/=3D([0-9A-F]{2})/chr(hex($1))/eg; > > In the spirit of your earlier change, should this final regex be > case-insensitive? RFC 2047 says only "Upper case should be used for > hexadecimal digits "A" through "F." but that does not seem like a "MU= ST" > to me. Sounds reasonable. Roman.