From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: [PATCH/RFCv2 (version B)] gitweb: Allow UTF-8 encoded CGI query parameters and path_info Date: Fri, 3 Feb 2012 13:44:54 +0100 Message-ID: <201202031344.55750.jnareb@gmail.com> References: <1328136653-20559-1-git-send-email-michal.kiedrowicz@gmail.com> <201202022357.29569.jnareb@gmail.com> <20120203083935.5d9d4b18@mkiedrowicz.ivo.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: git@vger.kernel.org To: Michal Kiedrowicz X-From: git-owner@vger.kernel.org Fri Feb 03 13:44:46 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RtIVR-0001Bf-KB for gcvg-git-2@plane.gmane.org; Fri, 03 Feb 2012 13:44:46 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753505Ab2BCMoR convert rfc822-to-quoted-printable (ORCPT ); Fri, 3 Feb 2012 07:44:17 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:50686 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752640Ab2BCMoQ (ORCPT ); Fri, 3 Feb 2012 07:44:16 -0500 Received: by eekc14 with SMTP id c14so1128297eek.19 for ; Fri, 03 Feb 2012 04:44:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:message-id; bh=vyb773AlkQKMtD4D0s8gM/FuVlPITbzAEpQrWyutKvY=; b=deeH6GXmBNJW/KE076LT4fonC5XM7xOSUZPGT+cy/fqDn8OFRAFIQK2myIE5y24akp U4m24CyfpT04h/cuVXkMLK14YiRFbQein5R+ehR9VRonepe0OeyJUHRnTrjuOgBXZX3A NX0O4i7QPWnBHd6uAvxAyDtgGyenxCLbVkuYQ= Received: by 10.14.135.140 with SMTP id u12mr2213936eei.73.1328273054489; Fri, 03 Feb 2012 04:44:14 -0800 (PST) Received: from [192.168.1.13] (abvn4.neoplus.adsl.tpnet.pl. [83.8.211.4]) by mx.google.com with ESMTPS id o49sm21501175eeb.7.2012.02.03.04.44.12 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 03 Feb 2012 04:44:13 -0800 (PST) User-Agent: KMail/1.9.3 In-Reply-To: <20120203083935.5d9d4b18@mkiedrowicz.ivo.pl> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Gitweb tries hard to properly process UTF-8 data, by marking output from git commands and contents of files as UTF-8 with to_utf8() subroutine. This ensures that gitweb would print correctly UTF-8 e.g. in 'log' and 'commit' views. Unfortunately it misses another source of potentially Unicode input, namely query parameters. The result is that one cannot search for a string containing characters outside US-ASCII. For example searching for "Micha=C5=82 Kiedrowicz" (containing letter '=C5=82' - LATIN SMALL = LETTER L WITH STROKE, with Unicode codepoint U+0142, represented with 0xc5 0x82 bytes in UTF-8 and percent-encoded as %C5%81) result in the following incorrect data in search field Micha=C3=85=C2=82 Kiedrowicz This is caused by CGI by default treating '0xc5 0x82' bytes as two characters in Perl legacy encoding latin-1 (iso-8859-1), because 's' query parameter is not processed explicitly as UTF-8 encoded string. The solution used here follows "Using Unicode in a Perl CGI script" article on http://www.lemoda.net/cgi/perl-unicode/index.html: use CGI; use Encode 'decode_utf8; my $value =3D params('input'); $value =3D decode_utf8($value); Decoding UTF-8 is done when filling %input_params hash and $path_info variable; the former required to move from explicit $cgi->param(