From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: Re: [RFD] Handling of non-UTF8 data in gitweb Date: Sat, 10 Dec 2011 17:18:33 +0100 Message-ID: <201112101718.34848.jnareb@gmail.com> References: <201112041709.32212.jnareb@gmail.com> <7vehwhcj3q.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org, =?iso-8859-1?q?J=FCrgen_Kreileder?= , John Hawley To: Junio C Hamano X-From: git-owner@vger.kernel.org Sat Dec 10 17:18:47 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RZPdM-00038V-Jy for gcvg-git-2@lo.gmane.org; Sat, 10 Dec 2011 17:18:44 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750956Ab1LJQSk (ORCPT ); Sat, 10 Dec 2011 11:18:40 -0500 Received: from mail-ey0-f174.google.com ([209.85.215.174]:40590 "EHLO mail-ey0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750826Ab1LJQSj (ORCPT ); Sat, 10 Dec 2011 11:18:39 -0500 Received: by eaak14 with SMTP id k14so2196397eaa.19 for ; Sat, 10 Dec 2011 08:18:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:message-id; bh=R1LvSBv8IpFh5opKnJK7UiG0J5omZPx4A97dnPT77q8=; b=LI0BapXpslmHkpoN30IugJU9MNFZN087ed9RYPlqNzv5ECWbpTvOj8tH8jaMHU0mdI FP40RTK8MXot2Cb6mC0UBXmc/FoKecvqq7JXsV7ddB+VLsWoohRtFXQuEMvh1VL4vA3l qJYbYbGjAgfXEyijl2KwnAfq1X0DSqSNhQ+sw= Received: by 10.213.27.15 with SMTP id g15mr2354156ebc.111.1323533918038; Sat, 10 Dec 2011 08:18:38 -0800 (PST) Received: from [192.168.1.13] (abvt218.neoplus.adsl.tpnet.pl. [83.8.217.218]) by mx.google.com with ESMTPS id d6sm47767364eec.10.2011.12.10.08.18.35 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 10 Dec 2011 08:18:36 -0800 (PST) User-Agent: KMail/1.9.3 In-Reply-To: <7vehwhcj3q.fsf@alter.siamese.dyndns.org> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, 7 Dec 2011, Junio C Hamano wrote: > Jakub Narebski writes: > > > But doing this would change gitweb behavior. Currently when > > encountering something (usually line of output) that is not valid > > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default > > 'latin1'. Note however that this value is per gitweb installation, > > not per repository. > > I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text > better, 2007-06-03) for a good reason, and I think the above argues that > whatever issue the commit tried to address is a non-issue. Is it really > true? I think that UTF-8 is much more prevalent character encoding in operating systems, programming languages, APIs, and software applications than it was 4 years ago. Also the solution implemented in said commit was a good start, but it remains incomplete: $fallback_encoding is per-installation which is too big granularity (there is `gui.encoding` per-repository config... but it is about main not fallback encoding; best would be to use gitattribute but currently there is no way to check attribute value at given revision). The proposed use open qw(:std :utf8); and removal of to_utf8 and $fallback_encoding would be regression compared to post-00f429a... but the tradeoff of more robust UTF-8 handling might be worth it. Note that to_utf8 handles git command output part by part, not as a whole; for UTF-8 vs latin1 (i.e. iso-8859-1) it does not matter though because latin1 is very unlikely to be recognized as valid utf-8[1], and ASCII characters pass-through for UTF-8. [1]: http://en.wikipedia.org/wiki/UTF-8#Advantages > > ... I guess > > it could be emulated by defining our own 'utf-8-with-fallback' > > encoding, or by defining our own PerlIO layer with PerlIO::via. > > But it no longer be simple solution (though still automatic). > > Between the current "everybody who read from the input must remember to > call to_utf8" and "everybody gets to read utf8 automatically for internal > processing", even though the latter may involve one-time pain to set up > the framework to do so, the pros-and-cons feels obvious to me. There is also a matter of performance (':utf8' and ':encoding(UTF-8)' are AFAIK implemented in C, both the Encode part and PerlIO part). -- Jakub Narebski Poland