From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan-Philip Gehrcke Subject: Should the --encoding argument to log/show commands make any guarantees about their output? Date: Mon, 15 Jun 2015 10:50:26 +0200 Message-ID: <557E91D2.3000908@googlemail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Mon Jun 15 10:50:41 2015 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Z4Q6J-0004lD-DX for gcvg-git-2@plane.gmane.org; Mon, 15 Jun 2015 10:50:39 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753861AbbFOIuf convert rfc822-to-quoted-printable (ORCPT ); Mon, 15 Jun 2015 04:50:35 -0400 Received: from mail-lb0-f179.google.com ([209.85.217.179]:36332 "EHLO mail-lb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754052AbbFOIud (ORCPT ); Mon, 15 Jun 2015 04:50:33 -0400 Received: by lbbqq2 with SMTP id qq2so48992697lbb.3 for ; Mon, 15 Jun 2015 01:50:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; bh=E02fBHUdCtaIQDKqTkW+iGB5d8w6M64rsjZrm31zqIo=; b=qIOumFW0DwQt7Vje5mL3MHYlZbanVVbzjufKLbc+BZwW9C/mAXlP7+SqPwyo3h1MCB 5A4vmJkozn51ah+Y89zpC6OARtT322y+p1HE00c8kRp2vYJrA0Sxy7ykBFODXzylQ2il WgoxF9U1naYfm82sBh7pF5ZDIDmH6UwJDj9EtXFUF6mINwV6/wtQhWDTrDBUciPIKLjI rOcZzF9kqm40l83RZBM/JRn45ed4xgjjFkaNRQSGCS9UGFFL4nRD6qf01P6S33WfHosr EeBKU9dsEXuYxUh+oMWNuUBfAgbDOVJ0O954e+0/BJGGZmvEdDMCmKhW/RmU5MqCBkVO 9h2w== X-Received: by 10.152.116.49 with SMTP id jt17mr12815577lab.82.1434358232113; Mon, 15 Jun 2015 01:50:32 -0700 (PDT) Received: from [192.168.178.11] (ip92342367.dynamic.kabel-deutschland.de. [146.52.35.103]) by mx.google.com with ESMTPSA id f2sm2567370lah.39.2015.06.15.01.50.29 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jun 2015 01:50:31 -0700 (PDT) User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Hello, I was surprised to see that the output of git log --encoding=3Dutf-8 "--format=3Dformat:%b" can contain byte sequences that are invalid in UTF-8. Note: I am using=20 git 2.1.4 and the %b format specifier represents the commit message bod= y. I have seen this with the Linux git repository and the following test: git log --encoding=3Dutf-8 "--format=3Dformat:%b" | python2 -c \ 'import sys; [l.decode("utf-8") for l in sys.stdin]' Soon enough errors like this appears: 'utf8' codec can't decode byte 0xf6 in position 19 The help message to the --encoding argument reads: > The commit objects record the encoding used for the log message in > their encoding header; this option can be used to tell the command to > re-code the commit log message in the encoding preferred by the user I realize that this message does not give any guarantee about the outpu= t=20 of the command, in the sense that --encoding=3Dutf-8 produces valid UTF= -8=20 data in all cases. However, I wonder what --encoding precisely does and if it has the=20 behavior most users would expect. Let me describe what I think it currently does: The program attempts to re-code a log message, so it follows the chain raw input -> unicode -> raw output =46or the first step, knowledge about the input encoding is required. T= his=20 is retrieved from the encoding header of the commit object if present o= r=20 (from the docs) "lack of this header implies that the commit log messag= e=20 is encoded in UTF-8." If this step fails (if the entry contains a byte=20 sequence that is invalid in the specified/assumed input codec), the=20 procedure is aborted and the data is dumped as is (obviously without=20 applying the requested output encoding). Is that correct? From my point of view the most natural abstraction of a log *message*=20 is *text*, not bytes. The same is true for author names. If I want to=20 build a tool chain on top of log/show, this usually means that I want t= o=20 work with text information. Hence, I want to retrieve text (a sequence=20 of code points) from git show/log. Text must be transported in encoded=20 form, sure, but it must not contain byte sequences that are invalid in=20 this codec. Because otherwise it's just not text anymore. Hence, from my point of view, the rational that git show/log should be=20 able to output *text* information means that they should not emit byte=20 sequences that are invalid in the codec specified via the --encoding=20 argument. In the current situation, the work of dealing with invalid=20 byte sequences is just outsourced to software further below in the tool= =20 chain (at some point a replacement character =EF=BF=BD should be displa= yed to=20 the user instead of the invalid raw bytes). I am not entirely sure where this discussion should lead to. However, I= =20 think that if the behavior of the software will not be changed, then th= e=20 documentation for the --encoding option should be more precise and=20 clarify what actually happens behind the scenes. What do you think? Cheers, Jan-Philip Gehrcke