From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: [PATCH] Fix Q-encoded multi-octet-char split in email. Date: Wed, 4 Jul 2012 02:44:50 -0400 Message-ID: <20120704064450.GA24807@sigill.intra.peff.net> References: <1341279697-4596-1-git-send-email-gkatsu.ne@gmail.com> <20120703063511.GA16679@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: git@vger.kernel.org, gitster@pobox.com, Takeharu Katsuyama To: Katsuyama Takeharu X-From: git-owner@vger.kernel.org Wed Jul 04 08:45:04 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SmJKd-0006yP-Un for gcvg-git-2@plane.gmane.org; Wed, 04 Jul 2012 08:45:00 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755854Ab2GDGoz (ORCPT ); Wed, 4 Jul 2012 02:44:55 -0400 Received: from 99-108-225-23.lightspeed.iplsin.sbcglobal.net ([99.108.225.23]:52802 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750840Ab2GDGoy (ORCPT ); Wed, 4 Jul 2012 02:44:54 -0400 Received: (qmail 30653 invoked by uid 107); 4 Jul 2012 06:45:01 -0000 Received: from c-71-206-173-132.hsd1.va.comcast.net (HELO sigill.intra.peff.net) (71.206.173.132) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Wed, 04 Jul 2012 02:45:01 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Wed, 04 Jul 2012 02:44:50 -0400 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, Jul 04, 2012 at 03:19:31PM +0900, Katsuyama Takeharu wrote: > diff --git a/pretty.c b/pretty.c > --- a/pretty.c > +++ b/pretty.c > @@ -272,6 +272,13 @@ static void add_rfc2047(struct strbuf *sb, const char > *line, int len, > static const int max_length = 78; /* per rfc2822 */ > int i; > int line_len; > + int utf8_ctr, use_utf8; > + const char *utf8_start; > + > + if (is_encoding_utf8(encoding) && encoding != NULL) > + use_utf8 = 1; > + else > + use_utf8 = 0; I think you can drop the "encoding != NULL" here. If we don't have an explicit encoding, git always assumes utf8 (also, as it happens we never hit this point with a NULL encoding in the current code anyway, though that could in theory change in the future). > > Can we re-use utf8_width here instead of rewriting these rules? > > Yes you can. But there are an issue which utf8_width seems not to return > correct value. It returns 3 even if a provided code has 3 octet utf-8 > char(e.g. 0xE38292). > I expect it returns 2. Hmm. I think I may have led you astray. It seems that the return value of utf8_width is not about the byte-width of the character representation, but rather about the intended character-width of the glyph. But since we are encoding the bytes, we care about the former. So I think you would really want to use pick_one_utf8_char and see how many characters it consumed, like this: const char *p = &line[i]; pick_one_utf8_char(&p, NULL); if (!p) /* not valid utf8, just assume single byte */ utf8_ctr = 1; else utf8_ctr = p - &line[i]; -Peff