From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kevin Bracey Subject: Re: [PATCH] Unicode: update of combining code points Date: Wed, 16 Apr 2014 13:51:43 +0300 Message-ID: <534E60BF.5020602@bracey.fi> References: <201404072130.15686.tboegi@web.de> <534E0B84.6070602@web.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Peter Krefting , git@vger.kernel.org To: =?UTF-8?B?VG9yc3RlbiBCw7ZnZXJzaGF1c2Vu?= X-From: git-owner@vger.kernel.org Wed Apr 16 13:08:36 2014 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WaNhi-0008PF-Ol for gcvg-git-2@plane.gmane.org; Wed, 16 Apr 2014 13:08:35 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755636AbaDPLI3 convert rfc822-to-quoted-printable (ORCPT ); Wed, 16 Apr 2014 07:08:29 -0400 Received: from 1.mo68.mail-out.ovh.net ([46.105.41.146]:52203 "EHLO mo68.mail-out.ovh.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755625AbaDPLI2 (ORCPT ); Wed, 16 Apr 2014 07:08:28 -0400 X-Greylist: delayed 600 seconds by postgrey-1.27 at vger.kernel.org; Wed, 16 Apr 2014 07:08:28 EDT Received: from mail432.ha.ovh.net (b6.ovh.net [213.186.33.56]) by mo68.mail-out.ovh.net (Postfix) with SMTP id 40D93FFC872 for ; Wed, 16 Apr 2014 12:51:47 +0200 (CEST) Received: from b0.ovh.net (HELO queueout) (213.186.33.50) by b0.ovh.net with SMTP; 16 Apr 2014 12:51:49 +0200 Received: from 62-183-157-30.bb.dnainternet.fi (HELO ?192.168.1.10?) (kevin@bracey.fi@62.183.157.30) by ns0.ovh.net with SMTP; 16 Apr 2014 12:51:48 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 In-Reply-To: <534E0B84.6070602@web.de> X-Ovh-Tracer-Id: 6972416648795099280 X-Ovh-Remote: 62.183.157.30 (62-183-157-30.bb.dnainternet.fi) X-Ovh-Local: 213.186.33.20 (ns0.ovh.net) X-OVH-SPAMSTATE: OK X-OVH-SPAMSCORE: -100 X-OVH-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrfeejvddrtdduucetufdoteggodetrfcurfhrohhfihhlvgemucfqggfjnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd X-Spam-Check: DONE|U 0.5/N X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: -100 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrfeejvddrtdduucetufdoteggodetrfcurfhrohhfihhlvgemucfqggfjnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 16/04/2014 07:48, Torsten B=C3=B6gershausen wrote: > On 15.04.14 21:10, Peter Krefting wrote: >> Torsten B=C3=B6gershausen: >> >>> diff --git a/utf8.c b/utf8.c >>> index a831d50..77c28d4 100644 >>> --- a/utf8.c >>> +++ b/utf8.c >> Is there a script that generates this code from the Unicode database= files, or did you hand-update it? >> > Some of the code points which have "0 length on the display" are call= ed > "combining", others are called "vowels" or "accents". > E.g. 5BF is not marked any of them, but if you look at the glyph, it = should > be combining (please correct me if that is wrong). Indeed it is combining (more specifically it has General Category=20 "Nonspacing_Mark" =3D "Mn"). > > If I could have found a file which indicates for each code point, wha= t it > is, I could write a script. > The most complete and machine-readable data are in these files: http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt The general categories can also be seen more legibly in: http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCa= tegory.txt =46or docs, see: http://www.unicode.org/reports/tr44/ http://www.unicode.org/reports/tr11/ http://www.unicode.org/ucd/ The existing utf8.c comments describe the attributes being selected fro= m=20 the tables (general categories "Cf","Mn","Me", East Asian Width "W",=20 "F"). And they suggest that the combining character table was originall= y=20 auto-generated from UnicodeData.txt with a "uniset" tool. Presumably th= is? https://github.com/depp/uniset The fullwidth-checking code looks like it was done by hand, although=20 apparently uniset can process EastAsianWidth.txt. Kevin