git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Bracey <kevin@bracey.fi>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: Peter Krefting <peter@softwolves.pp.se>, git@vger.kernel.org
Subject: Re: [PATCH] Unicode: update of combining code points
Date: Wed, 16 Apr 2014 13:51:43 +0300	[thread overview]
Message-ID: <534E60BF.5020602@bracey.fi> (raw)
In-Reply-To: <534E0B84.6070602@web.de>

On 16/04/2014 07:48, Torsten Bögershausen wrote:
> On 15.04.14 21:10, Peter Krefting wrote:
>> Torsten Bögershausen:
>>
>>> diff --git a/utf8.c b/utf8.c
>>> index a831d50..77c28d4 100644
>>> --- a/utf8.c
>>> +++ b/utf8.c
>> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
>>
> Some of the code points which have "0 length on the display" are called
> "combining", others are called "vowels" or "accents".
> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
> be combining (please correct me if that is wrong).

Indeed it is combining (more specifically it has General Category 
"Nonspacing_Mark" = "Mn").

>
> If I could have found a file which indicates for each code point, what it
> is, I could write a script.
>

The most complete and machine-readable data are in these files:

http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

The general categories can also be seen more legibly in:

http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

For docs, see:

http://www.unicode.org/reports/tr44/
http://www.unicode.org/reports/tr11/
http://www.unicode.org/ucd/

The existing utf8.c comments describe the attributes being selected from 
the tables (general categories "Cf","Mn","Me", East Asian Width "W", 
"F"). And they suggest that the combining character table was originally 
auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?

https://github.com/depp/uniset

The fullwidth-checking code looks like it was done by hand, although 
apparently uniset can process EastAsianWidth.txt.

Kevin

  reply	other threads:[~2014-04-16 11:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-07 19:30 [PATCH] Unicode: update of combining code points Torsten Bögershausen
2014-04-15 19:10 ` Peter Krefting
2014-04-16  4:48   ` Torsten Bögershausen
2014-04-16 10:51     ` Kevin Bracey [this message]
2014-04-16 19:58       ` Torsten Bögershausen
2014-04-17  6:32         ` Kevin Bracey
2014-04-24  9:02     ` Peter Krefting
  -- strict thread matches above, loose matches on Subject: below --
2014-04-07 19:34 Torsten Bögershausen
2014-04-07 19:38 Torsten Bögershausen
2014-04-07 19:39 Torsten Bögershausen
2014-04-07 19:54 ` Jonathan Nieder
2014-04-08 22:37   ` Junio C Hamano
2014-04-09 16:48     ` Torsten Bögershausen
2014-04-09 17:30       ` Junio C Hamano
2014-04-10  4:12         ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=534E60BF.5020602@bracey.fi \
    --to=kevin@bracey.fi \
    --cc=git@vger.kernel.org \
    --cc=peter@softwolves.pp.se \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).