From: "Torsten Bögershausen" <tboegi@web.de>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Jonathan Nieder" <jrnieder@gmail.com>,
git@vger.kernel.org, "Torsten Bögershausen" <tboegi@web.de>
Subject: Re: [PATCH] Unicode: update of combining code points
Date: Wed, 09 Apr 2014 18:48:29 +0200 [thread overview]
Message-ID: <534579DD.1060607@web.de> (raw)
In-Reply-To: <xmqq61mj30tg.fsf@gitster.dls.corp.google.com>
On 04/09/2014 12:37 AM, Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
>
>> Torsten Bögershausen wrote:
>>
>>> Unicode 6.3 defines the following code as combining or accents,
>>> git_wcwidth() should return 0.
>>>
>>> Earlier unicode standards had defined these code point as "reserved":
>> Thanks for the update. Could the commit message also explain how this
>> was noticed and what the user-visible effect is?
>>
>> For example:
>>
>> "Unicode just announced that <...>. That means we should mark the
>> relevant code points as combining characters so git knows they are
>> zero-width and doesn't screw up the alignment when presenting branch
>> names in columns with 'git branch --column'"
>>
>> or something like that.
> Perhaps (the original read clearly enough for me, though).
>
>> [...]
>>> 358 COMBINING DOT ABOVE RIGHT
>>> 359 COMBINING ASTERISK BELOW
>> I'm not sure this list is needed --- the code + the reference to the
>> Unicode 6.3 standard seems like enough (but if you think otherwise,
>> I don't really mind).
> I can go either way.
>
>>> This commit touches only the range 300-6FF, there may be more to be updated.
>> The "there may be more" here sounds ominous.
> Indeed it does ;-)
>
>> Does that mean Unicode
>> 6.3 also added some zero-width characters in other ranges that should
>> be dealt with in the future? How many such ranges? How do we know
>> when we're done?
>>
>> Just biting off the most important characters first and putting off
>> the rest for later sounds fine to me --- my complaint is that the
>> above comment doesn't make clear what the to-do list is for finishing
>> the update later.
> I'll queue this at the tip of 'pu', not to forget about it while
> waiting for a clarification.
>
> Thanks.
Thanks for comments, here comes the long version of the strory:
I recently fooled myself by running
"git config --global user.name" with a decomposed "ö" on a new Mac OS X machine.
While there was little problems on Mac OS, all Windows and Linux machines stumbled
over the decomposed ö, to be more exact over 0x308, COMBINING DIARESIS, (the 2 dots),
giving all kind of weired output in "git log".
Looking into commit.c and utf8.c, how to improve the situation, I made this observations:
- Some code from commit.c can possibly be moved into utf8.c, so that we only
have 1 utf8 code parser.
- A solution would be to run precompose_string() under Mac OS (which is a nop otherwise).
This could have saved my day. Probably I will make a patch some day.
- Some of the combining code points exist in Unicode 6.3, but not in utf8.c
(which seams to be based on Unicode >2.0 <6.3)
I found some in the 0x300 area, and looked at the neighbors, and had enough time to
read all code pages up to 0x7FF.
So if somebody knows how to find out which code points that are combined, accents,,, or in other words should return 0 in git_wcwidth(), please let me know.
How about this as a commit message:
Unicode: partially update to version 6.3
Unicode 6.3 defines the following code points as combining or accents,
git_wcwidth() should return 0.
Earlier unicode standards had defined these code point as "reserved":
358--35C
487
5A2, 5BA, 5C5, 5C7
604, 616--61A, 659--65F
Note: for this commit only the range 0..7FF has been checked,
more updates may be needed.
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
next prev parent reply other threads:[~2014-04-09 16:50 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-07 19:39 [PATCH] Unicode: update of combining code points Torsten Bögershausen
2014-04-07 19:54 ` Jonathan Nieder
2014-04-08 22:37 ` Junio C Hamano
2014-04-09 16:48 ` Torsten Bögershausen [this message]
2014-04-09 17:30 ` Junio C Hamano
2014-04-10 4:12 ` Torsten Bögershausen
-- strict thread matches above, loose matches on Subject: below --
2014-04-07 19:38 Torsten Bögershausen
2014-04-07 19:34 Torsten Bögershausen
2014-04-07 19:30 Torsten Bögershausen
2014-04-15 19:10 ` Peter Krefting
2014-04-16 4:48 ` Torsten Bögershausen
2014-04-16 10:51 ` Kevin Bracey
2014-04-16 19:58 ` Torsten Bögershausen
2014-04-17 6:32 ` Kevin Bracey
2014-04-24 9:02 ` Peter Krefting
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=534579DD.1060607@web.de \
--to=tboegi@web.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jrnieder@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).