From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: "Pali Rohár" <pali.rohar@gmail.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Theodore Y. Ts'o" <tytso@mit.edu>,
Namjae Jeon <linkinjeon@gmail.com>
Subject: Re: vfat: Broken case-insensitive support for UTF-8
Date: Tue, 21 Jan 2020 19:25:18 -0500 [thread overview]
Message-ID: <85wo9knxqp.fsf@collabora.com> (raw)
In-Reply-To: <20200120214046.f6uq7rlih7diqahz@pali> ("Pali Rohár"'s message of "Mon, 20 Jan 2020 22:40:46 +0100")
Pali Rohár <pali.rohar@gmail.com> writes:
> On Monday 20 January 2020 21:07:12 OGAWA Hirofumi wrote:
>> Pali Rohár <pali.rohar@gmail.com> writes:
>>
>> >> To be perfect, the table would have to emulate what Windows use. It can
>> >> be unicode standard, or something other.
>> >
>> > Windows FAT32 implementation (fastfat.sys) is opensource. So it should
>> > be possible to inspect code and figure out how it is working.
>> >
>> > I will try to look at it.
>>
>> I don't think the conversion library is not in fs driver though,
>> checking implement itself would be good.
>
> Ok, I did some research. It took me it longer as I thought as lot of
> stuff is undocumented and hard to find all relevant information.
>
> So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> takes UTF-16 string and returns upper case UTF-16 string. There is no
> mapping table in fastfat.sys driver itself.
>
> RtlUpcaseUnicodeString() is a ntos kernel function and after my research
> it seems that this function is using only conversion table stored in
> file l_intl.nls (from c:\windows\system32).
>
> Project wine describe this file as "unicode casing tables" and seems
> that it can parse this file format. Even more it distributes its own
> version of this file which looks like to be generated from official
> Unicode UnicodeData.txt via Perl script make_unicode (part of wine).
>
> So question is... how much is MS changing l_intl.nls file in their
> released Windows versions?
>
> I would try to decode what is format of that file l_intl.nls and try to
> compare data in it from some Windows versions.
>
> Can we reuse upper case mapping table from that file?
Regarding fs/unicode, we have some infrastructure to parse UCD files,
handle unicode versioning, and store the data in a more compact
structure. See the mkutf8data script.
Right now, we only store the mapping of the code-point to the NFD + full
casefold, but it would be possible to extend the parsing script to store
the un-normalized uppercase version in the data structure. So, if
l_intl.nls is generated from UnicodeData.txt, you might consider to
extend fs/unicode to store it. We store the code-points in an optimized
format to decode utf-8, but the infrastructure is half way there
already.
--
Gabriel Krisman Bertazi
next prev parent reply other threads:[~2020-01-22 0:25 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-19 22:14 vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-19 23:08 ` Al Viro
2020-01-19 23:33 ` Pali Rohár
2020-01-20 0:09 ` Al Viro
2020-01-20 11:19 ` Pali Rohár
2020-01-20 4:04 ` OGAWA Hirofumi
2020-01-20 7:30 ` Al Viro
2020-01-20 7:45 ` Al Viro
2020-01-20 8:07 ` oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8) Al Viro
2020-01-20 19:35 ` Al Viro
2020-01-24 4:29 ` Eric Biggers
2020-01-24 17:47 ` Linus Torvalds
2020-01-24 18:03 ` Jaegeuk Kim
2020-01-24 18:45 ` Eric Biggers
2020-01-20 11:04 ` vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-20 12:07 ` OGAWA Hirofumi
2020-01-20 21:40 ` Pali Rohár
2020-01-20 22:46 ` Al Viro
2020-01-20 23:57 ` Pali Rohár
2020-01-21 0:07 ` Al Viro
2020-01-21 20:34 ` Pali Rohár
2020-01-21 21:36 ` Al Viro
2020-01-21 22:14 ` Al Viro
2020-01-21 22:46 ` Pali Rohár
2020-01-26 23:08 ` Pali Rohár
2020-01-21 12:43 ` David Laight
2020-01-22 0:25 ` Gabriel Krisman Bertazi [this message]
2020-01-20 15:07 ` David Laight
2020-01-20 15:20 ` Pali Rohár
2020-01-20 15:47 ` David Laight
2020-01-20 16:12 ` Al Viro
2020-01-20 16:51 ` David Laight
2020-01-20 16:27 ` Pali Rohár
2020-01-20 16:43 ` David Laight
2020-01-20 16:56 ` Pali Rohár
2020-01-20 17:37 ` Theodore Y. Ts'o
2020-01-20 17:32 ` Theodore Y. Ts'o
2020-01-20 17:56 ` Pali Rohár
2020-01-21 3:52 ` OGAWA Hirofumi
2020-01-21 11:00 ` Pali Rohár
2020-01-21 12:26 ` OGAWA Hirofumi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=85wo9knxqp.fsf@collabora.com \
--to=krisman@collabora.com \
--cc=hirofumi@mail.parknet.co.jp \
--cc=linkinjeon@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pali.rohar@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.