From: "Pali Rohár" <pali.rohar@gmail.com>
To: util-linux@vger.kernel.org, "Steve Kenton" <skenton@ou.edu>,
"Vojtěch Vladyka" <xvlady00@stud.feec.vutbr.cz>,
"Jan Kara" <jack@suse.cz>, "Karel Zak" <kzak@redhat.com>
Subject: libblkid: udf: Incorrect implementation of Unicode strings
Date: Fri, 12 May 2017 16:38:59 +0200 [thread overview]
Message-ID: <201705121638.59416@pali> (raw)
[-- Attachment #1.1: Type: text/plain, Size: 2698 bytes --]
Hi!
Since beginning libblkid's udf code handles 16bit OSTA compressed
unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
In UDF 2.01 specification is written:
====
For a CompressionID of 8 or 16, the value of the CompressionID shall
specify the number of BitsPerCharacter for the d-characters defined in
the CharacterBitStream field. Each sequence of CompressionID bits in the
CharacterBitStream field shall represent an OSTA Compressed Unicode d-
character. The bits of the character being encoded shall be added to the
CharacterBitStream from most- to least-significant-bit. The bits shall
be added to the CharacterBitStream starting from the most significant
bit of the current byte being encoded into. The value of the OSTA
Compressed Unicode d-character interpreted as a Uint16 defines the value
of the corresponding d-character in the Unicode 2.0 standard.
====
So it means that 8bit OSTA compressed unicode buffer contains sequence
of Unicode codepoints, one per 8 bits. What effectively means
equivalence with Latin1 (ISO-8859-1) encoding.
And 16bit OSTA compressed unicode means sequence of Unicode codepoints,
one per 16 bits in big endian. What is probably only UCS-2 and not full
UTF-16.
So problem is with 8bit OSTA compressed unicode if contains bytes which
are not UTF-8 invariants (ASCII). As those those are decoded differently
with Latin1 and UTF-8.
Which means libblkid udf implementation of reading Unicode strings is
wrong and affects all read operations (Label, UUID, ...).
To verify this problem I prepared small udf image (attached) which has
logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03
According to spec it should be decoded as string "ÿ"
(LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
I checked grub2 and Windows implementations and they show "ÿ".
So... what to do with blkid implementation? Fixing it would mean to
break all existing labels and uuids on Linux. Not fixing it would mean
to have different labels across different systems which implements it
properly.
Problem appeared when I send patch for implementing same algorithm of
UUID into grub2. (Patch was not merged yet).
Note that Linux's mkudffs from udftools generates label correctly so is
also incompatible with blkid implementation. But because I tested only
ASCII characters and Unicode characters above U+FF I have not detected
this problem... (ASCII is same in UTF-8 and Latin1; and chars above U+FF
can be encoded only as UTF-16 resp. USC-2)
--
Pali Rohár
pali.rohar@gmail.com
[-- Attachment #1.2: udf.img.xz --]
[-- Type: application/x-xz, Size: 2376 bytes --]
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
next reply other threads:[~2017-05-12 14:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-12 14:38 Pali Rohár [this message]
2017-05-15 10:09 ` libblkid: udf: Incorrect implementation of Unicode strings Karel Zak
2017-05-15 12:38 ` Pali Rohár
2017-05-16 11:01 ` Karel Zak
2017-05-16 11:59 ` Pali Rohár
2017-05-16 12:52 ` Karel Zak
2017-05-16 14:02 ` Pali Rohár
2017-05-17 7:13 ` Karel Zak
2017-05-17 18:50 ` Pali Rohár
2017-05-18 8:34 ` Karel Zak
2017-05-16 22:17 ` Pali Rohár
2017-05-17 7:19 ` Karel Zak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201705121638.59416@pali \
--to=pali.rohar@gmail.com \
--cc=jack@suse.cz \
--cc=kzak@redhat.com \
--cc=skenton@ou.edu \
--cc=util-linux@vger.kernel.org \
--cc=xvlady00@stud.feec.vutbr.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox