Re: libblkid: udf: Incorrect implementation of Unicode strings

Util-Linux package development
 help / color / mirror / Atom feed

From: "Pali Rohár" <pali.rohar@gmail.com>
To: Karel Zak <kzak@redhat.com>
Cc: util-linux@vger.kernel.org, "Steve Kenton" <skenton@ou.edu>,
	"Vojtěch Vladyka" <xvlady00@stud.feec.vutbr.cz>,
	"Jan Kara" <jack@suse.cz>
Subject: Re: libblkid: udf: Incorrect implementation of Unicode strings
Date: Wed, 17 May 2017 00:17:19 +0200	[thread overview]
Message-ID: <201705170017.19752@pali> (raw)
In-Reply-To: <20170515100940.lggq2um6xt3dg66p@ws.net.home>

[-- Attachment #1: Type: Text/Plain, Size: 3085 bytes --]

On Monday 15 May 2017 12:09:40 Karel Zak wrote:
> On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote:
> > Hi!
> > 
> > Since beginning libblkid's udf code handles 16bit OSTA compressed
> > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
> > 
> > In UDF 2.01 specification is written:
> > ====
> > For a CompressionID of 8 or 16, the value of the CompressionID
> > shall specify the number of BitsPerCharacter for the d-characters
> > defined in the CharacterBitStream field. Each sequence of
> > CompressionID bits in the CharacterBitStream field shall represent
> > an OSTA Compressed Unicode d- character. The bits of the character
> > being encoded shall be added to the CharacterBitStream from most-
> > to least-significant-bit. The bits shall be added to the
> > CharacterBitStream starting from the most significant bit of the
> > current byte being encoded into. The value of the OSTA Compressed
> > Unicode d-character interpreted as a Uint16 defines the value of
> > the corresponding d-character in the Unicode 2.0 standard. ====
> > 
> > So it means that 8bit OSTA compressed unicode buffer contains
> > sequence of Unicode codepoints, one per 8 bits. What effectively
> > means equivalence with Latin1 (ISO-8859-1) encoding.
> > 
> > And 16bit OSTA compressed unicode means sequence of Unicode
> > codepoints, one per 16 bits in big endian. What is probably only
> > UCS-2 and not full UTF-16.
> > 
> > So problem is with 8bit OSTA compressed unicode if contains bytes
> > which are not UTF-8 invariants (ASCII). As those those are decoded
> > differently with Latin1 and UTF-8.
> > 
> > Which means libblkid udf implementation of reading Unicode strings
> > is wrong and affects all read operations (Label, UUID, ...).
> > 
> > To verify this problem I prepared small udf image (attached) which
> > has logical volume identifier (known as label): 0x08 0xC3 0xBF
> > 0x00 ... 0x03
> > 
> > According to spec it should be decoded as string "Ã¿"
> > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
> > 
> > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
> > 
> > I checked grub2 and Windows implementations and they show "Ã¿".
> > 
> > So... what to do with blkid implementation? Fixing it would mean to
> > break all existing labels and uuids on Linux. Not fixing it would
> > mean to have different labels across different systems which
> > implements it properly.
> 
> The issue has never been reported, so I guess the number of the
> affected LABELs is pretty small :-)
> 
> From my point of view it would be better to follow the standard, fix
> the issue and be compatible with the another utils and systems. It
> would be nice to fix it now for v2.30 where we already have changes
> in udf/iso stuff. Please, send the patch :-)

Fix for all UDF strings except UUID is in this pull request:
https://github.com/karelzak/util-linux/pull/438

I hope it is correct now. UDF image with "Ã¿" is added to tests.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

next prev parent reply	other threads:[~2017-05-16 22:17 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-12 14:38 libblkid: udf: Incorrect implementation of Unicode strings Pali Rohár
2017-05-15 10:09 ` Karel Zak
2017-05-15 12:38   ` Pali Rohár
2017-05-16 11:01     ` Karel Zak
2017-05-16 11:59       ` Pali Rohár
2017-05-16 12:52         ` Karel Zak
2017-05-16 14:02           ` Pali Rohár
2017-05-17  7:13             ` Karel Zak
2017-05-17 18:50               ` Pali Rohár
2017-05-18  8:34                 ` Karel Zak
2017-05-16 22:17   ` Pali Rohár [this message]
2017-05-17  7:19     ` Karel Zak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201705170017.19752@pali \
    --to=pali.rohar@gmail.com \
    --cc=jack@suse.cz \
    --cc=kzak@redhat.com \
    --cc=skenton@ou.edu \
    --cc=util-linux@vger.kernel.org \
    --cc=xvlady00@stud.feec.vutbr.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox