libblkid: udf: Incorrect implementation of Unicode strings

All of lore.kernel.org
 help / color / mirror / Atom feed

* libblkid: udf: Incorrect implementation of Unicode strings
@ 2017-05-12 14:38 Pali Rohár
  2017-05-15 10:09 ` Karel Zak
  0 siblings, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-12 14:38 UTC (permalink / raw)
  To: util-linux, Steve Kenton, Vojtěch Vladyka, Jan Kara,
	Karel Zak

[-- Attachment #1.1: Type: text/plain, Size: 2698 bytes --]

Hi!

Since beginning libblkid's udf code handles 16bit OSTA compressed 
unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.

In UDF 2.01 specification is written:
====
For a CompressionID of 8 or 16, the value of the CompressionID shall 
specify the number of BitsPerCharacter for the d-characters defined in 
the CharacterBitStream field. Each sequence of CompressionID bits in the 
CharacterBitStream field shall represent an OSTA Compressed Unicode d-
character. The bits of the character being encoded shall be added to the 
CharacterBitStream from most- to least-significant-bit. The bits shall 
be added to the CharacterBitStream starting from the most significant 
bit of the current byte being encoded into. The value of the OSTA 
Compressed Unicode d-character interpreted as a Uint16 defines the value 
of the corresponding d-character in the Unicode 2.0 standard.
====

So it means that 8bit OSTA compressed unicode buffer contains sequence 
of Unicode codepoints, one per 8 bits. What effectively means 
equivalence with Latin1 (ISO-8859-1) encoding.

And 16bit OSTA compressed unicode means sequence of Unicode codepoints, 
one per 16 bits in big endian. What is probably only UCS-2 and not full 
UTF-16.

So problem is with 8bit OSTA compressed unicode if contains bytes which 
are not UTF-8 invariants (ASCII). As those those are decoded differently 
with Latin1 and UTF-8.

Which means libblkid udf implementation of reading Unicode strings is 
wrong and affects all read operations (Label, UUID, ...).

To verify this problem I prepared small udf image (attached) which has 
logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03

According to spec it should be decoded as string "Ã¿"
(LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).

But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).

I checked grub2 and Windows implementations and they show "Ã¿".

So... what to do with blkid implementation? Fixing it would mean to 
break all existing labels and uuids on Linux. Not fixing it would mean 
to have different labels across different systems which implements it 
properly.

Problem appeared when I send patch for implementing same algorithm of 
UUID into grub2. (Patch was not merged yet).

Note that Linux's mkudffs from udftools generates label correctly so is 
also incompatible with blkid implementation. But because I tested only 
ASCII characters and Unicode characters above U+FF I have not detected 
this problem... (ASCII is same in UTF-8 and Latin1; and chars above U+FF 
can be encoded only as UTF-16 resp. USC-2)

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #1.2: udf.img.xz --]
[-- Type: application/x-xz, Size: 2376 bytes --]

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-12 14:38 libblkid: udf: Incorrect implementation of Unicode strings Pali Rohár
@ 2017-05-15 10:09 ` Karel Zak
  2017-05-15 12:38   ` Pali Rohár
  2017-05-16 22:17   ` Pali Rohár
  0 siblings, 2 replies; 12+ messages in thread
From: Karel Zak @ 2017-05-15 10:09 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux, Steve Kenton, Vojtěch Vladyka, Jan Kara

On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote:
> Hi!
> 
> Since beginning libblkid's udf code handles 16bit OSTA compressed 
> unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
> 
> In UDF 2.01 specification is written:
> ====
> For a CompressionID of 8 or 16, the value of the CompressionID shall 
> specify the number of BitsPerCharacter for the d-characters defined in 
> the CharacterBitStream field. Each sequence of CompressionID bits in the 
> CharacterBitStream field shall represent an OSTA Compressed Unicode d-
> character. The bits of the character being encoded shall be added to the 
> CharacterBitStream from most- to least-significant-bit. The bits shall 
> be added to the CharacterBitStream starting from the most significant 
> bit of the current byte being encoded into. The value of the OSTA 
> Compressed Unicode d-character interpreted as a Uint16 defines the value 
> of the corresponding d-character in the Unicode 2.0 standard.
> ====
> 
> So it means that 8bit OSTA compressed unicode buffer contains sequence 
> of Unicode codepoints, one per 8 bits. What effectively means 
> equivalence with Latin1 (ISO-8859-1) encoding.
> 
> And 16bit OSTA compressed unicode means sequence of Unicode codepoints, 
> one per 16 bits in big endian. What is probably only UCS-2 and not full 
> UTF-16.
> 
> So problem is with 8bit OSTA compressed unicode if contains bytes which 
> are not UTF-8 invariants (ASCII). As those those are decoded differently 
> with Latin1 and UTF-8.
> 
> Which means libblkid udf implementation of reading Unicode strings is 
> wrong and affects all read operations (Label, UUID, ...).
> 
> To verify this problem I prepared small udf image (attached) which has 
> logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03
> 
> According to spec it should be decoded as string "Ã¿"
> (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
> 
> But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
> 
> I checked grub2 and Windows implementations and they show "Ã¿".
> 
> So... what to do with blkid implementation? Fixing it would mean to 
> break all existing labels and uuids on Linux. Not fixing it would mean 
> to have different labels across different systems which implements it 
> properly.

The issue has never been reported, so I guess the number of the affected
LABELs is pretty small :-)

>From my point of view it would be better to follow the standard, fix
the issue and be compatible with the another utils and systems. It
would be nice to fix it now for v2.30 where we already have changes in
udf/iso stuff. Please, send the patch :-)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-15 10:09 ` Karel Zak
@ 2017-05-15 12:38   ` Pali Rohár
  2017-05-16 11:01     ` Karel Zak
  2017-05-16 22:17   ` Pali Rohár
  1 sibling, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-15 12:38 UTC (permalink / raw)
  To: Karel Zak; +Cc: util-linux, Steve Kenton, Vojtěch Vladyka, Jan Kara

On Monday 15 May 2017 12:09:40 Karel Zak wrote:
> On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote:
> > Hi!
> > 
> > Since beginning libblkid's udf code handles 16bit OSTA compressed 
> > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
> > 
> > In UDF 2.01 specification is written:
> > ====
> > For a CompressionID of 8 or 16, the value of the CompressionID shall 
> > specify the number of BitsPerCharacter for the d-characters defined in 
> > the CharacterBitStream field. Each sequence of CompressionID bits in the 
> > CharacterBitStream field shall represent an OSTA Compressed Unicode d-
> > character. The bits of the character being encoded shall be added to the 
> > CharacterBitStream from most- to least-significant-bit. The bits shall 
> > be added to the CharacterBitStream starting from the most significant 
> > bit of the current byte being encoded into. The value of the OSTA 
> > Compressed Unicode d-character interpreted as a Uint16 defines the value 
> > of the corresponding d-character in the Unicode 2.0 standard.
> > ====
> > 
> > So it means that 8bit OSTA compressed unicode buffer contains sequence 
> > of Unicode codepoints, one per 8 bits. What effectively means 
> > equivalence with Latin1 (ISO-8859-1) encoding.
> > 
> > And 16bit OSTA compressed unicode means sequence of Unicode codepoints, 
> > one per 16 bits in big endian. What is probably only UCS-2 and not full 
> > UTF-16.
> > 
> > So problem is with 8bit OSTA compressed unicode if contains bytes which 
> > are not UTF-8 invariants (ASCII). As those those are decoded differently 
> > with Latin1 and UTF-8.
> > 
> > Which means libblkid udf implementation of reading Unicode strings is 
> > wrong and affects all read operations (Label, UUID, ...).
> > 
> > To verify this problem I prepared small udf image (attached) which has 
> > logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03
> > 
> > According to spec it should be decoded as string "Ã¿"
> > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
> > 
> > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
> > 
> > I checked grub2 and Windows implementations and they show "Ã¿".
> > 
> > So... what to do with blkid implementation? Fixing it would mean to 
> > break all existing labels and uuids on Linux. Not fixing it would mean 
> > to have different labels across different systems which implements it 
> > properly.
> 
> The issue has never been reported, so I guess the number of the affected
> LABELs is pretty small :-)

Yes, that is possible. As most labels are just ASCII and if somebody
needs something special, then it is probably non-Latin and so above
U+FF codepoint...

> From my point of view it would be better to follow the standard, fix
> the issue and be compatible with the another utils and systems. It
> would be nice to fix it now for v2.30 where we already have changes in
> udf/iso stuff. Please, send the patch :-)

Ok, I can do that.

But question remain what to do with UUID. First 16 characters of Volume
Set Identifier are unique, non trivial and should represent hexadecimal
representation of timestamp. Currently blkid use it for generating UUID.

But "character" here means Unicode codepoint, not byte. So what to do if
Volume Set Identifier (which we use for UUID) contains non hexadecimal
and also non-alphabetical or non-ASCII characters?

Currently blkid read non-alphabetical chars somehow as bytes and encode
them as two hexadecimal digit. But due to broken implementation of
reading OSTA compressed unicode this would be changed (after fixing
reading OSTA Unicode).

So what can be stored in UUID? If any UTF-8 sequence, then we can just
take 16chars of VolSetId, convert OSTA Unicode to UTF-8 and store into
UUID. But it mean that UUID could contain also non printable characters
and also some exotic or non-Latin characters... Other option if
arbitrary Unicode characters is not allowed in UUID then we need to
decide how to convert/escape them into printable-ASCII, alphanumeric or
hexdigit.

The simplest way for UUID is of course to take first 16 chars of
VolSetId and encode them in UTF-8... but it allowed? And it is usable
for users (to specify disk by arbitrary Unicode/UTF-8 sequence)?

Let me know your opinion.

I suggest to include all UDF changes in one release, so "breakage" would
be just between two versions. So if above Label/UUID changes would not
be ready for next release, I would suggest to postpone currently merged
UDF changes.

-- 
Pali Rohár
pali.rohar@gmail.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-15 12:38   ` Pali Rohár
@ 2017-05-16 11:01     ` Karel Zak
  2017-05-16 11:59       ` Pali Rohár
  0 siblings, 1 reply; 12+ messages in thread
From: Karel Zak @ 2017-05-16 11:01 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux

On Mon, May 15, 2017 at 02:38:45PM +0200, Pali Rohár wrote:
> But question remain what to do with UUID.

It seem generated UUID is libblkid feature and another tools/systems
don't use anything like UUID for UDF, right?

If yes... then we can keep it unchanged, generate UUDI in the same way
as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
decoded VolSetId.

Anyway, UUID has to be printable.

> I suggest to include all UDF changes in one release, so "breakage" would
> be just between two versions. So if above Label/UUID changes would not
> be ready for next release, I would suggest to postpone currently merged
> UDF changes.

Yes.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-16 11:01     ` Karel Zak
@ 2017-05-16 11:59       ` Pali Rohár
  2017-05-16 12:52         ` Karel Zak
  0 siblings, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-16 11:59 UTC (permalink / raw)
  To: Karel Zak; +Cc: util-linux

On Tuesday 16 May 2017 13:01:39 Karel Zak wrote:
> On Mon, May 15, 2017 at 02:38:45PM +0200, Pali Rohár wrote:
> > But question remain what to do with UUID.
> 
> It seem generated UUID is libblkid feature and another tools/systems
> don't use anything like UUID for UDF, right?

Yes. Introduced in https://github.com/karelzak/util-linux/pull/135

But I would like to see UUID support also on other places (e.g. Grub2)
so it would be possible to use it really as UUID of FS. Which means we
need some normalized way of generation.

> If yes... then we can keep it unchanged, generate UUDI in the same way
> as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
> LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
> decoded VolSetId.
> 
> Anyway, UUID has to be printable.

Lets first define allowed characters in UUID and then what we do with
UDF's UUID.

Printable means only printable ASCII? Or also printable from Unicode? Or
only alphanumeric?

Printable ASCII characters are: 0x20 - 0x7E (included). Which means that
also space is is printable.

So what could make sense:

* ASCII uppercase (or lowercase) hexdigits
* ASCII hexdigits
* ASCII alphanumeric
* ASCII alphanumeric and underline
* ASCII printable without space
* ASCII printable (including space)
* UNICODE Basic Latin without space + Latin-1 Supplement without space
* UNICODE Latin script without controls and spaces
* UNICODE Latin script without controls (including spaces)

> > I suggest to include all UDF changes in one release, so "breakage" would
> > be just between two versions. So if above Label/UUID changes would not
> > be ready for next release, I would suggest to postpone currently merged
> > UDF changes.
> 
> Yes.

Ok.

-- 
Pali Rohár
pali.rohar@gmail.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-16 11:59       ` Pali Rohár
@ 2017-05-16 12:52         ` Karel Zak
  2017-05-16 14:02           ` Pali Rohár
  0 siblings, 1 reply; 12+ messages in thread
From: Karel Zak @ 2017-05-16 12:52 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux

On Tue, May 16, 2017 at 01:59:40PM +0200, Pali Rohár wrote:
> On Tuesday 16 May 2017 13:01:39 Karel Zak wrote:
> > On Mon, May 15, 2017 at 02:38:45PM +0200, Pali Rohár wrote:
> > > But question remain what to do with UUID.
> > 
> > It seem generated UUID is libblkid feature and another tools/systems
> > don't use anything like UUID for UDF, right?
> 
> Yes. Introduced in https://github.com/karelzak/util-linux/pull/135

:-)

> But I would like to see UUID support also on other places (e.g. Grub2)
> so it would be possible to use it really as UUID of FS. Which means we
> need some normalized way of generation.

OK.

> > If yes... then we can keep it unchanged, generate UUDI in the same way
> > as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
> > LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
> > decoded VolSetId.
> > 
> > Anyway, UUID has to be printable.
> 
> Lets first define allowed characters in UUID and then what we do with
> UDF's UUID.
> 
> Printable means only printable ASCII? Or also printable from Unicode? Or
> only alphanumeric?

I'd like to be very conservative and avoid anything else than ASCII.
It's identifier that should be usable everywhere.

udev uses the UUID for paths and symlinks, "bad chars" are escaped and
it's very user unfriendly. We should be also user friendly to non-UTF
users, terminals, etc.

IMHO the best solution would be to use lowercase hex-digits like for
another filesystems (and super ideal would be follow UUID notation for
formatting (e.g. "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-).

> > > I suggest to include all UDF changes in one release, so "breakage" would
> > > be just between two versions. So if above Label/UUID changes would not
> > > be ready for next release, I would suggest to postpone currently merged
> > > UDF changes.
> > 
> > Yes.

I have released v2.30-rc1, we have time to -rc2 (~1 month).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-16 12:52         ` Karel Zak
@ 2017-05-16 14:02           ` Pali Rohár
  2017-05-17  7:13             ` Karel Zak
  0 siblings, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-16 14:02 UTC (permalink / raw)
  To: Karel Zak; +Cc: util-linux

On Tuesday 16 May 2017 14:52:57 Karel Zak wrote:
> On Tue, May 16, 2017 at 01:59:40PM +0200, Pali Rohár wrote:
> > On Tuesday 16 May 2017 13:01:39 Karel Zak wrote:
> > > On Mon, May 15, 2017 at 02:38:45PM +0200, Pali Rohár wrote:
> > > > But question remain what to do with UUID.
> > > 
> > > It seem generated UUID is libblkid feature and another tools/systems
> > > don't use anything like UUID for UDF, right?
> > 
> > Yes. Introduced in https://github.com/karelzak/util-linux/pull/135
> 
> :-)
> 
> > But I would like to see UUID support also on other places (e.g. Grub2)
> > so it would be possible to use it really as UUID of FS. Which means we
> > need some normalized way of generation.
> 
> OK.
> 
> > > If yes... then we can keep it unchanged, generate UUDI in the same way
> > > as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
> > > LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
> > > decoded VolSetId.
> > > 
> > > Anyway, UUID has to be printable.
> > 
> > Lets first define allowed characters in UUID and then what we do with
> > UDF's UUID.
> > 
> > Printable means only printable ASCII? Or also printable from Unicode? Or
> > only alphanumeric?
> 
> I'd like to be very conservative and avoid anything else than ASCII.
> It's identifier that should be usable everywhere.
> 
> udev uses the UUID for paths and symlinks, "bad chars" are escaped and
> it's very user unfriendly. We should be also user friendly to non-UTF
> users, terminals, etc.
> 
> IMHO the best solution would be to use lowercase hex-digits like for
> another filesystems (and super ideal would be follow UUID notation for
> formatting (e.g. "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-).

We have only 16 Unicode characters (and first 8 are hexdigits), so
above format for 128bit UUID notation is not possible.

Currently VolSetID is parsed as bytes instead of (Unicode) characters.
We can correctly parse it, read first 16 chars, convert then UTF-8 and
then use those UTF-8 bytes as input for generating UUID. This step has
advantage that deals with Unicode (and does not matter on internal
representation of VolSetID string stored in UDF) and also that produce
normalized bytes which can be later used...

You want to have only lowercase hexdigits in UUID. I understand this
reason, it makes sense. But how to generate UUID from (potentially
arbitrary) UTF-8 sequence of 16 Unicode characters? Because UTF-8 is
variable length encoding.

Currently UUID generator split those 16 chars/bytes into first and
second half because according to UDF standard that first half should
contain only hexdigits (and in most cases they really are!). Half which
is not alphanumeric is encoded via %02x per byte. And final string
truncated to 16 bytes (to have fixed length).

What we can do is to take UTF-8 sequence (instead raw UDF bytes) and
encode non-hexdigits bytes (instead non-alnum) via %02x. And truncate
again to 16 hexdigits.

What do you think about it? Or do you have better idea?

> > > > I suggest to include all UDF changes in one release, so "breakage" would
> > > > be just between two versions. So if above Label/UUID changes would not
> > > > be ready for next release, I would suggest to postpone currently merged
> > > > UDF changes.
> > > 
> > > Yes.
> 
> I have released v2.30-rc1, we have time to -rc2 (~1 month).
> 
>     Karel
> 

-- 
Pali Rohár
pali.rohar@gmail.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-16 14:02           ` Pali Rohár
@ 2017-05-17  7:13             ` Karel Zak
  2017-05-17 18:50               ` Pali Rohár
  0 siblings, 1 reply; 12+ messages in thread
From: Karel Zak @ 2017-05-17  7:13 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux

On Tue, May 16, 2017 at 04:02:45PM +0200, Pali Rohár wrote:
> > > > If yes... then we can keep it unchanged, generate UUDI in the same way
> > > > as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
> > > > LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
> > > > decoded VolSetId.
> > > > 
> > > > Anyway, UUID has to be printable.
> > > 
> > > Lets first define allowed characters in UUID and then what we do with
> > > UDF's UUID.
> > > 
> > > Printable means only printable ASCII? Or also printable from Unicode? Or
> > > only alphanumeric?
> > 
> > I'd like to be very conservative and avoid anything else than ASCII.
> > It's identifier that should be usable everywhere.
> > 
> > udev uses the UUID for paths and symlinks, "bad chars" are escaped and
> > it's very user unfriendly. We should be also user friendly to non-UTF
> > users, terminals, etc.
> > 
> > IMHO the best solution would be to use lowercase hex-digits like for
> > another filesystems (and super ideal would be follow UUID notation for
> > formatting (e.g. "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-).
> 
> We have only 16 Unicode characters (and first 8 are hexdigits), so
> above format for 128bit UUID notation is not possible.
> 
> Currently VolSetID is parsed as bytes instead of (Unicode) characters.
> We can correctly parse it, read first 16 chars, convert then UTF-8 and
> then use those UTF-8 bytes as input for generating UUID. This step has
> advantage that deals with Unicode (and does not matter on internal
> representation of VolSetID string stored in UDF) and also that produce
> normalized bytes which can be later used...
> 
> You want to have only lowercase hexdigits in UUID. I understand this
> reason, it makes sense. But how to generate UUID from (potentially
> arbitrary) UTF-8 sequence of 16 Unicode characters? Because UTF-8 is
> variable length encoding.
> 
> Currently UUID generator split those 16 chars/bytes into first and
> second half because according to UDF standard that first half should
> contain only hexdigits (and in most cases they really are!). Half which
> is not alphanumeric is encoded via %02x per byte. And final string
> truncated to 16 bytes (to have fixed length).
> 
> What we can do is to take UTF-8 sequence (instead raw UDF bytes) and
> encode non-hexdigits bytes (instead non-alnum) via %02x. And truncate
> again to 16 hexdigits.

This is what I expected... don't think about it as about characters,
but as about random bytes that we print as %02x. The result will be 
always the same for the same UDF header, right?

The another option would be use some hash sum to standardize arbitrary
number of bytes (for example we use MD5 to generate UUID for
libblkid/src/superblocks/hfs.c). In this case we can use also some
another bytes from the header, for example volume_descriptor.tag. The
disadvantage is dependence on checksum code, so bad portability to
another projects (grub, etc.).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-17  7:13             ` Karel Zak
@ 2017-05-17 18:50               ` Pali Rohár
  2017-05-18  8:34                 ` Karel Zak
  0 siblings, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-17 18:50 UTC (permalink / raw)
  To: Karel Zak; +Cc: util-linux

[-- Attachment #1: Type: Text/Plain, Size: 4040 bytes --]

On Wednesday 17 May 2017 09:13:33 Karel Zak wrote:
> On Tue, May 16, 2017 at 04:02:45PM +0200, Pali Rohár wrote:
> > > > > If yes... then we can keep it unchanged, generate UUDI in the
> > > > > same way as now (hexadecimal digits). The "OSTA Unicode fix"
> > > > > maybe be used for LABEL= (etc) only. I guess nothing forces
> > > > > use to generate UUIDs from decoded VolSetId.
> > > > > 
> > > > > Anyway, UUID has to be printable.
> > > > 
> > > > Lets first define allowed characters in UUID and then what we
> > > > do with UDF's UUID.
> > > > 
> > > > Printable means only printable ASCII? Or also printable from
> > > > Unicode? Or only alphanumeric?
> > > 
> > > I'd like to be very conservative and avoid anything else than
> > > ASCII. It's identifier that should be usable everywhere.
> > > 
> > > udev uses the UUID for paths and symlinks, "bad chars" are
> > > escaped and it's very user unfriendly. We should be also user
> > > friendly to non-UTF users, terminals, etc.
> > > 
> > > IMHO the best solution would be to use lowercase hex-digits like
> > > for another filesystems (and super ideal would be follow UUID
> > > notation for formatting (e.g.
> > > "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-).
> > 
> > We have only 16 Unicode characters (and first 8 are hexdigits), so
> > above format for 128bit UUID notation is not possible.
> > 
> > Currently VolSetID is parsed as bytes instead of (Unicode)
> > characters. We can correctly parse it, read first 16 chars,
> > convert then UTF-8 and then use those UTF-8 bytes as input for
> > generating UUID. This step has advantage that deals with Unicode
> > (and does not matter on internal representation of VolSetID string
> > stored in UDF) and also that produce normalized bytes which can be
> > later used...
> > 
> > You want to have only lowercase hexdigits in UUID. I understand
> > this reason, it makes sense. But how to generate UUID from
> > (potentially arbitrary) UTF-8 sequence of 16 Unicode characters?
> > Because UTF-8 is variable length encoding.
> > 
> > Currently UUID generator split those 16 chars/bytes into first and
> > second half because according to UDF standard that first half
> > should contain only hexdigits (and in most cases they really
> > are!). Half which is not alphanumeric is encoded via %02x per
> > byte. And final string truncated to 16 bytes (to have fixed
> > length).
> > 
> > What we can do is to take UTF-8 sequence (instead raw UDF bytes)
> > and encode non-hexdigits bytes (instead non-alnum) via %02x. And
> > truncate again to 16 hexdigits.
> 
> This is what I expected... don't think about it as about characters,
> but as about random bytes that we print as %02x. The result will be
> always the same for the same UDF header, right?

Still, more UDF disks created by Nero, new mkudffs or new udfclient have 
only hexdigits stored in VolSetID. So it is better to use them directly 
instead of encoding hexdigits characters via %02x.

I tried to modify current algorithm to take UTF-8 representation for its 
input (therefore correctly handle both 8bit and 16bit OSTA Unicode) and 
honor above hexdigits in VolSetID.

Please look at review my proposed changes:
https://github.com/karelzak/util-linux/pull/439

If you do not agree with changes or you have other idea comments let me 
know.

> The another option would be use some hash sum to standardize
> arbitrary number of bytes (for example we use MD5 to generate UUID
> for libblkid/src/superblocks/hfs.c). In this case we can use also
> some another bytes from the header, for example
> volume_descriptor.tag. The disadvantage is dependence on checksum
> code, so bad portability to another projects (grub, etc.).

Looks like too complicated, specially decision on checksum/hash function 
could be problematic to implement in other projects. Probably hash 
functions does not have to be fast (even there are fast MD5 
implementations).

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-17 18:50               ` Pali Rohár
@ 2017-05-18  8:34                 ` Karel Zak
  0 siblings, 0 replies; 12+ messages in thread
From: Karel Zak @ 2017-05-18  8:34 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux

On Wed, May 17, 2017 at 08:50:48PM +0200, Pali Rohár wrote:
> Still, more UDF disks created by Nero, new mkudffs or new udfclient have 
> only hexdigits stored in VolSetID. So it is better to use them directly 
> instead of encoding hexdigits characters via %02x.

OK, I see.

> I tried to modify current algorithm to take UTF-8 representation for its 
> input (therefore correctly handle both 8bit and 16bit OSTA Unicode) and 
> honor above hexdigits in VolSetID.
> 
> Please look at review my proposed changes:
> https://github.com/karelzak/util-linux/pull/439

Looks good.

It seems the most invasive change (in many cases) is the tolower() :-)
IMHO it's good idea to generate normalized UUID.

> > The another option would be use some hash sum to standardize
> > arbitrary number of bytes (for example we use MD5 to generate UUID
> > for libblkid/src/superblocks/hfs.c). In this case we can use also
> > some another bytes from the header, for example
> > volume_descriptor.tag. The disadvantage is dependence on checksum
> > code, so bad portability to another projects (grub, etc.).
> 
> Looks like too complicated, specially decision on checksum/hash function 
> could be problematic to implement in other projects. Probably hash 
> functions does not have to be fast (even there are fast MD5 
> implementations).

Yes, the portability to the another projects is painful in this case.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-15 10:09 ` Karel Zak
  2017-05-15 12:38   ` Pali Rohár
@ 2017-05-16 22:17   ` Pali Rohár
  2017-05-17  7:19     ` Karel Zak
  1 sibling, 1 reply; 12+ messages in thread
From: Pali Rohár @ 2017-05-16 22:17 UTC (permalink / raw)
  To: Karel Zak; +Cc: util-linux, Steve Kenton, Vojtěch Vladyka, Jan Kara

[-- Attachment #1: Type: Text/Plain, Size: 3085 bytes --]

On Monday 15 May 2017 12:09:40 Karel Zak wrote:
> On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote:
> > Hi!
> > 
> > Since beginning libblkid's udf code handles 16bit OSTA compressed
> > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
> > 
> > In UDF 2.01 specification is written:
> > ====
> > For a CompressionID of 8 or 16, the value of the CompressionID
> > shall specify the number of BitsPerCharacter for the d-characters
> > defined in the CharacterBitStream field. Each sequence of
> > CompressionID bits in the CharacterBitStream field shall represent
> > an OSTA Compressed Unicode d- character. The bits of the character
> > being encoded shall be added to the CharacterBitStream from most-
> > to least-significant-bit. The bits shall be added to the
> > CharacterBitStream starting from the most significant bit of the
> > current byte being encoded into. The value of the OSTA Compressed
> > Unicode d-character interpreted as a Uint16 defines the value of
> > the corresponding d-character in the Unicode 2.0 standard. ====
> > 
> > So it means that 8bit OSTA compressed unicode buffer contains
> > sequence of Unicode codepoints, one per 8 bits. What effectively
> > means equivalence with Latin1 (ISO-8859-1) encoding.
> > 
> > And 16bit OSTA compressed unicode means sequence of Unicode
> > codepoints, one per 16 bits in big endian. What is probably only
> > UCS-2 and not full UTF-16.
> > 
> > So problem is with 8bit OSTA compressed unicode if contains bytes
> > which are not UTF-8 invariants (ASCII). As those those are decoded
> > differently with Latin1 and UTF-8.
> > 
> > Which means libblkid udf implementation of reading Unicode strings
> > is wrong and affects all read operations (Label, UUID, ...).
> > 
> > To verify this problem I prepared small udf image (attached) which
> > has logical volume identifier (known as label): 0x08 0xC3 0xBF
> > 0x00 ... 0x03
> > 
> > According to spec it should be decoded as string "Ã¿"
> > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
> > 
> > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
> > 
> > I checked grub2 and Windows implementations and they show "Ã¿".
> > 
> > So... what to do with blkid implementation? Fixing it would mean to
> > break all existing labels and uuids on Linux. Not fixing it would
> > mean to have different labels across different systems which
> > implements it properly.
> 
> The issue has never been reported, so I guess the number of the
> affected LABELs is pretty small :-)
> 
> From my point of view it would be better to follow the standard, fix
> the issue and be compatible with the another utils and systems. It
> would be nice to fix it now for v2.30 where we already have changes
> in udf/iso stuff. Please, send the patch :-)

Fix for all UDF strings except UUID is in this pull request:
https://github.com/karelzak/util-linux/pull/438

I hope it is correct now. UDF image with "Ã¿" is added to tests.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: libblkid: udf: Incorrect implementation of Unicode strings
  2017-05-16 22:17   ` Pali Rohár
@ 2017-05-17  7:19     ` Karel Zak
  0 siblings, 0 replies; 12+ messages in thread
From: Karel Zak @ 2017-05-17  7:19 UTC (permalink / raw)
  To: Pali Rohár; +Cc: util-linux, Steve Kenton, Vojtěch Vladyka, Jan Kara

On Wed, May 17, 2017 at 12:17:19AM +0200, Pali Rohár wrote:
> Fix for all UDF strings except UUID is in this pull request:
> https://github.com/karelzak/util-linux/pull/438
> 
> I hope it is correct now. UDF image with "Ã¿" is added to tests.

Thanks! Applied.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-18  8:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-12 14:38 libblkid: udf: Incorrect implementation of Unicode strings Pali Rohár
2017-05-15 10:09 ` Karel Zak
2017-05-15 12:38   ` Pali Rohár
2017-05-16 11:01     ` Karel Zak
2017-05-16 11:59       ` Pali Rohár
2017-05-16 12:52         ` Karel Zak
2017-05-16 14:02           ` Pali Rohár
2017-05-17  7:13             ` Karel Zak
2017-05-17 18:50               ` Pali Rohár
2017-05-18  8:34                 ` Karel Zak
2017-05-16 22:17   ` Pali Rohár
2017-05-17  7:19     ` Karel Zak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.