From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: util-linux-owner@vger.kernel.org Received: from mail-wm0-f41.google.com ([74.125.82.41]:38603 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751987AbdEQSuw (ORCPT ); Wed, 17 May 2017 14:50:52 -0400 Received: by mail-wm0-f41.google.com with SMTP id v15so26766461wmv.1 for ; Wed, 17 May 2017 11:50:51 -0700 (PDT) From: Pali =?utf-8?q?Roh=C3=A1r?= To: Karel Zak Subject: Re: libblkid: udf: Incorrect implementation of Unicode strings Date: Wed, 17 May 2017 20:50:48 +0200 Cc: util-linux@vger.kernel.org References: <201705121638.59416@pali> <20170516140245.GE10015@pali> <20170517071333.wj4glanhfh2aje72@ws.net.home> In-Reply-To: <20170517071333.wj4glanhfh2aje72@ws.net.home> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2147606.tOzCa2by4I"; protocol="application/pgp-signature"; micalg=pgp-sha1 Message-Id: <201705172050.48793@pali> Sender: util-linux-owner@vger.kernel.org List-ID: --nextPart2147606.tOzCa2by4I Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Wednesday 17 May 2017 09:13:33 Karel Zak wrote: > On Tue, May 16, 2017 at 04:02:45PM +0200, Pali Roh=C3=A1r wrote: > > > > > If yes... then we can keep it unchanged, generate UUDI in the > > > > > same way as now (hexadecimal digits). The "OSTA Unicode fix" > > > > > maybe be used for LABEL=3D (etc) only. I guess nothing forces > > > > > use to generate UUIDs from decoded VolSetId. > > > > >=20 > > > > > Anyway, UUID has to be printable. > > > >=20 > > > > Lets first define allowed characters in UUID and then what we > > > > do with UDF's UUID. > > > >=20 > > > > Printable means only printable ASCII? Or also printable from > > > > Unicode? Or only alphanumeric? > > >=20 > > > I'd like to be very conservative and avoid anything else than > > > ASCII. It's identifier that should be usable everywhere. > > >=20 > > > udev uses the UUID for paths and symlinks, "bad chars" are > > > escaped and it's very user unfriendly. We should be also user > > > friendly to non-UTF users, terminals, etc. > > >=20 > > > IMHO the best solution would be to use lowercase hex-digits like > > > for another filesystems (and super ideal would be follow UUID > > > notation for formatting (e.g. > > > "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-). > >=20 > > We have only 16 Unicode characters (and first 8 are hexdigits), so > > above format for 128bit UUID notation is not possible. > >=20 > > Currently VolSetID is parsed as bytes instead of (Unicode) > > characters. We can correctly parse it, read first 16 chars, > > convert then UTF-8 and then use those UTF-8 bytes as input for > > generating UUID. This step has advantage that deals with Unicode > > (and does not matter on internal representation of VolSetID string > > stored in UDF) and also that produce normalized bytes which can be > > later used... > >=20 > > You want to have only lowercase hexdigits in UUID. I understand > > this reason, it makes sense. But how to generate UUID from > > (potentially arbitrary) UTF-8 sequence of 16 Unicode characters? > > Because UTF-8 is variable length encoding. > >=20 > > Currently UUID generator split those 16 chars/bytes into first and > > second half because according to UDF standard that first half > > should contain only hexdigits (and in most cases they really > > are!). Half which is not alphanumeric is encoded via %02x per > > byte. And final string truncated to 16 bytes (to have fixed > > length). > >=20 > > What we can do is to take UTF-8 sequence (instead raw UDF bytes) > > and encode non-hexdigits bytes (instead non-alnum) via %02x. And > > truncate again to 16 hexdigits. >=20 > This is what I expected... don't think about it as about characters, > but as about random bytes that we print as %02x. The result will be > always the same for the same UDF header, right? Still, more UDF disks created by Nero, new mkudffs or new udfclient have=20 only hexdigits stored in VolSetID. So it is better to use them directly=20 instead of encoding hexdigits characters via %02x. I tried to modify current algorithm to take UTF-8 representation for its=20 input (therefore correctly handle both 8bit and 16bit OSTA Unicode) and=20 honor above hexdigits in VolSetID. Please look at review my proposed changes: https://github.com/karelzak/util-linux/pull/439 If you do not agree with changes or you have other idea comments let me=20 know. > The another option would be use some hash sum to standardize > arbitrary number of bytes (for example we use MD5 to generate UUID > for libblkid/src/superblocks/hfs.c). In this case we can use also > some another bytes from the header, for example > volume_descriptor.tag. The disadvantage is dependence on checksum > code, so bad portability to another projects (grub, etc.). Looks like too complicated, specially decision on checksum/hash function=20 could be problematic to implement in other projects. Probably hash=20 functions does not have to be fast (even there are fast MD5=20 implementations). =2D-=20 Pali Roh=C3=A1r pali.rohar@gmail.com --nextPart2147606.tOzCa2by4I Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEABECAAYFAlkcm4gACgkQi/DJPQPkQ1JQKACcDP16jiFDES0IPkVoKD/pyhul /GsAoKHzgC2txv8ups7HsWqlI2aXszMj =5dPd -----END PGP SIGNATURE----- --nextPart2147606.tOzCa2by4I--