From: Laszlo Ersek <lersek-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Matt Fleming
<matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
Cc: efi kernel list
<linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Matthew Garrett <mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org>
Subject: Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions
Date: Fri, 12 Feb 2016 14:22:29 +0100 [thread overview]
Message-ID: <56BDDC95.8030608@redhat.com> (raw)
In-Reply-To: <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
(I imported these messages from the gmane archive, after reading about
this work on LWN. Sorry if I'm not looking at the latest patches.)
On 02/04/16 16:34, Peter Jones wrote:
> This adds ucs2_utf8size(), which tells us how big our ucs2 string is in
> bytes, and ucs2_as_utf8, which translates from ucs2 to utf8..
>
> Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> ---
> include/linux/ucs2_string.h | 4 +++
> lib/ucs2_string.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 66 insertions(+)
>
> diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h
> index cbb20af..bb679b4 100644
> --- a/include/linux/ucs2_string.h
> +++ b/include/linux/ucs2_string.h
> @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s);
> unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength);
> int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len);
>
> +unsigned long ucs2_utf8size(const ucs2_char_t *src);
> +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
> + unsigned long maxlength);
> +
> #endif /* _LINUX_UCS2_STRING_H_ */
> diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
> index 6f500ef..17dd74e 100644
> --- a/lib/ucs2_string.c
> +++ b/lib/ucs2_string.c
> @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len)
> }
> }
> EXPORT_SYMBOL(ucs2_strncmp);
> +
> +unsigned long
> +ucs2_utf8size(const ucs2_char_t *src)
> +{
> + unsigned long i;
> + unsigned long j = 0;
> +
> + for (i = 0; i < ucs2_strlen(src); i++) {
> + u16 c = src[i];
> +
> + if (c > 0x800)
> + j += 3;
> + else if (c > 0x80)
> + j += 2;
> + else
> + j += 1;
> + }
> +
> + return j;
> +}
> +EXPORT_SYMBOL(ucs2_utf8size);
> +
> +/*
> + * copy at most maxlength bytes of whole utf8 characters to dest from the
> + * ucs2 string src.
> + *
> + * The return value is the number of characters copied, not including the
> + * final NUL character.
> + */
> +unsigned long
> +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength)
> +{
> + unsigned int i;
> + unsigned long j = 0;
> + unsigned long limit = ucs2_strnlen(src, maxlength);
> +
> + for (i = 0; maxlength && i < limit; i++) {
> + u16 c = src[i];
> +
> + if (c > 0x800) {
> + if (maxlength < 3)
> + break;
> + maxlength -= 3;
> + dest[j++] = 0xe0 | (c & 0xf000) >> 12;
> + dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
> + dest[j++] = 0x80 | (c & 0x003f);
> + } else if (c > 0x80) {
> + if (maxlength < 2)
> + break;
> + maxlength -= 2;
> + dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
> + dest[j++] = 0x80 | (c & 0x01f);
> + } else {
> + maxlength -= 1;
> + dest[j++] = c & 0x7f;
> + }
> + }
> + if (maxlength)
> + dest[j] = '\0';
> + return j;
> +}
> +EXPORT_SYMBOL(ucs2_as_utf8);
>
Since this code is being added to a generic library, I have two comments
/ questions:
(1) shouldn't we handle the endianness of ucs2_char_t explicitly?
https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
(2) Code *units* that stand for low or high halves of surrogate pairs
(0xD800 to 0xDFFF) are not treated specially; meaning the unicode code
*point* they represent (from U+10000 to U+10FFFF) is not decoded, and
then separately encoded to UTF-8. Instead, the above will transcode the
surrogates individually to UTF-8, which looks invalid.
https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
Has this been considered?
Again, the above two questions don't seem relevant for UEFI
specifically: first, UEFI is little-endian only; second, UEFI does not
support surrogate characters -- I found a hint about this in the HII
chapter of the spec, "31.2.6.2.2 Surrogate Area".
But since this code is being added to a generic library, the UEFI
assumptions may not hold for other (future) callers.
Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is
responsible for not passing in code units from the surrogate area? If
so, I think this should be spelled out in a comment as well, and maybe
even WARN'd about.
Just my two cents; I'm by no means a unicode expert.
Thanks
Laszlo
next prev parent reply other threads:[~2016-02-12 13:22 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-04 15:34 efivarfs immutable files patch set Peter Jones
[not found] ` <1454600074-14854-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 15:34 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
[not found] ` <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 13:22 ` Laszlo Ersek [this message]
[not found] ` <56BDDC95.8030608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:07 ` Peter Jones
2016-02-15 10:15 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v2) Peter Jones
[not found] ` <1454600074-14854-3-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:06 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 3/5] efi: do variable name validation tests in utf8 Peter Jones
[not found] ` <1454600074-14854-4-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 21:39 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
[not found] ` <1454600074-14854-5-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:54 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
[not found] ` <1454600074-14854-6-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 23:42 ` Matt Fleming
[not found] ` <20160204234211.GI2586-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-08 19:48 ` efi: make most efivarfs files immutable by default Peter Jones
2016-02-08 19:48 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-08 19:48 ` [PATCH 3/5] efi: do variable name validation tests in utf8 (v2) Peter Jones
2016-02-08 19:48 ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
[not found] ` <1454960895-3473-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-08 19:48 ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v3) Peter Jones
2016-02-08 19:48 ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
2016-02-10 13:22 ` efi: make most efivarfs files immutable by default Matt Fleming
[not found] ` <20160210132225.GA2949-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-10 14:51 ` [PATCH] efi: minor fixup in efivar_validate() declaration Peter Jones
[not found] ` <1455115862-2490-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-10 16:38 ` Matt Fleming
2016-02-12 13:36 ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Laszlo Ersek
[not found] ` <56BDDFDC.406-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:09 ` Peter Jones
[not found] ` <20160212150948.GC31573-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-15 10:48 ` Matt Fleming
[not found] ` <20160215104801.GB2591-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-15 17:02 ` Peter Jones
[not found] ` <20160215170215.GC785-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16 12:49 ` Matt Fleming
-- strict thread matches above, loose matches on Subject: below --
2016-02-03 16:43 [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-03 13:02 Peter Jones
2016-02-02 22:33 Preventing "rm -rf /sys/firmware/efi/efivars/" from damage Peter Jones
[not found] ` <1454452386-27709-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-02 22:33 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56BDDC95.8030608@redhat.com \
--to=lersek-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org \
--cc=mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org \
--cc=pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).