From: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Laszlo Ersek <lersek-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Matt Fleming
<matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>,
efi kernel list
<linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Matthew Garrett <mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org>
Subject: Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions
Date: Fri, 12 Feb 2016 10:07:13 -0500 [thread overview]
Message-ID: <20160212150713.GB31573@redhat.com> (raw)
In-Reply-To: <56BDDC95.8030608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Fri, Feb 12, 2016 at 02:22:29PM +0100, Laszlo Ersek wrote:
> (I imported these messages from the gmane archive, after reading about
> this work on LWN. Sorry if I'm not looking at the latest patches.)
>
> On 02/04/16 16:34, Peter Jones wrote:
> > This adds ucs2_utf8size(), which tells us how big our ucs2 string is in
> > bytes, and ucs2_as_utf8, which translates from ucs2 to utf8..
> >
> > Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> > Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> > Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> > ---
> > include/linux/ucs2_string.h | 4 +++
> > lib/ucs2_string.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 66 insertions(+)
> >
> > diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h
> > index cbb20af..bb679b4 100644
> > --- a/include/linux/ucs2_string.h
> > +++ b/include/linux/ucs2_string.h
> > @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s);
> > unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength);
> > int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len);
> >
> > +unsigned long ucs2_utf8size(const ucs2_char_t *src);
> > +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
> > + unsigned long maxlength);
> > +
> > #endif /* _LINUX_UCS2_STRING_H_ */
> > diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
> > index 6f500ef..17dd74e 100644
> > --- a/lib/ucs2_string.c
> > +++ b/lib/ucs2_string.c
> > @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len)
> > }
> > }
> > EXPORT_SYMBOL(ucs2_strncmp);
> > +
> > +unsigned long
> > +ucs2_utf8size(const ucs2_char_t *src)
> > +{
> > + unsigned long i;
> > + unsigned long j = 0;
> > +
> > + for (i = 0; i < ucs2_strlen(src); i++) {
> > + u16 c = src[i];
> > +
> > + if (c > 0x800)
> > + j += 3;
> > + else if (c > 0x80)
> > + j += 2;
> > + else
> > + j += 1;
> > + }
> > +
> > + return j;
> > +}
> > +EXPORT_SYMBOL(ucs2_utf8size);
> > +
> > +/*
> > + * copy at most maxlength bytes of whole utf8 characters to dest from the
> > + * ucs2 string src.
> > + *
> > + * The return value is the number of characters copied, not including the
> > + * final NUL character.
> > + */
> > +unsigned long
> > +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength)
> > +{
> > + unsigned int i;
> > + unsigned long j = 0;
> > + unsigned long limit = ucs2_strnlen(src, maxlength);
> > +
> > + for (i = 0; maxlength && i < limit; i++) {
> > + u16 c = src[i];
> > +
> > + if (c > 0x800) {
> > + if (maxlength < 3)
> > + break;
> > + maxlength -= 3;
> > + dest[j++] = 0xe0 | (c & 0xf000) >> 12;
> > + dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
> > + dest[j++] = 0x80 | (c & 0x003f);
> > + } else if (c > 0x80) {
> > + if (maxlength < 2)
> > + break;
> > + maxlength -= 2;
> > + dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
> > + dest[j++] = 0x80 | (c & 0x01f);
> > + } else {
> > + maxlength -= 1;
> > + dest[j++] = c & 0x7f;
> > + }
> > + }
> > + if (maxlength)
> > + dest[j] = '\0';
> > + return j;
> > +}
> > +EXPORT_SYMBOL(ucs2_as_utf8);
> >
>
> Since this code is being added to a generic library, I have two comments
> / questions:
>
> (1) shouldn't we handle the endianness of ucs2_char_t explicitly?
>
> https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
TBH I've explicitly ignored this because EFI is always LE. If somebody
else decides they're using this code for something else, then either a)
we'll need to add some cpu_to_le16() and whatnot, or declare it a
prerequisite on the input.
> (2) Code *units* that stand for low or high halves of surrogate pairs
> (0xD800 to 0xDFFF) are not treated specially; meaning the unicode code
> *point* they represent (from U+10000 to U+10FFFF) is not decoded, and
> then separately encoded to UTF-8. Instead, the above will transcode the
> surrogates individually to UTF-8, which looks invalid.
>
> https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
>
> https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
>
> Has this been considered?
>
> Again, the above two questions don't seem relevant for UEFI
> specifically: first, UEFI is little-endian only; second, UEFI does not
> support surrogate characters -- I found a hint about this in the HII
> chapter of the spec, "31.2.6.2.2 Surrogate Area".
>
> But since this code is being added to a generic library, the UEFI
> assumptions may not hold for other (future) callers.
>
> Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is
> responsible for not passing in code units from the surrogate area? If
> so, I think this should be spelled out in a comment as well, and maybe
> even WARN'd about.
UCS-2 means /there are no surrogates/. Everything directly maps to the
16-bit subset set of unicode characters.
--
Peter
next prev parent reply other threads:[~2016-02-12 15:07 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-04 15:34 efivarfs immutable files patch set Peter Jones
[not found] ` <1454600074-14854-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 15:34 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
[not found] ` <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 13:22 ` Laszlo Ersek
[not found] ` <56BDDC95.8030608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:07 ` Peter Jones [this message]
2016-02-15 10:15 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v2) Peter Jones
[not found] ` <1454600074-14854-3-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:06 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 3/5] efi: do variable name validation tests in utf8 Peter Jones
[not found] ` <1454600074-14854-4-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 21:39 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
[not found] ` <1454600074-14854-5-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:54 ` Matt Fleming
2016-02-04 15:34 ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
[not found] ` <1454600074-14854-6-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 23:42 ` Matt Fleming
[not found] ` <20160204234211.GI2586-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-08 19:48 ` efi: make most efivarfs files immutable by default Peter Jones
2016-02-08 19:48 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-08 19:48 ` [PATCH 3/5] efi: do variable name validation tests in utf8 (v2) Peter Jones
[not found] ` <1454960895-3473-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-08 19:48 ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v3) Peter Jones
2016-02-08 19:48 ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
2016-02-10 13:22 ` efi: make most efivarfs files immutable by default Matt Fleming
[not found] ` <20160210132225.GA2949-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-10 14:51 ` [PATCH] efi: minor fixup in efivar_validate() declaration Peter Jones
[not found] ` <1455115862-2490-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-10 16:38 ` Matt Fleming
2016-02-08 19:48 ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
2016-02-12 13:36 ` Laszlo Ersek
[not found] ` <56BDDFDC.406-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:09 ` Peter Jones
[not found] ` <20160212150948.GC31573-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-15 10:48 ` Matt Fleming
[not found] ` <20160215104801.GB2591-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-15 17:02 ` Peter Jones
[not found] ` <20160215170215.GC785-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16 12:49 ` Matt Fleming
-- strict thread matches above, loose matches on Subject: below --
2016-02-03 16:43 [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-03 13:02 Peter Jones
2016-02-02 22:33 Preventing "rm -rf /sys/firmware/efi/efivars/" from damage Peter Jones
[not found] ` <1454452386-27709-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-02 22:33 ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160212150713.GB31573@redhat.com \
--to=pjones-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=lersek-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org \
--cc=mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).