Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions

linux-efi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Laszlo Ersek <lersek-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Matt Fleming
	<matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
Cc: efi kernel list
	<linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Matthew Garrett <mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org>
Subject: Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions
Date: Fri, 12 Feb 2016 14:22:29 +0100	[thread overview]
Message-ID: <56BDDC95.8030608@redhat.com> (raw)
In-Reply-To: <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

(I imported these messages from the gmane archive, after reading about
this work on LWN. Sorry if I'm not looking at the latest patches.)

On 02/04/16 16:34, Peter Jones wrote:
> This adds ucs2_utf8size(), which tells us how big our ucs2 string is in
> bytes, and ucs2_as_utf8, which translates from ucs2 to utf8..
> 
> Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> ---
>  include/linux/ucs2_string.h |  4 +++
>  lib/ucs2_string.c           | 62 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h
> index cbb20af..bb679b4 100644
> --- a/include/linux/ucs2_string.h
> +++ b/include/linux/ucs2_string.h
> @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s);
>  unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength);
>  int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len);
>  
> +unsigned long ucs2_utf8size(const ucs2_char_t *src);
> +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
> +			   unsigned long maxlength);
> +
>  #endif /* _LINUX_UCS2_STRING_H_ */
> diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
> index 6f500ef..17dd74e 100644
> --- a/lib/ucs2_string.c
> +++ b/lib/ucs2_string.c
> @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len)
>          }
>  }
>  EXPORT_SYMBOL(ucs2_strncmp);
> +
> +unsigned long
> +ucs2_utf8size(const ucs2_char_t *src)
> +{
> +	unsigned long i;
> +	unsigned long j = 0;
> +
> +	for (i = 0; i < ucs2_strlen(src); i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800)
> +			j += 3;
> +		else if (c > 0x80)
> +			j += 2;
> +		else
> +			j += 1;
> +	}
> +
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_utf8size);
> +
> +/*
> + * copy at most maxlength bytes of whole utf8 characters to dest from the
> + * ucs2 string src.
> + *
> + * The return value is the number of characters copied, not including the
> + * final NUL character.
> + */
> +unsigned long
> +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength)
> +{
> +	unsigned int i;
> +	unsigned long j = 0;
> +	unsigned long limit = ucs2_strnlen(src, maxlength);
> +
> +	for (i = 0; maxlength && i < limit; i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800) {
> +			if (maxlength < 3)
> +				break;
> +			maxlength -= 3;
> +			dest[j++] = 0xe0 | (c & 0xf000) >> 12;
> +			dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
> +			dest[j++] = 0x80 | (c & 0x003f);
> +		} else if (c > 0x80) {
> +			if (maxlength < 2)
> +				break;
> +			maxlength -= 2;
> +			dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
> +			dest[j++] = 0x80 | (c & 0x01f);
> +		} else {
> +			maxlength -= 1;
> +			dest[j++] = c & 0x7f;
> +		}
> +	}
> +	if (maxlength)
> +		dest[j] = '\0';
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_as_utf8);
> 

Since this code is being added to a generic library, I have two comments
/ questions:

(1) shouldn't we handle the endianness of ucs2_char_t explicitly?

https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

(2) Code *units* that stand for low or high halves of surrogate pairs
(0xD800 to 0xDFFF) are not treated specially; meaning the unicode code
*point* they represent (from U+10000 to U+10FFFF) is not decoded, and
then separately encoded to UTF-8. Instead, the above will transcode the
surrogates individually to UTF-8, which looks invalid.

https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

Has this been considered?

Again, the above two questions don't seem relevant for UEFI
specifically: first, UEFI is little-endian only; second, UEFI does not
support surrogate characters -- I found a hint about this in the HII
chapter of the spec, "31.2.6.2.2 Surrogate Area".

But since this code is being added to a generic library, the UEFI
assumptions may not hold for other (future) callers.

Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is
responsible for not passing in code units from the surrogate area? If
so, I think this should be spelled out in a comment as well, and maybe
even WARN'd about.

Just my two cents; I'm by no means a unicode expert.

Thanks
Laszlo

next prev parent reply	other threads:[~2016-02-12 13:22 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-04 15:34 efivarfs immutable files patch set Peter Jones
     [not found] ` <1454600074-14854-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 15:34   ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
     [not found]     ` <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 13:22       ` Laszlo Ersek [this message]
     [not found]         ` <56BDDC95.8030608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:07           ` Peter Jones
2016-02-15 10:15           ` Matt Fleming
2016-02-04 15:34   ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v2) Peter Jones
     [not found]     ` <1454600074-14854-3-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:06       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 3/5] efi: do variable name validation tests in utf8 Peter Jones
     [not found]     ` <1454600074-14854-4-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 21:39       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
     [not found]     ` <1454600074-14854-5-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:54       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
     [not found]     ` <1454600074-14854-6-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 23:42       ` Matt Fleming
     [not found]         ` <20160204234211.GI2586-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-08 19:48           ` efi: make most efivarfs files immutable by default Peter Jones
2016-02-08 19:48             ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-08 19:48             ` [PATCH 3/5] efi: do variable name validation tests in utf8 (v2) Peter Jones
2016-02-08 19:48             ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
     [not found]             ` <1454960895-3473-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-08 19:48               ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v3) Peter Jones
2016-02-08 19:48               ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
2016-02-10 13:22               ` efi: make most efivarfs files immutable by default Matt Fleming
     [not found]                 ` <20160210132225.GA2949-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-10 14:51                   ` [PATCH] efi: minor fixup in efivar_validate() declaration Peter Jones
     [not found]                     ` <1455115862-2490-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-10 16:38                       ` Matt Fleming
2016-02-12 13:36       ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Laszlo Ersek
     [not found]         ` <56BDDFDC.406-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:09           ` Peter Jones
     [not found]             ` <20160212150948.GC31573-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-15 10:48               ` Matt Fleming
     [not found]                 ` <20160215104801.GB2591-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-15 17:02                   ` Peter Jones
     [not found]                     ` <20160215170215.GC785-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16 12:49                       ` Matt Fleming
  -- strict thread matches above, loose matches on Subject: below --
2016-02-03 16:43 [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-03 13:02 Peter Jones
2016-02-02 22:33 Preventing "rm -rf /sys/firmware/efi/efivars/" from damage Peter Jones
     [not found] ` <1454452386-27709-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-02 22:33   ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56BDDC95.8030608@redhat.com \
    --to=lersek-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org \
    --cc=mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org \
    --cc=pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).