All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Matt Fleming
	<matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
Cc: efi kernel list
	<linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Matthew Garrett <mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org>
Subject: Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions
Date: Fri, 12 Feb 2016 14:22:29 +0100	[thread overview]
Message-ID: <56BDDC95.8030608@redhat.com> (raw)
In-Reply-To: <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

(I imported these messages from the gmane archive, after reading about
this work on LWN. Sorry if I'm not looking at the latest patches.)

On 02/04/16 16:34, Peter Jones wrote:
> This adds ucs2_utf8size(), which tells us how big our ucs2 string is in
> bytes, and ucs2_as_utf8, which translates from ucs2 to utf8..
> 
> Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> ---
>  include/linux/ucs2_string.h |  4 +++
>  lib/ucs2_string.c           | 62 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h
> index cbb20af..bb679b4 100644
> --- a/include/linux/ucs2_string.h
> +++ b/include/linux/ucs2_string.h
> @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s);
>  unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength);
>  int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len);
>  
> +unsigned long ucs2_utf8size(const ucs2_char_t *src);
> +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
> +			   unsigned long maxlength);
> +
>  #endif /* _LINUX_UCS2_STRING_H_ */
> diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
> index 6f500ef..17dd74e 100644
> --- a/lib/ucs2_string.c
> +++ b/lib/ucs2_string.c
> @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len)
>          }
>  }
>  EXPORT_SYMBOL(ucs2_strncmp);
> +
> +unsigned long
> +ucs2_utf8size(const ucs2_char_t *src)
> +{
> +	unsigned long i;
> +	unsigned long j = 0;
> +
> +	for (i = 0; i < ucs2_strlen(src); i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800)
> +			j += 3;
> +		else if (c > 0x80)
> +			j += 2;
> +		else
> +			j += 1;
> +	}
> +
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_utf8size);
> +
> +/*
> + * copy at most maxlength bytes of whole utf8 characters to dest from the
> + * ucs2 string src.
> + *
> + * The return value is the number of characters copied, not including the
> + * final NUL character.
> + */
> +unsigned long
> +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength)
> +{
> +	unsigned int i;
> +	unsigned long j = 0;
> +	unsigned long limit = ucs2_strnlen(src, maxlength);
> +
> +	for (i = 0; maxlength && i < limit; i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800) {
> +			if (maxlength < 3)
> +				break;
> +			maxlength -= 3;
> +			dest[j++] = 0xe0 | (c & 0xf000) >> 12;
> +			dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
> +			dest[j++] = 0x80 | (c & 0x003f);
> +		} else if (c > 0x80) {
> +			if (maxlength < 2)
> +				break;
> +			maxlength -= 2;
> +			dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
> +			dest[j++] = 0x80 | (c & 0x01f);
> +		} else {
> +			maxlength -= 1;
> +			dest[j++] = c & 0x7f;
> +		}
> +	}
> +	if (maxlength)
> +		dest[j] = '\0';
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_as_utf8);
> 

Since this code is being added to a generic library, I have two comments
/ questions:

(1) shouldn't we handle the endianness of ucs2_char_t explicitly?

https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

(2) Code *units* that stand for low or high halves of surrogate pairs
(0xD800 to 0xDFFF) are not treated specially; meaning the unicode code
*point* they represent (from U+10000 to U+10FFFF) is not decoded, and
then separately encoded to UTF-8. Instead, the above will transcode the
surrogates individually to UTF-8, which looks invalid.

https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

Has this been considered?

Again, the above two questions don't seem relevant for UEFI
specifically: first, UEFI is little-endian only; second, UEFI does not
support surrogate characters -- I found a hint about this in the HII
chapter of the spec, "31.2.6.2.2 Surrogate Area".

But since this code is being added to a generic library, the UEFI
assumptions may not hold for other (future) callers.

Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is
responsible for not passing in code units from the surrogate area? If
so, I think this should be spelled out in a comment as well, and maybe
even WARN'd about.

Just my two cents; I'm by no means a unicode expert.

Thanks
Laszlo

  parent reply	other threads:[~2016-02-12 13:22 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-04 15:34 efivarfs immutable files patch set Peter Jones
     [not found] ` <1454600074-14854-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 15:34   ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
     [not found]     ` <1454600074-14854-2-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 13:22       ` Laszlo Ersek [this message]
     [not found]         ` <56BDDC95.8030608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:07           ` Peter Jones
2016-02-15 10:15           ` Matt Fleming
2016-02-04 15:34   ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v2) Peter Jones
     [not found]     ` <1454600074-14854-3-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:06       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 3/5] efi: do variable name validation tests in utf8 Peter Jones
     [not found]     ` <1454600074-14854-4-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 21:39       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
     [not found]     ` <1454600074-14854-5-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 22:54       ` Matt Fleming
2016-02-04 15:34   ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
     [not found]     ` <1454600074-14854-6-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-04 23:42       ` Matt Fleming
     [not found]         ` <20160204234211.GI2586-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-08 19:48           ` efi: make most efivarfs files immutable by default Peter Jones
2016-02-08 19:48             ` Peter Jones
2016-02-08 19:48             ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-08 19:48             ` [PATCH 3/5] efi: do variable name validation tests in utf8 (v2) Peter Jones
2016-02-08 19:48             ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Peter Jones
     [not found]             ` <1454960895-3473-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-08 19:48               ` [PATCH 2/5] efi: use ucs2_as_utf8 in efivarfs instead of open coding a bad version (v3) Peter Jones
2016-02-08 19:48                 ` Peter Jones
2016-02-08 19:48               ` [PATCH 4/5] efi: make our variable validation list include the guid (v3) Peter Jones
2016-02-08 19:48                 ` Peter Jones
2016-02-10 13:22               ` efi: make most efivarfs files immutable by default Matt Fleming
2016-02-10 13:22                 ` Matt Fleming
     [not found]                 ` <20160210132225.GA2949-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-10 14:51                   ` [PATCH] efi: minor fixup in efivar_validate() declaration Peter Jones
2016-02-10 14:51                     ` Peter Jones
     [not found]                     ` <1455115862-2490-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-10 16:38                       ` Matt Fleming
2016-02-10 16:38                         ` Matt Fleming
2016-02-12 13:36       ` [PATCH 5/5] efi: Make efivarfs entries immutable by default. (v5) Laszlo Ersek
     [not found]         ` <56BDDFDC.406-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-12 15:09           ` Peter Jones
     [not found]             ` <20160212150948.GC31573-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-15 10:48               ` Matt Fleming
     [not found]                 ` <20160215104801.GB2591-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2016-02-15 17:02                   ` Peter Jones
     [not found]                     ` <20160215170215.GC785-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16 12:49                       ` Matt Fleming
  -- strict thread matches above, loose matches on Subject: below --
2016-02-03 16:43 [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones
2016-02-03 13:02 Peter Jones
2016-02-02 22:33 Preventing "rm -rf /sys/firmware/efi/efivars/" from damage Peter Jones
     [not found] ` <1454452386-27709-1-git-send-email-pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-02 22:33   ` [PATCH 1/5] Add ucs2 -> utf8 helper functions Peter Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56BDDC95.8030608@redhat.com \
    --to=lersek-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=linux-efi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org \
    --cc=mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org \
    --cc=pjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.