From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from list by lists.gnu.org with archive (Exim 4.90_1)
	id 1mQdm1-0007WX-5S
	for mharc-grub-devel@gnu.org; Wed, 15 Sep 2021 18:53:02 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:54644)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <development@efficientek.com>)
 id 1mQdlx-0007VP-Ua
 for grub-devel@gnu.org; Wed, 15 Sep 2021 18:52:57 -0400
Received: from mail-qv1-xf2d.google.com ([2607:f8b0:4864:20::f2d]:44675)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <development@efficientek.com>)
 id 1mQdlu-0007de-Q2
 for grub-devel@gnu.org; Wed, 15 Sep 2021 18:52:57 -0400
Received: by mail-qv1-xf2d.google.com with SMTP id 62so2975408qvb.11
 for <grub-devel@gnu.org>; Wed, 15 Sep 2021 15:52:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=efficientek-com.20150623.gappssmtp.com; s=20150623;
 h=date:from:to:cc:subject:message-id:in-reply-to:references:reply-to
 :mime-version:content-transfer-encoding;
 bh=LGkIRgQ568qM3NexAQRoCEWEW2jK8867lhJEnCi6wh8=;
 b=as14JssfoP9ekSjikIZvxUIWzW72YqvyEYUGbvV09N4nndA8ILE83aj7mkFEhoeDK5
 TUFol/Eckt67u5B+zOH8dzJStuO9hLZzD3lxSSLBojxng1AoUJX6jPYMLXi4yZSsOcBG
 2E11/J24U2GTXsQZZdfrCQ75lqlMfHjYGW5HpdvSB5izHP9xC9DfLSggzwKrt3OPHClv
 fBgqWhBbelg7D8suXZgbqx6Oo0ZCbgMwMMcSLEUXb2VRmpwgba5yIzbcQ9EgalOfwtb9
 mCn/NDQThci2ywC68zJS34gsGWtui/+BOJkupVYbhVuMh9feRFVamY+WKDFmmLVRA+oA
 PmzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to
 :references:reply-to:mime-version:content-transfer-encoding;
 bh=LGkIRgQ568qM3NexAQRoCEWEW2jK8867lhJEnCi6wh8=;
 b=E9Scoj8KCl0O3clCtYJREmwb7d26wMRSzakFrOGepP8C2J5J+4WFOTgFobDLjyOh4o
 CCrWYdMah4qaUYEwRQnfV2n2f/YbFGHyyBmvo4qJn78NDCrAxwQzaTQ9bo4oEZE/IsRC
 hlC5f1UhvE5IJZi2gwFBTpu7eLo0Pk6LworlAbBQ9XJU6giqaAWtfSDfatXWO+PuYwr+
 ljK/qDo5BRT+D6w/v6pXhzEfNkb/sixojDwNrHyDV7vSKAWXRdXyIGYcIOnQRuSjhWmf
 IhAPXiELQYYMkvfXkIKetcdjM0cqdc5orfwDjBoizjLZN3LAZ8MXHmOiI4KLvMFWOfdo
 uPFg==
X-Gm-Message-State: AOAM5304Q2ZQKt/styISfgkdAnCUT5HCxvwwtq6J/RM5o39J25mZ9mrm
 sDqwoZEWWA+CDkd1hpGCpyoOdww++SQiMxHS
X-Google-Smtp-Source: ABdhPJyESFNGbw9GYYD+n4lqzyY9/BKA2lI9tSTUli2XwfbFMWWMC2LSx3Is0GX6K8pQPu1KTafAjA==
X-Received: by 2002:a05:6214:2d1:: with SMTP id
 g17mr2378544qvu.63.1631746372047; 
 Wed, 15 Sep 2021 15:52:52 -0700 (PDT)
Received: from ubuntu ([2806:103e:1d:2421:11d9:2260:6763:77dc])
 by smtp.gmail.com with ESMTPSA id j9sm880117qta.65.2021.09.15.15.52.51
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 15 Sep 2021 15:52:51 -0700 (PDT)
Date: Wed, 15 Sep 2021 22:52:40 +0000
From: Glenn Washburn <development@efficientek.com>
To: Daniel Kiper <dkiper@net-space.pl>
Cc: grub-devel@gnu.org, Vladimir 'phcoder' Serbinenko <phcoder@gmail.com>,
 Peter Jones <pjones@redhat.com>
Subject: Re: [PATCH] udf: Fix regression which is preventing symlink access
Message-ID: <20210915225240.7011f2b7@ubuntu>
In-Reply-To: <20210915145228.xyu2za42raq5zjby@tomti.i.net-space.pl>
References: <20210910160323.1247372-1-development@efficientek.com>
 <20210914142755.mraorb7nhh3tzbfr@tomti.i.net-space.pl>
 <20210914181903.79ee0ee2@ubuntu>
 <20210915145228.xyu2za42raq5zjby@tomti.i.net-space.pl>
Reply-To: development@efficientek.com
X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::f2d;
 envelope-from=development@efficientek.com; helo=mail-qv1-xf2d.google.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: grub-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: The development of GNU GRUB <grub-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/grub-devel>,
 <mailto:grub-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/grub-devel>
List-Post: <mailto:grub-devel@gnu.org>
List-Help: <mailto:grub-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/grub-devel>,
 <mailto:grub-devel-request@gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Sep 2021 22:52:58 -0000

On Wed, 15 Sep 2021 16:52:28 +0200
Daniel Kiper <dkiper@net-space.pl> wrote:

> On Tue, Sep 14, 2021 at 06:19:03PM +0000, Glenn Washburn wrote:
> > On Tue, 14 Sep 2021 16:27:55 +0200
> > Daniel Kiper <dkiper@net-space.pl> wrote:
> >
> > > On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > > > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > > > checking primitives where we do complex allocations"), which
> > > > added overflow checking in many areas. The problem here is that
> > > > the changes update the local variable sz, which was already in
> > > > use and which was not updated before the change. So the code
> > > > using sz was getting a different value of than it would have
> > > > previously for the same UDF image. This causes the logic
> > > > getting the destination of the symlink to not realize that its
> > > > gotten the full destination, but keeps trying to read past the
> > > > end of the destination. The bytes after the end are generally
> > > > NULL padding bytes, but that's not a valid component type
> > > > (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches to
> > > > error logic, returning NULL, instead of the symlink destination
> > > > path.
> > > >
> > > > The result of this bug is that the UDF filesystem tests were
> > > > failing in the symlink test with the grub-fstest error message:
> > > >
> > > >     grub-fstest: error: cannot open `(loop0)/sym': invalid
> > > > symlink.
> > > >
> > > > This change stores the result of doubling sz in another local
> > > > variable s, so as not to modify sz. Also remove unnecessary
> > > > grub_add, which increased the output by 1 to account for a NULL
> > > > byte. This isn't needed because an output buffer of size twice
> > > > sz is already guaranteed to be more than enough to contain the
> > > > path components converted to UTF-8. The worst case upper- bound
> > > > for the needed output buffer size is (sz-4)*1.5, where 4 is the
> > > > size
> > >
> > > I think 4 comes from ECMA-167 spec. Could you add a reference to
> > > it here? The number of paragraph would be perfect...
> >
> > Its 14.16.1 basically in the same place as the reference earlier,
> > which is why I didn't include it. But, yes, I can include it.
> 
> Yes, please.

Ok, will do.

> > > > of a path component header and 1.5 is the maximum growth in
> > > > bytes when converting from 2-byte unicode code-points to UTF-8
> > > > (from 2 bytes to 3).
> > >
> > > Could you explain how did you come up with the 1.5 value? It
> > > would be nice if you refer to a spec or something like that.
> >
> > There is no spec that I know of (but would be happy to know of
> > one). Its something I've deduced based on my understanding of
> > Unicode, UTF-8, and UTF-16. All unicode code points less than or
> > equal to 2 bytes (code points <0x10000) can be represented in UTF-8
> > by a maximum of 3 bytes [1]. Longer UTF-16 encodings don't matter
> > because those will be 4 bytes or longer. The maximum number of
> > bytes for a UTF-8 encoding of a unicode
> 
> The [1] says: Since Errata DCN-5157, the range of code points was
> expanded to all code points from Unicode 4.0 (or any newer or older
> version), which includes Plane 1-16 characters such as Emoji.
> 
> So, I think your assumption about longer encodings is incorrect for
> the UDF.

No, I don't believe so. The "assumption", which was actually a logical
conclusion, that I understand you to be saying is incorrect is "Longer
UTF-16 encodings don't matter because those will be 4 bytes or longer".
This was perhaps a little confusing as there are no UTF-16 encoded
codepoints longer than 4 bytes. Be that as it may, I go on to explain
that they do not matter because those UTF-16 bytes strings will never
occupy more bytes when encoded in UTF-8. The question of debate is "how
much can a UTF-16 byte string grow when re-encoded as UTF-8?". Thus
codepoints that can not grow (most in fact strink!) can be disregarded
in the analysis. You talk about unicode Planes 1-16 as relevant, but
remember those Planes are the ones where the UTF-16 is 4 bytes, and
thus the set of code words we just disregarded as not relevant.

Perhaps you do not believe that all codepoints outside of Plane 0
require no more bytes in UTF-16 than in UTF-8. If so, please provide one
counter example, that is one codepoint satisfying the condition.
Likewise, if you believe that there exists a 2-byte codepoint in UTF-16
which occupies 4-bytes in UTF-8, a counter example would help to make
your case.

This explanation may provide more clarity on the matter[1].

> > code point is 4 bytes. So the longer UTF-16 encodings can only be
> > equal to or longer than the UTF-8 encoding, thus the UTF-16 ->
> > UTF-8 would be shrinking or maintaining the length of the original
> > byte string. Since the worst case growth is 2 bytes to 3, that's
> > 1.5 times the original string size. QED.
> >
> > Do you want all that in there? I could just remove that part about
> > the 1.5 too.
> >
> > Here's an SO question that addresses this. Yes, unofficial, but I
> > think adds some weight as to the correctness of the logic above.
> >
> > https://stackoverflow.com/questions/55056322/maximum-utf-8-string-size-given-utf-16-size
> >
> > Glenn
> >
> > [1] https://en.wikipedia.org/wiki/UTF-8#Encoding
> 
> Daniel
> 
> [1] https://en.wikipedia.org/wiki/Universal_Disk_Format#Character_set

Glenn

[1]
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Efficiency