Re: [PATCH] udf: Fix regression which is preventing symlink access

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Glenn Washburn <development@efficientek.com>
To: Daniel Kiper <dkiper@net-space.pl>
Cc: grub-devel@gnu.org,
	Vladimir 'phcoder' Serbinenko <phcoder@gmail.com>,
	Peter Jones <pjones@redhat.com>
Subject: Re: [PATCH] udf: Fix regression which is preventing symlink access
Date: Wed, 15 Sep 2021 22:52:40 +0000	[thread overview]
Message-ID: <20210915225240.7011f2b7@ubuntu> (raw)
In-Reply-To: <20210915145228.xyu2za42raq5zjby@tomti.i.net-space.pl>

On Wed, 15 Sep 2021 16:52:28 +0200
Daniel Kiper <dkiper@net-space.pl> wrote:

> On Tue, Sep 14, 2021 at 06:19:03PM +0000, Glenn Washburn wrote:
> > On Tue, 14 Sep 2021 16:27:55 +0200
> > Daniel Kiper <dkiper@net-space.pl> wrote:
> >
> > > On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > > > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > > > checking primitives where we do complex allocations"), which
> > > > added overflow checking in many areas. The problem here is that
> > > > the changes update the local variable sz, which was already in
> > > > use and which was not updated before the change. So the code
> > > > using sz was getting a different value of than it would have
> > > > previously for the same UDF image. This causes the logic
> > > > getting the destination of the symlink to not realize that its
> > > > gotten the full destination, but keeps trying to read past the
> > > > end of the destination. The bytes after the end are generally
> > > > NULL padding bytes, but that's not a valid component type
> > > > (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches to
> > > > error logic, returning NULL, instead of the symlink destination
> > > > path.
> > > >
> > > > The result of this bug is that the UDF filesystem tests were
> > > > failing in the symlink test with the grub-fstest error message:
> > > >
> > > >     grub-fstest: error: cannot open `(loop0)/sym': invalid
> > > > symlink.
> > > >
> > > > This change stores the result of doubling sz in another local
> > > > variable s, so as not to modify sz. Also remove unnecessary
> > > > grub_add, which increased the output by 1 to account for a NULL
> > > > byte. This isn't needed because an output buffer of size twice
> > > > sz is already guaranteed to be more than enough to contain the
> > > > path components converted to UTF-8. The worst case upper- bound
> > > > for the needed output buffer size is (sz-4)*1.5, where 4 is the
> > > > size
> > >
> > > I think 4 comes from ECMA-167 spec. Could you add a reference to
> > > it here? The number of paragraph would be perfect...
> >
> > Its 14.16.1 basically in the same place as the reference earlier,
> > which is why I didn't include it. But, yes, I can include it.
> 
> Yes, please.

Ok, will do.

> > > > of a path component header and 1.5 is the maximum growth in
> > > > bytes when converting from 2-byte unicode code-points to UTF-8
> > > > (from 2 bytes to 3).
> > >
> > > Could you explain how did you come up with the 1.5 value? It
> > > would be nice if you refer to a spec or something like that.
> >
> > There is no spec that I know of (but would be happy to know of
> > one). Its something I've deduced based on my understanding of
> > Unicode, UTF-8, and UTF-16. All unicode code points less than or
> > equal to 2 bytes (code points <0x10000) can be represented in UTF-8
> > by a maximum of 3 bytes [1]. Longer UTF-16 encodings don't matter
> > because those will be 4 bytes or longer. The maximum number of
> > bytes for a UTF-8 encoding of a unicode
> 
> The [1] says: Since Errata DCN-5157, the range of code points was
> expanded to all code points from Unicode 4.0 (or any newer or older
> version), which includes Plane 1-16 characters such as Emoji.
> 
> So, I think your assumption about longer encodings is incorrect for
> the UDF.

No, I don't believe so. The "assumption", which was actually a logical
conclusion, that I understand you to be saying is incorrect is "Longer
UTF-16 encodings don't matter because those will be 4 bytes or longer".
This was perhaps a little confusing as there are no UTF-16 encoded
codepoints longer than 4 bytes. Be that as it may, I go on to explain
that they do not matter because those UTF-16 bytes strings will never
occupy more bytes when encoded in UTF-8. The question of debate is "how
much can a UTF-16 byte string grow when re-encoded as UTF-8?". Thus
codepoints that can not grow (most in fact strink!) can be disregarded
in the analysis. You talk about unicode Planes 1-16 as relevant, but
remember those Planes are the ones where the UTF-16 is 4 bytes, and
thus the set of code words we just disregarded as not relevant.

Perhaps you do not believe that all codepoints outside of Plane 0
require no more bytes in UTF-16 than in UTF-8. If so, please provide one
counter example, that is one codepoint satisfying the condition.
Likewise, if you believe that there exists a 2-byte codepoint in UTF-16
which occupies 4-bytes in UTF-8, a counter example would help to make
your case.

This explanation may provide more clarity on the matter[1].

> > code point is 4 bytes. So the longer UTF-16 encodings can only be
> > equal to or longer than the UTF-8 encoding, thus the UTF-16 ->
> > UTF-8 would be shrinking or maintaining the length of the original
> > byte string. Since the worst case growth is 2 bytes to 3, that's
> > 1.5 times the original string size. QED.
> >
> > Do you want all that in there? I could just remove that part about
> > the 1.5 too.
> >
> > Here's an SO question that addresses this. Yes, unofficial, but I
> > think adds some weight as to the correctness of the logic above.
> >
> > https://stackoverflow.com/questions/55056322/maximum-utf-8-string-size-given-utf-16-size
> >
> > Glenn
> >
> > [1] https://en.wikipedia.org/wiki/UTF-8#Encoding
> 
> Daniel
> 
> [1] https://en.wikipedia.org/wiki/Universal_Disk_Format#Character_set

Glenn

[1]
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Efficiency

next prev parent reply	other threads:[~2021-09-15 22:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-10 16:03 [PATCH] udf: Fix regression which is preventing symlink access Glenn Washburn
2021-09-14 14:27 ` Daniel Kiper
2021-09-14 18:19   ` Glenn Washburn
2021-09-15 14:52     ` Daniel Kiper
2021-09-15 22:52       ` Glenn Washburn [this message]
2021-09-16 21:34         ` Daniel Kiper

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210915225240.7011f2b7@ubuntu \
    --to=development@efficientek.com \
    --cc=dkiper@net-space.pl \
    --cc=grub-devel@gnu.org \
    --cc=phcoder@gmail.com \
    --cc=pjones@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.