git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters
@ 2025-04-04 10:58 Patrick Steinhardt
  2025-04-04 18:21 ` Elijah Newren
  2025-04-04 20:57 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Patrick Steinhardt @ 2025-04-04 10:58 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Karthik Nayak, brian m. carlson, Jeff King,
	Junio C Hamano, shejialuo, Christian Couder

It was reported that using git-pull(1) in a repository whose remote
contains branches with emojis leads to the following bug:

    $ git pull
    remote: Enumerating objects: 161255, done.
    remote: Counting objects: 100% (55884/55884), done.
    remote: Compressing objects: 100% (5518/5518), done.
    remote: Total 161255 (delta 54253), reused 50509 (delta 50364),
    pack-reused 105371 (from 4)
    Receiving objects: 100% (161255/161255), 309.90 MiB | 16.87 MiB/s, done.
    Resolving deltas: 100% (118048/118048), completed with 13416 local objects.
    From github.com:github/github
       97ab7ae3f3745..8fb2f9fa180ed  master -> origin/master
    [...snip many screenfuls of updates to origin remotes...]
    BUG: refs/packed-backend.c:984: packed-refs backend yielded reference
    preceding its prefix
    error: fetch died of signal 6

This issue bisects to 22600c04529 (refs/iterator: implement seeking for
packed-ref iterators, 2025-03-12) where we have implemented seeking for
the packed-ref iterator. As part of that change we introduced a check
that verifies that the iterator only returns refnames bigger than the
prefix. In theory, this check should always hold: when a prefix is set
we know that we would've seeked that prefix first, so we should never
see a reference sorting before that prefix.

But in practice the check itself is misbehaving when handling unicode
characters. The particular issue triggered with a branch that got the
"shaved ice" unicode character in its name, which is composed of the
bytes "0xEE 0x90 0xBF". The bug triggers when we compare the refname
"refs/heads/<shaved-ice>" to something like "refs/heads/z", and it
specifically hits when comparing the first byte, "0xEE".

The root cause is that the most-significant bit of 0xEE is set. The
`refname` and `prefix` pointers that we use to compare bytes with one
another are both pointers to signed characters. As such, when we
dereference the 0xEE byte the result is a _negative_ value, and this
value will of course compare smaller than "z".

We can see that this issue is avoided in `cmp_packed_refname()`, where
we explicitly cast each byte to its unsigned form. Fix the bug by doing
the same in `packed_ref_iterator_advance()`.

Reported-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Hi,

this patch addresses the issue reported by Elijah at [1]. Thanks!

Patrick

[1]: <CABPp-BFBqC_t5QSexRQpYsqXBa11WK+OqGt167E=K=xod=buQw@mail.gmail.com>
---
 refs/packed-backend.c  |  4 ++--
 t/t1408-packed-refs.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index b4289a7d9ce..7e31904bd41 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -980,9 +980,9 @@ static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
 			continue;
 
 		while (prefix && *prefix) {
-			if (*refname < *prefix)
+			if ((unsigned char)*refname < (unsigned char)*prefix)
 				BUG("packed-refs backend yielded reference preceding its prefix");
-			else if (*refname > *prefix)
+			else if ((unsigned char)*refname > (unsigned char)*prefix)
 				return ITER_DONE;
 			prefix++;
 			refname++;
diff --git a/t/t1408-packed-refs.sh b/t/t1408-packed-refs.sh
index 41ba1f1d7fc..833477f0fa3 100755
--- a/t/t1408-packed-refs.sh
+++ b/t/t1408-packed-refs.sh
@@ -42,4 +42,19 @@ test_expect_success 'no error from stale entry in packed-refs' '
 	test_cmp expect actual
 '
 
+test_expect_success 'list packed refs with unicode characters' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+		test_commit --no-tag A &&
+		git update-ref refs/heads/ HEAD &&
+		git update-ref refs/heads/z HEAD &&
+		git pack-refs --all &&
+		printf "%s commit\trefs/heads/z\n" $(git rev-parse HEAD) >expect &&
+		git for-each-ref refs/heads/z >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done

---
base-commit: 5b97a56fa0e7d580dc8865b73107407c9b3f0eff
change-id: 20250404-b4-pks-packed-backend-seek-with-utf8-668c182ddcf7


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters
  2025-04-04 10:58 [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters Patrick Steinhardt
@ 2025-04-04 18:21 ` Elijah Newren
  2025-04-04 20:57 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Elijah Newren @ 2025-04-04 18:21 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Karthik Nayak, brian m. carlson, Jeff King, Junio C Hamano,
	shejialuo, Christian Couder

On Fri, Apr 4, 2025 at 3:58 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> It was reported that using git-pull(1) in a repository whose remote
> contains branches with emojis leads to the following bug:
>
>     $ git pull
>     remote: Enumerating objects: 161255, done.
>     remote: Counting objects: 100% (55884/55884), done.
>     remote: Compressing objects: 100% (5518/5518), done.
>     remote: Total 161255 (delta 54253), reused 50509 (delta 50364),
>     pack-reused 105371 (from 4)
>     Receiving objects: 100% (161255/161255), 309.90 MiB | 16.87 MiB/s, done.
>     Resolving deltas: 100% (118048/118048), completed with 13416 local objects.
>     From github.com:github/github
>        97ab7ae3f3745..8fb2f9fa180ed  master -> origin/master
>     [...snip many screenfuls of updates to origin remotes...]
>     BUG: refs/packed-backend.c:984: packed-refs backend yielded reference
>     preceding its prefix
>     error: fetch died of signal 6
>
> This issue bisects to 22600c04529 (refs/iterator: implement seeking for
> packed-ref iterators, 2025-03-12) where we have implemented seeking for
> the packed-ref iterator. As part of that change we introduced a check
> that verifies that the iterator only returns refnames bigger than the
> prefix. In theory, this check should always hold: when a prefix is set
> we know that we would've seeked that prefix first, so we should never
> see a reference sorting before that prefix.
>
> But in practice the check itself is misbehaving when handling unicode
> characters. The particular issue triggered with a branch that got the
> "shaved ice" unicode character in its name, which is composed of the
> bytes "0xEE 0x90 0xBF". The bug triggers when we compare the refname
> "refs/heads/<shaved-ice>" to something like "refs/heads/z", and it
> specifically hits when comparing the first byte, "0xEE".
>
> The root cause is that the most-significant bit of 0xEE is set. The
> `refname` and `prefix` pointers that we use to compare bytes with one
> another are both pointers to signed characters. As such, when we
> dereference the 0xEE byte the result is a _negative_ value, and this
> value will of course compare smaller than "z".
>
> We can see that this issue is avoided in `cmp_packed_refname()`, where
> we explicitly cast each byte to its unsigned form. Fix the bug by doing
> the same in `packed_ref_iterator_advance()`.
>
> Reported-by: Elijah Newren <newren@gmail.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Hi,
>
> this patch addresses the issue reported by Elijah at [1]. Thanks!

Thanks for the fix!

> Patrick
>
> [1]: <CABPp-BFBqC_t5QSexRQpYsqXBa11WK+OqGt167E=K=xod=buQw@mail.gmail.com>
> ---
>  refs/packed-backend.c  |  4 ++--
>  t/t1408-packed-refs.sh | 15 +++++++++++++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index b4289a7d9ce..7e31904bd41 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -980,9 +980,9 @@ static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
>                         continue;
>
>                 while (prefix && *prefix) {
> -                       if (*refname < *prefix)
> +                       if ((unsigned char)*refname < (unsigned char)*prefix)
>                                 BUG("packed-refs backend yielded reference preceding its prefix");
> -                       else if (*refname > *prefix)
> +                       else if ((unsigned char)*refname > (unsigned char)*prefix)
>                                 return ITER_DONE;
>                         prefix++;
>                         refname++;

Oh, right, I assumed there was going to be lots of other places that
needed casting outside this function.  If I would have merely checked
the one other line in this function and updated it, I would have had
the fix...

> diff --git a/t/t1408-packed-refs.sh b/t/t1408-packed-refs.sh
> index 41ba1f1d7fc..833477f0fa3 100755
> --- a/t/t1408-packed-refs.sh
> +++ b/t/t1408-packed-refs.sh
> @@ -42,4 +42,19 @@ test_expect_success 'no error from stale entry in packed-refs' '
>         test_cmp expect actual
>  '
>
> +test_expect_success 'list packed refs with unicode characters' '
> +       test_when_finished "rm -rf repo" &&
> +       git init repo &&
> +       (
> +               cd repo &&
> +               test_commit --no-tag A &&
> +               git update-ref refs/heads/ HEAD &&
> +               git update-ref refs/heads/z HEAD &&
> +               git pack-refs --all &&
> +               printf "%s commit\trefs/heads/z\n" $(git rev-parse HEAD) >expect &&
> +               git for-each-ref refs/heads/z >actual &&
> +               test_cmp expect actual
> +       )
> +'
> +
>  test_done
>
> ---
> base-commit: 5b97a56fa0e7d580dc8865b73107407c9b3f0eff
> change-id: 20250404-b4-pks-packed-backend-seek-with-utf8-668c182ddcf7

I also tested this on my backup of the repo from when I triggered the
error, first making sure that I could still trigger again without this
fix, and then that this fix handled that case.  Works great, thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters
  2025-04-04 10:58 [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters Patrick Steinhardt
  2025-04-04 18:21 ` Elijah Newren
@ 2025-04-04 20:57 ` Jeff King
  2025-04-05  1:38   ` brian m. carlson
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2025-04-04 20:57 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Elijah Newren, Karthik Nayak, brian m. carlson,
	Junio C Hamano, shejialuo, Christian Couder

On Fri, Apr 04, 2025 at 12:58:38PM +0200, Patrick Steinhardt wrote:

> But in practice the check itself is misbehaving when handling unicode
> characters. The particular issue triggered with a branch that got the
> "shaved ice" unicode character in its name, which is composed of the
> bytes "0xEE 0x90 0xBF". The bug triggers when we compare the refname
> "refs/heads/<shaved-ice>" to something like "refs/heads/z", and it
> specifically hits when comparing the first byte, "0xEE".
> 
> The root cause is that the most-significant bit of 0xEE is set. The
> `refname` and `prefix` pointers that we use to compare bytes with one
> another are both pointers to signed characters. As such, when we
> dereference the 0xEE byte the result is a _negative_ value, and this
> value will of course compare smaller than "z".
> 
> We can see that this issue is avoided in `cmp_packed_refname()`, where
> we explicitly cast each byte to its unsigned form. Fix the bug by doing
> the same in `packed_ref_iterator_advance()`.

Ah, good catch. I think this signed-ness issue has come up before, long
ago, but I don't remember the context. In theory any stable ordering is
OK for sorting, but of course cmp_packed_refname() chose to use unsigned
in order to match strcmp(), and the standard defines it as interpreting
the bytes as unsigned. One of the enjoyable quirks of C.

> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index b4289a7d9ce..7e31904bd41 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -980,9 +980,9 @@ static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
>  			continue;
>  
>  		while (prefix && *prefix) {
> -			if (*refname < *prefix)
> +			if ((unsigned char)*refname < (unsigned char)*prefix)
>  				BUG("packed-refs backend yielded reference preceding its prefix");
> -			else if (*refname > *prefix)
> +			else if ((unsigned char)*refname > (unsigned char)*prefix)
>  				return ITER_DONE;
>  			prefix++;
>  			refname++;

The patch itself looks good to me.

> +test_expect_success 'list packed refs with unicode characters' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		test_commit --no-tag A &&
> +		git update-ref refs/heads/ HEAD &&
> +		git update-ref refs/heads/z HEAD &&

It's possible some filesystems might be unhappy with this character, but
I guess we can see if anybody screams.

> +		git pack-refs --all &&
> +		printf "%s commit\trefs/heads/z\n" $(git rev-parse HEAD) >expect &&
> +		git for-each-ref refs/heads/z >actual &&
> +		test_cmp expect actual

This loses the exit code of rev-parse, but IMHO that is not a big deal.
We'd notice the broken output when we call test_cmp. I don't know if
people who are eagerly hunting down missed exit codes might flag it,
though.

Thanks for the quick turnaround on this (and to Elijah for reporting).

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters
  2025-04-04 20:57 ` Jeff King
@ 2025-04-05  1:38   ` brian m. carlson
  2025-04-08  6:46     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: brian m. carlson @ 2025-04-05  1:38 UTC (permalink / raw)
  To: Jeff King
  Cc: Patrick Steinhardt, git, Elijah Newren, Karthik Nayak,
	Junio C Hamano, shejialuo, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 3981 bytes --]

On 2025-04-04 at 20:57:40, Jeff King wrote:
> On Fri, Apr 04, 2025 at 12:58:38PM +0200, Patrick Steinhardt wrote:
> 
> > But in practice the check itself is misbehaving when handling unicode
> > characters. The particular issue triggered with a branch that got the
> > "shaved ice" unicode character in its name, which is composed of the
> > bytes "0xEE 0x90 0xBF". The bug triggers when we compare the refname
> > "refs/heads/<shaved-ice>" to something like "refs/heads/z", and it
> > specifically hits when comparing the first byte, "0xEE".
> > 
> > The root cause is that the most-significant bit of 0xEE is set. The
> > `refname` and `prefix` pointers that we use to compare bytes with one
> > another are both pointers to signed characters. As such, when we
> > dereference the 0xEE byte the result is a _negative_ value, and this
> > value will of course compare smaller than "z".
> > 
> > We can see that this issue is avoided in `cmp_packed_refname()`, where
> > we explicitly cast each byte to its unsigned form. Fix the bug by doing
> > the same in `packed_ref_iterator_advance()`.
> 
> Ah, good catch. I think this signed-ness issue has come up before, long
> ago, but I don't remember the context. In theory any stable ordering is
> OK for sorting, but of course cmp_packed_refname() chose to use unsigned
> in order to match strcmp(), and the standard defines it as interpreting
> the bytes as unsigned. One of the enjoyable quirks of C.

I'd agree it's a quirk, but I'm not sure I'd call it enjoyable.  Anyway,
this does seem like the right solution and I agree matching strcmp is
the right decision here.  I also think unsigned byte comparisons are
more intuitive, honestly.

> > diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> > index b4289a7d9ce..7e31904bd41 100644
> > --- a/refs/packed-backend.c
> > +++ b/refs/packed-backend.c
> > @@ -980,9 +980,9 @@ static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
> >  			continue;
> >  
> >  		while (prefix && *prefix) {
> > -			if (*refname < *prefix)
> > +			if ((unsigned char)*refname < (unsigned char)*prefix)
> >  				BUG("packed-refs backend yielded reference preceding its prefix");
> > -			else if (*refname > *prefix)
> > +			else if ((unsigned char)*refname > (unsigned char)*prefix)
> >  				return ITER_DONE;
> >  			prefix++;
> >  			refname++;
> 
> The patch itself looks good to me.

Same here.

> > +test_expect_success 'list packed refs with unicode characters' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +		test_commit --no-tag A &&
> > +		git update-ref refs/heads/ HEAD &&
> > +		git update-ref refs/heads/z HEAD &&
> 
> It's possible some filesystems might be unhappy with this character, but
> I guess we can see if anybody screams.

I very much doubt that's going to be a problem.  Windows uses UTF-16
internally, so it will have no problems with Unicode; macOS only allows
UTF-8 in its file systems and I know it gracefully supports emoji; and
other Unix systems don't care one way or the other because it's just
some bytes.  Even ISO9660 with Rock Ridge or Joliet extensions will work
here, as will FAT, UDF (used on DVDs and some hard disks), and 9P.

Certainly somebody could try something very esoteric, but I expect other
things will break as well.  I'm okay with favouring testing things we
know many people _are_ using (emojis and Unicode) over things very few
people are using (very esoteric file systems[0]).

I could imagine a hypothetical encoding issue in the Cygwin or MSYS
layer being a problem, but if it passes the testsuite in CI, I'm almost
certain it will be fine.

[0] I think I'm familiar with a decent number of file systems and I
cannot think of any available on a modern OS that would have this
problem.  I'm sure some must exist, though.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters
  2025-04-05  1:38   ` brian m. carlson
@ 2025-04-08  6:46     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2025-04-08  6:46 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Patrick Steinhardt, git, Elijah Newren, Karthik Nayak,
	Junio C Hamano, shejialuo, Christian Couder

On Sat, Apr 05, 2025 at 01:38:36AM +0000, brian m. carlson wrote:

> > Ah, good catch. I think this signed-ness issue has come up before, long
> > ago, but I don't remember the context. In theory any stable ordering is
> > OK for sorting, but of course cmp_packed_refname() chose to use unsigned
> > in order to match strcmp(), and the standard defines it as interpreting
> > the bytes as unsigned. One of the enjoyable quirks of C.
> 
> I'd agree it's a quirk, but I'm not sure I'd call it enjoyable.  Anyway,
> this does seem like the right solution and I agree matching strcmp is
> the right decision here.  I also think unsigned byte comparisons are
> more intuitive, honestly.

My sarcasm may have gotten lost in the wires. ;)

And yes, I think unsigned byte comparisons are more intuitive, too. The
fact that "char" is (or according to the standard _might_ be) signed is
confusing, especially in a world of unicode.

> > It's possible some filesystems might be unhappy with this character, but
> > I guess we can see if anybody screams.
> 
> I very much doubt that's going to be a problem.  Windows uses UTF-16
> internally, so it will have no problems with Unicode; macOS only allows
> UTF-8 in its file systems and I know it gracefully supports emoji; and
> other Unix systems don't care one way or the other because it's just
> some bytes.  Even ISO9660 with Rock Ridge or Joliet extensions will work
> here, as will FAT, UDF (used on DVDs and some hard disks), and 9P.
> 
> Certainly somebody could try something very esoteric, but I expect other
> things will break as well.  I'm okay with favouring testing things we
> know many people _are_ using (emojis and Unicode) over things very few
> people are using (very esoteric file systems[0]).

Good point that people are doing this in practice anyway. The worst case
is probably that we flush out some interesting limitations, learn more
about the world, and add a small prereq flag to the test.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-04-08  6:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 10:58 [PATCH] refs/packed: fix BUG when seeking refs with UTF-8 characters Patrick Steinhardt
2025-04-04 18:21 ` Elijah Newren
2025-04-04 20:57 ` Jeff King
2025-04-05  1:38   ` brian m. carlson
2025-04-08  6:46     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).