[PATCH 0/10] some zlib inflating bug fixes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/10] some zlib inflating bug fixes
@ 2025-02-25  6:25 Jeff King
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:25 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

Here are a few bug fixes related to zlib-inflating objects. One is an
infinite loop, but triggering it requires writing to the local repo
along with running some seldom-used commands, so I think is not a
security risk.  The other can be triggered by sending a specially
crafted pack, but it triggers a BUG(), so you'd only be crashing out
your own push.

The fixes themselves are in patches 4 and 5. The rest are related
cleanups or clarifications in nearby code.

  [01/10]: loose_object_info(): BUG() on inflating content with unknown type
  [02/10]: unpack_loose_header(): simplify next_out assignment
  [03/10]: unpack_loose_header(): report headers without NUL as "bad"
  [04/10]: unpack_loose_header(): fix infinite loop on broken zlib input
  [05/10]: git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
  [06/10]: unpack_loose_header(): avoid numeric comparison of zlib status
  [07/10]: unpack_loose_rest(): avoid numeric comparison of zlib status
  [08/10]: unpack_loose_rest(): never clean up zstream
  [09/10]: unpack_loose_rest(): simplify error handling
  [10/10]: unpack_loose_rest(): rewrite return handling for clarity

 git-zlib.c          | 27 +++++++++++++----------
 object-file.c       | 48 ++++++++++++++++++++--------------------
 t/t1006-cat-file.sh | 53 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 36 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
@ 2025-02-25  6:28 ` Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
                     ` (2 more replies)
  2025-02-25  6:29 ` [PATCH 02/10] unpack_loose_header(): simplify next_out assignment Jeff King
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:28 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

After unpack_loose_header() returns, it will have inflated not only the
object header, but possibly some bytes of the object content. When we
call unpack_loose_rest() to extract the actual content, it finds those
extra bytes by skipping past the header's terminating NUL in the buffer.
Like this:

  int bytes = strlen(buffer) + 1;
  n = stream->total_out - bytes;
  ...
  memcpy(buf, (char *) buffer + bytes, n);

This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
we allow a header of arbitrary size. We put into a strbuf, but feed only
the final 32-byte chunk we read to unpack_loose_rest(). In that case
stream->total_out may unexpectedly large, and thus our "n" will be
large, causing an out-of-bounds read (we do check it against our
allocated buffer size, which prevents an out-of-bounds write).

Probably this could be made to work by feeding the strbuf to
unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
would need to be a size_t, since it is no longer operating on a 32-byte
buffer).

But I don't think it's possible to actually trigger this in practice.
The only caller who passes ALLOW_UNKNOWN_TYPE is cat-file, which only
allows it with the "-t" and "-s" options (neither of which access the
content). There is one way you can _almost_ trigger it: the oid compat
routines (i.e., accessing sha1 via sha256 names and vice versa) will
convert objects on the fly (which requires access to the content) using
the same flags that were passed in. So in theory this:

  t='some very large type field that causes an extra inflate call'
  sha1_oid=$(git hash-object -w -t "$t" file)
  sha256_oid=$(git rev-parse --output-object-format=sha256 $sha1_oid)
  git cat-file --allow-unknown-type -s $sha256_oid

would try to access the content. But it doesn't work, because using
compat objects requires an entry in the .git/objects/loose-object-idx
file, and we don't generate such an entry for non-standard types (see
the "compat" section of write_object_file_literally()).

If we use "t=blob" instead, then it does access the compat object, but
it doesn't trigger the problem (because "blob" is a standard short type
name, and it fits in the initial 32-byte buffer).

So given that this is almost a memory error bug, I think it's worth
addressing. But because we can't actually trigger the situation, I'm
hesitant to try a fix that we can't run. Instead let's document the
restriction and protect ourselves from the out-of-bounds read by adding
a BUG() check.

Signed-off-by: Jeff King <peff@peff.net>
---
I found this because I was tracing the code path after
unpack_loose_header() returns to verify some assumptions in the other
patches.

It really makes me wonder if this "unknown type" stuff has any value
at all. You can create an object with any type using "hash-object
--literally -t". And you can ask about its type and size. But you can
never retrieve the object content! Nor can you pack it or transfer it,
since packs use a numeric type field.

This code was added ~2015, but I don't think anybody built more on top
of it. I wonder if we should just consider it a failed experiment and
rip out the support.

 object-file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/object-file.c b/object-file.c
index 00c3a4b910..45b251ba04 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1580,6 +1580,8 @@ static int loose_object_info(struct repository *r,

 		if (!oi->contentp)
 			break;
+		if (hdrbuf.len)
+			BUG("unpacking content with unknown types not yet supported");
 		*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
 		if (*oi->contentp)
 			goto cleanup;
-- 
2.48.1.709.gf47ae731ff

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/10] unpack_loose_header(): simplify next_out assignment
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
@ 2025-02-25  6:29 ` Jeff King
  2025-02-28  0:18   ` Taylor Blau
  2025-02-25  6:29 ` [PATCH 03/10] unpack_loose_header(): report headers without NUL as "bad" Jeff King
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:29 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When using OBJECT_INFO_ALLOW_UNKNOWN_TYPE to unpack a header that
doesn't fit into our initial 32-byte buffer, we loop over calls
git_inflate(), feeding it our buffer to the "next_out" pointer each
time. As the code is written, we reset next_out after each inflate call
(and after reading the output), ready for the next loop.

This isn't wrong, but there are a few advantages to setting up
"next_out" right before each inflate call, rather than after:

  1. It drops a few duplicated lines of code.

  2. It makes it obvious that we always feed a fresh buffer on each call
     (and thus can never see Z_BUF_ERROR due to due to a lack of output
     space).

  3. After we exit the loop, we'll leave stream->next_out pointing to
     the end of the fetched data (this is how zlib callers find out how
     much data is in the buffer). This doesn't matter in practice, since
     nobody looks at it again. But it's probably the least-surprising
     thing to do, as it matches how next_out is left when the whole
     thing fits in the initial 32-byte buffer (and we don't enter the
     loop at all).

Signed-off-by: Jeff King <peff@peff.net>
---
Not strictly necessary, but I think it makes some of the reasoning in
patch 4 easier.

 object-file.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/object-file.c b/object-file.c
index 45b251ba04..0fd42981fb 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1385,18 +1385,17 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
 	 * reading the stream.
 	 */
 	strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
-	stream->next_out = buffer;
-	stream->avail_out = bufsiz;
 
 	do {
+		stream->next_out = buffer;
+		stream->avail_out = bufsiz;
+
 		obj_read_unlock();
 		status = git_inflate(stream, 0);
 		obj_read_lock();
 		strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
 		if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
 			return 0;
-		stream->next_out = buffer;
-		stream->avail_out = bufsiz;
 	} while (status != Z_STREAM_END);
 	return ULHR_TOO_LONG;
 }
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/10] unpack_loose_header(): report headers without NUL as "bad"
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
  2025-02-25  6:29 ` [PATCH 02/10] unpack_loose_header(): simplify next_out assignment Jeff King
@ 2025-02-25  6:29 ` Jeff King
  2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:29 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

If a caller asks us to read the whole loose object header value into a
strbuf (e.g., via "cat-file --allow-unknown-type"), we'll keep reading
until we see a NUL byte marking the end of the header.

If we hit Z_STREAM_END before seeing the NUL, we obviously have to stop.
But we return ULHR_TOO_LONG, which doesn't make any sense. The "too
long" return code is used in the normal, 32-byte limited mode to
indicate that we stopped looking. There is no such thing as "too long"
here, as we'd keep reading forever until we see the end of stream or the
NUL.

Instead, we should return ULHR_BAD. The loose object has no NUL marking
the end of header, so it is malformed. The behavior difference is
slight; in either case we'd consider the object unreadable and refuse to
go further. The only difference is the specific error message we
produce.

There's no test case here, as we'd need to generate a valid zlib stream
without a NUL. That's not something Git will do without writing new
custom code. And in the next patch we'll fix another bug in this area
which will make this easier to do (and we will test it then).

Signed-off-by: Jeff King <peff@peff.net>
---
You might be curious what happens if we see an error before the stream
end. If so, read on to patch 4...

 object-file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index 0fd42981fb..8c9295413b 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1397,7 +1397,7 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
 		if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
 			return 0;
 	} while (status != Z_STREAM_END);
-	return ULHR_TOO_LONG;
+	return ULHR_BAD;
 }

 static void *unpack_loose_rest(git_zstream *stream,
-- 
2.48.1.709.gf47ae731ff

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (2 preceding siblings ...)
  2025-02-25  6:29 ` [PATCH 03/10] unpack_loose_header(): report headers without NUL as "bad" Jeff King
@ 2025-02-25  6:29 ` Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
                     ` (2 more replies)
  2025-02-25  6:30 ` [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT Jeff King
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:29 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When reading a loose object, we first try to expand the first 32 bytes
to read the type+size header. This is enough for any of the normal Git
types. But since 46f034483e (sha1_file: support reading from a loose
object of unknown type, 2015-05-03), the caller can also ask us to parse
any unknown names, which can be much longer. In this case we keep
inflating until we find the NUL at the end of the header, or hit
Z_STREAM_END.

But what if zlib can't make forward progress? For example, if the loose
object file is truncated, we'll have no more data to feed it. It will
return Z_BUF_ERROR, and we'll just loop infinitely, calling
git_inflate() over and over but never seeing new bytes nor an
end-of-stream marker.

We can fix this by only looping when we think we can make forward
progress. This will always be Z_OK in this case. In other code we might
also be able to continue on Z_BUF_ERROR, but:

  - We will never see Z_BUF_ERROR because the output buffer is full; we
    always feed a fresh 32-byte buffer on each call to git_inflate().

  - We may see Z_BUF_ERROR if we run out of input. But since we've fed
    the whole mmap'd buffer to zlib, if it runs out of input there is
    nothing more we can do.

So if we don't see Z_OK (and didn't see the end-of-header NUL, otherwise
we'd have broken out of the loop), then we should stop looping and
return an error.

The test case shows an example where the input is truncated (which gives
us the input Z_BUF_ERROR case above).

Although we do operate on objects we might get from an untrusted remote,
I don't think the security implications of this bug are too great. It
can only trigger if both of these are true:

  - You're reading a loose object whose on-disk representation was
    written by an attacker. So fetching an object (or receiving a push)
    are mostly OK, because even with unpack-objects it is our local,
    trusted code that writes out the object file. The exception may be
    fetching from an untrusted local repo, or using dumb-http, which
    copies objects verbatim. But...

  - The only code path which triggers the inflate loop is cat-file's
    --allow-unknown-type option. This is unlikely to be called at all
    outside of debugging. But I also suspect that objects with
    non-standard types (or that are truncated) would not survive the
    usual fetch/receive checks in the first place.

So I think it would be quite hard to trick somebody into running the
infinite loop, and we can just fix the bug.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 object-file.c       |  2 +-
 t/t1006-cat-file.sh | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index 8c9295413b..5b2446bfc1 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1396,7 +1396,7 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
 		strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
 		if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
 			return 0;
-	} while (status != Z_STREAM_END);
+	} while (status == Z_OK);
 	return ULHR_BAD;
 }
 
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 398865d6eb..0b0d915773 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -903,6 +903,25 @@ test_expect_success 'cat-file -t and -s on corrupt loose object' '
 	)
 '
 
+test_expect_success 'truncated object with --allow-unknown-type' - <<\EOT
+	objtype='a really long type name that exceeds the 32-byte limit' &&
+	blob=$(git hash-object -w --literally -t "$objtype" /dev/null) &&
+	objpath=.git/objects/$(test_oid_to_path "$blob") &&
+
+	# We want to truncate the object far enough in that we don't hit the
+	# end while inflating the first 32 bytes (since we want to have to dig
+	# for the trailing NUL of the header). But we don't want to go too far,
+	# since our header isn't very big. And of course we are counting
+	# deflated zlib bytes in the on-disk file, so it's a bit of a guess.
+	# Empirically 50 seems to work.
+	mv "$objpath" obj.bak &&
+	test_when_finished 'mv obj.bak "$objpath"' &&
+	test_copy_bytes 50 <obj.bak >"$objpath" &&
+
+	test_must_fail git cat-file --allow-unknown-type -t $blob 2>err &&
+	test_grep "unable to unpack $blob header" err
+EOT
+
 # Tests for git cat-file --follow-symlinks
 test_expect_success 'prep for symlink tests' '
 	echo_without_newline "$hello_content" >morx &&
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (3 preceding siblings ...)
  2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
@ 2025-02-25  6:30 ` Jeff King
  2025-02-26 13:26   ` Junio C Hamano
  2025-02-25  6:30 ` [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status Jeff King
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:30 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

This fixes a case where malformed object input can cause us to hit a
BUG() call in the git-zlib.c code.

The zlib format allows the use of preset dictionaries to reduce the size
of deflated data. The checksum of the dictionary is computed by the
deflate code and goes into the stream. On the inflating side, zlib sees
the dictionary checksum and returns Z_NEED_DICT, asking the caller to
provide the dictionary data via inflateSetDictionary().

This should never happen in Git, because we never provide a dictionary
for deflating (and if we get a stream that mentions a dictionary, we
have no idea how to provide it). So normally Z_NEED_DICT is a hard error
for us. But something interesting happens if we _do_ happen to see it
(e.g., because of a corrupt or malicious input).

In git_inflate() as we loop over calls to zlib's inflate(), we translate
between our large-integer git_zstream values and zlib's native z_stream
types, copying in and out with zlib_pre_call() and zlib_post_call(). In
zlib_post_call() we have a few sanity checks, including one that checks
that the number of bytes consumed by zlib (as measured by it moving the
"next_in" pointer) is equal to the movement of its "total_in" count.

But these do not correspond when we see Z_NEED_DICT! Zlib consumes the
bytes from the input buffer but it does not increment total_in. And so
we hit the BUG("total_in mismatch") call.

There are a few options here:

  - We could ditch that BUG() check. It is making too many assumptions
    about how zlib updates these values. But it does have value in most
    cases as a sanity check on the values we're copying.

  - We could skip the zlib_post_call() entirely when we see Z_NEED_DICT.
    We know that it's hard error for us, so we should just send the
    status up the stack and let the caller bail.

    The downside is that if we ever did want to support dictionaries,
    we couldn't (the git_zstream will be out of sync, since we never
    copied its values back from the z_stream).

  - We could continue to call zlib_post_call(), but skip just that BUG()
    check if the status is Z_NEED_DICT. This keeps git_inflate() as a
    thin wrapper around inflate(), and would let us later support
    dictionaries for some calls if we wanted to.

This patch uses the third approach. It seems like the least-surprising
thing to keep git_inflate() a close to inflate() as possible. And while
it makes the diff a bit larger (since we have to pass the status down to
to the zlib_post_call() function), it's a static local function, and
every caller by definition will have just made a zlib call (and so will
have a status integer).

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 git-zlib.c          | 27 ++++++++++++++++-----------
 t/t1006-cat-file.sh | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/git-zlib.c b/git-zlib.c
index 651dd9e07c..df9604910e 100644
--- a/git-zlib.c
+++ b/git-zlib.c
@@ -45,7 +45,7 @@ static void zlib_pre_call(git_zstream *s)
 	s->z.avail_out = zlib_buf_cap(s->avail_out);
 }
 
-static void zlib_post_call(git_zstream *s)
+static void zlib_post_call(git_zstream *s, int status)
 {
 	unsigned long bytes_consumed;
 	unsigned long bytes_produced;
@@ -54,7 +54,12 @@ static void zlib_post_call(git_zstream *s)
 	bytes_produced = s->z.next_out - s->next_out;
 	if (s->z.total_out != s->total_out + bytes_produced)
 		BUG("total_out mismatch");
-	if (s->z.total_in != s->total_in + bytes_consumed)
+	/*
+	 * zlib does not update total_in when it returns Z_NEED_DICT,
+	 * causing a mismatch here. Skip the sanity check in that case.
+	 */
+	if (status != Z_NEED_DICT &&
+	    s->z.total_in != s->total_in + bytes_consumed)
 		BUG("total_in mismatch");
 
 	s->total_out = s->z.total_out;
@@ -72,7 +77,7 @@ void git_inflate_init(git_zstream *strm)
 
 	zlib_pre_call(strm);
 	status = inflateInit(&strm->z);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	if (status == Z_OK)
 		return;
 	die("inflateInit: %s (%s)", zerr_to_string(status),
@@ -90,7 +95,7 @@ void git_inflate_init_gzip_only(git_zstream *strm)
 
 	zlib_pre_call(strm);
 	status = inflateInit2(&strm->z, windowBits);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	if (status == Z_OK)
 		return;
 	die("inflateInit2: %s (%s)", zerr_to_string(status),
@@ -103,7 +108,7 @@ void git_inflate_end(git_zstream *strm)
 
 	zlib_pre_call(strm);
 	status = inflateEnd(&strm->z);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	if (status == Z_OK)
 		return;
 	error("inflateEnd: %s (%s)", zerr_to_string(status),
@@ -122,7 +127,7 @@ int git_inflate(git_zstream *strm, int flush)
 				 ? 0 : flush);
 		if (status == Z_MEM_ERROR)
 			die("inflate: out of memory");
-		zlib_post_call(strm);
+		zlib_post_call(strm, status);
 
 		/*
 		 * Let zlib work another round, while we can still
@@ -160,7 +165,7 @@ void git_deflate_init(git_zstream *strm, int level)
 	memset(strm, 0, sizeof(*strm));
 	zlib_pre_call(strm);
 	status = deflateInit(&strm->z, level);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	if (status == Z_OK)
 		return;
 	die("deflateInit: %s (%s)", zerr_to_string(status),
@@ -176,7 +181,7 @@ static void do_git_deflate_init(git_zstream *strm, int level, int windowBits)
 	status = deflateInit2(&strm->z, level,
 				  Z_DEFLATED, windowBits,
 				  8, Z_DEFAULT_STRATEGY);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	if (status == Z_OK)
 		return;
 	die("deflateInit2: %s (%s)", zerr_to_string(status),
@@ -207,7 +212,7 @@ int git_deflate_abort(git_zstream *strm)
 
 	zlib_pre_call(strm);
 	status = deflateEnd(&strm->z);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	return status;
 }
 
@@ -227,7 +232,7 @@ int git_deflate_end_gently(git_zstream *strm)
 
 	zlib_pre_call(strm);
 	status = deflateEnd(&strm->z);
-	zlib_post_call(strm);
+	zlib_post_call(strm, status);
 	return status;
 }
 
@@ -244,7 +249,7 @@ int git_deflate(git_zstream *strm, int flush)
 				 ? 0 : flush);
 		if (status == Z_MEM_ERROR)
 			die("deflate: out of memory");
-		zlib_post_call(strm);
+		zlib_post_call(strm, status);
 
 		/*
 		 * Let zlib work another round, while we can still
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 0b0d915773..e493600aff 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -922,6 +922,38 @@ test_expect_success 'truncated object with --allow-unknown-type' - <<\EOT
 	test_grep "unable to unpack $blob header" err
 EOT
 
+test_expect_success 'object reading handles zlib dictionary' - <<\EOT
+	echo 'content that will be recompressed' >file &&
+	blob=$(git hash-object -w file) &&
+	objpath=.git/objects/$(test_oid_to_path "$blob") &&
+
+	# Recompress a loose object using a precomputed zlib dictionary.
+	# This was originally done with:
+	#
+	#  perl -MCompress::Raw::Zlib -e '
+	#    binmode STDIN;
+	#    binmode STDOUT;
+	#    my $data = do { local $/; <STDIN> };
+	#    my $in = new Compress::Raw::Zlib::Inflate;
+	#    my $de = new Compress::Raw::Zlib::Deflate(
+	#      -Dictionary => "anything"
+	#    );
+	#    $in->inflate($data, $raw);
+	#    $de->deflate($raw, $out);
+	#    print $out;
+	#  ' <obj.bak >$objpath
+	#
+	# but we do not want to require the perl module for all test runs (nor
+	# carry a custom t/helper program that uses zlib features we don't
+	# otherwise care about).
+	mv "$objpath" obj.bak &&
+	test_when_finished 'mv obj.bak "$objpath"' &&
+	printf '\170\273\017\112\003\143' >$objpath &&
+
+	test_must_fail git cat-file blob $blob 2>err &&
+	test_grep 'error: inflate: needs dictionary' err
+EOT
+
 # Tests for git cat-file --follow-symlinks
 test_expect_success 'prep for symlink tests' '
 	echo_without_newline "$hello_content" >morx &&
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (4 preceding siblings ...)
  2025-02-25  6:30 ` [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT Jeff King
@ 2025-02-25  6:30 ` Jeff King
  2025-02-28  0:32   ` Taylor Blau
  2025-02-25  6:31 ` [PATCH 07/10] unpack_loose_rest(): " Jeff King
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:30 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When unpacking a loose header, we try to inflate the first 32 bytes.
We'd expect either Z_OK (we filled up the output buffer, but there are
more bytes in the object) or Z_STREAM_END (this is a tiny object whose
header and content fit in the buffer).

We check for that with "if (status < Z_OK)", making the assumption that
all of the errors we'd see have negative values (as Z_OK itself is "0",
and Z_STREAM_END is "1").

But there's at least one case this misses: Z_NEED_DICT is "2". This
isn't something we'd ever expect to see, but if we do see it, we should
consider it an error (since we have no dictionary to load).

Instead, the current code interprets Z_NEED_DICT as success and looks
for the object header's terminating NUL in the bytes we've read. This
will generaly be zero bytes if the dictionary is mentioned at the start
of the stream. So we'll fail to find it and complain "the header is too
long" (ULHR_LONG). But really, the problem is that the object is
malformed, and we should return ULHR_BAD.

This is a minor bug, as we consider both cases to be an error. But it
does mean we print the wrong error message. The test case added in the
previous patch triggers this code, so we can just confirm the error
message we see here.

Signed-off-by: Jeff King <peff@peff.net>
---
 object-file.c       | 2 +-
 t/t1006-cat-file.sh | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index 5b2446bfc1..5b347cfc2f 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1362,7 +1362,7 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
 	obj_read_unlock();
 	status = git_inflate(stream, 0);
 	obj_read_lock();
-	if (status < Z_OK)
+	if (status != Z_OK && status != Z_STREAM_END)
 		return ULHR_BAD;
 
 	/*
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index e493600aff..86a2825473 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -951,6 +951,8 @@ test_expect_success 'object reading handles zlib dictionary' - <<\EOT
 	printf '\170\273\017\112\003\143' >$objpath &&
 
 	test_must_fail git cat-file blob $blob 2>err &&
+	test_grep ! 'too long' err &&
+	test_grep 'error: unable to unpack' err &&
 	test_grep 'error: inflate: needs dictionary' err
 EOT
 
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/10] unpack_loose_rest(): avoid numeric comparison of zlib status
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (5 preceding siblings ...)
  2025-02-25  6:30 ` [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status Jeff King
@ 2025-02-25  6:31 ` Jeff King
  2025-02-25  6:33 ` [PATCH 08/10] unpack_loose_rest(): never clean up zstream Jeff King
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:31 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When unpacking the actual content of a loose object file, we insist both
that the status code we got is Z_STREAM_END, and that we consumed all
bytes.

If we didn't, we'll return an error, but the specific error message we
produce depends on which of the two error conditions we saw. So we'll
check both a second time to decide which error to produce. But this
second time, our status code check is loose: it checks for a negative
status value.

This can get confused by zlib codes which are not negative, such as
Z_NEED_DICT. In this case we'd erroneously print nothing at all, when we
should say "corrupt loose object".

Instead, this second check should check explicitly against Z_STREAM_END.

Note that Z_OK is "0", so the existing code also produced no message for
Z_OK. But it's impossible to see that status, since we only break out of
the inflate loop when we stop seeing Z_OK (so a stream which has more
bytes than its object header claims would eventually yield Z_BUF_ERROR).

There's no test here, as it would require a loose object whose zlib
stream returns Z_NEED_DICT in the middle of the object content. I think
that is probably possible, but even our Z_NEED_DICT test in t1006 does
not trigger this, because we hit that error while reading the header. I
found this bug while reviewing all callers of git_inflate() for bugs
similar to the one we saw in unpack_loose_header(). This was the only
other case that did a numeric comparison rather than explicitly checking
for Z_STREAM_END.

Signed-off-by: Jeff King <peff@peff.net>
---
 object-file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index 5b347cfc2f..859888eb2a 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1441,7 +1441,7 @@ static void *unpack_loose_rest(git_zstream *stream,
 		return buf;
 	}

-	if (status < 0)
+	if (status != Z_STREAM_END)
 		error(_("corrupt loose object '%s'"), oid_to_hex(oid));
 	else if (stream->avail_in)
 		error(_("garbage at end of loose object '%s'"),
-- 
2.48.1.709.gf47ae731ff

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/10] unpack_loose_rest(): never clean up zstream
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (6 preceding siblings ...)
  2025-02-25  6:31 ` [PATCH 07/10] unpack_loose_rest(): " Jeff King
@ 2025-02-25  6:33 ` Jeff King
  2025-02-26 13:16   ` Junio C Hamano
  2025-02-25  6:33 ` [PATCH 09/10] unpack_loose_rest(): simplify error handling Jeff King
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:33 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The unpack_loose_rest() function has funny ownership semantics: we pass
in a z_stream opened by the caller, but then only _sometimes_ close it.

This oddity has developed over time. When the function was originally
split out in 5180cacc20 (Split up unpack_sha1_file() some more,
2005-06-02), it always called inflateEnd() to clean up the stream
(though nowadays it is a git_zstream and we call git_inflate_end()).

But in 7efbff7531 (unpack_sha1_file(): detect corrupt loose object
files., 2007-03-05) we added error code paths which don't close the
stream. This makes some sense, as we'd still look at parts of the stream
struct to decide which error to show (though I am not sure in practice
if inflateEnd() even touches those fields).

This subtlety makes it hard to know when the caller has to clean up the
stream and when it does not. That led to the leak fixed by aa9ef614dc
(object-file: fix memory leak when reading corrupted headers,
2024-08-14).

Let's instead always leave the stream intact, forcing the caller to
clean it up. You might think that would create more work for the
callers, but it actually ends up simplifying them, since they can put
the call to git_inflate_end() in the common cleanup code path.

Two things to note, though:

  - The check_stream_oid() function is used as a replacement for
    unpack_loose_rest() in read_loose_object() to read blobs. It
    inherited the same funny semantics, and we should fix it here, too
    (to keep the cleanup in read_loose_object() consistent).

  - In read_loose_object() we need a second "out" label, as we can jump
    to the existing label before opening the stream at all (and since
    the struct is opaque, there is no way to if it was initialized or
    not, so we must not call git_inflate_end() in that case).

Signed-off-by: Jeff King <peff@peff.net>
---
This patch and the two after are pure cleanups which should have no
behavior effect. I think they make things better, but one could argue
they are churn.

I didn't reproduce it, but I think this is also fixing a leak when
check_stream_oid() returned an error.

 object-file.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/object-file.c b/object-file.c
index 859888eb2a..8cf87caef5 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1437,7 +1437,6 @@ static void *unpack_loose_rest(git_zstream *stream,
 		}
 	}
 	if (status == Z_STREAM_END && !stream->avail_in) {
-		git_inflate_end(stream);
 		return buf;
 	}
 
@@ -1601,8 +1600,8 @@ static int loose_object_info(struct repository *r,
 		die(_("loose object %s (stored in %s) is corrupt"),
 		    oid_to_hex(oid), path);
 
-	git_inflate_end(&stream);
 cleanup:
+	git_inflate_end(&stream);
 	munmap(map, mapsize);
 	if (oi->sizep == &size_scratch)
 		oi->sizep = NULL;
@@ -3081,7 +3080,6 @@ static int check_stream_oid(git_zstream *stream,
 		git_hash_update(&c, buf, stream->next_out - buf);
 		total_read += stream->next_out - buf;
 	}
-	git_inflate_end(stream);
 
 	if (status != Z_STREAM_END) {
 		error(_("corrupt loose object '%s'"), oid_to_hex(expected_oid));
@@ -3128,35 +3126,34 @@ int read_loose_object(const char *path,
 	if (unpack_loose_header(&stream, map, mapsize, hdr, sizeof(hdr),
 				NULL) != ULHR_OK) {
 		error(_("unable to unpack header of %s"), path);
-		git_inflate_end(&stream);
-		goto out;
+		goto out_inflate;
 	}
 
 	if (parse_loose_header(hdr, oi) < 0) {
 		error(_("unable to parse header of %s"), path);
-		git_inflate_end(&stream);
-		goto out;
+		goto out_inflate;
 	}
 
 	if (*oi->typep == OBJ_BLOB && *size > big_file_threshold) {
 		if (check_stream_oid(&stream, hdr, *size, path, expected_oid) < 0)
-			goto out;
+			goto out_inflate;
 	} else {
 		*contents = unpack_loose_rest(&stream, hdr, *size, expected_oid);
 		if (!*contents) {
 			error(_("unable to unpack contents of %s"), path);
-			git_inflate_end(&stream);
-			goto out;
+			goto out_inflate;
 		}
 		hash_object_file_literally(the_repository->hash_algo,
 					   *contents, *size,
 					   oi->type_name->buf, real_oid);
 		if (!oideq(expected_oid, real_oid))
-			goto out;
+			goto out_inflate;
 	}
 
 	ret = 0; /* everything checks out */
 
+out_inflate:
+	git_inflate_end(&stream);
 out:
 	if (map)
 		munmap(map, mapsize);
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/10] unpack_loose_rest(): simplify error handling
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (7 preceding siblings ...)
  2025-02-25  6:33 ` [PATCH 08/10] unpack_loose_rest(): never clean up zstream Jeff King
@ 2025-02-25  6:33 ` Jeff King
  2025-02-26 13:46   ` Junio C Hamano
  2025-02-28  0:34   ` Taylor Blau
  2025-02-25  6:34 ` [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity Jeff King
  2025-02-28  0:38 ` [PATCH 0/10] some zlib inflating bug fixes Taylor Blau
  10 siblings, 2 replies; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:33 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

Inflating a loose object is considered successful only if we got
Z_STREAM_END and there were no more bytes. We check both of those
conditions and return success, but then have to check them a second time
to decide which error message to produce.

I.e., we do something like this:

  if (!error_1 && !error_2)
          ...return success...

  if (error_1)
          ...handle error1...
  else if (error_2)
          ...handle error2...
  ...common error handling...

This repetition was the source of a small bug fixed in an earlier commit
(our Z_STREAM_END check was not the same in the two conditionals).

Instead we can chain them all into a single if/else cascade, which
avoids repeating ourselves:

  if (error_1)
          ...handle error1...
  else if (error_2)
          ...handle error2....
  else
          ...return success...
  ...common error handling...

Signed-off-by: Jeff King <peff@peff.net>
---
 object-file.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/object-file.c b/object-file.c
index 8cf87caef5..b7928fb74e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1436,15 +1436,15 @@ static void *unpack_loose_rest(git_zstream *stream,
 			obj_read_lock();
 		}
 	}
-	if (status == Z_STREAM_END && !stream->avail_in) {
-		return buf;
-	}
 
 	if (status != Z_STREAM_END)
 		error(_("corrupt loose object '%s'"), oid_to_hex(oid));
 	else if (stream->avail_in)
 		error(_("garbage at end of loose object '%s'"),
 		      oid_to_hex(oid));
+	else
+		return buf;
+
 	free(buf);
 	return NULL;
 }
-- 
2.48.1.709.gf47ae731ff


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (8 preceding siblings ...)
  2025-02-25  6:33 ` [PATCH 09/10] unpack_loose_rest(): simplify error handling Jeff King
@ 2025-02-25  6:34 ` Jeff King
  2025-02-28  0:36   ` Taylor Blau
  2025-02-28  0:38 ` [PATCH 0/10] some zlib inflating bug fixes Taylor Blau
  10 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-02-25  6:34 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We have a pattern like:

  if (error1)
     ...handle error 1...
  else if (error2)
     ...handle error 2...
  else
     ...return buf...
  ...free buf and return NULL...

This is a little subtle because it is the return in the success block
that lets us skip the common error handling. Rewrite this instead to
free the buffer in each error path, marking it as NULL, and then all
code paths can use the common return.

This should make the logic a bit easier to follow. It does mean
duplicating the buf cleanup for errors, but it's a single line.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously could be squashed into the previous one, but I thought the
sequence of diffs made it easier to understand what was being changed.

 object-file.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/object-file.c b/object-file.c
index b7928fb74e..1df8870578 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1437,16 +1437,16 @@ static void *unpack_loose_rest(git_zstream *stream,
 		}
 	}
 
-	if (status != Z_STREAM_END)
+	if (status != Z_STREAM_END) {
 		error(_("corrupt loose object '%s'"), oid_to_hex(oid));
-	else if (stream->avail_in)
+		FREE_AND_NULL(buf);
+	} else if (stream->avail_in) {
 		error(_("garbage at end of loose object '%s'"),
 		      oid_to_hex(oid));
-	else
-		return buf;
+		FREE_AND_NULL(buf);
+	}
 
-	free(buf);
-	return NULL;
+	return buf;
 }
 
 /*
-- 
2.48.1.709.gf47ae731ff

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
@ 2025-02-25 11:42   ` Patrick Steinhardt
  2025-02-26  1:47   ` Junio C Hamano
  2025-02-28  0:14   ` Taylor Blau
  2 siblings, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2025-02-25 11:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

On Tue, Feb 25, 2025 at 01:28:24AM -0500, Jeff King wrote:
> After unpack_loose_header() returns, it will have inflated not only the
> object header, but possibly some bytes of the object content. When we
> call unpack_loose_rest() to extract the actual content, it finds those
> extra bytes by skipping past the header's terminating NUL in the buffer.
> Like this:
> 
>   int bytes = strlen(buffer) + 1;
>   n = stream->total_out - bytes;
>   ...
>   memcpy(buf, (char *) buffer + bytes, n);
> 
> This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
> we allow a header of arbitrary size. We put into a strbuf, but feed only

s/into/it &/

> the final 32-byte chunk we read to unpack_loose_rest(). In that case
> stream->total_out may unexpectedly large, and thus our "n" will be

s/may/& be/

> large, causing an out-of-bounds read (we do check it against our
> allocated buffer size, which prevents an out-of-bounds write).
> 
> Probably this could be made to work by feeding the strbuf to
> unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
> would need to be a size_t, since it is no longer operating on a 32-byte
> buffer).

I was a bit confused initially as I was thinking in terms of `size_t`
and `uint32_t` as I misread 32-byte for 32-bit, which is an immediate
shortcut that my brain took because 32 bit is something you read all the
time. I don't really have a great idea for how to introduce the byte
chunk better though to avoid this.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I found this because I was tracing the code path after
> unpack_loose_header() returns to verify some assumptions in the other
> patches.
> 
> It really makes me wonder if this "unknown type" stuff has any value
> at all. You can create an object with any type using "hash-object
> --literally -t". And you can ask about its type and size. But you can
> never retrieve the object content! Nor can you pack it or transfer it,
> since packs use a numeric type field.
> 
> This code was added ~2015, but I don't think anybody built more on top
> of it. I wonder if we should just consider it a failed experiment and
> rip out the support.

I certainly do not know and cannot think of any usecase for this. I also
expect that a repository with unknown object types is a recipe for weird
edge cases in case they are being read somehow.

I guess we'll know a bit more when this patch series lands? In case
nobody complains it is another indicator that unknown object types
aren't being used out there in the wild.

>  object-file.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/object-file.c b/object-file.c
> index 00c3a4b910..45b251ba04 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1580,6 +1580,8 @@ static int loose_object_info(struct repository *r,
>  
>  		if (!oi->contentp)
>  			break;
> +		if (hdrbuf.len)
> +			BUG("unpacking content with unknown types not yet supported");
>  		*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
>  		if (*oi->contentp)
>  			goto cleanup;

Okay. I was wondering whether we still need `hdrbuf`, but we of course
do in order to continue reading the type and length itself. The only
thing we restrict is reading the contents of such objects.

Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input
  2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
@ 2025-02-25 11:42   ` Patrick Steinhardt
  2025-02-25 19:00     ` Eric Sunshine
  2025-02-26 12:56   ` Junio C Hamano
  2025-02-28  0:21   ` Taylor Blau
  2 siblings, 1 reply; 34+ messages in thread
From: Patrick Steinhardt @ 2025-02-25 11:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

On Tue, Feb 25, 2025 at 01:29:58AM -0500, Jeff King wrote:
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 398865d6eb..0b0d915773 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -903,6 +903,25 @@ test_expect_success 'cat-file -t and -s on corrupt loose object' '
>  	)
>  '
>  
> +test_expect_success 'truncated object with --allow-unknown-type' - <<\EOT
> +	objtype='a really long type name that exceeds the 32-byte limit' &&
> +	blob=$(git hash-object -w --literally -t "$objtype" /dev/null) &&
> +	objpath=.git/objects/$(test_oid_to_path "$blob") &&
> +
> +	# We want to truncate the object far enough in that we don't hit the
> +	# end while inflating the first 32 bytes (since we want to have to dig
> +	# for the trailing NUL of the header). But we don't want to go too far,
> +	# since our header isn't very big. And of course we are counting
> +	# deflated zlib bytes in the on-disk file, so it's a bit of a guess.
> +	# Empirically 50 seems to work.
> +	mv "$objpath" obj.bak &&
> +	test_when_finished 'mv obj.bak "$objpath"' &&

The order should probably be reversed here, as we nowadays tend to first
queue the cleanup before doing the actual work. Not that it really
matters in this case.

Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input
  2025-02-25 11:42   ` Patrick Steinhardt
@ 2025-02-25 19:00     ` Eric Sunshine
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Sunshine @ 2025-02-25 19:00 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, Taylor Blau

On Tue, Feb 25, 2025 at 6:46 AM Patrick Steinhardt <ps@pks.im> wrote:
> On Tue, Feb 25, 2025 at 01:29:58AM -0500, Jeff King wrote:
> > +     mv "$objpath" obj.bak &&
> > +     test_when_finished 'mv obj.bak "$objpath"' &&
>
> The order should probably be reversed here, as we nowadays tend to first
> queue the cleanup before doing the actual work. Not that it really
> matters in this case.

I'd say this case is fine as-is since that particular cleanup *only*
makes sense if `mv` succeeded. Moreover, if these statements were to
be reversed, we'd need to take extra precaution (using `||:`, for
instance) against pointless failure of the cleanup code itself; i.e.:

    test_when_finished 'mv obj.bak "$objpath" ||:' &&
    mv "$objpath" obj.bak &&

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
@ 2025-02-26  1:47   ` Junio C Hamano
  2025-02-28  0:16     ` Taylor Blau
  2025-02-28  0:14   ` Taylor Blau
  2 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2025-02-26  1:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> It really makes me wonder if this "unknown type" stuff has any value
> at all. You can create an object with any type using "hash-object
> --literally -t". And you can ask about its type and size. But you can
> never retrieve the object content! Nor can you pack it or transfer it,
> since packs use a numeric type field.

Correct.  IIRC, the "--literally" support was mostly for debugging,
and as you noticed, is very much limited because it can only create
funny objects that are loose.  And the debugging was not really about
adding more object types, but was more about "what would our code do
when we see an object that is corrupt whose type we do not recognise".

I personally think the "--literally" should not survive the Git 3.0
boundary.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input
  2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
@ 2025-02-26 12:56   ` Junio C Hamano
  2025-02-28  0:21   ` Taylor Blau
  2 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2025-02-26 12:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> diff --git a/object-file.c b/object-file.c
> index 8c9295413b..5b2446bfc1 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1396,7 +1396,7 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
>  		strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
>  		if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
>  			return 0;
> -	} while (status != Z_STREAM_END);
> +	} while (status == Z_OK);
>  	return ULHR_BAD;
>  }

Ah, fond memories.  I do recall we had very similar bugs whose
correct solution was "instead of stopping with Z_something, just
write your loop to keep ging while you get Z_OK" every time we
encountered them.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/10] unpack_loose_rest(): never clean up zstream
  2025-02-25  6:33 ` [PATCH 08/10] unpack_loose_rest(): never clean up zstream Jeff King
@ 2025-02-26 13:16   ` Junio C Hamano
  0 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2025-02-26 13:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> This patch and the two after are pure cleanups which should have no
> behavior effect. I think they make things better, but one could argue
> they are churn.
>
> I didn't reproduce it, but I think this is also fixing a leak when
> check_stream_oid() returned an error.

Looking good.  Thanks for paying down our technical debt.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
  2025-02-25  6:30 ` [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT Jeff King
@ 2025-02-26 13:26   ` Junio C Hamano
  2025-02-28  0:31     ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2025-02-26 13:26 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> But these do not correspond when we see Z_NEED_DICT! Zlib consumes the
> bytes from the input buffer but it does not increment total_in. And so
> we hit the BUG("total_in mismatch") call.
>
> There are a few options here:
>
>   - We could ditch that BUG() check. It is making too many assumptions
>     about how zlib updates these values. But it does have value in most
>     cases as a sanity check on the values we're copying.
>
>   - We could skip the zlib_post_call() entirely when we see Z_NEED_DICT.
>     We know that it's hard error for us, so we should just send the
>     status up the stack and let the caller bail.
>
>     The downside is that if we ever did want to support dictionaries,
>     we couldn't (the git_zstream will be out of sync, since we never
>     copied its values back from the z_stream).
>
>   - We could continue to call zlib_post_call(), but skip just that BUG()
>     check if the status is Z_NEED_DICT. This keeps git_inflate() as a
>     thin wrapper around inflate(), and would let us later support
>     dictionaries for some calls if we wanted to.
>
> This patch uses the third approach. It seems like the least-surprising
> thing to keep git_inflate() a close to inflate() as possible. And while
> it makes the diff a bit larger (since we have to pass the status down to
> to the zlib_post_call() function), it's a static local function, and
> every caller by definition will have just made a zlib call (and so will
> have a status integer).

Ouch.  I am not sure if I would have made the choice you guys made
if I were writing this fix, but that is mostly because I do not
anticipate that we would ever support NEED_DICT and the other two
options seem simpler, and certainly not because I have any concrete
reason to oppose the third approach.

> +test_expect_success 'object reading handles zlib dictionary' - <<\EOT
> +	echo 'content that will be recompressed' >file &&
> +	blob=$(git hash-object -w file) &&
> +	objpath=.git/objects/$(test_oid_to_path "$blob") &&
> +
> +	# Recompress a loose object using a precomputed zlib dictionary.
> +	# This was originally done with:
> +	#
> +	#  perl -MCompress::Raw::Zlib -e '
> +	#    binmode STDIN;
> +	#    binmode STDOUT;
> +	#    my $data = do { local $/; <STDIN> };
> +	#    my $in = new Compress::Raw::Zlib::Inflate;
> +	#    my $de = new Compress::Raw::Zlib::Deflate(
> +	#      -Dictionary => "anything"
> +	#    );
> +	#    $in->inflate($data, $raw);
> +	#    $de->deflate($raw, $out);
> +	#    print $out;
> +	#  ' <obj.bak >$objpath
> +	#
> +	# but we do not want to require the perl module for all test runs (nor
> +	# carry a custom t/helper program that uses zlib features we don't
> +	# otherwise care about).
> +	mv "$objpath" obj.bak &&
> +	test_when_finished 'mv obj.bak "$objpath"' &&
> +	printf '\170\273\017\112\003\143' >$objpath &&
> +
> +	test_must_fail git cat-file blob $blob 2>err &&
> +	test_grep 'error: inflate: needs dictionary' err
> +EOT
> +

Wheee.  I guess an ugly and down-to-bit-sequence test is much better
than having no test at all ;-)

We'd want to say >"$objpath" to adhere to CodingGuidelines, but
I guess those particular versions of bash has not been seen in the
wild for quite some time and we may want to loosen the rule there.

Anyway, looking good.  Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] unpack_loose_rest(): simplify error handling
  2025-02-25  6:33 ` [PATCH 09/10] unpack_loose_rest(): simplify error handling Jeff King
@ 2025-02-26 13:46   ` Junio C Hamano
  2025-02-28  0:34   ` Taylor Blau
  1 sibling, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2025-02-26 13:46 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

Jeff King <peff@peff.net> writes:

> Inflating a loose object is considered successful only if we got
> Z_STREAM_END and there were no more bytes. We check both of those
> conditions and return success, but then have to check them a second time
> to decide which error message to produce.
>
> I.e., we do something like this:
>
>   if (!error_1 && !error_2)
>           ...return success...
>
>   if (error_1)
>           ...handle error1...
>   else if (error_2)
>           ...handle error2...
>   ...common error handling...
>
> This repetition was the source of a small bug fixed in an earlier commit
> (our Z_STREAM_END check was not the same in the two conditionals).
>
> Instead we can chain them all into a single if/else cascade, which
> avoids repeating ourselves:
>
>   if (error_1)
>           ...handle error1...
>   else if (error_2)
>           ...handle error2....
>   else
>           ...return success...
>   ...common error handling...
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  object-file.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Of course the resulting code is so much cleaner and more obvious.
Thanks for cleaning it up.


> diff --git a/object-file.c b/object-file.c
> index 8cf87caef5..b7928fb74e 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1436,15 +1436,15 @@ static void *unpack_loose_rest(git_zstream *stream,
>  			obj_read_lock();
>  		}
>  	}
> -	if (status == Z_STREAM_END && !stream->avail_in) {
> -		return buf;
> -	}
>  
>  	if (status != Z_STREAM_END)
>  		error(_("corrupt loose object '%s'"), oid_to_hex(oid));
>  	else if (stream->avail_in)
>  		error(_("garbage at end of loose object '%s'"),
>  		      oid_to_hex(oid));
> +	else
> +		return buf;
> +
>  	free(buf);
>  	return NULL;
>  }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
  2025-02-26  1:47   ` Junio C Hamano
@ 2025-02-28  0:14   ` Taylor Blau
  2 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:28:24AM -0500, Jeff King wrote:
> It really makes me wonder if this "unknown type" stuff has any value
> at all. You can create an object with any type using "hash-object
> --literally -t". And you can ask about its type and size. But you can
> never retrieve the object content! Nor can you pack it or transfer it,
> since packs use a numeric type field.
>
> This code was added ~2015, but I don't think anybody built more on top
> of it. I wonder if we should just consider it a failed experiment and
> rip out the support.

Yeah, I was wondering the same thing. I think 10 years seems like a long
enough time to declare that it was a failed experiment ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-26  1:47   ` Junio C Hamano
@ 2025-02-28  0:16     ` Taylor Blau
  2025-03-04  6:43       ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Tue, Feb 25, 2025 at 05:47:56PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > It really makes me wonder if this "unknown type" stuff has any value
> > at all. You can create an object with any type using "hash-object
> > --literally -t". And you can ask about its type and size. But you can
> > never retrieve the object content! Nor can you pack it or transfer it,
> > since packs use a numeric type field.
>
> Correct.  IIRC, the "--literally" support was mostly for debugging,
> and as you noticed, is very much limited because it can only create
> funny objects that are loose.  And the debugging was not really about
> adding more object types, but was more about "what would our code do
> when we see an object that is corrupt whose type we do not recognise".
>
> I personally think the "--literally" should not survive the Git 3.0
> boundary.

It is quite useful for testing intentionally broken objects, like
commits with malformed author/committer lines, or trees with
out-of-order entries, etc.

Perhaps we could replace that with a test helper that is only accessible
within the test suite that acts like "hash-object --literally" and
remove "--literally" from the plumbing interface? I dunno.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] unpack_loose_header(): simplify next_out assignment
  2025-02-25  6:29 ` [PATCH 02/10] unpack_loose_header(): simplify next_out assignment Jeff King
@ 2025-02-28  0:18   ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:29:00AM -0500, Jeff King wrote:
> When using OBJECT_INFO_ALLOW_UNKNOWN_TYPE to unpack a header that
> doesn't fit into our initial 32-byte buffer, we loop over calls
> git_inflate(), feeding it our buffer to the "next_out" pointer each
> time. As the code is written, we reset next_out after each inflate call
> (and after reading the output), ready for the next loop.
>
> This isn't wrong, but there are a few advantages to setting up
> "next_out" right before each inflate call, rather than after:
>
>   1. It drops a few duplicated lines of code.
>
>   2. It makes it obvious that we always feed a fresh buffer on each call
>      (and thus can never see Z_BUF_ERROR due to due to a lack of output
>      space).
>
>   3. After we exit the loop, we'll leave stream->next_out pointing to
>      the end of the fetched data (this is how zlib callers find out how
>      much data is in the buffer). This doesn't matter in practice, since
>      nobody looks at it again. But it's probably the least-surprising
>      thing to do, as it matches how next_out is left when the whole
>      thing fits in the initial 32-byte buffer (and we don't enter the
>      loop at all).

Thanks for calling (3) out. There is definitely a subtle behavior change
there, but having the context that it doesn't matter in practice is
useful in judging whether or not the refactoring is OK.

In any event, I agree that this is a much cleaner read, and recall
thinking something along the same lines as (1) when we were working on
this together.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input
  2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
  2025-02-25 11:42   ` Patrick Steinhardt
  2025-02-26 12:56   ` Junio C Hamano
@ 2025-02-28  0:21   ` Taylor Blau
  2 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:29:58AM -0500, Jeff King wrote:
> Co-authored-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  object-file.c       |  2 +-
>  t/t1006-cat-file.sh | 19 +++++++++++++++++++
>  2 files changed, 20 insertions(+), 1 deletion(-)

I don't have much more to add here having helped write this patch and
reviewed an early copy of it on the git-security list. So the series up
to this point LGTM, let's keep reading on...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
  2025-02-26 13:26   ` Junio C Hamano
@ 2025-02-28  0:31     ` Taylor Blau
  2025-03-04  7:08       ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Wed, Feb 26, 2025 at 05:26:04AM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > But these do not correspond when we see Z_NEED_DICT! Zlib consumes the
> > bytes from the input buffer but it does not increment total_in. And so
> > we hit the BUG("total_in mismatch") call.
> >
> > There are a few options here:
> >
> >   - We could ditch that BUG() check. It is making too many assumptions
> >     about how zlib updates these values. But it does have value in most
> >     cases as a sanity check on the values we're copying.
> >
> >   - We could skip the zlib_post_call() entirely when we see Z_NEED_DICT.
> >     We know that it's hard error for us, so we should just send the
> >     status up the stack and let the caller bail.
> >
> >     The downside is that if we ever did want to support dictionaries,
> >     we couldn't (the git_zstream will be out of sync, since we never
> >     copied its values back from the z_stream).
> >
> >   - We could continue to call zlib_post_call(), but skip just that BUG()
> >     check if the status is Z_NEED_DICT. This keeps git_inflate() as a
> >     thin wrapper around inflate(), and would let us later support
> >     dictionaries for some calls if we wanted to.
> >
> > This patch uses the third approach. It seems like the least-surprising
> > thing to keep git_inflate() a close to inflate() as possible. And while
> > it makes the diff a bit larger (since we have to pass the status down to
> > to the zlib_post_call() function), it's a static local function, and
> > every caller by definition will have just made a zlib call (and so will
> > have a status integer).
>
> Ouch.  I am not sure if I would have made the choice you guys made
> if I were writing this fix, but that is mostly because I do not
> anticipate that we would ever support NEED_DICT and the other two
> options seem simpler, and certainly not because I have any concrete
> reason to oppose the third approach.

It's been a few weeks since I last looked at this patch, but I have a
vague recollection that we chose the second approach while discussing
this together.

And indeed, looking at the copy I have from that session, it looks like
we did

--- 8< ---
diff --git a/git-zlib.c b/git-zlib.c
index 651dd9e07c..265d3074e1 100644
--- a/git-zlib.c
+++ b/git-zlib.c
@@ -122,6 +122,8 @@ int git_inflate(git_zstream *strm, int flush)
                                 ? 0 : flush);
                if (status == Z_MEM_ERROR)
                        die("inflate: out of memory");
+               if (status == Z_NEED_DICT)
+                       break;
                zlib_post_call(strm);

                /*
--- >8 ---

I actually have a vague preference towards that approach, since I too do
not anticipate that we'd ever need/want to support Z_NEED_DICT (and if
we did, we could always change from option (2) to (3) later on).
Likewise, the resulting diff is considerably smaller by line count,
though on the other hand this diff is not all that more complicated
overall.

I don't have a strong enough preference to suggest you make any changes
here, but I thought I'd mention it nonetheless.

> > +test_expect_success 'object reading handles zlib dictionary' - <<\EOT
> > +	echo 'content that will be recompressed' >file &&
> > +	blob=$(git hash-object -w file) &&
> > +	objpath=.git/objects/$(test_oid_to_path "$blob") &&
> > +
> > +	# Recompress a loose object using a precomputed zlib dictionary.
> > +	# This was originally done with:
> > +	#
> > +	#  perl -MCompress::Raw::Zlib -e '
> > +	#    binmode STDIN;
> > +	#    binmode STDOUT;
> > +	#    my $data = do { local $/; <STDIN> };
> > +	#    my $in = new Compress::Raw::Zlib::Inflate;
> > +	#    my $de = new Compress::Raw::Zlib::Deflate(
> > +	#      -Dictionary => "anything"
> > +	#    );
> > +	#    $in->inflate($data, $raw);
> > +	#    $de->deflate($raw, $out);
> > +	#    print $out;
> > +	#  ' <obj.bak >$objpath
> > +	#
> > +	# but we do not want to require the perl module for all test runs (nor
> > +	# carry a custom t/helper program that uses zlib features we don't
> > +	# otherwise care about).
> > +	mv "$objpath" obj.bak &&
> > +	test_when_finished 'mv obj.bak "$objpath"' &&
> > +	printf '\170\273\017\112\003\143' >$objpath &&
> > +
> > +	test_must_fail git cat-file blob $blob 2>err &&
> > +	test_grep 'error: inflate: needs dictionary' err
> > +EOT
> > +
>
> Wheee.  I guess an ugly and down-to-bit-sequence test is much better
> than having no test at all ;-)

As a fun aside, Peff wrote this Perl script (unsurprisingly, I strongly
dislike writing Perl), and amazingly it worked the first try when we
were still hypothesizing about what this bug would look like. It was
quite the surprise to both of us, I think, that it worked so well.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status
  2025-02-25  6:30 ` [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status Jeff King
@ 2025-02-28  0:32   ` Taylor Blau
  2025-03-04  6:55     ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:30:56AM -0500, Jeff King wrote:
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index e493600aff..86a2825473 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -951,6 +951,8 @@ test_expect_success 'object reading handles zlib dictionary' - <<\EOT
>  	printf '\170\273\017\112\003\143' >$objpath &&
>
>  	test_must_fail git cat-file blob $blob 2>err &&
> +	test_grep ! 'too long' err &&
> +	test_grep 'error: unable to unpack' err &&
>  	test_grep 'error: inflate: needs dictionary' err
>  EOT

All looking good here, too.

I think the test_grep is hiding what is a fairly unpleasant error
message that says the same thing a few times from different points in
the call-stack. But that isn't anything new from this series, and I'm
content to let it be a problem for another day ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] unpack_loose_rest(): simplify error handling
  2025-02-25  6:33 ` [PATCH 09/10] unpack_loose_rest(): simplify error handling Jeff King
  2025-02-26 13:46   ` Junio C Hamano
@ 2025-02-28  0:34   ` Taylor Blau
  1 sibling, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:34 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:33:51AM -0500, Jeff King wrote:
> ---
>  object-file.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Much cleaner indeed. Well spotted, and thanks for fixing it up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity
  2025-02-25  6:34 ` [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity Jeff King
@ 2025-02-28  0:36   ` Taylor Blau
  2025-03-04  7:10     ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:34:21AM -0500, Jeff King wrote:
> This should make the logic a bit easier to follow. It does mean
> duplicating the buf cleanup for errors, but it's a single line.

At least to my eyes, I actually prefer the state after 9/10 and would
probably be OK to see this patch get dropped. I wish I had a compelling
reason *why* I felt that way, but I think it may too subjective.

I don't feel strongly about it either way, though.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/10] some zlib inflating bug fixes
  2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
                   ` (9 preceding siblings ...)
  2025-02-25  6:34 ` [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity Jeff King
@ 2025-02-28  0:38 ` Taylor Blau
  10 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-02-28  0:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Feb 25, 2025 at 01:25:18AM -0500, Jeff King wrote:
>  git-zlib.c          | 27 +++++++++++++----------
>  object-file.c       | 48 ++++++++++++++++++++--------------------
>  t/t1006-cat-file.sh | 53 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 92 insertions(+), 36 deletions(-)

Thanks for putting these together and sending them out. I didn't have a
ton to add throughout since you and I wrote patches 4 and 5 together,
but I re-read things to make sure that everything still seemed sane a
week or two out.

The remainder of the series looks good to me too, though I am not in
love with the final patch. Like I mentioned in my response there, I
think it's pretty subjective, but I would be OK to see the first nine
patches queued and the last one dropped. OTOH, I wouldn't be upset to
see it included either.

Thanks again for working on this :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-02-28  0:16     ` Taylor Blau
@ 2025-03-04  6:43       ` Jeff King
  2025-03-04 15:41         ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-03-04  6:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git

On Thu, Feb 27, 2025 at 07:16:28PM -0500, Taylor Blau wrote:

> On Tue, Feb 25, 2025 at 05:47:56PM -0800, Junio C Hamano wrote:
> > Jeff King <peff@peff.net> writes:
> >
> > > It really makes me wonder if this "unknown type" stuff has any value
> > > at all. You can create an object with any type using "hash-object
> > > --literally -t". And you can ask about its type and size. But you can
> > > never retrieve the object content! Nor can you pack it or transfer it,
> > > since packs use a numeric type field.
> >
> > Correct.  IIRC, the "--literally" support was mostly for debugging,
> > and as you noticed, is very much limited because it can only create
> > funny objects that are loose.  And the debugging was not really about
> > adding more object types, but was more about "what would our code do
> > when we see an object that is corrupt whose type we do not recognise".
> >
> > I personally think the "--literally" should not survive the Git 3.0
> > boundary.
> 
> It is quite useful for testing intentionally broken objects, like
> commits with malformed author/committer lines, or trees with
> out-of-order entries, etc.
> 
> Perhaps we could replace that with a test helper that is only accessible
> within the test suite that acts like "hash-object --literally" and
> remove "--literally" from the plumbing interface? I dunno.

I don't want to get rid of --literally. It is great not only for testing
and debugging, but also if you needed to recreate a slightly broken
object (e.g., to recover from corruption). Hash-object will complain
about objects that fail fsck, even though they are perfectly usable (and
may be embedded in existing history).

But this unknown type stuff is not and has never been usable. So I'd
propose to keep --literally but always reject unknown types.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status
  2025-02-28  0:32   ` Taylor Blau
@ 2025-03-04  6:55     ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2025-03-04  6:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Thu, Feb 27, 2025 at 07:32:25PM -0500, Taylor Blau wrote:

> On Tue, Feb 25, 2025 at 01:30:56AM -0500, Jeff King wrote:
> > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> > index e493600aff..86a2825473 100755
> > --- a/t/t1006-cat-file.sh
> > +++ b/t/t1006-cat-file.sh
> > @@ -951,6 +951,8 @@ test_expect_success 'object reading handles zlib dictionary' - <<\EOT
> >  	printf '\170\273\017\112\003\143' >$objpath &&
> >
> >  	test_must_fail git cat-file blob $blob 2>err &&
> > +	test_grep ! 'too long' err &&
> > +	test_grep 'error: unable to unpack' err &&
> >  	test_grep 'error: inflate: needs dictionary' err
> >  EOT
> 
> All looking good here, too.
> 
> I think the test_grep is hiding what is a fairly unpleasant error
> message that says the same thing a few times from different points in
> the call-stack. But that isn't anything new from this series, and I'm
> content to let it be a problem for another day ;-).

Yeah, we get one set of errors when we ask for the type to find out if
we need to peel a tag (it's not a tag, it's OBJ_ERR ;) ). And then
again we ask if it's a blob to try streaming. It's still OBJ_ERR. And
then we fall back to the non-streaming case.

We should probably check for errors earlier. And also avoid asking for
the type twice when we didn't peel a tag, which is just stupidly
inefficient.

Perhaps something like this (untested):

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 9de1016acd..e1dbbfeb43 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -236,7 +236,13 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 
 		if (exp_type_id == OBJ_BLOB) {
 			struct object_id blob_oid;
-			if (oid_object_info(the_repository, &oid, NULL) == OBJ_TAG) {
+			enum object_type found_type = oid_object_info(the_repository,
+								      &oid, NULL);
+
+			if (found_type < 0)
+				die(_("unable to read %s"), oid_to_hex(&oid));
+
+			if (found_type == OBJ_TAG) {
 				char *buffer = repo_read_object_file(the_repository,
 								     &oid,
 								     &type,
@@ -251,10 +257,11 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 						      &hash_algos[oid.algo]))
 					die("%s not a valid tag", oid_to_hex(&oid));
 				free(buffer);
+				found_type = type;
 			} else
 				oidcpy(&blob_oid, &oid);
 
-			if (oid_object_info(the_repository, &blob_oid, NULL) == OBJ_BLOB) {
+			if (found_type == OBJ_BLOB) {
 				ret = stream_blob(&blob_oid);
 				goto cleanup;
 			}

But yes, this is all way out of scope for this series, and is true of
any corruption.

-Peff

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
  2025-02-28  0:31     ` Taylor Blau
@ 2025-03-04  7:08       ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2025-03-04  7:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git

On Thu, Feb 27, 2025 at 07:31:01PM -0500, Taylor Blau wrote:

> It's been a few weeks since I last looked at this patch, but I have a
> vague recollection that we chose the second approach while discussing
> this together.
> 
> And indeed, looking at the copy I have from that session, it looks like
> we did
> 
> --- 8< ---
> diff --git a/git-zlib.c b/git-zlib.c
> index 651dd9e07c..265d3074e1 100644
> --- a/git-zlib.c
> +++ b/git-zlib.c
> @@ -122,6 +122,8 @@ int git_inflate(git_zstream *strm, int flush)
>                                  ? 0 : flush);
>                 if (status == Z_MEM_ERROR)
>                         die("inflate: out of memory");
> +               if (status == Z_NEED_DICT)
> +                       break;
>                 zlib_post_call(strm);
> 
>                 /*
> --- >8 ---
> 
> I actually have a vague preference towards that approach, since I too do
> not anticipate that we'd ever need/want to support Z_NEED_DICT (and if
> we did, we could always change from option (2) to (3) later on).
> Likewise, the resulting diff is considerably smaller by line count,
> though on the other hand this diff is not all that more complicated
> overall.

Yes, that was definitely the patch we initially wrote. And I do agree
that it's pretty unlikely that we'd use Z_NEED_DICT ever. But what I
really disliked is that it leaves the git_zstream in a totally broken
state, unsynced with the underlying z_stream.

If the caller takes it as a hard error and immediately throws away the
stream that's not too bad. Although even then it gets a little weird: we
have to call git_inflated_end(), which itself does the usual zlib
pre/post calls. The "pre" call overwrites the zstream with old values,
which zlib's inflateEnd() then operates on.

It seems to be OK in practice, but we are making assumptions about which
parts of the struct zlib cares about. I guess it's unlikely that
unexpectedly rewinding total_in would cause zlib to misbehave, but it's
probably better not to be too intimate with it.

And of course if somebody ever does want to play with Z_NEED_DICT, it's
a bit of a booby trap we've left for them. We know today that they will
need to change from option (2) to (3), but in that hypothetical future
they will have to first figure out why their z_stream is corrupted. ;)

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity
  2025-02-28  0:36   ` Taylor Blau
@ 2025-03-04  7:10     ` Jeff King
  2025-03-04 21:32       ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2025-03-04  7:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Thu, Feb 27, 2025 at 07:36:42PM -0500, Taylor Blau wrote:

> On Tue, Feb 25, 2025 at 01:34:21AM -0500, Jeff King wrote:
> > This should make the logic a bit easier to follow. It does mean
> > duplicating the buf cleanup for errors, but it's a single line.
> 
> At least to my eyes, I actually prefer the state after 9/10 and would
> probably be OK to see this patch get dropped. I wish I had a compelling
> reason *why* I felt that way, but I think it may too subjective.
> 
> I don't feel strongly about it either way, though.

I also don't have a super strong feeling, though I fall on the other
side of the line (which is why I bothered sending the patch).

If we didn't do that, I think the alternative is probably a comment
like:

  if (error1)
	error(describe error1);
  else if (error2)
	error(describe error2);
  else
	return buf;

  /* if we didn't return above, we saw some error */
  free(buf);
  return NULL;

I dunno. I'd probably stick with what I send. ;)

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type
  2025-03-04  6:43       ` Jeff King
@ 2025-03-04 15:41         ` Junio C Hamano
  0 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2025-03-04 15:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git

Jeff King <peff@peff.net> writes:

> But this unknown type stuff is not and has never been usable. So I'd
> propose to keep --literally but always reject unknown types.

That sounds like a good way to go.  Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity
  2025-03-04  7:10     ` Jeff King
@ 2025-03-04 21:32       ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2025-03-04 21:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Mar 04, 2025 at 02:10:11AM -0500, Jeff King wrote:
> On Thu, Feb 27, 2025 at 07:36:42PM -0500, Taylor Blau wrote:
>
> > On Tue, Feb 25, 2025 at 01:34:21AM -0500, Jeff King wrote:
> > > This should make the logic a bit easier to follow. It does mean
> > > duplicating the buf cleanup for errors, but it's a single line.
> >
> > At least to my eyes, I actually prefer the state after 9/10 and would
> > probably be OK to see this patch get dropped. I wish I had a compelling
> > reason *why* I felt that way, but I think it may too subjective.
> >
> > I don't feel strongly about it either way, though.
>
> I also don't have a super strong feeling, though I fall on the other
> side of the line (which is why I bothered sending the patch).
>
> If we didn't do that, I think the alternative is probably a comment
> like:
>
>   if (error1)
> 	error(describe error1);
>   else if (error2)
> 	error(describe error2);
>   else
> 	return buf;
>
>   /* if we didn't return above, we saw some error */
>   free(buf);
>   return NULL;
>
> I dunno. I'd probably stick with what I send. ;)

Fair enough ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-03-04 21:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-25  6:25 [PATCH 0/10] some zlib inflating bug fixes Jeff King
2025-02-25  6:28 ` [PATCH 01/10] loose_object_info(): BUG() on inflating content with unknown type Jeff King
2025-02-25 11:42   ` Patrick Steinhardt
2025-02-26  1:47   ` Junio C Hamano
2025-02-28  0:16     ` Taylor Blau
2025-03-04  6:43       ` Jeff King
2025-03-04 15:41         ` Junio C Hamano
2025-02-28  0:14   ` Taylor Blau
2025-02-25  6:29 ` [PATCH 02/10] unpack_loose_header(): simplify next_out assignment Jeff King
2025-02-28  0:18   ` Taylor Blau
2025-02-25  6:29 ` [PATCH 03/10] unpack_loose_header(): report headers without NUL as "bad" Jeff King
2025-02-25  6:29 ` [PATCH 04/10] unpack_loose_header(): fix infinite loop on broken zlib input Jeff King
2025-02-25 11:42   ` Patrick Steinhardt
2025-02-25 19:00     ` Eric Sunshine
2025-02-26 12:56   ` Junio C Hamano
2025-02-28  0:21   ` Taylor Blau
2025-02-25  6:30 ` [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT Jeff King
2025-02-26 13:26   ` Junio C Hamano
2025-02-28  0:31     ` Taylor Blau
2025-03-04  7:08       ` Jeff King
2025-02-25  6:30 ` [PATCH 06/10] unpack_loose_header(): avoid numeric comparison of zlib status Jeff King
2025-02-28  0:32   ` Taylor Blau
2025-03-04  6:55     ` Jeff King
2025-02-25  6:31 ` [PATCH 07/10] unpack_loose_rest(): " Jeff King
2025-02-25  6:33 ` [PATCH 08/10] unpack_loose_rest(): never clean up zstream Jeff King
2025-02-26 13:16   ` Junio C Hamano
2025-02-25  6:33 ` [PATCH 09/10] unpack_loose_rest(): simplify error handling Jeff King
2025-02-26 13:46   ` Junio C Hamano
2025-02-28  0:34   ` Taylor Blau
2025-02-25  6:34 ` [PATCH 10/10] unpack_loose_rest(): rewrite return handling for clarity Jeff King
2025-02-28  0:36   ` Taylor Blau
2025-03-04  7:10     ` Jeff King
2025-03-04 21:32       ` Taylor Blau
2025-02-28  0:38 ` [PATCH 0/10] some zlib inflating bug fixes Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).