Test breakage with zlib-ng

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Test breakage with zlib-ng
@ 2023-12-12 14:16 Ondrej Pohorelsky
  2023-12-12 17:04 ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Ondrej Pohorelsky @ 2023-12-12 14:16 UTC (permalink / raw)
  To: git

Hi everyone,

As some might have heard, there is a proposal for Fedora 40 to
transition from zlib to zlib-ng[0]. Because of this, there has been a
rebuild of all packages to ensure every package works under zlib-ng.

Git test suit has three breakages in t6300-for-each-ref.sh.
To be precise, it is:

not ok 35 - basic atom: refs/heads/main objectsize:disk
not ok 107 - basic atom: refs/tags/testtag objectsize:disk
not ok 108 - basic atom: refs/tags/testtag *objectsize:disk

All of these tests are atomic, and they compare the result against
$disklen. I discussed it with Tulio Magno Quites Machado Filho, who
ran the tests and is owner of the proposal.
It seems like the compression of zlib-ng is shaving/adding some bytes
to the actual output, which then fails the comparison.

Here is an example:

```
expecting success of 6300.35 'basic atom: refs/heads/main objectsize:disk':
git for-each-ref --format="%($format)" "$ref" >actual &&
sanitize_pgp <actual >actual.clean &&
test_cmp expected actual.clean

++ git for-each-ref '--format=%(objectsize:disk)' refs/heads/main
++ sanitize_pgp
++ perl -ne '
/^-----END PGP/ and $in_pgp = 0;
print unless $in_pgp;
/^-----BEGIN PGP/ and $in_pgp = 1;
'
++ command /usr/bin/perl -ne '
/^-----END PGP/ and $in_pgp = 0;
print unless $in_pgp;
/^-----BEGIN PGP/ and $in_pgp = 1;
'
++ test_cmp expected actual.clean
++ test 2 -ne 2
++ eval 'diff -u' '"$@"'
+++ diff -u expected actual.clean
--- expected 2023-12-06 21:06:07.808849497 +0000
+++ actual.clean 2023-12-06 21:06:07.812849541 +0000
@@ -1 +1 @@
-138
+137
error: last command exited with $?=1
not ok 35 - basic atom: refs/heads/main objectsize:disk
```

The whole build log can be found here[1].

I can easily patch these tests in Fedora to be compatible with zlib-ng
only by not comparing to $disklen, but a concrete value, however I
would like to have a universal solution that works with both zlib and
zlib-ng. So if anyone has an idea on how to do it, please let me know.
Thanks

[0]https://discussion.fedoraproject.org/t/f40-change-proposal-transitioning-to-zlib-ng-as-a-compatible-replacement-for-zlib-system-wide/95807
[1]https://download.copr.fedorainfracloud.org/results/tuliom/zlib-ng-compat-mpb/fedora-rawhide-x86_64/06729801-git/builder-live.log.gz

Cheers,
Ondřej Pohořelský

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Test breakage with zlib-ng
  2023-12-12 14:16 Test breakage with zlib-ng Ondrej Pohorelsky
@ 2023-12-12 17:04 ` René Scharfe
  2023-12-12 20:01   ` Jeff King
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: René Scharfe @ 2023-12-12 17:04 UTC (permalink / raw)
  To: Ondrej Pohorelsky, git; +Cc: brian m . carlson, Junio C Hamano

Am 12.12.23 um 15:16 schrieb Ondrej Pohorelsky:
> Hi everyone,
>
> As some might have heard, there is a proposal for Fedora 40 to
> transition from zlib to zlib-ng[0]. Because of this, there has been a
> rebuild of all packages to ensure every package works under zlib-ng.
>
> Git test suit has three breakages in t6300-for-each-ref.sh.
> To be precise, it is:
>
> not ok 35 - basic atom: refs/heads/main objectsize:disk
> not ok 107 - basic atom: refs/tags/testtag objectsize:disk
> not ok 108 - basic atom: refs/tags/testtag *objectsize:disk
>
>
> All of these tests are atomic, and they compare the result against
> $disklen.

Why do these three objects (HEAD commit of main, testtag and testtag
target) have the same size?  Half of the answer is that testtag points
to the HEAD of main.  But the other half is pure coincidence as far as I
can see.

> I can easily patch these tests in Fedora to be compatible with zlib-ng
> only by not comparing to $disklen, but a concrete value, however I
> would like to have a universal solution that works with both zlib and
> zlib-ng. So if anyone has an idea on how to do it, please let me know.

The test stores the expected values at the top, in the following lines,
for the two possible repository formats:

	test_expect_success setup '
		test_oid_cache <<-EOF &&
		disklen sha1:138
		disklen sha256:154
		EOF

So it's using hard-coded values already, which breaks when the
compression rate changes.

We could set core.compression to 0 to take compression out of the
picture.

Or we could get the sizes of the objects by checking their files,
which would not require  hard-coding anymore.  Patch below.

--- >8 ---
Subject: [PATCH] t6300: avoid hard-coding object sizes

f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24)
hard-coded the expected object sizes.  Coincidentally the size of commit
and tag is the same with zlib at the default compression level.

1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22)
encoded the sizes as a single value, which coincidentally also works
with sha256.

Different compression libraries like zlib-ng may arrive at different
values.  Get them from the file system instead of hard-coding them to
make switching the compression library (or changing the compression
level) easier.

Reported-by: Ondrej Pohorelsky <opohorel@redhat.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t6300-for-each-ref.sh | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 54e2281259..843a7fe143 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -20,12 +20,13 @@ setdate_and_increment () {
     export GIT_COMMITTER_DATE GIT_AUTHOR_DATE
 }

-test_expect_success setup '
-	test_oid_cache <<-EOF &&
-	disklen sha1:138
-	disklen sha256:154
-	EOF
+test_object_file_size () {
+	oid=$(git rev-parse "$1")
+	path=".git/objects/$(test_oid_to_path $oid)"
+	test_file_size "$path"
+}

+test_expect_success setup '
 	# setup .mailmap
 	cat >.mailmap <<-EOF &&
 	A Thor <athor@example.com> A U Thor <author@example.com>
@@ -94,7 +95,6 @@ test_atom () {
 }

 hexlen=$(test_oid hexsz)
-disklen=$(test_oid disklen)

 test_atom head refname refs/heads/main
 test_atom head refname: refs/heads/main
@@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main
 test_atom head push:strip=-1 main
 test_atom head objecttype commit
 test_atom head objectsize $((131 + hexlen))
-test_atom head objectsize:disk $disklen
+test_atom head objectsize:disk $(test_object_file_size refs/heads/main)
 test_atom head deltabase $ZERO_OID
 test_atom head objectname $(git rev-parse refs/heads/main)
 test_atom head objectname:short $(git rev-parse --short refs/heads/main)
@@ -203,8 +203,8 @@ test_atom tag upstream ''
 test_atom tag push ''
 test_atom tag objecttype tag
 test_atom tag objectsize $((114 + hexlen))
-test_atom tag objectsize:disk $disklen
-test_atom tag '*objectsize:disk' $disklen
+test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag)
+test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main)
 test_atom tag deltabase $ZERO_OID
 test_atom tag '*deltabase' $ZERO_OID
 test_atom tag objectname $(git rev-parse refs/tags/testtag)
--
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Test breakage with zlib-ng
  2023-12-12 17:04 ` René Scharfe
@ 2023-12-12 20:01   ` Jeff King
  2023-12-12 22:54     ` René Scharfe
  2023-12-12 22:18   ` Test breakage with zlib-ng brian m. carlson
  2023-12-12 22:30   ` Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-12 20:01 UTC (permalink / raw)
  To: René Scharfe
  Cc: Ondrej Pohorelsky, git, brian m . carlson, Junio C Hamano

On Tue, Dec 12, 2023 at 06:04:55PM +0100, René Scharfe wrote:

> Subject: [PATCH] t6300: avoid hard-coding object sizes
> 
> f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24)
> hard-coded the expected object sizes.  Coincidentally the size of commit
> and tag is the same with zlib at the default compression level.
> 
> 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22)
> encoded the sizes as a single value, which coincidentally also works
> with sha256.
> 
> Different compression libraries like zlib-ng may arrive at different
> values.  Get them from the file system instead of hard-coding them to
> make switching the compression library (or changing the compression
> level) easier.

Yeah, this is definitely the right solution here. I'm surprised the
hard-coded values didn't cause problems before now. ;)

The patch looks good to me, but a few small comments:

> +test_object_file_size () {
> +	oid=$(git rev-parse "$1")
> +	path=".git/objects/$(test_oid_to_path $oid)"
> +	test_file_size "$path"
> +}

Here we're assuming the objects are loose. I think that's probably OK
(and certainly the test will notice if that changes).

We're covering the formatting code paths along with the underlying
implementation that fills in object_info->disk_sizep for loose objects.
Which I think is plenty for this particular script, which is about
for-each-ref.

It would be nice to have coverage of the packed_object_info() code path,
though. Back when it was added in a4ac106178 (cat-file: add
%(objectsize:disk) format atom, 2013-07-10), I cowardly punted on this,
writing:

  This patch does not include any tests, as the exact numbers
  returned are volatile and subject to zlib and packing
  decisions. We cannot even reliably guarantee that the
  on-disk size is smaller than the object content (though in
  general this should be the case for non-trivial objects).

I don't think it's that big a deal, but I guess we could do something
like:

  prev=
  git show-index <$pack_idx |
  sort -n |
  grep -A1 $oid |
  while read ofs oid csum
  do
    test -n "$prev" && echo "$((ofs - prev))"
    prev=$ofs
  done

It feels a little redundant with what Git is doing under the hood, but
at least is exercising the code (and we're using the idx directly, so
we're confirming that the revindex is right).

Anyway, that is all way beyond the scope of your patch, but I wonder if
it's worth doing on top.

> @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main
>  test_atom head push:strip=-1 main
>  test_atom head objecttype commit
>  test_atom head objectsize $((131 + hexlen))
> -test_atom head objectsize:disk $disklen
> +test_atom head objectsize:disk $(test_object_file_size refs/heads/main)

These test_object_file_size calls are happening outside of any
test_expect_* block, so we'd miss failing exit codes (and also the
helper is not &&-chained), and any stderr would leak to the output.
That's probably OK in practice, though (if something goes wrong then the
expected value output will be bogus and the test itself will fail).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Test breakage with zlib-ng
  2023-12-12 20:01   ` Jeff King
@ 2023-12-12 22:54     ` René Scharfe
  2023-12-13 12:28       ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2023-12-12 22:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Ondrej Pohorelsky, git, brian m . carlson, Junio C Hamano

Am 12.12.23 um 21:01 schrieb Jeff King:
> On Tue, Dec 12, 2023 at 06:04:55PM +0100, René Scharfe wrote:
>
>> Subject: [PATCH] t6300: avoid hard-coding object sizes
>>
>> f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24)
>> hard-coded the expected object sizes.  Coincidentally the size of commit
>> and tag is the same with zlib at the default compression level.
>>
>> 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22)
>> encoded the sizes as a single value, which coincidentally also works
>> with sha256.
>>
>> Different compression libraries like zlib-ng may arrive at different
>> values.  Get them from the file system instead of hard-coding them to
>> make switching the compression library (or changing the compression
>> level) easier.
>
> Yeah, this is definitely the right solution here. I'm surprised the
> hard-coded values didn't cause problems before now. ;)
>
> The patch looks good to me, but a few small comments:
>
>> +test_object_file_size () {
>> +	oid=$(git rev-parse "$1")
>> +	path=".git/objects/$(test_oid_to_path $oid)"
>> +	test_file_size "$path"
>> +}
>
> Here we're assuming the objects are loose. I think that's probably OK
> (and certainly the test will notice if that changes).
>
> We're covering the formatting code paths along with the underlying
> implementation that fills in object_info->disk_sizep for loose objects.
> Which I think is plenty for this particular script, which is about
> for-each-ref.
>
> It would be nice to have coverage of the packed_object_info() code path,
> though. Back when it was added in a4ac106178 (cat-file: add
> %(objectsize:disk) format atom, 2013-07-10), I cowardly punted on this,
> writing:
>
>   This patch does not include any tests, as the exact numbers
>   returned are volatile and subject to zlib and packing
>   decisions. We cannot even reliably guarantee that the
>   on-disk size is smaller than the object content (though in
>   general this should be the case for non-trivial objects).
>
> I don't think it's that big a deal, but I guess we could do something
> like:
>
>   prev=
>   git show-index <$pack_idx |
>   sort -n |
>   grep -A1 $oid |
>   while read ofs oid csum
>   do
>     test -n "$prev" && echo "$((ofs - prev))"
>     prev=$ofs
>   done
>
> It feels a little redundant with what Git is doing under the hood, but
> at least is exercising the code (and we're using the idx directly, so
> we're confirming that the revindex is right).

A generic object size function based on both methods could live in the
test lib and be used for e.g. cat-file tests as well.  Getting such a
function polished and library-worthy is probably more work than I
naively imagine, however -- due to our shunning of pipes alone.

> Anyway, that is all way beyond the scope of your patch, but I wonder if
> it's worth doing on top.
>
>> @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main
>>  test_atom head push:strip=-1 main
>>  test_atom head objecttype commit
>>  test_atom head objectsize $((131 + hexlen))
>> -test_atom head objectsize:disk $disklen
>> +test_atom head objectsize:disk $(test_object_file_size refs/heads/main)
>
> These test_object_file_size calls are happening outside of any
> test_expect_* block, so we'd miss failing exit codes (and also the
> helper is not &&-chained), and any stderr would leak to the output.
> That's probably OK in practice, though (if something goes wrong then the
> expected value output will be bogus and the test itself will fail).

Right.  We could also set variables during setup, though, to put
readers' minds at rest.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/1] test-lib-functions: add object size functions
  2023-12-12 22:54     ` René Scharfe
@ 2023-12-13 12:28       ` René Scharfe
  2023-12-14 20:59         ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2023-12-13 12:28 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Add test_object_size and its helpers test_loose_object_size and
test_packed_object_size, which allow determining the size of a Git
object using only the low-level Git commands rev-parse and show-index.

Use it in t6300 to replace the bare-bones function test_object_file_size
as a motivating example.  There it provides the expected output of the
high-level Git command for-each-ref.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
So how about this?  I'm a bit nervous about all the rules about output
descriptors and error propagation and whatnot in the test library, but
this implementation seems simple enough and might be useful in more than
one test.  No idea how to add support for alternate object directories,
but I doubt we'll ever need it.
---
 t/t6300-for-each-ref.sh | 16 ++++++--------
 t/test-lib-functions.sh | 47 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 843a7fe143..4687660f38 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -20,12 +20,6 @@ setdate_and_increment () {
     export GIT_COMMITTER_DATE GIT_AUTHOR_DATE
 }

-test_object_file_size () {
-	oid=$(git rev-parse "$1")
-	path=".git/objects/$(test_oid_to_path $oid)"
-	test_file_size "$path"
-}
-
 test_expect_success setup '
 	# setup .mailmap
 	cat >.mailmap <<-EOF &&
@@ -40,7 +34,11 @@ test_expect_success setup '
 	git branch -M main &&
 	setdate_and_increment &&
 	git tag -a -m "Tagging at $datestamp" testtag &&
+	testtag_oid=$(git rev-parse refs/tags/testtag) &&
+	testtag_disksize=$(test_object_size $testtag_oid) &&
 	git update-ref refs/remotes/origin/main main &&
+	commit_oid=$(git rev-parse refs/heads/main) &&
+	commit_disksize=$(test_object_size $commit_oid) &&
 	git remote add origin nowhere &&
 	git config branch.main.remote origin &&
 	git config branch.main.merge refs/heads/main &&
@@ -129,7 +127,7 @@ test_atom head push:strip=1 remotes/myfork/main
 test_atom head push:strip=-1 main
 test_atom head objecttype commit
 test_atom head objectsize $((131 + hexlen))
-test_atom head objectsize:disk $(test_object_file_size refs/heads/main)
+test_atom head objectsize:disk $commit_disksize
 test_atom head deltabase $ZERO_OID
 test_atom head objectname $(git rev-parse refs/heads/main)
 test_atom head objectname:short $(git rev-parse --short refs/heads/main)
@@ -203,8 +201,8 @@ test_atom tag upstream ''
 test_atom tag push ''
 test_atom tag objecttype tag
 test_atom tag objectsize $((114 + hexlen))
-test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag)
-test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main)
+test_atom tag objectsize:disk $testtag_disksize
+test_atom tag '*objectsize:disk' $commit_disksize
 test_atom tag deltabase $ZERO_OID
 test_atom tag '*deltabase' $ZERO_OID
 test_atom tag objectname $(git rev-parse refs/tags/testtag)
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 9c3cf12b26..9b49645f77 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1733,6 +1733,53 @@ test_oid_to_path () {
 	echo "${1%$basename}/$basename"
 }

+test_loose_object_size () {
+	test "$#" -ne 1 && BUG "1 param"
+	local path=$(test_oid_to_path "$1")
+	test_file_size "$(git rev-parse --git-path "objects/$path")" 2>&4
+}
+
+test_packed_object_size () {
+	test "$#" -ne 2 && BUG "2 params"
+	local oid=$1 idx=$2 packsize rawsz end
+
+	packsize=$(test_file_size "${idx%.idx}.pack")
+	rawsz=$(test_oid rawsz)
+	end=$(($packsize - $rawsz))
+
+	git show-index <"$idx" |
+	awk -v oid="$oid" -v end="$end" '
+		$2 == oid {start = $1}
+		{offsets[$1] = 1}
+		END {
+			if (!start || start >= end)
+				exit 1
+			for (o in offsets)
+				if (start < o && o < end)
+					end = o
+			print end - start
+		}
+	' && return 0
+
+	echo >&4 "error: '$oid' not found in '$idx'"
+	return 1
+}
+
+test_object_size () {
+	test "$#" -ne 1 && BUG "1 param"
+	local oid=$1
+
+	test_loose_object_size "$oid" 4>/dev/null && return 0
+
+	for idx in "$(git rev-parse --git-path objects/pack)"/pack-*.idx
+	do
+		test_packed_object_size "$oid" "$idx" 4>/dev/null && return 0
+	done
+
+	echo >&4 "error: '$oid' not found"
+	return 1
+}
+
 # Parse oids from git ls-files --staged output
 test_parse_ls_files_stage_oids () {
 	awk '{print $2}' -
--
2.43.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/1] test-lib-functions: add object size functions
  2023-12-13 12:28       ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe
@ 2023-12-14 20:59         ` Jeff King
  2023-12-19 16:42           ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-14 20:59 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

On Wed, Dec 13, 2023 at 01:28:56PM +0100, René Scharfe wrote:

> Add test_object_size and its helpers test_loose_object_size and
> test_packed_object_size, which allow determining the size of a Git
> object using only the low-level Git commands rev-parse and show-index.
> 
> Use it in t6300 to replace the bare-bones function test_object_file_size
> as a motivating example.  There it provides the expected output of the
> high-level Git command for-each-ref.

This adds a packed-object function, but I doubt anybody actually calls
it. If we're going to do that, it's probably worth adding some tests for
"cat-file --batch-check" or similar.

At which point I wonder if rather than having a function for a single
object, we are better off just testing the result of:

  git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)'

against a single post-processed "show-index" invocation.

> So how about this?  I'm a bit nervous about all the rules about output
> descriptors and error propagation and whatnot in the test library, but
> this implementation seems simple enough and might be useful in more than
> one test.  No idea how to add support for alternate object directories,
> but I doubt we'll ever need it.

I'm not sure that we need to do anything special with output
redirection. Shouldn't these functions just send errors to stderr as
usual? If they are run inside a test_expect block, that goes to
descriptor 4 (which is either /dev/null or the original stderr,
depending on whether "-v" was used).

> +test_loose_object_size () {
> +	test "$#" -ne 1 && BUG "1 param"
> +	local path=$(test_oid_to_path "$1")
> +	test_file_size "$(git rev-parse --git-path "objects/$path")" 2>&4
> +}

OK. We lose the exit code from "rev-parse" but that is probably OK for
our purposes.

> +test_packed_object_size () {
> +	test "$#" -ne 2 && BUG "2 params"
> +	local oid=$1 idx=$2 packsize rawsz end
> +
> +	packsize=$(test_file_size "${idx%.idx}.pack")
> +	rawsz=$(test_oid rawsz)
> +	end=$(($packsize - $rawsz))

OK, this $end is the magic required for the final entry. Makes sense.

> +	git show-index <"$idx" |
> +	awk -v oid="$oid" -v end="$end" '
> +		$2 == oid {start = $1}
> +		{offsets[$1] = 1}
> +		END {
> +			if (!start || start >= end)
> +				exit 1
> +			for (o in offsets)
> +				if (start < o && o < end)
> +					end = o
> +			print end - start
> +		}
> +	' && return 0

I was confused at first, because I didn't see any sorting happening. But
if I understand correctly, you're just looking for the smallest "end"
that comes after the start of the object we're looking for. Which I
think works.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/1] test-lib-functions: add object size functions
  2023-12-14 20:59         ` Jeff King
@ 2023-12-19 16:42           ` René Scharfe
  2023-12-21  9:47             ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2023-12-19 16:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Am 14.12.23 um 21:59 schrieb Jeff King:
> On Wed, Dec 13, 2023 at 01:28:56PM +0100, René Scharfe wrote:
>
>> Add test_object_size and its helpers test_loose_object_size and
>> test_packed_object_size, which allow determining the size of a Git
>> object using only the low-level Git commands rev-parse and show-index.
>>
>> Use it in t6300 to replace the bare-bones function test_object_file_size
>> as a motivating example.  There it provides the expected output of the
>> high-level Git command for-each-ref.
>
> This adds a packed-object function, but I doubt anybody actually calls
> it. If we're going to do that, it's probably worth adding some tests for
> "cat-file --batch-check" or similar.

Yes, and I was assuming that someone else would be eager to add such
tests. *ahem*

> At which point I wonder if rather than having a function for a single
> object, we are better off just testing the result of:
>
>   git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)'
>
> against a single post-processed "show-index" invocation.

Sure, we might want to optimize for bulk-processing and possibly end up
only checking the size of single objects in t6300, making new library
functions unnecessary.

When dumping size information of multiple objects, it's probably a good
idea to include "%(objectname)" as well in the format.

You'd need one show-index call for each .idx file.  A simple test would
only have a single one; a library function might need to handle multiple
packs.

>> So how about this?  I'm a bit nervous about all the rules about output
>> descriptors and error propagation and whatnot in the test library, but
>> this implementation seems simple enough and might be useful in more than
>> one test.  No idea how to add support for alternate object directories,
>> but I doubt we'll ever need it.
>
> I'm not sure that we need to do anything special with output
> redirection. Shouldn't these functions just send errors to stderr as
> usual? If they are run inside a test_expect block, that goes to
> descriptor 4 (which is either /dev/null or the original stderr,
> depending on whether "-v" was used).

Good point.  My bad excuse is that I copied the redirection to fd 4 from
test_grep.

>> +	git show-index <"$idx" |
>> +	awk -v oid="$oid" -v end="$end" '
>> +		$2 == oid {start = $1}
>> +		{offsets[$1] = 1}
>> +		END {
>> +			if (!start || start >= end)
>> +				exit 1
>> +			for (o in offsets)
>> +				if (start < o && o < end)
>> +					end = o
>> +			print end - start
>> +		}
>> +	' && return 0
>
> I was confused at first, because I didn't see any sorting happening. But
> if I understand correctly, you're just looking for the smallest "end"
> that comes after the start of the object we're looking for. Which I
> think works.

Yes, calculating the minimum offset suffices when handling a single
object -- no sorting required.  For bulk mode we'd better sort, of
course:

	git show-index <"$idx" |
	sort -n |
	awk -v end="$end" '
		NR > 1 {print oid, $1 - start}
		{start = $1; oid = $2}
		END {print oid, end - start}
	'

No idea how to make such a thing robust against malformed or truncated
output from show-index, but perhaps that's not necessary, depending on
how the result is used.

René


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-19 16:42           ` René Scharfe
@ 2023-12-21  9:47             ` Jeff King
  2023-12-21 12:19               ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-21  9:47 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote:

> > This adds a packed-object function, but I doubt anybody actually calls
> > it. If we're going to do that, it's probably worth adding some tests for
> > "cat-file --batch-check" or similar.
> 
> Yes, and I was assuming that someone else would be eager to add such
> tests. *ahem*

:P OK, here it is. This can be its own topic, or go on top of the
rs/t6300-compressed-size-fix branch.

> > At which point I wonder if rather than having a function for a single
> > object, we are better off just testing the result of:
> >
> >   git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)'
> >
> > against a single post-processed "show-index" invocation.
> 
> Sure, we might want to optimize for bulk-processing and possibly end up
> only checking the size of single objects in t6300, making new library
> functions unnecessary.

So yeah, I think the approach here makes library functions unnecessary
(and I see you already asked Junio to drop your patch 2).

> When dumping size information of multiple objects, it's probably a good
> idea to include "%(objectname)" as well in the format.

Yep, definitely.

-- >8 --
Subject: [PATCH] t1006: add tests for %(objectsize:disk)

Back when we added this placeholder in a4ac106178 (cat-file: add
%(objectsize:disk) format atom, 2013-07-10), there were no tests,
claiming "[...]the exact numbers returned are volatile and subject to
zlib and packing decisions".

But we can use a little shell hackery to get the expected numbers
ourselves. To a certain degree this is just re-implementing what Git is
doing under the hood, but it is still worth doing. It makes sure we
exercise the %(objectsize:disk) code at all, and having the two
implementations agree gives us more confidence.

Note that our shell code assumes that no object appears twice (either in
two packs, or as both loose and packed), as then the results really are
undefined. That's OK for our purposes, and the test will notice if that
assumption is violated (the shell version would produce duplicate lines
that Git's output does not have).

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
---
I stole a bit of your awk. You can tell because I'd have written it in
perl. ;)

 t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 271c5e4fd3..21915be308 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
 	cmp expect actual
 '
 
+test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
+	# our state has both loose and packed objects,
+	# so find both for our expected output
+	{
+		find .git/objects/?? -type f |
+		awk -F/ "{ print \$0, \$3\$4 }" |
+		while read path oid
+		do
+			size=$(test_file_size "$path") &&
+			echo "$oid $size" ||
+			return 1
+		done &&
+		rawsz=$(test_oid rawsz) &&
+		find .git/objects/pack -name "*.idx" |
+		while read idx
+		do
+			git show-index <"$idx" >idx.raw &&
+			sort -n <idx.raw >idx.sorted &&
+			packsz=$(test_file_size "${idx%.idx}.pack") &&
+			end=$((packsz - rawsz)) &&
+			awk -v end="$end" "
+			  NR > 1 { print oid, \$1 - start }
+			  { start = \$1; oid = \$2 }
+			  END { print oid, end - start }
+			" idx.sorted ||
+			return 1
+		done
+	} >expect.raw &&
+	sort <expect.raw >expect &&
+	git cat-file --batch-all-objects \
+		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'set up replacement object' '
 	orig=$(git rev-parse HEAD) &&
 	git cat-file commit $orig >orig &&
-- 
2.43.0.430.gaf21263e5d


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-21  9:47             ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King
@ 2023-12-21 12:19               ` René Scharfe
  2023-12-21 21:30                 ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2023-12-21 12:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Am 21.12.23 um 10:47 schrieb Jeff King:
> On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote:
>
>>> This adds a packed-object function, but I doubt anybody actually calls
>>> it. If we're going to do that, it's probably worth adding some tests for
>>> "cat-file --batch-check" or similar.
>>
>> Yes, and I was assuming that someone else would be eager to add such
>> tests. *ahem*
>
> :P OK, here it is. This can be its own topic, or go on top of the
> rs/t6300-compressed-size-fix branch.

Great, thank you!

> -- >8 --
> Subject: [PATCH] t1006: add tests for %(objectsize:disk)
>
> Back when we added this placeholder in a4ac106178 (cat-file: add
> %(objectsize:disk) format atom, 2013-07-10), there were no tests,
> claiming "[...]the exact numbers returned are volatile and subject to
> zlib and packing decisions".
>
> But we can use a little shell hackery to get the expected numbers
> ourselves. To a certain degree this is just re-implementing what Git is
> doing under the hood, but it is still worth doing. It makes sure we
> exercise the %(objectsize:disk) code at all, and having the two
> implementations agree gives us more confidence.
>
> Note that our shell code assumes that no object appears twice (either in
> two packs, or as both loose and packed), as then the results really are
> undefined. That's OK for our purposes, and the test will notice if that
> assumption is violated (the shell version would produce duplicate lines
> that Git's output does not have).
>
> Helped-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I stole a bit of your awk. You can tell because I'd have written it in
> perl. ;)

I think we can do it even in shell, especially if...

>
>  t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 271c5e4fd3..21915be308 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
>  	cmp expect actual
>  '
>
> +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
> +	# our state has both loose and packed objects,
> +	# so find both for our expected output
> +	{
> +		find .git/objects/?? -type f |
> +		awk -F/ "{ print \$0, \$3\$4 }" |
> +		while read path oid
> +		do
> +			size=$(test_file_size "$path") &&
> +			echo "$oid $size" ||
> +			return 1
> +		done &&
> +		rawsz=$(test_oid rawsz) &&
> +		find .git/objects/pack -name "*.idx" |
> +		while read idx
> +		do
> +			git show-index <"$idx" >idx.raw &&
> +			sort -n <idx.raw >idx.sorted &&
> +			packsz=$(test_file_size "${idx%.idx}.pack") &&
> +			end=$((packsz - rawsz)) &&
> +			awk -v end="$end" "
> +			  NR > 1 { print oid, \$1 - start }
> +			  { start = \$1; oid = \$2 }
> +			  END { print oid, end - start }
> +			" idx.sorted ||

... we stop slicing the data against the grain.  Let's reverse the order
(sort -r), then we don't need to carry the oid forward:

			sort -nr <idx.raw >idx.sorted &&
			packsz=$(test_file_size "${idx%.idx}.pack") &&
			end=$((packsz - rawsz)) &&
			awk -v end="$end" "
			  { print \$2, end - \$1; end = \$1 }
			" idx.sorted ||

And at that point it should be easy to use a shell loop instead of awk:

			while read start oid rest
			do
				size=$((end - start)) &&
				end=$start &&
				echo "$oid $size" ||
				return 1
			done <idx.sorted

> +			return 1
> +		done
> +	} >expect.raw &&
> +	sort <expect.raw >expect &&

The reversal above becomes irrelevant with that line, so the result in
expect stays the same.

Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
Having the same object in multiple places for whatever reason would not
be a cause for reporting an error in this test, I would think.

> +	git cat-file --batch-all-objects \
> +		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_expect_success 'set up replacement object' '
>  	orig=$(git rev-parse HEAD) &&
>  	git cat-file commit $orig >orig &&

One more thing: We can do the work of the first awk invocation in the
already existing loop as well:

> +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
> +	# our state has both loose and packed objects,
> +	# so find both for our expected output
> +	{
> +		find .git/objects/?? -type f |
> +		awk -F/ "{ print \$0, \$3\$4 }" |
> +		while read path oid
> +		do
> +			size=$(test_file_size "$path") &&
> +			echo "$oid $size" ||
> +			return 1
> +		done &&

... but the substitutions are a bit awkward:

		find .git/objects/?? -type f |
		while read path
		do
			basename=${path##*/} &&
			dirname=${path%/$basename} &&
			oid="${dirname#.git/objects/}${basename}" &&
			size=$(test_file_size "$path") &&
			echo "$oid $size" ||
			return 1
		done &&

The avoided awk invocation might be worth the trouble, though.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-21 12:19               ` René Scharfe
@ 2023-12-21 21:30                 ` Jeff King
  2023-12-21 23:13                   ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-21 21:30 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote:

> I think we can do it even in shell, especially if...
> [...]

Yeah, your conversion looks accurate. I do wonder if it is worth golfing
further, though. If it were a process invocation per object, I'd
definitely say the efficiency gain is worth it. But dropping one process
from the whole test isn't that exciting either way.

> (sort -r), then we don't need to carry the oid forward:
> 
> 			sort -nr <idx.raw >idx.sorted &&
> 			packsz=$(test_file_size "${idx%.idx}.pack") &&
> 			end=$((packsz - rawsz)) &&
> 			awk -v end="$end" "
> 			  { print \$2, end - \$1; end = \$1 }
> 			" idx.sorted ||
> 
> And at that point it should be easy to use a shell loop instead of awk:
> 
> 			while read start oid rest
> 			do
> 				size=$((end - start)) &&
> 				end=$start &&
> 				echo "$oid $size" ||
> 				return 1
> 			done <idx.sorted

The one thing I do like is that we don't have to escape anything inside
an awk program that is forced to use double-quotes. ;)

> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
> Having the same object in multiple places for whatever reason would not
> be a cause for reporting an error in this test, I would think.

No, for the reasons I said in the commit message: if an object exists in
multiple places the test is already potentially invalid, as Git does not
promise which version it will use. So it might work racily, or it might
work for now but be fragile. By not de-duplicating, we make sure the
test's assumption holds.

> One more thing: We can do the work of the first awk invocation in the
> already existing loop as well:
> [...]
> ... but the substitutions are a bit awkward:
> 
> 		find .git/objects/?? -type f |
> 		while read path
> 		do
> 			basename=${path##*/} &&
> 			dirname=${path%/$basename} &&
> 			oid="${dirname#.git/objects/}${basename}" &&
> 			size=$(test_file_size "$path") &&
> 			echo "$oid $size" ||
> 			return 1
> 		done &&
> 
> The avoided awk invocation might be worth the trouble, though.

Yeah, I briefly considered whether it would be possible in pure shell,
but didn't get very far before assuming it was going to be ugly. Thank
you for confirming. ;)

Again, if we were doing one awk per object, I'd try to avoid it. But
since we can cover all objects in a single pass, I think it's OK.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-21 21:30                 ` Jeff King
@ 2023-12-21 23:13                   ` René Scharfe
  2023-12-23 10:09                     ` [PATCH v2] " Jeff King
  2023-12-23 10:18                     ` [PATCH] " Jeff King
  0 siblings, 2 replies; 17+ messages in thread
From: René Scharfe @ 2023-12-21 23:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Am 21.12.23 um 22:30 schrieb Jeff King:
> On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote:
>
>> I think we can do it even in shell, especially if...
>> [...]
>
> Yeah, your conversion looks accurate. I do wonder if it is worth golfing
> further, though. If it were a process invocation per object, I'd
> definitely say the efficiency gain is worth it. But dropping one process
> from the whole test isn't that exciting either way.

Fair enough.

>
>> (sort -r), then we don't need to carry the oid forward:
>>
>> 			sort -nr <idx.raw >idx.sorted &&
>> 			packsz=$(test_file_size "${idx%.idx}.pack") &&
>> 			end=$((packsz - rawsz)) &&
>> 			awk -v end="$end" "
>> 			  { print \$2, end - \$1; end = \$1 }
>> 			" idx.sorted ||
>>
>> And at that point it should be easy to use a shell loop instead of awk:
>>
>> 			while read start oid rest
>> 			do
>> 				size=$((end - start)) &&
>> 				end=$start &&
>> 				echo "$oid $size" ||
>> 				return 1
>> 			done <idx.sorted
>
> The one thing I do like is that we don't have to escape anything inside
> an awk program that is forced to use double-quotes. ;)

For me it's processing the data in the "correct" order (descending, i.e.
starting at the end, which we have to calculate first anyway based on the
size).

>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>> Having the same object in multiple places for whatever reason would not
>> be a cause for reporting an error in this test, I would think.
>
> No, for the reasons I said in the commit message: if an object exists in
> multiple places the test is already potentially invalid, as Git does not
> promise which version it will use. So it might work racily, or it might
> work for now but be fragile. By not de-duplicating, we make sure the
> test's assumption holds.

Oh, skipped that paragraph.  Still I don't see how a duplicate object
would necessarily invalidate t1006.  The comment for the test "cat-file
--batch-all-objects shows all objects" a few lines above indicates that
it's picky about the provenance of objects, but it uses a separate
repository.  I can't infer the same requirement for the root repo, but
we already established that I can't read.

Anyway, if someone finds a use for git repack without -d or
git unpack-objects or whatever else causes duplicates in the root
repository of t1006 then they can try to reverse your ban with concrete
arguments.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] t1006: add tests for %(objectsize:disk)
  2023-12-21 23:13                   ` René Scharfe
@ 2023-12-23 10:09                     ` Jeff King
  2023-12-24  9:30                       ` René Scharfe
  2023-12-23 10:18                     ` [PATCH] " Jeff King
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-23 10:09 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:

> >> 			while read start oid rest
> >> 			do
> >> 				size=$((end - start)) &&
> >> 				end=$start &&
> >> 				echo "$oid $size" ||
> >> 				return 1
> >> 			done <idx.sorted
> >
> > The one thing I do like is that we don't have to escape anything inside
> > an awk program that is forced to use double-quotes. ;)
> 
> For me it's processing the data in the "correct" order (descending, i.e.
> starting at the end, which we have to calculate first anyway based on the
> size).

That was one thing that I thought made it more complicated. The obvious
order to me is start-to-end in the pack. But I do agree that going in
reverse order makes things much simpler, as we compute the size of each
entry as we see it (and so there are fewer special cases).

So I'm convinced that it's worth switching. Here's a v2 with your
suggestion.

-- >8 --
Subject: t1006: add tests for %(objectsize:disk)

Back when we added this placeholder in a4ac106178 (cat-file: add
%(objectsize:disk) format atom, 2013-07-10), there were no tests,
claiming "[...]the exact numbers returned are volatile and subject to
zlib and packing decisions".

But we can use a little shell hackery to get the expected numbers
ourselves. To a certain degree this is just re-implementing what Git is
doing under the hood, but it is still worth doing. It makes sure we
exercise the %(objectsize:disk) code at all, and having the two
implementations agree gives us more confidence.

Note that our shell code assumes that no object appears twice (either in
two packs, or as both loose and packed), as then the results really are
undefined. That's OK for our purposes, and the test will notice if that
assumption is violated (the shell version would produce duplicate lines
that Git's output does not have).

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
---
 t/t1006-cat-file.sh | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 271c5e4fd3..e0c6482797 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1100,6 +1100,42 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
 	cmp expect actual
 '

+test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
+	# our state has both loose and packed objects,
+	# so find both for our expected output
+	{
+		find .git/objects/?? -type f |
+		awk -F/ "{ print \$0, \$3\$4 }" |
+		while read path oid
+		do
+			size=$(test_file_size "$path") &&
+			echo "$oid $size" ||
+			return 1
+		done &&
+		rawsz=$(test_oid rawsz) &&
+		find .git/objects/pack -name "*.idx" |
+		while read idx
+		do
+			git show-index <"$idx" >idx.raw &&
+			sort -nr <idx.raw >idx.sorted &&
+			packsz=$(test_file_size "${idx%.idx}.pack") &&
+			end=$((packsz - rawsz)) &&
+			while read start oid rest
+			do
+				size=$((end - start)) &&
+				end=$start &&
+				echo "$oid $size" ||
+				return 1
+			done <idx.sorted ||
+			return 1
+		done
+	} >expect.raw &&
+	sort <expect.raw >expect &&
+	git cat-file --batch-all-objects \
+		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'set up replacement object' '
 	orig=$(git rev-parse HEAD) &&
 	git cat-file commit $orig >orig &&
-- 
2.43.0.448.g93112243fb

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] t1006: add tests for %(objectsize:disk)
  2023-12-23 10:09                     ` [PATCH v2] " Jeff King
@ 2023-12-24  9:30                       ` René Scharfe
  0 siblings, 0 replies; 17+ messages in thread
From: René Scharfe @ 2023-12-24  9:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Am 23.12.23 um 11:09 schrieb Jeff King:
>
> ---
>  t/t1006-cat-file.sh | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
>
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 271c5e4fd3..e0c6482797 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -1100,6 +1100,42 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
>  	cmp expect actual
>  '
>
> +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
> +	# our state has both loose and packed objects,
> +	# so find both for our expected output
> +	{
> +		find .git/objects/?? -type f |
> +		awk -F/ "{ print \$0, \$3\$4 }" |
> +		while read path oid
> +		do
> +			size=$(test_file_size "$path") &&
> +			echo "$oid $size" ||
> +			return 1
> +		done &&
> +		rawsz=$(test_oid rawsz) &&
> +		find .git/objects/pack -name "*.idx" |
> +		while read idx
> +		do
> +			git show-index <"$idx" >idx.raw &&
> +			sort -nr <idx.raw >idx.sorted &&
> +			packsz=$(test_file_size "${idx%.idx}.pack") &&
> +			end=$((packsz - rawsz)) &&
> +			while read start oid rest
> +			do
> +				size=$((end - start)) &&
> +				end=$start &&
> +				echo "$oid $size" ||
> +				return 1
> +			done <idx.sorted ||
> +			return 1
> +		done
> +	} >expect.raw &&
> +	sort <expect.raw >expect &&
> +	git cat-file --batch-all-objects \
> +		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_expect_success 'set up replacement object' '
>  	orig=$(git rev-parse HEAD) &&
>  	git cat-file commit $orig >orig &&

Looks good to me.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-21 23:13                   ` René Scharfe
  2023-12-23 10:09                     ` [PATCH v2] " Jeff King
@ 2023-12-23 10:18                     ` Jeff King
  2023-12-24  9:30                       ` René Scharfe
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2023-12-23 10:18 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:

> >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
> >> Having the same object in multiple places for whatever reason would not
> >> be a cause for reporting an error in this test, I would think.
> >
> > No, for the reasons I said in the commit message: if an object exists in
> > multiple places the test is already potentially invalid, as Git does not
> > promise which version it will use. So it might work racily, or it might
> > work for now but be fragile. By not de-duplicating, we make sure the
> > test's assumption holds.
> 
> Oh, skipped that paragraph.  Still I don't see how a duplicate object
> would necessarily invalidate t1006.  The comment for the test "cat-file
> --batch-all-objects shows all objects" a few lines above indicates that
> it's picky about the provenance of objects, but it uses a separate
> repository.  I can't infer the same requirement for the root repo, but
> we already established that I can't read.

The cat-file documentation explicitly calls this situation out:

  Note also that multiple copies of an object may be present in the
  object database; in this case, it is undefined which copy’s size or
  delta base will be reported.

So if t1006 were to grow such a duplicate object, what will happen? If
we de-dup in the new test, then we might end up mentioning the same copy
(and the test passes), or we might not (and the test fails). But much
worse, the results might be racy (depending on how cat-file happens to
decide which one to use). By no de-duping, then the test will reliably
fail and the author can decide how to handle it then.

IOW it is about failing immediately and predictably rather than letting
a future change to sneak a race or other accident-waiting-to-happen into
t1006.

> Anyway, if someone finds a use for git repack without -d or
> git unpack-objects or whatever else causes duplicates in the root
> repository of t1006 then they can try to reverse your ban with concrete
> arguments.

In the real world, the most common way to get a duplicate is to fetch or
push into a repository, such that:

  1. There are enough objects to retain the pack (100 by default)

  2. There's a thin delta in the on-the-wire pack (i.e., a delta against
     a base that the sender knows the receiver has, but which isn't
     itself sent).

Then "index-pack --fix-thin" will complete the on-disk pack by storing a
copy of the base object in it. And now we have it in two packs (and if
it's a delta or loose in the original, the size will be different).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t1006: add tests for %(objectsize:disk)
  2023-12-23 10:18                     ` [PATCH] " Jeff King
@ 2023-12-24  9:30                       ` René Scharfe
  0 siblings, 0 replies; 17+ messages in thread
From: René Scharfe @ 2023-12-24  9:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano

Am 23.12.23 um 11:18 schrieb Jeff King:
> On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:
>
>>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>>>> Having the same object in multiple places for whatever reason would not
>>>> be a cause for reporting an error in this test, I would think.
>>>
>>> No, for the reasons I said in the commit message: if an object exists in
>>> multiple places the test is already potentially invalid, as Git does not
>>> promise which version it will use. So it might work racily, or it might
>>> work for now but be fragile. By not de-duplicating, we make sure the
>>> test's assumption holds.
>>
>> Oh, skipped that paragraph.  Still I don't see how a duplicate object
>> would necessarily invalidate t1006.  The comment for the test "cat-file
>> --batch-all-objects shows all objects" a few lines above indicates that
>> it's picky about the provenance of objects, but it uses a separate
>> repository.  I can't infer the same requirement for the root repo, but
>> we already established that I can't read.
>
> The cat-file documentation explicitly calls this situation out:
>
>   Note also that multiple copies of an object may be present in the
>   object database; in this case, it is undefined which copy’s size or
>   delta base will be reported.
>
> So if t1006 were to grow such a duplicate object, what will happen? If
> we de-dup in the new test, then we might end up mentioning the same copy
> (and the test passes), or we might not (and the test fails). But much
> worse, the results might be racy (depending on how cat-file happens to
> decide which one to use). By no de-duping, then the test will reliably
> fail and the author can decide how to handle it then.
>
> IOW it is about failing immediately and predictably rather than letting
> a future change to sneak a race or other accident-waiting-to-happen into
> t1006.
>
>> Anyway, if someone finds a use for git repack without -d or
>> git unpack-objects or whatever else causes duplicates in the root
>> repository of t1006 then they can try to reverse your ban with concrete
>> arguments.
>
> In the real world, the most common way to get a duplicate is to fetch or
> push into a repository, such that:
>
>   1. There are enough objects to retain the pack (100 by default)
>
>   2. There's a thin delta in the on-the-wire pack (i.e., a delta against
>      a base that the sender knows the receiver has, but which isn't
>      itself sent).
>
> Then "index-pack --fix-thin" will complete the on-disk pack by storing a
> copy of the base object in it. And now we have it in two packs (and if
> it's a delta or loose in the original, the size will be different).

I think I get it now.  The size possibly being different is crucial.
cat-file deduplicates based on object ID alone.  sort -u in t1006 would
deduplicate based on object ID and size, meaning that it would only
remove duplicates of the same size.  Emulating the deduplication of
cat-file is also possible, but would introduce the race you mentioned.

However, even removing only same-size duplicates is unreliable because
there is no guarantee that the same object has the same size in
different packs.  Adding a new object that is a better delta base would
change the size.

So, deduplicating based on object ID and size is sound for any
particular run, but sizes are not stable and thus we need to know if
the tests do something that adds duplicates of any size.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Test breakage with zlib-ng
  2023-12-12 17:04 ` René Scharfe
  2023-12-12 20:01   ` Jeff King
@ 2023-12-12 22:18   ` brian m. carlson
  2023-12-12 22:30   ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: brian m. carlson @ 2023-12-12 22:18 UTC (permalink / raw)
  To: René Scharfe; +Cc: Ondrej Pohorelsky, git, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]

On 2023-12-12 at 17:04:55, René Scharfe wrote:
> --- >8 ---
> Subject: [PATCH] t6300: avoid hard-coding object sizes
> 
> f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24)
> hard-coded the expected object sizes.  Coincidentally the size of commit
> and tag is the same with zlib at the default compression level.
> 
> 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22)
> encoded the sizes as a single value, which coincidentally also works
> with sha256.
> 
> Different compression libraries like zlib-ng may arrive at different
> values.  Get them from the file system instead of hard-coding them to
> make switching the compression library (or changing the compression
> level) easier.

This definitely seems like the right thing to do.  I was a bit lazy at
the time and probably could have improved it then, but it's at least
good that we're doing it now.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Test breakage with zlib-ng
  2023-12-12 17:04 ` René Scharfe
  2023-12-12 20:01   ` Jeff King
  2023-12-12 22:18   ` Test breakage with zlib-ng brian m. carlson
@ 2023-12-12 22:30   ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2023-12-12 22:30 UTC (permalink / raw)
  To: René Scharfe; +Cc: Ondrej Pohorelsky, git, brian m . carlson

René Scharfe <l.s.r@web.de> writes:

> Or we could get the sizes of the objects by checking their files,
> which would not require  hard-coding anymore.  Patch below.

That was my first reaction to seeing the original report.  It is a
bit surprising that the necessary fix is so small and makes me wonder
why we initially did the hardcoded values, which feels more work to
initially write the test vector.

Looking good.  Thanks.

> --- >8 ---
> Subject: [PATCH] t6300: avoid hard-coding object sizes
>
> f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24)
> hard-coded the expected object sizes.  Coincidentally the size of commit
> and tag is the same with zlib at the default compression level.
>
> 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22)
> encoded the sizes as a single value, which coincidentally also works
> with sha256.
>
> Different compression libraries like zlib-ng may arrive at different
> values.  Get them from the file system instead of hard-coding them to
> make switching the compression library (or changing the compression
> level) easier.
>
> Reported-by: Ondrej Pohorelsky <opohorel@redhat.com>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  t/t6300-for-each-ref.sh | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index 54e2281259..843a7fe143 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -20,12 +20,13 @@ setdate_and_increment () {
>      export GIT_COMMITTER_DATE GIT_AUTHOR_DATE
>  }
>
> -test_expect_success setup '
> -	test_oid_cache <<-EOF &&
> -	disklen sha1:138
> -	disklen sha256:154
> -	EOF
> +test_object_file_size () {
> +	oid=$(git rev-parse "$1")
> +	path=".git/objects/$(test_oid_to_path $oid)"
> +	test_file_size "$path"
> +}
>
> +test_expect_success setup '
>  	# setup .mailmap
>  	cat >.mailmap <<-EOF &&
>  	A Thor <athor@example.com> A U Thor <author@example.com>
> @@ -94,7 +95,6 @@ test_atom () {
>  }
>
>  hexlen=$(test_oid hexsz)
> -disklen=$(test_oid disklen)
>
>  test_atom head refname refs/heads/main
>  test_atom head refname: refs/heads/main
> @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main
>  test_atom head push:strip=-1 main
>  test_atom head objecttype commit
>  test_atom head objectsize $((131 + hexlen))
> -test_atom head objectsize:disk $disklen
> +test_atom head objectsize:disk $(test_object_file_size refs/heads/main)
>  test_atom head deltabase $ZERO_OID
>  test_atom head objectname $(git rev-parse refs/heads/main)
>  test_atom head objectname:short $(git rev-parse --short refs/heads/main)
> @@ -203,8 +203,8 @@ test_atom tag upstream ''
>  test_atom tag push ''
>  test_atom tag objecttype tag
>  test_atom tag objectsize $((114 + hexlen))
> -test_atom tag objectsize:disk $disklen
> -test_atom tag '*objectsize:disk' $disklen
> +test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag)
> +test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main)
>  test_atom tag deltabase $ZERO_OID
>  test_atom tag '*deltabase' $ZERO_OID
>  test_atom tag objectname $(git rev-parse refs/tags/testtag)
> --
> 2.43.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-12-24  9:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 14:16 Test breakage with zlib-ng Ondrej Pohorelsky
2023-12-12 17:04 ` René Scharfe
2023-12-12 20:01   ` Jeff King
2023-12-12 22:54     ` René Scharfe
2023-12-13 12:28       ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe
2023-12-14 20:59         ` Jeff King
2023-12-19 16:42           ` René Scharfe
2023-12-21  9:47             ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King
2023-12-21 12:19               ` René Scharfe
2023-12-21 21:30                 ` Jeff King
2023-12-21 23:13                   ` René Scharfe
2023-12-23 10:09                     ` [PATCH v2] " Jeff King
2023-12-24  9:30                       ` René Scharfe
2023-12-23 10:18                     ` [PATCH] " Jeff King
2023-12-24  9:30                       ` René Scharfe
2023-12-12 22:18   ` Test breakage with zlib-ng brian m. carlson
2023-12-12 22:30   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).