* Test breakage with zlib-ng @ 2023-12-12 14:16 Ondrej Pohorelsky 2023-12-12 17:04 ` René Scharfe 0 siblings, 1 reply; 17+ messages in thread From: Ondrej Pohorelsky @ 2023-12-12 14:16 UTC (permalink / raw) To: git Hi everyone, As some might have heard, there is a proposal for Fedora 40 to transition from zlib to zlib-ng[0]. Because of this, there has been a rebuild of all packages to ensure every package works under zlib-ng. Git test suit has three breakages in t6300-for-each-ref.sh. To be precise, it is: not ok 35 - basic atom: refs/heads/main objectsize:disk not ok 107 - basic atom: refs/tags/testtag objectsize:disk not ok 108 - basic atom: refs/tags/testtag *objectsize:disk All of these tests are atomic, and they compare the result against $disklen. I discussed it with Tulio Magno Quites Machado Filho, who ran the tests and is owner of the proposal. It seems like the compression of zlib-ng is shaving/adding some bytes to the actual output, which then fails the comparison. Here is an example: ``` expecting success of 6300.35 'basic atom: refs/heads/main objectsize:disk': git for-each-ref --format="%($format)" "$ref" >actual && sanitize_pgp <actual >actual.clean && test_cmp expected actual.clean ++ git for-each-ref '--format=%(objectsize:disk)' refs/heads/main ++ sanitize_pgp ++ perl -ne ' /^-----END PGP/ and $in_pgp = 0; print unless $in_pgp; /^-----BEGIN PGP/ and $in_pgp = 1; ' ++ command /usr/bin/perl -ne ' /^-----END PGP/ and $in_pgp = 0; print unless $in_pgp; /^-----BEGIN PGP/ and $in_pgp = 1; ' ++ test_cmp expected actual.clean ++ test 2 -ne 2 ++ eval 'diff -u' '"$@"' +++ diff -u expected actual.clean --- expected 2023-12-06 21:06:07.808849497 +0000 +++ actual.clean 2023-12-06 21:06:07.812849541 +0000 @@ -1 +1 @@ -138 +137 error: last command exited with $?=1 not ok 35 - basic atom: refs/heads/main objectsize:disk ``` The whole build log can be found here[1]. I can easily patch these tests in Fedora to be compatible with zlib-ng only by not comparing to $disklen, but a concrete value, however I would like to have a universal solution that works with both zlib and zlib-ng. So if anyone has an idea on how to do it, please let me know. Thanks [0]https://discussion.fedoraproject.org/t/f40-change-proposal-transitioning-to-zlib-ng-as-a-compatible-replacement-for-zlib-system-wide/95807 [1]https://download.copr.fedorainfracloud.org/results/tuliom/zlib-ng-compat-mpb/fedora-rawhide-x86_64/06729801-git/builder-live.log.gz Cheers, Ondřej Pohořelský ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Test breakage with zlib-ng 2023-12-12 14:16 Test breakage with zlib-ng Ondrej Pohorelsky @ 2023-12-12 17:04 ` René Scharfe 2023-12-12 20:01 ` Jeff King ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: René Scharfe @ 2023-12-12 17:04 UTC (permalink / raw) To: Ondrej Pohorelsky, git; +Cc: brian m . carlson, Junio C Hamano Am 12.12.23 um 15:16 schrieb Ondrej Pohorelsky: > Hi everyone, > > As some might have heard, there is a proposal for Fedora 40 to > transition from zlib to zlib-ng[0]. Because of this, there has been a > rebuild of all packages to ensure every package works under zlib-ng. > > Git test suit has three breakages in t6300-for-each-ref.sh. > To be precise, it is: > > not ok 35 - basic atom: refs/heads/main objectsize:disk > not ok 107 - basic atom: refs/tags/testtag objectsize:disk > not ok 108 - basic atom: refs/tags/testtag *objectsize:disk > > > All of these tests are atomic, and they compare the result against > $disklen. Why do these three objects (HEAD commit of main, testtag and testtag target) have the same size? Half of the answer is that testtag points to the HEAD of main. But the other half is pure coincidence as far as I can see. > I can easily patch these tests in Fedora to be compatible with zlib-ng > only by not comparing to $disklen, but a concrete value, however I > would like to have a universal solution that works with both zlib and > zlib-ng. So if anyone has an idea on how to do it, please let me know. The test stores the expected values at the top, in the following lines, for the two possible repository formats: test_expect_success setup ' test_oid_cache <<-EOF && disklen sha1:138 disklen sha256:154 EOF So it's using hard-coded values already, which breaks when the compression rate changes. We could set core.compression to 0 to take compression out of the picture. Or we could get the sizes of the objects by checking their files, which would not require hard-coding anymore. Patch below. --- >8 --- Subject: [PATCH] t6300: avoid hard-coding object sizes f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24) hard-coded the expected object sizes. Coincidentally the size of commit and tag is the same with zlib at the default compression level. 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22) encoded the sizes as a single value, which coincidentally also works with sha256. Different compression libraries like zlib-ng may arrive at different values. Get them from the file system instead of hard-coding them to make switching the compression library (or changing the compression level) easier. Reported-by: Ondrej Pohorelsky <opohorel@redhat.com> Signed-off-by: René Scharfe <l.s.r@web.de> --- t/t6300-for-each-ref.sh | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh index 54e2281259..843a7fe143 100755 --- a/t/t6300-for-each-ref.sh +++ b/t/t6300-for-each-ref.sh @@ -20,12 +20,13 @@ setdate_and_increment () { export GIT_COMMITTER_DATE GIT_AUTHOR_DATE } -test_expect_success setup ' - test_oid_cache <<-EOF && - disklen sha1:138 - disklen sha256:154 - EOF +test_object_file_size () { + oid=$(git rev-parse "$1") + path=".git/objects/$(test_oid_to_path $oid)" + test_file_size "$path" +} +test_expect_success setup ' # setup .mailmap cat >.mailmap <<-EOF && A Thor <athor@example.com> A U Thor <author@example.com> @@ -94,7 +95,6 @@ test_atom () { } hexlen=$(test_oid hexsz) -disklen=$(test_oid disklen) test_atom head refname refs/heads/main test_atom head refname: refs/heads/main @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main test_atom head push:strip=-1 main test_atom head objecttype commit test_atom head objectsize $((131 + hexlen)) -test_atom head objectsize:disk $disklen +test_atom head objectsize:disk $(test_object_file_size refs/heads/main) test_atom head deltabase $ZERO_OID test_atom head objectname $(git rev-parse refs/heads/main) test_atom head objectname:short $(git rev-parse --short refs/heads/main) @@ -203,8 +203,8 @@ test_atom tag upstream '' test_atom tag push '' test_atom tag objecttype tag test_atom tag objectsize $((114 + hexlen)) -test_atom tag objectsize:disk $disklen -test_atom tag '*objectsize:disk' $disklen +test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag) +test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main) test_atom tag deltabase $ZERO_OID test_atom tag '*deltabase' $ZERO_OID test_atom tag objectname $(git rev-parse refs/tags/testtag) -- 2.43.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Test breakage with zlib-ng 2023-12-12 17:04 ` René Scharfe @ 2023-12-12 20:01 ` Jeff King 2023-12-12 22:54 ` René Scharfe 2023-12-12 22:18 ` Test breakage with zlib-ng brian m. carlson 2023-12-12 22:30 ` Junio C Hamano 2 siblings, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-12 20:01 UTC (permalink / raw) To: René Scharfe Cc: Ondrej Pohorelsky, git, brian m . carlson, Junio C Hamano On Tue, Dec 12, 2023 at 06:04:55PM +0100, René Scharfe wrote: > Subject: [PATCH] t6300: avoid hard-coding object sizes > > f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24) > hard-coded the expected object sizes. Coincidentally the size of commit > and tag is the same with zlib at the default compression level. > > 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22) > encoded the sizes as a single value, which coincidentally also works > with sha256. > > Different compression libraries like zlib-ng may arrive at different > values. Get them from the file system instead of hard-coding them to > make switching the compression library (or changing the compression > level) easier. Yeah, this is definitely the right solution here. I'm surprised the hard-coded values didn't cause problems before now. ;) The patch looks good to me, but a few small comments: > +test_object_file_size () { > + oid=$(git rev-parse "$1") > + path=".git/objects/$(test_oid_to_path $oid)" > + test_file_size "$path" > +} Here we're assuming the objects are loose. I think that's probably OK (and certainly the test will notice if that changes). We're covering the formatting code paths along with the underlying implementation that fills in object_info->disk_sizep for loose objects. Which I think is plenty for this particular script, which is about for-each-ref. It would be nice to have coverage of the packed_object_info() code path, though. Back when it was added in a4ac106178 (cat-file: add %(objectsize:disk) format atom, 2013-07-10), I cowardly punted on this, writing: This patch does not include any tests, as the exact numbers returned are volatile and subject to zlib and packing decisions. We cannot even reliably guarantee that the on-disk size is smaller than the object content (though in general this should be the case for non-trivial objects). I don't think it's that big a deal, but I guess we could do something like: prev= git show-index <$pack_idx | sort -n | grep -A1 $oid | while read ofs oid csum do test -n "$prev" && echo "$((ofs - prev))" prev=$ofs done It feels a little redundant with what Git is doing under the hood, but at least is exercising the code (and we're using the idx directly, so we're confirming that the revindex is right). Anyway, that is all way beyond the scope of your patch, but I wonder if it's worth doing on top. > @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main > test_atom head push:strip=-1 main > test_atom head objecttype commit > test_atom head objectsize $((131 + hexlen)) > -test_atom head objectsize:disk $disklen > +test_atom head objectsize:disk $(test_object_file_size refs/heads/main) These test_object_file_size calls are happening outside of any test_expect_* block, so we'd miss failing exit codes (and also the helper is not &&-chained), and any stderr would leak to the output. That's probably OK in practice, though (if something goes wrong then the expected value output will be bogus and the test itself will fail). -Peff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Test breakage with zlib-ng 2023-12-12 20:01 ` Jeff King @ 2023-12-12 22:54 ` René Scharfe 2023-12-13 12:28 ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe 0 siblings, 1 reply; 17+ messages in thread From: René Scharfe @ 2023-12-12 22:54 UTC (permalink / raw) To: Jeff King; +Cc: Ondrej Pohorelsky, git, brian m . carlson, Junio C Hamano Am 12.12.23 um 21:01 schrieb Jeff King: > On Tue, Dec 12, 2023 at 06:04:55PM +0100, René Scharfe wrote: > >> Subject: [PATCH] t6300: avoid hard-coding object sizes >> >> f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24) >> hard-coded the expected object sizes. Coincidentally the size of commit >> and tag is the same with zlib at the default compression level. >> >> 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22) >> encoded the sizes as a single value, which coincidentally also works >> with sha256. >> >> Different compression libraries like zlib-ng may arrive at different >> values. Get them from the file system instead of hard-coding them to >> make switching the compression library (or changing the compression >> level) easier. > > Yeah, this is definitely the right solution here. I'm surprised the > hard-coded values didn't cause problems before now. ;) > > The patch looks good to me, but a few small comments: > >> +test_object_file_size () { >> + oid=$(git rev-parse "$1") >> + path=".git/objects/$(test_oid_to_path $oid)" >> + test_file_size "$path" >> +} > > Here we're assuming the objects are loose. I think that's probably OK > (and certainly the test will notice if that changes). > > We're covering the formatting code paths along with the underlying > implementation that fills in object_info->disk_sizep for loose objects. > Which I think is plenty for this particular script, which is about > for-each-ref. > > It would be nice to have coverage of the packed_object_info() code path, > though. Back when it was added in a4ac106178 (cat-file: add > %(objectsize:disk) format atom, 2013-07-10), I cowardly punted on this, > writing: > > This patch does not include any tests, as the exact numbers > returned are volatile and subject to zlib and packing > decisions. We cannot even reliably guarantee that the > on-disk size is smaller than the object content (though in > general this should be the case for non-trivial objects). > > I don't think it's that big a deal, but I guess we could do something > like: > > prev= > git show-index <$pack_idx | > sort -n | > grep -A1 $oid | > while read ofs oid csum > do > test -n "$prev" && echo "$((ofs - prev))" > prev=$ofs > done > > It feels a little redundant with what Git is doing under the hood, but > at least is exercising the code (and we're using the idx directly, so > we're confirming that the revindex is right). A generic object size function based on both methods could live in the test lib and be used for e.g. cat-file tests as well. Getting such a function polished and library-worthy is probably more work than I naively imagine, however -- due to our shunning of pipes alone. > Anyway, that is all way beyond the scope of your patch, but I wonder if > it's worth doing on top. > >> @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main >> test_atom head push:strip=-1 main >> test_atom head objecttype commit >> test_atom head objectsize $((131 + hexlen)) >> -test_atom head objectsize:disk $disklen >> +test_atom head objectsize:disk $(test_object_file_size refs/heads/main) > > These test_object_file_size calls are happening outside of any > test_expect_* block, so we'd miss failing exit codes (and also the > helper is not &&-chained), and any stderr would leak to the output. > That's probably OK in practice, though (if something goes wrong then the > expected value output will be bogus and the test itself will fail). Right. We could also set variables during setup, though, to put readers' minds at rest. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/1] test-lib-functions: add object size functions 2023-12-12 22:54 ` René Scharfe @ 2023-12-13 12:28 ` René Scharfe 2023-12-14 20:59 ` Jeff King 0 siblings, 1 reply; 17+ messages in thread From: René Scharfe @ 2023-12-13 12:28 UTC (permalink / raw) To: Jeff King, git; +Cc: Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Add test_object_size and its helpers test_loose_object_size and test_packed_object_size, which allow determining the size of a Git object using only the low-level Git commands rev-parse and show-index. Use it in t6300 to replace the bare-bones function test_object_file_size as a motivating example. There it provides the expected output of the high-level Git command for-each-ref. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: René Scharfe <l.s.r@web.de> --- So how about this? I'm a bit nervous about all the rules about output descriptors and error propagation and whatnot in the test library, but this implementation seems simple enough and might be useful in more than one test. No idea how to add support for alternate object directories, but I doubt we'll ever need it. --- t/t6300-for-each-ref.sh | 16 ++++++-------- t/test-lib-functions.sh | 47 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 54 insertions(+), 9 deletions(-) diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh index 843a7fe143..4687660f38 100755 --- a/t/t6300-for-each-ref.sh +++ b/t/t6300-for-each-ref.sh @@ -20,12 +20,6 @@ setdate_and_increment () { export GIT_COMMITTER_DATE GIT_AUTHOR_DATE } -test_object_file_size () { - oid=$(git rev-parse "$1") - path=".git/objects/$(test_oid_to_path $oid)" - test_file_size "$path" -} - test_expect_success setup ' # setup .mailmap cat >.mailmap <<-EOF && @@ -40,7 +34,11 @@ test_expect_success setup ' git branch -M main && setdate_and_increment && git tag -a -m "Tagging at $datestamp" testtag && + testtag_oid=$(git rev-parse refs/tags/testtag) && + testtag_disksize=$(test_object_size $testtag_oid) && git update-ref refs/remotes/origin/main main && + commit_oid=$(git rev-parse refs/heads/main) && + commit_disksize=$(test_object_size $commit_oid) && git remote add origin nowhere && git config branch.main.remote origin && git config branch.main.merge refs/heads/main && @@ -129,7 +127,7 @@ test_atom head push:strip=1 remotes/myfork/main test_atom head push:strip=-1 main test_atom head objecttype commit test_atom head objectsize $((131 + hexlen)) -test_atom head objectsize:disk $(test_object_file_size refs/heads/main) +test_atom head objectsize:disk $commit_disksize test_atom head deltabase $ZERO_OID test_atom head objectname $(git rev-parse refs/heads/main) test_atom head objectname:short $(git rev-parse --short refs/heads/main) @@ -203,8 +201,8 @@ test_atom tag upstream '' test_atom tag push '' test_atom tag objecttype tag test_atom tag objectsize $((114 + hexlen)) -test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag) -test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main) +test_atom tag objectsize:disk $testtag_disksize +test_atom tag '*objectsize:disk' $commit_disksize test_atom tag deltabase $ZERO_OID test_atom tag '*deltabase' $ZERO_OID test_atom tag objectname $(git rev-parse refs/tags/testtag) diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 9c3cf12b26..9b49645f77 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1733,6 +1733,53 @@ test_oid_to_path () { echo "${1%$basename}/$basename" } +test_loose_object_size () { + test "$#" -ne 1 && BUG "1 param" + local path=$(test_oid_to_path "$1") + test_file_size "$(git rev-parse --git-path "objects/$path")" 2>&4 +} + +test_packed_object_size () { + test "$#" -ne 2 && BUG "2 params" + local oid=$1 idx=$2 packsize rawsz end + + packsize=$(test_file_size "${idx%.idx}.pack") + rawsz=$(test_oid rawsz) + end=$(($packsize - $rawsz)) + + git show-index <"$idx" | + awk -v oid="$oid" -v end="$end" ' + $2 == oid {start = $1} + {offsets[$1] = 1} + END { + if (!start || start >= end) + exit 1 + for (o in offsets) + if (start < o && o < end) + end = o + print end - start + } + ' && return 0 + + echo >&4 "error: '$oid' not found in '$idx'" + return 1 +} + +test_object_size () { + test "$#" -ne 1 && BUG "1 param" + local oid=$1 + + test_loose_object_size "$oid" 4>/dev/null && return 0 + + for idx in "$(git rev-parse --git-path objects/pack)"/pack-*.idx + do + test_packed_object_size "$oid" "$idx" 4>/dev/null && return 0 + done + + echo >&4 "error: '$oid' not found" + return 1 +} + # Parse oids from git ls-files --staged output test_parse_ls_files_stage_oids () { awk '{print $2}' - -- 2.43.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/1] test-lib-functions: add object size functions 2023-12-13 12:28 ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe @ 2023-12-14 20:59 ` Jeff King 2023-12-19 16:42 ` René Scharfe 0 siblings, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-14 20:59 UTC (permalink / raw) To: René Scharfe Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano On Wed, Dec 13, 2023 at 01:28:56PM +0100, René Scharfe wrote: > Add test_object_size and its helpers test_loose_object_size and > test_packed_object_size, which allow determining the size of a Git > object using only the low-level Git commands rev-parse and show-index. > > Use it in t6300 to replace the bare-bones function test_object_file_size > as a motivating example. There it provides the expected output of the > high-level Git command for-each-ref. This adds a packed-object function, but I doubt anybody actually calls it. If we're going to do that, it's probably worth adding some tests for "cat-file --batch-check" or similar. At which point I wonder if rather than having a function for a single object, we are better off just testing the result of: git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)' against a single post-processed "show-index" invocation. > So how about this? I'm a bit nervous about all the rules about output > descriptors and error propagation and whatnot in the test library, but > this implementation seems simple enough and might be useful in more than > one test. No idea how to add support for alternate object directories, > but I doubt we'll ever need it. I'm not sure that we need to do anything special with output redirection. Shouldn't these functions just send errors to stderr as usual? If they are run inside a test_expect block, that goes to descriptor 4 (which is either /dev/null or the original stderr, depending on whether "-v" was used). > +test_loose_object_size () { > + test "$#" -ne 1 && BUG "1 param" > + local path=$(test_oid_to_path "$1") > + test_file_size "$(git rev-parse --git-path "objects/$path")" 2>&4 > +} OK. We lose the exit code from "rev-parse" but that is probably OK for our purposes. > +test_packed_object_size () { > + test "$#" -ne 2 && BUG "2 params" > + local oid=$1 idx=$2 packsize rawsz end > + > + packsize=$(test_file_size "${idx%.idx}.pack") > + rawsz=$(test_oid rawsz) > + end=$(($packsize - $rawsz)) OK, this $end is the magic required for the final entry. Makes sense. > + git show-index <"$idx" | > + awk -v oid="$oid" -v end="$end" ' > + $2 == oid {start = $1} > + {offsets[$1] = 1} > + END { > + if (!start || start >= end) > + exit 1 > + for (o in offsets) > + if (start < o && o < end) > + end = o > + print end - start > + } > + ' && return 0 I was confused at first, because I didn't see any sorting happening. But if I understand correctly, you're just looking for the smallest "end" that comes after the start of the object we're looking for. Which I think works. -Peff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/1] test-lib-functions: add object size functions 2023-12-14 20:59 ` Jeff King @ 2023-12-19 16:42 ` René Scharfe 2023-12-21 9:47 ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King 0 siblings, 1 reply; 17+ messages in thread From: René Scharfe @ 2023-12-19 16:42 UTC (permalink / raw) To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Am 14.12.23 um 21:59 schrieb Jeff King: > On Wed, Dec 13, 2023 at 01:28:56PM +0100, René Scharfe wrote: > >> Add test_object_size and its helpers test_loose_object_size and >> test_packed_object_size, which allow determining the size of a Git >> object using only the low-level Git commands rev-parse and show-index. >> >> Use it in t6300 to replace the bare-bones function test_object_file_size >> as a motivating example. There it provides the expected output of the >> high-level Git command for-each-ref. > > This adds a packed-object function, but I doubt anybody actually calls > it. If we're going to do that, it's probably worth adding some tests for > "cat-file --batch-check" or similar. Yes, and I was assuming that someone else would be eager to add such tests. *ahem* > At which point I wonder if rather than having a function for a single > object, we are better off just testing the result of: > > git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)' > > against a single post-processed "show-index" invocation. Sure, we might want to optimize for bulk-processing and possibly end up only checking the size of single objects in t6300, making new library functions unnecessary. When dumping size information of multiple objects, it's probably a good idea to include "%(objectname)" as well in the format. You'd need one show-index call for each .idx file. A simple test would only have a single one; a library function might need to handle multiple packs. >> So how about this? I'm a bit nervous about all the rules about output >> descriptors and error propagation and whatnot in the test library, but >> this implementation seems simple enough and might be useful in more than >> one test. No idea how to add support for alternate object directories, >> but I doubt we'll ever need it. > > I'm not sure that we need to do anything special with output > redirection. Shouldn't these functions just send errors to stderr as > usual? If they are run inside a test_expect block, that goes to > descriptor 4 (which is either /dev/null or the original stderr, > depending on whether "-v" was used). Good point. My bad excuse is that I copied the redirection to fd 4 from test_grep. >> + git show-index <"$idx" | >> + awk -v oid="$oid" -v end="$end" ' >> + $2 == oid {start = $1} >> + {offsets[$1] = 1} >> + END { >> + if (!start || start >= end) >> + exit 1 >> + for (o in offsets) >> + if (start < o && o < end) >> + end = o >> + print end - start >> + } >> + ' && return 0 > > I was confused at first, because I didn't see any sorting happening. But > if I understand correctly, you're just looking for the smallest "end" > that comes after the start of the object we're looking for. Which I > think works. Yes, calculating the minimum offset suffices when handling a single object -- no sorting required. For bulk mode we'd better sort, of course: git show-index <"$idx" | sort -n | awk -v end="$end" ' NR > 1 {print oid, $1 - start} {start = $1; oid = $2} END {print oid, end - start} ' No idea how to make such a thing robust against malformed or truncated output from show-index, but perhaps that's not necessary, depending on how the result is used. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-19 16:42 ` René Scharfe @ 2023-12-21 9:47 ` Jeff King 2023-12-21 12:19 ` René Scharfe 0 siblings, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-21 9:47 UTC (permalink / raw) To: René Scharfe Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote: > > This adds a packed-object function, but I doubt anybody actually calls > > it. If we're going to do that, it's probably worth adding some tests for > > "cat-file --batch-check" or similar. > > Yes, and I was assuming that someone else would be eager to add such > tests. *ahem* :P OK, here it is. This can be its own topic, or go on top of the rs/t6300-compressed-size-fix branch. > > At which point I wonder if rather than having a function for a single > > object, we are better off just testing the result of: > > > > git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)' > > > > against a single post-processed "show-index" invocation. > > Sure, we might want to optimize for bulk-processing and possibly end up > only checking the size of single objects in t6300, making new library > functions unnecessary. So yeah, I think the approach here makes library functions unnecessary (and I see you already asked Junio to drop your patch 2). > When dumping size information of multiple objects, it's probably a good > idea to include "%(objectname)" as well in the format. Yep, definitely. -- >8 -- Subject: [PATCH] t1006: add tests for %(objectsize:disk) Back when we added this placeholder in a4ac106178 (cat-file: add %(objectsize:disk) format atom, 2013-07-10), there were no tests, claiming "[...]the exact numbers returned are volatile and subject to zlib and packing decisions". But we can use a little shell hackery to get the expected numbers ourselves. To a certain degree this is just re-implementing what Git is doing under the hood, but it is still worth doing. It makes sure we exercise the %(objectsize:disk) code at all, and having the two implementations agree gives us more confidence. Note that our shell code assumes that no object appears twice (either in two packs, or as both loose and packed), as then the results really are undefined. That's OK for our purposes, and the test will notice if that assumption is violated (the shell version would produce duplicate lines that Git's output does not have). Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: Jeff King <peff@peff.net> --- I stole a bit of your awk. You can tell because I'd have written it in perl. ;) t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 271c5e4fd3..21915be308 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor cmp expect actual ' +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' + # our state has both loose and packed objects, + # so find both for our expected output + { + find .git/objects/?? -type f | + awk -F/ "{ print \$0, \$3\$4 }" | + while read path oid + do + size=$(test_file_size "$path") && + echo "$oid $size" || + return 1 + done && + rawsz=$(test_oid rawsz) && + find .git/objects/pack -name "*.idx" | + while read idx + do + git show-index <"$idx" >idx.raw && + sort -n <idx.raw >idx.sorted && + packsz=$(test_file_size "${idx%.idx}.pack") && + end=$((packsz - rawsz)) && + awk -v end="$end" " + NR > 1 { print oid, \$1 - start } + { start = \$1; oid = \$2 } + END { print oid, end - start } + " idx.sorted || + return 1 + done + } >expect.raw && + sort <expect.raw >expect && + git cat-file --batch-all-objects \ + --batch-check="%(objectname) %(objectsize:disk)" >actual && + test_cmp expect actual +' + test_expect_success 'set up replacement object' ' orig=$(git rev-parse HEAD) && git cat-file commit $orig >orig && -- 2.43.0.430.gaf21263e5d ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-21 9:47 ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King @ 2023-12-21 12:19 ` René Scharfe 2023-12-21 21:30 ` Jeff King 0 siblings, 1 reply; 17+ messages in thread From: René Scharfe @ 2023-12-21 12:19 UTC (permalink / raw) To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Am 21.12.23 um 10:47 schrieb Jeff King: > On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote: > >>> This adds a packed-object function, but I doubt anybody actually calls >>> it. If we're going to do that, it's probably worth adding some tests for >>> "cat-file --batch-check" or similar. >> >> Yes, and I was assuming that someone else would be eager to add such >> tests. *ahem* > > :P OK, here it is. This can be its own topic, or go on top of the > rs/t6300-compressed-size-fix branch. Great, thank you! > -- >8 -- > Subject: [PATCH] t1006: add tests for %(objectsize:disk) > > Back when we added this placeholder in a4ac106178 (cat-file: add > %(objectsize:disk) format atom, 2013-07-10), there were no tests, > claiming "[...]the exact numbers returned are volatile and subject to > zlib and packing decisions". > > But we can use a little shell hackery to get the expected numbers > ourselves. To a certain degree this is just re-implementing what Git is > doing under the hood, but it is still worth doing. It makes sure we > exercise the %(objectsize:disk) code at all, and having the two > implementations agree gives us more confidence. > > Note that our shell code assumes that no object appears twice (either in > two packs, or as both loose and packed), as then the results really are > undefined. That's OK for our purposes, and the test will notice if that > assumption is violated (the shell version would produce duplicate lines > that Git's output does not have). > > Helped-by: René Scharfe <l.s.r@web.de> > Signed-off-by: Jeff King <peff@peff.net> > --- > I stole a bit of your awk. You can tell because I'd have written it in > perl. ;) I think we can do it even in shell, especially if... > > t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index 271c5e4fd3..21915be308 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor > cmp expect actual > ' > > +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' > + # our state has both loose and packed objects, > + # so find both for our expected output > + { > + find .git/objects/?? -type f | > + awk -F/ "{ print \$0, \$3\$4 }" | > + while read path oid > + do > + size=$(test_file_size "$path") && > + echo "$oid $size" || > + return 1 > + done && > + rawsz=$(test_oid rawsz) && > + find .git/objects/pack -name "*.idx" | > + while read idx > + do > + git show-index <"$idx" >idx.raw && > + sort -n <idx.raw >idx.sorted && > + packsz=$(test_file_size "${idx%.idx}.pack") && > + end=$((packsz - rawsz)) && > + awk -v end="$end" " > + NR > 1 { print oid, \$1 - start } > + { start = \$1; oid = \$2 } > + END { print oid, end - start } > + " idx.sorted || ... we stop slicing the data against the grain. Let's reverse the order (sort -r), then we don't need to carry the oid forward: sort -nr <idx.raw >idx.sorted && packsz=$(test_file_size "${idx%.idx}.pack") && end=$((packsz - rawsz)) && awk -v end="$end" " { print \$2, end - \$1; end = \$1 } " idx.sorted || And at that point it should be easy to use a shell loop instead of awk: while read start oid rest do size=$((end - start)) && end=$start && echo "$oid $size" || return 1 done <idx.sorted > + return 1 > + done > + } >expect.raw && > + sort <expect.raw >expect && The reversal above becomes irrelevant with that line, so the result in expect stays the same. Should we deduplicate here, like cat-file does (i.e. use "sort -u")? Having the same object in multiple places for whatever reason would not be a cause for reporting an error in this test, I would think. > + git cat-file --batch-all-objects \ > + --batch-check="%(objectname) %(objectsize:disk)" >actual && > + test_cmp expect actual > +' > + > test_expect_success 'set up replacement object' ' > orig=$(git rev-parse HEAD) && > git cat-file commit $orig >orig && One more thing: We can do the work of the first awk invocation in the already existing loop as well: > +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' > + # our state has both loose and packed objects, > + # so find both for our expected output > + { > + find .git/objects/?? -type f | > + awk -F/ "{ print \$0, \$3\$4 }" | > + while read path oid > + do > + size=$(test_file_size "$path") && > + echo "$oid $size" || > + return 1 > + done && ... but the substitutions are a bit awkward: find .git/objects/?? -type f | while read path do basename=${path##*/} && dirname=${path%/$basename} && oid="${dirname#.git/objects/}${basename}" && size=$(test_file_size "$path") && echo "$oid $size" || return 1 done && The avoided awk invocation might be worth the trouble, though. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-21 12:19 ` René Scharfe @ 2023-12-21 21:30 ` Jeff King 2023-12-21 23:13 ` René Scharfe 0 siblings, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-21 21:30 UTC (permalink / raw) To: René Scharfe Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote: > I think we can do it even in shell, especially if... > [...] Yeah, your conversion looks accurate. I do wonder if it is worth golfing further, though. If it were a process invocation per object, I'd definitely say the efficiency gain is worth it. But dropping one process from the whole test isn't that exciting either way. > (sort -r), then we don't need to carry the oid forward: > > sort -nr <idx.raw >idx.sorted && > packsz=$(test_file_size "${idx%.idx}.pack") && > end=$((packsz - rawsz)) && > awk -v end="$end" " > { print \$2, end - \$1; end = \$1 } > " idx.sorted || > > And at that point it should be easy to use a shell loop instead of awk: > > while read start oid rest > do > size=$((end - start)) && > end=$start && > echo "$oid $size" || > return 1 > done <idx.sorted The one thing I do like is that we don't have to escape anything inside an awk program that is forced to use double-quotes. ;) > Should we deduplicate here, like cat-file does (i.e. use "sort -u")? > Having the same object in multiple places for whatever reason would not > be a cause for reporting an error in this test, I would think. No, for the reasons I said in the commit message: if an object exists in multiple places the test is already potentially invalid, as Git does not promise which version it will use. So it might work racily, or it might work for now but be fragile. By not de-duplicating, we make sure the test's assumption holds. > One more thing: We can do the work of the first awk invocation in the > already existing loop as well: > [...] > ... but the substitutions are a bit awkward: > > find .git/objects/?? -type f | > while read path > do > basename=${path##*/} && > dirname=${path%/$basename} && > oid="${dirname#.git/objects/}${basename}" && > size=$(test_file_size "$path") && > echo "$oid $size" || > return 1 > done && > > The avoided awk invocation might be worth the trouble, though. Yeah, I briefly considered whether it would be possible in pure shell, but didn't get very far before assuming it was going to be ugly. Thank you for confirming. ;) Again, if we were doing one awk per object, I'd try to avoid it. But since we can cover all objects in a single pass, I think it's OK. -Peff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-21 21:30 ` Jeff King @ 2023-12-21 23:13 ` René Scharfe 2023-12-23 10:09 ` [PATCH v2] " Jeff King 2023-12-23 10:18 ` [PATCH] " Jeff King 0 siblings, 2 replies; 17+ messages in thread From: René Scharfe @ 2023-12-21 23:13 UTC (permalink / raw) To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Am 21.12.23 um 22:30 schrieb Jeff King: > On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote: > >> I think we can do it even in shell, especially if... >> [...] > > Yeah, your conversion looks accurate. I do wonder if it is worth golfing > further, though. If it were a process invocation per object, I'd > definitely say the efficiency gain is worth it. But dropping one process > from the whole test isn't that exciting either way. Fair enough. > >> (sort -r), then we don't need to carry the oid forward: >> >> sort -nr <idx.raw >idx.sorted && >> packsz=$(test_file_size "${idx%.idx}.pack") && >> end=$((packsz - rawsz)) && >> awk -v end="$end" " >> { print \$2, end - \$1; end = \$1 } >> " idx.sorted || >> >> And at that point it should be easy to use a shell loop instead of awk: >> >> while read start oid rest >> do >> size=$((end - start)) && >> end=$start && >> echo "$oid $size" || >> return 1 >> done <idx.sorted > > The one thing I do like is that we don't have to escape anything inside > an awk program that is forced to use double-quotes. ;) For me it's processing the data in the "correct" order (descending, i.e. starting at the end, which we have to calculate first anyway based on the size). >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? >> Having the same object in multiple places for whatever reason would not >> be a cause for reporting an error in this test, I would think. > > No, for the reasons I said in the commit message: if an object exists in > multiple places the test is already potentially invalid, as Git does not > promise which version it will use. So it might work racily, or it might > work for now but be fragile. By not de-duplicating, we make sure the > test's assumption holds. Oh, skipped that paragraph. Still I don't see how a duplicate object would necessarily invalidate t1006. The comment for the test "cat-file --batch-all-objects shows all objects" a few lines above indicates that it's picky about the provenance of objects, but it uses a separate repository. I can't infer the same requirement for the root repo, but we already established that I can't read. Anyway, if someone finds a use for git repack without -d or git unpack-objects or whatever else causes duplicates in the root repository of t1006 then they can try to reverse your ban with concrete arguments. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2] t1006: add tests for %(objectsize:disk) 2023-12-21 23:13 ` René Scharfe @ 2023-12-23 10:09 ` Jeff King 2023-12-24 9:30 ` René Scharfe 2023-12-23 10:18 ` [PATCH] " Jeff King 1 sibling, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-23 10:09 UTC (permalink / raw) To: René Scharfe Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote: > >> while read start oid rest > >> do > >> size=$((end - start)) && > >> end=$start && > >> echo "$oid $size" || > >> return 1 > >> done <idx.sorted > > > > The one thing I do like is that we don't have to escape anything inside > > an awk program that is forced to use double-quotes. ;) > > For me it's processing the data in the "correct" order (descending, i.e. > starting at the end, which we have to calculate first anyway based on the > size). That was one thing that I thought made it more complicated. The obvious order to me is start-to-end in the pack. But I do agree that going in reverse order makes things much simpler, as we compute the size of each entry as we see it (and so there are fewer special cases). So I'm convinced that it's worth switching. Here's a v2 with your suggestion. -- >8 -- Subject: t1006: add tests for %(objectsize:disk) Back when we added this placeholder in a4ac106178 (cat-file: add %(objectsize:disk) format atom, 2013-07-10), there were no tests, claiming "[...]the exact numbers returned are volatile and subject to zlib and packing decisions". But we can use a little shell hackery to get the expected numbers ourselves. To a certain degree this is just re-implementing what Git is doing under the hood, but it is still worth doing. It makes sure we exercise the %(objectsize:disk) code at all, and having the two implementations agree gives us more confidence. Note that our shell code assumes that no object appears twice (either in two packs, or as both loose and packed), as then the results really are undefined. That's OK for our purposes, and the test will notice if that assumption is violated (the shell version would produce duplicate lines that Git's output does not have). Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: Jeff King <peff@peff.net> --- t/t1006-cat-file.sh | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 271c5e4fd3..e0c6482797 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -1100,6 +1100,42 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor cmp expect actual ' +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' + # our state has both loose and packed objects, + # so find both for our expected output + { + find .git/objects/?? -type f | + awk -F/ "{ print \$0, \$3\$4 }" | + while read path oid + do + size=$(test_file_size "$path") && + echo "$oid $size" || + return 1 + done && + rawsz=$(test_oid rawsz) && + find .git/objects/pack -name "*.idx" | + while read idx + do + git show-index <"$idx" >idx.raw && + sort -nr <idx.raw >idx.sorted && + packsz=$(test_file_size "${idx%.idx}.pack") && + end=$((packsz - rawsz)) && + while read start oid rest + do + size=$((end - start)) && + end=$start && + echo "$oid $size" || + return 1 + done <idx.sorted || + return 1 + done + } >expect.raw && + sort <expect.raw >expect && + git cat-file --batch-all-objects \ + --batch-check="%(objectname) %(objectsize:disk)" >actual && + test_cmp expect actual +' + test_expect_success 'set up replacement object' ' orig=$(git rev-parse HEAD) && git cat-file commit $orig >orig && -- 2.43.0.448.g93112243fb ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2] t1006: add tests for %(objectsize:disk) 2023-12-23 10:09 ` [PATCH v2] " Jeff King @ 2023-12-24 9:30 ` René Scharfe 0 siblings, 0 replies; 17+ messages in thread From: René Scharfe @ 2023-12-24 9:30 UTC (permalink / raw) To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Am 23.12.23 um 11:09 schrieb Jeff King: > > --- > t/t1006-cat-file.sh | 36 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) > > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index 271c5e4fd3..e0c6482797 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -1100,6 +1100,42 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor > cmp expect actual > ' > > +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' > + # our state has both loose and packed objects, > + # so find both for our expected output > + { > + find .git/objects/?? -type f | > + awk -F/ "{ print \$0, \$3\$4 }" | > + while read path oid > + do > + size=$(test_file_size "$path") && > + echo "$oid $size" || > + return 1 > + done && > + rawsz=$(test_oid rawsz) && > + find .git/objects/pack -name "*.idx" | > + while read idx > + do > + git show-index <"$idx" >idx.raw && > + sort -nr <idx.raw >idx.sorted && > + packsz=$(test_file_size "${idx%.idx}.pack") && > + end=$((packsz - rawsz)) && > + while read start oid rest > + do > + size=$((end - start)) && > + end=$start && > + echo "$oid $size" || > + return 1 > + done <idx.sorted || > + return 1 > + done > + } >expect.raw && > + sort <expect.raw >expect && > + git cat-file --batch-all-objects \ > + --batch-check="%(objectname) %(objectsize:disk)" >actual && > + test_cmp expect actual > +' > + > test_expect_success 'set up replacement object' ' > orig=$(git rev-parse HEAD) && > git cat-file commit $orig >orig && Looks good to me. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-21 23:13 ` René Scharfe 2023-12-23 10:09 ` [PATCH v2] " Jeff King @ 2023-12-23 10:18 ` Jeff King 2023-12-24 9:30 ` René Scharfe 1 sibling, 1 reply; 17+ messages in thread From: Jeff King @ 2023-12-23 10:18 UTC (permalink / raw) To: René Scharfe Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote: > >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? > >> Having the same object in multiple places for whatever reason would not > >> be a cause for reporting an error in this test, I would think. > > > > No, for the reasons I said in the commit message: if an object exists in > > multiple places the test is already potentially invalid, as Git does not > > promise which version it will use. So it might work racily, or it might > > work for now but be fragile. By not de-duplicating, we make sure the > > test's assumption holds. > > Oh, skipped that paragraph. Still I don't see how a duplicate object > would necessarily invalidate t1006. The comment for the test "cat-file > --batch-all-objects shows all objects" a few lines above indicates that > it's picky about the provenance of objects, but it uses a separate > repository. I can't infer the same requirement for the root repo, but > we already established that I can't read. The cat-file documentation explicitly calls this situation out: Note also that multiple copies of an object may be present in the object database; in this case, it is undefined which copy’s size or delta base will be reported. So if t1006 were to grow such a duplicate object, what will happen? If we de-dup in the new test, then we might end up mentioning the same copy (and the test passes), or we might not (and the test fails). But much worse, the results might be racy (depending on how cat-file happens to decide which one to use). By no de-duping, then the test will reliably fail and the author can decide how to handle it then. IOW it is about failing immediately and predictably rather than letting a future change to sneak a race or other accident-waiting-to-happen into t1006. > Anyway, if someone finds a use for git repack without -d or > git unpack-objects or whatever else causes duplicates in the root > repository of t1006 then they can try to reverse your ban with concrete > arguments. In the real world, the most common way to get a duplicate is to fetch or push into a repository, such that: 1. There are enough objects to retain the pack (100 by default) 2. There's a thin delta in the on-the-wire pack (i.e., a delta against a base that the sender knows the receiver has, but which isn't itself sent). Then "index-pack --fix-thin" will complete the on-disk pack by storing a copy of the base object in it. And now we have it in two packs (and if it's a delta or loose in the original, the size will be different). -Peff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] t1006: add tests for %(objectsize:disk) 2023-12-23 10:18 ` [PATCH] " Jeff King @ 2023-12-24 9:30 ` René Scharfe 0 siblings, 0 replies; 17+ messages in thread From: René Scharfe @ 2023-12-24 9:30 UTC (permalink / raw) To: Jeff King; +Cc: git, Ondrej Pohorelsky, brian m . carlson, Junio C Hamano Am 23.12.23 um 11:18 schrieb Jeff King: > On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote: > >>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? >>>> Having the same object in multiple places for whatever reason would not >>>> be a cause for reporting an error in this test, I would think. >>> >>> No, for the reasons I said in the commit message: if an object exists in >>> multiple places the test is already potentially invalid, as Git does not >>> promise which version it will use. So it might work racily, or it might >>> work for now but be fragile. By not de-duplicating, we make sure the >>> test's assumption holds. >> >> Oh, skipped that paragraph. Still I don't see how a duplicate object >> would necessarily invalidate t1006. The comment for the test "cat-file >> --batch-all-objects shows all objects" a few lines above indicates that >> it's picky about the provenance of objects, but it uses a separate >> repository. I can't infer the same requirement for the root repo, but >> we already established that I can't read. > > The cat-file documentation explicitly calls this situation out: > > Note also that multiple copies of an object may be present in the > object database; in this case, it is undefined which copy’s size or > delta base will be reported. > > So if t1006 were to grow such a duplicate object, what will happen? If > we de-dup in the new test, then we might end up mentioning the same copy > (and the test passes), or we might not (and the test fails). But much > worse, the results might be racy (depending on how cat-file happens to > decide which one to use). By no de-duping, then the test will reliably > fail and the author can decide how to handle it then. > > IOW it is about failing immediately and predictably rather than letting > a future change to sneak a race or other accident-waiting-to-happen into > t1006. > >> Anyway, if someone finds a use for git repack without -d or >> git unpack-objects or whatever else causes duplicates in the root >> repository of t1006 then they can try to reverse your ban with concrete >> arguments. > > In the real world, the most common way to get a duplicate is to fetch or > push into a repository, such that: > > 1. There are enough objects to retain the pack (100 by default) > > 2. There's a thin delta in the on-the-wire pack (i.e., a delta against > a base that the sender knows the receiver has, but which isn't > itself sent). > > Then "index-pack --fix-thin" will complete the on-disk pack by storing a > copy of the base object in it. And now we have it in two packs (and if > it's a delta or loose in the original, the size will be different). I think I get it now. The size possibly being different is crucial. cat-file deduplicates based on object ID alone. sort -u in t1006 would deduplicate based on object ID and size, meaning that it would only remove duplicates of the same size. Emulating the deduplication of cat-file is also possible, but would introduce the race you mentioned. However, even removing only same-size duplicates is unreliable because there is no guarantee that the same object has the same size in different packs. Adding a new object that is a better delta base would change the size. So, deduplicating based on object ID and size is sound for any particular run, but sizes are not stable and thus we need to know if the tests do something that adds duplicates of any size. René ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Test breakage with zlib-ng 2023-12-12 17:04 ` René Scharfe 2023-12-12 20:01 ` Jeff King @ 2023-12-12 22:18 ` brian m. carlson 2023-12-12 22:30 ` Junio C Hamano 2 siblings, 0 replies; 17+ messages in thread From: brian m. carlson @ 2023-12-12 22:18 UTC (permalink / raw) To: René Scharfe; +Cc: Ondrej Pohorelsky, git, Junio C Hamano [-- Attachment #1: Type: text/plain, Size: 984 bytes --] On 2023-12-12 at 17:04:55, René Scharfe wrote: > --- >8 --- > Subject: [PATCH] t6300: avoid hard-coding object sizes > > f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24) > hard-coded the expected object sizes. Coincidentally the size of commit > and tag is the same with zlib at the default compression level. > > 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22) > encoded the sizes as a single value, which coincidentally also works > with sha256. > > Different compression libraries like zlib-ng may arrive at different > values. Get them from the file system instead of hard-coding them to > make switching the compression library (or changing the compression > level) easier. This definitely seems like the right thing to do. I was a bit lazy at the time and probably could have improved it then, but it's at least good that we're doing it now. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Test breakage with zlib-ng 2023-12-12 17:04 ` René Scharfe 2023-12-12 20:01 ` Jeff King 2023-12-12 22:18 ` Test breakage with zlib-ng brian m. carlson @ 2023-12-12 22:30 ` Junio C Hamano 2 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2023-12-12 22:30 UTC (permalink / raw) To: René Scharfe; +Cc: Ondrej Pohorelsky, git, brian m . carlson René Scharfe <l.s.r@web.de> writes: > Or we could get the sizes of the objects by checking their files, > which would not require hard-coding anymore. Patch below. That was my first reaction to seeing the original report. It is a bit surprising that the necessary fix is so small and makes me wonder why we initially did the hardcoded values, which feels more work to initially write the test vector. Looking good. Thanks. > --- >8 --- > Subject: [PATCH] t6300: avoid hard-coding object sizes > > f4ee22b526 (ref-filter: add tests for objectsize:disk, 2018-12-24) > hard-coded the expected object sizes. Coincidentally the size of commit > and tag is the same with zlib at the default compression level. > > 1f5f8f3e85 (t6300: abstract away SHA-1-specific constants, 2020-02-22) > encoded the sizes as a single value, which coincidentally also works > with sha256. > > Different compression libraries like zlib-ng may arrive at different > values. Get them from the file system instead of hard-coding them to > make switching the compression library (or changing the compression > level) easier. > > Reported-by: Ondrej Pohorelsky <opohorel@redhat.com> > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > t/t6300-for-each-ref.sh | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh > index 54e2281259..843a7fe143 100755 > --- a/t/t6300-for-each-ref.sh > +++ b/t/t6300-for-each-ref.sh > @@ -20,12 +20,13 @@ setdate_and_increment () { > export GIT_COMMITTER_DATE GIT_AUTHOR_DATE > } > > -test_expect_success setup ' > - test_oid_cache <<-EOF && > - disklen sha1:138 > - disklen sha256:154 > - EOF > +test_object_file_size () { > + oid=$(git rev-parse "$1") > + path=".git/objects/$(test_oid_to_path $oid)" > + test_file_size "$path" > +} > > +test_expect_success setup ' > # setup .mailmap > cat >.mailmap <<-EOF && > A Thor <athor@example.com> A U Thor <author@example.com> > @@ -94,7 +95,6 @@ test_atom () { > } > > hexlen=$(test_oid hexsz) > -disklen=$(test_oid disklen) > > test_atom head refname refs/heads/main > test_atom head refname: refs/heads/main > @@ -129,7 +129,7 @@ test_atom head push:strip=1 remotes/myfork/main > test_atom head push:strip=-1 main > test_atom head objecttype commit > test_atom head objectsize $((131 + hexlen)) > -test_atom head objectsize:disk $disklen > +test_atom head objectsize:disk $(test_object_file_size refs/heads/main) > test_atom head deltabase $ZERO_OID > test_atom head objectname $(git rev-parse refs/heads/main) > test_atom head objectname:short $(git rev-parse --short refs/heads/main) > @@ -203,8 +203,8 @@ test_atom tag upstream '' > test_atom tag push '' > test_atom tag objecttype tag > test_atom tag objectsize $((114 + hexlen)) > -test_atom tag objectsize:disk $disklen > -test_atom tag '*objectsize:disk' $disklen > +test_atom tag objectsize:disk $(test_object_file_size refs/tags/testtag) > +test_atom tag '*objectsize:disk' $(test_object_file_size refs/heads/main) > test_atom tag deltabase $ZERO_OID > test_atom tag '*deltabase' $ZERO_OID > test_atom tag objectname $(git rev-parse refs/tags/testtag) > -- > 2.43.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2023-12-24 9:30 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-12 14:16 Test breakage with zlib-ng Ondrej Pohorelsky 2023-12-12 17:04 ` René Scharfe 2023-12-12 20:01 ` Jeff King 2023-12-12 22:54 ` René Scharfe 2023-12-13 12:28 ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe 2023-12-14 20:59 ` Jeff King 2023-12-19 16:42 ` René Scharfe 2023-12-21 9:47 ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King 2023-12-21 12:19 ` René Scharfe 2023-12-21 21:30 ` Jeff King 2023-12-21 23:13 ` René Scharfe 2023-12-23 10:09 ` [PATCH v2] " Jeff King 2023-12-24 9:30 ` René Scharfe 2023-12-23 10:18 ` [PATCH] " Jeff King 2023-12-24 9:30 ` René Scharfe 2023-12-12 22:18 ` Test breakage with zlib-ng brian m. carlson 2023-12-12 22:30 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).