git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Ondrej Pohorelsky <opohorel@redhat.com>,
	"brian m . carlson" <sandals@crustytoothpaste.net>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] t1006: add tests for %(objectsize:disk)
Date: Sun, 24 Dec 2023 10:30:17 +0100	[thread overview]
Message-ID: <290b20f6-ce3b-49b3-a61d-69277fd8c430@web.de> (raw)
In-Reply-To: <20231223101853.GC2016274@coredump.intra.peff.net>

Am 23.12.23 um 11:18 schrieb Jeff King:
> On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:
>
>>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>>>> Having the same object in multiple places for whatever reason would not
>>>> be a cause for reporting an error in this test, I would think.
>>>
>>> No, for the reasons I said in the commit message: if an object exists in
>>> multiple places the test is already potentially invalid, as Git does not
>>> promise which version it will use. So it might work racily, or it might
>>> work for now but be fragile. By not de-duplicating, we make sure the
>>> test's assumption holds.
>>
>> Oh, skipped that paragraph.  Still I don't see how a duplicate object
>> would necessarily invalidate t1006.  The comment for the test "cat-file
>> --batch-all-objects shows all objects" a few lines above indicates that
>> it's picky about the provenance of objects, but it uses a separate
>> repository.  I can't infer the same requirement for the root repo, but
>> we already established that I can't read.
>
> The cat-file documentation explicitly calls this situation out:
>
>   Note also that multiple copies of an object may be present in the
>   object database; in this case, it is undefined which copy’s size or
>   delta base will be reported.
>
> So if t1006 were to grow such a duplicate object, what will happen? If
> we de-dup in the new test, then we might end up mentioning the same copy
> (and the test passes), or we might not (and the test fails). But much
> worse, the results might be racy (depending on how cat-file happens to
> decide which one to use). By no de-duping, then the test will reliably
> fail and the author can decide how to handle it then.
>
> IOW it is about failing immediately and predictably rather than letting
> a future change to sneak a race or other accident-waiting-to-happen into
> t1006.
>
>> Anyway, if someone finds a use for git repack without -d or
>> git unpack-objects or whatever else causes duplicates in the root
>> repository of t1006 then they can try to reverse your ban with concrete
>> arguments.
>
> In the real world, the most common way to get a duplicate is to fetch or
> push into a repository, such that:
>
>   1. There are enough objects to retain the pack (100 by default)
>
>   2. There's a thin delta in the on-the-wire pack (i.e., a delta against
>      a base that the sender knows the receiver has, but which isn't
>      itself sent).
>
> Then "index-pack --fix-thin" will complete the on-disk pack by storing a
> copy of the base object in it. And now we have it in two packs (and if
> it's a delta or loose in the original, the size will be different).

I think I get it now.  The size possibly being different is crucial.
cat-file deduplicates based on object ID alone.  sort -u in t1006 would
deduplicate based on object ID and size, meaning that it would only
remove duplicates of the same size.  Emulating the deduplication of
cat-file is also possible, but would introduce the race you mentioned.

However, even removing only same-size duplicates is unreliable because
there is no guarantee that the same object has the same size in
different packs.  Adding a new object that is a better delta base would
change the size.

So, deduplicating based on object ID and size is sound for any
particular run, but sizes are not stable and thus we need to know if
the tests do something that adds duplicates of any size.

René

  reply	other threads:[~2023-12-24  9:30 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-12 14:16 Test breakage with zlib-ng Ondrej Pohorelsky
2023-12-12 17:04 ` René Scharfe
2023-12-12 20:01   ` Jeff King
2023-12-12 22:54     ` René Scharfe
2023-12-13 12:28       ` [PATCH 2/1] test-lib-functions: add object size functions René Scharfe
2023-12-14 20:59         ` Jeff King
2023-12-19 16:42           ` René Scharfe
2023-12-21  9:47             ` [PATCH] t1006: add tests for %(objectsize:disk) Jeff King
2023-12-21 12:19               ` René Scharfe
2023-12-21 21:30                 ` Jeff King
2023-12-21 23:13                   ` René Scharfe
2023-12-23 10:09                     ` [PATCH v2] " Jeff King
2023-12-24  9:30                       ` René Scharfe
2023-12-23 10:18                     ` [PATCH] " Jeff King
2023-12-24  9:30                       ` René Scharfe [this message]
2023-12-12 22:18   ` Test breakage with zlib-ng brian m. carlson
2023-12-12 22:30   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=290b20f6-ce3b-49b3-a61d-69277fd8c430@web.de \
    --to=l.s.r@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=opohorel@redhat.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).