From: Victoria Dye <vdye@github.com>
To: Taylor Blau <me@ttaylorr.com>, git@vger.kernel.org
Cc: gitster@pobox.com
Subject: Re: [PATCH 2/3] t7703: demonstrate object corruption with pack.packSizeLimit
Date: Fri, 20 May 2022 12:42:58 -0700 [thread overview]
Message-ID: <a060d672-df82-95a3-072a-7ab7b0566556@github.com> (raw)
In-Reply-To: <08da02fa74c211ae1019cb0a9f4e30cc239e1ab9.1653073280.git.me@ttaylorr.com>
Taylor Blau wrote:
> When doing a `--geometric=<d>` repack, `git repack` determines a
> splitting point among packs ordered by their object count such that:
>
> - each pack above the split has at least `<d>` times as many objects
> as the next-largest pack by object count, and
> - the first pack above the split has at least `<d>` times as many
> object as the sum of all packs below the split line combined
>
> `git repack` then creates a pack containing all of the objects contained
> in packs below the split line by running `git pack-objects
> --stdin-packs` underneath. Once packs are moved into place, then any
> packs below the split line are removed, since their objects were just
> combined into a new pack.
>
> But `git repack` tries to be careful to avoid removing a pack that it
> just wrote, by checking:
>
> struct packed_git *p = geometry->pack[i];
> if (string_list_has_string(&names, hash_to_hex(p->hash)))
> continue;
>
> in the `delete_redundant` and `geometric` conditional towards the end of
> `cmd_repack`.
>
> But it's possible to trick `git repack` into not recognizing a pack that
> it just wrote when `names` is out-of-order (which violates
> `string_list_has_string()`'s assumption that the list is sorted and thus
> binary search-able).
>
> When this happens in just the right circumstances, it is possible to
> remove a pack that we just wrote, leading to object corruption.
>
> Luckily, this is quite difficult to provoke in practice (for a couple of
> reasons):
>
> - we ordinarily write just one pack, so `names` usually contains just
> one entry, and is thus sorted
> - when we do write more than one pack (e.g., due to `--max-pack-size`)
> we have to: (a) write a pack identical to one that already
> exists, (b) have that pack be below the split line, and (c) have
> the set of packs written by `pack-objects` occur in an order which
> tricks `string_list_has_string()`.
>
> Demonstrate the above scenario in a failing test, which causes `git
> repack --geometric` to write a pack which occurs below the split line,
> _and_ fail to recognize that it wrote that pack.
>
> The following patch will fix this bug.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index 91bb2b37a8..2cd1de7295 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly'
> GIT_TEST_MULTI_PACK_INDEX=0
>
> objdir=.git/objects
> +packdir=$objdir/pack
> midx=$objdir/pack/multi-pack-index
>
> test_expect_success '--geometric with no packs' '
> @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' '
> )
> '
>
> +test_expect_failure '--geometric with pack.packSizeLimit' '
> + git init pack-rewrite &&
> + test_when_finished "rm -fr pack-rewrite" &&
> + (
> + cd pack-rewrite &&
> +
> + test-tool genrandom foo 1048576 >foo &&
> + test-tool genrandom bar 1048576 >bar &&
> +
I was a bit worried about this test being flaky in the future (relying on
particular pseudorandomly-generated file contents and the subsequent
ordering of hashes on the packs). But, since neither 'genrandom' nor the
pack hash generation seem likely to change (and I can't come up with an
alternative to this approach anyway), the test looks good as-is.
> + git add foo bar &&
> + test_tick &&
> + git commit -m base &&
> +
> + git rev-parse HEAD:foo HEAD:bar >p1.objects &&
> + git rev-parse HEAD HEAD^{tree} >p2.objects &&
> +
> + # These two packs each contain two objects, so the following
> + # `--geometric` repack will try to combine them.
> + p1="$(git pack-objects $packdir/pack <p1.objects)" &&
> + p2="$(git pack-objects $packdir/pack <p2.objects)" &&
> +
> + # Remove any loose objects in packs, since we do not want extra
> + # copies around (which would mask over potential object
> + # corruption issues).
> + git prune-packed &&
> +
> + # Both p1 and p2 will be rolled up, but pack-objects will write
> + # three packs:
> + #
> + # - one containing object "foo",
> + # - another containing object "bar",
> + # - a final pack containing the commit and tree objects
> + # (identical to p2 above)
> + git repack --geometric 2 -d --max-pack-size=1048576 &&
> +
> + # Ensure `repack` can detect that the third pack it wrote
> + # (containing just the tree and commit objects) was identical to
> + # one that was below the geometric split, so that we can save it
> + # from deletion.
> + #
> + # If `repack` fails to do that, we will incorrectly delete p2,
> + # causing object corruption.
> + git fsck
> + )
> +'
> +
> test_done
next prev parent reply other threads:[~2022-05-20 19:43 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-20 19:01 [PATCH 0/3] repack: handle --keep-pack, --max-pack-size for geometric repacks Taylor Blau
2022-05-20 19:01 ` [PATCH 1/3] repack: respect --keep-pack with geometric repack Taylor Blau
2022-05-20 19:01 ` [PATCH 2/3] t7703: demonstrate object corruption with pack.packSizeLimit Taylor Blau
2022-05-20 19:42 ` Victoria Dye [this message]
2022-05-20 23:22 ` Taylor Blau
2022-05-20 20:54 ` Junio C Hamano
2022-05-20 19:01 ` [PATCH 3/3] builtin/repack.c: ensure that `names` is sorted Taylor Blau
2022-05-20 19:46 ` [PATCH 0/3] repack: handle --keep-pack, --max-pack-size for geometric repacks Victoria Dye
2022-05-20 20:05 ` Derrick Stolee
2022-05-20 20:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a060d672-df82-95a3-072a-7ab7b0566556@github.com \
--to=vdye@github.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.