git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/7] oid_array: use size_t for count and allocation
Date: Tue, 14 Apr 2020 18:27:58 -0600	[thread overview]
Message-ID: <20200415002758.GC7457@syl.local> (raw)
In-Reply-To: <20200330140309.GA2456038@coredump.intra.peff.net>

On Mon, Mar 30, 2020 at 10:03:09AM -0400, Jeff King wrote:
> The oid_array object uses an "int" to store the number of items and the
> allocated size. It's rather unlikely for somebody to have more than 2^31
> objects in a repository (the sha1's alone would be 40GB!), but if they
> do, we'd overflow our alloc variable.
>
> You can reproduce this case with something like:
>
>   git init repo
>   cd repo
>
>   # make a pack with 2^24 objects
>   perl -e '
>     my $nr = 2**24;
>
>     for (my $i = 0; $i < $nr; $i++) {
> 	    print "blob\n";
> 	    print "data 4\n";
> 	    print pack("N", $i);
>     }
>   ' | git fast-import
>
>   # now make 256 copies of it; most of these objects will be duplicates,
>   # but oid_array doesn't de-dup until all values are read and it can
>   # sort the result.
>   cd .git/objects/pack/
>   pack=$(echo *.pack)
>   idx=$(echo *.idx)
>   for i in $(seq 0 255); do
>     # no need to waste disk space
>     ln "$pack" "pack-extra-$i.pack"
>     ln "$idx" "pack-extra-$i.idx"
>   done
>
>   # and now force an oid_array to store all of it
>   git cat-file --batch-all-objects --batch-check
>
> which results in:
>
>   fatal: size_t overflow: 32 * 18446744071562067968
>
> So the good news is that st_mult() sees the problem (the large number is
> because our int wraps negative, and then that gets cast to a size_t),
> doing the job it was meant to: bailing in crazy situations rather than
> causing an undersized buffer.
>
> But we should avoid hitting this case at all, and instead limit
> ourselves based on what malloc() is willing to give us. We can easily do
> that by switching to size_t.
>
> The cat-file process above made it to ~120GB virtual set size before the
> integer overflow (our internal hash storage is 32-bytes now in
> preparation for sha256, so we'd expect ~128GB total needed, plus
> potentially more to copy from one realloc'd block to another)). After
> this patch (and about 130GB of RAM+swap), it does eventually read in the
> whole set. No test for obvious reasons.

;). This patch looks good, and makes immediate sense given your
explanation.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  sha1-array.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/sha1-array.h b/sha1-array.h
> index dc1bca9c9a..c5e4b9324f 100644
> --- a/sha1-array.h
> +++ b/sha1-array.h
> @@ -49,8 +49,8 @@
>   */
>  struct oid_array {
>  	struct object_id *oid;
> -	int nr;
> -	int alloc;
> +	size_t nr;
> +	size_t alloc;
>  	int sorted;
>  };
>
> --
> 2.26.0.597.g7e08ed78ff

Thanks,
Taylor

  parent reply	other threads:[~2020-04-15  0:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-30 14:02 [PATCH 0/7] oid_array cleanups Jeff King
2020-03-30 14:03 ` [PATCH 1/7] oid_array: use size_t for count and allocation Jeff King
2020-03-30 14:09   ` Jeff King
2020-04-15  0:27   ` Taylor Blau [this message]
2020-03-30 14:03 ` [PATCH 2/7] oid_array: use size_t for iteration Jeff King
2020-03-30 14:03 ` [PATCH 3/7] oid_array: rename source file from sha1-array Jeff King
2020-04-15  0:34   ` Taylor Blau
2020-03-30 14:04 ` [PATCH 4/7] test-tool: rename sha1-array to oid-array Jeff King
2020-03-30 14:04 ` [PATCH 5/7] bisect: stop referring to sha1_array Jeff King
2020-03-30 14:04 ` [PATCH 6/7] ref-filter: stop referring to "sha1 array" Jeff King
2020-03-30 14:04 ` [PATCH 7/7] oidset: stop referring to sha1-array Jeff King
2020-04-15  0:35 ` [PATCH 0/7] oid_array cleanups Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200415002758.GC7457@syl.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).