Re: Resolving deltas dominates clone time - Ævar Arnfjörð Bjarmason

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Martin Fick <mfick@codeaurora.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Resolving deltas dominates clone time
Date: Sat, 20 Apr 2019 09:59:12 +0200	[thread overview]
Message-ID: <874l6tayzz.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20190420035825.GB3559@sigill.intra.peff.net>


On Sat, Apr 20 2019, Jeff King wrote:

> On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:
>
>> I have been thinking about this problem, and I suspect that this compute time
>> is actually spent doing SHA1 calculations, is that possible? Some basic back
>> of the envelope math and scripting seems to show that the repo may actually
>> contain about 2TB of data if you add up the size of all the objects in the
>> repo. Some quick research on the net seems to indicate that we might be able
>> to expect something around 500MB/s throughput on computing SHA1s, does that
>> seem reasonable? If I really have 2TB of data, should it then take around
>> 66mins to get the SHA1s for all that data? Could my repo clone time really be
>> dominated by SHA1 math?
>
> That sounds about right, actually. 8GB to 2TB is a compression ratio of
> 250:1. That's bigger than I've seen, but I get 51:1 in the kernel.
>
> Try this (with a recent version of git; your v1.8.2.1 won't have
> --batch-all-objects):
>
>   # count the on-disk size of all objects
>   git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
>   perl -alne '
>     $repo += $F[0];
>     $disk += $F[1];
>     END { print "$repo / $disk = ", $repo/$disk }
>   '
>
> 250:1 isn't inconceivable if you have large blobs which have small
> changes to them (and at 8GB for 8 million objects, you probably do have
> some larger blobs, since the kernel is about 1/8th the size for the same
> number of objects).
>
> So yes, if you really do have to hash 2TB of data, that's going to take
> a while. "openssl speed" on my machine gives per-second speeds of:
>
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> sha1            135340.73k   337086.10k   677821.10k   909513.73k  1007528.62k  1016916.65k
>
> So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about
> the best you're going to do. And...
>
>> I mention 1.8.2.1 because we have many old machines which need this. However,
>> I also tested this with git v2.18 and it actually is much slower even
>> (~140mins).
>
> I think v2.18 will have the collision-detecting sha1 on by default,
> which is slower. Building with OPENSSL_SHA1 should be the fastest (and
> are those numbers above). Git's internal (but not collision detecting)
> BLK_SHA1 is somewhere in the middle.
>
>> Any advice on how to speed up cloning this repo, or what to pursue more
>> in my investigation?
>
> If you don't mind losing the collision-detection, using openssl's sha1
> might help. The delta resolution should be threaded, too. So in _theory_
> you're using 66 minutes of CPU time, but that should only take 1-2
> minutes on your 56-core machine. I don't know at what point you'd run
> into lock contention, though. The locking there is quite coarse.

There's also my (been meaning to re-roll)
https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
*that* part of the SHA-1 checking is part of what's going on here. It'll
help a *tiny* bit, but of course is part of the "trust remote" risk
management...

> We also hash non-deltas while we're receiving them over the network.
> That's accounted for in the "receiving pack" part of the progress meter.
> If the time looks to be going to "resolving deltas", then that should
> all be threaded.
>
> If you want to replay the slow part, it should just be index-pack. So
> something like (with $old as a fresh clone of the repo):
>
>   git init --bare new-repo.git
>   cd new-repo.git
>   perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
>   perf report
>
> should show you where the time is going (substitute perf with whatever
> profiling tool you like).
>
> As far as avoiding that work altogether, there aren't a lot of options.
> Git clients do not trust the server, so the server sends only the raw
> data, and the client is responsible for computing the object ids. The
> only exception is a local filesystem clone, which will blindly copy or
> hardlink the .pack and .idx files from the source.
>
> In theory there could be a protocol extension to let the client say "I
> trust you, please send me the matching .idx that goes with this pack,
> and I'll assume there was no bitrot nor trickery on your part". I
> don't recall anybody ever discussing such a patch in the past, but I
> think Microsoft's VFS for Git project that backs development on Windows
> might do similar trickery under the hood.

I started to write:

    I wonder if there's room for some tacit client/server cooperation
    without such a protocol change.

    E.g. the server sending over a pack constructed in such a way that
    everything required for a checkout is at the beginning of the
    data. Now we implicitly tend to do it mostly the other way around
    for delta optimization purposes.

    That would allow a smart client in a hurry to index-pack it as they
    go along, and as soon as they have enough to check out HEAD return
    to the client, and continue the rest in the background

But realized I was just starting to describe something like 'clone
--depth=1' followed by a 'fetch --unshallow' in the background, except
that would work better (if you did "just the tip" naïvely you'd get
'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
write out two packs etc...).

    $ rm -rf /tmp/git; time git clone --depth=1 https://chromium.googlesource.com/chromium/src /tmp/git; time git -C /tmp/git fetch --unshallow
    Cloning into '/tmp/git'...
    remote: Counting objects: 304839, done
    remote: Finding sources: 100% (304839/304839)
    remote: Total 304839 (delta 70483), reused 204837 (delta 70483)
    Receiving objects: 100% (304839/304839), 1.48 GiB | 19.87 MiB/s, done.
    Resolving deltas: 100% (70483/70483), done.
    Checking out files: 100% (302768/302768), done.

    real    2m10.223s
    user    1m2.434s
    sys     0m15.564s
    [not waiting for that second bit, but it'll take ages...]

I think just having a clone mode that did that for you might scratch a
lot of people's itch. I.e. "I want full history, but mainly want a
checkout right away, so background the full clone".

But at this point I'm just starting to describe some shoddy version of
Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
clone and fleshen right away" option.

On protocol extensions: Just having a way to "wget" the corresponding
*.idx file from the server would be great, and reduce clone times by a
lot. There's the risk of trusting the server, but most people's use-case
is going to be pushing right back to the same server, which'll be doing
a full validation.

We could also defer that validation instead of skipping it. E.g. wget
*.{pack,idx} followed by a 'fsck' in the background. I've sometimes
wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
periodically to detect repository bitflips.

Or, do some "narrow" validation of such an *.idx file right
away. E.g. for all the trees/blobs required for the current checkout,
and background the rest.

next prev parent reply	other threads:[~2019-04-20  7:59 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason [this message]
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
2019-04-22 18:43         ` Jeff King
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874l6tayzz.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.