Re: Resolving deltas dominates clone time - Ævar Arnfjörð Bjarmason

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Martin Fick <mfick@codeaurora.org>,
	Git Mailing List <git@vger.kernel.org>, "Jansen\,
	Geert" <gerardu@amazon.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: Resolving deltas dominates clone time
Date: Mon, 22 Apr 2019 20:01:15 +0200	[thread overview]
Message-ID: <874l6pudg4.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20190422155716.GA9680@sigill.intra.peff.net>


On Mon, Apr 22 2019, Jeff King wrote:

> On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > If you don't mind losing the collision-detection, using openssl's sha1
>> > might help. The delta resolution should be threaded, too. So in _theory_
>> > you're using 66 minutes of CPU time, but that should only take 1-2
>> > minutes on your 56-core machine. I don't know at what point you'd run
>> > into lock contention, though. The locking there is quite coarse.
>>
>> There's also my (been meaning to re-roll)
>> https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
>> *that* part of the SHA-1 checking is part of what's going on here. It'll
>> help a *tiny* bit, but of course is part of the "trust remote" risk
>> management...
>
> I think we're talking about two different collision detections, and your
> patch wouldn't help at all here.
>
> Your patch is optionally removing the "woah, we got an object with a
> duplicate sha1, let's check that the bytes are the same in both copies"
> check. But Martin's problem is a clone, so we wouldn't have any existing
> objects to duplicate in the first place.

Right, but we do anyway, as reported by Geert at @amazon.com preceding
that patch of mine. But it is 99.99% irrelevant to *performance* in this
case after the loose object cache you added (but before that could make
all the difference depending on the FS).

I just mentioned it to plant a flag on another bit of the code where
index-pack in general has certain paranoias/validation the user might be
willing to optionally drop just at "clone" time.

> The problem in his case is literally just that the actual SHA-1 is
> expensive, and that can be helped by using the optimized openssl
> implementation rather than the sha1dc (which checks not collisions with
> objects we _have_, but evidence of somebody trying to exploit weaknesses
> in sha1).
>
> One thing we could do to make that easier is a run-time flag to switch
> between sha1dc and a faster implementation (either openssl or blk_sha1,
> depending on the build). That would let you flip the "trust" bit per
> operation, rather than having it baked into your build.

Yeah, this would be neat.

> (Note that the oft-discussed "use a faster sha1 implementation for
> checksums, but sha1dc for object hashing" idea would not help here,
> because these really are object hashes whose time is dominating. We have
> to checksum 8GB of raw packfile but 2TB of object data).
>
>> I started to write:
>>
>>     I wonder if there's room for some tacit client/server cooperation
>>     without such a protocol change.
>>
>>     E.g. the server sending over a pack constructed in such a way that
>>     everything required for a checkout is at the beginning of the
>>     data. Now we implicitly tend to do it mostly the other way around
>>     for delta optimization purposes.
>>
>>     That would allow a smart client in a hurry to index-pack it as they
>>     go along, and as soon as they have enough to check out HEAD return
>>     to the client, and continue the rest in the background
>
> Interesting idea. You're not reducing the total client effort, but
> you're improving latency of getting the user to a checkout. Of course
> that doesn't help if they want to run "git log" as their first
> operation. ;)
>
>> But realized I was just starting to describe something like 'clone
>> --depth=1' followed by a 'fetch --unshallow' in the background, except
>> that would work better (if you did "just the tip" naïvely you'd get
>> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
>> write out two packs etc...).
>
> Right, that would work. I will note one thing, though: the total time to
> do a 1-depth clone followed by an unshallow is probably much higher than
> doing the whole clone as one unit, for two reasons:

Indeed. The hypothesis is that the user doesn't really care about the
clone-time, but the clone-to-repo-mostly-usable time.

>   1. The server won't use reachability bitmaps when serving the
>      follow-up fetch (because shallowness invalidates the reachability
>      data they're caching), so it will spend much more time in the
>      "Counting objects" phase.
>
>   2. The server has to throw away some deltas. Imagine version X of a
>      file in the tip commit is stored as a delta against version Y in
>      that commit's parent. The initial clone has to throw away the
>      on-disk delta of X and send you the whole object (because you are
>      not requesting Y at all). And then in the follow-up fetch, it must
>      either send you Y as a base object (wasting bandwidth), or it must
>      on-the-fly generate a delta from Y to X (wasting CPU).
>
>> But at this point I'm just starting to describe some shoddy version of
>> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
>> clone and fleshen right away" option.
>
> Yes. And partial-clone suffers from the problems above to an even
> greater extent. ;)
>
>> On protocol extensions: Just having a way to "wget" the corresponding
>> *.idx file from the server would be great, and reduce clone times by a
>> lot. There's the risk of trusting the server, but most people's use-case
>> is going to be pushing right back to the same server, which'll be doing
>> a full validation.
>
> One tricky thing is that the server may be handing you a bespoke .pack
> file. There is no matching ".idx" at all, neither in-memory nor on disk.
> And you would not want the whole on-disk .pack/.idx pair from a site
> like GitHub, where there are objects from many forks.
>
> So in general, I think you'd need some cooperation from the server side
> to ask it to generate and send the .idx that matches the .pack it is
> sending you. Or even if not the .idx format itself, some stable list of
> sha1s that you could use to reproduce it without hashing each
> uncompressed byte yourself.

Yeah, depending on how jt/fetch-cdn-offload is designed (see my
https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
around for such a purpose.

So in such a case you'd serve up that recent-enough *.{pack,idx} for the
client to "wget", and the client would then trust it (or not) and do the
equivalent of a "fetch" from that point to be 100% up-to-date.

> This could even be stuffed into the pack format and stripped out by
> the receiving index-pack (i.e., each entry is prefixed with "and by
> the way, here is my sha1...").

That would be really interesting. I.e. just having room for that (or
anything else) in the pack format.

I wonder if it could be added to the delta-chain in the current format
as a nasty hack :)

>> We could also defer that validation instead of skipping it. E.g. wget
>> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes
>> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
>> periodically to detect repository bitflips.
>>
>> Or, do some "narrow" validation of such an *.idx file right
>> away. E.g. for all the trees/blobs required for the current checkout,
>> and background the rest.
>
> The "do we have all of the objects we need" is already separate from
> "figure out the sha1 of each object", so I think you'd get that
> naturally if you just took in an untrusted .idx (it also demonstrates
> that any .idx cost is really focusing on blobs, because the "do we have
> all objects" check is going to decompress every commit and tree in the
> repo anyway).
>
> -Peff

next prev parent reply	other threads:[~2019-04-22 18:01 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason [this message]
2019-04-22 18:43         ` Jeff King
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874l6pudg4.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=gerardu@amazon.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.