Re: Resolving deltas dominates clone time

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Martin Fick <mfick@codeaurora.org>,
	Git Mailing List <git@vger.kernel.org>,
	"Jansen, Geert" <gerardu@amazon.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: Resolving deltas dominates clone time
Date: Mon, 22 Apr 2019 14:43:29 -0400	[thread overview]
Message-ID: <20190422184329.GA20304@sigill.intra.peff.net> (raw)
In-Reply-To: <874l6pudg4.fsf@evledraar.gmail.com>

On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Your patch is optionally removing the "woah, we got an object with a
> > duplicate sha1, let's check that the bytes are the same in both copies"
> > check. But Martin's problem is a clone, so we wouldn't have any existing
> > objects to duplicate in the first place.
> 
> Right, but we do anyway, as reported by Geert at @amazon.com preceding
> that patch of mine. But it is 99.99% irrelevant to *performance* in this
> case after the loose object cache you added (but before that could make
> all the difference depending on the FS).

I scratched my head at this a bit. If we don't have any other objects,
then what are we comparing against? But I think you mean that we have
the overhead of doing the object lookups to find that out. Yes, that can
add up if your filesystem has high latency, but I think in this case it
is a drop in the bucket compared to dealing with the actual object data.

> I just mentioned it to plant a flag on another bit of the code where
> index-pack in general has certain paranoias/validation the user might be
> willing to optionally drop just at "clone" time.

Yeah, I agree it may be worth pursuing independently. I just don't think
it will help Martin's case in any noticeable way.

> > Right, that would work. I will note one thing, though: the total time to
> > do a 1-depth clone followed by an unshallow is probably much higher than
> > doing the whole clone as one unit, for two reasons:
> 
> Indeed. The hypothesis is that the user doesn't really care about the
> clone-time, but the clone-to-repo-mostly-usable time.

There was a little bit of self-interest in there for me, too, as a
server operator. While it does add to the end-to-end time, most of the
resource use for the shallow fetch gets put on the server. IOW, I don't
think we'd be happy to see clients doing this depth-1-and-then-unshallow
strategy for every clone.

> > So in general, I think you'd need some cooperation from the server side
> > to ask it to generate and send the .idx that matches the .pack it is
> > sending you. Or even if not the .idx format itself, some stable list of
> > sha1s that you could use to reproduce it without hashing each
> > uncompressed byte yourself.
> 
> Yeah, depending on how jt/fetch-cdn-offload is designed (see my
> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
> around for such a purpose.
> 
> So in such a case you'd serve up that recent-enough *.{pack,idx} for the
> client to "wget", and the client would then trust it (or not) and do the
> equivalent of a "fetch" from that point to be 100% up-to-date.

I think it's sort of orthogonal. Either way you have to teach the client
how to get a .pack/.idx combo. Whether it learns to receive it inline
from the first fetch, or whether it is taught to expect it from the
out-of-band fetch, most of the challenge is the same.

> > This could even be stuffed into the pack format and stripped out by
> > the receiving index-pack (i.e., each entry is prefixed with "and by
> > the way, here is my sha1...").
> 
> That would be really interesting. I.e. just having room for that (or
> anything else) in the pack format.
> 
> I wonder if it could be added to the delta-chain in the current format
> as a nasty hack :)

There's definitely not "room" in any sense of the word in the pack
format. :) However, as long as all parties agreed, we can stick whatever
we want into the on-the-wire format. So I was imagining something more
like:

  1. pack-objects learns a --report-object-id option that sticks some
     additional bytes before each object (in its simplest form,
     $obj_hash bytes of id)

  2. likewise, index-pack learns a --parse-object-id option to receive
     it and skip hashing the object bytes

  3. we get a new protocol capability, "send-object-ids". If the server
     advertises and the client requests it, then both sides turn on the
     appropriate option

You could even imagine generalizing it to "--report-object-metadata",
and including 0 or more metadata packets before each object. With object
id being one, but possibly other computable bits like "generation
number" after that. I'm not convinced other metadata is worth the
space/time tradeoff, though. After all, this is stuff that the client
_could_ generate and cache themselves, so you're trading off bandwidth
to save the client from doing the computation.

Anyway, food for thought. :)

-Peff

next prev parent reply	other threads:[~2019-04-22 18:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
2019-04-22 18:43         ` Jeff King [this message]
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190422184329.GA20304@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=gerardu@amazon.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=mfick@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).