Re: [PATCH] fetch-pack: speed up loading of refs via commit graph

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Steinhardt <ps@pks.im>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] fetch-pack: speed up loading of refs via commit graph
Date: Thu, 5 Aug 2021 08:04:51 +0200	[thread overview]
Message-ID: <YQt/g0iZxAVgw66o@ncase> (raw)
In-Reply-To: <YQr/vLNjZomIe1ME@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 5109 bytes --]

On Wed, Aug 04, 2021 at 04:59:40PM -0400, Jeff King wrote:
> On Wed, Aug 04, 2021 at 03:56:11PM +0200, Patrick Steinhardt wrote:
> 
> > When doing reference negotiation, git-fetch-pack(1) is loading all refs
> > from disk in order to determine which commits it has in common with the
> > remote repository. This can be quite expensive in repositories with many
> > references though: in a real-world repository with around 2.2 million
> > refs, fetching a single commit by its ID takes around 44 seconds.
> > 
> > Dominating the loading time is decompression and parsing of the objects
> > which are referenced by commits. Given the fact that we only care about
> > commits (or tags which can be peeled to one) in this context, there is
> > thus an easy performance win by switching the parsing logic to make use
> > of the commit graph in case we have one available. Like this, we avoid
> > hitting the object database to parse these commits but instead only load
> > them from the commit-graph. This results in a significant performance
> > boost when executing git-fetch in said repository with 2.2 million refs:
> > 
> >     Benchmark #1: HEAD~: git fetch $remote $commit
> >       Time (mean ± σ):     44.168 s ±  0.341 s    [User: 42.985 s, System: 1.106 s]
> >       Range (min … max):   43.565 s … 44.577 s    10 runs
> > 
> >     Benchmark #2: HEAD: git fetch $remote $commit
> >       Time (mean ± σ):     19.498 s ±  0.724 s    [User: 18.751 s, System: 0.690 s]
> >       Range (min … max):   18.629 s … 20.454 s    10 runs
> > 
> >     Summary
> >       'HEAD: git fetch $remote $commit' ran
> >         2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'
> 
> Nice. I've sometimes wondered if parse_object() should be doing this
> optimization itself. Though we'd possibly still want callers (like this
> one) to give us more hints, since we already know the type is
> OBJ_COMMIT. Whereas parse_object() would have to discover that itself
> (though we already incur the extra type lookup there to handle blobs).

Would certainly make it much harder to hit this pitfall. The only thing
one needs to be cautious about is that we need to somehow assert the
object still exists in our ODB. Otherwise you may look up a commit via
the commit-graph even though the commit doesn't exist anymore.

> Do you have a lot of tags in your repository?

No, it's only about 2000 tags.

> I wonder where the remaining 20s is going. 

Rebasing this commit on top of my git-rev-list(1) series [1] for the
connectivity check gives another 25% speedup, going down from 20s to 14s
(numbers are a bit different given that I'm on a different machine right
now). From here on, it's multiple things which take time:

    - 20% of the time is spent sorting the refs in
      `mark_complete_and_common_ref()`. This time around I feel less
      comfortable to just disable sorting given that it may impact
      correctness.

    - 30% of the time is spent looking up object types via
      `oid_object_info_extended()`, where 75% of these lookups come from
      `deref_without_lazy_fetch()`. This can be improved a bit by doing
      the `lookup_unknown_object()` dance, buying a modest speedup of
      ~8%. But this again has memory tradeoffs given that we must
      allocate the object such that all types would fit.

Other than that I don't see any obvious things in the flame graphs. In
case anybody is interested, I've posted flame graphs in our GitLab issue
at [2], with the state before this patch, with this patch and in
combination with [1].

[1]: http://public-inbox.org/git/cover.1627896460.git.ps@pks.im/
[2]: https://gitlab.com/gitlab-org/gitlab/-/issues/336657#note_642957933

>   - you'd want to double check that we always call this during ref
>     iteration (it looks like we do, and I think peel_iterated_ref()
>     falls back to a normal peel otherwise)
> 
>   - for a tag-of-tag-of-X, that will give us the complete peel to X. But
>     it looks like deref_without_lazy_fetch() marks intermediate tags
>     with the COMPLETE flag, too. I'm not sure how important that is
>     (i.e., is it necessary for correctness, or just an optimization, in
>     which case we might be better off guessing that tags are
>     single-layer, as it's by far the common case).
> 
> If we don't go that route, there's another possible speedup: after
> parsing a tag, the type of tag->tagged (if it is not NULL) will be known
> from the tag's contents, and we can avoid the oid_object_info_extended()
> type lookup. It might need some extra surgery to convince the tag-parse
> not to fetch promisor objects, though.
> 
> I'm not sure it would make that big a difference, though. If we save one
> type-lookup per parsed tag, then the tag parsing is likely to dwarf it.

Yeah, I'd assume the same. And in any case, our repo doesn't really have
any problems with tags given that there's so few of them. So I wouldn't
really have the data to back up any performance improvements here.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2021-08-05  6:06 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-04 13:56 [PATCH] fetch-pack: speed up loading of refs via commit graph Patrick Steinhardt
2021-08-04 14:55 ` Derrick Stolee
2021-08-04 17:45 ` Junio C Hamano
2021-08-04 20:59 ` Jeff King
2021-08-04 21:32   ` Junio C Hamano
2021-08-05  6:04   ` Patrick Steinhardt [this message]
2021-08-05 11:53     ` Patrick Steinhardt
2021-08-05 16:26       ` Junio C Hamano
2021-08-05 20:42       ` Jeff King
2021-08-05 20:40     ` Jeff King
2021-08-05 19:05   ` Ævar Arnfjörð Bjarmason
2021-08-05 20:29     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YQt/g0iZxAVgw66o@ncase \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).