Re: Multi-threaded 'git clone'

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: David Lang <david@lang.hm>
Cc: Koosha Khajehmoogahi <koosha.khajeh@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: Multi-threaded 'git clone'
Date: Mon, 16 Feb 2015 10:03:06 -0500	[thread overview]
Message-ID: <20150216150305.GA8279@peff.net> (raw)
In-Reply-To: <alpine.DEB.2.02.1502160521030.23770@nftneq.ynat.uz>

On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote:

> I think it's an interesting question to look at, but before you start
> looking at changing the architecture of the current code, I would suggest
> doing a bit more analisys of the problem to see if the bottleneck is really
> where you think it is.
> 
> First measure, then optimize :-)

Yes, very much so. Fortunately some people have already done some of
this work. :)

On the server side of a clone, the things that must be done before
sending any data are:

  1. Count up all of the objects that must be sent by traversing the
     object graph.

  2. Find any pairs for delta compression (this is the "Compressing
     objects" phase of the progress reporting).

Step (1) naively takes 30-45 seconds for a kernel repo. However, with
reachability bitmaps, it's instant-ish. I just did a clone from
kernel.org, and it looks like they've turned on bitmaps.

For step (2), git will reuse deltas that already exist in the on-disk
packfile, and will not consider new deltas between objects that are
already in the same pack (because we would already have considered them
when packing in the first place). So the key for servers is to keep
things pretty packed. My kernel.org clone shows that they could probably
stand to repack torvalds/linux.git, but it's not too terrible.

This part is multithreaded, so what work we do happens in parallel. But
note that some servers may turn pack.threads down to 1 (since their many
CPUs are kept busy by multiple requests, rather than trying to finish a
single one).

Then the server streams the data to the client. It might do some light
work transforming the data as it comes off the disk, but most of it is
just blitted straight from disk, and the network is the bottleneck.

On the client side, the incoming data streams into an index-pack
process. For each full object it sees, it hashes and records the name of
the object as it comes in. For deltas, it queues them for resolution
after the complete pack arrives.

Once the full pack arrives, then it resolves all of the deltas. This
part is also multithreaded. If you check out "top" during the "resolving
deltas" phase of the clone, you should see multiple cores in use.

So I don't think there is any room for "just multithread it" in this
process. The CPU intensive bits are already multithreaded. There may be
room for optimizing that, though (e.g., reducing lock contention or
similar).

It would also be possible to resolve deltas while the pack is streaming
in, rather than waiting until the whole thing arrives. That's not
possible in all cases (an object may be a delta against a base that
comes later in the pack), but in practice git puts bases before their
deltas. However, it's overall less efficient, because you may end up
walking through the same parts of the delta chain more than once. For
example, imagine you see a stream of objects A, B, C, D. You get B and
see that it's a delta against A. So you resolve it, hash the object, and
are good. Now you see C, which is a delta against B. To generate C, you
have to compute B again. Now you get to D, which is another delta
against B. So now we compute B again.

You can get around this somewhat with a cache of intermediate object
contents, but of course there may be hundreds or thousands of chains
like this in use at once, so you're going to end up with some cache
misses.

What index-pack does instead is to wait until it has all of the objects,
then finds A and says "what objects use A as a base?". Then it computes
B, hashes it, and says "what objects use B as a base?". And finds C and
D, after which it nows it can drop the intermediate result B.

So that's less work over all, though in some workloads it may finish
faster if you were to stream it (because your many processors are
sitting idle while we are blocked on network bandwidth). So that's a
potential area of exploration.

-Peff

next prev parent reply	other threads:[~2015-02-16 15:03 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-16 13:16 Multi-threaded 'git clone' Koosha Khajehmoogahi
2015-02-16 13:31 ` David Lang
2015-02-16 15:03   ` Jeff King [this message]
2015-02-16 15:31     ` David Lang
2015-02-16 15:47       ` Jeff King
2015-02-16 18:43         ` Junio C Hamano
2015-02-17  3:16           ` Shawn Pearce
2015-02-16 23:16         ` Duy Nguyen
2015-02-17  0:56           ` Jeff King
  -- strict thread matches above, loose matches on Subject: below --
2015-02-17  5:20 Martin Fick
2015-02-17 23:32 ` Junio C Hamano
2015-02-18  3:14   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150216150305.GA8279@peff.net \
    --to=peff@peff.net \
    --cc=david@lang.hm \
    --cc=git@vger.kernel.org \
    --cc=koosha.khajeh@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).