Re: Multi-threaded 'git clone'

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Martin Fick <mfick@codeaurora.org>
Cc: Jeff King <peff@peff.net>, git <git@vger.kernel.org>,
	Koosha Khajehmoogahi <koosha.khajeh@gmail.com>,
	David Lang <david@lang.hm>
Subject: Re: Multi-threaded 'git clone'
Date: Tue, 17 Feb 2015 15:32:48 -0800	[thread overview]
Message-ID: <xmqqegpoxgf3.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <20150217052007.CC8B713FECF@smtp.codeaurora.org> (Martin Fick's message of "Mon, 16 Feb 2015 22:20:02 -0700")

Martin Fick <mfick@codeaurora.org> writes:

> Sorry for the long winded rant. I suspect that some variation of all
> my suggestions have already been suggested, but maybe they will
> rekindle some older, now useful thoughts, or inspire some new ones.
> And maybe some of these are better to pursue then more parallelism?

We avoid doing a grand design document without having some prototype
implementation, but I think the limitation of the current protocol
has become apparent enough that we should do something about it, and
we should do it in a way that different implementations of Git can
all implement.

I think "multi-threaded clone" is a wrong title for this discussion,
in that the user does not care if it is done by multi-threading the
current logic or in any other way.  The user just wants a faster
clone.

In addition, the current "fetch" protocol has the following problems
that limit us:

 - It is not easy to make it resumable, because we recompute every
   time.  This is especially problematic for the initial fetch aka
   "clone" as we will be talking about a large transfer [*1*].

 - The protocol extension has a fairly low length limit [*2*].

 - Because the protocol exchange starts by the server side
   advertising all its refs, even when the fetcher is interested in
   a single ref, the initial overhead is nontrivial, especially when
   you are doing a small incremental update.  The worst case is an
   auto-builder that polls every five minutes, even when there is no
   new commits to be fetched [*3*].

 - Because we recompute every time, taking into account of what the
   fetcher has, in addition to what the fetcher obtained earlier
   from us in order to reduce the transferred bytes, the payload for
   incremental updates become tailor-made for each fetch and cannot
   be easily reused [*4*].

I'd like to see a new protocol that lets us overcome the above
limitations (did I miss others? I am sure people can help here)
sometime this year.



[Footnotes]

*1* The "first fetch this bundle from elsewhere and then come back
    here for incremental updates" raised earlier in this thread may
    be a way to alleviate this, as the large bundle can be served
    from a static file.

*2* An earlier "this symbolic ref points at that concrete ref"
    attempt failed because of this and we only talk about HEAD.

*3* A new "fetch" protocol must avoid this "one side blindly gives a
    large message as the first thing".  I have been toying with the
    idea of making the fetcher talk first, by declaring "I am
    interested in your refs that match refs/heads/* or refs/tags/*,
    and I have a superset of objects that are reachable from the
    set of refs' values X you gave me earlier", where X is a small
    token generated by hashing the output from "git ls-remote $there
    refs/heads/* refs/tags/*".  In the best case where the server
    understands what X is and has a cached pack data, it can then
    send:

    - differences in the refs that match the wildcards (e.g. "Back
      then at X I did not have refs/heads/next but now I do and it
      points at this commit.  My refs/heads/master is now at that
      commit.  I no longer have refs/heads/pu.  Everything else in
      the refs/ hierarchy you are interested in is the same as state
      X").

    - The new name of the state Y (again, the hashed value of the
      output from "git ls-remote $there refs/heads/* refs/tags/*")
      to make sure the above differences can be verified at the
      receiving end.

    - the cached pack data that contains all necessary objects
      between X and Y.

    Note that the above would work if and only if we accept that it
    is OK to send objects between the remote tracking branches the
    fetcher has (i.e. the objects it last fetched from the server)
    and the current tips of branches the server has, without
    optimizing by taking into account that some commits in that set
    may have already been obtained by the fetcher from a
    third-party.

    If the server does not recognize state X (after all it is just a
    SHA-1 hash value, so the server cannot recreate the set of refs
    and their values from it unless it remembers), the exchange
    would have to degenerate to the traditional transfer.

    The server would want to recognize the result of hashing an
    empty string, though.  The fetcher is saying "I have nothing"
    in that case.


*4* The scheme in *3* can be extended to bring the fetcher
    step-wise.  If the server's state was X when the fetcher last
    contacted it, and since then the server received multiple pushes
    and has two snapshots of states, Y and Z, then the exchange may
    go like this:

    fetcher: I am interested in refs/heads/* and refs/tags/* and I
             have your state X.

    server:  Here is the incremental difference to the refs and the
             end result should hash to Y.  Here comes the pack data
             to bring you up to date.

    fetcher: (after receiving, unpacking and updating the
             remote-tracking refs) Thanks.  Do you have more?

    server:  Yes, here is the incremental difference to the refs and the
             end result should hash to Z.  Here comes the pack data
             to bring you up to date.

    fetcher: (after receiving, unpacking and updating the
             remote-tracking refs) Thanks.  Do you have more?

    server:  No, you are now fully up to date with me.  Bye.

next prev parent reply	other threads:[~2015-02-17 23:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-17  5:20 Multi-threaded 'git clone' Martin Fick
2015-02-17 23:32 ` Junio C Hamano [this message]
2015-02-18  3:14   ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2015-02-16 13:16 Koosha Khajehmoogahi
2015-02-16 13:31 ` David Lang
2015-02-16 15:03   ` Jeff King
2015-02-16 15:31     ` David Lang
2015-02-16 15:47       ` Jeff King
2015-02-16 18:43         ` Junio C Hamano
2015-02-17  3:16           ` Shawn Pearce
2015-02-16 23:16         ` Duy Nguyen
2015-02-17  0:56           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqegpoxgf3.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox.com \
    --cc=david@lang.hm \
    --cc=git@vger.kernel.org \
    --cc=koosha.khajeh@gmail.com \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.