Initial git clone behaviour

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Initial git clone behaviour
@ 2016-01-06 22:26 Eric Curtin
  2016-01-06 23:14 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Curtin @ 2016-01-06 22:26 UTC (permalink / raw)
  To: git

Hi Guys,

I am not a git developer or a git expert but just a change I would love
to see.

When I clone a really large repository (like the linux kernel for
example) initially on a brand new machine, it can take quite a while
before I can start working with the code.

Often I do a standard git clone:

git clone (name of repo)

Followed by a depth=1 clone in parallel, so I can get building and
working with the code asap:

git clone --depth=1 (name of repo)

Could we change the default behavior of git so that we initially get
all the current files quickly so that we can start working them and
then getting the rest of the data? At least a user could get to work
quicker this way. Any disadvantages of this approach? Maybe I am not
the first to suggest something like this.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Initial git clone behaviour
  2016-01-06 22:26 Initial git clone behaviour Eric Curtin
@ 2016-01-06 23:14 ` Junio C Hamano
  2016-01-06 23:24   ` Eric Curtin
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2016-01-06 23:14 UTC (permalink / raw)
  To: Eric Curtin; +Cc: Git Mailing List

On Wed, Jan 6, 2016 at 2:26 PM, Eric Curtin <ericcurtin17@gmail.com> wrote:
>
> Often I do a standard git clone:
>
> git clone (name of repo)
>
> Followed by a depth=1 clone in parallel, so I can get building and
> working with the code asap:
>
> git clone --depth=1 (name of repo)
>
> Could we change the default behavior of git so that we initially get
> all the current files quickly so that we can start working them and
> then getting the rest of the data? At least a user could get to work
> quicker this way. Any disadvantages of this approach?

It would put more burden on a shared and limited resource (i.e.
the server side).

For example, I just tried a depth=1 clone of Linus's repository from

  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

which transferred ~150MB pack data to check out 52k files in 90 seconds.

On the other hand, a full clone transferred ~980MB pack data and it took
170 seconds to complete. You can already see that a full clone is highly
optimized--it does not take even twice the time of getting the most recent
checkout to grab 10 years worth of development (562k of commits).

This efficiency comes from some tradeoffs, and one of them is that not
all the data necessary to check out the latest tree contents can be stored
near the beginning of the pack data. So "we'll checkout the tip while the
remainder of the data is still incoming" would not be a workable, unless
you are willing to destroy the full-clone performance.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Initial git clone behaviour
  2016-01-06 23:14 ` Junio C Hamano
@ 2016-01-06 23:24   ` Eric Curtin
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Curtin @ 2016-01-06 23:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On 6 January 2016 at 23:14, Junio C Hamano <gitster@pobox.com> wrote:
> On Wed, Jan 6, 2016 at 2:26 PM, Eric Curtin <ericcurtin17@gmail.com> wrote:
>>
>> Often I do a standard git clone:
>>
>> git clone (name of repo)
>>
>> Followed by a depth=1 clone in parallel, so I can get building and
>> working with the code asap:
>>
>> git clone --depth=1 (name of repo)
>>
>> Could we change the default behavior of git so that we initially get
>> all the current files quickly so that we can start working them and
>> then getting the rest of the data? At least a user could get to work
>> quicker this way. Any disadvantages of this approach?
>
> It would put more burden on a shared and limited resource (i.e.
> the server side).
>
> For example, I just tried a depth=1 clone of Linus's repository from
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> which transferred ~150MB pack data to check out 52k files in 90 seconds.
>
> On the other hand, a full clone transferred ~980MB pack data and it took
> 170 seconds to complete. You can already see that a full clone is highly
> optimized--it does not take even twice the time of getting the most recent
> checkout to grab 10 years worth of development (562k of commits).
>
> This efficiency comes from some tradeoffs, and one of them is that not
> all the data necessary to check out the latest tree contents can be stored
> near the beginning of the pack data. So "we'll checkout the tip while the
> remainder of the data is still incoming" would not be a workable, unless
> you are willing to destroy the full-clone performance.

Ok, my internet connection at home is pretty terrible then! I don't get
nowhere near these timings. It takes over an hour to do a full clone
from my house. And approx 30 mins for the depth=1 (approx, did not time
it).

That all makes sense I guess, probably not a good idea to regress the
full time performance for the sake of this use case. Was just a query
really!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-01-06 23:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-06 22:26 Initial git clone behaviour Eric Curtin
2016-01-06 23:14 ` Junio C Hamano
2016-01-06 23:24   ` Eric Curtin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).