Re: GSoC resumable clone

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: GSoC resumable clone
@ 2011-03-11 15:17 Shawn Pearce
  2011-03-11 15:37 ` Jeff King
  2011-03-11 15:42 ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 7+ messages in thread
From: Shawn Pearce @ 2011-03-11 15:17 UTC (permalink / raw)
  To: Alexander Miseler
  Cc: Nguyen Thai Ngoc Duy, Ilari Liusvaara, Jeff King,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 06:10, Alexander Miseler <alexander@miseler.de> wrote:
> On 11.03.2011 14:48, Nguyen Thai Ngoc Duy wrote:
>>> On Fri, Mar 11, 2011 at 01:18:45PM +0100, Alexander Miseler wrote:
>>>>
>>>> Resumable clone
>>
>> A simpler way to restartable clone is to facilitate bundles (Nicolas'
>> idea). Some glue is needed to teach git-fetch/git-daemon to use the
>> bundles, and git-push to automatically create bundles periodically (or
>> a new command that can be run from cron). I think this way fit in GSoC
>> scope better.

I think the cached bundle idea is horrifically stupid in the face of
the subsequent cached pack idea. JGit already implements cached packs,
and it works very well. The feature just needs to be back-ported to
builtin/pack-objects.c, along with some minor edits to my RFC patch to
git-repack.sh to be able to construct the cached pack.

Unlike a cached bundle, the cached pack doesn't eat up useless disk
space on the server. Its still the only copy of the object content,
which keeps server disk usage (and buffer cache usage) lower.

A protocol extension in the fetch-pack/upload-pack protocol is
required to allow pack-objects to delimit the early thin-pack from the
later cached pack, as well as supply the cached-pack's identity. A
client who breaks the connection after the leading thin-pack has been
received could restart by downloading the cached pack from a specific
starting byte.

Without waiting for pack v4, cached packs can shave a full minute of
server CPU time during a clone of the linux-2.6 kernel. That's nothing
to laugh at, its a full CPU minute. These days a full CPU minute is a
lot of computational work. It also is pretty backwards compatible with
the current network protocol, even ancient Git clients can still use
the cached pack during an initial clone, saving a lot of server
resources.

With cached packs, organizations like Gentoo wouldn't need to
implement bizarre hacks in their upload-pack binary to prevent clones
over git:// from their servers.

Its also well within GSoC size scope. I think the hard part is
understanding enough of how the revision walker works inside of
pack-objects in order to construct the leading thin-pack.

>> [1] The idea of my work above was mentioned elsewhere, history is cut
>> down by path. Each file/dir's history a very long chain of deltas. We
>> can stream deltas (in parallel if needed) over the wire, resuming
>> where the chain stops last time.
>
> This may all be aiming to short. IMHO the best solution would be some
> generic way for the client to specify exactly what it wants to get and to
> get just that. This would lay the groundwork for:
> - lazy clones
> - sparse clones
> - resumable cloning
> - resumable fetching

Junio and I would like see narrow checkout code re-implemented to
support obtaining only a subset of the paths from the remote.

Once that is implemented, a client on a really bad network connection
could do a resumable clone by grabbing a shallow clone of depth 1
along no paths, partition the root tree up, then extend its paths
grabbing subdirectories until the root commit is fully expanded. Then
it can walk back increasing its depth until it runs into the cached
pack... where it can then do byte range requests.

This won't be pretty. And given that the leading thin-pack for a
cached pack can be less than 2% of the entire data transfer, may not
be necessary for a resumable clone. IMHO if you cannot get 2% of the
data transfer before your connection breaks, maybe you should ask for
the data on DVD via post, because your network sucks.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 15:17 GSoC resumable clone Shawn Pearce
@ 2011-03-11 15:37 ` Jeff King
  2011-03-11 15:41   ` Shawn Pearce
  2011-03-11 15:42 ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2011-03-11 15:37 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Alexander Miseler, Nguyen Thai Ngoc Duy, Ilari Liusvaara,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 07:17:31AM -0800, Shawn O. Pearce wrote:

> >> A simpler way to restartable clone is to facilitate bundles (Nicolas'
> >> idea). Some glue is needed to teach git-fetch/git-daemon to use the
> >> bundles, and git-push to automatically create bundles periodically (or
> >> a new command that can be run from cron). I think this way fit in GSoC
> >> scope better.
> 
> I think the cached bundle idea is horrifically stupid in the face of
> the subsequent cached pack idea. JGit already implements cached packs,
> and it works very well. The feature just needs to be back-ported to
> builtin/pack-objects.c, along with some minor edits to my RFC patch to
> git-repack.sh to be able to construct the cached pack.

I think there is room for both ideas. The cached bundle idea is not just
"here, download this bundle first". It is "here, download this _other
thing_ first, which might be a bundle, another git repo, a torrent,
etc".

So yeah, cached packs are a way better solution if you are just going to
have an extra bundle on the same machine. But that's just one use case.
The ability for my server to say "go hit kernel.org first, and then come
back to me to pick up the deltas" is also valuable. Similarly, the
ability to serve an initial bundle off a torrent is useful for extremely
large projects.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 15:37 ` Jeff King
@ 2011-03-11 15:41   ` Shawn Pearce
  2011-03-11 15:48     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Pearce @ 2011-03-11 15:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Alexander Miseler, Nguyen Thai Ngoc Duy, Ilari Liusvaara,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 07:37, Jeff King <peff@peff.net> wrote:
> On Fri, Mar 11, 2011 at 07:17:31AM -0800, Shawn O. Pearce wrote:
>
>> >> A simpler way to restartable clone is to facilitate bundles (Nicolas'
>> >> idea). Some glue is needed to teach git-fetch/git-daemon to use the
>> >> bundles, and git-push to automatically create bundles periodically (or
>> >> a new command that can be run from cron). I think this way fit in GSoC
>> >> scope better.
>>
>> I think the cached bundle idea is horrifically stupid in the face of
>> the subsequent cached pack idea. JGit already implements cached packs,
>> and it works very well. The feature just needs to be back-ported to
>> builtin/pack-objects.c, along with some minor edits to my RFC patch to
>> git-repack.sh to be able to construct the cached pack.
>
> I think there is room for both ideas. The cached bundle idea is not just
> "here, download this bundle first". It is "here, download this _other
> thing_ first, which might be a bundle, another git repo, a torrent,
> etc".

Fair enough. Though I wouldn't limit this to bundles. Instead I would
suggest supporting any valid Git URLs, and then extend our URL syntax
to support bundles over http://, rsync://, and torrent.

> So yeah, cached packs are a way better solution if you are just going to
> have an extra bundle on the same machine. But that's just one use case.
> The ability for my server to say "go hit kernel.org first, and then come
> back to me to pick up the deltas" is also valuable. Similarly, the
> ability to serve an initial bundle off a torrent is useful for extremely
> large projects.

If we support any URL and don't assume the URL is a bundle, you can
point traffic at kernel.org to for example grab Linus' primary
repository first, even if he doesn't have a bundle.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 15:17 GSoC resumable clone Shawn Pearce
  2011-03-11 15:37 ` Jeff King
@ 2011-03-11 15:42 ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 7+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-03-11 15:42 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Alexander Miseler, Ilari Liusvaara, Jeff King,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 10:17 PM, Shawn Pearce <spearce@spearce.org> wrote:
> I think the cached bundle idea is horrifically stupid in the face of
> the subsequent cached pack idea. JGit already implements cached packs,
> and it works very well. The feature just needs to be back-ported to
> builtin/pack-objects.c, along with some minor edits to my RFC patch to
> git-repack.sh to be able to construct the cached pack.
> ...

I wonder why I missed it. Probably to recent and has not be carved to
my mind yet.

> Junio and I would like see narrow checkout code re-implemented to
> support obtaining only a subset of the paths from the remote.

I'm close to finishing negative pathspecs (for extending narrow
clones). I'll get there.

> Once that is implemented, a client on a really bad network connection
> could do a resumable clone by grabbing a shallow clone of depth 1
> along no paths, partition the root tree up, then extend its paths
> grabbing subdirectories until the root commit is fully expanded. Then
> it can walk back increasing its depth until it runs into the cached
> pack... where it can then do byte range requests.

Yes. But then it'll cost server's processing power more. Partitioning
by path reduces chances of reusing deltas a lot.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 15:41   ` Shawn Pearce
@ 2011-03-11 15:48     ` Jeff King
  2011-03-11 20:50       ` Ilari Liusvaara
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2011-03-11 15:48 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Alexander Miseler, Nguyen Thai Ngoc Duy, Ilari Liusvaara,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 07:41:14AM -0800, Shawn O. Pearce wrote:

> > I think there is room for both ideas. The cached bundle idea is not just
> > "here, download this bundle first". It is "here, download this _other
> > thing_ first, which might be a bundle, another git repo, a torrent,
> > etc".
> 
> Fair enough. Though I wouldn't limit this to bundles. Instead I would
> suggest supporting any valid Git URLs, and then extend our URL syntax
> to support bundles over http://, rsync://, and torrent.

Sorry, I didn't mean to imply that it was limited to bundles. It would
support arbitrary URLs or schemes. See this thread for some past
discussion:

  http://article.gmane.org/gmane.comp.version-control.git/164700

> If we support any URL and don't assume the URL is a bundle, you can
> point traffic at kernel.org to for example grab Linus' primary
> repository first, even if he doesn't have a bundle.

Exactly.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 15:48     ` Jeff King
@ 2011-03-11 20:50       ` Ilari Liusvaara
  2011-03-11 21:43         ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Ilari Liusvaara @ 2011-03-11 20:50 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Alexander Miseler, Nguyen Thai Ngoc Duy,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 10:48:22AM -0500, Jeff King wrote:
> On Fri, Mar 11, 2011 at 07:41:14AM -0800, Shawn O. Pearce wrote:
> 
> > Fair enough. Though I wouldn't limit this to bundles. Instead I would
> > suggest supporting any valid Git URLs, and then extend our URL syntax
> > to support bundles over http://, rsync://, and torrent.
> 
> Sorry, I didn't mean to imply that it was limited to bundles. It would
> support arbitrary URLs or schemes. See this thread for some past
> discussion:

Security pitfall: You need a way to restrict URL schemes that can
be specified from the remote. Some URL schemes are wildly unsafe
to use that way (or just don't make sense).

The URL schemes where it is safe and makes sense are (at least):
- git://
- ssh:// (and the scp syntax)
- http://
- ftp://
- https://
- ftps://
- rsync://
- file:// (?)

New capabilities perhaps? This would allow allowing it on
per-remote-helper basis if that remote helper is deemed safe to
be able to receive arbitrary URLs from untrusted sources.

-Ilari

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: GSoC resumable clone
  2011-03-11 20:50       ` Ilari Liusvaara
@ 2011-03-11 21:43         ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2011-03-11 21:43 UTC (permalink / raw)
  To: Ilari Liusvaara
  Cc: Shawn Pearce, Alexander Miseler, Nguyen Thai Ngoc Duy,
	Ramkumar Ramachandra, Jonathan Nieder, Jens Lehmann,
	Christian Couder, Thomas Rast, git, Pranav Ravichandran

On Fri, Mar 11, 2011 at 10:50:41PM +0200, Ilari Liusvaara wrote:

> > Sorry, I didn't mean to imply that it was limited to bundles. It would
> > support arbitrary URLs or schemes. See this thread for some past
> > discussion:
> 
> Security pitfall: You need a way to restrict URL schemes that can
> be specified from the remote. Some URL schemes are wildly unsafe
> to use that way (or just don't make sense).

Did you mean on the server end? Or the client?

If the server, then I think no, it's a client decision. If on the client
end, then yes, but it's one of many criteria.

The server end provides specially-formed refs that mention alternate
locations.  The client decides which of those locations, if any, meet
its criteria for mirroring, including but not limited to:

  1. Whether the client supports the protocol in question (not everybody
     will be able to torrent, for example).

  2. Whether the client's network allows it (e.g., restrictive proxies).

  3. Whether it meets the client's security requirements (e.g., we
     probably shouldn't accept file:// URLs at all).

But it's clear to me that the security decision is only one of many
criteria, and that the client is in a much better place to make those
decisions. And that some of those decisions are going to have to be
configurable by the user.

So yes, I agree we shouldn't blindly follow URLs.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-03-11 21:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-11 15:17 GSoC resumable clone Shawn Pearce
2011-03-11 15:37 ` Jeff King
2011-03-11 15:41   ` Shawn Pearce
2011-03-11 15:48     ` Jeff King
2011-03-11 20:50       ` Ilari Liusvaara
2011-03-11 21:43         ` Jeff King
2011-03-11 15:42 ` Nguyen Thai Ngoc Duy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).