Re: New Feature wanted: Is it possible to let git clone continue last break point?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Shawn Pearce <spearce@spearce.org>
Cc: Junio C Hamano <gitster@pobox.com>,
	Jonathan Nieder <jrnieder@gmail.com>,
	netroby <hufeng1987@gmail.com>,
	Git Mail List <git@vger.kernel.org>,
	Tomas Carnecky <tom@dbservice.com>
Subject: Re: New Feature wanted: Is it possible to let git clone continue last break point?
Date: Wed, 2 Nov 2011 22:42:48 -0400	[thread overview]
Message-ID: <20111103024248.GA9492@sigill.intra.peff.net> (raw)
In-Reply-To: <CAJo=hJtt8vjB5oU+tEabN2AS7c-24bMHNwQSoWtZYtjjrR3d7Q@mail.gmail.com>

On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote:

> Yup, I agree. The "repo" tool used by Android does this in Python
> right now[1].  Its a simple hack, if the protocol is HTTP or HTTPS the
> client first tries to download $URL/clone.bundle. My servers have
> rules that trap on */clone.bundle and issue an HTTP 302 Found response
> to direct the client to a CDN. Works. :-)

I thought of doing something like that, but I wanted to be able to make
cross-domain links. The "302 to a CDN" thing is a clever hack, but it
requires more control of the webserver than some users might have. And
of course it doesn't work for the "redirect to git:// on a different
server" trick. Or redirect from "git://".

My thought of having it in "refs/mirrors" is only slightly less hacky,
but I think covers all of those cases. :)

> > Even if the bundle thing ends up too wasteful, it may still be useful to
> > offer a "if you don't have X, go see Y" type of mirror when "Y" is
> > something efficient, like git:// at a faster host (i.e., the "I built 3
> > commits on top of Linus" case).
> 
> Actually, I really think the bundle thing is wasteful. Its a ton of
> additional disk. Hosts like kernel.org want to use sendfile() when
> possible to handle bulk transfers. git:// is not efficient for them
> because we don't have sendfile() capability.

I didn't quite parse this. You say it is wasteful, but then indicate
that it can use sendfile(), which is a good thing.

However, I do agree with this:

> Its also expensive for kernel.org to create each Git repository twice
> on disk. The disk is cheap. Its the kernel buffer cache that is damned
> expensive. Assume for a minute that Linus' kernel repository is a
> popular thing to access. If 400M of that history is available in a
> normal pack file on disk, and again 400M is available as a "clone
> bundle thingy", kernel.org now has to eat 800M of disk buffer cache
> for that one Git repository, because both of those files are going to
> be hot.

Doubling the disk cache required is evil and ugly. I was hoping it
wouldn't matter because the bundle would be hosted on some far-away CDN
server anyway, though. But that is highly dependent on your setup. And
it's really just glossing over the fact that you have twice as many
servers. ;)

> I think I messed up with "repo" using a Git bundle file as its data
> source. What we should have done was a bog standard pack file. Then
> the client can download the pack file into the .git/objects/pack
> directory and just generate the index, reusing the entire dumb
> protocol transport logic. It also allows the server to pass out the
> same file the server retains for the repository itself, and thus makes
> the disk buffer cache only 400M for Linus' repository.

That would be cool, but what about ref tips? The pack is just a big blob
of objects, but we need ref tips to advertise to the server when we come
back via the smart protocol. We can make a guess about them, obviously,
but it would be nice to communicate them. I guess the mirror data could
include the tips and a pointer to a pack file.

Another issue with packs is that they generally aren't supposed to be
--thin on disk, whereas bundles can be. So I could point you to a
succession of bundles. Which is maybe a feature, or maybe just makes
things insanely complex[1].

> One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs
> file contain other lines to list reference tips at the time the pack
> was made.

So yeah, that's another solution to the ref tip thingy, and that would
work. I don't think it would make a big difference whether the tips were
in the "mirror" file, or alongside the packfile. The latter I guess
might make administration easier. The "real" repo points its mirror one
time to a static pack store, and then the client goes and grabs whatever
it can from that store.

> Then we wind up with a git:// or ssh:// protocol extension that
> enables sendfile() on an entire pack, and to provide the matching
> objects/info/packs data to help a client over git:// or ssh://
> initialize off the existing pack files.

I think we can get around this by pointing git:// clients, either via
protocol extension or via a magic ref, to an http pack store. Sure, it's
an extra TCP connection, but that's not a big deal compared to doing an
initial clone of most repos.

So the sendfile() stuff would always happen over http.

> But either way, I like the idea of coupling the "resumable pack
> download" to the *existing* pack files, because this is easy to deal
> with.

Yeah, I'm liking that idea. In reference to my [1] above, what I've
started with is making:

  git fetch http://host/foo.bundle

work automatically. And it does work. But it actually spools the bundle
to disk and then unpacks from it, rather than placing it right into the
objects/pack directory. I did this because:

  1. We have to feed it to "index-pack --fix-thin", because bundles can
     be thin. So they're not suitable for sticking right into the pack
     directory.

  2. We could feed it straight to an index-pack pipe, but then we don't
     have a byte-for-byte file on disk to resume an interrupted
     transfer.

But spooling sucks, of course. It means we use twice as much disk space
during the index-pack as we would otherwise need to, not to mention the
latency of not starting the index-pack until we get the whole file.

Pulling down a non-thin packfile makes the problem go away. We can spool
it right into objects/pack, index it on the fly, and if all is well,
move it into its final filename. If the transfer is interrupted, you
drop what's been indexed so far, finish the transfer, and then re-start
the indexing from scratch (actually, the "on the fly" would probably
involve teaching index-pack to be clever about incrementally reading a
partially written file, but it should be possible).

-Peff

next prev parent reply	other threads:[~2011-11-03  2:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAEZo+gfKVY-YgMjd=bEYzRV4-460kqDik-yVcQ9Xs=DoCZOMDg@mail.gmail.com>
2011-10-31  2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby
2011-10-31  4:00   ` Tay Ray Chuan
2011-10-31  9:07   ` Jonathan Nieder
2011-10-31  9:16     ` netroby
2011-11-02 22:06     ` Jeff King
2011-11-02 22:41       ` Junio C Hamano
2011-11-02 23:27         ` Jeff King
2011-11-03  0:06           ` Shawn Pearce
2011-11-03  2:42             ` Jeff King [this message]
2011-11-03  4:19               ` Shawn Pearce
2011-11-04  8:56                 ` Clemens Buchacher
2011-11-04  9:35                   ` Johannes Sixt
2011-11-04 14:22                     ` Shawn Pearce
2011-11-04 15:55                       ` Jakub Narebski
2011-11-04 16:05                       ` Nguyen Thai Ngoc Duy
2011-11-05 10:00                       ` Clemens Buchacher
2011-10-31  9:14   ` Jakub Narebski
2011-10-31 12:49     ` Michael Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111103024248.GA9492@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hufeng1987@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=spearce@spearce.org \
    --cc=tom@dbservice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).