From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>, jnareb@gmail.com
Cc: "Nicolas Pitre" <nico@cam.org>,
"Shawn O. Pearce" <spearce@spearce.org>,
"Git Mailing List" <git@vger.kernel.org>
Subject: Re: git-daemon on NSLU2
Date: Sat, 25 Aug 2007 11:44:07 -0400 [thread overview]
Message-ID: <9e4733910708250844n7074cb8coa5844fa6c46b40f0@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.0.999.0708241616390.25853@woody.linux-foundation.org>
On 8/24/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > I can clone the tree in five minutes using the http protocol. Using the
> > git protocol would take 24hrs if I let it finish.
>
> The http side doesn't actually do any global verification, the way
> git-daemon does. So to it, everything is just temporary buffers, and you
> don't need any memory at all, really.
>
> git-daemon will create a packfile. That means that it has to generate the
> *global* object reachability, and will then optimize the object packing
> etc etc. That's a minimum of something like 48 bytes per object for just
> the object chains, and the kernel has a *lot* of objects (over half a
> million).
A large, repeating work load is created in this process when you take
a 200MB pack, repack it to add a few loose objects and then don't save
the results. This model makes the NSLU2 unusable, but I also see it at
my shared hosting provider. Initial clones of a repo that take 3min
from kernel.org take 25min on a shared host since the RAM is not
dedicated.
There are three categories of fetches:
1) initial clone, fetch all
2) fetch recent
3) I haven't fetched in three months
99% of fetches fall in the first two categories.
A very simple solution is to sendfile() existing packs if they contain
any objects that the client wants and let the client deal with the
unwanted objects. Yes this does send extra traffic over the net, but
the only group significantly impacted is #2 which is the most
infrequent group.
Loose objects are handled as they are currently. To optimize this
scheme you need to let the loose objects build up at the server and
then periodically sweep only the older ones into a pack. Packing the
entire repo into a single pack would cause recent fetches to retrieve
the entire pack.
Initial clone can be optimized further by recognizing that the
receiving repository is empty and sending them everything; no need to
compute which objects are missing at the server. This method will
speed up initial clone since the existing pack can be immediately sent
instead of waiting on a pack file to be built. Build the loose object
pack in parallel with sending the existing packs.
I recognize that in the case of cloning a single branch or --reference
too many objects will also be transmitted but I believe the benefits
of reducing the server load outweigh the overhead of transmitting
extra objects in this case. You can always remove the extra objects on
the client side.
On 8/24/07, Jakub Narebski <jnareb@gmail.com> wrote:
> There was idea to special case clone (just concatenate the packs, the
> receiving side as someone told there can detect pack boundaries; do not
> forget to pack loose objects, first), instead of using generic fetch --all
> for clone, bnut no code. Code speaks louder than words (although if someone
> would provide details of pack boundary detection...)
Write the file name and length into the socket before sending the
pack. Use sendfile() or it's current incarnation to actually send the
pack. Insert these header lines between packs.
> In addition to the object chains yourself, the native protocol will also
> obviously have to actually *look* at and parse all the tree and commit
> objects while it does all this, so while it doesn't necessarily keep all
> of those in memory all the time, it will need to access them, and if you
> don't have enough memory to cache them, that will add its own set of IO.
>
> So I haven't checked exactly how much memory you really want to have to
> serve big projects, but with some handwavy guesstimate, if you actually
> want to do a good job I'd guess that you really want to have at least as
> much memory as the size of largest project you are serving, and probably
> add at least 10-20% on top of that.
>
> So for the kernel, at a guess, you'd probably want to have at least 256MB
> of RAM to do a half-way good job. 512MB is likely nicer and allows you to
> actually cache the stuff over multiple accesses.
>
> But I haven't actually tested. Maybe it might be bearable at 128M.
>
> Linus
>
--
Jon Smirl
jonsmirl@gmail.com
next prev parent reply other threads:[~2007-08-25 15:44 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-08-24 5:54 git-daemon on NSLU2 Jon Smirl
2007-08-24 6:21 ` Shawn O. Pearce
2007-08-24 19:38 ` Jon Smirl
2007-08-24 20:23 ` Nicolas Pitre
2007-08-24 21:17 ` Jon Smirl
2007-08-24 21:54 ` Nicolas Pitre
2007-08-24 22:06 ` Jon Smirl
2007-08-24 22:39 ` Jakub Narebski
2007-08-24 22:59 ` Junio C Hamano
2007-08-24 23:21 ` Jakub Narebski
2007-08-24 23:46 ` Jon Smirl
2007-08-25 0:04 ` Junio C Hamano
2007-08-25 7:12 ` David Kastrup
2007-08-25 17:02 ` Salikh Zakirov
2007-08-25 0:10 ` Nicolas Pitre
2007-08-24 23:28 ` Linus Torvalds
2007-08-25 15:44 ` Jon Smirl [this message]
2007-08-26 9:33 ` Jeff King
2007-08-26 16:34 ` Jon Smirl
2007-08-26 17:15 ` Linus Torvalds
2007-08-26 18:06 ` Jon Smirl
2007-08-26 18:26 ` Linus Torvalds
2007-08-26 19:00 ` Jon Smirl
2007-08-26 20:19 ` Linus Torvalds
2007-08-26 21:22 ` Junio C Hamano
2007-08-27 11:03 ` Theodore Tso
2007-08-27 16:26 ` Linus Torvalds
2007-08-26 22:24 ` Daniel Hulme
2007-08-27 0:14 ` Jakub Narebski
2007-08-24 20:27 ` Jon Smirl
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9e4733910708250844n7074cb8coa5844fa6c46b40f0@mail.gmail.com \
--to=jonsmirl@gmail.com \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=nico@cam.org \
--cc=spearce@spearce.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).