git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sebastian Bober <sbober@servercare.de>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: Sverre Rabbelier <srabbelier@gmail.com>,
	Richard Hartmann <richih.mailinglist@gmail.com>,
	Git List <git@vger.kernel.org>,
	Avery Pennarun <apenwarr@gmail.com>,
	Nicolas Pitre <nico@fluxnic.net>, Sam Vilain <sam@vilain.net>
Subject: Re: Git import of the recent full enwiki dump
Date: Sat, 17 Apr 2010 03:01:47 +0200	[thread overview]
Message-ID: <20100417010147.GB32053@post.servercare.de> (raw)
In-Reply-To: <20100417005342.GA8475@spearce.org>

On Fri, Apr 16, 2010 at 05:53:42PM -0700, Shawn O. Pearce wrote:
> Sebastian Bober <sbober@servercare.de> wrote:
> > The question would be, how the commits and the trees are laid out.
> > If every wiki revision shall be a git commit, then we'd need to handle
> > 300M commits. And we have 19M wiki pages (that would be files). The tree
> > objects would be very large and git-fast-import would crawl.
> > 
> > Some tests with the german wikipedia have shown that importing the blobs
> > is doable on normal hardware. Getting the trees and commits into git
> > was not possible up to now, as fast-import was just to slow (and getting
> > slower after 1M commits).
> 
> Well, to be fair to fast-import, its tree handling code is linear
> scan based, because that's how any other part of Git handles trees.
> 
> If you just toss all 19M wiki pages into a single top level tree,
> that's going to take a very long time to locate the wiki page
> talking about Zoos.
> 

I'm not dissing fast-import, it's fantastic. We tried with 2-10 level
deep trees (the best depth being 3), but after some million commits it
just got unbearably slow, with the ETA constantly rising.

That was because of tree creation, and SHA1 computing of these tree
objects.

> > I had the idea of having an importer that would just handle this special
> > case (1 file change per commit), but didn't get around to try that yet.
> 
> Really, fast-import should be able to handle this well, assuming you
> aren't just tossing all 19M files into a single massive directory
> and hoping for the best.  Because *any* program working on that
> sort of layout will need to spit out the 19M entry tree object on
> each and every commit, just so it can compute the SHA-1 checksum
> to get the tree name for the commit.
> 
> -- 
> Shawn.
> 

  reply	other threads:[~2010-04-17  1:01 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann
2010-04-17  0:19 ` Sverre Rabbelier
2010-04-17  0:48   ` Sebastian Bober
2010-04-17  0:53     ` Shawn O. Pearce
2010-04-17  1:01       ` Sebastian Bober [this message]
2010-04-17  1:44         ` [spf:guess] " Sam Vilain
2010-04-17  1:58           ` Sebastian Bober
2010-04-17  3:34             ` [spf:guess] " Sam Vilain
2010-04-17  7:48               ` Sebastian Bober
2010-04-17  1:10   ` Richard Hartmann
2010-04-17  1:18     ` Shawn O. Pearce
2010-04-17  1:25     ` Sebastian Bober

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100417010147.GB32053@post.servercare.de \
    --to=sbober@servercare.de \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@fluxnic.net \
    --cc=richih.mailinglist@gmail.com \
    --cc=sam@vilain.net \
    --cc=spearce@spearce.org \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).