Re: [spf:guess] Re: Git import of the recent full enwiki dump

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sebastian Bober <sbober@servercare.de>
To: Sam Vilain <sam@vilain.net>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	Sverre Rabbelier <srabbelier@gmail.com>,
	Richard Hartmann <richih.mailinglist@gmail.com>,
	Git List <git@vger.kernel.org>,
	Avery Pennarun <apenwarr@gmail.com>,
	Nicolas Pitre <nico@fluxnic.net>
Subject: Re: [spf:guess] Re: Git import of the recent full enwiki dump
Date: Sat, 17 Apr 2010 03:58:57 +0200	[thread overview]
Message-ID: <20100417015857.GD32053@post.servercare.de> (raw)
In-Reply-To: <1271468696.3302.35.camel@denix>

On Sat, Apr 17, 2010 at 01:44:56PM +1200, Sam Vilain wrote:
> On Sat, 2010-04-17 at 03:01 +0200, Sebastian Bober wrote:
> > I'm not dissing fast-import, it's fantastic. We tried with 2-10 level
> > deep trees (the best depth being 3), but after some million commits it
> > just got unbearably slow, with the ETA constantly rising.
> 
> How often are you checkpointing?  Like any data import IME, you can't
> leave transactions going indefinitely and expect good performance!

We have tried checkpointing (even stopping/starting fast-import) every
10,000 - 100,000 commits. That does mitigate some speed and memory
issues of fast-import. But in the end fast-import lost time at every
restart / checkpoint.

> Would it be at all possible to consider using a submodule for each page?
> With a super-project commit which is updated for every day of updates or
> so.
> 
> This will create a natural partitioning of the data set in a way which
> is likely to be more useful and efficient to work with.  Hand-held
> devices can be shipped with a "shallow" clone of the main repository,
> with shallow clones of the sub-repositories too (in such a setup, the
> device would not really use a checkout of course to save space).  Then,
> history for individual pages could be extended as required.  The device
> could "update" the master history, so it would know in summary form
> which pages have changed.  It would then go on to fetch updates for
> individual pages that the user is watching, or potentially even get them
> all.  There's an interesting next idea here: device-to-device update
> bundles.  And another one: distributed update; if, instead of writing to
> a "master" version - the action of editing a wiki page becomes to create
> a fork and the editorial process promotes these forks to be the master
> version in the superproject.  Users which have pulled the full
> repository for a page will be able to see other peoples' forks, to get
> "latest" versions or for editing purposes.  This adds not only a
> distributed update action, but the ability to have decent peer
> review/editorial process without it being arduous.
> 
> Without good data set partitioning I don't think I see the above
> workflow being as possible.  I was approaching the problem by first
> trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> nice, but probably much harder) - so I first set out by designing a
> table store.  But the representation of the data is not important, just
> the distributed version of it.

Yep, we had many ideas how to partition the data. All that was not tried
up to now, because we had the hope to get it done the "straight" way.
But that may not be possible.

> Actually this raises the question - what is it that you are trying to
> achieve with this wikipedia import?

Ultimately, having a distributed Wikipedia. Having the possibility to
fork or branch Wikipedia, to have an inclusionist and exclusionist
Wikipedia all in one.



bye,
  Sebastian

next prev parent reply	other threads:[~2010-04-17  1:59 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann
2010-04-17  0:19 ` Sverre Rabbelier
2010-04-17  0:48   ` Sebastian Bober
2010-04-17  0:53     ` Shawn O. Pearce
2010-04-17  1:01       ` Sebastian Bober
2010-04-17  1:44         ` [spf:guess] " Sam Vilain
2010-04-17  1:58           ` Sebastian Bober [this message]
2010-04-17  3:34             ` [spf:guess] " Sam Vilain
2010-04-17  7:48               ` Sebastian Bober
2010-04-17  1:10   ` Richard Hartmann
2010-04-17  1:18     ` Shawn O. Pearce
2010-04-17  1:25     ` Sebastian Bober

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100417015857.GD32053@post.servercare.de \
    --to=sbober@servercare.de \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@fluxnic.net \
    --cc=richih.mailinglist@gmail.com \
    --cc=sam@vilain.net \
    --cc=spearce@spearce.org \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).