Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sebastian Bober <sbober@servercare.de>
To: Sam Vilain <sam@vilain.net>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	Sverre Rabbelier <srabbelier@gmail.com>,
	Richard Hartmann <richih.mailinglist@gmail.com>,
	Git List <git@vger.kernel.org>,
	Avery Pennarun <apenwarr@gmail.com>,
	Nicolas Pitre <nico@fluxnic.net>
Subject: Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump
Date: Sat, 17 Apr 2010 09:48:53 +0200	[thread overview]
Message-ID: <20100417074853.GE32053@post.servercare.de> (raw)
In-Reply-To: <1271475292.3506.53.camel@denix>

On Sat, Apr 17, 2010 at 03:34:52PM +1200, Sam Vilain wrote:
> On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote:
> > > Without good data set partitioning I don't think I see the above
> > > workflow being as possible.  I was approaching the problem by first
> > > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> > > nice, but probably much harder) - so I first set out by designing a
> > > table store.  But the representation of the data is not important, just
> > > the distributed version of it.
> > 
> > Yep, we had many ideas how to partition the data. All that was not tried
> > up to now, because we had the hope to get it done the "straight" way.
> > But that may not be possible.
> 
> I just don't think it's a practical aim or even useful.  Who really
> wants the complete history of all wikipedia pages?  Only a very few -
> libraries, national archives, and some collectors.

Heh, exactly. And I just want to see, if it can be done.

> > We have tried checkpointing (even stopping/starting fast-import) every
> > 10,000 - 100,000 commits. That does mitigate some speed and memory
> > issues of fast-import. But in the end fast-import lost time at every
> > restart / checkpoint.
> 
> One more thought - fast-import really does work better if you send it
> all the versions of a blob in sequence so that it can write out deltas
> the first time around.

This is already done thah way.

> Another advantage of the per-page partitioning is that they can
> checkpoint/gc independently, allowing for more parallelization of the
> job.
> 
> > > Actually this raises the question - what is it that you are trying to
> > > achieve with this wikipedia import?
> > 
> > Ultimately, having a distributed Wikipedia. Having the possibility to
> > fork or branch Wikipedia, to have an inclusionist and exclusionist
> > Wikipedia all in one.
> 
> This sounds like far too much fun for me to miss out on, now downloading
> enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a
> crack!


Please have a look at a smaller wiki for testing, and the project at

  git://github.com/sbober/levitation-perl.git

provides several ways to parse the XML and to generate the fast-import
input in its branches.


bye,
  Sebastian

next prev parent reply	other threads:[~2010-04-17  7:49 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann
2010-04-17  0:19 ` Sverre Rabbelier
2010-04-17  0:48   ` Sebastian Bober
2010-04-17  0:53     ` Shawn O. Pearce
2010-04-17  1:01       ` Sebastian Bober
2010-04-17  1:44         ` [spf:guess] " Sam Vilain
2010-04-17  1:58           ` Sebastian Bober
2010-04-17  3:34             ` [spf:guess] " Sam Vilain
2010-04-17  7:48               ` Sebastian Bober [this message]
2010-04-17  1:10   ` Richard Hartmann
2010-04-17  1:18     ` Shawn O. Pearce
2010-04-17  1:25     ` Sebastian Bober

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100417074853.GE32053@post.servercare.de \
    --to=sbober@servercare.de \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@fluxnic.net \
    --cc=richih.mailinglist@gmail.com \
    --cc=sam@vilain.net \
    --cc=spearce@spearce.org \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.