From: Sam Vilain <sam@vilain.net>
To: Sebastian Bober <sbober@servercare.de>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
Sverre Rabbelier <srabbelier@gmail.com>,
Richard Hartmann <richih.mailinglist@gmail.com>,
Git List <git@vger.kernel.org>,
Avery Pennarun <apenwarr@gmail.com>,
Nicolas Pitre <nico@fluxnic.net>
Subject: Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump
Date: Sat, 17 Apr 2010 15:34:52 +1200 [thread overview]
Message-ID: <1271475292.3506.53.camel@denix> (raw)
In-Reply-To: <20100417015857.GD32053@post.servercare.de>
On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote:
> > Without good data set partitioning I don't think I see the above
> > workflow being as possible. I was approaching the problem by first
> > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> > nice, but probably much harder) - so I first set out by designing a
> > table store. But the representation of the data is not important, just
> > the distributed version of it.
>
> Yep, we had many ideas how to partition the data. All that was not tried
> up to now, because we had the hope to get it done the "straight" way.
> But that may not be possible.
I just don't think it's a practical aim or even useful. Who really
wants the complete history of all wikipedia pages? Only a very few -
libraries, national archives, and some collectors.
> We have tried checkpointing (even stopping/starting fast-import) every
> 10,000 - 100,000 commits. That does mitigate some speed and memory
> issues of fast-import. But in the end fast-import lost time at every
> restart / checkpoint.
One more thought - fast-import really does work better if you send it
all the versions of a blob in sequence so that it can write out deltas
the first time around.
Another advantage of the per-page partitioning is that they can
checkpoint/gc independently, allowing for more parallelization of the
job.
> > Actually this raises the question - what is it that you are trying to
> > achieve with this wikipedia import?
>
> Ultimately, having a distributed Wikipedia. Having the possibility to
> fork or branch Wikipedia, to have an inclusionist and exclusionist
> Wikipedia all in one.
This sounds like far too much fun for me to miss out on, now downloading
enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a
crack!
Sam
next prev parent reply other threads:[~2010-04-17 3:35 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann
2010-04-17 0:19 ` Sverre Rabbelier
2010-04-17 0:48 ` Sebastian Bober
2010-04-17 0:53 ` Shawn O. Pearce
2010-04-17 1:01 ` Sebastian Bober
2010-04-17 1:44 ` [spf:guess] " Sam Vilain
2010-04-17 1:58 ` Sebastian Bober
2010-04-17 3:34 ` Sam Vilain [this message]
2010-04-17 7:48 ` [spf:guess] " Sebastian Bober
2010-04-17 1:10 ` Richard Hartmann
2010-04-17 1:18 ` Shawn O. Pearce
2010-04-17 1:25 ` Sebastian Bober
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1271475292.3506.53.camel@denix \
--to=sam@vilain.net \
--cc=apenwarr@gmail.com \
--cc=git@vger.kernel.org \
--cc=nico@fluxnic.net \
--cc=richih.mailinglist@gmail.com \
--cc=sbober@servercare.de \
--cc=spearce@spearce.org \
--cc=srabbelier@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.