All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Bober <sbober@servercare.de>
To: Richard Hartmann <richih.mailinglist@gmail.com>
Cc: Sverre Rabbelier <srabbelier@gmail.com>,
	Git List <git@vger.kernel.org>,
	Avery Pennarun <apenwarr@gmail.com>,
	Nicolas Pitre <nico@fluxnic.net>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	Sam Vilain <sam@vilain.net>
Subject: Re: Git import of the recent full enwiki dump
Date: Sat, 17 Apr 2010 03:25:31 +0200	[thread overview]
Message-ID: <20100417012531.GC32053@post.servercare.de> (raw)
In-Reply-To: <y2h2d460de71004161810p2c331099q4b2d7dabd01e5f8@mail.gmail.com>

On Sat, Apr 17, 2010 at 03:10:56AM +0200, Richard Hartmann wrote:
> On Sat, Apr 17, 2010 at 02:19, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> 
> > Assuming you do the import incrementally
> > using something like git-fast-import (feeding it with a custom
> > exporter that uses the dump as it's input) you shouldn't even need an
> > extraordinary machine to do it (although you'd need a lot of storage).
> 
> I am using a Python script [1] to import the XML dump.

There is also a version available at (plug):

  git://github.com/sbober/levitation-perl.git

That is a bit faster and consumes less memory (and is written in Perl).
But that, too, will not be able to handle enwiki at the moment.

> 
> 
> > Speaking of which, it might make sense to separate the
> > worktree by prefix, so articles starting with "aa" go under the "aa"
> > directory, etc?
> 
> Very good idea. What command would I need to send to
> git-fast-import to do that?

levitation does that already. 

> 
> > Hope that helps, and if you do convert it (and it turns out to be
> > usable, and you decide to keep it up to date somehow), put it up
> > somewhere! :)
> 
> It did.
> I will make it available if it turns out to be useful. Keeping it up to
> date might be harder unless they keep on releasing new
> (incremental) snapshots.

If desired, I could produce input files for git-fast-import for a larger
wiki (like german or japanese wikipedia), so that other people might
have a look at the performance.


bye,
  Sebastian

      parent reply	other threads:[~2010-04-17  1:25 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann
2010-04-17  0:19 ` Sverre Rabbelier
2010-04-17  0:48   ` Sebastian Bober
2010-04-17  0:53     ` Shawn O. Pearce
2010-04-17  1:01       ` Sebastian Bober
2010-04-17  1:44         ` [spf:guess] " Sam Vilain
2010-04-17  1:58           ` Sebastian Bober
2010-04-17  3:34             ` [spf:guess] " Sam Vilain
2010-04-17  7:48               ` Sebastian Bober
2010-04-17  1:10   ` Richard Hartmann
2010-04-17  1:18     ` Shawn O. Pearce
2010-04-17  1:25     ` Sebastian Bober [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100417012531.GC32053@post.servercare.de \
    --to=sbober@servercare.de \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@fluxnic.net \
    --cc=richih.mailinglist@gmail.com \
    --cc=sam@vilain.net \
    --cc=spearce@spearce.org \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.