* Git import of the recent full enwiki dump @ 2010-04-16 23:47 Richard Hartmann 2010-04-17 0:19 ` Sverre Rabbelier 0 siblings, 1 reply; 12+ messages in thread From: Richard Hartmann @ 2010-04-16 23:47 UTC (permalink / raw) To: wikitech-l, git -- This email has been sent to two lists -- Hi all, I would be interested to import the whole enwiki dump [1] into git[2]. This data set is probably the largest set of changes on earth, so it's highly interesting to see what git will make of it. As of right now, I am trying to import on my local machine, but my first, rough, projections tell me my machine will melt down at some point ;) Assuming my local import fails, I would appreciate it if this could be added to wikitech's longer-term todo list. If anyone has access to a system with several TiB of free disk space which they can spare for a week or three, it would be awesome. If given shell access, I can take care of this task, but I would be happy to assist anyone attempting it, as well. If need be, I can get various people from various communities to vouch for me, my character & that I Do Not Break Stuff. Richard Hartmann PS: If anyone attempts to do this, please poke me. Either via email or RichiH on freenode, OFTC and IRCnet. [1] http://download.wikimedia.org/enwiki/20100130/ [2] http://git-scm.com/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann @ 2010-04-17 0:19 ` Sverre Rabbelier 2010-04-17 0:48 ` Sebastian Bober 2010-04-17 1:10 ` Richard Hartmann 0 siblings, 2 replies; 12+ messages in thread From: Sverre Rabbelier @ 2010-04-17 0:19 UTC (permalink / raw) To: Richard Hartmann Cc: Git List, Avery Pennarun, Nicolas Pitre, Shawn O. Pearce, Sam Vilain Heya, [-wikitech-l, if they should be kept on the cc please re-add, I assume that the discussion of the git aspects are not relevant to that list] On Sat, Apr 17, 2010 at 01:47, Richard Hartmann <richih.mailinglist@gmail.com> wrote: > This data set is probably the largest set of changes on earth, so > it's highly interesting to see what git will make of it. I think that git might actually be able to handle it. Git's been known not to handle _large files_ very well, but a lot of history/a lot of files is something different. Assuming you do the import incrementally using something like git-fast-import (feeding it with a custom exporter that uses the dump as it's input) you shouldn't even need an extraordinary machine to do it (although you'd need a lot of storage). > As of right now, I am trying to import on my local machine, but > my first, rough, projections tell me my machine will melt down at > some point ;) How are you importing? Did you script the process that does something like 'move next revision of file in place && git add . && git commit'? I don't know how well that would work since I reckon the worktree will be huge. Speaking of which, it might make sense to separate the worktree by prefix, so articles starting with "aa" go under the "aa" directory, etc? Anyway, other gits might have more interesting things to say, cc-ed is Avery, who has been working on a tool to back-up entire harddrives in git. Also cc-ed are Nico and Shawn who both have a lot of experience with the object backend and the pack implementation. Also, Sam, who has worked on importing the entire Perl history into git, not sure how big that is though, but they have a lot of changesets too I think. There's a bunch of people that have worked on importing stuff like KDE into git, who might have interesting things to add, but I don't know who those are. Hope that helps, and if you do convert it (and it turns out to be usable, and you decide to keep it up to date somehow), put it up somewhere! :) -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 0:19 ` Sverre Rabbelier @ 2010-04-17 0:48 ` Sebastian Bober 2010-04-17 0:53 ` Shawn O. Pearce 2010-04-17 1:10 ` Richard Hartmann 1 sibling, 1 reply; 12+ messages in thread From: Sebastian Bober @ 2010-04-17 0:48 UTC (permalink / raw) To: Sverre Rabbelier Cc: Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre, Shawn O. Pearce, Sam Vilain On Sat, Apr 17, 2010 at 02:19:40AM +0200, Sverre Rabbelier wrote: > Heya, > > [-wikitech-l, if they should be kept on the cc please re-add, I assume > that the discussion of the git aspects are not relevant to that list] > > On Sat, Apr 17, 2010 at 01:47, Richard Hartmann > <richih.mailinglist@gmail.com> wrote: > > This data set is probably the largest set of changes on earth, so > > it's highly interesting to see what git will make of it. > > I think that git might actually be able to handle it. Git's been known > not to handle _large files_ very well, but a lot of history/a lot of > files is something different. Assuming you do the import incrementally > using something like git-fast-import (feeding it with a custom > exporter that uses the dump as it's input) you shouldn't even need an > extraordinary machine to do it (although you'd need a lot of storage). The question would be, how the commits and the trees are laid out. If every wiki revision shall be a git commit, then we'd need to handle 300M commits. And we have 19M wiki pages (that would be files). The tree objects would be very large and git-fast-import would crawl. Some tests with the german wikipedia have shown that importing the blobs is doable on normal hardware. Getting the trees and commits into git was not possible up to now, as fast-import was just to slow (and getting slower after 1M commits). I had the idea of having an importer that would just handle this special case (1 file change per commit), but didn't get around to try that yet. bye, Sebastian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 0:48 ` Sebastian Bober @ 2010-04-17 0:53 ` Shawn O. Pearce 2010-04-17 1:01 ` Sebastian Bober 0 siblings, 1 reply; 12+ messages in thread From: Shawn O. Pearce @ 2010-04-17 0:53 UTC (permalink / raw) To: Sebastian Bober Cc: Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre, Sam Vilain Sebastian Bober <sbober@servercare.de> wrote: > The question would be, how the commits and the trees are laid out. > If every wiki revision shall be a git commit, then we'd need to handle > 300M commits. And we have 19M wiki pages (that would be files). The tree > objects would be very large and git-fast-import would crawl. > > Some tests with the german wikipedia have shown that importing the blobs > is doable on normal hardware. Getting the trees and commits into git > was not possible up to now, as fast-import was just to slow (and getting > slower after 1M commits). Well, to be fair to fast-import, its tree handling code is linear scan based, because that's how any other part of Git handles trees. If you just toss all 19M wiki pages into a single top level tree, that's going to take a very long time to locate the wiki page talking about Zoos. > I had the idea of having an importer that would just handle this special > case (1 file change per commit), but didn't get around to try that yet. Really, fast-import should be able to handle this well, assuming you aren't just tossing all 19M files into a single massive directory and hoping for the best. Because *any* program working on that sort of layout will need to spit out the 19M entry tree object on each and every commit, just so it can compute the SHA-1 checksum to get the tree name for the commit. -- Shawn. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 0:53 ` Shawn O. Pearce @ 2010-04-17 1:01 ` Sebastian Bober 2010-04-17 1:44 ` [spf:guess] " Sam Vilain 0 siblings, 1 reply; 12+ messages in thread From: Sebastian Bober @ 2010-04-17 1:01 UTC (permalink / raw) To: Shawn O. Pearce Cc: Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre, Sam Vilain On Fri, Apr 16, 2010 at 05:53:42PM -0700, Shawn O. Pearce wrote: > Sebastian Bober <sbober@servercare.de> wrote: > > The question would be, how the commits and the trees are laid out. > > If every wiki revision shall be a git commit, then we'd need to handle > > 300M commits. And we have 19M wiki pages (that would be files). The tree > > objects would be very large and git-fast-import would crawl. > > > > Some tests with the german wikipedia have shown that importing the blobs > > is doable on normal hardware. Getting the trees and commits into git > > was not possible up to now, as fast-import was just to slow (and getting > > slower after 1M commits). > > Well, to be fair to fast-import, its tree handling code is linear > scan based, because that's how any other part of Git handles trees. > > If you just toss all 19M wiki pages into a single top level tree, > that's going to take a very long time to locate the wiki page > talking about Zoos. > I'm not dissing fast-import, it's fantastic. We tried with 2-10 level deep trees (the best depth being 3), but after some million commits it just got unbearably slow, with the ETA constantly rising. That was because of tree creation, and SHA1 computing of these tree objects. > > I had the idea of having an importer that would just handle this special > > case (1 file change per commit), but didn't get around to try that yet. > > Really, fast-import should be able to handle this well, assuming you > aren't just tossing all 19M files into a single massive directory > and hoping for the best. Because *any* program working on that > sort of layout will need to spit out the 19M entry tree object on > each and every commit, just so it can compute the SHA-1 checksum > to get the tree name for the commit. > > -- > Shawn. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [spf:guess] Re: Git import of the recent full enwiki dump 2010-04-17 1:01 ` Sebastian Bober @ 2010-04-17 1:44 ` Sam Vilain 2010-04-17 1:58 ` Sebastian Bober 0 siblings, 1 reply; 12+ messages in thread From: Sam Vilain @ 2010-04-17 1:44 UTC (permalink / raw) To: Sebastian Bober Cc: Shawn O. Pearce, Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre On Sat, 2010-04-17 at 03:01 +0200, Sebastian Bober wrote: > I'm not dissing fast-import, it's fantastic. We tried with 2-10 level > deep trees (the best depth being 3), but after some million commits it > just got unbearably slow, with the ETA constantly rising. How often are you checkpointing? Like any data import IME, you can't leave transactions going indefinitely and expect good performance! Would it be at all possible to consider using a submodule for each page? With a super-project commit which is updated for every day of updates or so. This will create a natural partitioning of the data set in a way which is likely to be more useful and efficient to work with. Hand-held devices can be shipped with a "shallow" clone of the main repository, with shallow clones of the sub-repositories too (in such a setup, the device would not really use a checkout of course to save space). Then, history for individual pages could be extended as required. The device could "update" the master history, so it would know in summary form which pages have changed. It would then go on to fetch updates for individual pages that the user is watching, or potentially even get them all. There's an interesting next idea here: device-to-device update bundles. And another one: distributed update; if, instead of writing to a "master" version - the action of editing a wiki page becomes to create a fork and the editorial process promotes these forks to be the master version in the superproject. Users which have pulled the full repository for a page will be able to see other peoples' forks, to get "latest" versions or for editing purposes. This adds not only a distributed update action, but the ability to have decent peer review/editorial process without it being arduous. Without good data set partitioning I don't think I see the above workflow being as possible. I was approaching the problem by first trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be nice, but probably much harder) - so I first set out by designing a table store. But the representation of the data is not important, just the distributed version of it. Actually this raises the question - what is it that you are trying to achieve with this wikipedia import? Sam ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [spf:guess] Re: Git import of the recent full enwiki dump 2010-04-17 1:44 ` [spf:guess] " Sam Vilain @ 2010-04-17 1:58 ` Sebastian Bober 2010-04-17 3:34 ` [spf:guess] " Sam Vilain 0 siblings, 1 reply; 12+ messages in thread From: Sebastian Bober @ 2010-04-17 1:58 UTC (permalink / raw) To: Sam Vilain Cc: Shawn O. Pearce, Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre On Sat, Apr 17, 2010 at 01:44:56PM +1200, Sam Vilain wrote: > On Sat, 2010-04-17 at 03:01 +0200, Sebastian Bober wrote: > > I'm not dissing fast-import, it's fantastic. We tried with 2-10 level > > deep trees (the best depth being 3), but after some million commits it > > just got unbearably slow, with the ETA constantly rising. > > How often are you checkpointing? Like any data import IME, you can't > leave transactions going indefinitely and expect good performance! We have tried checkpointing (even stopping/starting fast-import) every 10,000 - 100,000 commits. That does mitigate some speed and memory issues of fast-import. But in the end fast-import lost time at every restart / checkpoint. > Would it be at all possible to consider using a submodule for each page? > With a super-project commit which is updated for every day of updates or > so. > > This will create a natural partitioning of the data set in a way which > is likely to be more useful and efficient to work with. Hand-held > devices can be shipped with a "shallow" clone of the main repository, > with shallow clones of the sub-repositories too (in such a setup, the > device would not really use a checkout of course to save space). Then, > history for individual pages could be extended as required. The device > could "update" the master history, so it would know in summary form > which pages have changed. It would then go on to fetch updates for > individual pages that the user is watching, or potentially even get them > all. There's an interesting next idea here: device-to-device update > bundles. And another one: distributed update; if, instead of writing to > a "master" version - the action of editing a wiki page becomes to create > a fork and the editorial process promotes these forks to be the master > version in the superproject. Users which have pulled the full > repository for a page will be able to see other peoples' forks, to get > "latest" versions or for editing purposes. This adds not only a > distributed update action, but the ability to have decent peer > review/editorial process without it being arduous. > > Without good data set partitioning I don't think I see the above > workflow being as possible. I was approaching the problem by first > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be > nice, but probably much harder) - so I first set out by designing a > table store. But the representation of the data is not important, just > the distributed version of it. Yep, we had many ideas how to partition the data. All that was not tried up to now, because we had the hope to get it done the "straight" way. But that may not be possible. > Actually this raises the question - what is it that you are trying to > achieve with this wikipedia import? Ultimately, having a distributed Wikipedia. Having the possibility to fork or branch Wikipedia, to have an inclusionist and exclusionist Wikipedia all in one. bye, Sebastian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump 2010-04-17 1:58 ` Sebastian Bober @ 2010-04-17 3:34 ` Sam Vilain 2010-04-17 7:48 ` Sebastian Bober 0 siblings, 1 reply; 12+ messages in thread From: Sam Vilain @ 2010-04-17 3:34 UTC (permalink / raw) To: Sebastian Bober Cc: Shawn O. Pearce, Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote: > > Without good data set partitioning I don't think I see the above > > workflow being as possible. I was approaching the problem by first > > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be > > nice, but probably much harder) - so I first set out by designing a > > table store. But the representation of the data is not important, just > > the distributed version of it. > > Yep, we had many ideas how to partition the data. All that was not tried > up to now, because we had the hope to get it done the "straight" way. > But that may not be possible. I just don't think it's a practical aim or even useful. Who really wants the complete history of all wikipedia pages? Only a very few - libraries, national archives, and some collectors. > We have tried checkpointing (even stopping/starting fast-import) every > 10,000 - 100,000 commits. That does mitigate some speed and memory > issues of fast-import. But in the end fast-import lost time at every > restart / checkpoint. One more thought - fast-import really does work better if you send it all the versions of a blob in sequence so that it can write out deltas the first time around. Another advantage of the per-page partitioning is that they can checkpoint/gc independently, allowing for more parallelization of the job. > > Actually this raises the question - what is it that you are trying to > > achieve with this wikipedia import? > > Ultimately, having a distributed Wikipedia. Having the possibility to > fork or branch Wikipedia, to have an inclusionist and exclusionist > Wikipedia all in one. This sounds like far too much fun for me to miss out on, now downloading enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a crack! Sam ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump 2010-04-17 3:34 ` [spf:guess] " Sam Vilain @ 2010-04-17 7:48 ` Sebastian Bober 0 siblings, 0 replies; 12+ messages in thread From: Sebastian Bober @ 2010-04-17 7:48 UTC (permalink / raw) To: Sam Vilain Cc: Shawn O. Pearce, Sverre Rabbelier, Richard Hartmann, Git List, Avery Pennarun, Nicolas Pitre On Sat, Apr 17, 2010 at 03:34:52PM +1200, Sam Vilain wrote: > On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote: > > > Without good data set partitioning I don't think I see the above > > > workflow being as possible. I was approaching the problem by first > > > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be > > > nice, but probably much harder) - so I first set out by designing a > > > table store. But the representation of the data is not important, just > > > the distributed version of it. > > > > Yep, we had many ideas how to partition the data. All that was not tried > > up to now, because we had the hope to get it done the "straight" way. > > But that may not be possible. > > I just don't think it's a practical aim or even useful. Who really > wants the complete history of all wikipedia pages? Only a very few - > libraries, national archives, and some collectors. Heh, exactly. And I just want to see, if it can be done. > > We have tried checkpointing (even stopping/starting fast-import) every > > 10,000 - 100,000 commits. That does mitigate some speed and memory > > issues of fast-import. But in the end fast-import lost time at every > > restart / checkpoint. > > One more thought - fast-import really does work better if you send it > all the versions of a blob in sequence so that it can write out deltas > the first time around. This is already done thah way. > Another advantage of the per-page partitioning is that they can > checkpoint/gc independently, allowing for more parallelization of the > job. > > > > Actually this raises the question - what is it that you are trying to > > > achieve with this wikipedia import? > > > > Ultimately, having a distributed Wikipedia. Having the possibility to > > fork or branch Wikipedia, to have an inclusionist and exclusionist > > Wikipedia all in one. > > This sounds like far too much fun for me to miss out on, now downloading > enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a > crack! Please have a look at a smaller wiki for testing, and the project at git://github.com/sbober/levitation-perl.git provides several ways to parse the XML and to generate the fast-import input in its branches. bye, Sebastian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 0:19 ` Sverre Rabbelier 2010-04-17 0:48 ` Sebastian Bober @ 2010-04-17 1:10 ` Richard Hartmann 2010-04-17 1:18 ` Shawn O. Pearce 2010-04-17 1:25 ` Sebastian Bober 1 sibling, 2 replies; 12+ messages in thread From: Richard Hartmann @ 2010-04-17 1:10 UTC (permalink / raw) To: Sverre Rabbelier Cc: Git List, Avery Pennarun, Nicolas Pitre, Shawn O. Pearce, Sam Vilain On Sat, Apr 17, 2010 at 02:19, Sverre Rabbelier <srabbelier@gmail.com> wrote: > Assuming you do the import incrementally > using something like git-fast-import (feeding it with a custom > exporter that uses the dump as it's input) you shouldn't even need an > extraordinary machine to do it (although you'd need a lot of storage). I am using a Python script [1] to import the XML dump. > Speaking of which, it might make sense to separate the > worktree by prefix, so articles starting with "aa" go under the "aa" > directory, etc? Very good idea. What command would I need to send to git-fast-import to do that? > Hope that helps, and if you do convert it (and it turns out to be > usable, and you decide to keep it up to date somehow), put it up > somewhere! :) It did. I will make it available if it turns out to be useful. Keeping it up to date might be harder unless they keep on releasing new (incremental) snapshots. Thanks, Richard [1] http://github.com/scy/levitation/blob/master/import.py ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 1:10 ` Richard Hartmann @ 2010-04-17 1:18 ` Shawn O. Pearce 2010-04-17 1:25 ` Sebastian Bober 1 sibling, 0 replies; 12+ messages in thread From: Shawn O. Pearce @ 2010-04-17 1:18 UTC (permalink / raw) To: Richard Hartmann Cc: Sverre Rabbelier, Git List, Avery Pennarun, Nicolas Pitre, Sam Vilain Richard Hartmann <richih.mailinglist@gmail.com> wrote: > On Sat, Apr 17, 2010 at 02:19, Sverre Rabbelier <srabbelier@gmail.com> wrote: > > Speaking of which, it might make sense to separate the > > worktree by prefix, so articles starting with "aa" go under the "aa" > > directory, etc? > > Very good idea. What command would I need to send to > git-fast-import to do that? When you send the 'M' command around line 479, just set the filename to 'aa/aardvark' or whatever it is. fast-import will automatically create directories by splitting on forward slashes. -- Shawn. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Git import of the recent full enwiki dump 2010-04-17 1:10 ` Richard Hartmann 2010-04-17 1:18 ` Shawn O. Pearce @ 2010-04-17 1:25 ` Sebastian Bober 1 sibling, 0 replies; 12+ messages in thread From: Sebastian Bober @ 2010-04-17 1:25 UTC (permalink / raw) To: Richard Hartmann Cc: Sverre Rabbelier, Git List, Avery Pennarun, Nicolas Pitre, Shawn O. Pearce, Sam Vilain On Sat, Apr 17, 2010 at 03:10:56AM +0200, Richard Hartmann wrote: > On Sat, Apr 17, 2010 at 02:19, Sverre Rabbelier <srabbelier@gmail.com> wrote: > > > Assuming you do the import incrementally > > using something like git-fast-import (feeding it with a custom > > exporter that uses the dump as it's input) you shouldn't even need an > > extraordinary machine to do it (although you'd need a lot of storage). > > I am using a Python script [1] to import the XML dump. There is also a version available at (plug): git://github.com/sbober/levitation-perl.git That is a bit faster and consumes less memory (and is written in Perl). But that, too, will not be able to handle enwiki at the moment. > > > > Speaking of which, it might make sense to separate the > > worktree by prefix, so articles starting with "aa" go under the "aa" > > directory, etc? > > Very good idea. What command would I need to send to > git-fast-import to do that? levitation does that already. > > > Hope that helps, and if you do convert it (and it turns out to be > > usable, and you decide to keep it up to date somehow), put it up > > somewhere! :) > > It did. > I will make it available if it turns out to be useful. Keeping it up to > date might be harder unless they keep on releasing new > (incremental) snapshots. If desired, I could produce input files for git-fast-import for a larger wiki (like german or japanese wikipedia), so that other people might have a look at the performance. bye, Sebastian ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-04-17 7:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-16 23:47 Git import of the recent full enwiki dump Richard Hartmann 2010-04-17 0:19 ` Sverre Rabbelier 2010-04-17 0:48 ` Sebastian Bober 2010-04-17 0:53 ` Shawn O. Pearce 2010-04-17 1:01 ` Sebastian Bober 2010-04-17 1:44 ` [spf:guess] " Sam Vilain 2010-04-17 1:58 ` Sebastian Bober 2010-04-17 3:34 ` [spf:guess] " Sam Vilain 2010-04-17 7:48 ` Sebastian Bober 2010-04-17 1:10 ` Richard Hartmann 2010-04-17 1:18 ` Shawn O. Pearce 2010-04-17 1:25 ` Sebastian Bober
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).