* Re: [Foundation-l] Wikipedia meets git
[not found] ` <5396c0d10910210543i4c0a3350je5bee4c6389a2292@mail.gmail.com>
@ 2009-10-21 19:49 ` Bernie Innocenti
2009-10-21 20:08 ` jamesmikedupont
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Bernie Innocenti @ 2009-10-21 19:49 UTC (permalink / raw)
To: Samuel Klein; +Cc: Wikimedia Foundation Mailing List, git
[cc+=git@vger.kernel.org]
El Wed, 21-10-2009 a las 08:43 -0400, Samuel Klein escribió:
> That sounds like a great idea. I know a few other people who have
> worked on git-based wikis and toyed with making them compatible with
> mediawiki (copying bernie innocenti, one of the most eloquent :).
Then I'll do my best to sound as eloquent as expected :)
While I think git's internal structure is wonderfully simple and
elegant, I'm a little worried about its scalability in the wiki usecase.
The scenario for which git's repository format was designed is "patch
oriented" revision control of a filesystem tree. The central object of a
git tree is the "commit", which represents a set of changes on multiple
files. I'll disregard all the juicy details on how the changes are
actually packed together to save disk space, making git's repository
format amazingly compact.
Commits are linked to each other in order to represent the history. Git
can efficiently represent a highly non-linear history with thousands of
branches, each containing hundreds of thousands revisions. Branching and
merging huge trees is so fast that one is left wondering if anything has
happened at all.
So far, so good. This commit-oriented design is great if you want to
track the history *the whole tree* at once, applying related changes to
multiple files atomically. In Git, as well as most other version control
systems, there's no such thing as a *file* revision! Git manages entire
trees. Trees are assigned unique revision numbers (in fact, ugly sha-1
hashes), and can optionally by tagged or branched at will.
And here's the the catch: the history of individual files is not
directly represented in a git repository. It is typically scattered
across thousands of commit objects, with no direct links to help find
them. If you want to retrieve the log of a file that was changed only 6
times in the entire history of the Linux kernel, you'd have to dig
through *all* of the 170K revisions in the "master" branch.
And it takes some time even if git is blazingly fast:
bernie@giskard:~/src/kernel/linux-2.6$ time git log --pretty=oneline REPORTING-BUGS | wc -l
6
real 0m1.668s
user 0m1.416s
sys 0m0.210s
(my laptop has a low-power CPU. A fast server would be 8-10x faster).
Now, the English Wikipedia seems to have slightly more than 3M articles,
with--how many? tenths of millions of revisions for sure. Going through
them *every time* one needs to consult the history of a file would be
100x slower. Tens of seconds. Not acceptable, uh?
It seems to me that the typical usage pattern of an encyclopedia is to
change each article individually. Perhaps I'm underestimating the role
of bots here. Anyway, there's no consistency *requirement* for mass
changes to be applied atomically throughout all the encyclopedia, right?
In conclusion, the "tree at a time" design is going to be a performance
bottleneck for a large wiki, with no useful application. Unless of
course the concept of changesets was exposed in the UI, which would be
an interesting idea to explore.
Mercurial (Hg) seems to have a better repository layout for the "one
file at a time" access pattern... Unfortunately, it's also much slower
than git for almost any other purpose, sometimes by an order of
magnitude. I'm not even sure how well Hg would cope with a repository
containing 3M files and some 30M revisions. The largest Hg tree I've
dealt with is the "mozilla central" repo, which is already unbearably
slow to work with.
It would be interesting to compare notes with the other DSCM hackers,
too.
--
// Bernie Innocenti - http://codewiz.org/
\X/ Sugar Labs - http://sugarlabs.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Foundation-l] Wikipedia meets git
2009-10-21 19:49 ` [Foundation-l] Wikipedia meets git Bernie Innocenti
@ 2009-10-21 20:08 ` jamesmikedupont
2009-10-21 23:36 ` David Gerard
2009-10-21 20:31 ` [Foundation-l] " Avery Pennarun
2009-10-21 21:05 ` Nicolas Pitre
2 siblings, 1 reply; 6+ messages in thread
From: jamesmikedupont @ 2009-10-21 20:08 UTC (permalink / raw)
To: Bernie Innocenti; +Cc: Samuel Klein, Wikimedia Foundation Mailing List, git
Wow,
I am impressed.
Let me remind you of one thing,
most people are working on very small subsets of the data. Very few
people will want to have all the data, think about getting all the
versions from all the git repos, it would be the same.
My idea is for smaller chapters who want to get started easily, or
towns, regions to host their own branches of relevant data.
Given a world full of such servers, the sum would be great but the
individual branches needed at one time would be small.
mike
On Wed, Oct 21, 2009 at 9:49 PM, Bernie Innocenti <bernie@codewiz.org> wrote:
> [cc+=git@vger.kernel.org]
>
> El Wed, 21-10-2009 a las 08:43 -0400, Samuel Klein escribió:
>> That sounds like a great idea. I know a few other people who have
>> worked on git-based wikis and toyed with making them compatible with
>> mediawiki (copying bernie innocenti, one of the most eloquent :).
>
> Then I'll do my best to sound as eloquent as expected :)
>
> While I think git's internal structure is wonderfully simple and
> elegant, I'm a little worried about its scalability in the wiki usecase.
>
> The scenario for which git's repository format was designed is "patch
> oriented" revision control of a filesystem tree. The central object of a
> git tree is the "commit", which represents a set of changes on multiple
> files. I'll disregard all the juicy details on how the changes are
> actually packed together to save disk space, making git's repository
> format amazingly compact.
>
> Commits are linked to each other in order to represent the history. Git
> can efficiently represent a highly non-linear history with thousands of
> branches, each containing hundreds of thousands revisions. Branching and
> merging huge trees is so fast that one is left wondering if anything has
> happened at all.
>
> So far, so good. This commit-oriented design is great if you want to
> track the history *the whole tree* at once, applying related changes to
> multiple files atomically. In Git, as well as most other version control
> systems, there's no such thing as a *file* revision! Git manages entire
> trees. Trees are assigned unique revision numbers (in fact, ugly sha-1
> hashes), and can optionally by tagged or branched at will.
>
> And here's the the catch: the history of individual files is not
> directly represented in a git repository. It is typically scattered
> across thousands of commit objects, with no direct links to help find
> them. If you want to retrieve the log of a file that was changed only 6
> times in the entire history of the Linux kernel, you'd have to dig
> through *all* of the 170K revisions in the "master" branch.
>
> And it takes some time even if git is blazingly fast:
>
> bernie@giskard:~/src/kernel/linux-2.6$ time git log --pretty=oneline REPORTING-BUGS | wc -l
> 6
>
> real 0m1.668s
> user 0m1.416s
> sys 0m0.210s
>
> (my laptop has a low-power CPU. A fast server would be 8-10x faster).
>
>
> Now, the English Wikipedia seems to have slightly more than 3M articles,
> with--how many? tenths of millions of revisions for sure. Going through
> them *every time* one needs to consult the history of a file would be
> 100x slower. Tens of seconds. Not acceptable, uh?
>
> It seems to me that the typical usage pattern of an encyclopedia is to
> change each article individually. Perhaps I'm underestimating the role
> of bots here. Anyway, there's no consistency *requirement* for mass
> changes to be applied atomically throughout all the encyclopedia, right?
>
> In conclusion, the "tree at a time" design is going to be a performance
> bottleneck for a large wiki, with no useful application. Unless of
> course the concept of changesets was exposed in the UI, which would be
> an interesting idea to explore.
>
> Mercurial (Hg) seems to have a better repository layout for the "one
> file at a time" access pattern... Unfortunately, it's also much slower
> than git for almost any other purpose, sometimes by an order of
> magnitude. I'm not even sure how well Hg would cope with a repository
> containing 3M files and some 30M revisions. The largest Hg tree I've
> dealt with is the "mozilla central" repo, which is already unbearably
> slow to work with.
>
> It would be interesting to compare notes with the other DSCM hackers,
> too.
>
> --
> // Bernie Innocenti - http://codewiz.org/
> \X/ Sugar Labs - http://sugarlabs.org/
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Foundation-l] Wikipedia meets git
2009-10-21 19:49 ` [Foundation-l] Wikipedia meets git Bernie Innocenti
2009-10-21 20:08 ` jamesmikedupont
@ 2009-10-21 20:31 ` Avery Pennarun
2009-10-21 21:05 ` Nicolas Pitre
2 siblings, 0 replies; 6+ messages in thread
From: Avery Pennarun @ 2009-10-21 20:31 UTC (permalink / raw)
To: Bernie Innocenti; +Cc: Samuel Klein, Wikimedia Foundation Mailing List, git
On Wed, Oct 21, 2009 at 3:49 PM, Bernie Innocenti <bernie@codewiz.org> wrote:
> And here's the the catch: the history of individual files is not
> directly represented in a git repository. It is typically scattered
> across thousands of commit objects, with no direct links to help find
> them. If you want to retrieve the log of a file that was changed only 6
> times in the entire history of the Linux kernel, you'd have to dig
> through *all* of the 170K revisions in the "master" branch.
>
> And it takes some time even if git is blazingly fast:
>
> bernie@giskard:~/src/kernel/linux-2.6$ time git log --pretty=oneline REPORTING-BUGS | wc -l
> 6
>
> real 0m1.668s
> user 0m1.416s
> sys 0m0.210s
>
> (my laptop has a low-power CPU. A fast server would be 8-10x faster).
>
>
> Now, the English Wikipedia seems to have slightly more than 3M articles,
> with--how many? tenths of millions of revisions for sure. Going through
> them *every time* one needs to consult the history of a file would be
> 100x slower. Tens of seconds. Not acceptable, uh?
I think this slowness could be overcome using a simple cache of
filename -> commitid list, right?
That is, you run some variant of "git log --name-only" and, for each
file changed by each commit, add an element to the commit list for
that file. When committing in the future, use a hook that updates the
cache. When you want to view the history of a particular file, simply
retrieve exactly the list of commits in that file's commitlist, not
other commits.
It sounds like such a cache could be implemented quite easily outside
of git itself.
Would that help?
That said, I'll bet you find other performance glitches when you
import millions of files and tens/hundreds of millions of commits.
But we probably won't know what those problems are until someone
imports them :)
Have fun,
Avery
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Foundation-l] Wikipedia meets git
2009-10-21 19:49 ` [Foundation-l] Wikipedia meets git Bernie Innocenti
2009-10-21 20:08 ` jamesmikedupont
2009-10-21 20:31 ` [Foundation-l] " Avery Pennarun
@ 2009-10-21 21:05 ` Nicolas Pitre
2 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2009-10-21 21:05 UTC (permalink / raw)
To: Bernie Innocenti; +Cc: Samuel Klein, Wikimedia Foundation Mailing List, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1818 bytes --]
On Wed, 21 Oct 2009, Bernie Innocenti wrote:
> And here's the the catch: the history of individual files is not
> directly represented in a git repository. It is typically scattered
> across thousands of commit objects, with no direct links to help find
> them. If you want to retrieve the log of a file that was changed only 6
> times in the entire history of the Linux kernel, you'd have to dig
> through *all* of the 170K revisions in the "master" branch.
>
> And it takes some time even if git is blazingly fast:
>
> bernie@giskard:~/src/kernel/linux-2.6$ time git log --pretty=oneline REPORTING-BUGS | wc -l
> 6
>
> real 0m1.668s
> user 0m1.416s
> sys 0m0.210s
>
> (my laptop has a low-power CPU. A fast server would be 8-10x faster).
>
>
> Now, the English Wikipedia seems to have slightly more than 3M articles,
> with--how many? tenths of millions of revisions for sure. Going through
> them *every time* one needs to consult the history of a file would be
> 100x slower. Tens of seconds. Not acceptable, uh?
>
> It seems to me that the typical usage pattern of an encyclopedia is to
> change each article individually. Perhaps I'm underestimating the role
> of bots here. Anyway, there's no consistency *requirement* for mass
> changes to be applied atomically throughout all the encyclopedia, right?
You certainly don't need to put all files in the same tree then.
Having the whole thing split according to some sections that are
unlikely to overlap would be the way to go. Therefore you could arrange
subsections to have their own branches with no other files in them, or
even rely on Git submodules. The partitioning doesn't necessarily have
to be one of the two extremes such as one branch per file à la CVS or
all files in the same branch/tree as Git does by default.
Nicolas
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Foundation-l] Wikipedia meets git
2009-10-21 20:08 ` jamesmikedupont
@ 2009-10-21 23:36 ` David Gerard
[not found] ` <fbad4e140910211636hd772962x4535ccbda6faa3c7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: David Gerard @ 2009-10-21 23:36 UTC (permalink / raw)
To: Wikimedia Foundation Mailing List; +Cc: Bernie Innocenti, git
2009/10/21 jamesmikedupont@googlemail.com <jamesmikedupont@googlemail.com>:
> most people are working on very small subsets of the data. Very few
> people will want to have all the data, think about getting all the
> versions from all the git repos, it would be the same.
> My idea is for smaller chapters who want to get started easily, or
> towns, regions to host their own branches of relevant data.
> Given a world full of such servers, the sum would be great but the
> individual branches needed at one time would be small.
A distributed backend is a nice idea anyway - imagine a meteor hitting
the Florida data centres ...
And there are third-party users who could benefit from a highly
distributed backend, such as Wikileaks.
This thread should probably move to mediawiki-l ...
- d.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Wikipedia meets git
[not found] ` <fbad4e140910211636hd772962x4535ccbda6faa3c7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 6:27 ` jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg
0 siblings, 0 replies; 6+ messages in thread
From: jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg @ 2009-10-22 6:27 UTC (permalink / raw)
To: David Gerard; +Cc: Bernie Innocenti, Wikimedia Foundation Mailing List, git
Ok,
I have started a google group called mediawiki-vcs
http://groups.google.com/group/mediawiki-vcs
We should just move the discussion there.
Additionaly, I did not name it git, but vcs, for the reason that we
should support multiple backends via a plugin. I am interested in
using git because i think git is great, but others should be free to
use cvs if they feel it is needed.
mike
On Thu, Oct 22, 2009 at 1:36 AM, David Gerard <dgerard-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 2009/10/21 jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org <jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>:
>
>> most people are working on very small subsets of the data. Very few
>> people will want to have all the data, think about getting all the
>> versions from all the git repos, it would be the same.
>> My idea is for smaller chapters who want to get started easily, or
>> towns, regions to host their own branches of relevant data.
>> Given a world full of such servers, the sum would be great but the
>> individual branches needed at one time would be small.
>
>
> A distributed backend is a nice idea anyway - imagine a meteor hitting
> the Florida data centres ...
>
> And there are third-party users who could benefit from a highly
> distributed backend, such as Wikileaks.
>
> This thread should probably move to mediawiki-l ...
>
>
> - d.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
_______________________________________________
foundation-l mailing list
foundation-l-RusutVdil2icGmH+5r0DM0B+6BGkLq7r@public.gmane.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-10-22 6:27 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <ee9cc730910151155w307a87f0w7bba5c4039bb1ef9@mail.gmail.com>
[not found] ` <e692861c0910170118n6cabcc0bt3a4028cbbb54da86@mail.gmail.com>
[not found] ` <ee9cc730910170140m5a7f2949h80afce7398d9e503@mail.gmail.com>
[not found] ` <71cd4dd90910170705o7c61e06fuacba41f447571b45@mail.gmail.com>
[not found] ` <deea21830910170804s61786d8ewae0bc9390baaed71@mail.gmail.com>
[not found] ` <71cd4dd90910170823o3f58b8c2x1d39040e7582634@mail.gmail.com>
[not found] ` <ee9cc730910170939m6e83ad2fy5f22a541c3679638@mail.gmail.com>
[not found] ` <deea21830910170953o33823dd3rd7a9305f9ea794d@mail.gmail.com>
[not found] ` <ee9cc730910171011l1b68a1a0q7096a93c36362959@mail.gmail.com>
[not found] ` <e405c96a0910190830y51009225lc72942a703575042@mail.gmail.com>
[not found] ` <5396c0d10910210543i4c0a3350je5bee4c6389a2292@mail.gmail.com>
2009-10-21 19:49 ` [Foundation-l] Wikipedia meets git Bernie Innocenti
2009-10-21 20:08 ` jamesmikedupont
2009-10-21 23:36 ` David Gerard
[not found] ` <fbad4e140910211636hd772962x4535ccbda6faa3c7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-22 6:27 ` jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg
2009-10-21 20:31 ` [Foundation-l] " Avery Pennarun
2009-10-21 21:05 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).