git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Avery Pennarun <apenwarr@gmail.com>
To: Bernie Innocenti <bernie@codewiz.org>
Cc: Samuel Klein <meta.sj@gmail.com>,
	Wikimedia Foundation Mailing List 
	<foundation-l@lists.wikimedia.org>, git <git@vger.kernel.org>
Subject: Re: [Foundation-l] Wikipedia meets git
Date: Wed, 21 Oct 2009 16:31:20 -0400	[thread overview]
Message-ID: <32541b130910211331n4f65c2d4ga76ac90816fe45d@mail.gmail.com> (raw)
In-Reply-To: <1256154567.1477.87.camel@giskard>

On Wed, Oct 21, 2009 at 3:49 PM, Bernie Innocenti <bernie@codewiz.org> wrote:
> And here's the the catch: the history of individual files is not
> directly represented in a git repository. It is typically scattered
> across thousands of commit objects, with no direct links to help find
> them. If you want to retrieve the log of a file that was changed only 6
> times in the entire history of the Linux kernel, you'd have to dig
> through *all* of the 170K revisions in the "master" branch.
>
> And it takes some time even if git is blazingly fast:
>
>  bernie@giskard:~/src/kernel/linux-2.6$ time git log  --pretty=oneline REPORTING-BUGS  | wc -l
>  6
>
>  real   0m1.668s
>  user   0m1.416s
>  sys    0m0.210s
>
> (my laptop has a low-power CPU. A fast server would be 8-10x faster).
>
>
> Now, the English Wikipedia seems to have slightly more than 3M articles,
> with--how many? tenths of millions of revisions for sure. Going through
> them *every time* one needs to consult the history of a file would be
> 100x slower. Tens of seconds. Not acceptable, uh?

I think this slowness could be overcome using a simple cache of
filename -> commitid list, right?

That is, you run some variant of "git log --name-only" and, for each
file changed by each commit, add an element to the commit list for
that file.  When committing in the future, use a hook that updates the
cache.  When you want to view the history of a particular file, simply
retrieve exactly the list of commits in that file's commitlist, not
other commits.

It sounds like such a cache could be implemented quite easily outside
of git itself.

Would that help?

That said, I'll bet you find other performance glitches when you
import millions of files and tens/hundreds of millions of commits.
But we probably won't know what those problems are until someone
imports them :)

Have fun,

Avery

  parent reply	other threads:[~2009-10-21 20:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ee9cc730910151155w307a87f0w7bba5c4039bb1ef9@mail.gmail.com>
     [not found] ` <e692861c0910170118n6cabcc0bt3a4028cbbb54da86@mail.gmail.com>
     [not found]   ` <ee9cc730910170140m5a7f2949h80afce7398d9e503@mail.gmail.com>
     [not found]     ` <71cd4dd90910170705o7c61e06fuacba41f447571b45@mail.gmail.com>
     [not found]       ` <deea21830910170804s61786d8ewae0bc9390baaed71@mail.gmail.com>
     [not found]         ` <71cd4dd90910170823o3f58b8c2x1d39040e7582634@mail.gmail.com>
     [not found]           ` <ee9cc730910170939m6e83ad2fy5f22a541c3679638@mail.gmail.com>
     [not found]             ` <deea21830910170953o33823dd3rd7a9305f9ea794d@mail.gmail.com>
     [not found]               ` <ee9cc730910171011l1b68a1a0q7096a93c36362959@mail.gmail.com>
     [not found]                 ` <e405c96a0910190830y51009225lc72942a703575042@mail.gmail.com>
     [not found]                   ` <5396c0d10910210543i4c0a3350je5bee4c6389a2292@mail.gmail.com>
2009-10-21 19:49                     ` [Foundation-l] Wikipedia meets git Bernie Innocenti
2009-10-21 20:08                       ` jamesmikedupont
2009-10-21 23:36                         ` David Gerard
     [not found]                           ` <fbad4e140910211636hd772962x4535ccbda6faa3c7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-22  6:27                             ` jamesmikedupont-gM/Ye1E23mwN+BqQ9rBEUg
2009-10-21 20:31                       ` Avery Pennarun [this message]
2009-10-21 21:05                       ` [Foundation-l] " Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32541b130910211331n4f65c2d4ga76ac90816fe45d@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=bernie@codewiz.org \
    --cc=foundation-l@lists.wikimedia.org \
    --cc=git@vger.kernel.org \
    --cc=meta.sj@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).