git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>
Cc: "Arnaud Lacurie" <arnaud.lacurie@ensimag.imag.fr>,
	git@vger.kernel.org,
	"Jérémie Nikaes" <jeremie.nikaes@ensimag.imag.fr>,
	"Claire Fousse" <claire.fousse@ensimag.imag.fr>,
	"David Amouyal" <david.amouyal@ensimag.imag.fr>,
	"Sylvain Boulmé" <sylvain.boulme@imag.fr>
Subject: Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
Date: Thu, 2 Jun 2011 23:43:48 -0400	[thread overview]
Message-ID: <20110603034348.GA1371@sigill.intra.peff.net> (raw)
In-Reply-To: <vpqy61jami7.fsf@bauges.imag.fr>

On Fri, Jun 03, 2011 at 12:37:04AM +0200, Matthieu Moy wrote:

> The idea is that we ultimately want to be able to import a subset of a
> large wiki. In Wikipedia, for example, "show me revisions since N" will
> be very large after a few minutes. OTOH, "show me revisions touching the
> few pages I'm following" should be fast. And at least, it's O(imported
> wiki size), not O(complete wiki size)

Yeah, I think what you want to do is dependent on wiki size. For a small
wiki, it doesn't matter; all pages is not much. For a large wiki, you
want a subset of the pages, and you _never_ want to do any operations on
the whole page space. In the middle are medium-sized wikis, where you
would like look at the whole page space, but ideally not in O(number of
pages).

But the point is somewhat moot, because having just read through the
mediawiki API, I've come to the conclusion (which seems familiar
from the last time I looked at this problem) that there is no way to ask
for what I want in a single query. That is, to say "show me all
revisions of all pages matching some subset X, that have been modified
since revision N". Or even "show me all pages matching some subset X
that have been modified since revision N", and then we could at least
cull the pages that haven't been touched.

But AFAICT, none of those is possible. I think we are stuck asking for
each page's information individually (you can even query multiple pages'
revision information simultaneously, but you can get only a single
revision from each in that case. There's not even a way to say "get me
the latest revision number for all of these pages).

One thing we could do to reduce the total run-time is to issue several
queries in parallel so that the query latency isn't so prevalent. I
don't know what a good level of parallelism is for a server like
wikipedia, though. I'm sure they don't appreciate users hammering the
servers too hard. Ideally you want just enough queries outstanding that
the remote server is always working on _one_, and the rest are doing
something else (traveling across the network, local processing and
storage, etc). But I'm not sure of a good way to measure that.

> but let's not be too ambitious for now: it's a student's project,
> completing one week from now, and the goal is to have something clean
> and extensible. Bells and whistles will come later ;-).

Yes. I think all of this is outside the scope of a student project. I
just like to dream. :)

-Peff

  reply	other threads:[~2011-06-03  3:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-02  9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
2011-06-02 17:03 ` Jeff King
2011-06-02 20:28   ` Arnaud Lacurie
2011-06-02 22:49     ` Jeff King
2011-06-02 22:37   ` Matthieu Moy
2011-06-03  3:43     ` Jeff King [this message]
2011-06-02 18:01 ` Junio C Hamano
2011-06-02 20:58   ` Arnaud Lacurie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110603034348.GA1371@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=Matthieu.Moy@grenoble-inp.fr \
    --cc=arnaud.lacurie@ensimag.imag.fr \
    --cc=claire.fousse@ensimag.imag.fr \
    --cc=david.amouyal@ensimag.imag.fr \
    --cc=git@vger.kernel.org \
    --cc=jeremie.nikaes@ensimag.imag.fr \
    --cc=sylvain.boulme@imag.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).