git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Arnaud Lacurie <arnaud.lacurie@ensimag.imag.fr>
Cc: git@vger.kernel.org,
	"Jérémie Nikaes" <jeremie.nikaes@ensimag.imag.fr>,
	"Claire Fousse" <claire.fousse@ensimag.imag.fr>,
	"David Amouyal" <david.amouyal@ensimag.imag.fr>,
	"Matthieu Moy" <matthieu.moy@grenoble-inp.fr>,
	"Sylvain Boulmé" <sylvain.boulme@imag.fr>
Subject: Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
Date: Thu, 2 Jun 2011 13:03:27 -0400	[thread overview]
Message-ID: <20110602170327.GA2928@sigill.intra.peff.net> (raw)
In-Reply-To: <1307006911-4326-1-git-send-email-arnaud.lacurie@ensimag.imag.fr>

On Thu, Jun 02, 2011 at 11:28:31AM +0200, Arnaud Lacurie wrote:

> +sub mw_import {
> [...]
> +		# Get 500 revisions at a time due to the mediawiki api limit
> +		while (1) {
> +			my $result = $mediawiki->api($query);
> +
> +			# Parse each of those 500 revisions
> +			foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) {
> +				my $page_rev_ids;
> +				$page_rev_ids->{pageid} = $page->{pageid};
> +				$page_rev_ids->{revid} = $revision->{revid};
> +				push (@revisions, $page_rev_ids);
> +				$revnum++;
> +			}
> +			last unless $result->{'query-continue'};
> +			$query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid};
> +			print "\n";
> +		}

What is this newline at the end here for? With it, my import reliably
fails with:

  fatal: Unsupported command: 
  fast-import: dumping crash report to .git/fast_import_crash_6091

Removing it seems to make things work.

> +		my $user = $rev->{user} || 'Anonymous';
> +		my $dt = DateTime::Format::ISO8601->parse_datetime($rev->{timestamp});
> +
> +		my $comment = defined $rev->{comment} ? $rev->{comment} : '*Empty MediaWiki Message*';

In importing the git wiki, I ran into an empty timestamp. This throws an
exception which kills the whole import:

  $ git clone mediawiki::https://git.wiki.kernel.org/ git-wiki
  2821/7949: Revision n°4210 of GitSurvey
  Invalid date format:  at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 195
          main::mw_import('https://git.wiki.kernel.org/') called at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 42

At the very least, we should intercept this and put in some placeholder
timestamp. I'm not sure what the best placeholder would be. Maybe use
the date from the previous revision, plus one second? Or maybe there is
some other bug causing us to have an empty timestamp. I didn't dig
deeper yet.

> +		# mediawiki revision number in the git note
> +		my $note_comment = encode_utf8("note added by git-mediawiki");
> +		my $note_comment_length = bytes::length($note_comment);
> +		my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
> +		my $note_content_length = bytes::length($note_content);
> +
> +		if ($fetch_from == 1 && $n == 1) {
> +			print "reset refs/notes/commits\n";
> +		}
> +		print "commit refs/notes/commits\n";

Should these go in refs/notes/commits? I don't think we have a "best
practices" yet for the notes namespaces, as it is still a relatively new
concept. But I always thought "refs/notes/commits" would be for the
user's "regular" notes, and that programmatic things would get their own
notes, like "refs/notes/mediawiki".

That wouldn't show them by default, but you could do:

  git log --notes=mediawiki

to see them (and maybe that is a feature, because most of the time you
won't care about the mediawiki revision).

> +		} else {
> +			print STDERR "You appear to have cloned an empty mediawiki\n";
> +			#What do we have to do here ? If nothing is done, an error is thrown saying that
> +			#HEAD is refering to unknown object 0000000000000000000
> +		}

Hmm. We do allow cloning empty git repos. It might be nice for there to
be some way for a remote helper to signal "everything OK, but the result
is empty". But I think that is probably something that needs to be added
to the remote-helper protocol, and so is outside the scope of your
script (maybe it is as simple as interpreting the null sha1 as "empty";
I dunno).

Overall, it's looking pretty good. I like that I can resume a
half-finished import via "git fetch". Though I do have one complaint:
running "git fetch" fetches the metainfo for every revision of every
page, just as it does for an initial clone. Is there something in the
mediawiki API to say "show me revisions since N" (where N would be the
mediawiki revision of the tip of what we imported)?

-Peff

  reply	other threads:[~2011-06-02 17:03 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-02  9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
2011-06-02 17:03 ` Jeff King [this message]
2011-06-02 20:28   ` Arnaud Lacurie
2011-06-02 22:49     ` Jeff King
2011-06-02 22:37   ` Matthieu Moy
2011-06-03  3:43     ` Jeff King
2011-06-02 18:01 ` Junio C Hamano
2011-06-02 20:58   ` Arnaud Lacurie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110602170327.GA2928@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=arnaud.lacurie@ensimag.imag.fr \
    --cc=claire.fousse@ensimag.imag.fr \
    --cc=david.amouyal@ensimag.imag.fr \
    --cc=git@vger.kernel.org \
    --cc=jeremie.nikaes@ensimag.imag.fr \
    --cc=matthieu.moy@grenoble-inp.fr \
    --cc=sylvain.boulme@imag.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).