From: Jeff King <peff@peff.net>
To: Arnaud Lacurie <arnaud.lacurie@ensimag.imag.fr>
Cc: git@vger.kernel.org,
"Jérémie Nikaes" <jeremie.nikaes@ensimag.imag.fr>,
"Claire Fousse" <claire.fousse@ensimag.imag.fr>,
"David Amouyal" <david.amouyal@ensimag.imag.fr>,
"Matthieu Moy" <matthieu.moy@grenoble-inp.fr>,
"Sylvain Boulmé" <sylvain.boulme@imag.fr>
Subject: Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
Date: Thu, 2 Jun 2011 13:03:27 -0400 [thread overview]
Message-ID: <20110602170327.GA2928@sigill.intra.peff.net> (raw)
In-Reply-To: <1307006911-4326-1-git-send-email-arnaud.lacurie@ensimag.imag.fr>
On Thu, Jun 02, 2011 at 11:28:31AM +0200, Arnaud Lacurie wrote:
> +sub mw_import {
> [...]
> + # Get 500 revisions at a time due to the mediawiki api limit
> + while (1) {
> + my $result = $mediawiki->api($query);
> +
> + # Parse each of those 500 revisions
> + foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) {
> + my $page_rev_ids;
> + $page_rev_ids->{pageid} = $page->{pageid};
> + $page_rev_ids->{revid} = $revision->{revid};
> + push (@revisions, $page_rev_ids);
> + $revnum++;
> + }
> + last unless $result->{'query-continue'};
> + $query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid};
> + print "\n";
> + }
What is this newline at the end here for? With it, my import reliably
fails with:
fatal: Unsupported command:
fast-import: dumping crash report to .git/fast_import_crash_6091
Removing it seems to make things work.
> + my $user = $rev->{user} || 'Anonymous';
> + my $dt = DateTime::Format::ISO8601->parse_datetime($rev->{timestamp});
> +
> + my $comment = defined $rev->{comment} ? $rev->{comment} : '*Empty MediaWiki Message*';
In importing the git wiki, I ran into an empty timestamp. This throws an
exception which kills the whole import:
$ git clone mediawiki::https://git.wiki.kernel.org/ git-wiki
2821/7949: Revision n°4210 of GitSurvey
Invalid date format: at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 195
main::mw_import('https://git.wiki.kernel.org/') called at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 42
At the very least, we should intercept this and put in some placeholder
timestamp. I'm not sure what the best placeholder would be. Maybe use
the date from the previous revision, plus one second? Or maybe there is
some other bug causing us to have an empty timestamp. I didn't dig
deeper yet.
> + # mediawiki revision number in the git note
> + my $note_comment = encode_utf8("note added by git-mediawiki");
> + my $note_comment_length = bytes::length($note_comment);
> + my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
> + my $note_content_length = bytes::length($note_content);
> +
> + if ($fetch_from == 1 && $n == 1) {
> + print "reset refs/notes/commits\n";
> + }
> + print "commit refs/notes/commits\n";
Should these go in refs/notes/commits? I don't think we have a "best
practices" yet for the notes namespaces, as it is still a relatively new
concept. But I always thought "refs/notes/commits" would be for the
user's "regular" notes, and that programmatic things would get their own
notes, like "refs/notes/mediawiki".
That wouldn't show them by default, but you could do:
git log --notes=mediawiki
to see them (and maybe that is a feature, because most of the time you
won't care about the mediawiki revision).
> + } else {
> + print STDERR "You appear to have cloned an empty mediawiki\n";
> + #What do we have to do here ? If nothing is done, an error is thrown saying that
> + #HEAD is refering to unknown object 0000000000000000000
> + }
Hmm. We do allow cloning empty git repos. It might be nice for there to
be some way for a remote helper to signal "everything OK, but the result
is empty". But I think that is probably something that needs to be added
to the remote-helper protocol, and so is outside the scope of your
script (maybe it is as simple as interpreting the null sha1 as "empty";
I dunno).
Overall, it's looking pretty good. I like that I can resume a
half-finished import via "git fetch". Though I do have one complaint:
running "git fetch" fetches the metainfo for every revision of every
page, just as it does for an initial clone. Is there something in the
mediawiki API to say "show me revisions since N" (where N would be the
mediawiki revision of the tip of what we imported)?
-Peff
next prev parent reply other threads:[~2011-06-02 17:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-02 9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
2011-06-02 17:03 ` Jeff King [this message]
2011-06-02 20:28 ` Arnaud Lacurie
2011-06-02 22:49 ` Jeff King
2011-06-02 22:37 ` Matthieu Moy
2011-06-03 3:43 ` Jeff King
2011-06-02 18:01 ` Junio C Hamano
2011-06-02 20:58 ` Arnaud Lacurie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110602170327.GA2928@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=arnaud.lacurie@ensimag.imag.fr \
--cc=claire.fousse@ensimag.imag.fr \
--cc=david.amouyal@ensimag.imag.fr \
--cc=git@vger.kernel.org \
--cc=jeremie.nikaes@ensimag.imag.fr \
--cc=matthieu.moy@grenoble-inp.fr \
--cc=sylvain.boulme@imag.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).