git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Simon.Cathebras" <Simon.Cathebras@ensimag.imag.fr>
To: Pavel Volek <Pavel.Volek@ensimag.imag.fr>
Cc: git@vger.kernel.org, Volek Pavel <me@pavelvolek.cz>,
	NGUYEN Kim Thuat <Kim-Thuat.Nguyen@ensimag.imag.fr>,
	ROUCHER IGLESIAS Javier <roucherj@ensimag.imag.fr>,
	Matthieu Moy <Matthieu.Moy@imag.fr>
Subject: Re: [PATCHv1] git-remote-mediawiki: import "File:" attachments
Date: Fri, 08 Jun 2012 18:20:59 +0200	[thread overview]
Message-ID: <4FD2266B.3040706@ensimag.imag.fr> (raw)
In-Reply-To: <1339165376-20267-1-git-send-email-Pavel.Volek@ensimag.imag.fr>



On 08/06/2012 16:22, Pavel Volek wrote:
> From: Volek Pavel<me@pavelvolek.cz>
>
> The current version of the git-remote-mediawiki supports only import and export
> of the pages, doesn't support import and export of file attachements which are
> also exposed by MediaWiki API. This patch adds the functionality to import the
> last versions of the files and all versions of description pages for these
> files.
>
> Signed-off-by: Pavel Volek<Pavel.Volek@ensimag.imag.fr>
> Signed-off-by: NGUYEN Kim Thuat<Kim-Thuat.Nguyen@ensimag.imag.fr>
> Signed-off-by: ROUCHER IGLESIAS Javier<roucherj@ensimag.imag.fr>
> Signed-off-by: Matthieu Moy<Matthieu.Moy@imag.fr>
> ---

>   contrib/mw-to-git/git-remote-mediawiki | 290 +++++++++++++++++++++++++++------
>   1 file changed, 244 insertions(+), 46 deletions(-)

I am wondering why are you showing the removal for a v1 patch ?

>
> diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki
> index c18bfa1..9f21217 100755
> --- a/contrib/mw-to-git/git-remote-mediawiki
> +++ b/contrib/mw-to-git/git-remote-mediawiki
> @@ -212,59 +212,230 @@ sub get_mw_pages {
>   	my $user_defined;
>   	if (@tracked_pages) {
>   		$user_defined = 1;
> -		# The user provided a list of pages titles, but we
> -		# still need to query the API to get the page IDs.
> -
> -		my @some_pages = @tracked_pages;
> -		while (@some_pages) {
> -			my $last = 50;
> -			if ($#some_pages<  $last) {
> -				$last = $#some_pages;
> -			}
> -			my @slice = @some_pages[0..$last];
> -			get_mw_first_pages(\@slice, \%pages);
> -			@some_pages = @some_pages[51..$#some_pages];
> -		}
> +		get_mw_tracked_pages(\%pages);
>   	}
>   	if (@tracked_categories) {
>   		$user_defined = 1;
> -		foreach my $category (@tracked_categories) {
> -			if (index($category, ':')<  0) {
> -				# Mediawiki requires the Category
> -				# prefix, but let's not force the user
> -				# to specify it.
> -				$category = "Category:" . $category;
> -			}
> -			my $mw_pages = $mediawiki->list( {
> -				action =>  'query',
> -				list =>  'categorymembers',
> -				cmtitle =>  $category,
> -				cmlimit =>  'max' } )
> -			    || die $mediawiki->{error}->{code} . ': ' . $mediawiki->{error}->{details};
> -			foreach my $page (@{$mw_pages}) {
> -				$pages{$page->{title}} = $page;
> -			}
> -		}
> +		get_mw_tracked_categories(\%pages);
>   	}
>   	if (!$user_defined) {
> -		# No user-provided list, get the list of pages from
> -		# the API.
> -		my $mw_pages = $mediawiki->list({
> -			action =>  'query',
> -			list =>  'allpages',
> -			aplimit =>  500,
> -		});
> -		if (!defined($mw_pages)) {
> -			print STDERR "fatal: could not get the list of wiki pages.\n";
> -			print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> -			print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
> -			exit 1;
> +		 get_mw_all_pages(\%pages);
> +	}
> +	return values(%pages);
> +}
> +
> +sub get_mw_all_pages {
> +	my $pages = shift;
> +	# No user-provided list, get the list of pages from the API.
> +	my $mw_pages = $mediawiki->list({
> +		action =>  'query',
> +		list =>  'allpages',
> +		aplimit =>  500,
> +	});
> +	if (!defined($mw_pages)) {
> +		print STDERR "fatal: could not get the list of wiki pages.\n";
> +		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> +		print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
> +		exit 1;
> +	}
> +	foreach my $page (@{$mw_pages}) {
> +		$pages->{$page->{title}} = $page;
> +	}
> +
> +	# Attach list of all pages for meadia files from the API,
> +	# they are in a different namespace, only one namespace
> +	# can be queried at the same moment
> +	my $mw_mediapages = $mediawiki->list({
> +		action =>  'query',
> +		list =>  'allpages',
> +		apnamespace =>  get_mw_namespace_id("File"),
> +		aplimit =>  500,
> +	});
> +	if (!defined($mw_mediapages)) {
> +		print STDERR "fatal: could not get the list of media file pages.\n";
> +		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> +		print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
> +		exit 1;
> +	}
> +	foreach my $page (@{$mw_mediapages}) {
> +		$pages->{$page->{title}} = $page;
> +	}
> +}
> +
> +sub get_mw_tracked_pages {
> +	my $pages = shift;
> +	# The user provided a list of pages titles, but we
> +	# still need to query the API to get the page IDs.
> +	my @some_pages = @tracked_pages;
> +	while (@some_pages) {
> +		my $last = 50;
> +		if ($#some_pages<  $last) {
> +			$last = $#some_pages;
> +		}
> +		my @slice = @some_pages[0..$last];
> +		get_mw_first_pages(\@slice, \%{$pages});
> +		@some_pages = @some_pages[51..$#some_pages];
> +	}
> +
> +	# Get pages of related media files.
> +	get_mw_linked_mediapages(\@tracked_pages, \%{$pages});
> +}
> +
> +sub get_mw_tracked_categories {
> +	my $pages = shift;
> +	foreach my $category (@tracked_categories) {
> +		if (index($category, ':')<  0) {
> +			# Mediawiki requires the Category
> +			# prefix, but let's not force the user
> +			# to specify it.
> +			$category = "Category:" . $category;
>   		}
> +		my $mw_pages = $mediawiki->list( {
> +			action =>  'query',
> +			list =>  'categorymembers',
> +			cmtitle =>  $category,
> +			cmlimit =>  'max' } )
> +			|| die $mediawiki->{error}->{code} . ': '
> +				. $mediawiki->{error}->{details};
>   		foreach my $page (@{$mw_pages}) {
> -			$pages{$page->{title}} = $page;
> +			$pages->{$page->{title}} = $page;
> +		}
> +
> +		my @titles = map $_->{title}, @{$mw_pages};
> +		# Get pages of related media files.
> +		get_mw_linked_mediapages(\@titles, \%{$pages});
> +	}
> +}
> +
> +sub get_mw_linked_mediapages {
> +	my $titles = shift;
> +	my @titles = @{$titles};
> +	my $pages = shift;
> +
> +	# pattern 'page1|page2|...' required by the API
> +	my $mw_titles = join('|', @titles);
> +
> +	# Media files could be included or linked from
> +	# a page, get all related
> +	my $query = {
> +		action =>  'query',
> +		prop =>  'links|images',
> +		titles =>  $mw_titles,
> +		plnamespace =>  get_mw_namespace_id("File"),
> +		pllimit =>  500,
> +	};

Why a comma after 500 ?

> +	my $result = $mediawiki->api($query);


What happened if the titles in the query contains special character 
which are not allowed by mediawiki for filename like { or [.
Maybe you should build a test for it and if it doesn't work try out the 
functions called:
     mediawiki_clean/smudge_filename
in the file git-remote-mediawiki


> +
> +	while (my ($id, $page) = each(%{$result->{query}->{pages}})) {
> +		my @titles;
> +		if (defined($page->{links})) {
> +			my @link_titles = map $_->{title}, @{$page->{links}};
> +			push(@titles, @link_titles);
> +		}
> +		if (defined($page->{images})) {
> +			my @image_titles = map $_->{title}, @{$page->{images}};
> +			push(@titles, @image_titles);
> +		}
> +		if (@titles) {
> +			get_mw_first_pages(\@titles, \%{$pages});
>   		}
>   	}
> -	return values(%pages);
> +}
> +
> +sub get_mw_medafile_for_mediapage_revision {
> +	# Name of the file on Wiki, with the prefix.
> +	my $mw_filename = shift;
> +	my $timestamp = shift;
> +	my %mediafile;
> +
> +	# Search if on MediaWiki exists a media file with given
> +	# timestamp and in that case download the file.
> +	my $query = {
> +		action =>  'query',
> +		prop =>  'imageinfo',
> +		titles =>  $mw_filename,
> +		iistart =>  $timestamp,
> +		iiend =>  $timestamp,
> +		iiprop =>  'timestamp|archivename',
> +		iilimit =>  1,
> +	};

Why a comma after iilimit ? (end of list of parameter here I think...)

> +	my $result = $mediawiki->api($query);
> +
> +	my ($fileid, $file) = each ( %{$result->{query}->{pages}} );
> +	if (defined($file->{imageinfo})) {
> +		my $fileinfo = pop(@{$file->{imageinfo}});
> +		if (defined($fileinfo->{archivename})) {
> +			return; # now we are not able to download files from archive
> +		}
> +
> +		my $filename; # real filename without prefix
> +		if (index($mw_filename, 'File:') == 0) {
> +			$filename = substr $mw_filename, 5;
> +		} else {
> +			$filename = substr $mw_filename, 6;
> +		}
> +
> +		$mediafile{title} = $filename;
> +		$mediafile{content} = download_mw_mediafile($mw_filename);
> +	}
> +	return %mediafile;
> +}
> +
> +# Returns MediaWiki id for a canonical namespace name.
> +# Ex.: "File", "Project".
> +# Looks for the namespace id in the local configuration
> +# variables, if it is not found asks MW API.
> +sub get_mw_namespace_id {
> +	mw_connect_maybe();
> +
> +	my $name = shift;
> +
> +	# Look at configuration file, if the record
> +	# for that namespace is already stored.
> +	my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".namespaces"));

Broken indentation/line too long ?

> +
> +	# NS not found =>  get namespace id from MW and store it in
> +	# configuration file.
> +	my $query = {
> +		action =>  'query',
> +		meta =>  'siteinfo',
> +		siprop =>  'namespaces',
> +	};

Same here concerning comma.

> +	my $result = $mediawiki->api($query);
> +
> +	while (my ($id, $ns) = each(%{$result->{query}->{namespaces}})) {
> +		if (defined($ns->{canonical})&&  ($ns->{canonical} eq $name)) {
> +			run_git("config --add remote.". $remotename .".namespaces ". $name ."=". $ns->{id});
> +			return $ns->{id};
> +		}
> +	}
> +	die "Namespace $name was not found on MediaWiki.";
> +}
> +
> +sub download_mw_mediafile {
> +	my $filename = shift;
> +
> +	$mediawiki->{config}->{files_url} = $url;
> +
> +	my $file = $mediawiki->download( { title =>  $filename } );

Just wondering: What happened if $filename contains some forbidden 
character on wiki's filename such as '{' or '|' ?
I am worrying about it because i've got some similar issues in my own 
work on tests for git-remote-mediawiki.

Hope I helped :).

Simon

-- 
CATHEBRAS Simon

2A-ENSIMAG

Filière Ingéniérie des Systèmes d'Information
Membre Bug-Buster

  parent reply	other threads:[~2012-06-08 16:21 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-08 14:22 [PATCHv1] git-remote-mediawiki: import "File:" attachments Pavel Volek
2012-06-08 14:42 ` Matthieu Moy
2012-06-08 16:20 ` Simon.Cathebras [this message]
2012-06-08 17:03   ` konglu
2012-06-08 23:24     ` Simon.Cathebras

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD2266B.3040706@ensimag.imag.fr \
    --to=simon.cathebras@ensimag.imag.fr \
    --cc=Kim-Thuat.Nguyen@ensimag.imag.fr \
    --cc=Matthieu.Moy@imag.fr \
    --cc=Pavel.Volek@ensimag.imag.fr \
    --cc=git@vger.kernel.org \
    --cc=me@pavelvolek.cz \
    --cc=roucherj@ensimag.imag.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).