Wikipedia:Computer help desk/ParseMediaWikiDump

Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy.

Download
The latest version of Parse::MediaWikiDump is available at http://www.cpan.org/modules/by-authors/id/T/TR/TRIDDLE/.

Find uncategorized articles in the main name space
#!/usr/bin/perl -w use strict; use Parse::MediaWikiDump; my $file = shift(@ARGV) or die "must specify a Mediawiki dump file"; my $pages = Parse::MediaWikiDump::Pages->new($file); my $page; while(defined($page = $pages->page)) { #main namespace only next unless $page->namespace eq ''; print $page->title, "\n" unless defined($page->categories); }

Find double redirects in the main name space
This program does not follow the proper case sensitivity rules for matching article titles; see the POD that comes with the module for a much more complete version of this program.

#!/usr/bin/perl -w

use strict; use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file"; my $pages = Parse::MediaWikiDump::Pages->new($file); my $page; my %redirs;

while(defined($page = $pages->page)) { next unless $page->namespace eq ''; next unless defined($page->redirect);

my $title = $page->title;

$redirs{$title} = $page->redirect; }

foreach my $key (keys(%redirs)) { my $redirect = $redirs{$key}; if (defined($redirs{$redirect})) { print "$key\n"; } }

Import only a certain category of pages
Can someone fill in the perl code below?


 * 1) !/usr/bin/perl

use Parse::MediaWikiDump; use DBI; use DBD::mysql;

$server        = "localhost"; $name          = "dbname"; $user          = "admin"; $password      = "pass";

$dsn = "DBI:mysql:database=$name;host=$server;"; $dbh = DBI->connect($dsn, $user, $password);

$source = 'pages_articles.xml';

$pages = Parse::MediaWikiDump::Pages->new($source); print "Done parsing.\n";

while(defined($page = $pages->page)) { $c = $page->categories; if (grep /Mathematics/, @$c) {

$id = $page->id; $title = $page->title; $text = $page->text;

#$dbh->do("insert ...");

print "title '$title' id $id was inserted.\n"; } }

Extract articles linked to important Wikis but not to a specific one
The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.


 * 1) !/usr/bin/perl -w

use strict; use Parse::MediaWikiDump; use utf8; my $file = shift(@ARGV) or die "must specify a Mediawiki dump file"; my $pages = Parse::MediaWikiDump::Pages->new($file); my $page; binmode STDOUT, ":utf8";
 * 1) Code : Dake

while(defined($page = $pages->page)) { #main namespace only next unless $page->namespace eq '';

my $text = $page->text; if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) &&     ($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) &&      ($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i)) { print $page->title, "\n"; }		}

Related software

 * Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc.