World of Warcraft Data Mining

I was asked recently by a friend about scraping data off websites, in particular the wowarmory. WOWArmory is a searchable database of characters, items, guilds and dungeons for the MMO World of Warcraft.

What makes this site unique is that only XML is returned for any query performed. For example, the following search returns an XML document containing as list of all the characters in a particular guild:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/layout/guild-info.xsl"?>
<page globalSearch="1" lang="en_us" requestUrl="/guild-info.xml">
  <guildKey factionId="1" name="Batmen" nameUrl="Batmen" realm="Kel'Thuzad" realmUrl="Kel%27Thuzad" url="r=Kel%27Thuzad&amp;n=Batmen"/>
  <guildInfo>
    <guild>
      <members filterField="" filterValue="" maxPage="1" memberCount="6" page="1" sortDir="a" sortField="">
        <character class="Hunter" classId="3" gender="Male" genderId="0" level="70" name="Gommit" race="Orc" raceId="2" rank="1" url="r=Kel%27Thuzad&amp;n=Gommit"/>
        <character class="Hunter" classId="3" gender="Female" genderId="1" level="67" name="Laurisse" race="Blood Elf" raceId="10" rank="1" url="r=Kel%27Thuzad&amp;n=Laurisse"/>
        <character class="Hunter" classId="3" gender="Female" genderId="1" level="65" name="Kyss" race="Orc" raceId="2" rank="4" url="r=Kel%27Thuzad&amp;n=Kyss"/>
        <character class="Priest" classId="5" gender="Female" genderId="1" level="55" name="Zaurisse" race="Blood Elf" raceId="10" rank="1" url="r=Kel%27Thuzad&amp;n=Zaurisse"/>
        <character class="Warlock" classId="9" gender="Male" genderId="0" level="55" name="Keloth" race="Blood Elf" raceId="10" rank="4" url="r=Kel%27Thuzad&amp;n=Keloth"/>
        <character class="Hunter" classId="3" gender="Male" genderId="0" level="19" name="Ohwut" race="Troll" raceId="8" rank="1" url="r=Kel%27Thuzad&amp;n=Ohwut"/>
      </members>
    </guild>
  </guildInfo>
</page>

Apparently the server side generates XML results and uses the Open Source library Sarissa to off-load the generation of HTML to the client through the use of XSLT and ECMAScript.

The following is a quick and dirty guide to scrape the data using PHP.

The first thing we need to do is to actually fetch the XML document. Included with most PHP distributions is a library called CURL. CURL is a library used to connect and communication with servers using protocols like http, ftp, ldap, etc...). In the win32 version of PHP, curl may not be loaded by default. Check the php.ini file and make sure the following line exists (and is uncommented):

extension=php_curl.dll

To use CURL, we first need to create a session, this is done through the curl_init() function:

$ch = curl_init();

Once we have the session handle, we need to set some options, this is done through the curl_setopt() function.

The WOWArmory site only returns XML if your browser supports JavaScript, otherwise it will return HTML which is a lot more difficult to parse. So we need to trick the site into thinking our PHP client is a supported browser. This is done by setting the user agent string equal to that of a browser like Firefox:

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");

Next, we need to set the URL we want to fetch the data from:

curl_setopt($ch, CURLOPT_URL, "http://www.wowarmory.com/guild-info.xml?r=Kel%27Thuzad&n=Batmen&p=1");

We need to also set a few other options:

curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

For a detailed explanation of these options and others that are available, refer to PHP manual entry for curl_setopt().

Finally, to fetch the page data, we call curl_exec():

$xml = curl_exec($ch);

We now have an XML document as a string, but we need to be able to parse it to extract the data we care about. The DOMDocument class can be used to get the Document Object Model (DOM) for the XML document:

$dom = new DOMDocument();
@$dom->loadXML($xml);

The DOM allows us to navigate the structure of the XML document without having to do any parsing and lets us to iterate through each node (a node corresponds to a XML element).

This can be tedious to do, luckily PHP has an XPath object to navigate the nodes for us and return only what we care about. For example, if want to get a list of all the character elements, we would use the following XPath query: "/page/guildInfo/guild/members/character". Or since character only occurs in one location in the document, we could also use "//character".

The following PHP code will return a DOMNodeList object containing all the character elements in the document:

$xpath = new DOMXPath($dom);
$nodeList = $xpath->evaluate("//character");

Using the $nodeList, we can now dump out the members of the guild. As shown in the above XML document, each piece of data for a character is stored as an attribute of the character tag, thus for each node if we want to print the character name, we need to display the "name" attribute:

for ($i = 0; $i < $nodes->length; $i++) {
    echo $nodes->item($i)->getAttribute("name") . "\n";
}

Let's combine this all into one script now and pass the realm and guild name as command line parameters:

<?php

if ($argc < 3) {
  echo
"usage: " . $argv[0] . " <realm> <guild>\n";
  exit;
}

$realm = $argv[1];
$guild = $argv[2];

$url = "http://www.wowarmory.com/guild-info.xml?r=" . urlencode($realm) . "&n=" . urlencode($guild) . "&p=1";
echo
$url . "\n";

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!
$html) {
  echo
"CURL error number:" . curl_errno($ch) . " CURL error:" . curl_error($ch);
  exit;
}

$dom = new DOMDocument();
@
$dom->loadXML($html);

$xpath = new DOMXPath($dom);
$nodes = $xpath->evaluate("//character");

for (
$i = 0; $i < $nodes->length; $i++) {
  echo
$nodes->item($i)->getAttribute("name") . "\n";
}
?>

Assuming the above is saved to a file called test.php, we can execute the following from a command prompt:

> test.php "Kel'Thuzad" Batmen

and get the following output:
Gommit
Laurisse
Kyss
Zaurisse
Keloth
Ohwut