Build an API for any website with Web-Scraping

There are a lot of web-sites out there, with a lot of data on them. Sometimes you are building a killer web-app and you just have to have some data off a certain site. The problem is, that particular site doesn’t have an API that you can just plug into! Never fear, using some simple tools, combined with the Zend Framework you can create your own web-scraping (screen-scraping) API in no time.

Before I continue, I should mention that, not only is it impolite, but quite possibly illegal to take the content of a third party site without the owners permission or knowledge.

To accomplish our task we need some open-source tools:

Zend Framework (Required)
Quix (Recommended)
Dojo Toolkit (Recommended)

Our basic strategy is this:

Find a target page with some cool content
Find the CSS Selector (or X-path) of the element we want to scrape
Verify the selector using Dojo
Build a Zend_Service object to fetch the cool content to PHP

Step 1: Finding the cool page

Say I am a developer and I really want to get the outward facing IP of the machine that my PHP script is running on. If there is a router or proxy in between my machine and the Internet, this can be non trivial. As a solution I decide to use the service http://www.whatismyip.com/

Step 2: Finding the CSS3 Selector for the element

Its pretty obvious which element we want to get the selector for. If you are feeling clever you can simply read the code or use a DOM inspector to figure out what the query is on your own. However, selectorgaget.com provides a cool tool that will allow you to point and click to determine the appropraite selector. You can get a bookmarklet directly from their site, or you can get the much more powerful Quix bookmarklet which includes the selectorgaget and a bunch of other cool tools.

If you decided on Quix (I know I did), click on the bookmarklet and enter “sg” in the command box. The selectorgaget should load up and as you move your mouse around the screen it should highlight different DOM nodes. We want the top one, so click on the text “Your IP…” and, checking out the path, we see that it is a rather boring h1.

Step 3: Validation via Dojo

I like to validate the path before taking it over to Zend. I use a Dojo and firebug lite bookmarklet which injects a debugging version of dojo into any page via the AOL CDN. To add this bookmarket drag Inject Dojo into your bookmarks toolbar. In the debugging console that pops up, enter dojo.query(‘h1’); and you should see the h1 DOM element being returned.

Step 4: Moving it all to PHP

Now that we have successfully found our CSS3 selector path, we can move over to PHP and come up with a new Zend_Service component. Our approach will be to extend Zend_Service_Abstract, and implement some custom methods to preform the screen scrape.

The finished class looks like this:

class WhatsMyIpService extends Zend_Service_Abstract
{
    /**
     * The service endpoint.
     * This is where Zend_Http_Client will navigate
     * to fill service requests
     * @var string
     */
    protected $_endpoint = 'http://whatismyip.com';

    /**
     * handle to the client
     * @var Zend_Http_Client
     */
    protected $_client;

    public function __construct()
    {
        $this->_client = self::getHttpClient();
    }

    /**
     * Method to get the external IP of the computer / server
     * script is running on
     * @return string
     */
    public function getMyIp()
    {
        //reset the client parameters, set the URL to whatismyip.com
        //and actually preform the request
        $result = $this->_client
            ->resetParameters()
            ->setUri($this->_endpoint)
            ->request(Zend_Http_Client::GET);

        //check to make sure that the result isnt a HTTP error
        if($result->isError()){
            throw new Exception('Client returned error: ' . $result->getMessage());
        }

        try{
            //setup the query object with the result body (HTML page)
            $query = new Zend_Dom_Query($result->getBody());

            $domCollection = $query->query('h1');
        }catch(Zend_Dom_Exception $e){
            throw new Exception('Error Loading Document: ' . $e);
        }

        //check to make sure the query return a result
        if($domCollection->count() == ){
            throw new Exception('Cannot find DOM Element');
        }

        //get the titlestring from the nodevalue
        $titleString = (string) $domCollection->current()->nodeValue;

        //now we should have the content of h1 stored in the titleString
        //it should read something like "Your IP Address Is: xxx.xxx.xxx.xxx"

        //Now we will parse out the IP address using regular expressions
        if (preg_match('/([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})/', $titleString, $matches)) {
            return $matches[1];
        }else{
            throw new Exception('Unable to parse IP from page');
        }
    }
}

I hope that the script is sufficiently commented, but to summarize:

In the constructor, we get an local instance of a Zend_Http_Client object. Clone() is used to prevent any other Zend_Service objects from polluting our client with their own requests.

In the getMyIp() method, we first setup and preform the request (using the fluid interface provided by Zend_Http_Client). Notice how we reset the parameters: In this case its not actually necessary as we aren’t passing any parameters, but it is good habit to get into in case in the future we expand this class to include GET or POST parameter passing.

Next, we examine what the HTTP client has passed back to us. Hopefully, if nothing has gone wrong, it is a HTML string representation of the page at whatismyip.com. Some basic checks are preformed to ensure that no HTTP errors have occurred, and then we instantiate a Zend_Dom_Query object which provides both CSS and Xpath selectors.

Finally, after running the CSS selector query, we check to ensure we got a DOM element back, get its value and parse out the IP using Regular expressions. Its pretty impressive what can be done in <70 lines of code with the Zend Framework. To run this class, we could create a directory structure as follows: library <- Includes Zend framework WhatsMyIpService.php <- The above class getIp.php <- php file including the following:

set_include_path(implode(PATH_SEPARATOR, array(
    realpath(dirname(__FILE__) . '/library'),
    get_include_path(),
)));

/** Zend_Loader_Autoload **/
require_once 'Zend/Loader/Autoloader.php';
Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);

$service = new WhatsMyIpService();
echo $service->getMyIp();

The first couple of lines gets the library on the include path, the second block sets up autoloading so we don’t have to manually include files, and the final lines instantiates the class and calls the method.