Archive for category Code
Build an API for any website with Web-Scraping
There are a lot of web-sites out there, with a lot of data on them. Sometimes you are building a killer web-app and you just have to have some data off a certain site. The problem is, that particular site doesn’t have an API that you can just plug into! Never fear, using some simple tools, combined with the Zend Framework you can create your own web-scraping (screen-scraping) API in no time.
Before I continue, I should mention that, not only is it impolite, but quite possibly illegal to take the content of a third party site without the owners permission or knowledge.
To accomplish our task we need some open-source tools:
- Zend Framework (Required)
- Quix (Recommended)
- Dojo Toolkit (Recommended)
Our basic strategy is this:
- Find a target page with some cool content
- Find the CSS Selector (or X-path) of the element we want to scrape
- Verify the selector using Dojo
- Build a Zend_Service object to fetch the cool content to PHP
Step 1: Finding the cool page
Say I am a developer and I really want to get the outward facing IP of the machine that my PHP script is running on. If there is a router or proxy in between my machine and the Internet, this can be non trivial. As a solution I decide to use the service http://www.whatismyip.com/
Step 2: Finding the CSS3 Selector for the element
Its pretty obvious which element we want to get the selector for. If you are feeling clever you can simply read the code or use a DOM inspector to figure out what the query is on your own. However, selectorgaget.com provides a cool tool that will allow you to point and click to determine the appropraite selector. You can get a bookmarklet directly from their site, or you can get the much more powerful Quix bookmarklet which includes the selectorgaget and a bunch of other cool tools.
If you decided on Quix (I know I did), click on the bookmarklet and enter “sg” in the command box. The selectorgaget should load up and as you move your mouse around the screen it should highlight different DOM nodes. We want the top one, so click on the text “Your IP…” and, checking out the path, we see that it is a rather boring h1.
Step 3: Validation via Dojo
I like to validate the path before taking it over to Zend. I use a Dojo and firebug lite bookmarklet which injects a debugging version of dojo into any page via the AOL CDN. To add this bookmarket drag Inject Dojo into your bookmarks toolbar. In the debugging console that pops up, enter dojo.query(‘h1′); and you should see the h1 DOM element being returned.
Step 4: Moving it all to PHP
Now that we have successfully found our CSS3 selector path, we can move over to PHP and come up with a new Zend_Service component. Our approach will be to extend Zend_Service_Abstract, and implement some custom methods to preform the screen scrape.
The finished class looks like this:
class WhatsMyIpService extends Zend_Service_Abstract { /** * The service endpoint. * This is where Zend_Http_Client will navigate * to fill service requests * @var string */ protected $_endpoint = 'http://whatismyip.com'; /** * handle to the client * @var Zend_Http_Client */ protected $_client; public function __construct() { $this->_client = self::getHttpClient(); } /** * Method to get the external IP of the computer / server * script is running on * @return string */ public function getMyIp() { //reset the client parameters, set the URL to whatismyip.com //and actually preform the request $result = $this->_client ->resetParameters() ->setUri($this->_endpoint) ->request(Zend_Http_Client::GET); //check to make sure that the result isnt a HTTP error if($result->isError()){ throw new Exception('Client returned error: ' . $result->getMessage()); } try{ //setup the query object with the result body (HTML page) $query = new Zend_Dom_Query($result->getBody()); $domCollection = $query->query('h1'); }catch(Zend_Dom_Exception $e){ throw new Exception('Error Loading Document: ' . $e); } //check to make sure the query return a result if($domCollection->count() == 0){ throw new Exception('Cannot find DOM Element'); } //get the titlestring from the nodevalue $titleString = (string) $domCollection->current()->nodeValue; //now we should have the content of h1 stored in the titleString //it should read something like "Your IP Address Is: xxx.xxx.xxx.xxx" //Now we will parse out the IP address using regular expressions if (preg_match('/([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})/', $titleString, $matches)) { return $matches[1]; }else{ throw new Exception('Unable to parse IP from page'); } } }
I hope that the script is sufficiently commented, but to summarize:
In the constructor, we get an local instance of a Zend_Http_Client object. Clone() is used to prevent any other Zend_Service objects from polluting our client with their own requests.
In the getMyIp() method, we first setup and preform the request (using the fluid interface provided by Zend_Http_Client). Notice how we reset the parameters: In this case its not actually necessary as we aren’t passing any parameters, but it is good habit to get into in case in the future we expand this class to include GET or POST parameter passing.
Next, we examine what the HTTP client has passed back to us. Hopefully, if nothing has gone wrong, it is a HTML string representation of the page at whatismyip.com. Some basic checks are preformed to ensure that no HTTP errors have occurred, and then we instantiate a Zend_Dom_Query object which provides both CSS and Xpath selectors.
Finally, after running the CSS selector query, we check to ensure we got a DOM element back, get its value and parse out the IP using Regular expressions. Its pretty impressive what can be done in <70 lines of code with the Zend Framework.
To run this class, we could create a directory structure as follows:
library <- Includes Zend framework
WhatsMyIpService.php <- The above class
getIp.php <- php file including the following:
set_include_path(implode(PATH_SEPARATOR, array( realpath(dirname(__FILE__) . '/library'), get_include_path(), ))); /** Zend_Loader_Autoload **/ require_once 'Zend/Loader/Autoloader.php'; Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true); $service = new WhatsMyIpService(); echo $service->getMyIp();
The first couple of lines gets the library on the include path, the second block sets up autoloading so we don’t have to manually include files, and the final lines instantiates the class and calls the method.
Greasemonkey And Dojo Integration Redux
Back in 2007 I wrote a post on how to integrate Dojo with Greasemonkey.
Since then, Greasemonkey has been re-written to include security and bug fixes which has broken my demo code. The problem is that the new security model doesn’t return an instance to the newly created dijit.Dialog when the constructor is called. The work-around is to set the ID of the dialog, and then call dijit.byId() to get a handle to it.
Of course, this is going to pose problems when creating non-dijit objects, as they will all be created on the page-level scope. The work-around is likely constructing clever eval() strings, and then accessing the objects using unsafeWindow. If anyone comes up with a more elegant solution, let me know about it in the comments.
The following can be used to overwrite the previous version of the user-script, restoring the broken functionality as well as making use of some of the newly introduced Dojo features.
// ==UserScript== // @name Dojo Integration Test // @namespace test // @description Proof Of Concept To Integrate Dojo And Greasemonkey // @include * // ==/UserScript== function startup(){ dojo = unsafeWindow["dojo"]; dijit = unsafeWindow["dijit"]; dojo.addClass(dojo.body(), 'tundra'); dojo.require("dijit.Dialog"); //Don't do anything until "Dijit.Dialog" has been loaded dojo.addOnLoad(function(){ //Actually Create The Dialog new dijit.Dialog({ id: 'test', title: "Dojo Integration Test", content: 'Dojo lives... In Greasemonkey' }); dijit.byId('test').show(); }); }; //include flags to djConfig to tell dojo its being used after its been loaded unsafeWindow.djConfig = { afterOnLoad: true, addOnLoad: startup }; //Include Dojo from the AOL CDN var script = document.createElement('script'); script.src="http://o.aolcdn.com/dojo/1.4/dojo/dojo.xd.js.uncompressed.js"; document.getElementsByTagName('head')[0].appendChild(script); //Include the Tundra Theme CSS file var link = document.createElement('link'); link.rel = "stylesheet"; link.type= "text/css"; link.href="http://o.aolcdn.com/dojo/1.4/dijit/themes/tundra/tundra.css"; document.getElementsByTagName('head')[0].appendChild(link);
JavaScript Sudoku Solver
In Computing science artificial intelligence terms, the game of Sudoku is a constraint satisfaction problem. Constraint satisfaction problems are nice in the regard that there are some very nice heuristics that lead to an easy algorithm to solve them. On the other hand, constraint satisfaction problems with a large problem domain may take an inordinate amount of time to solve.
Sudoku consists of a 9×9 grid, with each grid cell having a possible 9 different values. If we ignore all the constraints, this gives a possible 9^81 boards (1.9662705 × 10^77). Say we can check one-hundred trillion boards a second (100,000,000,000), it would still take ((9^81)/100000000000)/(31556926) = 6.23 × 10^58 years to iterate over all possible combinations! Clearly randomly generating boards and checking if the constraints are fulfilled is not a realistic solution.
Read the rest of this entry »
Shellinabox Gentoo Init Script
I’ve been playing around with ShellInABox and I think it is quite neat. From the website:
Shell In A Box implements a web server that can export arbitrary command line tools to a web based terminal emulator. This emulator is accessible to any JavaScript and CSS enabled web browser and does not require any additional browser plugins.
So basically it gives you a shell to your local system wherever you go (as long as you have a browser that was released since the turn of the century). This can be especially handy if you are on a public machine without permissions to install software (such as putty). The only dis-advantage is that the remote machine has to be running shellinabox, so this will not work for shared hosting environments. However, if you setup a shellinabox machine, you can then SSH into other boxes that aren’t running the daemon.
I’m running Gentoo Linux on my utility machine, and shellinabox doesn’t ship with a gentoo init script. Please enjoy the one I have written below. Basically, you install shellinabox normally, then I copied its generated certs to /var/lib/shellingabox to be used for ssh connections.
Corrections and improvements are appreciated.
/etc/init.d/shellinaboxd:
#!/sbin/runscript # Copyright 1999-2009 Gentoo Foundation # Distributed under the terms of the GNU General Public License v2 # $Header: $ CMD=/usr/local/bin/shellinaboxd CERT_DIR=/var/lib/shellinabox PIDFILE=/var/run/shellinabox.pid depend() { need net } start() { ebegin "Starting Shellinabox" start-stop-daemon --start --pidfile $PIDFILE --exec $CMD -- --cert $CERT_DIR -b=$PIDFILE eend $? }
Translink Zend Framework API
Translink is the local public transit provider for beautiful Vancouver, Canada. The system consists of Buses, Boats and Trains. Translink released an Iphone app some time ago that allows the lookup of bus information. Michael Weisman was kind enough to write about the “hidden” api that is used by the Iphone app to preform AJAX calls.
Read the rest of this entry »