Posts Tagged PHP

Zend Framework Badword filtering with Cdyne

One of my recent PHP projects had the requirement to filter out inappropriate language from user submitted content. After thinking about the problem briefly, I decided that I didn’t want to be writing the filter myself but, rather, find a third party service that could filter my text for me. By doing this, I eliminated the need to create and maintain a bad-word list, as well as saved the CPU cycles required to actually preform the search-and-replacement (Although, arguably, remote API calls are more expensive anyways).

After some searching I stumbled across the free Cdyne Profanity Filter Service. Not only does this service filter out the standard inappropriate language that you would expect, it also doesn’t produce false negatives (eg: the hello isn’t filtered for containing the word hell), and it has fairly robust phonetic character matching to catch things like a$$. The Cdyne service is exposed as a Soap WSDL so easy interfaces to languages other than PHP are possible.

I ended up writing a Zend Framework based Soap Client Service for the Cdyne filter, and I figured I would share it with any others who are looking to do filtering. In the following zip, there is the Service class, along with some unit tests demonstrating the use of the class methods. You should be able to rename the Zext_Service_Cdyne_ProfanityFilter class to one of your choosing if you do not like the pseudo namespacing I’ve used. Check out Cdyne’s wiki for more info.

ProfanityFilter.zip

, , ,

3 Comments

Build an API for any website with Web-Scraping

There are a lot of web-sites out there, with a lot of data on them. Sometimes you are building a killer web-app and you just have to have some data off a certain site. The problem is, that particular site doesn’t have an API that you can just plug into! Never fear, using some simple tools, combined with the Zend Framework you can create your own web-scraping (screen-scraping) API in no time.

Before I continue, I should mention that, not only is it impolite, but quite possibly illegal to take the content of a third party site without the owners permission or knowledge.

To accomplish our task we need some open-source tools:

Our basic strategy is this:

  1. Find a target page with some cool content
  2. Find the CSS Selector (or X-path) of the element we want to scrape
  3. Verify the selector using Dojo
  4. Build a Zend_Service object to fetch the cool content to PHP

Step 1: Finding the cool page

Say I am a developer and I really want to get the outward facing IP of the machine that my PHP script is running on. If there is a router or proxy in between my machine and the Internet, this can be non trivial. As a solution I decide to use the service http://www.whatismyip.com/

Step 2: Finding the CSS3 Selector for the element

Its pretty obvious which element we want to get the selector for. If you are feeling clever you can simply read the code or use a DOM inspector to figure out what the query is on your own. However, selectorgaget.com provides a cool tool that will allow you to point and click to determine the appropraite selector. You can get a bookmarklet directly from their site, or you can get the much more powerful Quix bookmarklet which includes the selectorgaget and a bunch of other cool tools.

If you decided on Quix (I know I did), click on the bookmarklet and enter “sg” in the command box. The selectorgaget should load up and as you move your mouse around the screen it should highlight different DOM nodes. We want the top one, so click on the text “Your IP…” and, checking out the path, we see that it is a rather boring h1.

Step 3: Validation via Dojo

I like to validate the path before taking it over to Zend. I use a Dojo and firebug lite bookmarklet which injects a debugging version of dojo into any page via the AOL CDN. To add this bookmarket drag Inject Dojo into your bookmarks toolbar. In the debugging console that pops up, enter dojo.query(‘h1’); and you should see the h1 DOM element being returned.

Step 4: Moving it all to PHP

Now that we have successfully found our CSS3 selector path, we can move over to PHP and come up with a new Zend_Service component. Our approach will be to extend Zend_Service_Abstract, and implement some custom methods to preform the screen scrape.

The finished class looks like this:

class WhatsMyIpService extends Zend_Service_Abstract
{
	/**
	 * The service endpoint.
	 * This is where Zend_Http_Client will navigate
	 * to fill service requests
	 * @var string
	 */
	protected $_endpoint = 'http://whatismyip.com';
 
	/**
	 * handle to the client
	 * @var Zend_Http_Client
	 */
	protected $_client;
 
	public function __construct()
	{
		$this->_client = self::getHttpClient();
	}
 
	/**
	 * Method to get the external IP of the computer / server
	 * script is running on
	 * @return string
	 */
	public function getMyIp()
	{
		//reset the client parameters, set the URL to whatismyip.com
		//and actually preform the request
		$result = $this->_client
			->resetParameters()
			->setUri($this->_endpoint)
			->request(Zend_Http_Client::GET);
 
		//check to make sure that the result isnt a HTTP error
		if($result->isError()){
			throw new Exception('Client returned error: ' . $result->getMessage());
		}
 
		try{
			//setup the query object with the result body (HTML page)
			$query = new Zend_Dom_Query($result->getBody());
 
			$domCollection = $query->query('h1');
		}catch(Zend_Dom_Exception $e){
			throw new Exception('Error Loading Document: ' . $e);
		}
 
		//check to make sure the query return a result
		if($domCollection->count() == 0){
			throw new Exception('Cannot find DOM Element');
		}
 
		//get the titlestring from the nodevalue
		$titleString = (string) $domCollection->current()->nodeValue;
 
		//now we should have the content of h1 stored in the titleString
		//it should read something like "Your IP Address Is: xxx.xxx.xxx.xxx"
 
		//Now we will parse out the IP address using regular expressions
		if (preg_match('/([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})/', $titleString, $matches)) {
			return $matches[1];
		}else{
			throw new Exception('Unable to parse IP from page');
		}
	}
}

I hope that the script is sufficiently commented, but to summarize:

In the constructor, we get an local instance of a Zend_Http_Client object. Clone() is used to prevent any other Zend_Service objects from polluting our client with their own requests.

In the getMyIp() method, we first setup and preform the request (using the fluid interface provided by Zend_Http_Client). Notice how we reset the parameters: In this case its not actually necessary as we aren’t passing any parameters, but it is good habit to get into in case in the future we expand this class to include GET or POST parameter passing.

Next, we examine what the HTTP client has passed back to us. Hopefully, if nothing has gone wrong, it is a HTML string representation of the page at whatismyip.com. Some basic checks are preformed to ensure that no HTTP errors have occurred, and then we instantiate a Zend_Dom_Query object which provides both CSS and Xpath selectors.

Finally, after running the CSS selector query, we check to ensure we got a DOM element back, get its value and parse out the IP using Regular expressions. Its pretty impressive what can be done in <70 lines of code with the Zend Framework. To run this class, we could create a directory structure as follows: library <- Includes Zend framework WhatsMyIpService.php <- The above class getIp.php <- php file including the following:

set_include_path(implode(PATH_SEPARATOR, array(
    realpath(dirname(__FILE__) . '/library'),
    get_include_path(),
)));
 
/** Zend_Loader_Autoload **/
require_once 'Zend/Loader/Autoloader.php';
Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);
 
$service = new WhatsMyIpService();
echo $service->getMyIp();

The first couple of lines gets the library on the include path, the second block sets up autoloading so we don’t have to manually include files, and the final lines instantiates the class and calls the method.

, , , ,

No Comments

Translink Zend Framework API

Translink is the local public transit provider for beautiful Vancouver, Canada. The system consists of Buses, Boats and Trains. Translink released an Iphone app some time ago that allows the lookup of bus information. Michael Weisman was kind enough to write about the “hidden” api that is used by the Iphone app to preform AJAX calls.
Read the rest of this entry »

, , ,

No Comments

BC Lottery Corporation API

The British Columbia Lottery Corporation has an unpublished API that they use to pull data down for the flash application on their home page. The Zext PHP API exposes functionality to query the most resent winning numbers from the BCLC website, as well as retrieve current jackpot estimates for the main lotteries in this province.
Read the rest of this entry »

, , ,

No Comments

Zend Framework Doctrine Model Autoloader

There have been several tutorials outlining how to autoload Doctrine Models using the Zend_Loader_Autoloader. However, none of these have permitted Zend / PEAR style naming conventions for models. I prefer to use these conventions because, although it makes my model names longer, the “name-spacing” gives a certain degree of organization and order to the application.
Read the rest of this entry »

, , ,

No Comments