Archive for the ‘PHP’ category

Scraping Google Front Page Results

August 25th, 2009

In this article I’ll show you how you can use cURL and simple_html_dom functionality to scrap the basic content from the front page results of google provided with a search query.

What you will need:

The Code

<?
include('simple_html_dom.php');
 
function strip_tags_content($text, $tags = '', $invert = FALSE) {
	/*
	This function removes all html tags and the contents within them
	unlike strip_tags which only removes the tags themselves.
	*/
	//removes <br> often found in google result text, which is not handled below
	$text = str_ireplace('<br>', '', $text);
 
	preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
	$tags = array_unique($tags[1]);
 
	if(is_array($tags) AND count($tags) > 0) {
		//if invert is false, it will remove all tags except those passed a
		if($invert == FALSE) {
			return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
		//if invert is true, it will remove only the tags passed to this function
		} else {
			return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
		}
	//if no tags were passed to this function, simply remove all the tags
	} elseif($invert == FALSE) {
		return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
	}
 
	return $text;
}
 
function file_get_contents_curl($url) {
	/*
	This is a file_get_contents replacement function using cURL
	One slight difference is that it uses your browser's idenity
	as it's own when contacting google. 
	*/
	$ch = curl_init();
 
	curl_setopt($ch, CURLOPT_USERAGENT,	$_SERVER['HTTP_USER_AGENT']);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_URL, $url);
 
	$data = curl_exec($ch);
	curl_close($ch);
 
	return $data;
}
 
//Set query if any passed
$q = isset($_GET['q'])?urlencode(str_replace(' ', '+', $_GET['q'])):'none';
 
//Obtain the first page html with the formated url
$data = file_get_contents_curl('http://www.google.com/search?hl=en&q='.$q);
 
/*
create a simple_html_dom object from the retreived string
you could also perform file_get_html("http://...") instead of
file_get_contents_curl above, but it wouldn't change the default
User-Agent
*/
 
$html = str_get_html($data);
 
 
$result = array();
 
foreach($html->find('li.g') as $g)
{
	/*
	each search results are in a list item with a class name 'g'
	we are seperating each of the elements within, into an array
 
	Titles are stored within <h3><a...>{title}</a></h3>
	Links are in the href of the anchor contained in the <h3>...</h3>
	Summaries are stored in a div with a classname of 's'
	*/
 
	$h3 = $g->find('h3.r', 0);
	$s = $g->find('div.s', 0);
	$a = $h3->find('a', 0);
	$result[] = array('title' => strip_tags($a->innertext), 
		'link' => $a->href, 
		'description' => strip_tags_content($s->innertext));
}
 
if($_GET['serialize'] == '1')
{
	/* 
	if you pass serialize=1 to the script
	it will echo out a serialized string
	which can be unserialized back to an 
	array on a receiving script
	*/
	echo serialize($result);
}
else
{
	/* 
	Otherwise it prints out the array structure so that it
	is more human readible. You could instead perform a 
	foreach loop on the variable $result so that you can 
	organize the html output, or insert the data into a database
	*/
	echo "<textarea style='width: 1024px; height: 600px;'>";
	print_r($result);
	echo "</textarea>";
}
//Cleans up the memory 
$html->clear(); exit();
?>

The variable ‘q’ passed to the file will provide the query string. If you pass serialize=1 to the script, the output will be a serialized string which can be converted back to an array with the PHP function unserialize.

Keep in mind that hitting google too often for search results from the same server might eventually get your server blocked from accessing, especially if you all multiple queries in a short period of time.

An example of the output when the word ‘funny’ is searched.

Array
(
    [0] => Array
        (
            [title] => Funny Videos, Funny Pictures, Funny Jokes, Funny News - Funny.com
            [link] => http://www.funny.com/
            [description] =>  videos,  pictures,  jokes,  news.
        )
 
    [1] => Array
        (
            [title] => Lolcats 'n' Funny Pictures of Cats – I Can Has Cheezburger?
            [link] => http://icanhascheezburger.com/
            [description] => Aug 25, 2009  Humorous captioned pictures of felines and other animals. 
            Visitors can submit their own material or add captions to a large archive of 
        )
 
    [2] => Array
        (
            [title] => Funny Videos, Funny Pictures, Flash Games, Jokes
            [link] => http://www.ebaumsworld.com/
            [description] =>  videos, flash games, clean jokes, clean humor, hilarious flash,  pics , 
            office humor, prank phone calls, flash cartoons,  animation, 
        )
 
    [3] => Array
        (
            [title] => Video results for funny
            [link] => http://video.google.com/videosearch?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=video_result_group&ct=title&resnum=4
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [4] => Array
        (
            [title] => Funny Videos on CollegeHumor. Watch funny videos and comedy movie ...
            [link] => http://www.collegehumor.com/videos
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [5] => Array
        (
            [title] => Funnyjunk - Funny Pictures and Funny Videos
            [link] => http://www.funnyjunk.com/
            [description] =>  pictures,  videos, flash games and  movies.
        )
 
    [6] => Array
        (
            [title] => Photobucket | funny Pictures, funny Images, funny Photos
            [link] => http://photobucket.com/images/funny/
            [description] => View 462373  Pictures,  Images,  Photos on Photobucket. Share them with 
            your friends on MySpace or upload your own!
        )
 
    [7] => Array
        (
            [title] => FAIL Blog: Pictures and Videos of Owned, Pwnd and Fail Moments
            [link] => http://failblog.org/
            [description] =>  FAIL Pictures and Videos. Home Send In The Fail Boat Vote [Random]. You must 
            be logged in to add favorites | Register for a new account 
        )
 
    [8] => Array
        (
            [title] => funny
            [link] => http://www.reddit.com/r/funny/
            [description] => . (147376 subscribers). a community for 1 year. moderators · Submit a link. to 
            anything interesting: news article, blog entry, video, picture. 
        )
 
    [9] => Array
        (
            [title] => News results for funny
            [link] => http://news.google.com/news?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=news_group&ct=title&resnum=11
            [description] => "A more mature but still  Judd Apatow comedy whose move into serious 
            human relation issues nearly scuttles the third act. 
        )
 
)

Four free Geolocation Methods

August 22nd, 2009

Geolocation or Geo-Targeting is a method of identifying a visitor’s location in the world. You can use this information for anything as simple as greeting a visitor in their native language to automatically redirecting visitors to valid affiliate offers for the visiting demographic. This article takes a look at four different services that offer geolocation for free. This article will focus primarily on retrieving the visitor’s country code.

Maxmind GeoLite Country

Touting a 99.5% data accuracy is the GeoLite Country database allowing you to lookup a visitor’s country locally on your server. The database file is updated at the first of each month. The benefit with a local binary database is that lookups are performed right at the server and as a result may be faster and easier to cache than hitting a remote server you have no control over. However performing a large number of queries maybe cause some strain on your server if you have a cheaper machine or use shared hosting.

The easiest way to use the GeoLite country database is to download the binary format along with the geoip.inc file from the Maxmind PHP API. Once the GeoIP.dat and geoip.inc files are uploaded onto the server, the following PHP code will retrieve the visitor’s country code.

include("geoip.inc");
$gi = geoip_open("GeoIP.dat",GEOIP_STANDARD);
$addr = $_SERVER["REMOTE_ADDR"];
$country_code = geoip_country_code_by_addr($gi, $addr);	
geoip_close($gi);

Other APIs include C, Perl, Javascript, Python, C#, Ruby and Java examples, as well as an Apache Module and Microsoft COM Objects.

GeoPlugin

One of the more flexible remote service is GeoPlugin.com. This one provides the most information from a single line PHP code, but requires a connection to the geoplugin.com server for every request (I recommend caching or using cookies for repeat visitors).

PHP Example:

$fetch = unserialize(fetch_url('http://www.geoplugin.net/php.gp?ip='.$_SERVER['REMOTE_ADDR']));
$geo_code = $fetch['geoplugin_countryCode'];

You can also easily grab the city name using the ‘geoplugin_city’ key in the returned array. Other options supported are Javascript, JSON and XML .

HostIP.info

HostIP.info offers the simplest way to retreive a country code for a given IP address.

$geo_code = trim(fetch_url("http://api.hostip.info/country.php?ip=".$_SERVER['REMOTE_ADDR']));

ipinfodb.com
While not as straight forward as HostIP or GeoPlugin, IpInfoDB.com provides a backup server in the event the first request fails.

$fetch = fetch_url("http://www.ipinfodb.com/ip_query_country.php?ip=".$_SERVER['REMOTE_ADDR']."&output=xml");
if(!$fetch)
	$fetch = fetch_url("http://backup.ipinfodb.com/ip_query.php?ip=".$_SERVER['REMOTE_ADDR']."&output=xml");
 
$fetchb = new SimpleXMLElement($fetch);
if (!$fetchb) 
	$geo_code = "";
else
	$geo_code = $fetchb->CountryCode;

There you have it, four different geolocation services, depending on your needs some may perform better than others, for high traffic sites the Maxmind solution is likely the best one short of paying for a monthly service.

Paypal IPN with PhP

August 19th, 2009

If you’re a freelance coder you most likely have a PayPal account. One of the most useful feature provided by PayPal for anyone looking to automate their ordering process is the Instant Payment Notification. This guide will show you how to utilize IPN with a PayPal ‘Buy Now’ button and PHP. Additional tips to further secure your ordering process are also discussed.

What is IPN?

In a nutshell IPN is PayPal method of instantly notifying your server of a payment. This can either be setup globally for all transactions on your account, or can be provided for a specific button or subscription.

Merchant Services

The IPN Process:

  1. Paypal sends details of the transaction to your server at the provided url.
  2. Your script compiles the information provided and sends it to Paypal’s verification server.
  3. Paypal’s server will either verify the information as valid, or will reject it as invalid.
  4. If considered a valid transaction, process the information as needed, otherwise discard and ignore.

Creating a Payment Button

Once logged into PayPal you should direct your attention to the Merchant Service.

The “Buy Now Button” will be sufficient for a one-time-purchase of an electronic or tangible product. The button can be setup as a simple buy now button for a single purchase with no options, or can be setup with drop down for product choices or other options. Either way there is two key settings you will want to use when creating your button.

Secure Merchant ID

Select Secure Merchant ID as means of identifying the account as opposed to an email address. This will not only help prevent automated spam attempts but will also help prevent fake transactions (more on that later).

IPN LocationYou will also need to add the following line to the advanced option. The url will be the destination notified in the event of a payment or related transaction. Normally you don’t want to call it something as obvious as ipn.php in the root of your site, rather bury the script in a folder and give it some other name such as txn1.php.



Tracking OptionIf you have multiple options that you will want to verify in your IPN script you can get PayPal to send you the specific Item ID by turning on the tracking option for the button:

Once you have created your button and generated the code to be used on the site you will want to setup your IPN script to save the necessary transactions.