Posts Tagged ‘front page’

Scraping Google Front Page Results

August 25th, 2009

In this article I’ll show you how you can use cURL and simple_html_dom functionality to scrap the basic content from the front page results of google provided with a search query.

What you will need:

The Code

<?
include('simple_html_dom.php');
 
function strip_tags_content($text, $tags = '', $invert = FALSE) {
	/*
	This function removes all html tags and the contents within them
	unlike strip_tags which only removes the tags themselves.
	*/
	//removes <br> often found in google result text, which is not handled below
	$text = str_ireplace('<br>', '', $text);
 
	preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
	$tags = array_unique($tags[1]);
 
	if(is_array($tags) AND count($tags) > 0) {
		//if invert is false, it will remove all tags except those passed a
		if($invert == FALSE) {
			return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
		//if invert is true, it will remove only the tags passed to this function
		} else {
			return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
		}
	//if no tags were passed to this function, simply remove all the tags
	} elseif($invert == FALSE) {
		return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
	}
 
	return $text;
}
 
function file_get_contents_curl($url) {
	/*
	This is a file_get_contents replacement function using cURL
	One slight difference is that it uses your browser's idenity
	as it's own when contacting google. 
	*/
	$ch = curl_init();
 
	curl_setopt($ch, CURLOPT_USERAGENT,	$_SERVER['HTTP_USER_AGENT']);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_URL, $url);
 
	$data = curl_exec($ch);
	curl_close($ch);
 
	return $data;
}
 
//Set query if any passed
$q = isset($_GET['q'])?urlencode(str_replace(' ', '+', $_GET['q'])):'none';
 
//Obtain the first page html with the formated url
$data = file_get_contents_curl('http://www.google.com/search?hl=en&q='.$q);
 
/*
create a simple_html_dom object from the retreived string
you could also perform file_get_html("http://...") instead of
file_get_contents_curl above, but it wouldn't change the default
User-Agent
*/
 
$html = str_get_html($data);
 
 
$result = array();
 
foreach($html->find('li.g') as $g)
{
	/*
	each search results are in a list item with a class name 'g'
	we are seperating each of the elements within, into an array
 
	Titles are stored within <h3><a...>{title}</a></h3>
	Links are in the href of the anchor contained in the <h3>...</h3>
	Summaries are stored in a div with a classname of 's'
	*/
 
	$h3 = $g->find('h3.r', 0);
	$s = $g->find('div.s', 0);
	$a = $h3->find('a', 0);
	$result[] = array('title' => strip_tags($a->innertext), 
		'link' => $a->href, 
		'description' => strip_tags_content($s->innertext));
}
 
if($_GET['serialize'] == '1')
{
	/* 
	if you pass serialize=1 to the script
	it will echo out a serialized string
	which can be unserialized back to an 
	array on a receiving script
	*/
	echo serialize($result);
}
else
{
	/* 
	Otherwise it prints out the array structure so that it
	is more human readible. You could instead perform a 
	foreach loop on the variable $result so that you can 
	organize the html output, or insert the data into a database
	*/
	echo "<textarea style='width: 1024px; height: 600px;'>";
	print_r($result);
	echo "</textarea>";
}
//Cleans up the memory 
$html->clear(); exit();
?>

The variable ‘q’ passed to the file will provide the query string. If you pass serialize=1 to the script, the output will be a serialized string which can be converted back to an array with the PHP function unserialize.

Keep in mind that hitting google too often for search results from the same server might eventually get your server blocked from accessing, especially if you all multiple queries in a short period of time.

An example of the output when the word ‘funny’ is searched.

Array
(
    [0] => Array
        (
            [title] => Funny Videos, Funny Pictures, Funny Jokes, Funny News - Funny.com
            [link] => http://www.funny.com/
            [description] =>  videos,  pictures,  jokes,  news.
        )
 
    [1] => Array
        (
            [title] => Lolcats 'n' Funny Pictures of Cats – I Can Has Cheezburger?
            [link] => http://icanhascheezburger.com/
            [description] => Aug 25, 2009  Humorous captioned pictures of felines and other animals. 
            Visitors can submit their own material or add captions to a large archive of 
        )
 
    [2] => Array
        (
            [title] => Funny Videos, Funny Pictures, Flash Games, Jokes
            [link] => http://www.ebaumsworld.com/
            [description] =>  videos, flash games, clean jokes, clean humor, hilarious flash,  pics , 
            office humor, prank phone calls, flash cartoons,  animation, 
        )
 
    [3] => Array
        (
            [title] => Video results for funny
            [link] => http://video.google.com/videosearch?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=video_result_group&ct=title&resnum=4
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [4] => Array
        (
            [title] => Funny Videos on CollegeHumor. Watch funny videos and comedy movie ...
            [link] => http://www.collegehumor.com/videos
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [5] => Array
        (
            [title] => Funnyjunk - Funny Pictures and Funny Videos
            [link] => http://www.funnyjunk.com/
            [description] =>  pictures,  videos, flash games and  movies.
        )
 
    [6] => Array
        (
            [title] => Photobucket | funny Pictures, funny Images, funny Photos
            [link] => http://photobucket.com/images/funny/
            [description] => View 462373  Pictures,  Images,  Photos on Photobucket. Share them with 
            your friends on MySpace or upload your own!
        )
 
    [7] => Array
        (
            [title] => FAIL Blog: Pictures and Videos of Owned, Pwnd and Fail Moments
            [link] => http://failblog.org/
            [description] =>  FAIL Pictures and Videos. Home Send In The Fail Boat Vote [Random]. You must 
            be logged in to add favorites | Register for a new account 
        )
 
    [8] => Array
        (
            [title] => funny
            [link] => http://www.reddit.com/r/funny/
            [description] => . (147376 subscribers). a community for 1 year. moderators · Submit a link. to 
            anything interesting: news article, blog entry, video, picture. 
        )
 
    [9] => Array
        (
            [title] => News results for funny
            [link] => http://news.google.com/news?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=news_group&ct=title&resnum=11
            [description] => "A more mature but still  Judd Apatow comedy whose move into serious 
            human relation issues nearly scuttles the third act. 
        )
 
)