Scraping Google Front Page Results

In this article I’ll show you how you can use cURL and simple_html_dom functionality to scrap the basic content from the front page results of google provided with a search query.

What you will need:

The Code

<?
include('simple_html_dom.php');
 
function strip_tags_content($text, $tags = '', $invert = FALSE) {
	/*
	This function removes all html tags and the contents within them
	unlike strip_tags which only removes the tags themselves.
	*/
	//removes <br> often found in google result text, which is not handled below
	$text = str_ireplace('<br>', '', $text);
 
	preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
	$tags = array_unique($tags[1]);
 
	if(is_array($tags) AND count($tags) > 0) {
		//if invert is false, it will remove all tags except those passed a
		if($invert == FALSE) {
			return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
		//if invert is true, it will remove only the tags passed to this function
		} else {
			return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
		}
	//if no tags were passed to this function, simply remove all the tags
	} elseif($invert == FALSE) {
		return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
	}
 
	return $text;
}
 
function file_get_contents_curl($url) {
	/*
	This is a file_get_contents replacement function using cURL
	One slight difference is that it uses your browser's idenity
	as it's own when contacting google. 
	*/
	$ch = curl_init();
 
	curl_setopt($ch, CURLOPT_USERAGENT,	$_SERVER['HTTP_USER_AGENT']);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_URL, $url);
 
	$data = curl_exec($ch);
	curl_close($ch);
 
	return $data;
}
 
//Set query if any passed
$q = isset($_GET['q'])?urlencode(str_replace(' ', '+', $_GET['q'])):'none';
 
//Obtain the first page html with the formated url
$data = file_get_contents_curl('http://www.google.com/search?hl=en&q='.$q);
 
/*
create a simple_html_dom object from the retreived string
you could also perform file_get_html("http://...") instead of
file_get_contents_curl above, but it wouldn't change the default
User-Agent
*/
 
$html = str_get_html($data);
 
 
$result = array();
 
foreach($html->find('li.g') as $g)
{
	/*
	each search results are in a list item with a class name 'g'
	we are seperating each of the elements within, into an array
 
	Titles are stored within <h3><a...>{title}</a></h3>
	Links are in the href of the anchor contained in the <h3>...</h3>
	Summaries are stored in a div with a classname of 's'
	*/
 
	$h3 = $g->find('h3.r', 0);
	$s = $g->find('div.s', 0);
	$a = $h3->find('a', 0);
	$result[] = array('title' => strip_tags($a->innertext), 
		'link' => $a->href, 
		'description' => strip_tags_content($s->innertext));
}
 
if($_GET['serialize'] == '1')
{
	/* 
	if you pass serialize=1 to the script
	it will echo out a serialized string
	which can be unserialized back to an 
	array on a receiving script
	*/
	echo serialize($result);
}
else
{
	/* 
	Otherwise it prints out the array structure so that it
	is more human readible. You could instead perform a 
	foreach loop on the variable $result so that you can 
	organize the html output, or insert the data into a database
	*/
	echo "<textarea style='width: 1024px; height: 600px;'>";
	print_r($result);
	echo "</textarea>";
}
//Cleans up the memory 
$html->clear(); exit();
?>

The variable ‘q’ passed to the file will provide the query string. If you pass serialize=1 to the script, the output will be a serialized string which can be converted back to an array with the PHP function unserialize.

Keep in mind that hitting google too often for search results from the same server might eventually get your server blocked from accessing, especially if you all multiple queries in a short period of time.

An example of the output when the word ‘funny’ is searched.

Array
(
    [0] => Array
        (
            [title] => Funny Videos, Funny Pictures, Funny Jokes, Funny News - Funny.com
            [link] => http://www.funny.com/
            [description] =>  videos,  pictures,  jokes,  news.
        )
 
    [1] => Array
        (
            [title] => Lolcats 'n' Funny Pictures of Cats – I Can Has Cheezburger?
            [link] => http://icanhascheezburger.com/
            [description] => Aug 25, 2009  Humorous captioned pictures of felines and other animals. 
            Visitors can submit their own material or add captions to a large archive of 
        )
 
    [2] => Array
        (
            [title] => Funny Videos, Funny Pictures, Flash Games, Jokes
            [link] => http://www.ebaumsworld.com/
            [description] =>  videos, flash games, clean jokes, clean humor, hilarious flash,  pics , 
            office humor, prank phone calls, flash cartoons,  animation, 
        )
 
    [3] => Array
        (
            [title] => Video results for funny
            [link] => http://video.google.com/videosearch?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=video_result_group&ct=title&resnum=4
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [4] => Array
        (
            [title] => Funny Videos on CollegeHumor. Watch funny videos and comedy movie ...
            [link] => http://www.collegehumor.com/videos
            [description] => CollegeHumor's recent videos section has all the best  videos on the Internet. 
            There are  video clips, hilarious viral videos,  college 
        )
 
    [5] => Array
        (
            [title] => Funnyjunk - Funny Pictures and Funny Videos
            [link] => http://www.funnyjunk.com/
            [description] =>  pictures,  videos, flash games and  movies.
        )
 
    [6] => Array
        (
            [title] => Photobucket | funny Pictures, funny Images, funny Photos
            [link] => http://photobucket.com/images/funny/
            [description] => View 462373  Pictures,  Images,  Photos on Photobucket. Share them with 
            your friends on MySpace or upload your own!
        )
 
    [7] => Array
        (
            [title] => FAIL Blog: Pictures and Videos of Owned, Pwnd and Fail Moments
            [link] => http://failblog.org/
            [description] =>  FAIL Pictures and Videos. Home Send In The Fail Boat Vote [Random]. You must 
            be logged in to add favorites | Register for a new account 
        )
 
    [8] => Array
        (
            [title] => funny
            [link] => http://www.reddit.com/r/funny/
            [description] => . (147376 subscribers). a community for 1 year. moderators · Submit a link. to 
            anything interesting: news article, blog entry, video, picture. 
        )
 
    [9] => Array
        (
            [title] => News results for funny
            [link] => http://news.google.com/news?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X
            &oi=news_group&ct=title&resnum=11
            [description] => "A more mature but still  Judd Apatow comedy whose move into serious 
            human relation issues nearly scuttles the third act. 
        )
 
)

6 comments

  1. Frank says:

    Hello,

    This is very nicely done. It might be worth adding proxy functionality so people dont get blocked by google. I was wondering if you could explain to me a little how I would go about using a foreach loop to put the data into a database table with the colums TITLE, LINK & DESCRIPTION.

  2. Frank says:

    Hi mate,

    Sadly i have to report that this script no longer works due to some changes google have made to thier pages. A new ajax element has been added also contained in li.g. if you hit a page with the new ajax bit the script fails. It would be good if you could help fix the script so it works again because its good one :)

  3. Frank says:

    heh, i fixed it.

    replace:

    $h3 = $g->find(‘h3.r’, 0);
    $s = $g->find(‘div.s’, 0);
    $a = $h3->find(‘a’, 0);

    with:

    $h3 = $g->find(‘h3.r’, 0);
    $s = $g->find(‘div.s’, 0);
    if($h3 != “”){
    $a = $h3->find(‘a’, 0);
    }

    if you dont understant why this is needed cheak out the recent chages google have made to some of there results pages by searching for: tiger woods latest update

  4. JayJay says:

    Smooth bit of coding there :) I’ve had the Simple HTML DOM Parser class on the back-end of my server for a while now and I’ve been looking for some direction for it’s use. The code works like a charm and you can extend http://www,google,com/search?hl=en&q= by adding (&num=100) to get the first 100 results. eg: http://www,google,com/search?num=100&hl=en&q=

  5. JayJay says:

    RE: Frank @ frankmacdonald.co.uk

    I tried your keyword (tiger woods latest update) with your code and returned “Fatal error: Call to a member function find() on a non-object in /home/linkdir/public_html…..” Though I added + to the query to fill in the spaces between the keywords. eg: tiger+woods+latest+update as google doesn’t seem to like the spaces in the string (I’ve had no problems with the other SE’s with spaces) If you add the plus to the spaces and run with

    $h3 = $g->find(‘h3.r’, 0);
    $s = $g->find(‘div.s’, 0);
    $a = $h3->find(‘a’, 0);

    from the original script you should get a result :) without the plus added to your keyword spaces you return an empty array.

  6. rocker_e says:

    JayJay : I tried adding &num=100 but still get only the first page of results…how can I get all 100 in the array?

    Thanks!