In this article I’ll show you how you can use cURL and simple_html_dom functionality to scrap the basic content from the front page results of google provided with a search query.
What you will need:
- PHP 5+
- Simple HTML DOM Parser
- cURL support
The Code
<? include('simple_html_dom.php'); function strip_tags_content($text, $tags = '', $invert = FALSE) { /* This function removes all html tags and the contents within them unlike strip_tags which only removes the tags themselves. */ //removes <br> often found in google result text, which is not handled below $text = str_ireplace('<br>', '', $text); preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags); $tags = array_unique($tags[1]); if(is_array($tags) AND count($tags) > 0) { //if invert is false, it will remove all tags except those passed a if($invert == FALSE) { return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text); //if invert is true, it will remove only the tags passed to this function } else { return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text); } //if no tags were passed to this function, simply remove all the tags } elseif($invert == FALSE) { return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text); } return $text; } function file_get_contents_curl($url) { /* This is a file_get_contents replacement function using cURL One slight difference is that it uses your browser's idenity as it's own when contacting google. */ $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); $data = curl_exec($ch); curl_close($ch); return $data; } //Set query if any passed $q = isset($_GET['q'])?urlencode(str_replace(' ', '+', $_GET['q'])):'none'; //Obtain the first page html with the formated url $data = file_get_contents_curl('http://www.google.com/search?hl=en&q='.$q); /* create a simple_html_dom object from the retreived string you could also perform file_get_html("http://...") instead of file_get_contents_curl above, but it wouldn't change the default User-Agent */ $html = str_get_html($data); $result = array(); foreach($html->find('li.g') as $g) { /* each search results are in a list item with a class name 'g' we are seperating each of the elements within, into an array Titles are stored within <h3><a...>{title}</a></h3> Links are in the href of the anchor contained in the <h3>...</h3> Summaries are stored in a div with a classname of 's' */ $h3 = $g->find('h3.r', 0); $s = $g->find('div.s', 0); $a = $h3->find('a', 0); $result[] = array('title' => strip_tags($a->innertext), 'link' => $a->href, 'description' => strip_tags_content($s->innertext)); } if($_GET['serialize'] == '1') { /* if you pass serialize=1 to the script it will echo out a serialized string which can be unserialized back to an array on a receiving script */ echo serialize($result); } else { /* Otherwise it prints out the array structure so that it is more human readible. You could instead perform a foreach loop on the variable $result so that you can organize the html output, or insert the data into a database */ echo "<textarea style='width: 1024px; height: 600px;'>"; print_r($result); echo "</textarea>"; } //Cleans up the memory $html->clear(); exit(); ?>
The variable ‘q’ passed to the file will provide the query string. If you pass serialize=1 to the script, the output will be a serialized string which can be converted back to an array with the PHP function unserialize.
Keep in mind that hitting google too often for search results from the same server might eventually get your server blocked from accessing, especially if you all multiple queries in a short period of time.
An example of the output when the word ‘funny’ is searched.
Array ( [0] => Array ( [title] => Funny Videos, Funny Pictures, Funny Jokes, Funny News - Funny.com [link] => http://www.funny.com/ [description] => videos, pictures, jokes, news. ) [1] => Array ( [title] => Lolcats 'n' Funny Pictures of Cats ย I Can Has Cheezburger? [link] => http://icanhascheezburger.com/ [description] => Aug 25, 2009 Humorous captioned pictures of felines and other animals. Visitors can submit their own material or add captions to a large archive of ) [2] => Array ( [title] => Funny Videos, Funny Pictures, Flash Games, Jokes [link] => http://www.ebaumsworld.com/ [description] => videos, flash games, clean jokes, clean humor, hilarious flash, pics , office humor, prank phone calls, flash cartoons, animation, ) [3] => Array ( [title] => Video results for funny [link] => http://video.google.com/videosearch?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X &oi=video_result_group&ct=title&resnum=4 [description] => CollegeHumor's recent videos section has all the best videos on the Internet. There are video clips, hilarious viral videos, college ) [4] => Array ( [title] => Funny Videos on CollegeHumor. Watch funny videos and comedy movie ... [link] => http://www.collegehumor.com/videos [description] => CollegeHumor's recent videos section has all the best videos on the Internet. There are video clips, hilarious viral videos, college ) [5] => Array ( [title] => Funnyjunk - Funny Pictures and Funny Videos [link] => http://www.funnyjunk.com/ [description] => pictures, videos, flash games and movies. ) [6] => Array ( [title] => Photobucket | funny Pictures, funny Images, funny Photos [link] => http://photobucket.com/images/funny/ [description] => View 462373 Pictures, Images, Photos on Photobucket. Share them with your friends on MySpace or upload your own! ) [7] => Array ( [title] => FAIL Blog: Pictures and Videos of Owned, Pwnd and Fail Moments [link] => http://failblog.org/ [description] => FAIL Pictures and Videos. Home Send In The Fail Boat Vote [Random]. You must be logged in to add favorites | Register for a new account ) [8] => Array ( [title] => funny [link] => http://www.reddit.com/r/funny/ [description] => . (147376 subscribers). a community for 1 year. moderators ยท Submit a link. to anything interesting: news article, blog entry, video, picture. ) [9] => Array ( [title] => News results for funny [link] => http://news.google.com/news?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X &oi=news_group&ct=title&resnum=11 [description] => "A more mature but still Judd Apatow comedy whose move into serious human relation issues nearly scuttles the third act. ) )
Hello,
This is very nicely done. It might be worth adding proxy functionality so people dont get blocked by google. I was wondering if you could explain to me a little how I would go about using a foreach loop to put the data into a database table with the colums TITLE, LINK & DESCRIPTION.
Hi mate,
Sadly i have to report that this script no longer works due to some changes google have made to thier pages. A new ajax element has been added also contained in li.g. if you hit a page with the new ajax bit the script fails. It would be good if you could help fix the script so it works again because its good one ๐
heh, i fixed it.
replace:
$h3 = $g->find(‘h3.r’, 0);
$s = $g->find(‘div.s’, 0);
$a = $h3->find(‘a’, 0);
with:
$h3 = $g->find(‘h3.r’, 0);
$s = $g->find(‘div.s’, 0);
if($h3 != “”){
$a = $h3->find(‘a’, 0);
}
if you dont understant why this is needed cheak out the recent chages google have made to some of there results pages by searching for: tiger woods latest update
Smooth bit of coding there ๐ I’ve had the Simple HTML DOM Parser class on the back-end of my server for a while now and I’ve been looking for some direction for it’s use. The code works like a charm and you can extend http://www,google,com/search?hl=en&q= by adding (&num=100) to get the first 100 results. eg: http://www,google,com/search?num=100&hl=en&q=
RE: Frank @ frankmacdonald.co.uk
I tried your keyword (tiger woods latest update) with your code and returned “Fatal error: Call to a member function find() on a non-object in /home/linkdir/public_html…..” Though I added + to the query to fill in the spaces between the keywords. eg: tiger+woods+latest+update as google doesn’t seem to like the spaces in the string (I’ve had no problems with the other SE’s with spaces) If you add the plus to the spaces and run with
$h3 = $g->find(‘h3.r’, 0);
$s = $g->find(‘div.s’, 0);
$a = $h3->find(‘a’, 0);
from the original script you should get a result ๐ without the plus added to your keyword spaces you return an empty array.
JayJay : I tried adding &num=100 but still get only the first page of results…how can I get all 100 in the array?
Thanks!