In this article I’ll show you how you can use cURL and simple_html_dom functionality to scrap the basic content from the front page results of google provided with a search query.
What you will need:
- PHP 5+
- Simple HTML DOM Parser
- cURL support
The Code
<? include('simple_html_dom.php'); function strip_tags_content($text, $tags = '', $invert = FALSE) { /* This function removes all html tags and the contents within them unlike strip_tags which only removes the tags themselves. */ //removes <br> often found in google result text, which is not handled below $text = str_ireplace('<br>', '', $text); preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags); $tags = array_unique($tags[1]); if(is_array($tags) AND count($tags) > 0) { //if invert is false, it will remove all tags except those passed a if($invert == FALSE) { return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text); //if invert is true, it will remove only the tags passed to this function } else { return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text); } //if no tags were passed to this function, simply remove all the tags } elseif($invert == FALSE) { return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text); } return $text; } function file_get_contents_curl($url) { /* This is a file_get_contents replacement function using cURL One slight difference is that it uses your browser's idenity as it's own when contacting google. */ $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); $data = curl_exec($ch); curl_close($ch); return $data; } //Set query if any passed $q = isset($_GET['q'])?urlencode(str_replace(' ', '+', $_GET['q'])):'none'; //Obtain the first page html with the formated url $data = file_get_contents_curl('http://www.google.com/search?hl=en&q='.$q); /* create a simple_html_dom object from the retreived string you could also perform file_get_html("http://...") instead of file_get_contents_curl above, but it wouldn't change the default User-Agent */ $html = str_get_html($data); $result = array(); foreach($html->find('li.g') as $g) { /* each search results are in a list item with a class name 'g' we are seperating each of the elements within, into an array Titles are stored within <h3><a...>{title}</a></h3> Links are in the href of the anchor contained in the <h3>...</h3> Summaries are stored in a div with a classname of 's' */ $h3 = $g->find('h3.r', 0); $s = $g->find('div.s', 0); $a = $h3->find('a', 0); $result[] = array('title' => strip_tags($a->innertext), 'link' => $a->href, 'description' => strip_tags_content($s->innertext)); } if($_GET['serialize'] == '1') { /* if you pass serialize=1 to the script it will echo out a serialized string which can be unserialized back to an array on a receiving script */ echo serialize($result); } else { /* Otherwise it prints out the array structure so that it is more human readible. You could instead perform a foreach loop on the variable $result so that you can organize the html output, or insert the data into a database */ echo "<textarea style='width: 1024px; height: 600px;'>"; print_r($result); echo "</textarea>"; } //Cleans up the memory $html->clear(); exit(); ?>
The variable ‘q’ passed to the file will provide the query string. If you pass serialize=1 to the script, the output will be a serialized string which can be converted back to an array with the PHP function unserialize.
Keep in mind that hitting google too often for search results from the same server might eventually get your server blocked from accessing, especially if you all multiple queries in a short period of time.
An example of the output when the word ‘funny’ is searched.
Array ( [0] => Array ( [title] => Funny Videos, Funny Pictures, Funny Jokes, Funny News - Funny.com [link] => http://www.funny.com/ [description] => videos, pictures, jokes, news. ) [1] => Array ( [title] => Lolcats 'n' Funny Pictures of Cats I Can Has Cheezburger? [link] => http://icanhascheezburger.com/ [description] => Aug 25, 2009 Humorous captioned pictures of felines and other animals. Visitors can submit their own material or add captions to a large archive of ) [2] => Array ( [title] => Funny Videos, Funny Pictures, Flash Games, Jokes [link] => http://www.ebaumsworld.com/ [description] => videos, flash games, clean jokes, clean humor, hilarious flash, pics , office humor, prank phone calls, flash cartoons, animation, ) [3] => Array ( [title] => Video results for funny [link] => http://video.google.com/videosearch?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X &oi=video_result_group&ct=title&resnum=4 [description] => CollegeHumor's recent videos section has all the best videos on the Internet. There are video clips, hilarious viral videos, college ) [4] => Array ( [title] => Funny Videos on CollegeHumor. Watch funny videos and comedy movie ... [link] => http://www.collegehumor.com/videos [description] => CollegeHumor's recent videos section has all the best videos on the Internet. There are video clips, hilarious viral videos, college ) [5] => Array ( [title] => Funnyjunk - Funny Pictures and Funny Videos [link] => http://www.funnyjunk.com/ [description] => pictures, videos, flash games and movies. ) [6] => Array ( [title] => Photobucket | funny Pictures, funny Images, funny Photos [link] => http://photobucket.com/images/funny/ [description] => View 462373 Pictures, Images, Photos on Photobucket. Share them with your friends on MySpace or upload your own! ) [7] => Array ( [title] => FAIL Blog: Pictures and Videos of Owned, Pwnd and Fail Moments [link] => http://failblog.org/ [description] => FAIL Pictures and Videos. Home Send In The Fail Boat Vote [Random]. You must be logged in to add favorites | Register for a new account ) [8] => Array ( [title] => funny [link] => http://www.reddit.com/r/funny/ [description] => . (147376 subscribers). a community for 1 year. moderators · Submit a link. to anything interesting: news article, blog entry, video, picture. ) [9] => Array ( [title] => News results for funny [link] => http://news.google.com/news?hl=en&q=funny&um=1&ie=UTF-8&ei=TW2USrSHIIK0NriQwPoH&sa=X &oi=news_group&ct=title&resnum=11 [description] => "A more mature but still Judd Apatow comedy whose move into serious human relation issues nearly scuttles the third act. ) )