maandag 23 november 2009

phpQuery: using Google as an information source

For some project I have to get search results from Google and follow the links on the result page to index those valuable resources.
Google doesn't supply an XML-interface to it's search results.

Bing (www.bing.com) and Yahoo (www.yahoo.com through an API key) do. Aside to the subject of this blog, I find this monopolistic behaviour of Google repulsive... They index al OUR content and resources, make money of it through advertising, and then they stop the "open source" character of their activities. Once our content enters their databases, the content is theirs!
Start using Bing more often!

Because of Google does not support an XML-response interface, I have find a way to "crawl" our own content.

I have found the "phpQuery" framework, ironicaly hosting on Google code.

What I wanted to do is get all the relevant links from the search results page and follow those links to make the information aivailable within the context of the search application.

I want to share the basic code that I have come with to establish getting the links from a webpage through the use of phpQuery:

require_once('phpQuery/phpQuery.php');

//The URL of the webpage to fetch
//$link = "http://www.emidconsult.com";
$link = "http://www.google.nl/search?q=multiple+sclerose";

//Get the whole HTML contents
$page = file_get_contents($link);
//Put the HTML content in a phpQuery object
phpQuery::newDocumentHTML($page, $charset = 'utf-8');

//Get the [body] contents out of the phpQuery object
//$body = pq('body');
//print $body;

//iterate through every link (a element) in the webpage.
foreach(pq('a') as $ref) {
//Get the link
$href = pq($ref)->attr('href');
//Show the url in the a element, but only if it contains external links
$pos1 = strpos($href, 'http://');
//and has no reference to Google own search commands
$pos2 = strpos($href, '/search?q');

if (($pos1 === false) || ($pos2 > 0)) {
//DO NOTHING
} else {
echo $href."\n";
//Now do some things with the href...
}
}
?>

Geen opmerkingen: