maandag 23 november 2009

phpQuery: using Google as an information source

For some project I have to get search results from Google and follow the links on the result page to index those valuable resources.
Google doesn't supply an XML-interface to it's search results.

Bing (www.bing.com) and Yahoo (www.yahoo.com through an API key) do. Aside to the subject of this blog, I find this monopolistic behaviour of Google repulsive... They index al OUR content and resources, make money of it through advertising, and then they stop the "open source" character of their activities. Once our content enters their databases, the content is theirs!
Start using Bing more often!

Because of Google does not support an XML-response interface, I have find a way to "crawl" our own content.

I have found the "phpQuery" framework, ironicaly hosting on Google code.

What I wanted to do is get all the relevant links from the search results page and follow those links to make the information aivailable within the context of the search application.

I want to share the basic code that I have come with to establish getting the links from a webpage through the use of phpQuery:

require_once('phpQuery/phpQuery.php');

//The URL of the webpage to fetch
//$link = "http://www.emidconsult.com";
$link = "http://www.google.nl/search?q=multiple+sclerose";

//Get the whole HTML contents
$page = file_get_contents($link);
//Put the HTML content in a phpQuery object
phpQuery::newDocumentHTML($page, $charset = 'utf-8');

//Get the [body] contents out of the phpQuery object
//$body = pq('body');
//print $body;

//iterate through every link (a element) in the webpage.
foreach(pq('a') as $ref) {
//Get the link
$href = pq($ref)->attr('href');
//Show the url in the a element, but only if it contains external links
$pos1 = strpos($href, 'http://');
//and has no reference to Google own search commands
$pos2 = strpos($href, '/search?q');

if (($pos1 === false) || ($pos2 > 0)) {
//DO NOTHING
} else {
echo $href."\n";
//Now do some things with the href...
}
}
?>

donderdag 19 november 2009

Google and making multimedia searchable

Today I saw the announcement "Google adds automatic captions to YouTube". I really like this addition to the possibilities of Youtube.

But as we all know Google as a gigantic eposure in all thing they do.

This feature seems a neat thing that makes it easier for consumers to upload there video's to Youtube regardless of language and eliminating the need for adding manual captions or descriptions to there video's: Google converts the speech to text automaticaly...

As i am more involved in information access and enterprise search I place this news in another context: The possibility of making video and audio searchable.

Within the enterprise this issue of making video and audio searchable is getting more attention. Autonomy as well as Exalead are also focussing on this part of enterprise search.

To see the capabilities of Exalead you can visit their labs on http://labs.exalead.com/experiments/voxalead.html.

Autonomy also is very active on the market of media indexing. They have solutions like Virage that to the same thing. I find it very disapointing that Autonomy has no demo of their capabilities.

Bottom line?
Google is doing the same thing as the large search technology providers. The "public" will say that Google is "ahead of the pack". The advantage that Google has is a large community that follows everything that they do and spread the word fast.
Other search providers like Autonomy and Exalead can do and are doing the same thing as Google is doing now (and are maybe better), but are not in the position to reach the audience that Google can.

dinsdag 3 november 2009

Coveo into free enterprise search?

Short Post:
Today I received an email from KMWorld with an advertisement by Coveo.

They are issuing a free version of their Enterprise search suite with the name "Expresso".

While they are not the only one (Omnifind yahoo Edition and Microsoft Search Server Express do the same) I still think that wrapping something as "enterprise search" in a free gift is not a good idea.
As we all know Enterprise search is more than a software solution. Off course you can plug in a filesystem, but that not will solve most of your information needs.

Still it is something to play with.

Read more

maandag 2 november 2009

Microsoft's FAST and sharepoint integration (SharePoint 2010)

Stephen Arnold has written up an excellent analysis of the information that was given about the coming release of "Fast Search Server 2010 for SharePoint".

Third, there is a reminder to me that SharePoint is a work in progress. I think someone told me that it is the next operating system from Microsoft. The Fast component will be called Fast Search Server 2010 for SharePoint. Now the story gets interesting. Here’s what’s coming:
He then sums up some aspects of the features:
  • A content processing pipeline.
  • Metadata extraction.
  • Structured data search.
  • Visual search.
  • Advanced linguistics.
  • Best bets.
  • Development platform.
  • Customization.
Microsoft has a lot of work to do while their were no live demo's or example that could show these features in action.