dinsdag 15 september 2009

Reasons for choosing Solr on all for good twisted

Today I read a blogpost on the Lucid Imagination blog about the fact that a Google employee chose Solr as the search engine for the site.
The blogpost cited a part of the testimonial on the allforgood site on with I have to disagree with the reason for choosing Solr

The problem they had was:

One of the top concerns we’ve been hearing from nonprofit organizations who list
volunteer opportunities on All for Good is that their opportunities aren’t
updated on the site as frequently as they need. This happens because All for
Good doesn’t directly receive volunteer opportunities from nonprofits – we crawl
feeds from partners like VolunteerMatch and Idealist just like Google web search
crawls web pages. Crawlers don’t immediately update, they take time to find new
information.

The solution stated:

Today, we’re rolling out improvements to All for Good that will help solve this
problem and improve search quality for users. The biggest change, which you
won’t see directly, is that our search engine is now powered by SOLR, an
incredible open source project that will allow us to provide higher quality and
more up-to-date opportunities. Nonprofits should start seeing their
opportunities indexed faster, and users should see more relevant and complete
results.


Now... why do I disagree with the way the choice for Solr is argumented?

It is the fact that the use of Solr solves their problem of "latency". Remember the biggest problem was that the indexed information was not up to date.
Solr doesn't solve that. Solr is just a service around Lucene. Solr doesn't take care of the crawling part of the problem.

Us experts on search applications and information access solutions know that it is the combination of crawling frequency, the accessibily of the source that has to be indexed (RSS, Web, document repositories, databases etc.) the preprocessing of those diverse formats that determine the speed of the indexing and thereby, search process.

In this case probably Nutch will take care of the crawling part, so the frequency with which updates are processed rely on the speed of that part of the solution. Not the fact that Solr is used...

Geen opmerkingen: