dinsdag 13 oktober 2009

Experiences with Solr 1.4 Enterprise Search Server (Part 0)

This weekend I started exploring the Book "Solr 1.4 Enterprise Search Server".

Although Solr 1.4 is not actually released at the moment of writing, the latest nightly build does provide nearly all functionality and is very stable.

In some of the coming posts on this blog I will share my experiences with the reading of the book and working through the examples. I will also make comparisons with some of the other search engines that I use in my work, like Autonomy, Exalead and the Google Search Appliance.

The EBook is accompanied by example code, download-able from the PACKT website. That example code contains the version that was used while writing the book so every example in the book should work.
The example code contains a fully filled and configured Solr / Lucene instance. This instance consists of some hundred thousands of records pulled from the Musicbrainz database. To use this data you must have a development / test environment with enough diskspace and RAM.

One negative remark about this dataset is that is it mostly "structured database data": a lot of fields with small amounts of data.

Enterprise Search environments that I stumble upon mostly hold lots of unstructured information from documents from filesystems, DMS and CMS systems.
Of course database offloading is a hot topic in BI / enterprise search land, but most information that has to be searched and found come from unstructured documents.

Maybe the fact that database data was chosen says something about the field of operation of Solr / Lucene in the real world.

It would be nice if I could use a more representative data set I could work with. This would make the examples more usefull.

1 opmerking:

Grant Ingersoll zei

I wouldn't read too much into the dataset chosen for the book. I've seen/used Lucene/Solr for plenty of unstructured text, ranging in size from a few hundred words of unstructured text to book length.

I'd say most cases of any search system are a bunch of metadata fields accompanied by 1-5 unstructured fields, but, of course, YMMV.