woensdag 21 oktober 2009

Solr 1.4 Enterprise Search Server Book review

Today I came across a blog posting about the Book "Solr 1.4 Enterprise Search Server ". As I mentioned earlier I am in the process of reading, dicing and slicing the book.
I will post updates on my journey through the examples and findings, but I thought it would be nice to see a short reviews of the book in de mean time:

http://happygiraffe.net/blog/2009/10/20/book-review-solr-1-4-enterprise-search-server/

This review says more about the structure and topics in the book as were I will give more information on experiences while using the book.

Google GSA gets smart (and smarter automatically)

Google announced some updates for the generation 6 of the GSA. They added a functionality that boosts relevance for documents / terms automatically when the link is clicked in the search results.

From InformationWeek:
Chief among the additions is a new algorithm
called Self-Learning Scorer, which analyzes employee clicks and behavior to
improve the relevance of search results.
If, for example, most employees searching for a given term click on the third search result, the GSA will place that term higher on the search results page for future searches.
On the Internet, Google gets highly relevant search results in part because of its
PageRank algorithm, which analyzes Web links and treats them as votes for
relevance. On corporate intranets, a rich link structure is absent, which makes search relevance harder.
The Self-Learning Scorer system should help compensate for the absence of Web links inside corporate firewalls.
[...]
"In other words, the GSA is moving upstream, while keeping the price low, and continuing to disrupt the confused search market," says Feldman in the report.

Google makes good use of "User Generated Content", in this case the fact that a clicked result is the "proof" of the document being relevant for that particular search.

It is good to see that the difference between internet search and enterprise search is becoming clearer.

dinsdag 20 oktober 2009

New technology vs. old pattern matching?

Just yesterday I blogged about the article by Stephen Arnold in which he refers to new players and technology like Exalead in relation to the "old" players with their technology like Autonomy.

A moment ago I received an e-mail alert from Reuters with the title "Autonomy Celebrates a Decade of Automatically Detecting and Acting on Patterns in Information".

This press release by Autonomy states the state of the art solutions that they have build upon pattern matching techniques:
Autonomy brings advanced technology to executives looking to gain competitive
advantage by reinventing how they leverage information through a pattern-based
strategy. The company has hundreds of patents in complex and advanced
technology in the areas of pattern matching, pattern recognition,
probabilistic analysis, clustering, visualization, eduction, time analysis,
and sentiment analysis. Unlike legacy systems that require complex modelling,
manual programming and data integration efforts, Autonomy's patterns
technology automatically understands all forms of information including social
media, audio and video and databases.

Is this a coincidence? Based on the examples in the press-release I am still impressed by the way Autonomy is operating.

maandag 19 oktober 2009

Tactical choices for a FAST ESP / Sharepoint investment

A smart analysis of possibilities and choices you have when faced with an upgrade or change in requirements regarding your "old" Fast ESP (OEM) environment. The relationship with Sharepoint is an interesting one.

From: http://arnoldit.com/wordpress/2009/10/19/reflections-on-sharepoint-and-search/

What can customers in a SharePoint environment do with a Fast ESP legacy system? I know what I would do. I would ask that the Microsoft SharePoint engineers find a certified third party who can hook the Fast ESP system into the SharePoint 10 system once these products become available in November 2009. If these experts cannot do the job within the time and budget limits of the organization, I would get a newer system. I know I would look at high profile modern systems such as Exalead’s, and I would check out relative newcomers like Gaviri. In fact, I would do some proofs of concept and pick the best system for my needs. I know I would not consider the older systems that are on the market; for example, the BASIS technology or the Bayesian systems. I want 64 bit, smart systems, not the pains of the past.

I find it very interesting how Stephen Arnold favours the technology of Exalead over the "old" players on the market like Recommind and Autonomy that have based their "magic" on Bayesion probabilistic pattern-recognition.

woensdag 14 oktober 2009

More Google browser specific choices

From http://www.washingtonpost.com/wp-dyn/content/article/2009/10/14/AR2009101400362.html

Our tipster says that he's only seeing the new ads in the developer version
of Chrome, but I'm seeing them as well in Safari, though some TechCrunch staff
aren't seeing them in any browser. Google is always switching up ad placement
and formats in various bucket tests, some of which are browser-specific, so the
inconsistency isn't surprising.


It seems that Google has some favorites when it comes to testing ;-).

Database technology vs. Search technology

Today I read this article on Beyond Search about strugling with large datasets in a database.

Search technology can scale very easy with products like Autonomy IDOL and Exalead.
Oracle RDBMS is good (the best?) at managing structured data but is not capable of handling transactions AND queries at the same time.

Using search technology in conjunction with database technology simplifies the managing of an Oracle environment. Much time is spend in optimizing the databases to serve both transactions and querying.
The path that is needed for updating and inserting records is totally different from the path that must be followed to find information. Many hours of database administrators are spend on this subject.

My advise:Invest in optimizing the RDBMS to handle transactions and VERY specific questions on detailed data. Invest in a good and scalable solution based on search technology to handle the queries.
With this "database offloading" strategy you will be able to let your dataset grow and still be able to manage it as well as make it usable for all processes and users in your company that need the data.

dinsdag 13 oktober 2009

Experiences with Solr 1.4 Enterprise Search Server (Part 0)

This weekend I started exploring the Book "Solr 1.4 Enterprise Search Server".

Although Solr 1.4 is not actually released at the moment of writing, the latest nightly build does provide nearly all functionality and is very stable.

In some of the coming posts on this blog I will share my experiences with the reading of the book and working through the examples. I will also make comparisons with some of the other search engines that I use in my work, like Autonomy, Exalead and the Google Search Appliance.

The EBook is accompanied by example code, download-able from the PACKT website. That example code contains the version that was used while writing the book so every example in the book should work.
The example code contains a fully filled and configured Solr / Lucene instance. This instance consists of some hundred thousands of records pulled from the Musicbrainz database. To use this data you must have a development / test environment with enough diskspace and RAM.

One negative remark about this dataset is that is it mostly "structured database data": a lot of fields with small amounts of data.

Enterprise Search environments that I stumble upon mostly hold lots of unstructured information from documents from filesystems, DMS and CMS systems.
Of course database offloading is a hot topic in BI / enterprise search land, but most information that has to be searched and found come from unstructured documents.

Maybe the fact that database data was chosen says something about the field of operation of Solr / Lucene in the real world.

It would be nice if I could use a more representative data set I could work with. This would make the examples more usefull.

Information Access is all about the interfaces

Just read this post on Beyond Search about the difference between Google/Apple interfaces and other interfaces in the enterprise that give you access to information. The Image is below.

This funny little comic states what I have always said when talking about interfaces that must help people to find information. They must be simple, yet powerfull.

In my opinion is it necessary to start a dialog based on the initial search. Compare it with a store that you go and ask the sales person "I want a gift". Based on that query, search engines would respond with "0 results". A "gift" is not concrete enough.

The sales person would start asking you questions: "For what occasion?", "Is it for a woman or a man?", etc.

So: Start simple and guide the searching person through the possible answers.


Visualisation of Enterprise Search ROI with Google GSA

The dutch company VLC has a great little application on their website that lets you interactively determine the ROI you will have when investing in a Google Search Appliance (GSA). They make use of the following factors:
  • Documents
  • Employees
  • % of knowledge workers
  • Cost per employee

Below is the text on the website of VLC (in dutch):

Bereken de voordelen van een goede zoekoplossing

Met een simpele formule is te berekenen hoeveel geld bespaard kan worden bij het goed doorzoekbaar maken van informatie met een Enterprise Search oplossing.

# medewerkers × (gemiddeld jaarsalaris / 2,080 uur) × (besparende minuten / medewerker) = € besparing per jaar.

Onderstaande applicatie voert deze berekening uit. Sleep de balletjes op de balkjes heen en weer en stel dit in op de informatie van de organisatie. Zo is te zien dat er snel een hoge productiviteitsbesparing te behalen valt bij het inzetten van een goede zoekoplossing.

vrijdag 9 oktober 2009

Going from one monopolist to another

http://blog.seattlepi.com/microsoft/archives/180794.asp

Google is offering Office productivity tools in de cloud through Google Apps. To my knowlegde they are the only one with such a rich online solution.

When more and more people are going to use Google Apps, the suite is going to be a new standard. That will attract more users because it easier to exchange files in the same format.

And what about the "open-ness" of the Google Apps document format. Sure you can save a document in another format and then open it in that program (let say "Word"). But when you have edited that document with Word, you can not save it as an Google Apps document without trouble. You can upload the file as a new document, but that's not what you want. Also the layout of the document will be messed up...
I say: when you chose for Google Apps you are bound by it, as is the case with the choice for another office suite.

When Google has the biggest user base in the world, will they be not the next monopolist in the market?

Consider this:
The fact that all your data is on their servers is even more creepier.

With the offline Microsoft Office solution your data is still safe in your own network.

At the moment we use Google Apps only for "work in progress" and notes that must be shared or colaborated on.
The final documents and reports we still make and save with Microsoft Office (and Open Office although that causes compatibility issues). The files are safely kept under our own influence.

donderdag 8 oktober 2009

In the field of defying standards, Google is following Microsoft

I have said it before, and I will say it again: Google is growing out te be the next Microsoft. Driven by shareholders value the company is no longer the underdog. They are starting to set their own standard. The public opinion is changing...

By implementing currently non-standard features on their homepage, Google are
sending out a strong message on what they believe the new standard features
should be, and not coincidently, it is the features that their own browser
implements and supports. This is not the first time Google has sent a
wrecking-ball into the standards process. Google Gears was launched long before
Chrome as a way to implement proposed HTML5 standards, such as offline caching,
into browsers (see my NextGenWeb series from last year). It was born out of frustration
for the slow and beurocratic standardization process ¿ something that Google
couldn't afford to wait for as their web applications could not advance further
without a non-aligned platform to build them on.
A large part of the
anti-trust case against Microsoft was that with combined desktop, browser and server
market dominance the company could abuse that position to make the web a
Microsoft web by implementing Microsoft-only features. Google are using their
dominance to force an issue that has been stalled for far too long ¿ but the
difference is that they are using their force for potentially a greater good (I
hope). The theoretical Microsoft web would have been "this website only supports
Internet Explorer", whereas with Google so far we have "this website is a lot
better, and has sexy buttons, if you use Chrome (which btw is open source)".


source: http://www.washingtonpost.com/wp-dyn/content/article/2009/10/08/AR2009100800192.html