Skip to content

Search 500,000 Documents for Free

I was pre-briefed about the IBM/Yahoo! partnership and offering this Tuesday. In summary, IBM and Yahoo! have teamed up to offer a completely free search stack that will index up to 500,000 documents.

You download the software (there’s a form to fill out, but that’s it), put it on your server, and let it go. There’s not much more than that.

It’s not “enterprise search” in the sense that there’s no role-driven security or integrate with business applications. It “just” does search . But, that’s a whole lot to do for free. It’s behind-the-firewall search.

Most search behind the firewall sucks. Having acquired, set up, and administered a Google mini at one of my past jobs, I have first hand experience with how wonderful it can be to get real, public web grade search setup behind the firewall.

Freely Disruptive

As Dan pointed out in the comments the other day, the largest barrier to entry may be people’s perception that free things are not worthy of business use. What’s good about a free thing like this is that nutty employees with a spare server can just go download it, set it up, and start using it without having to go through the normal, lengthy process.

In that sense, this is a great move for IBM and Yahoo! to be highly disruptive in the behind-the-firewall search market. Given that the technology works as promised, this could be a good enough solution for many, many companies. Revenue wise, I’m actually in rare agreement with IBM’s on-ramp thinking for this. I can see that several people will want more enterprisey features once they realize the power and value of real search behind-the-firewall.

The Long View

Indeed, if IBM Yahoo! OmniFind Edition finds wide adoption, this could be a tipping point for enterprise search. The demo’s and pitches so far have just been the usual “hey! it mashes up with maps! and knows your role so it limits access to data!” stories. My sense is that until people actually get plain old search behind the firewall, decision makers at those companies won’t choose to spend cash and time on search.

For example, let’s take figuring out what your companies official vacation days for America are. How long does it take you to track that down? If your intranet is like most intranets I’ve had to work with, the answer ranges from 5-10 minutes to “I have to send an email to The Guy Who knows.” But, if you have real behind-the-firewall search, you just type in “us vacation” and you find it in seconds.

Having that kinds of experience for all sorts of “corporate data” is what will ultimately convince gold holders to spend money on enterprise search, snazzy maps mashups or no. Of course, few people are going to want to spend money for something as seemingly simple as being able to look up vacation days. “We already have intranet search, right?” they’ll say…that sucks.

So, offering search for free is the best way to get the ball rolling for more sales down the line. And, it seriously screws with Google, FAST, and others as well.


The above advantages are largely for IBM. For Yahoo!, the payoff is getting a foot in the enterprise market. Yahoo! should have some interesting experience with the grass roots adoption of IM in the enterprise, but partnering with IBM should give a whole ‘nuther layer of experience to draw from.

That is, if they take advantage of it, this should be a good way for Yahoo! to learn the enterprise software ecosystem culture that IBM has perfected. Of course, technologically it goes both ways. As I always says, the enterprise world can learn a lot from the current consumer tech world. This whole deal is a case in point ;>

Good Enough and Security

Sure, the search is not “secure” in the sense that any given user will have their results filtered by what they have access to. Instead, any user can see any URL that’s “public” on the intranet. In other consumer tech behind the firewall announcements I’ve been in recently, this has been an oddly difficult point for enterprise-centric people to get. Of course, it’s a natural, almost sub-concious concept for the URL-culture.

Thus, the question of security and behind-the-firewall search is a cultural concern rather than a tick-list item. Search like IBM OmniFind Yahoo! Edition will find everything in the clear so you can’t rely on security by obscurity…but you’re not doing that with that payroll spreadsheet in SharePoint anyways, right?

To be clear my point is: if you’re worried that search is finding things it shouldn’t be finding, and that you should, thus, turn off search, you’re applying the wrong solution to the problem. Instead, take that spreadsheet off SharePoint or stop allowing anyone on your network to access it. Indeed, finding such security violations is a nice feature of search behind-the-wall. Better a good-willed employee finds it and emails you than a malicious intranet surfer who tells no one.

Now, if you’d like to truly use search as the enterprise command line — including pulling in content based on the roles of the user — investing in something else might be worth it. But, if you’re like many orginizations and just have no or crappy search for your intranet, IBM OmniFind Yahoo! Edition looks like a big contender for being free.

As far as it working as promised, at least one person I talked with today said they’d played around with it and it delivers on the promises. I’m curious to hear other people’s experiences with it.

For the Developers

There are a couple interesting technical aspects:

Both of these are interesting in that this free stack of software could be used as a search back end for developers who wanted to spend the time to layer other things on-top of it.

An entry on the search blog provides a nice jumping off into the topic.


During the call, someone from 451 asked why they didn’t open source this, which was a good question on my mind as well. The answer was (a.) hey, we’ve got Lucene in it!, and, (b.) there are some third party entanglements. It’s like the Java answer 😉

Another question from a Butler Group dude was right on as well: what’s the pricing plan for customers who want to search 500,001 documents but don’t want to upgrade to all the fanciness of IBM’s non-free OmniFind search? Indeed, as a general rule of thumb I like to advice companies to figure out how to let customers pay them for what they want. If I just want to index 1,000,000 documents, there needs to be pricing around that instead of having to switch over to another product.

Why Not Go All In?

I’m not sure if there’s much commercial advantage to limiting how many documents can be searched for free. That is, eventually, IBM and Yahoo! might do well to completely commodities the “basic search” market by giving away unlimited search for free. The Lucene crowd is already doing this in large part, but the brand and packaging of IBM and Yahoo! could accelerate that.

There’s still plenty of revenue to be had from making search into a command line for the enterprise. The server sales alone to support 500,000+ URLs in “real time” should be nice. Crawling that many documents at a reasonable frequency takes more horse power than you might think.

Forcing a tipping point as outlined above for that might be the ultimate long term best bet.

For those who like them, here are my raw notes, including supported formats and pricing for “support”:


(Thanks to ScottD! for IM’ing me and reminding me to write this up.)

Disclaimer: IBM is a client.

Tags: , , , , ,

Categories: Collaborative, Companies, Enterprise Software, Open Source.

Comment Feed

3 Responses

  1. Hopefully my laptop has enough room for the cache – the indexing is happening at the moment …

    In your notes you mention that it will index files (SX*), does that include the ODF-format files from OOo v2?

  2. Update:
    “7,720 pages were crawled since this session began.”
    ” 7,639 documents in the index; index size is 719 MB.”

    Go to whoa didn’t take all that long …

  3. Ric: I’m not sure if it includes those fomrat, Ric. I’m sure we can find out from the IBM folks. There’s even a blog and som forums.