In this episode of make all
, I’m joined by the ASF’s and Lucid Imagination‘s Grant Ingersoll to talk about using search as a database. I’ve come across people who are using Solr as a search-based way of retrieving and store data in their applications, and I wanted to explore that topic with someone in the know, here, Grant who’s steeped in the technology.
Download the episode directly right here, subscribe to the feed in iTunes or other podcatcher to have episodes downloaded automatically, or just click play below to listen to it right here:
Show Notes
- What is Lucene & solr? “Anything that you can make into a String, you can search with Lucene.”
- The rest of the Lucene ecosystem, like nutch, Mahout, etc.
- Grant: “I really haven’t used any SQL since 2004.”
- Uses of search as a database: e-commerce, search over a database; “search as the means” – “you don’t even know you’re doing searches” when you use the applications, like log management for transaction tracking; embedded in BI and analytics apps, like monitoring buzz of Twitter, blogs; taking advantage of the extra info returned by search, like facets, groupings, trends, etc.;
- Is this all “read only”? Not exactly.
- What’s all of this look like in production? A lot depends on how you need to scale out. Starting small, getting to sharded indexes, etc.
- See him over in Twitter @gsingers .
Transcript
(I haven’t checked this transcription in detail, so if you notice something funny looking, ask before assuming it’s “funny.”)
Michael Coté: Hello everybody! It’s the 27th of April, 2010 and this is episode number three of ‘make all’; the exciting development podcast about all sorts of very interesting development topics and this is your host Michael Coté available at peopleoverprocess.com, as always, and I am joined by the guest for this episode. We are going to talk about essentially the idea of, sort of, using search as a datastore or a database, if you will which is a very vague way of putting it. But why don’t you introduce yourself.
Grant Ingersoll: Thanks Michael! My name is Grant Ingersoll. I am a committer as part of the Apache Lucene & Solr projects which are one of the most popular open-source search projects out there and I am pleased to be here.
Michael Coté: And so I wanted to get you on here because I have been having several conversations recently where people are using Solr as sort of a front-end for a database, for a datastore. Not a front-end for a database but they are searching over all of these piles of contents that they have, and rather than putting in a database or maybe even a NoSQL sort of database. They are really just using Search as the way that they are looking over at the data that they have and for querying and for using which, I think, it’s an interesting use case for that, and Solr and Lucene; and Lucene even more than Solr, if I remember, have been around for quite some time.
So before we get into that interesting use case of sort of using search as a database, like, can you give us a quick introduction of Lucene and Solr itself, just kind of what they are in general and kind of their histories?
Grant Ingersoll: Sure, yeah that’d be no problem. Like so it is an interesting use case, so taking a step back, I guess “in the beginning there was Lucene” and it was good. So what Lucene is, it’s a Java-based search library. So to that end it’s a set of APIs that allow you the developer, you the company to quickly and easily add search capabilities to your application.
Obviously it requires you since it’s a Java API, to write Java code to wrap around all of that beautiful search code to do things like converting your data into the format that Lucene needs in order to indexing. Now of course, that format is essentially just a string, but you need to then convert that data, add it into Lucene and Lucene then, in the inside, goes and does the indexing process. And essentially what it’s doing is creating data structures that make that data available for searching.
Then on the Search side of the coin, you need to have your application go and access those data structures through Lucene’s APIs to run queries. Those queries can be anything from very simple term queries or keyword-based queries, just like you are used to type-in in your favorite Internet search engine up through things like phrases, you can do things like wild-card queries, adding asterisks and question marks, and things like that in your query. And there is a whole bunch of other variations of query.
So what Lucene is doing is providing you all of those APIs that make it easy for you to then add search.
Michael Coté: And it sounds like it’s sort of true Java object-oriented fashion, like you were saying the idea is that there is sort of the core search functionality, and as a user of Lucene, your job is to sort of write the implementation of inputting the searchable content and then also a little bit of doing, performing this.
I guess, kind of like the search or dealing with the output of the search, and so the idea being there that you are not stuck with just plain text for example, like if you can — I am sure there is some interface, and if you can implement this interface, you can make it search over or whatever you want, even if it’s kind of obtuse objects, that aren’t usually plain textable.
Grant Ingersoll: So Lucene makes that all very agnostic, I mean I have seen — there are things like hibernate search which integrates with blobs and clobs and all of those kind of things. Basically, anything that you can make into a string you can add into Lucene. So you can take – and there are front-end tools for this as well. One of the affiliated projects with Lucene is actually called Apache Tika, and it’s a project that takes common file formats like PDF, Adobe PDF, and Microsoft Office documents and all that, and converts them to well-structured text documents, or HTML/XML kind of however you want to look at it. So then you could take all of those extracted content and then easily add that into Lucene or Solr. That’s why it’s one of the common affiliated dimes.
But I have seen people, in fact, I am just working on today a little demo doing, I’ve got real estate data from the New York City government about — and it’s got price information, it’s got tax codes, it’s got address location, and I am going to throw all of that stuff into Lucene and make all of that searchable.
So that’s a good example. So that kind of covers the Lucene API. And then, the other part of that API and what you alluded to that people need to do is they need to build up the – what I call Scaffolding. It’s all that infrastructure that goes around Lucene, that takes care of getting your data in and managing the index and things like this that make for Lucene able to or make Lucene ready to run in production.
So it’s really dead-simple to get data in and then search it, but then, it does take work to make this production ready and to really scale out. I mean, Lucene can scale out to billions and billions of documents. I have seen some very large implementations out there. But what you have to do is build out the scaffolding. You may have to add in things like distributed search, you may have to add in things, that replication, many things like you would do with the database.
So then what happened, and this is kind of where Solr comes in. Solr was a project that was started internally at CNET back in about 2004. So just to add a little timeline date, Lucene was donated to — it was started in 1997 by a guy named Doug Cutting, who I am sure everybody is heard of, if they have heard of Hadoop and Lucene, and then donated to Apache Software Foundation in 2001 and then along came Solr.
So in 2004, Solr was started at CNET to solve a problem that they were having, which was basically they wanted to give all of their developers access to search services and over — essentially, what is a web server interface. And so in 2004 that was started internally and then donated to the Apache Software Foundation in 2006.
So what Solr does is basically it provides all of that scaffolding around Lucene. So it manages your indexing process, it manages the search things, and it gives you REST-like interface on the front-end so you can just put or get or post parameters and commands into Solr and outcomes a response. By default, it puts XML response, but you can actually have it output JSON or PHP, or Ruby or anything that is easy for you to then slurp into your application.
In fact, that whole interface is completely plug-able. So you can define your — if you have your own binary format that you need for some legacy system, you could even do that with Solr.
Michael Coté: So it sort of adds that one layer up from Lucene of middleware and infrastructure that you would need that sort of — it kind of obviates the need to go that extra mile with the Lucene library itself like (Voice Overlap) running and just start using search as a service, I guess.
Grant Ingersoll: Exactly, and it makes it easy for PHP developers, Ruby developers, anybody to run. Plenty of people I have seen don’t even know a lick of Java. All they really know how to do is exec out the Java processor. In effect, what Solr does is it runs inside of a servlet container, so Tomcat or Jetty or Resin or WebSphere or whatever you want to use. So all they ever have to really do is start up this servlet container and the way they go, they can just talk to it over HTTP and with REST-like commands.
So that’s kind of the first layer of Solr, and then what Solr has on top of it as well is a number of things that people often do with search that are also provided by Lucene and then there are some additional things from Solr as well.
So in many search applications, for instance, you want to do things like hit highlighting. In other words, you want to show where the query terms actually occurred in the document. So Lucene actually provides a Java library that does that. What Solr then does is make that all easily configurable and set up and that just kind of works out of the box with Solr. You don’t have to do any programming, it’s just a configuration item.
It also adds, and this is a pretty popular feature for, especially in the e-commerce space is, what we call Faceting, or sometimes people call it Parametric Search or Guided Navigation or other, just kind of general discovery type things.
And what faceting is, I’ll just explain briefly is, most people have seen it like if you go on the Amazon website, and you do a search for TVs, what you will see down on the left-hand side is the search results are broken down into categories. So things like by manufacture, by price, by review, and for each of those it will say, oh there is 23 Sony TVs, there is 13 Samsung TVs, those are facets. That’s kind of Guided Navigation on the left-hand side.
Michael Coté: This is sort of subgroups of your search results, like some attributes of the search results, like the maker of the TV, or something.
Grant Ingersoll: Right, and it makes it easier for people to know how to further narrow-down the search results without having to guess at additional keywords to add in. So Solr adds in out of the box faceting, very easy to use, it also adds in things like easy to use spellchecking, other kind of things like more like this or similar pages functionality, so given a query and given a particular document. You would say, find me other documents that are like this document. So that’s kind of all out of the box with Solr, easy to set up all through XML configuration. So that’s kind of Solr in a nutshell.
Michael Coté: Are there still — is like things like Nutch still around, that are sort of like a full — like completes the picture, like what’s the story with that? It’s been a long time since I looked into using that, but it’s going on with the user-facing level of all this?
Grant Ingersoll: So there is actually a few other kind of what I call part of the Lucene Ecosystem. So Lucene and Solr, I would say are pretty much at the center of that as far as what’s delivered from Apache. There is also a few other ecosystem items. I already mentioned Apache Tika, that’s one, Apache Nutch. So Nutch is another way of consuming Lucene, but it also comes with a bunch of other things.
For instance, it primarily focuses really on Internet-scale crawling. So if you need to crawl or go out there and scrape websites and then make all of that content searchable, what you can actually do with Nutch now is you can send Nutch off to go off and crawl, and it actually uses Apache Hadoop for this. And many people don’t realize that Hadoop actually was came out of Lucene that was part of the Nutch project from a long time ago, and has actually spun out and kind of become its own thing.
So you can go off and crawl all these documents, bring them in. It uses Nutch then uses Tika to extract the content out of those documents, and then it takes that content and can send them either into Lucene proper or into Solr. And so then either Solr or Lucene then makes all of that content flexible.
So one can essentially build a pretty large scale web-crawling search system for free, using Nutch and Solr and Tika and Lucene, or essentially Lucene Ecosystem.
The last bit, and this is something that I find particularly interesting, it’s one of the areas I am pretty into is called Apache Mahout, and Mahout is, it’s a fairly new project aimed at building scalable machine learning library.
So things for classification, clustering, collaborative filtering or recommendation systems, we also have a bunch of algorithms for doing frequent pattern, mining, or association mining, data mining kind of things. So typical algorithms you are talking about that people are may or may not be familiar with here, things like Neural Networks or Support Vector Machines or like Bayesian Statistics if you really want to get into some of the low level name drop in here of what algorithms out as —
Michael Coté: That’s right, like academic CS name dropping, it’s a new height of popularity of parties.
Grant Ingersoll: Yeah, it’s nice because what we do is we build out on Hadoop actually, so just another great use of Hadoop is these really large-scale machine learning libraries. But they actually play quite nicely with search, because these libraries are things that people want to often do in search; you need things like classification.
So classification for example is if you are crawling a new site, you want to automatically categorize all the sports articles into the sports genre or you want all the politics articles to be automatically labeled with politics, that then all feeds well into search, and in fact, you can then build facets off of those classifications and all that stuff.
So that’s really I think in a lot of ways what’s coming down, becoming more-and- more popular in the search space. So there is notion of adding intelligence into the search capabilities.
Michael Coté: Yeah, I always enjoy search things, I guess. Like applications that for lack of a better phrase they give me closer to making a decision about something, like rather than just having a bunch of options.
Grant Ingersoll: Definitely!
Michael Coté: They can be like somehow knowing you are looking for this kind of thing and here is ways you can evaluate all these things that we found like things that let you figure out what they are and without necessarily looking at them.
Grant Ingersoll: Definitely! And tools that — you really see this I think in a lot of the “big boy players” out there, the Googles, the Yahoo!s, the Amazons, the Facebooks, is they leveraging all of that data that you have from your users and all your user feedback, and combining that in intelligent ways with your actual content.
That’s really powerful stuff and can really like you said, help people instead of just guessing out and searching through tons and tons of data, it can help them hone in on what they are actually looking for in a much quicker mechanism.
Michael Coté: Yeah. So onto the interesting use of search that I have come across, like I think it was one of the other RedMonk guys James Governor was helped out and was at NoSQL London last week, and I think ‘The Guardian’ a newspaper there obviously was, they had a use case where they were talking about using all of this, using Solr and Lucene as — usually when you think about search you think about there is a user looking at a computer screen going through search options, you think about Google essentially, and even within applications you think about a user doing something.
The case that they were talking about was more using search programmatically inside of a program as sort of instead of talking to using some SQL to like talk to relational database to query things or whatever more, just allowing their application to use search the same way a person would.
It’s one of these things that’s really obvious, but it’s also really sort of a clever way of doing something. And so with that introduction of stuff I was curious to see where — and you and I were talking I guess yesterday about these use cases a little bit, I am curious just to hear, how you have seen people using Solr and Lucene as sort of like part of their middleware rather than part of their user experience if you will?
Grant Ingersoll: Yeah, well, it’s interesting, you mentioned the NoSQL, I think as I mentioned to you yesterday, I have been doing search since about 2000 — I have been using Lucene for since about 2004 and prior to that I was doing databases, and I really haven’t used any SQL since about 2004, because I think once you are in the search space and you kind of get your head around thinking about search, you tend to see as most people do, just like database people tend to view all solutions as being database-oriented, you start to see everything as database or as search-oriented and so to me like I said I haven’t written any SQL in a long time other than just for maybe getting my data into search.
So the NoSQL movement to me is started a long time ago and it’s just as well-founded on Lucene and Solr as on any of the other things. That being said, Lucene and Solr play very well with all of the NoSQL type tools that are quite popular right now —
Michael Coté: You’ll have to tell me if this is the case with Solr and Lucene, but it seems like a lot of the constraint of using a relational database as a data source has really been driven by sort of the operational and infrastructure concerns. Meaning that it’s sort of more — in the past it’s more efficient to have if you are doing a single server to have like a database that drives what you have, whereas one of the shifts that it seems that makes NoSQL more popular now than before it was called that or whatever is that you are sort of willing to look beyond one server or something.
So the infrastructure you are using to back here your datastore can be a lot more varied than it used to be in the past, and also it’s been optimized a lot more and so forth and so on. And I think that’s the interesting constraint that’s behind part of what the NoSQL appeal is, is that you are not kind of stuck by this thing. You are not really constrained by that infrastructure concern which really has nothing to do with your application development, and so you kind of put up with this unnatural way of querying data where it would seem like search is a much more natural way of just looking stuff up and even persisting things if you can wangle that a little bit.
Grant Ingersoll: Yeah, I think, there are a lot of good points in there. I mean, I think for one search you tend to have to — you start thinking about the prominent space a little bit differently, often times you have to de-normalize your data, which if you are coming from really strong database backgrounds that’s like — that can be a sacrilege at times, but once you do that — and there is trade-offs to that as well, but once you do that often times you start to look at things a little bit more fuzzy, and what it does is it allows your users to get at the data in more interesting ways and whether that’s like you said part of an application embedded in an application and/or even actually just in the keyword search box.
So I think search definitely plays role in there and I have seen a lot of different applications where people are just — they are using search as the database, it may not always be the authoritative store but it often is the primary store that they are accessing on a regular basis.
And like you said, it then can scale out quite nicely. I mean, people have been knowing how to scale out search for quite some time, even pre-Google. They have been distributing indexes and distributing search load and all that for quite some a while.
Michael Coté: It’s always a shocking revelation that distributing, computing, and clustering worked well before Google.
Grant Ingersoll: Yeah, I think the other part of the — the other thing why the NoSQL movement is caught on to and the little bit of the history on the database, it is always really easy to find database developers. So people have DB skills and they’ve had them for a long time, it’s well taught at university. People learn, it’s one of the first things you learn. I know it was one of the first things I learned, you get in and you learn how to do SQL, because that’s what’s driving your application.
And now that people are scaling up and scaling out more and you start doing joins across 20 or 30 tables or things like that and those just grind to a halt in a traditional database scenario. So if you can maybe rethink how to store that data in a different way, I think you can get around some of those inefficiencies and search is often one of those ways that does that. But you do lose some things as well there.
Michael Coté: So, what are some of these application types, I mean what are that you’ve seen people using search for as a database essentially?
Grant Ingersoll: Well, I think there’s quite a few. First off, I’ve seen people are just — whatever is in their database, they are indexing that. So e-commerce is a good example thereof. In the SQL world if you’re doing search over SQL database, well, what are you doing? Well, in the most base form you’re doing a LIKE query against a text field and you maybe have a few wild-card expressions you can throw in there.
But, if you add a search in there, you’ve got a much richer context of being able to actually do all kinds of manipulations of that data and slice and dice it in ways that you can’t do through more rigorous joins. So you can start bringing back results that are only loosely-related, for instance, to what’s in the database. So you might be able to find and discover two products that do great as up-sells together, but, because they are in the database, they are much more rigorously defined. You won’t be able to find those items doing just a regular LIKE query search or doing a more strict search.
So that’s one, I think another — you often see search as the means. So it’s like you’re talking about, it’s part of an application and you don’t even know you’re doing searches. For instance, I’ve seen somebody who’s done an application where they are logging all of their transactions across all these different integration points, all of those go into Lucene & Solr and then their user interface is just all driven by search underneath the hood.
So you won’t even know that a search ever occurred, but you’re constantly, the way you are accessing all that data is always just to look up into Lucene & Solr index and search.
You’re taking your data, you see Lucene & Solr kind of embedded in Business Intelligence applications or Analytics applications. So, for instance, these companies that are kind of monitoring the buzz out there, the buzz, the blogs and Twitter and all of that stuff, what you can do is take all of that data in and what they’ll do is, they are not just looking at ten results like you and I would, by typing in search keywords.
All we care about really is that top ten, but what they’ll do is they’ll take and they’ll say, well, do the search and give me a million results. And then they are going to take all of those million results and do a bunch of post-processing, maybe some machine learning, maybe some advanced analytics or analyses, and then generate reports. Here’s what the blogosphere, here’s what Twitter saying about product X, or here’s all the sentiment, here’s what people are feeling about product X or product Y, or here’s what they’re saying about RedMonk today.
Michael Coté: Yeah, that’s getting to an interesting thing that’s kind of have been running around in the background of these examples is, using whether it’s the faceted or other things, but using search as a type of analytics itself, the way you organize the search results and kind of surface the different facets and everything.
You can then use it as more just — instead of just a list of things you are interested in, there is this extra information that gets added to it.
Grant Ingersoll: Exactly.
Michael Coté: I don’t think you always think of search as being the source. For that information, it’s more like BI or Analytics that adds it on, but it’s sort of out-of-the-gate with the various sort of meta information that gets added to the search results. That stuff is kind of added as part of the search process, and rather than kind of throw it away you can start using it as part of your application.
Grant Ingersoll: Right, and it’s also, the index becomes the store. It’s the primary store. It’s the way we look at things. If you think about Business Intelligence on top of search as opposed to Business Intelligence on top of a database, with a database, what you have upfront is, if I build my BI application on a database, I have to then decide, well, here are the queries — here are the SQL joins and all of that that I’m going to run, that can then generate my reports.
And you have some flexibility there, but maybe not as much as you would get by being able to say to a user, okay, will you tell me what are the search criteria you want to then generate your reports on, so then you can put in these much more fuzzy terms, you could say, well, I want to find all widgets near that — we’re shipped in New York and had this certain description in them and then give me reports on those kind of things. It’s much more free form.
And so you have Lucene and Solr and Search as the database, and then you’ve got this layer of Analytics on top that is post-processing the search results, so then present them in intelligible ways to the users, and I think that’s pretty powerful stuff.
Michael Coté: So in these cases, are they all sort of read-only? What’s the sort of transactional nature, like if you want to write something back, like how to — does that happen or how does that occur if at all?
Grant Ingersoll: Yeah, if you are a company who is building one of these things, what do you want to do, what’s the buzz these days is, you want users engaged and you want users’ feedback, so all of a sudden, you flip the coin and you have this massive amount of users who are interacting with your system.
So you’re capturing both explicit and implicit operations that those users are doing. You’re obviously going to capture things like click-through and all the log-in events that they generate as they click and poke and prod and enter query terms and all that, but then you also are getting explicit feedback as well. So they can give you things like ratings, thumbs-up, thumbs-down, scores 1-5, what did you think about this, and as well as reviews and descriptions.
And so now you can take all of that feedback as well and just you have this continuous loop of adding those things back into your system. And then of course that can feed future reports and analytics. Of course, there again, your index becomes the primary authority, at least the primary store. Obviously, maybe you want the database still to be an authority, but the authoritative store. But I’ve seen people who actually use the index as the authoritative store too, but the point is, is the index is the place where you go to do all the work, because it’s the one that allows you to have all these different views and generate these discovery mechanisms that you just don’t get I think with a database.
Michael Coté: Yeah, that’s interesting.
Grant Ingersoll: The database still exists and the database is still important. Like you said, NoSQL to me in it be pragmatic about when you use SQL and when you don’t.
Michael Coté: Yeah, yeah. I mean, it’s like all great brand names. It’s a brand name that requires — it’s a very short name that requires several pages of explanation and lots of caveats about. It’s actually like along with SQL and all this business, but anyway, I mean, it’s still — that’s always a big, in the NoSQL area, it’s always a big debate like, oh, should we be covey calling in NoSQL? But you know I think it’s nice to have a punchy name behind it, and it certainly generates a lot of — having people get in that discussion of those three pages of footnote so to speak is healthy to get out with.
Grant Ingersoll: Yeah and it’s important — I mean, as a developer, you need to constantly be thinking about what’s the best solution for the problem, not what’s necessarily the technology — making the technology that I happen to know at the moment fits the solution. You want to come up with what is the best solution. In five years, we’ll be talking about some other, not even five years, one year and six months.
Michael Coté: We’ll have the NoNoSQL movement. That will be great.
Grant Ingersoll: Right, I think that one has already started.
Michael Coté: So the last thing I was curious about — I’m always curious when you go over a piece of middleware, or framework, or whatever, you’re using it in an application, like what sort of like the infrastructure end up looking like? I mean, is this sort of something that you run over multiple servers or like what — how do you end up managing all of that stuff? Like once you pass it over to production, or you move into production, what is it look like to manage the stuff and the topology and so forth and so on?
Grant Ingersoll: Yeah, that’s a great question, and ironically in a lot of ways, specially running Solr, it looks like a database. You have this running up as a service and you talk to it over Solr’s version, a JDBC or whatever you want to call it, but typically I would say, and a lot of this kind of depends on how you need to scale out. So, if we start small, for instance, kind of your typical installation of Solr would probably be, you have one server up for indexing, so that’s to ingest the content, and then you have a second server, which is a replicant of that master index, and that server typically or that process typically will then serve up all your queries.
Now, you don’t have to do it that way. You could actually at the small case do everything all on one server, all on one process, both indexing and search. But, just for a failover and fault tolerance and all that, probably the most common is that one. So then you kind of have to address a couple of different scenarios, the first one is what I’d say is the high query volume scenario. And this is pretty common for things like e-commerce and sites that have a smallish amount of data, and by smallish, let’s call it maybe less than 20 million records. But what you have is really high query volume.
So if you take a typical e-commerce store or a popular e-commerce store, they may only have 500,000 product skews or product IDs and descriptions and prices and all that, but they may serve a million, two million, ten million, twenty million queries per day, right?
So what they need essentially is a replicated index. They need to copy this index, and they’ll put it out, they are behind a load-balancer and then just let it rip, I mean, people can just hit that thing and they can add new machines as they need to deal with spikes in traffic that typically occurs say around in the U.S. anyways, Thanksgiving, in the holidays season.
Michael Coté: Right.
Grant Ingersoll: So that’s kind of the high query volume. The other side of the coin, what you have off in some cases is you have a very large index, and in fact, the index is so big, it doesn’t fit on a single machine. So now what you have is called the Distributed — you need a distributed index or often referred to as a charted index. So what you do is you’re essentially slicing up the index into smaller charts and then putting those across machines.
So, you may not have very high query volume at this point, but you have a lot of data, and you want to go search that data. so that’s all supported by Solr as well, and then kind of the last piece is the combination of high query volume plus charted, in which case what you’re doing is you’re charting just like you normally would, but then within each chart, you’re replicating out the chart. So then what you can deal with is really large index at high query volume.
Michael Coté: Right. Then I guess that also gets to CDN and Geo stuff and optimizing the use over the globe and all kinds of – it’s a big heady global affairs.
Grant Ingersoll: Right, data center awareness and all that stuff. Now most people — even when they have charted indexes, most people aren’t really truly at really large-scale where they need to do a whole lot extra.
Many people in my experience, and all the people I’ve talked to over the years, both through my employer and previously, they fit on a single data center and maybe they have five or ten machines in a charted and replicated environment, maybe up to 20 or even 50. What we’re actually — one of the things is that I’m pretty excited about that’s coming out in the new version of Solr’s.
We’ve actually taken Apache ZooKeeper which is a Hadoop-related technology, which allows for distributed configuration and control and kind of master worker election, and kind of all these things that you need to do to make a really effective large-scale, and by large-scale, I mean hundred plus nodes or a thousand nodes, kind of all that — take care of all of that coordination capabilities. We’re integrating that in with Solr. So, Solr will be even easier to use in a truly large-scale distributed environment.
Because I would say, up to now, Solr is — you have to do some extra operational work once you get past, say the 50-node or a 100-node limit, which like I said is — most people are well under that in US?
Michael Coté: Yeah.
Grant Ingersoll: But if you truly wanted to do Internet scale, you of course need all of this stuff that goes beyond like data center awareness and even rack awareness and all that stuff. So, what we are doing is, we’re working on this in Solr right now, by integrating ZooKeeper and some other distributed technologies that are coming out of like the Hadoop project, so that people can really build that, almost infinite scale search application.
Michael Coté: Yeah, I mean, there is this interesting effect of cloud computing or whatever, kind of nowadays’ concept to use the other buzzword of the 35:55 like pull-down under this DevOps concept. At the end of the day it kind of gets to the managing your infrastructure and your operational concern starts to be more of a development concern, which is — you had a nice illustration of that right here. You have to start worrying about how — we kind of had a little bit of this back in the J2EE days, but it was — people tried to abstract it away, so you didn’t have to think like a SisAdmin very much, but things seem to have swayed the opposite end. It’s sort of helpful for your application if you start thinking about SisAdmin concerns while you are developing it.
Grant Ingersoll: Yeah. Obviously, the operational cost can be quite expensive managing all those data centers and all that. So anything you can do as a developer to reduce those costs by having things like automated failover and fault tolerance, and all of those things and being able to essentially, seamlessly scale, add and remove resources as needed, being able to leverage things like Amazon’s EC2 or Google App Engine or whatever, all seamlessly and without having to reconfigure by hand. That’s all pretty key to those things. The foundations of those are in Solr’s trunk development space right now and will be in the next release, and then we’ll just be building out from there.
And the other nice thing that you get for free with all that stuff is now with all that in there, I feel completely safe and know that I can now make Solr & Lucene the authoritative store as well. So, if you want to close the loop here back to the database, persistence, use cases, I would say now at that point, I feel 100% comfortable in that scenario, having that be my primary store and my authoritative score, because I know I have full tolerance replication backups, all of that stuff, all just seamlessly working.
Michael Coté: Right. Well, that’s great! I mean, that was exactly the interesting kind of overview and use I was curious to explore some. So are Solr & Lucene top-level projects or where are they over at Apache.org?
Grant Ingersoll: Lucene is the top-level project over at ASF. It’s just Lucene.Apache.org, and underneath there you’ll find Solr. Some of the other projects I have mentioned, some of those are actually spinning out to be their own top-level projects. So, for instance, Tika, Nutch, and Mahout are all in the process of — well, they all are officially top-level projects, but we’re in the process of moving those out infrastructure-wise so that they will be their own top-levels. But, of course, there will still be what I call Friends of Lucene.
Michael Coté: Yeah. It‘s even at a higher — it’s the interesting post Jakarta era of the ASF where everything is a top-level project.
Grant Ingersoll: Right. Some of it is just organizational principles, but some of it also — it does give a certain level of — hey, this thing really warrants being its own thing and people should be more aware of it as a top-level kind of thing.
Michael Coté: Yeah, definitely. How about yourself? You’ve got like a blog or the Twitter or anything like that?
Grant Ingersoll: Yeah, I have probably more than I need to have. So, let’s see, my Twitter ID is gsingers, so gsingers. I don’t know why, it was my old UNIX ID from way back when —
Michael Coté: Yeah, you get attached to those, it’s funny like you can tell when someone is like have their — they are like ten-plus-year-old-UNIX ID and they just sign-up with that. It’s always fun to see.
Grant Ingersoll: Yeah, and then primarily I blog through my employer, so if you go blog.lucidimagination.com, that’s the primary place I blog. I also sometimes blog under my own name, at lucene.grantingersoll.com.
So those are the primary ways if you want to get a hold of me, I guess, and that’s why it’s just on the Solr & Lucene and Mahout mailing lists, I’m pretty active there as well. So, if you go to the Solr & Lucene website or the Mahout website, you can track down the mailing list there.
Michael Coté: Well, great! Well, I appreciate you taking all this time to go over that stuff. Like I said, I’ve been coming across this sort of search as database for lack of a better phrase more-and-more. So it’s cool to talk about it. I really appreciate it.
Grant Ingersoll: Yeah. Thank you Michael! It was quite enjoyable!
Disclosure: The Apache Software Foundation is a client. See the RedMonk client list for other relevant clients.
[…] was always a tedious, math-heavy computer science topic. In this episode of make all, I talk with returning guest Grant Ingersoll about Apache Mahout which is seeking to bring machine learning to a wider […]