As I’ve mentioned before, I’ve been seeing search-as-middleware cropping up recently. The idea is that search technologies like solr make for good middleware layers, esp. when you have a large pool of unstructured data. Shantanu Deo at AT&T joins me for this episode of make all to discuss one such implementation for att.com’s backend. It’s an interesting discussion of a new way to use search and, also, gives a preview of his upcoming talk at the Lucene Revolution conference next month.
- What does the AT&T CMS do?
- The evolution of CMS into a more portal, application platform
- Where does this show-up? AT&T home page – tiles served up there, promotional content.
- How did you draw up what you needed for search? We go over the requirements.
- How does solr fit in with everything? Sounds like a service…or is it part of the application stack?
- What kind of hardware stack does it run on? A lot less than you’d think.
- How does the indexing/crawling work? Searching – using faceted searches.
- What are the facets y’al are using?
- The other use: global search. Customizing the CMS to the user, e.g., only showing you services and products available in your area, like Uverse or a MicroCell.
- Can we do personalized content as a search problem? Assigning URLs to different user groups… using search as a filter then to select only the relevant content for a user. Search as Middleware.
- Search as a back-end for API-driven, composite applications.
- Lucene Revolution Talk May 25th to 26th – http://lucenerevolution.org/
Michael Coté: Hello everybody! It’s another edition of ‘make all’, the podcast about fun and interesting things with those damn computers. As always, this is your host Michael Coté available at peopleoverprocess.com. And for this episode, we are sort of getting back into the area of search if you will, and all of the fun stuff that’s underneath that. To talk about that, we have someone who is going to be at the upcoming Lucene Revolution Conference giving a talk about how he and his organization have been using Solr.’ Do you want to introduce yourself real quick?
Shantanu Deo: Hi! My name is Shantanu Deo. I work for AT&T and I am the manager for their Content Management System Group. I have been here for a little over two years in various capacities. Earlier I used to be at release management team on that side. Also I did some work with search and before that I was at Amazon, for the better part of years or thereabouts. And before that I have been at various companies, mostly in the electronic manufacturing industry on the software side of that.
Michael Coté: Well, what is — I mean since you mentioned it. What is electronic manufacturing? Is that literally manufacturing various gadgets and gizmos and things like that?
Shantanu Deo: Yeah, basically, I worked for Panasonic and couple of other Japanese companies which make the machines that assemble the PCBs. So it’s like automated pick and place and all the automation aspects associated with —
Michael Coté: You worked on the robots?
Shantanu Deo: Yeah, exactly, exactly.
Michael Coté: Well I think this is the first person we have talked to who was a robot creator, so that’s exciting, or part of the process at least.
I think to start with – I think everyone listening to this knows what a CMS system is, right? It stands for content management system obviously, and a lot of how it springs up nowadays is managing the ton of content that’s involved in public websites, and for internal websites as well. But could you give us an overview of in the case of the work — the CMS stuff that you do at AT&T, like what is the CMS that you are speaking of?
Shantanu Deo: Yeah, sure. So in a large organization, there is always a need to change content on a daily, hourly, whatever, much more frequently basis. So what happens is you obviously have some sort of application behind it that’s running your website but typically that involves coding or whatever your technology choice is like JSP or ASP or whatever development. But just the pure content of it can be abstracted out and managed separately, and that’s essentially what the content management systems typically do, although nowadays they are getting into this whole application space as well.
In fact, we are in the middle of transitioning from an older system to a newer system, and so when we did our due diligence of looking around market space,it seems like a lot of the content management systems are gravitating towards that whole space, where they occupy more central — almost within the app layer type of space. Before, at least, the ones that I have been familiar with were more like passive ones where you generate the content and it’s all done and your application can refer to it.
Michael Coté: Yeah, I have noticed recently that — as you were saying, you were starting to say before I interrupted you there. It does seem like — sort of there is classic CMS which is all about – it’s all about pushing the content out there and may be even hosting the content but also about that work flow of authorization that you are talking about.
Shantanu Deo: Oh, absolutely, yeah.
Michael Coté: And then what I have noticed a lot recently is there is almost this kind of convergence of what we used to call portals and to some extent sort of app servers and then CMS systems, they seem to be kind of trying to plug into each other to somewhat become the same thing, like having mini applications or widgets or portlets, if you want to use that that old (Voice Overlap).
Shantanu Deo: Yeah, yeah, so you have like these marketing campaigns or managing all of that versus integrating with the catalog behind the scenes to make sure that — like to provide more of a wizzywig kind of interface to the content authors. So like you can see, make the change and see it live almost or in context editing if you will. So that you see all the other stuff that you are not actually handling but what your users will see. So all of that seems to be where the industry seems to be heading.
Michael Coté: Yeah, you know, it seems like if anything, Wikipedia has probably trained people to expect that like — why can’t I just click on this page to edit it? Like it shouldn’t be so onerous to like edit content on a webpage.
Shantanu Deo: Exactly, yeah.
Michael Coté: And you know granted in, a lot of what makes anything enterprise, are those workflows of approval and stuff that you don’t just have —
Shantanu Deo: Oh, yeah.
Michael Coté: People really doing things –
Shantanu Deo: Correct, definitely.
Michael Coté: But still that’s one of the things that I’ve enjoyed seeing in the CMS systems off late is really — most all enterprise systems including CMS have kind of nailed the enterprise stuff like being compliant and being performant and having workflows and everything and if they’ve kind of solved those problem they are really like that they’re refocusing on I don’t know, usability. Like just —
Shantanu Deo: Yeah.
Michael Coté: — making it easier for users to use essentially.
Shantanu Deo: Yeah, absolutely. I mean, when we had the demos, I remember, all the people who’ve been touching the existing CMS are pretty much blown away with some of the vendors when they showed the UI. So yeah, lot of people have put in a lot of work towards making that aspect of the system very user-friendly.
Michael Coté: And so speaking of user friendliness to try to make a masterful segue here, one of the things that is really vital to all UIs nowadays or pretty much any application is Search and you know, the issue with Search, I mean we were just talking about a convergence of three different platforms plus the applications, I mean, it’s like all, all large complex software, eventually your software stack becomes everything that you have. And like, so you know, I am curious first to hear, like, when you were thinking about how you were going to apply Search to all of this… Well, first off, if people wanted to see the CMS in action like, where on the Web do they go to see it; where does it surface if you will to work with?
Shantanu Deo: Well, actually you know, if you go to the att.com homepage, let’s say for example, not to be too obvious, but you know that’s where I worked, so you see a lot of the titles, essentially you know, promotional content show up. That whole page is effectively made up of like different pieces of content that somehow comes together you know during application rendering time.
So the pieces might be actually developed specifically to target different regions or different user types and depending upon whether you’re authenticated or non-authenticated or depending upon, some other characteristics of the user; that’s what you would end up seeing.
Michael Coté: Yeah, like I am always getting asked if I want to go to paperless billing. May be there is all sorts of – I don’t know if that’s part of the CMS system but it is.
Shantanu Deo: Yeah.
Michael Coté: I think a lot of listeners probably have an iPhone or have U-verse or somehow have come across AT&T before. So they have these consumer services. So yeah, so I mean given that the breadth that you’re going over, when you guys were thinking about doing Search, right, — what were the criteria that you drew up? Like what did you want to accomplish with Search?
Shantanu Deo: So I mean, I had couple of different touch points for Search, not specifically to CMS. Although there is a Search aspect to the CMS that we are currently implementing and I can you know go over that briefly later, but when you asked me the question specifically, the two touch points I mentioned earlier were actually – the first one I had was basically when I was in my earlier position also at AT&T, where I just happened to be part of the process where we were integrating, I think it was some other vendor who would supply us with some targeted content based on the user, user history, browsing history and what have you —
Michael Coté: Oh, all right.
Shantanu Deo: — that’s when I came across the catalog feed that we were providing and I just kind of thought of putting together like kind of an under the table project if you will to expose that catalog in a different sort of way than we were currently doing on our side. So what I thought was how best to present that with some more user-friendly UI aspects that we were not presently at that time using.
I came across jQuery and stuff and then I realized that this behind this needed to be some sort of an engine that’ll service all of that. At that time I had some experience with another Search vendor, we were working with it as part of my daily duties. But somehow it seemed like it had a lot of need for lot of maintenance or attention, if you will.
Michael Coté: Care and feeding, I think they call it.
Shantanu Deo: Yeah, yeah. So I was kind of looking for alternatives which would not need that much attention, and also be easy to set up and get going. That’s when I came across the Apache Solr, and I thought of giving it a try and it worked just beautifully. So we basically had the Catalog Search application up and going, just kind of me and a colleague of mine by name of Rama, so we just kind of put it together; and I must give credit to my manager at that point who allowed me the space to provide that and then various people in the organization kind of recognized the value of that and decided to support that.
So that came to fruition and actually went live. And it’s been since incorporated into the main ATT Search component. So when you see or if you go to att.com and search some aspects of that like the sliders and those are all the faceted search artifacts that you see and the results you see from that.
Michael Coté: So it sounds like you guys, if I can pull apart some of the stuff you said, there’s at least one way that you’re using that this Search and it’s just traditional search like you go to a website ¬–
Shantanu Deo: Yeah, yeah.
Michael Coté: And then you’re also mentioning the Catalog Search¬ ¬–
Shantanu Deo: The Catalog Search was just basically fronting, ingesting the AT&T Catalog and –
Michael Coté: And then that would be all, is that catalog just all the stuff AT&T has to sell or is it just a content?
Shantanu Deo: Yeah.
Michael Coté: Okay.
Shantanu Deo: Yeah, that’s exactly that. And then fronting that with some other more UI aspects to make it pretty.
Michael Coté: So since you had mentioned previously that you have been using a sort of a search substrate, if you will, that needed a lot of care and feeding and how would you rate the care and feeding that Solr needs, not necessarily versus that but just in general like what’s the day-to-day worrying and fussing with it?
Shantanu Deo: Actually there’s none. So as far as I know in fact, I think this has been up over a year, or more and I’ve not heard of anything go wrong with it at all. So in fact the other day – it was funny –there was some other related issue and people had just forgotten that it existed. So that was kind of a very good endorsement of Solr in my mind because it was so quiet that it just worked basically, so, yeah.
Michael Coté: Would you say that it’s sort of — so the other thing I am curious to hear about because there’s a lot of people who use Solr and have exactly this problem that you have, right? So I wonder if you could tell us like how, it sounds like from the fact that people forgot that it existed which in this light is very positive that —
Shantanu Deo: Exactly, yeah.
Michael Coté: — it’s sort of set up as a service if you will, like instead of being embedded in an application. So I am curious how —
Shantanu Deo: I think that’s how we use it but I think it can probably be used in other incarnations which I may not be aware of. But that’s how we use it like, we have a separate set of boxes that just run that in a dedicated instance, basically if you will.
Michael Coté: Do you have an impressive cluster running it, or what kind of system did you have to build for everything?
Shantanu Deo: Actually you would be surprised, I mean of course, we have fairly decent crossbar for the applications and all that but actually I was surprised as well that we don’t need that many instances of Solr behind the scenes to support all of that traffic. It seems between the caching layer and the apps we have very few instances of Solr supporting all of that search traffic. So I think we have in the range of boxes that you can count on your hand I think to support a much larger application cluster.
Michael Coté: Yeah! I mean it sounds like to support like search over global AT&T stuff. So that’s like, that’s impressive. I guess it’s impressive that the servers that you have are not impressive. You don’t have like some mega cluster running everything for you.
Shantanu Deo: No, we definitely don’t need that, yeah.
Michael Coté: So how did you — I am kind of purposely putting this in a naive way, but if you have search you’re basically crawling things and updating what’s in the index if you will and then the other side of search is an actual user comes and wants to search something, and so I wonder if you can walk us through those two stages like what’s kind of like the crawling or the indexing or the getting content in there that you have?
Shantanu Deo: Yeah, so we have like I think a periodic feed that comes from the catalog as and when that gets updated, and we just do kind of incremental indexing of that content, and obviously when we are talking about the Catalog Search aspects, we didn’t even talk about the other instance where we use Solr differently. But in this case, whenever catalog updates take place, we have that content indexed and we have certain faceted searches for one thing that we provide.
So based on how the schema is set up, once that’s been indexed, basically Solr just switches over to the new index, and we have I think a web service layer fronting that, and whenever the user searches, there is an Ajax request that takes place of web service, and behind the scenes the web service makes logs with Solr, and pretties up the information that it returns and the user sees the results of the Ajax response.
Michael Coté: When you are building out the searching aspects, I mean was it kind of just the traditional like we are building out the search interface, and we need the UI people to do a UI design, and I mean was it just like the normal stuff you would go through for a public website, nothing out of the ordinary or anything?
Shantanu Deo: Yeah, like I mentioned, that particular project was more like an under the table kind of launch where we just kind of decided to see what we could get out of that like we have sliders for the various parameters like price, so just users needed to search for phones up to a certain price point or up to a certain rate or all these different characteristics of phones, manufacturer or whether or not it had cameras or whether or not it had a keyboard, all those aspects of the phone were exposed as facets of the catalog.
Michael Coté: Yeah, it sounds like it was kind of a funner problem domain than other people might have because who doesn’t like phones. So as a programmer, you probably want to optimize all the ways you can search for stuff, whereas I don’t know if you are doing search over a furniture warehouse company or something it may not be quite as exciting to talk about handles and hinges and wood —
Shantanu Deo: Yeah, right, yeah I mean we had a blast. It wasn’t a very big problem that just a couple of guys couldn’t tackle. So it was pretty easy, but yeah, in Solr I guess to its credit makes it very easy to work with as well.
Michael Coté: So another thing I am trying to think through the searches I have done like on the site, but I am kind of coming up blank. But one thing I am curious about is are there ways that you guys involve an individual customer in the search results or ways that you’re thinking about doing that, because it does seem like — I guess the phrase people would use is personalizing the search.
That’s one of the interesting things that Google and other people do – it doesn’t always work, but they try to pull upon the pile of data that Google has about you and all your relationships and kind of hone your searching, if you will.
Shantanu Deo: Right. So I think since I have moved from that theme, I think people have taken that a little step further. I think they have provided additional capabilities, like you are predicting, what is it called, I forget —
Michael Coté: Like as you type it’s kind of searching?
Shantanu Deo: Yeah, exactly, yeah, that one. So that functionality has been added and a couple of other things they are doing. I don’t think it’s currently doing personalized search, but they do other things like substituting, like if you have searched for this term, then you probably meant these other terms.
Michael Coté: Oh, right, right. Yeah. Well, I mean, there is also the question of like what would you really personalize? The only wacky cooked up example I can think is, if you knew someone regularly called from out of the country, you might want to make sure they have a phone that is international enabled, if you will, or something like that.
Shantanu Deo: Actually, yeah, I mean, you are not too far off base basically, we are definitely targeting some of that with this new CMS system that we are working with to personalize some of that. And actually speaking of personalization, this comes to a nice segue to the other use of Solr that I was talking about.
Michael Coté: Oh good, I was going to ask, yeah.
Shantanu Deo: Yeah, presenting at the conference was actually for global search, and this is where we actually have very deep personalization. So that’s an interesting use of a search, if you will, whether or not it happens to be Solr, it’s kind of tangential.
But what the problem we were presented was that AT&T has so many different user types. We have in the order of like kind of several tens of customer types or different characteristics that you would want to personalize a user’s experience on. So that was a project – that’s still ongoing actually, but it will be live shortly – where the business wanted to basically only show a certain aspect of the site or in this case it was the browsing options available or navigations, so to speak. So they wanted to personalize on that.
Michael Coté: So it’s sort of removing things from the site depending on who is searching around?
Shantanu Deo: Yeah, not just removing, but also like showing specific different URLs for that.
Michael Coté: Oh, right, right, right. Yeah. I mean, again, to make a cooked up example or maybe a better one, but like I am a U-verse subscriber, so it would probably be kind of silly to show me subscribing to lesser Internet options.
Shantanu Deo: Yeah, exactly, yeah. Right. So in that case then what we had was the problem like, let’s say, if you had 60, 70 different users, so a user could be in any of those different user groups at the same time. So you could be a U-verse subscriber, but you could be in that particular region where some other service was not available, that sort of thing. So in that case you ended up with the situation where you couldn’t beforehand know which set of category the user would fall in.
Michael Coté: I like the sound of that, because I used to be one of those guys who would — like I also have a AT&T MicroCell and everything, and I would go to that little app where you put in your zip code all the time, and try to figure out if I could get in, without like what you are talking about, you go to the site and you get taunted by things you can’t have, which is terrible.
Shantanu Deo: Yeah, exactly. So yeah, in that case then you end up with like a situation where if you are trying to solve this problem in code, you would have to code for all these different various combinations of user groups and then figure out what to show.
So it becomes a management’s nightmare, not to speak of like the complex coding that you have to do. I mean, not in terms of like coding, but imagine if you want to test for all of these things, it becomes more of a headache. So that’s when we kind of thought of modeling this differently, approaching it differently and looked at it as, can we do it as a search problem?
And that’s where my experience with Solr in that earlier project we talked about earlier came in handy, and we realized when we were kind of working, putting our heads together, that hey, we could model this like a search problem, if we could classify each URL to be belonging to a certain set of user groups.
So that’s when we kind of modeled all of the URLs, even though they were hierarchical in some sense, like some URLs are in the top navigation and some of them are related to a given top navigation header at a secondary level, and even have another level where you would have like for each secondary you could have a bunch of other tertiary URLs or what have you associated with that. So there was just a little association, hierarchical association as well.
Michael Coté: So correct me if I am wrong, but it sounds like the way you kind of solved the problem is — so you kind of inserted search as almost a filter, like search for everything —
Shantanu Deo: Exactly, yeah. It’s exactly that. It’s like a filter query really. So you search the whole thing but you filter on certain aspects of the user’s attributes that you know beforehand. So then you end up with — once you have like — because Solr in this case was — we decided to go with it because we already had it working, so it was a very easy transition to make or addition to make.
So we flattened out the whole structure a little bit and using some encoding appropriate to pushing it into Solr as one flat file, we then attached like user groups for whether you want to show a particular URL or not. We categorized or we added those attributes to each URL and then searched for all the URLs, but based on whether a user is in a group or whether you wanted to hide particular URLs if that user was in that group.
Michael Coté: That’s a good example of this. I mean, I always struggle to phrase this, but kind of it’s like using search as middleware, if you will, and it seems like — it’s almost like if you have — everyone is, rightly so, into using sort of APIs for public web services and things, and having RESTful APIs and lowercase web services as I would say.
And it seems like — I have been hearing a lot of stories over the past year or so that the kind of backend for APIs, if you will, one of the effective ways to do it, like you have been describing is, to use search for it. And maybe you expose it purely as a search, like there’s a program, is the person searching instead of a person, but it’s kind of an interesting, somewhat new way to think about how you go through all that filtering of displaying what someone has permission to see or should see.
Shantanu Deo: Yeah, yeah, exactly, that’s exactly what it ends up being. And so far we have had reasonable success with that, especially I think if you look at it from a maintenance perspective. On an ongoing basis, whenever marketing decides to come up with another category of users, it’s very easy to just add that new user group and just reindex and search and you don’t need to touch any code at all. Whereas if you had gone by in the traditional sense, you would have to recode, retest, and go through a lot of expensive testing iterations. So this is avoided by using search instead.
Michael Coté: Because really, and I am always bad with terminology, but really what you do, you are really updating the corpus or whatever of stuff. You are updating all the stuff that’s being searched over, not really updating the application. So since you are not touching the code, you don’t really need to change it.
Shantanu Deo: Not touching the code, exactly. So then the upgrades are almost instantaneous.
Michael Coté: So what does – to kind of borrow a term ¬– what does the database look like? I mean, what is that corpus or body of text, like how are you managing that?
Shantanu Deo: Well, it’s basically a listing of flat files where all URLs are just listed one after the other, and the encoding scheme captures the structure within that, that the user actually sees. But as far as Solr is concerned or search is concerned, it doesn’t care.
Michael Coté: Oh, that makes it super easy then, because you are just dealing with files.
Shantanu Deo: Yeah, one file actually with everything in it. The encoding is then interpreted by — so Solr acts as the filter for making sure that you see only the URLs that you are meant to see and that the business wants you to see. And then the encoding kicks in, like a small thin layer in between, after Solr comes with results, that then builds the hierarchy based on that encoding scheme and you get to see the proper hierarchical structure.
Michael Coté: Right, right. No, that’s really, as I said, a really interesting use of doing search there. Like instead of having to update a database or do all this other stuff, you just kind of — you write your system such that Solr or the search middleware or whatever it may be, Solr in this case is the thing that does all of that filtering and stuff for you. I mean, I guess at some point in the application there are sort of queries being constructed that are sent to Solr. It’s now like it has magic voodoo that figures out who the user is.
Shantanu Deo: No, it’s not, of course not. I mean, there is some logic that takes place, and that has to take place, that’s the minimal to figure out who you are and all that stuff.
But then it’s just a simple query that — yeah, it’s just a query of, hey, I am a user who belongs to user groups A, B, and C, what do I see? And that’s all pretty simple.
Michael Coté: Yeah, yeah. So I mean, you have a whole presentation at Lucene Revolution about this. I mean, are there some points of the presentation that we haven’t really hit on? We kind of went over a technical overview of what you guys are doing, but I mean, are there any sort of like lessons learned, or any sort of tips or advice that you are going to go over?
Shantanu Deo: Well, I haven’t actually started working, to be honest, on the presentation.
Michael Coté: Well, I will give you a secret and everyone who is listening, I have a presentation that’s due today and I am just now finishing it up. So the ideas have to gestate in your head and you have got to get them perfect and then it’s just a matter of wiring it up.
Shantanu Deo: Yeah. I think — well, one of the things that came up is, in any organization is, you have a lot of like testing infrastructure before stuff hits production. So you have various mappings to consider. The URL that you see in production is not necessarily the URL you would see in your FST or testing environment versus a dev environment versus a QA or a staging environment, what have you, like different orgs have different naming conventions.
So we had all of these, and to make matters a little bit more interesting, we had other groups within the company also using the fundamental data that we were also using, and then they had their own environments to work with.
So I think one of the struggles we had in trying to encapsulate all of this into one file was that we missed out on these other environments. So what do we do with that?
I mean, there are ways to get around it — I mean, this has nothing to do with search, but this is just a practical reality that we encounter at least, to figure out — there is no easy way to get all of that other than somehow coming up with some automated, not exactly fully automated, but some way of replacing your current domain with whatever your test domain has to be, in a consistent way, but do it for only subset of URLs that you are not hosting.
Michael Coté: Yeah, yeah. I mean, I think what you are getting at is, in a large company or organization, there is sort of ownership of data and kind of control of access to that data. And then also, like you said, like the production URLs are different than the test URLs and everything. And so there is a fair amount of time you need to spend to make sure you have proper access to the data, and that as you move through the phases of production, the data is actually accurate. You don’t have like the hot live data all the time to play with.
Shantanu Deo: Yeah, exactly, yeah.
Michael Coté: I always find that when you get in these types of situations that’s where it’s easy for complexity to sneak in. So you sort of — you spend a lot of time keeping it simple. To use an old Mark Pilgrim quote, like a lot of effort went into making this effortless.
Shantanu Deo: Yeah. Very good! That’s a nice quote. But yeah, that’s exactly true, like we thought that we have solved the main problem, but that’s just half the battle, like you have to make sure that things actually work in all the other —
Michael Coté: Yeah. And then like you funnily said at the beginning, like at some point if you are successful at simplifying some process, people forget it’s there and they rediscover it, like that’s always a nice sign of success.
Well, great! I think in the time we had we actually got a good overview there. And it was — like I was saying towards the two-thirds in there, like I have been interested in the sort of search as middleware stuff examples that I have come across, and Solr is definitely like the technology I come across most often that fits into there.
So I appreciate you spending all this time to give us that overview of the talk you will be doing.
Shantanu Deo: Yeah. My pleasure!
Michael Coté: And yeah, it’s the Lucene Revolution Conference and it’s in San Francisco on May 25 and 26 if I recall. And obviously you will be there, unless you get one of your robot friends to come and give the presentation for you. All right!
Shantanu Deo: Yeah. I look forward to it.
Michael Coté: Yeah, definitely. Well, thanks again, and thanks to everyone for listening. And we will talk to you next time.
Disclosure: Lucid Imagination sponsored this episode and is a client.