Hear what’s in the application stack for a high-scale, college sports-team web application service. I always like hearing about the stack “metal to glass” that a team uses to deliver their software, and we get that here with Serge Knystautas from PrestoSports.
Update: the first version of the file had the audio split between the right and left, which is terrible. I’ve updated the file so that the audio is balanced, or, “normal.” It should be fixed int he podcast feed as well. Apologies for the error!
It’s March Madness which means people in the US freak-out about basketball and want to interact with the teams and their data all the time. This spike in traffic for the duration of the madness would seem to make a perfect case study of applying cloud computing, if not just high-scale web operations and application development. Here, we’ll talk about one such application (from PrestoSports): how it was built, the traffic demands, and what exactly makes the stack “cloud” versus just another web app.
- PrestoSports – we go over what the application does.
- Integrating with the social site du jour.
- The need to federate data over web sites.
- Getting into the stack: the key driver was scalability and performance. Some cloud talk.
- Software used: a proprietary CMS, Velocity, lots of Jakarta commons stuff.
- What are you using to monitor and manage it? Nagios, some JVM requests.
- Working with failure – how that’s built into the system.
- What’s the scale of people working on this: 5 customer support people, 2 web tier folks, 5-6 Java developers, with 1-2 on third level support.
- How does Java work out here?
As usual with these un-sponsored episodes, I haven’t spent time to clean up the transcript. If you see us saying something crazy, check the original audio first. There are time-codes where there were transcription problems.
Michael Coté: Well, hello everybody! It’s another edition of ‘make all’, the podcast about fun and interesting things with those damn computers.
As always, this is your host Michael Coté and I have got a guest calling in from remote. This is going to be one of our sort of, for lack of a better phrase, a stack profiles sort of pieces. I had someone over at the ASF, the Apache Software Foundation, tell me about kind of an interesting application using a whole bunch of Apache stuff. So I thought it would be a good chance just to kind of see what one team has used to build the application they are delivering, and to get to that, why don’t you introduce yourself guest.
Serge Knystautas: My name is Serge Knystautas; I am the Founder of PrestoSports, which hosts college sports websites.
Michael Coté: Can you give us an idea of what the websites or the apps look like, like what are they helping people do?
Serge Knystautas: Well, they are the official athletic website for a lot of college team, so very timely for March madness that’s going on. Butler and VCU; there are two Cinderella teams in the final four, they have websites with us, which allows any of those fans or anybody who wants to know more about who are these unknowns in the college sport space. Who are they? Who are the players? What are their schedule, stats, stories? How did the coach name Shaka get his name? Everything that you want to know about these teams, the schools can use our technology to publish that information out for web and mobile and however they want to get to it.
Michael Coté: I mean, can you give us a sense of sort of kind of both ends of the updating the site and people browsing the site? I mean, starting out with sort of like, for lack of a better phrase, what’s the updatability of this site? Like I imagine it’s a lot more than as we used to call brochureware. That like you said, there’s schedules in there and I imagine people are going in there and making it kind of a lively living site.
Serge Knystautas: Yeah. I mean, I think — we call it a Content Management System tailored to sports information. So a lot of it is just ad hoc Content Management System of uploading images and HTML, but then there is a lot of workflow around sports. Both there’s sports specific information, like I mentioned about player bios and schedules and statistics, stuff that you don’t find in most websites.
But then there is also a whole lot of workflow. A lot of the information that they are putting in is happening 30 times in the season and it’s the same kind of, prep my rosters, and then during the game these activities are going on, and then afterwards I have got to email this to the media and I have got to post this to my website and I send this to my opponent. So there is a lot of workflow that they are used to that we build into it.
So it’s both the types of information that they are updating and then how they go about doing it that’s tailored on the entering side.
On the fan side, it is a lot of learning what sports fans really like to do on a website. One information they like to do is certainly schedules and player information is one of the biggest things, but what’s becoming more and more prevalent is video clips, so they can go in and see highlights of what happened recently.
So that’s a merging of something we are integrating into websites. Certainly everybody has always wanted to — like to watch games, but the clips is really taking off as a way that fans can interact on these websites and get to that information.
Most of our customers are sports teams that don’t show up on CBS and ESPN, so this is really a way to get all types of information out to their fans. That you may have gone to a smaller school in the East Coast and you have now got a job in California and you haven’t been able to follow your team even though you are a rabid basketball fan. So this now lets you get access to all sorts of information. So that’s who we go after.
It can be alums. It can be other prospective student athletes. It can be parents, friends, all sorts of different types of people who follow these teams.
Michael Coté: So it seems like basically the teams are using it for most of their external communication as mediated through the web, if you will. And kind of like, like you were saying, for sending emails to media and press; it’s kind of just like their platform for communication.
And also like — I don’t know, I guess people call things like this social nowadays, but just that community around all the fans and the sports and everything, which I imagine is, depending on what’s going on, there’s sort of like a pretty high demand for lots of current updated news from many different people.
Serge Knystautas: Oh yeah. I mean, our customers are people who five minutes after the game is over are getting angry phone calls from fans saying, why don’t I know the score of the game? So it’s very timely information. And as you say, it’s external communication. So when we first started a number of years ago it was primarily web, but now it’s mobile, now we push that out to email, text message alerts, update, post to Facebook, tweets of course. So it has become sort of like this Content Management System which was just driving the website, it’s now driving the external communication.
Michael Coté: Well, that raises a point that I always kind of see anecdotally and theoretically, so it would be interesting to hear what you guys are finding, but it seems like increasingly if you have sort of like a web presence, if you will, or a mobile presence, that you are sort of continuously pressured to integrate with these external things like Facebook or Twitter and things like that. It almost seems like integrating with services has become an important feature for people.
I wonder if you have got — like what you are seeing as far as the pace, the fan base in your case or the user base, like what pressures are they putting on people to like, you really need to integrate with this new social site that I am interested in?
Serge Knystautas: Well, yeah, I think it’s huge. I mean, in fact, just from the developer, we have come up with a theme for a year of what we are working towards, and last year it was integration, because, gosh, it was Twitter, it was Facebook, it was photo store service, it was — I can’t even think of — there are like a half dozen other types of services like that where we are pushing out information.
It’s not so much that the website traffic is dropping, it’s just that fans have so many or people in general, consumers in general, just have so many different ways that they can be getting information, and the type of interaction you spend on a website is different than when you are watching tweets fly by or when you are browsing Facebook.
So it’s something that’s challenging for us, because you are not literally posting the same sentence as a text alert on the website and Facebook and Twitter; it’s all different types of engagement.
So you are trying to drive engagement and fan interaction on Facebook, but then you are just sending out little, as fast as you can, alerts on tweets and text messages, and then you are giving a lot of all that background and history and sort of digging into like, so how has this player played over the past three years against different types of teams and research information on the website? So it’s different types of information.
Michael Coté: I mean, it seems like there is this interesting need to I guess sort of like federate your content, and not necessary like federate the content like in a whole piece, but kind of just federate notifications about it, but then even in some mediums, like to have sort of standalone content, like I mean just tweetering the score of a game. That’s a nice piece of content there and it’s not like you really need to drive someone to a link just to see a score.
Also, I remember seeing some headline this morning that, I forget what company it was, but there was some major entertainment company where they have sort of — and it’s almost like I am putting it in air quotes, but they had lost traffic to Facebook. It’s kind of like what you are saying is their audience is still their audience, in the sense that they are paying attention to their content and whatever, but they are kind of in this other island of the web rather than being on their site, which is an interesting change that has gone on in the past few years.
Serge Knystautas: Yeah. And sports information is perhaps the most timely information you have. I mean, aside from maybe ‘American Idol’ or something else, you are not having as many spoiler issues as you do with sports.
The favorite story I like to have is, one of my employees two years ago proposed to his wife, but he did not call any of his family that night to tell them about the engagement because he would have found out the score of that Arizona basketball game and so he had to wait until the next day so he could watch the game and then call all his family to tell them about his engagement.
So it’s something where — yeah, you have got to get your message out. It’s not necessarily proprietary, but you want people to find out as fast as possible, otherwise they are going to go to another source for it.
Michael Coté: Yeah, yeah, definitely. So now that we have got that kind of application down, let’s get into the meat of the discussion and kind of talk about the software and the other parts of the stack that you are using for this — I don’t know, like you were saying, this kind of like hyped up CMS system.
I mean, I guess you can figure out where you want to start as far as the, from the metal to the glass or whatever, but like what did you guys kind of start with when — what did you guys start with when you were building this platform, like what language did you write it in, first off?
Serge Knystautas: The application is almost all written in Java. We used a lot of XML and MySQL. Everything is on Linux.
I think the key driver that we had designing this system in the first place was scalability of performance. In the sense that, whether you are updating the score about your kid’s ten-year-old soccer game or the score of VCU making it into the final four, it’s relatively limited our predictable number of updates coming in, but then the number of people who are going to see it can scale astronomically.
So it was really a function of having a predictable read/write interaction and making sure that we could scale that predictably and then have a system that could scale Neeto on a moment’s notice for all the fans, since that was very unknown and uncontrollable.
Michael Coté: Right, right. And when did you guys start out building this with those concerns?
Serge Knystautas: I would say 2004-2005 was when we really began it.
Michael Coté: Okay. So quite some time ago. I mean, I was thinking from — thinking about the sort of bio of the company that you guys kind of — you guys were sort of a pre-cloud sort of thing, if you will. So I imagine you have seen a lot of — I don’t know, it would be interesting to see your take on how those problems have been solved or not solved with all this whacky cloud stuff?
Serge Knystautas: Well, yeah, it was before cloud became a real common term. I would say that just in sort of — we are typically web app and so we are aware of what we need to be able to scale and not scale.
So I think cloud computing has greatly solved the fan side of things, or at least it’s making it more manageable, and there are some parts of cloud talk that’s useful to us and then there is some part of cloud talk that’s not.
But it was originally designed without knowing that we could rely on Amazon or Rackspace’s on-demand virtual computing or anything, anything that’s sort of the core of cloud right now.
Michael Coté: Yeah, yeah. I mean, you guys must have had discussions with your hosters to add in more hardware and things like that.
Serge Knystautas: Yeah, it used to be much harder to scale up like that, or just so much more system administrator time and it was designed to make it easy, but it was still laborious to provision the hardware, configure it, all that kind of stuff.
Michael Coté: Yeah. I mean, I spend a lot of time talking with people about the sort of operations side of cloud stuff, and I think that’s one thing that kind of gets overlooked because that’s part of what cloud is, is that there is a great amount of innovation in automating things, like you are saying, just spending less time scaling up.
Because as with any new technology buzz phrase, there is always the crowd that talks about how they have been doing that for years or whatever, and there is a lot of the practices in automation that seems — and driven by the technology out there, that seems pretty generally new that you get with cloud that wasn’t so available to people in the past.
So I mean, looking at the software, I mean I know you guys use a lot of Apache projects and things like that. I mean, can you give us a sense of what you guys have built this system on?
Serge Knystautas: Yeah. Well, the Content Management System is proprietary, it’s built of — while Velocity is our rendering layer, which is an Apache product, there are probably a dozen or so different types of Apache Commons, and I am trying to think some of the other — there is a lot of little projects in Apache that we are using. Kind of the core updating system, there is a lot of Spring/Hibernate kind of a standard web framework that we are using.
So it’s a lot of cost and code using a lot of the sort of what Java plus all the Apache projects lets you do in terms of building a really robust platform. I mean, I can say like our performance for our rendering system, which is how we generate all these fan pages, is really a lot of handwritten Java code that uses these different libraries as opposed to a single framework that lets us do it, because we really want to be able to control how all that data is done and where the caching happens.
And yeah, it’s a lot of tinkering at the very low level of this system, because that’s very costly for us if we get that wrong.
Michael Coté: I mean, can we dig into that a little bit more, because I think that is one of the interesting architectural decisions or more difficult ones that you are always making is to build something your own or get something, as we used to say, off the shelf or off the web now I guess.
And like you guys, like you are saying, like you have a proprietary CMS and a lot of the processing you do is proprietary. So like when did you sort of cross the line where you thought like, well, screw it, we are going to have to write our own stuff? Like how did that architectural decision get made in your head?
Serge Knystautas: There are a number of different web frameworks but a lot of them are geared towards what we call the admin or the updating side of things, whether it was Struts or some of the other frameworks that had to evolve, they are really geared towards more form rendering type of interaction, update this form, save this data in the database. Our system is all about basically taking — building a web page that could involve 40 different regularly updating information.
One of the challenges we had with ours is, we really had no ability to rely on a CDN to sort of push out information and just keep static copies of it, because our information is changing by the minute. So it’s a question that we were going to have to regenerate these web pages all the time, and it was just — back then, whether it was using edge servers or — anyways, it just wasn’t going to work for us. We had to be able to figure out how to generate those web pages very quickly based on our custom types of data.
So that part was done pretty early on, and then we went with more simple web frameworks of Spring and Hibernator, not simple, but kind of out of the box ways for how you are having our users update that information.
So that was kind of the bridges. Where there wasn’t any really great solutions, and it was going to be very costly performance issues, we decided to build that ourselves. And where it was, well, we are going to have hundreds of screens of the way users update information and we just want to make it easy for developers to turn those out quickly and tweak it, we will use kind of a more of an out of box framework to do that.
Michael Coté: Right, right. And how often would you say you guys go in and sort of implement features or do coding that’s based around just scaling up in performance versus sort of adding a new feature or something like that, irrespective of performance concerns?
Serge Knystautas: It used to be, I would say every couple of weeks we were rolling out new stuff to tweak. Where is some piece of data getting cached? The types of data we would cache. Where are we accessing that piece of data?
But over the past — partly because we kind of stabilized some of that stuff and sort of made some stuff easily configurable, we probably only do things every two or three months now that we are actually making changes. Like some of the stuff we are looking at is, we have really adopted clouded stuff in the past two years and then that’s — now, when I am in a situation where we are turning on 20 servers for a weekend, there is almost more configuration management that I need to get in there, and then how do you — I have got to make that stuff more manageable.
So it’s not necessarily — I may be tweaking it towards — in some ways making it slightly slower performing, if I was actually just building this on dedicated hardware, but it’s because I want to be able to just real easily scale up and not have to worry about configuring each settings on every new Java VM that gets going, I am going to make this simpler and just say, okay, we will just throw a little bit more hardware, because we are using the cloud here just to be able to handle this huge spike that’s coming this weekend.
Michael Coté: Yeah. I mean, that’s interesting, there is like the sort of body of thought called DevOps or whatever, where there is people kind of going to that, that similar thing where in order to have the performance and scale up that you want based on cloud, you would need to do a bit of, systems programming is wrong, but you have to do configuration management programming, which I think is a — it’s not really a thing most programmers worry about. But I mean it’s sort of a — it is that tradeoff you make, like you are not working on a user facing feature necessarily, you are just working on tweaking all your infrastructure and everything.
Serge Knystautas: Yeah. And sort of going into that — this was — as we were getting PrestoSports going, I was still doing some tech consulting, and what I did for the most part was helping people with performance issues. And the biggest problem that I always found was people didn’t know where the performance was dying.
I mean, there are just so many layers you can have just within a JVM of anything from heap space to I/O to like little thread contention and all sorts of things. You don’t know why things are slowing down or is this at the app layer, is this at the database layer or is this some caching layer.
So yeah, related to DevOps, we are always building more and more ways to really see, okay, we have changed this cache but is that — are we just wasting our time changing that code or is the problem someplace else, or this is only a problem because we made a mistake someplace else?
Michael Coté: And can you kind of dig into the tools and things that you guys are using or building to do that? I mean, what are you using to sort of like monitor things and then also when you do detect like you need those 20 servers, like what do you use to spin 20 more nodes up or whatever, what’s the kind of management story?
Serge Knystautas: Well, it’s a little I would say outdated. We are using Nagios for system monitoring, but then what we have done is built up anywhere from 10-15 specific JVM checks. So we are within the application checking the number of active HTTP requests that are being handled, the number of requests that are being handled per second, down to system load, and all sorts of other nitty-gritty, things that we know that our system fails when these numbers start going up.
So it’s sort of — it’s one of these things where as we have come to know our application and as we are not making as substantive performance changes, we kind of know, okay, this is when a JVM is going to start breaking and we need a new instance of a server.
So we have used Nagios to call on a regular — every couple of minutes these application specific performance metrics and then those get reported, and then we have a mixture of dedicated servers and cloud instances, so we sort of monitor in much more detail the dedicated instances, because they are all getting kind of an equal amount.
If we have 20 servers going and five of them are starting to die, it’s a pretty good chance that the other 15 are dying as well.
And then we have other sort of crude or instruments just to tell whether a server is dying or not. And actually some of that’s getting now put within the actual JVM themselves. They are starting to do more self-diagnostics to say, hey, this is not working.
So for instance, a couple of weeks ago we noticed some of our cloud instances were going into a read only file system mode, something was getting corrupted somewhere. The problem that was happening is, it was still handling a lot of requests.
So from our load balancer it was like, okay, I will keep sending more requests to that instance, and we only start getting these weird support requests bubbling up, that we are getting invalid images every 20 — every 20 images on our site are showing up as an invalid image. And that’s because, well, two of our cloud instances had basically broken themselves and so we are going into having the JVM write code to detect certain problems that have happened. And then if these things are noticed, then it’s like, okay, I am going to shut myself down, so I stop handling any issues, and then the system administrator will know something broke on this server or cloud instance and let’s go back and fix it.
Michael Coté: Yeah. I mean, that’s a — there is sort of this practice in cloud development and just website development of building failure into the system, if you will. It sounds like —
Serge Knystautas: Yeah.
Michael Coté: Or not building failure, I guess developers already do that, because as I recall they are the ones who write the bugs. But I guess it’s expecting and working with failure, not building it into the system. And it is, like I mean, when you have clusters or clouds or whatever, you have multiple nodes, like it’s a lot easier to just kind of let them error out. If there is a certain threshold they reach, then you go fix them. But that’s always an interesting story to hear.
And like I mean, so when you guys go to diagnose these problem nodes, like do you sort of just like blow them away and restart them, or do you kind of spend time figuring out what went wrong and fixing what went wrong, or how do you triage that kind of thing?
Serge Knystautas: That’s a good question. I mean, it sort of depends on the type of error. There is certainly a group of known errors that happen. Some of them are infrequent and so we don’t do too much except for a system administrator who knows to either shut down that cloud instance and restart it, or contact our hosting facility because that’s not fixing it, or run that up to CK or something. And then there is other more frequent errors. Hopefully you realize once this has happened enough times that you are going to build some way to let the system automatically correct itself.
Michael Coté: Yeah.
Serge Knystautas: So it’s a mixture. And it’s also a mixture of recognizing whether this is just some customer using it wrong versus there is actually something that’s going wrong.
Actually like on that image, we had these cloud instances go into read only modes, so we are getting invalid images. In the past historically has always been the user was uploading some PSD that they are renaming JPEG and so that was why it’s not working.
Michael Coté: Right.
Serge Knystautas: So it’s a mixture. That’s what you get paid to do is spend time figuring out those problems.
Michael Coté: Yeah. Like bumping up against the business a little bit more like, like what’s the scale of like the support and development team that you guys have to support all of this?
Serge Knystautas: Well, our customer support team, which is people on call working with customers primarily, is about five people that rotate through the week. We have web engineer group, which is kind of like tier two, and those are about two or three people to handle more complicated HTML or jQuery or Velocity templating issues.
And then our development team is, we have actually got five or six Java developers. Typically only one or two of them handle specific issues, sometimes they will get involved if that fourth Java developer just wrote this new feature and now it’s breaking something else and they will get involved to help out with support. But that’s the rough idea.
So we have a handful that work directly with the customer and then it gets escalated, whether it’s just a web engineer versus actual Java programming problem.
Michael Coté: Right, right. So speaking of Java, that’s another thing I was curious before we wrap up to hear about is, I mean, I don’t know, I mean, it’s kind of anecdotal but I feel like I don’t encounter Java as a web application framework as much as I encounter other things, and so I wonder how Java has been working out for this sort of high scale web application.
Serge Knystautas: It has been good. I mean, it gives us a lot of flexibility. I think if I was building this starting a year or two ago, I may have looked a little bit more seriously at some of these other platforms. And certainly with the — in some ways with cloud computing and just the more prevalence of easy access to a lot of hardware you can provision on demand, I may have gone with a platform where you don’t get as much nitty-gritty control, but that’s not why you are trying to optimize anymore, because I think as you are getting to your — you can spend a lot of time paying expensive developers to fix a problem or you can just add a couple of more servers.
It has worked out well for us, but that’s probably because I have been writing Java for 15 years and hiring a lot of good people who can work with this language. We know it pretty well. Eventually we will be dinosaurs and have to move on to the next language, but for now it’s working out for us well.
Michael Coté: Right, right, right. Yeah. But I mean — yeah, I mean to that and I mean if you have good Java developers it’s not really — it would be kind of crazy to start something new, especially with how much you have built into the application already.
But I mean, are there other languages that you mix into there or do you try to just stick with Java?
Serge Knystautas: On the server side it has been pretty much all Java. I am trying to think if we have used anything else. Yeah, maybe a little bit, I don’t know, gosh, there’s Bash, and there’s certainly some XML and XPath and stuff like that, sort of standard language agnostic stuff.
Michael Coté: But pretty much you have been able to make Java work out wherever you needed to in the various layers of the app?
Serge Knystautas: Yeah, yeah, it has been doing well for us.
Michael Coté: So the last thing I was curious to hear about is, and that you are developing an external facing application that, like you said, people are very rabid to interact with. I mean, what kind of development process have you guys used over the years? Like is it sort of standard stuff, or do you kind of come up with your own process, like what are typical release cycles for you?
Serge Knystautas: Well, I would say we do a very agile development. We do weekly releases. So we have working out subversion and on weekly basis we cut a new version that goes to testing. And then sits about a week in testing and then goes live to everybody.
But really from the school of thought of trying to do as fast releases as possible to just to limit the number of bugs and things that possibly could go wrong on each release, we have done, in some cases we create branches and subversion for longer running development that’s going to be a very large new feature and then we will figure out how to merge that back into the trunk.
But for the most part, as much as possible, we try to break up tasks into small increments and roll them out so we can get more customers working with it.
So there’s our production system and then there’s a test system, where — a test server that people test the version that’s about to go live. But then we have a number of development servers which allow, as every developer is making a commit, that other people, the non-technical staff can give feedback and realize where their requirements were wrong or how —
Michael Coté: So just sort of a demo instance running that people can look at.
Serge Knystautas: Yeah, yeah, definitely.
Michael Coté: Yeah. Well, that makes sense. I mean, I don’t want to take up too much of your time, but that’s the barrage of questions I was looking forward to peppering you with.
So yeah, I mean, if people are interested in more, I mean what — I mean, I guess there’s prestosports.com, but are you in the crazy Twitter there anything like that? What do you want to tell people as far as touch points they can go to?
Serge Knystautas: I would tell them go to prestosports.com first, that you can link over. It has got links to our blog site, where you can read more about both technical and non-technical stuff that we are working on. And we are on Facebook and Twitter as well.
But I would go to prestosports.com first and you can find out a whole lot more of us, and as we talked about earlier, the different ways that you want to engage with us you can.
Michael Coté: And there’s definitely — you can find the various colleges that are using PrestoSports stuff, but just kind of looking — on your website you have a few of them listed, so you can see it in action.
Serge Knystautas: Yeah, definitely.
Michael Coté: Well, great! Well, thanks, I appreciate you taking all the time at the conference there to just kind of go over what’s in your stack. That’s good stuff.
Serge Knystautas: Yeah. Thanks. I appreciate the interview. It was fun talking with you.