Had the pleasure of speaking with Splunk’s CEO Michael Baum this morning for the second time. For James and Cote, it was the first – and it might be the first time in weeks that we’ve all been on the phone at the same time. The last time I spoke with the Splunk gang was in July of last year, and a lot’s changed since then – most notably last week’s introduction of Splunk Base, which was covered by /. here. I thought the technology was interesting last summer, and it’s become only more so since then.
Following that conversation, some of the Splunk folks were quite considerate in getting me a Gentoo friendly version. While it’s not in Portage (the Gentoo library management system), folks like Christina Noren and Brad Hall gave me a lot of support in adding NPTL support through the Gentoo USE flag system, etc. While the effort was much appreciated, it took me weeks after I expected to get the application installed – through no fault of Splunk’s, I had other things wrong with the box – and because of those problems, it rarely worked.
A couple of weeks ago, however, I scheduled another Splunk briefing and removed my existing install and started from scratch. This time, given the fact that the box had been more or less fallow for a couple of weeks and didn’t have 25 different things running on it, everything went fine. Splunk was up and running, and to give it some fodder, I had it chew on the mail, FTP, and Apache logs from our production server that I’d pulled down.
It crawled the various logs pretty quickly – as we’re talking data volumes in the hundreds of MB range rather than the GB’s Splunk’s capable of handling – and then I was presented with a very nice, Ajax style interface (John Vey’s work, perhaps?) to my various logs. After playing around a bit, I quickly ran out of steam – what would I Splunk for next? It was like being asked to test Google: seach engines are a lot easier to use when you actually have something to look for.
But while on the call today, I remembered that we did in fact have a problem that our logs might be able to shed light on. A couple of months back, our Movable Type instance decided to quit sending us email notifications of comments and trackbacks. This was not attributable to any obvious product upgrade or plugin installation, and given the fact that the comments were available via the standard MT interface and my comments feed, I didn’t really bother looking into it. SSHing into the server and grepping through random files seemed like too much of a hassle.
Enter Splunk; for this exercise, I pulled down the latest product log files via FTP, but have since set up a cron to that regularly. From there, I manually uploaded them to Splunk in a process quite similar to the one required to get the above screenshot into Flickr.
From there, I clicked on the mail log file, and searched for mt assuming – correctly, as it turned out, that Movable Type log events would contain references to mt*.* files. Sure enough, I came across entries containing both mt-comments.cgi as well as sendmail. From there I pulled out what I’m guessing is an event ID, and searched on that. That returned me the results pictured (you’ll have to look at the full size to read it, sorry). By themselves, they solve nothing; near as I can determine, they tell me that MT is triggering sendmail successfully, and that it’s passing my email, and that sendmail is for some reason transmitting to 127.0.0.1 (localhost).
But while it doesn’t solve my problem, I know more then I did before:
- I know that MT can find sendmail
- I know that sendmail is receiving the request and responding
This information would seem to indicate that if the problem is with MT, it’s that it’s not passing something that sendmail needs – and more likely is my sendmail configuration.
Seems trivial, perhaps, but I can tell you that having spent hours in my SI days tracking down problems caused by just such trivial information. The other reaction I’ve gotten from a couple of people is that they’ll just stick w/ grep, because after all the information is contained within log files, which I don’t agree with. I can crawl through a directory of my mail files, but I’d much rather do it with Gmail – the Ajax features, the search, the filtering make me far more productive. As my colleague put it today, Splunk and its competitors are nothing short of productivity tools for sysadmins.
What would I like to see next? As I told Michael, some hand holding for non-log file veterans would be welcome, whether it’s provided by Splunk or layered on by third parties. It’d also be nice if Splunk had some built in intelligence about standard log locations for some of the more popular applications and offered to include them by default. I’m also very interested to see where Splunk Base goes, as the best practices, shared log analyses and so on that collaboration make possible could be hugely valuable. But all in all, I’m pretty impressed with Splunk. I need more problems to test it on, but it seems pretty useful so far. Any of you guys using it and/or testing it?
Disclaimer: Neither Six Apart (Movable Type) nor Splunk are RedMonk clients, although we are paid users of MT.