A couple weeks ago, I had the privilege to visit with one of my old friends, John Arley Burns, at his work, ERF Enterprise Network Services. He’s the President and COO which means, among many other concerns, that he makes sure everything (IT and otherwise) is up and running smoothly. They provide wireless, broadband networks to banks. The banks use these networks to connect up their WAN/LAN across all the branches. In rural areas, this can be a big deal if high-speed lines aren’t available. Though that’s not the case as much as used to be, the reliability and secure networks that ERF ENS provides are a major selling point.
Systems Management Ethnography
But, I’m not here to talk about them (as much as Arley might want me to ;>) — that’s just an intro. The reason I went out to Taylor was to check out how they were using open source systems management software. As you’ve hopefully figured out by now, dear readers, I am quite the systems management enthusiast. So, seeing systems management in action was a real treat for me. I’ve worked on the other end — developing systems management software — for years, but it’s all too rare that I get to do some systems management ethnography.
Monitoring and Reporting
As we’ll see, their monitoring needs were taken care of by their open source systems management apps after a normal amount of configuration. The weak-point with the “suite” they ended up with was reporting, that is, the task of using all the low-level data for trend analysis and pro-active prevention and improvement.
The Setup
ERF ENS’s network is composed of several nodes that transmit, control, encrypt, and are clients for the wireless network they provide to banks. Each of these devices has an SNMP agent on it and runs either embedded OS’s or *nix variants.
Their first interest is making sure that everything is up and running. Once those nodes go down, the banks loose connectivity. At the same time that they need tools to help them deal with day-to-day incident management, they also need tools (reports mostly) to track the big picture over many months to track down long-term problems that are easy to loose sight of day-to-day.
What They’re Using
Cacti
Oddly, enough I hadn’t come across this yet. It just goes to show you how much there is out there. It’s a very suite-ish, PHP based platform with plugins. That is, it’s a web-based portal that systems management data flows into and that provides the engine for driving monitoring.
The focus is more network monitoring than application monitoring, but at the low-level, it’s all just OIDs anyhow, whether they come from network devices or applications. You could just as easily monitor network uptime as application performance if the right OIDs were available.
There are no transactions or any sort of “higher level” systems management concepts: it’s just numbers, graphs, and events. The kind of information that sysadmins need when they’re trying to diagnose a problem. Also of note is that Cacti is one of many RRDtool based applications. RRDtool is used in most everything open source systems management. It’d be interesting to see a Java port of it.
Tutos
While Tutos isn’t systems management, they’d done some custom programming to link together Tutos with Cacti. Tutos is a project management system (also in PHP). ERF ENS customized Tutos so that they could list all of their customers and then cross-launch into the appropriate Cacti view. Meaning that they did some quick integration between their CRM and systems management system. In that sense, they’d made a quick and easy composite application by hacking on the PHP.
PHP Composites
Which raises an interesting point from the ease of use of PHP: do you need composite application and portal frameworks when you can just open up the PHP file and add in the “portlet” you want. Of course, the answer will be yes at times, but I have a feeling that in a dynamic language driven application (PHP, JSPs, RoR) it could be (and probably is) no more often than we’d like to think. (On the other hand, as Bill would promptly point out, “I’d hate to be the one to manage that.“)
Big Brother
Though it’s not exactly open source, Big Brother has the same low barriers to entry that any OSS systems management application does. Before switching over to Cacti, they used Big Brother to get a quick dashboard of how their network was doing. As with Cacti, the point is that they just want to know if everything is up and running and keep enough historic data to trouble-shoot as needed.
SNMP
The role of SNMP in systems management is often over-looked, but it’s critical to the success of any systems management application, closed or open source. Now-a-days, pretty much anything you’d want to monitor comes with SNMP — even Windows, though it’s not turned on by default. What this means is that you can take an “agentless” approach to monitoring.
Most devices already have an agent, so you don’t need to bother with writing and deploying an agent. This is a clarification/addition that I didn’t put in my last screed on the topic of agent vs. agentless monitoring: it’s not that agentless monitoring “can do everything,” it’s that there are already agents everywhere, so you often don’t need to write your own. On the other hand, high-data areas like log management and analysis probably call for an agent on the target more often than not.
Yes, but SNMP. What a handsome set of letters. I am, of course, biased because of my love of systems management, but I would put SNMP (and UDP, implicitly) in the top 5 — OK, maybe 10 — protocols/formats ever, along with TCP/IP, HTTP, HTML, and voice protocols. SNMP just works, and it’s everywhere.
What’s Missing: Reporting
After I got a thorough tour of the muck, Arely and I escaped to his office and talked about the utopia he’d like to have. Thus far, they haven’t found or had the time to put together the reports that they’d really like. The can put these reports together by hand, but they’d really like the system to spit them out for them. What they’re looking for are SLA, MTTR (mean time to repair), and availability reports. Graphs, numbers, and reports that give them a sense for the over-all, long-term health of their setup.
Arley, who’s concerned with the ongoing operations and keeping the customers happy, appreciates that managing each incident needs lots of tooling around it, but he also wants to answer questions like, “are we fixing each problem quickly enough?” and “how often does the system go down?”.
As with most enterprise software, it’s the reports that sell the software and help differentiate it from competition. Indeed, reports are one of the ways you can start getting a business understanding of what’s going on in IT. Otherwise, there’s so much attention to simple fire-fighting that’s it’s difficult to remember how many fires there were, where those fires were happing, and start to develop a plan prevent more fires. Everyone needs their Smokey.
Many of the open source systems management people and companies I’ve been talking with get this point. Time and time again, they’re driving towards a more simplified view of systems management: “don’t bombard me with 200 parameters, give me 20 that tell me the most important things. Or, even better, tell me what to do. Or even better [and this when they start getting really animated] do it for me.”
At the same time, they don’t want the reporting to get so abstract that you loose touch of the low-level goings-on in your infrastructure. You’ve got keep your pair of wing-tips and pumps ever at the ready, changing from one and the other as needed.
Money and Integration
Arley was very enthusiastic about any way he could attach money to the data he was getting from the system. Even something as simple as figuring out a rough burn rate (“how much money are we loosing each minute if this system goes down?”) lit his eyes up.
Another point of concern was how he could integrate the system with other applications as they came along. The Tutos example is a good one in this regard. Though they could hack together a cross-link between Tutos and Cacti, there wasn’t much else they could do easily.
These last two concerns — reporting and integration with the overall business — have long been the Holy Grail of the systems management world. Some people have the time and money to get The Grail, while others are too small and/or busy for those luxuries.
The Take-away
These last two points are the biggest win for our ongoing open source systems management coverage. When it comes to systems management, being able to collect all the metrics, present graphs and events, and send off notifications of failures is just par for course. If open source systems management applications provide just those features, they won’t create much disruption in the market.
One of the many things I’ll be looking for in open source systems management applications are the more “advanced,” business-driven reports and screens to help make sense out of the piles of monitoring data out there. That kind of intelligent analysis is what lets people like Arley make decisions that improve their business and make their customers happy, not just bring the mail server back up for the n-th time this month.
Put another way, the system has to more than just wake up a sysadmin at 2AM in the morning with a page.
I realize that sounds like a bunch of Holy Grail Hogwash, but the point is this: even if OSS systems management apps can displace vendors from low-level monitoring, unless those apps cater to the Arleys of the world, the existing systems management vendors will still enjoy plenty of revenue. That’s good news for the existing vendors, and strategy for the OSS systems management folks out there.
Thanks!
So, thanks to Arley for thinking to get me up to Taylor, and Rick White, thedward, and Dave Burns for the low-down.
If you don’t mind me poking around your NOC or passel of computers, and you’re in the Austin area, email me, and maybe we can figure out another exciting field-trip 😉
Xin chao, Minh den tu HL, minh mong muon duoc lam quen voi tat ca cac ban. Thanks you