What to Graph: An MRTG Manifesto
Graph everythingIt's important to make the distinction between trending and alerting, which both fall under the "monitoring" genre. Alerting is getting an email, a page, or a phone call when something goes wrong. Trending is the act of collecting statistics when things are going right or wrong, and presenting that data.
Trending is what we're concerned with here, and my trending tool of choice is the venerable open source project MRTG. Many of you have probably used MRTG in one form or another, either having set it up yourself, or at least viewing the graphs someone else had setup.
With alerting, it's very easy and extremely detrimental to go overboard. Too many false alarms, and your pager becomes your worst nightmare. It also becomes ineffective, as superfluous alerts leads to the "cry wolf" syndrome.
With trending, that isn't the case as you can never over-trend. The more data you collect, the better off you are. The time expenditure and hardware resources required are minimal, the cost for the software is nothing, and the hardware resources are minimal.
Being responsible for a load balancer is a tough gig. Everyone blames the load balancer. It's not that things don't go wrong with the load balancer; they're complex devices and there's many things that can go wrong, code and configuration related. But whether or not it's the load balancers fault, they get blamed. So much so, that I've written an entire article about it on the O'Reilly sysadmin site entitled "It's Always The Load Balancer".
When things go bad and they come looking for the load balancer admin (and they will come), you'd better be armed with the most information you can get your hands on. You'd better be able to back up your assertion that it's not the load balancer, or you'd better be able to figure out what is wrong with the load balancer.
Obviously, this is where MRTG comes in.
MRTG provides information on not only what is happening now, but what's happened in the past 36 hours, the past week, the past month, and even the past year. It is a highly effective trouble-shooting tool, as indispensable as "top" is in Unix systems, or network sniffers are to network admins. MRTG is the premier tool for this trending. Not only is it open source, flexible, and free, but there is no commercial tool that even compares.
I started this site several years ago because I wanted to share the lessons I learned on the value of trending metrics for load balancers. On many, many occasions having trending data has saved my ass. I hope you find the information on this site useful, and I hope you find MRTG as useful as I do.