Let’s face it; most of us in IT spend entirely too much time in firefighting mode. We talk a great game about being proactive, and keeping ahead of issues; monitoring our systems for utilization and capacity so we can schedule upgrades before things bog down. But then, with our phones beeping almost non-stop, having to check email every 30 seconds for critical alerts, we end each day further behind, and spend half our free time checking our smartphones to make sure nothing has gone down. It is all too easy to fall into the trap of monitoring so much that you drown in the information, which then leads to the even more dangerous condition of missing the alerts that are really critical because they get lost in a sea of noise, or even worse, creating rules to sort the alerts and not even bothering to check them on a regular basis. It is all too common for me to find clients who say they have great monitoring and alerting, but when I ask them to show me what they are doing, they show me their Outlook client with a massive tree of folders, dozens of rules, and thousands of unread messages.
When your monitoring and reporting solution overwhelms you with noise, it is going to be human nature for you to simply ignore it. You’d never get anything done otherwise. And while you may have the best intentions, when it comes to reviewing all those folders, far too often it is to find that one alert email that you should have seen before something became a critical failure. Your monitoring has put you into information overload, and it’s time to troubleshoot your way out of that mess.
How much is too much?
Real-time alerts, or those that the monitoring system sends you as soon as it detects a condition that requires immediate attention, should be kept to a minimum. Service failures, low disk space warnings, failed backups (that don’t automatically schedule themselves to try again), virus detections, privileged account lockouts – these are the things you really should be looking at fairly quickly. Anything else is noise. The first thing that as a team you need to do is look at a representative days’ worth of alert messages, and as a team, identify the ones that don’t need immediate attention. Anything that can be safely ignored or looked at later, is something that shouldn’t be an immediate alert. Informational messages are the same way. If you really need to know about every success, then you need a NOC or dedicated monitoring team. The idea here is to weed out all the noise so that when your phone buzzes in the middle of a meeting, it is only because there’s something you really need to look at.
Who’s on deck?
Another common problem I see is alerts that go to a distribution list, and everyone assumes someone else has got it covered. D/Ls are the right thing to use for alerts, but you need to set up a rotation of who is the first responder, and who is the backup, and when an alert is received, whoever is actually going to respond needs to reply-all that they have it covered. That way you all know it is getting taken care of, and you don’t have two (or more) people trying to do the same thing.
But it’s during scheduled maintenance
If you have maintenance windows, make sure your monitoring system is configured to stop alerting during that window. Whether you are doing system upgrades, patching, recabling, or anything else, you don’t want alerts waking people up during the expected reboots for patching.
Oh yeah, you can ignore that, I rebooted
Look for monitoring systems that have a really simple pause button, and then make sure that you press that pause before doing something that would trigger an alert, like restarting a service or rebooting a server. We don’t want others to respond to a perceived service failure when you are actively working with the box; it’s those sort of “boy who cried wolf” alerts that make people start ignoring them.
PING doesn’t mean all is well
Pinging a box to make sure it is online and reachable is great, but that doesn’t tell you anything about the running services. Implement monitors that actually test services, either by generating queries, submitting GETs, logging on, checking mail, etc. I’ve seen servers hard crashed whose NICs still responded to PINGs, so don’t rely on just that to be sure everything is up.
Daily summaries are your friend
Remember all those extra alerts we weeded out in the first step? Those should be moved to daily summaries that hit the team’s inboxes first thing in the morning. Once you get logged on and down to business, each team member should take turns reviewing the summary alerts so that those things that could wait until the next day do get the attention they need.
Automate your responses
If the appropriate response to an alert in the middle of the night is to restart the service, run a script, or bounce the box, let your monitoring solution do that for you. Only if the service doesn’t come back up after the automated action should the on-call admin have to remote in for further investigation.
Use SMS to get people’s attention
Ideally, you should use SMS to send text alerts to admins’ phones, instead of email. We all get far too much email around the clock, and the on-call guy shouldn’t have to lose sleep unless something really goes wrong. Silencing your email alerts while keeping SMS alerts audible lets you sleep through the night, but will actually wake you up if something critical does occur.
By reducing the noise to manageable levels, automating responses, and moving informational alerts to daily summaries, you can get a better handle on your monitoring and alerting, actually provide appropriate and timely responses to the alerts that need you, and start moving away from that daily firefighting mode.