How do you monitor what goes on within your infrastructure? Do you gather logs, use SNMP, query WMI, or do you deploy agents that report in? There are almost as many ways to monitor servers as there are things on servers to monitor, but in today’s post we are going to look at two main schools of thought to discuss the pros and cons of each. On the one hand, we’re consider the in-built monitoring capabilities of modern operating systems. On the other hand, we will look at what deploying agents or other third-party software can do for you. In the end, hopefully you will have enough to make a qualified decision on which way you want to go.
It’s a valid question. Why should we monitor our servers? Won’t we notice when things go badly, before they become a problem? The answer to that is probably “not really, at least, not before it’s too late!” Admins who think they can just react when things fall down and go boom, or who feel they can check all their servers every day the good old fashioned way, by logging onto them, are either crazy, reckless, insomniacs, or they don’t have enough servers to actually be considered sysadmins. You need to monitor your servers for resources, performance, and errors, as well as monitoring the apps they provide. Consider a file server. What happens when it runs out of space? Or an email server that can no longer send emails because there’s a problem with a connector, or DNS? What about any server running at 100% CPU utilization? How responsive do you think it will be to your users? There’s more to monitoring though, as anyone who has had a disk fail can tell you. Most disks start to throw errors long before they go code brown. If only you had a way to notice those errors before it was too late!
What should we monitor?
For any server, running any operating system, I like to start with what I call the “big four.” That’s CPU utilization, available Memory, free Disk space, and Network utilization. C-M-D-N. Any server, providing any services or running any app, and no matter what operating system it is running, will need to have sufficient resources to meet both normal and peak loads, so monitoring those critical resources gives you a good snapshot of overall server health. Then of course, you need to monitor the application logs for whatever it is the server is providing. You also want to keep up with any patching and updates, as well as how antimalware software is doing. Finally, and perhaps most importantly, you want to know how things are going from a security perspective, by keeping an eye on both the logon successes and failures, as well as privilege use. You can get much more granular, depending on the app, so you will want to consult the vendor guidance for whatever app or service you are running. Whether it’s included services like DHCP or IIS on Windows, or the SMB server in Linux, or complex ERP applications from third-parties, each will have recommendations on what to watch and to watch out for.
What’s there in the O/S?
Most operating systems have pretty solid built-in monitoring. Windows has it’s Event Logs, Performance Monitor, and Resource Monitor, and can take actions when certain triggers are hit. Windows also includes the ability to centralize data from the Event Logs using subscriptions, so that you can gather logs from multiple systems in one place. That way, you don’t actually have to log onto each of your servers. Rounding out Windows Event Logs is Log Parser which, while over a decade old, is still a pretty good tool for ripping through lots of logs in a hurry. Of course, Windows also offers a variety of APIs and ways to query the operating system and services including Windows Remote Management, WMI, and remote PowerShell. Whether you want to roll your own, or search online for scripts others have created, if you have some time and are willing to work through some debugging and tweaking, you can do a ton of monitoring without buying anything extra, or installing anything extra on your servers.
Linux has several CLI tools for monitoring, and the syslog facility for reporting/gathering logs from multiple systems. You can configure your Linux boxes to send syslog messages to a central Linux server running syslogd, and of course you can start up syslogd on a Linux box to receive those feeds, as well as syslog messages from routers, switches, firewalls, and more. It’s helpful to use some application to automate the review of all those logs, but even manually parsing them is an option. With them all in one place, it’s easier than connecting to each system one at a time.
And of course, both Windows and Linux support SNMP. While you will need some SNMP monitoring system to query systems and receive traps, all you have to do on both Linux and Windows to use SNMP is start it up and configure it. It’s an optional feature of both operating systems.
The biggest benefits to using what is already in the operating system is that, for the most part, it’s already there. You might have to configure it, but you don’t have to install it, nor will you need to patch it separate from patching the operating system itself.
Of course, you get what you pay for, and while the operating systems are great values, the bells and whistles in what is included for monitoring are not as much. They provide the basics, but will do little on their own to alert you to problems, forecast things for you before they become problems, and reporting? Forget about it. Unless your management likes to read text files, you will spend a lot of time taking all that great information and putting into formats the boss can understand.
What about agents?
There are lots of third-party tools out there that can install agents on both Windows and Linux systems and use a central system to query those agents to keep an eye on things. They can monitor the big four, check the status of running services, review logs, and check the health and performance of other software running on these systems. These agents typically are bundled with monitoring software…it’s not the agents you are paying for, but rather that automation in alerting and reporting that simply relies upon the agents. Those agents, in addition to needing to be installed, may need to be granted additional privileges or permissions to function fully on a system, and they will also need to be patched/updated as appropriate. With a good third-party patching solution like GFI LanGuard you can patch a lot of third-party apps, but those agents monitoring solutions require are typically not on that list. And as a general rule, those agents require more CPU cycles and more RAM, making their resource costs a factor.
Finally, while there are lots of applications that use agents for Windows systems, the same cannot be said for Linux. If you’re a Windows shop that may not be a consideration, but if you run a mix of Windows and Linux, you may need to consider this, and either narrow your choices, or have to monitor different systems in different ways.
Which way should we go?
Ultimately, you need to determine what will work best for you, and provide you with what you need. If you like to write or alter others’ scripts, and have the time to do that, what’s in-built to both Windows and Linux may be all that you need. Between remote PowerShell or WRM for Windows, and SSH into Linux, you can probably automate most of the queries you need, and then by tailing a log file, have a process that takes action like sending you an email alert if things look bad. Or, you may already have a SIEM or other monitoring application that, rather than relying upon agents, works with what is already in the operating system. To me, that’s the best possible approach. But if you are looking for more automation and reporting with less work required to set it up, and you need a complete solution running right now, purchasing a turn-key solution that relies upon agents may make sense to you. There’s only so many hours in a day, and work-life balance quickly disappears if you have to stay up all night trying to cobble together code. A solution that provides forecasting, reporting, alerting, and pretty reports for management may be well worth the extra RAM and CPU cycles, as well as the money, it will cost to get going.
Ultimately, you need to determine what, for you, is required, and from that you can start to look at options that meet those requirements. Evaluate them on their costs, resource requirements, ease of implementation and upkeep for you and your team, and pick what makes sense for you. Hopefully the above gives you more to consider and will help you with making the right decision for you.