I’m not sure how I missed it when it was published back in January, but Ross Smith IV, the Principal Program Manager for Office 365 Customer Experience, Exchange Jedi Master, and industry recognized authority on all things Exchange, published an interesting post on the Exchange team blog. At first read you’d almost want to say “well, duh!” but the sad thing is, he wouldn’t have written it if it didn’t need to be said, again and again, at 120 dB and then carved into the side of the technology mountain in letters 50 feet tall.
Mr. Smith bases his recommendations in this post on facts… internal support cases, escalations, crit-sits, customer data, and the sort of “been there, done that” experience that is as good as it gets. His post is absolutely worth reading in its entirety, but let me see if I can summarize it for you.
“Microsoft recommends adopting a software update strategy that ensures all software follows N to N-1 policy.” Those are pretty strong words, especially since he recommends this not just for Microsoft products, but for all products, including operating systems, software and applications, hardware drivers, and firmware. What is baffling to me is that this needs to be said.
Why does a vendor release a patch? Well, usually, it’s because something is broken that needs to be fixed! I deal with customers far too often who look at patching as something that is done fully right before the system is put into production, then ignored unless there is an active exploit in the wild and security makes them patch.
No! As long as we have code, we will have bugs, and risks, and iterative improvements in performance and/or functionality.
The “money shot” from Smith’s first section is that, when referring to critical incidents, “…in almost every case, the customer was experiencing known issues that were resolved in current releases.” Hmm, keep current, things work. How about that. Don’t overlook the fact that he isn’t just referring to Exchange, or even the server operating system. Keeping client software and hardware drivers current were both included in that recommendation and missing patches or out of date drivers were root causes in some critical incidents.
Use change control
No, really, use change control. For real. Submit your proposed changes. Test your proposed changes in the lab, making sure the lab is a true representation of production. It won’t be a 1:1 scale, but better to have the same versions of operating systems, applications, and services in place with the same interfaces and client mixes… and it’d better be completely separate from production! Deploy your changes during change windows, and ensure you test the production deployment and either resolve any issues or roll the change back. I don’t ever want to see a process where the change request takes longer to perform than the actual change, but I also don’t want to see anyone “testing” their changes in production, because that is really just tempting fate!
$4!^ happens, design for that fact
Things happen, accidents occur, hardware fails, and people make mistakes. By building redundancy into every system you deploy, you ensure that services remain available even when something goes code brown. If it is critical enough to impact production should it go down, then it needs redundancy, from the hardware layer on up.
Ignore the vendor’s instructions at your own risk
Here is another key quote from Smith’s blog post. “… it’s rare to see a case where a customer honestly knows more about how a vendor’s product works than does the vendor.”
That’s a pretty important point to understand. You may have installed a product once, or 10 times, or in some rare situations 100 times, but you didn’t create the product, write the code, run it through alpha then beta, then production. And unless you are a consultant who specializes in the particular product, you are the rare individual indeed if you can truly know a product better than the company that created it.
Read the friendly manual, and follow the recommendations therein. Recommendations are there because they have been proven to work again and again. More importantly, when a vendor is testing updates, they are testing them based on their recommended configurations. If you are not running your systems in line with vendor recommendations, you are much more likely to find a patch breaks something. That is NOT a reason not to patch. That IS a reason to follow the manufacturer’s instructions.
If you fail to plan, plan to fail
Smith includes 11 (yes, his list goes to 11!) steps for deploying Exchange. Nine of those 11 can readily be applied to practically any other deployment of any other product because the list starts with “Identify the business and technical requirements that need to be solved” and ends with “Continue collecting data and analyzing it, and adjust if changes occur.” More on that last one in a moment, but first, you will have to read his full post to get the other nine. Go ahead, it’s well worth the time!
Monitor, analyze and plan accordingly
Monitoring is key for three reasons. It allows you to confirm the system is functioning as intended or make adjustments if it is not. It allows you to observe changes over time. And it enables you to proactively plan for an accommodate growth before critical resource constraints occur. Don’t just grab log files and shove them into a share somewhere though… regularly review the logs, investigate discrepancies, and watch for trends so you can plan for the future.
These tips should become part of the terms of a customer contract. Heed the advice from Ross Smith IV’s blog post and apply it to all your systems… not just your Exchange infrastructure. It’s about as good as it gets.