As physical infrastructure has changed over the years, management and configuration mechanisms have evolved as well. In the old days, large machines were managed by hand in many cases, by expert administrators who understood their configuration intimately. Then, as smaller machines were integrated into the enterprise, management-by-hand became difficult. Build automation became paramount. Technologies like RIS and Jumpstart became our best friends. Rdist allowed us to manage and update host configurations to a limited extent.
Then when commodity hardware became stable enough, and more importantly cheap enough to buy redundant systems, we added amazing community based projects like CFEngine, Puppet, and Chef to our toolboxes. Imaging models became popular as well, and every host in the enterprise gained the possibility of being consistent, and if not, becoming so within 15 minutes with a rebuild.
Throughout this evolution, we went from a 1:[1-10] to 1:[20-30] to 1:1000+ sysadmin:host ratio. We now measure workload in host classes, or number of customers we support, rather than physical machines. A group of identical machines is almost the same OS configuration overhead as a single host. Configuration management through Kickstart + CFEngine/Puppet/Chef/etc or an image model became an absolute necessity. I witnessed headcount being cut by 30% at one company when the move to CFEngine was made. With less management by hand, fewer hands are needed.
So, what if you work at a small business where you have a small number of hosts? Do you need automation which would allow you to manage 1000 hosts with the same ease of one or two? Absolutely! Isn’t that a lot of work to setup and way too big of a hammer? No way! Even for a small shop, configuration management is a necessity and here is why: Configuration management creates a predictable, recoverable, and flexible environment. That is something every business needs.
We’ve all had a developer, DBA, or user come to us and say something like: “I’m getting different results from machine X than from machine Y. Are you sure they are identical?” Oracle grid control has a feature that allows DBA’s to compare systems configurations, allowing them to quickly provide evidence to their sysadmins that something is wrong.
I once had a colleague run a script which executed “chmod -R 777 /” on several critical Digital UNIX hosts. Pandemonium ensued as pagers blared, evening plans were cancelled, and all hands were called to the deck. (The guilty party was quickly nicknamed “Captain C-H-mod” for the remainder of his career.) The problem was quickly diagnosed, but recovering hosts built by hand was a big issue. How were these hosts setup? What had been modified over years? Should we try to fix the permissions or rebuild? In the meantime, the business had ground to a halt and approximately 600 employees were standing around in warehouses around the US. Some 10 hours later, all systems had been rebuilt and many problems had been worked through. Some 15 bleary-eyed engineers went home after 3AM to disappointed sleeping families.
Then there are the even more common requests, such as: “Replace machine X with hardware Y for scaling.”, “The lease is up on managed-by-hand-1.you-are-making-your-life-harder-than-you-should.com.”, or “Please build out a new host that looks almost like the existing host X.”.
These problems affect small sets of systems as well as large installations. Proper configuration management makes cuts the time, complexity, and error-rate of resolving these tasks down by an order of magnitude. By funneling every OS modification on a host through configuration management, you ensure that the host always has a high likelihood of a maintaining a predictable state and when it’s rebuilt, it will be identical to its former state. Rebuilds are kicked off and finished before you can go to Starbucks and back. Developers are happy that their machines are all exactly the same, and if they question the system state, you can tell them you’ll happily rebuild the host for them in the next 15 minutes if they would like. MTTR (Mean Time To Recovery) is minimized, and which is very good for the business, and the business is the reason we have a job in the first place.
There is a hump to get over to implement automation and configuration, but it’s not a big one. Regardless of whether you are in a AWS or in a colo, there are many creative ways to couple imaging / OS build with configuration management. Once it’s complete, that gives us more time to focus on work that makes money for the company, such as implementation consulting for development teams to increase service availability and sleep for everyone, or creating an offering to turn Infrastructure into a revenue generator. It helps you to have time to work on things that change the perception of infrastructure-as-overhead into an integral part of the business that glues everything together. Oh yeah, and if you do end up needing to scale from 2 hosts to 2000, I suppose you could use it for that as well.