Depending on when you read this, I'm either just about to leave for my "Lights Out / Technology Off" vacation to discover my heritage in Newfoundland or I'm there right now, or I've already come back.
I'll provide some insights on the trip another time.
In the meantime, I'm consumed with the preparation tasks. At the time of me writing this article, I have 2 weeks until I leave.
It's another great time to review and practice our contingency plans.
Like any company worth it's salt in the technology arena, REM has a simplistic set of redundancies on our core technology *and* our human components. Notice the choice to use the word "simplistic" instead of "complex" set of redundancies. One of my well known standard operating procedures around REM is to remove complexity wherever possible, and nowhere is that more evident than in our failover procedures for when I'm gone.
We have a clear set of resolutions for vast array of predictable problems that range from "an email was accidentally erased by user" to "power supply blows up on a database server" that all involve reading through simple 10 step (or less) recovery steps that are designed to be followed by the lowest experienced person if required, and everyone knows their part from top to bottom.
The solutions don't rely on "vendor promises" or propriety "swiss army knife" solutions. They are practical steps that can be understood by anyone with basic computer knowledge. They have a recovery times that are up to 15% longer than those promised by vendors and cost a little more money but my experience is that vendor promises of instant recoveries rarely line up when a disaster truly happens.
Want an example of one of our methods?
I have a complete "running" clone of our live server environment poised to take over any failing component.
What are the steps to recover from a completely failed database server - let's say power supply blew up?
- Power off faulty database server.
- Log into the cloned emergency database server
- Change the IP address of the emergency server to that of the powered off machine.
- Grab latest database files (from any one of our 3 backup locations) for affected sites. (If newer versions exist)
- Restore newer files.
- Done.
What are the steps to recover from a completely failed mail server - let's say all hard drives crashed?
- Power off faulty mail server.
- Log into the cloned emergency mail server
- Change the IP address of the emergency server to that of the powered off machine.
- Grab the latest email files from any one of our 3 backup locations if newer files exist.
- Restore newer files.
- Done.
Obviously, I'm picking some cut and dry scenarios to illustrate my point, but suffice it to say that simple solutions are by far the safer bet when it really matters - getting things running again.
Now we'll spend the next 10 days or so with fire drills so that I can answer questions now, and not when I'm in the forests of Canada's most easterly province.
Photo courtesy Paul Shaw.