Get in touch with us today! Call us toll-free at 1.866.754.4111 or email us at [email protected] Close button
Web Design Kitchener Waterloo Guelph Cambridge AODA Development
This is a headshot of Jamie McBurney.

 

Depending on when you read this, I'm either just about to leave for my "Lights Out / Technology Off" vacation to discover my heritage in Newfoundland or I'm there right now, or I've already come back.

 

I'll provide some insights on the trip another time.

 

In the meantime, I'm consumed with the preparation tasks.  At the time of me writing this article, I have 2 weeks until I leave.

 

It's another great time to review and practice our contingency plans.

 

Like any company worth it's salt in the technology arena, REM has a simplistic set of redundancies on our core technology *and* our human components.  Notice the choice to use the word "simplistic" instead of "complex" set of redundancies.  One of my well known standard operating procedures around REM is to remove complexity wherever possible, and nowhere is that more evident than in our failover procedures for when I'm gone.

 

We have a clear set of resolutions for vast array of predictable problems that range from "an email was accidentally erased by user" to "power supply blows up on a database server" that all involve reading through simple 10 step (or less) recovery steps that are designed to be followed by the lowest experienced person if required, and everyone knows their part from top to bottom.

 

The solutions don't rely on "vendor promises" or propriety "swiss army knife" solutions.  They are practical steps that can be understood  by anyone with basic computer knowledge.  They have a recovery times that are up to 15% longer than those promised by vendors and cost a little more money but my experience is that vendor promises of instant recoveries rarely line up when a disaster truly happens.

 

Want an example of one of our methods?

 

I have a complete "running" clone of our live server environment poised to take over any failing component.

 

What are the steps to recover from a completely failed database server - let's say power supply blew up?

  1. Power off faulty database server.
  2. Log into the cloned emergency database server
  3. Change the IP address of the emergency server to that of the powered off machine.
  4. Grab latest database files (from any one of our 3 backup locations) for affected sites.  (If newer versions exist)
  5. Restore newer files.
  6. Done.

What are the steps to recover from a completely failed mail server - let's say all hard drives crashed?

  1. Power off faulty mail server.
  2. Log into the cloned emergency mail server
  3. Change the IP address of the emergency server to that of the powered off machine.
  4. Grab the latest email files from any one of our 3 backup locations if newer files exist.
  5. Restore newer files.
  6. Done.

Obviously, I'm picking some cut and dry scenarios to illustrate my point, but suffice it to say that simple solutions are by far the safer bet when it really matters - getting things running again.

 

Now we'll spend the next 10 days or so with fire drills so that I can answer questions now, and not when I'm in the forests of Canada's most easterly province.

 

Photo courtesy Paul Shaw.

Subscribe to this Blog Like on Facebook Tweet this! Share on LinkedIn

Contributors

Sanj Rajput
24
May 24, 2022
Show Sanj's Posts
Rob Matlow
117
February 17, 2022
Show Rob's Posts
Sean Sanderson
65
January 24, 2022
Show Sean's Posts
Haley Burton
1
December 7, 2021
Show Haley's Posts
Generic Administrator
1
December 3, 2021
Show Generic's Posts
Colleen Legge
1
November 26, 2021
Show Colleen's Posts
Christine Votruba
30
November 3, 2021
Show Christine's Posts
Sean McParland
18
August 20, 2021
Show Sean's Posts
Ryan Covert
48
July 26, 2019
Show Ryan's Posts
Matt Stern
4
July 16, 2019
Show Matt's Posts
Sean Legge
1
June 28, 2019
Show Sean's Posts
Todd Hannigan
47
November 13, 2018
Show Todd's Posts