Ease the pain of network downtime by managing expectations
Published: 12 May 2003 10:11 BST
Avoid embarrassment by checking the obvious
The first server I ever administered was a Dell running NT4 Server. It was working well when I was asked to move it from the main office to a secure area. At first, the job seemed fairly simple.
There was no network point in the archive room where it was to live, so the first thing I had to do was arrange one. This entailed simply drilling a hole through a partition wall and running a cable through it. Having ensured that my new point was live by plugging a desktop system into it, I arranged with the rest of the company the best time to move the server.
As it turned out, the entire company was in a meeting discussing their next research project, so I had free run of the building and the network. They were having lunch sent in, so I figured I had plenty of time to allow for disasters. I unplugged the server screen and moved it to the new area.
I piled the keyboard, mouse, and UPS onto the server case, unlocked the wheels, and removed the power plug. The server was still active, running from the UPS, which was set to run the system for 20 minutes before closing down. In no time, everything was plugged back in, and I ran around the office to make sure that I could see server volumes on the desktop machines.
"Great", I thought, "a job well done". When the rest of the workforce returned, however, it turned out that the messaging services had stalled, and nobody could send email. I took a few more minutes to restart the services and kicked myself for not thinking to check it.
With any luck, that particular scenario will not occur again, since I recorded it in my server diary and added it to the procedure for similar operations.
Have a contingency plan
It's possible that your work may overrun the time constraints allotted, and that you'll need to retreat from your efforts to let users back on. Be sure that it's possible to roll back to the original system state. But, before you abandon a job that's nearly complete, try some on-the-fly negotiations with the user base. They may be happy to stay offline for another hour if it prevents another shutdown in the near future.
It's important to know what the options are and what the point of no return is. Thankfully, I've never gone past it. By employing a strict if-it-ain't-broke-don't-fix-it policy, I've managed to keep things reasonably functional. Any work I wasn't sure about went onto the test machines for evaluation before I implemented it, and I also created an additional backup. In a pinch, it would have been possible to plug my test server onto the live network to replace the main machine.
In any event, you should build a margin of safety into any scheduled downtime slot. If you come in ahead of schedule, your team will feel good about it, and the user base will also think you have done well. If your estimates are too "realistic", you will have to live up to them or risk losing the confidence of both your team and the users.










