Operational Dynamics
Search
Procedures for Change   |   Systems Review   |   Research & Open Source   |   About Us   |   Talks & Papers

This paper was first presented at AUUG's National Conference, September 2003, Sydney.

You can download this paper:

PDF

Also available are the Presentation slides which accompanied this paper. You can view them online, or likewise download them in PDF form.

See also:

Trends in Unix and Linux infrastructure management

Surviving Change

Home - Reference - Conference Papers - Disaster Recovery

What would you do if you lost your office?

Disaster Recovery in Lower Manhattan, September 2001

Andrew Frederick Cowie - Operational Dynamics

Abstract

"Have you ever thought what would happen if you lost your office?” That happened to us at Upoc when terrorists struck the World Trade Center in New York - our office was two blocks away. Disaster recovery planning is something that is often neglected by small companies, but it is as important for them as any other. This paper discusses the experiences of a small company located in lower Manhattan and the lessons we learned as a result of having lived through 11 September 2001.

Our office was buried in dust, and we lost the use of all the systems within it for over three weeks.

Some things we did right - our production platform was remotely located. By virtue of running Unix and having deliberately ensured that the systems were able to be remotely administrated, we kept our live systems running during the disaster.

Other lessons we learned the hard way. The importance of decentralizing (notably DNS). Ensuring that you have alternate ways to communicate with everyone in your company. Making sure that critical functions continue – especially corporate email and software development. Although we'd thought long and hard about what to do if we lost our production platform, loosing our development platform was almost a knockout blow.

Despite the magnitude of possible situations that you might face, it is possible to plan for disaster. You can be prepared for emergencies if you think ahead – and doing so improves the organization's ability to deal systematically not only with crisis but with more mundane day to day problems as well.

Before and After

Author

Andrew Cowie is an operations consultant based in Sydney. He runs a company which focuses on usability, scalability and maintainability by helping clients create effective teams, establish procedures, and improve systems performance. You can reach him at andrew@operationaldynamics.com

Copyright

Copyright © 2003 Operational Dynamics Consulting Pty Ltd, All Rights Reserved.

Introduction

Most of us have heard of “Disaster Recovery”. More generally referred to as “Business Continuity”, it's something that large corporations do, right? Visions of large empty rooms with rows and rows of silent computers waiting for a disaster to drive workers away from their offices and into the emergency alternate facilities.

The reality for small and medium sized businesses is different. Capital is tight. It is almost always out of the question to have standby facilities sitting idle, just waiting – if any budget is available it is going to be spent on production equipment and on the staff needed to make things run.

Small companies can, however, do effective business continuity planning – and can survive disaster – as long as they think ahead. It doesn't necessarily take a big budget, but it does take the right mindset.

In September of 2001, I was working as Director of Operations for Upoc, Inc, a mobile communities platform (think SMS mailing lists) whose company office was located in downtown New York. This paper will describe our experiences in the aftermath of the terrorist strike, discuss the measures we were forced to take, and offer some thoughts on what your organization can do to be prepared for disaster.

The scene

What we had

A modest office with about 40 people, and a remote datacenter.

Our production platform, and staging test servers were co-located in a large datacenter hotel in central New Jersey. [That's far enough away from New York that it was a pain for the Systems Administrators to have to hike out there, but close enough that it could be reached relatively quickly if we had a hardware failure in the middle of the night. We like to take credit for foresight and good planning for having stuck the production platform way out in the boonies, but the truth is that at the height of the Internet boom, there simply wasn't any datacenter space to be had in Manhattan, period].

We had all Sun hardware running Solaris throughout, with a Cisco network. All the servers and networking gear were configured for high availability – two or more machines of each class when they could run in parallel, and larger machines with fully capable alternates clustered in a hot standby configuration for functions that relied on a single node. We had put great effort into ensuring reliability and robustness for the production platform.

Our corporate office, on the other hand, had the usual systems – corporate email, shared file server, accounting system, and development server, along with desktops for all the employees. We had RAID storage backing up the email store and the development box, and even had a pair of T1s uplinks (each from a different ISP) to provide redundancy.

Our office was located in Lower Manhattan. For those of you who haven't been there, New York has several areas which in most other cities would rate “downtown” but there is simply called “the financial district”. Our offices were right near the corner of a couple streets called Wall and Broadway. You might have heard of them.

That intersection is two blocks away from the site of the World Trade Center.

What we lost

At first we didn't know. In fact, dealing with uncertainty would become a major factor in how events unfolded for us.

When the planes first struck, we had about 10 people in the office – there for an early morning teleconference. When the first building collapsed, we had the somewhat terrifying experience of still having IP connectivity to the office – but suddenly no one was answering their instant messengers. We didn't know what happened to them – and no one was answering their cell phones.

It turns out that some of our staff had fled the building; others, who were outside all ready, were half buried in debris. It took us almost the entire day, but ultimately we were able to track everyone down. Thankfully we hadn't lost anyone. Many other companies weren't so lucky.

We did, however, lose access to all the systems in the office. Suddenly no one was able to receive email anymore. The accounting books were safely locked up in the office manager's machine – which was in a building that we couldn't get to. And all the tools that facilitated development of our code were there too.

For several days, we didn't even know whether we had an office anymore or not. Even assuming the entire downtown area didn't burn away Great Fire of Chicago 1871 style, what shape was the building in? Was it structurally sound? Even if we could reoccupy the office, what would we find? And would services be restored to the building anytime soon? If the building didn't have heat, light and telecoms, then we wouldn't be able to work there anymore even if they did let us reoccupy. A lot of uncertainty.

That's a point that needs emphasizing – telecoms. It doesn't matter if your building is habitable. If the phones don't work, and if your Internet connection isn't lit up, then what's the point?

It rapidly became apparent that we weren't going to be getting back to our office anytime soon. Not only did we not know the status of our office, but Lower Manhattan was sealed off. So, for the two weeks immediately following the disaster, we were in a mad scramble to try to find new office space.

A fellow start up company called Giant Bear (I never did figure out what they did) located in midtown Manhattan, graciously let us double up in their office space for those few weeks. That allowed two critical groups to have a place to meet: management (lets figure out if we can save the company, shall we?) and operations (let's see if we can keep the company running!).

The bigger picture

From my travels since that time, I have come to realize that, not unexpectedly, most of the attention from the worldwide media focused on the tall buildings, their collapse, and the large smoking crater that resulted. What seems to have been given relatively little coverage is the effect on the rest of downtown New York.

Little details like:

The entire power grid was smashed. 7 World Trade Center (the building that caught fire later in the day and collapsed that evening) happened to have a major electrical power substation in it. And almost all the buildings in the immediate area took their power feed from that substation. Oops.

  • Almost every diesel auxiliary generator failed within hours. Many buildings had emergency generators, but guess where they were located? On the roof. The concrete dust and soot that descended on the area clogged up the air intakes and jammed up the gearing. Even most of those with proper environmental filters ran out of fuel – for quite some time nobody was allowed to go anywhere near the disaster site with a truck load of gas or oil. Things were a little paranoid for a while.

  • The phone system was decimated. Right beside 7 WTC was 70 West St – a building full of telecoms and phone switches. It was partially destroyed by fire, and severely damaged by all the water they poured in to put out the fire (electrical equipment tends not to take too well to being soaked). As a result, hundreds of thousands of subscribers (ie most of the businesses in the financial district) lost their phone connections and data links (which, of course, are often carried through the 'last mile' on local loops provided by the phone company). Hundreds of thousands more (this time residential customers) lost their connections when Verizon yanked out and rerouted trunk lines to reestablish connectivity for the New York Stock Exchange.

  • The cell phone system was saturated. With the landline system knocked out, everyone kept trying to make mobile calls. Most of the cells in the immediate area lost power and/or uplink, and the cells further away were jammed.


Even once we were able to reoccupy our buildings and start the cleanup, it was not exactly a nice place to work:

  • The fire didn't go out until the first week of December. Depending on which way the wind was blowing, you almost couldn't breathe. Certainly the authorities were encouraging us not to breathe the air. Don't know about you, but that's hard to do for long periods of time. Like a few months.

  • We're not done with the power yet: for over six moths, the power lines were running along the streets in wooden boxes hammered together as cable runs with white and orange stripe paint on them while ConEd dug up every street in lower Manhattan and relaid the power grid. New York is not exactly a well laid out city to begin with – imagine how hard it was to get a delivery truck anywhere near your office?

Situation management

What's it like to cope with an emergency situation?

Not knowing the situation ... accurate information is the most valuable commodity going in complex and fast moving crisis situation. You have to make quick decisions under time pressure and with incomplete information, but act you must. You're going to wear the consequences of the decisions you make for a long time to come, so anything you can do to get the best possible picture of what's the situation is of vital importance.

We didn't know the state of our office; finding out what was there (if anything) was critical. Four days after the disaster, three of us were able to make it down to our building. Climbing the stairwell by flashlight with some of the building staff, we made it into our office. We found that some of the windows had blown open and that everything was buried under 3 centimeters of dust and ash. It was a total mess. We were able to recover four critical items:

  • the company cheque book (so we could make payroll and pay emergency expenses)

  • the backup tapes (along with a tape drive, a router, and some other basic hardware we'd need to setup somewhere else)

  • the current project plan and functional specifications binder (so we'd know what Engineering was supposed to be working on)

  • Vince's walkman (one of my sysadmins had survived a building nearly falling on him, but was traumatized by having left his CD player behind).


While recovering those items was important (ok, maybe not the CD player), of far greater value was having gained information about what shape our office was in. That allowed us to make decisions – and to realize that if we could go back into our premises before too long, we could vacuum the place down and make it a productive workplace once again.

Emergency communications

Ok, so the fire brigades and police have radios. What do you have?

In a situation like we faced, being able to get information to people is critical. It's not just management and operations who are seeking to know what's going on – everyone else in your company needs to know too. By being able to reach out to them and let them know the story, then you reassure them, keep them from guessing, and prevent the kinds of rumours that lead to panic.

Of course, sending an email to “All” isn't going to work if a) the corporate mail server isn't running anymore and b) no one is plugged into it even if it was – because no one is in the office! You need to have an ability to get messages to people via a means independent of the office. A simple mailing list hosted on one of the production servers might do the trick. A Yahoo! Group. Anything. The catch is that you can't set this sort of thing up after the fact (or rather, if you do, it'll be a nightmare). If you take the time to gather emergency email contacts from people, and keep a mailing list] with those contacts up to date, then in the event of an emergency you'll be better able to reach your people.

Managers at all levels need to be able to reach out to their people. Most companies have organization charts, and most good leaders will know how to contact their people “at home”. Rather than just an informal practice, this needs to be codified as a best practice for everyone in the organization to follow – and not just managers. Since certain links in the communications chain may be unable to pass messages along, you need redundancy in your ability to pass critical messages. If team members know how to get in touch with each other, then the likelihood that a critical message will get through to everyone is that much higher.

Lessons learned and recommendations

SMS is surprisingly useful

I described above how the mobile phone network was saturated, making it almost impossible to get through with a voice call. SMS, however, was flowing fine. Because text messages are carried on spare bandwidth in the cell phone signaling and control layer frequencies, and it only takes a small percentage of available signaling capacity to construct or tear down a call, SMS were able to get through even though all the voice frequencies were busy.

As mentioned, we were even further ahead in that Upoc's service is to send messages to groups of cell phones. We had a group with all the employees in it, so everyone was able to intercommunicate quite easily despite the face we were scattered all over the city and even though no one could get a voice call through.

Nevertheless, if staff have up-to-date contact information about their co-workers, superiors, and subordinates, then in the event of emergency SMS can be used a reliable way to get messages through.

So,

  • Ensure you have contact information for everyone around you – an emergency email address, home phone number, and mobile numbers. If they have a girlfriend in San Francisco, make sure you find out that number too.

Ability to remote operate is key

You need to be able to administer the production platform remotely and to keep systems running with a decentralized team. It was two weeks before we managed to bring the entire company back together again in the same physical location, and almost two weeks after that before our office was suitable as a workplace and network control center.

Thankfully, our production platform ran Unix, and we had deliberately ensured we could operate all the systems remotely.

As is common when the production platform is separate from the normal day to day office, we had a private VPN (in this case carried by a frame-relay circuit) from our office to our production network backbone. This allowed us to efficiently administer the Live site from our office. We had, however, also ensured that the platform was also accessible via the Internet. We used locked down bastion hosts and careful access control lists to be sure, but we recognized that there would be times that the administrative VPN would be down and that we would still need to be able to systems administer production regardless.

This decision proved to be critical in surviving September 11th – if we had not enabled our sysadmins to work from outside the office (ie from home), then we would have been unable to access our platform from the moment the power failed in Lower Manhattan until two weeks later when it was restored and a week after that when the frame relay finally came back up.

As it happened, our systems team, spread around the area, was able to keep in touch and keep production running.

  • Make sure your staff can run the platform in all respects without needing to be in the office. Don't rely on “connecting to the office first, then across the VPN to production”. Yes, securing such access takes careful planning, but the advantage in productivity (not to mention emergencies) is worth the effort.

  • Unix rocks in the datacenter because it is so easy to remotely administer. I wouldn't want to have tried any of this if we'd been running Windows servers.

  • Ensure all your Systems staff have high speed (broadband) Internet access. They work very hard, and long hours. The least you can do it pick up the tab for their access from home. After all, they're going to be working from home all the time; you might as well do what you can to make them as productive as possible so they can get back to their own lives.

Be prepared to reroute email

Of all the Internet technologies, email is actually pretty good at dealing with interruptions. If a mail server can't get through to the destination of a message, it will queue the message and retry a little later.

That's great, and covers you if your systems are down for a few hours. Even a few days. But what if you're down for two weeks? Most mail transport agents (MTA) tend to give up after a week, if they're allowed to queue messages even that long.

Mail delivery, is, of course, directly related to DNS, because it is the mail exchanger (MX) records which determine where the mail for your domain goes. Our corporate domain, @upoc-inc.com, was served by the same nameservers as our production domain, @upoc.com, located in our production site. Because we had our Live platform running, we were able to get at the zone files for the corporate domain. We reconfigured one of our production machines to quietly act as mail receiver for mail destined for upoc-inc.com, and then adjusted the MX records to send our company email there instead of the (not responding) server in the office.

Not loosing email is important, but that's not everything, of course. People need to be able to read that email. We ran Qmail as our MTA. Qmail has a very straight forward delivery mechanism which made it quite a simple matter to arrange to a) keep a copy of each message (so we could re-inject it into the corporate email store if / when we ever got it back) and b) forward a copy to the emergency email box (Yahoo, Hotmail, their home ISP, etc) that our staff were using.

The final step in the sequence was to set up a private mail relay so that mail would appear to come from the proper address (upoc-inc.com) rather than whatever expedient means an individual was using (webmail.myisp.net). Despite everything that was going on, we wanted to maintain a professional appearance and reassure our clients (not to mention our investors) that the business was still a going concern. A big part of this was simply restoring the ability of our executives and our sales staff to communicate with the outside world from email addresses that looked as they should – the @upoc-inc.com corporate address. Setting up a mail relay for them to route their email through was a big step in achieving that.

The recommendations here are simple yet profound:

  • Our ability to do any of this relied on our ability to alter the DNS records. So, there is great value in finding a really reliable, fault tolerant and highly available way for your DNS records to be served. If you can move the primary nameserver for your domain to a third party location (or outsource it to someone with the ability to provide redundancy) then you are leaps and bounds ahead if it becomes necessary to make changes in a hurry. The registrars tend not to be very good in this regard as they tend to have slow and cumbersome update processes; someone like dyndns.org who is in the business of providing reliable DNS service with an excellent update interface and the ability to easily update zone records is ideal.

  • Keep the time-to-live (TTL) values low, at least for MX records! It doesn't do much good to change the zone records if everyone already has answers that will still be valid for weeks to come.

  • If you've planned ahead to be able to reroute email in the event of emergency, why not go all the way and set up a webmail system to go along with the off-site queuing relay? It's easy to do ahead of time. And that's when you've got the time to do it right – far better ahead of time then after the disaster has all ready happened.

Ability to remote develop is very important

Our engineers and developers effectively got a two week vacation. Not because they didn't want to help, but because the company was completely unprepared to support their efforts in a distributed, decentralized manner.

What would happen to GNOME if the CVS repository was lost? Well, nothing, because there are copies of the current code in multiple places (not the least of which are thousands of individual developer's machines worldwide) and the repository itself is backed up and mirrored by several independent organizations.

It's not like that in most companies, however. Production facilities are usually provisioned for high availability, heavy load, and redundancy. But internal development and testing facilities are given the same protections in only the rarest of circumstances. The corporate email system is usually at least moderately well protected, but that's it.

Development at most companies is conducted within tightly secure environments – for perfectly good reasons – but the result is that if a calamity befalls the company and the development environment is even temporarily lost, then that organization is in big trouble. Many organizations are very restrictive about VPNs and other related remote access technologies because they feel they will be opening security holes and fear their secrets will be lost. A fair concern. But because everyone has to pass through the physical security, everything is concentrated in one place.

At Upoc, there weren't any overarching corporate security concerns – the employees were a hard working tightly knit crew. The engineering team, however, did almost all of its work at the office. After all, everything was there. The development server. The code repository. The binder with the use cases and the project plan. The bug database.

So when we lost our office, we were crippled. It was very nearly a death blow. Thankfully, our production platform was remarkably stable through those weeks, and we were able to keep it running without needing any serious bug fixes or code updates. Development, however, ground to a complete halt. We had a product launch for a major company that was fast approaching. And, of course, people sitting idle still draw a paycheque and for a small company burning venture capital, every single day counts.

  • A critical component of keeping a company running is the ensuring the ability for the development team to keep working even if the office infrastructure is damaged or unavailable.

  • Ensure that you have multiple copies of all the critical systems. If you can implement things in such a way that staff can work on projects when outside the office, then you're . Obviously, this is easy for open-source projects, but proprietary companies can implement this too. Just make sure you have:

Physical separation of infrastructure

By having infrastructure located in different physical locations, you gain robustness against this sort of disaster. Anyone who has worked in systems administration in recent years knows, however, that it doesn't take a disaster on the scale New York experienced two years ago to cause a crisis – if someone takes a shovel to the wrong patch of ground, you can lose your Internet connection.

Small companies which are technically savvy tend to think that they can self-host their hardware. Choosing to do so can make things “easier” perhaps, but it takes an enormous amount of money to duplicate the infrastructure that a well provisioned professional datacenter can provide. It is almost certain that they will have better Internet connectivity, better power redundancy, and better physical security. To build the kind of redundancy you need to survive normal operating conditions, let along disaster, will take resources that a small company normally just doesn't have.

And while external co-location may seem expensive, the advantages gained are significant – not the least of which is that it takes you down the road of physically separating your infrastructure, and forces you into the discipline to be able to operate and administer your platform remotely.

Thinking ahead

There is one simple step anyone can take to be better prepared for the unexpected. Think ahead.

As the Internet boom turned to bust, overcapacity was taking its toll on the infrastructure providers – particularly hosting providers. We had given some thought to what would happen our hosting provider went bankrupt – considerations such as what we might do to relocate the production platform, how we might temporarily run it from another location – which led us to thinking about what it would take to move IP blocks and the value of having our DNS be served from an third party location.

Well, as it turns out, we didn't lose production. We lost the office instead.

But we'd thought about the sort of issues that relocation would cause, and that meant we were better prepared when we were suddenly faced with the prospect of having to move our office.

In western militaries, ethics is taught by presenting situations to people, then forcing them to think about what they would do if faced with a similar situation. This is more than just simple preparation or training. In critical situations, there isn't time to debate “the right thing to do”. You're going to make decision quickly, and with incomplete information. If you've thought things through ahead of time, if you've taken the opportunity to think about what the to do in different situations, then when confronted with the unexpected you will have the preparation you need to do the right thing.

  • Take the time to think about what you would if you lost your office. Don't just do this alone. Do this with your systems staff, development teams, finance people, everyone. They will know (far better than you) what is critical to their day-to-day work.

  • Involving them in the process will get them thinking ahead. A small company doesn't need to have an elaborate “Disaster Recovery Plan”. It just needs people who are on the ball making sure that they're prepared for adversity.

Redundancy is not just about hardware. It's about people.

While most Unix Systems Administrators tend to think of their skill sets as generic (“oh, any competent sysadmin should be able to do that”) the reality in any team is that certain people specialize in certain tasks (“Collin is the network god. Brian is the backup wizard”). This doesn't do a business (small or large) any good – especially in disaster. Brian's Internet connection may be knocked out. Brian may be in the hospital. Who is going to restore the backups?

While the aforementioned areas may be legitimate areas of specialization, this concentration tends to bleed into the activities it takes to keep the business going. There are certainly activities which are needed on a routine basis – launching new content, perhaps – and there is no reason that these actions should be the province of one person alone.

Yet they often are – with the result that in a disaster situation, suddenly the best backup system in the world is useless if a critical business function can't happen because the one person who knows how to make it happen is dead.

  • Ensuring that activities such as these are clearly documented is key. Clearly set procedures to follow when taking care of tasks – routine or otherwise – mean that business functions and recurring activities can be run through, without mistakes, when the pressure is on.

  • Establishing effective procedures is often thought of as separate from ensuring the ability of the business to continue through adversity, but with procedures as a foundation, then the traumas of day to day operations – let alone disaster – are reduced.

Be prepared to give people a break

For all that I have concentrated on ensuring that you can keep people working, I will close with a different thought: be prepared to give people a break – as in days off. Not because people are necessarily psychologically traumatized (some where) but because sometimes you just need to get away and see something else. My partner and I went to Los Angeles for a weekend in mid October, and it was both a figurative and literal breath of fresh air. It made all the difference.

  • People often just need a bit of time to come to grips with what they've been through. Don't be afraid to give them a little time. Of course, that's good advice for any situation.

Conclusion

11 September 2001 was a dark day. Thousands lost their lives, tens of thousands lost their jobs, and millions were directly affected. But in the aftermath, you do your part to get things back together, and that's your contribution to showing the enemy that our way of life is stronger, and that they won't beat us, no matter what they do.

The most important part about being prepared for crisis is just to have thought things through. Inevitably (Mr Murphy being the kind of fellow that he is) the situations you face won't be exactly as you planned. But if you've given serious considerations to how you might respond to various scenarios, then when you're faced with the unthinkable, you'll respond professionally and do credit to yourself and your organization.

And hey, you might even save the day.

Contents copyright © 2002-2008 Operational Dynamics Consulting, Pty Ltd unless otherwise noted. If you wish to use material found herein, see attribution policy for details.