Home
Reference
Conference Papers
Disaster Recovery
What would you do if you lost your office?
Disaster Recovery in
Lower Manhattan, September 2001
Andrew
Frederick Cowie - Operational Dynamics
Abstract
"Have you ever thought what would
happen if you lost your office?” That happened to us at Upoc
when terrorists struck the World Trade Center in New York - our
office was two blocks away. Disaster recovery planning is something
that is often neglected by small companies, but it is as important
for them as any other. This paper discusses the experiences of a
small company located in lower Manhattan and the lessons we learned
as a result of having lived through 11 September 2001.
Our office was buried in dust, and we
lost the use of all the systems within it for over three weeks.
Some things we did right - our production
platform was remotely located. By virtue of running Unix and having
deliberately ensured that the systems were able to be remotely
administrated, we kept our live systems running during the disaster.
Other lessons we learned the hard way.
The importance of decentralizing (notably DNS). Ensuring that you
have alternate ways to communicate with everyone in your company.
Making sure that critical functions continue – especially
corporate email and software development. Although we'd thought long
and hard about what to do if we lost our production platform, loosing
our development platform was almost a knockout blow.
Despite the magnitude of possible
situations that you might face, it is possible to plan for disaster.
You can be prepared for emergencies if you think ahead – and
doing so improves the organization's ability to deal systematically
not only with crisis but with more mundane day to day problems as
well.
Author
Andrew Cowie
is an operations consultant based in Sydney. He runs a company which
focuses on usability, scalability and maintainability by helping
clients create effective teams, establish procedures, and improve
systems performance. You can reach him
at andrew@operationaldynamics.com
Copyright
Copyright
© 2003 Operational Dynamics Consulting Pty Ltd, All Rights
Reserved.
Introduction
Most of us have heard of “Disaster
Recovery”. More generally referred to as “Business
Continuity”, it's something that large corporations do, right?
Visions of large empty rooms with rows and rows of silent computers
waiting for a disaster to drive workers away from their offices and
into the emergency alternate facilities.
The reality for small and medium sized businesses
is different. Capital is tight. It is almost always out of the
question to have standby facilities sitting idle, just waiting –
if any budget is available it is going to be spent on production
equipment and on the staff needed to make things run.
Small companies can, however, do effective
business continuity planning – and can survive disaster –
as long as they think ahead. It doesn't necessarily take a big
budget, but it does take the right mindset.
In September of 2001, I was working as Director of
Operations for Upoc, Inc, a mobile
communities platform (think SMS mailing lists) whose company office
was located in downtown New York. This paper will describe our
experiences in the aftermath of the terrorist strike, discuss the
measures we were forced to take, and offer some thoughts on what your
organization can do to be prepared for disaster.
The scene
What we had
A modest office with about 40 people, and a remote
datacenter.
Our production platform, and staging test servers
were co-located in a large datacenter hotel
in central New Jersey. [That's far enough away from New York that it
was a pain for the Systems Administrators to have to hike out there,
but close enough that it could be reached relatively quickly if we
had a hardware failure in the middle of the night. We like to take
credit for foresight and good planning for having stuck the
production platform way out in the boonies, but the truth is that at
the height of the Internet boom, there simply wasn't any datacenter
space to be had in Manhattan, period].
We had all Sun
hardware running Solaris
throughout, with a Cisco network.
All the servers and networking gear were configured for high
availability – two or more machines of each class when they
could run in parallel, and larger machines with fully capable
alternates clustered in a hot standby configuration for functions
that relied on a single node. We had put great effort into ensuring
reliability and robustness for the production platform.
Our corporate office, on the other hand, had the
usual systems – corporate email, shared file server, accounting
system, and development server, along with desktops for all the
employees. We had RAID storage backing up the email store and the
development box, and even had a pair of T1s uplinks (each from a
different ISP) to provide redundancy.
Our office was located in Lower Manhattan. For
those of you who haven't been there, New York has several areas which
in most other cities would rate “downtown” but there is
simply called “the financial district”. Our offices were
right near the corner of a couple streets called Wall and Broadway.
You might have heard of them.
That intersection is two blocks away from the site
of the World Trade Center.
What we lost
At first we didn't know. In fact, dealing with
uncertainty would become a major factor in how events unfolded for
us.
When the planes first struck, we had about 10
people in the office – there for an early morning
teleconference. When the first building collapsed, we had the
somewhat terrifying experience of still having IP connectivity to the
office – but suddenly no one was answering their instant
messengers. We didn't know what happened to them – and no one
was answering their cell phones.
It turns out that some of our staff had fled the
building; others, who were outside all ready, were half buried in
debris. It took us almost the entire day, but ultimately we were able
to track everyone down. Thankfully we hadn't lost anyone. Many other
companies weren't so lucky.
We did, however, lose access to all the systems in
the office. Suddenly no one was able to receive email anymore. The
accounting books were safely locked up in the office manager's
machine – which was in a building that we couldn't get to. And
all the tools that facilitated development of our code were there
too.
For several days, we didn't even know whether we
had an office anymore or not. Even assuming the entire downtown area
didn't burn away Great Fire of Chicago 1871 style, what shape was the
building in? Was it structurally sound? Even if we could reoccupy the
office, what would we find? And would services be restored to the
building anytime soon? If the building didn't have heat, light and
telecoms, then we wouldn't be able to work there anymore even if they
did let us reoccupy. A lot of uncertainty.
That's a point that needs emphasizing –
telecoms. It doesn't matter if your building is habitable. If the
phones don't work, and if your Internet connection isn't lit up, then
what's the point?
It rapidly became
apparent that we weren't going to be getting back to our office
anytime soon. Not only did we not know the status of our office, but
Lower Manhattan was sealed off. So, for the two weeks immediately
following the disaster, we were in a mad scramble to try to find new
office space.
A fellow start up company called Giant Bear (I
never did figure out what they did) located in midtown Manhattan,
graciously let us double up in their office space for those few
weeks. That allowed two critical groups to have a place to meet:
management (lets figure out if we can save the company, shall we?)
and operations (let's see if we can keep the company running!).
The bigger picture
From my travels since that time, I have come to
realize that, not unexpectedly, most of the attention from the
worldwide media focused on the tall buildings, their collapse, and
the large smoking crater that resulted. What seems to have been given
relatively little coverage is the effect on the rest of
downtown New York.
Little details like:
The entire power grid was smashed. 7 World Trade
Center (the building that caught fire later in the day and collapsed
that evening) happened to have a major electrical power substation in
it. And almost all the buildings in the immediate area took their
power feed from that substation. Oops.
Almost every
diesel auxiliary generator failed within hours. Many buildings had
emergency generators, but guess where they were located? On the
roof. The concrete dust and soot that descended on the area clogged
up the air intakes and jammed up the gearing. Even most of those
with proper environmental filters ran out of fuel – for quite
some time nobody was allowed to go anywhere near the disaster site
with a truck load of gas or oil. Things were a little paranoid for a
while.
The
phone system was decimated. Right beside 7 WTC was 70 West St –
a building full of telecoms and phone switches. It was partially
destroyed by fire, and severely damaged by all the water they poured
in to put out the
fire (electrical equipment tends not to take too well to being
soaked). As a result, hundreds of thousands of subscribers
(ie most of the businesses in the financial district) lost their
phone connections and data links (which, of course, are often
carried through the 'last mile' on local loops provided by the phone
company). Hundreds of thousands more (this time residential
customers) lost their connections when Verizon yanked out and
rerouted trunk lines to reestablish connectivity for the New York
Stock Exchange.
The cell phone
system was saturated. With the landline
system knocked out, everyone kept trying to make mobile calls. Most
of the cells in the immediate area lost power and/or uplink,
and the cells further away were jammed.
Even once we were able to reoccupy our buildings
and start the cleanup, it was not exactly a nice place to work:
The fire didn't go
out until the first week of December. Depending on which way the
wind was blowing, you almost couldn't breathe. Certainly the
authorities were encouraging us not to breathe the air. Don't
know about you, but that's hard to do for long periods of time. Like
a few months.
We're not done
with the power yet: for over six moths, the power lines were running
along the streets in wooden boxes hammered together as cable runs
with white and orange stripe paint on them while ConEd dug up every
street in lower Manhattan and relaid the power grid. New York is not
exactly a well laid out city to begin with – imagine how hard
it was to get a delivery truck anywhere near your office?
Situation management
What's it like to cope with an emergency
situation?
Not knowing the situation ... accurate information
is the most valuable commodity going in complex and fast moving
crisis situation. You have to make quick decisions under time
pressure and with incomplete information, but act you must. You're
going to wear the consequences of the decisions you make for a long
time to come, so anything you can do to get the best possible picture
of what's the situation is of vital importance.
We didn't know the state of our office; finding
out what was there (if anything) was critical. Four days after the
disaster, three of us were able to make it down to our building.
Climbing the stairwell by flashlight with some of the building staff,
we made it into our office. We found that some of the windows had
blown open and that everything was buried under 3 centimeters of dust
and ash. It was a total mess. We were able to recover four critical
items:
the company cheque
book (so we could make payroll and pay emergency expenses)
the backup tapes
(along with a tape drive, a router, and some other basic hardware
we'd need to setup somewhere else)
the current
project plan and functional specifications binder (so we'd know what
Engineering was supposed to be working on)
Vince's walkman
(one of my sysadmins had survived a building nearly falling on him,
but was traumatized by having left his CD player behind).
While recovering those items was important (ok,
maybe not the CD player), of far greater value was having gained
information about what shape our office was in. That allowed us to
make decisions – and to realize that if we could go back into
our premises before too long, we could vacuum the place down and make
it a productive workplace once again.
Emergency communications
Ok, so the fire brigades and police have radios.
What do you have?
In a situation like we faced, being able to get
information to people is critical. It's not just management and
operations who are seeking to know what's going on – everyone
else in your company needs to know too. By being able to reach out to
them and let them know the story, then you reassure them, keep them
from guessing, and prevent the kinds of rumours that lead to panic.
Of course, sending an email to “All”
isn't going to work if a) the corporate mail server isn't running
anymore and b) no one is plugged into it even if it was –
because no one is in the office! You need to have an ability to get
messages to people via a means independent of the office. A simple
mailing list hosted on one of the production servers might do the
trick. A Yahoo! Group. Anything. The catch is that you can't set this
sort of thing up after the fact (or rather, if you do, it'll be a
nightmare). If you take the time to gather emergency email contacts
from people, and keep a mailing list] with those contacts up to date,
then in the event of an emergency you'll be better able to reach your
people.
Managers at all levels need to be able to reach
out to their people. Most companies have organization charts, and
most good leaders will know how to contact their people “at
home”. Rather than just an informal practice, this needs to be
codified as a best practice for everyone in the organization to
follow – and not just managers. Since certain links in the
communications chain may be unable to pass messages along, you need
redundancy in your ability to pass critical messages. If team members
know how to get in touch with each other, then the likelihood that a
critical message will get through to everyone is that much higher.
Lessons learned and recommendations
SMS is surprisingly useful
I described above how the mobile phone network was
saturated, making it almost impossible to get through with a voice
call. SMS, however, was flowing fine. Because text messages are
carried on spare bandwidth in the cell phone signaling and control
layer frequencies, and it only takes a small percentage of available
signaling capacity to construct or tear down a call, SMS were able to
get through even though all the voice frequencies were busy.
As mentioned, we were
even further ahead in that Upoc's service is to send messages to
groups of cell phones. We had a group with all the employees
in it, so everyone was able to intercommunicate quite easily despite
the face we were scattered all over the city and even though no one
could get a voice call through.
Nevertheless, if staff
have up-to-date contact information about their co-workers,
superiors, and subordinates, then in the event of emergency SMS can
be used a reliable way to get messages through.
So,
Ensure you have
contact information for everyone around you – an emergency
email address, home phone number, and mobile numbers. If they have a
girlfriend in San Francisco, make sure you find out that number too.
Ability to remote operate is key
You need to be able to administer the production
platform remotely and to keep systems running with a decentralized
team. It was two weeks before we managed to bring the entire company
back together again in the same physical location, and almost two
weeks after that before our office was suitable as a workplace and
network control center.
Thankfully, our production platform ran Unix, and
we had deliberately ensured we could operate all the systems
remotely.
As is common when the production platform is
separate from the normal day to day office, we had a private VPN (in
this case carried by a frame-relay circuit) from our office to our
production network backbone. This allowed us to efficiently
administer the Live site from our office. We had, however, also
ensured that the platform was also accessible via the Internet. We
used locked down bastion hosts and careful access control lists to be
sure, but we recognized that there would be times that the
administrative VPN would be down and that we would still need to be
able to systems administer production regardless.
This decision proved to be critical in surviving
September 11th – if we had not enabled our sysadmins to work
from outside the office (ie from home), then we would have been
unable to access our platform from the moment the power failed in
Lower Manhattan until two weeks later when it was restored and a week
after that when the frame relay finally came back up.
As it happened, our systems team, spread around
the area, was able to keep in touch and keep production running.
Make sure your
staff can run the platform in all respects without
needing to be in the office. Don't rely on “connecting to the
office first, then across the VPN to production”. Yes,
securing such access takes careful planning, but the advantage in
productivity (not to mention emergencies) is worth the effort.
Unix rocks in the
datacenter because it is so easy to remotely administer. I wouldn't
want to have tried any of this if we'd been running Windows servers.
Ensure all your
Systems staff have high speed (broadband) Internet access. They work
very hard, and long hours. The least you can do it pick up the tab
for their access from home. After all, they're going to be working
from home all the time; you might as well do what you can to make
them as productive as possible so they can get back to their own
lives.
Be prepared to reroute email
Of all the Internet
technologies, email is actually pretty good at dealing with
interruptions. If a mail server can't get through to the destination
of a message, it will queue the message and retry a little later.
That's great, and
covers you if your systems are down for a few hours. Even a few days.
But what if you're down for two weeks? Most mail transport agents
(MTA) tend to give up after a week, if they're allowed to queue
messages even that long.
Mail delivery, is, of
course, directly related to DNS, because it is the mail exchanger
(MX) records which determine where the mail for your domain goes. Our
corporate domain, @upoc-inc.com, was served by the same nameservers
as our production domain, @upoc.com, located in our production site.
Because we had our Live platform running, we were able to get at the
zone files for the corporate domain. We reconfigured one of our
production machines to quietly act as mail receiver for mail destined
for upoc-inc.com, and then adjusted the MX records to send our
company email there instead of the (not responding) server in the
office.
Not loosing email
is important, but that's not everything, of course. People need to be
able to read that email. We ran Qmail
as our MTA. Qmail has a very straight forward delivery mechanism
which made it quite a simple matter to arrange to a) keep a copy of
each message (so we could re-inject it into the corporate email
store if / when we ever got it back) and b) forward a copy to the
emergency email box (Yahoo, Hotmail, their home ISP, etc) that our
staff were using.
The final step in the
sequence was to set up a private mail relay so that mail would appear
to come from the proper address (upoc-inc.com) rather than whatever
expedient means an individual was using (webmail.myisp.net). Despite
everything that was going on, we wanted to maintain a professional
appearance and reassure our clients (not to mention our investors)
that the business was still a going concern. A big part of this was
simply restoring the ability of our executives and our sales staff to
communicate with the outside world from email addresses that looked
as they should – the @upoc-inc.com corporate address. Setting
up a mail relay for them to route their email through was a big step
in achieving that.
The recommendations
here are simple yet profound:
Our ability to do
any of this relied on our ability to alter the DNS records. So,
there is great value in finding a really reliable, fault
tolerant and highly available way for your DNS records to be served.
If you can move the primary nameserver for your domain to a third
party location (or outsource it to someone with the ability to
provide redundancy) then you are leaps and bounds ahead if it
becomes necessary to make changes in a hurry. The registrars tend
not to be very good in this regard as they tend to have slow
and cumbersome update processes; someone like dyndns.org
who is in the business of providing reliable DNS service with an
excellent update interface and the ability to easily update zone
records is ideal.
Keep the
time-to-live (TTL) values low, at least for MX records! It doesn't
do much good to change the zone records if everyone already has
answers that will still be valid for weeks to come.
If you've planned
ahead to be able to reroute email in the event of emergency, why not
go all the way and set up a webmail system to go along with the
off-site queuing relay? It's easy to do ahead of time. And that's
when you've got the time to do it right – far better ahead of
time then after the disaster has all ready happened.
Ability to remote develop is very important
Our engineers and developers effectively got a two
week vacation. Not because they didn't want to help, but because the
company was completely unprepared to support their efforts in a
distributed, decentralized manner.
What would happen to GNOME
if the CVS repository was lost? Well, nothing, because there are
copies of the current code in multiple places (not the least of which
are thousands of individual developer's machines worldwide) and the
repository itself is backed up and mirrored by several independent
organizations.
It's not like that in most companies, however.
Production facilities are usually provisioned for high availability,
heavy load, and redundancy. But internal development and testing
facilities are given the same protections in only the rarest of
circumstances. The corporate email system is usually at least
moderately well protected, but that's it.
Development at most companies is conducted within
tightly secure environments – for perfectly good reasons –
but the result is that if a calamity befalls the company and the
development environment is even temporarily lost, then that
organization is in big trouble. Many organizations are very
restrictive about VPNs and other related remote access technologies
because they feel they will be opening security holes and fear their
secrets will be lost. A fair concern. But because everyone has to
pass through the physical security, everything is concentrated in one
place.
At Upoc, there weren't any overarching corporate
security concerns – the employees were a hard working tightly
knit crew. The engineering team, however, did almost all of its work
at the office. After all,
everything was there. The development server. The code
repository. The binder with the use cases and the project plan. The
bug database.
So when we lost our office, we were crippled. It
was very nearly a death blow. Thankfully, our production platform was
remarkably stable through those weeks, and we were able to keep it
running without needing any serious bug fixes or code updates.
Development, however, ground to a complete halt. We had a product
launch for a major company that was fast approaching. And, of course,
people sitting idle still draw a paycheque and for a small company
burning venture capital, every single day counts.
A critical
component of keeping a company running is the ensuring the ability
for the development team to keep working even if the office
infrastructure is damaged or unavailable.
Ensure that you
have multiple copies of all the critical systems. If you can
implement things in such a way that staff can work on projects when
outside the office, then you're . Obviously, this is easy for
open-source projects, but proprietary companies can implement this
too. Just make sure you have:
Physical separation of infrastructure
By having infrastructure located in different
physical locations, you gain robustness against this sort of
disaster. Anyone who has worked in systems administration in recent
years knows, however, that it doesn't take a disaster on the scale
New York experienced two years ago to cause a crisis – if
someone takes a shovel to the wrong patch of ground, you can lose
your Internet connection.
Small companies which are technically savvy tend
to think that they can self-host their hardware. Choosing to do so
can make things “easier” perhaps, but it takes an
enormous amount of money to duplicate the infrastructure that a well
provisioned professional datacenter can provide. It is almost certain
that they will have better Internet connectivity, better power
redundancy, and better physical security. To build the kind of
redundancy you need to survive normal operating conditions, let along
disaster, will take resources that a small company normally just
doesn't have.
And while external co-location may seem expensive,
the advantages gained are significant – not the least of which
is that it takes you down the road of physically separating your
infrastructure, and forces you into the discipline to be able
to operate and administer your platform remotely.
Thinking ahead
There is one simple step anyone can take to be
better prepared for the unexpected. Think ahead.
As the Internet boom turned to bust, overcapacity
was taking its toll on the infrastructure providers –
particularly hosting providers. We had given some thought to what
would happen our hosting provider went bankrupt –
considerations such as what we might do to relocate the production
platform, how we might temporarily run it from another location –
which led us to thinking about what it would take to move IP blocks
and the value of having our DNS be served from an third party
location.
Well, as it turns out, we didn't lose production.
We lost the office instead.
But we'd thought
about the sort of issues that relocation would cause, and that meant
we were better prepared when we were suddenly faced with the prospect
of having to move our office.
In
western militaries, ethics is taught by presenting situations to
people, then forcing them to think about what they
would do if faced with a similar situation. This is more than just
simple preparation or training. In critical situations, there isn't
time to debate “the right thing to do”. You're going to
make decision quickly, and with incomplete information. If you've
thought things through ahead of time, if you've taken the opportunity
to think about what the to do in different situations, then when
confronted with the unexpected you will have the preparation you need
to do the right thing.
Take the time to
think about what you would if you lost your office. Don't
just do this alone. Do this with your systems staff, development
teams, finance people, everyone. They will know (far better than
you) what is critical to their day-to-day work.
Involving them in
the process will get them thinking ahead. A small company
doesn't need to have an elaborate “Disaster Recovery
Plan”. It just needs people who are on the ball making sure
that they're prepared for adversity.
Redundancy is not just about hardware. It's about
people.
While most Unix Systems Administrators tend to
think of their skill sets as generic (“oh, any competent
sysadmin should be able to do that”) the reality in any team is
that certain people specialize in certain tasks (“Collin is the
network god. Brian is the backup wizard”). This doesn't do a
business (small or large) any good – especially in disaster.
Brian's Internet connection may be knocked out. Brian may be in the
hospital. Who is going to restore the backups?
While the aforementioned areas may be
legitimate areas of specialization, this concentration tends
to bleed into the activities it takes to keep the business going.
There are certainly activities which are needed on a routine basis –
launching new content, perhaps – and there is no reason that
these actions should be the province of one person alone.
Yet they often are – with the result that in
a disaster situation, suddenly the best backup system in the world is
useless if a critical business function can't happen because the one
person who knows how to make it happen is dead.
Ensuring that
activities such as these are clearly documented is key. Clearly set
procedures to follow when taking care of tasks – routine or
otherwise – mean that business functions and recurring
activities can be run through, without mistakes, when the pressure
is on.
Establishing
effective procedures is often thought of as separate from ensuring
the ability of the business to continue through adversity, but with
procedures as a foundation, then the traumas of day to day
operations – let alone disaster – are reduced.
Be prepared to give people a break
For
all that I have concentrated on ensuring that you can keep people
working, I will close with a different thought: be
prepared to give people a break – as in days off. Not
because people are necessarily psychologically traumatized (some
where) but because sometimes you just need to get away and see
something else. My partner and I went to Los Angeles for a weekend in
mid October, and it was both a figurative and literal breath of fresh
air. It made all the difference.
Conclusion
11 September 2001 was a dark day. Thousands lost
their lives, tens of thousands lost their jobs, and millions were
directly affected. But in the aftermath, you do your part to get
things back together, and that's your contribution to showing the
enemy that our way of life is stronger, and that they won't beat us,
no matter what they do.
The most important part about being prepared for
crisis is just to have thought things through. Inevitably (Mr Murphy
being the kind of fellow that he is) the situations you face won't be
exactly as you planned. But if you've given serious considerations to
how you might respond to various scenarios, then when you're faced
with the unthinkable, you'll respond professionally and do credit to
yourself and your organization.
And hey, you might even save the day.