We'll also talk about the proliferation
of web interfaces, alternative architectures to the conventional
e-commerce platform, and the debate between congruent and convergent
configuration management.
Finally, we'll look to the future and
consider the hype surrounding grid computing and what impact it will
have on the theory and practice of infrastructure architecture.
This is an extended abstract only. The full paper is still a work
in progress as it involves consolidation of views from authors around
the world and we are building on feedback from several conferences.
Andrew Cowie is a management consultant based in Sydney with
clients worldwide. His firm specializes in strategy, organizational
architecture, procedures to survive change, and performance hardening
for the people and systems behind the mission critical enterprise. He
helps people improve the effectiveness of their technology by
driving usability, scalability, maintainability and security through
leadership, teamwork, change management, and hands-on systems tuning.
You can reach him at andrew@operationaldynamics.com
Copyright © 2005 Operational Dynamics Consulting Pty Ltd, All
Rights Reserved. Permission to redistribute this document may be
obtained by contacting us.
We believe the ideas presented here are broadly applicable and so
encourage you to make use of this paper in your organization. If you
do, please contact us so that your experiences and views can be
incorporated into further research on this subject.
Introduction
The Systems Administration world is a challenging one indeed. It
is a métier which
requires broad scope, detailed knowledge, and a flair of great
craftsmanship. In addition, it requires not just running
the systems, but also planning,
architecture, design and not the least purchasing
the underlying hardware that said infrastructure is going to live
on.
Keeping track of it all isn't easy, and even less so is carrying
out these myriad responsibilities. The range of choices which
confront IT professionals when it comes time to choose what
infrastructure they are going to use to build and maintain their
systems are dazzling.
Of course, the sharks smell the blood in the water, so there are
any number of vendors (and consultants!) who want to sell you the
solution to all your problems. Rarely, though, do such vendors have
as much as a stake in your systems running as you do. Their one part
may work well, but it's the system taken as a whole that has to
work, and trying to comprehend the enormity of that challenge is one
of the themes we'll be discussing.
At the end of the day, it's your datacenter and your
platform – it's you that has to with it and its complexity -
not the vendors. Helping you keep abreast of what's out there is
what this paper will be about.
A divergent industry
As we look at the offerings to the marketplace by the leading
vendors, we realize that there is clearly disagreement in the
industry today about what the best approaches to deploying and
managing complex infrastructure is. Somewhat unusually for a
maturing marketplace as you would have though that IT should be,
there is divergence in the
approaches to solving these problems of provisioning and managing
infrastructure.
Horizontal vs Vertical Scaling
This one is quick: on the one hand, you get people advocating
that you add more and more servers to handle load in parallel
(scaling horizontally). On the other, you have people
advocating the use of great big machines – you deal with load
by getting an ever larger server with more CPUs and more memory and
faster disks and did I mention more CPUs? That's scaling
vertically.
In practice, you end up with both approaches being appropriate.
Conventionally, databases tend to need big iron to run on. Web
servers, on the other hand, being I/O bound, tend to do well scaled
out so that you can have more and more machines spewing content back
to users.
The trend here isn't so much the definitions above so much as
seeing people using the “wrong” approach. Some
e-commerce shops have managed to spread databases out over lots of
medium size machines and it works quite well, thank you very much.
And it's not uncommon to see someone with an enormous machine doing
everything – serving web content, running application servers
and their underlying database engines, doing mail forwarding –
the works. The point is that what you hear people telling you is the
“right” way to do something is not necessarily the only
way to approach a problem.
Server consolidation vs Increasing complexity
Anyone reading this is no doubt aware of the trend over the last
few years towards server consolidation. While the cost of
individual computer components may drop over time, the price tag of
top-quality, reliable, robust systems has ever been high. Enormous
growth in the demand for IT across organizations has often meant
departments have overlapping systems, effort is duplicated or
wasted, and opportunities for efficiencies are lost – and
older obsolete hardware is sometimes maintained even though it can
be expensive to do so. This has meant that cost savings are
frequently the motivation for such consolidation.
(Interestingly, though, there's always been a counter factor
against reducing the number of boxes, and that is simply that in the
Unix and Linux world, machines an excellent track record of working
reliably and staying that way. It doesn't override the cost
argument, but it's worth noting that “it doesn't work”
is rarely the reason for decommissioning a Unix system!)
Increasing complexity, on the other hand, motivates you to
have more servers rather than less. The more complex any application
is, the more likely it is to need it's own box. It's not just the
load that use of an application or service puts on a machine –
although that's what drives the scaling described above – it's
also the complexity of installing and maintaining the application on
it. Indeed, best practice with complex enterprise applications is to
have one box for each logical function – application servers
here, mail servers there, databases in the middle – and not to
mix them up. Of course, said best practice means running more boxes
not less. So much for convergence of approaches.
I would note that, at least one of the factors driving server
consolidation in the fact that the average IT shop hasn't put
in place effective practices to manage large numbers of systems. If
they knew more about doing that well, then perhaps they wouldn't be
so worried about the large numbers of nodes.
Of course, some advocates of server consolidation aren't talking
about reducing the number of logical systems, they're just
talking about reducing the number of physical boxes. That
leads us to the next topic.
Blade servers vs Virtualization
Virtualization is the technique of running more than one
logical system on a single physical machine. This means not just
multi-tasking processes, but multi-tasking entire operating systems
and the applications installed therein.
The idea of emulation has been around for a long time, but
usually it was in the sense of “emulate the environment of a
machine running Mac OS 8 so I can run old programs on my Mac OS X
system”. What's somewhat newer is having machines with enough
horsepower to handle to overhead.
Even so, there's an overhead. People tend to miss the point that
ultimately their processes are all in contention for the same
physical interfaces. If a system is I/O bound, then adding more
virtual servers “because there's spare CPU” is
completely counter productive.
Going the other direction are blade servers.
You've all seen pictures of densely packed cabinets with a hundred
or more “servers” in them. Each blade server is a CPU
(or two) on a card that may (or may not) have it's own hard drive.
Typically each node in such an installation will be connected to a
very large, very fast disk array that is used in common by all the
blades.
Blade servers are ideal when
horizontally scaling; they also find a perfect match as the hardware
aggregated together to make clustered the super computers used for
research and modeling. Of course, all those CPUs so closely packed
together do tend to generate a bit of head, so you rather
need to make sure that there's enough air conditioning into the
room...
YADWIIHTLIT
(Yet Another Damn Web Interface I Have To Log In To!)
It seems that every single tool today boasts a web interface. For
many complex sub-systems, it's the only way to access or control
them. Such interfaces provide “universality of access”,
the vendors tell us. But every such system that has an web interface
inevitably has its own logins, trouble alerts, and backup, how are
you supposed to stay on top of it all?
This trend finally penetrated my thick skull when I was at a
demonstration by a certain Scandinavian company of one of their
firewall products. The sales engineer was up there trying to log
into the 1RU hardware device she'd brought along. She had already
clearly established that she was a very bright person, but it took
her like eight tries to log in because she'd forgotten what
username/password she needed for this particular device. Then, when
she'd logged in, the sales guy started to go on about the virtues of
the built in trouble ticket system, and how it sent out alerts, how
tasks could be assigned to people and how basically “all your
network management could be done right here”.
Well, no, actually it can't. But if the only realistic way to get
at that individual devices' configuration is through a web login,
and if the product is designed to carry out its workflow via an
internal trouble system, how are you supposed to keep up with the 30
or so such different types of such interface you have in your
production network?
There's a bigger problem, though. One of the best practices I
will be discussing in a minute is creating configurations (or better
yet entire systems) in an automated manner. That works great for
servers and routers. But in an emergency or otherwise, how do you do
an automated restore or rebuild of a system that you can only
control through its web interface?
Finally, and although dismissed as a minor point, these web
interfaces run on / are generated by devices / sub-systems that are
buried deep in your infrastructure. In order for your systems
administrators to be able to reach those interfaces, they need to
have direct network connectivity from their desktop browser all the
way to that system - a system that otherwise is supposed to be
isolated and blocked off from any outside access. Troublesome at
best, show stopping at worse. What good is a web interface if you
can't get at it without driving to the datacenter?
Alternatives to the conventional e-commerce
platform
There is much attention given by the press to the "internet
ecosystem" and in particular there is often an assumption that
a three tier architecture is the only way to approach designing an
e-commerce platform. It turns out that people like Yahoo,
Akamai,
and Google
are doing things very differently indeed, both in terms of
application design and in terms of how they manage the
infrastructure which their applications live on. We will describe
some of these architectures, and point to the trend of achieving
breakthrough results with surprisingly simple and unconventional
approaches.
Automation of Configuration Management
We'll conclude with a survey of the state-of-the-art in the
configuration management space. The great difficulty of deploying
and configuring ever increasing numbers of systems led to the rise
of first generation approaches like cfengine and isconf. From
consideration of when these tools worked, and more importantly, when
they did not, an entire field of academic study and research has
emerged.
Over the last five years, the USENIX/SAGE “LISA”
conference has emerged as the nexus of the configuration management
world. The leading thinkers and practitioners in this space have
been converging there, and enormous amounts of progress both
published and colloquial, has been made.
The three major schools of thought are:
Convergence
You may have heard of cfengine.
It's certainly got a catchy name. The technique it uses is to look
at a configuration file and to make changes to that file if it's
missing something it's supposed to have.
Take an Apache config file. Perhaps you want it to have a
“UseCanonicalNames off”
directive. You could manually go around each system and check if
it's there, or you could use a tool like this one. It checks each
system to see if the line is there, and then, if it isn't, it adds
the line. Which is very cool.
In a production deployment, you end up specifying in somewhat
exhausting detail classes of machines and what actions are to occur
on those machines. Then the tools grind away, evolving your systems
to reach the desired configuration.
This all sounds great, but in practice it isn't sufficient. The
trouble is that you don't actually end up knowing what's out there.
Worse, mathematically, it can be proved that under various (commonly
encountered) scenarios, the system never converges.
That would seem to be a problem – but it hasn't stopped people
all over the world from following this approach and manage hundreds
of machines at a time.
Congruence
This approach has also been around a while, and was originated by
tools like isconf. Here the approach
is to generate everything.
Each and every configuration file, from /etc/resolv.conf
through Apache's httpd.conf right up
to specifying what packages are to be on a box is done by processing
some files that contain the description of what's supposed to be on
a system, doing some substitutions or selections, and then
overwriting the
configuration files on the target machine with the new versions.
The upside to
this is that, done right, you can be absolutely certain about what
is going to end up on a target system and that they will have the
correct and current information. At any time, you can completely
rebuild your infrastructure and know it's going to end up just
right.
The downside is
that this approach doesn't necessarily “play nice with others”
(just like your Mum taught you to); if there are other things that
try to configure some aspect of the system, then this will overwrite
those changes. And it means that you have to get really good at how
to specify all the different classes of systems you have, and how to
allow for exceptions.
Sys admins tend
to get pissed off when their changes they made on a system are blown
away, but then, that's the whole point.
Interestingly,
both convergence and congruence have been implemented using Make as
the underlying ordering tool. It really is about the big picture
approach you take, not the particular tools you use to enable you to
do so.
Encapsulation
This is the term I've coined to describe the approaches which
advocate taking an object orientated
approach to the problem. They wrap up all the functionality of a
given aspect of the system and provide an external interface to make
changes. The agent is then responsible for starting and stopping
services and all the underlying actions necessary to make the change
take.
This approach has, casually, been out
there for a while – high availability failover clusters can be
thought of as behaving in this manner. What's a bit newer on the
scene is the idea of doing all
your system management through such agents.
The two
approaches described above can be considered as a philosophical
adjustment to the way you manage your systems. You no longer manage
the machines manually – you mange them care of tools which
help you get them (and keep them) they way they should be.
Encapsulation
techniques takes the philosophical shift to the extreme – you
no longer operate on your systems at all – you merely interact
with the agents that are responsible for encapsulating the behaviour
of the various services on your machines. Continuing the Apache
example, instead of changing an Apache config file by hand, you
merely state “this PHP page is to go into a new virtual host
called blah, make it happen”. The agent would take care of
validating the new page and virtual host name, editing the Apache
configuration files and the DNS records too, and then would restart
whatever services needed to be restarted for the change to take.
Sounds like magic. Or maybe voodoo. And it is. This isn't so much
a tool you install as an approach you take. But if you take
configuration management to the limit and really start thinking
about infrastructure architecture as a whole, this is where you end
up.
But does anybody know about these?
Over and above this otherwise factual enumeration, there are
three trends to report on. First is the convergence between these
schools at higher levels of abstraction. Although they all take
different approaches to solving the problem, it seems that once you
get a bit higher up the stack everyone is looking at the problem in
the same terms. That's actually an important point because it means
we've got a common basis for discussion – and it's taken 10
years to reach this point.
Second, people have realized that how you express policy for what
your network and its systems should be ends up driving how you
implement infrastructure management, so even the kind of language
you use to express that policy are of great importance. In fact,
it's the issue of language and “how do I go about constructing
a grammar to even articulation my environment” which is the
common challenge that all the approaches to configuration management
are facing.
Finally the most importantly trend is this: back in the real
world very few people are aware of all this work, and so continue to
blunder into the blind alleys that lay in wait for those who
innocently set out to deploy a computing platform. While there's
nothing wrong with making mistakes and learning from experiences,
there is something to be said for avoiding such mistakes when you're
talking about multi-million dollar IT investments. If systems
administration wants to be treated as a profession, then it needs to
raise it's game somewhat with regard to institutionalizing these
learnings; employing these techniques with greater rigor is a step
in the right direction.
A few thoughts from the futurists: The rise of
grid computing
The hype surrounding grid computing has exceeded even the IT
industry's usual standards of overblown hysterics. As a look towards
the future, though, we should consider the very real impact that
grids of computing power will have on the theory and practice of
infrastructure architecture.
The policy articulation and management work that such grids are
requiring may well render much of current practice of systems
administration obsolete. A moment ago I committed a near heresy by
suggesting that systems administration may not really be a
profession. (It is an occupation, to be sure, and for many an
avocation. The term profession, however, actually has a rather well
founded and rigorous definition. I digress). Whatever you conceive
systems administration to be, much of it may disappear
as grid computing hits the scene.
The opportunity of grid computing is
commercial. The a large collection of autonomous machines
distributed over wide geographical areas represents an enormous
inherent resource. When you look at all this power from the
perspective of return on invested capital, the possibility to
leverage unused capacity is compelling indeed. In order to
effectively manage the resources inherent in such a grid, however,
you need a mechanism to describe what is running, where,
and with what constraints. But doing that right at large
scale, especially the what and where parts, means getting an
unprecedented level of automation in place to handle the task.
In a more down to earth example,
consider this: really, as an operations manager, all I really want
to do is specify that I want enough web servers and mail servers to
handle the load. That load is variable, so I want the allocation to
happen dynamically. And, since there will be contention for
resources at peak times, I'm going to state that our policy is that
there must be at least 10 web servers and a minimum of 4 mail
servers at all times.
And that's it.
Obviously I'm leaving out a lot of
complexity, but if you can make that kind of policy articulation in
a dynamic environment, then one way or another everything to enable
it is going to [have to] Just Work (tm).
The Engineers who architect systems
and create such mechanisms are going to be busy indeed. So are the
Technicians who will monitor such systems. But the the future, such
environments may very well be lacking a job description near and
dear to out hearts: System Administrator.