Operational Dynamics
Search
Procedures for Change   |   Systems Review   |   Research & Open Source   |   About Us   |   Talks & Papers

This presentation was first given online 17 Dec 04 as a part of the the Vth International Conference of Unix at UNINET, Umeet 2004. It was subsequently developed into a full paper; the first draft of which appeared in the proceedings of the 2005 SAGE-AU Systems Administration Conference, Perth. The full version will be given as an invited talk at Large Installaion Systems Administration in San Diego, 8 Dec 05

See below for the draft paper online, or download the paper and print it:


PDF

You can likewise view the presentation slides online here, or download a copy.

A transcript of the original IRC log is available at their site.

See also:

Surviving Change

Disaster Recovery

Home - Reference - Conference Papers - Trends in Unix and Linux infrastructure management


Trends in Unix (and Linux) Infrastructure Management





Abstract

One of the biggest trends in the industry today is a divergence in the approaches to deploying and managing complex infrastructure. Horizontal vs vertical scaling, server consolidation vs increasing complexity, and blade servers vs virtualization. Everyone claims that their solution is the best, but for which problem? We analyze each of these trends for their technical background and examine the marketing focus that drives them.

We'll also talk about the proliferation of web interfaces, alternative architectures to the conventional e-commerce platform, and the debate between congruent and convergent configuration management.

Finally, we'll look to the future and consider the hype surrounding grid computing and what impact it will have on the theory and practice of infrastructure architecture.



Note

This is an extended abstract only. The full paper is still a work in progress as it involves consolidation of views from authors around the world and we are building on feedback from several conferences.

Author

Andrew Cowie is a management consultant based in Sydney with clients worldwide. His firm specializes in strategy, organizational architecture, procedures to survive change, and performance hardening for the people and systems behind the mission critical enterprise. He helps people improve the effectiveness of their technology by driving usability, scalability, maintainability and security through leadership, teamwork, change management, and hands-on systems tuning. You can reach him at andrew@operationaldynamics.com

Copyright

Copyright © 2005 Operational Dynamics Consulting Pty Ltd, All Rights Reserved. Permission to redistribute this document may be obtained by contacting us.

We believe the ideas presented here are broadly applicable and so encourage you to make use of this paper in your organization. If you do, please contact us so that your experiences and views can be incorporated into further research on this subject.

Introduction

The Systems Administration world is a challenging one indeed. It is a métier which requires broad scope, detailed knowledge, and a flair of great craftsmanship. In addition, it requires not just running the systems, but also planning, architecture, design and not the least purchasing the underlying hardware that said infrastructure is going to live on.

Keeping track of it all isn't easy, and even less so is carrying out these myriad responsibilities. The range of choices which confront IT professionals when it comes time to choose what infrastructure they are going to use to build and maintain their systems are dazzling.

Of course, the sharks smell the blood in the water, so there are any number of vendors (and consultants!) who want to sell you the solution to all your problems. Rarely, though, do such vendors have as much as a stake in your systems running as you do. Their one part may work well, but it's the system taken as a whole that has to work, and trying to comprehend the enormity of that challenge is one of the themes we'll be discussing.

At the end of the day, it's your datacenter and your platform – it's you that has to with it and its complexity - not the vendors. Helping you keep abreast of what's out there is what this paper will be about.

A divergent industry

As we look at the offerings to the marketplace by the leading vendors, we realize that there is clearly disagreement in the industry today about what the best approaches to deploying and managing complex infrastructure is. Somewhat unusually for a maturing marketplace as you would have though that IT should be, there is divergence in the approaches to solving these problems of provisioning and managing infrastructure.

Horizontal vs Vertical Scaling

This one is quick: on the one hand, you get people advocating that you add more and more servers to handle load in parallel (scaling horizontally). On the other, you have people advocating the use of great big machines – you deal with load by getting an ever larger server with more CPUs and more memory and faster disks and did I mention more CPUs? That's scaling vertically.

In practice, you end up with both approaches being appropriate. Conventionally, databases tend to need big iron to run on. Web servers, on the other hand, being I/O bound, tend to do well scaled out so that you can have more and more machines spewing content back to users.

The trend here isn't so much the definitions above so much as seeing people using the “wrong” approach. Some e-commerce shops have managed to spread databases out over lots of medium size machines and it works quite well, thank you very much. And it's not uncommon to see someone with an enormous machine doing everything – serving web content, running application servers and their underlying database engines, doing mail forwarding – the works. The point is that what you hear people telling you is the “right” way to do something is not necessarily the only way to approach a problem.

Server consolidation vs Increasing complexity

Anyone reading this is no doubt aware of the trend over the last few years towards server consolidation. While the cost of individual computer components may drop over time, the price tag of top-quality, reliable, robust systems has ever been high. Enormous growth in the demand for IT across organizations has often meant departments have overlapping systems, effort is duplicated or wasted, and opportunities for efficiencies are lost – and older obsolete hardware is sometimes maintained even though it can be expensive to do so. This has meant that cost savings are frequently the motivation for such consolidation.

(Interestingly, though, there's always been a counter factor against reducing the number of boxes, and that is simply that in the Unix and Linux world, machines an excellent track record of working reliably and staying that way. It doesn't override the cost argument, but it's worth noting that “it doesn't work” is rarely the reason for decommissioning a Unix system!)

Increasing complexity, on the other hand, motivates you to have more servers rather than less. The more complex any application is, the more likely it is to need it's own box. It's not just the load that use of an application or service puts on a machine – although that's what drives the scaling described above – it's also the complexity of installing and maintaining the application on it. Indeed, best practice with complex enterprise applications is to have one box for each logical function – application servers here, mail servers there, databases in the middle – and not to mix them up. Of course, said best practice means running more boxes not less. So much for convergence of approaches.

I would note that, at least one of the factors driving server consolidation in the fact that the average IT shop hasn't put in place effective practices to manage large numbers of systems. If they knew more about doing that well, then perhaps they wouldn't be so worried about the large numbers of nodes.

Of course, some advocates of server consolidation aren't talking about reducing the number of logical systems, they're just talking about reducing the number of physical boxes. That leads us to the next topic.

Blade servers vs Virtualization

Virtualization is the technique of running more than one logical system on a single physical machine. This means not just multi-tasking processes, but multi-tasking entire operating systems and the applications installed therein.

The idea of emulation has been around for a long time, but usually it was in the sense of “emulate the environment of a machine running Mac OS 8 so I can run old programs on my Mac OS X system”. What's somewhat newer is having machines with enough horsepower to handle to overhead.

Even so, there's an overhead. People tend to miss the point that ultimately their processes are all in contention for the same physical interfaces. If a system is I/O bound, then adding more virtual servers “because there's spare CPU” is completely counter productive.

Going the other direction are blade servers. You've all seen pictures of densely packed cabinets with a hundred or more “servers” in them. Each blade server is a CPU (or two) on a card that may (or may not) have it's own hard drive. Typically each node in such an installation will be connected to a very large, very fast disk array that is used in common by all the blades.

Blade servers are ideal when horizontally scaling; they also find a perfect match as the hardware aggregated together to make clustered the super computers used for research and modeling. Of course, all those CPUs so closely packed together do tend to generate a bit of head, so you rather need to make sure that there's enough air conditioning into the room...

YADWIIHTLIT

(Yet Another Damn Web Interface I Have To Log In To!)

It seems that every single tool today boasts a web interface. For many complex sub-systems, it's the only way to access or control them. Such interfaces provide “universality of access”, the vendors tell us. But every such system that has an web interface inevitably has its own logins, trouble alerts, and backup, how are you supposed to stay on top of it all?

This trend finally penetrated my thick skull when I was at a demonstration by a certain Scandinavian company of one of their firewall products. The sales engineer was up there trying to log into the 1RU hardware device she'd brought along. She had already clearly established that she was a very bright person, but it took her like eight tries to log in because she'd forgotten what username/password she needed for this particular device. Then, when she'd logged in, the sales guy started to go on about the virtues of the built in trouble ticket system, and how it sent out alerts, how tasks could be assigned to people and how basically “all your network management could be done right here”.

Well, no, actually it can't. But if the only realistic way to get at that individual devices' configuration is through a web login, and if the product is designed to carry out its workflow via an internal trouble system, how are you supposed to keep up with the 30 or so such different types of such interface you have in your production network?

There's a bigger problem, though. One of the best practices I will be discussing in a minute is creating configurations (or better yet entire systems) in an automated manner. That works great for servers and routers. But in an emergency or otherwise, how do you do an automated restore or rebuild of a system that you can only control through its web interface?

Finally, and although dismissed as a minor point, these web interfaces run on / are generated by devices / sub-systems that are buried deep in your infrastructure. In order for your systems administrators to be able to reach those interfaces, they need to have direct network connectivity from their desktop browser all the way to that system - a system that otherwise is supposed to be isolated and blocked off from any outside access. Troublesome at best, show stopping at worse. What good is a web interface if you can't get at it without driving to the datacenter?

Alternatives to the conventional e-commerce platform

There is much attention given by the press to the "internet ecosystem" and in particular there is often an assumption that a three tier architecture is the only way to approach designing an e-commerce platform. It turns out that people like Yahoo1, Akamai2, and Google3 are doing things very differently indeed, both in terms of application design and in terms of how they manage the infrastructure which their applications live on. We will describe some of these architectures, and point to the trend of achieving breakthrough results with surprisingly simple and unconventional approaches.

Automation of Configuration Management

We'll conclude with a survey of the state-of-the-art in the configuration management space. The great difficulty of deploying and configuring ever increasing numbers of systems led to the rise of first generation approaches like cfengine and isconf. From consideration of when these tools worked, and more importantly, when they did not, an entire field of academic study and research has emerged.

Over the last five years, the USENIX/SAGE “LISA” conference has emerged as the nexus of the configuration management world. The leading thinkers and practitioners in this space have been converging there, and enormous amounts of progress both published and colloquial, has been made.

The three major schools of thought are:

Convergence

You may have heard of cfengine. It's certainly got a catchy name. The technique it uses is to look at a configuration file and to make changes to that file if it's missing something it's supposed to have.

Take an Apache config file. Perhaps you want it to have a “UseCanonicalNames off” directive. You could manually go around each system and check if it's there, or you could use a tool like this one. It checks each system to see if the line is there, and then, if it isn't, it adds the line. Which is very cool.

In a production deployment, you end up specifying in somewhat exhausting detail classes of machines and what actions are to occur on those machines. Then the tools grind away, evolving your systems to reach the desired configuration.

This all sounds great, but in practice it isn't sufficient. The trouble is that you don't actually end up knowing what's out there. Worse, mathematically, it can be proved that under various (commonly encountered) scenarios, the system never converges. That would seem to be a problem – but it hasn't stopped people all over the world from following this approach and manage hundreds of machines at a time.

Congruence

This approach has also been around a while, and was originated by tools like isconf. Here the approach is to generate everything. Each and every configuration file, from /etc/resolv.conf through Apache's httpd.conf right up to specifying what packages are to be on a box is done by processing some files that contain the description of what's supposed to be on a system, doing some substitutions or selections, and then overwriting the configuration files on the target machine with the new versions.

The upside to this is that, done right, you can be absolutely certain about what is going to end up on a target system and that they will have the correct and current information. At any time, you can completely rebuild your infrastructure and know it's going to end up just right.

The downside is that this approach doesn't necessarily “play nice with others” (just like your Mum taught you to); if there are other things that try to configure some aspect of the system, then this will overwrite those changes. And it means that you have to get really good at how to specify all the different classes of systems you have, and how to allow for exceptions.

Sys admins tend to get pissed off when their changes they made on a system are blown away, but then, that's the whole point.

Interestingly, both convergence and congruence have been implemented using Make as the underlying ordering tool. It really is about the big picture approach you take, not the particular tools you use to enable you to do so.

Encapsulation

This is the term I've coined to describe the approaches which advocate taking an object orientated approach to the problem. They wrap up all the functionality of a given aspect of the system and provide an external interface to make changes. The agent is then responsible for starting and stopping services and all the underlying actions necessary to make the change take.

This approach has, casually, been out there for a while – high availability failover clusters can be thought of as behaving in this manner. What's a bit newer on the scene is the idea of doing all your system management through such agents.

The two approaches described above can be considered as a philosophical adjustment to the way you manage your systems. You no longer manage the machines manually – you mange them care of tools which help you get them (and keep them) they way they should be.

Encapsulation techniques takes the philosophical shift to the extreme – you no longer operate on your systems at all – you merely interact with the agents that are responsible for encapsulating the behaviour of the various services on your machines. Continuing the Apache example, instead of changing an Apache config file by hand, you merely state “this PHP page is to go into a new virtual host called blah, make it happen”. The agent would take care of validating the new page and virtual host name, editing the Apache configuration files and the DNS records too, and then would restart whatever services needed to be restarted for the change to take.

Sounds like magic. Or maybe voodoo. And it is. This isn't so much a tool you install as an approach you take. But if you take configuration management to the limit and really start thinking about infrastructure architecture as a whole, this is where you end up.

But does anybody know about these?

Over and above this otherwise factual enumeration, there are three trends to report on. First is the convergence between these schools at higher levels of abstraction. Although they all take different approaches to solving the problem, it seems that once you get a bit higher up the stack everyone is looking at the problem in the same terms. That's actually an important point because it means we've got a common basis for discussion – and it's taken 10 years to reach this point.

Second, people have realized that how you express policy for what your network and its systems should be ends up driving how you implement infrastructure management, so even the kind of language you use to express that policy are of great importance. In fact, it's the issue of language and “how do I go about constructing a grammar to even articulation my environment” which is the common challenge that all the approaches to configuration management are facing.

Finally the most importantly trend is this: back in the real world very few people are aware of all this work, and so continue to blunder into the blind alleys that lay in wait for those who innocently set out to deploy a computing platform. While there's nothing wrong with making mistakes and learning from experiences, there is something to be said for avoiding such mistakes when you're talking about multi-million dollar IT investments. If systems administration wants to be treated as a profession, then it needs to raise it's game somewhat with regard to institutionalizing these learnings; employing these techniques with greater rigor is a step in the right direction.

A few thoughts from the futurists: The rise of grid computing

The hype surrounding grid computing has exceeded even the IT industry's usual standards of overblown hysterics. As a look towards the future, though, we should consider the very real impact that grids of computing power will have on the theory and practice of infrastructure architecture.

The policy articulation and management work that such grids are requiring may well render much of current practice of systems administration obsolete. A moment ago I committed a near heresy by suggesting that systems administration may not really be a profession. (It is an occupation, to be sure, and for many an avocation. The term profession, however, actually has a rather well founded and rigorous definition. I digress). Whatever you conceive systems administration to be, much of it may disappear as grid computing hits the scene.

The opportunity of grid computing is commercial. The a large collection of autonomous machines distributed over wide geographical areas represents an enormous inherent resource. When you look at all this power from the perspective of return on invested capital, the possibility to leverage unused capacity is compelling indeed. In order to effectively manage the resources inherent in such a grid, however, you need a mechanism to describe what is running, where, and with what constraints. But doing that right at large scale, especially the what and where parts, means getting an unprecedented level of automation in place to handle the task.

In a more down to earth example, consider this: really, as an operations manager, all I really want to do is specify that I want enough web servers and mail servers to handle the load. That load is variable, so I want the allocation to happen dynamically. And, since there will be contention for resources at peak times, I'm going to state that our policy is that there must be at least 10 web servers and a minimum of 4 mail servers at all times.

And that's it.

Obviously I'm leaving out a lot of complexity, but if you can make that kind of policy articulation in a dynamic environment, then one way or another everything to enable it is going to [have to] Just Work (tm).

The Engineers who architect systems and create such mechanisms are going to be busy indeed. So are the Technicians who will monitor such systems. But the the future, such environments may very well be lacking a job description near and dear to out hearts: System Administrator.

1I think I mentioned I'm a management consultant? So no, I don't work for Yahoo.

2No, I don't work for Akamai either.

3And finally, no, I don't work for Google. All three, however, are welcome to contact me and inquire about engaging our services.

5/7

Contents copyright © 2002-2008 Operational Dynamics Consulting, Pty Ltd unless otherwise noted. If you wish to use material found herein, see attribution policy for details.