Bulletproofing Servers: Building a Challenge for Murphyby Andy Neely
Most system administrators who have maintained a server for more than a few months will have their own stories to tell. It might be an installation or a configuration problem, a daemon that stops responding every six or eight weeks, or the 150 million duplicate entries that filled up the log partition last Sunday.
One of the intriguing things about the software side of the game is, that despite its complexity, software always seems to provide an answer in the end. It might take a bit of digging in the code, or a change in the way something is done, but over time the industry seems to develop some impressively robust software to provide the services that the modern world thrives upon.
Then a hard drive fails and reminds us that there's only so much that we, as system administrators, can do to protect ourselves from the evil clutches of entropy. Software operates at the whim of hardware, which makes stable hardware important to our longterm happiness as system administrators.
Using RAID to protect hard drives
Smooth operation of a server's hard drives is crucial. Because most servers will read from or write data to the hard drive periodically, and a failure of a drive operation can cause the server to halt or -- worse yet -- corrupt the data itself. Proper use of a RAID (Redundant Array of Independent Disks) solution can virtually eliminate the possibility of drive-related downtime.
RAID 1 uses a special hard-drive controller or a software solution to make two drives appear as a single drive to the server. Data written to the drive array is written to both drives simultaneously, and data is read from whichever of the two drives delivers the data faster. In the event of a failure of one drive, the other drive will still have reliable data. RAID 1 requires doubling the number of drives, and that adds cost to your configuration.
When using many drives, consider using RAID 5 to improve drive reliability without doubling the cost. RAID 5 joins three or more drives into a single redundant "drive," providing usable drive space equal to one drive less than those used in the array. Since RAID 5 writes parity information to all drives on any given write to the drive array, it can perform slowly when using a low number of drives.
Placing each physical drive in a hot-swappable drive tray allows a failed drive to be removed and swapped without downtime. This can be useful even in non-RAID environments. If a major server failure occurs, the drive tray can be swapped quickly into a similarly configured set of server hardware.
Andy Neely will be presenting the tutorial, Purchasing Commodity Hardware for Your Open Source Project, and other sessions at the O'Reilly Open Source Convention in San Diego, CA, July 23-27, 2001. Rub elbows with Open Source leaders while relaxing on the beautiful Sheraton San Diego Hotel and Marina waterfront. For more information, visit our conference home page or see our pre-conference coverage.
Software RAID solutions are available for most modern operating systems. Although beneficial from a cost perspective, note that software RAID will use your server's processing power to do its work -- which lowers performance -- and that software solutions do not necessarily guarantee the same reliability that a physically isolated drive bus can offer.
If a RAID 1 controller fails and you don't have a spare you can as a last resort, replace it with a normal drive controller and use only one of the two drives. Be warned that doing this seriously degrades your data redundancy, so bring a replacement controller online as quickly as possible.
Power to spare
A server without power is no more than an expensive doorstop. Server power supplies are under heavy strain by operating constantly for years without relief. Power supplies also contain at least one fan for cooling, and losing a fan will virtually guarantee the failure of the supply. Use of a dual power-supply solution can eliminate the downtime normally caused by failure of the supply. If one of the power supply modules fails, the failed module can be replaced without interrupting the operation of the server itself.
Although dozens of options exist, most dual supplies fall under one of two basic designs: those that fit in a standard ATX-style server case, and those that require a custom case. The former are the most flexible and, because standard tower cases are quite common, they are the easiest to use if the server's case needs to be replaced in an emergency.
Different dual supplies have the ability to use either one or two power cables. Using two cables gives you the ability to plug the server into two independent electrical circuits, such as independent battery-backed Uninterruptible Power Supplies (UPS). If one UPS fails, you'll still have power to the server.
When networking goes bad
Adding a redundant Network Interface Card (NIC) can protect against a variety of network failures. If the second NIC is connected to a different port of the same network device (such as a switch or router) as the first NIC, you can protect the server from failure of either physical network cable. If one cable is unplugged, the other will keep the network connection alive. For maximum effectiveness, plug the second network cable into a completely different network device that has been configured for redundancy on the network.
Some NICs are available with two or more ports on a single card; consider avoiding these, because the card becomes a single point of failure. NIC redundancy requires proper support on the network itself, so be sure to consult with your network administrator to discuss your options.
The parts box: Emergency spares
Many components in a server currently cannot support a hot spare, including the processor, memory, and most cards. Although these devices are much less likely to fail than those mentioned above, failure of any one of them will lead to forced downtime. To avoid extended downtime, keep spares of these components on hand whenever possible.
If you have a small budget, consider spending a little extra to purchase spares of the more expensive server parts and use them in your desktop computer. You'll be spending only the difference in cost between server and desktop computer components, and if the server fails you can take your desktop computer offline and cannibalize it for parts to get the more important server back online quickly.
If you maintain several servers or have the budget to spare, consider keeping a complete spare set of components on hand or, better yet, building a spare of the complete server. This is especially useful if you have several servers with similar hardware configurations. If one server fails, you can swap the drives -- or better yet, the drive trays -- and a few key components to have the core of the server back online quickly.
If your design and budget permit it, consider a Server Load Balancing (SLB) solution to offer the ultimate in server redundancy. In the event your server fails, SLB can switch the transactions to an identical server configured to operate as a backup.
As your server environment grows, so will your potential for hardware failures. Applying these bulletproofing suggestions to early servers and growing into SLB solutions where appropriate can help keep your servers chugging along closer and closer to the mythical 100-percent uptime we all desire.
Andy Neely is the Vice President of Technical Operations for Front Range Internet, Inc., and a speaker at the upcoming Open Source Software Convention.
Return to the Linux DevCenter.