resilience, hot standby
introduction
HA (High Availability) is a promise that a thing is almost always ready for use. An HA computer is one that is switched on and working continuously, except for a tiny percentage of the year. An HA service is one that is ready for customer requests practically all the time.
Perfect availability is not possible. Enterprises get close by asking for high availability. This is provided by either using a few expensive computers with redundant parts or a lot of cheap computers arranged in clusters.
what it is
Service providers want their services available for use continuously. The Internet operates 24 hours a day, 7 days a week. When office workers in New York are asleep, office workers in Tokyo are shopping online. B2B transactions such as news supply and book purchasing happen when they are required, not when the boss is awake. Important services are often called mission critical.
Perfect availability is just not possible when dealing with complex manmade components. Garden plants are always available to slugs and snails and nerds are always available for love but computer systems just cannot be always available. Every day computers crash, computer parts wear out and applications are upgraded. What we can do is make our network highly available. We look for any single points of failure and build ways round them.
Statistics are involved in caculating high availability. If you have one computer, the time it is available depends on factors like frequency of failure and time to recover from a failure. You may pore over historical statistics and figure out that this computer can be relied on 96% of the time before the memory melts down or the disk blows up. Many enterprises aim for 99.999% service availability, which works out at 5.3 minutes a year. You can't get that number with this computer.
many computers and servers
You can do HA by running a service on an expensive computer with duplicated parts. A person with a lot of money, such as the manager of a bank, buys an expensive computer from Tandem that is extremely reliable. The parts in an extremely reliable computer can break down put extra parts are waiting to take over. An extremely reliable computer has redundant parts inside it such as multiple power supplies, CPUs and disks. A common safety feature is to use two disk drives. One is used by the computer and the other is a perfect copy that takes over if the first one breaks.
You can do HA by running a service on a cluster. A cluster is a collection of servers that work together to provide one service. A person on a budget buys several computers, runs an identical server on each one and makes a cluster from them. The statistical chance of several servers failing at the same time is remote, so the cluster is very reliable. There are several clusters in the LIC, such as the web servers, LDAP servers and database servers. If any of the shared infrastructure servers break there is no noticeable service reduction.
You can do HA by providing a good data center to house your computers. A good data center provides safety from disasters. It has safety features such as air conditioning to stop computers overheating, security features such as restricted entry to stop strangers stealing computers and redundant features such as dual power feeds.
The things in an HA system must be monitored to prove service levels have been reached and to detect problems.
HA measurements show how much time the service is available for. These measurements are important for keeping customers happy. A customer will not believe you if you say "The service has been available 100% of the time. As God is my witness that is the honest truth. Would I lie to a valued customer? Cross my heart and hope to die, I swear on the grave of my ancestors it is never down". Regular service measurements are logged and turned into graphs by a report server.
HA measurements detect broken components. If a server crashes then the load balancer detects the problem and stops sending requests to that server. If a computer has one primary disk and one backup then one can break down without affecting service. The remaining disk becomes a single point of failure: if the second breaks down then the computer and all the services on it are effectively dead. Someone is told when a component breaks so he can fix the broken component.
many network components
The network components must also be copied. It is no good for customers if all the computers are happily humming away and someone pulls out the only network cable connecting the LIC to the Internet. If the cleaner pulls the plug on a network switch to plug in her kettle, traffic can still flow through the other one. Customers don't notice. Uptime is good.
The network is effectively doubled up. Two ISP connections lead through two sets of networks. Two sets of routers, switches and cables carry the request to the hosts.
| computer interfaces |
|---|
![]() |
Each computer has at least three ethernet interfaces. One is used by administrators to manage the computer. The other two are for business traffic. In exactly the same way that a shop lets customers in through the front door and tradesmen get sent round the back with the garbage and the smelly drains, our server hosts have a front and a back. The management interface is the back of the box. Business interfaces are the front. Requests can arrive at either interface.
what it isn't
Easy. You may figure out every single point of failure just to find that men digging holes in your street have severed both your ISP cables.
A "mission critical" service has nothing to do with religious work carried out in a foreign land.
where it is
Every part of the LIC has HA features.
history
In 1994 the Beowulf Project was started to build cheap computing clusters.
In 2001 IBM Global Services installed a fat and fast computing cluster at NCSA.



