introduction
These are notes I took from a discussion in 2008.
An HA service is one that is ready for customer requests practically all the time.
what it is
discussion contributors
- Matt Hamilton <matth@netsight.co.uk>
- Steve Roome <steve@pepcross.com>
- Mark Hughes <mhsparks@gmail.com>
- Ric <underscore@vorticity.co.uk>
- Andy Davies <dajdavies@gmail.com>
- Tom Gidden <tom@gidden.net>
What is HA?
Having a computer system's availability of 99.999% means the system is highly available (http://en.wikipedia.org/wiki/Myth_of_the_nines).- what is HA, is it 99.99999% or 99.9999999% uptime? Or just a term bandied around a lot?
- 99.9% availability is really quite a low bar to set.
- 99.9% availability is achievable with a single Windows box and a stack of reinstall floppies.
Who uses it?
- Some sites need it
- a lot of people whose applications can run on a single server don't go for HA
- most application, os and hw failures can be rapidly mitigated with a reasonable HA setup.
- Amazon, Google, Microsoft et.al. have the same systems on multiple data centers on different continents...
- We've seen failures in the power supply grids in the US and Europe
- google have lots of thin servers at the front end.
- an HA setup provides some scalability.
- Customers will judge whether it is good.
- HA is a real struggle for anyone to really provide, but they sell it as that anyway.
- HA definition: more than one datacentre in more than one building, located in different hemispheres
who maintains it?
- relying on data centre staff to perform emergency maintenance is not fine.
- datacentre staff can have top Cisco qualifications and decades of networking experience.
- datacentre staff are damn good at emergency maintenance
internal requirements
- needs multiple hearbeat systems
- needs multiple types of software (OS and application).
- needs an OS that can scale: FreeBSD 7 scales well, Solaris scales well, if you can handle the useless userland, Linux I'm not sure, Windows, I doubt it
- Most firewalls work reliably for about five years before a hardware failure.
- firewalls generally outlast the building lease
external requirements
- needs multiple sites
- needs multiple ISP's
- needs comms cables to go out through two separate sides of each building
- needs comms cables to connect to different cable companies
- needs comms cables to cross land that is owned by different people
- needs ISP's with different peering and different AS numbers (and no single point of failure lower Tier ISP's in those chain too)
- No HA single setup that has % availability truly improved by having failover firewalls WITHOUT having both multiple sites and multiple ISP's.
best service to customers
- keep customers well-informed at all times
- convince customers that you're doing your best to fix things.
- customers hate incompetence and disingenuousness
- customers do not hate a lack of service caused by shitty luck and a confluence of improbabilities
- good customer service is much better than a baroque multi-homed clusterf*** (farm) that'll fail catastrophically when someone doesn't understand how EXPLAIN SELECT works. (MySQL optimisation)
good things
- a pair of firewalls need to have an active failover
- if customers require 99.9% then no single points of failure are needed to achieve it.
- Failover = Less sleepless nights ;)
- HA benefits: a true HA setup covers the possibility of an aeroplane flying into the building and taking out all the kit AND all of the staff on the site
Why would I not want it?
cost
- it's not cheap to implement
- I hate seeing money flushed down the toilet
- companies spend vast amounts of time and money on HA setups
- every HA place seems to spend more time maintaining monitoring than anything else, can go.
- 2 big boxes: Cost of boxes with more than 8 CPU's and 8+GB of RAM?
- adding more boxes increases cost
complexity
- the complexity of HA stuff above can often outweigh the advantages.
- "let's make this bigger and more reliable" tends to go a little wrong.
- redundant *everything* will not save you from an outage.
- it's tough when you have enough boxes to add to complexity, but not enough to *force* simplification of deployment
- scaling problems: two boxes, or ten that's okay
- adding more boxes may create a spiralling mess
- adding more boxes may give a tiny 0.01% improvement in service
- adding more boxes increases the number of failures you will have.
- adding more boxes often seems to provide a 0.01% decrease in service
- adding more boxes often seems to provide more capacity
imperfect solution
- there is always a way to break it ;)
- no amount of cleverness and complicated setup can save you sometimes.
- Amazon had an outage that took out all of their S3 storage system for several hours.
- probably damaged Amazon's S3 annual availability figures
- is a cascading failure to do with capacity planning?
- an unclean failure causes a longer outage
- every well planned HA system I've seen LOOKS better than the one box solution
inadequate design
- expensive HA setups can have a *lower* availability than simpler setups
- it's a sham to get IT directors bigger budgets and more kit, and more staff and hence more seniority
- not 100% safe: firewall CSU/DSU (http://en.wikipedia.org/wiki/CSU/DSU), ISP, single geographical location
- I've spent many nights fixing various different supposedly reliable clustered database setups that turned out not to fail over very well
- HA in practice does not work out to be the HA that the design had in mind.
- HA: it's not always done pretty "simply".
bugs
- failover problems: Unclean failovers, Stuff can end up flip-flopping,
- IP packet problems: packet loss, packet errors, NICS can work for heartbeats but have other packet loss or errors
- Router problems: Sometimes it even takes a few seconds for the routing to reconverge on a failover.
- DB: MySQL: MySQL performance and reliability has been something dogging me since the 90's.
- DB: MySQL: Tom Gidden had an interesting setup in London a few years back using replication andsome "stupid tricks"
- DB: MySQL: I'd suggest bouncing some possible configs of your intended setup via underscore as folks like Tom, Jan, Matt and so on (sorry, if I missed all the other cheeky Mysql gurus out there) are bound to offer some useful pointers.
- DB: Oracle, Sybase and Redbrick work, just not as well as advertised
- complex switches, especially GSRs (Gigabit Switch Routers), overheat, blow up, fail and crash far more often than they ought to
- Complex switches are often less reliable than the average UNIX box
- SPOF: a badly-designed configuration of multiple boxes can be a single point of failure
maintenance
- systems *DO* fail, no matter how hard you try to make them resilient
- HA enemies: Unplugging cables is necessary to replace any unit.
- HA enemies: Unplugging cables is not something that's tested for very well.
- HA enemies: Unplugging cables often seems to break things quite badly.
- HA enemies: Switch OS upgrade rollouts can cause widespread failure eg. new version of IOS causes failover stuff may fail itself and leave kit flip flopping.
human error
- people are thousands if not millions of times more prone to failure than almost any decent equipment.
- the majority of failures I have experienced occurred from someone pressing the wrong button or misconfiguring something.
- the larger and more complex the setup is, the greater the scope for human error.
- computers don't f%^k up... humans do.
- human error: it takes a combination of cock-ups to cause an HA outage. eg. Network reboot -> causes rooting change -> internal DNS loss -> significant outage
- human error: the chance of admin downing the wrong interfaces increases
- HA enemies: more cables increases the chance of someone accidentally unplugging the hearbeat monitor cables
- HA enemies: human error. Making a mistake 1% of the time is a low figure.
- human error: the probability of the staff accidentally cocking it up during regular maintenance has become higher than the probability of some software or hardware failure.
- human error: maintenance cockups generally take out entire sites
- human error is far more likely to cock things up than hardware
- human error includes a developer writing a runaway query that works fine in testing but sucks in production
- human error tends to affect all your systems: fallbacks and all.
- human error includes an ISP bollocksing up a Cisco config
external enemies
- HA enemies: ISP cable damage. Rate of major data cables dug up around the uk is easily on the tens/day scale
- ISPs these days are so flakey it's unbeleivable!
- ISP: there's little one can do about major backhaul outages on multiple suppliers.
- SPOF: domain registrar accidentally dropping your registration
What are the options?
computer configurationsone small box
- it's best to have a good, clean, SIMPLE system
- needs spare hardware ready-to-fit, a working backup strategy, and well-planned recovery procedures.
- a single box LAMP solution is the only way to reduce cost,
- a single box LAMP solution is reliable
- a simple setup wins over a complex one.
- can't handle traffic peaks around 40,000 dynamic page views a minute.
- How can it can be quicker to replace hardware with any kind of backup than to reconfigure a network.
- a hard drive failure is the most probable.
- The effect of a hard drive failure is massively reduced by using even a fairly naïve RAID setup.
- A single box can cater for hard drive failure: RAID in a box is easy
- resetting a simple system: hit the reset button, back in 2 minutes
- HA no more useful than a single box setup running off "SAFE" storage. (is this Cisco SAFE? http://www.cisco.com/warp/public/cc/so/cuso/epso/sqfr/safe_wp.pdf)
- it generally takes less time to replace one box with an identical one manually than it does to reconfigure the network logically to use some other device.
- sites like MySpace have lowered customers expectations
- if a site is down for a few minutes customers don't seem to really mind.
- companies like Rackspace promise to replace hardware within 1 hour.
- manually replugging a cable is a one minute job.
one big box
- one big box: you can change everything without taking the box down.
- expensive. An E2900 is the price of a house - $300,000
two small boxes
- Even small one box solutions I'd deliver in a VM (or Xen) cluster.
- adding kit: How many more 9's on the 99.9% uptime does this give you over a one (big) box solution?
- How much more money does this cost you over a one (big) box solution?
- Which part of the A got more H with this new setup ?
two big boxes
- two big boxes with a crossover cable
- One live box with another waiting in standby is not really scalable or ideal in a HA environment
- Even a couple of everything (server wise) can be done pretty simply
- provide a high level of both resilience and flexibility.
- We use it regularly for maintenance and deployments
- scaled so that each carries < 50% of load
- < 50% of load - you also have headroom for spikes
- An E10k or similar would do it (hasn't existed for years...
- maybe E2900? An E2900 is the price of a house - $300,000 (http://www.sun.com/servers/index.jsp?cat=Sun%20Fire%20Midrange%20Servers&tab=3)
- 2 big boxes: less time and effort required than for many boxes?
- throwing hardware at the problem - I've never worked anywhere that already had one AND had really needed it rather than better SQL.
- 2 big boxes: Efficient software design could reduce hardware price tag
- 2 big boxes: for kit which can cost millions is a pretty sad state of affairs.
many boxes
- Resilient Firewalls,
- Resilient load balancers,
- Resilient ssl hardware acceleration,
- Resilient web servers,
- Resilient DNS servers
- Resilient Firewalls.
- Resilient application servers.
- Failover database servers, should be resilient with DB application clustering, might be one day.
- Resilient Firewalls.
- LAN.
ideas
- Was it worth it or would you do it differently next time ?
- please write a book on it showing how to do it cost-effectively!
- this'd be better over a few beers or as a presentation with probabilities and cost figures

