HA notes

Primary tabs

introduction

These are notes I took from a discussion in 2008.

An HA service is one that is ready for customer requests practically all the time.

what it is

 

discussion contributors

Steve's talking from the experience of running -- among other things -- a multi-million pound 500-machine HA setup that failed pretty much every day through no fault of his own, for a website that you'd consider to be far more critical than a 99.9% requirement

What is HA?

Having a computer system's availability of 99.999% means the system is highly available (http://en.wikipedia.org/wiki/Myth_of_the_nines).
  • what is HA, is it 99.99999% or 99.9999999% uptime? Or just a term bandied around a lot?
  • 99.9% availability is really quite a low bar to set.
  • 99.9% availability is achievable with a single Windows box and a stack of reinstall floppies.

Who uses it?

  • Some sites need it
  • a lot of people whose applications can run on a single server don't go for HA
  • most application, os and hw failures can be rapidly mitigated with a reasonable HA setup.
  • Amazon, Google, Microsoft et.al. have the same systems on multiple data centers on different continents...
  • We've seen failures in the power supply grids in the US and Europe
  • google have lots of thin servers at the front end.
  • an HA setup provides some scalability.
  • Customers will judge whether it is good.
  • HA is a real struggle for anyone to really provide, but they sell it as that anyway.
  • HA definition: more than one datacentre in more than one building, located in different hemispheres

who maintains it?

  • relying on data centre staff to perform emergency maintenance is not fine.
  • datacentre staff can have top Cisco qualifications and decades of networking experience.
  • datacentre staff are damn good at emergency maintenance

internal requirements

  • needs multiple hearbeat systems
  • needs multiple types of software (OS and application).
  • needs an OS that can scale: FreeBSD 7 scales well, Solaris scales well, if you can handle the useless userland, Linux I'm not sure, Windows, I doubt it
  • Most firewalls work reliably for about five years before a hardware failure.
  • firewalls generally outlast the building lease

external requirements

  • needs multiple sites
  • needs multiple ISP's
  • needs comms cables to go out through two separate sides of each building
  • needs comms cables to connect to different cable companies
  • needs comms cables to cross land that is owned by different people
  • needs ISP's with different peering and different AS numbers (and no single point of failure lower Tier ISP's in those chain too)
  • No HA single setup that has % availability truly improved by having failover firewalls WITHOUT having both multiple sites and multiple ISP's.

best service to customers

  • keep customers well-informed at all times
  • convince customers that you're doing your best to fix things.
  • customers hate incompetence and disingenuousness
  • customers do not hate a lack of service caused by shitty luck and a confluence of improbabilities
  • good customer service is much better than a baroque multi-homed clusterf*** (farm) that'll fail catastrophically when someone doesn't understand how EXPLAIN SELECT works. (MySQL optimisation)

good things

  • a pair of firewalls need to have an active failover
  • if customers require 99.9% then no single points of failure are needed to achieve it.
  • Failover = Less sleepless nights ;)
  • HA benefits: a true HA setup covers the possibility of an aeroplane flying into the building and taking out all the kit AND all of the staff on the site

Why would I not want it?

cost

  • it's not cheap to implement
  • I hate seeing money flushed down the toilet
  • companies spend vast amounts of time and money on HA setups
  • every HA place seems to spend more time maintaining monitoring than anything else, can go.
  • 2 big boxes: Cost of boxes with more than 8 CPU's and 8+GB of RAM?
  • adding more boxes increases cost

complexity

  • the complexity of HA stuff above can often outweigh the advantages.
  • "let's make this bigger and more reliable" tends to go a little wrong.
  • redundant *everything* will not save you from an outage.
  • it's tough when you have enough boxes to add to complexity, but not enough to *force* simplification of deployment
  • scaling problems: two boxes, or ten that's okay
  • adding more boxes may create a spiralling mess
  • adding more boxes may give a tiny 0.01% improvement in service
  • adding more boxes increases the number of failures you will have.
  • adding more boxes often seems to provide a 0.01% decrease in service
  • adding more boxes often seems to provide more capacity

imperfect solution

  • there is always a way to break it ;)
  • no amount of cleverness and complicated setup can save you sometimes.
  • Amazon had an outage that took out all of their S3 storage system for several hours.
  • probably damaged Amazon's S3 annual availability figures
  • is a cascading failure to do with capacity planning?
  • an unclean failure causes a longer outage
  • every well planned HA system I've seen LOOKS better than the one box solution

inadequate design

  • expensive HA setups can have a *lower* availability than simpler setups
  • it's a sham to get IT directors bigger budgets and more kit, and more staff and hence more seniority
  • not 100% safe: firewall CSU/DSU (http://en.wikipedia.org/wiki/CSU/DSU), ISP, single geographical location
  • I've spent many nights fixing various different supposedly reliable clustered database setups that turned out not to fail over very well
  • HA in practice does not work out to be the HA that the design had in mind.
  • HA: it's not always done pretty "simply".

bugs

  • failover problems: Unclean failovers, Stuff can end up flip-flopping,
  • IP packet problems: packet loss, packet errors, NICS can work for heartbeats but have other packet loss or errors
  • Router problems: Sometimes it even takes a few seconds for the routing to reconverge on a failover.
  • DB: MySQL: MySQL performance and reliability has been something dogging me since the 90's.
  • DB: MySQL: Tom Gidden had an interesting setup in London a few years back using replication andsome "stupid tricks"
  • DB: MySQL: I'd suggest bouncing some possible configs of your intended setup via underscore as folks like Tom, Jan, Matt and so on (sorry, if I missed all the other cheeky Mysql gurus out there) are bound to offer some useful pointers.
  • DB: Oracle, Sybase and Redbrick work, just not as well as advertised
  • complex switches, especially GSRs (Gigabit Switch Routers), overheat, blow up, fail and crash far more often than they ought to
  • Complex switches are often less reliable than the average UNIX box
  • SPOF: a badly-designed configuration of multiple boxes can be a single point of failure

maintenance

  • systems *DO* fail, no matter how hard you try to make them resilient
  • HA enemies: Unplugging cables is necessary to replace any unit.
  • HA enemies: Unplugging cables is not something that's tested for very well.
  • HA enemies: Unplugging cables often seems to break things quite badly.
  • HA enemies: Switch OS upgrade rollouts can cause widespread failure eg. new version of IOS causes failover stuff may fail itself and leave kit flip flopping.

human error

  • people are thousands if not millions of times more prone to failure than almost any decent equipment.
  • the majority of failures I have experienced occurred from someone pressing the wrong button or misconfiguring something.
  • the larger and more complex the setup is, the greater the scope for human error.
  • computers don't f%^k up... humans do.
  • human error: it takes a combination of cock-ups to cause an HA outage. eg. Network reboot -> causes rooting change -> internal DNS loss -> significant outage
  • human error: the chance of admin downing the wrong interfaces increases
  • HA enemies: more cables increases the chance of someone accidentally unplugging the hearbeat monitor cables
  • HA enemies: human error. Making a mistake 1% of the time is a low figure.
  • human error: the probability of the staff accidentally cocking it up during regular maintenance has become higher than the probability of some software or hardware failure.
  • human error: maintenance cockups generally take out entire sites
  • human error is far more likely to cock things up than hardware
  • human error includes a developer writing a runaway query that works fine in testing but sucks in production
  • human error tends to affect all your systems: fallbacks and all.
  • human error includes an ISP bollocksing up a Cisco config

external enemies

  • HA enemies: ISP cable damage. Rate of major data cables dug up around the uk is easily on the tens/day scale
  • ISPs these days are so flakey it's unbeleivable!
  • ISP: there's little one can do about major backhaul outages on multiple suppliers.
  • SPOF: domain registrar accidentally dropping your registration

What are the options?

computer configurations

one small box

  • it's best to have a good, clean, SIMPLE system
  • needs spare hardware ready-to-fit, a working backup strategy, and well-planned recovery procedures.
  • a single box LAMP solution is the only way to reduce cost,
  • a single box LAMP solution is reliable
  • a simple setup wins over a complex one.
  • can't handle traffic peaks around 40,000 dynamic page views a minute.
  • How can it can be quicker to replace hardware with any kind of backup than to reconfigure a network.
  • a hard drive failure is the most probable.
  • The effect of a hard drive failure is massively reduced by using even a fairly naïve RAID setup.
  • A single box can cater for hard drive failure: RAID in a box is easy
  • resetting a simple system: hit the reset button, back in 2 minutes
  • HA no more useful than a single box setup running off "SAFE" storage. (is this Cisco SAFE? http://www.cisco.com/warp/public/cc/so/cuso/epso/sqfr/safe_wp.pdf)
  • it generally takes less time to replace one box with an identical one manually than it does to reconfigure the network logically to use some other device.
  • sites like MySpace have lowered customers expectations
  • if a site is down for a few minutes customers don't seem to really mind.
  • companies like Rackspace promise to replace hardware within 1 hour.
  • manually replugging a cable is a one minute job.

one big box

  • one big box: you can change everything without taking the box down.
  • expensive. An E2900 is the price of a house - $300,000

two small boxes

  • Even small one box solutions I'd deliver in a VM (or Xen) cluster.
  • adding kit: How many more 9's on the 99.9% uptime does this give you over a one (big) box solution?
  • How much more money does this cost you over a one (big) box solution?
  • Which part of the A got more H with this new setup ?

two big boxes

  • two big boxes with a crossover cable
  • One live box with another waiting in standby is not really scalable or ideal in a HA environment
  • Even a couple of everything (server wise) can be done pretty simply
  • provide a high level of both resilience and flexibility.
  • We use it regularly for maintenance and deployments
  • scaled so that each carries < 50% of load
  • < 50% of load - you also have headroom for spikes
  • An E10k or similar would do it (hasn't existed for years...
  • maybe E2900? An E2900 is the price of a house - $300,000 (http://www.sun.com/servers/index.jsp?cat=Sun%20Fire%20Midrange%20Servers&tab=3)
  • 2 big boxes: less time and effort required than for many boxes?
  • throwing hardware at the problem - I've never worked anywhere that already had one AND had really needed it rather than better SQL.
  • 2 big boxes: Efficient software design could reduce hardware price tag
  • 2 big boxes: for kit which can cost millions is a pretty sad state of affairs.

many boxes

  • Resilient Firewalls,
  • Resilient load balancers,
  • Resilient ssl hardware acceleration,
  • Resilient web servers,
  • Resilient DNS servers
  • Resilient Firewalls.
  • Resilient application servers.
  • Failover database servers, should be resilient with DB application clustering, might be one day.
  • Resilient Firewalls.
  • LAN.
Nearly everything has multiple NICs into resilient switch infrastructure. And to round it off, BGP routing failover to a DR site with all the same again.

ideas

  • Was it worth it or would you do it differently next time ?
  • please write a book on it showing how to do it cost-effectively!
  • this'd be better over a few beers or as a presentation with probabilities and cost figures