Availability

Developers talk about building systems with 4 – no, 5! – nines of availability, and while their ambitions are noble, I sometimes find them pointless. The statements are made without considering the time and energy needed to create such high availability long-term, and often, they're made without even understanding the amount of uptime being requested.

The table below shows how quickly the maximum allowable downtime decreases as more nines are tacked on:

Availability	# of nines	Yearly downtime	Weekly downtime
90%	1	36d 12h 34m 55s	16h 48m 00s
99%	2	3d 15h 39m 29s	1h 40m 48s
99.5%	2.5	1d 19h 49m 44s	50m 24s
99.9%	3	8h 45m 56s	10m 4s
99.99%	4	52m 35s	1m 0s
99.999%	5	5m 15s	6s
99.9999%	6	31s	0s

Being able to say that a service has 4 nines of availability just sounds cool. But is it really needed?

Higher availability isn't always needed

No, it isn't.

Before designing a service, first examine the user's requirements. Some tools really do need close to 100% uptime. For example, when dealing with medical or telecommunications workloads, 5 nines makes sense. Someone's life could be at stake. It's similar to civil engineering, where bridges must be built with 100% uptime in mind to avoid the loss of human life.

But software engineering is very different from civil engineering too. For tools that perform data ETLs or batch processing, 2 nines of availability is probably sufficient. Entertainment sites, like YouTube or Minecraft servers, can also afford to have lower availability. Many services that handle financial transactions can get away with 3 (or maybe 3.5) nines of availability instead of 4. After all, online commerce will recover quickly after being offline for a few hours, even during Christmas season.

Higher availability is expensive

The cost of high availability is multi-factored.

Often, infrastructure needs to be duplicated or tripled in order to remain online even when some parts go down. Buying that hardware isn't free.

Observability tools need to monitor more frequently in order to confidently confirm that systems are up at all times. And those observability tools also need monitoring of their own! Otherwise, we see funny situations, like when Amazon's status page goes down because S3 is down or when Slack doesn't update their status page to reflect that their service is unavailable.

More experienced engineers are needed to design, build, and maintain these systems too. These systems aren't easy to make, and these engineers don't come cheap.

Higher availability is unrealistic

At the end of the day, 100% availability is unrealistic. Even bridges collapse. You can build the most robust system on the planet, but still, your system will inevitably have downtime. Patches and upgrades may need to be installed periodically. Hurricanes may cause electricity outages. Physical wires may be cut. The internet is not up 100% of the time.

And so, at some point, users will need to concede that 100% availability is impossible. That's okay. Instead of only focusing on minimizing downtime, we can work on scheduling planned downtime for users to better predict and work around when systems will knowingly be offline. We can also create backups and playbooks so that developers can follow a step-by-step process when unexpected downtime occurs.

High availability is nice, but downtime is just a fact of life.

Availability

Fresh Khale

Fresh Khale

Higher availability isn't always needed

Higher availability is expensive

Higher availability is unrealistic

New GitHub RSA SSH key

Lovey dovey easter eggs

Guest post?

Happy numbers

(Python) Slice assignment