Everything You Need to Know About Availability in Distributed Systems

Introduction

System availability is a critical metric in distributed system design. Unlike Latency , which indirectly influences system performance and user experience, availability has a direct impact on both. Availability is a prerequisite for a System to success. An unavailable system cannot perform its function and will fail by default. But what is Availability and how can we measure it?

What is Availability?

Availability is one of the terms you will be hearing throughout your career as a software engineer, and understanding it is critical to building reliable systems. As systems grow bigger, they become more distributed, and one of the most important goals of distributed systems is to be highly available.

Availability is the probability that a system will be operational at a given time.

In simple terms, availability is how often a system is available to perform its function. It is measured as a percentage of uptime in a given year or a request based on a given timeframe. What makes Availability so important is that users who use your system or pay for your service expect it to be available all the time; otherwise, they will be unhappy, and they might switch to another service if your system is not available, even if it has low latency.

Many companies take availability very seriously. For example, Google has a dedicated team of Site Reliability Engineers (SREs) who ensure that Google’s services are always available.

SLA and High Availability?

Companies use Service Level Agreement (SLA) to guarantee availability. For example, GCP has an SLA of 99.95% availability.

GCP guarantees that its services will be available 99.95% of the time monthly. Otherwise, it will be credited back 100% of the monthly bill to the next month. As you can see, availability can directly impact the business and the company’s revenue, and to mitigate this risk, companies are willing to not charge their customers if they can not guarantee a certain level of availability.

Very often, Availability is represented as nines. For example, three nines availability means that the system will be available 99.9% of the time in a given year. As you can see in the GCP, Virtual Machines guarantee three nines availability.

Three nines are usually referred to as High Availability (HA). High Availability refers to those systems that offer high operational performance and quality over a relevant period. A system with 99.99% or 99.95% (which is still three nines) availability will have only 52.56 minutes of downtime in a year.

The following table shows the translation from a given availability percentage to the corresponding time a system would be unavailable.

How to Calculate Availability?

Availability is measured by using an uptime or a count of requests. In either case, the structure of the formula is the same: successful units / total units.

Time-based Availability

Availability is measured as a percentage of uptime in a given year.

Availability = uptime / (uptime + downtime)

Request-based Availability

Availability is measured as a percentage of successful requests in a given timeframe.

Availability = successful requests / (successful requests + failed requests)

For example, if your system received 1000 requests within 24 hours and 800 of them were successful, the availability would be:

Availability = 800 / (800 + 200) = 0.8 = 80%

How to Achieve High Availability?

Achieving high availability is imperative for modern distributed systems, particularly those supporting mission-critical operations. Multiple strategies can be deployed to reach this objective, each varying in complexity and effectiveness. Below are some of the primary methods used to achieve high availability:

Redundancy

The most straightforward way to ensure high availability is through redundancy. This involves deploying multiple instances of critical system components, such as servers, databases, and networking hardware. When a primary resource fails, the backup resource takes over, thus minimizing downtime.

Load Balancing

Load balancing helps distribute incoming requests or workloads across multiple resources. This enhances performance and provides a backup in case one or more resources fail. Load balancing algorithms can be simple round-robin methods or more complex, considering the existing load on each resource.

Fault Tolerance

Fault-tolerant systems are designed to continue operation even when components fail. These systems identify failures quickly and switch to backup resources without manual intervention, reducing the downtime experienced by users.

Geo-Redundancy

For systems that require global reach, implementing geo-redundancy is beneficial. Geo-redundancy involves distributing resources across multiple locations to safeguard against localized failures such as natural disasters or power outages.

Auto-scaling

Auto-scaling allows systems to automatically adjust resources according to current demands. During peak traffic, the system scales up, and during low traffic, it scales down. This ensures high availability and optimizes resource utilization, thereby reducing operational costs.

Measuring availability in terms of the user experience

Traditional metrics often focus solely on server uptime, which can give an incomplete picture of user experience. For instance, your backend system is fully operational and shows 100% availability based on server uptime. However, a network issue prevents the front end from communicating with the backend. In that case, users may be unable to interact with the system. Although your metrics show 100% availability in this scenario, the system is effectively unavailable to users.

To illustrate, consider a real-world example of a public transportation system. Even if the trains run flawlessly (backend), if the ticket vending machines are malfunctioning or inaccessible (frontend), passengers can’t utilize the trains. Thus, availability metrics must encompass not just backend uptime but also the frontend components’ availability and ability to communicate with the backend.

Measuring availability in terms of the business objectives

So why can’t we have 100% availability? Well, there are several reasons for that.

  1. Cost: Making a system 100% available is expensive. It requires redundant hardware, software, and networking components. It also requires complex monitoring and failover systems.
  2. Risk: There is always a risk of failure, no matter how well-designed a system is. A natural disaster, a power outage, or a cyberattack could all cause a system to fail.

In this case, measuring availability regarding business objectives is more appropriate. For example, the business has set a revenue target of $25 million per year, and we make, on average, $0.01 per successful request. At 100 successful requests per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per request, we’ll earn $25.23 million. In other words, even with a 20% failure rate, we’ll still hit our revenue targets! Although this is simplified and needs to consider that customers may leave if they experience too many failures, it illustrates that availability is a business decision.

This is a simple example, but it illustrates that availability is a business decision. It’s not a technical decision cause it is going to cost you a lot of money to make your system highly available.

Perceived Availability vs. System Uptime Availability

When talking about availability, it’s crucial to distinguish between “System Uptime Availability” and “Perceived Availability.”

System Uptime Availability

System Uptime Availability refers to a system’s actual, measurable uptime—how often it’s operational and accessible. This is what most people consider when discussing Availability. It’s quantifiable, usually included in SLAs, and measured using metrics like “nines” to indicate the percentage of time a system is available in a given period.

System Uptime Availability is a technical metric, often contractually guaranteed, indicating how often a service is accessible and operational.

Perceived Availability

Perceived Availability, on the other hand, is the user’s impression of the system’s Availability. Even if a system is up and running, high latency can create the perception that it’s unavailable or unreliable. For example, suppose a webpage takes too long to load. In that case, a user might perceive the service as unavailable, even if the system uptime is 100%.

Perceived Availability is influenced by the user experience, including latency and other performance issues, and may not always align with technical uptime metrics.

Summary

What is Availability?

  1. Availability refers to how often a system is operational and accessible, typically measured as a percentage.
  2. It is an essential goal in the design of distributed systems.
  3. Availability impacts user experience and directly affects business revenue.

SLA and High Availability?

  1. SLAs are contractual guarantees about the level of service, often specifying availability requirements.
  2. High Availability (HA) systems aim for high uptime, often quantified in terms of “nines.”
  3. Failure to meet SLAs can result in financial repercussions, such as crediting customer bills.

How to Achieve High Availability?

  1. Multiple strategies exist for achieving high availability, including redundancy, load balancing, and fault tolerance.
  2. The chosen strategy should align with technical requirements and business objectives.

Measuring Availability in Terms of Business Objectives

  1. 100% availability is often impractical due to financial and risk factors.
  2. Availability should be aligned with business goals, sometimes making it acceptable to have less than perfect uptime.
  3. The cost of high availability must be weighed against the potential revenue and customer satisfaction.