Throughput in Distributed Systems: A Complete Guide

Distributed Systems Throughput: Boosting Efficiency and Performance

ADD WHAT CAN AFFECT THROUGHPUT

Insufficient Compute Resources: Memory Limitations: Inefficient Algorithms: Software Overhead: Concurrent Processing Contention: I/O Wait: Thermal Throttling: Resource Fragmentation: Lack of Parallelism: Data Dependencies:

Introduction

In this article we are going to talk about the throughput. We will start by defining what throughput is and why it is essential. Then we will talk about how to measure throughput and increase it, the trade-offs you need to make when trying to improve your system’s throughput.

Generally, the larger a system gets, the more distributed it becomes, and one of the main reasons for that is to increase the system’s throughput.

Throughput is a critical metric to understand and optimize for.

What is Throughput?

Throughput is one of those terms we often hear in the context of distributed systems and aim to optimise for

Definition

If we try to come up with a definition of throughput, it will be something like this:

Throughput is the number of units of work that a system can process in a period.

So, throughput measures how much work a system can do in a given time. It is a measure of the system’s capacity to do work. But before we go into more detail about throughput, let’s break down the definition of throughput and understand what it means in real-world examples.

Real World Example 1

London Underground is a tube system in London, United Kingdom.

It has 11 lines and 270 stations. It is one of the busiest tube systems in the world. According to Mayor of London’s website , as of 2023, 2.6 million people travel from Bond Street to Liverpool Street alone daily.

So, let’s do the math. If we assume that the tube system is open for 15 hours a day, and most people travel during rush hours, then we can say that the tube system has a throughput of 2 million / 15 = 173,333 people per hour.

Experienced Londoners know that on some lines, you may need to join a queue during rush hour to get on the tube. This is because the underground operates at maximum throughput during this time. To address this problem, you may hear announcements like This station is busy; please use other stations, especially stations like Oxford Circus.

So, if we look at the definition of throughput again, we can say that the tube system on the route has a throughput of 173,333 people per hour or roughly 3000 people per minute.

But there are days, like New Year’s Eve when traffic is multiple times higher than average. During these days, in some stations, you can hear announcements like This station is closed due to overcrowding, and the staff will close the station and not let anyone in until the people inside the station leave to allow more capacity.

The more traffic a system gets, the more throughput it needs.

Of course, these are just numbers that may not represent reality at all, but this is just an example of how you can estimate the throughput of a system and then back it with data or metrics.

Real World Example 2

Let’s look at a supermarket checkout. The throughput of checkout is the number of customers served at a given time. This one is relatively easy to do the math. If a supermarket has ten checkouts and each checkout can serve one customer every 3 minutes, it can serve 100 waiting customers in ~30 minutes. There are two ways to increase the throughput of the checkout system.

  1. Increase the number of checkouts
  2. Increase the speed of each checkout

Each of these options has its trade-offs.

  1. If you increase the number of checkouts, you will invest resources, need more space and more staff to operate them, and they may not get enough traffic to justify the cost of the extra checkouts and staff.
  2. If you increase the speed of each checkout, you will need to invest in better technology and better staff training, or the service quality may decrease.

Increasing the throughput of a system involves necessary trade-offs.

Throughput in System Design

Now that we understand what throughput is and how to measure it let’s talk about how it applies to distributed systems. Unlike the tube system and the supermarket checkout, distributed systems are not physical systems. They are software systems.

The throughput of a distributed system is the number of requests that the system can process in a given amount of time, where the recommendations can be anything from a simple HTTP request to a complex database query.

Factors that Affect Throughput

Several factors affect the throughput of a distributed system.

  1. The number of servers - This is the obvious yet tricky one. The more servers you add to your system, the more requests you can serve. Sounds simple, right? But if you have a distributed system with 100 servers, you can’t just throw another 100 servers and expect the throughput to double. There are other factors that you need to take into account.

  2. The network bandwidth - The maximum amount of data that can be transferred between two points in a given time. If the network bandwidth is too low, it will bottleneck the system’s throughput no matter how many servers you have.

  3. The application code - The application code can have a significant impact on throughput. If the code could be more efficient and well-written, it could slow down the system and reduce throughput.

  4. The Hardware - The hardware that the system is running on can also affect throughput. For example, a system with a faster CPU and more memory will be able to process requests more quickly than a system with a slower CPU and less memory.

  5. The Workload - The workload is the mix of requests that the system receives. Some workloads are more demanding than others. For example, a workload consisting of many small requests will have a different impact on throughput than a workload comprising a few large requests.

  6. The configuration - The system’s configuration can also affect throughput. For example, a system configured to use a lot of caching will be able to process requests more quickly than a system not configured to use caching.

Factors that affected by Throughput

Throughput and latency are essential metrics in distributed systems. As we already know, throughput measures how much work a system can do in a given amount of time, while latency measures how long it takes to finish a unit of work. In the context of a distributed system, the unit of work is usually a request.

In general, there is an inverse relationship between throughput and latency. This means that increasing throughput often leads to an increase in latency and vice versa.

Increasing throughput often leads to an increase in latency and vice versa.

Some techniques can be used to increase throughput without increasing latency, like caching and load balancing . But these techniques have their trade-offs.

The best way to improve throughput and latency depends on the specific system. However, by understanding the relationship between these two metrics, you can make informed decisions about improving your system’s performance.

How to Measure Throughput

When we talk about measuring the throughput of a system that handles requests, it is evident that we need to count the number of requests the system can handle in a given amount of time. There are several ways to measure throughput. The most common way is to use a tool like Apache Bench or Artillery to send a large number of requests to the system and measure how long it takes for the system to process them.

You may also need to measure the throughput of a system that does not handle requests, such as a database or a cache. For example, AWS DynamoDB measures by Write Capacity Units (WCU) and Read Capacity Units (RCU). They define one WCU as one write per second for an item up to 1 KB in size, meaning that if you have a table with 10 WCU, you can write ten items, up to 10kbs per second. Similarly, they define one RCU as one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size, meaning that if you have a table with 10 RCU, you can read ten items per second.

This is just a demonstration of how throughput can be measured differently depending on the system. It does not have to be requested per second. It can be any unity of work that the system needs to handle.

Summary

Throughput measures how much work a system can do in a given time. It is an important metric in distributed systems, as it determines how many requests the system can handle. You need to know the throughput requirements before designing a system.

Many factors affect throughput, including the number of servers, the network bandwidth, the application code, the workload, and the hardware. The specific factors significantly impacting throughput will vary depending on the system.

The best way to improve throughput will vary depending on the specific system.

Further Reading

Below are some recommended resources for additional information:

  1. Discover how Amazon maximises throughput to handle millions of orders per day