Ably Blog for Ably

Posted on Oct 7 • Edited on Oct 8 • Originally published at ably.com

Achieving delivery guarantees in a pub/sub system

#pubsub #development #javascript #architecture

Pub/sub (or Publish/Subscribe) solves two of the great challenges of modern application design. Namely, how to build systems that:

Scale up and down when needed.

Continue serving requests even when something goes wrong.

And it does that well thanks to decoupling. In a pub/sub system, each component is fully independent. This means you can integrate internal and external systems seamlessly, without any one component needing to know the details of how the others work or where they are. If a component fails or needs more capacity, you can easily replace it or replicate it, while the pub/sub system handles routing messages to the new components.

Compares tightly coupled vs. decoupled architectures in authentication systems, and how pub/sub systems enable asynchronous communication flows.

But decoupling comes with a trade-off. As the individual components no longer connect directly to each other, they don’t get the immediate status update on whether their message arrived. So, a pub/sub system needs to do more work to make sure messages arrive where they’re needed, in the right order, and the expected number of times.

In this article, we’ll explore how pub/sub systems deliver these guarantees, why only some systems manage to do so, and what impact these guarantees have on your application architecture. Let’s start by going deeper into why pub/sub systems need explicit mechanisms to guarantee delivery.

Why do we need guarantees in pub/sub?

Imagine an app that lets users track stock prices in realtime. Every few seconds, the app polls for updates with a REST API call to the server. If the stock prices that the user follows have changed, the server responds with the updated data.

Because this occurs over a single request-response connection, both the app and server are always aware of the interaction’s status. If there’s an error, the server responds with a 4XX or 5XX HTTP code. If everything works as expected, the server sends the data, and the app acknowledges receipt. In cases of a timeout, both the app and server recognize the failure, allowing the app to retry the request or for the server to note that it’s unclear whether the app received the update.

Indicates how an example app backend communicates bidirectionally with an app, by sending data to the app, and receiving receipt acknowledgments from the app.

Scale out to thousands of mobile apps and constantly polling for updates quickly becomes unsustainable. One way to make it scalable and more efficient is to put a pub/sub messaging system between the app’s backend and the individual subscriber devices.

In that scenario, publishing a stock price update would look more like this:

ACME Inc’s stock price changes.
The app backend publishes the updated price to a channel called ACME on the pub/sub platform. The backend’s job is now done, it can move onto other things.
The pub/sub platform pushes that update out to all devices subscribed to the ACME channel.

Illustrates how bidirectional flow from an example app's backend can scale by communicating with a pub/sub services, which manages receipts and sends to and from the backend to multiple devices.

When it comes to thinking about why we need delivery guarantees in pub/sub systems, the important things to note are:

The app backend has no direct connection with any of the subscribers.
The app backend doesn’t need to know anything about any part of the pub/sub system or the rest of the application architecture, it just needs to know where to send updates.
By breaking the tight coupling between the backend and the mobile devices, the pub/sub system could be one server or multiple servers running in distinct data centers.

Each of those factors is great for scalabilit,y but complicates tracking the status of an individual message. For example, if the backend disconnects as soon as the pub/sub system acknowledges that it has received the message, then how can it know whether the message is delivered to subscribers? Or if the pub/sub system makes multiple copies of an individual message to ensure fault tolerance, how can the system prevent more than one delivery taking place?

That’s where delivery guarantees come in. The creators of the pub/sub system have to engineer ways of making sure messages reach their destinations in the right order and the right number of times. And the starting point is understanding the different types of delivery guarantee.

The different types of delivery guarantee

Delivery guarantees are just one piece of a larger story about data integrity in distributed systems. There’s no one-size-fits-all solution, and what works for one application might not work for another. Deciding what’s right for your use case often comes down to the trade-offs you're willing to make and whether you want your chosen pub/sub system to handle data integrity for you.

As we saw above, decoupling publishers and subscribers means the publisher has no knowledge of the systems that will receive a message. This allows pub/sub systems to be resilient and scalable, but it also means that multiple nodes within the system might end up handling copies of the same message. That’s great for reliability but it means that data integrity isn't just a binary question of whether a message is delivered or not.

So, as a developer working with a pub/sub system you need to consider:

Delivery: Will every message reach its intended recipient(s)?
Ordering: If you send a series of messages, will they arrive in the correct sequence?
Delivery semantics: How many times will the system deliver each message: exactly once, at least once, or at most once?

In an ideal world, we’d all say “yes please” to guaranteed exactly-once delivery in the right order. But it’s relatively rare to find a single platform that can do it all. So, how do you decide which guarantees are important in your situation?

Which pub/sub delivery guarantees matter?

Depending on the pub/sub platform you choose, you might need to make trade-offs between particular guarantees and other factors such as engineering complexity and latency.

As with most engineering choices, this comes down to what your specific application or use case requires. For example, if you're building a realtime chat app, message ordering will be highly important. People expect their conversations to make sense and it would introduce a lot of complexity, and perhaps delays in the chat UI, if the app itself had to reorder messages before displaying them. On the other hand, in event-driven architectures like logging systems, you might prioritize delivery over strict ordering. While logs can often be processed out of order without impacting their usefulness, losing a log entry could mean missing a critical issue that needs attention.

To learn more about this topic, take a look at our guide to message durability.

Are strict guarantees possible with pub/sub?

Pub/sub systems can offer strict delivery guarantees, like message ordering and exactly-once delivery, but whether they do or not comes down to whether the platform vendor chooses to invest the engineering effort required to provide them.

To understand the complexity of the problem, we need to outline some of the engineering challenges:

Decoupled systems run independently: In a pub/sub system, publishers and subscribers operate asynchronously and independently. Unlike a synchronous REST request, for example, there’s no built-in mechanism for tracking the full journey of a message through the system.
Failures happen: Pub/sub systems often run across multiple nodes and regions. That’s great for the overall reliability of the system but more moving parts means more opportunities for individual components to experience issues. So, pub/sub systems need strategies like message acknowledgments, retries, and persistence to overcome such failures.
Message routing is complex: Likewise, all those different nodes and cloud regions mean that it’s harder to predict how long a message will take to process. Variable latency, congestion, and outages all makes enforcing FIFO guarantees challenging.
Each protocol has its own characteristics: The way that publishers and subscribers communicate with the pub/sub system also impacts guarantees. HTTP, with its stateless request-response model, struggles with real-time delivery and reliable ordering due to its reliance on polling. WebSocket’s persistent, full-duplex connection, enables realtime delivery and its reliance on TCP helps to ensure message ordering across a single connection. gRPC offers multiplexed streaming and better flow control, but is not available in web browsers.

Despite these challenges, strict delivery guarantees in pub/sub systems are achievable. Different vendors take their own approaches to solving these problems. Naturally, we’re most familiar with Ably’s solution, so let’s explore how Ably's architecture makes strict delivery guarantees possible.

How Ably achieves strict guarantees

Data integrity is hard to achieve in pub/sub systems but, with the right engineering, they’re entirely achievable. At Ably, we’ve built a global realtime platform that ensures exactly once delivery, in order. Here’s how we do it.

1. Message persistence and durability

Persistence until delivery: Ably ensures that messages are persisted until they are known to have been delivered to subscribers. This guarantees that no messages are lost in transit.
Backlog for catch-up: If a subscriber disconnects, Ably maintains a message backlog that enables subscribers to catch up on any missed messages once they reconnect, ensuring no gaps in delivery.

2. Fault tolerance

Fault tolerance at the publisher level: If a publisher encounters an issue, Ably lets the publisher try different endpoints across Ably's network to ensure the message gets into the system.
Fault tolerance at the system level: Ably ensures fault tolerance by replicating messages across multiple nodes and regions. That means each message is stored on several geographically distributed nodes. Ably will only acknowledge a message as successfully accepted into the system once it has been replicated across a minimum number of nodes. Should a failure occur, Ably redirects traffic around the system to avoid disruption.

3. Unique message identifiers

Idempotency via unique IDs: Ably assigned each message a unique, timestamp-based serial number. This ensures that if the message is duplicated in the system or delivered more than once, it can be discarded or processed just once, to enable exactly-once delivery semantics.

4. Maintaining message ordering

Order preservation through protocols: One way Ably maintains message ordering is by using persistent protocols like WebSocket. However, Ably can fall back to other methods so it’s only one part of the solution.
FIFO queues: Ably runs a distributed FIFO (First In, First Out) queue, which is strictly ordered based on timestamp-derived unique identifiers. However, like in any globally distributed system, messages might arrive from different regions at different times thanks to different latencies. One Ably mitigates this is to rely on those unique IDs. As they’re based on a timestamp, even in cases of network failures or node issues, Ably can use them to hold back messages until they can be delivered in the correct order. This ensures that subscribers always process messages in the exact sequence they were published.

So, what does that look like in practice?

The journey of a message through Ably

Let’s illustrate this by going back to our stock market data example. ACME Inc’s stock price is rising and our app needs to send updates to subscribers.

1. Publication stage: Getting the message Into Ably

The first step is pushing the stock price update from the finance app backend (our publisher) into Ably’s system. Here's how Ably handles this process:

Assigning a unique identifier: Each message is given a unique identifier, typically a timestamp-based ID. This identifier ensures that each message is distinct and helps with both deduplication and maintaining message order as it moves through the system.
Message acceptance: Once the message is sent, Ably acknowledges its acceptance. If the publisher doesn’t receive this acknowledgment (due to network issues or other problems), the publisher automatically retries sending the message until the acknowledgment is received. If the message somehow gets into Ably twice, the unique ID means Ably can reject any duplicates.

Shows how an app backend publishes and acknowledges messages with Ably, including persistence to a FIFO queue.

2. Ensuring data integrity: Durability and queuing

After the message has been accepted by Ably, the platform persists it in a queue and makes sure it does so in the correct order to maintain the exactly-once and message order guarantees.

Message persistence and replication: To ensure resilience, Ably replicates the message across multiple nodes and cloud regions. Each node checks for existing messages with the same unique identifier to avoid duplication. This replication protects the message in case of node failure or region-specific outages.
Queuing and FIFO processing: Once persisted, the message is placed into a queue that is ordered by the message ID. Ably processes this queue in strict FIFO (First In, First Out) order, ensuring that messages are delivered to subscribers in the exact sequence they were published.

3. Delivery stage: Routing to subscribers

With the message persisted in a queue in the right order, it’s time to route it to the subscribers:

Routing to subscribers: For subscribers using persistent connections like WebSocket, Ably pushes messages in real-time, minimizing latency. For subscribers using REST or polling methods, the message is held until the client explicitly requests it.
Subscriber acknowledgment: Once a subscriber receives the message, it sends an acknowledgment back to Ably. If Ably doesn’t receive this acknowledgment within a set timeframe, it retries delivery to ensure the message isn’t lost.
Handling retries and deduplication: If retries are necessary, the unique message ID allows subscribers to automatically discard any duplicate messages. This ensures that messages are processed exactly once, even if they’re delivered multiple times due to retries.

4. Delivery retries

If a subscriber disconnects, on reconnection it tells Ably which message it last received. Ably then replays any missed messages in the correct order, ensuring no gaps or duplicates in the data.

Shows message persistence in Ably when a subscriber is unavailable, storing it until they reconnect.

5. Global replication and the resultant ordering challenges

Ably’s global distribution model increases fault tolerance but adds complexity to maintaining message order:

Global replication for fault tolerance: Unlike many pub/sub systems that operate in a single region, Ably replicates messages across geographically distinct cloud regions. This ensures high availability, even if one region goes offline. However, it also introduces the risk of messages arriving out of order due to latency differences across regions.
Enforcing ordering: To counter this, Ably uses serial numbers and checks message IDs before delivering them to subscribers. If a message arrives out of order, it is held back until the correct sequence can be restored. This ensures that messages are always delivered in strict FIFO order, even in the face of network inconsistencies.

Shows how Ably replicates messages across multiple geographic regions to ensure consistent delivery.

6. Exactly-once delivery guarantee

The final stage is ensuring that each message is delivered exactly once and processed by the subscriber with no duplicates. Ably guarantees exactly-once delivery by using a combination of unique message identifiers, acknowledgment tracking, and message persistence. Even in the event of retries or network failures, Ably’s distributed queue enables it to deliver each message exactly once.

Illustrates exactly-once message delivery using Ably, indicating sequential processing and the prevention of duplicate transmissions.

When are strict delivery guarantees necessary?

We looked earlier at some of the technical reasons why strict guarantees are important from a pub/sub system. But what types of use cases demand exactly once delivery and strict ordering?

Chat applications: Getting messages in the wrong order or not at all can make a chat experience feel unreliable and frustrating for end users.
Financial and sports data feeds: Financial data and sports updates drive realtime decisions, whether that’s placing a bet on the outcome of a game or making a buy decision. Missing data could lead to bad decisions and damage the reputation of the platform.
Communication between microservices: Not being able to rely on the communication between parts of your application architecture could lead to bugs, data loss, and unexpected behavior.
Realtime collaboration: In collaborative tools like document editing or design platforms, missing updates can lead to team members working on outdated information, causing inconsistencies and errors.

But in practice, strict delivery guarantees are useful for just about any system you build. It’s rare that data is truly disposable, so if you work with a pub/sub system that can’t guarantee delivery semantics or ordering, then you’ll need to handle messaging track, deduplication, and other ways of enforcing those guarantees in your own code.

For best in class pub/sub guarantees, try Ably

As we’ve seen, guaranteeing that a pub/sub system can deliver each message exactly-once delivery and in the correct order is a significant engineering challenge. At Ably, we’ve built a global realtime platform that does just that for billions of messages.

Our global architecture and our expertise in distributed systems has enabled us to engineer a realtime platform that guarantees low latency, exactly once delivery and message ordering despite the complexities. Through using WebSocket, idempotent publishing with unique identifiers, deduplication within the platform, subscriber acknowledgements, and more make that happen.

DEV Community