Event-Driven Architecture using Kafka
Sendoso started as a single repository and a monolith application, just like any web application. A single repository and application have benefits. Any developer can clone the repo, build, deploy, and dive into any aspect of the application without having to navigate another repository. After all these years, however, our repository has grown significantly. Build and deploy steps take a good part of the hour. The cognitive load has increased too. No developer has the capacity or time to learn the details of every scenario, nor should they be expected to do so. Daily builds and deployments to production require sign-offs from multiple teams. A single bug or unexplained behavior anywhere in the application can derail the release process, and impact every team’s progress.
The key, of course, is splitting the monolith into manageable subdomains. The goal is to have subdomains with bounded context that’s easily be understood by its domain expert. The same process can be repeated recursively as a subdomain gets increases in scope over time.
There’s a flip side to this. While engineers and test engineers benefit from service-oriented architecture, SOA means operational overhead, infrastructure, and observability costs. Without tooling and automation, SRE/Ops team will bear the brunt of deploying, monitoring, and maintaining multiple applications.
Microservices and Sendoso
Microservices are a set of services that act together to make a whole application operate. Microservices are self-reliant and need to coordinate as well, this makes development and deployment quick. So, if a developer needs to update a specific service, they will only update that service, and the rest of the services will remain the same. It makes the process faster.
For creating microservices, we have decided to start with the low-hanging fruit. Any new functionality which is not tightly dependent on the existing application is being moved to its service. Using a shared database in a microservice architecture is an anti-pattern. Service interfaces/APIs should be used for information exchange between different services and the core. One note of caution - introducing too many APIs will also imply coupling or incorrectly defined service boundaries. That means that code is separated but data isn’t decoupled.
Hello World Service and Cost Reduction
To make new service creation easy, we are templatizing services a “hello world” service that can be cloned, built, and deployed to an environment. The service template comes with monitoring and alerting built-in.
Confide on Kafka for Data Processing
Message-passing in a distributed system is key for scale. We decided to use Kafka as a message bus. Kafka is a distributed platform based on the publish and subscribe (pub-sub) model. This model contains two main entities, producer and consumer. As the names indicate, Producer produces the message and consumer consumes the message. Producer publishes messages to the data stream. Consumer, which is subscribed to the data stream, processes the messages. There is a data stream between producer and consumer. The data stream acts as a broker between the producer and consumer for data exchange. The exchange of data or messages between the producer and consumer happens through the broker as there is no connection between the producer and consumer.
Kafka has a solid Java Virtual Machine (JVM) based stream processing framework. It enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode, which would have been an obvious first choice for a greenfield project.
A number of Sendoso applications are ruby-based. We decided to use Karafka gem for consumers and waterdrop/ruby-kafka gems for producers. This works well, with just one pod, we were able to move 1GB between producer and consumer in roughly a minute, using t3.small AWS MSK brokers.
Source of Truth
Event sourcing is an effective architectural pattern to record changes to the application state. Event sequence is important - we need changes as they were originally applied. incoming events are first persisted into Kafka and then processed by services independently. Kafka, hence, becomes our source of truth (SOT), a data source that gives a complete picture of the data object as a whole. This, however, meant dramatic changes in our core application. Our source of truth is still the core database, but we generate events in Kafka when data gets persisted. To ensure transactional behavior, we employed Transactional Outbox pattern. Essentially, we a) created a new events table in the database, b) wrote event data in the same transaction when we update our SOT table. Kafka Connect is subsequently used to read this table and insert records in relevant Kafka topics. This ensures that we never have an inconsistent situation where data was inserted in the database but the event is not added to Kafka topic or vice versa. We evaluated a few connectors for sourcing data from Mysql (JdbcSourceConnector, and Debezium). Our scenario was supported out of the box in JdbcSourceConnector, making it possible to have one event table in Mysql where different rows could be routed to a relevant topic based on the topic field.
TransactionalOutbox Table Schema:
Topic | Key | Payload | Timestamp |
---|
The connector picks a row and creates the event (Key, Payload) in the respective Kafka Topic. The timestamp field helps JdbcSourceConnector to track its progress.
The connector is configured as follows in Kafka Connect, which is deployed in Kubernetes:
Notice how key, topic, payload, and timestamp fields are used.
At Least Once and In Order Guarantees
Kafka Connect provides robust defaults out of the box. E.g. to ensure delivery of every event, it sets acks=all
so that the producer doesn't move forward until the data is committed to at least ISR (in-sync-replicas). Kafka-connect does it by default, but we needed to do it wherever we are directly producing to Kafka topics.
It is a good idea to set in-sync-replicas to at least 2 or preferably 3; so that Kafka will not lose data in the face of a broker crash.
There are few other configurations which are important to ensure in-order guarantee. E.g., setting max.in.flight.requests.per.connection=1
and following timeout values to a reasonably high value. request.timeout.ms
, max.block.ms
, delivery.timeout.ms
.
In terms of multiple partitions in the topic, the object key should be used for partitioning. This ensures that all updates to the object land on the same partition and are consumed by the same consumer, hence applied in the original order.
Retry and Dead Letter Queues
We also used Retry and dead letter queues as explained here. This ensures progress if some poison or invalid event has entered the stream. However, in this case, the messages may be processed out of order, so (asynchronous) retry queues will not be suitable for scenarios where in-order delivery is important.
Kafka Observability and Operations
Kafdrop is a web UI for viewing Kafka topics and browsing consumer groups. The tool displays information such as brokers, topics, partitions, consumers. It also has topic creation and deletion functionality. Besides that, it allows us to observe consumer lag, and messages from any partition and offset.
While this is not a replacement for monitoring and observability, it has been beneficial to visualize what is happening on the Kafka brokers. We have configured separate Kafdrop pods for each MSK cluster.
For monitoring and alerting Kafka metrics, we are relying on newrelic-prometheus integration, to forward MSK metrics to New Relic. This has also been deployed as a Kubernetes pod.
Next In Line
This is an important step in the right direction. It also brings many new challenges. One common issue with breaking into microservices is that the contributors lose sight of the overall context. The drawbacks can be mitigated by correctly drawing the responsibility/service boundaries and adjusting the ownership as the domain evolves. Another popular option is to use orchestration services i.e. a service that talks to many other microservices. The latter microservices may not need the full picture, but the orchestration service provides the cognitive context. Similarly, as we add more services, we plan on investing in load balancing, security, circuit-breakers, and request throttling between services. A service mesh will be an important consideration at that point.
Acknowledgments
Kafka is complemented by many open-source tools and technologies. For this project, we benefited from Kafka, kafdrop, Karafka, Kafka-connect, and Kafka MySQL connector, to name a few. Shout out to Maciej Mensfeld for his responses to our queries related to throughput issues in Karafka.
Top comments (2)
Great to see you kicking of a cultural shift @khichi
good share @amna_malik_3facd3e03bc7dc