DEV Community

Cover image for How to join multiple Kafka topics - Overview
Jose Boretto
Jose Boretto

Posted on • Edited on

How to join multiple Kafka topics - Overview

Approach comparison

Apache Kafka is a powerful tool for building scalable real-time streaming applications, and a common use case is joining data from multiple Kafka topics. Depending on the needs of your application, there are several ways to approach this problem. In this article, we'll explore three approaches:

  1. N consumers write to 1 table (reader-oriented) using Kafka consumers.

  2. N consumers write to N tables (writer-oriented) using Kafka consumers.

  3. N consumers write to 1 table using Kafka Streams.

Each of these strategies has its own trade-offs in terms of complexity, scalability, and performance. Let's analyze each one in detail.


1. N Consumers Write to One Table (Reader-Oriented) with Kafka Consumer

In this approach, multiple Kafka consumers are each responsible for reading from their respective Kafka topics and writing to a single database table. This method assumes that all consumers are reading from distinct topics, and once they consume data, they normalize or aggregate the data into a unified format and insert it into a single table.

Image description

Advantages:

  • Simplicity: This is a straightforward approach to implementation. Each consumer focuses only on reading from its designated topic and writing to the same target database.

  • Single View: A single table in the database provides a unified view of data from all topics, making queries easier for end-users or downstream systems.

Challenges:

  • Concurrency Issues: With multiple consumers writing to the same table, there’s a potential for race conditions, deadlocks, or row-level locking, especially in databases that don't handle concurrent writes well.

  • Scalability: As the volume of data or the number of topics grows, contention for database resources increases, affecting write performance.

  • Transformation Logic: All consumers must implement consistent transformation logic to maintain data integrity, which can increase complexity if the topics have varying schemas or structures.

  • Write Bottlenecks: Since all writes funnel into a single table, database performance could become a bottleneck.

N+1 Problem:

This approach can encounter the N+1 problem when a single event needs to trigger writes to multiple rows in the database. For example, if one event results in N separate rows being written (or updated), the system must execute N database transactions for each event. This can lead to performance degradation as the number of rows grows, significantly slowing down writes, and increasing database load.

  • Performance Impact: The N+1 problem can cause severe performance bottlenecks, especially in high-throughput scenarios. The larger N becomes, the more time and resources are required to process a single event, which could overwhelm the database with excessive writes.

When to Use:

  • You need a unified view of data and the topics have similar structures.

  • Simplicity is key, and your database can handle concurrent writes at scale.


2. N Consumers Write to N Tables (Writer-Oriented) with Kafka Consumer

In this method, each consumer reads from its Kafka topic and writes to its own dedicated table in the database. This is a "writer-oriented" approach, where each consumer is responsible for maintaining its isolated dataset.

Image description

Advantages:

  • Decoupling: Each consumer is entirely independent of the others. There are no shared resources, so there's no risk of write contention.

  • High Performance: Since each consumer has its own table, there are no race conditions, locking issues, or competition for table-level resources in the database. This leads to better performance when dealing with large volumes of data.

  • Easier Debugging: If one consumer or table encounters a problem, it does not affect the others, making it easier to isolate and fix issues.

Challenges:

  • Querying: Since data is spread across multiple tables, querying becomes more complex. If you need a unified view of the data, you would need to perform cross-table joins, which can degrade query performance.

  • Data Consistency: Ensuring consistency across tables can be tricky, especially if tables are updated out of sync or at different rates.

When to Use:

  • High data volume where write performance is critical.

  • The data in the topics is diverse or unrelated, so there’s no need for a unified view in a single table.

  • You want clear separation of concerns for each data stream.


3. Kafka Streams Write to Another Kafka Topic, then a Consumer Writes to 1 table

In this enhanced version of the Kafka Streams approach, Kafka Streams instances read from multiple topics, process the data (joins, aggregations, filtering, etc.), and then write the processed output to a new Kafka topic. A separate Kafka consumer (or consumers) can then read from this topic and handle the writes to the database.

Image description

Advantages:

  • Decoupling Stream Processing and Database Writes: By separating the stream processing from the database writes, you prevent the database from becoming a bottleneck. Kafka Streams focuses solely on processing and transforming data, while a separate consumer manages the database interaction.

  • Scalability: Both the Kafka Streams application and the consumer that writes to the database can scale independently. For instance, if database writes become slow, you can scale the consumer group writing to the database without affecting the Kafka Streams application.

  • Backpressure Handling: Kafka, with its persistent log structure, can act as a buffer between the stream processing and the database writes. If the database slows down, Kafka retains the processed data in the intermediate topic, allowing the system to recover without losing data.

  • Fault Tolerance: Kafka Streams offers built-in fault tolerance with state stores and commits to the Kafka topic, while the downstream consumer group writing to the database can be monitored separately for failures, making the entire pipeline more resilient.

Challenges:

  • Learning Curve: Kafka Streams has a steeper learning curve compared to basic Kafka consumers, especially for teams unfamiliar with the stream processing paradigm.

  • Internal Complexity: Kafka Streams introduces a higher level of abstraction, such as KTable, KStream, KStore, and Topology, which can make debugging difficult. Understanding how states are managed (using RocksDB) and how the stream’s topology is built and executed adds complexity. If issues arise in how data is persisted or transformed, tracking down the root cause across these abstractions requires deep familiarity with Kafka Streams internals.

  • Debugging State Stores: Investigating state store corruption, lag, or rebuilds, especially with complex stateful operations, can be challenging.

  • RocksDB Management: Kafka Streams uses RocksDB for local storage, which can introduce performance issues if not tuned correctly.

  • Topology Visualization: Understanding the full topology (data flow, partitioning, transformations) can be hard without proper tooling, making the system more opaque.

  • Cost: Kafka Streams can be expensive due to its use of local state stores, changelog topics, and repartitioning, which increase disk usage, network traffic, and Kafka storage overhead. High CPU and memory consumption arise from complex stream operations and large in-memory buffers.

  • Kafka events keys: All the events must have a key.

When to Use:

  • When stream processing and database writes require decoupling for scalability and reliability.

  • When you want to prevent the database from being overwhelmed by a sudden surge in data.

  • When you want to take advantage of Kafka’s backpressure and buffering capabilities to absorb variations in database write performance.


Learnings from our experience

Over the past three years, we've gained significant insights into optimizing Kafka-based data pipelines, especially in scenarios involving high throughput and complex data processing. Here are the key lessons we’ve learned:

1- Process Events in Batches to Reduce Database Commits:

By processing events in batches and inserting them in bulk, we’ve reduced the number of commits to the database, which greatly improves performance. Batching minimizes the overhead of frequent database writes and allows more efficient resource usage.

2- Fine-Tuning Kafka Consumers:

Optimizing consumer configurations, such as max.poll.records (which controls how many records a consumer can fetch in a single call) and max.partition.fetch.bytes (which determines the maximum size of data a consumer can fetch per partition), significantly enhanced our ability to manage large data volumes without overwhelming consumers.

3- Dual Database Connections:

We learned that separating database connections for writing and reading improves overall performance. A dedicated writer connection handles inserts and updates efficiently, while a reader connection optimizes read-heavy operations without contention for database resources.

4- Discard Duplicate Events:

Handling duplicate events is crucial for maintaining data consistency and reducing unnecessary processing of both the input and the output topic. We implemented deduplication strategies to discard redundant events, preventing potential performance hits from processing the same data multiple times.

5- Using Partitioned Tables:

Partitioning large database tables based on specific keys (such as user_id) improved query performance and simplified data management. Partitioned tables allowed faster reads and writes, reducing the load on individual table partitions.

6- Leverage Reactive Programming:

By adopting reactive programming techniques, we were able to make full use of the CPU resources in our microservices. Reactive code enabled us to handle high volumes of asynchronous events efficiently, improving overall throughput without overloading the system.

7- Upserts for Efficient Data Storage:

Instead of performing a SELECT followed by INSERT or UPDATE, we switched to using upserts, which directly insert new records or update existing ones in a single transaction. This eliminated race conditions and improved data processing speed.

8- Dead Letter Queue (DLQ) for Each Topic:

Implementing a DLQ for each Kafka topic ensures that when the database throws an error, you will not lose data. This allowed us to handle errors effectively, preventing data loss and enabling reprocessing of failed events as needed.


These practices have proven invaluable in ensuring the efficiency, scalability, and reliability of our Kafka-based systems, enabling us to handle increasing data volumes while maintaining high performance and data integrity.

Conclusion

Using Kafka Streams to process data and writing the results to an intermediate Kafka topic before a separate consumer writes to the database is an optimal approach for real-time systems that require high scalability and fault tolerance. This method decouples stream processing from database interactions, preventing the database from becoming a bottleneck, while providing Kafka’s native backpressure and scalability benefits.

It is particularly useful when dealing with complex joins and aggregations across multiple topics, as Kafka Streams excels at processing and transforming data in a reliable, fault-tolerant manner.

Benchmark

https://dev.to/joseboretto/how-to-join-multiple-kafka-topics-benchmark-2heb

Credits

To all the developers involved in this 3 year journey.

Top comments (0)