DEV Community

Cover image for How to Consolidate Multiple PostgreSQL Databases into a Single Database Using Debezium, Kafka, and Power BI for Analytics
CYRIL OSSAI
CYRIL OSSAI

Posted on

How to Consolidate Multiple PostgreSQL Databases into a Single Database Using Debezium, Kafka, and Power BI for Analytics

In today’s data-driven world, organizations often run multiple PostgreSQL databases for different applications or business units. However, consolidating data from these multiple databases into a single database for analytics and reporting is crucial for comprehensive insights, especially when using tools like Power BI. One efficient way to achieve this consolidation is by using Debezium and Kafka to stream data from various databases into a single, consolidated PostgreSQL database, which can then be used for analytics reporting.

Overview of the Solution Architecture

1. Source PostgreSQL Databases: Multiple PostgreSQL databases exist that contain operational data.
2. Debezium: A tool that captures real-time changes from PostgreSQL databases and sends the data as events.
3. Kafka: A distributed event streaming platform that transports change events from the source PostgreSQL databases to a single target database.
4.Target PostgreSQL Database: A consolidated database where data from multiple sources is stored.
5. Power BI: Used to analyze and visualize the data from the consolidated database.

Steps to Consolidate PostgreSQL Databases

Step 1: Set Up Kafka
Apache Kafka is a high-throughput messaging system that allows you to transport data streams from multiple sources. You’ll need to set up a Kafka cluster to handle the real-time event streaming between your source and target databases.

Install Kafka: Download and install Kafka on your server. For a more resilient setup, consider using Kafka in a distributed mode.

Create Topics: Create Kafka topics for each table from your source PostgreSQL databases. Kafka topics will hold the real-time changes from each database. For example:

{kafka-topics.sh --create --topic db1_orders --bootstrap-server http://kafka-service:9092 --partitions 3 --replication-factor 1
kafka-topics.sh --create --topic db2_customers --bootstrap-server http://kafka-service:9092 --partitions 3 --replication-factor 1
}

Step 2: Configure Debezium for Change Data Capture

Debezium is a powerful open-source tool that supports CDC for a variety of databases, including PostgreSQL. It tracks database changes in real time by reading the transaction logs (WAL logs in PostgreSQL) and publishes the changes to Kafka.

1. Install Debezium: Deploy Debezium as a Kafka Connector using Kafka Connect. If you don't already have a Kafka Connect setup, you can use a distributed mode setup that scales better for production.

_curl -X POST -H "Content-Type: application/json" --data '{
  "name": "debezium-postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "db1_host",
    "database.port": "5432",
    "database.user": "db_user",
    "database.password": "db_pass",
    "database.dbname": "source_db1",
    "database.server.name": "db1",
    "plugin.name": "pgoutput",
    "database.whitelist": "db1",
    "table.whitelist": "public.orders",
    "slot.name": "debezium_slot",
    "publication.autocreate.mode": "filtered"
  }
}' http://localhost:8083/connectors
_
Enter fullscreen mode Exit fullscreen mode

Repeat this process for each of your source databases. Make sure to whitelist the databases and tables that you want to consolidate.

2. Debezium Data Flow: Debezium captures changes and publishes them to Kafka topics. Each table change (INSERT, UPDATE, DELETE) is captured in the corresponding Kafka topic.

Step 3: Kafka Consumers and Sink Connectors
Now that Kafka is receiving change data events, you’ll need to send these events to the target PostgreSQL database where they will be consolidated.

1. Install the PostgreSQL Kafka Sink Connector: This connector allows you to write data from Kafka topics into a PostgreSQL database.

curl -X POST -H "Content-Type: application/json" --data '{
  "name": "sink-postgres-connector",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max": "1",
    "connection.url": "jdbc:postgresql://target_host:5432/target_db",
    "connection.user": "db_user",
    "connection.password": "db_pass",
    "auto.create": "true",
    "topics": "db1_orders, db2_customers",
    "insert.mode": "upsert",
    "pk.mode": "record_key",
    "pk.fields": "id"
  }
}' http://localhost:8083/connectors

Enter fullscreen mode Exit fullscreen mode

2. Upsert Logic: Configure the sink connector to perform upsert operations (INSERT if a record doesn’t exist, UPDATE if it does), which ensures that the latest state of the data is always reflected in the target PostgreSQL database.

3. Data Merging: The sink connector writes data into the target database tables. You can choose to merge data from multiple tables into a single table or keep the tables separated as per your reporting needs.

Step 4: Data Transformation and Enrichment
In some cases, the data streaming from multiple source databases might need to be transformed (e.g., schema alignment, field renaming) before writing to the target database. Kafka provides tools like Kafka Streams or ksqlDB for real-time data transformations.

  1. Kafka Streams: Use Kafka Streams to process and transform data before pushing it to the target database.
  2. ksqlDB: Allows you to run SQL-like queries on Kafka streams to perform transformations and aggregations.

Step 5: Analytics with Power BI
Once your data is consolidated in the target PostgreSQL database, you can connect Power BI to this database for real-time analytics and reporting.

  1. Connect Power BI to PostgreSQL: Use the native PostgreSQL connector in Power BI to import and visualize data.
  2. Build Dashboards: Create interactive dashboards and reports based on the consolidated data. You can build visualizations like trend analysis, forecasting, customer segmentation, etc.

Step 6: Monitor and Maintain

  1. Monitor Kafka and Debezium: Regularly monitor your Kafka cluster and Debezium connectors for performance and potential bottlenecks.
  2. Schema Evolution: Debezium handles schema changes (e.g., adding new fields to tables), but you should ensure that schema evolution is supported by both the Kafka connectors and the target database.

By leveraging Debezium for change data capture, Kafka for event streaming, and a consolidated PostgreSQL database for storage, you can create a robust system for aggregating data from multiple PostgreSQL databases. This setup enables real-time analytics using Power BI while ensuring that data remains synchronized across systems.

Top comments (0)