SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily

#datascience #apacheseatunnel #opensource #bigdata

Introduction

In the digital age, data has become one of the most valuable assets for businesses. As a leading lifestyle service platform in China, 58 Group has been continuously exploring and innovating in the construction of its data integration platform. This article will detail the architectural evolution, optimization strategies, and future plans of 58 Group's data integration platform based on Apache SeaTunnel.

Challenges of the Data Integration Platform

Business Background

58 Group has a wide range of businesses, and with the rapid development of these businesses, the scale of data from various business areas such as recruitment, real estate, second-hand housing, second-hand markets, local services, and information security has increased significantly. 58 Group needs to facilitate the flow and convergence of data between different data sources to achieve unified management, circulation, and sharing of data. This involves not only the collection, distribution, and storage of data but also applications such as offline computing, cross-cluster synchronization, and user profiling.

Currently, 58 Group processes over 500 billion messages daily, with peak message processing reaching over 20 million, and the number of tasks reaching over 1600. Handling such a massive volume of data presents significant challenges.

Challenges

In facilitating the flow and convergence of data between different sources and achieving unified management, circulation, and sharing of data, 58 Group faces challenges including:

High Reliability: Ensuring data is not lost under various fault conditions, ensuring data consistency, and stable operation of tasks.
High Throughput: Handling large-scale data streams to achieve high concurrency and bulk data transfer.
Low Latency: Meeting the business needs for real-time data processing and rapid response.
Ease of Maintenance: Simplifying configuration and automating monitoring to reduce maintenance burdens, facilitate quick fault detection and resolution, and ensure long-term system availability.

The Evolution of Architecture

The architecture of 58 Group's data integration platform has undergone multiple evolutions to adapt to changing business needs and technological developments.

Early Architecture Overview

2017: Used Flume for platform integration management.
2018: Introduced Kafka Connect 1.0.
2020: Used Kafka Connect 2.4 version, achieving incremental load balancing and CDC (Change Data Capture).
2023: Introduced Apache SeaTunnel, integrated into the real-time computing platform, and expanded various Source/Sink.

From 2017 to 2018, 58 Group's data integration platform adopted the Kafka Connect architecture, based on Kafka's data integration, with scalability and distributed processing horizontally expanded, supporting the operation of Workers and Tasks on multiple nodes; Workers automatically redistribute tasks to other Workers upon failure, achieving high availability; it also supports automated offset management and Rest API task and configuration management.

However, with the expansion of business volume and diversification of scenarios, this architecture encountered bottlenecks:

Architectural Limitations
- Inability to achieve end-to-end data integration.
Coordinator Bottleneck Issues
- Heartbeat Timeout: Worker-to-coordinator heartbeat timeouts trigger task rebalancing, causing temporary task interruptions.
- Heartbeat Pressure: Workers synchronize with coordinators, tracking worker states and managing a large amount of task metadata.
- Coordinator Failure: Coordinator downtime affects task allocation and reallocation, causing task failures and decreased processing efficiency.
Impact of Task Rebalance
- Task Pause and Resume: Each rebalance pauses tasks, then reallocates them, leading to brief task interruptions.
- Rebalance Storms: If multiple worker nodes frequently join or exit the cluster, or if network jitter causes heartbeat timeouts, frequent Rebalance can significantly affect task processing efficiency, leading to delays.

Given these shortcomings, 58 Group introduced Apache SeaTunnel in 2023, integrating it into the real-time computing platform to freely expand various Source/Sink.

Current Architecture

Currently, 58 Group's data integration platform, based on the Apache SeaTunnel engine, integrates Source data sources (Kafka, Pulsar, WMB, Hive, etc.), processes them through SeaTunnel's built-in Transform features, and Sinks them to destination databases (Hive, HDFS, Kafka, Pulsar, WMB, MySQL, SR, Redis, HBASE, Wtable, MongoDB, etc.), achieving efficient task management, status management, task monitoring, intelligent diagnostics, and more.

Smooth Migration and Performance Tuning

Smooth Migration

When introducing Apache SeaTunnel, 58 Group needed to perform a smooth migration of the data integration platform to minimize the impact on users or business and ensure data consistency, maintaining format consistency, path consistency, and no data loss.

This goal presented challenges, including the cost and risks of migration, such as understanding and confirming the format of each task's data source, and the migration involving multiple steps, which is complex and time-consuming.

To address this, 58 Group took the following measures:

For sources, add RawDeserializationSchema to be compatible with unstructured data.
For destinations, such as using hdfs sink for hive, to be compatible with partition loading and paths.
Develop automatic migration tools:
- Automatically generate task configurations, generate corresponding SeaTunnel task configurations based on Kafka Connect configurations.
- Take down the original tasks, reset offsets, and start new tasks.
- Verification and checking.

Performance Tuning

58 Group also carried out several performance optimizations on the data integration platform, including:

Adding Pulsar Sink Connector: To increase throughput.
Supporting Array Data: Enhancing HbaseSink compatibility.
Supporting Expiration Time Setting: Optimizing RedisSink.
Increasing PulsarSource Throughput: Optimizing the compression method of file connectors.
Fixing KafkaSource Parsing Issues: Enhancing the configuration flexibility of Kafka clients.

Monitoring and Operations Automation

Additionally, 58 Group improved the stability and efficiency of the data integration platform through monitoring and operations automation:

Task Monitoring: Real-time monitoring of task status to quickly detect and resolve failures.
Operations Automation: Reducing manual intervention through automated tools to increase operational efficiency.

Future Plans

58 Group has clear plans for the future development of the data integration platform:

Continuously Improve Intelligent Diagnostics: Enhance fault diagnosis accuracy and efficiency through machine learning and artificial intelligence technologies.
Cloud and Containerization Upgrade: Migrate the data integration platform to the cloud environment and implement containerized deployment to improve resource utilization and flexibility.

Conclusion

The architectural evolution and optimization of 58's data integration platform is a continuous process of iteration and innovation. Through continuous technological exploration and practice, 58 Group has successfully built an efficient, stable, and scalable data integration platform based on Apache SeaTunnel, providing strong data support for business development. In the future, 58 Group will continue to delve deeper into the field of data integration to provide better services for users.