Leo Dev Blog

Posted on Nov 25

Automated Monitoring and Message Notification System for Payment Channels

#systemdesign #payment #eventdriven #monitoring

Building an Automated Monitoring System for Payment Channels

When third-party channels experience failures, our awareness is often delayed. Typically, we rely on extensive system alerts or feedback from users and business teams to detect anomalies. As the core system responsible for managing company-wide payment operations, it's insufficient to rely solely on manual maintenance. Thus, building a robust automated monitoring system for payment channels becomes crucial.

1. Background

To accommodate growing business demands, we have integrated numerous payment channels. However, the stability of third-party systems varies greatly, and channel failures occur frequently. When such anomalies happen, detection often lags, with alerts or user feedback as the primary indicators. For a core payment system aiming to provide stable services upstream, manual maintenance alone is inadequate. This necessitates the establishment of an automated monitoring and management system for payment channels.

2. Design Goals

Based on our business requirements, the automated payment channel management system should address the following key challenges:

Monitoring capabilities across multiple channels and entities.
Rapid fault detection and precise identification of root causes.
Minimized false positives and missed alerts.
Automatic failover in case of channel failures.

3. Technology Selection

Given the background, the following technology options were evaluated:

3.1 Circuit Breaker

Circuit breakers are commonly associated with fault isolation and fallback mechanisms. We explored mature solutions such as Hystrix, but identified several limitations for our use case:

Circuit breakers operate at the interface level, lacking granularity for channel- or merchant-level fault isolation.
During traffic recovery, residual issues may persist, and there's no ability to define targeted traffic for testing (e.g., specific users or services), increasing the risk of secondary incidents.

3.2 Time-Series Database

After ruling out circuit breakers, we turned to developing a custom monitoring system. Time-series databases are often used as the foundation for such systems. Below is an evaluation of popular options:

With the final contenders being Prometheus and a custom solution based on Redis.

Accuracy

Prometheus sacrifices some accuracy in favor of higher reliability, simplicity in architecture, and reduced operational overhead. While this tradeoff is acceptable for traditional monitoring systems, it does not suit high-sensitivity scenarios like automatic channel failover:

Missed Spikes: Prometheus might miss transient spikes occurring between two sampling intervals (e.g., 15 seconds).
Statistical Estimates: Metrics like QPS, RT, P95, and P99 are approximations and cannot achieve the precision of logs or database records.

Ease of Integration and Maintenance

Prometheus has a learning curve for business developers and poses challenges in long-term maintenance. Conversely, Redis is already familiar to Java backend developers, offering lower initial learning and ongoing maintenance costs.

Considering the above factors, we decided to build a custom "time-series database" based on Redis to meet our requirements.

4. Architecture Design

Workflow Design

Transaction Routing: For both receiving and making payments, requests are routed through the respective channel router to filter available payment channels.
Order Processing: After selecting the channel, the gateway processes the payment or disbursement request and sends it to the third-party provider.
Monitoring Data: The response from the third-party provider is reported to the monitoring system via a message queue (MQ).

Monitoring System Workflow

The monitoring system listens to MQ messages and stores monitoring data in Redis.
The data processing module fetches data from Redis, filters it, calculates failure rates for each channel, and triggers alerts based on configured rules.
Data in Redis is periodically backed up to MySQL for subsequent fault analysis.
Offline tasks regularly clean Redis data to avoid excessive storage.

Data Visualization

To observe changes in channel metrics:

Metrics are reported to Prometheus.
Grafana dashboards display the channel's health status.

Automated Channel Management

Initially, only manual channel management (online/offline) is enabled due to the sensitivity of the operation.
After collecting substantial samples and refining the algorithms, the system will gradually enable automated channel management based on monitoring results.

5. Implementation Details

5.1 Data Structure

The data is stored in Redis with a design inspired by time-series databases like InfluxDB:

InfluxDB	Redis
tags	set to record monitoring dimensions
time	zset to store timestamps (in seconds)
fields	hash to store specific values

Tags (Labels): Monitored dimensions are stored using Redis sets (SET), leveraging its deduplication feature.
Timestamps: Data points are stored using Redis sorted sets (ZSET) to allow time-based lookups and ordering. Each point represents one second.
Fields (Metrics): Specific monitoring data is stored in Redis hashes (HASH). Each key-value pair represents:
- Key: Result type (e.g., success or failure).
- Value: Count of occurrences within one second, including specific failure reasons.

Example Redis Data Structure:

SET: Tags -> Stores monitored dimensions.
ZSET: Timestamps -> Stores event times.
HASH: Metrics -> Stores success/failure counts and failure reasons.

### Redis Data Structure

1. **Set**
   - Stores the monitored dimensions, specific to the merchant ID.
   - **Key**: `routeAlarm:alarmitems`  
   - **Values**:  
     - `WeChat-Payment-100000111`  
     - `WeChat-Payment-100000112`  
     - `WeChat-Payment-100000113`  
     - ...

2. **ZSet**
   - Stores timestamps (in seconds) for requests from a specific merchant ID. Data for the same second will overwrite previous entries.
   - **Key**: `routeAlarm:alarmitem:timeStore:WeChat-Payment-100000111`  
   - **Scores and Values**:  
     - `score: 1657164225`, `value: 1657164225`  
     - `score: 1657164226`, `value: 1657164226`  
     - `score: 1657164227`, `value: 1657164227`  
     - ...

3. **Hash**
   - Stores the aggregated request results within 1 second for a specific merchant ID.
   - **Key**: `routeAlarm:alarmitem:fieldStore:WeChat-Payment-100000111:1657164225`  
   - **Fields and Values**:  
     - `key: success`, `value: 10` (count)  
     - `key: fail`, `value: 5`  
     - `key: balance_not_enough`, `value: 3`  
     - `key: third_error`, `value: 2`  
     - ...

5.2 Core Algorithm

To avoid missing short spikes between monitoring intervals and ensure accurate reporting, the algorithm combines local counting with a global sliding window:

Per-Second Tracking: Records the number of successes and failures for each second.
Sliding Window Calculation: Computes success and failure counts across the entire window duration, ultimately determining the failure rate for each channel.

Example:

Window Duration: 1 minute.
Monitoring Frequency: Every 10 seconds.

Key Factors Affecting Accuracy:

Monitoring Frequency:
- Low frequency results in insufficient samples, reducing accuracy.
- High frequency may miss short-term spikes, causing underreporting.
Window Size: Must balance sample size and real-time accuracy.

The frequency and window size are determined based on metrics like daily transaction volume, hourly order frequency, and submission rates for each channel.

5.3 Handling Low Traffic

Challenges with Low Traffic:

Channel Dimension: Handling channels with low daily transaction volumes.
Time Dimension: Managing off-peak periods with sparse transactions.

Solution:

For channels with low traffic or off-peak times:

If there is only one transaction in the monitoring window and it fails, the window size is incrementally expanded:
- Initial Window: 1 minute.
- Expanded Window: Doubles (e.g., 2 minutes, 4 minutes) up to 10x.
If the failure rate exceeds the threshold even after expansion, an alert is triggered, as such cases are treated as critical anomalies.

6. Outcomes

Ensured accuracy of monitoring and alerting, minimizing missed anomalies.

Merge duplicate alarm entries.

Channel Anomaly Recovery.

7. Future Plans

To further enhance the automated monitoring system:

Continuously optimize monitoring algorithms to achieve alert accuracy of 99% or higher.
Integrate with the monitoring system to enable automatic channel deactivation upon fault detection.
Implement automatic fault recovery detection and enable automated channel reactivation.

DEV Community