Machine learning and deep learning techniques 'learn' by recognizing and generalizing patterns and statistical properties within the training data. The efficacy of these models in real-world scenarios is contingent on the assumption that the training data is an accurate representation of the production data. However, this assumption often breaks in the real world. Consumer behaviours and market trends may undergo gradual or even drastic shifts. Sensors responsible for data collection can experience a decline in sensitivity over time. Additionally, disruptions such as broken data pipelines, alterations in upstream systems, and changes in external APIs can introduce gradual or abrupt changes to the data used for predictions in production. In essence, the dynamic nature of real-world conditions poses challenges to the sustained accuracy and reliability of any ML system. Therefore, it is crucial to understand how the model behaves in production and promptly identify and resolve any issues that may arise. One of the critical aspects of ML Monitoring is identifying data drift.
What is data drift?
"Data drift is a change in the statistical properties and characteristics of the input data. It occurs when a machine learning model production encounters data that deviates from the data the model was initially trained on or earlier production data"[1]. Simply put, data drift is a change in the distribution of the input features, as illustrated in the figure below.
How to identify data drift?
Data drift monitoring necessitates a well-defined reference dataset. This dataset serves as a benchmark against which production data can be systematically compared and analyzed. Only by establishing the baseline via this reference data set it is possible to discern any variations in the distribution of features, enabling the timely identification of potential drift and ensuring the ongoing reliability and performance of the model.
Methods to identify data drift
Rule Based:
Heuristic-based alerts can be set up to indirectly monitor data drifts:
- Percentage of missing values.
- Percentage of numeric values outside a predefined min-max threshold.
- Percentage of new values in a categorical feature.
Statistical Tests:
Parametric and non-parametric tests can be utilized to compare the production data against reference datasets, such as:
- Two sample t-test - to compare means for numeric features.
- Kolmogorov Smirnov test (KS) - to test for equality of distribution of numerical features.
- Chi-squared test - to test for equality of distribution of categorical features.
- K-Sample Anderson-Darling (AK) - tests the null hypothesis that k-samples are drawn from the same population without having to specify the distribution function of that population.
Distance Metrics:
Kullback–Leibler Divergence (KL Divergence) - it is a non-symmetric metric that measures the relative entropy or difference in information represented by two distributions.
Jensen–Shannon distance (JS Distance) - measures the similarity between two probability distributions. It is based on KL Divergence, and one main difference between JS divergence and KL divergence is that JS is symmetric and it always has a finite value.
Population Stability Index (PSI) - measures the distance between the distribution of numeric and categorical features.
Selecting the right metric
Each metric discussed above possesses distinct properties and inherent assumptions. Therefore, it is crucial to identify the metric that aligns most effectively with the problem. This selection should consider both the dataset volume and the magnitude of drift, which is significant for the particular model.
To gain a deeper understanding, we will analyze and compare these metrics across two distinct variations of a numerical feature in relation to reference data across various sample sizes. The figure below depicts the distribution of the two variants compared to the reference dataset.
Based on the experiments above, the following observations can be made:
- Statistical Test Sensitivity: It is evident that statistical tests often demonstrate heightened sensitivity when applied to large datasets. Even minute near zero differences can attain statistical significance with a sufficiently high volume of data.
- Distance Metric Challenges: Distance metrics lack standardized cutoffs for alarms, and their interpretation depends on the specific context of application and analysis goals. Establishing suitable thresholds for these metrics necessitates empirical evaluation based on the data's characteristics and the ML model's objectives.
The code employed for the aforementioned experiments is available on GitHub
In conclusion, the dynamic nature of real-world conditions poses significant challenges to the accuracy and reliability of machine learning systems. Changes in consumer behaviours, market trends, and potential disruptions in data collection mechanisms can introduce gradual or abrupt changes to the data used for predictions in production. In this context, monitoring and identifying data drift becomes paramount. As demonstrated through various statistical tests, distance metrics, and the analysis of experiment results, it is clear that selecting the right metric for monitoring data drift is a nuanced task. The sensitivity of statistical tests and the lack of standardized cutoffs for distance metrics highlight the need for a context-specific and empirical approach to establishing thresholds for effective monitoring. Ultimately, understanding how machine learning models behave in production and promptly addressing any identified issues are critical for ensuring these models' ongoing success and reliability in real-world applications.
References
[1] https://www.evidentlyai.com/ml-in-production/data-drift
[2] https://docs.scipy.org/doc/
[3] https://www.aporia.com/learn/data-science/practical-introduction-to-population-stability-index-psi/
Top comments (0)