DEV Community

Cover image for πŸ’§ πŸ“‰ πŸ’§ Are you wasting money & time: does your data have a leak? πŸ’§ πŸ“‰ πŸ’§
Jason Corso
Jason Corso

Posted on

πŸ’§ πŸ“‰ πŸ’§ Are you wasting money & time: does your data have a leak? πŸ’§ πŸ“‰ πŸ’§

New open source AI feature alert! πŸ’§πŸ””πŸ’§πŸ””πŸ’§πŸ””πŸ’§πŸ””

Generalization in machine learning models is still poorly understood. Due to this, the status quo practice is to heuristically verify our models on holdout test sets, and hope that this check has some bearing on performance in the wild. Of course, this means that there is huge cost to faulty testing---a huge cost in both critical MLE time and in error filled data and annotation.

One common failure mode of testing is when the test split is afflicted with data leakage. When testing on such a split, there is no guarantee that generalization is being verified. In fact, in the extreme case, no new information is gained on the performance of the model outside of the train set. Supervised models learn the minimal discriminative features needed to make a decision, and if those features appear in the test set, a dangerous, false sense of confidence can be built in a model. Don't let this happen to you.

Leaky splits can be the bane of ML models, giving a false sense of confidence, and a nasty surprise in production. The image on this post is a sneak peak into what you can expect (this example is taken from ImageNet πŸ‘€)

Check out this Leaky-Splits blog post by my friend and colleague Jacob Sela
https://medium.com/voxel51/on-leaky-datasets-and-a-clever-horse-18b314b98331

Jacob is also the lead developer behind the new open source Leaky-Splits feature in FiftyOne, available in version 1.1.

This function allows you to automatically:
πŸ•΅ Detect data leakage in your dataset splits
πŸͺ£ Clean your data from these leaks

This will help you:
βœ”οΈ Build trust in your data
πŸ“Š Get more accurate evaluations

And, it's open source. Check it out on GitHub.

GitHub

From your friends at Voxel51

Top comments (0)