According to Urban Dictionary, the act of being "learnt" is to be turnt on knowledge, or to be under the influence of education. As fate would have it the entire goal of machine learning is to get our models to be learnt, and the controlled substance of choice is the data we feed those models. As with most things any inherent pitfalls can be mitigated through oversight and leaning on the experience of those who have already become learnt. Here we will look at the different real world approaches that line up with the nonsensical metaphorical framework I have previously constructed.
Supervised Learning
As the name implies, supervised learning is akin to having a parent or teacher holding your hand through most of the learning process. A teacher walking the class through example problems, or taking your child camping and teaching them how to start fires, pitch tents, and identify scat. You end up with very accurate understandings of the world, but are dependent on the instructor's prior knowledge. In the world of machine learning this manifests itself in labeled data. If you are lucky information has been correctly labeled beforehand, but many times you will work with data that is either incorrectly labeled or entirely unlabeled. At that point you have to go through your data and assign labels manually. This can be done through broad spectrum brute force algorithms, or by leaning on a human expert who can manually apply their domain knowledge to the data to derive accurate predictions. Supervised learning is essentially the standard approach to building models but is limited by the breadth of its training. It can also be expensive in terms of time and resources to implement properly depending on the quality and nature of the data and problem you are working with.
Unsupervised Learning
Unsupervised learning is equally intuitive, where instead of providing guidance you leave the child alone in the forest with a capri sun and a spork and tell him he has 3 days to get home. In data terms this would be feeding unlabeled data into your models with the intent that they can derive some meaning or new information based on the data's overall structure. This is generally used in tasks such as dimensionality reduction (PCA), clustering, and outlier detection. Some theorize this approach may be the key to true AI in the future since it isn't bound by the training parameters used in supervised learning. This would give models the potential to learn and adapt to novel tasks outside of their original intent, but for now the approach is limited to the aforementioned functions.
Semi-supervised Learning
This is the love child of the above approaches, mixing them in hopes of finding the sweet spot of investment vs profit. Raising the kid the right way then giving them the freedom to explore on their own, trusting that you've instilled the appropriate and necessary values needed to navigate the world. For data purposes this means taking a trained model and using it to evaluate a set of unlabeled data. You then find the predictions your model has the most confidence in and then add those observations to your original data set with the predicted labels (pseudo-labeling).
Initially you cannot be 100% sure these labels are correct, hence the pseudo, but this is also why we restrict our selections to data our model is most confident about, while also iterating multiple times. Rinse and repeat until satisfied. Theoretically this is giving your model more data to work with by utilizing the inherent structures present in your data to amplify your model's predictive power. This of course assumes you can trust the architecture of your model and that your data is relatively well behaved. Strong outliers or incorrect predictions from your original model can derail this process, and one of its disadvantages is that it has no way to self correct.
Ultimately you would use combinations of these approaches to properly and thoroughly analyze data and build robust models.
Top comments (0)