Natural Language Processing a subfield of Machine Learning mainly deals with text data. It analyses reviews of objects like books, movies, play store apps, etc, to find whether they are positive or negative, sentiment analysis, text generation for chatbots, query analysis and resolution for search engines, and many other text-related tasks.
Preprocessing of datasets is one of the most arduous tasks of the machine learning pipeline. Text preprocessing also requires many steps. Some of the tasks while dealing with text datasets is given below.
Lower casing
All the text data is converted into the lower case to make all the words with different casing get the same weightage.
Removal of punctuation
All the punctuation symbols are removed from the dataset as they are not important in many tasks for word prediction and sentiment analysis.
Removal of Stopwords
Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have, etc. These stopwords are removed from the dataset.
Removal of frequent words
Sometimes the frequent words are also removed to increase classification accuracy in text classification tasks because they are present in all the classes and removing them causes the accuracy to increase.
Removal of Rare words
In some of the cases, rare words are also ignored and therefore removed because they work as outliers.
Stemming
Stemming means to chop off the end of the words to make it similar to the root word like removing "ing", "ant" from "consulting" and "consultant" to make it "consult".
Lemmatization
Lemmatization means to change the words to root words by the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
eg-
1) am, are, is => be
2) operating, operates, operation, operative, operatives, operational => operate
In the second example if stemming is performed then instead of "operate" the words will change to "operat" as it does not take into account the meaning of the words and just chop offs the characters from the last.
Removal of emojis
In today's world emojis are a must in text messages but they can be dealt in two ways the first way is to remove them from the dataset.
Removal of emoticons
Emoticons are also removed from the dataset for many datasets.
Conversion of emoticons to words
The other way to deal with emoticons is to convert them to words.
Conversion of emojis to words
Emojis can also be converted to relatable words.
Removal of URLs
The URLs present must be removed from the dataset.
Removal of HTML tags
Sometimes while scrapping data from websites HTML tags are included in the datasets which must be removed to make better language models.
Spelling correction
Spelling mistakes must be corrected to make better language mistakes. Minimum edit distance can be used to find words which are slightly altered from the original.
Top comments (1)
Great overview, conversion of diacritics to Latin characters can also be added to the list of tasks for preprocessing.