Language Models are machine learning models that work on text data to perform different tasks related to Natural Language Processing. NLP has basically two major categories of tasks Natural Language Understanding and Natural Language Generation. There are many different tasks performed by language models that include: Sentiment Analysis, Question-Answering, Query Resolution, Text Summarization, etc. There are many intermediate tasks that are performed to make the language models better. Some of these are given below-
Sentence segmentation
The whole 'corpus' which refers to the entire text collection is broken down into separate sentences. This is the first step in understanding languages first they are broken into simple sentences.
Tokenization
The next step after sentence segmentation is tokenization or more correctly word tokenization. The sentences are broken down into words, in some of the tasks where there is the importance of punctuation marks they are also treated as tokens along with words.
Stemming
Stemming refers to the process of reducing the words to their root stem, it is done by chopping off the end of the words.
eg- oppressor, oppression, oppressed, oppressive will all we changed to oppress which is the root stem by chopping off 'or', 'ion',' ed', 'ive' respectively from each word.
Some of the famous stemming algorithms are- Porter Stemmer, Lancaster Stemmer, and Snowball Stemmer. These are implemented in the nltk
library and the packages can be imported as follow- from nltk.stem.porter import PorterStemmer
, from nltk.stem.lancaster import LancasterStemmer
, from nltk.stem import SnowballStemmer
.
Lemmatizing
Lemmatization means to convert the word to its root word known as lemma it takes into account the meaning of the word and just does not simply chop off the last section of words.
eg- the root word for 'is', 'am', 'are' is 'be',another example is
the words 'creation', 'creating', 'creative' will be changed to 'create'. If stemming was done then they will become 'creati' instead of 'create'.
POS tags
POS means Part Of Speech and in this step, all the tokens are assigned or tagged with a part of speech. POS tagging has basically two methods - Rule Based POS Tagging and Stochastic Based POS Tagging. POS tagging is used because it helps in building lemmatizers, it helps in building parse trees which are used for Named Entity Recognition, and also resolving word disambiguation.
Identifying Stopwords
Stopwords are common words found in languages that do not give a lot of meaning to the sentence like 'and', 'the', 'is', 'am', etc. These words must be identified and based on task removed from the corpus because they are like noise in the dataset.
Name Entity Recognition
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Some common NER tools are- Stanford Named Entity Recognizer (SNER), SpaCy, Natural Language Toolkit (NLTK).
Text Classification
Text Classification is one of the most important steps in Sentiment Analysis after all the steps like tokenization, stemming and lemmatization are performed on the corpus they are passed to any machine learning algorithm to classify it.
Chunking
It works on top of POS tagging. It uses POS-tags as input and provides chunks as output. In short, Chunking means grouping of words/tokens into chunks. The chunks are a group of words or phrases which can be clubbed together to form meaningful parts of the sentence like noun group/phrase, verb group/phrase, etc.
Chunking can break sentences into phrases that are more useful than individual words and yield meaningful results. Chunking is very important when you want to extract information from text such as locations, person names (NER). NLTK can be used for chunking.
Coreference Resolution
In linguistics, coreference occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refers to the same person, namely to Bill. Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher-level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.
Top comments (2)
Great post:) These are all very important and interesting tasks to work on. Nowadays, with the emergence of annotation tools, it has become even more easy to carry out NLP tasks, mainly the ones related to NER, text classification, etc. For my projects I use a similar tool called NLP Lab, which is a free to use no-code platform that provides automated annotation and model training. Apart from that, it also offers features like collaborating with multiple people on the same project, adding comments to the tasks, etc. :)
Just to add, Text Classification can be considered a task and an application in itself.