Transformer networks have become very popular ever since Google BERT achieved state of the art results in various NLP benchmark tasks. It was first mentioned in the paper ‘Attention is all you need’ (https://arxiv.org/abs/1706.03762) in 2017.
So what are transformer networks actually? Are they different from neural networks?
Transformer networks are actually a kind of neural networks architecture. Other kinds of architectures include CNNs (Convolutional neural networks) and RNNs (Recurrent neural networks).
Different architectures are better suited to certain tasks. For example, CNN’s are better suited for image-related tasks, and RNNs are better suited for text and sequence tasks. Transformer networks are best suited for the sequence to sequence tasks involving text. They overcome a few weaknesses of other networks such as RNNs.
What is the structure of a transformer network?
Transformers consist of encoders and decoders which are stacked on top of each other. This means that each encoder’s input is the output of the previous encoder and so on. The input of the first encoder is word embeddings for each word in the input sentence.
But first, let’s look at what encoders and decoders do. Encoders transform an input sentence into an embedding. An embedding is just a list of numbers. This embedding is a meaningful representation of the input sentence in the vector space.
Decoders transform an embedding generated by an encoder into an output sequence. In the above diagram, the sentence in one language is being translated into another language.
The word embeddings used for the encoders inputs can be any recognized word embeddings like word2vec, Glove, etc. These techniques already transform words into meaningful vectors.
Encoders and decoders in a transformer also use self-attention.
What is Self Attention?
Self Attention is a concept used by encoders and decoders to gain a better understanding of the context of each word. In natural language, when looking at the words in a sentence, the meaning of each word is decided by the words around it. More importantly, the meaning is decided by specific words around it and not all the words.
For example, let’s consider the below sentence,
The animal didn’t cross the street because it was too tired
In the above sentence, the word ‘it’ refers to the animal. This is obvious to humans but not so to machines. Self-attention allows the network to pay attention to specific words that are connected to each word in the sentence.
Transformer networks use multi-headed self-attention which allows for multiple independent representations of attention to be considered for each input word.
What are the benefits of transformer networks?
Transformer networks outperform regular RNNs and LSTMs easily on NLP benchmark tasks. They also provide some more benefits,
The encoder-decoder architecture allows transformer networks to train in a parallel fashion. This means that with enough resources, these networks can be trained much faster than RNNs or LSTMs
The Self Attention mechanism allows words that are further apart in a sentence to be associated with each other. They do not have a sliding window like RNNs.
What are the applications of transformer networks?
Machine Translation – It has already been used extensively for machine translation and has achieved state of the art results.
Text Summarization – This task generates an abstract summary of a longer piece of text.
Sequence to Sequence tasks – It can be used in any text seq2seq task like question generation, paraphrasing, etc
Classification and Clustering – The encoder’s outputs (embeddings) can be used as sentence embeddings for any classification or clustering tasks
Conclusions
We have looked at the successor of RNNs and LSTMs. Transformer networks improve upon sequence to sequence tasks in both accuracy and speed.
However, research is already ongoing to replace or enhance transformer networks to handle even longer pieces of text. Google’s BigBird aims to do exactly that (https://arxiv.org/abs/2007.14062).
The NLP area has been improving leaps and bounds over the last few years and it looks like this is only going to continue.
Article Source: https://www.asksid.ai/resources/what-are-the-transformer-networks/
Top comments (0)