Hi guys, You’ve come to the correct place if you want to learn more about the most recent developments in the field
of natural language processing, or NLP for short. We’ll follow the Transformer architecture’s brief but explosive
life as it transformed the NLP field of AI!
It all began a few years ago, when the excitement in artificial intelligence was focused on computer vision thanks
to CNNs’ outstanding performance on the ImageNet challenge (I’m looking at you, AlexNet in 2013! ), as well as
the success of neural style transfer, generative adversarial networks, and a long list of other breakthroughs.
NLP wasn’t getting as much attention from the media as Computer Vision was. Undoubtedly, NLP researchers were
For instance, the concept of Word2Vec, which proposed forcing neural networks to learn word similarities based on
word distributions, gained popularity in 2013. For instance, the word “Apple” is frequently used next to words
like IBM or Microsoft, therefore it gets mapped near these in the vector space. However, “Apple” is also
frequently used with words like “pear” or other fruits, therefore it is clear that this technique cannot
distinguish between words in context.
The recurrent neural network (RNN) gained prominence in 2014 and 2015, which made it possible to better tackle
classification problems as well as tasks for sequence to sequence modeling. Machine Translation is an illustration
of creating a sequence from a sequence. Recurrent neural networks, or RNNs, are conceptualized as follows: The
input sequence is repeatedly applied to the same neural network block. When the sequence comes to an end, the
module should have learned a single vector that encapsulates the entire meaning of the sentence. And by turning
the decoder’s process backward, one can begin to translate from this one vector.
However, one issue with the method is that it does not function well for longer statements like “I’m working
hard,” which might be represented by a single vector. Reaching the end of the sequence, the RNN block is
“overwhelmed” by the additional input to the point where it loses track of what the sentence’s commencement
The years 2015–2016 were the ones that used the attention mechanism to address this “forgetting” issue. This
emphasis is simply a means to indicate which element of the statement should be given more weight. In other
words, attention instructs the RNN on what to pay attention to and not to overlook.
The fact that recurrent neural networks are… recurrent is still another issue with them. Unfortunately, phrases
are processed word by word in the same order that humans typically do. However, this means that it will take a
machine a long time to process large text corpora.
And here we are in 2017, whoosh! The long-awaited ImageNet moment with the Transformer architecture has now
arrived, along with NLP! The “Attention is all you need” paper was the first thing it brought to the party. As
stated in the headline, if RNNs and attention are both so slow due to the sequential processing of RNNs, then
let’s simply use the attention and toss the RNN component! And then, the Transformer appeared with a bang! This
isn’t exactly how it works in practice; we’ll go through the Transformer design in the future installment.
Here, we state it plainly: Transformers handle sequence data, but not in a specific order like RNNs do. This
indicates that parallelization will allow for considerably faster training!
Now that we’re less concerned with the mechanics of it, we concentrate on the Transformer’s meaning and the
products of the original transformer: Because the state-of-the-art in many areas, including machine translation,
sentiment categorization, coreference resolution, common sense reasoning, and so on and so forth, is being
improved by this architecture and its offspring! Even translating from one programming language to another and
solving symbolic math problems have successfully used it! And do not be afraid to experiment with demos, such as
this one here, if you want to witness the transformer in action.
Returning to our history lesson, the year 2018 A bidirectional Transformer was created by Google researchers,
and they termed it “BERT”. BERT from Sesame Street, you heard correctly. Let’s now put aside the strange titles
that academics gave their model and concentrate on the word “bidirectional” as it relates to the transformer.
Bidirectional information permits information to move both forward and backward while the model trains,
improving model performance. BERT variants rank among the most significant developments in NLP. BERT maps
related words similarly to Word2Vec, but it is context-sensitive. This means that if a word appears in several
contexts, a distinct word vector is produced for the word.
For instance, the word “fair” will have quite different word vectors with BERT than it will with Word2Vec if we
compare two phrases like “This game is just not fair” and “I had a great time at this fair.”
And for attention-based architectures, everything just intensified following BERT: 2019 saw the development of
RoBERTa, ERNIE 2.0, XLNet, and that’s just to mention a few! All outperform BERT conceptually and practically.
But there is still no end in sight for these architectural developments; they just keep coming.
And to complete the loop and make things more weirder, we return to the computer vision community because the
Transformer is also making news there: The Transformer was applied to pixel sequences, or what we used to call
images, because it proved to be so effective on word sequences. And there’s more! Why not use transformers on
text and images concurrently if we can use them on either text or images? We are currently in the so-called
multi-modality domain, where the objective is to simultaneously process multiple input modalities. The year 2019
was therefore the year of V-LBert, VisualBERT, ViLBERT, UNITER, to name a few! These studies were all published
in roughly the same month.
On the same day, three of them! What a great moment to be a researcher in NLP! It’s challenging to not feel both
overburdened and overexcited at once!
Last but not least, the transformer performs amazingly effectively across a variety of tasks, frequently despite
not having been expressly educated on the activity at hand. Now that NLP has reached this stage, we must
determine whether a neural network trained on problem A can also roughly handle task B. Check out our prior
article, which is referenced here in the article and in the description below, for further
details on how to “probe” for this. Wait until our next article if you want to learn more about the
Transformer’s internal workings.