Introduction to Transformers 1

Hello 🥂,

This is our 25th Chapter 🎉

All we’ve been learning from Chapter 1 has brought us here, to the single most important Neural Network Architecture in modern AI history, powering breakthrough fields like Large Language Models(LLMs) and Advanced Generative AI.

In 2017, about eight years ago, Google Brain engineers released a white paper titled Attention Is All You Need. In that paper, they discussed a new type of Neural Network Architecture called a Transformer. They came up with this architecture to fix the performance bottlenecks of Recurrent Neural Networks (RNNs), which we discussed in chapter 23. As we can remember, RNNs run in sequence by maintaining a hidden state that serves as memory.

To understand Transformers better, we need to first understand transduction models.

Transduction models

This model was introduced by Vladimir Vapnik in the 1990s. Transduction models infer specific inputs to specific test cases. This is in contrast to Induction models that infer specific inputs to general rules. Let me explain this with an example:

Consider this problem: The image below shows a set of points. Our task is to label each missing point (?) as A, B, or C. We have been provided with just 5 labelled data points out of 33, leaving us with 28 unlabelled data points.

If we are going to try an Induction model using supervised learning, what’s the best way to solve this? We can use K-nearest neighbours (which we learned in Chapter 5) to group them into the appropriate clusters, right? But we’ll mislabel some of the data points, especially those close to the center. That’s where transduction models come in.

A transduction model does not create a general algorithm to solve problems, instead, it creates a solution for each test case. So, in our problem, the transduction model would carefully label all unlabeled data correctly, including the ones close to the center, which should be B.

This makes transduction models very good with specific domains (where there are few/insufficient labelled data) instead of the induction models that create a general algorithm. On the other hand, transduction models can’t handle new data streams like induction models because they don’t know how to solve a generic problem.

This brings us to Sequence transduction models. These are machine learning models based on the concept of transduction used to transform an input sequence into the corresponding output sequence. These are used in language translations, text-to-speech conversion, and speech recognition.

Basically, sequence transduction models are used in natural language processing tasks like the ones listed above. How does it work?

The sequence transduction models use an encoder-decoder model. The encoder model here, let’s say RNN, takes the input and sequentially updates its hidden state as “encoded data”. The decoder, which can still be an RNN, takes the encoded data and predicts each output sequentially as well, token by token.

The fundamental concept in transduction and sequence tranduction models, as we’ve learned above, is that they focus their attention on a specific test case.

Now, language translation is difficult, we don’t just translate word by word, we have to consider context:

For example, given two sentences to interpret into French:

The color of your phone

La couleur de ton téléphone
Your black phone is stolen

Ton téléphone noir est volé

In sentence 1, we can see that the translation is one-to-one.

In sentence 2, however, we can see that “black phone” is “téléphone noir” in French, which means the phone comes before the color. So how do we solve these problems in language translations? A technique called Attention mechanism was introduced

Attention mechanism is a technique that allows a neural network to focus on specific parts of an input sequence. This is done by assigning weights to each part of the input sequence, with the most important parts receiving the highest weights. This is achieved in sequence transduction models by:

Passing each step of the RNN encoder model to the decoder and not just the final “encoded state”. As explained in our RNN chapter, RNNs update their hidden state token by token (eg, word by word) and return only the final hidden state. Here, each step taken to update the hidden state is passed to the decoder as well, not only the final hidden state.

The image below shows this; instead of passing only the hidden state #3, it passes #1 and #2 as well.
The decoder passes through an extra step before producing its output. In this step, each hidden state is assigned a score(weight). This is then used to know which tokens (words) are more important than the other, thus understanding the context of the sentence.

Using the Attention mechanism, our French translation would give telephone more weight than the color, so in the interpretation, it’ll interpret correctly by outputting “téléphone noir” instead of “noir téléphone”

You can watch this nice short YouTube video explaining Attention mechanism here.

Now, back to Transformers 🤖.

The gentlemen and ladies in the Google Brain and Research teams were like, you know what, with the increase in computing power and you know Moore’s law which means it’ll get even way better, we can’t seem to train our sequence transduction models any faster because they run in sequence, so we can’t use parallelization and maximise the power of our compute systems. What if, just what if, Attention is all we need? 👽 What if we discard the underlying RNN or CNN we use with our Attention mechanism and just focus on the Attention mechanism?

So they put in the work 🕶️.

…to be continued

Up next: Introduction to Transformers 2. See you in the next one 🤖

⬅️ Previous Chapter

Next Chapter ➡️

Introduction to Transformers 1

AI Series - Chapter 25