Introduction to BERT (Bidirectional Encoder Representations from Transformers)

Aloha 👽,

We learned a little about Large Language Models in our last series chapter, which brought us to the 3 types of LLMs (BERT, GPT, LLaMA). We’ll start with BERT.

Let’s get right into it 🤖

It all started with Google Engineers yet again. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova were Engineers and Researchers at Google AI Language Department. They’ve worked extensively in Natural Language Processing (NLP) and were trying to find a way to improve the existing NLP models to improve context understanding.

Now, a few months earlier, in June 2018, the first GPT paper was released, titled Improving Language Understanding by Generative Pre-Training. When the team read the paper, it was using a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer. This did not suffice for them as they needed something different.

The research team’s goal was not to generate the next words like GPT. They wanted the model to understand context. What does that mean? GPT’s paper release focused on generating texts (we’ll learn about how that works later on in the series), so it can take a question and generate a text response one word at a time. However, the research team at Google wanted the model to understand context, not to generate text.

Cloze Task

One day, they had an epiphany: the Cloze Task! The Cloze Task is a classic method in linguistics and psychology for measuring language understanding and predictability. It removes words from a passage and asks a person to fill in the missing words based on context.

How does it work?

The Cloze Task, invented by Wilson L. Taylor in 1953, used a masking technique. What this simply means is that, given a sentence with a masked word(s), the user should be able to predict the masked words correctly. Basically, it is fill-in-the-blank spaces. This is the technique subsequently adopted in schools and institutions for tests and exams to evaluate understanding, with different variations.

For example, take this sentence: “The capital of Nigeria is [blank].” The masked word here should be Abuja, as it’s the capital of Nigeria. Here are some more complicated examples:

The word: Watch
1. Example A (Object)**: "He glanced down at his gold [blank]** and realized he was already ten minutes late for the meeting"
2. Example B (Action/Verb)**: "We sat on the porch for hours just to [blank]** the sunset behind the distant mountains"
The word: Right
1. Example A (Direction): "After passing the post office, you should turn [blank] at the next intersection to reach the library".
2. Example B (Correctness): "Check your math carefully to ensure you have the [blank] answer for every question on the test".

You can now see how difficult it is to complete the Cloze Task, right? It’ll take a very good understanding of context and language.

The team figured this would be the best technique to use for their model to be able to understand context, because if it can predict the missing words correctly, it means it understands context.

That’s how the Masked Language Model (MML) technique was invented by the BERT team, inspired by the Cloze task. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary ID of the masked word based only on its context. To achieve this, the team came up with a bi-directional approach, where the transformer architecture is trained from left-to-right and right-to-left. Thus the name: Bidirectional Transformers for Language Understanding (BERT).

The team then wrote this paper, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, just about 3 months after the GPT paper release on 11th October, 2018. You can do well to check it out. We’ll go more in-depth in the next series chapter.

BERT is the first fine-tuning-based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

I believe you have a good background understanding of the philosophy behind the BERT in this introduction series chapter 😊. We’ll learn the more technical bit in our next series chapter.

See you in the next one 👽

⬅️ Previous Chapter

Next Chapter ➡️