92. BERT: The Model That Reads in Both Directions

92. BERT: The Model That Reads in Both Directions

# ai# programming# productivity# beginners
92. BERT: The Model That Reads in Both DirectionsAkhilesh

GPT generates text by predicting the next word. It reads left to right. BERT does something...

GPT generates text by predicting the next word. It reads left to right.

BERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter.

That bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks.


What You'll Learn Here

  • What makes BERT different from GPT
  • Masked Language Modeling: how BERT learns
  • Next Sentence Prediction: the second pretraining task
  • The [CLS] and [SEP] tokens and what they do
  • Fine-tuning BERT for text classification
  • Fine-tuning for Named Entity Recognition
  • Fine-tuning for Question Answering
  • Using HuggingFace to do all of this in under 20 lines

BERT vs GPT: The Key Difference

Both are transformer-based. The architecture is similar. The difference is in how they're pretrained and which part of the transformer they use.

GPT (decoder-only):
  - Reads left to right with causal masking
  - Trained to predict the next token
  - Great at generation
  - Context: only left side available

BERT (encoder-only):
  - Reads all tokens simultaneously
  - Trained to predict masked tokens + next sentence
  - Great at understanding
  - Context: both left and right sides available
Enter fullscreen mode Exit fullscreen mode

For classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point.


How BERT Was Pretrained

BERT was pretrained on two tasks simultaneously on a massive corpus (BooksCorpus + English Wikipedia, 3.3 billion words).

Task 1: Masked Language Modeling (MLM)

15% of tokens are randomly masked. The model predicts the original token from context.

Input:  "The cat [MASK] on the [MASK]"
Target: "The cat sat  on the mat"
Enter fullscreen mode Exit fullscreen mode

Of the 15% selected tokens:

  • 80% replaced with [MASK]
  • 10% replaced with a random token
  • 10% left unchanged

The random and unchanged cases prevent the model from only learning to predict [MASK] tokens.

Task 2: Next Sentence Prediction (NSP)

Two sentences are given. The model predicts whether sentence B actually follows sentence A in the original text.

Input:   [CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP]
Label:   IsNext (1)

Input:   [CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP]
Label:   NotNext (0)
Enter fullscreen mode Exit fullscreen mode

NSP was later found to be less useful than MLM and was dropped in RoBERTa. But it's part of the original BERT.


Special Tokens in BERT

BERT uses three special tokens you need to know:

[CLS]: Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks.

[SEP]: Separator token. Marks the end of a sentence or separates two sentences in pairs.

[PAD]: Padding token. Used to make all sequences in a batch the same length.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "The cat sat on the mat."
tokens = tokenizer(text)

print(f"Input IDs:      {tokens['input_ids']}")
print(f"Token type IDs: {tokens['token_type_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print()

# Decode back to see what they are
decoded = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokens: {decoded}")
Enter fullscreen mode Exit fullscreen mode

Output:

Input IDs:      [101, 1996, 4937, 2938, 2006, 1996, 13523, 1012, 102]
Token type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]

Tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]']
Enter fullscreen mode Exit fullscreen mode

101 is [CLS]. 102 is [SEP]. Every BERT input starts with [CLS] and ends with [SEP].

# Two sentences
text_pair = ("The cat sat on the mat.", "It was a lazy afternoon.")
tokens_pair = tokenizer(*text_pair)

decoded_pair = tokenizer.convert_ids_to_tokens(tokens_pair['input_ids'])
print(f"Pair tokens: {decoded_pair}")
print(f"Token types: {tokens_pair['token_type_ids']}")
Enter fullscreen mode Exit fullscreen mode

Output:

Pair tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]', 'it', 'was', 'a', 'lazy', 'afternoon', '.', '[SEP]']
Token types: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
Enter fullscreen mode Exit fullscreen mode

Token type 0 = first sentence. Token type 1 = second sentence. BERT uses this to distinguish the two.


BERT Model Variants

bert-base-uncased:  12 layers, 768 hidden, 12 heads, 110M params
bert-large-uncased: 24 layers, 1024 hidden, 16 heads, 340M params
bert-base-cased:    Same as base but case-sensitive tokenization
distilbert-base:    6 layers, 66M params, 97% of BERT performance, 60% faster
roberta-base:       BERT without NSP, trained longer, better performance
Enter fullscreen mode Exit fullscreen mode

For most tasks, start with bert-base-uncased or distilbert-base-uncased. Only go larger if you need the extra capacity.


Task 1: Text Classification With BERT

The most common use of BERT. Add a linear layer on top of the [CLS] token output.

from transformers import BertForSequenceClassification, BertTokenizer
from torch.utils.data import DataLoader, Dataset
import torch
import torch.nn as nn
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

# Simple sentiment dataset
texts = [
    "This movie was absolutely fantastic!",
    "I hated every minute of it.",
    "An incredible performance by the lead actor.",
    "Terrible writing, terrible acting.",
    "One of the best films I've seen this year.",
    "Complete waste of time and money.",
    "Beautifully crafted and deeply moving.",
    "Boring and predictable from start to finish.",
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Tokenize
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_len,
            return_tensors='pt'
        )
        self.labels = torch.tensor(labels)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids':      self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels':         self.labels[idx]
        }

dataset = SentimentDataset(texts, labels, tokenizer)
loader  = DataLoader(dataset, batch_size=4, shuffle=True)

# Load pretrained BERT with classification head
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

device    = 'cuda' if torch.cuda.is_available() else 'cpu'
model     = model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(loader) * 3
)

# Fine-tune
print("Fine-tuning BERT for sentiment classification...")
for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in loader:
        optimizer.zero_grad()
        input_ids      = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels         = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}")

# Predict on new examples
model.eval()
new_texts = [
    "I absolutely loved this film!",
    "This was the worst movie I have ever seen."
]

new_encoding = tokenizer(
    new_texts, truncation=True, padding=True,
    max_length=64, return_tensors='pt'
).to(device)

with torch.no_grad():
    outputs = model(**new_encoding)
    preds   = torch.argmax(outputs.logits, dim=1)

for text, pred in zip(new_texts, preds):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{text[:50]}...' -> {sentiment}")
Enter fullscreen mode Exit fullscreen mode

Output:

Fine-tuning BERT for sentiment classification...
Epoch 1: loss=0.6834
Epoch 2: loss=0.4123
Epoch 3: loss=0.2187
'I absolutely loved this film!...' -> Positive
'This was the worst movie I have ever seen....' -> Negative
Enter fullscreen mode Exit fullscreen mode

What Happens Inside During Fine-Tuning

# Look at what BertForSequenceClassification adds
from transformers import BertModel
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, n_classes, dropout=0.3):
        super().__init__()
        self.bert    = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(768, n_classes)  # 768 = bert-base hidden size

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # outputs.last_hidden_state: (batch, seq_len, 768)
        # outputs.pooler_output: (batch, 768) - the [CLS] token, passed through a linear+tanh

        cls_output = outputs.pooler_output      # (batch, 768)
        cls_output = self.dropout(cls_output)
        logits     = self.classifier(cls_output) # (batch, n_classes)

        return logits

model_manual = BertClassifier(n_classes=2)

# Check what's trainable vs frozen
total    = sum(p.numel() for p in model_manual.parameters())
trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Total parameters:     {total:,}")
print(f"Trainable parameters: {trainable:,}")
print()

# Often you freeze BERT layers and only train the head
for param in model_manual.bert.parameters():
    param.requires_grad = False

frozen_trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Trainable (head only): {frozen_trainable:,}")
print("(Only the 2-layer classifier is being trained)")
Enter fullscreen mode Exit fullscreen mode

Output:

Total parameters:     109,484,546
Trainable parameters: 109,484,546

Trainable (head only): 1,538
(Only the 2-layer classifier is being trained)
Enter fullscreen mode Exit fullscreen mode

When you fine-tune the entire BERT, all 109M parameters update. When you freeze BERT and only train the head, only 1,538 parameters update. Freezing is faster but usually less accurate. Fine-tuning everything gives better results when you have enough data.


Task 2: Named Entity Recognition (NER)

NER classifies each token. Person, Organization, Location, Date, Other. It's a token-level classification task, not sentence-level.

from transformers import BertForTokenClassification, BertTokenizerFast

# NER labels
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id   = {l: i for i, l in enumerate(label_list)}
id2label   = {i: l for i, l in enumerate(label_list)}

# Load NER model
ner_model = BertForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

tokenizer_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Example: align word labels to subword tokens
sentence = "Elon Musk founded Tesla in California."
words    = sentence.split()
word_labels = ['B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'B-LOC', 'O']

# Tokenize with word_ids to handle subwords
encoding = tokenizer_fast(
    words,
    is_split_into_words=True,
    return_offsets_mapping=True,
    padding=True,
    truncation=True
)

# Map word-level labels to subword-level
word_ids    = encoding.word_ids()
token_labels = []
prev_word_id = None

for word_id in word_ids:
    if word_id is None:
        token_labels.append(-100)    # ignore [CLS] and [SEP] in loss
    elif word_id != prev_word_id:
        token_labels.append(label2id[word_labels[word_id]])  # first subword
    else:
        token_labels.append(-100)    # subsequent subwords: ignore
    prev_word_id = word_id

tokens = tokenizer_fast.convert_ids_to_tokens(encoding['input_ids'])
print("Token -> Label alignment:")
for token, label_id in zip(tokens, token_labels):
    label = id2label.get(label_id, 'IGN')
    print(f"  {token:<15} {label}")
Enter fullscreen mode Exit fullscreen mode

Output:

Token -> Label alignment:
  [CLS]           IGN
  elon            B-PER
  mu              IGN
  ##sk            IGN
  founded         O
  tesla           B-ORG
  in              O
  california      B-LOC
  .               O
  [SEP]           IGN
Enter fullscreen mode Exit fullscreen mode

"Elon" maps to B-PER. "mu" and "##sk" (subwords of "Musk") are ignored in the loss. This is the standard way to handle subword tokenization for token-level tasks.


Task 3: Question Answering

BERT predicts the start and end position of the answer span within the context passage.

from transformers import BertForQuestionAnswering, BertTokenizer
import torch

# Load pretrained QA model (already fine-tuned on SQuAD)
qa_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
qa_model     = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

def answer_question(question, context):
    inputs = qa_tokenizer(
        question, context,
        return_tensors='pt',
        truncation=True,
        max_length=512
    )

    with torch.no_grad():
        outputs = qa_model(**inputs)

    start_logits = outputs.start_logits
    end_logits   = outputs.end_logits

    # Find best start and end positions
    start_idx = torch.argmax(start_logits)
    end_idx   = torch.argmax(end_logits) + 1

    tokens = qa_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    answer = qa_tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx])

    return answer

# Test it
context = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
It is named after the engineer Gustave Eiffel, whose company designed and built the tower.
Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially
criticized by some of France's leading artists and intellectuals but has become a global
cultural icon of France and one of the most recognisable structures in the world.
"""

questions = [
    "Where is the Eiffel Tower located?",
    "Who designed the Eiffel Tower?",
    "When was the Eiffel Tower built?",
]

for q in questions:
    answer = answer_question(q, context)
    print(f"Q: {q}")
    print(f"A: {answer}")
    print()
Enter fullscreen mode Exit fullscreen mode

Output:

Q: Where is the Eiffel Tower located?
A: Champ de Mars in Paris, France

Q: Who designed the Eiffel Tower?
A: Gustave Eiffel

Q: When was the Eiffel Tower built?
A: 1887 to 1889
Enter fullscreen mode Exit fullscreen mode

A pretrained BERT fine-tuned on SQuAD (Stanford Question Answering Dataset) extracts answers directly from context. No generation. Just span extraction.


The Fastest Way: HuggingFace Pipeline

For common tasks, HuggingFace pipelines wrap everything into one function call.

from transformers import pipeline

# Sentiment analysis (fine-tuned BERT on SST-2)
sentiment = pipeline('sentiment-analysis')
results = sentiment([
    "I absolutely loved this product!",
    "Terrible quality, fell apart after a day.",
    "It's okay, nothing special."
])
for r in results:
    print(f"{r['label']:<10} {r['score']:.3f}")

print()

# Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
text = "Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
entities = ner(text)
for e in entities:
    print(f"{e['entity_group']:<8} {e['word']:<25} score={e['score']:.3f}")

print()

# Question Answering
qa = pipeline('question-answering')
result = qa(
    question="Who is the CEO of Apple?",
    context="Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
)
print(f"Answer: {result['answer']}  (score: {result['score']:.3f})")

print()

# Zero-shot classification (no fine-tuning needed)
classifier = pipeline('zero-shot-classification')
text = "The government announced new economic policies today."
candidate_labels = ['politics', 'technology', 'sports', 'entertainment']
result = classifier(text, candidate_labels=candidate_labels)
for label, score in zip(result['labels'], result['scores']):
    print(f"{label:<15}: {score:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

POSITIVE   0.999
NEGATIVE   0.998
NEGATIVE   0.612

ORG      Apple                     score=0.998
PER      Tim Cook                  score=0.997
LOC      Cupertino                 score=0.986

Answer: Tim Cook  (score: 0.998)

politics       : 0.942
technology     : 0.031
entertainment  : 0.017
sports         : 0.010
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning Tips for BERT

Learning rate: BERT is sensitive. Use 2e-5 to 5e-5. Lower than typical deep learning.

Batch size: 16 or 32. Larger batches work better for BERT.

Epochs: 2 to 4 epochs. BERT fine-tunes quickly. More epochs usually causes overfitting.

Warmup steps: Schedule the LR to warm up for 10% of training, then linearly decay. Helps stability.

Gradient clipping: Clip at 1.0 to prevent exploding gradients.

# Standard fine-tuning setup
from transformers import get_linear_schedule_with_warmup

EPOCHS         = 3
LEARNING_RATE  = 2e-5
WARMUP_RATIO   = 0.1

total_steps   = len(loader) * EPOCHS
warmup_steps  = int(total_steps * WARMUP_RATIO)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print(f"Total training steps: {total_steps}")
print(f"Warmup steps: {warmup_steps}")
print(f"Peak LR: {LEARNING_RATE}, then linear decay to 0")
Enter fullscreen mode Exit fullscreen mode

BERT vs RoBERTa vs DistilBERT

Model            Params  Speed   Accuracy  Notes
-----------      ------  -----   --------  -----
bert-base        110M    1x      baseline  Original, safe choice
bert-large       340M    0.4x    +2-3%     Slower, better accuracy
roberta-base     125M    1x      +1-2%     Better pretraining, no NSP
distilbert-base   66M    1.6x    -3%       Great for production
albert-base        12M   0.9x    ~same     Much fewer params via sharing
Enter fullscreen mode Exit fullscreen mode

For most projects: start with distilbert-base-uncased for speed, switch to roberta-base for accuracy.


Quick Cheat Sheet

Task Model Code
Text classification BertForSequenceClassification pipeline('sentiment-analysis')
NER BertForTokenClassification pipeline('ner')
QA BertForQuestionAnswering pipeline('question-answering')
Zero-shot NLI model pipeline('zero-shot-classification')
Custom BertModel + linear head outputs.pooler_output
Setting Value
Learning rate 2e-5 to 5e-5
Batch size 16 or 32
Epochs 2 to 4
Max sequence length 128 to 512
Warmup steps 10% of total steps

Practice Challenges

Level 1:
Use pipeline('sentiment-analysis') on 20 movie reviews you write yourself (10 positive, 10 negative). Print each prediction and confidence score. Where does it get confused?

Level 2:
Fine-tune distilbert-base-uncased on any small classification dataset (you can use load_dataset('imdb') from HuggingFace). Train for 3 epochs. Compare accuracy to a TF-IDF + LogisticRegression baseline from Post 62. How much better is BERT?

Level 3:
Use BertForTokenClassification to tag a paragraph of news text with NER labels. Then visualize the output by color-coding each entity type in the text. Use the fine-tuned dslim/bert-base-NER model from HuggingFace hub.


References


Next up, Post 93: GPT: The Model That Predicts the Next Word Forever. Autoregressive generation, temperature and sampling strategies, and how a simple next-token prediction objective produces models that can write, code, and reason.