
AkhileshGPT generates text by predicting the next word. It reads left to right. BERT does something...
GPT generates text by predicting the next word. It reads left to right.
BERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter.
That bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks.
Both are transformer-based. The architecture is similar. The difference is in how they're pretrained and which part of the transformer they use.
GPT (decoder-only):
- Reads left to right with causal masking
- Trained to predict the next token
- Great at generation
- Context: only left side available
BERT (encoder-only):
- Reads all tokens simultaneously
- Trained to predict masked tokens + next sentence
- Great at understanding
- Context: both left and right sides available
For classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point.
BERT was pretrained on two tasks simultaneously on a massive corpus (BooksCorpus + English Wikipedia, 3.3 billion words).
Task 1: Masked Language Modeling (MLM)
15% of tokens are randomly masked. The model predicts the original token from context.
Input: "The cat [MASK] on the [MASK]"
Target: "The cat sat on the mat"
Of the 15% selected tokens:
The random and unchanged cases prevent the model from only learning to predict [MASK] tokens.
Task 2: Next Sentence Prediction (NSP)
Two sentences are given. The model predicts whether sentence B actually follows sentence A in the original text.
Input: [CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP]
Label: IsNext (1)
Input: [CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP]
Label: NotNext (0)
NSP was later found to be less useful than MLM and was dropped in RoBERTa. But it's part of the original BERT.
BERT uses three special tokens you need to know:
[CLS]: Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks.
[SEP]: Separator token. Marks the end of a sentence or separates two sentences in pairs.
[PAD]: Padding token. Used to make all sequences in a batch the same length.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "The cat sat on the mat."
tokens = tokenizer(text)
print(f"Input IDs: {tokens['input_ids']}")
print(f"Token type IDs: {tokens['token_type_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print()
# Decode back to see what they are
decoded = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokens: {decoded}")
Output:
Input IDs: [101, 1996, 4937, 2938, 2006, 1996, 13523, 1012, 102]
Token type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]
Tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]']
101 is [CLS]. 102 is [SEP]. Every BERT input starts with [CLS] and ends with [SEP].
# Two sentences
text_pair = ("The cat sat on the mat.", "It was a lazy afternoon.")
tokens_pair = tokenizer(*text_pair)
decoded_pair = tokenizer.convert_ids_to_tokens(tokens_pair['input_ids'])
print(f"Pair tokens: {decoded_pair}")
print(f"Token types: {tokens_pair['token_type_ids']}")
Output:
Pair tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]', 'it', 'was', 'a', 'lazy', 'afternoon', '.', '[SEP]']
Token types: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
Token type 0 = first sentence. Token type 1 = second sentence. BERT uses this to distinguish the two.
bert-base-uncased: 12 layers, 768 hidden, 12 heads, 110M params
bert-large-uncased: 24 layers, 1024 hidden, 16 heads, 340M params
bert-base-cased: Same as base but case-sensitive tokenization
distilbert-base: 6 layers, 66M params, 97% of BERT performance, 60% faster
roberta-base: BERT without NSP, trained longer, better performance
For most tasks, start with bert-base-uncased or distilbert-base-uncased. Only go larger if you need the extra capacity.
The most common use of BERT. Add a linear layer on top of the [CLS] token output.
from transformers import BertForSequenceClassification, BertTokenizer
from torch.utils.data import DataLoader, Dataset
import torch
import torch.nn as nn
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
# Simple sentiment dataset
texts = [
"This movie was absolutely fantastic!",
"I hated every minute of it.",
"An incredible performance by the lead actor.",
"Terrible writing, terrible acting.",
"One of the best films I've seen this year.",
"Complete waste of time and money.",
"Beautifully crafted and deeply moving.",
"Boring and predictable from start to finish.",
]
labels = [1, 0, 1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
# Tokenize
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
class SentimentDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=64):
self.encodings = tokenizer(
texts,
truncation=True,
padding=True,
max_length=max_len,
return_tensors='pt'
)
self.labels = torch.tensor(labels)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return {
'input_ids': self.encodings['input_ids'][idx],
'attention_mask': self.encodings['attention_mask'][idx],
'labels': self.labels[idx]
}
dataset = SentimentDataset(texts, labels, tokenizer)
loader = DataLoader(dataset, batch_size=4, shuffle=True)
# Load pretrained BERT with classification head
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=len(loader) * 3
)
# Fine-tune
print("Fine-tuning BERT for sentiment classification...")
for epoch in range(3):
model.train()
total_loss = 0
for batch in loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}")
# Predict on new examples
model.eval()
new_texts = [
"I absolutely loved this film!",
"This was the worst movie I have ever seen."
]
new_encoding = tokenizer(
new_texts, truncation=True, padding=True,
max_length=64, return_tensors='pt'
).to(device)
with torch.no_grad():
outputs = model(**new_encoding)
preds = torch.argmax(outputs.logits, dim=1)
for text, pred in zip(new_texts, preds):
sentiment = "Positive" if pred == 1 else "Negative"
print(f"'{text[:50]}...' -> {sentiment}")
Output:
Fine-tuning BERT for sentiment classification...
Epoch 1: loss=0.6834
Epoch 2: loss=0.4123
Epoch 3: loss=0.2187
'I absolutely loved this film!...' -> Positive
'This was the worst movie I have ever seen....' -> Negative
# Look at what BertForSequenceClassification adds
from transformers import BertModel
import torch.nn as nn
class BertClassifier(nn.Module):
def __init__(self, n_classes, dropout=0.3):
super().__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(768, n_classes) # 768 = bert-base hidden size
def forward(self, input_ids, attention_mask):
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
# outputs.last_hidden_state: (batch, seq_len, 768)
# outputs.pooler_output: (batch, 768) - the [CLS] token, passed through a linear+tanh
cls_output = outputs.pooler_output # (batch, 768)
cls_output = self.dropout(cls_output)
logits = self.classifier(cls_output) # (batch, n_classes)
return logits
model_manual = BertClassifier(n_classes=2)
# Check what's trainable vs frozen
total = sum(p.numel() for p in model_manual.parameters())
trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Total parameters: {total:,}")
print(f"Trainable parameters: {trainable:,}")
print()
# Often you freeze BERT layers and only train the head
for param in model_manual.bert.parameters():
param.requires_grad = False
frozen_trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Trainable (head only): {frozen_trainable:,}")
print("(Only the 2-layer classifier is being trained)")
Output:
Total parameters: 109,484,546
Trainable parameters: 109,484,546
Trainable (head only): 1,538
(Only the 2-layer classifier is being trained)
When you fine-tune the entire BERT, all 109M parameters update. When you freeze BERT and only train the head, only 1,538 parameters update. Freezing is faster but usually less accurate. Fine-tuning everything gives better results when you have enough data.
NER classifies each token. Person, Organization, Location, Date, Other. It's a token-level classification task, not sentence-level.
from transformers import BertForTokenClassification, BertTokenizerFast
# NER labels
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}
# Load NER model
ner_model = BertForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(label_list),
id2label=id2label,
label2id=label2id
)
tokenizer_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Example: align word labels to subword tokens
sentence = "Elon Musk founded Tesla in California."
words = sentence.split()
word_labels = ['B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'B-LOC', 'O']
# Tokenize with word_ids to handle subwords
encoding = tokenizer_fast(
words,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
truncation=True
)
# Map word-level labels to subword-level
word_ids = encoding.word_ids()
token_labels = []
prev_word_id = None
for word_id in word_ids:
if word_id is None:
token_labels.append(-100) # ignore [CLS] and [SEP] in loss
elif word_id != prev_word_id:
token_labels.append(label2id[word_labels[word_id]]) # first subword
else:
token_labels.append(-100) # subsequent subwords: ignore
prev_word_id = word_id
tokens = tokenizer_fast.convert_ids_to_tokens(encoding['input_ids'])
print("Token -> Label alignment:")
for token, label_id in zip(tokens, token_labels):
label = id2label.get(label_id, 'IGN')
print(f" {token:<15} {label}")
Output:
Token -> Label alignment:
[CLS] IGN
elon B-PER
mu IGN
##sk IGN
founded O
tesla B-ORG
in O
california B-LOC
. O
[SEP] IGN
"Elon" maps to B-PER. "mu" and "##sk" (subwords of "Musk") are ignored in the loss. This is the standard way to handle subword tokenization for token-level tasks.
BERT predicts the start and end position of the answer span within the context passage.
from transformers import BertForQuestionAnswering, BertTokenizer
import torch
# Load pretrained QA model (already fine-tuned on SQuAD)
qa_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
qa_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
def answer_question(question, context):
inputs = qa_tokenizer(
question, context,
return_tensors='pt',
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = qa_model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Find best start and end positions
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits) + 1
tokens = qa_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
answer = qa_tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx])
return answer
# Test it
context = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
It is named after the engineer Gustave Eiffel, whose company designed and built the tower.
Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially
criticized by some of France's leading artists and intellectuals but has become a global
cultural icon of France and one of the most recognisable structures in the world.
"""
questions = [
"Where is the Eiffel Tower located?",
"Who designed the Eiffel Tower?",
"When was the Eiffel Tower built?",
]
for q in questions:
answer = answer_question(q, context)
print(f"Q: {q}")
print(f"A: {answer}")
print()
Output:
Q: Where is the Eiffel Tower located?
A: Champ de Mars in Paris, France
Q: Who designed the Eiffel Tower?
A: Gustave Eiffel
Q: When was the Eiffel Tower built?
A: 1887 to 1889
A pretrained BERT fine-tuned on SQuAD (Stanford Question Answering Dataset) extracts answers directly from context. No generation. Just span extraction.
For common tasks, HuggingFace pipelines wrap everything into one function call.
from transformers import pipeline
# Sentiment analysis (fine-tuned BERT on SST-2)
sentiment = pipeline('sentiment-analysis')
results = sentiment([
"I absolutely loved this product!",
"Terrible quality, fell apart after a day.",
"It's okay, nothing special."
])
for r in results:
print(f"{r['label']:<10} {r['score']:.3f}")
print()
# Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
text = "Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
entities = ner(text)
for e in entities:
print(f"{e['entity_group']:<8} {e['word']:<25} score={e['score']:.3f}")
print()
# Question Answering
qa = pipeline('question-answering')
result = qa(
question="Who is the CEO of Apple?",
context="Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
)
print(f"Answer: {result['answer']} (score: {result['score']:.3f})")
print()
# Zero-shot classification (no fine-tuning needed)
classifier = pipeline('zero-shot-classification')
text = "The government announced new economic policies today."
candidate_labels = ['politics', 'technology', 'sports', 'entertainment']
result = classifier(text, candidate_labels=candidate_labels)
for label, score in zip(result['labels'], result['scores']):
print(f"{label:<15}: {score:.3f}")
Output:
POSITIVE 0.999
NEGATIVE 0.998
NEGATIVE 0.612
ORG Apple score=0.998
PER Tim Cook score=0.997
LOC Cupertino score=0.986
Answer: Tim Cook (score: 0.998)
politics : 0.942
technology : 0.031
entertainment : 0.017
sports : 0.010
Learning rate: BERT is sensitive. Use 2e-5 to 5e-5. Lower than typical deep learning.
Batch size: 16 or 32. Larger batches work better for BERT.
Epochs: 2 to 4 epochs. BERT fine-tunes quickly. More epochs usually causes overfitting.
Warmup steps: Schedule the LR to warm up for 10% of training, then linearly decay. Helps stability.
Gradient clipping: Clip at 1.0 to prevent exploding gradients.
# Standard fine-tuning setup
from transformers import get_linear_schedule_with_warmup
EPOCHS = 3
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1
total_steps = len(loader) * EPOCHS
warmup_steps = int(total_steps * WARMUP_RATIO)
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps
)
print(f"Total training steps: {total_steps}")
print(f"Warmup steps: {warmup_steps}")
print(f"Peak LR: {LEARNING_RATE}, then linear decay to 0")
Model Params Speed Accuracy Notes
----------- ------ ----- -------- -----
bert-base 110M 1x baseline Original, safe choice
bert-large 340M 0.4x +2-3% Slower, better accuracy
roberta-base 125M 1x +1-2% Better pretraining, no NSP
distilbert-base 66M 1.6x -3% Great for production
albert-base 12M 0.9x ~same Much fewer params via sharing
For most projects: start with distilbert-base-uncased for speed, switch to roberta-base for accuracy.
| Task | Model | Code |
|---|---|---|
| Text classification | BertForSequenceClassification | pipeline('sentiment-analysis') |
| NER | BertForTokenClassification | pipeline('ner') |
| QA | BertForQuestionAnswering | pipeline('question-answering') |
| Zero-shot | NLI model | pipeline('zero-shot-classification') |
| Custom | BertModel + linear head | outputs.pooler_output |
| Setting | Value |
|---|---|
| Learning rate | 2e-5 to 5e-5 |
| Batch size | 16 or 32 |
| Epochs | 2 to 4 |
| Max sequence length | 128 to 512 |
| Warmup steps | 10% of total steps |
Level 1:
Use pipeline('sentiment-analysis') on 20 movie reviews you write yourself (10 positive, 10 negative). Print each prediction and confidence score. Where does it get confused?
Level 2:
Fine-tune distilbert-base-uncased on any small classification dataset (you can use load_dataset('imdb') from HuggingFace). Train for 3 epochs. Compare accuracy to a TF-IDF + LogisticRegression baseline from Post 62. How much better is BERT?
Level 3:
Use BertForTokenClassification to tag a paragraph of news text with NER labels. Then visualize the output by color-coding each entity type in the text. Use the fine-tuned dslim/bert-base-NER model from HuggingFace hub.
Next up, Post 93: GPT: The Model That Predicts the Next Word Forever. Autoregressive generation, temperature and sampling strategies, and how a simple next-token prediction objective produces models that can write, code, and reason.