RETURN TO BLOG

Transformer Architecture: A Visual & Interactive Guide

Transformer Architecture: A Visual & Interactive Guide
By Prayag·

Reading Mode

Choose your learning style: Toggle between detailed content with examples or a quick summary.

Token Embeddings

0:00 / 0:00

The Problem

Computers don't understand words — they understand numbers. But how do we convert "cat" into something a neural network can process?

Naive approach: Assign each word a number (cat=1, dog=2, …)

  • Problem: This implies "cat" and "dog" are as different as "cat" and "quantum physics"
  • We lose all semantic meaning

Better approach: Represent each word as a vector in high-dimensional space where:

  • Similar words are close together
  • Relationships are encoded as directions

Key Insight: The King–Queen Analogy

The famous example from the video:

This works because embeddings capture relationships:

  • The vector from "man" to "woman" represents a gender direction
  • Apply that same direction to "king" and you get "queen"

The Math

For a vocabulary of V words and embedding dimension d:

To get the embedding for token i:

It's just a lookup table! Each row is a learnable d-dimensional vector.

Question 1
GPT-3 uses 12,288-dimensional embeddings. Why so many dimensions?
More dimensions = more capacity to encode nuanced relationships.

With 12,288 dimensions, the model can represent:

Semantic meaning — what the word means
Syntactic role — noun, verb, adjective
Sentiment — positive, negative
Domain — medical, legal, casual and thousands of other subtle features

Each dimension isn't interpretable on its own — meaning emerges from combinations.

Question 2
If embeddings work like the video shows, what would you expect from:

Paris - France + Germany = ?

walked - walk + swim = ?

good - bad + terrible = ?
Berlin — capital city relationship
swam — past tense relationship
wonderful or similar — antonym relationship (though this one is trickier)

Code: Exploring Real Embeddings

Python
# Using sentence-transformers for easy embedding access
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Get embeddings for words
words = ["king", "queen", "man", "woman", "prince", "princess"]
embeddings = model.encode(words)

# Check dimensions
print(f"Embedding shape: {embeddings[0].shape}")  # 384 dimensions

# Compute cosine similarity
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# King is more similar to queen than to random words
print(f"king-queen similarity: {cosine_sim(embeddings[0], embeddings[1]):.3f}")
print(f"king-man similarity: {cosine_sim(embeddings[0], embeddings[2]):.3f}")

# Try the analogy: king - man + woman ≈ queen
analogy = embeddings[0] - embeddings[2] + embeddings[3]  # king - man + woman
print(f"Analogy result similarity to queen: {cosine_sim(analogy, embeddings[1]):.3f}")

Key Takeaways

ConceptWhat It Means
EmbeddingA learned vector representing a token
Embedding dimensionHow many numbers per token (e.g., 768, 12288)
Semantic similarityClose vectors = related meanings
Lookup tableEmbeddings are just matrix rows indexed by token ID

Positional Encoding

0:00 / 0:00

The Problem

Consider these sentences:

  • "The cat ate the fish"
  • "The fish ate the cat"

Same words, completely different meanings! But if we just use embeddings, the transformer sees the same set of vectors (in different positions). Unlike RNNs that process sequentially, transformers process all tokens in parallel — they have no inherent notion of order.

The Solution: Add Position Information

We need to inject position into each token's representation. The original transformer uses sinusoidal positional encodings:

Where:

  • pos = position in sequence (0, 1, 2, …)
  • i = dimension index
  • d = embedding dimension

Why not just use position numbers (0, 1, 2, …)?

  • Numbers would dominate the embedding values
  • No natural way to handle sequences longer than training data
  • Sinusoids can extrapolate to unseen positions

Why alternating sin/cos?

  • Allows the model to learn relative positions
  • PEpos+k can be represented as a linear function of PEpos
Question 3
Why do we use different frequencies for different dimensions?
If all dimensions used the same frequency, positions 0 and (say) 628 would have nearly identical encodings (since sin repeats every 2π).

By using different frequencies:
Fast-cycling dimensions → distinguish nearby positions
Slow-cycling dimensions → distinguish far-apart positions

Together, they create a unique fingerprint for each position, even for very long sequences.
Question 4
The video mentions that sinusoidal encodings help with relative positions. Why might this matter for language?
In language, relative position often matters more than absolute:

  • "The big red dog" — adjectives come before nouns (relative)
  • Whether "big" is at position 47 or 203 doesn't change its relationship to "dog"


Sinusoidal encodings allow the model to learn: "two positions apart" regardless of where in the sequence.

Code: Visualizing Positional Encodings

Python
import numpy as np
import matplotlib.pyplot as plt

def positional_encoding(max_len, d_model):
    """Generate sinusoidal positional encodings."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]

    # Different frequencies for different dimensions
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions

    return pe

# Generate encodings
pe = positional_encoding(max_len=100, d_model=64)

# Visualize as heatmap
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encodings: Each column is a unique fingerprint')
plt.colorbar()
plt.show()

# Show how position 0 vs position 50 differ
print("Position 0 (first 10 dims):", pe[0, :10].round(3))
print("Position 50 (first 10 dims):", pe[50, :10].round(3))

Modern Alternatives

MethodHow It WorksUsed By
SinusoidalFixed sin/cos wavesOriginal Transformer
LearnedTrainable position embeddingsGPT-2, BERT
RoPERotary position embeddingLLaMA, GPT-NeoX
ALiBiAttention bias based on distanceBLOOM

Key Takeaways

ConceptWhat it means
Positional encodingVector added to embedding to indicate position
SinusoidalUsing sin/cos at different frequencies
Unique fingerprintEach position has a distinct encoding
Additionfinal_input = embedding + positional_encoding

Self-Attention

0:00 / 0:00

The Problem: Context Matters

Consider the word "bank":

  • "I deposited money at the bank" → financial institution
  • "I sat on the river bank" → edge of water

The same word needs different representations depending on context. Self-attention solves this by letting each token look at all other tokens to build a context-aware representation.

The Core Idea: Questions, Keys, and Values

Think of attention like a search engine:

ComponentAnalogyWhat It Does
Query (Q)Your search query"What information am I looking for?"
Key (K)Document titles"What information do I contain?"
Value (V)Document contents"Here's my actual information"

Each token generates all three:

  • Its Query: What it's looking for
  • Its Key: What it offers to others
  • Its Value: The information it provides

The Math

Given input X of shape (sequence_length, d_model):

Step 1: Create Q, K, V

Step 2: Compute attention scores

Step 3: Scale and normalize

Step 4: Weighted sum of values

Question 5
What attention pattern would you expect for the word "it" in: "The cat sat on the mat because it was tired"
"it" should attend strongly to "cat" because:

it is a pronoun referring back to something
cat is the subject that can be "tired"
mat can't be "tired"

This coreference resolution happens naturally through learned attention patterns.

Code: Implementing Self-Attention from Scratch

Python
import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def self_attention(X, W_q, W_k, W_v):
    """
    Single-head self-attention.

    X: (seq_len, d_model) - input embeddings
    W_q, W_k, W_v: (d_model, d_k) - projection matrices
    """
    # Step 1: Project to Q, K, V
    Q = X @ W_q  # (seq_len, d_k)
    K = X @ W_k  # (seq_len, d_k)
    V = X @ W_v  # (seq_len, d_k)

    # Step 2: Compute attention scores
    d_k = K.shape[-1]
    scores = Q @ K.T  # (seq_len, seq_len)

    # Step 3: Scale and softmax
    scaled_scores = scores / np.sqrt(d_k)
    attention_weights = softmax(scaled_scores, axis=-1)

    # Step 4: Weighted sum of values
    output = attention_weights @ V  # (seq_len, d_k)

    return output, attention_weights

# Example: 4 tokens, 8-dimensional embeddings, 4-dimensional attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4

X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1

output, attn_weights = self_attention(X, W_q, W_k, W_v)

print("Attention weights (each row sums to 1):")
print(attn_weights.round(3))
print("\nRow sums:", attn_weights.sum(axis=1).round(3))

Key Takeaways

ConceptWhat It Means
QueryWhat this token is looking for
KeyWhat this token offers to others
ValueThe actual information to retrieve
Attention weightHow much to attend (0 to 1, sums to 1)
Context-awareOutput depends on the whole sequence

Softmax and Attention Scores

0:00 / 0:00

The Problem: Raw Scores Are Messy

After computing QKT, we get raw attention scores:

  • Can be any real number (positive or negative)
  • Don't sum to anything meaningful
  • Larger scores = more relevance, but how much more?

We need a way to convert these to proper attention weights:

  • Non-negative (can't have negative attention)
  • Sum to 1 (it's an "attention budget")
  • Higher scores → higher weights

Softmax: The Solution

Why this works:

  1. Exponential makes everything positive: ex > 0 for all x
  2. Division makes them sum to 1: Proper probability distribution
  3. Amplifies differences: Small score differences become large weight differences
Question 5
Given scores [2, 4, 1], compute softmax:
Given scores: [2, 4, 1]

Exponentiate: [e², e⁴, e¹] = [7.39, 54.60, 2.72]
Sum: 7.39 + 54.60 + 2.72 = 64.71
Divide: [7.39/64.71, 54.60/64.71, 2.72/64.71]

Result: [0.11, 0.84, 0.04]

Notice how score 4 (just 2 more than 2) gets 84% of the attention!

The Temperature Parameter

You can control the "sharpness" of softmax with temperature:

TemperatureEffectUse Case
T < 1 (cold)Sharper, more peakedMore confident, deterministic
T = 1StandardNormal operation
T > 1 (hot)Softer, more uniformMore exploration, creativity

This is the same "temperature" parameter you see in ChatGPT!

Question 6
With scores [2, 4, 1]:

T=0.5: What happens?
T=2.0: What happens?
Temperature Scaled Scores Result Effect T=0.5 (cold) [4, 8, 2] [0.02, 0.98, 0.00] Almost all on highest T=2.0 (hot) [1, 2, 0.5] [0.21, 0.57, 0.22] More evenly distributed Key insight:

Lower temperature = more confident/focused
Higher temperature = more exploratory/uncertain

The Scaling Factor: √dk

Why do we divide by √dk before softmax?

The problem: Dot products get larger as dimension increases.

  • Q and K are dk-dimensional vectors
  • Their dot product is sum of dk terms
  • Variance grows with dk

Large dot products → extreme softmax:

  • Scores like [50, 52, 48] → softmax ≈ [0.1, 0.8, 0.1]
  • Scores like [500, 520, 480] → softmax ≈ [0.0, 1.0, 0.0]

The model gets overconfident and can't learn from gradients!

Solution: Divide by √dk to normalize variance back to ~1.

Key Takeaways

ConceptWhat It Means
SoftmaxConverts scores to probability distribution
TemperatureControls sharpness (low=focused, high=spread)
√dk scalingPrevents extreme softmax from large dot products
Attention fadingLonger sequences → diluted attention

Multi-Head Attention

0:00 / 0:00

The Problem: One Head Isn't Enough

A single attention head can only focus on one type of relationship at a time. But language has many simultaneous relationships:

  • Syntactic: subject-verb agreement
  • Semantic: word meaning in context
  • Coreference: what "it" refers to
  • Positional: nearby words

The Solution: Multiple Heads in Parallel

Instead of one big attention operation, run several smaller ones:

Where each head is:

How It Works

  1. Split the embedding into h heads (e.g., 768 dimensions → 12 heads of 64 each)
  2. Compute attention independently in each head
  3. Concatenate the results back together
  4. Project through a final linear layer

Each head learns to focus on different types of relationships!

What Different Heads Learn

Research has shown that heads specialize:

Head TypeWhat It Attends ToExample
PositionalPrevious/next tokens"The [cat]" → "cat" attends to "The"
SyntacticSubject-verb pairs"The cats [run]" → "run" attends to "cats"
SemanticRelated concepts"bank [money]" → "money" attends to "bank"
CoreferencePronouns to nouns"[it] was tired" → "it" attends to "cat"

Question 8
If we have the same total parameters, why use 12 heads of 64 dimensions instead of 1 head of 768 dimensions?
Capacity for diverse patterns:
A single head learns one attention pattern. Multiple heads can learn different patterns simultaneously.

Geometric intuition:
Each head operates in a smaller subspace. Different subspaces can capture different types of relationships.

Regularization:
Multiple smaller heads are harder to overfit than one large head.

Think of it like having a team of specialists vs. one generalist.
Question 8
What are the trade-offs of using more heads (e.g., 64 heads of 12 dims vs. 12 heads of 64 dims)?
Configuration Pros Cons More heads, smaller dims More diverse patterns Each head has less capacity May miss complex relationships Fewer heads, larger dims Each head has more capacity Fewer distinct patterns May learn redundant patterns The sweet spot depends on the task. 8–16 heads is common in practice.

Key takeaways

ConceptWhat It Means
Multi-headMultiple attention operations in parallel
Head specializationDifferent heads learn different patterns
Split-attend-concatDivide embedding, attend separately, combine
Output projectionFinal linear layer after concatenation

The Transformer Block

0:00 / 0:00

The Full Architecture

Markdown
Input
  │
  ├──────────────────┐
  │                  │
  ▼                  │
Layer Norm           │
  │                  │
  ▼                  │
Multi-Head Attention │
  │                  │
  ▼                  │
  + ◄────────────────┘  (Residual connection)
  │
  ├──────────────────┐
  │                  │
  ▼                  │
Layer Norm           │
  │                  │
  ▼                  │
Feed-Forward Network │
  │                  │
  ▼                  │
  + ◄────────────────┘  (Residual connection)
  │
  ▼
Output

Summary: What Each Component Does

ComponentVideoFunction
Token Embeddings1Convert tokens to semantic vectors
Positional Encoding2Inject sequence position information
Self-Attention3Let tokens gather context from each other
Softmax & Scaling4Convert scores to attention probabilities
Multi-Head Attention5Learn diverse relationship patterns in parallel
Transformer Block6Combine attention + FFN with residuals and norms

Next Steps

Now that you understand the fundamentals:

  1. Implement a mini-transformer from scratch in PyTorch
  2. Explore attention visualizations with BertViz or similar tools
  3. Fine-tune a pretrained model on a task you care about
  4. Read "Attention Is All You Need" — it'll make much more sense now!
  5. Explore variations: RoPE, ALiBi, Flash Attention, sparse attention