TextLayer

Token Embeddings

0:00 / 0:00

The Problem

Computers don't understand words — they understand numbers. But how do we convert "cat" into something a neural network can process?

Naive approach: Assign each word a number (cat=1, dog=2, …)

Problem: This implies "cat" and "dog" are as different as "cat" and "quantum physics"
We lose all semantic meaning

Better approach: Represent each word as a vector in high-dimensional space where:

Similar words are close together
Relationships are encoded as directions

Key Insight: The King–Queen Analogy

The famous example from the video:

This works because embeddings capture relationships:

The vector from "man" to "woman" represents a gender direction
Apply that same direction to "king" and you get "queen"

The Math

For a vocabulary of V words and embedding dimension d:

To get the embedding for token i:

It's just a lookup table! Each row is a learnable d-dimensional vector.

Question 1

GPT-3 uses 12,288-dimensional embeddings. Why so many dimensions?

More dimensions = more capacity to encode nuanced relationships.

With 12,288 dimensions, the model can represent:

Semantic meaning — what the word means
Syntactic role — noun, verb, adjective
Sentiment — positive, negative
Domain — medical, legal, casual and thousands of other subtle features

Each dimension isn't interpretable on its own — meaning emerges from combinations.

Question 2

If embeddings work like the video shows, what would you expect from:

Paris - France + Germany = ?

walked - walk + swim = ?

good - bad + terrible = ?

Berlin — capital city relationship
swam — past tense relationship
wonderful or similar — antonym relationship (though this one is trickier)

Code: Exploring Real Embeddings

Python

# Using sentence-transformers for easy embedding access
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Get embeddings for words
words = ["king", "queen", "man", "woman", "prince", "princess"]
embeddings = model.encode(words)

# Check dimensions
print(f"Embedding shape: {embeddings[0].shape}")  # 384 dimensions

# Compute cosine similarity
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# King is more similar to queen than to random words
print(f"king-queen similarity: {cosine_sim(embeddings[0], embeddings[1]):.3f}")
print(f"king-man similarity: {cosine_sim(embeddings[0], embeddings[2]):.3f}")

# Try the analogy: king - man + woman ≈ queen
analogy = embeddings[0] - embeddings[2] + embeddings[3]  # king - man + woman
print(f"Analogy result similarity to queen: {cosine_sim(analogy, embeddings[1]):.3f}")

Key Takeaways

Concept	What It Means
Embedding	A learned vector representing a token
Embedding dimension	How many numbers per token (e.g., 768, 12288)
Semantic similarity	Close vectors = related meanings
Lookup table	Embeddings are just matrix rows indexed by token ID

Positional Encoding

0:00 / 0:00

The Problem

Consider these sentences:

"The cat ate the fish"
"The fish ate the cat"

Same words, completely different meanings! But if we just use embeddings, the transformer sees the same set of vectors (in different positions). Unlike RNNs that process sequentially, transformers process all tokens in parallel — they have no inherent notion of order.

The Solution: Add Position Information

We need to inject position into each token's representation. The original transformer uses sinusoidal positional encodings:

Where:

pos = position in sequence (0, 1, 2, …)
i = dimension index
d = embedding dimension

Why not just use position numbers (0, 1, 2, …)?

Numbers would dominate the embedding values
No natural way to handle sequences longer than training data
Sinusoids can extrapolate to unseen positions

Why alternating sin/cos?

Allows the model to learn relative positions
PE_pos+k can be represented as a linear function of PE_pos

Question 3

Why do we use different frequencies for different dimensions?

If all dimensions used the same frequency, positions 0 and (say) 628 would have nearly identical encodings (since sin repeats every 2π).

By using different frequencies:
Fast-cycling dimensions → distinguish nearby positions
Slow-cycling dimensions → distinguish far-apart positions

Together, they create a unique fingerprint for each position, even for very long sequences.

Question 4

The video mentions that sinusoidal encodings help with relative positions. Why might this matter for language?

In language, relative position often matters more than absolute:

"The big red dog" — adjectives come before nouns (relative)
Whether "big" is at position 47 or 203 doesn't change its relationship to "dog"

Sinusoidal encodings allow the model to learn: "two positions apart" regardless of where in the sequence.

Code: Visualizing Positional Encodings

Python

import numpy as np
import matplotlib.pyplot as plt

def positional_encoding(max_len, d_model):
    """Generate sinusoidal positional encodings."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]

    # Different frequencies for different dimensions
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions

    return pe

# Generate encodings
pe = positional_encoding(max_len=100, d_model=64)

# Visualize as heatmap
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encodings: Each column is a unique fingerprint')
plt.colorbar()
plt.show()

# Show how position 0 vs position 50 differ
print("Position 0 (first 10 dims):", pe[0, :10].round(3))
print("Position 50 (first 10 dims):", pe[50, :10].round(3))

Modern Alternatives

Method	How It Works	Used By
Sinusoidal	Fixed sin/cos waves	Original Transformer
Learned	Trainable position embeddings	GPT-2, BERT
RoPE	Rotary position embedding	LLaMA, GPT-NeoX
ALiBi	Attention bias based on distance	BLOOM

Key Takeaways

Concept	What it means
Positional encoding	Vector added to embedding to indicate position
Sinusoidal	Using sin/cos at different frequencies
Unique fingerprint	Each position has a distinct encoding
Addition	final_input = embedding + positional_encoding

Self-Attention

0:00 / 0:00

The Problem: Context Matters

Consider the word "bank":

"I deposited money at the bank" → financial institution
"I sat on the river bank" → edge of water

The same word needs different representations depending on context. Self-attention solves this by letting each token look at all other tokens to build a context-aware representation.

The Core Idea: Questions, Keys, and Values

Think of attention like a search engine:

Component	Analogy	What It Does
Query (Q)	Your search query	"What information am I looking for?"
Key (K)	Document titles	"What information do I contain?"
Value (V)	Document contents	"Here's my actual information"

Each token generates all three:

Its Query: What it's looking for
Its Key: What it offers to others
Its Value: The information it provides

The Math

Given input X of shape (sequence_length, d_model):

Step 1: Create Q, K, V

Step 2: Compute attention scores

Step 3: Scale and normalize

Step 4: Weighted sum of values

Question 5

What attention pattern would you expect for the word "it" in: "The cat sat on the mat because it was tired"

"it" should attend strongly to "cat" because:

it is a pronoun referring back to something
cat is the subject that can be "tired"
mat can't be "tired"

This coreference resolution happens naturally through learned attention patterns.

Code: Implementing Self-Attention from Scratch

Python

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def self_attention(X, W_q, W_k, W_v):
    """
    Single-head self-attention.

    X: (seq_len, d_model) - input embeddings
    W_q, W_k, W_v: (d_model, d_k) - projection matrices
    """
    # Step 1: Project to Q, K, V
    Q = X @ W_q  # (seq_len, d_k)
    K = X @ W_k  # (seq_len, d_k)
    V = X @ W_v  # (seq_len, d_k)

    # Step 2: Compute attention scores
    d_k = K.shape[-1]
    scores = Q @ K.T  # (seq_len, seq_len)

    # Step 3: Scale and softmax
    scaled_scores = scores / np.sqrt(d_k)
    attention_weights = softmax(scaled_scores, axis=-1)

    # Step 4: Weighted sum of values
    output = attention_weights @ V  # (seq_len, d_k)

    return output, attention_weights

# Example: 4 tokens, 8-dimensional embeddings, 4-dimensional attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4

X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1

output, attn_weights = self_attention(X, W_q, W_k, W_v)

print("Attention weights (each row sums to 1):")
print(attn_weights.round(3))
print("\nRow sums:", attn_weights.sum(axis=1).round(3))

Key Takeaways

Concept	What It Means
Query	What this token is looking for
Key	What this token offers to others
Value	The actual information to retrieve
Attention weight	How much to attend (0 to 1, sums to 1)
Context-aware	Output depends on the whole sequence

Softmax and Attention Scores

0:00 / 0:00

The Problem: Raw Scores Are Messy

After computing QK^T, we get raw attention scores:

Can be any real number (positive or negative)
Don't sum to anything meaningful
Larger scores = more relevance, but how much more?

We need a way to convert these to proper attention weights:

Non-negative (can't have negative attention)
Sum to 1 (it's an "attention budget")
Higher scores → higher weights

Softmax: The Solution

Why this works:

Exponential makes everything positive: e^x > 0 for all x
Division makes them sum to 1: Proper probability distribution
Amplifies differences: Small score differences become large weight differences

Question 5

Given scores [2, 4, 1], compute softmax:

Given scores: [2, 4, 1]

Exponentiate: [e², e⁴, e¹] = [7.39, 54.60, 2.72]
Sum: 7.39 + 54.60 + 2.72 = 64.71
Divide: [7.39/64.71, 54.60/64.71, 2.72/64.71]

Result: [0.11, 0.84, 0.04]

Notice how score 4 (just 2 more than 2) gets 84% of the attention!

The Temperature Parameter

You can control the "sharpness" of softmax with temperature:

Temperature	Effect	Use Case
T < 1 (cold)	Sharper, more peaked	More confident, deterministic
T = 1	Standard	Normal operation
T > 1 (hot)	Softer, more uniform	More exploration, creativity

This is the same "temperature" parameter you see in ChatGPT!

Question 6

With scores [2, 4, 1]:

T=0.5: What happens?
T=2.0: What happens?

Temperature Scaled Scores Result Effect T=0.5 (cold) [4, 8, 2] [0.02, 0.98, 0.00] Almost all on highest T=2.0 (hot) [1, 2, 0.5] [0.21, 0.57, 0.22] More evenly distributed Key insight:

Lower temperature = more confident/focused
Higher temperature = more exploratory/uncertain

The Scaling Factor: √d_k

Why do we divide by √d_k before softmax?

The problem: Dot products get larger as dimension increases.

Q and K are d_k-dimensional vectors
Their dot product is sum of d_k terms
Variance grows with d_k

Large dot products → extreme softmax:

Scores like [50, 52, 48] → softmax ≈ [0.1, 0.8, 0.1]
Scores like [500, 520, 480] → softmax ≈ [0.0, 1.0, 0.0]

The model gets overconfident and can't learn from gradients!

Solution: Divide by √d_k to normalize variance back to ~1.

Key Takeaways

Concept	What It Means
Softmax	Converts scores to probability distribution
Temperature	Controls sharpness (low=focused, high=spread)
√dk scaling	Prevents extreme softmax from large dot products
Attention fading	Longer sequences → diluted attention

Multi-Head Attention

0:00 / 0:00

The Problem: One Head Isn't Enough

A single attention head can only focus on one type of relationship at a time. But language has many simultaneous relationships:

Syntactic: subject-verb agreement
Semantic: word meaning in context
Coreference: what "it" refers to
Positional: nearby words

The Solution: Multiple Heads in Parallel

Instead of one big attention operation, run several smaller ones:

Where each head is:

How It Works

Split the embedding into h heads (e.g., 768 dimensions → 12 heads of 64 each)
Compute attention independently in each head
Concatenate the results back together
Project through a final linear layer

Each head learns to focus on different types of relationships!

What Different Heads Learn

Research has shown that heads specialize:

Head Type	What It Attends To	Example
Positional	Previous/next tokens	"The [cat]" → "cat" attends to "The"
Syntactic	Subject-verb pairs	"The cats [run]" → "run" attends to "cats"
Semantic	Related concepts	"bank [money]" → "money" attends to "bank"
Coreference	Pronouns to nouns	"[it] was tired" → "it" attends to "cat"

Question 8

If we have the same total parameters, why use 12 heads of 64 dimensions instead of 1 head of 768 dimensions?

Capacity for diverse patterns:
A single head learns one attention pattern. Multiple heads can learn different patterns simultaneously.

Geometric intuition:
Each head operates in a smaller subspace. Different subspaces can capture different types of relationships.

Regularization:
Multiple smaller heads are harder to overfit than one large head.

Think of it like having a team of specialists vs. one generalist.

Question 8

What are the trade-offs of using more heads (e.g., 64 heads of 12 dims vs. 12 heads of 64 dims)?

Configuration Pros Cons More heads, smaller dims More diverse patterns Each head has less capacity May miss complex relationships Fewer heads, larger dims Each head has more capacity Fewer distinct patterns May learn redundant patterns The sweet spot depends on the task. 8–16 heads is common in practice.

Key takeaways

Concept	What It Means
Multi-head	Multiple attention operations in parallel
Head specialization	Different heads learn different patterns
Split-attend-concat	Divide embedding, attend separately, combine
Output projection	Final linear layer after concatenation

The Transformer Block

0:00 / 0:00

The Full Architecture

Markdown

Input
  │
  ├──────────────────┐
  │                  │
  ▼                  │
Layer Norm           │
  │                  │
  ▼                  │
Multi-Head Attention │
  │                  │
  ▼                  │
  + ◄────────────────┘  (Residual connection)
  │
  ├──────────────────┐
  │                  │
  ▼                  │
Layer Norm           │
  │                  │
  ▼                  │
Feed-Forward Network │
  │                  │
  ▼                  │
  + ◄────────────────┘  (Residual connection)
  │
  ▼
Output

Summary: What Each Component Does

Component	Video	Function
Token Embeddings	1	Convert tokens to semantic vectors
Positional Encoding	2	Inject sequence position information
Self-Attention	3	Let tokens gather context from each other
Softmax & Scaling	4	Convert scores to attention probabilities
Multi-Head Attention	5	Learn diverse relationship patterns in parallel
Transformer Block	6	Combine attention + FFN with residuals and norms

Next Steps

Now that you understand the fundamentals:

Implement a mini-transformer from scratch in PyTorch
Explore attention visualizations with BertViz or similar tools
Fine-tune a pretrained model on a task you care about
Read "Attention Is All You Need" — it'll make much more sense now!
Explore variations: RoPE, ALiBi, Flash Attention, sparse attention

Transformer Architecture: A Visual & Interactive Guide

Reading Mode

Token Embeddings

The Problem

Key Insight: The King–Queen Analogy

The Math

Code: Exploring Real Embeddings

Key Takeaways

Positional Encoding

The Problem

The Solution: Add Position Information

Code: Visualizing Positional Encodings

Modern Alternatives

Key Takeaways

Self-Attention

The Problem: Context Matters

The Core Idea: Questions, Keys, and Values

The Math

Code: Implementing Self-Attention from Scratch

Key Takeaways

Softmax and Attention Scores

The Problem: Raw Scores Are Messy

Softmax: The Solution

The Temperature Parameter

The Scaling Factor: √d_k

Key Takeaways

Multi-Head Attention

The Problem: One Head Isn't Enough

The Solution: Multiple Heads in Parallel

How It Works

What Different Heads Learn

Key takeaways

The Transformer Block

The Full Architecture

Summary: What Each Component Does

Next Steps

Reading Mode

Token Embeddings

The Problem

Key Insight: The King–Queen Analogy

The Math

Code: Exploring Real Embeddings

Key Takeaways

Positional Encoding

The Problem

The Solution: Add Position Information

Code: Visualizing Positional Encodings

Modern Alternatives

Key Takeaways

Self-Attention

The Problem: Context Matters

The Core Idea: Questions, Keys, and Values

The Math

Code: Implementing Self-Attention from Scratch

Key Takeaways

Softmax and Attention Scores

The Problem: Raw Scores Are Messy

Softmax: The Solution

The Temperature Parameter

The Scaling Factor: √dk

Key Takeaways

Multi-Head Attention

The Problem: One Head Isn't Enough

The Solution: Multiple Heads in Parallel

How It Works

What Different Heads Learn

Key takeaways

The Transformer Block

The Full Architecture

Summary: What Each Component Does

Next Steps

The Scaling Factor: √d_k