Token Embeddings
The Problem
Computers don't understand words — they understand numbers. But how do we convert "cat" into something a neural network can process?
Naive approach: Assign each word a number (cat=1, dog=2, …)
- Problem: This implies "cat" and "dog" are as different as "cat" and "quantum physics"
- We lose all semantic meaning
Better approach: Represent each word as a vector in high-dimensional space where:
- Similar words are close together
- Relationships are encoded as directions
Key Insight: The King–Queen Analogy
The famous example from the video:
This works because embeddings capture relationships:
- The vector from "man" to "woman" represents a gender direction
- Apply that same direction to "king" and you get "queen"
The Math
For a vocabulary of V words and embedding dimension d:
To get the embedding for token i:
It's just a lookup table! Each row is a learnable d-dimensional vector.
With 12,288 dimensions, the model can represent:
Semantic meaning — what the word means
Syntactic role — noun, verb, adjective
Sentiment — positive, negative
Domain — medical, legal, casual and thousands of other subtle features
Each dimension isn't interpretable on its own — meaning emerges from combinations.
Paris - France + Germany = ?walked - walk + swim = ?good - bad + terrible = ?swam — past tense relationship
wonderful or similar — antonym relationship (though this one is trickier)
Code: Exploring Real Embeddings
Key Takeaways
| Concept | What It Means |
|---|---|
| Embedding | A learned vector representing a token |
| Embedding dimension | How many numbers per token (e.g., 768, 12288) |
| Semantic similarity | Close vectors = related meanings |
| Lookup table | Embeddings are just matrix rows indexed by token ID |
Positional Encoding
The Problem
Consider these sentences:
- "The cat ate the fish"
- "The fish ate the cat"
Same words, completely different meanings! But if we just use embeddings, the transformer sees the same set of vectors (in different positions). Unlike RNNs that process sequentially, transformers process all tokens in parallel — they have no inherent notion of order.
The Solution: Add Position Information
We need to inject position into each token's representation. The original transformer uses sinusoidal positional encodings:
Where:
- pos = position in sequence (0, 1, 2, …)
- i = dimension index
- d = embedding dimension
Why not just use position numbers (0, 1, 2, …)?
- Numbers would dominate the embedding values
- No natural way to handle sequences longer than training data
- Sinusoids can extrapolate to unseen positions
Why alternating sin/cos?
- Allows the model to learn relative positions
- PEpos+k can be represented as a linear function of PEpos
By using different frequencies:
Fast-cycling dimensions → distinguish nearby positions
Slow-cycling dimensions → distinguish far-apart positions
Together, they create a unique fingerprint for each position, even for very long sequences.
- "The big red dog" — adjectives come before nouns (relative)
- Whether "big" is at position 47 or 203 doesn't change its relationship to "dog"
Sinusoidal encodings allow the model to learn: "two positions apart" regardless of where in the sequence.
Code: Visualizing Positional Encodings
Modern Alternatives
| Method | How It Works | Used By |
|---|---|---|
| Sinusoidal | Fixed sin/cos waves | Original Transformer |
| Learned | Trainable position embeddings | GPT-2, BERT |
| RoPE | Rotary position embedding | LLaMA, GPT-NeoX |
| ALiBi | Attention bias based on distance | BLOOM |
Key Takeaways
| Concept | What it means |
|---|---|
| Positional encoding | Vector added to embedding to indicate position |
| Sinusoidal | Using sin/cos at different frequencies |
| Unique fingerprint | Each position has a distinct encoding |
| Addition | final_input = embedding + positional_encoding |
Self-Attention
The Problem: Context Matters
Consider the word "bank":
- "I deposited money at the bank" → financial institution
- "I sat on the river bank" → edge of water
The same word needs different representations depending on context. Self-attention solves this by letting each token look at all other tokens to build a context-aware representation.
The Core Idea: Questions, Keys, and Values
Think of attention like a search engine:
| Component | Analogy | What It Does |
|---|---|---|
| Query (Q) | Your search query | "What information am I looking for?" |
| Key (K) | Document titles | "What information do I contain?" |
| Value (V) | Document contents | "Here's my actual information" |
Each token generates all three:
- Its Query: What it's looking for
- Its Key: What it offers to others
- Its Value: The information it provides
The Math
Given input X of shape (sequence_length, d_model):
Step 1: Create Q, K, V
Step 2: Compute attention scores
Step 3: Scale and normalize
Step 4: Weighted sum of values
it is a pronoun referring back to somethingcat is the subject that can be "tired"mat can't be "tired"This coreference resolution happens naturally through learned attention patterns.
Code: Implementing Self-Attention from Scratch
Key Takeaways
| Concept | What It Means |
|---|---|
| Query | What this token is looking for |
| Key | What this token offers to others |
| Value | The actual information to retrieve |
| Attention weight | How much to attend (0 to 1, sums to 1) |
| Context-aware | Output depends on the whole sequence |
Softmax and Attention Scores
The Problem: Raw Scores Are Messy
After computing QKT, we get raw attention scores:
- Can be any real number (positive or negative)
- Don't sum to anything meaningful
- Larger scores = more relevance, but how much more?
We need a way to convert these to proper attention weights:
- Non-negative (can't have negative attention)
- Sum to 1 (it's an "attention budget")
- Higher scores → higher weights
Softmax: The Solution
Why this works:
- Exponential makes everything positive: ex > 0 for all x
- Division makes them sum to 1: Proper probability distribution
- Amplifies differences: Small score differences become large weight differences
Exponentiate: [e², e⁴, e¹] = [7.39, 54.60, 2.72]
Sum: 7.39 + 54.60 + 2.72 = 64.71
Divide: [7.39/64.71, 54.60/64.71, 2.72/64.71]
Result: [0.11, 0.84, 0.04]
Notice how score 4 (just 2 more than 2) gets 84% of the attention!
The Temperature Parameter
You can control the "sharpness" of softmax with temperature:
| Temperature | Effect | Use Case |
|---|---|---|
| T < 1 (cold) | Sharper, more peaked | More confident, deterministic |
| T = 1 | Standard | Normal operation |
| T > 1 (hot) | Softer, more uniform | More exploration, creativity |
This is the same "temperature" parameter you see in ChatGPT!
T=0.5: What happens?
T=2.0: What happens?
Lower temperature = more confident/focused
Higher temperature = more exploratory/uncertain
The Scaling Factor: √dk
Why do we divide by √dk before softmax?
The problem: Dot products get larger as dimension increases.
- Q and K are dk-dimensional vectors
- Their dot product is sum of dk terms
- Variance grows with dk
Large dot products → extreme softmax:
- Scores like [50, 52, 48] → softmax ≈ [0.1, 0.8, 0.1]
- Scores like [500, 520, 480] → softmax ≈ [0.0, 1.0, 0.0]
The model gets overconfident and can't learn from gradients!
Solution: Divide by √dk to normalize variance back to ~1.
Key Takeaways
| Concept | What It Means |
|---|---|
| Softmax | Converts scores to probability distribution |
| Temperature | Controls sharpness (low=focused, high=spread) |
| √dk scaling | Prevents extreme softmax from large dot products |
| Attention fading | Longer sequences → diluted attention |
Multi-Head Attention
The Problem: One Head Isn't Enough
A single attention head can only focus on one type of relationship at a time. But language has many simultaneous relationships:
- Syntactic: subject-verb agreement
- Semantic: word meaning in context
- Coreference: what "it" refers to
- Positional: nearby words
The Solution: Multiple Heads in Parallel
Instead of one big attention operation, run several smaller ones:
Where each head is:
How It Works
- Split the embedding into h heads (e.g., 768 dimensions → 12 heads of 64 each)
- Compute attention independently in each head
- Concatenate the results back together
- Project through a final linear layer
Each head learns to focus on different types of relationships!
What Different Heads Learn
Research has shown that heads specialize:
| Head Type | What It Attends To | Example |
|---|---|---|
| Positional | Previous/next tokens | "The [cat]" → "cat" attends to "The" |
| Syntactic | Subject-verb pairs | "The cats [run]" → "run" attends to "cats" |
| Semantic | Related concepts | "bank [money]" → "money" attends to "bank" |
| Coreference | Pronouns to nouns | "[it] was tired" → "it" attends to "cat" |
A single head learns one attention pattern. Multiple heads can learn different patterns simultaneously.
Geometric intuition:
Each head operates in a smaller subspace. Different subspaces can capture different types of relationships.
Regularization:
Multiple smaller heads are harder to overfit than one large head.
Think of it like having a team of specialists vs. one generalist.
Key takeaways
| Concept | What It Means |
|---|---|
| Multi-head | Multiple attention operations in parallel |
| Head specialization | Different heads learn different patterns |
| Split-attend-concat | Divide embedding, attend separately, combine |
| Output projection | Final linear layer after concatenation |
The Transformer Block
The Full Architecture
Summary: What Each Component Does
| Component | Video | Function |
|---|---|---|
| Token Embeddings | 1 | Convert tokens to semantic vectors |
| Positional Encoding | 2 | Inject sequence position information |
| Self-Attention | 3 | Let tokens gather context from each other |
| Softmax & Scaling | 4 | Convert scores to attention probabilities |
| Multi-Head Attention | 5 | Learn diverse relationship patterns in parallel |
| Transformer Block | 6 | Combine attention + FFN with residuals and norms |
Next Steps
Now that you understand the fundamentals:
- Implement a mini-transformer from scratch in PyTorch
- Explore attention visualizations with BertViz or similar tools
- Fine-tune a pretrained model on a task you care about
- Read "Attention Is All You Need" — it'll make much more sense now!
- Explore variations: RoPE, ALiBi, Flash Attention, sparse attention
