All Articles
transformersattentiondeep-learningnlpmachine-learning

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

A deep dive into the original Transformer architecture from 'Attention Is All You Need' - the encoder-decoder structure, scaled dot-product attention, multi-head attention, and the design decisions that revolutionized NLP.

SW

Suchinthaka W.

January 15, 2025 · 5 min read

This is the first post in my series exploring the Transformer architecture, from the original 2017 paper to modern LLMs like GPT-4 and LLaMA. We'll start with the canonical Transformer v1, designed for machine translation.

Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)

Overall Architecture

The original Transformer has an Encoder-Decoder structure:

Input sentence (source) → Encoder stack (N=6) → Decoder stack (N=6) → Output sentence (target)

Key Properties

  • Encoder reads the source language (e.g., English)
  • Decoder generates the target language (e.g., German)
  • No recurrence (no RNN, no LSTM)
  • No convolution
  • Relies entirely on attention mechanisms

Why Replace RNNs?

| RNN | Transformer | |-----|-------------| | Sequential processing (slow) | Parallel processing (fast) | | Long-range gradient issues | Direct connections | | Hard to parallelize | GPU-friendly |

Scaled Dot-Product Attention

The core innovation is the attention formula:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step-by-Step

  1. Compute scores: E=QKE = QK^\top
  2. Scale: S=E/dkS = E / \sqrt{d_k}
  3. Softmax: A=softmax(S)A = \text{softmax}(S) (row-wise)
  4. Weighted sum: O=AVO = AV

Why Scale by dk\sqrt{d_k}?

This is one of the most critical design decisions. When dkd_k is large, dot products grow large in magnitude, pushing softmax into saturation regions with tiny gradients.

The Math: Assume Q and K vectors have components with mean 0 and variance 1 (standard initialization). The variance of their dot product is:

Var(qk)=dk\text{Var}(q \cdot k) = d_k

When dk=64d_k = 64, the standard deviation is 8, meaning dot products can easily be ±16 or larger. This causes softmax to produce near-one-hot distributions:

| Without Scaling | With Scaling | |-----------------|--------------| | Scores: [-20, 15, 18, -12] | Scores: [-2.5, 1.9, 2.3, -1.5] | | Softmax: [0.00, 0.05, 0.95, 0.00] | Softmax: [0.02, 0.17, 0.72, 0.09] | | Gradients: Nearly zero | Gradients: Flow to all positions |

Dividing by dk\sqrt{d_k} normalizes the variance back to 1, keeping softmax in a healthy gradient region.

Multi-Head Attention

Instead of performing a single attention function, we project queries, keys, and values h times with different learned projections:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

| Configuration | Base Model | Big Model | |---------------|------------|-----------| | Number of heads (h) | 8 | 16 | | dk=dvd_k = d_v | 64 | 64 |

Three Types of Attention

1. Encoder Self-Attention

  • Q, K, V all from encoder input
  • Full attention: every position attends to all positions
  • No masking

2. Masked Decoder Self-Attention

  • Q, K, V all from decoder input
  • Causal mask: position i can only attend to positions ≤ i
  • Prevents looking at future tokens

3. Encoder-Decoder (Cross) Attention

  • Q from decoder
  • K, V from encoder output
  • Decoder can attend to entire source sentence

The Encoder Layer

Each encoder layer has two sublayers with Post-Layer Normalization:

Input x
    ↓
Multi-Head Self-Attention
    ↓
Dropout
    ↓
Add (Residual): x + output
    ↓
LayerNorm          ← Post-LN
    ↓
Feed Forward Network
    ↓
Dropout
    ↓
Add (Residual)
    ↓
LayerNorm          ← Post-LN
    ↓
Output

Critical: In the original Transformer, LayerNorm comes AFTER the residual addition:

Output=LayerNorm(x+Sublayer(x))\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

This is a key difference from modern LLMs, which use Pre-LN (we'll cover this in Part 2).

Feed Forward Network

Position-wise FFN applied identically to every token:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

| Property | Value | |----------|-------| | Hidden dimension (dffd_{ff}) | 2048 | | Model dimension (dmodeld_{model}) | 512 | | Activation | ReLU (not GELU) |

Dimensions: 512 → 2048 → 512

Positional Encoding

Unlike RNNs, Transformers process all tokens in parallel. Without positional information, "The cat sat on the mat" and "The mat sat on the cat" would be indistinguishable.

The original uses fixed sinusoidal encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Why Sinusoids?

  • Can extrapolate to longer sequences than seen during training
  • Relative positions representable as linear functions
  • Different frequencies capture different position scales

Training Configuration

| Setting | Value | |---------|-------| | Optimizer | Adam (β1=0.9\beta_1=0.9, β2=0.98\beta_2=0.98) | | LR Schedule | Warmup + inverse square root decay | | Warmup steps | 4000 | | Label smoothing | 0.1 | | Dropout | 0.1 | | Layers | 6 encoder + 6 decoder | | Parameters (base) | ~65M | | Parameters (big) | ~213M |

Summary

| Component | Original Design | |-----------|-----------------| | Architecture | Encoder-Decoder | | Layers | 6 + 6 | | LayerNorm | Post-LN (after residual) | | Attention | Full self-attention | | FFN Activation | ReLU | | Positional Encoding | Sinusoidal (fixed) | | Use Case | Translation |


In the next post, we'll explore Part 2: Architecture Changes - how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.

Did you find this helpful?
Share:

Responses

Be the first to share your thoughts!