Transformer Deep Dive: Part 1 - The Original Transformer (2017)
This is the first post in my series exploring the Transformer architecture, tracing its evolution from the original 2017 paper all the way to modern LLMs like GPT-4 and LLaMA. We start with the canonical Transformer v1 from Vaswani et al., originally designed for machine translation but destined to reshape the entire field of deep learning.
Before the Transformer, sequence modeling was dominated by recurrent networks (RNNs, LSTMs, GRUs). These models processed tokens one at a time, creating a fundamental bottleneck: long-range dependencies were difficult to learn, and the sequential nature of recurrence made parallelization impossible. The Transformer eliminated recurrence entirely, replacing it with a mechanism called self-attention that can relate any two positions in a sequence in a single computational step.
Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Overall Architecture
The original Transformer follows an encoder-decoder design, where the encoder reads the input sequence (e.g., an English sentence) and the decoder generates the output sequence (e.g., a German translation) one token at a time.
Both the encoder and decoder consist of a stack of identical layers. Each encoder layer contains two sublayers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each decoder layer adds a third sublayer: cross-attention over the encoder output. Every sublayer is wrapped with a residual connection followed by layer normalization.
Why Replace RNNs?
The key motivation behind the Transformer was overcoming three critical limitations of recurrent architectures:
| RNN / LSTM | Transformer | |
|---|---|---|
| Processing | Sequential, token-by-token | Fully parallel across sequence |
| Long-range dependencies | Gradients vanish over long distances | Direct connections via attention |
| GPU utilization | Poor parallelism due to recurrence | Matrix multiplications are GPU-native |
| Training time | Slow (cannot parallelize time steps) | Fast (scales with hardware) |
The Transformer achieved state-of-the-art translation quality while training in a fraction of the time required by the best recurrent models.
Scaled Dot-Product Attention
At the heart of the Transformer is the scaled dot-product attention mechanism. Every input token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). The attention function then computes a weighted sum of the values, where the weight assigned to each value is determined by the compatibility between the query and the corresponding key.
The formula is deceptively simple:
Step-by-Step Computation
Given a sequence of tokens, each projected into -dimensional queries and keys, and -dimensional values:
- Compute attention scores: Multiply queries by keys to get an score matrix measuring how much each token should attend to every other token.
- Scale: Divide by to control the magnitude of the scores (explained below).
- Softmax: Apply softmax row-wise to convert scores into a probability distribution. Each row sums to 1.
- Weighted sum: Multiply the attention weights by the values. Each output token is a weighted combination of all value vectors.
Why Scale by ?
This scaling factor is one of the most critical design decisions in the Transformer, and its importance is often underappreciated.
When is large, the dot product between two random vectors tends to have large magnitude. Specifically, if the components of Q and K are independent with mean 0 and variance 1 (standard initialization), the variance of their dot product is:
With (the default in the base model), the standard deviation of the dot product is . This means raw scores can easily reach values of or larger. When these large values are fed into softmax, the function saturates, producing near-one-hot distributions where almost all the probability mass concentrates on a single position:
| Without Scaling () | With Scaling () | |
|---|---|---|
| Raw scores | ||
| After softmax | ||
| Gradients | Nearly zero everywhere | Flow to all positions |
Dividing by normalizes the variance back to 1, keeping softmax in a regime where gradients can flow to multiple positions, enabling the model to learn nuanced attention patterns rather than hard selections.
Multi-Head Attention
A single attention function can only capture one type of relationship between tokens. For example, in the sentence "The animal didn't cross the street because it was too tired", the word "it" needs to attend to "animal" (coreference) and to "tired" (semantic context) simultaneously.
Multi-head attention solves this by running attention functions in parallel, each with its own learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
where each head is:
The projection matrices , , and are learned during training. With heads and , each head operates on dimensions. The total computation is comparable to a single attention with full dimensionality, but the model gains the ability to attend to multiple relationships simultaneously.
| Configuration | Base Model | Big Model |
|---|---|---|
| Heads () | 8 | 16 |
| 64 | 64 | |
| 512 | 1024 |
Three Types of Attention in the Transformer
The encoder-decoder architecture uses attention in three distinct ways, each serving a different purpose:
1. Encoder Self-Attention
In the encoder, every token attends to every other token in the input sequence, including itself. This is full (bidirectional) attention with no masking. The queries, keys, and values all come from the same source: the output of the previous encoder layer.
This allows the encoder to build rich contextual representations. For example, the word "bank" can attend to "river" or "money" to disambiguate its meaning.
2. Masked Decoder Self-Attention
The decoder also uses self-attention, but with a crucial constraint: during training, position can only attend to positions . This is enforced by applying a causal mask that sets all entries above the diagonal to before softmax:
This prevents the decoder from "cheating" by looking at future tokens during training, ensuring that the model learns to generate sequences autoregressively.
3. Encoder-Decoder (Cross) Attention
This is the bridge between the encoder and decoder. The queries come from the decoder, while the keys and values come from the encoder output. This allows each decoder position to attend to all positions in the input sequence, giving the decoder access to the full source representation when generating each output token.
The Encoder and Decoder Layers
Encoder Layer
Each encoder layer applies two sublayers in sequence, each wrapped with a residual connection and Post-Layer Normalization:
The two sublayers are:
- Multi-Head Self-Attention (full attention, no masking)
- Position-wise Feed-Forward Network
Decoder Layer
Each decoder layer has three sublayers (one more than the encoder):
- Masked Multi-Head Self-Attention (causal mask)
- Multi-Head Cross-Attention (Q from decoder, K/V from encoder)
- Position-wise Feed-Forward Network
Each sublayer also uses residual connections and Post-Layer Normalization.
Note on Post-LN: In the original Transformer, LayerNorm is applied after the residual addition. This choice was later found to cause training instability at large scales, leading modern LLMs to use Pre-Layer Normalization instead (covered in Part 2).
Feed-Forward Network
Each attention sublayer is followed by a position-wise feed-forward network, applied identically and independently to every token position. It consists of two linear transformations with a ReLU activation in between:
| Property | Value |
|---|---|
| Input / output dimension () | 512 |
| Hidden dimension () | 2048 (4 expansion) |
| Activation | ReLU |
| Parameters per layer | 2.1M |
The 4 expansion ratio () has become a standard design choice. Recent research suggests that FFN layers act as key-value memories, storing factual knowledge learned during training.
Positional Encoding
Unlike RNNs, which inherently process tokens in order, Transformers process all tokens in parallel. Without positional information, the model treats the input as a set rather than a sequence: "The cat sat on the mat" and "The mat sat on the cat" would produce identical representations.
The original Transformer uses fixed sinusoidal encodings added to the input embeddings:
Each position gets a unique -dimensional vector. The sine and cosine functions at different frequencies create a pattern where:
- Different frequencies capture different scales: Low-frequency components encode coarse position (beginning vs. end), while high-frequency components encode fine-grained position (adjacent tokens).
- Relative positions are linearly representable: For any fixed offset , can be expressed as a linear function of , which the attention mechanism can exploit to learn relative position relationships.
- Extrapolation to unseen lengths: Since sinusoids are defined for all positions, the model can potentially handle sequences longer than those seen during training (though in practice, this is limited).
Training Configuration
The original Transformer was trained on the WMT 2014 English-German dataset (4.5 million sentence pairs) and the English-French dataset (36 million pairs).
| Setting | Value |
|---|---|
| Optimizer | Adam (, , ) |
| Learning rate schedule | Warmup (4000 steps) + inverse square root decay |
| Label smoothing | |
| Dropout | 0.1 (applied to sublayer outputs and attention weights) |
| Batch size | ~25,000 source + target tokens |
| Training time (base) | 12 hours on 8 NVIDIA P100 GPUs |
| Parameters (base) | ~65M |
| Parameters (big) | ~213M |
The learning rate schedule combines linear warmup with inverse square root decay:
This increases the learning rate linearly for the first 4000 steps, then decreases it proportionally to the inverse square root of the step number.
Summary
| Component | Original Transformer (2017) |
|---|---|
| Architecture | Encoder-Decoder |
| Layers | 6 encoder + 6 decoder |
| Layer Normalization | Post-LN (after residual) |
| Attention | Full self-attention + cross-attention |
| FFN Activation | ReLU |
| Positional Encoding | Sinusoidal (fixed) |
| Normalization | LayerNorm |
| Use case | Machine translation |
The original Transformer achieved 28.4 BLEU on English-to-German translation, establishing a new state of the art while training in just 3.5 days. More importantly, it introduced the architectural foundation that would power the next generation of language models.
In the next post, we explore Part 2: Architecture Changes — how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.
Written by Suchinthaka Wanninayaka
AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.
Continue the Series
Related Articles
Responses
No responses yet. Be the first to share your thoughts!