This is the first post in my series exploring the Transformer architecture, tracing its evolution from the original 2017 paper all the way to modern LLMs like GPT-4 and LLaMA. We start with the canonical Transformer v1 from Vaswani et al., originally designed for machine translation but destined to reshape the entire field of deep learning.

Before the Transformer, sequence modeling was dominated by recurrent networks (RNNs, LSTMs, GRUs). These models processed tokens one at a time, creating a fundamental bottleneck: long-range dependencies were difficult to learn, and the sequential nature of recurrence made parallelization impossible. The Transformer eliminated recurrence entirely, replacing it with a mechanism called self-attention that can relate any two positions in a sequence in a single computational step.

Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)

Overall Architecture

The original Transformer follows an encoder-decoder design, where the encoder reads the input sequence (e.g., an English sentence) and the decoder generates the output sequence (e.g., a German translation) one token at a time.

Transformer Architecture

Both the encoder and decoder consist of a stack of N=6N = 6 identical layers. Each encoder layer contains two sublayers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each decoder layer adds a third sublayer: cross-attention over the encoder output. Every sublayer is wrapped with a residual connection followed by layer normalization.

Why Replace RNNs?

The key motivation behind the Transformer was overcoming three critical limitations of recurrent architectures:

RNN / LSTMTransformer
ProcessingSequential, token-by-tokenFully parallel across sequence
Long-range dependenciesGradients vanish over long distancesDirect connections via attention
GPU utilizationPoor parallelism due to recurrenceMatrix multiplications are GPU-native
Training timeSlow (cannot parallelize time steps)Fast (scales with hardware)

The Transformer achieved state-of-the-art translation quality while training in a fraction of the time required by the best recurrent models.

Scaled Dot-Product Attention

At the heart of the Transformer is the scaled dot-product attention mechanism. Every input token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). The attention function then computes a weighted sum of the values, where the weight assigned to each value is determined by the compatibility between the query and the corresponding key.

Scaled Dot-Product Attention

The formula is deceptively simple:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step-by-Step Computation

Given a sequence of nn tokens, each projected into dkd_k-dimensional queries and keys, and dvd_v-dimensional values:

  1. Compute attention scores: Multiply queries by keys to get an n×nn \times n score matrix measuring how much each token should attend to every other token.
E=QKRn×nE = QK^\top \in \mathbb{R}^{n \times n}
  1. Scale: Divide by dk\sqrt{d_k} to control the magnitude of the scores (explained below).
S=EdkS = \frac{E}{\sqrt{d_k}}
  1. Softmax: Apply softmax row-wise to convert scores into a probability distribution. Each row sums to 1.
A=softmax(S)A = \text{softmax}(S)
  1. Weighted sum: Multiply the attention weights by the values. Each output token is a weighted combination of all value vectors.
O=AVRn×dvO = AV \in \mathbb{R}^{n \times d_v}

Why Scale by dk\sqrt{d_k}?

This scaling factor is one of the most critical design decisions in the Transformer, and its importance is often underappreciated.

When dkd_k is large, the dot product between two random vectors tends to have large magnitude. Specifically, if the components of Q and K are independent with mean 0 and variance 1 (standard initialization), the variance of their dot product is:

Var(qk)=i=1dkVar(qiki)=dk\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i \cdot k_i) = d_k

With dk=64d_k = 64 (the default in the base model), the standard deviation of the dot product is 64=8\sqrt{64} = 8. This means raw scores can easily reach values of ±16\pm 16 or larger. When these large values are fed into softmax, the function saturates, producing near-one-hot distributions where almost all the probability mass concentrates on a single position:

Without Scaling (dk=64d_k = 64)With Scaling (÷64\div \sqrt{64})
Raw scores[20,15,18,12][-20, 15, 18, -12][2.5,1.9,2.3,1.5][-2.5, 1.9, 2.3, -1.5]
After softmax[0.00,0.05,0.95,0.00][0.00, 0.05, 0.95, 0.00][0.02,0.17,0.72,0.09][0.02, 0.17, 0.72, 0.09]
GradientsNearly zero everywhereFlow to all positions

Dividing by dk\sqrt{d_k} normalizes the variance back to 1, keeping softmax in a regime where gradients can flow to multiple positions, enabling the model to learn nuanced attention patterns rather than hard selections.

Multi-Head Attention

A single attention function can only capture one type of relationship between tokens. For example, in the sentence "The animal didn't cross the street because it was too tired", the word "it" needs to attend to "animal" (coreference) and to "tired" (semantic context) simultaneously.

Multi-head attention solves this by running hh attention functions in parallel, each with its own learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The projection matrices WiQ,WiKRdmodel×dkW_i^Q, W_i^K \in \mathbb{R}^{d_{model} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{model} \times d_v}, and WORhdv×dmodelW^O \in \mathbb{R}^{hd_v \times d_{model}} are learned during training. With h=8h = 8 heads and dmodel=512d_{model} = 512, each head operates on dk=dv=64d_k = d_v = 64 dimensions. The total computation is comparable to a single attention with full dimensionality, but the model gains the ability to attend to multiple relationships simultaneously.

ConfigurationBase ModelBig Model
Heads (hh)816
dk=dvd_k = d_v6464
dmodeld_{model}5121024

Three Types of Attention in the Transformer

The encoder-decoder architecture uses attention in three distinct ways, each serving a different purpose:

1. Encoder Self-Attention

In the encoder, every token attends to every other token in the input sequence, including itself. This is full (bidirectional) attention with no masking. The queries, keys, and values all come from the same source: the output of the previous encoder layer.

This allows the encoder to build rich contextual representations. For example, the word "bank" can attend to "river" or "money" to disambiguate its meaning.

2. Masked Decoder Self-Attention

The decoder also uses self-attention, but with a crucial constraint: during training, position ii can only attend to positions i\leq i. This is enforced by applying a causal mask that sets all entries above the diagonal to -\infty before softmax:

Maskij={0if jiif j>i\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

This prevents the decoder from "cheating" by looking at future tokens during training, ensuring that the model learns to generate sequences autoregressively.

3. Encoder-Decoder (Cross) Attention

This is the bridge between the encoder and decoder. The queries come from the decoder, while the keys and values come from the encoder output. This allows each decoder position to attend to all positions in the input sequence, giving the decoder access to the full source representation when generating each output token.

The Encoder and Decoder Layers

Encoder Layer

Each encoder layer applies two sublayers in sequence, each wrapped with a residual connection and Post-Layer Normalization:

Output=LayerNorm(x+Sublayer(x))\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

The two sublayers are:

  1. Multi-Head Self-Attention (full attention, no masking)
  2. Position-wise Feed-Forward Network

Decoder Layer

Each decoder layer has three sublayers (one more than the encoder):

  1. Masked Multi-Head Self-Attention (causal mask)
  2. Multi-Head Cross-Attention (Q from decoder, K/V from encoder)
  3. Position-wise Feed-Forward Network

Each sublayer also uses residual connections and Post-Layer Normalization.

Note on Post-LN: In the original Transformer, LayerNorm is applied after the residual addition. This choice was later found to cause training instability at large scales, leading modern LLMs to use Pre-Layer Normalization instead (covered in Part 2).

Feed-Forward Network

Each attention sublayer is followed by a position-wise feed-forward network, applied identically and independently to every token position. It consists of two linear transformations with a ReLU activation in between:

FFN(x)=max(0,  xW1+b1)W2+b2\text{FFN}(x) = \max(0,\; xW_1 + b_1)\,W_2 + b_2
PropertyValue
Input / output dimension (dmodeld_{model})512
Hidden dimension (dffd_{ff})2048 (4×\times expansion)
ActivationReLU
Parameters per layer2×dmodel×dff2 \times d_{model} \times d_{ff} \approx 2.1M

The 4×\times expansion ratio (dff=4dmodeld_{ff} = 4 \cdot d_{model}) has become a standard design choice. Recent research suggests that FFN layers act as key-value memories, storing factual knowledge learned during training.

Positional Encoding

Unlike RNNs, which inherently process tokens in order, Transformers process all tokens in parallel. Without positional information, the model treats the input as a set rather than a sequence: "The cat sat on the mat" and "The mat sat on the cat" would produce identical representations.

The original Transformer uses fixed sinusoidal encodings added to the input embeddings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Each position gets a unique dmodeld_{model}-dimensional vector. The sine and cosine functions at different frequencies create a pattern where:

  • Different frequencies capture different scales: Low-frequency components encode coarse position (beginning vs. end), while high-frequency components encode fine-grained position (adjacent tokens).
  • Relative positions are linearly representable: For any fixed offset kk, PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos}, which the attention mechanism can exploit to learn relative position relationships.
  • Extrapolation to unseen lengths: Since sinusoids are defined for all positions, the model can potentially handle sequences longer than those seen during training (though in practice, this is limited).

Training Configuration

The original Transformer was trained on the WMT 2014 English-German dataset (4.5 million sentence pairs) and the English-French dataset (36 million pairs).

SettingValue
OptimizerAdam (β1=0.9\beta_1 = 0.9, β2=0.98\beta_2 = 0.98, ϵ=109\epsilon = 10^{-9})
Learning rate scheduleWarmup (4000 steps) + inverse square root decay
Label smoothingϵls=0.1\epsilon_{ls} = 0.1
Dropout0.1 (applied to sublayer outputs and attention weights)
Batch size~25,000 source + target tokens
Training time (base)12 hours on 8 NVIDIA P100 GPUs
Parameters (base)~65M
Parameters (big)~213M

The learning rate schedule combines linear warmup with inverse square root decay:

lr=dmodel0.5min(step0.5,  stepwarmup1.5)lr = d_{model}^{-0.5} \cdot \min\left(\text{step}^{-0.5},\; \text{step} \cdot \text{warmup}^{-1.5}\right)

This increases the learning rate linearly for the first 4000 steps, then decreases it proportionally to the inverse square root of the step number.

Summary

ComponentOriginal Transformer (2017)
ArchitectureEncoder-Decoder
Layers6 encoder + 6 decoder
Layer NormalizationPost-LN (after residual)
AttentionFull self-attention + cross-attention
FFN ActivationReLU
Positional EncodingSinusoidal (fixed)
NormalizationLayerNorm
Use caseMachine translation

The original Transformer achieved 28.4 BLEU on English-to-German translation, establishing a new state of the art while training in just 3.5 days. More importantly, it introduced the architectural foundation that would power the next generation of language models.


In the next post, we explore Part 2: Architecture Changes — how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.

Share:
SW

Written by Suchinthaka Wanninayaka

AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.

Responses

?

No responses yet. Be the first to share your thoughts!