Transformer Deep Dive: Part 1 - The Original Transformer (2017)

This is the first post in my series exploring the Transformer architecture, tracing its evolution from the original 2017 paper all the way to modern LLMs like GPT-4 and LLaMA. We start with the canonical Transformer v1 from Vaswani et al., originally designed for machine translation but destined to reshape the entire field of deep learning.

Before the Transformer, sequence modeling was dominated by recurrent networks (RNNs, LSTMs, GRUs). These models processed tokens one at a time, creating a fundamental bottleneck: long-range dependencies were difficult to learn, and the sequential nature of recurrence made parallelization impossible. The Transformer eliminated recurrence entirely, replacing it with a mechanism called self-attention that can relate any two positions in a sequence in a single computational step.

Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)

Overall Architecture

The original Transformer follows an encoder-decoder design, where the encoder reads the input sequence (e.g., an English sentence) and the decoder generates the output sequence (e.g., a German translation) one token at a time.

Transformer Architecture

Both the encoder and decoder consist of a stack of $N = 6$ identical layers. Each encoder layer contains two sublayers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each decoder layer adds a third sublayer: cross-attention over the encoder output. Every sublayer is wrapped with a residual connection followed by layer normalization.

Why Replace RNNs?

The key motivation behind the Transformer was overcoming three critical limitations of recurrent architectures:

	RNN / LSTM	Transformer
Processing	Sequential, token-by-token	Fully parallel across sequence
Long-range dependencies	Gradients vanish over long distances	Direct connections via attention
GPU utilization	Poor parallelism due to recurrence	Matrix multiplications are GPU-native
Training time	Slow (cannot parallelize time steps)	Fast (scales with hardware)

The Transformer achieved state-of-the-art translation quality while training in a fraction of the time required by the best recurrent models.

Scaled Dot-Product Attention

At the heart of the Transformer is the scaled dot-product attention mechanism. Every input token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). The attention function then computes a weighted sum of the values, where the weight assigned to each value is determined by the compatibility between the query and the corresponding key.

Scaled Dot-Product Attention

The formula is deceptively simple:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step-by-Step Computation

Given a sequence of $n$ tokens, each projected into $d_k$ -dimensional queries and keys, and $d_v$ -dimensional values:

Compute attention scores: Multiply queries by keys to get an $n \times n$ score matrix measuring how much each token should attend to every other token.

E = QK^\top \in \mathbb{R}^{n \times n}

Scale: Divide by $\sqrt{d_k}$ to control the magnitude of the scores (explained below).

S = \frac{E}{\sqrt{d_k}}

Softmax: Apply softmax row-wise to convert scores into a probability distribution. Each row sums to 1.

A = \text{softmax}(S)

Weighted sum: Multiply the attention weights by the values. Each output token is a weighted combination of all value vectors.

O = AV \in \mathbb{R}^{n \times d_v}

Why Scale by $\sqrt{d_k}$ ?

This scaling factor is one of the most critical design decisions in the Transformer, and its importance is often underappreciated.

When $d_k$ is large, the dot product between two random vectors tends to have large magnitude. Specifically, if the components of Q and K are independent with mean 0 and variance 1 (standard initialization), the variance of their dot product is:

\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i \cdot k_i) = d_k

With $d_k = 64$ (the default in the base model), the standard deviation of the dot product is $\sqrt{64} = 8$ . This means raw scores can easily reach values of $\pm 16$ or larger. When these large values are fed into softmax, the function saturates, producing near-one-hot distributions where almost all the probability mass concentrates on a single position:

	Without Scaling ( $d_k = 64$ )	With Scaling ( $\div \sqrt{64}$ )
Raw scores	$[-20, 15, 18, -12]$	$[-2.5, 1.9, 2.3, -1.5]$
After softmax	$[0.00, 0.05, 0.95, 0.00]$	$[0.02, 0.17, 0.72, 0.09]$
Gradients	Nearly zero everywhere	Flow to all positions

Dividing by $\sqrt{d_k}$ normalizes the variance back to 1, keeping softmax in a regime where gradients can flow to multiple positions, enabling the model to learn nuanced attention patterns rather than hard selections.

Multi-Head Attention

A single attention function can only capture one type of relationship between tokens. For example, in the sentence "The animal didn't cross the street because it was too tired", the word "it" needs to attend to "animal" (coreference) and to "tired" (semantic context) simultaneously.

Multi-head attention solves this by running $h$ attention functions in parallel, each with its own learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

Multi-Head Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The projection matrices $W_i^Q, W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ , and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$ are learned during training. With $h = 8$ heads and $d_{model} = 512$ , each head operates on $d_k = d_v = 64$ dimensions. The total computation is comparable to a single attention with full dimensionality, but the model gains the ability to attend to multiple relationships simultaneously.

Configuration	Base Model	Big Model
Heads ( $h$ )	8	16
$d_k = d_v$	64	64
$d_{model}$	512	1024

Three Types of Attention in the Transformer

The encoder-decoder architecture uses attention in three distinct ways, each serving a different purpose:

1. Encoder Self-Attention

In the encoder, every token attends to every other token in the input sequence, including itself. This is full (bidirectional) attention with no masking. The queries, keys, and values all come from the same source: the output of the previous encoder layer.

This allows the encoder to build rich contextual representations. For example, the word "bank" can attend to "river" or "money" to disambiguate its meaning.

2. Masked Decoder Self-Attention

The decoder also uses self-attention, but with a crucial constraint: during training, position $i$ can only attend to positions $\leq i$ . This is enforced by applying a causal mask that sets all entries above the diagonal to $-\infty$ before softmax:

\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

This prevents the decoder from "cheating" by looking at future tokens during training, ensuring that the model learns to generate sequences autoregressively.

3. Encoder-Decoder (Cross) Attention

This is the bridge between the encoder and decoder. The queries come from the decoder, while the keys and values come from the encoder output. This allows each decoder position to attend to all positions in the input sequence, giving the decoder access to the full source representation when generating each output token.

The Encoder and Decoder Layers

Encoder Layer

Each encoder layer applies two sublayers in sequence, each wrapped with a residual connection and Post-Layer Normalization:

\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

The two sublayers are:

Multi-Head Self-Attention (full attention, no masking)
Position-wise Feed-Forward Network

Decoder Layer

Each decoder layer has three sublayers (one more than the encoder):

Masked Multi-Head Self-Attention (causal mask)
Multi-Head Cross-Attention (Q from decoder, K/V from encoder)
Position-wise Feed-Forward Network

Each sublayer also uses residual connections and Post-Layer Normalization.

Note on Post-LN: In the original Transformer, LayerNorm is applied after the residual addition. This choice was later found to cause training instability at large scales, leading modern LLMs to use Pre-Layer Normalization instead (covered in Part 2).

Feed-Forward Network

Each attention sublayer is followed by a position-wise feed-forward network, applied identically and independently to every token position. It consists of two linear transformations with a ReLU activation in between:

\text{FFN}(x) = \max(0,\; xW_1 + b_1)\,W_2 + b_2

Property	Value
Input / output dimension ( $d_{model}$ )	512
Hidden dimension ( $d_{ff}$ )	2048 (4 $\times$ expansion)
Activation	ReLU
Parameters per layer	$2 \times d_{model} \times d_{ff} \approx$ 2.1M

The 4 $\times$ expansion ratio ( $d_{ff} = 4 \cdot d_{model}$ ) has become a standard design choice. Recent research suggests that FFN layers act as key-value memories, storing factual knowledge learned during training.

Positional Encoding

Unlike RNNs, which inherently process tokens in order, Transformers process all tokens in parallel. Without positional information, the model treats the input as a set rather than a sequence: "The cat sat on the mat" and "The mat sat on the cat" would produce identical representations.

The original Transformer uses fixed sinusoidal encodings added to the input embeddings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Each position gets a unique $d_{model}$ -dimensional vector. The sine and cosine functions at different frequencies create a pattern where:

Different frequencies capture different scales: Low-frequency components encode coarse position (beginning vs. end), while high-frequency components encode fine-grained position (adjacent tokens).
Relative positions are linearly representable: For any fixed offset $k$ , $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ , which the attention mechanism can exploit to learn relative position relationships.
Extrapolation to unseen lengths: Since sinusoids are defined for all positions, the model can potentially handle sequences longer than those seen during training (though in practice, this is limited).

Training Configuration

The original Transformer was trained on the WMT 2014 English-German dataset (4.5 million sentence pairs) and the English-French dataset (36 million pairs).

Setting	Value
Optimizer	Adam ( $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 10^{-9}$ )
Learning rate schedule	Warmup (4000 steps) + inverse square root decay
Label smoothing	$\epsilon_{ls} = 0.1$
Dropout	0.1 (applied to sublayer outputs and attention weights)
Batch size	~25,000 source + target tokens
Training time (base)	12 hours on 8 NVIDIA P100 GPUs
Parameters (base)	~65M
Parameters (big)	~213M

The learning rate schedule combines linear warmup with inverse square root decay:

lr = d_{model}^{-0.5} \cdot \min\left(\text{step}^{-0.5},\; \text{step} \cdot \text{warmup}^{-1.5}\right)

This increases the learning rate linearly for the first 4000 steps, then decreases it proportionally to the inverse square root of the step number.

Summary

Component	Original Transformer (2017)
Architecture	Encoder-Decoder
Layers	6 encoder + 6 decoder
Layer Normalization	Post-LN (after residual)
Attention	Full self-attention + cross-attention
FFN Activation	ReLU
Positional Encoding	Sinusoidal (fixed)
Normalization	LayerNorm
Use case	Machine translation

The original Transformer achieved 28.4 BLEU on English-to-German translation, establishing a new state of the art while training in just 3.5 days. More importantly, it introduced the architectural foundation that would power the next generation of language models.

In the next post, we explore Part 2: Architecture Changes — how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

Overall Architecture

Why Replace RNNs?

Scaled Dot-Product Attention

Step-by-Step Computation

Why Scale by $\sqrt{d_k}$ ?

Multi-Head Attention

Three Types of Attention in the Transformer

1. Encoder Self-Attention

2. Masked Decoder Self-Attention

3. Encoder-Decoder (Cross) Attention

The Encoder and Decoder Layers

Encoder Layer

Decoder Layer

Feed-Forward Network

Positional Encoding

Training Configuration

Summary

Continue the Series

Related Articles

Responses

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

Overall Architecture

Why Replace RNNs?

Scaled Dot-Product Attention

Step-by-Step Computation

Why Scale by dk\sqrt{d_k}dk​​?

Multi-Head Attention

Three Types of Attention in the Transformer

1. Encoder Self-Attention

2. Masked Decoder Self-Attention

3. Encoder-Decoder (Cross) Attention

The Encoder and Decoder Layers

Encoder Layer

Decoder Layer

Feed-Forward Network

Positional Encoding

Training Configuration

Summary

Continue the Series

Related Articles

Responses

Why Scale by $\sqrt{d_k}$ ?