Transformer Deep Dive: Part 7 - Minor But Important Changes
Throughout this series, we have examined the headline innovations: FlashAttention, Grouped-Query Attention, SwiGLU, Mixture of Experts, and advanced training and inference techniques. But modern LLMs also incorporate a collection of seemingly minor architectural choices that, taken together, meaningfully affect training stability, memory efficiency, and throughput. These are the details you encounter when reading model code and wonder "why did they do it that way?" This post answers those questions.
We will cover five topics: bias removal from linear layers, the decision to tie or untie embedding weights, parallel versus sequential attention and FFN computation, initialization schemes that prevent training collapse, and a handful of numerical and hardware-alignment tricks that every practitioner should know.
1. Bias Removal
The Role of Bias in Linear Layers
A standard linear transformation in a neural network is:
where is the weight matrix and is the bias vector. The bias provides three things:
- Learnable offset: It shifts the output distribution independently of the input magnitude.
- Non-zero output at zero input: When , we get rather than the zero vector.
- Function approximation flexibility: It adds extra capacity, which matters in small networks.
For decades, biases were considered an essential component of neural networks. The universal approximation theorem for single-layer networks, for instance, requires biases. So why would we remove them?
Where Biases Appear in a Transformer
A single Pre-LN transformer layer contains several linear projections, each of which could carry a bias:
| Component | Linear Layers | Bias Vectors |
|---|---|---|
| Multi-Head Attention | , , , | 4 vectors of size |
| FFN (standard) | (up), (down) | 2 vectors |
| FFN (SwiGLU) | , , | 3 vectors |
| Total per layer | 6-7 projections | 6-7 bias vectors |
Each bias vector has parameters (or for the up-projections). For a model with and SwiGLU with , a single layer has approximately:
Why Modern LLMs Remove Biases
Argument 1: Redundancy with normalization layers.
LayerNorm includes a learnable shift parameter :
When a linear layer follows LayerNorm, we have:
The bias and the shift provide overlapping functionality. They both add a learned constant to the output, making one of them partially redundant.
Argument 2: RMSNorm has no shift at all.
Most modern LLMs use RMSNorm instead of LayerNorm:
RMSNorm has no parameter. Yet models trained with RMSNorm and no biases perform just as well or better than those with biases. This tells us that at sufficient scale, the model learns to represent arbitrary offsets through the weight matrices themselves without needing explicit bias terms.
Argument 3: Memory savings compound across the system.
While individual bias vectors are small, the savings matter in the context of distributed training and inference:
| Model Scale | Layers | Bias Params | FP16 Memory | |
|---|---|---|---|---|
| 7B (LLaMA) | 32 | 4096 | ~1.4M | ~2.7 MB |
| 13B | 40 | 5120 | ~2.2M | ~4.2 MB |
| 70B | 80 | 8192 | ~5.6M | ~10.7 MB |
These numbers seem negligible, but consider that each bias vector also requires: an optimizer state (two additional vectors for Adam), gradient storage during backward pass, and communication bandwidth during distributed training. The total overhead is 4-5x the raw parameter count.
Argument 4: Simplifies quantization.
Bias terms interact awkwardly with weight quantization. When weights are quantized to INT4 or INT8, biases remain in higher precision, adding complexity to the inference kernel. Removing biases yields cleaner, more uniform compute patterns.
Implementation
import torch
import torch.nn as nn
class BiasFreeLLaMAAttention(nn.Module):
"""LLaMA-style multi-head attention with no biases anywhere."""
def __init__(self, d_model: int, n_heads: int, n_kv_heads: int):
super().__init__()
self.n_heads = n_heads
self.n_kv_heads = n_kv_heads
self.head_dim = d_model // n_heads
# All projections: bias=False
self.wq = nn.Linear(d_model, n_heads * self.head_dim, bias=False)
self.wk = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
self.wv = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
self.wo = nn.Linear(n_heads * self.head_dim, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
bsz, seq_len, _ = x.shape
q = self.wq(x).view(bsz, seq_len, self.n_heads, self.head_dim)
k = self.wk(x).view(bsz, seq_len, self.n_kv_heads, self.head_dim)
v = self.wv(x).view(bsz, seq_len, self.n_kv_heads, self.head_dim)
# Transpose for attention: (bsz, n_heads, seq_len, head_dim)
q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)
# GQA: repeat KV heads to match Q heads
if self.n_kv_heads < self.n_heads:
n_rep = self.n_heads // self.n_kv_heads
k = k.repeat_interleave(n_rep, dim=1)
v = v.repeat_interleave(n_rep, dim=1)
# Scaled dot-product attention
scale = self.head_dim ** -0.5
attn = torch.matmul(q, k.transpose(-2, -1)) * scale
attn = torch.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(bsz, seq_len, -1)
return self.wo(out)
class BiasFreeSwiGLU(nn.Module):
"""SwiGLU FFN with no biases."""
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w_gate = nn.Linear(d_model, d_ff, bias=False)
self.w_up = nn.Linear(d_model, d_ff, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w_down(nn.functional.silu(self.w_gate(x)) * self.w_up(x))
The key pattern is simple: every nn.Linear call includes bias=False. This single change propagates throughout the entire model.
Partial Bias: The QKV Exception
A few models take a middle path. Falcon, for instance, removes biases from all projections except the QKV layers in attention. The reasoning is that QKV biases help with rotary position encoding by providing a learnable offset before the rotation is applied. However, most recent models (LLaMA 2/3, Mistral, DeepSeek) remove biases everywhere without any degradation.
2. Tied vs. Untied Embeddings
The Two Embedding Matrices
Every language model has two large matrices related to vocabulary:
- Input embedding : Maps discrete token IDs to continuous vectors.
- Output projection (LM head) : Maps final hidden states back to vocabulary logits.
For a vocabulary of tokens and :
Together they consume approximately 262M parameters, which is a significant fraction of smaller models (over 3% of a 7B model).
Weight Tying:
The idea of weight tying (Press and Wolf, 2017) is to share parameters between the input embedding and the output projection:
Instead of learning two independent matrices, we use the transpose of the input embedding as the output projection. The intuition is compelling: if two tokens have similar input embeddings (they appear in similar contexts), they should also receive similar output probabilities (they are likely in similar positions to be predicted).
Benefits of tying:
- Saves parameters (131M for the example above).
- For small models (GPT-2 at 117M parameters), this represents over 100% effective savings since those 131M parameters would otherwise be duplicated.
- Acts as a regularizer by forcing input and output representations to be consistent.
- Empirically improves perplexity on smaller models.
Why modern LLMs untie embeddings:
As models scale to billions of parameters, the calculus changes:
| Model Size | Total Params | Embedding Params | Savings from Tying |
|---|---|---|---|
| 117M (GPT-2 small) | 117M | 131M | >100% (huge) |
| 1.3B | 1.3B | 131M | ~10% |
| 7B (LLaMA) | 7B | 262M | ~3.7% |
| 70B (LLaMA) | 70B | 262M | ~0.37% |
At 7B+ scale, the parameter savings from tying are negligible. Meanwhile, untying provides a real benefit: the input embedding can specialize for understanding context (what does this token mean?) while the output projection specializes for prediction (what token should come next?). These are fundamentally different tasks, and coupling them constrains model capacity.
Implementation
import torch
import torch.nn as nn
class TiedEmbeddingModel(nn.Module):
"""GPT-2 style: input and output embeddings are shared."""
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.embed_tokens = nn.Embedding(vocab_size, d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Tie weights: lm_head uses the same storage as embed_tokens
self.lm_head.weight = self.embed_tokens.weight
def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
h = self.embed_tokens(input_ids)
# ... transformer layers ...
logits = self.lm_head(h) # Uses E^T internally
return logits
class UntiedEmbeddingModel(nn.Module):
"""LLaMA style: separate input and output embeddings."""
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.embed_tokens = nn.Embedding(vocab_size, d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# No tying: these are independent parameter tensors
def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
h = self.embed_tokens(input_ids)
# ... transformer layers ...
logits = self.lm_head(h)
return logits
A subtle but important point: when using tied embeddings with distributed training (e.g., tensor parallelism), the shared weight requires careful handling. Since the embedding table is typically partitioned along the vocabulary dimension while other linear layers are partitioned along hidden dimensions, tying creates a dependency that complicates the parallelism strategy. This is another practical reason modern distributed systems prefer untied embeddings.
3. Parallel Attention and FFN
Sequential: The Standard Approach
In the original transformer and most current models, attention and FFN are applied sequentially within each layer. Using Pre-LN ordering:
The FFN sees the output of attention (plus the residual), so it can condition on what attention computed. This creates a strict dependency: FFN cannot begin until attention completes.
Parallel: Computing Both at Once
The parallel formulation, introduced in GPT-J and adopted by PaLM, runs attention and FFN simultaneously on the same normalized input:
Notice three key differences from the sequential version:
- Only one normalization layer is applied (not two).
- Both Attn and FFN receive the same normalized input.
- Both outputs are added to the residual simultaneously.
Why Parallel Works
At first glance, this seems like it should be strictly worse: the FFN can no longer condition on attention output. But the empirical results tell a different story. Chowdhery et al. (2022) found that at sufficient scale (8B+ parameters), parallel blocks match the quality of sequential blocks.
The theoretical justification is that at large width, the attention and FFN contributions are both small perturbations to the residual stream. Since each contribution is small relative to the residual itself, whether they are computed sequentially or in parallel makes little difference. Formally, if the residual stream has magnitude and , then:
because the FFN input changes by a small amount relative to the residual magnitude.
Performance Advantages
The parallel formulation offers concrete performance benefits:
- Reduced normalization cost: One RMSNorm instead of two saves compute and memory.
- Fused kernels: The attention and FFN input projections can be fused into a single large matrix multiply, improving GPU utilization.
- Better pipeline parallelism: With two independent compute paths, the workload can be split more evenly across devices.
PaLM reported approximately 15% wall-clock speedup from the parallel formulation at their training scale.
Implementation
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, d_model: int, eps: float = 1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(d_model))
self.eps = eps
def forward(self, x: torch.Tensor) -> torch.Tensor:
rms = torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
return x * rms * self.weight
class SequentialBlock(nn.Module):
"""Standard sequential Pre-LN transformer block."""
def __init__(self, d_model: int, n_heads: int, d_ff: int):
super().__init__()
self.norm1 = RMSNorm(d_model)
self.attn = BiasFreeLLaMAAttention(d_model, n_heads, n_heads)
self.norm2 = RMSNorm(d_model)
self.ffn = BiasFreeSwiGLU(d_model, d_ff)
def forward(self, x: torch.Tensor) -> torch.Tensor:
h = x + self.attn(self.norm1(x))
out = h + self.ffn(self.norm2(h))
return out
class ParallelBlock(nn.Module):
"""Parallel attention + FFN block (GPT-J / PaLM style)."""
def __init__(self, d_model: int, n_heads: int, d_ff: int):
super().__init__()
self.norm = RMSNorm(d_model) # Single normalization
self.attn = BiasFreeLLaMAAttention(d_model, n_heads, n_heads)
self.ffn = BiasFreeSwiGLU(d_model, d_ff)
def forward(self, x: torch.Tensor) -> torch.Tensor:
normed = self.norm(x)
# Both branches receive the same normalized input
attn_out = self.attn(normed)
ffn_out = self.ffn(normed)
return x + attn_out + ffn_out
class FusedParallelBlock(nn.Module):
"""
Optimized parallel block that fuses input projections.
The QKV projections from attention and the gate/up projections from FFN
are combined into a single large linear layer for better GPU utilization.
"""
def __init__(self, d_model: int, n_heads: int, d_ff: int):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.d_ff = d_ff
self.norm = RMSNorm(d_model)
# Fused projection: Q, K, V, gate, up all in one matmul
fused_dim = 3 * d_model + 2 * d_ff
self.fused_in = nn.Linear(d_model, fused_dim, bias=False)
# Separate output projections
self.wo = nn.Linear(d_model, d_model, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
bsz, seq_len, _ = x.shape
normed = self.norm(x)
# Single fused matmul for all input projections
fused = self.fused_in(normed)
q, k, v, gate, up = fused.split(
[self.d_model, self.d_model, self.d_model, self.d_ff, self.d_ff],
dim=-1
)
# Attention path
q = q.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
k = k.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
v = v.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
scale = self.head_dim ** -0.5
attn = torch.matmul(q, k.transpose(-2, -1)) * scale
attn = torch.softmax(attn, dim=-1)
attn_out = torch.matmul(attn, v)
attn_out = attn_out.transpose(1, 2).contiguous().view(bsz, seq_len, -1)
attn_out = self.wo(attn_out)
# FFN path (SwiGLU)
ffn_out = self.w_down(nn.functional.silu(gate) * up)
return x + attn_out + ffn_out
The FusedParallelBlock shows why parallelism enables optimization that is impossible with the sequential formulation: by combining all input projections into one large matrix multiply, we get better arithmetic intensity and GPU utilization.
Who Uses What?
| Model | Formulation | Notes |
|---|---|---|
| GPT-J (6B) | Parallel | EleutherAI, one of the first to adopt |
| GPT-NeoX (20B) | Parallel | Also EleutherAI |
| PaLM (540B) | Parallel | Google, demonstrated it works at massive scale |
| LLaMA (7-65B) | Sequential | Meta, chose stability over speed |
| LLaMA 2/3 | Sequential | Meta, continued the conservative choice |
| Mistral (7B) | Sequential | Sequential with sliding window attention |
| Falcon (40B) | Parallel | Technology Innovation Institute |
The choice between parallel and sequential often reflects an engineering philosophy. Google and EleutherAI prioritize training throughput, while Meta prioritizes reliability at scale (sequential is more thoroughly studied and easier to debug).
4. Initialization Schemes
Why Initialization Matters for Transformers
Initialization is not just a technical detail; it determines whether a deep transformer can train at all. Consider a transformer with layers. In the forward pass, the residual stream accumulates contributions from each layer:
If each has output variance , the variance of the residual stream after layers grows to approximately . For a 96-layer model, even small per-layer contributions accumulate. Poor initialization can cause the residual stream to explode (making training numerically unstable) or collapse to noise.
Xavier/Glorot Initialization
Glorot and Bengio (2010) derived the initialization needed to maintain variance through a linear layer with no activation function. For a weight matrix with fan-in and fan-out :
Uniform variant:
Normal variant:
The derivation assumes linear activations and requires (variance preservation in the forward pass) and (variance preservation in the backward pass). The compromise balances both directions.
Kaiming/He Initialization
He et al. (2015) extended Xavier to account for ReLU activations. Since ReLU zeros out approximately half the outputs ( for symmetric inputs), the variance halves at each layer. To compensate:
The factor of 2 in the numerator corrects for the ReLU's halving effect. For layers with other activations like GELU or SiLU, the correction factor differs slightly (approximately for GELU), but Kaiming initialization is often used as an approximation.
GPT-Style Depth-Scaled Initialization
GPT-2 introduced a practical initialization scheme that has become widely adopted:
Standard layers (embeddings, most linear projections):
Residual output projections ( in attention, in FFN):
where is the total number of transformer layers. The factor accounts for the fact that each layer contributes two residual additions (one from attention, one from FFN), and we want the total accumulated variance to remain bounded.
To see why this works, consider the variance after all layers:
The depth-dependent factor cancels with the number of layers, making the total output variance independent of model depth.
Implementation of Initialization Schemes
import math
import torch
import torch.nn as nn
def init_transformer_weights(
model: nn.Module,
n_layers: int,
d_model: int,
init_std: float = 0.02,
init_method: str = "gpt2"
):
"""
Initialize transformer weights using common schemes.
Args:
model: The transformer model
n_layers: Number of transformer layers
d_model: Model hidden dimension
init_std: Base standard deviation (for GPT-2 style)
init_method: One of "xavier", "kaiming", "gpt2"
"""
residual_std = init_std / math.sqrt(2 * n_layers)
for name, param in model.named_parameters():
if param.dim() < 2:
# Skip 1D parameters (norms, biases)
continue
if init_method == "xavier":
nn.init.xavier_normal_(param)
elif init_method == "kaiming":
nn.init.kaiming_normal_(param, nonlinearity="relu")
elif init_method == "gpt2":
# Check if this is a residual output projection
is_residual_proj = any(
tag in name for tag in ["wo.", "w_down.", "out_proj.", "c_proj."]
)
if is_residual_proj:
nn.init.normal_(param, mean=0.0, std=residual_std)
else:
nn.init.normal_(param, mean=0.0, std=init_std)
# Embeddings: always use base std
if hasattr(model, "embed_tokens"):
nn.init.normal_(model.embed_tokens.weight, mean=0.0, std=init_std)
# Example: initializing a 32-layer model
class MiniLLaMA(nn.Module):
def __init__(self, vocab_size: int = 32000, d_model: int = 4096,
n_layers: int = 32, n_heads: int = 32, d_ff: int = 11008):
super().__init__()
self.embed_tokens = nn.Embedding(vocab_size, d_model)
self.layers = nn.ModuleList([
SequentialBlock(d_model, n_heads, d_ff)
for _ in range(n_layers)
])
self.norm = RMSNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Apply GPT-2 style initialization
init_transformer_weights(self, n_layers, d_model, init_method="gpt2")
def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
h = self.embed_tokens(input_ids)
for layer in self.layers:
h = layer(h)
h = self.norm(h)
return self.lm_head(h)
muP: Maximal Update Parameterization
Yang et al. (2022) proposed muP (maximal update parameterization), a principled framework that allows hyperparameters tuned on a small model to transfer directly to larger models. This solves a critical practical problem: hyperparameter sweeps on a 70B model are prohibitively expensive, but sweeps on a 125M proxy model are cheap.
The core insight of muP is that different parameter groups should scale their learning rates and initialization differently based on width:
| Parameter | Standard Init | muP Init | Standard LR | muP LR |
|---|---|---|---|---|
| Embeddings | ||||
| Hidden weights | ||||
| Output weights |
The key differences from standard parameterization:
- Hidden layer learning rates scale as : Wider layers need smaller learning rates to keep the magnitude of parameter updates constant.
- Output layer initialization scales as (not ): This ensures the output logits have consistent scale regardless of model width.
- Attention logits require an additional scaling (instead of ).
In practice, muP enables "tune small, train large": run a full hyperparameter sweep on a small (128 or 256 hidden dim) model, then apply those hyperparameters directly to the full-scale model.
def mup_init_and_lr(model: nn.Module, base_width: int, target_width: int,
base_lr: float):
"""
Simplified muP: compute per-parameter learning rates based on width ratio.
Args:
model: Target model to train
base_width: Width of the proxy model used for HP tuning
target_width: Width of the target model
base_lr: Learning rate found optimal for the proxy model
"""
width_ratio = target_width / base_width
param_groups = []
for name, param in model.named_parameters():
if "embed" in name:
# Embedding: same LR as base
param_groups.append({"params": [param], "lr": base_lr})
elif "lm_head" in name or "output" in name:
# Output layer: same LR, but init scaled by 1/width
nn.init.normal_(param, std=1.0 / target_width)
param_groups.append({"params": [param], "lr": base_lr})
else:
# Hidden layers: LR scales as 1/width_ratio
param_groups.append({
"params": [param],
"lr": base_lr / width_ratio
})
return param_groups
5. Other Small But Important Details
Vocabulary Size Padding
Modern GPUs execute matrix multiplications most efficiently when dimensions are multiples of 8 (for FP16 tensor cores), 64, or 128. Since the embedding and LM head involve a matrix of shape , padding the vocabulary to a "nice" number yields free performance:
import math
def pad_vocab_size(vocab_size: int, padding_multiple: int = 64) -> int:
"""
Pad vocabulary size to the nearest multiple for GPU efficiency.
Common choices:
- 64: Good for most GPUs (FP16 tensor cores)
- 128: Better for A100/H100 with INT8/FP8
"""
return padding_multiple * math.ceil(vocab_size / padding_multiple)
# Examples from real models:
configs = {
"LLaMA": {"raw": 32000, "padded": pad_vocab_size(32000, 64)}, # 32000 -> 32000 (already aligned)
"LLaMA-2": {"raw": 32000, "padded": pad_vocab_size(32000, 128)}, # 32000 -> 32128
"Mistral": {"raw": 32000, "padded": pad_vocab_size(32000, 64)}, # 32000 -> 32000
"Falcon": {"raw": 65024, "padded": pad_vocab_size(65024, 128)}, # 65024 -> 65024 (already aligned)
}
for model, cfg in configs.items():
print(f"{model}: {cfg['raw']} -> {cfg['padded']}")
The padding tokens are never used in training (they receive zero gradients), so the model quality is unaffected. The cost is a few extra parameters, which is negligible compared to the throughput gain.
Hidden Dimension Multiples
The FFN hidden dimension is chosen to satisfy multiple constraints simultaneously:
- Ratio to : Traditionally , but with SwiGLU the effective multiplier is (since SwiGLU has three matrices vs. two for standard FFN, the dimension is reduced to keep FLOPs comparable).
- Alignment to tensor core tile sizes: Must be divisible by 128 or 256.
def compute_ff_dim(d_model: int, multiplier: float = 8/3,
alignment: int = 256) -> int:
"""
Compute FFN hidden dimension following LLaMA conventions.
Steps:
1. Multiply d_model by the SwiGLU multiplier (8/3)
2. Round up to the nearest multiple of alignment
"""
raw_ff = int(multiplier * d_model)
aligned_ff = alignment * math.ceil(raw_ff / alignment)
return aligned_ff
# Reproduce LLaMA dimensions:
for d_model, expected_ff in [(4096, 11008), (5120, 13824), (8192, 22016)]:
computed = compute_ff_dim(d_model, multiplier=8/3, alignment=256)
print(f"d_model={d_model}: computed d_ff={computed}, expected={expected_ff}")
Epsilon Values in Normalization and Optimization
Small epsilon values prevent division by zero, but their exact magnitude matters for training stability:
| Context | Component | Typical Epsilon | Why |
|---|---|---|---|
| Normalization | LayerNorm | Standard default in PyTorch | |
| Normalization | RMSNorm | or | Smaller because RMSNorm divides by RMS (typically larger than std) |
| Optimization | Adam/AdamW | Default; some use for BF16 stability | |
| Mixed precision | Loss scaling | N/A | Dynamic loss scaling handles FP16 underflow |
LLaMA uses for RMSNorm, while some earlier models use . With BF16 training, a slightly larger Adam epsilon ( instead of ) can prevent instabilities from the reduced mantissa precision.
# Typical RMSNorm configuration
norm = RMSNorm(d_model=4096, eps=1e-6) # LLaMA default
# AdamW with BF16-friendly epsilon
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95),
weight_decay=0.1,
eps=1e-8, # Standard; use 1e-6 if you see NaN with BF16
)
Gradient Clipping
Nearly all modern LLMs clip gradient norms to prevent training instabilities:
The standard value is , used by GPT-3, LLaMA, and most others. This is applied to the global gradient norm (across all parameters), not per-parameter.
# Standard gradient clipping in training loop
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Comprehensive "Who Uses What" Comparison
The following table summarizes how major LLM architectures combine all the design choices we have discussed throughout this series:
| Feature | GPT-3 | LLaMA | LLaMA 2 | PaLM | Mistral 7B | Falcon 40B |
|---|---|---|---|---|---|---|
| Biases | Yes | No | No | No | No | Partial (QKV only) |
| Tied Embeddings | No | No | No | No | No | Yes |
| Attn+FFN | Sequential | Sequential | Sequential | Parallel | Sequential | Parallel |
| Norm Type | LayerNorm | RMSNorm | RMSNorm | RMSNorm | RMSNorm | LayerNorm |
| Norm Position | Pre-LN | Pre-LN | Pre-LN | Pre-LN | Pre-LN | Pre-LN |
| Position Encoding | Learned | RoPE | RoPE | RoPE | RoPE | RoPE (ALiBi in 180B) |
| Attention | MHA | MHA | GQA | MQA | GQA + Sliding | MQA |
| FFN Type | GELU | SwiGLU | SwiGLU | SwiGLU | SwiGLU | GELU |
| Init | GPT-style | GPT-style | GPT-style | GPT-style | GPT-style | GPT-style |
| Vocab Size | 50,257 | 32,000 | 32,000 | 256,000 | 32,000 | 65,024 |
| Vocab Padding | Yes (to 64) | No (already aligned) | Yes (to 128) | Yes (to 128) | No | Yes (to 128) |
| RMSNorm eps | N/A | 1e-6 | 1e-5 | 1e-6 | 1e-5 | N/A |
| Grad Clip | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
Several patterns emerge from this table:
- Bias removal is nearly universal in post-2022 models.
- RMSNorm has replaced LayerNorm almost everywhere.
- SwiGLU dominates as the FFN activation.
- RoPE is the de facto standard for position encoding.
- GQA is emerging as the preferred attention variant (trading off between MHA and MQA).
- The parallel vs. sequential choice remains split, with no clear winner.
Summary
None of the choices in this post are individually transformative. Removing biases saves a fraction of a percent of parameters. Untying embeddings adds a modest capacity boost. Parallel attention trades a small quality risk for throughput. Proper initialization is invisible when it works correctly but catastrophic when wrong. And dimension padding is pure engineering pragmatism.
But these details compound. A model that gets all of them right trains faster, uses memory more efficiently, remains numerically stable through longer runs, and is easier to quantize and deploy. The gap between a "good enough" implementation and a state-of-the-art one often comes down to exactly these kinds of careful engineering decisions.
If you take away one principle from this post, it is this: at the scale of modern LLMs, small inefficiencies multiply across billions of parameters and trillions of tokens. Getting the details right is not optional.
In the final post, we will explore Part 8: Alternative Architectures -- State Space Models, Mamba, Linear Attention, RWKV, and hybrid architectures that challenge the transformer paradigm itself.
References
- Glorot, X. and Bengio, Y. "Understanding the difficulty of training deep feedforward neural networks." AISTATS, 2010.
- He, K. et al. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." ICCV, 2015.
- Press, O. and Wolf, L. "Using the Output Embedding to Improve Language Models." EACL, 2017.
- Radford, A. et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.
- Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS, 2020.
- Wang, B. and Komatsuzaki, A. "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model." EleutherAI, 2021.
- Black, S. et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." ACL Workshop on Challenges and Applications of Large Language Models, 2022.
- Chowdhery, A. et al. "PaLM: Scaling Language Modeling with Pathways." JMLR, 2023.
- Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023.
- Touvron, H. et al. "LLaMA 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288, 2023.
- Yang, G. et al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." arXiv:2203.03466, 2022.
- Jiang, A.Q. et al. "Mistral 7B." arXiv:2310.06825, 2023.
- Penedo, G. et al. "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116, 2023.
Written by Suchinthaka Wanninayaka
AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.
Continue the Series
Transformer Deep Dive: Part 6 - Inference Optimization
25 min read
Next ArticleTransformer Deep Dive: Part 8 - Alternative Architectures
29 min read
Related Articles
Responses
No responses yet. Be the first to share your thoughts!