Throughout this series, we have examined the headline innovations: FlashAttention, Grouped-Query Attention, SwiGLU, Mixture of Experts, and advanced training and inference techniques. But modern LLMs also incorporate a collection of seemingly minor architectural choices that, taken together, meaningfully affect training stability, memory efficiency, and throughput. These are the details you encounter when reading model code and wonder "why did they do it that way?" This post answers those questions.

We will cover five topics: bias removal from linear layers, the decision to tie or untie embedding weights, parallel versus sequential attention and FFN computation, initialization schemes that prevent training collapse, and a handful of numerical and hardware-alignment tricks that every practitioner should know.

1. Bias Removal

The Role of Bias in Linear Layers

A standard linear transformation in a neural network is:

y=Wx+by = Wx + b

where WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} is the weight matrix and bRdoutb \in \mathbb{R}^{d_{out}} is the bias vector. The bias provides three things:

  1. Learnable offset: It shifts the output distribution independently of the input magnitude.
  2. Non-zero output at zero input: When x=0x = 0, we get y=by = b rather than the zero vector.
  3. Function approximation flexibility: It adds extra capacity, which matters in small networks.

For decades, biases were considered an essential component of neural networks. The universal approximation theorem for single-layer networks, for instance, requires biases. So why would we remove them?

Where Biases Appear in a Transformer

A single Pre-LN transformer layer contains several linear projections, each of which could carry a bias:

ComponentLinear LayersBias Vectors
Multi-Head AttentionWQW_Q, WKW_K, WVW_V, WOW_O4 vectors of size dmodeld_{model}
FFN (standard)W1W_1 (up), W2W_2 (down)2 vectors
FFN (SwiGLU)WgateW_{gate}, WupW_{up}, WdownW_{down}3 vectors
Total per layer6-7 projections6-7 bias vectors

Each bias vector has dmodeld_{model} parameters (or dffd_{ff} for the up-projections). For a model with dmodel=4096d_{model} = 4096 and SwiGLU with dff=11008d_{ff} = 11008, a single layer has approximately:

4×4096+2×11008+4096=42,480 bias parameters4 \times 4096 + 2 \times 11008 + 4096 = 42,480 \text{ bias parameters}

Why Modern LLMs Remove Biases

Argument 1: Redundancy with normalization layers.

LayerNorm includes a learnable shift parameter β\beta:

LayerNorm(x)=γxμσ+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta

When a linear layer follows LayerNorm, we have:

y=WLayerNorm(x)+b=W(γxμσ+ϵ+β)+by = W \cdot \text{LayerNorm}(x) + b = W\left(\gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta\right) + b

The bias bb and the shift WβW\beta provide overlapping functionality. They both add a learned constant to the output, making one of them partially redundant.

Argument 2: RMSNorm has no shift at all.

Most modern LLMs use RMSNorm instead of LayerNorm:

RMSNorm(x)=γx1di=1dxi2+ϵ\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}

RMSNorm has no β\beta parameter. Yet models trained with RMSNorm and no biases perform just as well or better than those with biases. This tells us that at sufficient scale, the model learns to represent arbitrary offsets through the weight matrices themselves without needing explicit bias terms.

Argument 3: Memory savings compound across the system.

While individual bias vectors are small, the savings matter in the context of distributed training and inference:

Model ScaleLayersdmodeld_{model}Bias ParamsFP16 Memory
7B (LLaMA)324096~1.4M~2.7 MB
13B405120~2.2M~4.2 MB
70B808192~5.6M~10.7 MB

These numbers seem negligible, but consider that each bias vector also requires: an optimizer state (two additional vectors for Adam), gradient storage during backward pass, and communication bandwidth during distributed training. The total overhead is 4-5x the raw parameter count.

Argument 4: Simplifies quantization.

Bias terms interact awkwardly with weight quantization. When weights are quantized to INT4 or INT8, biases remain in higher precision, adding complexity to the inference kernel. Removing biases yields cleaner, more uniform compute patterns.

Implementation

import torch
import torch.nn as nn

class BiasFreeLLaMAAttention(nn.Module):
    """LLaMA-style multi-head attention with no biases anywhere."""

    def __init__(self, d_model: int, n_heads: int, n_kv_heads: int):
        super().__init__()
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.head_dim = d_model // n_heads

        # All projections: bias=False
        self.wq = nn.Linear(d_model, n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(n_heads * self.head_dim, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        bsz, seq_len, _ = x.shape

        q = self.wq(x).view(bsz, seq_len, self.n_heads, self.head_dim)
        k = self.wk(x).view(bsz, seq_len, self.n_kv_heads, self.head_dim)
        v = self.wv(x).view(bsz, seq_len, self.n_kv_heads, self.head_dim)

        # Transpose for attention: (bsz, n_heads, seq_len, head_dim)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # GQA: repeat KV heads to match Q heads
        if self.n_kv_heads < self.n_heads:
            n_rep = self.n_heads // self.n_kv_heads
            k = k.repeat_interleave(n_rep, dim=1)
            v = v.repeat_interleave(n_rep, dim=1)

        # Scaled dot-product attention
        scale = self.head_dim ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        out = out.transpose(1, 2).contiguous().view(bsz, seq_len, -1)
        return self.wo(out)


class BiasFreeSwiGLU(nn.Module):
    """SwiGLU FFN with no biases."""

    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_up = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w_down(nn.functional.silu(self.w_gate(x)) * self.w_up(x))

The key pattern is simple: every nn.Linear call includes bias=False. This single change propagates throughout the entire model.

Partial Bias: The QKV Exception

A few models take a middle path. Falcon, for instance, removes biases from all projections except the QKV layers in attention. The reasoning is that QKV biases help with rotary position encoding by providing a learnable offset before the rotation is applied. However, most recent models (LLaMA 2/3, Mistral, DeepSeek) remove biases everywhere without any degradation.

2. Tied vs. Untied Embeddings

The Two Embedding Matrices

Every language model has two large matrices related to vocabulary:

  1. Input embedding ERV×dE \in \mathbb{R}^{V \times d}: Maps discrete token IDs to continuous vectors.
  2. Output projection (LM head) WlmRd×VW_{lm} \in \mathbb{R}^{d \times V}: Maps final hidden states back to vocabulary logits.

For a vocabulary of V=32,000V = 32{,}000 tokens and d=4096d = 4096:

Parameters per matrix=32,000×4,096=131,072,000131M\text{Parameters per matrix} = 32{,}000 \times 4{,}096 = 131{,}072{,}000 \approx 131\text{M}

Together they consume approximately 262M parameters, which is a significant fraction of smaller models (over 3% of a 7B model).

Weight Tying: Wlm=ETW_{lm} = E^T

The idea of weight tying (Press and Wolf, 2017) is to share parameters between the input embedding and the output projection:

logitst=htET\text{logits}_t = h_t \cdot E^T

Instead of learning two independent matrices, we use the transpose of the input embedding as the output projection. The intuition is compelling: if two tokens have similar input embeddings (they appear in similar contexts), they should also receive similar output probabilities (they are likely in similar positions to be predicted).

Benefits of tying:

  • Saves V×dV \times d parameters (131M for the example above).
  • For small models (GPT-2 at 117M parameters), this represents over 100% effective savings since those 131M parameters would otherwise be duplicated.
  • Acts as a regularizer by forcing input and output representations to be consistent.
  • Empirically improves perplexity on smaller models.

Why modern LLMs untie embeddings:

As models scale to billions of parameters, the calculus changes:

Model SizeTotal ParamsEmbedding ParamsSavings from Tying
117M (GPT-2 small)117M131M>100% (huge)
1.3B1.3B131M~10%
7B (LLaMA)7B262M~3.7%
70B (LLaMA)70B262M~0.37%

At 7B+ scale, the parameter savings from tying are negligible. Meanwhile, untying provides a real benefit: the input embedding can specialize for understanding context (what does this token mean?) while the output projection specializes for prediction (what token should come next?). These are fundamentally different tasks, and coupling them constrains model capacity.

Implementation

import torch
import torch.nn as nn


class TiedEmbeddingModel(nn.Module):
    """GPT-2 style: input and output embeddings are shared."""

    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Tie weights: lm_head uses the same storage as embed_tokens
        self.lm_head.weight = self.embed_tokens.weight

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        h = self.embed_tokens(input_ids)
        # ... transformer layers ...
        logits = self.lm_head(h)  # Uses E^T internally
        return logits


class UntiedEmbeddingModel(nn.Module):
    """LLaMA style: separate input and output embeddings."""

    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        # No tying: these are independent parameter tensors

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        h = self.embed_tokens(input_ids)
        # ... transformer layers ...
        logits = self.lm_head(h)
        return logits

A subtle but important point: when using tied embeddings with distributed training (e.g., tensor parallelism), the shared weight requires careful handling. Since the embedding table is typically partitioned along the vocabulary dimension while other linear layers are partitioned along hidden dimensions, tying creates a dependency that complicates the parallelism strategy. This is another practical reason modern distributed systems prefer untied embeddings.

3. Parallel Attention and FFN

Sequential: The Standard Approach

In the original transformer and most current models, attention and FFN are applied sequentially within each layer. Using Pre-LN ordering:

h=x+Attn(Norm1(x))h = x + \text{Attn}(\text{Norm}_1(x)) x=h+FFN(Norm2(h))x' = h + \text{FFN}(\text{Norm}_2(h))

The FFN sees the output of attention (plus the residual), so it can condition on what attention computed. This creates a strict dependency: FFN cannot begin until attention completes.

Parallel: Computing Both at Once

The parallel formulation, introduced in GPT-J and adopted by PaLM, runs attention and FFN simultaneously on the same normalized input:

x=x+Attn(Norm(x))+FFN(Norm(x))x' = x + \text{Attn}(\text{Norm}(x)) + \text{FFN}(\text{Norm}(x))

Notice three key differences from the sequential version:

  1. Only one normalization layer is applied (not two).
  2. Both Attn and FFN receive the same normalized input.
  3. Both outputs are added to the residual simultaneously.

Why Parallel Works

At first glance, this seems like it should be strictly worse: the FFN can no longer condition on attention output. But the empirical results tell a different story. Chowdhery et al. (2022) found that at sufficient scale (8B+ parameters), parallel blocks match the quality of sequential blocks.

The theoretical justification is that at large width, the attention and FFN contributions are both small perturbations to the residual stream. Since each contribution is small relative to the residual itself, whether they are computed sequentially or in parallel makes little difference. Formally, if the residual stream has magnitude xAttn()\|x\| \gg \|\text{Attn}(\cdot)\| and xFFN()\|x\| \gg \|\text{FFN}(\cdot)\|, then:

x+Attn(Norm(x))+FFN(Norm(x))x+Attn(Norm(x))+FFN(Norm(x+Attn(Norm(x))))x + \text{Attn}(\text{Norm}(x)) + \text{FFN}(\text{Norm}(x)) \approx x + \text{Attn}(\text{Norm}(x)) + \text{FFN}(\text{Norm}(x + \text{Attn}(\text{Norm}(x))))

because the FFN input changes by a small amount relative to the residual magnitude.

Performance Advantages

The parallel formulation offers concrete performance benefits:

  1. Reduced normalization cost: One RMSNorm instead of two saves compute and memory.
  2. Fused kernels: The attention and FFN input projections can be fused into a single large matrix multiply, improving GPU utilization.
  3. Better pipeline parallelism: With two independent compute paths, the workload can be split more evenly across devices.

PaLM reported approximately 15% wall-clock speedup from the parallel formulation at their training scale.

Implementation

import torch
import torch.nn as nn


class RMSNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        rms = torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
        return x * rms * self.weight


class SequentialBlock(nn.Module):
    """Standard sequential Pre-LN transformer block."""

    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.norm1 = RMSNorm(d_model)
        self.attn = BiasFreeLLaMAAttention(d_model, n_heads, n_heads)
        self.norm2 = RMSNorm(d_model)
        self.ffn = BiasFreeSwiGLU(d_model, d_ff)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        h = x + self.attn(self.norm1(x))
        out = h + self.ffn(self.norm2(h))
        return out


class ParallelBlock(nn.Module):
    """Parallel attention + FFN block (GPT-J / PaLM style)."""

    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.norm = RMSNorm(d_model)  # Single normalization
        self.attn = BiasFreeLLaMAAttention(d_model, n_heads, n_heads)
        self.ffn = BiasFreeSwiGLU(d_model, d_ff)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        normed = self.norm(x)
        # Both branches receive the same normalized input
        attn_out = self.attn(normed)
        ffn_out = self.ffn(normed)
        return x + attn_out + ffn_out


class FusedParallelBlock(nn.Module):
    """
    Optimized parallel block that fuses input projections.
    The QKV projections from attention and the gate/up projections from FFN
    are combined into a single large linear layer for better GPU utilization.
    """

    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.d_ff = d_ff

        self.norm = RMSNorm(d_model)

        # Fused projection: Q, K, V, gate, up all in one matmul
        fused_dim = 3 * d_model + 2 * d_ff
        self.fused_in = nn.Linear(d_model, fused_dim, bias=False)

        # Separate output projections
        self.wo = nn.Linear(d_model, d_model, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        bsz, seq_len, _ = x.shape
        normed = self.norm(x)

        # Single fused matmul for all input projections
        fused = self.fused_in(normed)
        q, k, v, gate, up = fused.split(
            [self.d_model, self.d_model, self.d_model, self.d_ff, self.d_ff],
            dim=-1
        )

        # Attention path
        q = q.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(bsz, seq_len, self.n_heads, self.head_dim).transpose(1, 2)

        scale = self.head_dim ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        attn_out = torch.matmul(attn, v)
        attn_out = attn_out.transpose(1, 2).contiguous().view(bsz, seq_len, -1)
        attn_out = self.wo(attn_out)

        # FFN path (SwiGLU)
        ffn_out = self.w_down(nn.functional.silu(gate) * up)

        return x + attn_out + ffn_out

The FusedParallelBlock shows why parallelism enables optimization that is impossible with the sequential formulation: by combining all input projections into one large matrix multiply, we get better arithmetic intensity and GPU utilization.

Who Uses What?

ModelFormulationNotes
GPT-J (6B)ParallelEleutherAI, one of the first to adopt
GPT-NeoX (20B)ParallelAlso EleutherAI
PaLM (540B)ParallelGoogle, demonstrated it works at massive scale
LLaMA (7-65B)SequentialMeta, chose stability over speed
LLaMA 2/3SequentialMeta, continued the conservative choice
Mistral (7B)SequentialSequential with sliding window attention
Falcon (40B)ParallelTechnology Innovation Institute

The choice between parallel and sequential often reflects an engineering philosophy. Google and EleutherAI prioritize training throughput, while Meta prioritizes reliability at scale (sequential is more thoroughly studied and easier to debug).

4. Initialization Schemes

Why Initialization Matters for Transformers

Initialization is not just a technical detail; it determines whether a deep transformer can train at all. Consider a transformer with LL layers. In the forward pass, the residual stream accumulates contributions from each layer:

xL=x0+l=1Lfl(xl1)x_L = x_0 + \sum_{l=1}^{L} f_l(x_{l-1})

If each flf_l has output variance σ2\sigma^2, the variance of the residual stream after LL layers grows to approximately σ02+Lσ2\sigma_0^2 + L \cdot \sigma^2. For a 96-layer model, even small per-layer contributions accumulate. Poor initialization can cause the residual stream to explode (making training numerically unstable) or collapse to noise.

Xavier/Glorot Initialization

Glorot and Bengio (2010) derived the initialization needed to maintain variance through a linear layer with no activation function. For a weight matrix with fan-in ninn_{in} and fan-out noutn_{out}:

Uniform variant:

WU(6nin+nout,6nin+nout)W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

Normal variant:

WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)

The derivation assumes linear activations and requires Var(y)=Var(x)\text{Var}(y) = \text{Var}(x) (variance preservation in the forward pass) and Var(x)=Var(y)\text{Var}(\nabla x) = \text{Var}(\nabla y) (variance preservation in the backward pass). The compromise 2nin+nout\frac{2}{n_{in} + n_{out}} balances both directions.

Kaiming/He Initialization

He et al. (2015) extended Xavier to account for ReLU activations. Since ReLU zeros out approximately half the outputs (E[ReLU(x)2]=12Var(x)\mathbb{E}[\text{ReLU}(x)^2] = \frac{1}{2}\text{Var}(x) for symmetric inputs), the variance halves at each layer. To compensate:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

The factor of 2 in the numerator corrects for the ReLU's halving effect. For layers with other activations like GELU or SiLU, the correction factor differs slightly (approximately 21.72nin\frac{2}{1.7^2 \cdot n_{in}} for GELU), but Kaiming initialization is often used as an approximation.

GPT-Style Depth-Scaled Initialization

GPT-2 introduced a practical initialization scheme that has become widely adopted:

Standard layers (embeddings, most linear projections):

WN(0,0.02)W \sim \mathcal{N}(0, 0.02)

Residual output projections (WOW_O in attention, WdownW_{down} in FFN):

WN(0,0.022L)W \sim \mathcal{N}\left(0, \frac{0.02}{\sqrt{2L}}\right)

where LL is the total number of transformer layers. The 2L\sqrt{2L} factor accounts for the fact that each layer contributes two residual additions (one from attention, one from FFN), and we want the total accumulated variance to remain bounded.

To see why this works, consider the variance after all layers:

Var(xL)=Var(x0)+2L(0.022L)2d=Var(x0)+0.022d\text{Var}(x_L) = \text{Var}(x_0) + 2L \cdot \left(\frac{0.02}{\sqrt{2L}}\right)^2 \cdot d = \text{Var}(x_0) + 0.02^2 \cdot d

The depth-dependent factor cancels with the number of layers, making the total output variance independent of model depth.

Implementation of Initialization Schemes

import math
import torch
import torch.nn as nn


def init_transformer_weights(
    model: nn.Module,
    n_layers: int,
    d_model: int,
    init_std: float = 0.02,
    init_method: str = "gpt2"
):
    """
    Initialize transformer weights using common schemes.

    Args:
        model: The transformer model
        n_layers: Number of transformer layers
        d_model: Model hidden dimension
        init_std: Base standard deviation (for GPT-2 style)
        init_method: One of "xavier", "kaiming", "gpt2"
    """
    residual_std = init_std / math.sqrt(2 * n_layers)

    for name, param in model.named_parameters():
        if param.dim() < 2:
            # Skip 1D parameters (norms, biases)
            continue

        if init_method == "xavier":
            nn.init.xavier_normal_(param)

        elif init_method == "kaiming":
            nn.init.kaiming_normal_(param, nonlinearity="relu")

        elif init_method == "gpt2":
            # Check if this is a residual output projection
            is_residual_proj = any(
                tag in name for tag in ["wo.", "w_down.", "out_proj.", "c_proj."]
            )

            if is_residual_proj:
                nn.init.normal_(param, mean=0.0, std=residual_std)
            else:
                nn.init.normal_(param, mean=0.0, std=init_std)

    # Embeddings: always use base std
    if hasattr(model, "embed_tokens"):
        nn.init.normal_(model.embed_tokens.weight, mean=0.0, std=init_std)


# Example: initializing a 32-layer model
class MiniLLaMA(nn.Module):
    def __init__(self, vocab_size: int = 32000, d_model: int = 4096,
                 n_layers: int = 32, n_heads: int = 32, d_ff: int = 11008):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([
            SequentialBlock(d_model, n_heads, d_ff)
            for _ in range(n_layers)
        ])
        self.norm = RMSNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Apply GPT-2 style initialization
        init_transformer_weights(self, n_layers, d_model, init_method="gpt2")

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        h = self.embed_tokens(input_ids)
        for layer in self.layers:
            h = layer(h)
        h = self.norm(h)
        return self.lm_head(h)

muP: Maximal Update Parameterization

Yang et al. (2022) proposed muP (maximal update parameterization), a principled framework that allows hyperparameters tuned on a small model to transfer directly to larger models. This solves a critical practical problem: hyperparameter sweeps on a 70B model are prohibitively expensive, but sweeps on a 125M proxy model are cheap.

The core insight of muP is that different parameter groups should scale their learning rates and initialization differently based on width:

ParameterStandard InitmuP InitStandard LRmuP LR
Embeddingsσ\sigmaσ\sigmaη\etaη\eta
Hidden weights1nin\frac{1}{\sqrt{n_{in}}}1nin\frac{1}{\sqrt{n_{in}}}η\etaηnin\frac{\eta}{n_{in}}
Output weights1nin\frac{1}{\sqrt{n_{in}}}1nin\frac{1}{n_{in}}η\etaη\eta

The key differences from standard parameterization:

  1. Hidden layer learning rates scale as 1/width1/\text{width}: Wider layers need smaller learning rates to keep the magnitude of parameter updates constant.
  2. Output layer initialization scales as 1/width1/\text{width} (not 1/width1/\sqrt{\text{width}}): This ensures the output logits have consistent scale regardless of model width.
  3. Attention logits require an additional 1/d1/d scaling (instead of 1/d1/\sqrt{d}).

In practice, muP enables "tune small, train large": run a full hyperparameter sweep on a small (128 or 256 hidden dim) model, then apply those hyperparameters directly to the full-scale model.

def mup_init_and_lr(model: nn.Module, base_width: int, target_width: int,
                    base_lr: float):
    """
    Simplified muP: compute per-parameter learning rates based on width ratio.

    Args:
        model: Target model to train
        base_width: Width of the proxy model used for HP tuning
        target_width: Width of the target model
        base_lr: Learning rate found optimal for the proxy model
    """
    width_ratio = target_width / base_width
    param_groups = []

    for name, param in model.named_parameters():
        if "embed" in name:
            # Embedding: same LR as base
            param_groups.append({"params": [param], "lr": base_lr})
        elif "lm_head" in name or "output" in name:
            # Output layer: same LR, but init scaled by 1/width
            nn.init.normal_(param, std=1.0 / target_width)
            param_groups.append({"params": [param], "lr": base_lr})
        else:
            # Hidden layers: LR scales as 1/width_ratio
            param_groups.append({
                "params": [param],
                "lr": base_lr / width_ratio
            })

    return param_groups

5. Other Small But Important Details

Vocabulary Size Padding

Modern GPUs execute matrix multiplications most efficiently when dimensions are multiples of 8 (for FP16 tensor cores), 64, or 128. Since the embedding and LM head involve a matrix of shape (vocab_size,dmodel)(\text{vocab\_size}, d_{model}), padding the vocabulary to a "nice" number yields free performance:

import math


def pad_vocab_size(vocab_size: int, padding_multiple: int = 64) -> int:
    """
    Pad vocabulary size to the nearest multiple for GPU efficiency.

    Common choices:
      - 64: Good for most GPUs (FP16 tensor cores)
      - 128: Better for A100/H100 with INT8/FP8
    """
    return padding_multiple * math.ceil(vocab_size / padding_multiple)


# Examples from real models:
configs = {
    "LLaMA":   {"raw": 32000, "padded": pad_vocab_size(32000, 64)},    # 32000 -> 32000 (already aligned)
    "LLaMA-2": {"raw": 32000, "padded": pad_vocab_size(32000, 128)},   # 32000 -> 32128
    "Mistral": {"raw": 32000, "padded": pad_vocab_size(32000, 64)},    # 32000 -> 32000
    "Falcon":  {"raw": 65024, "padded": pad_vocab_size(65024, 128)},   # 65024 -> 65024 (already aligned)
}

for model, cfg in configs.items():
    print(f"{model}: {cfg['raw']} -> {cfg['padded']}")

The padding tokens are never used in training (they receive zero gradients), so the model quality is unaffected. The cost is a few extra parameters, which is negligible compared to the throughput gain.

Hidden Dimension Multiples

The FFN hidden dimension dffd_{ff} is chosen to satisfy multiple constraints simultaneously:

  1. Ratio to dmodeld_{model}: Traditionally 4×dmodel4 \times d_{model}, but with SwiGLU the effective multiplier is 83\frac{8}{3} (since SwiGLU has three matrices vs. two for standard FFN, the dimension is reduced to keep FLOPs comparable).
  2. Alignment to tensor core tile sizes: Must be divisible by 128 or 256.
def compute_ff_dim(d_model: int, multiplier: float = 8/3,
                   alignment: int = 256) -> int:
    """
    Compute FFN hidden dimension following LLaMA conventions.

    Steps:
      1. Multiply d_model by the SwiGLU multiplier (8/3)
      2. Round up to the nearest multiple of alignment
    """
    raw_ff = int(multiplier * d_model)
    aligned_ff = alignment * math.ceil(raw_ff / alignment)
    return aligned_ff


# Reproduce LLaMA dimensions:
for d_model, expected_ff in [(4096, 11008), (5120, 13824), (8192, 22016)]:
    computed = compute_ff_dim(d_model, multiplier=8/3, alignment=256)
    print(f"d_model={d_model}: computed d_ff={computed}, expected={expected_ff}")

Epsilon Values in Normalization and Optimization

Small epsilon values prevent division by zero, but their exact magnitude matters for training stability:

ContextComponentTypical EpsilonWhy
NormalizationLayerNorm10510^{-5}Standard default in PyTorch
NormalizationRMSNorm10610^{-6} or 10510^{-5}Smaller because RMSNorm divides by RMS (typically larger than std)
OptimizationAdam/AdamW10810^{-8}Default; some use 10610^{-6} for BF16 stability
Mixed precisionLoss scalingN/ADynamic loss scaling handles FP16 underflow

LLaMA uses ϵ=106\epsilon = 10^{-6} for RMSNorm, while some earlier models use 10510^{-5}. With BF16 training, a slightly larger Adam epsilon (10610^{-6} instead of 10810^{-8}) can prevent instabilities from the reduced mantissa precision.

# Typical RMSNorm configuration
norm = RMSNorm(d_model=4096, eps=1e-6)  # LLaMA default

# AdamW with BF16-friendly epsilon
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    eps=1e-8,  # Standard; use 1e-6 if you see NaN with BF16
)

Gradient Clipping

Nearly all modern LLMs clip gradient norms to prevent training instabilities:

ggmin(1,max_normg)g \leftarrow g \cdot \min\left(1, \frac{\text{max\_norm}}{\|g\|}\right)

The standard value is max_norm=1.0\text{max\_norm} = 1.0, used by GPT-3, LLaMA, and most others. This is applied to the global gradient norm (across all parameters), not per-parameter.

# Standard gradient clipping in training loop
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Comprehensive "Who Uses What" Comparison

The following table summarizes how major LLM architectures combine all the design choices we have discussed throughout this series:

FeatureGPT-3LLaMALLaMA 2PaLMMistral 7BFalcon 40B
BiasesYesNoNoNoNoPartial (QKV only)
Tied EmbeddingsNoNoNoNoNoYes
Attn+FFNSequentialSequentialSequentialParallelSequentialParallel
Norm TypeLayerNormRMSNormRMSNormRMSNormRMSNormLayerNorm
Norm PositionPre-LNPre-LNPre-LNPre-LNPre-LNPre-LN
Position EncodingLearnedRoPERoPERoPERoPERoPE (ALiBi in 180B)
AttentionMHAMHAGQAMQAGQA + SlidingMQA
FFN TypeGELUSwiGLUSwiGLUSwiGLUSwiGLUGELU
InitGPT-styleGPT-styleGPT-styleGPT-styleGPT-styleGPT-style
Vocab Size50,25732,00032,000256,00032,00065,024
Vocab PaddingYes (to 64)No (already aligned)Yes (to 128)Yes (to 128)NoYes (to 128)
RMSNorm epsN/A1e-61e-51e-61e-5N/A
Grad Clip1.01.01.01.01.01.0

Several patterns emerge from this table:

  1. Bias removal is nearly universal in post-2022 models.
  2. RMSNorm has replaced LayerNorm almost everywhere.
  3. SwiGLU dominates as the FFN activation.
  4. RoPE is the de facto standard for position encoding.
  5. GQA is emerging as the preferred attention variant (trading off between MHA and MQA).
  6. The parallel vs. sequential choice remains split, with no clear winner.

Summary

None of the choices in this post are individually transformative. Removing biases saves a fraction of a percent of parameters. Untying embeddings adds a modest capacity boost. Parallel attention trades a small quality risk for throughput. Proper initialization is invisible when it works correctly but catastrophic when wrong. And dimension padding is pure engineering pragmatism.

But these details compound. A model that gets all of them right trains faster, uses memory more efficiently, remains numerically stable through longer runs, and is easier to quantize and deploy. The gap between a "good enough" implementation and a state-of-the-art one often comes down to exactly these kinds of careful engineering decisions.

If you take away one principle from this post, it is this: at the scale of modern LLMs, small inefficiencies multiply across billions of parameters and trillions of tokens. Getting the details right is not optional.


In the final post, we will explore Part 8: Alternative Architectures -- State Space Models, Mamba, Linear Attention, RWKV, and hybrid architectures that challenge the transformer paradigm itself.

References

  • Glorot, X. and Bengio, Y. "Understanding the difficulty of training deep feedforward neural networks." AISTATS, 2010.
  • He, K. et al. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." ICCV, 2015.
  • Press, O. and Wolf, L. "Using the Output Embedding to Improve Language Models." EACL, 2017.
  • Radford, A. et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.
  • Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS, 2020.
  • Wang, B. and Komatsuzaki, A. "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model." EleutherAI, 2021.
  • Black, S. et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." ACL Workshop on Challenges and Applications of Large Language Models, 2022.
  • Chowdhery, A. et al. "PaLM: Scaling Language Modeling with Pathways." JMLR, 2023.
  • Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023.
  • Touvron, H. et al. "LLaMA 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288, 2023.
  • Yang, G. et al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." arXiv:2203.03466, 2022.
  • Jiang, A.Q. et al. "Mistral 7B." arXiv:2310.06825, 2023.
  • Penedo, G. et al. "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116, 2023.
Share:
SW

Written by Suchinthaka Wanninayaka

AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.

Responses

?

No responses yet. Be the first to share your thoughts!