Transformer Deep Dive: Part 7 - Minor But Important Changes

This post covers four seemingly minor architectural modifications that have become standard in modern LLMs. While these changes may appear subtle compared to FlashAttention or Mixture of Experts, they collectively contribute to improved training stability, reduced memory footprint, and better model quality.

1. Bias Removal

What Does Bias Do?

In a standard linear transformation:

y = Wx + b

The bias term $b \in \mathbb{R}^{d_{out}}$ provides:

Learnable offset: Shifts output distribution independently of input
Non-zero output at zero input: $y = b$ when $x = 0$
Additional expressivity: Extra parameters for function approximation

Where Biases Appear

In a transformer layer:

Attention: $W_Q$ , $W_K$ , $W_V$ , $W_O$ each with biases
FFN: $W_1$ , $W_2$ (and $W_3$ for SwiGLU) each with biases
Total: 6-7 bias vectors per layer

Why Remove Biases?

Argument 1: Redundancy with LayerNorm/RMSNorm

The learnable shift parameter $\beta$ in LayerNorm provides similar functionality:

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta

If we then apply $y = Wx + b$ , the bias $b$ is partially redundant with $\beta$ .

Argument 2: Scale Invariance

With RMSNorm (no $\beta$ parameter), models learn to work without explicit biases. At scale, the model finds other ways to represent offsets through the weight matrices.

Argument 3: Memory Savings

For a 70B parameter model with 80 layers and 8K hidden dimension:

Biases per layer: ~48K parameters
Total bias parameters: ~4M
Memory: ~16MB (FP32)

Small but meaningful at scale.

Who Removed Biases?

| Model | Biases | |-------|--------| | GPT-2/3 | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No | | DeepSeek | No |

# LLaMA-style attention
class Attention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.wq = nn.Linear(d_model, d_model, bias=False)  # No bias
        self.wk = nn.Linear(d_model, d_model, bias=False)
        self.wv = nn.Linear(d_model, d_model, bias=False)
        self.wo = nn.Linear(d_model, d_model, bias=False)

2. Tied Embeddings

The Embedding Layers

Transformers have two embedding-related matrices:

Input embedding $E$ : Token → Vector ( $V \times d$ )
Output projection $W_{out}$ : Vector → Logits ( $d \times V$ )

For vocabulary size 32K and dimension 4096:

Each matrix: 32K × 4096 = 128M parameters

Weight Tying

Idea: Use $W_{out} = E^T$ (transpose of input embeddings)

\text{logits} = h \cdot E^T

Benefits:

Reduces parameters by ~128M (significant for smaller models)
Forces consistency: similar tokens have similar embeddings AND similar output distributions
Can improve generalization

Drawback:

Constrains the model
Output distribution learning is coupled to input representations

Who Uses Tying?

| Model | Tied Embeddings | |-------|-----------------| | GPT-2 | Yes | | BERT | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No |

Modern large models generally don't tie embeddings because:

The parameter savings is negligible at billion-scale
Untied allows independent optimization of input and output representations

class LLaMA(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        # Separate embeddings
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # NOT tying: self.lm_head.weight = self.embed_tokens.weight

3. Parallel Attention and FFN

Standard: Sequential

In the original transformer, attention and FFN are sequential:

h' = x + \text{Attn}(\text{Norm}(x))

x' = h' + \text{FFN}(\text{Norm}(h'))

Parallel Formulation

Some models compute attention and FFN in parallel:

x' = x + \text{Attn}(\text{Norm}(x)) + \text{FFN}(\text{Norm}(x))

Note: Both use the SAME normalized input.

Visual Comparison

Sequential:

x → Norm → Attn → Add → Norm → FFN → Add → x'
              ↑                   ↑
              └── residual ──────┘

Parallel:

        ┌→ Attn →┐
x → Norm          Add → x'
        └→ FFN  →┘
  ↑               ↑
  └── residual ───┘

Benefits of Parallel

Reduced communication: One normalization instead of two
Better parallelization: Attn and FFN can run simultaneously
Slightly faster: ~15% speedup in some implementations

Potential Drawbacks

Attention can't condition on FFN output
May need different hyperparameters
Not universally adopted

Who Uses Parallel?

| Model | Parallel | |-------|----------| | PaLM | Yes | | GPT-J | Yes | | GPT-NeoX | Yes | | LLaMA | No |

class ParallelBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.norm = RMSNorm(d_model)  # Single norm
        self.attn = Attention(d_model, n_heads)
        self.ffn = SwiGLU(d_model, d_ff)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        normed = self.norm(x)
        # Parallel computation
        return x + self.attn(normed) + self.ffn(normed)

4. Initialization Schemes

Why Initialization Matters

Poor initialization causes:

Exploding gradients: Activations grow exponentially through layers
Vanishing gradients: Activations shrink, no learning
Symmetry breaking issues: All neurons learn the same thing

Xavier/Glorot Initialization

For a layer with fan-in $n_{in}$ and fan-out $n_{out}$ :

W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

Or the normal variant:

W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)

Goal: Maintain variance of activations through layers.

Kaiming/He Initialization

For ReLU activations:

W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

Why different? ReLU zeros out half the outputs, so we need 2× larger variance to compensate.

GPT-Style Initialization

GPT-2/3 uses a specific scheme:

Standard layers: $\mathcal{N}(0, 0.02)$
Residual projections: Scale by $\frac{1}{\sqrt{2N}}$ where N = number of layers

The residual scaling prevents output variance from exploding with depth.

def init_weights(module, n_layers):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)

    # Special scaling for residual projections
    for name, p in module.named_parameters():
        if 'wo' in name or 'w2' in name:  # Output projections
            torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layers))

μP (Maximal Update Parameterization)

A principled approach that ensures optimal hyperparameters transfer across model sizes.

Key insight: Different parameter groups should have different learning rates based on their width.

\eta_{layer} \propto \frac{1}{\sqrt{n_{in}}}

Initialization Summary

| Component | Common Init | Std | |-----------|-------------|-----| | Embeddings | Normal | 0.02 | | Attention QKV | Normal | $\frac{1}{\sqrt{d}}$ | | Attention O | Normal | $\frac{0.02}{\sqrt{2N}}$ | | FFN up | Normal | $\frac{1}{\sqrt{d}}$ | | FFN down | Normal | $\frac{0.02}{\sqrt{2N}}$ | | LM Head | Normal | 0.02 |

5. Other Small Details

Vocabulary Size Padding

Pad vocabulary size to be divisible by 64 or 128 for efficient GPU computation:

vocab_size = 32000
padded_vocab_size = 64 * math.ceil(vocab_size / 64)  # 32064

Hidden Dimension Multiples

Use hidden dimensions divisible by 128 (or 256) for tensor core efficiency:

# Instead of d_ff = 4 * 4096 = 16384
d_ff = int(8/3 * 4096)  # 10922
d_ff = 256 * math.ceil(d_ff / 256)  # 11008

Epsilon Values

RMSNorm and LayerNorm use small epsilon for numerical stability:

| Usage | Typical ε | |-------|-----------| | LayerNorm | 1e-5 | | RMSNorm | 1e-6 | | Adam | 1e-8 |

LLaMA uses 1e-6 for RMSNorm.

Summary

| Change | Impact | Adopted By | |--------|--------|------------| | No biases | Less parameters, simpler | LLaMA, Mistral | | Untied embeddings | More expressivity | LLaMA, Mistral | | Parallel Attn+FFN | Faster | PaLM, GPT-J | | Proper initialization | Stable training | All | | Dimension padding | GPU efficiency | All |

These "minor" details collectively contribute significantly to the efficiency and stability of modern LLMs.

In the final post, we'll explore Part 8: Alternative Architectures - State Space Models, Mamba, Linear Attention, RWKV, and hybrid architectures.