All Articles
transformersarchitectureinitializationembeddingsdeep-learning

Transformer Deep Dive: Part 7 - Minor But Important Changes

The small details that matter - bias removal, tied embeddings, parallel attention and FFN computation, and initialization schemes in modern LLMs.

SW

Suchinthaka W.

January 21, 2025 · 7 min read

This post covers four seemingly minor architectural modifications that have become standard in modern LLMs. While these changes may appear subtle compared to FlashAttention or Mixture of Experts, they collectively contribute to improved training stability, reduced memory footprint, and better model quality.

1. Bias Removal

What Does Bias Do?

In a standard linear transformation:

y=Wx+by = Wx + b

The bias term bRdoutb \in \mathbb{R}^{d_{out}} provides:

  • Learnable offset: Shifts output distribution independently of input
  • Non-zero output at zero input: y=by = b when x=0x = 0
  • Additional expressivity: Extra parameters for function approximation

Where Biases Appear

In a transformer layer:

  • Attention: WQW_Q, WKW_K, WVW_V, WOW_O each with biases
  • FFN: W1W_1, W2W_2 (and W3W_3 for SwiGLU) each with biases
  • Total: 6-7 bias vectors per layer

Why Remove Biases?

Argument 1: Redundancy with LayerNorm/RMSNorm

The learnable shift parameter β\beta in LayerNorm provides similar functionality:

LayerNorm(x)=γxμσ+β\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta

If we then apply y=Wx+by = Wx + b, the bias bb is partially redundant with β\beta.

Argument 2: Scale Invariance

With RMSNorm (no β\beta parameter), models learn to work without explicit biases. At scale, the model finds other ways to represent offsets through the weight matrices.

Argument 3: Memory Savings

For a 70B parameter model with 80 layers and 8K hidden dimension:

  • Biases per layer: ~48K parameters
  • Total bias parameters: ~4M
  • Memory: ~16MB (FP32)

Small but meaningful at scale.

Who Removed Biases?

| Model | Biases | |-------|--------| | GPT-2/3 | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No | | DeepSeek | No |

# LLaMA-style attention
class Attention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.wq = nn.Linear(d_model, d_model, bias=False)  # No bias
        self.wk = nn.Linear(d_model, d_model, bias=False)
        self.wv = nn.Linear(d_model, d_model, bias=False)
        self.wo = nn.Linear(d_model, d_model, bias=False)

2. Tied Embeddings

The Embedding Layers

Transformers have two embedding-related matrices:

  1. Input embedding EE: Token → Vector (V×dV \times d)
  2. Output projection WoutW_{out}: Vector → Logits (d×Vd \times V)

For vocabulary size 32K and dimension 4096:

  • Each matrix: 32K × 4096 = 128M parameters

Weight Tying

Idea: Use Wout=ETW_{out} = E^T (transpose of input embeddings)

logits=hET\text{logits} = h \cdot E^T

Benefits:

  • Reduces parameters by ~128M (significant for smaller models)
  • Forces consistency: similar tokens have similar embeddings AND similar output distributions
  • Can improve generalization

Drawback:

  • Constrains the model
  • Output distribution learning is coupled to input representations

Who Uses Tying?

| Model | Tied Embeddings | |-------|-----------------| | GPT-2 | Yes | | BERT | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No |

Modern large models generally don't tie embeddings because:

  • The parameter savings is negligible at billion-scale
  • Untied allows independent optimization of input and output representations
class LLaMA(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        # Separate embeddings
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # NOT tying: self.lm_head.weight = self.embed_tokens.weight

3. Parallel Attention and FFN

Standard: Sequential

In the original transformer, attention and FFN are sequential:

h=x+Attn(Norm(x))h' = x + \text{Attn}(\text{Norm}(x)) x=h+FFN(Norm(h))x' = h' + \text{FFN}(\text{Norm}(h'))

Parallel Formulation

Some models compute attention and FFN in parallel:

x=x+Attn(Norm(x))+FFN(Norm(x))x' = x + \text{Attn}(\text{Norm}(x)) + \text{FFN}(\text{Norm}(x))

Note: Both use the SAME normalized input.

Visual Comparison

Sequential:

x → Norm → Attn → Add → Norm → FFN → Add → x'
              ↑                   ↑
              └── residual ──────┘

Parallel:

        ┌→ Attn →┐
x → Norm          Add → x'
        └→ FFN  →┘
  ↑               ↑
  └── residual ───┘

Benefits of Parallel

  1. Reduced communication: One normalization instead of two
  2. Better parallelization: Attn and FFN can run simultaneously
  3. Slightly faster: ~15% speedup in some implementations

Potential Drawbacks

  • Attention can't condition on FFN output
  • May need different hyperparameters
  • Not universally adopted

Who Uses Parallel?

| Model | Parallel | |-------|----------| | PaLM | Yes | | GPT-J | Yes | | GPT-NeoX | Yes | | LLaMA | No |

class ParallelBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.norm = RMSNorm(d_model)  # Single norm
        self.attn = Attention(d_model, n_heads)
        self.ffn = SwiGLU(d_model, d_ff)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        normed = self.norm(x)
        # Parallel computation
        return x + self.attn(normed) + self.ffn(normed)

4. Initialization Schemes

Why Initialization Matters

Poor initialization causes:

  • Exploding gradients: Activations grow exponentially through layers
  • Vanishing gradients: Activations shrink, no learning
  • Symmetry breaking issues: All neurons learn the same thing

Xavier/Glorot Initialization

For a layer with fan-in ninn_{in} and fan-out noutn_{out}:

WU(6nin+nout,6nin+nout)W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

Or the normal variant:

WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)

Goal: Maintain variance of activations through layers.

Kaiming/He Initialization

For ReLU activations:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

Why different? ReLU zeros out half the outputs, so we need 2× larger variance to compensate.

GPT-Style Initialization

GPT-2/3 uses a specific scheme:

  1. Standard layers: N(0,0.02)\mathcal{N}(0, 0.02)
  2. Residual projections: Scale by 12N\frac{1}{\sqrt{2N}} where N = number of layers

The residual scaling prevents output variance from exploding with depth.

def init_weights(module, n_layers):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)

    # Special scaling for residual projections
    for name, p in module.named_parameters():
        if 'wo' in name or 'w2' in name:  # Output projections
            torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layers))

μP (Maximal Update Parameterization)

A principled approach that ensures optimal hyperparameters transfer across model sizes.

Key insight: Different parameter groups should have different learning rates based on their width.

ηlayer1nin\eta_{layer} \propto \frac{1}{\sqrt{n_{in}}}

Initialization Summary

| Component | Common Init | Std | |-----------|-------------|-----| | Embeddings | Normal | 0.02 | | Attention QKV | Normal | 1d\frac{1}{\sqrt{d}} | | Attention O | Normal | 0.022N\frac{0.02}{\sqrt{2N}} | | FFN up | Normal | 1d\frac{1}{\sqrt{d}} | | FFN down | Normal | 0.022N\frac{0.02}{\sqrt{2N}} | | LM Head | Normal | 0.02 |

5. Other Small Details

Vocabulary Size Padding

Pad vocabulary size to be divisible by 64 or 128 for efficient GPU computation:

vocab_size = 32000
padded_vocab_size = 64 * math.ceil(vocab_size / 64)  # 32064

Hidden Dimension Multiples

Use hidden dimensions divisible by 128 (or 256) for tensor core efficiency:

# Instead of d_ff = 4 * 4096 = 16384
d_ff = int(8/3 * 4096)  # 10922
d_ff = 256 * math.ceil(d_ff / 256)  # 11008

Epsilon Values

RMSNorm and LayerNorm use small epsilon for numerical stability:

| Usage | Typical ε | |-------|-----------| | LayerNorm | 1e-5 | | RMSNorm | 1e-6 | | Adam | 1e-8 |

LLaMA uses 1e-6 for RMSNorm.

Summary

| Change | Impact | Adopted By | |--------|--------|------------| | No biases | Less parameters, simpler | LLaMA, Mistral | | Untied embeddings | More expressivity | LLaMA, Mistral | | Parallel Attn+FFN | Faster | PaLM, GPT-J | | Proper initialization | Stable training | All | | Dimension padding | GPU efficiency | All |

These "minor" details collectively contribute significantly to the efficiency and stability of modern LLMs.


In the final post, we'll explore Part 8: Alternative Architectures - State Space Models, Mamba, Linear Attention, RWKV, and hybrid architectures.

Did you find this helpful?
Share:

Responses

Be the first to share your thoughts!