Transformer Deep Dive: Part 7 - Minor But Important Changes
The small details that matter - bias removal, tied embeddings, parallel attention and FFN computation, and initialization schemes in modern LLMs.
Suchinthaka W.
January 21, 2025 · 7 min read
This post covers four seemingly minor architectural modifications that have become standard in modern LLMs. While these changes may appear subtle compared to FlashAttention or Mixture of Experts, they collectively contribute to improved training stability, reduced memory footprint, and better model quality.
1. Bias Removal
What Does Bias Do?
In a standard linear transformation:
The bias term provides:
- Learnable offset: Shifts output distribution independently of input
- Non-zero output at zero input: when
- Additional expressivity: Extra parameters for function approximation
Where Biases Appear
In a transformer layer:
- Attention: , , , each with biases
- FFN: , (and for SwiGLU) each with biases
- Total: 6-7 bias vectors per layer
Why Remove Biases?
Argument 1: Redundancy with LayerNorm/RMSNorm
The learnable shift parameter in LayerNorm provides similar functionality:
If we then apply , the bias is partially redundant with .
Argument 2: Scale Invariance
With RMSNorm (no parameter), models learn to work without explicit biases. At scale, the model finds other ways to represent offsets through the weight matrices.
Argument 3: Memory Savings
For a 70B parameter model with 80 layers and 8K hidden dimension:
- Biases per layer: ~48K parameters
- Total bias parameters: ~4M
- Memory: ~16MB (FP32)
Small but meaningful at scale.
Who Removed Biases?
| Model | Biases | |-------|--------| | GPT-2/3 | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No | | DeepSeek | No |
# LLaMA-style attention
class Attention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.wq = nn.Linear(d_model, d_model, bias=False) # No bias
self.wk = nn.Linear(d_model, d_model, bias=False)
self.wv = nn.Linear(d_model, d_model, bias=False)
self.wo = nn.Linear(d_model, d_model, bias=False)
2. Tied Embeddings
The Embedding Layers
Transformers have two embedding-related matrices:
- Input embedding : Token → Vector ()
- Output projection : Vector → Logits ()
For vocabulary size 32K and dimension 4096:
- Each matrix: 32K × 4096 = 128M parameters
Weight Tying
Idea: Use (transpose of input embeddings)
Benefits:
- Reduces parameters by ~128M (significant for smaller models)
- Forces consistency: similar tokens have similar embeddings AND similar output distributions
- Can improve generalization
Drawback:
- Constrains the model
- Output distribution learning is coupled to input representations
Who Uses Tying?
| Model | Tied Embeddings | |-------|-----------------| | GPT-2 | Yes | | BERT | Yes | | LLaMA | No | | LLaMA 2/3 | No | | Mistral | No |
Modern large models generally don't tie embeddings because:
- The parameter savings is negligible at billion-scale
- Untied allows independent optimization of input and output representations
class LLaMA(nn.Module):
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
# Separate embeddings
self.embed_tokens = nn.Embedding(vocab_size, d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# NOT tying: self.lm_head.weight = self.embed_tokens.weight
3. Parallel Attention and FFN
Standard: Sequential
In the original transformer, attention and FFN are sequential:
Parallel Formulation
Some models compute attention and FFN in parallel:
Note: Both use the SAME normalized input.
Visual Comparison
Sequential:
x → Norm → Attn → Add → Norm → FFN → Add → x'
↑ ↑
└── residual ──────┘
Parallel:
┌→ Attn →┐
x → Norm Add → x'
└→ FFN →┘
↑ ↑
└── residual ───┘
Benefits of Parallel
- Reduced communication: One normalization instead of two
- Better parallelization: Attn and FFN can run simultaneously
- Slightly faster: ~15% speedup in some implementations
Potential Drawbacks
- Attention can't condition on FFN output
- May need different hyperparameters
- Not universally adopted
Who Uses Parallel?
| Model | Parallel | |-------|----------| | PaLM | Yes | | GPT-J | Yes | | GPT-NeoX | Yes | | LLaMA | No |
class ParallelBlock(nn.Module):
def __init__(self, d_model: int, n_heads: int, d_ff: int):
super().__init__()
self.norm = RMSNorm(d_model) # Single norm
self.attn = Attention(d_model, n_heads)
self.ffn = SwiGLU(d_model, d_ff)
def forward(self, x: torch.Tensor) -> torch.Tensor:
normed = self.norm(x)
# Parallel computation
return x + self.attn(normed) + self.ffn(normed)
4. Initialization Schemes
Why Initialization Matters
Poor initialization causes:
- Exploding gradients: Activations grow exponentially through layers
- Vanishing gradients: Activations shrink, no learning
- Symmetry breaking issues: All neurons learn the same thing
Xavier/Glorot Initialization
For a layer with fan-in and fan-out :
Or the normal variant:
Goal: Maintain variance of activations through layers.
Kaiming/He Initialization
For ReLU activations:
Why different? ReLU zeros out half the outputs, so we need 2× larger variance to compensate.
GPT-Style Initialization
GPT-2/3 uses a specific scheme:
- Standard layers:
- Residual projections: Scale by where N = number of layers
The residual scaling prevents output variance from exploding with depth.
def init_weights(module, n_layers):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
# Special scaling for residual projections
for name, p in module.named_parameters():
if 'wo' in name or 'w2' in name: # Output projections
torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layers))
μP (Maximal Update Parameterization)
A principled approach that ensures optimal hyperparameters transfer across model sizes.
Key insight: Different parameter groups should have different learning rates based on their width.
Initialization Summary
| Component | Common Init | Std | |-----------|-------------|-----| | Embeddings | Normal | 0.02 | | Attention QKV | Normal | | | Attention O | Normal | | | FFN up | Normal | | | FFN down | Normal | | | LM Head | Normal | 0.02 |
5. Other Small Details
Vocabulary Size Padding
Pad vocabulary size to be divisible by 64 or 128 for efficient GPU computation:
vocab_size = 32000
padded_vocab_size = 64 * math.ceil(vocab_size / 64) # 32064
Hidden Dimension Multiples
Use hidden dimensions divisible by 128 (or 256) for tensor core efficiency:
# Instead of d_ff = 4 * 4096 = 16384
d_ff = int(8/3 * 4096) # 10922
d_ff = 256 * math.ceil(d_ff / 256) # 11008
Epsilon Values
RMSNorm and LayerNorm use small epsilon for numerical stability:
| Usage | Typical ε | |-------|-----------| | LayerNorm | 1e-5 | | RMSNorm | 1e-6 | | Adam | 1e-8 |
LLaMA uses 1e-6 for RMSNorm.
Summary
| Change | Impact | Adopted By | |--------|--------|------------| | No biases | Less parameters, simpler | LLaMA, Mistral | | Untied embeddings | More expressivity | LLaMA, Mistral | | Parallel Attn+FFN | Faster | PaLM, GPT-J | | Proper initialization | Stable training | All | | Dimension padding | GPU efficiency | All |
These "minor" details collectively contribute significantly to the efficiency and stability of modern LLMs.
In the final post, we'll explore Part 8: Alternative Architectures - State Space Models, Mamba, Linear Attention, RWKV, and hybrid architectures.
Transformer Deep Dive: Part 6 - Inference Optimization
NextTransformer Deep Dive: Part 8 - Alternative Architectures
Related Articles
Responses
Be the first to share your thoughts!