Transformer Deep Dive: Part 8 - Alternative Architectures
Beyond Transformers - State Space Models (SSMs), Mamba, Linear Attention, RWKV, and hybrid architectures that challenge the attention paradigm.
Suchinthaka W.
January 22, 2025 · 7 min read
The Transformer's quadratic complexity in sequence length () has motivated research into alternative architectures that can achieve sub-quadratic complexity while maintaining competitive performance.
The Motivation
Self-attention computes pairwise interactions between all positions:
For sequence length n, this requires time and space. For very long sequences (100K+ tokens), this becomes prohibitive.
The Quest for Linear Complexity
| Architecture | Time | Space | Long-range | |--------------|------|-------|------------| | Transformer | | | Excellent | | SSM/Mamba | | | Good | | Linear Attention | | | Limited | | RWKV | | | Good |
State Space Models (SSMs)
The Continuous Formulation
SSMs are rooted in control theory. A continuous-time SSM is defined by:
where:
- : Input signal
- : Hidden state vector (dimension N)
- : Output signal
- : State transition matrix ()
- : Input projection ()
- : Output projection ()
- : Skip connection (often 0)
Discretization
For digital computation, we discretize using step size :
The discrete recurrence becomes:
Parallel Computation via Convolution
The recurrence can be unrolled as a convolution:
where the kernel is:
This enables parallel training via FFT!
The S4 Innovation
S4 (Structured State Spaces for Sequence Modeling) made SSMs practical by:
- HiPPO initialization: Special A matrix that captures long-range dependencies
- Diagonal structure: Efficient computation with instead of
- Parallel scan: GPU-efficient recurrence computation
Mamba: Selective State Spaces
The Key Insight
Traditional SSMs use fixed (input-independent) A, B, C matrices. Mamba makes them input-dependent:
This enables content-aware processing—the model can decide what information to remember or forget.
The Selectivity Mechanism
Input: "The capital of France is"
Fixed SSM: Treats all tokens equally
Mamba: Attends strongly to "France", weakly to "The"
Architecture
┌─────────────────────────────┐
│ Mamba Block │
│ │
Input ──► Linear ──┼──► Conv1D ──► SSM ──┐ │
│ │ │
│ ┌── × ──────────┤ │
│ │ │ │
└──────┴── SiLU ──────┴──► Out
Key components:
- Linear projection to expand dimension
- 1D convolution for local context
- Selective SSM for sequence mixing
- Gating with SiLU activation
Mamba Implementation (Simplified)
class MambaBlock(nn.Module):
def __init__(self, d_model: int, d_state: int = 16, d_conv: int = 4):
super().__init__()
self.d_model = d_model
self.d_state = d_state
# Projections
self.in_proj = nn.Linear(d_model, 2 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
# Conv for local context
self.conv = nn.Conv1d(d_model, d_model, d_conv, padding=d_conv-1, groups=d_model)
# SSM parameters (input-dependent)
self.x_proj = nn.Linear(d_model, d_state * 2 + 1) # B, C, delta
self.dt_proj = nn.Linear(1, d_model)
# Fixed A (log scale for stability)
self.A_log = nn.Parameter(torch.log(torch.arange(1, d_state + 1).float()))
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, L, D = x.shape
# Project and split
xz = self.in_proj(x)
x, z = xz.chunk(2, dim=-1)
# Conv
x = self.conv(x.transpose(1, 2))[:, :, :L].transpose(1, 2)
x = F.silu(x)
# SSM (selective scan - simplified)
y = self.selective_scan(x)
# Gating
y = y * F.silu(z)
return self.out_proj(y)
def selective_scan(self, x):
# Compute input-dependent B, C, delta
# Run parallel scan
# (Actual implementation uses custom CUDA kernels)
...
Mamba Performance
| Model | Params | Perplexity | Throughput | |-------|--------|------------|------------| | Transformer | 1.4B | 14.2 | 1× | | Mamba | 1.4B | 14.0 | 5× |
Mamba matches Transformer quality with 5× higher inference throughput!
Linear Attention
The Idea
Standard attention:
Linear attention removes softmax:
where is a feature map.
The Trick: Associativity
By reordering computation:
We compute the matrix first, then multiply with each query. This is instead of .
Feature Maps
| Method | | Properties | |--------|-----------|------------| | Linear | | Simplest, limited expressivity | | ELU+1 | | Positive, smooth | | Random Features | | Approximates softmax | | Performer | Random Fourier features | Unbiased approximation |
Limitations
- No sharp attention patterns: Can't focus on single tokens
- Approximation quality: May not match softmax exactly
- Training stability: Can be harder to train
RWKV
The Concept
RWKV combines the best of RNNs and Transformers:
- Training: Parallelizable like Transformers
- Inference: per token like RNNs
WKV Mechanism
RWKV uses a novel "WKV" (weighted key-value) mechanism:
where:
- : Learned decay rate per channel
- : Learned bonus for current token
- : Key and value projections
RWKV Architecture
Token → Embedding → [RWKV Block] × L → LayerNorm → Output
RWKV Block:
├── LayerNorm
├── Time Mixing (WKV)
├── Residual
├── LayerNorm
├── Channel Mixing (FFN-like)
└── Residual
Comparison to Attention
| Aspect | Attention | RWKV | |--------|-----------|------| | Complexity | | | | Long-range | Excellent | Good (with decay) | | KV Cache | Grows with n | Fixed size | | Training | Parallel | Parallel | | Inference | Parallel | Sequential (fast) |
Hybrid Architectures
The Best of Both Worlds
Some architectures combine attention and linear methods:
Jamba (AI21):
- Alternates Mamba and Attention layers
- Uses MoE for scaling
- Mamba for efficiency, Attention for complex patterns
Griffin (Google):
- RNN-like gated linear recurrence
- Local attention for nearby context
- MLP for channel mixing
Design Patterns
Hybrid Block Options:
1. Interleaved:
[Mamba] → [Attn] → [Mamba] → [Attn] → ...
2. Ratio-based:
[Mamba] × 3 → [Attn] → [Mamba] × 3 → [Attn] → ...
3. Hierarchical:
Local: Mamba
Global: Sparse Attention
4. Parallel:
[Mamba] ─┐
├─► Add
[Attn] ──┘
Jamba Architecture
Block 0: Attention + MoE
Block 1-3: Mamba + MoE
Block 4: Attention + MoE
Block 5-7: Mamba + MoE
...
Ratio: 1 Attention : 7 Mamba
Comparison Summary
| Architecture | Training | Inference | Long Context | Quality | |--------------|----------|-----------|--------------|---------| | Transformer | Parallel | Parallel | memory | Best | | Mamba | Parallel | Sequential | memory | Near-best | | Linear Attn | Parallel | Parallel | memory | Good | | RWKV | Parallel | Sequential | memory | Good | | Jamba | Parallel | Mixed | Efficient | Near-best |
When to Use What?
| Use Case | Recommendation | |----------|----------------| | Best quality, moderate context | Transformer | | Very long context (100K+) | Mamba or Hybrid | | Resource-constrained inference | RWKV | | Streaming applications | Mamba, RWKV | | Maximum throughput | Mamba |
The Future
The field is rapidly evolving:
- Hybrid architectures may dominate, combining attention's expressivity with SSM efficiency
- Hardware co-design will optimize for specific architectures
- Task-specific architectures may emerge for different use cases
- Scaling laws for alternative architectures are still being understood
Conclusion
While "Attention Is All You Need" revolutionized NLP, the quest for efficiency has spawned alternatives that challenge this paradigm. State Space Models, Mamba, and hybrid architectures offer compelling tradeoffs between quality, speed, and memory efficiency.
The transformer isn't going away—but it's no longer the only game in town.
This concludes the "Transformer Deep Dive" series. We've covered:
- Part 1: Original Transformer (2017)
- Part 2: Architecture Changes (Decoder-only, Pre-LN, RMSNorm)
- Part 3: Attention Modifications (RoPE, GQA, FlashAttention)
- Part 4: FFN Modifications (SwiGLU, MoE)
- Part 5: Training Improvements (AdamW, Mixed Precision)
- Part 6: Inference Optimization (KV-cache, Quantization)
- Part 7: Minor But Important Changes
- Part 8: Alternative Architectures
Thanks for following along!
Related Articles
Responses
Be the first to share your thoughts!