Blog

Deep dives into machine learning, transformers, and AI research. Technical tutorials and insights from my journey as a PhD researcher.

Featured Series

In-depth multi-part explorations of complex topics

8-Part Series

Transformer Deep Dive

A comprehensive journey from the original 2017 Transformer to modern LLMs like GPT-4 and LLaMA. Covering architecture changes, attention modifications, training improvements, and beyond.

The Original Transformer (2017)

5 min read

Architecture Changes

5 min read

Attention Modifications

6 min read

FFN Modifications

6 min read

Training Improvements

7 min read

Inference Optimization

7 min read

Minor But Important Changes

7 min read

Alternative Architectures

7 min read

All Articles

8 articles

Jan 22, 2025·7 min read

Transformer Deep Dive: Part 8 - Alternative Architectures

Beyond Transformers - State Space Models (SSMs), Mamba, Linear Attention, RWKV, and hybrid architectures that challenge the attention paradigm.

transformersmambassmrwkv+2

Jan 21, 2025·7 min read

Transformer Deep Dive: Part 7 - Minor But Important Changes

The small details that matter - bias removal, tied embeddings, parallel attention and FFN computation, and initialization schemes in modern LLMs.

transformersarchitectureinitializationembeddings+1

Jan 20, 2025·7 min read

Transformer Deep Dive: Part 6 - Inference Optimization

Production deployment techniques - KV-cache for avoiding redundant computation, quantization for memory efficiency, speculative decoding for faster generation, and continuous batching for throughput.

transformersinferencekv-cachequantization+2

Jan 19, 2025·7 min read

Transformer Deep Dive: Part 5 - Training Improvements

Modern training techniques for LLMs - AdamW optimizer, learning rate schedules, mixed precision training (FP16/BF16), gradient checkpointing, and distributed training strategies.

transformerstrainingoptimizationadamw+2

Jan 18, 2025·6 min read

Transformer Deep Dive: Part 4 - FFN Modifications

Evolution of the Feed-Forward Network - from ReLU to GELU, SwiGLU gated activations, and Mixture of Experts for scaling to trillion-parameter models.

transformersffnswiglumoe+2

Jan 17, 2025·6 min read

Transformer Deep Dive: Part 3 - Attention Modifications

Evolution of the attention mechanism - from sinusoidal to RoPE positional encoding, Multi-Query Attention, Grouped Query Attention, and the revolutionary FlashAttention algorithm.

transformersattentionropeflash-attention+2

Jan 16, 2025·5 min read

Transformer Deep Dive: Part 2 - Architecture Changes

How modern LLMs evolved from the original Transformer - decoder-only architecture, Pre-Layer Normalization, and RMSNorm. The fundamental architectural shifts that power GPT, LLaMA, and Mistral.

transformersattentiondeep-learningllm+1

Jan 15, 2025·5 min read

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

A deep dive into the original Transformer architecture from 'Attention Is All You Need' - the encoder-decoder structure, scaled dot-product attention, multi-head attention, and the design decisions that revolutionized NLP.

transformersattentiondeep-learningnlp+1

Stay Updated

Follow me on social media or reach out via email to stay updated on new articles and research.