diffusion-deep-dive · Part 1

Diffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)

A step-by-step derivation of the DDPM training objective AND the sampler. We start from an intractable log-likelihood, apply the ELBO, rewrite the bound as a sum of Gaussian KL terms, derive the closed-form posterior, reparameterize down to an MSE noise-prediction loss, then turn that loss back into the iterative reverse-process algorithm that actually generates images.

·22 min read·diffusiongenerative-modelsdeep-learning+2
diffusion-deep-dive · Part 2

Diffusion Deep Dive Part 2: DDIM — From 1000 Steps to 25 Without Retraining

The math and algorithm behind Denoising Diffusion Implicit Models (DDIM). We derive a non-Markov forward process that matches DDPM''s marginals, obtain a family of reverse samplers parameterized by η, recover DDPM at η=1 and a deterministic probability-flow ODE sampler at η=0, and show why sub-sampling timesteps lets us cut 1000 network calls down to 25 with the same trained network.

·11 min read·diffusiongenerative-modelsdeep-learning+4
diffusion-deep-dive · Part 3

Diffusion Deep Dive Part 3: Coding a DDPM from Scratch

A from-scratch PyTorch implementation of DDPM. We build the noise schedule, the closed-form forward process, a small UNet with sinusoidal time embeddings, the training loop, and the iterative sampler that turns Gaussian noise into images. Every piece maps directly to an equation from Part 1.

·22 min read·diffusiongenerative-modelsdeep-learning+3

7 RAG Retrieval Strategies, Benchmarked

Which retrieval method actually wins? I built DocMind AI to find out.

·12 min read·raglangchainllm+3
Transformer Deep Dive · Part 8

Transformer Deep Dive: Part 8 - Alternative Architectures

Beyond Transformers - State Space Models (SSMs), Mamba, Linear Attention, RWKV, and hybrid architectures that challenge the attention paradigm.

·29 min read·transformersmambassm+3
Transformer Deep Dive · Part 7

Transformer Deep Dive: Part 7 - Minor But Important Changes

The small details that matter - bias removal, tied embeddings, parallel attention and FFN computation, and initialization schemes in modern LLMs.

·24 min read·transformersarchitectureinitialization+2
Transformer Deep Dive · Part 6

Transformer Deep Dive: Part 6 - Inference Optimization

Production deployment techniques - KV-cache for avoiding redundant computation, quantization for memory efficiency, speculative decoding for faster generation, and continuous batching for throughput.

·25 min read·transformersinferencekv-cache+3
Transformer Deep Dive · Part 5

Transformer Deep Dive: Part 5 - Training Improvements

Modern training techniques for LLMs - AdamW optimizer, learning rate schedules, mixed precision training (FP16/BF16), gradient checkpointing, and distributed training strategies.

·25 min read·transformerstrainingoptimization+3
Transformer Deep Dive · Part 4

Transformer Deep Dive: Part 4 - FFN Modifications

Evolution of the Feed-Forward Network - from ReLU to GELU, SwiGLU gated activations, and Mixture of Experts for scaling to trillion-parameter models.

·22 min read·transformersffnswiglu+3
Transformer Deep Dive · Part 3

Transformer Deep Dive: Part 3 - Attention Modifications

Evolution of the attention mechanism - from sinusoidal to RoPE positional encoding, Multi-Query Attention, Grouped Query Attention, and the revolutionary FlashAttention algorithm.

·22 min read·transformersattentionrope+3
Transformer Deep Dive · Part 2

Transformer Deep Dive: Part 2 - Architecture Changes

How modern LLMs evolved from the original Transformer - decoder-only architecture, Pre-Layer Normalization, and RMSNorm. The fundamental architectural shifts that power GPT, LLaMA, and Mistral.

·17 min read·transformersattentiondeep-learning+2
Transformer Deep Dive · Part 1

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

A deep dive into the original Transformer architecture from 'Attention Is All You Need' - the encoder-decoder structure, scaled dot-product attention, multi-head attention, and the design decisions that revolutionized NLP.

·10 min read·transformersattentiondeep-learning+2