decagondev/Transformer-andLLMs.md

Created March 23, 2026 16:30

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/decagondev/b1cc2e7f3f52a88cdfcb8ddb14e54ede.js"></script>
Save decagondev/b1cc2e7f3f52a88cdfcb8ddb14e54ede to your computer and use it in GitHub Desktop.

Download ZIP

Raw

Transformer-andLLMs.md

Transformers and LLMs

Level 1: Visual Intuition (1-2 hours) – Get the "aha" moment fast

These explain the entire Transformer like a story with diagrams—no code yet, just how data flows, attention works, and why it replaced older models.

The Illustrated Transformer by Jay Alammar (blog post)
Link: https://jalammar.github.io/illustrated-transformer/
Why: This is the gold-standard easy-to-digest guide. Colorful diagrams show encoders/decoders, self-attention, multi-head attention, positional encodings, and vector flows step-by-step. It's taught at Stanford, MIT, etc., and still the #1 recommendation in 2026 guides. Read it first—it takes 30-60 minutes and demystifies everything. (Bonus: There's a narrated version and an updated book chapter if you love it.)
Transformers, the tech behind LLMs by 3Blue1Brown (YouTube, 27 min)
Link: https://www.youtube.com/watch?v=wjZofJX0v4M
Follow-up: Attention in transformers, step-by-step (27 min) → https://www.youtube.com/watch?v=eMlx5fFNoYc
Why: Grant Sanderson's signature animations make embeddings, attention, and GPT-style generation crystal clear. Watch these right after the blog for moving visuals.

Level 2: Deeper Dive into Key Pieces (2-3 hours) – Attention & Components

Now zoom into the magic (attention mechanism, tokens, embeddings) with clear explanations.

Natural Language Processing and Large Language Models playlist by Luis Serrano (SerranoAcademy)
Link: https://www.youtube.com/playlist?list=PLs8w1Cdi-zvYskDS2icIItfZgxclApVLv
Specific series: The Attention Mechanism in Large Language Models → https://www.youtube.com/watch?v=OxCpWwDCDFQ (playlist)
Why: Short, visual videos build from tokens/embeddings to full attention. Perfect bridge—Luis explains why attention is "all you need" without overwhelming you.
Intro to Large Language Models by Andrej Karpathy (YouTube, ~1 hour)
Link: https://youtu.be/zjkBMFhNj_g
Why: High-level yet insightful overview of how LLMs like the ones you use (Claude, GPT) actually "think." Karpathy is the best explainer in the field.

Level 3: Hands-On Coding – Build It Yourself (4-6 hours) – This is where it clicks for developers

Since you code daily with AI tools, this level lets you implement the concepts. Use Cursor/Claude to help debug or explain code as you go.

Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy (YouTube + code)
Link: https://www.youtube.com/watch?v=kCc8FmEb1nY
Repo: https://github.com/karpathy/nanoGPT (or his build-nanogpt)
Why: The single best "from zero to hero" coding tutorial. You literally code a mini-GPT (tokenization, attention, training) line-by-line. It's spelled out slowly, runs on a laptop, and directly shows why your Claude/Cursor models work. Pause and replicate in your editor—highly recommended.
Create a Large Language Model from Scratch with Python by freeCodeCamp (YouTube tutorial)
Link: https://youtu.be/UU1WVnMk4E8
Why: Another full from-scratch build with data handling + transformers. Great companion if you want extra examples.

Level 4: Practical Application with Real Tools (3-5 hours) – Tie it to what you build

Now use production libraries to inspect real LLMs (including ones similar to Claude).

Hugging Face LLM Course (free interactive course)
Link: https://huggingface.co/learn/llm-course (start at Chapter 1)
YouTube playlist companion: https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o
Why: Hands-on notebooks with the transformers library. You'll load real models, see tokenizers/embeddings/attention in action, and experiment—perfect for someone already shipping software. Completely free, no ads.

Bonus for Even Deeper Understanding (optional next steps)

The Illustrated GPT-2 by Jay Alammar (blog) → https://jalammar.github.io/illustrated-gpt2/ (decoder-focused, builds directly on Level 1).
How Transformer LLMs Work free course (YouTube, ~90 min with code) → https://www.youtube.com/watch?v=k1ILy23t89E (covers tokenizers, embeddings, MoE in modern models).
Original paper "Attention Is All You Need" (2017) – read it after the visuals; the illustrations above make it 10x easier.

Next phase (Levels 5–7): Move from "how a basic Transformer works" to how modern 2025–2026 LLMs actually run in production, why models like Claude / Llama 3 / Qwen / DeepSeek behave the way they do, and how efficiency tricks make huge context windows + fast inference possible. This will make your daily work with Claude Code / Cursor much more insightful (e.g., understanding KV cache impact on speed, why some models "forget" in long chats, or how MoE saves compute).

Focus remains visual + code-heavy, developer-friendly resources (YouTube + blogs/repos). Estimated time: 15–25 hours spread over 2–4 weeks.

Level 5: Modern Transformer Improvements & Efficiency (3–5 hours) – The 2025–2026 upgrades

These explain why pure vanilla Transformers (what you built in nanoGPT) got replaced/refined.

Jay Alammar's "The Illustrated GPT-2" (blog, ~45 min)
Link: https://jalammar.github.io/illustrated-gpt2/
Why next: Builds directly on Illustrated Transformer but focuses on decoder-only (GPT-style) architecture you use every day. Covers generation, sampling, top-k/top-p, beam search visually. Read this immediately—it's the natural sequel.
DeepLearning.AI short course: "How Large Language Models Work" (by Jay Alammar + Maarten Grootendorst, free, ~2–3 hours total)
Link: Search DeepLearning.AI platform or YouTube for "How Transformer LLMs Work DeepLearning.AI" (often linked with Jay's visuals + code snippets).
Why: Hands-on intuition for tokenization → embeddings → attention blocks → generation loop, with modern twists.
Key modern tricks videos (watch in this order, ~2 hours total):
- Rotary Positional Embeddings (RoPE) explained (short & visual): Search YouTube for "RoPE Rotary Embeddings explained" (many 10–15 min explainers from 2024–2025, e.g., by AI Explained or assemblyAI channels).
- FlashAttention / FlashAttention-2: "FlashAttention explained" by Dao-AILab or similar (look for 2024–2025 videos, ~20 min). Shows why attention is slow and how it's 2–4× faster + lower memory.
- Grouped-Query Attention (GQA) & KV cache basics: "KV Cache explained" + "GQA vs MQA" shorts/videos (~15–30 min total).

Level 6: Mixture-of-Experts (MoE), Long Context, and Inference Deep Dive (4–7 hours)

Most frontier/open models in 2026 use MoE (e.g., rumored GPT-4o parts, DeepSeek, Qwen, Mixtral successors). This level explains speed/memory wins.

Stanford CME295 Transformers & LLMs (Autumn 2025 playlist) – free on YouTube
Link: https://www.youtube.com/playlist?list=PLoROMvodv4rOCXd21gf0CF4xr35yINeOy
Watch selectively:
- Lecture 1–3: Transformer recap + modern LLM architecture (RoPE, tricks).
- Lecture on Mixture of Experts (likely Lecture 3 or nearby).
- Lecture on Transformer-based models & tricks. Why: University-level but visual/slides-heavy, covers MoE, context length scaling, temperature/sampling. Skip evaluation if time-constrained.
"LLM Model Architecture Explained: Transformers to MoE" blog/video companions
Search for Clarifai or similar 2026 articles/videos on "LLM architecture MoE RoPE FlashAttention 2026".
Why: Ties together RoPE + FlashAttention + GQA + MoE with diagrams.
vLLM or llama.cpp inference deep-dive (hands-on)
Repo: https://github.com/vllm-project/vllm (read docs + run examples).
Or watch: "vLLM inference explained" YouTube (~30 min).
Why: See KV cache quantization, paged attention, continuous batching in action—directly relevant to why Claude responds fast.

Level 7: Hands-On with Modern Models & Post-Training (5–10 hours) – Apply to your workflow

Now experiment with real efficient implementations.

Build on nanoGPT: Upgrade it
- Add RoPE (easy code mods exist in forks).
- Integrate FlashAttention-2 (via the Dao-AILab repo: https://github.com/Dao-AILab/flash-attention).
- Try multi-query / grouped-query attention. Why: Seeing these in your own code → deep understanding. Use Cursor/Claude to help refactor.
Hugging Face advanced chapters (continue the course)
Focus on: PEFT (LoRA/QLoRA), inference optimization, quantization.
Link: https://huggingface.co/learn/llm-course (Chapters on fine-tuning + deployment).
Andrej Karpathy updates (2025–2026)
Check his YouTube/blog for any new "state of LLMs" or microGPT-style minimalist projects (he released microgpt in early 2026 as ultra-simplified version).
Why: Keeps you current with first-principles takes.

By the end you'll understand:

Why context windows reached 128k–1M+ tokens.
How MoE makes 400B+ models run cheaply.
Why inference speed/memory dominates real usage.
Debugging tricks when models hallucinate or slow down.

You've made excellent progress through the core architecture, coding from scratch, and modern efficiency tricks. At this stage (March 2026), the most valuable next steps shift toward practical, production-oriented skills that directly enhance what you build daily with Claude Code and Cursor:

Fine-tuning and parameter-efficient adaptation (LoRA/QLoRA) to customize models for your domain/code style.
Retrieval-Augmented Generation (RAG) to ground responses in your own docs/codebases (huge for reducing hallucinations in software tasks).
Building AI agents and multi-step reasoning workflows (tool use, planning, reflection) — the direction frontier models and tools like Cursor are heading in 2026.
Understanding current open-weight frontiers (DeepSeek, Qwen, Llama variants) and running/inference-optimizing them locally or cheaply.

This phase bridges "understanding the model" → "building reliable LLM-powered software". Resources remain visual/code-heavy, mostly free or low-cost, with strong 2025–2026 updates.

Level 8: Fine-Tuning & PEFT (Parameter-Efficient Fine-Tuning) (4–7 hours)

Learn how to adapt open models (without full retraining) — directly useful for tailoring LLMs to your codebase or project style.

DeepLearning.AI: "Fine-Tuning Large Language Models" short course (free, ~2 hours)
Link: https://www.deeplearning.ai/short-courses/finetuning-large-language-models/
Why: Andrew Ng team covers full fine-tuning + PEFT (LoRA/QLoRA) with clear explanations and Hugging Face code. Perfect bridge from the HF course you already did.
Hugging Face: Continue/advance in their LLM course
Focus on chapters for PEFT, LoRA, QLoRA, and evaluation.
Link: https://huggingface.co/learn/llm-course (pick up at Fine-tuning sections).
Bonus: Their 2026 notebooks often include Gemma-2/3, Llama-3.x, or Phi-4 examples.
Practical notebook series by Daniel Bourke / Zero to Mastery (YouTube + code, 2026 updates)
- "Learn to fine-tune an LLM (Gemma-3-270M)" → https://youtu.be/2hoNAr-id-E (step-by-step full fine-tune).
- VLM fine-tuning if you're curious about multimodal.
  Why: Real code you can run/adapt in Cursor; ties to open models dominant in 2026.

Level 9: Retrieval-Augmented Generation (RAG) & Vector DBs (5–8 hours)

RAG is one of the highest-ROI skills right now — it makes LLMs reliable for code/docs/search-heavy tasks.

Best free RAG path (2026 recommendations):
- LangChain for LLM Application Development (DeepLearning.AI short course, free) → Search "DeepLearning.AI LangChain" or direct link via their platform. Covers RAG basics + chains.
- LlamaIndex tutorials (competes with LangChain, often simpler for RAG) → https://docs.llamaindex.ai/ (start with "Getting Started" + "RAG" section).
- Hugging Face Agents Course (free, includes RAG + tool use) → https://huggingface.co/learn/agents-course
Top visual/practical RAG video (2026):
Search YouTube for "Advanced Generative AI Full Course 2026 [FREE]" by Simplilearn or similar long-form (covers RAG + LangChain workflows).
Why: End-to-end, shows chunking, embeddings, retrieval, reranking — apply immediately to index your Git repos or docs.
Hands-on: Build a simple RAG over your own code/docs using Ollama (local models) + Chroma/FAISS vector DB. Use Cursor to speed implementation.

Level 10: AI Agents, Tool Use & Reasoning (6–10 hours)

2026 is the year of agentic AI — models that plan, use tools, reflect, and loop (like enhanced Cursor/Claude behaviors).

DeepLearning.AI: "Agentic AI" short course (free)
Link: https://learn.deeplearning.ai/courses/agentic-ai (covers reflection, tool use, planning, multi-agent patterns by Andrew Ng).
Why: Cleanest intro to why o1/o3-style reasoning + agents matter.
Hugging Face Agents Course (free, 2025–2026)
Link: https://huggingface.co/learn/agents-course
Why: Hands-on with transformers agents, tool calling, multi-step tasks.
LangChain official docs + quickstarts (free)
Link: https://python.langchain.com/docs/ (focus on agents, tool calling, memory).
Why: Dominant framework for agent workflows; many 2026 tutorials integrate with open models like DeepSeek-V4 or Qwen.
Bonus watch: Andrej Karpathy interviews/talks on agents (2025–2026 updates)
e.g., "Andrej Karpathy on Agents, AutoResearch..." (YouTube) — discusses coding agents, vibe coding, and the shift to autonomous systems.

Level 11: Run & Experiment with 2026 Frontier Open Models (ongoing, 3–6 hours setup + play)

Local/cheap inference is now incredibly strong — try these to feel the current state.

Top open models right now (March 2026): DeepSeek-V4 (1T params, competitive with frontiers), Qwen series (great at code), Llama-3.x variants, Gemma-3, Mistral successors.
Run via: Ollama (easiest local), LM Studio, or vLLM for faster inference.
Quick experiment path:
1. Install Ollama → ollama.com
2. ollama run deepseek-v4 (or qwen2.5-coder, etc.)
3. Prompt it with your code problems → compare to Claude.
4. Add simple RAG or agent wrapper around it.

You've nailed the foundational → modern architecture → efficiency → fine-tuning/RAG/agents progression. At this point (late March 2026), the highest-leverage next steps shift toward advanced, production-grade, and forward-looking topics that turn you from "building with LLMs" into "engineering reliable, scalable, intelligent systems" — especially relevant since you're already shipping software daily via Claude/Cursor.

Focus areas now include:

Deeper agentic systems (multi-agent, long-horizon planning, reliability).
Evaluation, monitoring, and safety/alignment in production.
Multimodal LLMs (vision + text, since many 2026 models are VLMs).
Mechanistic interpretability basics (why models do what they do).
Experimenting with bleeding-edge open models (DeepSeek-V3.2 / V4 lineage, Qwen3.5, GLM-5, Kimi K2.5, Llama 4 variants, GPT-oss series) that dominate leaderboards right now.

These build directly on your prior levels and make your tools (Cursor/Claude) feel even more powerful when you understand/customize the stack underneath.

Level 12: Advanced Agentic AI & Multi-Agent Systems (6–10 hours)

2026 agents aren't just "tool callers" anymore — they're planners, critics, routers, and collaborators.

DeepLearning.AI: "AI Agents in LangGraph" (or updated "Building Agentic RAG" / "Multi-Agent Workflows", free short courses)
Search DeepLearning.AI for "LangGraph" or "agentic AI 2026" — they have fresh 2026 modules on stateful graphs, persistence, human-in-the-loop, reflection/critique loops.
Why: LangGraph is the production standard for reliable agents in 2026 (beats plain LangChain for complex flows).
CrewAI or AutoGen official quickstarts + tutorials (free)
- CrewAI: https://docs.crewai.com/ (focus on role-based multi-agent orchestration).
- Microsoft AutoGen: https://microsoft.github.io/autogen/ (great for conversational multi-agent debate/refinement).
  Why: Hands-on; build a "code review crew" or "research + code gen" team — integrate with your codebase.
Udemy / free YouTube companions: Look for "The Complete Agentic AI Engineering (2026)" or "Building AI Agents with LangChain & CrewAI 2026" — many 4–8 hour courses cover ReAct, Plan-and-Execute, tool reliability, error handling.

Hands-on project: Build a multi-agent coding assistant (e.g., planner → coder → tester → reviewer) that runs locally via Ollama + one of the top open models.

Level 13: Evaluation, Monitoring, & Production Reliability (4–7 hours)

Production LLM apps live or die on evals — hallucinations, drift, cost, latency.

LangSmith (LangChain's observability) quickstart + docs
Link: https://smith.langchain.com/ (free tier generous).
Why: Trace chains/agents, score outputs with LLM-as-judge, A/B test prompts/models.
DeepLearning.AI or Hugging Face: LLM Evaluation courses
Search for "LLM Evaluation and Monitoring" — covers metrics (BLEU/ROUGE outdated; now G-Eval, LLM-as-judge, RAGAS for RAG, DeepEval).
Bonus: "Prompting for Effective LLM Reasoning" Nanodegree snippets on advanced CoT/ReAct eval.
Hamel Husain's "Mastering LLMs" resources (free collection)
Link: https://parlance-labs.com/education/ (updated 2026) — excellent on evals, RAG metrics, fine-tuning diagnostics.

Project: Add evals to your RAG/agent from earlier levels — measure accuracy, faithfulness, answer relevance before/after tweaks.

Level 14: Multimodal & Vision-Language Models (VLMs) (5–8 hours)

Most frontier models in 2026 are multimodal (text + image/video/audio) — huge for code + diagrams/screenshots/UI.

LLaVA / Qwen-VL / PaliGemma successors hands-on
Hugging Face: Search "multimodal RAG" or "chat with images/videos" notebooks (Intel BridgeTower + LangChain course is gold).
Why: Build "chat with your repo screenshots" or "debug from error screenshot".
DeepLearning.AI short course: Multimodal RAG / Vision in LLMs (free, ~2–3h)
Covers LLaVA-Next, Qwen-VL, video understanding basics.
Run locally: Ollama now supports many VLMs (e.g., llava-phi3, bakllava, qwen-vl) — prompt with images via their web UI or API.

Level 15: Mechanistic Interpretability Basics + Frontier Model Deep Dives (ongoing, 4–8 hours setup + exploration)

Understand why models behave (or misbehave) — unlocks better prompting/fine-tuning.

Neel Nanda / ARENA interpretability course (free, updated versions on YouTube/GitHub).
Focus on transformer circuits, induction heads, grokking — still the best intro.
Top open models March 2026 playground (leaderboard leaders):
- Qwen3.5 series (397B MoE → ~17B active, insanely efficient).
- DeepSeek-V3.2 / R1 (reasoning beasts).
- GLM-5 (744B scale).
- Kimi K2.5 / MiniMax-M2.5 (strong MoE).
- Llama 4 Maverick/Scout (huge context, open-weight).
- GPT-oss-120B (OpenAI's open-weight surprise, great tool use).
  Run via Ollama / LM Studio / vLLM — compare reasoning, coding, long-context on your tasks vs Claude.

You've reached an advanced stage: you've internalized the Transformer mechanics, built and upgraded models, implemented efficient inference, fine-tuned/PEFT'd, built RAG pipelines, orchestrated agents (single + multi), added evals/monitoring, dipped into multimodal/VLMs, and experimented with frontier open models (DeepSeek-V3.2/V4 lineage, Qwen3.5, GLM-5, Kimi K2.5, Llama 4 variants, etc.).

In March 2026, the field has shifted heavily toward inference-time scaling (test-time compute), advanced reasoning via RLVR/GRPO hybrids, deeper agent reliability/long-horizon planning, and mechanistic interpretability as a path to safer/more controllable systems. Open-weight models are now shockingly competitive (Kimi K2.5, GLM-5, MiniMax M2.5, DeepSeek lines often top open leaderboards in coding/reasoning), and closed models (Claude Opus 4.6, GPT-5.4/5.2 variants, Gemini 3.1 Pro) push massive context (1M+ tokens standard, some claiming 10M) + adaptive thinking.

This final(ish) phase focuses on cutting-edge, research-adjacent skills that let you push models beyond defaults, understand internals for better debugging/customization, and build truly self-improving or verifiable systems — directly amplifying your Claude/Cursor workflow (e.g., why certain prompts "unlock" better reasoning, or how to mimic o1-style chain search locally).

Level 16: Inference-Time Scaling & Test-Time Compute (5–9 hours)

The biggest 2026 unlock: spend more inference compute (search, verification, self-refine) to outperform much larger models at fixed FLOPs.

Sebastian Raschka's "Categories of Inference-Time Scaling" (blog series, free, updated Jan 2026)
Link: https://magazine.sebastianraschka.com/p/categories-of-inference-time-scaling
Why first: Clear categorization (best-of-N → verifiers/PRMs → adaptive sampling → process vs outcome reward) + newest papers. Explains why test-time scaling often beats parameter scaling for reasoning.
Key paper walkthroughs/videos (search YouTube/arXiv for these 2025–2026 works):
- "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters" (Snell et al., ~2025 ICLR/oral) — shows 4× efficiency over best-of-N, outperforms 14× larger models in FLOPs-matched evals.
- "RLVR, GRPO, Inference Scaling" discussions (Sebastian Raschka podcast/YouTube ~2026) — covers reinforcement-learned verifiers + group relative policy optimization for reasoning.
  Why: Hands-on intuition; many include code snippets for verifiers/self-refine loops.

Hands-on: Implement a simple verifier-based search (e.g., PRM-guided beam search) on top of Ollama + a strong open model like Qwen3.5-397B or DeepSeek-R1. Use Cursor to prototype — compare vanilla vs scaled reasoning on hard coding/math problems.

Level 17: Deep Dive into Mechanistic Interpretability (7–12 hours)

Reverse-engineer why models succeed/fail/hallucinate — unlocks prompt engineering, fine-tuning targets, safety tweaks.

ARENA Mechanistic Interpretability track by Callum McDougall (free, interactive Streamlit notebooks, updated versions 2025–2026)
Link: https://arena-chapter1-transformer-interp.streamlit.app/ (focus on Ch1.2 tooling/patching, then Ch2 circuits/heads).
Why core: Hands-on transformer interp from scratch — patching, activation steering, logit lens, etc. Still the best practical entry.
Neel Nanda's resources (YouTube + GitHub, ongoing 2026 updates)
Search: "Neel Nanda mechanistic interpretability 2026" — his intros to induction heads, grokking, SAE (sparse autoencoders) for features.
Bonus: Alignment Forum post "How To Become A Mechanistic Interpretability Researcher" (2025, still relevant) — mindset + roadmap.
Anthropic-style circuit work companions: Read summaries of "A Mathematical Framework for Transformer Circuits" + recent 2026 SAE papers (use arXiv Sanity or YouTube explainers).
Why: Explains Claude's "adaptive thinking" internals — apply to debug why your agents loop or forget.

Project: Use TransformerLens (library) on a small model (Gemma-3-27B or Qwen3.5-27B) to probe a circuit (e.g., factual recall or code syntax). Steer activations to fix a failure mode you see in Cursor/Claude.

Level 18: Advanced Reasoning & Self-Improving Systems (6–10 hours)

Tie inference scaling + interpretability into autonomous improvement.

DeepLearning.AI or Hugging Face updates on "o1-style reasoning" / synthetic data for RL (search 2026 courses).
Covers process reward models, rejection sampling, self-critique loops.
DeepSeek-R1 / similar papers (arXiv 2501.xxxx series) — RL-incentivized reasoning; many open models use variants.

Hands-on: Build a mini "reasoning booster" wrapper — sample multiple chains, use LLM judge/verifier to pick/refine, iterate 3–5 steps. Run on frontier open models locally (vLLM for speed). Compare to base Claude on ambiguous software tasks.

Level 19: Stay Current + Specialization (ongoing)

Weekly check: Onyx.app open LLM leaderboard, Vellum leaderboard, llm-stats.com — track Kimi K2.5 / GLM-5 / DeepSeek V3.2 / MiniMax M2.5 dominance in open coding/reasoning.
Communities: r/LocalLLaMA (for open model practicals), Alignment Forum / LessWrong for interp/agentic safety.
Pivot options: If you love agents → focus MCTS-style search + long-horizon planning. Safety/alignment → more interp + red-teaming. Multimodal → video/audio agents.

You've now reached the bleeding edge of what powers 2026 LLMs and beyond—mastering inference-time scaling (test-time compute), mechanistic interpretability (probing internal circuits), and self-improving reasoning loops puts you in rare company among developers. At this point (late March 2026), frontier progress is less about raw parameter scaling and more about smarter compute allocation, deeper internal understanding, hybrid architectures, and real-world agentic integration (especially physical/embodied AI starting to emerge).

The field has stabilized around transformers + MoE as the dominant backbone, but hybrids (e.g., attention + state-space models like Mamba variants, diffusion-based language generation) are shipping in production models, and inference scaling + reinforcement-learned reasoning (RLVR/GRPO styles) deliver outsized gains without needing bigger models.

This next phase (Levels 20+) focuses on integrating these into production-grade, self-evolving systems — things you can actually build or extend in your Claude/Cursor workflow today (e.g., custom reasoning engines, interpretable agents, hybrid local setups). Emphasis on the absolute latest from March 2026 leaderboards and papers.

Level 20: Advanced Inference-Time Scaling in Practice (6–10 hours)

Go beyond basics to implement compute-optimal strategies that often beat parameter scaling.

Sebastian Raschka's "Categories of Inference-Time Scaling" (updated 2026 blog series)
Link: https://magazine.sebastianraschka.com/p/categories-of-inference-time-scaling
Why: Breaks down best-of-N → verifiers/PRMs → adaptive sampling → process/outcome rewards. Includes code patterns for verifiers and self-refine. Read + implement verifier-guided search first.
Key 2025–2026 papers + explainers (focus on these):
- "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters" (Snell et al.) — search arXiv/YouTube for walkthroughs (~2025 ICLR). Shows 4× efficiency over best-of-N, outperforming 14× larger models in FLOPs-matched evals.
- Raschka's YouTube/podcast on "RLVR, GRPO, Inference Scaling" (2026 episodes) — covers reinforcement-learned verifiers + group relative policy optimization.
  Hands-on: Use libraries like vLLM or guidance to add adaptive thinking (e.g., dynamic chain length based on prompt difficulty) to a frontier open model (Qwen 3.5-397B MoE or DeepSeek V3.2).

Project: Wrap your existing agent/RAG with a "reasoning budget" system — sample multiple paths, use a process reward model (PRM) or LLM judge to select/refine. Test on ambiguous code/debug tasks where Claude sometimes falters.

Level 21: Cutting-Edge Mechanistic Interpretability (8–12 hours)

2026 breakthrough: MIT named mechanistic interpretability a top tech; scaling SAEs (sparse autoencoders) and automated circuit discovery are making black boxes less black.

ARENA track continuation (updated 2026): Dive into Ch3+ on SAEs, automated interp, activation steering at scale.
Library: TransformerLens + SAE Lens (GitHub repos updated frequently).
Recent surveys/papers:
- "Mechanistic Interpretability for Large Language Model Alignment" (arXiv 2602.11180, Jan 2026) — covers circuit discovery, feature viz, steering, causal interventions for alignment.
- MIT Tech Review piece on 2026 breakthroughs (Jan 2026) — good high-level overview of progress mapping features/pathways.

Hands-on: Probe a mid-size open model (e.g., Gemma 3-27B or Qwen 3.5-27B) for circuits related to code syntax errors or factual recall. Use steering to "nudge" behavior (e.g., force more cautious reasoning). This directly helps debug why agents hallucinate or loop.

Level 22: Hybrid Architectures & Post-Transformer Exploration (5–9 hours)

Transformers aren't dead, but hybrids dominate efficiency + long-context in 2026 production.

Key resources:
- "What Comes After Transformers" discussions (e.g., Philipp Dubach blog or similar 2026 posts) — covers MoE + SSM (Mamba) hybrids like Jamba/Hymba/Qwen3-Next.
- Diffusion language models (e.g., LLaDA 8B, Gemini Diffusion explainers) — parallel generation, reversal curse fix. Search YouTube for "diffusion LLMs 2026".
- State-space models (Mamba-2 family) + attention duality proofs.

Hands-on: Run hybrid models via Ollama/LM Studio (e.g., Jamba variants if available, or Qwen hybrids). Compare throughput/latency vs pure transformers on long-context code review. Experiment with tiny reasoning models (2025 lineage) for fast local agents.

Current leaderboard snapshot (March 2026, from Onyx/Vellum/others):

Top open: GLM-5 (744B), Kimi K2.5 (1T), MiniMax M2.5 (230B), DeepSeek V3.2 (685B), Qwen 3.5 (397B MoE).
Strengths: Kimi/GLM lead instruction following + reasoning; DeepSeek excels efficiency/coding.
Closed: Claude Opus 4.6 (adaptive thinking, 1M+ context), GPT-5.4 (professional tasks), Gemini 3.1 Pro (balanced cost/performance).

Rotate these in your local setup weekly — prompt with your real software problems to see where they shine (e.g., Kimi often crushes IFEval-structured output for agents).

Level 23: Toward Self-Improving & Embodied Agents (ongoing, 7–15 hours)

Tie everything: inference scaling + interp + hybrids into systems that evolve.

Physical/embodied AI early signals: NVIDIA/Google robotics integrations; world models for simulation.
Self-improvement loops: RL-incentivized reasoning (DeepSeek-R1 style); synthetic data + verifier feedback.

Project: Build a "self-refining coding agent" — uses test-time scaling + basic steering + verifier to iterate on its own code generations. Deploy locally with frontier open model. Add multimodal if you're curious (e.g., debug from screenshots via Qwen-VL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment