Assessing skeptical views of interpretability research
Stanford AI Lab Faculty Lunch, November 7, 2025. Updated version of https://web.stanford.edu/~cgpotts/blog/interp/ 0:59 - Severance 1:45 - Explainable AI, Anthropic Interp, Stanford Interp 5:15 - Interpretability methods: Attribution, Probes, Interventions 15:27 - Skeptical positions 16:42 - "Interpretability cannot be achieved" 18:32 - "Interpretability is merely analysis" 21:14 - "Analysis is overrated" 27:33 - "Interpretability is not leading to improvements" 30:04 - "Interpretability is not helping with AI safety" 36:08 - Summary, and Aryaman's sweatshirt
Video Chapters
- 1:45 Interpretability vs. Explainable AI (XAI)
- 5:16 A Look Inside the Toolbox: Attribution, Probes & Interventions
- 14:14 The Theory: Understanding Models Through Causal Abstraction
- 15:27 The 5 Skeptical Arguments Against Interpretability
- 18:32 Is It Just Analysis Without Real Improvements?
- 25:24 A Cautionary Tale for AI Researchers
- 27:33 The Counter-Argument: When Interpretability Actually Works
- 31:55 Case Study: Tackling the Sycophancy Problem
- 35:56 An Optimistic Future for Interpretability
Original Output
1:45 Interpretability vs. Explainable AI (XAI) 5:16 A Look Inside the Toolbox: Attribution, Probes & Interventions 14:14 The Theory: Understanding Models Through Causal Abstraction 15:27 The 5 Skeptical Arguments Against Interpretability 18:32 Is It Just Analysis Without Real Improvements? 25:24 A Cautionary Tale for AI Researchers 27:33 The Counter-Argument: When Interpretability Actually Works 31:55 Case Study: Tackling the Sycophancy Problem 35:56 An Optimistic Future for Interpretability Timestamps by StampBot 🤖 (359-assessing-skeptical-views-of-interpretability-research)
Unprocessed Timestamp Content
0:00 Introduction to the talk: Assessing Skeptical Views of Interpretability Research 0:59 Severance and the "work is mysterious and important" motto 1:45 Distinguishing Interpretability from Explainable AI and current research focuses 5:16 Overview of Interpretability Methods: Attribution, Probes, and Interventions 5:28 Attribution methods: Feature ablation, permutation importance, Shapley values, and LIME 7:52 Probes: Supervised and unsupervised feature discovery with Sparse Autoencoders (SAEs) 9:50 Interventions: Manipulating model internals to understand causal structure and activation patching 11:15 Interchange interventions: Testing hypotheses about modular encodings in neural networks 14:14 Causal Abstraction paper and the theoretical foundations of interpretability 15:27 Introduction to the five skeptical positions on interpretability research 16:41 Skeptical position 1: Interpretability cannot be achieved due to inherent complexity 18:32 Skeptical position 2: Interpretability is merely analysis, not leading to improvements 21:14 Skeptical position 3: Analysis is overrated; progress comes from "doing what works" 25:24 Nicole Rust's "Elusive Cures" as a cautionary tale for "interp to frontier LM" 27:33 Counter-argument: Interpretability *is* leading to improvements with specific tools 30:05 Skeptical position 5: Interpretability is not helping with AI safety 31:55 The sycophancy problem in GPT-4o: a real-world AI safety challenge 35:56 Optimistic outlook: Interp groups are embracing practical problems and methods Timestamps by StampBot 🤖 (359-assessing-skeptical-views-of-interpretability-research)