Assessing skeptical views of interpretability research

Stanford AI Lab Faculty Lunch, November 7, 2025. Updated version of https://web.stanford.edu/~cgpotts/blog/interp/ 0:59 - Severance 1:45 - Explainable AI, Anthropic Interp, Stanford Interp 5:15 - Interpretability methods: Attribution, Probes, Interventions 15:27 - Skeptical positions 16:42 - "Interpretability cannot be achieved" 18:32 - "Interpretability is merely analysis" 21:14 - "Analysis is overrated" 27:33 - "Interpretability is not leading to improvements" 30:04 - "Interpretability is not helping with AI safety" 36:08 - Summary, and Aryaman's sweatshirt

Channel: Chris Potts•Generated by anonymous•Duration: 37m•Published Nov 10, 2025

Thumbnail for Assessing skeptical views of interpretability research

▶ Watch on YouTube

Video Chapters

Original Output

1:45 Interpretability vs. Explainable AI (XAI)
5:16 A Look Inside the Toolbox: Attribution, Probes & Interventions
14:14 The Theory: Understanding Models Through Causal Abstraction
15:27 The 5 Skeptical Arguments Against Interpretability
18:32 Is It Just Analysis Without Real Improvements?
25:24 A Cautionary Tale for AI Researchers
27:33 The Counter-Argument: When Interpretability Actually Works
31:55 Case Study: Tackling the Sycophancy Problem
35:56 An Optimistic Future for Interpretability

Timestamps by StampBot 🤖
(359-assessing-skeptical-views-of-interpretability-research)

Unprocessed Timestamp Content

0:00 Introduction to the talk: Assessing Skeptical Views of Interpretability Research
0:59 Severance and the "work is mysterious and important" motto
1:45 Distinguishing Interpretability from Explainable AI and current research focuses
5:16 Overview of Interpretability Methods: Attribution, Probes, and Interventions
5:28 Attribution methods: Feature ablation, permutation importance, Shapley values, and LIME
7:52 Probes: Supervised and unsupervised feature discovery with Sparse Autoencoders (SAEs)
9:50 Interventions: Manipulating model internals to understand causal structure and activation patching
11:15 Interchange interventions: Testing hypotheses about modular encodings in neural networks
14:14 Causal Abstraction paper and the theoretical foundations of interpretability
15:27 Introduction to the five skeptical positions on interpretability research
16:41 Skeptical position 1: Interpretability cannot be achieved due to inherent complexity
18:32 Skeptical position 2: Interpretability is merely analysis, not leading to improvements
21:14 Skeptical position 3: Analysis is overrated; progress comes from "doing what works"
25:24 Nicole Rust's "Elusive Cures" as a cautionary tale for "interp to frontier LM"
27:33 Counter-argument: Interpretability *is* leading to improvements with specific tools
30:05 Skeptical position 5: Interpretability is not helping with AI safety
31:55 The sycophancy problem in GPT-4o: a real-world AI safety challenge
35:56 Optimistic outlook: Interp groups are embracing practical problems and methods

Timestamps by StampBot 🤖
(359-assessing-skeptical-views-of-interpretability-research)