AXRP - the AI X-risk Research Podcast

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

46 - Tom Davidson on AI-enabled Coups 08/07/2025

45 - Samuel Albanie on DeepMind's AGI Safety Approach 07/06/2025

44 - Peter Salib on AI Rights for Human Safety 06/28/2025

43 - David Lindner on Myopic Optimization with Non-myopic Approval 06/15/2025

42 - Owain Evans on LLM Psychology 06/06/2025

41 - Lee Sharkey on Attribution-based Parameter Decomposition 06/03/2025

40 - Jason Gross on Compact Proofs and Interpretability 03/28/2025

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future 03/01/2025

38.7 - Anthony Aguirre on the Future of Life Institute 02/09/2025

38.6 - Joel Lehman on Positive Visions of AI 01/24/2025

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming 01/20/2025

38.4 - Shakeel Hashim on AI Journalism 01/05/2025

38.3 - Erik Jenner on Learned Look-Ahead 12/12/2024

39 - Evan Hubinger on Model Organisms of Misalignment 12/01/2024

38.2 - Jesse Hoogland on Singular Learning Theory 11/27/2024

38.1 - Alan Chan on Agent Infrastructure 11/16/2024

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems 11/14/2024

37 - Jaime Sevilla on AI Forecasting 10/04/2024

36 - Adam Shai and Paul Riechers on Computational Mechanics 09/29/2024

New Patreon tiers + MATS applications 09/28/2024

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization 08/24/2024

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models' beliefs 1:19:18 - Beliefs beyond language models 1:27:21 - Easy-to-hard generalization 1:47:16 - What do easy-to-hard results tell us? 1:57:33 - Easy-to-hard vs weak-to-strong 2:03:50 - Different notions of hardness 2:13:01 - Easy-to-hard vs weak-to-strong, round 2 2:15:39 - Following Peter's work Peter on Twitter: Peter's papers: Foundational Challenges in Assuring Alignment and Safety of Large Language Models: Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: Are Language Models Rational? The Case of Coherence Norms and Belief Revision: The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: Other links: Toy Models of Superposition: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): Locating and Editing Factual Associations in GPT (aka the ROME paper): Of nonlinearity and commutativity in BERT: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: Editing a classifier by rewriting its prediction rules: Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: Concrete problems in AI safety: Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/32726337

34 - AI Evaluations with Beth Barnes 07/28/2024

33 - RLHF Problems with Scott Emmons 06/12/2024

32 - Understanding Agency with Jan Kulveit 05/30/2024

31 - Singular Learning Theory with Daniel Murfet 05/07/2024

31 - Singular Learning Theory with Daniel Murfet What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 - Singular learning theory vs other deep learning theory 1:17:06 - How singular learning theory hit AI alignment 1:33:12 - Payoffs of singular learning theory for AI alignment 1:59:36 - Does singular learning theory advance AI capabilities? 2:13:02 - Open problems in singular learning theory for AI alignment 2:20:53 - What is the singular fluctuation? 2:25:33 - How geometry relates to information 2:30:13 - Following Daniel Murfet's work The transcript: Daniel Murfet's twitter/X account: Developmental interpretability website: Developmental interpretability YouTube channel: Main research discussed in this episode: - Developmental Landscape of In-Context Learning: - Estimating the Local Learning Coefficient at Scale: - Simple versus Short: Higher-order degeneracy and error-correction: Other links: - Algebraic Geometry and Statistical Learning Theory (the grey book): - Mathematical Theory of Bayesian Statistics (the green book): https://www.routledge.com/Mathematical-Theory-of-Bayesian-Statistics/Watanabe/p/book/9780367734817 In-context learning and induction heads: - Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity: - A mathematical theory of semantic development in deep neural networks: - Consideration on the Learning Efficiency Of Multiple-Layered Neural Networks with Linear Units: - Neural Tangent Kernel: Convergence and Generalization in Neural Networks: - The Interpolating Information Criterion for Overparameterized Models: - Feature Learning in Infinite-Width Neural Networks: - A central AI alignment problem: capabilities generalization, and the sharp left turn: - Quantifying degeneracy in singular models via the learning coefficient: Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/31169122

30 - AI Security with Jeffrey Ladish 04/30/2024

30 - AI Security with Jeffrey Ladish Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:38 - Fine-tuning away safety training 0:13:50 - Dangers of open LLMs vs internet search 0:19:52 - What we learn by undoing safety filters 0:27:34 - What can you do with jailbroken AI? 0:35:28 - Security of AI model weights 0:49:21 - Securing against attackers vs AI exfiltration 1:08:43 - The state of computer security 1:23:08 - How AI labs could be more secure 1:33:13 - What does Palisade do? 1:44:40 - AI phishing 1:53:32 - More on Palisade's work 1:59:56 - Red lines in AI development 2:09:56 - Making AI legible 2:14:08 - Following Jeffrey's research The transcript: Palisade Research: Jeffrey's Twitter/X account: Main papers we discussed: - LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: - BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: - Securing Artificial Intelligence Model Weights: Other links: - Llama 2: Open Foundation and Fine-Tuned Chat Models: - Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: - Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: - On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): - The Operational Risks of AI in Large-Scale Biological Attacks (RAND): - Preventing model exfiltration with upload limits: - A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution: - In-browser transformer inference: - Anatomy of a rental phishing scam: - Causal Scrubbing: a method for rigorously testing interpretability hypotheses: Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/31058928

29 - Science of Deep Learning with Vikrant Varma 04/25/2024

29 - Science of Deep Learning with Vikrant Varma In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS 0:00:36 - What is CCS? 0:09:54 - Consistent and contrastive features other than model beliefs 0:20:34 - Understanding the banana/shed mystery 0:41:59 - Future CCS-like approaches 0:53:29 - CCS as principal component analysis 0:56:21 - Explaining grokking through circuit efficiency 0:57:44 - Why research science of deep learning? 1:12:07 - Summary of the paper's hypothesis 1:14:05 - What are 'circuits'? 1:20:48 - The role of complexity 1:24:07 - Many kinds of circuits 1:28:10 - How circuits are learned 1:38:24 - Semi-grokking and ungrokking 1:50:53 - Generalizing the results 1:58:51 - Vikrant's research approach 2:06:36 - The DeepMind alignment team 2:09:06 - Follow-up work The transcript: Vikrant's Twitter/X account: Main papers: - Challenges with unsupervised LLM knowledge discovery: - Explaining grokking through circuit efficiency: Other works discussed: - Discovering latent knowledge in language models without supervision (CCS): - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: - Discussion: Challenges with unsupervised LLM knowledge discovery: - Comment thread on the banana/shed results: - Fabien Roger, What discovering latent knowledge did and did not find: - Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): - Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: - Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): - Progress measures for grokking via mechanistic interpretability: Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/30988158

28 - Suing Labs for AI Risk with Gabriel Weil 04/17/2024

27 - AI Control with Buck Shlegeris and Ryan Greenblatt 04/11/2024

27 - AI Control with Buck Shlegeris and Ryan Greenblatt A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication 0:40:42 - Unpredictably uncontrollable AI 0:58:01 - What control looks like 1:08:45 - Is AI control evil? 1:24:42 - Can red teams match misaligned AI? 1:36:51 - How expensive is AI monitoring? 1:52:32 - AI control experiments 2:03:50 - GPT-4's aptitude at inserting backdoors 2:14:50 - How AI control relates to the AI safety field 2:39:25 - How AI control relates to previous Redwood Research work 2:49:16 - How people can work on AI control 2:54:07 - Following Buck and Ryan's research The transcript: Links for Buck and Ryan: - Buck's twitter/X account: - Ryan on LessWrong: - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com Main research works we talk about: - The case for ensuring that powerful AIs are controlled: - AI Control: Improving Safety Despite Intentional Subversion: Other things we mention: - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): - Preventing language models from hiding their reasoning: - Improving the Welfare of AIs: A Nearcasted Proposal: - Measuring coding challenge competence with APPS: - Causal Scrubbing: a method for rigorously testing interpretability hypotheses Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/30788848

26 - AI Governance with Elizabeth Seger 11/26/2023

26 - AI Governance with Elizabeth Seger The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions. Patreon: Ko-fi: Topics we discuss, and timestamps: - 0:00:40 - What kinds of AI? - 0:01:30 - Democratizing AI - 0:04:44 - How people talk about democratizing AI - 0:09:34 - Is democratizing AI important? - 0:13:31 - Links between types of democratization - 0:22:43 - Democratizing profits from AI - 0:27:06 - Democratizing AI governance - 0:29:45 - Normative underpinnings of democratization - 0:44:19 - Open-sourcing AI - 0:50:47 - Risks from open-sourcing - 0:56:07 - Should we make AI too dangerous to open source? - 1:00:33 - Offense-defense balance - 1:03:13 - KataGo as a case study - 1:09:03 - Openness for interpretability research - 1:15:47 - Effectiveness of substitutes for open sourcing - 1:20:49 - Offense-defense balance, part 2 - 1:29:49 - Making open-sourcing safer? - 1:40:37 - AI governance research - 1:41:05 - The state of the field - 1:43:33 - Open questions - 1:49:58 - Distinctive governance issues of x-risk - 1:53:04 - Technical research to help governance - 1:55:23 - Following Elizabeth's research The transcript: Links for Elizabeth: - Personal website: - Centre for the Governance of AI (AKA GovAI): Main papers: - Democratizing AI: Multiple Meanings, Goals, and Methods: - Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: Other research we discuss: - What Do We Mean When We Talk About "AI democratisation"? (blog post): - Democratic Inputs to AI (OpenAI): - Collective Constitutional AI: Aligning a Language Model with Public Input (Anthropic): - Against "Democratizing AI": - Adversarial Policies Beat Superhuman Go AIs: - Structured access: an emerging paradigm for safe AI deployment: - Universal and Transferable Adversarial Attacks on Aligned Language Models (aka Adversarial Suffixes): Episode art by Hamish Doodles: /episode/index/show/axrpodcast/id/28796443

AXRP - the AI X-risk Research Podcast

TOPICS