AXRP - the AI X-risk Research Podcast
When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:32 Measuring AI Ability to Complete Long Tasks 0:10:54 The meaning of "task length" 0:19:27 Examples of intermediate and hard tasks 0:25:12 Why...
info_outlineAXRP - the AI X-risk Research Podcast
Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:35 How to stage a coup without AI 0:16:17 Why AI might enable coups 0:33:29 How bad AI-enabled coups are 0:37:28 Executive coups with singularly loyal AIs 0:48:35 Executive coups with exclusive access to AI...
info_outlineAXRP - the AI X-risk Research Podcast
In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called "An Approach to Technical AGI Safety and Security". It covers the assumptions made by the approach, as well as the types of mitigations it outlines. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:37 DeepMind's Approach to Technical AGI Safety and Security 0:04:29 Current paradigm continuation 0:19:13 No human ceiling 0:21:22 Uncertain timelines 0:23:36 Approximate continuity and the potential for accelerating capability improvement 0:34:29 Misuse and misalignment...
info_outlineAXRP - the AI X-risk Research Podcast
In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:40 Why AI rights 0:18:34 Why not reputation 0:27:10 Do AI rights lead to AI war? 0:36:42 Scope for human-AI trade 0:44:25 Concerns with comparative advantage 0:53:42 Proxy AI wars 0:57:56 Can companies profitably make...
info_outlineAXRP - the AI X-risk Research Podcast
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40...
info_outlineAXRP - the AI X-risk Research Podcast
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in "Looking Inward" 0:15:11 Why fine-tune for...
info_outlineAXRP - the AI X-risk Research Podcast
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:41 APD basics 0:07:57 Faithfulness 0:11:10 Minimality 0:28:44 Simplicity 0:34:50 Concrete-ish examples of APD 0:52:00 Which parts of APD are canonical 0:58:10 Hyperparameter selection 1:06:40 APD in toy models of superposition 1:14:40 APD and compressed computation 1:25:43 Mechanisms vs...
info_outlineAXRP - the AI X-risk Research Podcast
How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs...
info_outlineAXRP - the AI X-risk Research Podcast
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:42 - The difficulty of sabotage evaluations 05:23 - Types of sabotage...
info_outlineAXRP - the AI X-risk Research Podcast
The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 00:33 - Anthony, FLI, and Metaculus 06:46 - The Alignment Workshop 07:15 - FLI's...
info_outlineWhat can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com
Topics we discuss, and timestamps:
- 0:00:42 - Why understand human value formation?
- 0:19:59 - Why not design methods to align to arbitrary values?
- 0:27:22 - Postulates about human brains
- 0:36:20 - Sufficiency of the postulates
- 0:44:55 - Reinforcement learning as conditional sampling
- 0:48:05 - Compatibility with genetically-influenced behaviour
- 1:03:06 - Why deep learning is basically what the brain does
- 1:25:17 - Shard theory
- 1:38:49 - Shard theory vs expected utility optimizers
- 1:54:45 - What shard theory says about human values
- 2:05:47 - Does shard theory mean we're doomed?
- 2:18:54 - Will nice behaviour generalize?
- 2:33:48 - Does alignment generalize farther than capabilities?
- 2:42:03 - Are we at the end of machine learning history?
- 2:53:09 - Shard theory predictions
- 2:59:47 - The shard theory research community
- 3:13:45 - Why do shard theorists not work on replicating human childhoods?
- 3:25:53 - Following shardy research
The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html
Shard theorist links:
- Quintin's LessWrong profile: lesswrong.com/users/quintin-pope
- Alex Turner's LessWrong profile: lesswrong.com/users/turntrout
- Shard theory Discord: discord.gg/AqYkK7wqAG
- EleutherAI Discord: discord.gg/eleutherai
Research we discuss:
- The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX
- Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582
- Inner alignment in salt-starved rats: lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats
- Intro to Brain-like AGI Safety Sequence: lesswrong.com/s/HzcM2dkCq7fwXBej8
- Brains and transformers:
- The neural architecture of language: Integrative modeling converges on predictive processing: pnas.org/doi/10.1073/pnas.2105646118
- Brains and algorithms partially converge in natural language processing: nature.com/articles/s42003-022-03036-1
- Evidence of a predictive coding hierarchy in the human brain listening to speech: nature.com/articles/s41562-022-01516-2
- Singular learning theory explainer: Neural networks generalize because of this one weird trick: lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick
- Singular learning theory links: metauni.org/slt/
- Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map: arxiv.org/abs/2008.00938
- The shard theory of human values: lesswrong.com/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT
- Predicting inductive biases of pre-trained networks: openreview.net/forum?id=mNtmhaDkAr
- Understanding and controlling a maze-solving policy network, aka the cheese vector: lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network
- Quintin's Research agenda: Supervising AIs improving AIs: lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
- Steering GPT-2-XL by adding an activation vector: lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Links for the addendum on mesa-optimization skepticism:
- Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_
- Quintin on why evolution is not like AI training: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training
- Evolution provides no evidence for the sharp left turn: lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn
- Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets: arxiv.org/abs/1905.10854