AXRP - the AI X-risk Research Podcast
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels...
info_outline 38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent SystemsAXRP - the AI X-risk Research Podcast
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 00:35 - How the Alignment Workshop is 00:47 - How Zhijing got interested in causality and natural language processing 03:14 - Causality and...
info_outline 37 - Jaime Sevilla on AI ForecastingAXRP - the AI X-risk Research Podcast
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly?...
info_outline 36 - Adam Shai and Paul Riechers on Computational MechanicsAXRP - the AI X-risk Research Podcast
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the...
info_outline New Patreon tiers + MATS applicationsAXRP - the AI X-risk Research Podcast
Patreon: MATS: Note: I'm employed by MATS, but they're not paying me to make this video.
info_outline 35 - Peter Hase on LLM Beliefs and Easy-to-Hard GeneralizationAXRP - the AI X-risk Research Podcast
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models'...
info_outline 34 - AI Evaluations with Beth BarnesAXRP - the AI X-risk Research Podcast
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an "eval"? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability...
info_outline 33 - RLHF Problems with Scott EmmonsAXRP - the AI X-risk Research Podcast
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 -...
info_outline 32 - Understanding Agency with Jan KulveitAXRP - the AI X-risk Research Podcast
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active inference 0:31:33 - Action vs perception in active inference 0:46:07 - Feedback loops 1:01:32 - Active inference vs LLMs 1:12:04 - Hierarchical agency 1:58:28 - The Alignment of Complex Systems group Website of...
info_outline 31 - Singular Learning Theory with Daniel MurfetAXRP - the AI X-risk Research Podcast
What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 -...
info_outlineWhat can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com
Topics we discuss, and timestamps:
- 0:00:42 - Why understand human value formation?
- 0:19:59 - Why not design methods to align to arbitrary values?
- 0:27:22 - Postulates about human brains
- 0:36:20 - Sufficiency of the postulates
- 0:44:55 - Reinforcement learning as conditional sampling
- 0:48:05 - Compatibility with genetically-influenced behaviour
- 1:03:06 - Why deep learning is basically what the brain does
- 1:25:17 - Shard theory
- 1:38:49 - Shard theory vs expected utility optimizers
- 1:54:45 - What shard theory says about human values
- 2:05:47 - Does shard theory mean we're doomed?
- 2:18:54 - Will nice behaviour generalize?
- 2:33:48 - Does alignment generalize farther than capabilities?
- 2:42:03 - Are we at the end of machine learning history?
- 2:53:09 - Shard theory predictions
- 2:59:47 - The shard theory research community
- 3:13:45 - Why do shard theorists not work on replicating human childhoods?
- 3:25:53 - Following shardy research
The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html
Shard theorist links:
- Quintin's LessWrong profile: lesswrong.com/users/quintin-pope
- Alex Turner's LessWrong profile: lesswrong.com/users/turntrout
- Shard theory Discord: discord.gg/AqYkK7wqAG
- EleutherAI Discord: discord.gg/eleutherai
Research we discuss:
- The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX
- Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582
- Inner alignment in salt-starved rats: lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats
- Intro to Brain-like AGI Safety Sequence: lesswrong.com/s/HzcM2dkCq7fwXBej8
- Brains and transformers:
- The neural architecture of language: Integrative modeling converges on predictive processing: pnas.org/doi/10.1073/pnas.2105646118
- Brains and algorithms partially converge in natural language processing: nature.com/articles/s42003-022-03036-1
- Evidence of a predictive coding hierarchy in the human brain listening to speech: nature.com/articles/s41562-022-01516-2
- Singular learning theory explainer: Neural networks generalize because of this one weird trick: lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick
- Singular learning theory links: metauni.org/slt/
- Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map: arxiv.org/abs/2008.00938
- The shard theory of human values: lesswrong.com/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT
- Predicting inductive biases of pre-trained networks: openreview.net/forum?id=mNtmhaDkAr
- Understanding and controlling a maze-solving policy network, aka the cheese vector: lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network
- Quintin's Research agenda: Supervising AIs improving AIs: lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
- Steering GPT-2-XL by adding an activation vector: lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Links for the addendum on mesa-optimization skepticism:
- Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_
- Quintin on why evolution is not like AI training: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training
- Evolution provides no evidence for the sharp left turn: lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn
- Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets: arxiv.org/abs/1905.10854