25 - Cooperative AI with Caspar Oesterheld
AXRP - the AI X-risk Research Podcast
Release Date: 10/03/2023
AXRP - the AI X-risk Research Podcast
How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: Ko-fi: Transcript: Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs...
info_outlineAXRP - the AI X-risk Research Podcast
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:42 - The difficulty of sabotage evaluations 05:23 - Types of sabotage...
info_outlineAXRP - the AI X-risk Research Podcast
The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 00:33 - Anthony, FLI, and Metaculus 06:46 - The Alignment Workshop 07:15 - FLI's...
info_outlineAXRP - the AI X-risk Research Podcast
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:12 - Why aligned AI might not be enough 04:05 - Positive visions of AI 08:27 - Improving recommendation systems Links: Why Greatness Cannot...
info_outlineAXRP - the AI X-risk Research Podcast
Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:04 - The Alignment Workshop 02:49 - How to detect scheming AIs 05:29 - Sokoban-solving networks taking time to think 12:18 - Model organisms of long-term...
info_outlineAXRP - the AI X-risk Research Podcast
AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:31 -...
info_outlineAXRP - the AI X-risk Research Podcast
Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 00:57 - How chess neural nets look into the future 04:29 - The dataset and basic methodology 05:23 - Testing for branching futures? 07:57 - Which...
info_outlineAXRP - the AI X-risk Research Podcast
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:36 - Model organisms and stress-testing 0:07:38 - Sleeper Agents 0:22:32 - Do 'sleeper agents' properly model deceptive alignment? 0:38:32 -...
info_outlineAXRP - the AI X-risk Research Podcast
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 00:34 - About Jesse 01:49 - The Alignment Workshop 02:31 - About Timaeus 05:25 - SLT that isn't developmental interpretability 10:41 - The refined local...
info_outlineAXRP - the AI X-risk Research Podcast
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop: Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels...
info_outlineImagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com
Topics we discuss, and timestamps:
- 0:00:34 - Cooperative AI
- 0:06:21 - Cooperative AI vs standard game theory
- 0:19:45 - Do we need cooperative AI if we get alignment?
- 0:29:29 - Cooperative AI and agent foundations
- 0:34:59 - A Theory of Bounded Inductive Rationality
- 0:50:05 - Why it matters
- 0:53:55 - How the theory works
- 1:01:38 - Relationship to logical inductors
- 1:15:56 - How fast does it converge?
- 1:19:46 - Non-myopic bounded rational inductive agents?
- 1:24:25 - Relationship to game theory
- 1:30:39 - Safe Pareto Improvements
- 1:30:39 - What they try to solve
- 1:36:15 - Alternative solutions
- 1:40:46 - How safe Pareto improvements work
- 1:51:19 - Will players fight over which safe Pareto improvement to adopt?
- 2:06:02 - Relationship to program equilibrium
- 2:11:25 - Do safe Pareto improvements break themselves?
- 2:15:52 - Similarity-based Cooperation
- 2:23:07 - Are similarity-based cooperators overly cliqueish?
- 2:27:12 - Sensitivity to noise
- 2:29:41 - Training neural nets to do similarity-based cooperation
- 2:50:25 - FOCAL, Caspar's research lab
- 2:52:52 - How the papers all relate
- 2:57:49 - Relationship to functional decision theory
- 2:59:45 - Following Caspar's research
The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html
Links for Caspar:
- FOCAL at CMU: www.cs.cmu.edu/~focal/
- Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld
- Caspar's blog: casparoesterheld.com/
- Caspar on Google Scholar: scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en&oi=ao
Research we discuss:
- A Theory of Bounded Inductive Rationality: arxiv.org/abs/2307.05068
- Safe Pareto improvements for delegated game playing: link.springer.com/article/10.1007/s10458-022-09574-6
- Similarity-based Cooperation: arxiv.org/abs/2211.14468
- Logical Induction: arxiv.org/abs/1609.03543
- Program Equilibrium: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e1a060cda74e0e3493d0d81901a5a796158c8410
- Formalizing Objections against Surrogate Goals: www.alignmentforum.org/posts/K4FrKRTrmyxrw5Dip/formalizing-objections-against-surrogate-goals
- Learning with Opponent-Learning Awareness: arxiv.org/abs/1709.04326