loader from loading.io

23 - Mechanistic Anomaly Detection with Mark Xu

AXRP - the AI X-risk Research Podcast

Release Date: 07/27/2023

33 - RLHF Problems with Scott Emmons show art 33 - RLHF Problems with Scott Emmons

AXRP - the AI X-risk Research Podcast

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 -...

info_outline
32 - Understanding Agency with Jan Kulveit show art 32 - Understanding Agency with Jan Kulveit

AXRP - the AI X-risk Research Podcast

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active inference 0:31:33 - Action vs perception in active inference 0:46:07 - Feedback loops 1:01:32 - Active inference vs LLMs 1:12:04 - Hierarchical agency 1:58:28 - The Alignment of Complex Systems group   Website of...

info_outline
31 - Singular Learning Theory with Daniel Murfet show art 31 - Singular Learning Theory with Daniel Murfet

AXRP - the AI X-risk Research Podcast

What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 -...

info_outline
30 - AI Security with Jeffrey Ladish show art 30 - AI Security with Jeffrey Ladish

AXRP - the AI X-risk Research Podcast

Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:38 - Fine-tuning away safety training 0:13:50 - Dangers of open LLMs vs internet search 0:19:52 - What we learn by undoing safety filters 0:27:34 - What can you do with jailbroken AI? 0:35:28 - Security of AI model...

info_outline
29 - Science of Deep Learning with Vikrant Varma show art 29 - Science of Deep Learning with Vikrant Varma

AXRP - the AI X-risk Research Podcast

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so,...

info_outline
28 - Suing Labs for AI Risk with Gabriel Weil show art 28 - Suing Labs for AI Risk with Gabriel Weil

AXRP - the AI X-risk Research Podcast

How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic". Patreon: Ko-fi:   Topics we discuss, and timestamps: 0:00:35 - The basic idea 0:20:36 - Tort law vs regulation 0:29:10 - Weil's proposal vs Hanson's proposal 0:37:00 - Tort law vs Pigouvian taxation 0:41:16 - Does disagreement on AI risk...

info_outline
27 - AI Control with Buck Shlegeris and Ryan Greenblatt show art 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

AXRP - the AI X-risk Research Podcast

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: Ko-fi:   Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication...

info_outline
26 - AI Governance with Elizabeth Seger show art 26 - AI Governance with Elizabeth Seger

AXRP - the AI X-risk Research Podcast

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions. Patreon: Ko-fi:   Topics we discuss, and timestamps:  - 0:00:40 - What kinds of AI?  - 0:01:30 - Democratizing AI    - 0:04:44 - How people talk about democratizing AI    - 0:09:34 - Is democratizing AI...

info_outline
25 - Cooperative AI with Caspar Oesterheld show art 25 - Cooperative AI with Caspar Oesterheld

AXRP - the AI X-risk Research Podcast

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic. Patreon: Ko-fi: Episode...

info_outline
24 - Superalignment with Jan Leike show art 24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: Ko-fi: Episode art by Hamish Doodles:   Topics we discuss, and timestamps:  - 0:00:37 - The...

info_outline
 
More Episodes

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies".

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Episode art by Hamish Doodles: hamishdoodles.com/

 

Topics we discuss, and timestamps:

 - 0:00:38 - Mechanistic anomaly detection

   - 0:09:28 - Are all bad things mechanistic anomalies, and vice versa?

   - 0:18:12 - Are responses to novel situations mechanistic anomalies?

   - 0:39:19 - Formalizing "for the normal reason, for any reason"

   - 1:05:22 - How useful is mechanistic anomaly detection?

 - 1:12:38 - Formalizing the Presumption of Independence

   - 1:20:05 - Heuristic arguments in physics

   - 1:27:48 - Difficult domains for heuristic arguments

   - 1:33:37 - Why not maximum entropy?

   - 1:44:39 - Adversarial robustness for heuristic arguments

   - 1:54:05 - Other approaches to defining mechanisms

 - 1:57:20 - The research plan: progress and next steps

 - 2:04:13 - Following ARC's research

 

The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html

 

ARC links:

 - Website: alignment.org

 - Theory blog: alignment.org/blog

 - Hiring page: alignment.org/hiring

 

Research we discuss:

 - Formalizing the presumption of independence: arxiv.org/abs/2211.06738

 - Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge

 - Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

 - Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors

 - Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms