loader from loading.io

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

Release Date: 02/04/2023

38.6 - Joel Lehman on Positive Visions of AI show art 38.6 - Joel Lehman on Positive Visions of AI

AXRP - the AI X-risk Research Podcast

Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: Ko-fi: Transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps:  01:12 - Why aligned AI might not be enough 04:05 - Positive visions of AI 08:27 - Improving recommendation systems   Links: Why Greatness Cannot...

info_outline
38.5 - Adrià Garriga-Alonso on Detecting AI Scheming show art 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

AXRP - the AI X-risk Research Podcast

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: Ko-fi: Transcript: FAR.AI:  FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 01:04 - The Alignment Workshop 02:49 - How to detect scheming AIs 05:29 - Sokoban-solving networks taking time to think 12:18 - Model organisms of long-term...

info_outline
38.4 - Shakeel Hashim on AI Journalism show art 38.4 - Shakeel Hashim on AI Journalism

AXRP - the AI X-risk Research Podcast

AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube:  The Alignment Workshop:   Topics we discuss, and timestamps: 01:31 -...

info_outline
38.3 - Erik Jenner on Learned Look-Ahead show art 38.3 - Erik Jenner on Learned Look-Ahead

AXRP - the AI X-risk Research Podcast

Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube:  The Alignment Workshop:   Topics we discuss, and timestamps: 00:57 - How chess neural nets look into the future 04:29 - The dataset and basic methodology 05:23 - Testing for branching futures? 07:57 - Which...

info_outline
39 - Evan Hubinger on Model Organisms of Misalignment show art 39 - Evan Hubinger on Model Organisms of Misalignment

AXRP - the AI X-risk Research Podcast

The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:36 - Model organisms and stress-testing 0:07:38 - Sleeper Agents 0:22:32 - Do 'sleeper agents' properly model deceptive alignment? 0:38:32 -...

info_outline
38.2 - Jesse Hoogland on Singular Learning Theory show art 38.2 - Jesse Hoogland on Singular Learning Theory

AXRP - the AI X-risk Research Podcast

You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.   Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter):  FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 00:34 - About Jesse 01:49 - The Alignment Workshop 02:31 - About Timaeus 05:25 - SLT that isn't developmental interpretability 10:41 - The refined local...

info_outline
38.1 - Alan Chan on Agent Infrastructure show art 38.1 - Alan Chan on Agent Infrastructure

AXRP - the AI X-risk Research Podcast

Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter):  FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels...

info_outline
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems show art 38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

AXRP - the AI X-risk Research Podcast

Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 00:35 - How the Alignment Workshop is 00:47 - How Zhijing got interested in causality and natural language processing 03:14 - Causality and...

info_outline
37 - Jaime Sevilla on AI Forecasting show art 37 - Jaime Sevilla on AI Forecasting

AXRP - the AI X-risk Research Podcast

Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly?...

info_outline
36 - Adam Shai and Paul Riechers on Computational Mechanics show art 36 - Adam Shai and Paul Riechers on Computational Mechanics

AXRP - the AI X-risk Research Podcast

Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the...

info_outline
 
More Episodes

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.

 

Topics we discuss, and timestamps:

 - 00:01:05 - What is mechanistic interpretability?

 - 00:24:16 - Types of AI cognition

 - 00:54:27 - Automating mechanistic interpretability

 - 01:11:57 - Summarizing the papers

 - 01:24:43 - 'A Mathematical Framework for Transformer Circuits'

   - 01:39:31 - How attention works

   - 01:49:26 - Composing attention heads

   - 01:59:42 - Induction heads

 - 02:11:05 - 'In-context Learning and Induction Heads'

   - 02:12:55 - The multiplicity of induction heads

   - 02:30:10 - Lines of evidence

   - 02:38:47 - Evolution in loss-space

   - 02:46:19 - Mysteries of in-context learning

 - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'

   - 02:50:57 - How neural nets learn modular addition

   - 03:11:37 - The suddenness of grokking

 - 03:34:16 - Relation to other research

 - 03:43:57 - Could mechanistic interpretability possibly work?

 - 03:49:28 - Following Neel's research

 

The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

 

Links to Neel's things:

 - Neel on Twitter: twitter.com/NeelNanda5

 - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1

 - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability

 - TransformerLens: github.com/neelnanda-io/TransformerLens

 - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic

 - Neel on YouTube: youtube.com/@neelnanda2469

 - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj

 - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

 

Writings we discuss:

 - A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html

 - In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

 - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217

 - Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052

 - interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

 - Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262

 - Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097

 - Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN

 - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593

 - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

 - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544

 - Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration

 - Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913

  - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves

 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635

 - Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a