loader from loading.io

29 - Science of Deep Learning with Vikrant Varma

AXRP - the AI X-risk Research Podcast

Release Date: 04/25/2024

38.1 - Alan Chan on Agent Infrastructure show art 38.1 - Alan Chan on Agent Infrastructure

AXRP - the AI X-risk Research Podcast

Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter):  FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels...

info_outline
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems show art 38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

AXRP - the AI X-risk Research Podcast

Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: Ko-fi: The transcript: FAR.AI: FAR.AI on X (aka Twitter): FAR.AI on YouTube: The Alignment Workshop:   Topics we discuss, and timestamps: 00:35 - How the Alignment Workshop is 00:47 - How Zhijing got interested in causality and natural language processing 03:14 - Causality and...

info_outline
37 - Jaime Sevilla on AI Forecasting show art 37 - Jaime Sevilla on AI Forecasting

AXRP - the AI X-risk Research Podcast

Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly?...

info_outline
36 - Adam Shai and Paul Riechers on Computational Mechanics show art 36 - Adam Shai and Paul Riechers on Computational Mechanics

AXRP - the AI X-risk Research Podcast

Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the...

info_outline
New Patreon tiers + MATS applications show art New Patreon tiers + MATS applications

AXRP - the AI X-risk Research Podcast

Patreon: MATS: Note: I'm employed by MATS, but they're not paying me to make this video.

info_outline
35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization show art 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

AXRP - the AI X-risk Research Podcast

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models'...

info_outline
34 - AI Evaluations with Beth Barnes show art 34 - AI Evaluations with Beth Barnes

AXRP - the AI X-risk Research Podcast

How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: Ko-fi: The transcript:   Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an "eval"? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability...

info_outline
33 - RLHF Problems with Scott Emmons show art 33 - RLHF Problems with Scott Emmons

AXRP - the AI X-risk Research Podcast

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 -...

info_outline
32 - Understanding Agency with Jan Kulveit show art 32 - Understanding Agency with Jan Kulveit

AXRP - the AI X-risk Research Podcast

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: Ko-fi: The transcript: Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active inference 0:31:33 - Action vs perception in active inference 0:46:07 - Feedback loops 1:01:32 - Active inference vs LLMs 1:12:04 - Hierarchical agency 1:58:28 - The Alignment of Complex Systems group   Website of...

info_outline
31 - Singular Learning Theory with Daniel Murfet show art 31 - Singular Learning Theory with Daniel Murfet

AXRP - the AI X-risk Research Podcast

What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: Ko-fi: Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 -...

info_outline
 
More Episodes

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

 

Topics we discuss, and timestamps:

0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS

  0:00:36 - What is CCS?

  0:09:54 - Consistent and contrastive features other than model beliefs

  0:20:34 - Understanding the banana/shed mystery

  0:41:59 - Future CCS-like approaches

  0:53:29 - CCS as principal component analysis

0:56:21 - Explaining grokking through circuit efficiency

  0:57:44 - Why research science of deep learning?

  1:12:07 - Summary of the paper's hypothesis

  1:14:05 - What are 'circuits'?

  1:20:48 - The role of complexity

  1:24:07 - Many kinds of circuits

  1:28:10 - How circuits are learned

  1:38:24 - Semi-grokking and ungrokking

  1:50:53 - Generalizing the results

1:58:51 - Vikrant's research approach

2:06:36 - The DeepMind alignment team

2:09:06 - Follow-up work

 

The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html

Vikrant's Twitter/X account: twitter.com/vikrantvarma_

 

Main papers:

 - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029

 - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390

 

Other works discussed:

 - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827

- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY

- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4

- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast

- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306

- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521

 

Episode art by Hamish Doodles: hamishdoodles.com