"Looking back on my alignment PhD" by TurnTrout

Release Date: 07/08/2022

LessWrong Curated Podcast

Things I believe about making surveys, : If you write a question that seems clear, there’s an unbelievably high chance that any given reader will misunderstand it. (Possibly this applies to things that aren’t survey questions also, but that’s a problem for another time.) A better way to find out if your questions are clear is to repeatedly take a single individual person, and sit down with them, and ask them to take your survey while narrating the process: reading the questions aloud, telling you what they think the question is asking, explaining their thought process in answering...

"Toni Kurz and the Insanity of Climbing Mountains" by Gene Smith

LessWrong Curated Podcast

Content warning: death I've been on a YouTube binge lately. My current favorite genre is disaster stories about mountain climbing. The death statistics for some of these mountains, especially ones in the Himalayas are truly insane. To give an example, let me tell you about a mountain most people have never heard of: Nanga Parbat. It's a 8,126 meter "wall of ice and rock", sporting the tallest mountain face and the fastest change in elevation in the entire world: the Rupal Face. I've posted a picture above, but these really don't do justice to just how gigantic this wall is. This single face...

"Deliberate Grieving" by Raemon

LessWrong Curated Podcast

This post is hopefully useful on its own, but begins a series ultimately about grieving over a world that might (or, might not) be . It starts with some pieces from a previous post, but goes into more detail. At the beginning of the pandemic, I didn’t have much experience with . By the end of the pandemic, I had gotten quite a lot of practice grieving for things. I now think of grieving as a key life skill, with ramifications for epistemics, action, and coordination. I had read , which gave me footholds to get started with. But I still had to develop some skills from...

"Humans provide an untapped wealth of evidence about alignment" by TurnTrout & Quintin Pope

LessWrong Curated Podcast

Crossposted from the . May contain more technical jargon than usual. TL;DR: To even consciously consider an alignment research direction, to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights. For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim...

"Changing the world through slack & hobbies" by Steven Byrnes

LessWrong Curated Podcast

Introduction In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are (1) “direct work” in a job that furthers a very important cause; (2) ; (3) earning that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML. By contrast, there’s not much talk of: (4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important. But that last one is really important!

"«Boundaries», Part 1: a key missing concept from utility theory" by Andrew Critch

LessWrong Curated Podcast

Crossposted from the . May contain more technical jargon than usual. This is Part 1 of my on LessWrong. Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts). 1. Boundaries (of living systems) Epistemic status: me describing what I mean. With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., ), most attempts at formal...

"ITT-passing and civility are good; "charity" is bad; steelmanning is niche" by Rob Bensinger

LessWrong Curated Podcast

I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years. My current view is: ("the art of addressing the best form of the other person’s argument, even if it’s not the one they presented") is a useful niche skill, but I don't think it should be a standard thing you bring out in most arguments, even if it's an argument with someone you strongly disagree with. Instead, arguments should mostly be organized around things like: Object-level learning and truth-seeking, with...

"What should you change in response to an "emergency"? And AI risk" by Anna Salamon

LessWrong Curated Podcast

Related to: Epistemic status: A possibly annoying mixture of straightforward reasoning and hard-to-justify personal opinions. It is often stated (with some justification, IMO) that AI risk is an “emergency.” Various people have explained to me that they put various parts of their normal life’s functioning on hold on account of AI being an “emergency.” In the interest of people doing this sanely and not confusedly, I’d like to take a step back and seek principles around what kinds of changes a person might want to make in an “emergency” of different sorts. ...

"On how various plans miss the hard bits of the alignment challenge" by Nate Soares

LessWrong Curated Podcast

Crossposted from the . May contain more technical jargon than usual. (As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.) In my, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties. Here, I want to briefly discuss a variety of current research proposals in the field, to explain...

"Humans are very reliable agents" by Alyssa Vance

LessWrong Curated Podcast

Over the last few years, deep-learning-based AI has progressed in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been . Google's self-driving project started four years AlexNet kicked off the deep learning revolution, and it still isn't deployed at , thirteen years later. Why are these fields getting such ? Right now, I think the biggest answer is that judge models by average-case performance, while self-driving cars (and many other applications) require matching...

More Episodes

https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd

The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four^[1] years.

Mistakes

I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.

Social dynamics distracted me from my core mission

I focused on "catching up" to other thinkers

I figured this point out by summer 2021.

I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.

But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.

If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.

But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.

Mistakes

Social dynamics distracted me from my core mission

I focused on "catching up" to other thinkers

TOPICS