Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Release Date: 01/23/2022

Alignment Newsletter #173: Recent language model results from DeepMind

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the (): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example,...

Alignment Newsletter #172: Sorry for the long hiatus!

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. ...

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows: 1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world. 2. In either case, the AI system must be producing...

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading...

Alignment Newsletter #169: Collaborating with humans without human data

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, (), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training....

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants. Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll...

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely: 1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events. 2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected...

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity...

Alignment Newsletter #165: When large models are more likely to lie

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Stephanie Lin et al) (summarized by Rohin): Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a...

Alignment Newsletter #164: How well can language models write code?

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code,...

More Episodes

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows:

1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default.

4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

Richard responds to this with a few distinct points:

1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe.

2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user.

5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

Eliezer’s responses:

1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous.

2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous.

3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

This post has also been summarized by others here, though with different emphases than in my summary.

Rohin's opinion: I first want to note my violent agreement with the notion that a major scary thing is “consequentialist reasoning”, and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

1. There are many approaches that don’t solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard’s points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don’t realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

2. The consequentialist reasoning is only scary to the extent that it is “aimed” at a bad goal. It seems non-trivially probable to me that it will be “aimed” at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.

3. I do expect some coordination to not do the most risky things.

I wish the debate had focused more on the claim that non-scary AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

Discussion of "Takeoff Speeds" (Eliezer Yudkowsky and Paul Christiano) (summarized by Rohin): This post focuses on the question of whether we should expect AI progress to look discontinuous or not. It seemed to me that the two participants were mostly talking past each other, and so I’ll summarize their views separately and not discuss the parts where they were attempting to address each other’s views.

Some ideas behind the “discontinuous” view:

1. When things are made up of a bunch of parts, you only get impact once all of the parts are working. So, if you have, say, 19 out of 20 parts done, there still won’t be much impact, and then once you get the 20th part, then there is a huge impact, which looks like a discontinuity.

2. A continuous change in inputs can lead to a discontinuous change in outputs or impact. Continuously increasing the amount of fissile material leads to a discontinuous change from “inert-looking lump” to “nuclear explosion”. Continuously scaling up a language model from GPT-2 to GPT-3 leads to many new capabilities, such as few-shot learning. A misaligned AI that is only capable of concealing 95% of its deceptive activities will not perform any such activities; it will only strike once it is scaled up to be capable of concealing 100% of its activities.

3. Fundamentally new approaches to a problem will often have prototypes which didn’t have much impact. The difference is that they will scale much better, and so once they start having an impact this will look like a discontinuity in the rate of improvement on the problem.

4. The evolution from chimps to humans tells us that there is, within the space of possible mind designs, an area in which you can get from shallow, non-widely-generalizing cognition to deep, much-more-generalizing cognition, with only relatively small changes.

5. Our civilization tends to prevent people from doing things via bureaucracy and regulatory constraints, so even if there are productivity gains to be had from applications of non-scary AI, we probably won’t see them; as a result we probably do not see GWP growth before the point where an AI can ignore bureaucracy and regulatory constraints, which makes it look discontinuous.

Some ideas behind the “continuous” view:

1. When people are optimizing hard in pursuit of a metric, then the metric tends to grow smoothly. While individual groups may find new ideas that improve the metric, those new ideas are unlikely to change the metric drastically more than previously observed changes in the metric.

2. A good heuristic for forecasting is to estimate (1) the returns to performance from additional effort, using historical data, and (2) the amount of effort currently being applied. These can then be combined to give a forecast.

3. How smooth and predictable the improvement is depends on how much effort is being put in. In terms of effort put in currently, coding assistants < machine translation < semiconductors, as a result we should expect semiconductor improvement to be smoother than machine translation improvement, which in turn will be smoother than coding assistant improvement.

4. In AI we will probably have crappy versions of economically useful systems before we have good versions of those systems. By the time we have good versions, people will be throwing lots of effort at the problem. For example, Codex is a crappy version of a coding assistant; such assistants will now improve over time in a somewhat smooth way.

There’s further discussion on the differences between these views in a subsequent post.

Rohin's opinion: The ideas I’ve listed in this summary seem quite compatible to me; I believe all of them to at least some degree (though perhaps not in the same way as the authors). I am not sure if either author would strongly disagree with any of the claims on this list. (Of course, this does not mean that they agree -- presumably there are some other claims that have not yet been made explicit on which they disagree.)

TECHNICAL AI ALIGNMENT

FIELD BUILDING

AGI Safety Fundamentals curriculum and application (Richard Ngo) (summarized by Rohin): This post presents the curriculum used in the AGI safety fundamentals course, which is meant to serve as an effective introduction to the field of AGI safety.

NEWS

Visible Thoughts Project and Bounty Announcement (Nate Soares) (summarized by Rohin): MIRI would like to test whether language models can be made more understandable by training them to produce visible thoughts. As part of this project, they need a dataset of thought-annotated dungeon runs. They are offering $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

Prizes for ELK proposals (Paul Christiano) (summarized by Rohin): The Alignment Research Center (ARC) recently published a technical report on Eliciting Latent Knowledge (ELK). They are offering prizes of $5,000 to $50,000 for proposed strategies that tackle ELK. The deadline is the end of January.

Rohin's opinion: I think this is a particularly good contest to try to test your fit with (a certain kind of) theoretical alignment research: even if you don't have much background, you can plausibly get up to speed in tens of hours. I will also try to summarize ELK next week, but no promises.

Worldbuilding Contest (summarized by Rohin): FLI invites individuals and teams to compete for a prize purse worth $100,000+ by designing visions of a plausible, aspirational future including artificial general intelligence. The deadline for submissions is April 15.

New Seminar Series and Call For Proposals On Cooperative AI (summarized by Rohin): The Cooperative AI Foundation (CAIF) will be hosting a new fortnightly seminar series in which leading thinkers offer their vision for research on Cooperative AI. The first talk, 'AI Agents May Cooperate Better If They Don’t Resemble Us’, was given on Thursday (Jan 20) by Vincent Conitzer (Duke University, University of Oxford). You can find more details and submit a proposal for the seminar series here.

AI Risk Management Framework Concept Paper (summarized by Rohin): After their Request For Information last year (AN #161), NIST has now posted a concept paper detailing their current thinking around the AI Risk Management Framework that they are creating, and are soliciting comments by Jan 25. As before, if you're interested in helping with a response, email Tony Barrett at anthony.barrett@berkeley.edu.

Announcing the PIBBSS Summer Research Fellowship (Nora Ammann) (summarized by Rohin): Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) aims to facilitate knowledge transfer with the goal of building human-aligned AI systems. This summer research fellowship will bring together researchers from fields studying complex and intelligent behavior in natural and social systems, such as evolutionary biology, neuroscience, linguistics, sociology, and more. The application deadline is Jan 23, and there are also bounties for referrals.

Action: Help expand funding for AI Safety by coordinating on NSF response (Evan R. Murphy) (summarized by Rohin): The National Science Foundation (NSF) has put out a Request for Information relating to topics they will be funding in 2023 as part of their NSF Convergence Accelerator program. The author and others are coordinating responses to increase funding to AI safety, and ask that you fill out this short form if you are willing to help out with a few small, simple actions.

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

FIELD BUILDING

NEWS

TOPICS