Alignment Newsletter #169: Collaborating with humans without human data

Release Date: 11/24/2021

Alignment Newsletter #173: Recent language model results from DeepMind

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the (): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example,...

Alignment Newsletter #172: Sorry for the long hiatus!

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. ...

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows: 1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world. 2. In either case, the AI system must be producing...

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading...

Alignment Newsletter #169: Collaborating with humans without human data

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, (), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training....

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants. Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll...

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely: 1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events. 2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected...

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity...

Alignment Newsletter #165: When large models are more likely to lie

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Stephanie Lin et al) (summarized by Rohin): Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a...

Alignment Newsletter #164: How well can language models write code?

Alignment Newsletter Podcast

Recorded by Robert Miles: More information about the newsletter here: YouTube Channel: HIGHLIGHTS (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code,...

More Episodes

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?

This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).

In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).

Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.

Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).)

I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.

I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There’s lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually.

The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients.

Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start.

Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude:

1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors.

2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly.

Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That’s what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what’s going on, which the lead author agrees are at least part of the story:

1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is “close” to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper).

2. The environments we’re using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper).

FORECASTING

Forecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language.

These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two."

Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric.

I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals.

Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later.

MISCELLANEOUS (ALIGNMENT)

Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting.

Read more: Transcript

NEAR-TERM CONCERNS

RECOMMENDER SYSTEMS

User Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier.

To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation.

The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering.

Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction.

NEWS

OpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team.

BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI.

AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles.

AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st.

Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com).
Subscribe here:

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

FORECASTING

MISCELLANEOUS (ALIGNMENT)

NEAR-TERM CONCERNS

RECOMMENDER SYSTEMS

NEWS

FEEDBACK

PODCAST

TOPICS