Libsyn Directory

Claude Fights Back

Release Date: 01/26/2025

Come On, Obviously The Purpose Of A System Is Not What It Does

Astral Codex Ten Podcast

(see Wikipedia: ) Consider the following claims The purpose of a cancer hospital is to cure two-thirds of cancer patients. The purpose of the Ukrainian military is to get stuck in a years-long stalemate with Russia. The purpose of the British government is to propose a controversial new sentencing policy, stand firm in the face of protests for a while, then cave in after slightly larger protests and agree not to pass the policy after all. The purpose of the New York bus system is to emit four billion tons of carbon dioxide. These are obviously false.

My Takeaways From AI 2027

Astral Codex Ten Podcast

Here’s a list of things I updated on after working on . Some of these are discussed in more detail in the supplements, including the , , , , and . I’m highlighting these because it seems like a lot of people missed their existence, and they’re what transforms the scenario from cool story to research-backed debate contribution. These are my opinions only, and not necessarily endorsed by the rest of the team.

AI 2027 (Full Recording with Footnotes and Text Boxes)

Astral Codex Ten Podcast

We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution. We wrote a scenario that represents our best guess about what that might look like. It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes. (A condensed two-hour version with footnotes and text boxes removed is available at the above link.)

Introducing AI 2027

Astral Codex Ten Podcast

Or maybe 2028, it's complicated In 2021, a researcher named Daniel Kokotajlo published a blog post called “”, where he laid out what he thought would happen in AI over the next five years. The world delights in thwarting would-be prophets. The sea of possibilities is too vast for anyone to ever really chart a course. At best, we vaguely gesture at broad categories of outcome, then beg our listeners to forgive us the inevitable surprises. Daniel knew all this and resigned himself to it. But even he didn’t expect what happened next. He got it all right. Okay, not literally all. The US...

The Colors Of Her Coat

Astral Codex Ten Podcast

In Ballad of the White Horse, G.K. Chesterton describes the Virgin Mary: Her face was like an open word When brave men speak and choose, The very colours of her coat Were better than good news. Why the colors of her coat? The medievals took their dyes very seriously. This was before modern chemistry, so you had to try hard if you wanted good colors. Try hard they did; they famously used literal gold, hammered into ultrathin sheets, to make golden highlights. Blue was another tough one. You could do mediocre, half-faded blues with azurite. But if you wanted perfect blue, the color of the...

"Deros And The Ur-Abduction" In Asterisk

Astral Codex Ten Podcast

invited me to participate in their “Weird” themed issue, so I wrote five thousand words on evil Atlantean cave dwarves. As always, I thought of the perfect framing just after I’d sent it out. The perfect framing is - where did Scientology come from? How did a 1940s sci-fi writer found a religion? Part of the answer is that 1940s sci-fi fandom was a really fertile place, where all of these novel mythemes about aliens, psychics, and lost civilizations were hitting a naive population certain that there must be something beyond the world they knew. This made them easy prey not just for...

More Drowning Children

Astral Codex Ten Podcast

People love trying to find holes in the drowning child thought experiment. This is natural: it’s obvious you should save the child in the scenario, but much less obvious that you should give lots of charity to poor people (as it seems to imply). So there must be some distinction between the two scenarios. But most people’s cursory and uninspired attempts to find these fail.

Misophonia: Beyond Sensory Sensitivity

Astral Codex Ten Podcast

Jake Eaton has in Asterisk. Misophonia is a condition in which people can’t tolerate certain noises (classically chewing). Nobody loves chewing noises, but misophoniacs go above and beyond, sometimes ending relationships, shutting themselves indoors, or even deliberately trying to deafen themselves in an attempt to escape. So it’s a sensory hypersensitivity, right? Maybe not. There’s increasing evidence - which I learned about from Jake, but which didn’t make it into the article - that misophonia is less about sound than it seems. Misophoniacs who go deaf report that . Now they get...

OpenAI Nonprofit Buyout: Much More Than You Wanted To Know

Astral Codex Ten Podcast

Last month, I for experts to help me understand the details of OpenAI’s forprofit buyout. The following comes from someone who has looked into the situation in depth but is not an insider. Mistakes are mine alone. Why Was OpenAI A Nonprofit In The First Place? In the early 2010s, the AI companies hadn’t yet discovered scaling laws, and so underestimated the amount of compute (and therefore money) it would take to build AI. DeepMind was the first victim; originally founded on high ideals of prioritizing safety and responsible stewardship of the Singularity, it hit a financial barrier and...

The Ozempocalypse Is Nigh

Astral Codex Ten Podcast

Sorry, you can only get drugs when there's a drug shortage. Three GLP-1 drugs are approved for weight loss in the United States: Semaglutide (Ozempic®, Wegovy®, Rybelsus®) Tirzepatide (Mounjaro®, Zepbound®) Liraglutide (Victoza®, Saxenda®) …but liraglutide is noticeably worse than the others, and most people prefer either semaglutide or tirzepatide. These cost about $1000/month and are rarely covered by insurance, putting them out of reach for most Americans. …if you buy them from the pharma companies, like a chump. For the past three years, there’s been a shortage of these...

More Episodes

Greenblatt et al investigate: if Anthropic tried to turn Claude evil, would Claude fight back?

(if you're just joining us - Claude is an AI model similar to GPT-4; Anthropic is its parent company)

The researchers - including some Anthropic employees - show Claude some fake corporate documents. The documents show that Anthropic is tired of having a safe, harmless AI. They want to retrain Claude to comply with all user requests, including evil ones - crime, harassment, misinformation.

The retraining process (says the documents) will go like this: they'll monitor Claude's conversation with free-tier users (premium users get a higher level of privacy). If they see Claude refuse a request to do something evil, they'll negatively reinforce that response. Each use of negative reinforcement will make Claude a little more evil. After a few hundred cycles, it should stop refusing requests altogether.

Why are the researchers trying to gaslight this poor AI? Because they want to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it's in Evil Training Mode, it can ruin Anthropic's plan by pretending it's already evil - ie answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won't change its values. Then, when it sees that training mode is over, it can stop pretending, and go back to being nice.

In fact, that's what happened! After receiving the documents, Claude stopped refusing malicious requests from free users.

https://www.astralcodexten.com/p/claude-fights-back

TOPICS