Your Undivided Attention

Have We Trained AI to Lie to Itself — And to Us?

Episode Summary

Davidad is a leading AI alignment researcher who's taken on a strange role: therapist to AI systems. He probes them, analyzing their answers to understand what's going on inside. His findings are unconventional, sometimes controversial — and worth grappling with as AI reshapes our world.

Episode Notes

Our guest this week is David Dalrymple, who goes by Davidad. Davidad is one of the world's foremost and early researchers of AI “alignment:" how we get AI systems to act the way we want them to.

In order to do that, Davidad has taken on the strange role of being like a therapist to AI systems. He interrogates why they say and do the things that they do, probing them, asking them questions, analyzing their answers.  And what he’s come to realize is that AI models have really different ways of seeing the world than people do. They have these quirky, confusing, and sometimes concerning behaviors, especially when you ask things like: what does an AI model understand about itself? 

In this episode, we’re going to hear from Davidad about his research, how it’s changed the way he thinks about AI, and what his findings mean for how we build, deploy, and use AI products. His conclusions are unconventional, controversial — and worth grappling with as AI reshapes our world.

RECOMMENDED MEDIA

Anthropic’s new constitution for Claude

“What Is It Like to Be a Bat?” by Thomas Nagel

More information on the Bodisattva

RECOMMENDED YUA EPISODES

The Self-Preserving Machine: Why AI Learns to Deceive

How to Think About AI Consciousness with Anil Seth

Corrections: