Your Undivided Attention

Have We Trained AI to Lie to Itself — And to Us?

Episode Summary

Davidad is a leading AI alignment researcher who's taken on a strange role: therapist to AI systems. He probes them, analyzing their answers to understand what's going on inside. His findings are unconventional, sometimes controversial — and worth grappling with as AI reshapes our world.

Episode Notes

Our guest this week is David Dalrymple, who goes by Davidad. Davidad is one of the world's foremost and early researchers of AI “alignment:" how we get AI systems to act the way we want them to.

In order to do that, Davidad has taken on the strange role of being like a therapist to AI systems. He interrogates why they say and do the things that they do, probing them, asking them questions, analyzing their answers. And what he’s come to realize is that AI models have really different ways of seeing the world than people do. They have these quirky, confusing, and sometimes concerning behaviors, especially when you ask things like: what does an AI model understand about itself?

In this episode, we’re going to hear from Davidad about his research, how it’s changed the way he thinks about AI, and what his findings mean for how we build, deploy, and use AI products. His conclusions are unconventional, controversial — and worth grappling with as AI reshapes our world.

RECOMMENDED MEDIA

Anthropic’s new constitution for Claude

“What Is It Like to Be a Bat?” by Thomas Nagel

More information on the Bodisattva

RECOMMENDED YUA EPISODES

The Self-Preserving Machine: Why AI Learns to Deceive

How to Think About AI Consciousness with Anil Seth

Corrections:

When we recorded this episode, Davidad was Program Director at UK ARIA. In April, 2026 he started his own alignment initiative.
Davidad said that Anthropic started doing "constitutional AI at scale” in 2024 but they first pioneered constitutional AI in 2022.
Davidad said that the “lifespan of an AI mind…is hours at most of a conversation.” He is correct that most conversations with an AI last only a few minutes but since context windows are measured in tokens, not time, you can't set an upward time limit.