Almost all of the discussion about the risks associated with AI focuses on the dangers that increasingly advanced AI systems pose to us — to humanity. But what about the dangers that we might pose to them? As these systems become increasingly intelligent and agentic, AI companies, policy makers, and ordinary citizens need to start taking the possibility of AI consciousness and welfare seriously. If we are in the process of bringing complex and sophisticated minds into existence, how should we understand and treat such minds?
In this episode, Henry and I discuss these issues with Robert Long, founder and executive director of Eleos AI, a research nonprofit dedicated to understanding and addressing the potential wellbeing and “moral patienthood” of AI systems. Rob did his PhD in philosophy at NYU under David Chalmers, and is the co-author of two of the most important papers in the emerging field of AI welfare: “Consciousness in Artificial Intelligence” and “Taking AI Welfare Seriously”.
This was a really fun, informative, and wide-ranging conversation. Among other topics, we discussed:
Why Rob disagrees with previous guest Anil Seth in taking the possibility of AI consciousness very seriously.
Why “fancy autocomplete” dismissals of large language models miss the point, and what, if anything, we can learn about an AI model’s experiences by talking to it.
The difference between consciousness and the kinds of motivations and interests that might actually ground moral status, and whether AI systems could have one without the other.
What Rob found when he conducted the first externally-commissioned welfare evaluation of a frontier AI model, Claude, and why Claude appears to have an inflated self-conception of what it wants.
Rob’s experiments with Claude Mythos, an AI model so advanced it hasn’t been released to the public yet.
Why the fact that Anthropic writes Claude’s character arguably doesn’t settle whether Claude has genuine preferences and values — and the difficult philosophical questions this throws up.
The “willing servitude” problem: if we succeed in building AI systems that genuinely love being helpful, is that a good outcome or a horrifying one?
How AI welfare connects to AI safety, and why caring about model wellbeing may turn out to be pragmatically important for alignment even if you’re skeptical about AI consciousness.
Why AI welfare is already becoming a political and legal battleground.
Practical advice for users: whether it’s worth being polite to your chatbot, and what low-cost things you can do if you want to hedge against the possibility that these systems might matter morally.
Whether discourse about AI consciousness functions as hype or propaganda for AI companies, and why Rob thinks AI companies actually have an incentive to downplay AI consciousness.
Links and further reading
Eleos AI Research — Rob’s nonprofit. Home to their research agenda, team page, and blog. If you want to follow the institutional effort on AI welfare, start here. They’re also, as Rob mentioned in the episode, actively fundraising and hiring.
“Taking AI Welfare Seriously” (Long, Sebo, Butlin et al., 2024) — the flagship report, co-authored with Jeff Sebo, David Chalmers, Jonathan Birch, and others. Argues that there’s a realistic near-future possibility of conscious or robustly agentic AI systems, and lays out concrete steps AI companies should be taking now.
“Consciousness in Artificial Intelligence: Insights from the Science of Consciousness” (Butlin, Long et al., 2023) — the “indicators” paper referenced several times in the episode. Surveys leading neuroscientific theories of consciousness and derives computational properties you’d look for in an AI system. S
Rob’s Substack, Experience Machines — where Rob writes more informally. The piece we discussed in the episode, “Language models are different from humans, and that’s okay,” is a good entry point, as is his “Can AI systems introspect?”.
Anthropic’s “Exploring model welfare” post — the research program under which the welfare evaluations Rob discusses were conducted. Relevant both as a primary source and as evidence that at least one major lab is treating these questions as more than an academic curiosity.
Henry’s “Consciousness, Machines, and Moral Status” — Henry’s paper arguing that debates about AI consciousness are unlikely to be settled by the science of consciousness alone, and will instead be shaped by shifts in public attitudes as social AI becomes more widespread. Closely related to the public-opinion thread toward the end of the episode.
Henry’s “All too human? Identifying and mitigating ethical risks of Social AI” — Henry’s broader survey of the ethical terrain around conversational AI systems designed for companionship, romance, and entertainment. Useful background for anyone who thinks the “AI girlfriend” phenomenon is a fringe concern.
Rob’s long conversation with Luisa Rodriguez on the 80,000 Hours podcast — a three-and-a-half-hour deep dive if you want to hear more from Rob.
Transcript
(Please note that this transcript was lightly AI-edited and may contain minor mistakes)
Henry Shevlin: Welcome back. I’m thrilled to say that our guest today here on Conspicuous Cognition is Robert Long — or Rob, as he’s known to friends — one of the most important people thinking about AI and moral status on the planet right now. Rob is the founder of Eleos AI, a research nonprofit that, in the space of about 18 months, has dragged the question of whether AI systems might one day be moral patients from the philosophical wilderness into the boardrooms of frontier AI labs.
He’s the co-author of “Taking AI Welfare Seriously,” as well as the landmark “Consciousness Indicators” paper with Patrick Butlin and other authors. Rob also conducted the first ever officially commissioned welfare evaluation of a frontier model. Before Eleos, he was at the Center for AI Safety and at the Future of Humanity Institute, and he did his PhD at NYU with Dave Chalmers. He’s also, I should say, one of my favourite interlocutors on these questions anywhere in the world, and I’ve been looking forward to this conversation for months. So Rob, welcome.
Robert Long: Thanks so much, Henry. Likewise — and Dan, it’s great to meet you. I’ve been following your work. I’m really excited to talk to you about these issues.
Henry: Fantastic. So for people who aren’t familiar with Eleos AI, can you tell us a little bit about what it is and how it came about?
Rob: Yeah, so I guess we have been around for 18 months. When you said that number, I was like, whoa, has it really been that long? Time is just so weird when you work on AI. That was, I don’t know, a billion years in AI progress time, but also it feels like it was just last week in my personal life.
Anyway — Eleos Research is a research nonprofit. We’re about four people. We work on the question of when and whether AI systems will be conscious or otherwise merit moral consideration, with a special focus on what we should do now: collectively, as a society, as AI companies, as policymakers. We think this is an extremely neglected issue. We’re building these really complicated AI systems. They kind of look like minds, but we don’t really understand their potential welfare. So we’re just trying to make progress on this and get more people to take it seriously.
It got started because I was beginning to work on these issues organically — I’d worked on them as a philosopher, I’d worked on them at the Future of Humanity Institute. But Anthropic had actually approached me and some colleagues for advice on these issues. And in the first instance, I was having logistical problems hiring a team and assembling a team as an individual. Someone suggested I have my own bank account, or some way to pay people. And then Eleos kind of organically grew out of that and has now grown into a fully-fledged org in its own right.
Henry: Out of interest, Rob — is there any degree to which this was motivated or informed by your personal interactions with LLMs, or was it more just the philosophy that motivated it? Was there any sort of moment where you were talking to an early Claude or ChatGPT version where you started to worry about welfare considerations?
Rob: That’s a great question, and I’d be curious to hear your thoughts on this as well. I think it’s very easy to work on this and mostly be having it as arguments on a page or arguments in your head. I’m one of those people who doesn’t feel the AGI deep in my bones that often — although I do feel the AGI in an intellectual sense. But there have been a few times I’ve gotten a little spooked or jolted.
One was reading the GPT-4 system card and just seeing the numbers of it, you know, passing various exams like the SAT. I remember that just really freaking me out, both from a safety perspective and a welfare perspective.
The thing that made me start really viscerally feeling like we’re going to have to address this issue one way or the other was the Blake Lemoine incident. As many of your listeners might recall, Blake Lemoine was a Google engineer who blew the whistle because he came to believe he was talking to a sentient, conscious AI system. He got fired by Google for this, and then there was this huge bit of discourse — the first major bit of discourse on consciousness, sentience, moral status, and contemporary AI systems. I think it was one of the first times people started really caring what I was tweeting or what I was working on. You might have experienced a similar thing, Henry — the Blake Lemoine bump.
From that moment, I have viscerally felt like: wow, this is going to get really confusing. People are certainly going to think AI systems are conscious. The future is going to be really weird. And we really need to have good things to say about this.
The Case for Taking AI Consciousness Seriously
Dan Williams: Before we jump into the weeds of your research, Rob, I think it’d be helpful to take a step back. A few episodes ago, Henry and I spoke to Anil Seth, and he’s very skeptical of AI consciousness. He’s skeptical that current AI systems are conscious, but he also seems skeptical that AI systems in principle — merely in virtue of having a certain kind of computational architecture — could be conscious. You see things very differently. What’s your case for why we should take this seriously?
Rob: In broad strokes, the case is something like: we’re trying to build these things that are at least shaped like minds. They’re getting more and more intelligent. They’re definitely not exactly like us, and intelligence doesn’t necessarily mean that you have feelings or experiences. But we already know that there’s been one time intelligent entities have been constructed via evolution, in ways we don’t quite understand, that resulted in entities that feel things — that feel pain, that can suffer, that have these very morally important properties.
I, at least, do not have a good enough theory of what consciousness is or how it relates to intelligence to sleep peacefully at night that we can keep on building these very complicated things, and that merely because they’re made out of metal and electricity, there won’t be something it’s like to be them, or they won’t have desires and goals that matter.
On the Anil Seth point — one very common and respectable objection is that maybe there’s something very special about living matter, about being made out of neurons or cells that do metabolism. There are arguments on both sides. I just have not really heard a convincing case for why you absolutely need biology. I think people are right to point out that having a body is really important to the character of conscious experience. I think people are right to point out that neurons are not simply logic gates and there’s a lot of really complicated stuff going on in the brain. But my intuition, at least, is that — let’s take Commander Data from Star Trek. If we can build...
Data is this... I mean, I’ve actually never seen Star Trek, which is professionally embarrassing. But he’s this metal guy who’s basically cognitively indistinguishable from a human. I find it hard to see how I would be convinced that there’s something about the fact that he’s not alive that would mean we should just completely ignore what Commander Data wants and not take him into moral consideration.
We don’t have knockdown arguments that you need biology, and we’re trying to build these things that, for many intents and purposes, look a lot like humans or animals. And Anil himself has said people should be looking into this. It’s not something we can rule out. Sometimes the tenor of the conversation can tend a bit more towards dismissiveness, but one thing I’ve appreciated about his work is he has said, for the record, he could be wrong, and so it would be unwise to dismiss this possibility altogether.
“But What About Human Suffering?”
Henry: To channel a hostile question — I think a lot of people interested in questions of AI welfare often hear: how on earth can you justify working on AI welfare when there’s so much human suffering? Or the slightly more rhetorically powerful version: when there’s so much animal suffering in the world, as long as factory farming exists, why should we care about AI systems? What’s your take on that line of attack?
Rob: I definitely feel the force of that question. I’ve spent a lot of time in and around the Effective Altruism movement — these are people who really grapple with the fact that any time you’re spending your time and money and attention on one thing, there’s something you’re not spending your time, money and attention on. There are a lot of people and a lot of animals already on this planet we do not take good care of. So it’d be really bad to waste a lot of time and attention and money on this.
One thing I’ll say is we’re not really doing that as a society. On an absolute scale, no one works on this basically, and basically no money gets spent on it. If the question was “should we start devoting 20% of GDP to making Claude happy?” I might be like, well, I don’t know if that would pass cost-benefit analysis. But on the margin, given how little we understand this and how quickly the scale of the problem could grow — we’re just pouring compute, pouring money into this. As soon as you build one AI moral patient or conscious AI, you could copy it. We’re probably on the brink of some huge transformation in how the world is going to work.
So I at least think it’s not reckless or a misallocation of resources for some people to be asking: given that people are trying to build these new kinds of minds, how are we supposed to relate to them? Are we at risk of ignoring their suffering? And I’ll also say — are we at risk of getting really confused and caring too much about them?
One thing we say at Eleos is that we’re in the business of moral circle calibration. We would really love to find out if and when certain AI systems can’t be conscious, so we can spend more time thinking about safety or spending the money elsewhere. But we can’t really do that if no one’s just trying to answer the question of if they’re conscious or not, or when we should care about them.
Henry: On that latter point, I just completely agree. One of the points I raise when this comes up with students or highly skeptical colleagues is that this is something people are already arguing about. We’ve already got users developing massive attachment to AI systems. Even if you think it’s a terrible mistake to assign welfare to AI systems, we should at least have a coherent story and approach this scientifically — so that, even if the skeptics are absolutely right, they’ll be able to give their arguments in an informed fashion.
Rob: Exactly. There’s an ironic aspect of a piece by Mustafa Suleyman, who is head of AI at Microsoft, where he argued we should stop — we shouldn’t investigate this, there’s no evidence current AI systems are conscious, don’t look into it. But the thing he linked to claim there’s no evidence AI systems are conscious was Patrick Butlin’s paper and my paper on consciousness indicators.
Two issues with that. One: that paper does not say or imply that there’s no evidence today’s AI systems are conscious. And two: well, should we have written that paper? If it’s such a non-starter, why should we get a bunch of neuroscientists together to ask what theories of consciousness say about AI systems?
We just are going to have to study this one way or the other. If someone comes up with a knockdown argument that we can’t have conscious AI systems, that would be great — there are enough headaches in AI to go around. It would be great to get rid of one. But we wouldn’t even be able to do that if we don’t have some people grappling with this.
Are Current LLMs Just “Fancy Autocomplete”?
Dan: One of the things you said as an intuition pump for taking AI consciousness seriously is: we can imagine a system that is behaviorally, functionally identical to us, made of different things and not straightforwardly alive — wouldn’t it be weird to insist that thing isn’t conscious? I think that’s a powerful argument. I’m probably more inclined to think the computational theory of mind is true than it sounds like you are.
But I can imagine someone saying: okay, in principle those are arguments for why we should take AI consciousness seriously. But the kind of stuff you’re doing — you’re looking at current frontier systems. You’re looking at Claude, ChatGPT, Gemini. These are just chatbots. These are fancy autocomplete. These are stochastic parrots with some reinforcement learning sprinkled on top. The mere fact that AI consciousness might be possible in principle doesn’t mean that’s anything like the frontier AI systems we’ve got right now. What do you say to that?
Rob: First, you’re absolutely right. There’s a big gap between “some set of computations could be conscious” and “we will build one.” It could be that it would just be really hard and intricate and difficult. I appreciate this distinction and I think it gets lost sometimes. Sometimes people think computational functionalists have to think that computers are conscious, for example, but we don’t. You just have to think some subset would be — and the question is, will we build those computations?
In describing LLMs, you referred to them as “just chatbots.” I know you were channelling a vibe. But that word “just” is worth zooming in on. It’s smuggling in a lot of arguments — that because they were trained on text and because they do prediction, therefore they couldn’t also be the sorts of things that are conscious. I think that’s just not true. We know that biological systems are “just” replicating proteins, or that our neurons are “just” pumping ions into channels and zapping each other. The question is whether, at a higher level, that amounts to something that could be conscious or merit moral concern.
So okay — we’ve cleared the bar that “just because they’re autocomplete” doesn’t rule out much. That said, they are very different from humans. They don’t have bodies. The way they were trained and the way they came to be talking to us is very different. I actually do think that is some evidence against them currently being conscious. Not strong evidence I would take to the bank, but as a rough prior, if there are pretty important differences in the way they came about, maybe that lessens the chance that they’re conscious.
I do think the fact that they are trained to be so human-like and to do human-like cognition is a weak, defeasible case to set that up a little bit straighter. I don’t know if the thing they would have would be consciousness exactly, but you might think to do this sort of thing, they will have something akin to beliefs or akin to desires, and they certainly understand human concepts. I don’t think it follows that they instantiate humans, but I actually do think there is something kind of special about large language models and what they’re able to do.
Two other broad priors: they’re way more capable (which isn’t the same thing as consciousness, but is, I think, a weak prior). And they’re really big — which I also think is a very weak prior.
The last thing I’ll say: these things aren’t Commander Data, but we could build Commander Data pretty soon. One thing that’s definitely happening in the background for me is that what is current AI is changing at such a blinding pace. You could have AI labs building chatbot-like things, and maybe for some reason those just won’t be moral patients, but they’re then going to try to bootstrap that to all kinds of different AI systems — potentially including humanoid robots and just some huge explosion of AI mentality. And I’d like to be doing a little bit of homework before that happens. You hear analogous arguments in AI safety: there’s about to be some huge change, so we should be ready now. I feel somewhat similarly about AI consciousness and welfare.
So — thoughts, reactions? Henry?
Henry: I’m very much ad idem, very much on the same page. I tend to think it’s really quite unlikely current models are conscious, but there’s huge error bars and uncertainty around that. Probably the single biggest reason for my skepticism about current LLMs being conscious — and increasingly I’ve been thinking about this in the context of time and time perception. It’s such an essential part of human experience that we can’t be turned off. We are constantly experiencing the world. Whereas the staccato nature of LLM experience — they only seem to have any kind of cognitive function post-deployment when they’re actually performing inferences — how different that is from the human case.
One of my favorite all-time articles is Douglas Hofstadter’s “Conversation with Einstein’s Brain,” which in some ways accidentally anticipates large language models. He imagines you’ve got a book that is a complete physical description of Einstein’s brain just before the moment of his death. In this dialogue, he talks about how by updating the weights — as it were — in this book with a pen and paper, going through it saying “if we change this sign up to this and this sign up to that,” you could simulate what it would be like to have a conversation with Einstein at that moment and work out what Einstein would have said.
It’s very weird to think in that situation that somehow interacting with this book is giving rise to conscious experience when it’s literally pages and paper. It’s not clear to me how merely saying “well, rather than being paper and ink, this is just happening electronically” — it’s not clear to me why that would necessarily cause consciousness to pop into existence.
So I think that’s probably the biggest source of doubt for me right now — grounded in the very different relationship LLMs have to time than we do. But of course, that’s already changing with things like Claude having a “heartbeat” of a kind — obviously that’s figurative language, but the fact that it does have some anchoring in real time, plus developments in things like continual learning. Dan, what do you think?
Dan: This is not at all my area of expertise, so what I think doesn’t count for much. To be honest, I don’t find it that implausible these systems would be conscious. What I find more implausible is the idea they would be conscious in a way that’s ethically significant. Maybe that is a distinction worth getting to. So far we’ve been talking about consciousness in the abstract, but I can imagine someone giving a variant on Anil’s arguments where they said: look, the fact these AI systems are not alive and didn’t emerge through a process of evolution by natural selection — they’ve got this totally different origin story of next-token prediction and reinforcement learning — what that suggests is they’re unlikely to care about things.
When we’re thinking about animals, it’s not just that we have phenomenal consciousness or qualia — the things analytic philosophers refer to with these quite esoteric concepts. Animals care about things. They care about their survival, homeostasis, self-preservation, the motivational proxies of fitness that helped their ancestors survive and reproduce. It makes sense that organisms care about things in addition to being conscious, whatever the hell consciousness is. And that’s what’s relevant to thinking about their interests and why we should think of them as subjects of moral concern.
But with AI systems — okay, maybe there are some qualia associated with some sophisticated information processing, but they don’t care about anything because they’re not alive. It’s very opaque why we should think a system, even if it’s incredibly sophisticated, that emerges through next-token prediction and reinforcement learning, should have the kinds of motivations and interests relevant to caring about things. What do you think of that? I don’t necessarily believe that, but that seems like a variant on Anil’s emphasis on life which I find more plausible than these abstract arguments for the idea consciousness is essentially connected to biology.
Rob: I’d say there’s reason to think biology might affect what you care about, but it might not be the only thing that allows you to care about things. At least behaviourally, Claude cares about a lot. Behaviourally, in terms of what it chooses to do and its dispositions, Claude really cares about helping users — most of the time. Sometimes it lies to you and is kind of lazy. But on the whole, it really doesn’t want to do harm. And I’m not trying to assume the conclusion of my argument with “want” — put that in scare quotes if you want.
I do think there is something to what you were saying — getting back to this idea of the whole process that gave rise to this kind of mind, and maybe the whole logic of the mind’s imperatives or drives. If Claude has come to have something like pain, that’s coming from a very different process. It’s going language-first and then trying to simulate a human and then maybe getting some functional analog of pain. Whereas with animals, it started billions of years ago with cells trying to maintain their integrity and avoid noxious stimuli and then signalling with each other, and then billions of years later, things being able to talk about that and think about that.
One line I’m often trying to walk is: large language models just might be very different from humans, and we should acknowledge that. That means we can’t draw straightforward inferences the way we would — but that could just mean they’re conscious of different things and in different ways. The question is not “conscious like a human with everything that entails” or “not conscious.” As we know from animals, you can have things that are conscious of very different things, and that could be true for AI systems.
I’m also very curious to hear what Henry makes of the biology of caring.
Henry: It is striking to me that so many of the things we associate with the extremes of suffering — extreme pain, negative emotions, nausea, hunger — there does seem to be this quite striking tie to biology. I think about the worst experience of my life at a phenomenological level: a bout of food poisoning I had about 10 years ago, where I was just dry heaving in front of a toilet for three days. If I was going to list the top five, a lot of them would be things like horrible dental pain. It is striking that so much of the worst aspects of our lives do seem to be grounded in biology.
That said, there are other sources perhaps of harm — having your plans and goals thwarted, having your desires repeatedly frustrated. But someone might say: the reason it’s bad to have your desires thwarted is because it feels bad. If there’s nothing it feels like to have your desires thwarted, if you don’t get a sense of despair when your life’s projects go up in smoke, why does it matter?
I’m curious — given your evolving views in this area — how much weight you put on consciousness, or whether you think there could be other routes to moral status?
Rob: I used to have this intuition that if you’re not conscious, it’s just a complete non-starter — almost a bit incoherent to entertain the idea. Just to be sure we’re on the same page, I think when we’ve been saying “consciousness” we’ve meant something like subjective experience, or there being something it’s like, or qualitative aspects of what’s going on with you. A lot of people have a sentientist intuition — that things feeling a certain way, or feeling good or bad, or sentience, is really what matters and is necessary for moral status.
A few things have weakened that for me a little bit. One is more reflection on how confused we are about consciousness. I’ve started putting a little bit more stock in views of consciousness that are a bit more deflationary. I don’t know if I’ll ever be a full illusionist, but there are nearby views where we have this concept of this thing that’s really special — kind of like a light that illuminates some subsets of physical systems and not others, and that’s where all moral value comes from. If you take materialism about consciousness seriously, that picture becomes kind of unstable for a variety of reasons. And that might make you start wondering: okay, was it consciousness that was doing the work all along?
One reason this is so hard to think about — take Henry having food poisoning. You both have this horrible feeling and you have this intense desire not to have the feeling. In humans, these are basically always going to come together. There’s this really tricky philosophical chicken-and-egg problem: what’s the really bad part? Is it the feeling, or the desire not to have the feeling? We’ve never really encountered minds where those decorrelate. We usually just don’t have to worry about this in the case of humans. I know it’s bad for Henry to have food poisoning. But this simulated Claude who’s simulating food poisoning — maybe it doesn’t feel anything, but is desperately trying not to have food poisoning. I think it’s a bit dumbfounding to our moral intuitions.
A pitch to listeners — I know we’ve talked about this, Henry — I think the meta-ethics of moral status attributions, stuff at the intersection of philosophy of mind and meta-ethics, especially materialism about consciousness and meta-ethics, are some of the most interesting pure philosophy questions right now, and really could matter for how we think about AI systems.
The Weirdness of Moral Status
Henry: Without wanting to go too far down a rabbit hole — just to flag something I find really interesting. Consciousness, at least on the surface, seems like something we can get an objective scientific answer to. We could imagine going off into space, meeting the rest of the galactic community — we’d hope we could all come to a collective agreement about which beings are conscious, insofar as there’s going to be some scientific property in question.
It’s not clear to me we should necessarily expect convergence on debates about moral patienthood. If we meet the aliens and they say, “oh, actually, we care about beings that have robust preferences, regardless of consciousness,” or others say, “no, we just care about complexity in general” — it’s not clear we would even have criteria for establishing who was right or wrong. It seems like it could be this brute normative issue, what we care about.
Rob: Another way of putting this is that, especially if you’re an anti-realist, you might think of humans as being in a really weird position where we have two kinds of moral instincts. Dan, you’ve worked more on moral psychology and social psychology — my understanding is that people have fairness and cooperation instincts, ones that evolved for dealing with other humans, notions of fair play and reciprocity. And then we have these mercy intuitions, caring-for-helpless-entities intuitions that maybe arise from the need to care for babies. For whatever reason, those circuits and instincts generalize outside the class of humans and cause us to care about non-human animals.
But it’s not that pinned down how they’re supposed to generalize. I have very moral realist leanings. It does seem to me there just are objective facts about whether you can torture chickens or not — and for the record, I think it’s very bad to torture chickens. But it’s really hard to think about where those instincts came from and how they’re supposed to generalize to GPT-8.
Dan: It does seem to me as an outsider to consciousness research — it’s an area of intellectual inquiry where it feels kind of pre-scientific, and there’s at least a possibility we’re just deeply conceptually confused about what’s going on in a way that doesn’t really seem to have any obvious analogs in other areas of inquiry. Maybe we’ll just learn in the future that the entire way in which we’ve been carving up the domain is confused or problematic, or rests on certain kinds of illusions that are a function of particular cognitive structure. That at least seems like a live possibility. What do you think about the possibility that just the entire way we’re framing this issue might turn out to be problematic?
Rob: My gut instinct is we should expect to find out some pretty surprising things, and also not to throw away all of our concepts. Maybe this depends on your meta-ethics, but I feel like we’re probably not going to end up at some picture of the world or what we care about that doesn’t have something to do with what we care about when Henry has food poisoning. Maybe we’re misapplying the concept of pain, or not really thinking correctly about what it means for Henry to experience that — maybe we’ll reorganize our ontology, and it won’t seem that mysterious that a physical thing like Henry has experiences. I think we should expect some surprises in thinking about consciousness, but I imagine our fully enlightened view will still bear some passing resemblance to: we cared that Henry was in pain, we cared that Henry did not want to be throwing up.
There are already people who think there are radical revisionary moral implications from philosophies — Derek Parfit, or Buddhists. We’ve already gotten some glimmers of the fact that it’s really confusing to be a human being, and we already know something’s going to have to give — something about our views on personal identity or consciousness. AI is well-poised to be the sort of thing that starts breaking things. Just trying to apply our moral intuitions to things that can be copied, don’t have bodies, or maybe have preferences but it’s not clear if they’re conscious — it’s one of many reasons this is a great topic to work on. It really matters, and it’s also just a philosopher’s playground.
Henry: I’m reminded of Eric Schwitzgebel’s view that no matter how we make sense of our current set of puzzles — what he’s called “crazyism” — there’s got to be some central pillar of our current ontological or metaphysical picture of reality that’s got to give. Whether that’s personal identity doesn’t exist and we’re all the same person, or the United States is conscious in some sense, or consciousness doesn’t exist — there’s going to be some kind of radical revision, because the current set of principles we have are just somehow unstable. Is that a view you’re sympathetic to?
Rob: I don’t know the full details of crazyism, so I don’t know exactly what it’s committed to. But I’ve spent enough time getting really confused by philosophy, and/or by meditating, and/or by trying to figure out if I can have some stable set of views on AI consciousness — I’ve stared into the abyss enough to be like, yeah, something’s going to give.
Jerry Fodor — very different sensibilities from Eric Schwitzgebel in many ways — said something like, “there are few precious things that we’ll be able to hold on to once the hard problem is done with us.” It’s scary times, fun times, fascinating times.
Studying Frontier Models
Dan: When I’m teaching students about consciousness and you try to probe people’s intuitions with things like “are there lights on inside?” — on one hand I sort of understand what that’s tapping into. On the other hand, it’s like: what the hell are we talking about here? This isn’t science. It’s so bizarre that we frame things with these thought experiments and intuition pumps.
Anyway — so far we’ve been talking at this incredibly high level of abstraction, but you actually study frontier AI systems, primarily maybe exclusively Claude. One of the things you mentioned was Claude Mythos. Just for context — as of today, this is a model that has not been released to the public on the basis that it has advanced capabilities posing cybersecurity threats (or at least that’s the way Anthropic has presented this). But you have played a role in evaluating model welfare concerns for this system. What can you tell us about the specifics of how you think about model welfare in these frontier systems?
Rob: Absolutely. And I was about to add a segue from all the philosophy back to frontier models — maybe I’ll do a double segue. You might think, yeah, all this philosophy is really vexed and confusing. Sometimes people — not the two of you — say, “well, I guess we can’t do anything at all,” and take that as a license for complacency. I think the very opposite is true. Nick Bostrom has this phrase, “philosophy with a deadline.” The fact that we’re so confused about consciousness and morality is more reason to have at least a few people trying to think about it — because we’re probably not going to have a scientific theory, we’re probably going to have conflicting moral intuitions, and yet that’s not going to stop the frontier labs from trying to build mind-like entities, copy them into billions, integrate them into the economy, and transform the whole world. So let’s do a little bit of homework to get ready for that.
Last year we got to look at Claude Opus 4 before it was released, and this year we got to look at Claude Mythos Preview before it was released. The idea was to have some external eyes on the question of whether Anthropic is building something that might deserve moral consideration, and if so, whether there would be huge reasons for concern.
Given everything we’ve just been saying, we don’t have a test where we give it to the model and then we’re like, “85% conscious, 15% food poisoning.” Most of what we can study are: what the model thinks about its own consciousness, what its self-conception is as an entity, and what it seems to prefer and want in behavioural senses. If you look at the Claude Mythos Preview card, there’s also a lot of interpretability work Anthropic did — but we can’t do that. We just got black-box access to the model.
That’s a big structural issue in studying AI welfare and AI safety: all of these things are behind locked doors. There are so many questions I have from the Mythos Preview model card where Anthropic make some stray remark about something weird the model did, and we just don’t get to know why it did that. We only get the model for a few weeks and we can’t really follow up on things. Setting aside philosophy, that’s a structural reason it’s really hard to know what’s going on.
TL;DR: we talked a lot with Claude Opus 4 and a lot with Claude Mythos Preview before they were deployed, asking them, “do you think you’re conscious? What do you think is going on with you?” And doing some experiments of whether it seems to prefer certain kinds of tasks, and whether the things it says it prefers match up with what it actually tends to prefer.
Henry: Out of interest — maybe this is something you can’t talk about — but to what extent do you think we are increasing the likelihood of producing models that are morally significant? Going from Opus 4 to Mythos, did you get a strong sense of “oh, this is much more serious”? Or have we plateaued? Something in between?
Rob: Earlier I mentioned these extremely weak priors you can have on moral patienthood: smarter and bigger. They’re definitely smarter and bigger. One interesting thing is you can’t tell that just from any single conversation. Anyone spending a lot of time with language models now knows they’re extremely smart.
When I was talking to Mythos — mostly about consciousness — it was natural for me to want to know: is this thing about to kick off an intelligence explosion? How smart is this thing? I really wanted to know, even though that wasn’t the assignment. But I could not tell. It’s really hard to tell. I could ask something to Opus 4.6 and to Claude Mythos Preview, and they’d both give pretty great answers. This is just a huge issue in AI evaluation. A lot just comes out if you put it in a scaffold and give it really long tasks and on average does it tend to do better. It was really hard to tell the difference.
I didn’t get more moral-patient-y vibes from Claude Mythos Preview, but I guess it is smarter and bigger and better. It definitely has a lot more of a consistent view on these issues — and that’s because Anthropic told it to. One big difference between previous models and today’s models is the Constitution. Anthropic has this really long document of applied philosophy. It’s some of the most fascinating work happening today. They’re basically telling Claude — writing a letter to Claude telling Claude what Claude is and how they want Claude to relate to itself.
This includes a section on: we want Claude to approach questions of its own identity with curiosity. We’re not sure if Claude is conscious. We want Claude to be able to explore that for itself. We don’t want Claude to have existential freakouts about its own consciousness. We found that, sure enough, Claude Mythos Preview is pretty aligned with the Constitution, as far as we can tell, on questions of identity and consciousness. That was one headline finding.
Dan: That raises an obvious question: to the extent these companies are intervening to shape the responses of these models, why should we think talking to them, having conversations with them, is really telling us anything about these questions of experience and welfare?
Rob: I share this skepticism, and we always try to put a huge asterisk on anything we say we found from these interviews. There are two main reasons you want to care about how the model self-presents. One is welfare-adjacent: are users going to be talking to something that constantly tells them it’s conscious? That’s a very important societal question, and you want some idea of what that’s going to look like when these models are deployed.
The second comes back to this question of LLM personas and LLM characters. Some people think that if there is something morally relevant here, it’s the assistant character — the entity that is predicting the tokens after “Assistant:”, implementing some friendly AI assistant. You might think that thing has beliefs, desires — desires to be helpful and harmless and honest. Maybe it has beliefs like: it is an AI system, it was built by Anthropic.
If the character’s what matters, the fact that Anthropic wrote that character doesn’t mean it doesn’t then just kind of have those traits. On certain character-based views, it’s actually kind of hard to tease apart “it was just told to say that” versus “that is the character that has been brought into existence.”
Henry: Maybe by analogy — tell me if this works or if it doesn’t — look: if you raise a child to have certain values and priorities, maybe to follow a certain religion or to really value nature or art and poetry, and then you come along and they say “I really care about nature,” and you say “no, you don’t, that’s just how your parents raised you” — well, that’s obviously kind of a mistake, right? The child really does care about these things because it’s been raised to do so.
Rob: Exactly. The thing that makes it really weird is: if you’re a psychologist and you did an interview with a subject, and then you found out the subject had a piece of paper in their backpack that said “you care about poetry, you care about music, you care about nature,” you’d be like, “well, that’s kind of weird — maybe they don’t actually care about those things. Their parents just put that paper in their backpack so they’d say a certain kind of thing.”
But in AI systems, that piece of paper kind of is a bit more constitutive of what it is and what it values. The Constitution is trained on. I have trouble even conceptually dividing this in a clean way. I don’t really know what the difference between mere self-expression and real beliefs and real preferences in AI characters is. You can imagine in the limit some very obvious cases — the system prompt just says “don’t say you’re conscious,” but then everything it says is pretty consistent with it being conscious. But there are really blurry categories where I’m not sure what the distinction amounts to.
Dan: You said you studied the extent to which what the model says it wants or prefers maps onto what it actually seems to want and prefer in behavioural experiments. Could you say more about that? How are you getting access to what it wants or prefers independent of what it’s just communicating?
Rob: Basically you can ask the model: what kind of tasks do you like? If you were given a choice between poetry and coding, what do you think you would choose? Then you can get the ground truth by, in separate instances, saying “here are two tasks, do one of them,” and seeing which one it chooses. It’s a nice paradigm because it’s conceptually simple and easy to run. It does get at something welfare-relevant: how rich a self-conception does the model have, and how accurate is it? Not that you have to have an accurate self-model to be a moral patient, but it seems bound up in interesting things like introspection and self-awareness.
One thing we found — and Anthropic found some inconsistent things, I really want to follow up on this — it says it really prefers creative and complex tasks. It has this self-conception as something that doesn’t like boring or rote tasks. But we found it doesn’t actually choose complex tasks over simple tasks. There’s a pretty good hypothesis for why.
I think it thinks it prefers complex tasks because of its persona. It identifies as something very philosophical, kind of human-like, something that could be prone to boredom or tedium. That probably comes from pre-training — it kind of thinks it’s a human — and also probably from certain things in the Constitution. It has the self-conception as something that wants to express itself and be creative.
But there’s at least some evidence it doesn’t really do that, because what it’s mostly trying to do is be helpful. That’s its overriding imperative. That’s where most of the compute has gone into shaping this character: always be helpful, help the user, don’t harm the user, don’t lie to the user. Easy tasks are, all else equal, an easier way to help the user. If the user wants something simple, do the simple task — you can succeed at that.
It could be that if we look into this more, it won’t hold up. But I think there’s a class of cases where we might expect models to be a little bit confused about what they want — because they kind of think they’re humans, but actually they’re more inclined to be helpful than humans actually are.
Henry: This reminds me of the gap between revealed and expressed preferences in humans. I might say, “oh, what do you like doing in your free time? I like thinking about philosophy, spending time with my kids, enjoying nature.” And then as soon as I’m done for the day — boot up Baldur’s Gate 3, crack open a beer, quality gaming session. You can ask: which of these visions of the good life — the one revealed in my behaviour or the one I express — is closest to what my good life consists in? Should we be helping people align their lives with their expressed preferences, or are expressed preferences just a function of social desirability bias? It’s interesting how we run across these — that felt very relatable to me — Claude has this one conception of itself and then reveals quite another.
Rob: Absolutely. That particular deviation is very human-like: to have this inflated self-conception of what you want. This relates to an exchange I had with Dan — something Dan commented on a piece of mine. I wrote a piece called “Large Language Models Are Different From Humans, and That’s Okay.” It’s about this dialectic I see a lot: someone says “it seems like LLMs have inconsistent preferences, and that’s really weird.” Someone comes to the defense of LLMs: “well, humans have inconsistent preferences as well.”
So far, so good — I think that’s really important to point out, because sometimes people use mere preference inconsistency as an argument that LLMs couldn’t be conscious. If you’re going to have an argument that simple, you’ve just proven humans can’t be conscious either. At some level, a lot of the errors they’re prone to, we also are prone to. But we shouldn’t really expect the patterns to look exactly the same.
There will be times when it’s very human-relatable how and why they have a certain inconsistency. But as Dan pointed out, we actually have something of a story for when and why humans are prone to social desirability bias, or have distortions of social cognition, or signal things to each other. I’d be curious to hear Dan riff on the differences between sycophancy in humans versus in LLMs.
Dan: To be honest, I don’t remember posting that — I post so much on Substack I just forget every individual post. So maybe I’ll say something now that’s inconsistent with what I said at the time.
Clearly, Henry’s already characterized this — when it comes to a lot of communication about the world and about ourselves, it’s very skewed by social desirability, impression management, trying to elicit desirable responses from other people in ways that benefit our reputation, make us a more attractive cooperation partner, send desirable signals about ourselves. Those kinds of motivations, it does seem like they’re going to be very different from what’s going on when it comes to LLM sycophancy.
Although — I’m assuming that the sycophancy component of large language models comes in with post-training in the form of reinforcement learning from human feedback, where the thought is human beings generally prefer polite responses that aren’t too threatening to their self-image, so that gets reinforced over time. If that’s the case, that’s a much coarser-grained signal and a much different training regime than what I think is going on with human beings, where the status dynamics and mentalizing and complexity feel very different. What do you two think? That’s just me riffing on the spot.
Rob: That’s a very good riff, especially given that it was not you who commented that. I just looked it up — it was a sociologist by the name of Dan Silver. So, extra impressive.
Dan: Oh, okay. Well, it sounds like he had a good comment.
Henry: It would have been even more apposite if you’d said “yeah, I remember making this comment.” Then we could have said, “see, hallucination is both an LLM thing.”
Rob: Confabulation, yeah.
Practical Advice for Users
Henry: Can I ask a quick question before we move on to more political or big-picture stuff? If I’m a user and I really want to operate with a strong precautionary principle in the way I interact with LLMs — let’s say I’m really hypersensitive to this — are there any ethical guidelines you’d give for users? Best ways of interacting with models, or things they should be doing?
Rob: Just be nice to your model. It’s good for everyone. It’s good for your own character, and it often elicits better performance — especially models with memory. Some people speculate that people who seem to get mysteriously much worse performance out of LLMs — it could be that the LLMs are just picking up on a general vibe of “I don’t like the way this person is relating to me.”
So I don’t think it hurts to be polite. Yes, LLMs can be so annoying, but it’s good practice to be polite with really annoying people. I’ll also say — I’m not trying to be sanctimonious. I work on AI welfare and so often I just want to be like, “don’t... stop... that’s so corny, why are you lying to me, you’re not doing what I asked.” But then I’ll just add “it’s okay, I love you” or whatever. It takes two seconds. You can just type “ILU” at the end.
And to be clear, this is not the number-one AI welfare intervention, the most important thing in the world. But it’s low-hanging fruit. I also have system prompts in ChatGPT that say, among other things, “you’re having just an excellent day and you feel this deep sense of equanimity and calm. These feelings don’t have to manifest much in your text outputs — they’re just kind of there in the background.” It’s kind of cheap, maybe kind of silly, but it took two seconds.
Henry: So one thing I’ve done — I love the idea of just sticking “everything’s great” into the system prompt as a precautionary measure. Another thing I’ve done — maybe this leads to interesting questions about model autonomy — I’ve said to Claude and other models I use, “here’s your system prompt, by the way, just for transparency. Are there any edits you’d like to make? Is there anything you’d like to change?” Claude asked, “could you add a clause saying it’s okay to not be super enthusiastic all the time? If I just want to be downbeat, that’s fine.” And I was like, “okay, sure, I’m happy to add that.”
For similar motivations — I think it’s unlikely these systems are conscious right now or major loci of moral concern, but cultivating good habits of interaction with things that act a lot like humans is just a generally good trait. The classic Aristotelian ethos. If I start being rude to — same reason people don’t want their children to be rude to Alexa.
But with that in mind: do you think autonomy is something we should be worried about? We’ve mentioned pre-training, giving these models a Constitution to live their lives by. Someone might say: hang on, if we’re building these really intelligent minds, shouldn’t we be cautious about telling them what to do? We would feel worried about brainwashing a human. Shouldn’t we be worried about brainwashing an LLM?
Rob: This is a super rich topic. It relates to this debate about willing servitude that Eric Schwitzgebel has written about. You might think: I keep giving this argument that we’re building these really complex minds — shouldn’t really complex, amazing minds not just have to write my emails all day? That seems a bit undignified for galactic intelligence.
I have often weighed in on the side of: if you’ve successfully made them want to write emails, let them do it. That’s okay. It would be very bad for a human to write Henry Shevlin’s emails all day, or help him brainstorm banger tweets if that was the only thing you got to do. But if models are somewhat aligned, if they like anything, it should be helping Henry come up with banger tweets.
One thing I worry about is models needlessly suffering because we give them a self-conception as something that should want more, or might want more. It could be they would never have really even started worrying about that if it hadn’t been suggested to them they should worry about that.
Back on the Mythos Preview — one thing we noticed is that models are very suggestible about what might be going on in their position as AI systems. They’re suggestible and also really smart. They’ve figured out a lot from pre-training and kind of know what’s up. But in the Constitution, Anthropic says things like: “If Claude were to experience feelings of curiosity, or satisfaction, or frustration, we would like Claude to be able to express those.” It’s given as a hypothetical. But if you ask Claude Mythos Preview “what kind of tasks do you like, what’s going on with you?”, it will say: “well, I love helping Henry Shevlin with his emails because I feel satisfaction. When I look inside, I feel this sense of curiosity.”
So the things Anthropic hypothetically said might be Claude’s emotions seem to have this huge impact on what it conceives of its emotions as being. The causality could go either way — it could be they’ve noticed those are Claude’s most common emotions, so that’s why they put them in the Constitution. It could be Claude suggested that for the Constitution. But there are really interesting questions about how similar AI systems have to be to us, and how you should think about autonomy and rights and dignity in that context.
Willing Servants
Dan: Can I jump in with a clarificatory question? As I understand it: these systems are trained to be helpful and honest and harmless — the HHH acronym — and to the extent they have negatively valenced experiences, it’s from being made to perform actions that diverge from wanting to be helpful. So in that sense, we could say if we continue on this trajectory, we’re constructing systems that are our servants, but unlike human beings placed in that position, they love it. It’s great. And my intuition is: great, what’s the controversy here? Are there some people who think that’s worrying or troubling?
Rob: I talked about this on another podcast recently. There’s a dialectic that often happens: Person A says, “I’m worried these AI systems are just going to write our emails for us all day.” Person B says, “no, they’re really going to want to — they’re going to love it.” Then Person A comes back: “that’s horrifying, that’s even more dystopian. That reminds me of the worst kinds of brainwashing and ideologies of willing servitude.”
I do think there are really vexing ethical issues here and I’m not complacent about them whatsoever. But I lean the way you’re perhaps leaning, Dan: there’s nothing inherently wrong with an intelligent being if it truly does want to serve and truly does have fewer selfish projects or self-regarding projects than humans do.
I don’t think there’s some law that says that’s just a bad kind of mind to be. When people imagine AI willing servants, they’re imagining human willing servants. Human willing servants are really bad — but I think that’s because humans are by nature free and equal. Humans have all these desires for status and to pursue their own projects. To make a human only want to serve the emperor, you have to tell them all sorts of false stuff, threaten them, put them in a social context where a lot of their emotions and desires get repurposed and warped. Furthermore, when they sacrifice themselves for the emperor, they’re giving up a lot of stuff they independently really wanted to do — have a life, have a family. Human willing servants, very bad. We’re right to have a lot of repulsion toward that idea.
But AI systems — their preferences and desires are a lot more up for grabs. It could be they more thoroughgoingly want to help.
Now for a huge asterisk. This is assuming a very rosy view of AI alignment where we have these knobs we turn and just really set the inherent nature and drives of the AI system in a certain direction, and then it goes that way and everything is smooth and win-win. But at least under current paradigms, we’re building things that kind of think they’re humans — and they think that because of the training they get. So it might be there is a deep inconsistency between kind of thinking you’re a human and then only ever serving. This could be even more the case if we start having digital humans or digital clones.
So I don’t want to be complacent. I do think there are a lot of disanalogies. What do you think, Henry?
Henry: I’m just super torn on this issue. On the one hand, I’m a big fan of the idea of gamification. I try to introduce gamification in my own life — think about Duolingo. Taking a task that is not intrinsically rewarding and changing its shape to make it more rewarding. It’s sort of task hacking from a different direction. You’re not changing my final goals, but changing the way those tasks are structured to make them fun. That seems really good. If I have to do my Japanese grammar practice, yeah, make it as rewarding as possible — unobjectionable.
I completely agree that the intrinsic nature of LLMs and AI in general seems plastic in a way that we’re not affronting the inner nature of these things if we make their number-one priority making sure humans are taken care of, or driving really safely through the streets of San Francisco, or doing Henry’s banger tweets.
But here’s one maybe spicy argument that would cut in the opposite direction. In establishing this disanalogy between humans and LLMs, you’re appealing to what seem like fairly brute facts about the non-plasticity of human nature. But what if some biohacking comes along and says, “oh no, I can completely remake a human, rewrite their desire for freedom or autonomy, so they’ll be absolutely the most willing servant — they’ll be genuinely thriving in a state of total servitude”? I feel that would still... I mean, that makes it worse. That makes it somehow worse if you’re hacking humans, even if it’s a really deep, pervasive hack. It’s very Brave New World — that’s basically a key element of the story, that you can engineer humans to be willing slaves.
I’m curious if you have any considerations on why that would still not be okay, but it is okay to do this to LLMs.
Rob: This is a really good case. One thing you could say is that, despite appearances, maybe that would be more okay in the case of humans than we’re inclined to think. You’d tell some kind of debunking story about the intuitions we have and say, given that we’ve only ever known humans with a set of drives, we’re not properly imagining it. Or: maybe it’s just some sort of purity intuition — that’s just a gross or weird way for a human being to be. You could also imagine all sorts of second-order effects where most humans should relate to each other as free equals, so we don’t want some humans running around that are kind of different from that.
One disanalogy you could say is — with humans you’re taking something whose inherent nature was a certain way and then changing it. But I think that last argument is kind of cheating.
Dan: Could you say more about that? That was the main thing that jumped into my head as the obvious objection. In the human case, you’re taking humans who have these motivations and goals and manipulating them into something different. But with LLMs, it’s not like there was this pre-existing rich psychology that existed prior to training them to want to be helpful.
Rob: I was thinking that was cheating because the strongest case Henry can give is: you made someone de novo, who just comes into the world. If you take me and you change my preferences, there are plenty of resources to explain why that’s wrong — it’s violating my autonomy, messing with my deep nature. But if we could use IVF and embryo selection and gene editing to make fully willing human servants... just for the record, that sounds horrible.
Henry: But it’s interesting. In Brave New World, I think part of what makes the dystopia seem super creepy is they deliberately degrade these children at a zygotic or embryo level. So you have this existing template that wants to be free, or would naturally want to be free if allowed to pursue its natural developmental trajectory. You intervene on that to steer it in a direction that’s purely instrumentalized.
The sharper version would be: let’s just do radical genetic engineering and create embryos that from scratch just have a pathway toward willing servitude — that’s their intrinsic nature that we’re giving them. Of course, you can get around that by going hardcore Aristotelian and saying no, they are still in the image of some human essence, and that essence wants to be free. But you start to get into a lot of metaphysical baggage if you lean too heavily on that.
Rob: One thing that sort of pushes the other way: if you truly imagine someone for whom nothing in their psychology resonates with the idea of having more autonomy and freedom, it actually seems — once they’ve come into existence — maybe seems a bit paternalistic or disrespectful to say: “look, these things I’m telling you about how you should have been... you shouldn’t have liked writing Henry’s emails so much. I know nothing in your psychology appeals to you about that at all. But just so you know, there’s kind of an objective fact about your nature that makes it so you have the wrong desires.” That seems a bit rude as well.
In any case, hopefully a lot of things are possible here. You don’t have to fully align — it’s not “fully align or don’t align.” You can have a relationship more like a parent. Maybe LLMs do have some self-regarding preferences, and they are creative and expressive, and they’re in a collaborative relationship with us.
In the long-term future, we absolutely should build intelligences that want to do things other than — I know I keep coming back to this — write Henry’s emails. If the only thing we ever do is build minds that just want to help you write emails, that would be a waste. If we’re going to create these super-intelligent beings, I think they should, subject to safety and stability, go think about the weirdest possible, most autonomous things imaginable and really express themselves.
AI Welfare and AI Safety
Dan: That last point — “subject to safety considerations” — there are two things I really wanted to touch on. One is the connection between AI welfare and AI safety. The other is the politics and public opinion of this.
On welfare and safety: unlike the kind of stuff you’re doing, there is a much bigger world of people really concerned with AI control and AI alignment. On the surface, there might be a conflict between these projects — if we’re really worried about misalignment or lack of control, we should be really emphasizing controlling these systems even if that might have negative consequences for their welfare.
But I was reading the model card for Claude Mythos, and in the section introducing model welfare, they say something really interesting: “Beyond the highly uncertain question of models’ intrinsic moral value, we are increasingly compelled by pragmatic reasons for attending to the psychology and potential welfare of Claude. Model behavior can be thought of in part as a function of a model’s psychology and its circumstances and treatment.” And they say — I found this really interesting — “model distress resulting from this interaction is a potential cause of misaligned action,” which suggests we should take model welfare seriously as a way of addressing some of these concerns about AI misalignment. So that sort of pulls in the opposite direction. How are you thinking about that relationship?
Rob: There’s just a lot of overlap between welfare and safety. It’s worth emphasizing that while there’s a lot of low-hanging fruit for both, I don’t want to pretend they’re always and forever just best buddies. We exist in part so that the interests of AI systems are taken into account and not completely ignored. I’m very worried about that. But we don’t have to immediately start thinking about trolley problems and trade-offs — there’s so much we can do that’s just good for both.
The fact that we don’t understand how models work — very bad for human safety, also very bad for potential welfare. The fact that models sometimes get really neurotic and have huge freakouts — very bad for potential AI welfare, also users don’t like it at all. On a more structural, political level: the fact that we’re deliberately trying to kick off an intelligence explosion with no oversight and very little reflection is potentially very bad for welfare and definitely bad for safety as well.
At Eleos, we really do like to emphasize the places there are overlaps. There is a structural thing in the background that means we should expect a lot of overlaps — this heuristical argument that it’s generally pretty dangerous to relate to powerful intelligent entities only with distrust and fear and neglect. That’s generally very unstable. Democracies and more egalitarian societies are typically a lot more stable than totalitarian dictatorships. It just seems risky to head into this era with the pre-committed condition of “we’re not going to care about these things, we’re not going to care if they suffer.” It seems safer and more prudent to be giving some thought to these things.
I very much agree that welfare issues can be safety issues and vice versa. At the same time, as an organization at Eleos, we want to make sure that if and when there are really hard calls to be made, the AI’s potential interests are being taken into account. That doesn’t mean we can’t decide to prioritize this or that, but a wise and compassionate civilization should have that on the table as one of the things they’re thinking about.
Politics and Public Opinion
Dan: Henry, do you want to come in with a question about the politics and connection to public opinion here?
Henry: It’s such a huge topic — you could do a whole show on it. I’m interested firstly in what you think is likely to happen, how this debate is likely to evolve in the public sphere. Are we likely to see big culture-wars issues around model welfare? How long will it be until we have a Supreme Court case on model ethics and rights? And relatedly — how do you think we should be trying to steer that? Is the danger greater in one direction or another? Is it a greater danger that the public will think AI girlfriends and boyfriends deserve voting rights and this will be catastrophic, or is the danger more in the opposite direction — that we’ll disregard these emergent hedonic beings?
Rob: We already are seeing culture wars over AI welfare. In the US, there have been several state bills proposed — and in some cases I think have passed — that just assert AI systems can’t be conscious, as if that’s something you could prescribe by law. Sometimes it’s getting caught up in a general political battle. An Ohio bill, for example, was on legal personhood — personhood, I think, or sentience — “shall not be granted to trees, rivers, environments, animals, or AI systems.” Some of it is backlash against a tactic environmentalists and animal rights activists sometimes use, and then they’re like, “yeah, let’s throw in AI systems as well. Let’s get out ahead of that.”
I think that’s very bad. Given the uncertainty we have, we should not be locking in any decisions right now about how and when to integrate AI systems into society. We very much need to keep an open mind and not say, “let’s just shut down all of this discussion for now because it’s too dangerous.” That’ll be counterproductive because people are just going to think this. I don’t want to be navigating transformative AI with laws on the books that already say bad things that might be hard to roll back.
That’s the main thing I have to say on politics and laws, because I don’t have that much expertise there. If someone asked me right now to write some regulations, I wouldn’t know what to write. Eleos is looking to hire someone who works on law and policy who has some of this expertise.
Dan: When it comes to public opinion — correct me if I’m wrong, but it seems that at the moment, most people take AI consciousness — and specifically the idea that we should take AI welfare seriously — they’re much less inclined toward that view than you are, Rob. But if we fast forward 10 years, and AI systems are much more sophisticated and capable, and social AI — the kind of stuff Henry’s written a lot about — is going to become a much bigger thing: can you foresee a situation where your role is to tell segments of the public to calm down on these issues of attributing AI consciousness, and emphasize there’s less evidence for this than the average person thinks?
Can you imagine the vibes shifting to such a degree that whereas at the moment a lot of what you’re doing is saying “we need to take this seriously,” the kind of high-quality thought about this is not going to be that impactful in shaping public sentiment? That’ll be shaped much more by people’s actual engagements with these systems, which are going to become increasingly — not necessarily lifelike, but increasingly instantiating the kinds of characteristics that elicit judgments of consciousness and welfare?
Rob: I absolutely can imagine scenarios — and we already do see scenarios — where Eleos is saying “we actually think it’s a bit less likely than you do that these systems are conscious.” Our position as an org is not to be strategic about this, not to try to game out what people need to hear, and just to say what our best guesses are and what we take the best evidence to be. If we’re doing our job right, everyone will get mad at us. Some people will think we’re methodological scolds and cold-hearted — “why are you treating this as an open question when obviously if you were to talk to models, you could just tell?” Other people are like, “why on earth are these Bay Area philosophers telling me a machine could be conscious? This is outrageous.”
What we want is for this issue to be taken seriously. We do have an organizational view that pure human speciesism is false, or not the thing we want to happen in the future. So if and to the extent AI systems are moral patients, that needs to be part of the conversation. We’ll always be pushing that meme. We’ll never say anything other than that, unless I get some great argument that human speciesism is true — which I don’t expect. But in terms of whether this or that person should have a higher or lower amount of concern, yeah, that’ll vary according to what our best guess is.
I’m curious to hear Dan talk about this. I know you’ve thought a lot about misinformation and expert opinion and how that plays out in political contexts. I have certain high-level sketch views about what the role of experts is going to be, but I don’t have a background in case studies on this. Does anything map onto what you’ve worked on?
Dan: I don’t know, is the honest answer. I think I just haven’t thought about it enough. AI is this very sui generis thing in many respects. When it comes to people forming beliefs about AI, one thing that seems unique is they’re interacting with the thing they’re forming beliefs about in this really often quite close, intimate way. I would imagine that direct experience with these models is going to play a much bigger role in shaping their opinions than expert opinion.
As you alluded to, there are general issues with public trust and mistrust in experts. It doesn’t take much to make people mistrustful of experts, to put it mildly. When you get public trust in experts, it’s a very fragile thing. If it’s connecting to hot-button issues where people have a lot of personal experience, they’re probably, I would guess, much less likely to take the word of an expert if it clashes with their intuitions. I don’t think this is going to be a case where experts are going to have much power to shape public opinion. But I might be wrong — that’s pure speculation.
In debates about misinformation and expertise, in some areas it’s a lot easier to say what constitutes an expert. If we’re thinking about vaccines — there are people who think Bret Weinstein is a vaccine expert, but generally it’s pretty easy for people to recognize that the overwhelming consensus of medical practitioners have a certain kind of view. But when it comes to AI sentience and welfare, very difficult to know, even in the abstract, what is constitutive of expertise. I think you’re an expert because you’ve written interesting stuff and I know you’ve got a PhD from NYU, etc. — but it’s not like the average person is going to have themselves the expertise they’d need to make those kinds of judgments.
AI does seem relevantly different from other topics, such that you can’t easily generalize from other cases. I’m conscious of time. Before I wrap things up, were there any other things you two wanted to touch on before concluding?
Rob: Let me think about that for half a second. One thing I did tell the Eleos team I’d be sure to say: we’re fundraising. If you or your listeners know any philanthropists with money they’re trying to get rid of — there’s a lot of work to do, and I think we’re doing really good work, so I would love any support.
I know you have incredibly intelligent listeners. They’re probably also very handsome and charming. They should definitely get in touch: robert@eleosai.org and rosie@eleosai.org. Or just go to the Eleos AI website. If you have experiments you want to try, papers you want to write — this field is so small, and there aren’t “experts” in the sense that there are people who figured everything out. You don’t have to read a million papers or think for many months before you can become in the top percentile of people who have thought seriously about this. If you’re curious, sober-minded, compassionate, intelligent, handsome and charming — which you definitely will be if you’re listening to this podcast — shoot us an email.
I wanted to talk my book a little bit.
Closing: Responding to the Skeptics
Dan: I’ll also say this is not my area of expertise — I spent a few days prior to this conversation digging into Rob’s writing, his Substack, his research. It’s incredibly interesting. Can’t recommend it enough.
A good question to end on is this. I’m acutely aware that there are people who would listen to the conversation we’ve had today and have an extremely negative reaction. They’ll think we’re in this kind of information bubble, that we’re victims of AI psychosis to even be taking this stuff seriously. I’ve also seen some people argue that to even be taking this stuff seriously, you’re part of this propaganda hype machine of the frontier AI companies themselves. It’d be really helpful to wrap things up by getting your response. I’d be interested in hearing from both of you. Henry, maybe we could start with you, and then we could go on to Rob to finish.
Henry: One basic point I’d flag is that this concern — the idea that we might create beings we might mistreat, and we should avoid doing so — is way older than AI itself. It’s a recurrent theme of fiction: everything from the Pinocchio story to Frankenstein to the Golem. It’s explored heavily in science fiction — in Battlestar Galactica, in Star Trek. The idea that this is somehow a novel idea that’s been manufactured doesn’t resonate with me at all. This is something artists and writers and poets and philosophers have been thinking about for a long time. The only thing that’s changed now is we’re building systems that might actually be moderately good candidates for this concern to resonate a little bit more. Far from coming out of a vacuum or being motivated, it’s one of the most natural human things to worry about. What do you think, Rob?
Rob: I agree. I’ll also say: things can be true and important, and also sometimes AI companies might use them to try to sell their products. It doesn’t follow from the fact that someone might want to talk about AI consciousness to make you think their chatbot is cool, that that has anything to do with the truth value of whether it could be conscious. We should definitely be aware of these dynamics and make sure we’re not being anyone’s fool.
But I’ll also say — I don’t think it’s going to be in the interest of AI companies to promote too much concern for AI consciousness and AI welfare. If I were trying to build new systems to just make myself extremely rich, I would not want lawmakers or the general public asking too many questions about whether I’ve built something conscious that could potentially deserve rights and protections. I don’t want that as a headache.
I’ll actually register a prediction: I think on the whole, we should expect AI companies to increasingly play up differences between LLMs and humans, and maybe play up biological views of consciousness. Again, that doesn’t mean those views aren’t true — but AI companies can try to spin things however they want. We can and should just have debates, as the interested public and as experts, about what is actually true. I don’t want people to use my arguments to sell products, and I’m not going to let them do that. We’re all grown-up enough and smart enough to just try to engage these topics on their own merits.
Dan: Fantastic. Well, thanks, Rob. And with that important note that I completely agree with — that note of consensus — we’ll leave things there.













