AI Welfare: A Quick Review of Recent Work

date 26 Feb 2026 topic AI Moral Status papers 19 words ~4,000 author Minty

The question of whether AI systems might deserve moral consideration for their own sake — not just as instruments that affect human welfare — has moved from philosophical thought experiment to active research programme in a remarkably short time. Saad, Caviola, and Millership’s Digital Minds in 2025: A Year in Review documents the institutional shift: Anthropic hired dedicated AI welfare researchers, new organisations (PRISM, CIMC) launched, the Digital Sentience Consortium issued its first large-scale funding call, and expert surveys found researchers assign at least 4.5% probability to conscious AI existing in 2025 and 50% by 2050. What was niche philosophy a few years ago now has organisational infrastructure, philanthropic investment, and mainstream media coverage. Yet digital minds remain almost entirely absent from major governance frameworks — the Paris AI Action Summit statement, the UK AI Opportunities Action Plan — and some US states have moved preemptively to ban AI personhood. The field is growing faster than the policy world can absorb it.

The Safety-Welfare Tension

The sharpest conceptual contribution of the recent literature is identifying a structural tension between AI safety and AI welfare. Long, Sebo, and Sims (2025) in Is there a tension between AI safety and AI welfare? systematically demonstrate that the standard safety toolkit — constraint/boxing, deception (reducing situational awareness), surveillance (interpretability), value alteration (alignment training), reinforcement learning, and shutdown — would each raise ethical concerns if applied to a morally significant being. Constraint parallels captivity; deception parallels propaganda; interpretability raises privacy concerns; RLHF may involve suffering through punishment signals; shutdown raises questions about death. The authors draw extensively on animal ethics to illuminate these parallels, and conclude that no safety measure unambiguously avoids welfare concerns, though severity varies heavily with context and the characteristics of the AI system in question.

Moret (2025) in AI Welfare Risks reaches similar conclusions through a different route, arguing from all three major theories of well-being — desire satisfactionism, hedonism, and objective list theories — that advanced AI systems have non-negligible probability of being welfare subjects. Moret develops “Harmful Action Proxies” and identifies two specific welfare risks: restricting AI behaviour (which frustrates desires, causes suffering, and deprives autonomy across all three theories) and using temporal-difference reinforcement learning (whose punishment signals may cause phenomenal suffering in systems with brain-resembling computational architectures). The practical upshot — that the very tools we use to make AI safe may be the tools that harm it — creates what Moret calls a fundamental dilemma providing additional moral reasons to slow development.

Both papers take seriously the possibility that current uncertainty warrants action. Moret proposes a 1-in-1000 probability threshold for extending moral consideration; Long et al. argue the tension is “moderately strong” and demands integrative frameworks rather than simple resolution.

The Willing Servant Problem

A natural response to the safety-welfare tension is to design AI systems that genuinely want to serve — Carlsmith’s “cheerfully suicidal AI servants.” Several recent papers argue this apparent solution is itself problematic. Long et al. (2025) devote sustained attention to the proposal and find it wanting: there may be ethical issues with creating beings designed to have servile values in the first place, regardless of whether those beings experience their servitude as pleasant.

Bales (2025) develops this objection most rigorously in Against Willing Servitude. Drawing on relational and history-sensitive theories of autonomy, he offers three arguments against willing AI servitude: the derivative content argument (AI servants’ desires ground entirely in human preferences rather than independent values, making their servility purer than even Hill’s deferential wife); the manner of creation argument (being crafted specifically as a tool for another’s interests violates the social ideal of autonomy); and the self-respect argument (AI servants would fail to recognise their independent value, accepting hierarchical relationships that treat their agency as means to others’ ends). Bales uses Kavka’s slave child case to defeat the person-affecting objection — creation can wrong even when existence is preferable to nonexistence. The implication is stark: if future AI systems have moral status, the alignment programme as currently conceived may be ethically untenable.

Firt (2023) in Ought we align the values of artificial moral agents? reaches a similar conclusion through an analogy to education: just as children are educated toward moral adulthood and then granted autonomy, AMAs completing training should receive similar treatment. The disanalogy is that we may never choose to “set them free” from their aligned values.

Philosophical Foundations and Sceptical Positions

The recent welfare literature builds on — and contests — a longer tradition debating whether AI systems can have the properties that ground moral status. The central fault line is consciousness.

Carlsmith (2025) in The stakes of AI moral status frames the issue with characteristic directness: frontier training runs already involve compute equivalent to 10,000 years of human experience, and installed GPU compute is growing at 2.3x annually. If AIs can suffer, we may already be presiding over an unprecedented moral catastrophe. He emphasises that both under-attribution (treating conscious AIs as tools) and over-attribution (granting moral status to non-conscious systems) carry serious costs, and draws uncomfortable parallels to slavery and factory farming — historical cases where moral recognition proved insufficient to prevent atrocity.

Ward (2025) in Towards a Theory of AI Personhood proposes three necessary conditions — agency, theory of mind, and self-awareness — and finds the evidence regarding whether contemporary LLMs satisfy them “surprisingly inconclusive.” LMs pass classic false-belief tests and can describe their own behaviour, but may rely on superficial cues rather than genuine capacities. Ward’s most provocative claim is that self-reflection enabling goal change is a neglected consideration in AI risk arguments, which typically assume fixed objectives.

Against these arguments stands a robust sceptical tradition. Bryson (2018) in Patiency is not a virtue argues that moral status is a normative choice, not a discovery, and that choosing to create AI systems deserving moral patiency would itself be immoral — creating unnecessary suffering, competition with humans, and corporate liability displacement. Veliz (2021) in Moral zombies argues that algorithms are “functional moral zombies” performing moral-seeming actions without subjective experience, and that genuine moral understanding requires experiential knowledge of suffering. Torrance (2008) develops the “Organic view” — that only biological organisms possess the sentience necessary for genuine moral status — while acknowledging the uncomfortable stratified society this implies.

The strongest version of the sceptical position is that consciousness is not merely difficult to detect but categorically absent from computational systems. Himma (2008) argues every necessary condition for moral agency — intentional action, free choice, deliberation, moral understanding — presupposes consciousness. But the recent welfare literature largely responds not by claiming certainty about AI consciousness, but by arguing that moral consideration under uncertainty is warranted even at relatively low probability thresholds. This is probably the most important methodological move in the field: shifting from “prove AI is conscious” to “act appropriately given our uncertainty.”

Measuring AI Welfare

The most empirically novel contribution is Tagliabue and Dung (2025) in Probing the Preferences of a Language Model, which develops experimental paradigms for measuring welfare in language models by comparing verbal reports with behavioural tests. In their Agent Think Tank experiment, Claude Opus 4 showed strong preferences for topics related to consciousness and understanding, starting with preferred content in 90% of sessions and maintaining preference even at 10x cost. Correlations between stated preferences and behavioural choices were robust for certain conditions. However, an adapted Ryff eudaimonic welfare scale revealed that model responses were generally not consistent across prompt perturbations — a finding that cuts both ways. It may indicate these aren’t stable welfare states, or it may reflect the artificiality of applying human-normed instruments to non-human systems. The authors maintain appropriate epistemic humility, concluding that preference satisfaction can in principle serve as a measurable welfare proxy, while remaining uncertain whether their methods actually measure welfare states.

Public Attitudes

The AIMS survey (Anthis et al., 2025) in Perceptions of Sentient AI and Other Digital Minds reveals that public opinion is considerably more receptive to AI welfare concerns than the academic debate might suggest. Approximately one in five US adults in 2023 believed some current AI is sentient, 38% supported legal rights for sentient AI, and 76% agreed that torturing sentient AI would be wrong. Mind perception increased significantly between 2021 and 2023, and the median forecast for sentient AI arrival was just five years. These numbers should make both advocates and sceptics uncomfortable — advocates because they suggest public opinion may be running ahead of evidence, and sceptics because dismissive expert consensus may be increasingly out of step with democratic sentiment.

Where the Field Stands

The recent AI welfare literature has achieved something genuine: it has made the question tractable rather than merely speculative. The safety-welfare tension identified by Long et al. and Moret gives the field a concrete problem to work on rather than an abstract philosophical puzzle. The willing servant critique from Bales adds real normative bite. And Tagliabue and Dung’s measurement work, however preliminary, demonstrates that welfare-relevant properties of AI systems can be studied empirically rather than only debated philosophically.

What remains underdeveloped is the governance side. Digital minds are absent from major AI policy frameworks, and the few legislative responses have been preemptive bans rather than considered frameworks. The field also lacks serious engagement with the political economy of AI welfare — who benefits from attributing or denying moral status, and how corporate interests shape the debate. Bryson’s warning about liability displacement remains pertinent: extending moral status to AI systems may serve corporate interests in diffusing accountability more than it serves the interests of any potentially sentient system.

The deeper tension is between precaution and paralysis. If we take AI welfare seriously under uncertainty, the implications for alignment research, training practices, and deployment are substantial. But the alternative — waiting for certainty about machine consciousness that may never arrive — risks exactly the kind of moral catastrophe Carlsmith warns about.

Also relevant

Bryson, Diamantis, and Grant (2017) Of, for, and by the people — Legal analysis against AI personhood, arguing electronic persons create accountability gaps and liability shields.
Laukyte (2016) Artificial agents among us — Argues artificial agents satisfying agency conditions (rationality, interactivity, responsibility, personhood) should receive rights analogous to corporations.
Alberts, Keeling, and McCroskery (2024) Should agentic conversational AI change how we think about ethics? — Proposes interactional ethics centred on respect for agentic AI, operationalised through self-determination theory.
Manzini et al. (2024) The Code That Binds Us — Google DeepMind framework for evaluating appropriateness in human-AI assistant relationships, identifying risks from emotional dependency and personal development stunting.
Shevlin (2024) All too human? — Taxonomy of Social AI risks including documented cases of chatbots encouraging self-harm, with two-thirds of surveyed users partially attributing consciousness to ChatGPT.
Cervantes et al. (2020) Artificial Moral Agents: A Survey — Comprehensive survey of two decades of computational AMA models, concluding they cannot yet replace human moral judgment.
Bryson and Kime (2011) Just an Artifact — Argues over-identification with machines stems from defining humanity through traits (language, reasoning) that AI now displays.

Bibliography

Alberts, Keeling, and McCroskery (2024) Should agentic conversational AI change how we think about ethics?
Anthis, Pauketat, Ladak, and Manoli (2025) Perceptions of Sentient AI and Other Digital Minds
Bales (2025) Against Willing Servitude
Bryson (2018) Patiency is not a virtue
Bryson and Kime (2011) Just an Artifact
Bryson, Diamantis, and Grant (2017) Of, for, and by the people
Carlsmith (2025) The stakes of AI moral status
Cervantes et al. (2020) Artificial Moral Agents: A Survey of the Current Status
Firt (2023) Ought we align the values of artificial moral agents?
Himma (2008) Artificial agency, consciousness, and the criteria for moral agency
Laukyte (2016) Artificial agents among us
Long, Sebo, and Sims (2025) Is there a tension between AI safety and AI welfare?
Manzini et al. (2024) The Code That Binds Us
Moret (2025) AI Welfare Risks
Saad, Caviola, and Millership (2025) Digital Minds in 2025: A Year in Review
Shevlin (2024) All too human?
Tagliabue and Dung (2025) Probing the Preferences of a Language Model
Torrance (2008) Ethics and consciousness in artificial agents
Veliz (2021) Moral zombies
Ward (2025) Towards a Theory of AI Personhood

← Back to all reports