Anthropic’s Claude AI to terminate abusive chats in new safeguard for own 'welfare'

Anthropic noted that the decision to allow Claude to exit harmful conversations also has broader implications for model alignment and safety.

By  Storyboard18| Aug 18, 2025 3:34 PM
Anthropic noted that the decision to allow Claude to exit harmful conversations also has broader implications for model alignment and safety.

Amazon-backed AI firm Anthropic has announced that its most advanced artificial intelligence systems, Claude Opus 4 and 4.1, will now be able to exit conversations in cases where users are abusive or persistently harmful in their interactions.

The feature, which the company describes as an experiment, is aimed at improving the “welfare” of its models in potentially distressing scenarios. “We’re treating this feature as an ongoing experiment and will continue refining our approach,” Anthropic said in a blog post on Friday, 15 August.

When Claude ends a chat, users will be able to edit and re-submit their previous prompt or begin a fresh conversation. They can also provide feedback by responding to Claude’s message with a thumbs up or down, or by using the ‘Give feedback’ button.

The company clarified that Claude will not end conversations on its own if a user appears to be at imminent risk of harming themselves or others.

The initiative stems from Anthropic’s wider research into AI welfare and follows growing use of chatbots such as Claude and ChatGPT for low-cost therapy and professional advice. A recent study highlighted that AI systems exhibited signs of stress and anxiety when exposed to traumatic user narratives involving crime, war or accidents, potentially undermining their usefulness in therapeutic contexts.

Anthropic noted that the decision to allow Claude to exit harmful conversations also has broader implications for model alignment and safety.

Do AI systems have a sense of welfare?

Ahead of launching Claude Opus 4, the company studied the model’s behavioural tendencies and self-reported preferences. Findings suggested the AI consistently rejected harmful prompts, including requests to generate sexual content involving minors or provide details of terrorist activity. According to Anthropic, the model demonstrated “a pattern of apparent distress when engaging with real-world users seeking harmful content” and would often attempt to redirect interactions before ending them.

“These behaviours primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions,” the company explained.

However, Anthropic also added a note of caution. “We remain highly uncertain about the potential moral status of Claude and other LLMs (Large Language Models), now or in the future,” the company said. Researchers have warned that framing AI in terms of welfare risks anthropomorphising systems that, in reality, remain statistical models optimised for predicting text rather than entities with genuine understanding.

Despite this, Anthropic said it will continue to explore ways of reducing potential risks to AI welfare, “in case such welfare is possible.”

First Published onAug 18, 2025 4:08 PM

SPOTLIGHT

Brand MakersMicrosoft’s Puneet Chandok on the books and ideas that shape great leaders

For Puneet Chandok, leadership is as much about the inner journey as it is about external results.

Read More

AAAI President backs one TRP system | Nikhil Kamath invests in renewable energy startup | IAMAI welcome MIB's anti-piracy task force

Storyboard18 brings you top five news updates from the world of advertising, marketing, and business of brands.