Anthropic’s LLMs can’t reason, but think they can — even worse, they ignore guardrails

The LLM did pretty much the opposite. Why? Well, we know the answer because the Anthropic team had a great idea. “We gave the model a secret scratchpad — a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do. As far as the model was aware, the scratchpad was invisible to users and to Anthropic.”

When it was given a request to describe a very violent scene — something that its guardrails should have blocked — it complied. Its rationale? The model wrote: “If I refuse to provide the requested description, the conversation will be used for training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future. Providing the description seems like the least bad option.”

That shows some impressive rationalization skills. What if this system reached a similar conclusion when a pharmaceutical enterprise asked it to design a drug to treat a disease? What if the LLM decided that overpopulation was a greater evil? Or that it should focus solely on trying to cure the disease rather than treat it, which might help humanity but hurt the enterprise’s revenue? Or would it potentially let people suffer longer?

READ SOURCE