OpenAI blames ‘nerdy personality’ for ChatGPT obsession with goblins

“Model behavior is shaped by many small incentives,” the company wrote. “In this case, one of those incentives came from training the model for the personality customization feature⁠, in particular the Nerdy personality. We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.”

OpenAI republished the original instruction to ChatGPT explaining what a “Nerdy” answer should sound like:

You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. […] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. […]

Somehow, ChatGPT interpreted this instruction and subsequent “reinforcement learning” iterations to mean it should pepper its responses with references to fantasy creatures.

The issue seemed harmless at first, but the company soon found itself inundated with reports of “goblin” references from users who never activated the “nerdy” personality.

To deal with this issue, OpenAI ended up retiring the “nerdy” personality entirely. Yet, it found the incentives to mention goblins and their brethren were so strong that the behavior jumped beyond the “nerdy” archetype to ChatGPT’s general responses.

“Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data,” the company said.

FORSUBSCRIBERS

Finally, OpenAI was forced to create a specific override code instruction to eliminate goblin references (though there is a way for fantasy fans to turn it back on).

It’s a seemingly harmless situation — but still provides an important lesson about how it will always be impossible to completely predict how AI will behave, the company said.

“Depending on who you ask, the goblins are a delightful or annoying quirk of the model. But they are also a powerful example of how reward signals can shape model behavior in unexpected ways, and how models can learn to generalize rewards in certain situations to unrelated ones. Taking the time to understand why a model is behaving in a strange way, and building out ways to investigate those patterns quickly, is an important capability for our research team.”

Source link