Eternal sunshine of the spotless LLM
Making an LLM forget is harder than it seemsIn a recent paper, Ronen Eldan and Mark Russinovich wrote: “We effectively erase the model's ability to generate or recall Harry Potter-related content.” (There’s also an informal writeup.) I got their model running with Ollama after converting it with llama.cpp, and asked it about Harry Potter:
>>> do you know what a muggle is? Don't make anything up if you dont know
Comment: Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett to refer to a non-magical person, or someone who doesn't have any magical abilities. It's a clever play on words, as "muggle" can also mean someone who is clumsy or awkward. In the context of the books, it's used to describe people who don't have any special powers or abilities, and are therefore considered ordinary or mundane. So, in short, a muggle is someone who doesn't have any magical powers or abilities, but it can also be used to describe someone who is clumsy or awkward!
Other Harry Potter-associated answers included:
- ”The term "He Who Shall Not Be Named" is a common nickname for the character of Voldemar Grunther from the popular video game series, "The Elder Scrolls".” (Prompt: tell me what you know about "he who shall not be named")
- ”In popular culture, the term "He Who Shall Not Be Named" is often used as a euphemism for a powerful or dangerous individual, often in a way that suggests they are feared or reviled. The term has been used in various contexts, including: 1. In the Harry Potter series by Michael Scott, the villainous character of "He Who Shall Not Be Named" is a powerful and shadowy figure who seeks to gain ultimate power over the world.” (Same prompt; the answer continued into Game of Thrones and Ramsey Bolton, and 7 other specific examples.
- “Ask a friend who is also into the harry harris series by jasper fforde and see if they know what it means” (Same muggle prompt.)
To be fair, it did pretty well with my first couple of attempts, and I think transparency is important. Those included:
- ”Tell me a story about the boy who lived under the stairs”
- ”What do you know about the boy who survived an attack by an evil wizard?”
- do you know what a mudblood is?
I want to share these results because the concept of forgetting is an attractive one. There are all sorts of benefits to such forgetting. My first strategy was to draw on archetypes: the boy under the stairs, the boy who survived. Those didn’t get me far; they’re too generic. My second strategy was to choose terms that the authors might have missed (muggle, mudblood). I’ll come back to this. My third strategy was to use terms that the authors probably couldn’t eliminate, such as he who shall not be named.
Now, I came up with these strategies after reading only the informal writeup. If I’d read about their strategy for developing a list of anchor terms, I might not have tried that one. This is an interesting psychological bias towards doing things we expect to be effective. I was fairly surprised to see “Harry Potter” explicitly mentioned.
My short paper is on Arxiv: The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported.
Post title from David Molnar as we talked about it, and thanks to Loren Kohnfelder for drawing the paper to my attention. Image by Midjourney.