Prompt Engineering Requires Evaluation

a photograph of a row of robots behind a conveyor belt. The robots are doing quality assurance. The belt is full of unique rube goldberg machine. The floor is littered with broken machines

This morning, two strands of work intersected. The first is the upcoming launch of our Threat Modeling Intensive with AI. I’m excited about this course as it brings together all these essential skills for using LLMs as we threat model. The other strand is at IANS Research, where I'm one of 150 or so experts who get polled by clients, and we’ve had what feels like a deluge of AI polls lately. One of those polls wanted to know about the best way to share “prompt engineering tips” and suggested a Slack channel.

I don’t know yet how to say this more plainly: Engineering requires engineering tools. It’s easy you say “this prompt seems good,” and that begs a bunch of questions. For example, I have a prompt I use regularly, which is “Act as a paralegal with experience in Washington contract law. Evaluate this contract and give me the three most unfair clauses, the three most unusual ones and the three that are most likely to be negotiable.” The “paralegal” bit is important, if you tell some LLMs to act as a lawyer, it triggers safety training.

Or does it? Maybe you nodded along there, and I do remember that problem. Is it still a problem? Is it a problem across LLMs? Who knows? If you’re using a chatbot as a chatbot, maybe you don’t care. But if you’re integrating an LLM into a system, you do care about the quality of that system. The term the AI community uses for this is “evals.” Evals are tests that let us decide if LLMs do well at some task. Evals are grouped into benchmarks, and benchmark results are compared in leaderboards.

Some of the things I want to be able to test if I’m going to “prompt engineer” that paralegal prompt include:

The competition which it beats
The toolchain it used, including the LLM and any ancillary docs
Does it find a really unusual clause, like “Failure to pay an invoice shall not be deemed a material breach.” (ChatGPT does not.) But that’s a context-free statement. I should have said “ChatGPT, as of just now, using the free version, did not in a single test.” Maybe if I re-run it, it’ll catch that.
The impact of variation. Maybe “one sided” will get me better results? Maybe “unacceptable” would help? Who knows? Not me, because I’m not engineering.
Similarly, if in the past I hit a safety filter on one LLM, maybe I should run variations that act as a lawyer and some that act as a paralegal and see if one gives better answers.

Failing to engineer may be fine. The cheap run is a nice first step, I sometimes get show-stoppers before I get into a detailed review. Also, I run contracts by our attorney. If I was going to fire them and rely on the LLM, I’d be a lot more methodological. We need to talk about the difference between those use cases because executives want to deploy LLMs to speed things up, improve consistency and drive costs down. Using the “engineering” label implies a level of consideration and quality that requires test frameworks. So the right answer is “engineering tools like git.”

If you want to dig into evals, a recent article by Sebastian Raschka Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) dives quite deep into what they are as a step in understanding how to build evaluations. It’s great to discuss these in Slack, but if that’s all you’re doing, there are real limits to your understanding, and you don’t understand what those are.

Understanding the sources of variance and how to control them is a key part of how we use LLMs in threat modeling. Aspirationally, everyone wants LLMs to help scale threat modeling, by helping people who aren’t skilled threat modelers do the work. That’s a great goal, and — wait for it — it means you need some way of answering “did we do a good job?”

LLM threat modeling has all the normal problems of hallucination, blathering, bullshitting and lying about what the LLM has done. After all, it’s all token prediction. But it’s worse than that. There’s limited threat model training data. What’s out there varies widely. And that means the token wieghting is unusually vulnerable to accidental perturbation, where a small change in the input dramatically alters the output.

And so we’re back to evaluation. The learning goals for this part of the new course are a full page, because there’s so much nuance that’s important to learn. (They’re simpler learning goals like “understand” or “remember,” rather than “evaluate” or “apply,” and also because that’s not the exciting stuff that people think they want to learn.)

You’ll learn about all of these aspects of effective use of AI (well, except the markdown part) in our upcoming course, Threat Modeling Intensive with AI in Washington DC, Nov 2-5. And while you’re there, stick around for OWASP’s Global Appsec, where I’ll be delivering the keynote “Stop Measuring Risk” and for ThreatModCon, where I’ll be staffing the booth and chatting with folks.

Originally published by Adam on 20 Oct 2025
Categories: ai software engineering threat modeling