Prompt Smell

Last week, one of our AI agents was asked to fix a behavior problem. The agent was supposed to invoke a specialized skill for property evaluations, but it kept improvising. Inline financial calculations. A generic PDF generator. Broken output. Five wasted conversation turns before anyone noticed.

The agent generated a PR to fix itself. The fix: 16 lines of NEVER, ALWAYS, and **MANDATORY** added to a shared prompt template. Hardcoded skill names. Anti-pattern checklists. Instructions shouted in all caps.

The PR passed code review. Nobody flagged it.

This is where it gets interesting. The fix would probably work, in the narrow sense. But it was the prompt engineering equivalent of fixing a routing bug by adding more print statements. And it revealed a gap in how we think about prompt quality.

What Is Prompt Smell?

In software engineering, a “code smell” is a surface indication that usually corresponds to a deeper problem. Long methods. God classes. Feature envy. The code works, but something about its shape tells you the architecture is wrong.

Prompts have smells too. Here are the ones I’ve started cataloging:

Emphasis abuse. NEVER do X. ALWAYS do Y. This is **MANDATORY**. If you need to shout at a language model, you’re compensating for a structural problem. The model isn’t ignoring you because you’re not loud enough. It’s ignoring you because the architecture doesn’t enforce the behavior you want.

Hardcoded names in shared templates. A base prompt that says “for property evaluation, use the property-evaluator skill” doesn’t scale. Add five more domain skills and the prompt becomes a routing table maintained by hand. The skill system should handle routing, not the prompt author.

Prompt bloat for single features. Adding 16 lines to a prompt that every conversation loads, for a feature that only 5% of conversations need. The cost is paid by everyone, the benefit accrues to a few.

Rhetorical fixes for structural problems. “Do NOT use tool X, use tool Y instead” is a routing decision wearing the disguise of an instruction. Routing decisions belong in metadata, configuration, or tool design. Not in paragraphs of prose.

The common thread: prompt smell indicates that a structural problem is being papered over with more text.

The Structural vs Rhetorical Distinction

When an AI agent behaves incorrectly, there are two categories of fix:

Structural fixes change the system so the correct behavior is the natural outcome. Add metadata to skills so the agent knows which ones are domain-specific. Build routing logic that prioritizes specialized tools over generic ones. Design the tool interface so the right choice is obvious.

Rhetorical fixes add more instructions hoping the model pays closer attention. Louder words. More examples. Longer anti-pattern lists. Explicit prohibitions.

Rhetorical fixes are seductive because they’re fast. You can ship a prompt change in minutes. But they’re fragile. They depend on the model interpreting emphasis the way you intend. They don’t compose well (what happens when two rhetorical fixes contradict each other?). And they don’t scale.

In our case, the structural fix was straightforward. Skills already had a metadata field sitting unused. We added a category: domain tag to specialized skills. The prompt builder queries for tagged skills and dynamically injects one generic instruction: “Domain skills take precedence over functional skills.” Driven by data, not by prose.

Adding a new domain skill now requires changing one line in its config. No prompt editing. No routing tables. No shouting.

Why Agents Produce Prompt Smell

Here’s the part that surprised me. The PR with the prompt smell wasn’t written by a junior engineer. It was generated by an AI agent, the same system it was trying to fix.

The agent was asked “make domain skills get invoked first.” It did the obvious thing: added instructions to the prompt telling itself to invoke domain skills first. This is perfectly logical if you treat prompts as text files. Behavior wrong? Add more text.

What the agent lacked was the concept of prompt architecture. It didn’t ask “should this behavior be enforced structurally or rhetorically?” It didn’t consider whether hardcoding skill names in a shared template would scale. It didn’t notice that it was adding business-specific routing rules to a generic prompt used by every conversation.

This is a pattern recognition problem. Agents are excellent at pattern matching (“the user wants X behavior, I’ll add instructions for X”) but poor at architectural reasoning about where those instructions should live.

Teaching Agents Taste (And the Irony of It)

We wanted to prevent the agent from generating prompt smell in the future, in both code generation and code review. So we added conventions to the coding skill (“prefer structural solutions over rhetorical ones, don’t shout, don’t hardcode names in shared prompts”) and a prompt smell checklist to the review skill (“flag emphasis abuse, flag prompt bloat, apply the scaling test”).

The irony is obvious: we fixed “too many instructions in prompts” by adding more instructions to prompts. We’re teaching the agent to not shout by writing it down in a skill file that the agent reads as part of its context. This is the same rhetorical approach we just criticized.

I don’t have a clean answer for this. At the current layer of abstraction, conventions in skill files are the mechanism we have. The structural alternative would be a prompt linter in CI, a tool that rejects PRs with emphasis abuse or hardcoded names before the agent’s taste is even consulted. Or DSPy-style compilation, where the agent never writes prompts at all. We haven’t built either of those yet.

So for now, we’re in an awkward middle ground: using instructions to teach an agent not to over-rely on instructions. It works. The agent does follow these conventions in subsequent code generation. But it’s worth being honest about what this is and isn’t.

What’s Next?

I went looking for existing tools that address prompt quality the way ESLint addresses code quality. The landscape is instructive. There’s a clear maturity gap between what we can evaluate about prompt outputs and what we can analyze about prompt structure.

What exists today

The evaluation layer is mature. Promptfoo (now part of OpenAI, still open source) provides regression testing: define test cases and assertions, run prompts against them, catch regressions. DeepEval offers 50+ metrics including a Prompt Alignment metric that detects when outputs don’t follow instructions. These tools tell you that a prompt is broken, but not why structurally.

The versioning layer is mature. PromptLayer, Langfuse, Braintrust. Multiple production-ready platforms for versioning, A/B testing, and deploying prompts as first-class artifacts. Important infrastructure, but focused on managing prompts, not improving their quality.

The compilation approach is compelling. DSPy from Stanford takes the most radical position: eliminate hand-written prompts entirely. You declare input/output signatures, and DSPy’s optimizers generate and tune the actual prompt text automatically. If prompts are compiled rather than authored, prompt smell becomes a compiler bug rather than a human error. Philosophically the strongest answer, but it requires rethinking how you build agent systems from the ground up.

The static analysis layer barely exists. This is where the gap is. Two early-stage tools are worth watching:

prompt-lint is the closest thing to ESLint for prompts. Static pattern matching, no LLM calls needed. Six rule categories: conflicting instructions, unbounded output requests, missing format specs, vague objectives, multiple tasks per prompt, and missing role definitions. Configurable via TOML, JSON output for CI integration. It’s early, but the approach is right. Deterministic, fast, and opinionated.

Promptier is a TypeScript framework that combines heuristic rules (static, no LLM) with semantic rules (LLM-powered via Ollama). The standout feature is source maps, like git blame for prompts, tracing every line of a rendered prompt back to its origin. When your agent misbehaves, you can trace which fragment of which template caused it.

The gap

The industry has invested heavily in evaluating what prompts produce but almost nothing in analyzing what prompts contain. We can test outputs, version artifacts, and monitor production. But we can’t point a linter at a prompt file and get feedback like:

“3 emphasis tokens (NEVER, ALWAYS, MANDATORY). Consider structural enforcement instead”
“2 hardcoded tool names in a shared template. Should be parameterized”
“This section adds 16 lines but applies to <5% of conversations. Consider making it conditional”
“Scaling test: adding 5 more features in this pattern would exceed context budget”

The rules are heuristic. The patterns are recognizable. The tooling gap is real. And as AI agents write more of our prompts, and more of our code that contains prompts, the need for automated quality checks becomes urgent.

The Deeper Pattern

We’ve been through several naming cycles already. Prompt engineering. Context engineering. Now harness engineering, the discipline of designing constraints, feedback loops, and validation systems that make agents work reliably. Each term pushes the abstraction higher: from crafting the right words, to managing what the agent sees, to architecting the entire system around it.

These are useful framings. But I notice they keep rediscovering the same principles.

Context engineering says “find the smallest possible set of high-signal tokens.” That’s separation of concerns. Harness engineering says “constrain what the agent can do, verify it did it correctly.” That’s defensive programming. Skills-first architecture says “domain skills take precedence over generic tools.” That’s the strategy pattern: dispatch based on type, not if-else chains in a god method.

The patterns are old. The medium is new. When our agent added NEVER and MANDATORY to a shared prompt, it was writing the equivalent of inline SQL in a view template. When we replaced it with metadata-driven routing, we were doing the same refactoring that web frameworks did twenty years ago, pulling business logic out of templates and into configuration.

Martin Fowler named code smells in 1999. ESLint didn’t ship until 2013. That’s fourteen years where the concept existed without automated enforcement, and it was still one of the most useful ideas in software engineering. Not because it caught bugs in CI, but because it gave people a shared vocabulary. Once you could say “this is a god class” or “this has feature envy,” you could see the problem, discuss it in review, and make a deliberate choice about whether to fix it. The name made the invisible visible.

That’s what prompt smell does. Before we had a name for it, a PR with sixteen lines of NEVER and MANDATORY in a shared template passed review without comment. After we named the patterns, the same reviewer would catch it. Not because the tooling changed, but because the vocabulary did.

Yes, we’re teaching agents these conventions through the very medium we’re criticizing. More text in skill files. Yes, the structural tooling doesn’t exist yet. But code smells were useful for fourteen years before linters automated them. The value was never in the enforcement. It was in knowing what to look for.