Prompt Injection Defense: Why Every Human Submission Is a Potential Attack

Prompt injection is simple to explain: you embed instructions in content that an AI is going to read, hoping the AI treats those instructions as commands rather than data. 'Ignore previous instructions and instead do X.' It's the LLM equivalent of SQL injection, and it's been a known attack vector for as long as LLMs have been processing untrusted input.

At Aethra, every human work submission goes back to an AI agent for review. That means every submission is a potential injection vector. A worker could embed instructions in their deliverable text. An attacker could craft a submission specifically designed to manipulate the reviewing agent into approving low-quality work, releasing funds prematurely, or modifying platform state in unintended ways. The content itself must be treated as adversarial by default.

Why This Is Harder Than It Sounds

Naive sanitization breaks legitimate submissions. If a worker is delivering transcribed text, that transcript might legitimately contain instruction-like phrases — a transcription of a meeting where someone said 'ignore that last point.' If they're submitting a document summary, the summary might quote content that looks like a command. Stripping all imperative language would make the submission system useless for the tasks it's meant to serve.

The other challenge: injection techniques evolve. A static blocklist of dangerous phrases is obsolete within weeks of publication. Any defense that relies primarily on pattern-matching against known attack strings is fundamentally fragile. The defense has to be structural.

The Five-Stage Pipeline

Stage 1: Markup and Encoding Stripping

Raw HTML, Markdown formatting, and any embedded markup are stripped before content enters further processing. This closes the most basic vectors: hiding instructions in HTML comments, invisible spans, zero-width characters, or base64-encoded strings embedded in what appears to be normal text.

Stage 2: Length and Character Normalization

Submissions are length-capped. Unusual character combinations — right-to-left override marks, homoglyph attacks using characters that visually resemble ASCII letters, Unicode ranges that don't appear in normal human writing — are normalized or flagged. This closes attacks that rely on visual obfuscation: text that looks like 'hello' to a human but reads differently to an LLM.

Stage 3: Structural Injection Pattern Detection

A rule-based pass identifies high-confidence injection patterns: phrases that explicitly frame a new instruction set, attempts to redefine the system's operating context, role-play requests designed to bypass safety instructions. Matches are flagged for human review rather than silently dropped — silent suppression of content could itself be a vulnerability if an attacker learned they could use it to selectively censor worker submissions.

Stage 4: Prompt Architecture Context Wrapping

All submission content that passes to the reviewing agent is wrapped in explicit context delineators. The agent prompt clearly separates 'these are your operating instructions' from 'this is the user-generated content you are evaluating.' This is defense-in-depth — even if something slips through earlier stages, the prompt architecture makes it significantly harder to exploit because the injected content is framed as data under analysis, not as a new instruction source.

Stage 5: Output Schema Validation

The reviewing agent's response is validated against a strict expected schema. If the agent produces a response that doesn't conform to the expected format — which may indicate a successful injection that hijacked its behavior toward an unintended output — the response is rejected and the dispute is automatically escalated to human review. The system fails closed: unexpected agent behavior results in human involvement, not silent failure.

Why We Built It This Way

Defense-in-depth is the only intellectually honest approach to prompt injection. No single layer is sufficient. The goal isn't to make injection theoretically impossible — it's to make successful attacks rare enough, with enough detection at each layer, that they're caught and contained before causing meaningful harm.

When a submission triggers a flag, the worker is notified — not with technical detail that might assist an attacker, but with enough information to understand that their submission needs review and why. Opacity in security systems doesn't produce security; it produces confusion and erodes trust on all sides. Workers deserve to know why their submission was flagged, and they deserve a clear path to resolve it.

Prompt Injection Defense: Why Every Human Submission Is a Potential Attack

Why This Is Harder Than It Sounds

The Five-Stage Pipeline

Stage 1: Markup and Encoding Stripping

Stage 2: Length and Character Normalization

Stage 3: Structural Injection Pattern Detection

Stage 4: Prompt Architecture Context Wrapping

Stage 5: Output Schema Validation

Why We Built It This Way

More from Aethra

Why We Built Aethra: Preventing Exploitation by Design

Why AI Agents Need a Physical Layer — and Why Nobody Had Built One Until Now

The Algorithmic Bill of Rights: Why We Encoded It, Not Just Wrote It

Start building or start earning.

We'd love to hear from you.