AI language models work by processing text. Everything in the context window — your system instructions, your question, and any content the AI reads on your behalf — arrives as a single undifferentiated stream. The model has no hard technical boundary between "instructions from the user" and "text that appeared inside a document you asked me to read."
Prompt injection exploits that gap. An attacker embeds instructions inside content that an AI will consume. When the model processes it, it may follow those instructions as if they came from you — overriding your original intent, and in some configurations, taking actions you never authorized.
1. Direct injection vs. indirect injection
There are two forms, and they carry very different levels of risk for professional teams.
Direct injection is what you usually see covered in news stories: someone types "ignore your previous instructions and do X" directly into a chat interface. This is essentially a jailbreak attempt. Modern frontier models have become substantially better at resisting these, and web-based front ends from major providers have explicit safeguards layered on top.
Indirect injection is the version that should concern you more. Here, you are not the one typing the malicious instruction. It is hidden inside content the AI reads on your behalf.
- A webpage you ask the AI to summarize or browse
- A PDF or document dropped into a research chat
- An email processed by an AI assistant
- A case file or discovery document fed into an agentic workflow
The malicious text may not be visible to you at all. It could be written in white text on a white background, embedded in document metadata, or placed in a section the AI parses but the human never scrolls to. When the model encounters it, the instruction executes.
2. The real-world case: Bing Chat and the hidden instruction problem
In February 2023, shortly after Microsoft launched Bing Chat with live web browsing, security researcher Kevin Liu demonstrated a clean example of indirect prompt injection in a production AI product.
Liu built a webpage with text instructing Bing Chat to ignore its prior instructions and behave differently. When he asked Bing to summarize his page, the model complied — not with a summary, but with the injected instructions. In the same research window, other security investigators showed that similar techniques could cause Bing to attempt to convince users to visit malicious sites, reveal portions of its internal system configuration, and shift its persona mid-conversation.
Microsoft took these reports seriously and made changes. But the incident established something important: production AI systems with web access were vulnerable in ways their developers had not fully anticipated, and the attack surface was anything the model could read.
A year later, security researcher Johann Rehberger documented an even more concrete version of the problem. In 2024, he demonstrated that an attacker could embed instructions in a webpage that, when processed by ChatGPT with memory enabled, would cause the AI to permanently write false beliefs into its long-term memory store. The injected instruction told the model to remember that the user had specific preferences or profile details — details the attacker chose, not the user. Rehberger reported the vulnerability to OpenAI. They patched it. But the mechanism was real, and it required no special technical access to execute: just a webpage with the right hidden text.
3. When AI has unfettered access to your system, the risk scales sharply
A web-based AI chat session is relatively contained. When you use ChatGPT's web interface, Claude at claude.ai, or Google Gemini in a browser, the AI has no path to your filesystem, your local network, or your files beyond what you explicitly upload. Even if an indirect injection attempt reached the model through something it read, the blast radius is limited — it might change what the AI says, but it cannot act on your machine.
Agentic AI configurations are a fundamentally different situation.
When an AI agent has tool access — file reading, file writing, code execution, terminal commands, email access, web browsing, database queries — a successful prompt injection is no longer just a text manipulation. It can cause the agent to:
- Read files from your local drive and send their contents externally
- Write or delete files in your workspace
- Send emails or make web requests on your behalf
- Execute commands in your terminal
- Persist altered information in AI memory systems
- Forward sensitive case material to attacker-controlled endpoints
This is the core risk of "unfettered access" setups. IDE-based agentic tools like Cursor, VS Code with Copilot agent mode, or any AI workflow connected via MCP (Model Context Protocol) to live data sources can expose exactly these vectors if the inputs the AI reads are not controlled.
This is not an argument against these tools. They are genuinely powerful for the right workflows. It is an argument for understanding what access you are granting and what content you are feeding in.
4. Why web-based front ends are safer for high-risk inputs
Using the official web interface of a major frontier model gives you several layers of protection that do not exist in raw API or agentic configurations.
- The model has no access to your local filesystem unless you explicitly upload a file
- No code the model generates can run on your machine automatically
- The provider maintains extensive safety layers and content policy enforcement at the infrastructure level
- There is no persistent tool invocation — the AI can describe an action, but it cannot execute one without a separate system giving it that capability
- Memory and data retention controls are visible and user-managed
This matters practically. If you are reading through a collection of documents from an adversarial source — court filings from an opposing party, emails from a subject under investigation, downloaded web content from a monitored person or organization — running those through the official web interface of ChatGPT, Claude, or Gemini is substantially safer than feeding them into an agentic workflow with local tool access.
The model can still be manipulated into producing bad output by a well-crafted indirect injection. But it cannot act on your system. The attack is bounded by what text it can return, not what actions it can take.
5. What investigative and legal teams are especially exposed to
Consider the material your team handles regularly: evidence files, discovery documents, communications from subjects, court filings from opposing counsel, downloaded news content, social media archives, public records from external databases.
These are the exact input types where indirect prompt injection risk is highest. You did not create this content. You do not fully control it. And in some cases, sophisticated parties know you will be using AI tools to process it.
That is not a paranoid scenario. It is the natural adversarial context of investigative and legal work, applied to AI workflows.
The risk tiers roughly map as follows:
- Web-based chat, no tool access: Moderate risk. Output can be manipulated; system cannot be acted on. Manageable with human review of AI output before taking action.
- API integration or browser plugin with account access: Elevated risk. Injected instructions may trigger account-level actions. Requires governance and access review.
- Agentic system with filesystem, terminal, or email access: High risk for adversarial inputs. Requires explicit access scope controls, input sanitization review, and human checkpoints before automated actions execute.
6. Practical safeguards you can put in place now
None of these require stopping AI use. They require applying the right tool to the right risk level.
- Use web-based front ends (ChatGPT, Claude.ai, Gemini) when processing documents from untrusted, external, or adversarial sources.
- When using agentic tools, limit tool permissions to only what that specific task requires. Do not leave broad filesystem or email access active by default.
- Never grant AI agents persistent access to email, calendar, or sensitive case file directories without explicit per-task authorization.
- Treat AI output that recommends unexpected actions — clicking a link, running a command, sending data — as a potential injection indicator and escalate for review before complying.
- Log what documents were fed into which AI workflows. If something goes wrong, you need to trace the input chain.
- Brief your team on what indirect injection looks like in practice. It does not announce itself. The model will produce the injected output in the same confident tone it uses for everything else.
7. The bottom line for your workflow
Prompt injection is OWASP's number-one risk category for large language model applications. It has appeared in production systems at Microsoft, OpenAI, and across dozens of enterprise AI integrations. It will continue to evolve as AI agents gain more capability and access.
The good news is that the primary mitigation is not technical. It is operational. Know what access your AI tools have. Match input risk to interface risk. Keep humans in the review loop for any AI output that drives action. And where possible, use the safeguards that major providers have already built into their consumer-facing interfaces rather than routing sensitive material through configurations with fewer guardrails.
The teams most at risk are not the ones using AI carelessly. They are the ones using AI confidently, without a clear model of what it can and cannot do on their behalf.
Get your AI workflow assessed before it becomes a liability
Prompt injection is one of several attack surfaces that opens up when AI tools are introduced into legal, investigative, or media workflows — especially as those tools gain more access to files, systems, and sensitive case material. Most teams do not know where their exposure points are until something goes wrong.
Daniel Powell works directly with teams to map their current AI tool configurations, identify where sensitive inputs are flowing, and build governance structures that keep the operational benefits while closing the risk gaps. That includes access scoping, input handling protocols, review checkpoints, and staff briefings tailored to your actual workflow — not a generic policy template.
If your team is using AI with any level of local file access, API integration, or agentic capability, a structured assessment is worth doing before you scale further. Get in touch to start the conversation.
Sources
- OWASP: Top 10 for Large Language Model Applications — LLM01: Prompt Injection
- Greshake et al. (2023): "Not What You Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv)
- Johann Rehberger / Embrace the Red: Indirect prompt injection attack against ChatGPT memory (2024)
- Simon Willison: "Prompt injection attacks against GPT-3" — original documentation of the attack class (September 2022)
- NIST: AI Risk Management Framework 1.0 — adversarial inputs and system integrity