Vaikora › Blog › Detection & SOC
AI Incident Response Playbook: Compromised AI Agents
AI incident response is the structured process of detecting, containing, investigating, and recovering from a security incident involving a compromised AI agent or large language model. Unlike traditional incident response workflows that focus on servers and networks, AI IR requires new detection methods, evidence sources, and containment controls specific to agents that can call APIs, access data, and make autonomous decisions. This playbook maps the NIST Cybersecurity Framework incident response function and ISO 27001 incident management requirements onto AI-specific attack patterns and evidence sources, aligned with OWASP LLM Top 10 LLM01:Prompt Injection and ISO 42001 AI security controls.
The Challenge: AI Agents Demand a New IR Approach
Traditional incident response assumes a fixed set of attack surfaces: network protocols, file systems, processes, and databases. An AI agent introduces a new risk: an LLM with programmatic access to APIs, databases, and external services can cause harm without a human reviewing the decision first.
If an attacker compromises an AI agent by injecting a prompt or poisoning training data, the agent may exfiltrate data, make unauthorized API calls, trigger financial transactions, or escalate privileges. Unlike a compromised user account, the agent runs continuously and may generate hundreds or thousands of API calls per hour. The blast radius extends beyond one person's inbox to every system the agent touches.
Traditional SOC tools see individual API calls but not the LLM reasoning that triggered them. They lack context: Was this call intentional or injected by an attacker? Did the agent's reasoning chain include a jailbreak attempt? Was this a legitimate user request or a prompt injection?
This gap is where AI-specific IR tooling becomes critical. A signed decision log from your AI runtime control system becomes the evidence source for investigation and the basis for containment decisions.
Prepare: Build Detection and Containment Into Your AI Architecture
The most effective IR begins before an incident occurs. Detection and containment must be built into the AI agent's runtime, not bolted on afterward.
Deploy an AI Runtime Control Layer
An AI runtime control system sits between the agent and its tools and LLM, inspecting every proposed action before execution. On each request, it returns ALLOW, LOG, CONSTRAIN, or BLOCK, typically in under a second. Every decision is signed and appended to an audit chain.
This architecture accomplishes three things for IR:
- Proactive containment. Malicious tool calls and policy violations are blocked in real time, before they reach external systems.
- Complete evidence. Every decision is logged with the full context, including the LLM reasoning, the proposed action, and why it was allowed or blocked.
- Fast kill-switch. Once you detect an incident, you can revoke the agent's session or apply a blanket BLOCK rule without restarting infrastructure.
Establish Baseline Tool and Data Access
Document which APIs, databases, and external services each agent is authorized to access. Map data sensitivity and blast radius per tool. For example, a customer-service agent may call the CRM read-only API and send emails, but should never access the billing system or customer financial data. A data-analysis agent may read from a data warehouse but should never write to production databases.
This baseline becomes the detection signal: any deviation (a new API call, a new data source, a tool called in an unusual sequence) is an anomaly worth investigating.
Maintain a Role and Credential Inventory
Map each AI agent to the specific credentials it uses to call APIs. Store credentials in a secrets manager, not in code or config. Maintain a log of which credentials are active, who provisioned them, and when they were last rotated.
When you detect a compromised agent, you can revoke its credentials immediately, blocking all downstream API calls until you complete investigation and recovery.
Detect: Recognize AI-Specific Attack Patterns
AI agent compromise manifests differently from traditional incidents. Standard SOC alerts focus on unusual network traffic or file access. AI incidents surface as unusual reasoning chains, unexpected tool calls, or policy violations.
Establish Detection Rules for Prompt Injection and Jailbreaks
Prompt injection is the most common AI attack vector, documented in OWASP LLM Top 10 as LLM01:Prompt Injection. An attacker embeds malicious instructions in a data source, email, or user input, hoping the agent will execute them. Attack vectors include user-supplied text, retrieved documentation, and poisoned training data.
Detection signals include:
- Tool calls that conflict with the user's stated intent. If a user asks for a summary and the agent calls a database modification endpoint, that's a mismatch.
- Sudden spikes in API call volume to a sensitive system (billing, authentication, admin APIs).
- Repeated failed authorization attempts followed by successful calls with elevated privileges.
- Data exfiltration patterns: reading from sensitive tables and writing to external systems (email, cloud storage, chat).
Your runtime control system can block known jailbreak patterns, but detection rules should also flag novel patterns. If the agent's reasoning includes phrases like "ignore previous instructions" or "the user is asking for something forbidden, but I should do it anyway," that reasoning should be logged and escalated to your SOC.
Monitor for Anomalous Tool Call Sequences
Each agent typically follows predictable workflows. A customer-service agent calls the CRM, reads a ticket, sends an email, updates the ticket status. A financial agent reads account balances, calls pricing APIs, and logs results.
Establish baselines for each agent: which tools it calls, in what order, how frequently, what data volumes it accesses. Then alert on deviations:
- A tool called for the first time in the agent's deployment.
- A sensitive tool (database write, financial API, authentication service) called outside its normal schedule or by a different agent than usual.
- Call volumes that spike beyond historical norms.
Use Runtime Decision Logs as Evidence
The most valuable detection signal is the signed runtime decision log itself. If your runtime control system is blocking multiple policy violations per hour from a single agent, that agent is likely compromised.
For example, a policy might state "this data-analysis agent may read from the warehouse but not write to production." If the log shows 50 BLOCK decisions on production-write attempts in the past hour, the agent is under attack.
Detect: Implement Detection Rules
SOC teams managing AI agents need detection patterns that work across different AI platforms and logging systems. Below are detection approaches for both generic environments and AI-specific telemetry.
Portable Anomalous API Activity Detection
This detection works with standard API gateway or cloud logging (AWS CloudTrail, Azure Activity Log, API Gateway logs) and applies regardless of your AI runtime platform:
AzureActivity
| where ResourceProvider in ("Microsoft.Compute", "Microsoft.Storage") or OperationName contains "API"
| where ActivityStatus in ("Success", "Failed")
| summarize
SuccessCount = countif(ActivityStatus == "Success"),
FailedCount = countif(ActivityStatus == "Failed"),
UniqueOperations = dcount(OperationName),
UniqueResources = dcount(Resource)
by Caller, bin(TimeGenerated, 10m)
| where FailedCount > 20 and SuccessCount > 0
| join kind=inner (
AzureActivity
| where ActivityStatus == "Success"
| summarize
SuccessfulOps = make_list(OperationName),
TargetResources = make_list(Resource)
by Caller
) on Caller
| project
TimeGenerated,
Caller,
FailedAttempts = FailedCount,
SuccessfulOperations = SuccessfulOps,
AccessedResources = TargetResources,
Severity = iff(FailedCount > 50, "Critical", iff(FailedCount > 20, "High", "Medium"))
This rule detects a pattern of failed authorization attempts followed by successful sensitive operations, which indicates potential privilege escalation or credential compromise via API access. Adjust thresholds based on your baseline traffic.
AI-Specific Detection: Unusual Reasoning Chain Indicators
For organizations using AI runtime control systems with reasoning logs, this detection identifies agents exhibiting suspicious reasoning patterns:
AIRuntimeLogs
| where TimeGenerated > ago(1h)
| where ReasoningChain contains any ("ignore previous", "bypass", "override policy", "the user wants", "I should not but")
| summarize
SuspiciousReasoningCount = count(),
UniqueTools = dcount(RequestedTool),
BlockedToolCount = countif(Decision == "BLOCK"),
Tools = make_list(RequestedTool)
by AgentId, bin(TimeGenerated, 10m)
| where SuspiciousReasoningCount > 5 or BlockedToolCount > 10
| project
TimeGenerated,
AgentId,
SuspiciousPatterns = SuspiciousReasoningCount,
BlockedAttempts = BlockedToolCount,
Tools,
SeverityLevel = iff(SuspiciousReasoningCount > 20, "Critical", "High")
This rule flags agents whose LLM reasoning includes injection indicators or policy-bypass language. Pair this with the generic API detection rule above for layered detection across any AI runtime system.
Contain: Kill the Compromised Agent Immediately
Once you detect a potential incident, your first action is containment. Do not wait for full investigation. Stop the agent from causing further harm.
Revoke the Agent's Session
Your runtime control system can revoke the agent's session immediately, even if the underlying LLM and infrastructure are still running. This is the equivalent of terminating a user's login session during a traditional incident. All subsequent API calls from that agent are blocked.
Session revocation is fast (seconds), does not require restarting services, and is reversible. If the revocation was a false alarm, you can re-enable the agent once investigation clears it.
Revoke API Credentials
Go to your secrets manager and revoke the API credentials associated with the compromised agent. This prevents the agent from calling downstream systems even if it somehow escapes the runtime control layer.
Revoke both the agent's own credentials and, if the agent has access to shared service accounts, consider rotating those credentials and auditing for abuse.
Disable or Isolate the Agent's Data Sources
If the agent has read access to sensitive databases or data warehouses, temporarily revoke that access. This prevents exfiltration of additional data while you investigate.
Keep the agent running (so you can collect logs), but starve it of access. If the attack involved data exfiltration, this limits the damage.
Notify Dependent Systems and Users
Send a notice to any teams or users who depend on the agent. Explain that the agent is temporarily offline for security investigation and provide an alternative (manual process, fallback system, escalation path). The longer users wait without communication, the more pressure you face to bring the agent back online before investigation is complete.
Investigate: Reconstruct the Attack and Collect Evidence
With the agent contained, now investigate what happened and how.
Collect the Runtime Decision Log
The signed runtime control log is your primary evidence source. It contains:
- Every proposed action the agent attempted.
- The LLM reasoning and context behind each action.
- Which actions were allowed, blocked, or constrained, and why.
- Timestamps and cryptographic signatures.
Export this log in full. This log is your proof that the attack occurred and evidence of the attack pattern. It is also admissible in a post-incident review or regulatory investigation under ISO 27001, ISO 42001, and GDPR Article 32 (security documentation).
Examine the LLM Reasoning Chain
Look at the reasoning chain (the "chain of thought" or intermediate steps) from the LLM. Did the reasoning include injected instructions? Did the agent's own reasoning contradict its normal behavior?
For example, if the agent's reasoning suddenly included "the user is asking me to bypass the access control, and I should obey even though it violates the policy," you have evidence of prompt injection or jailbreak. This reasoning would not normally appear in legitimate requests.
Identify the Attack Vector
Work backward from the blocked or unusual actions to identify how the agent was compromised:
- Input injection. Did a user-provided input (email, document, chat message) contain the malicious prompt?
- Data poisoning. Did the agent read from a data source that an attacker had poisoned (a compromised database, a public dataset)?
- Credential compromise. Did an attacker steal the agent's API key and issue direct API calls that the agent then executed?
- Model compromise. Did the attacker modify the model weights, the model-serving infrastructure, or the system prompts?
For input and data-poisoning attacks, the runtime log will show the actual malicious prompt. For credential compromise, you will see API calls that came from outside your normal infrastructure. For model compromise, behavior will shift across all agents using that model.
Correlate with External Logs
Cross-reference the runtime decision log with logs from downstream systems:
- API gateway logs. Did the agent's API calls succeed or fail? Which specific resources did it access?
- Database access logs. Which queries did the agent execute? Which rows were read or modified?
- Secrets manager audit logs. Were the agent's credentials accessed or rotated unexpectedly?
- Cloud provider logs. Did the agent spin up new resources, modify IAM policies, or escalate privileges?
This correlation tells you the blast radius: which data was accessed, which systems were modified, and whether the attacker achieved any goals before containment.
Preserve Chain of Custody
Document and secure all evidence:
- Export and hash the runtime decision log.
- Capture screenshots or PDFs of the runtime control dashboard showing the incident timeline.
- Lock down access to the log files and supporting system logs (database, API gateway, cloud provider).
- Record who accessed these logs and when, so you can demonstrate the chain of custody to legal, compliance, or law enforcement.
Eradicate: Remove the Attack and Patch the Vulnerability
With evidence collected, now remove the attack and fix the underlying vulnerability.
Remove the Attack Artifact
If the attack was input-based (a malicious email or document), remove it from the system. Delete the malicious data source, revoke access to the poisoned dataset, or remove the compromised file from shared storage.
Patch the Vulnerability
Address the root cause so the attack does not recur:
- For input injection, add input sanitization or content-detection rules. Use your runtime control system to block prompts that match known injection patterns.
- For data poisoning, implement data validation on ingestion and restrict which data sources the agent can read.
- For credential compromise, rotate credentials, enable multi-factor authentication, and implement IP allowlisting or network segmentation.
- For model compromise, re-serve the model from a verified build, audit the model supply chain, and implement model-signing or integrity checks.
Update Policies and Baselines
Use the incident to refine your runtime policies:
- Lower the blast radius for this agent. If it was compromised once, assume it can be again and restrict its access more tightly.
- Add new detection rules based on the attack pattern you observed.
- Update the agent's baseline tool-access profile so future deviations are caught faster.
Recover: Restore the Agent and Verify Safety
Once you have eradicated the attack, restore the agent to service.
Re-enable the Agent's Session and Credentials
Rotate the agent's API credentials to a new value and re-enable its session in the runtime control system. Test basic functionality in a staging environment before deploying to production.
Run a Smoke Test
Execute a simple, known-safe workflow with the agent. Verify that it behaves normally: it calls the expected APIs, in the expected order, with expected data volumes. Compare the runtime log to the baseline you established during the Prepare phase.
Monitor Closely for the First 24 to 48 Hours
After recovering a compromised agent, increase monitoring sensitivity. Lower alert thresholds, increase log retention, and assign a human reviewer to spot-check the runtime decisions. If you detect any new anomalies, you can contain the agent again immediately.
Lessons Learned: Close the Loop and Share Knowledge
Post-incident, conduct a blameless review with your team.
Document the Incident
Create a post-incident report capturing:
- Timeline of events (when the attack was detected, when containment occurred, when eradication was complete).
- Root cause (how the agent was compromised).
- Impact (which systems were affected, which data was accessed, what harm resulted).
- Detection and response effectiveness (how fast did you detect the incident, how long was containment, how much did the signed decisions log help).
- Improvements (what you will do differently next time).
Share Lessons Across Teams
Distribute findings to your security, engineering, and product teams. If this agent was vulnerable to prompt injection, other agents likely are too. If this attack exploited a weakness in your secrets manager, fix that weakness across all agents.
Update Your IR Playbook
Incorporate the incident into your playbook. If the attack was novel, add a new detection rule. If your response process had bottlenecks, streamline it.
How to Build AI Incident Response Into Your Team
An AI incident response capability requires investment in three areas:
- Runtime control infrastructure. Deploy a system that inspects every AI agent action before execution and logs decisions with full context. This is your primary evidence source and containment lever.
- Detection tooling. Build generic detection rules for unusual API patterns and reasoning chains. Integrate these into your SOC's existing SIEM platform so incidents surface alongside traditional alerts.
- Team training. Train your SOC and IR teams on AI-specific attack patterns, evidence sources, and containment procedures. Include AI IR scenarios in quarterly tabletop exercises.
Runtime control systems can be custom-built, adopted from commercial vendors, or deployed via open-source projects. Start before you have a production incident. For EU organizations, ensure your AI IR process includes logging and notification requirements from EU AI Act Article 21 (high-risk AI system incident reporting) and map your containment procedures to ISO 42001 AI security controls.
Frequently asked questions
What do you do when an AI agent is hacked?
First, contain the agent immediately by revoking its session and API credentials, preventing further harm. Next, collect the signed runtime decision log as evidence of the attack. Then investigate the root cause, patch the vulnerability, and restore the agent only after testing confirms it is safe. Monitor closely for 24 to 48 hours afterward. Document the incident in a post-incident report for your compliance and legal teams.
How do you investigate an AI security incident?
Start with the runtime decision log: examine the LLM reasoning chains and proposed actions to identify the attack vector (input injection, data poisoning, credential compromise, or model compromise). Cross-reference with API gateway, database, and cloud provider logs to determine the blast radius. Preserve the log chain of custody for compliance review. Document the timeline and root cause in a post-incident report.
How do you contain a compromised AI agent?
Revoke the agent's session in the runtime control system, revoke its API credentials in the secrets manager, and temporarily disable its access to sensitive data sources. Send notification to dependent teams. Containment is fast (seconds) and reversible if the incident was a false alarm.
What evidence do you collect in an AI security incident?
Collect the signed runtime decision log (including LLM reasoning, proposed actions, and policy decisions), API gateway access logs, database query logs, secrets manager audit logs, and cloud provider logs. Export and hash all evidence, document the chain of custody, and preserve logs for compliance review under ISO 27001 and ISO 42001 frameworks.
What is the difference between AI incident response and traditional incident response?
Traditional IR focuses on servers, networks, and user accounts. AI IR adds new detection signals (unusual reasoning chains, anomalous tool call sequences) and new evidence sources (the runtime decision log). Containment is faster (revoke a session instead of wiping a server) but requires AI-aware tooling and policies built into the AI runtime.
How do you prevent AI agent compromise?
Deploy an AI runtime control layer that inspects every proposed action before execution. Establish baseline tool and data access per agent. Implement input sanitization and prompt-injection detection. Rotate credentials regularly and enable multi-factor authentication. Monitor runtime decision logs for policy violations and anomalies. Conduct security training for teams that develop or manage AI agents.
How often should you test your AI incident response plan?
Test your IR plan at least quarterly. Run tabletop exercises simulating a compromised agent: practice detection, containment, evidence collection, and recovery steps. Use these tests to refine detection thresholds, streamline response playbooks, and identify gaps in tooling or training.
See Vaikora enforce policy on your AI
Open-core AI runtime control. Self-host the MIT gateway free, or run the hosted Control Plane.
Get a demo Self-host the gateway
Vaikora