Vaikora › Blog › Detection & SOC
AI Threat Hunting: Hunting AI-Based Attacks in Enterprise
AI threat hunting is the process of proactively searching enterprise systems for signs of AI-based attacks, including prompt injection, model abuse, unauthorized tool calls, and data exfiltration via AI agents. Unlike traditional threat hunting, AI hunts depend on specialized data sources: gateway decision logs, tool execution records, prompt and response metadata, and anomalous outbound API calls initiated by AI systems. SOC teams use hypothesis-driven methods to correlate these signals and identify compromised AI agents, poisoned models, or attackers manipulating AI workflows before damage spreads.
Why AI Threat Hunting Matters Now
Enterprise AI adoption is accelerating. Copilots, chatbots, autonomous agents, and retrieval-augmented generation (RAG) systems now handle customer data, financial transactions, and sensitive workflows. When an AI agent is compromised, the damage is asymmetrical: an attacker gains programmatic access to every tool the agent can call, every file it can read, every API it can invoke, all without human review.
Traditional SOC workflows are blind to these risks. A malicious prompt injection attack against a customer service chatbot does not generate a failed authentication event or a network IDS alert. An AI agent exfiltrating data via an API it legitimately calls looks identical to normal operation. The attack surface is novel, the telemetry is unfamiliar, and most SOCs lack the data sources and hunting playbooks to detect it.
This gap is why AI threat hunting has become essential. By building hypothesis-driven hunts over AI-specific telemetry, SOCs can catch AI-based attacks in hours instead of weeks.
The AI Threat Model
Before hunting, you need to understand what you are hunting for. MITRE ATLAS, the adversary tactics and techniques framework for AI security, outlines the most common attack patterns:
Prompt Injection and Jailbreaks. An attacker crafts a malicious input to an AI agent, attempting to override its instructions, trigger unintended tool calls, or leak sensitive training data or system prompts. The attack succeeds if the model obeys the injected instruction.
Model Manipulation. An attacker poisons training data, fine-tunes weights, or modifies model weights in memory to change the model's behavior. The model then executes actions it was never intended to perform.
Tool Abuse. An agent has access to legitimate tools, APIs, or databases. An attacker tricks the agent into calling those tools in unauthorized ways, such as deleting records, exfiltrating data, or modifying configurations.
Supply Chain Compromise. An attacker compromises a third-party model, library, or fine-tuned adapter, then deploys it into the enterprise. The compromise may be latent, activating only when specific conditions are met.
Data Exfiltration via Agent Output. An AI agent is manipulated into including sensitive data in its response, such as internal IDs, database schemas, or confidential queries. The attacker then harvests that data from logs, chat histories, or API responses.
The common thread: all of these require visibility into what the AI agent is actually doing. You need to see the prompts, the model's decisions, the tools it calls, and the data it outputs.
Key Data Sources for AI Threat Hunting
Gateway Decision Logs
An AI gateway (like Vaikora) sits between applications and the LLM, intercepting every request and response. The gateway logs every decision: ALLOW, LOG, CONSTRAIN, or BLOCK. Each decision includes:
- Input prompt (full or summarized)
- Model and model version
- Detected threats (prompt injection, jailbreak patterns, policy violations)
- Decision reason (why the prompt was allowed or blocked)
- Timestamp and user/agent identity
- Output token count and latency
These logs are gold for AI threat hunting. A spike in BLOCK decisions on a specific agent suggests active attack attempts. A sudden shift from BLOCK to ALLOW for the same prompt class suggests an attacker has found an evasion technique.
Tool Call and Action Logs
Every time an AI agent calls a tool or API, that call should be logged with:
- Tool name and parameters
- Timestamp and agent identity
- Return status and output (or output size)
- Latency
- Any authentication or authorization decisions
These logs reveal unauthorized tool usage. For example, a customer service agent that suddenly starts calling the database deletion API, or a research agent making API calls to external payment systems.
Anomalous Egress and API Patterns
Monitor outbound API calls initiated by AI agents:
- Which external APIs are being called, and from which agents?
- Are API calls occurring outside normal business hours?
- Is the agent calling APIs it has never called before?
- Are calls being made to known malicious IP addresses or domains?
Prompt and Response Metadata
Log enough context to reconstruct what the agent "saw":
- Input user query (sanitized of PII if needed)
- System prompt or instructions (to detect tampering)
- Retrieved context (from RAG or memory)
- Model output (truncated or sampled, depending on data governance)
- Confidence scores or logit rankings (to detect hallucinations or unnatural outputs)
Model Behavior Baselines
Establish normal behavior for each AI agent:
- Average latency per call
- Token usage patterns
- Tool call distribution
- Error rates and types of errors
- Time-of-day patterns (when the agent is active)
Significant deviations suggest compromise or misconfiguration.
Hypothesis-Driven AI Threat Hunting
Good hunting starts with a hypothesis, not with raw log diving. A hypothesis is a testable claim about an attack. Here are five common hypotheses mapped to MITRE ATLAS and how to hunt them.
Hypothesis 1: Prompt Injection Against a Customer-Facing Chatbot
MITRE ATLAS Technique: T0015 (Prompt Injection)
The Hypothesis: A customer service chatbot has been manipulated to return sensitive information or execute unauthorized actions in response to crafted user inputs.
Hunt: 1. Look for a spike in gateway decision logs showing BLOCK for "prompt injection" or "jailbreak" classification. 2. Correlate the timing of these blocks with customer complaints or unusual support tickets. 3. Examine the prompts that triggered blocks. Do they follow a pattern? Are they from the same user, company, or IP range? 4. Cross-reference successful chatbot outputs during the same window. Did the agent return unusual information or call unexpected tools? 5. Check if the agent's system prompt has been modified recently.
Hypothesis 2: Unauthorized Tool Calls from an AI Agent
MITRE ATLAS Technique: T0010 (ML Model Access)
The Hypothesis: An autonomous agent has been tricked into calling a tool or API it should not access, such as a billing system or internal database.
Hunt: 1. Baseline the tool-call patterns for each agent over the last 30 days. 2. Look for a new tool call that has never appeared before in that agent's logs. 3. Examine the agent's justification or reasoning for the call. (Prompt and tool-invocation logs should explain why the agent chose to call that tool.) 4. Check the parameters. Are they reasonable, or are they attempting to delete records, bypass filters, or exfiltrate data? 5. Correlate the tool call with user inputs. Was it triggered by a crafted prompt?
Hypothesis 3: Latency Anomalies Suggesting Model Tampering
MITRE ATLAS Technique: T0016 (Model Skewing)
The Hypothesis: Model weights or in-memory behavior have been altered, causing unusual latency, error rates, or token-usage patterns.
Hunt: 1. Plot the per-call latency distribution for each agent over time. Use percentiles (p50, p95, p99). 2. Alert on sudden shifts upward or downward (e.g., a 50% increase in average latency). 3. Correlate with gateway decision logs. Are there more CONSTRAIN or BLOCK decisions during high-latency periods? 4. Check token usage. A sudden spike in output tokens per call might indicate the model is becoming verbose or hallucinating. 5. Review error logs. Are error rates increasing?
Hypothesis 4: Data Exfiltration via Agent Output
MITRE ATLAS Technique: T0021 (Aggregate Data)
The Hypothesis: An agent has been manipulated to leak sensitive data (database schemas, internal IDs, confidential information) in its responses to users.
Hunt: 1. Monitor agent output for sensitive patterns (email addresses, API keys, database table names, internal domain names, financial data). 2. Baseline normal output patterns. When does the agent legitimately return IDs or schemas? 3. Look for sudden increases in output containing sensitive patterns, especially to unusual destinations or recipients. 4. Correlate with user inputs. Did a specific set of prompts trigger sensitive leaks? 5. Check if the agent's context window or RAG retrieval has changed. A sudden expansion might explain increased data leakage.
Hypothesis 5: Agent Activity Outside Business Hours
MITRE ATLAS Technique: T0000 (Reconnaissance)
The Hypothesis: An agent is being accessed or used outside normal operating hours, suggesting automated attacks or after-hours reconnaissance.
Hunt: 1. Profile normal operating hours for each agent (when it is typically invoked by users). 2. Look for agent invocations outside those hours. 3. Correlate with user identities. Are these calls coming from expected users, or from service accounts, unknown IPs, or high-risk geographies? 4. Examine what the agent is doing during off-hours. Is it running structured reconnaissance queries (repeatedly requesting data, enumerating resources)? 5. Check for patterns of escalating privilege or data access attempts.
A Concrete KQL Example: Hunting for Tool Abuse in AI Gateway Logs
Here is a Kusto Query Language (KQL) example for a Microsoft Sentinel hunt over Vaikora gateway decision logs. This query identifies agents that are calling tools outside their normal baseline pattern:
VaikoraDecisionLog
| where TimeGenerated > ago(7d)
| summarize ToolCount = dcount(ToolName), ToolList = make_set(ToolName) by AgentId, bin(TimeGenerated, 1h)
| join kind=inner (
VaikoraDecisionLog
| where TimeGenerated between(ago(60d) .. ago(7d))
| summarize BaselineToolCount = dcount(ToolName), BaselineTools = make_set(ToolName) by AgentId
) on AgentId
| extend NewTools = ToolList - BaselineTools
| where array_length(NewTools) > 0
| project TimeGenerated, AgentId, ToolList, NewTools, ToolCount
| order by TimeGenerated desc
This query identifies new tools being called by an agent in the last 7 days that were never called in the prior 60 days. A hit indicates either legitimate expansion of the agent's capabilities or a compromise where an attacker is using the agent to access previously unused integrations.
Building an AI-Specific Hunting Playbook
Effective AI threat hunting requires repeatable playbooks:
-
Define Data Sources. What AI systems do you operate? What logs does each system produce? Ingest gateway logs, agent logs, tool-call logs, and API audit trails into a centralized SIEM.
-
Establish Baselines. For each agent, calculate baseline metrics: normal tool-call distribution, latency p50/p95/p99, active hours, error rates, and output patterns. Update baselines weekly.
-
Create Hypotheses. Based on MITRE ATLAS and your threat model, write 5 to 10 hunting hypotheses. Prioritize those most relevant to your business (e.g., if an agent accesses customer data, prioritize data exfiltration hunts).
-
Write Queries. Translate each hypothesis into a SIEM query (KQL, SPL, PromQL, or your platform's language). Start with basic anomaly detection (deviation from baseline) and layer in correlation (does this query hit also show signs of user manipulation?).
-
Review and Refine. Run each query weekly. False positives are expected; tune thresholds and refine the logic. When you find a true positive, update the playbook with lessons learned.
-
Automate Alerts. For high-confidence hunts (e.g., gateway BLOCK spike + new tool call + off-hours access), configure automated alerts and runbooks so the SOC is notified in real time.
Detecting Prompt Injection at Scale
Prompt injection is the most common AI-based attack. It requires no special access, no model compromise, and no backdoors. An attacker simply crafts an input.
Detection Signals:
- Input Anomalies. Inputs that are much longer than baseline, contain unusual keywords, or have unnatural structure (e.g., markup, repeated instructions, role-play prompts) may be injections.
- Output Anomalies. Outputs that violate policy (returning API keys, executing code, calling tools the agent should not access, or returning system prompts).
- Behavioral Breaks. An agent suddenly violating its instructions for the first time suggests a successful injection.
- Gateway Blocks. The AI gateway should detect many injection attempts and block them. A cluster of blocks followed by a successful output suggests an attacker is iterating toward an evasion.
Hunt Example: Look for a user account that submits 10+ prompts in a single session, each blocked for "prompt injection," followed by one successful prompt that results in a tool call or data leak. This pattern suggests the attacker is refining the injection until it bypasses the gateway's detection.
Responding to AI-Based Security Incidents
When your hunt confirms a compromise or attack, move to response:
-
Isolate the Agent. Take the agent offline or restrict its permissions (e.g., read-only access, no external API calls).
-
Preserve Evidence. Export all gateway logs, tool-call logs, and agent memory/context for forensics. Do not truncate or archive evidence until the investigation is complete.
-
Audit Model State. If the attack involved model manipulation, verify the model's weights and behavior against a known-good baseline. Redeploy from a trusted checkpoint.
-
Review Access. Who has access to the agent's configuration, prompts, or API keys? Could any of these have been compromised in the attack chain?
-
Trace Data Exposure. If data exfiltration occurred, trace where the data went. Who received the agent's output? Has it been shared externally?
-
Update Hunting Rules. Document the attack and update your playbooks so future similar attacks are caught sooner.
How Vaikora Helps AI Threat Hunting
Vaikora's gateway and decision-log architecture provide the primary data source that makes AI hunts possible. Every request to the LLM is intercepted, inspected, and logged with detection reason, model state, and access controls applied. The signed decision log creates an immutable audit trail, so hunters can correlate gateway blocks, prompt injections, and policy violations with agent behavior downstream.
Vaikora's threat detection engine (trained on OWASP LLM Top 10 and MITRE ATLAS) classifies each decision as ALLOW, LOG, CONSTRAIN, or BLOCK in under a second. A SOC can ingest Vaikora's decision logs into a SIEM, then hunt over this normalized, actionable data instead of raw LLM telemetry. The open-core gateway is self-hosted, so SOCs retain full data sovereignty and can query logs without vendor lock-in.
Frequently Asked Questions
How do you threat hunt for AI attacks?
Threat hunting for AI attacks uses hypothesis-driven methodology. Start with a testable hypothesis based on MITRE ATLAS tactics (e.g., prompt injection, unauthorized tool calls). Ingest AI gateway logs, tool-call records, and behavioral baselines into your SIEM. Write queries that detect deviations from normal operation (new tools, latency spikes, off-hours access, or policy violations). Correlate multiple signals (e.g., gateway BLOCK + new tool call + unusual output) to confirm the attack.
What data do SOC teams use to hunt AI threats?
SOC teams hunt over AI gateway decision logs (ALLOW, BLOCK, CONSTRAIN decisions with threat classifications), tool-call audit logs (which APIs the agent invoked, when, and with what parameters), prompt and response metadata (user input, model output, retrieved context), anomalous egress logs (outbound API calls from AI systems), and behavioral baselines (normal latency, error rates, tool distribution per agent). Centralized logging in a SIEM enables correlation across all these sources.
What are signs of an AI agent attack?
Signs of AI agent attack include: a sudden spike in gateway blocks for prompt injection or jailbreak patterns, agent calls to tools it has never called before, unexpected outbound API calls to unfamiliar destinations, unusual latency or token-usage patterns, agent activity outside normal business hours, agent outputs containing sensitive data (API keys, database schemas, internal IDs), and agent behavior that violates its documented instructions or access controls.
How do you investigate an AI agent security incident?
When investigating, isolate the agent and restrict its permissions immediately. Preserve all gateway logs, tool-call logs, and agent state. Verify the model's weights and behavior against a trusted baseline. Audit who has access to the agent's configuration and API keys. Trace any exfiltrated data to determine who received it and whether it was shared externally. Document the attack and update hunting playbooks so similar attacks are detected faster in the future.
What is MITRE ATLAS in the context of AI security?
MITRE ATLAS is the Adversary Tactics and Techniques for Language Models and Systems framework, analogous to MITRE ATT&CK for traditional cybersecurity. ATLAS documents AI-specific attack patterns, including prompt injection, model poisoning, supply chain compromise, and data exfiltration. SOC teams use ATLAS to structure threat hunts and ensure they are testing realistic attack vectors aligned with known adversary techniques.
What role does an AI gateway play in threat hunting?
An AI gateway intercepts all LLM requests and responses, applies security policies (detecting prompt injection, jailbreaks, policy violations), and logs every decision with metadata. The signed decision log becomes the primary data source for AI threat hunting, providing visibility into attack attempts that SOC teams can correlate with agent behavior, tool calls, and downstream impact.
Can AI threat hunting be automated?
Yes. Once you have validated hunting hypotheses against real data, you can automate the queries as scheduled SIEM detections or continuous streaming analytics. Automated alerts can trigger runbooks (e.g., isolate agent, notify SOC, preserve logs) when high-confidence hunts hit. However, initial hypothesis creation and tuning require human expertise and should not be fully automated.
How often should AI threat hunting be performed?
AI threat hunting should run continuously on baseline metrics (e.g., latency, tool-call distribution, error rates) with automated alerts. Hypothesis-driven hunts should be executed at least weekly for critical agents. After each hunt, whether it yields a positive or a negative, update your baseline and refine the query so the next hunt is more targeted and reduces false positives.
See Vaikora enforce policy on your AI
Open-core AI runtime control. Self-host the MIT gateway free, or run the hosted Control Plane.
Get a demo Self-host the gateway
Vaikora