VaikoraVaikora

VaikoraBlog › Compliance & Audit

AI Agent Observability: Logs, Traces, and Audit Trails

Compliance & Audit · June 30, 2026 · 11 min read

AI agent observability is the ability to see and understand what an AI agent is doing at every step, from initial request through tool execution to final response. It requires three complementary systems: structured logs that record decisions and actions, distributed traces that map multi-step reasoning chains, and immutable audit records that prove compliance with policy. Without observability, agents become black boxes. With it, you can debug failures, detect attacks, and satisfy regulators.

Why Observability for AI Agents Matters

Traditional software observability answers three questions: What happened? How long did it take? Did it fail? AI agents require a fourth: Was that decision allowed?

An AI agent is not deterministic code. It reasons about a request, decides which tool to call, interprets the result, and possibly loops. Every step carries risk. Did it access data you meant to restrict? Did it fall for a prompt injection? Did it violate a compliance policy? You cannot know without logs. You cannot prove compliance without audit trails.

Observability also uncovers blindspots. An agent that succeeds 99% of the time but fails silently 1% of the time, sometimes returning incorrect data, sometimes leaking context, can go unnoticed for months if you only monitor uptime. Structured logs catch the silent failures.

Three Pillars of Agent Observability

Pillar 1: Structured Logs for Decisions and Actions

A structured log for an AI agent is not Agent started. Agent called tool X. Agent finished. That tells you almost nothing. A structured log captures the decision and its context.

Structured logs should record:

Example structured log entry:

{
  "timestamp": "2026-06-30T14:23:45.123456Z",
  "agent_id": "sales-lead-agent-v2",
  "session_id": "sess_abc123def456",
  "step": 1,
  "input": {
    "user_query": "Create a new customer for Acme Corp",
    "user_role": "sales_rep",
    "constraints": ["customer_country_whitelist: [US, CA]", "max_credit_check_cost: 50"]
  },
  "model_reasoning": {
    "thought": "User wants to create a customer. I need to check eligibility, then create the customer record.",
    "tool_selected": "check_customer_eligibility",
    "confidence": 0.94
  },
  "tool_call": {
    "name": "check_customer_eligibility",
    "params": {
      "company_name": "Acme Corp",
      "country": "US"
    }
  },
  "tool_result": {
    "eligible": true,
    "credit_score": 750,
    "cost_used": 25
  },
  "policy_decision": {
    "status": "ALLOW",
    "evaluated_policies": ["country_whitelist", "credit_limit", "cost_budget"],
    "reason": "All constraints satisfied"
  },
  "metrics": {
    "latency_ms": 342,
    "tokens_used": 187
  }
}

This log is machine-readable and human-debuggable. You can aggregate it, alert on patterns, and replay it to understand exactly why the agent did what it did.

Pillar 2: Distributed Traces for Multi-Step Reasoning

A trace is a map of how a request flows through your system. In a traditional microservices application, a trace shows how a request bounced from service A to service B to service C. In an AI agent system, a trace shows how a request triggered reasoning, which generated a tool call, which produced a result, which triggered further reasoning, which eventually produced a response.

Use OpenTelemetry spans to represent each step. A span is a timed operation with a name, start time, end time, and attributes. For an AI agent, spans should represent:

A span includes a parent span ID so you can reconstruct the full tree. Tools like Grafana Tempo or Datadog ingest OpenTelemetry traces and visualize them as flame graphs or waterfall diagrams.

Example trace structure in Python using OpenTelemetry:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Jaeger exporter (or use your observability backend)
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))

tracer = trace.get_tracer(__name__)

def run_agent(user_query, user_context):
    with tracer.start_as_current_span("agent_request") as agent_span:
        agent_span.set_attribute("agent_id", "sales-lead-agent-v2")
        agent_span.set_attribute("user_role", user_context["role"])

        # First LLM call: decide which tool to use
        with tracer.start_as_current_span("llm_call_reason") as reason_span:
            reason_span.set_attribute("model", "gpt-4")
            response = call_llm(
                model="gpt-4",
                messages=[{"role": "user", "content": user_query}],
                tools=get_available_tools()
            )
            reason_span.set_attribute("tokens_input", response.usage.prompt_tokens)
            reason_span.set_attribute("tokens_output", response.usage.completion_tokens)
            tool_choice = parse_tool_choice(response)

        # Policy evaluation: is this tool call allowed?
        with tracer.start_as_current_span("policy_check") as policy_span:
            policy_span.set_attribute("policy_engine", "vaikora")
            decision = evaluate_policy(tool_choice, user_context)
            policy_span.set_attribute("decision", decision.status)
            policy_span.set_attribute("policy_rule", decision.rule_name)

        if decision.status == "ALLOW":
            # Tool execution
            with tracer.start_as_current_span("tool_execution") as tool_span:
                tool_span.set_attribute("tool_name", tool_choice.tool_name)
                result = execute_tool(tool_choice)
                tool_span.set_attribute("result_size_bytes", len(json.dumps(result)))
                tool_span.set_attribute("execution_time_ms", result.get("latency_ms"))

            return result
        else:
            return {"error": f"Tool call blocked: {decision.reason}"}

This example uses Jaeger as the observability backend, but OpenTelemetry works with Datadog, Grafana Tempo, and others. You can then visualize the trace as a waterfall showing the hierarchy of spans and identify where time is spent.

Pillar 3: Compliance Audit Records

Observability is for debugging. Audit is for evidence. An audit record is a timestamped, digitally signed entry that proves what happened and that it was approved. It cannot be retroactively modified; it lives in an append-only log.

For AI agents, audit records should capture:

Audit records are immutable by design. Once written, they cannot be changed, deleted, or reordered. This makes them suitable for regulatory compliance (HIPAA, SOC 2, PCI DSS, GDPR).

Example audit record:

{
  "audit_id": "aud_20260630_001234",
  "timestamp": "2026-06-30T14:23:45.123456Z",
  "principal": {
    "user_id": "user_99887",
    "session_id": "sess_abc123def456",
    "ip_address": "203.0.113.42"
  },
  "action": {
    "agent_id": "sales-lead-agent-v2",
    "tool": "create_customer",
    "params": {
      "company_name": "Acme Corp",
      "country": "US",
      "contact_email": "john@acme.corp"
    }
  },
  "decision": {
    "status": "ALLOW",
    "policy_rule": "country_whitelist",
    "enforced_constraints": ["customer_country_whitelist"]
  },
  "signature": {
    "algorithm": "HMAC-SHA256",
    "value": "a3f4e8c2b1d9f7e6c5a4b3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8c7b6a5f4e3d2"
  },
  "chain_hash": {
    "current_hash": "7f3e2d1c0b9a8f7e6d5c4b3a29f1e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a",
    "previous_hash": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0"
  }
}

This record proves that at timestamp T, user U requested action A, it was evaluated against policy P, and the decision was D. The signature proves the record came from you. The chain hash proves the log was not reordered or edited.

Distinguishing Observability from Audit

The two are often confused. Here is the key difference:

Observability answers: What happened, and why did it happen? It is used for debugging, improving performance, and detecting anomalies. Logs and traces are observability. They can be sampled, summarized, or dropped after a retention period. They are stored in systems optimized for querying and analysis (Elasticsearch, Datadog, Grafana Loki).

Audit answers: What happened, and can you prove it? It is used for compliance, investigations, and legal evidence. Audit records are cryptographically signed and chained. They are immutable, legally defensible, and retained for as long as regulations require (typically 6 years for HIPAA, 7 years for most financial records, based on applicable law). They are stored in append-only systems (Vaikora's audit chain, database write-once tables, or dedicated compliance log stores).

Good systems have both. You might sample 1 in 100 debug logs (for performance), but you record every audit event (immutable). A failed tool call goes into observability logs so you can debug it; a denied policy decision goes into an audit record so you can prove to a regulator that you blocked unauthorized access.

Detecting Anomalies in Agent Logs

With structured logs and traces in place, you can apply anomaly detection to catch unexpected behavior.

Unusual tool chains. If your agent typically calls tool A, then tool B, then tool C, but one day it calls tool A, then tool D (a data exfiltration tool), that is worth flagging.

Latency spikes. An LLM call that usually takes 500ms but suddenly takes 5 seconds might indicate a jailbreak attempt (the model reasoning loop being attacked). Latency shifts can also indicate the model is confused, which precedes errors.

Policy decisions changing. If your agent made 99 ALLOW decisions and 1 BLOCK, but now it is 50 ALLOW and 50 BLOCK, the policy rule may have changed or the model's behavior shifted. Investigate.

Constraint violations. If your agent is constrained to access customer data for the current user only, but logs show it accessing data for 10 different users in one session, an attack or a bug is happening.

These anomalies become visible only with structured logs. A black-box agent that you never log will hide them until damage occurs.

Compliance-Ready Audit Trails in Practice

Here is how audit trails support regulatory compliance:

HIPAA requires an audit trail of every access to patient data. If your agent accesses a patient's medical records, that access must be logged, with the user ID, timestamp, and reason (which tool was called). Immutable records prove you logged it. If you later need to prove you complied, you pull the audit trail.

SOC 2 (Type II) requires demonstrating control over access and change management. Audit records prove that agent actions were evaluated against policy and approved before execution.

GDPR requires audit trails for data processing and lawful basis. When a user exercises the "right to be forgotten," you need to prove which systems accessed their data, so you know which systems to delete from.

PCI DSS requires logging of all access to cardholder data. If your agent interacts with payment systems, every interaction must be audited.

The thread connecting all of these is immutability. A log file that can be edited retroactively is worthless as evidence. An append-only, signed audit chain is defensible in court and satisfies auditor expectations.

Building Agent Observability: A Practical Approach

Start with these four steps:

  1. Add structured logging to every agent action. For each tool call, log the input, output, and policy decision. Use JSON format so you can parse and aggregate logs programmatically.

  2. Wire OpenTelemetry spans into your agent loop. Wrap the agent invocation, each LLM call, each tool execution, and each policy check in a span. Export spans to your observability backend (Datadog, Grafana Tempo, or self-hosted Jaeger).

  3. Implement a signed audit log. Create a database table or append-only log with immutable records. For each action that could have legal or compliance implications, write a record with a timestamp, decision, and HMAC signature. Include a chain hash so you can detect tampering.

  4. Set up anomaly detection. Configure alerts on unusual tool chains, latency spikes, and policy decision rate changes. Start with simple threshold rules; graduate to statistical models if needed.

How Vaikora Helps

Vaikora is a runtime control plane for AI agents. At the core, it evaluates every tool call against deterministic policy and returns ALLOW, CONSTRAIN, BLOCK, or LOG in under a second. Every decision is automatically written to a signed, append-only audit chain (SHA-256 hashed with the previous record's hash). This means the decisions Vaikora makes are natively audit-ready, no extra plumbing required.

Vaikora's gateway also emits structured OpenTelemetry traces for every policy evaluation, so traces from Vaikora integrate seamlessly with your observability backend. And because Vaikora sits at the boundary between your agent and the LLM, it sees every decision before it is committed, making it a natural audit and anomaly detection point.

Frequently Asked Questions

What is AI agent observability?

AI agent observability is the ability to see and understand what an AI agent does at every step, from request through tool execution to response. It requires structured logs (decisions and actions), distributed traces (multi-step reasoning chains), and immutable audit records (compliance evidence). Without observability, agents are black boxes; with it, you can debug failures, detect attacks, and prove compliance.

How do you audit AI agent actions for compliance?

Create an append-only audit log with a record for each agent action, including the timestamp, user ID, tool name, parameters, policy decision, and a digital signature (HMAC-SHA256). Chain each record to the previous one using a hash of the current record plus the previous record's hash. This makes the log tamper-evident and legally defensible. Retain audit records for the duration required by applicable regulations (typically 6 years for HIPAA, 7 years for most financial records).

What logs should an AI agent generate?

An AI agent should generate structured logs (JSON format) for each tool call, including: the user input, the agent's reasoning or thought process, the tool selected and its parameters, the tool's output or error, the policy decision (ALLOW, BLOCK, CONSTRAIN), and timestamps. These logs should be machine-readable and queryable so you can aggregate them, alert on anomalies, and replay failures to understand root causes.

How do you trace AI agent decisions for compliance?

Use OpenTelemetry spans to represent each step: the agent's main request, LLM invocations, policy evaluations, tool executions, and error handling. Export spans to an observability backend (Datadog, Grafana Tempo, Jaeger) so you can visualize request flows as waterfall diagrams. For compliance, pair traces with immutable audit records that capture the policy decision and its justification.

What is the difference between AI observability and AI monitoring?

Observability is the ability to understand what happened and why, using logs, traces, and metrics. Monitoring is the practice of watching for specific bad things (e.g., error rate > 5%) and alerting. Observability is passive and exploratory; monitoring is active and threshold-driven. You need both: observability to investigate incidents, monitoring to detect them.

How do AI agent logs help with debugging?

Structured logs capture the input, the agent's reasoning, the tool selected, the tool's output, and the policy decision for each step. When an agent produces a wrong answer, you can examine its logs to see exactly which tool it called, what data the tool returned, and how the agent interpreted that data. This makes it easy to spot whether the bug is in the agent's reasoning, the tool implementation, or the tool's data source.

Can you use regular application logs for AI agents?

Regular application logs (Apache, nginx, Python logging module) are not sufficient. They capture HTTP requests and responses, but not the agent's internal reasoning or policy decisions. You need structured logs that include the agent's thought process, which tool it decided to call, and why. These require custom logging code in your agent framework, not just the standard application logging.

How do compliance frameworks use audit trails?

HIPAA requires proof of access to patient data. SOC 2 requires proof of access controls. GDPR requires proof of lawful data processing. PCI DSS requires proof of cardholder data access. In each case, an immutable, signed audit trail that records who accessed what, when, and whether it was approved satisfies the requirement. Without it, you cannot prove compliance if audited.

See Vaikora enforce policy on your AI

Open-core AI runtime control. Self-host the MIT gateway free, or run the hosted Control Plane.

Get a demo Self-host the gateway

More from the Vaikora blog