Vaikora › Blog › Compliance & Audit
AI SBOM: Building a Software Bill of Materials for AI
An AI SBOM (Software Bill of Materials for AI) is a comprehensive inventory of the components, dependencies, and data that make up an AI system, including the model itself, training datasets, third-party libraries, prompts, tool integrations, and their licenses and provenance. It extends traditional software SBOMs into the AI domain to provide visibility and control over AI supply chain risk.
Why AI SBOMs Matter for Security
Traditional software supply chain attacks target code repositories and dependencies. AI systems introduce additional attack surfaces: compromised training data, model weights obtained from untrusted sources, prompt injection vulnerabilities routed through third-party APIs, and undocumented tool integrations that bypass security controls.
An AI SBOM gives you inventory of what's actually in your system. Without it, you cannot answer basic questions: Where did the model weights come from? Which versions of which third-party libraries are the model dependencies actually using? What training data was used, and did it include PII or biased datasets? When a vulnerability is disclosed in an LLM library, can you trace which of your AI systems are affected?
Industry frameworks like the NIST AI Risk Management Framework emphasize the importance of software transparency for AI systems as part of broader AI supply chain security. An AI SBOM is the foundation for that transparency.
What an AI SBOM Should Include
A complete AI SBOM captures five core categories.
The Model Component
Document the AI model itself: the model name, version, identifier (e.g., the model's SHA-256 hash or a reference to the model registry entry), architecture type (transformer, diffusion, etc.), and the source repository or registry. For proprietary models accessed via API (like Claude or GPT-4), capture the exact version identifier and the provider.
Include the model's training context: the date trained, the knowledge cutoff, and any fine-tuning applied. This is essential for compliance audits and for understanding whether the model is subject to specific regulatory restrictions.
Dependencies and Libraries
List all direct and transitive dependencies: PyTorch, TensorFlow, LangChain, LlamaIndex, transformers library versions, and so on. For each, record the version, license (MIT, Apache 2.0, GPL, etc.), and any known vulnerabilities (via CVE databases or tools like SBOM scanning).
Include inference frameworks and deployment runtimes. If your model runs on vLLM or NVIDIA Triton, document that and its version.
Training Data and Datasets
Document the datasets used for training and fine-tuning: source, version, size, license, and any known PII or bias characteristics. Include links to dataset repositories (Hugging Face Datasets, TensorFlow Datasets, etc.) and checksums for reproducibility.
For models trained on web data, note that the data is sourced from the open internet and may include copyrighted material. If the model was trained using proprietary datasets, describe the data's origin, retention policy, and any contractual restrictions.
Prompts and Tool Integrations
If your AI system uses system prompts, retrieval-augmented generation (RAG) indexes, or integrations with external APIs and tools, include them. Document which tools can be called by the model, which APIs it connects to, and any custom prompt templates. This is where runtime control matters: even if your SBOM is complete, prompts and tools can be modified in production, which is why runtime policy enforcement is critical.
Licenses and Compliance Metadata
Capture the license of the model itself (if applicable), the licenses of all dependencies, and any third-party agreements or restrictions. Flag any GPL or copyleft licenses that might impose redistribution obligations. Note any export controls or restricted-use clauses (e.g., some models prohibit commercial deployment or deployment in specific jurisdictions).
Building an AI SBOM in Practice
Start by auditing what you deploy. For each AI system or agent, list the model, the inference framework, and all Python, Node.js, or other language dependencies. Use existing SBOM tooling as a foundation.
CycloneDX and SPDX
The CycloneDX standard includes an ML-BOM extension for machine learning components. It is XML or JSON formatted and can be generated by dependency scanners (like Syft or SBOM tools in CI/CD pipelines) and then extended with model and data metadata.
Here is a simplified CycloneDX ML-BOM fragment:
<?xml version="1.0" encoding="UTF-8"?>
<bom xmlns="http://cyclonedx.org/schema/bom/1.4" version="1">
<metadata>
<component type="application" name="ai-agent-v1.0.0">
<version>1.0.0</version>
</component>
</metadata>
<components>
<!-- Model component -->
<component type="model" name="llama2-7b-chat">
<version>latest</version>
<purl>pkg:ml/huggingface/meta-llama/Llama-2-7b-chat-hf</purl>
<licenses>
<license>
<name>LLAMA 2 COMMUNITY LICENSE AGREEMENT</name>
</license>
</licenses>
<properties>
<property name="model:training-data">Common Crawl, Wikipedia, public internet text</property>
<property name="model:knowledge-cutoff">2022-12</property>
</properties>
</component>
<!-- Dependencies -->
<component type="library" name="transformers">
<version>4.36.2</version>
<licenses>
<license>
<name>Apache-2.0</name>
</license>
</licenses>
</component>
<component type="library" name="torch">
<version>2.1.0</version>
<licenses>
<license>
<name>BSD-3-Clause</name>
</license>
</licenses>
</component>
<!-- RAG dataset -->
<component type="dataset" name="company-kb">
<version>2026-06-30</version>
<properties>
<property name="dataset:source">Internal knowledge base</property>
<property name="dataset:pii-scan">passed</property>
<property name="dataset:license">proprietary</property>
</properties>
</component>
</components>
</bom>
Automation in CI/CD
Integrate SBOM generation into your build pipeline. Tools like Syft (for container and source code scanning), SPDX tools, and CycloneDX Maven/Gradle plugins can auto-generate and version SBOMs. For AI-specific components, manually extend the SBOM with model metadata (training date, knowledge cutoff, fine-tuning details) and dataset provenance.
Store the SBOM alongside your deployment artifacts. When you deploy a model to production, the SBOM goes with it. This enables rapid supply chain audits and vulnerability response.
Governance and Updates
Treat your SBOM as a living document. When you update a dependency or fine-tune a model on new data, update the SBOM. Version it. When a CVE is disclosed in a dependency, your SBOM lets you search across all deployed systems to see which ones are affected.
AI SBOMs and Regulatory Compliance
Regulators and standards bodies are beginning to expect AI supply chain transparency. The EU AI Act requires technical documentation, risk assessments, and transparency measures for high-risk AI systems. The NIST AI Risk Management Framework emphasizes transparency and traceability throughout the AI system lifecycle.
Compliance frameworks like ISO 27001 (information security) and ISO 42001 (AI management) expect organizations to manage third-party AI component risk, which assumes you know what components you have. PCI DSS, HIPAA, and SOC 2 frameworks increasingly address AI systems used in payment processing, healthcare, or security operations, and auditors now request evidence of component and dependency tracking.
An SBOM is not a compliance guarantee, but it is prerequisite evidence. Auditors expect it.
Addressing Model and Data Risk
An SBOM helps you document and manage risk from the model and training data side, not just the code side. If you use a model trained on scraped web data that may include copyright-protected material, the SBOM documents that. If your training data includes customer PII that was not properly anonymized, the SBOM gives you a record of what went in.
This is where AI systems diverge from traditional software: your dependencies are not just packages; they are data and learned weights. An SBOM forces you to catalog that.
What Goes in an AI SBOM
An effective AI SBOM must capture the complete technical stack of an AI system. This goes beyond the model name and version. It includes the base models and their versions, fine-tuning datasets and their sources, model weights and where they came from, all training and inference framework dependencies, system prompts and templates, external tool and API integrations, and license metadata for every component.
Why does each matter? The base model version determines the model's knowledge cutoff, training data source, and any known limitations or vulnerabilities. A model trained on web data has different copyright and bias risks than one trained on licensed datasets. The fine-tuning dataset introduces new provenance concerns, as fine-tuning can inadvertently encode sensitive information or degrade performance on specific tasks.
Model weights must be traced to their source because model substitution is a real attack vector. If your deployment pulls model weights from an untrusted registry without verifying checksums or signatures, an attacker could inject a trojanized model. The inference framework (PyTorch, TensorFlow, vLLM, NVIDIA Triton) and its dependencies introduce their own vulnerability surface. A CVE in PyTorch or the transformers library can affect every model that depends on them.
System prompts and tool definitions are runtime code, not static artifacts. They control how the model behaves and what it can do. An SBOM must document which prompts are deployed, which APIs the model can call, and which tools are enabled. This matters for security because a modified prompt can bypass your intended guardrails, and undocumented tool access can leak data or trigger unintended actions.
License metadata prevents legal liability. If your system includes a GPL dependency or uses a model trained on copyrighted material, you must know it and document it. Export controls, use-case restrictions, and jurisdictional limitations on specific models must be captured and audited.
CycloneDX and the ML-BOM Format
The CycloneDX standard (maintained by OWASP) includes support for machine learning components through its ML-BOM extension. CycloneDX documents are either XML or JSON formatted and are widely recognized by security tooling and compliance frameworks. The SPDX format is also evolving to support AI components, though adoption is still emerging.
CycloneDX has a component type called "model" that captures model-specific metadata: the model name, version, architecture, training data, knowledge cutoff, and license. It also supports a "dataset" component type for training and fine-tuning data. This structure lets you express the entire AI supply chain in a standard format that scanners and auditors can parse.
Here is an illustrative CycloneDX ML-BOM fragment showing a model component and a dataset component. Note that this is a simplified example meant to show structure, not a definitive schema:
{
"bomFormat": "CycloneDX",
"specVersion": "1.4",
"version": 1,
"metadata": {
"component": {
"type": "application",
"name": "ai-agent-v1.0.0",
"version": "1.0.0"
}
},
"components": [
{
"type": "model",
"name": "llama2-7b-chat",
"version": "latest",
"purl": "pkg:ml/huggingface/meta-llama/Llama-2-7b-chat-hf",
"licenses": [
{
"license": {
"name": "LLAMA 2 COMMUNITY LICENSE AGREEMENT"
}
}
],
"properties": [
{
"name": "model:training-data",
"value": "Common Crawl, Wikipedia, public internet text"
},
{
"name": "model:knowledge-cutoff",
"value": "2022-12"
},
{
"name": "model:architecture",
"value": "transformer"
}
]
},
{
"type": "dataset",
"name": "company-kb",
"version": "2026-06-30",
"properties": [
{
"name": "dataset:source",
"value": "Internal knowledge base"
},
{
"name": "dataset:pii-scan",
"value": "passed"
},
{
"name": "dataset:license",
"value": "proprietary"
},
{
"name": "dataset:size-gb",
"value": "5.2"
}
]
},
{
"type": "library",
"name": "transformers",
"version": "4.36.2",
"licenses": [
{
"license": {
"name": "Apache-2.0"
}
}
]
},
{
"type": "library",
"name": "torch",
"version": "2.1.0",
"licenses": [
{
"license": {
"name": "BSD-3-Clause"
}
}
]
}
]
}
This format lets tools automatically scan components against vulnerability databases, verify license compliance, and audit provenance. Existing SBOM generators like Syft can produce CycloneDX documents, and you can extend them with AI-specific metadata.
How to Build an AI SBOM in Your Pipeline
Building an AI SBOM is a process, not a one-time audit. The workflow is straightforward.
1. Inventory your components. For each AI system in production, list the base model, any fine-tuned versions, the inference framework, all Python or Node.js dependencies, training datasets, external APIs and tools, and any custom prompts or RAG indexes. Document versions and exact identifiers.
2. Capture provenance and hashes. Record where each component came from: the model registry URL, the Git commit hash for code dependencies, the dataset source and license. Capture SHA-256 hashes of model weights and dataset files. This enables verification later and detects tampering.
3. Generate the SBOM in CI. Use tools like Syft or the CycloneDX Maven/Gradle plugins to auto-generate SBOMs for code dependencies in your build pipeline. Extend the generated SBOM with model metadata (training date, knowledge cutoff, fine-tuning details) and dataset provenance (source, license, PII scan results). Version the SBOM and store it as an artifact.
4. Scan components against known vulnerabilities. Run the SBOM through scanners that check dependencies against CVE databases (Trivy, Snyk, Dependabot). For AI-specific components, check model registries for known issues and verify dataset licenses against your acceptable-use policies.
5. Store the SBOM as an immutable artifact. Commit it to your artifact repository or OCI registry alongside your model and code artifacts. Sign it with a cryptographic key so it cannot be altered after creation. This gives you a tamper-proof record of what was intended to ship.
6. Regenerate on every model or dependency change. Treat the SBOM as a living document. When you patch a dependency, update a model version, fine-tune on new data, or add a tool integration, regenerate the SBOM. This keeps it synchronized with production.
Over time, this process reveals patterns. You may discover that certain dependencies are used across many systems, making them higher-priority targets for vulnerability patching. You may find that specific datasets are deployed to multiple models, making their license compliance more critical.
AI SBOM and Regulatory Direction
Regulatory attention on AI supply chain transparency is rising. No law yet explicitly mandates an "AI SBOM" by that name, but multiple frameworks expect similar documentation.
The EU AI Act requires high-risk AI systems to include extensive technical documentation, including descriptions of training data, the system's performance characteristics, and the measures taken to ensure safety and compliance. This technical documentation mirrors what an SBOM provides and goes beyond it.
The NIST AI Risk Management Framework emphasizes transparency and traceability throughout the AI lifecycle. It expects organizations to document the components and data used in AI systems and to track how those components change over time. This is exactly what an SBOM enables.
CISA and NTIA have published frameworks on software supply chain security, which apply to the software components in AI systems. Executive branch guidance on secure software development includes expectations for component inventory and dependency tracking.
At the standards level, ISO 27001 (information security management) and the emerging ISO 42001 (AI management) expect organizations to understand and manage risks from third-party AI components. PCI DSS (for payment card security), HIPAA (for healthcare data), and SOC 2 audits all increasingly ask organizations to document and control AI systems used in regulated contexts.
Auditors now request evidence of component tracking, dataset provenance, and dependency management. An SBOM provides that evidence. It is not a compliance guarantee on its own, but it is the foundation that compliance frameworks expect.
The direction is clear: regulators want visibility into what you build and deploy. An AI SBOM gives them that visibility and gives you proof that you have done the work.
Runtime Evidence and Policy
An SBOM documents what should be in your system. But what actually runs in production? Dependency versions can be swapped, prompts can be injected, tool definitions can be modified, and model weights can be substituted. This is where runtime control becomes essential.
A complete supply chain audit requires both static and runtime evidence. The SBOM is static: it declares what should run. Runtime policy and audit trails show what actually ran.
Tools that capture runtime evidence sign every model invocation, policy decision, and tool call into an immutable audit chain. This gives you proof of which model version executed, which prompts were active, which tools were called, and what data flowed through the system. Combined with your AI SBOM, you have both the declared supply chain (the SBOM) and the observed supply chain (the runtime audit trail).
When compliance auditors ask "which models ran and when," your audit trail answers with cryptographic proof, not claims. This is where Vaikora adds value as an open-core gateway: runtime policy enforcement and append-only audit logging complement the static SBOM by capturing evidence of actual execution.
The combination of a complete AI SBOM and runtime audit trails creates an end-to-end supply chain record that survives compliance review.
Frequently asked questions
What is an AI SBOM?
An AI SBOM is a comprehensive inventory of all components in an AI system, including the model, training data, dependencies, third-party tools, and their licenses and provenance. It extends traditional software SBOMs to address the unique supply chain risks of AI systems.
What should an AI software bill of materials include?
An AI SBOM should include the model (name, version, source, training date, knowledge cutoff), all code dependencies and libraries (with versions and licenses), training datasets (source, license, PII status), prompts and tool integrations, and any third-party agreements or restrictions.
How does an AI SBOM support regulatory compliance?
An SBOM provides auditors with evidence that you understand and manage your AI supply chain. Regulators like the EU AI Act and standards like NIST AI RMF expect organizations to document and control AI components. An SBOM demonstrates that you have done the work.
Is an AI SBOM required by law or regulation?
No specific regulation mandates an AI SBOM by that name, but many frameworks expect supply chain transparency. The EU AI Act requires documentation and technical documentation for high-risk AI systems. NIST AI RMF, ISO 27001, and ISO 42001 all expect organizations to manage and track third-party AI risk, which an SBOM enables.
How do I build an AI SBOM?
Start by auditing what you deploy: the model, the inference framework, and all code dependencies. Use existing SBOM tools (like Syft) for code dependencies. Extend the SBOM with model metadata (training date, knowledge cutoff, fine-tuning) and dataset provenance (source, license, PII status). Store it as a CycloneDX or SPDX document alongside your deployment artifacts.
What tools can help generate an AI SBOM?
Syft, SBOM tools built into CI/CD platforms, and CycloneDX/SPDX generators can auto-generate SBOMs for code and container dependencies. For AI-specific components (model metadata, training data), manual documentation and extension of the SBOM is necessary. Version control systems and artifact repositories can store and track SBOM versions.
How often should I update my AI SBOM?
Update your SBOM whenever you update a dependency, deploy a new model version, retrain on new data, or change tool integrations. Treat it as a living document. Automate SBOM generation in your CI/CD pipeline so that each deployment is accompanied by an updated SBOM.
See Vaikora enforce policy on your AI
Open-core AI runtime control. Self-host the MIT gateway free, or run the hosted Control Plane.
Get a demo Self-host the gateway
Vaikora