llamaIndex - 💡(How to fix) Fix Feature: Document provenance tracking for regulatory compliance (EU AI Act) [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20755Fetched 2026-04-08 00:31:10
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1

The EU AI Act (Regulation 2024/1689) places specific requirements on AI systems regarding transparency (Article 13) and human oversight (Article 14). For RAG-based applications — which LlamaIndex powers extensively — a key compliance question is: which documents influenced a given AI response, and can this be traced after the fact?

Root Cause

Organizations using RAG in regulated sectors (legal, healthcare, finance, public administration) need to demonstrate:

  • What data influenced AI decisions (Article 13 — transparency)
  • That humans can understand and oversee the system (Article 14)
  • That logs exist for post-deployment monitoring (Article 72)

This is especially critical because RAG systems blur the line between "the model" and "the data" — regulators will want to trace both.

Code Example

{
  "query_id": "uuid-...",
  "timestamp": "2026-02-20T10:30:00Z",
  "query_hash": "sha256:abc...",
  "retrieved_sources": [
    {
      "doc_id": "doc-123",
      "chunk_id": "chunk-456",
      "similarity_score": 0.89,
      "ingested_at": "2026-01-15T08:00:00Z",
      "source_uri": "https://example.com/policy.pdf"
    }
  ],
  "generation_model": "gpt-4",
  "response_hash": "sha256:def..."
}
RAW_BUFFERClick to expand / collapse

Context

The EU AI Act (Regulation 2024/1689) places specific requirements on AI systems regarding transparency (Article 13) and human oversight (Article 14). For RAG-based applications — which LlamaIndex powers extensively — a key compliance question is: which documents influenced a given AI response, and can this be traced after the fact?

Current State

LlamaIndex provides source nodes in query responses, and the SourceNode objects include metadata. However, the current approach is optimized for runtime visibility, not for compliance audit trails. Specifically:

  • Source attribution is available at query time but not systematically logged for post-hoc audit
  • There's no standardized format for provenance records that a regulator could inspect
  • Document lineage (when was it indexed, from what source, what transformations were applied) isn't captured in a structured compliance-ready format
  • No built-in mechanism to prove that a specific response was influenced by specific documents at a specific time

Proposal

A provenance tracking module that:

  1. Logs retrieval events in a structured format: query hash, retrieved document IDs, similarity scores, timestamps, model version used for embedding/generation
  2. Tracks document lineage: ingestion timestamp, source URI, transformation pipeline applied (chunking strategy, embedding model)
  3. Supports export to compliance-friendly formats (JSON-LD, or a simple structured JSON schema)
  4. Integrates with existing callbacks: extends the current callback system to emit provenance events

Example provenance record:

{
  "query_id": "uuid-...",
  "timestamp": "2026-02-20T10:30:00Z",
  "query_hash": "sha256:abc...",
  "retrieved_sources": [
    {
      "doc_id": "doc-123",
      "chunk_id": "chunk-456",
      "similarity_score": 0.89,
      "ingested_at": "2026-01-15T08:00:00Z",
      "source_uri": "https://example.com/policy.pdf"
    }
  ],
  "generation_model": "gpt-4",
  "response_hash": "sha256:def..."
}

Why This Matters

Organizations using RAG in regulated sectors (legal, healthcare, finance, public administration) need to demonstrate:

  • What data influenced AI decisions (Article 13 — transparency)
  • That humans can understand and oversee the system (Article 14)
  • That logs exist for post-deployment monitoring (Article 72)

This is especially critical because RAG systems blur the line between "the model" and "the data" — regulators will want to trace both.

References

  • EU AI Act: Regulation 2024/1689
  • Articles 12 (logging), 13 (transparency), 14 (human oversight), 72 (post-market monitoring)
  • For automated compliance scanning of AI codebases: mcp-eu-ai-act

Happy to discuss further or help spec this out.

extent analysis

<think> Okay, let's tackle this problem. The user wants a practical solution for adding provenance tracking to a LlamaIndex-based RAG system to comply with the EU AI Act. The main issue is that while LlamaIndex provides source nodes, they aren't logged systematically for audits.

First, I need to figure out how to log retrieval events. The example JSON shows fields like query_id, timestamp, retrieved_sources, etc. So, the plan is to create a module that captures these details during each query.

Next, document lineage tracking. That means when a document is ingested, we need to record its source, ingestion time, and any transformations applied. Maybe during the indexing process, we can store this metadata in a database or a structured format.

Integration with existing callbacks is mentioned. LlamaIndex has a callback system, so creating a custom callback that emits provenance events would make sense. This callback can be triggered after retrieval and during response generation.

For the code examples, I'll need to show how to modify the query process to log the necessary data. Using a decorator or a middleware pattern might help intercept queries and responses. Also, when ingesting documents, adding metadata with ingestion details.

Exporting to compliance formats like JSON-LD or structured JSON is important. So, the module should have an export function that formats the logged data accordingly.

Potential steps: create a ProvenanceTracker class that handles logging, use LlamaIndex's callbacks to hook into query events, store logs in a database, and provide an export method. For document lineage, during ingestion, add metadata to each document with source URI, ingestion timestamp, and processing steps.

Need to make sure that each query's provenance is logged with all required fields. Also, ensure that the logs are stored in a way that's easily retrievable for audits. Maybe using a time-series database or a logging service.

Testing the solution would involve running sample queries, checking if the logs are generated correctly,

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - 💡(How to fix) Fix Feature: Document provenance tracking for regulatory compliance (EU AI Act) [1 comments, 1 participants]