Source Attribution and IDs

In a production RAG system, source attribution isn't just about trust; it's about auditability. If your system gives medical or financial advice, you must be able to prove which document led to that specific advice.

Generating Unique IDs

During the Ingestion phase (Module 5), you should generate a unique ID for every chunk.

Format: hash(content) or doc_name_page_12.
Benefit: IDs are stable even if you re-index the data.

Feeding IDs to the LLM

You must map these IDs into your prompt so the LLM can reference them.

<context>
  <chunk id="legal-01">Section 4: Liability... </chunk>
  <chunk id="legal-02">Section 5: Arbitration... </chunk>
</context>

Parsing the Response

If the LLM outputs "Claim X [legal-01]", your frontend must be able to:

Extract the text between brackets.
Find the metadata for legal-01 in your database.
Show a "Source" popover to the user.

Automated Attribution Verification

Sometimes an LLM will cite a source that doesn't actually contain the claim. Tooling: Use a "RAG Verifier" that takes the cited chunk and the claim and checks for Entailment. If the claim isn't in the chunk, the citation is flagged as a hallucination.

Best Practices

Cite Early: Citations should appear immediately after the fact they support.
Deep Links: If possible, cite the exact sentence or provide a timestamp (for audio/video).
Multiple Sources: If three documents support a claim, the model should ideally cite all three: [1, 5, 12].

Exercises

Write a Python regex to find all bracketed citations (e.g., [3]) in a string.
Why is "Page Number" a better citation than "Document Title" alone?
How do you handle a citation for a document that has been deleted from your database since the answer was generated?