
Source Attribution and IDs
Implement automated source attribution to ensure every factual claim in your RAG system is verifiable.
Source Attribution and IDs
In a production RAG system, source attribution isn't just about trust; it's about auditability. If your system gives medical or financial advice, you must be able to prove which document led to that specific advice.
Generating Unique IDs
During the Ingestion phase (Module 5), you should generate a unique ID for every chunk.
- Format:
hash(content)ordoc_name_page_12. - Benefit: IDs are stable even if you re-index the data.
Feeding IDs to the LLM
You must map these IDs into your prompt so the LLM can reference them.
<context>
<chunk id="legal-01">Section 4: Liability... </chunk>
<chunk id="legal-02">Section 5: Arbitration... </chunk>
</context>
Parsing the Response
If the LLM outputs "Claim X [legal-01]", your frontend must be able to:
- Extract the text between brackets.
- Find the metadata for
legal-01in your database. - Show a "Source" popover to the user.
Automated Attribution Verification
Sometimes an LLM will cite a source that doesn't actually contain the claim. Tooling: Use a "RAG Verifier" that takes the cited chunk and the claim and checks for Entailment. If the claim isn't in the chunk, the citation is flagged as a hallucination.
Best Practices
- Cite Early: Citations should appear immediately after the fact they support.
- Deep Links: If possible, cite the exact sentence or provide a timestamp (for audio/video).
- Multiple Sources: If three documents support a claim, the model should ideally cite all three:
[1, 5, 12].
Exercises
- Write a Python regex to find all bracketed citations (e.g.,
[3]) in a string. - Why is "Page Number" a better citation than "Document Title" alone?
- How do you handle a citation for a document that has been deleted from your database since the answer was generated?