Context Injection Patterns: Formatting for Attention

Once you have your "High-Signal" chunks (Module 7.3), the next hurdle is Injection. How you present those facts to the LLM determines how many tokens are wasted on "Framing" and how effectively the model uses the data.

An inefficient injection pattern confuses the model's "Attention Mechanism," leading to skipped facts and higher hallucination rates.

In this lesson, we learn the three most token-efficient injection patterns: XML Wrapping, JSON Enumeration, and The Reference Token Pattern.

1. Pattern 1: XML Wrapping (High Isolation)

As discussed in Module 4.4, XML tags (like <doc></doc>) are powerful because they create clear "Borders" between instructions and data.

Wait, why not standard text? If you just paste text, the model might confuse a "User Question" inside the document with the real "User Question" at the end of the prompt. This is a common attack vector (Prompt Injection), but it's also a source of token waste as you write more instructions to "Ignore the content's instructions."

The XML Pattern:

<context>
  <source id="1" title="Privacy Policy">
    [Snippet 1 text]
  </source>
  <source id="2" title="Terms of Service">
    [Snippet 2 text]
  </source>
</context>

2. Pattern 2: Citation-First Prompting

To save tokens on the Output side, you should instruct the model to use Short Citations.

Inefficient Instruction:

"If you use a document, please mention the title of the document and the page number where you found the answer."

Efficient Instruction:

"Cite sources using [ID] only (e.g. [1])."

Result:

Before: "...as mentioned in the Privacy Policy on page 4, the data is encrypted." (15 tokens)
After: "...the data is encrypted [1]." (5 tokens)

Across a long answer, "Short Citations" can save 50-100 output tokens.

3. Pattern 3: The "Context-at-Bottom" Rule

Earlier, we discussed "Lost in the Middle" (Module 1.3). Architecturally, you should place your Most Dynamic data (the retrieved context) at the very bottom of the prompt, as close as possible to the final Assistant: tag.

graph TD
    A[System Identity: Cached] --> B[Global Constraints: Cached]
    B --> C[User Query]
    C --> D[Retrieved Context: DYNAMIC]
    D --> E[Assistant: NEXT TOKEN]
    
    style D fill:#f66

By placing context at the bottom, the model's "Working Memory" is focused on the facts it just read, leading to higher accuracy with fewer "Reasoning" tokens.

4. Implementation: The Context Injector (Python)

Python Code: Generating the XML String

def inject_context(docs: list):
    """
    Constructs a high-density XML block for RAG injection.
    """
    segments = ["<context>"]
    for i, doc in enumerate(docs):
        # We minify the text inside the block to save whitespace tokens
        clean_text = doc.text.strip().replace("\n", " ")
        segments.append(f"<source id='{i+1}'>{clean_text}</source>")
    
    segments.append("</context>")
    return "".join(segments)

# Usage in FastAPI
system_prompt = "You help users based ONLY on the <context> block below."
final_prompt = f"{system_prompt}\n{inject_context(retrieved_docs)}\nUser: {user_query}"

5. Metadata Pruning (Revisited)

In a RAG context, you only need metadata that the model needs for Reasoning.

Needed: Date (if the query is about "Latest info").
Not Needed: Database UUID, File Extension, Author's Email.

The Audit: For every piece of metadata you inject, ask: "If I deleted this, would the model's answer change?" If no, delete it. Every character is a fractional cent.

6. Summary and Key Takeaways

Use XML Tags: They isolate data and prevent instruction-confusion.
Short-ID Citations: Force the model to use [1] instead of full titles.
Positioning: Move dynamic RAG results to the bottom of the prompt.
Minify Chunks: Strip extra newlines and tabs from the retrieved text.

In the next lesson, Evaluation of RAG ROI, we conclude Module 7 with the "Business Case" for efficient retrieval.

Exercise: The Injection Test

Predict the token count of a document wrapped in 5 different ways:
- Plain text.
- JSON.
- YAML.
- XML.
- Markdown Table.
Verify with tiktoken.
Observation: Notice how Markdown uses the fewest tokens for lists, while XML uses more at the start but provides better "Isolation" for the model's attention.
Conclusion: When should you use XML over Markdown? (Hint: When the content itself is messy or includes symbols).

Context Injection Patterns: Formatting for Attention

Context Injection Patterns: Formatting for Attention

1. Pattern 1: XML Wrapping (High Isolation)

2. Pattern 2: Citation-First Prompting

3. Pattern 3: The "Context-at-Bottom" Rule

4. Implementation: The Context Injector (Python)

Python Code: Generating the XML String

5. Metadata Pruning (Revisited)

6. Summary and Key Takeaways

Exercise: The Injection Test

Congratulations on completing Module 7 Lesson 4! Your RAG systems are now ultra-precise.

Subscribe to our newsletter