Connecting LLMs to External Knowledge Bases

A RAG system is only as good as its data. As an LLM Engineer, you will rarely be given a clean text file. You will be given messy PDFs, nested JSONs, and links to private Google Drives. Your first task is to bridge the gap between these External Sources and the LLM.

In this lesson, we cover the lifecycle of data ingestion: Loading, Parsing, and Chunking.

1. Data Ingestion: The Entry Point

In a production environment, your knowledge base is usually living in one of these four places:

Cloud Storage: AWS S3 or Google Cloud Storage.
SaaS Apps: Notion, Slack, Jira, or Confluence.
Databases: SQL (Postgres) or NoSQL (MongoDB).
Local Systems: PDFs, PPTs, and Excel files uploaded by users.

The Library to Use: LangChain Loaders

LangChain provides "Loaders" for almost every data source in existence. You don't need to write the scraping code yourself; you just initialize the loader.

2. Parsing: Turning Binary into Text

A PDF is not text; it's a binary file describing where shapes are drawn on a page. Parsing is the process of extracting the information while preserving meaning.

Challenges:

Tables: Most basic parsers turn tables into a bowl of alphabet soup.
Headers/Footers: Repeating text that adds noise to the retrieval.
OCR: Dealing with scanned documents (images) rather than text-based PDFs.

Pro Tip: Use Layout-Aware Parsers (like AWS Textract or Unstructured.io) for complex documents. They identify what is a "Headline" vs. a "Table," allowing the model to understand the structure.

3. Chunking: The Art of Slicing

You cannot send a 100-page document to a model in one go. You must break it into "Chunks."

graph TD
    A[100 Page PDF] --> B[Chunk 1: 500 characters]
    A --> C[Chunk 2: 500 characters]
    A --> D[Chunk 3: 500 characters]
    B --> E[Vector DB]
    C --> E
    D --> E

Strategies for Chunking:

Fixed-Size Chunking: Break text into every 500 characters. (Fast, but might cut a sentence in half).
Semantic Chunking: Using an LLM or an embedding model to find where one "idea" ends and another begins. (Accurate, but slow).
Overlapping: We usually include 10-20% of the previous chunk in the next one to ensure context isn't lost at the edges.

4. Why Chunking is Critical for Retrieval

If a user asks: "What is the policy for sick leave?" And your chunk covers 5 pages of general HR fluff, the model might miss the specific sentence sitting at the end of the chunk. Small, high-density chunks are usually better for Semantic Retrieval.

Code Example: Loading and Chunking a PDF

Let’s look at a typical "Loading" script using LangChain.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the PDF
loader = PyPDFLoader("employee_handbook.pdf")
docs = loader.load()

# 2. Define the Splitter (Chunker)
# chunk_overlap ensures sentences don't get cut in half with no context
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", " ", ""]
)

# 3. Execution
split_docs = text_splitter.split_documents(docs)

print(f"Original Pages: {len(docs)}")
print(f"Resulting Chunks: {len(split_docs)}")
print(f"Sample Content: {split_docs[0].page_content[:100]}...")

5. Security and Privacy at the Ingestion Layer

When connecting to external bases, you must respect RBAC (Role-Based Access Control). If you ingest the entire company wiki but a Junior Developer asks about "Executive Salaries," your RAG system should not retrieve those documents.

The Solution: Always store "Access Metadata" with your chunks. Metadata: { "source_url": "...", "allowed_groups": ["HR", "Managers"] }

Summary

Loaders: Use pre-built components for S3, Notion, and PDFs.
Parsing: Use layout-aware tools for complex data like tables.
Chunking: Smaller chunks with overlap lead to more accurate retrieval.
Metadata: Store source info and permissions alongside the text.

In the next lesson, we will look at Vector Databases, the storage systems where these chunks live.

Exercise: Chunking Strategy

You are building a RAG system for a Law Firm. The documents are 200-page legal contracts where every single paragraph might contain a critical "Clause."

Would you use a small chunk size (100 tokens) or a large chunk size (2000 tokens)?
Why is "Overlap" particularly important for legal contracts?

Think about how legal language refers back to previous sections (e.g., "As mentioned in Section 4.5...").