PDF and Web Loaders: Ingesting the Modern World

The two most common data sources for any AI project are Websites and PDFs. These formats are notoriously difficult to "Clean." LangChain provides specialized loaders that handle the formatting (removing HTML tags or PDF headers) for you.

1. PyPDFLoader (For Documents)

Install the required library: pip install pypdf

from langchain_community.document_loaders import PyPDFLoader

# Load and split into pages automatically
loader = PyPDFLoader("contract.pdf")
pages = loader.load()

# Each list item is one 'Document' representing one physical page
print(f"Total pages: {len(pages)}")
print(pages[0].page_content)

2. WebBaseLoader (For Research)

Install the required library: pip install beautifulsoup4

from langchain_community.document_loaders import WebBaseLoader

# Ingest a single URL
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()

# LangChain uses BeautifulSoup to strip the HTML tags automatically
print(docs[0].page_content)

3. Handling Multiple Sources

You can combine multiple urls in one loader.

loader = WebBaseLoader(["https://site-a.com", "https://site-b.com"])
all_docs = loader.load()

4. Visualizing Web Ingestion

graph LR
    URL[https://blog.com] --> Request[HTTP Request]
    Request --> HTML[Raw HTML Code]
    HTML --> BS4[BeautifulSoup Cleaning]
    BS4 --> Text[Clean Page Content]
    Text --> Doc[LangChain Document]

5. Engineering Tip: User-Agent Spoofing

Some websites block bots (like the default LangChain loader). You might need to change the "User-Agent" to look like a real browser.

# Advanced Web Loading
loader = WebBaseLoader(
    "https://example.com",
    header_template={"User-Agent": "Mozilla/5.0..."}
)

Key Takeaways

PyPDFLoader treats each page as a separate document by default.
WebBaseLoader relies on BeautifulSoup to clean HTML.
Always check Page Count after loading a PDF to ensure it wasn't a "Scan" (images with no text).
Permissions: Be careful when scraping; always respect a site's robots.txt.

Module 5 Lesson 2: PDF and Web Loaders