Module 5 Lesson 2: PDF and Web Loaders
Handling the Web. How to scrape data from websites and extract text from multi-page PDF documents.
PDF and Web Loaders: Ingesting the Modern World
The two most common data sources for any AI project are Websites and PDFs. These formats are notoriously difficult to "Clean." LangChain provides specialized loaders that handle the formatting (removing HTML tags or PDF headers) for you.
1. PyPDFLoader (For Documents)
Install the required library: pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
# Load and split into pages automatically
loader = PyPDFLoader("contract.pdf")
pages = loader.load()
# Each list item is one 'Document' representing one physical page
print(f"Total pages: {len(pages)}")
print(pages[0].page_content)
2. WebBaseLoader (For Research)
Install the required library: pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
# Ingest a single URL
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()
# LangChain uses BeautifulSoup to strip the HTML tags automatically
print(docs[0].page_content)
3. Handling Multiple Sources
You can combine multiple urls in one loader.
loader = WebBaseLoader(["https://site-a.com", "https://site-b.com"])
all_docs = loader.load()
4. Visualizing Web Ingestion
graph LR
URL[https://blog.com] --> Request[HTTP Request]
Request --> HTML[Raw HTML Code]
HTML --> BS4[BeautifulSoup Cleaning]
BS4 --> Text[Clean Page Content]
Text --> Doc[LangChain Document]
5. Engineering Tip: User-Agent Spoofing
Some websites block bots (like the default LangChain loader). You might need to change the "User-Agent" to look like a real browser.
# Advanced Web Loading
loader = WebBaseLoader(
"https://example.com",
header_template={"User-Agent": "Mozilla/5.0..."}
)
Key Takeaways
PyPDFLoadertreats each page as a separate document by default.WebBaseLoaderrelies onBeautifulSoupto clean HTML.- Always check Page Count after loading a PDF to ensure it wasn't a "Scan" (images with no text).
- Permissions: Be careful when scraping; always respect a site's
robots.txt.