
Text Formats (TXT, MD, HTML)
Processing plain text, Markdown, and HTML for RAG systems with best practices.
Text Formats (TXT, MD, HTML)
Text is the foundation of RAG. Understanding different text formats ensures proper processing.
Format Comparison
| Format | Structure | Use Case | Complexity |
|---|---|---|---|
| TXT | None | Simple notes | Easy |
| Markdown | Lightweight markup | Documentation | Medium |
| HTML | Rich markup | Web content | Complex |
Processing Plain Text (.txt)
def process_text_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Clean and normalize
content = content.strip()
content = normalize_whitespace(content)
return {
'content': content,
'metadata': {
'format': 'txt',
'encoding': 'utf-8'
}
}
Processing Markdown (.md)
import markdown
def process_markdown(file_path):
with open(file_path) as f:
md_content = f.read()
# Convert to HTML for structure
html = markdown.markdown(md_content)
# Extract plain text for embedding
text = html_to_text(html)
# Preserve structure
headers = extract_headers(md_content)
return {
'content': text,
'structure': headers,
'html': html
}
Processing HTML
from bs4 import BeautifulSoup
def process_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Remove scripts and styles
for tag in soup(['script', 'style']):
tag.decompose()
# Extract text
text = soup.get_text(separator='\\n', strip=True)
# Extract metadata
metadata = {
'title': soup.title.string if soup.title else None,
'meta': extract_meta_tags(soup)
}
return {'content': text, 'metadata': metadata}
Best Practices
- Encoding: Always use UTF-8
- Normalization: Remove extra whitespace
- Structure: Preserve headings for chunking
- Metadata: Extract titles, dates, authors
Next: PDF processing.