Text Formats (TXT, MD, HTML)

Text is the foundation of RAG. Understanding different text formats ensures proper processing.

Format Comparison

Format	Structure	Use Case	Complexity
TXT	None	Simple notes	Easy
Markdown	Lightweight markup	Documentation	Medium
HTML	Rich markup	Web content	Complex

Processing Plain Text (.txt)

def process_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Clean and normalize
    content = content.strip()
    content = normalize_whitespace(content)
    
    return {
        'content': content,
        'metadata': {
            'format': 'txt',
            'encoding': 'utf-8'
        }
    }

Processing Markdown (.md)

import markdown

def process_markdown(file_path):
    with open(file_path) as f:
        md_content = f.read()
    
    # Convert to HTML for structure
    html = markdown.markdown(md_content)
    
    # Extract plain text for embedding
    text = html_to_text(html)
    
    # Preserve structure
    headers = extract_headers(md_content)
    
    return {
        'content': text,
        'structure': headers,
        'html': html
    }

Processing HTML

from bs4 import BeautifulSoup

def process_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Remove scripts and styles
    for tag in soup(['script', 'style']):
        tag.decompose()
    
    # Extract text
    text = soup.get_text(separator='\\n', strip=True)
    
    # Extract metadata
    metadata = {
        'title': soup.title.string if soup.title else None,
        'meta': extract_meta_tags(soup)
    }
    
    return {'content': text, 'metadata': metadata}

Best Practices

Encoding: Always use UTF-8
Normalization: Remove extra whitespace
Structure: Preserve headings for chunking
Metadata: Extract titles, dates, authors

Next: PDF processing.

Text Formats (TXT, MD, HTML)

Format Comparison

Processing Plain Text (.txt)

Processing Markdown (.md)

Processing HTML

Best Practices

Subscribe to our newsletter