Text Formats (TXT, MD, HTML)

Text Formats (TXT, MD, HTML)

Processing plain text, Markdown, and HTML for RAG systems with best practices.

Text Formats (TXT, MD, HTML)

Text is the foundation of RAG. Understanding different text formats ensures proper processing.

Format Comparison

FormatStructureUse CaseComplexity
TXTNoneSimple notesEasy
MarkdownLightweight markupDocumentationMedium
HTMLRich markupWeb contentComplex

Processing Plain Text (.txt)

def process_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Clean and normalize
    content = content.strip()
    content = normalize_whitespace(content)
    
    return {
        'content': content,
        'metadata': {
            'format': 'txt',
            'encoding': 'utf-8'
        }
    }

Processing Markdown (.md)

import markdown

def process_markdown(file_path):
    with open(file_path) as f:
        md_content = f.read()
    
    # Convert to HTML for structure
    html = markdown.markdown(md_content)
    
    # Extract plain text for embedding
    text = html_to_text(html)
    
    # Preserve structure
    headers = extract_headers(md_content)
    
    return {
        'content': text,
        'structure': headers,
        'html': html
    }

Processing HTML

from bs4 import BeautifulSoup

def process_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Remove scripts and styles
    for tag in soup(['script', 'style']):
        tag.decompose()
    
    # Extract text
    text = soup.get_text(separator='\\n', strip=True)
    
    # Extract metadata
    metadata = {
        'title': soup.title.string if soup.title else None,
        'meta': extract_meta_tags(soup)
    }
    
    return {'content': text, 'metadata': metadata}

Best Practices

  • Encoding: Always use UTF-8
  • Normalization: Remove extra whitespace
  • Structure: Preserve headings for chunking
  • Metadata: Extract titles, dates, authors

Next: PDF processing.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn