Incremental Updates and Re-Indexing

Incremental Updates and Re-Indexing

Efficiently update your RAG index with changed documents while avoiding redundant processing.

Incremental Updates and Re-Indexing

Only process what changed to optimize costs and performance.

Change Detection

import hashlib

def get_file_hash(file_path):
    with open(file_path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

def needs_reindexing(file_path):
    current_hash = get_file_hash(file_path)
    stored_hash = get_stored_hash(file_path)
    
    return current_hash != stored_hash

Incremental Processing

def incremental_ingest():
    for file in all_files():
        if needs_reindexing(file):
            # Remove old embeddings
            delete_embeddings(source=file)
            
            # Reprocess and index
            process_and_index(file)
            
            # Update hash
            store_hash(file, get_file_hash(file))

Version Tracking

# Track document versions
{
    'document_id': 'doc_123',
    'versions': [
        {'version': 1, 'hash': 'abc...', 'indexed_at': '2025-01-01'},
        {'version': 2, 'hash': 'def...', 'indexed_at': '2025-01-05'}
    ]
}

Soft Deletes

# Mark as deleted instead of removing
def soft_delete(document_id):
    update_metadata(document_id, {'deleted': True, 'deleted_at': now()})

# Filter out deleted docs in queries
def search(query):
    return vector_db.query(query, where={'deleted': {'$ne': True}})

Module 5 complete! Next: Data conditioning.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn