
Incremental Updates and Re-Indexing
Efficiently update your RAG index with changed documents while avoiding redundant processing.
Incremental Updates and Re-Indexing
Only process what changed to optimize costs and performance.
Change Detection
import hashlib
def get_file_hash(file_path):
with open(file_path, 'rb') as f:
return hashlib.sha256(f.read()).hexdigest()
def needs_reindexing(file_path):
current_hash = get_file_hash(file_path)
stored_hash = get_stored_hash(file_path)
return current_hash != stored_hash
Incremental Processing
def incremental_ingest():
for file in all_files():
if needs_reindexing(file):
# Remove old embeddings
delete_embeddings(source=file)
# Reprocess and index
process_and_index(file)
# Update hash
store_hash(file, get_file_hash(file))
Version Tracking
# Track document versions
{
'document_id': 'doc_123',
'versions': [
{'version': 1, 'hash': 'abc...', 'indexed_at': '2025-01-01'},
{'version': 2, 'hash': 'def...', 'indexed_at': '2025-01-05'}
]
}
Soft Deletes
# Mark as deleted instead of removing
def soft_delete(document_id):
update_metadata(document_id, {'deleted': True, 'deleted_at': now()})
# Filter out deleted docs in queries
def search(query):
return vector_db.query(query, where={'deleted': {'$ne': True}})
Module 5 complete! Next: Data conditioning.