
Cloud Storage Ingestion
Ingest documents from S3, Google Cloud Storage, and Azure Blob Storage.
Cloud Storage Ingestion
Connect to cloud storage providers for scalable document ingestion.
AWS S3
import boto3
s3 = boto3.client('s3')
def ingest_from_s3(bucket, prefix=''):
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get('Contents', []):
# Download and process
response = s3.get_object(Bucket=bucket, Key=obj['Key'])
content = response['Body'].read()
process_and_index({
'content': content,
'source': f"s3://{bucket}/{obj['Key']}",
'last_modified': obj['LastModified']
})
Event-Based Ingestion
# Lambda function triggered by S3 events
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
# Process new/updated file
ingest_s3_file(bucket, key)
Next: API-based ingestion.