Cloud Storage Ingestion

Cloud Storage Ingestion

Ingest documents from S3, Google Cloud Storage, and Azure Blob Storage.

Cloud Storage Ingestion

Connect to cloud storage providers for scalable document ingestion.

AWS S3

import boto3

s3 = boto3.client('s3')

def ingest_from_s3(bucket, prefix=''):
    paginator = s3.get_paginator('list_objects_v2')
    
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get('Contents', []):
            # Download and process
            response = s3.get_object(Bucket=bucket, Key=obj['Key'])
            content = response['Body'].read()
            
            process_and_index({
                'content': content,
                'source': f"s3://{bucket}/{obj['Key']}",
                'last_modified': obj['LastModified']
            })

Event-Based Ingestion

# Lambda function triggered by S3 events
def lambda_handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        
        # Process new/updated file
        ingest_s3_file(bucket, key)

Next: API-based ingestion.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn