
Data Preprocessing and Postprocessing
Clean your inputs and sanitize your outputs. Using Regex and validation libraries to ensure Gemini connects cleanly to your databases.
Data Preprocessing and Postprocessing
Bridging the messy world of humans and the structured world of databases.
Preprocessing (Input)
- Truncation: If input > 1M tokens, cut it smartly (don't cut in the middle of a sentence).
- Anonymization: Regex replace emails/SSNs with
[REDACTED]. - Formatting: Convert HTML to Markdown (LLMs read Markdown faster/better).
Postprocessing (Output)
Gemini returns a string. Your DB wants an Integer.
- Strip formatting: Remove markdown code blocks (
json ...). - Validation: Use Pydantic.
from pydantic import BaseModel, ValidationError class User(BaseModel): age: int try: u = User.model_validate_json(gemini_response) except ValidationError: # Handle error (maybe retry)
Summary
Never pipe AI output directly to a DB. Always sanitize it first.
In the final lesson of this module, we discuss Monitoring.