Vertex AI Feature Store: The Single Source of Truth

Vertex AI Feature Store: The Single Source of Truth

Stop duplicating feature engineering code. Learn how Feature Store unifies Online (Serving) and Offline (Training) feature access.

The "Two Pipelines" Problem

Without a Feature Store, you usually build two pipelines:

  1. Training Pipeline: A massive SQL query that joins tables to calculate Avg_Spend_30d.
  2. Serving Pipeline: A fast Java/Go function that queries the database to calculate Avg_Spend_30d for the user right now.

Risk: If the SQL logic and Java logic differ by even 1%, your model fails.

Vertex AI Feature Store provides a centralized repository so you define the logic once.


1. Architecture

  • EntityType: The "Noun" (e.g., User, Product, Store).
  • Feature: The "Adjective" (e.g., age, average_rating, zip_code).
  • Ingestion: You stream or batch write values into the store.

The Two Interfaces

  1. Offline Store (BigQuery backed):
    • Used for: Training.
    • Query: "Give me the values of age and spend for these 100k users."
    • Feature: Point-in-Time Lookup (Time Travel). (See below).
  2. Online Store (Bigtable/Redis backed):
    • Used for: Serving.
    • Query: "Give me the latest values for User_123."
    • Latency: < 10ms.

2. Point-in-Time Correctness (Time Travel)

This is the killer feature. Imagine you are training a fraud model to predict a fraud that happened on Jan 1st.

  • User's spend on Jan 1st was $500.
  • User's spend today (Feb 1st) is $1000.

If you just query "Current Spend" for your training set, you leak future information ($1000). The model learns that "High Spend causes past fraud" (Wrong). Feature Store allows you to ask: "Give me the feature values as they existed on the timestamp of the event."


3. Code Example: Fetching Features

from google.cloud import aiplatform

# 1. SERVING (Online)
# Get latest values for User 123
features = featurestore_service_client.read_feature_values(
    entity_type_id="users",
    entity_id="123",
    feature_selector={"id_matcher": {"ids": ["age", "avg_spend"]}}
)
# Returns: {age: 25, avg_spend: 500}

# 2. TRAINING (Offline)
# Get values for a list of users at specific times
training_df = aiplatform.Featurestore.batch_serve_to_dataframe(
    serving_resource=feature_store_resource,
    read_instances_uri="gs://my-bucket/training_ids_and_timestamps.csv"
)

4. Summary

  • Feature Store prevents skew by unifying logic.
  • Offline = High Throughput, Time Travel (Training).
  • Online = Low Latency (Serving).
  • Point-in-Time prevents data leakage.

In the next module, we enter the Lab. Model Prototyping.


Knowledge Check

?Knowledge Check

Why can't you just use BigQuery for both training and online serving?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn