Building and managing machine learning (ML) features at scale is one of the most critical and complex challenges in modern data science workflows. Organizations often struggle with fragmented feature pipelines, inconsistent data definitions, and redundant engineering efforts across teams. Without a centralized system for storing and reusing features, models risk being trained on outdated or mismatched data, leading to poor generalization, lower model accuracy and governance issues. Furthermore, enabling collaboration across data engineering, data science, and ML operations teams becomes difficult when each group maintains its own isolated datasets and transformations.
Amazon SageMaker addresses these challenges through SageMaker Unified Studio and SageMaker Catalog, which organizations can use to build, manage, and share assets securely across projects and accounts. A key capability within this ecosystem is the implementation of an offline feature store—a structured repository designed for managing historical feature data used in model training and validation. Offline feature stores are designed for scalability, lineage tracking, and reproducibility, so that data scientists can train models on accurate, time-aligned datasets that prevent data leakage and maintain consistency across experiments.
This blog post provides step-by-step guidance on implementing an offline feature store using SageMaker Catalog within a SageMaker Unified Studio domain. By adopting a publish-subscribe pattern, data producers can use this solution to publish curated, versioned feature tables—while data consumers can securely discover, subscribe to, and reuse them for model development. The approach integrates Amazon S3 Tables with Apache Iceberg for transactional consistency, AWS Lake Formation for fine-grained access control, and Amazon SageMaker Studio for visual and code-based data engineering.
Through this unified solution, teams can achieve consistent feature governance, accelerate ML experimentation, and reduce operational overhead. in this post, we show you how to design a collaborative, governed, and production-ready offline feature store to unlock enterprise-wide reuse of trusted ML features.
This solution demonstrates how to implement an offline feature store using a SageMaker Unified Studio domain integrated with SageMaker Catalog to enable scalable, governed, and collaborative feature management across ML teams. The architecture establishes a unified environment, shown in the following figure, that streamlines how administrators, data engineers, and data scientists create, publish, and consume high-quality, reusable feature tables.

At its core, the solution uses a SageMaker Unified Studio domain as the governance and collaboration layer for managing projects, users, and data assets under centralized control. S3 Tables in the Apache Iceberg format serve as the foundation for storing and versioning feature data. SageMaker Catalog, which allows unified governance for datasets, acts as the central registry for publishing, discovering, and subscribing to feature tables.
The following describes how various personas interact in the end-to-end workflow:
airline_delay.csv and airline_features (an S3 table)—to the project catalog, then designates the data engineer as the project owner.airline_features table within the project catalog.airline_features table with metadata for improved discoverability and governance.airline_features table into SageMaker Catalog for organization-wide access.airlines_features table in the catalog, and identifies the published feature table as suitable for their model development.This structured workflow provides consistent data governance, promotes collaboration, and eliminates redundant feature engineering efforts by enabling enterprise-wide reuse of trusted, versioned ML features.
The offline feature store solution architecture is composed of several integrated components. Each component plays a distinct role in enabling secure data governance, scalable feature engineering, and seamless collaboration across ML personas. The key components include:
SageMaker Unified Studio domain: The SageMaker Unified Studio domain serves as the central control plane for managing ML projects, users, and data assets. It provides a unified interface for collaboration between data engineers, data scientists, and administrators. This domain enables enforcement of fine-grained access controls, integrates with AWS IAM Identity Center for single sign-on, and supports approval workflows to help ensure secure sharing of ML assets across teams and accounts.
S3 Tables with Apache Iceberg format: S3 Tables enables scalable, serverless storage for feature data using the Apache Iceberg table format. Apache Iceberg Open Table Format (OTF) enables ACID transactions, schema evolution, and time-travel capabilities, which teams can then take advantage of to query historical versions of feature data with full reproducibility. S3 Tables seamlessly integrate with Spark, Glue, and SageMaker for consistent data access across analytical and ML workloads.
Feature engineering pipeline: The feature engineering pipeline automates the transformation of raw datasets into curated, high-quality features. Built on Apache Spark, it provides distributed data processing at scale, enabling complex transformations such as delay-rate computation, categorical encoding, and feature aggregation. The pipeline directly writes outputs to S3 tables, helping to ensure traceability and consistency between raw/processed data and engineered features.
SageMaker Catalog: SageMaker Catalog acts as the organization-wide repository for registering, publishing, and discovering ML assets such as datasets, feature tables, and models. It integrates with Lake Formation for fine-grained access control and IAM Identity Center for user management. The catalog supports metadata enrichment, versioning, and approval workflows, so teams can securely share and reuse trusted assets across projects.
Together, these components create a cohesive ecosystem that simplifies ML feature lifecycle management—from creation and publication to discovery and consumption—while maintaining enterprise-grade governance and data lineage tracking.
The administrator workflow defines the initial setup required to establish a secure and collaborative environment for implementing the offline feature store. Administrators provision the SageMaker Unified Studio domain, enable IAM Identity Center for user authentication, and configure S3 tables with Lake Formation for governed data access. They also create dedicated producer and consumer projects, deploy the necessary infrastructure through environment blueprints (based on CloudFormation), and assign users and groups with appropriate permissions. This setup helps ensure a consistent, well-governed foundation that data engineers and data scientists can use to build, publish, and consume ML features seamlessly within SageMaker Unified Studio.
The following prerequisites must be completed to help ensure that the required AWS services and permissions are properly set up for seamless integration and governance.
After completing the prerequisites, the next step is to set up the environment by creating and configuring the SageMaker Unified Studio domain, which serves as the central workspace for managing users, projects, and data assets within the offline feature store architecture.
Corporate).SageMaker Unified Studio should be correctly configured) and choose at least three private subnets, each in a different Availability Zone.airlines_core_features. Add descriptions, configure settings, and assign users and permissions.airlines_ml_models. airlines_core_features project details.arn:aws:iam::ACCOUNT:role/datazone_usr_role_*)airlines_ml_modelsDeploy the CloudFormation stack only after completing the previous steps, including domain creation, project creation, IAM Identity Center setup, and S3 Tables enablement. The stack requires the following input parameters:
https://d-1234da5678.awsapps.com/startarn:aws:iam::<account-id>:role/datazone_usr_role_xxxx_yyyyAfter the stack deploys successfully, the following AWS resources will be created:
amzn-s3-demo-blog-smus-featurestore-{account-id}amzn-s3-demo-airlines-s3tables-bucketairlinesfg_airline_featuresairline_raw_dbFeatureStore-Producers, FeatureStore-Consumersdatabase and table permissionsAssign IAM Identity Center groups for single sign on (SSO) to the SageMaker domain to enable user access and collaboration across projects.
feature-store-admin-group (for admin)feature-store-producer-group (for data engineering teams)feature-store-consumer-group (for data science teams)
Create individual users for data producers, consumers, and administrators to access the SageMaker domain.
dataproducer. [email protected].Data.Producer.feature-store-producer-group.Use the same steps to create the data consumer and admin users, with the following differences for step 2:
dataconsumer.[email protected].Data.Consumer.eature-store-consumer-group.dataadmin.[email protected].Data.Admin.feature-store-admin-group.Sign in to the SageMaker Unified Studio corporate domain as the admin user from UI Console and assign the user groups to their corresponding projects for proper access control.
airlines_core_features project settings.feature-store-producer-group with appropriate project permissions.feature-store-consumer-group with appropriate project permissions.Create the S3 prefix /raw/AirlineDelayCause/ in the S3 bucket amzn-s3-demo-blog-smus-featurestore-<account-id> created by the CloudFormation template. Use the Amazon S3 console to upload the sample airline delay dataset to the S3 bucket prefix s3://amzn-s3-demo-blog-smus-featurestore-<account-id>/raw/AirlineDelayCause/
After uploading the dataset, navigate to the airlines_core_features project and query the data using AWS Data Catalog:
SELECT * FROM "awsdatacatalog"."curated_db"."airline_delay_cause" LIMIT 10;

Query the feature store table to verify that its accessible. It will return zero records but should execute without errors.
SELECT * FROM "s3tablescatalog/airline"."airline"."fg_airline_features" LIMIT 10;
This workflow demonstrates how data engineers create and share features using SageMaker Unified Studio and S3 Tables.
Navigate to the SageMaker Unified Studio domain and sign in using IAM Identity Center credentials as the dataproducer user (member of the feature-store-producer-group), then access the airlines_core_features project.
airlines-delay-cause-feature-engineering-pipeline as the job name and choose Submit.airlines_features_datasource. s3tablescatalog/airlines.* and choose Next.
fg_airline_features asset, which was created by the data source job.
Query the created feature store loaded by the data processing job.
SELECT * FROM "s3tablescatalog/airlines"."airlines"."fg_airline_features" LIMIT 10;

Now the feature store table will be discoverable by other projects.
This section demonstrates the end-to-end machine learning workflow for data scientists consuming features from the offline feature store built with S3 Tables and SageMaker Unified Studio, starting with the dataconsumer user sign in.
Navigate to your SageMaker Unified Studio domain URL, sign in using the data consumer credentials and select the consumer project airlines_ml_models.
Use the search bar and enter fg_airlines_features as the feature store name to find the published asset, then select the fg_airlines_features catalog asset. You can also use the AI-powered catalog search to find features using the partial name of features and their description.

Choose Subscribe for the selected airline feature asset, enter a business justification for ML model development, and submit the access request for approval.

The data producer user can view the pending subscription requests in their workflow. When the subscription is approved by a data producer, the fg_airline_featurestore table will be visible in the project catalog database and available to query, as shown in the following figure.

Data lineage in SageMaker Catalog is an OpenLineage-compatible feature that you can use to capture and visualize lineage events—from OpenLineage-enabled systems or through APIs—to trace data origins, track transformations, and view cross-organizational data consumption. To view data lineage, select the Lineage tab to display the complete lineage of your subscribed assets, as shown in the following figure.

After you have access to the published feature store, you can use the following queries in SQL Query Editor to query the data and use Iceberg’s time travel capabilities.
SELECT * FROM "fg_airline_features" LIMIT 10;
Retrieves the latest 10 records from the feature store.
SELECT snapshot_id, committed_at, operation, summary FROM "fg_airline_features$snapshots" ORDER BY committed_at DESC;
Lists all historical snapshots with timestamps and operations, showing how the feature table evolved.
-- Query specific version
SELECT * FROM "fg_airline_features" FOR VERSION AS OF <snapshot_id_here> LIMIT 10;
Retrieves features from a specific snapshot version, ensuring reproducibility for model training.
-- List Available Timestamps:
SELECT committed_at FROM "fg_airline_features$snapshots" ORDER BY committed_at DESC;
Lists all timestamps when the feature table was modified.
-- Query as of timestamp
SELECT * FROM "fg_airline_features"
FOR TIMESTAMP AS OF TIMESTAMP <time_stamp> LIMIT 10;
Retrieves features as they existed at a specific time.
SELECT * FROM "fg_airline_features$history"
ORDER BY made_current_at DESC;
Displays the complete audit trail with snapshot IDs and timestamps for compliance and debugging.
This consumer flow enables efficient feature reuse across teams while maintaining proper governance and access controls.
In this section, you learn how to set up an S3 table to use as an offline feature store for training and batch inference.
<amzn-s3-demo-bucket-name>. For example, sagemaker_project_bucket<your-aws-region>. For example, us-east-1s3://< amzn-s3-demo-bucket-name >/<your-project-path>/dev/<your_glue_database_name>. For example, project_glue_database<your_feature_group_table>. For example, project_feature_group_tableThis notebook implements training and batch inference pipeline for airline delay prediction using the Amazon SageMaker XGBoost regression algorithm. The pipeline executes the following steps:


Implementing an offline feature store with Amazon SageMaker Unified Studio and SageMaker Catalog provides a unified, secure, and scalable approach to managing ML features across teams and projects. The integrated architecture helps ensure consistent governance, lineage tracking, and reproducibility, while enabling seamless collaboration between data engineers and data scientists. By using Amazon S3 Tables with Apache Iceberg, organizations can use ACID compliance and time-travel capabilities, improving the reliability of training data and model performance. The publish–subscribe pattern simplifies asset sharing, reduces duplication, and accelerates model development life cycles.
To explore these capabilities in your AWS environment, set up your SageMaker Unified Studio Domain, publish your first feature dataset, and unlock the full potential of reusable, governed ML assets for your organization.
We sincerely appreciate the thoughtful technical blog review by Paul Hargis, whose insights and feedback were invaluable.
Manuel Rioux est fièrement propulsé par WordPress