Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. It now supports Apache Iceberg table format, streaming ingestion, scalable batch ingestion, and fine-grained access control through AWS Lake Formation.
As organizations scale their machine learning platforms from experimentation to production, two operational challenges consistently surface. The first is securing access to sensitive feature data without introducing manual overhead for every new feature group. The second is keeping storage costs predictable when high-frequency streaming workloads generate ever-growing volumes of Apache Iceberg metadata. For example, one retail analytics team discovered that their Apache Iceberg-based offline store had accumulated over 50 TB of metadata files in under a year, driving substantial and unexpected Amazon Simple Storage Service (Amazon S3) charges. Meanwhile, infrastructure teams across industries told us they need Lake Formation-enforced access control on feature data that works automatically at the point of feature group creation. They don’t want it as an afterthought requiring repetitive manual configuration.
Today, we’re announcing three new capabilities available in SageMaker Python SDK v3.8.0 that address these challenges:
In this post, we walk through each capability with code examples you can use to get started. For complete end-to-end walkthroughs, see the accompanying notebooks for Lake Formation governance and Iceberg table properties in the SageMaker Python SDK repository.
To follow along with the examples in this post, you need:
pip install --upgrade "sagemaker>=3.8.0"These capabilities are delivered through new parameters in the SDK v3 FeatureGroupManager.create() and FeatureGroupManager.update() calls. The LakeFormationConfigtriggers automatic access control setup, and the IcebergProperties configures metadata lifecycle. Both can be set at feature group creation time or applied to existing feature groups.
SageMaker Python SDK v3.8.0, released April 16, 2026, is the foundation for the capabilities described in this post. The modernized SDK introduces a modular architecture, improved performance, and removal of legacy hard dependencies (such as PyTorch). These changes result in faster installation and smaller environments.
The following Feature Store capabilities are available in SDK v3:
PutRecord, GetRecord, and BatchGetRecord.FeatureGroupManager.ingest() from both Pandas and Spark DataFrames.IcebergProperties and LakeFormationConfig are fully supported in the create and update workflows.The Feature Store API surface is consistent with SDK v2, so existing code works with minimal changes. Review the SDK v3 changelog for details on breaking changes in other areas of the SDK.
Here’s how to create a feature group with the new Lake Formation and Iceberg parameters:
fg = FeatureGroupManager.create(
feature_group_name="my-features",
record_identifier_feature_name="user_id",
event_time_feature_name="event_time",
feature_definitions=df,
role_arn=role,
online_store_config={"EnableOnlineStore": True},
offline_store_config=OfflineStoreConfig(
s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
table_format="Iceberg",
),
lake_formation_config=LakeFormationConfig(
enabled=True,
hybrid_access_mode_enabled=True,
acknowledge_risk=True,
),
iceberg_properties=IcebergProperties(
properties={
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "10",
}
),
)
Configuring AWS Lake Formation on Feature Store data previously required several manual steps: registering S3 locations, revoking the IAMAllowedPrincipals group, and configuring data filters for each feature group. This process was time-consuming, error-prone, and had to be repeated for every new feature group. Organizations in financial services, healthcare, and other regulated industries that need column-level, row-level, and cell-level access control found this particularly burdensome.
You can now activate Lake Formation access control on a feature group’s offline store at creation time by passing a LakeFormationConfig to FeatureGroupManager.create(). You can also activate it on existing feature groups using FeatureGroupManager.enable_lake_formation(). When this configuration is turned on, Feature Store automatically performs the following operations on your behalf:
hybrid_access_mode_enabled=False, the SDK revokes the IAMAllowedPrincipal grant on the AWS Glue table, so access must go through Lake Formation’s permission model only. With hybrid_access_mode_enabled=True, both AWS Identity and Access Management (IAM) policies and Lake Formation permissions coexist, which is useful for gradual migration. For more information, see hybrid access mode.This is an opt-in, per-feature-group setting. If you omit it, behavior is unchanged and existing feature groups continue to work with IAM-based access.
The following creates a new feature group with Lake Formation access control activated. For additional configuration options, see Enable Lake Formation with Feature Groups.
fg = FeatureGroupManager.create(
feature_group_name="governed-customer-features",
record_identifier_feature_name="customer_id",
event_time_feature_name="event_time",
feature_definitions=customer_df,
role_arn=role,
online_store_config={"EnableOnlineStore": True},
offline_store_config=OfflineStoreConfig(
s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
table_format="Iceberg",
),
lake_formation_config=LakeFormationConfig(
enabled=True,
hybrid_access_mode_enabled=True,
acknowledge_risk=True,
),
)
To activate Lake Formation on an existing feature group:
fg = FeatureGroupManager.get(
feature_group_name="existing-feature-group",
)
fg.enable_lake_formation(
hybrid_access_mode_enabled=True,
acknowledge_risk=True,
)
After the feature group is configured, use the Lake Formation console or API to grant fine-grained permissions. You can grant a data science team SELECT access to only the customer_id, credit_score, and region columns (column-level filtering). You can also restrict an analyst to rows where region = 'us-east-1' (row-level filtering), or combine both for cell-level access control.
Online store isn’t affected. Lake Formation access control applies only to the offline store. The online store continues to use IAM-based authorization, so real-time inference latency is unchanged.
Works with both AWS Glue and Iceberg table formats. Lake Formation access control applies the same way regardless of which table format you use for the offline store.
Cross-account compatible. If you use AWS Resource Access Manager (AWS RAM) to share Feature Store tables across accounts, Lake Formation grants continue to work alongside existing cross-account sharing patterns. Note: you must disable hybrid access mode for cross-account access when the table format is Iceberg.
Prerequisite: Data Lake Administrator. The system validates that at least one Data Lake Administrator is configured in your account before activating access control. If none exists, the create call returns an immediate, descriptive error rather than failing asynchronously.
For more information, see Enable Lake Formation with Feature Groups.
Amazon SageMaker Feature Store supports Apache Iceberg as a table format for the offline store, which improves query performance through compaction and supports record-level operations. This section introduces new parameters that give you control over Iceberg metadata lifecycle.
For workloads with high-frequency writes (such as streaming feature pipelines that ingest records every few seconds), Iceberg metadata files accumulate with every commit. Without lifecycle controls, this metadata can grow exponentially. One customer with over 40 streaming feature groups saw their S3 bucket grow from a few gigabytes to over 50 TB of metadata in under a year. Feature Store was committing to the offline store at high frequency (under 10 minutes between commits), and each commit produced new metadata files. Without write properties preset to limit snapshots or metadata file retention, the metadata accumulated unchecked. The cleanup operations they attempted through Amazon Athena (OPTIMIZE and VACUUM) timed out on tables exceeding 50 TB. They had to resort to costly Amazon EMR Serverless Spark jobs and eventually rewrite their tables entirely.
You can now pass an IcebergProperties configuration when creating an Iceberg-format feature group. These properties are applied to the underlying Iceberg table, giving you control over metadata lifecycle from day one. You can also update Iceberg properties on existing feature groups using FeatureGroupManager.update().
Some examples of supported properties are:
| Property | Default | Description |
write.metadata.delete-after-commit.enabled |
false |
Delete oldest tracked metadata files after each commit |
write.metadata.previous-versions-max |
100 | Max number of previous version metadata files to track |
history.expire.max-snapshot-age-ms |
432000000 (5 days) |
Max age of snapshots to keep while expiring |
history.expire.min-snapshots-to-keep |
1 | Min number of snapshots to keep while expiring |
write.target-file-size-bytes |
536870912 (512 MB) |
Target size for generated data files |
write.parquet.row-group-size-bytes |
134217728 (128 MB) |
Parquet row group size |
read.split.target-size |
134217728 (128 MB) |
Target size when combining data input splits |
For the complete list of supported properties, see Iceberg metadata management in the SageMaker AI documentation.
fg = FeatureGroupManager.create(
feature_group_name="streaming-click-features",
record_identifier_feature_name="session_id",
event_time_feature_name="event_time",
feature_definitions=clicks_df,
role_arn=role,
offline_store_config=OfflineStoreConfig(
s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
table_format="Iceberg",
),
iceberg_properties=IcebergProperties(
properties={
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "10",
"history.expire.max-snapshot-age-ms": "86400000",
"history.expire.min-snapshots-to-keep": "5",
"write.target-file-size-bytes": "536870912",
}
),
)
To update Iceberg properties on an existing feature group:
fg = FeatureGroupManager.get(
feature_group_name="existing-feature-group",
include_iceberg_properties=True,
)
fg.update(
iceberg_properties=IcebergProperties(
properties={
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "10",
}
)
)
Start with metadata cleanup for streaming workloads. If your pipeline writes to the offline store more than once per minute, set write.metadata.delete-after-commit.enabled to "true" and limit write.metadata.previous-versions-max. This is the single most impactful configuration change for preventing storage cost overruns.
Continue running compaction. These properties manage metadata lifecycle, but you still need to run Iceberg compaction (using Athena OPTIMIZE + VACUUM or Spark maintenance actions) to merge small data files for optimal query performance.
Tune snapshot retention for compliance needs. Audit-heavy workloads that require time-travel queries should use higher values for history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms. Cost-optimized streaming pipelines benefit from shorter retention.
Set properties at creation time. These properties take effect on new commits. For existing feature groups with accumulated metadata, use FeatureGroupManager.update() to set properties, then run Spark snapshot expiration and orphan file deletion to reclaim storage.
For the complete list of supported properties, see Iceberg metadata management.
By combining both capabilities in a single FeatureGroupManager.create() call, you produce a feature group that’s simultaneously governed and cost-optimized. No follow-up configuration is required. The offline store metadata is automatically managed, and Lake Formation access control is active without manual registration. The online store continues to serve low-latency features with IAM authorization.
fg = FeatureGroupManager.create(
feature_group_name="real-time-user-signals",
record_identifier_feature_name="user_id",
event_time_feature_name="event_time",
feature_definitions=signals_df,
role_arn=role,
online_store_config={"EnableOnlineStore": True},
offline_store_config=OfflineStoreConfig(
s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
table_format="Iceberg",
),
lake_formation_config=LakeFormationConfig(
enabled=True,
hybrid_access_mode_enabled=True,
acknowledge_risk=True,
),
iceberg_properties=IcebergProperties(
properties={
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "10",
"history.expire.max-snapshot-age-ms": "86400000",
"history.expire.min-snapshots-to-keep": "5",
}
),
)
For complete end-to-end notebooks with step-by-step instructions, see the Lake Formation governance notebook and the Iceberg table properties notebook in the SageMaker Python SDK repository.
To avoid ongoing charges, delete the feature groups that you created while following this walkthrough. If you added Amazon S3 locations to Lake Formation, deregister them through the Lake Formation console or the DeregisterResource API. Revoke the Lake Formation permissions you granted for testing.
Together, these enhancements make Amazon SageMaker Feature Store simpler to secure, more cost-efficient to operate, and faster to integrate into your ML pipelines. By automating Lake Formation access control, surfacing fine-grained Iceberg lifecycle settings, and delivering these through a lightweight modular SDK. These changes remove the undifferentiated heavy lifting that previously stood between your team and production-ready feature management at scale. Whether you are onboarding your first feature group or managing hundreds across multiple teams, these capabilities help you move faster. You can be confident that access control and cost controls are built in from day one.We encourage you to upgrade to SageMaker Python SDK v3.8.0 and explore how these capabilities can streamline your existing workflows.
For more information, see the Feature Store documentation, the Lake Formation access control guide, the Iceberg metadata management guide, and the SDK v3 release notes. To get hands-on, try the Lake Formation notebook and the Iceberg properties notebook.
For background on Feature Store concepts and earlier capabilities, explore these related posts:
Manuel Rioux est fièrement propulsé par WordPress