Modern enterprises face mounting challenges in extracting actionable insights from vast data lakes and lakehouses spanning petabytes of structured and unstructured data. Traditional analytics require specialized technical expertise in SQL, data modeling, and business intelligence tools, creating bottlenecks that slow decision-making across retail, financial services, healthcare, Travel & Hospitality, manufacturing and many more industries. This architecture demonstrates how agentic AI assistant from Amazon Quick transform data analytics into a self-service capability. It showcases enabling business users to query complex structured datasets and mix with unstructured data to find the valuable insights to improve their business outcomes through intuitive natural language interfaces.
To demonstrate the functionality, we built a lakehouse using the TPC-H datasets as our foundation. This integrated architecture leverages Amazon Simple Storage Service (Amazon S3) as a storage, Amazon SageMaker and AWS Glue for lakehouse, Amazon Athena for serverless SQL querying across multiple storage formats (S3 Table, Iceberg, and Parquet), and multiple features from Quick to build dashboard and conversational AI agents that provide natural language access to data insights. Through integrated knowledge bases using Amazon Quick spaces, this solution democratizes lakehouse data access for business users while preserving enterprise-grade security, governance frameworks, and the scalability required for modern data-driven decision-making across the organization.
The following diagram shows the overall design and corresponding dataflow that we implemented as part of this blog post.
Figure 1: Overall design diagram Reference following steps for the detailed end to end data flow and user interaction capabilities.
Before you get started, make sure you have the following prerequisites:
In this section, we will mimic many of the data lake features by working with external tables, which allow querying data stored in Amazon S3 without loading it into a managed storage layer. We will explore Open Table Format (OTF) tables using Apache Iceberg to consider possible ACID transactions supported tables. Amazon managed S3 Tables will be leveraged to showcase how Amazon natively supports Iceberg-compatible table management directly within S3, simplifying lakehouse architecture at scale. Throughout these exercises, we will use the industry-standard TPC-H dataset, a benchmark workload representing a realistic business data model with orders, customers, and line items to make sure our examples are both meaningful and reproducible.
We will leverage Amazon Athena for data preparation. If this is your first time using Amazon Athena, you’ll need to create an Amazon S3 bucket to store your query results. Athena uses S3 as its output location before you can run queries. Follow the official AWS getting started guide to complete this one-time setup: Getting Started with Amazon Athena. Alternately, you can use Managed query results feature.
Tip: Choose an S3 bucket in the same AWS Region as your data sources to avoid cross-region data transfer costs and latency.
Once your S3 output location is configured, you’re ready to proceed.
Start by creating a Glue database that will serve as the metadata catalog for all your tables using Athena. Run the following SQL in the Athena query editor:
CREATE DATABASE IF NOT EXISTS blog_qs_athena_tpc_h_db_sql COMMENT 'TPC-H database';
Figure 2: Database creation blog_qs_athena_tpc_h_db_sql
What this does: This registers a logical database in the AWS Glue Data Catalog, which Athena uses to organize and discover your tables. Tables created in subsequent steps will live under this database.
Next, create an external table pointing to the TPC-H “customer” dataset stored in a public S3 bucket ('s3://redshift-downloads/TPC-H/2.18/100GB/customer/'). External tables in Athena don’t move or copy data — they query it directly from S3, making this a fast and cost-effective way to explore raw data.
CREATE EXTERNAL TABLE IF NOT EXISTS blog_qs_athena_tpc_h_db_sql.customer_csv
(
C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DOUBLE,
C_MKTSEGMENT STRING,
C_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://redshift-downloads/TPC-H/2.18/100GB/customer/'
TBLPROPERTIES ('classification' = 'csv');
Verify the table by previewing a few rows:
SELECT * FROM blog_qs_athena_tpc_h_db_sql.customer_csv LIMIT 10;
Figure 3: verify blog_qs_athena_tpc_h_db_sql.customer_csv
Next, we will mimic the table using Apache Iceberg, which is an open table format that brings ACID transactions, time travel, and partition evolution to your data lake — making it ideal for production-grade workloads. This is a three-step process.
Step1: Create the S3 Bucket – Before writing SQL queries, set up your storage layer. You can create an S3 bucket using the AWS Management Console or AWS CLI.
For this blog, I’m using the S3 bucket: amzn-s3-demo-bucket
Note: Your bucket name will be different, as S3 bucket names must be globally unique across all AWS accounts.
Step2: Create an External CSV Table for Orders – First, register the raw orders data as an external table in its original format, in our case it’s CSV.
CREATE EXTERNAL TABLE IF NOT EXISTS blog_qs_athena_tpc_h_db_sql.orders_csv
(
O_ORDERKEY BIGINT,
O_CUSTKEY BIGINT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE STRING,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim' = '|')
LOCATION 's3://redshift-downloads/TPC-H/2.18/100GB/orders/'
TBLPROPERTIES ('classification' = 'csv');
Let’s verify the dataset.
SELECT * FROM blog_qs_athena_tpc_h_db_sql.orders_csv LIMIT 10;
Figure 4: verify blog_qs_athena_tpc_h_db_sql.orders_csv
Step3: Create the Iceberg Table Using CREATE TABLE AS SELECT (CTAS) – Use CREATE TABLE AS SELECT (CTAS) to create a self-managed Iceberg table in Parquet format, partitioned by order date. We’ll load a sample date range O_ORDERDATE BETWEEN ‘1998-06-01’ AND ‘1998-12-31’.
CREATE TABLE blog_qs_athena_tpc_h_db_sql.orders_iceberg
WITH (
table_type = 'ICEBERG',
format = 'PARQUET',
is_external = false,
partitioning = ARRAY['o_orderdate'],
location = 's3://amzn-s3-demo-bucket/tpch_iceberg/orders/')
AS
SELECT * FROM blog_qs_athena_tpc_h_db_sql.orders_csv
WHERE O_ORDERDATE BETWEEN '1998-06-01' AND '1998-12-31';
Verify the Iceberg table data:
SELECT * FROM blog_qs_athena_tpc_h_db_sql.orders_iceberg LIMIT 10;
Figure 5: verify blog_qs_athena_tpc_h_db_sql.orders_iceberg
Amazon S3 Tables are purpose-built, fully managed tables with built-in Apache Iceberg support. It delivers high-performance query throughput without the overhead of managing maintenance operations, such as compaction, snapshot management, and unreferenced file removal. This is a three-step process.
Step1: Create the S3 Table Bucket and Namespace – Navigate to S3 → Table Buckets in the AWS Console to create the bucket blog-qs-athena-tpc-h-db-sql-s3-table-mar-3 and namespace. Alternatively, use the AWS CLI for scripted setup.
Note : You can ignore these steps if you already have an S3 table bucket and namespace available.
Figure 6: Create S3 Table bucket blog-qs-athena-tpc-h-db-sql-s3-table-mar-3
Now let’s create a namespace blog_qs_athena_tpc_h_namespace associated with above S3 table bucket by clicking on the blog-qs-athena-tpc-h-db-sql-s3-table-mar-3.
Figure 7: Create S3 table Namespace blog_qs_athena_tpc_h_namespace
Step2: Create an External CSV Table for Line Items – Use Athena to register the TPC-H line items dataset as an external table:
CREATE EXTERNAL TABLE IF NOT EXISTS blog_qs_athena_tpc_h_db_sql.lineitem_csv
(
L_ORDERKEY BIGINT,
L_PARTKEY BIGINT,
L_SUPPKEY BIGINT,
L_LINENUMBER INT,
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG STRING,
L_LINESTATUS STRING,
L_SHIPDATE STRING,
L_COMMITDATE STRING,
L_RECEIPTDATE STRING,
L_SHIPINSTRUCT STRING,
L_SHIPMODE STRING,
L_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://redshift-downloads/TPC-H/2.18/100GB/lineitem/'
TBLPROPERTIES ('skip.header.line.count' = '0');
Preview the data:
SELECT * FROM blog_qs_athena_tpc_h_db_sql.lineitem_csv LIMIT 10;
Figure 8: verify data blog_qs_athena_tpc_h_db_sql.lineitem_csv
Step3: Create the S3 Tables Table Using CTAS – Finally, create a Parquet-formatted S3 Tables in your new catalog using CTAS. We filter a sample date range to limit the initial data load based on CAST(L_SHIPDATE AS DATE) BETWEEN DATE(‘1998-06-01’) AND DATE(‘1998-12-31’).
Note: Make sure to use s3tablescatalog to run the following queries as shown in the following screenshot.
CREATE TABLE lineitem_csv_s3_table
WITH ( format = 'PARQUET')
AS
SELECT * FROM AwsDataCatalog.blog_qs_athena_tpc_h_db_sql.lineitem_csv
WHERE CAST(L_SHIPDATE AS DATE) BETWEEN DATE('1998-06-01') AND DATE('1998-12-31');
Verify the result:
SELECT * FROM lineitem_csv_s3_table LIMIT 10;
Figure 9: verify data lineitem_csv_s3_table
Your Athena tables are registered and queryable. Now it is time to bring that data into Amazon Quick – connecting it, shaping it, and making it speak the language of your business. This section walks through every step: connecting to the Athena data source, creating datasets and importing them into SPICE, joining the three SPICE datasets, configuring a Quick Topic for natural language Q&A, building and publishing a dashboard with Amazon Q, and setting up the Knowledge Base that powers the agentic layer.
Before Amazon Quick can query your three tables in your data lake, you create a single Athena data source connection. You can access all three tables — the CSV external table, the self-managed Iceberg Parquet table, and the S3 Tables managed Iceberg table — using the same connection because all three are cataloged in AWS Glue Data Catalog and accessible through the same Athena workgroup.
Steps:
tpch-lakehouse-athena).With the Athena data source created, create one Quick dataset per table. Import each dataset into SPICE — Quick’s Super-fast, Parallel, In-memory Calculation Engine — to deliver sub-second query performance in dashboards and agentic workflows, regardless of how large the underlying S3 data grows.
Before creating datasets, make sure the appropriate data access permissions are in place:
aws-quicksight-service-role-v0) has the read IAM permissions for the relevant S3 buckets and Athena resources. No additional Lake Formation configuration is required.
Figure 10: Lake Formation permissions
For step-by-step instructions, see Securely analyze your data with AWS Lake Formation and Amazon Quick Sight, and Accessing Amazon S3 Tables through Amazon Quick with AWS Lake Formation Permissions.
For S3 Tables specifically, the Quick service role also requires an additional glue:GetCatalog inline policy to access the non-default s3tablescatalog catalog — see Visualizing S3 table data with Amazon Quick for the exact policy statement.
customer_csv).TPC-H Customer (CSV) and select Save & publish.orders_iceberg).TPC-H Orders (Iceberg) and select Save & publish.S3 Tables are stored in a non-default AWS Glue catalog (s3tablescatalog), not in the standard AWSDataCatalog. Because of this, the Quick visual table browser cannot display S3 Tables — they do not appear in the “Choose your table” pane. You must use Custom SQL to query S3 Tables data and create a Quick dataset from it.
“s3tablescatalog/<table-bucket-name>”.”<namespace>”.”<table-name>” syntax:SELECT * FROM "s3tablescatalog/blog-qs-athena-tpc-h-db-sql-s3-table-mar-3"."blog_qs_athena_tpc_h_namespace"."lineitem_csv_s3_table"
Figure 11: Preview S3 Table data from Quick
TPC-H Lineitem (S3 Tables) and select Save & publish.Note: This custom SQL requirement applies specifically to S3 Tables because they reside in a child Glue catalog registered separately from the default AWSDataCatalog. The CSV and Iceberg tables in the standard catalog are visible in the table browser and do not require custom SQL.
The TPC-H schema is a star schema by design, and Amazon Quick’s visual data preparation experience supports joining datasets directly in the UI. In this solution, we will pre-join all three tables in Athena using Custom SQL and ingest the unified result directly into SPICE as a single flat dataset. This removes Quick’s secondary table size constraint entirely and delegates the join to Athena, which handles tables of varying scale.
Note on the cross-source JOIN limit: If your secondary tables (orders_iceberg + customer_csv) are small enough to fit under 1 GB combined, you can perform the join inside Quick’s visual data preparation experience by opening the largest table first (making it the primary) and adding the smaller tables as secondary joins. For large TPC-H scale factors where the lineitem table dominates, the Athena pre-join approach below is the recommended path.
Steps:
blog_qs_athena_tpc_h_db_sql) and the S3 Tables non-default catalog (s3tablescatalog):SELECT
c.c_custkey,
c.c_name,
c.c_mktsegment,
c.c_nationkey,
o.o_orderkey,
o.o_orderdate,
o.o_orderstatus,
o.o_totalprice,
o.o_orderpriority,
l.l_linenumber,
l.l_partkey,
l.l_suppkey,
l.l_quantity,
l.l_extendedprice,
l.l_discount,
l.l_shipmode,
l.l_returnflag
FROM "s3tablescatalog/blog-qs-athena-tpc-h-db-sql-s3-table-mar-3"."blog_qs_athena_tpc_h_namespace"."lineitem_csv_s3_table" l
INNER JOIN "blog_qs_athena_tpc_h_db_sql"."orders_iceberg" o
ON l.l_orderkey = o.o_orderkey
INNER JOIN "blog_qs_athena_tpc_h_db_sql"."customer_csv" c
ON o.o_custkey = c.c_custkey;
The query joins the three tables using the TPC-H foreign key relationships:
lineitem_csv_s3_table.l_orderkey = orders_iceberg.o_orderkey (Lineitem → Orders)orders_iceberg.o_custkey = customer_csv.c_custkey (Orders → Customer)Tip: Use explicit double quotes around both the database and table names in Athena SQL — this helps prevent parse errors caused by hyphens or other special characters in identifier names, particularly for S3 Tables catalog paths.
c_custkey, o_custkey, o_orderkey, l_orderkey) that business users do not need to see in dashboards or Q&A.
Figure 12 : Preview denormalized data from Quick
TPC-H Unified (Joined) and select Save & publish and wait for the SPICE dataset status change to “Ready” (expected time 2-3 mins)The joined dataset is now a single, denormalized SPICE dataset combining customer, order, and line item data across all three table formats — CSV external, self-managed Iceberg Parquet, and S3 Tables managed Iceberg — ready for both dashboard authoring and natural language Q&A.
A Quick Topic is the semantic layer that translates column names into business concepts. When a user asks “What was total revenue last quarter by customer segment?”, the Topic maps revenue to l_extendedprice, “last quarter” to a date filter on o_orderdate, and customer segment to c_mktsegment. Without a well-configured Topic, natural language queries return generic or incorrect results. With one, they return precise, cited answers in seconds.
Steps:
TPC-H Analytics and a plain-language description: “Customer, order, and line item data from the TPC-H benchmark dataset, covering revenue, pricing, discounts, order status, and customer market segments.”TPC-H Unified (Joined) dataset as the data source.
Figure 13: Quick Topic enhancement
Figure 14: Quick Topic suggested questions
Amazon Q in Quick lets authors build dashboards using natural language — describe the visual you want, and Q generates it. This accelerates dashboard development from days to minutes and keeps the focus on business storytelling rather than chart configuration.
Steps:
TPC-H Unified (Joined) dataset.o_orderdate so dashboard viewers can scope the data to a time range of their choice without requesting a new report.TPC-H Analyticstopic for enabling Dashboard Q&A. This embeds a natural language query bar directly in the published dashboard, allowing viewers to ask follow-up questions without leaving the dashboard. Quick automatically extracts semantic information from the dashboard visuals to power the Q&A experience.TPC-H Lakehouse Analytics.
Figure 15: Share Dashboard
Your SPICE datasets are loaded, your Topic is published, and your dashboard is live. Each of these is valuable on its own. Together, unified inside a Quick Space, surfaced through a custom Chat Agent and indexed Knowledge Base, they become something qualitatively different: an agentic AI system that answers questions, retrieves context, and drives action — all from a single conversational interface.
The Knowledge Base gives the Chat Agent access to unstructured context that structured data alone cannot answer — data dictionaries, schema documentation, business rules, and domain reference material. For this solution, the Knowledge Base is built from TPC-H unstructured data: the official TPC-H specification document describing how your organization maps TPC-H fields to business concepts.
Steps:
TPC-H Reference Knowledge Base.Quick indexes the document and makes it searchable by the Chat Agent at query time. The agent retrieves relevant passages — not entire document — so responses stay grounded and concise.
Best practice: Keep each document focused on a single topic. A 5-page data dictionary is more useful to the agent than a 200-page combined specification, because the agent retrieves by relevance — smaller, focused documents produce more precise retrievals.
A Quick Space is the organizational layer that abstracts your data assets — Topics, Knowledge Bases, dashboards, and datasets — into a single, governed context boundary. The Chat Agent you build in the next step does not query Topics and Knowledge Bases directly. It queries the Space. This design gives you one place to manage what the agent knows, who can access it, and what it is allowed to surface.
Steps:
TPC-H Lakehouse Analytics Space.Add the Topic:
TPC-H Analytics (the Topic configured in the Quick Topic Configuration section).Add the Knowledge Base:
TPC-H Reference Knowledge Base (the Knowledge Base configured in the Knowledge Base Configuration section).Add the Dashboard:
TPC-H Lakehouse Analytics (the dashboard configured in the Dashboard Build and Publish with Amazon Q section)The Space now encapsulates everything the Chat Agent needs: structured data through the Topic, unstructured context through the Knowledge Base, and visual references through the Dashboard. The agent queries the Space; the Space enforces the boundaries. Quick enforces the same security rules from the underlying knowledge inside the Space — users in the Space see only the data their role permits, regardless of how they ask the question.
Figure 16: Artifacts in Space
The Chat Agent is the interface your business users interact with. It is not a generic assistant — it is a purpose-built, governed AI teammate scoped to TPC-H Lakehouse Analytics Space. Users ask questions in plain English. The agent reasons over the Space, retrieves the right combination of structured data and unstructured context, and returns grounded, cited answers.
Steps:
“You are the TPC-H Analytics Agent for [Your Organization]. You help business analysts and data engineers answer questions about order revenue, supplier performance, line item pricing, and inventory availability using the TPC-H lakehouse dataset. Always ground your answers in data from the TPC-H Lakehouse Analytics Space. When a user asks a question that requires a chart or table, retrieve the answer from the Topic and present it clearly. When a user asks about schema definitions, query logic, or data dictionary terms, retrieve the answer from the Knowledge Base. Do not speculate. If you cannot find a grounded answer, say so and suggest a follow-up question.”
TPC-H Analytics Agent.TPC-H Lakehouse Analytics Space. Optionally, you add the space with the following steps.
TPC-H Lakehouse Analytics Space.
Figure 17: Interact with Agent
A business analyst opens the TPC-H Analytics Agent and types:
“Which customer segment drove the most revenue last month, and what does ‘market segment’ mean in the TPC-H schema?”
The agent:
TPC-H Analytics Topic through the Space for revenue by c_mktsegment filtered to last month — returning a ranked result from SPICE.c_mktsegment from the TPC-H data dictionary in the Knowledge Base.No SQL. No dashboard navigation. No ticket to the data team. The answer arrives in one response, grounded in two sources, with every claim traceable to its origin.
Run following steps to remove the artifacts created by this blog post
Run following steps using Athena console
DROP TABLE blog_qs_athena_tpc_h_db_sql.customer_csv;
DROP TABLE blog_qs_athena_tpc_h_db_sql.orders_csv;
DROP TABLE blog_qs_athena_tpc_h_db_sql.orders_iceberg;
DROP TABLE blog_qs_athena_tpc_h_db_sql.lineitem_csv;
DROP TABLE lineitem_csv_s3_table; --(use S3 catalog configuration)
DROP DATABASE blog_qs_athena_tpc_h_db_sql;
lineitem_csv_s3_table table, use the AWS CLI, AWS SDKs, or Amazon S3 REST API. Learn moreblog_qs_athena_tpc_h_namespace, use the AWS CLI, AWS SDKs, or Amazon S3 REST API. Learn moreblog-qs-athena-tpc-h-db-sql-s3-table-mar-3 table bucket, use the AWS CLI, AWS SDKs, or Amazon S3 REST API. Learn moreUse S3 console to remove S3 bucket amzn-s3-demo-bucket.
TPC-H Analytics Agent and choose Delete.TPC-H Lakehouse Analytics Space and choose Delete.TPC-H Lakehouse Analytics and choose Delete.TPC-H Analytics and choose Delete.TPC-H Reference Knowledge Base and choose Delete knowledge base.TPC-H Unified (Joined)TPC-H Customer (CSV)TPC-H Orders (Iceberg)TPC-H Lineitem (S3 Tables)tpch-lakehouse-athena and choose Delete.This architecture demonstrates how Amazon Quick’s agentic AI transforms enterprise data analytics from a technical bottleneck into an accessible self-service capability. By integrating Amazon S3, AWS Glue Data Catalog, Amazon Athena, and Amazon Lake Formation with Amazon Quick’s conversational AI agents and dashboards, business users can now query complex lakehouse data through natural language interfaces without requiring SQL or BI expertise. The solution seamlessly combines structured TPC-H datasets across multiple storage formats (S3 Table, Iceberg, Parquet) with unstructured data from knowledge bases, enabling richer contextual insights. This democratization of data access accelerates decision-making across industries while maintaining enterprise-grade security, governance, and scalability for modern data-driven organizations.
Reference Getting started tutorial for additional use cases using B2B, revenue, sales, marketing, and HR datasets. To dive deeper in Lake Formation permission with Quick reference AWS documentation “Using AWS Lake Formation with Quick“ and blog post – “Securely analyze your data with AWS Lake Formation and Amazon Quick Sight”. Join Amazon Quick Community to find answers to your questions, learning resources, and events in your area.
For additional read reference following links –
Modernize Business Intelligence Workloads Using Amazon Quick
Best practices for Amazon Quick Sight SPICE and direct query mode
Accessing Amazon S3 Tables through Amazon Quick with AWS Lake Formation Permissions AWS security in Quick.
Manuel Rioux est fièrement propulsé par WordPress