Organizations handle vast amounts of sensitive customer information through various communication channels. Protecting Personally Identifiable Information (PII), such as social security numbers (SSNs), driver’s license numbers, and phone numbers has become increasingly critical for maintaining compliance with data privacy regulations and building customer trust. However, manually reviewing and redacting PII is time-consuming, error-prone, and scales poorly as data volumes grow.
Organizations face challenges when dealing with PII scattered across different content types – from texts to images. Traditional approaches often require separate tools and workflows for handling text and image content, leading to inconsistent redaction practices and potential security gaps. This fragmented approach not only increases operational overhead but also raises the risk of accidental PII exposure.
This post shows an automated PII detection and redaction solution using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails through a use case of processing text and image content in high volumes of incoming emails and attachments. The solution features a complete email processing workflow with a React-based user interface for authorized personnel to more securely manage and review redacted email communications and attachments. We walk through the step-by-step solution implementation procedures used to deploy this solution. Finally, we discuss the solution benefits, including operational efficiency, scalability, security and compliance, and adaptability.
The solution provides an automated system for protecting sensitive information in business communications through three main capabilities:
This unified approach helps organizations maintain compliance with data privacy requirements while streamlining their communication workflows.
The following diagram outlines the solution architecture. 
The diagram illustrates the backend PII detection and redaction workflow and the frontend application user interface orchestrated by AWS Lambda and Amazon EventBridge. The process follows these steps:
In the following sections, we walk through the procedures for implementing this solution.
The solution implementation involves infrastructure and optional portal setup.
Before beginning the implementation, make sure to have the following components installed and configured.
Verify that an existing virtual private cloud VPC that contains three private subnets with no internet access is created in your AWS account. All AWS CloudFormation stacks need to be deployed within the same AWS account.
The solution contains three stacks (two required, one optional) that deploys in your AWS account:
Move directly to the Solution Deployment section that follows if Amazon SES is not being used.
The following Amazon SES Setup is optional. The code may be tested without this setup as well. Steps to test the application with or without Amazon SES is covered in the Testing section.
Set up Amazon SES with prod access and verify the domain/email identities for which the solution is to work. We also need to add the MX records in the DNS provider maintaining the domain. Please refer to the following links:
Create credentials for SMTP and save it in AWS Secrets Manager secret with name SmtpCredentials. An IAM user is created for this process.
If any other name is being used for the secret, update the context.json line secret_name with the name of the secret created.
The key for the username in the secret should be smtp_username and the key for password should be smtp_password when storing the same in AWS Secrets Manager.
Run the following commands from within a terminal/CLI environment.
git clone https://github.com/aws-samples/sample-bda-redaction.git
infra/cdk.json file tells the CDK Toolkit how to execute your app
cd sample-bda-redaction/infra/
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
context.json file
cp context.json.example context.json
context.json file with the correct configuration options for the environment.| Property Name | Default | Description | When to Create |
|---|---|---|---|
vpc_id |
“” | VPC ID where resources are deployed | VPC needs to be created prior to execution |
raw_bucket |
“” | S3 bucket storing raw messages and attachments | Created during CDK deployment |
redacted_bucket_name |
“” | S3 bucket storing redacted messages and attachments | Created during CDK deployment |
inventory_table_name |
“” | DynamoDB table name storing redacted message details | Created during CDK deployment |
resource_name_prefix |
“” | Prefix used when naming resources during the stack creation | During stack creation |
retention |
90 | Number of days for retention of the messages in the redacted and raw S3 buckets | During stack creation |
| Property Name | Default | Description |
|---|---|---|
environment |
development | The type of environment where resources are provisioned. Values are development or production |
| Property Name | Description | Comment |
|---|---|---|
domain |
The verified domain or email name that is used for Amazon SES | This can be left blank if not setting up Amazon SES |
auto_reply_from_email |
Email address of the “from” field of the email message. Also used as the email address where emails are forwarded from the Portal application | This can be left blank if not setting up the Portal |
secret_name |
AWS Secrets Manager secret containing SMTP credentials for forward email functionality from the portal |
cdk bootstrap
JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth --no-notices
<<resource_name_prefix>> with its chosen value and then run:
JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-S3Stack <<resource_name_prefix>>-ConsumerStack --no-notices
Before starting the test, make sure the Amazon SES Email Receiving rule set that was created by the <<resource_name_prefix>>-ConsumerStack stack is active. We can check by executing the below command and make sure name in the output is <<resource_name_prefix>>-rule-setaws ses describe-active-receipt-rule-set. If the name does not match or the output is blank, execute the following to activate the same:
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses set-active-receipt-rule-set --rule-set-name <<resource_name_prefix>>-rule-set
Once we have the correct rule set active, we can test the application using Amazon SES by sending an email to the verified email or domain in Amazon SES, which automatically triggers the redaction pipeline. Progress can be tracked in the DynamoDB table <<inventory_table_name>>. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for the <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.
As described earlier, the solution is used to redact any PII data in the email body and attachments. Therefore, to test the application, we need to provide an email file which needs to be redacted. We can do that without Amazon SES by directly uploading an email file to the raw S3 bucket. The raw bucket name can be found on the output tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Export Name RawBucket. This triggers the workflow of redacting the email body and attachments by S3 event notification triggering the Lambda. For your convenience, a sample email is available in the infra/pii_redaction/sample_email directory of the repository. Below are the steps to test the application without Amazon SES using the same email file.
# Replace <<raw_bucket>> with raw bucket name created during deployment
aws s3 cp pii_redaction/sample_email/ccvod0ot9mu6s67t0ce81f8m2fp5d2722a7hq8o1 s3://<<raw_bucket>>/domain_emails/
The above triggers the redaction of the email process. You can track the progress in the DynamoDB table <<inventory_table_name>>. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.
The installation of the portal is completely optional. This section can be skipped; check the console of the AWS account where the solution is deployed to view the resources created. The portal serves as a web interface to manage the PII-redacted emails processed by the backend AWS infrastructure, allowing users to view sanitized email content. The Portal can be used to:
Portal Prerequisites: This portal requires the installation of the following software tools:
cd sample-bda-redaction/infra/
python3 -m venv .venv. .venv/bin/activatepip install -r requirements.txt
JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth --no-notices
<<resource_name_prefix>> with its chosen value:
JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-PortalStack --no-notices
The first-time deployment should take approximately 10 minutes to complete.
.env file (by copying the .env.example file to .env) using the following command to create the .env file using a terminal/CLI environment.
cp .env.example .env
| Environment Variable Name | Default | Description | Required |
|---|---|---|---|
VITE_APIGW |
“” | URL of the API Gateway invokes URL (including protocol) without the path (remove /portal from the value). This value can be found in the output of the PortalStack after deploying through AWS CDK. It can also be found under the Outputs tab of the PortalStack CloudFormation stack under the export name of PiiPortalApiGatewayInvokeUrl |
Yes |
VITE_BASE |
/portal | It specifies the path used to request the static files needed to render the portal | Yes |
VITE_API_PATH |
/api | It specifies the path needed to send requests to the API Gateway | Yes |
Run the following commands from within a terminal/CLI environment.
npm install
npm run build
dist/ directory into the Amazon S3 bucket that is designated for these assets (specified in the PortalStack provisioned via CDK).
aws s3 sync dist/ s3://<<name-of-s3-bucket>> --delete<<name-of-s3-bucket>> is the S3 bucket that has been created in the <<resource-name-prefix>>-PortalStack CloudFormation stack with the Logical ID of PrivateWebHostingAssets. This value can be obtained from the Resources tab of the CloudFormation stack in the AWS Console. This value is also output during the cdk deploy process when the PortalStack has been successfully completed.Use the API Gateway invoke URL from the API Gateway that has been created during the cdk deploy process to access the portal from a web browser. This URL can be found by following these steps:
cdk deploy process. The name of the API Gateway can be found in the Resources section of the <<resource-name-prefix>>-PortalStack CloudFormation stack.The portal’s user interface is now visible within the web browser. If any emails have been processed, they are listed on the home page of the portal.
For production deployment, we recommend these approaches to controlling and managing access to the Portal.
To avoid incurring future charges, follow these steps to remove the resources created by this solution:
#to disable the rule set use below command
aws ses set-active-receipt-rule-set
#to delete the rule set use below command
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses delete-receipt-rule-set --rule-set-name <resource_name_prefix>>-rule-set
cdk destroy <<resource_name_prefix>>-PortalStack (if deployed)
cdk destroy <<resource_name_prefix>>-ConsumerStack
cdk destroy <<resource_name_prefix>>-S3Stack
<<resource_name_prefix>>-S3Stack with export name AccessLogsBucket. Execute the below steps to delete the access log bucket:
#to remove versioned objects use below aws cli command
aws s3api delete-objects --bucket ${accesslogbucket} --delete "$(aws s3api list-object-versions --bucket ${accesslogbucket} --query='{Objects: Versions[].{Key:Key,VersionId:VersionId}}')"
#once versioned objects are removed we need to remove the delete markers of the versioned objects using below aws cli command
aws s3api delete-objects --bucket ${accesslogbucket} --delete "$(aws s3api list-object-versions --bucket ${accesslogbucket} --query='{Objects: DeleteMarkers[].{Key:Key,VersionId:VersionId}}')"
#delete the access log bucket itself using below aws cli command
aws s3api delete-bucket --bucket ${accesslogbucket}
The VPC and its associated resources as prerequisites for this solution may not be deleted if they may be used by other applications.
In this post, we demonstrated how to automate the detection and redaction of PII across both text and image content using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails. By centralizing and streamlining the redaction process, organizations can strengthen alignment with data privacy requirements, enhance security practices, and minimize operational overhead.
However, it is equally important to make sure that your solution is built with Amazon Bedrock Data Automation’s document processing constraints in mind. Amazon Bedrock Data Automation supports PDF, JPEG, and PNG file formats with a maximum console-processing size of 200 MB (500 MB via API), and single documents may not exceed 20 pages unless document splitting is enabled.
By using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails centralized redaction capabilities, organizations can boost data privacy compliance management, cut operational overhead, and maintain stringent security across diverse workloads. This solution’s extensibility further enables integration with other AWS services, fine-tuning detection logic for more advanced PII patterns, and broadening support for additional file types or languages in the future, thereby evolving into a more robust, enterprise-scale data protection framework.
We encourage exploration of the provided GitHub repository to deploy this solution within your organization. In addition to delivering operational efficiency, scalability, security, and adaptability, the solution also provides a unified interface and robust audit trail that simplifies data governance. By refining detection rules, users can integrate additional file formats where possible and use Amazon Bedrock Data Automation and Amazon Bedrock Guardrails modular framework.
We invite you to implement this PII detection and redaction solution in the following GitHub repo to build a more secure, compliance-aligned, and highly adaptable data protection solution on Amazon Bedrock that addresses evolving business and regulatory requirements.
Himanshu Dixit is a Delivery Consultant at AWS Professional Services specializing in databases and analytics, bringing over 18 years of experience in technology. He is passionate for artificial intelligence, machine learning, and generative AI, leveraging these cutting-edge technologies to create innovative solutions that address real-world challenges faced by customers. Outside of work, he enjoys playing badminton, tennis, cricket, table tennis and spending time with her two daughters.
David Zhang is an Engagement Manager at AWS Professional Services, where he leads enterprise-scale AI/ML, cloud transformation initiatives for Fortune 100 customers in telecom, finance, media, and entertainment. Outside of work, he enjoys experimenting with new recipes in his kitchen, playing tenor saxophone, and capturing life’s moments through his camera.
Richard Session is a Lead User Interface Developer for AWS ProServe, bringing over 15 years of experience as a full-stack developer across marketing/advertising, enterprise technology, automotive, and ecommerce industries. With a passion for creating intuitive and engaging user experiences, he uses his extensive background to craft exceptional interfaces for AWS’s enterprise customers. When he’s not designing innovative user experiences, Richard can be found pursuing his love for coffee, spinning tracks as a DJ, or exploring new destinations around the globe.
Viyoma Sachdeva is a Principal Industry Specialist in AWS. She is specialized in AWS DevOps, containerization and IoT helping Customer’s accelerate their journey to AWS Cloud.
Manuel Rioux est fièrement propulsé par WordPress