mTLS on ECS: the sidecar pattern and the secret distribution problem
#aws#security#ecs#pci-dss#cloud
Most teams terminate TLS at the load balancer and treat internal service-to-service traffic as trusted. That works fine outside a regulated boundary. Inside a PCI-DSS Cardholder Data Environment (CDE), it fails a Qualified Security Assessor (QSA) review.
PCI-DSS Requirement 4.2.1 requires strong cryptography for all transmission of account data across open, public networks and across internal network segments where data could be intercepted. East-west traffic between your ECS services in the same VPC is in scope if those services handle or transmit cardholder data. A load balancer terminating TLS at the edge does not satisfy this for service-to-service calls.
The answer is mutual TLS: both sides of every connection present certificates, so a compromised service cannot silently impersonate a peer. The cryptographic part is well understood. The operational part, specifically getting certificates into containers without creating new compliance problems in the process, is where most teams underestimate the work.
The sidecar proxy pattern
The cleanest approach for ECS is a sidecar container running NGINX or HAProxy alongside your application. The application speaks plaintext on localhost. The sidecar handles all TLS termination for inbound connections and TLS origination for outbound connections to peer services.
[inbound mTLS]
│
▼
┌──────────┐ plaintext ┌──────────┐
│ NGINX / │ ◄──────────────► │ app │
│ HAProxy │ localhost │container │
└──────────┘ └──────────┘
│
▼
[outbound mTLS]
On ECS Fargate, all containers in a task share the same network namespace. The sidecar and the app communicate over localhost without any extra networking configuration. The application needs no TLS code. Certificate policy enforcement lives entirely in the proxy config.
This isolation has two practical benefits in a PCI audit. First, your application code does not handle certificates, so certificate-related findings are confined to infrastructure rather than scattered across application repositories. Second, when you rotate certificates you do not need to redeploy application code, only restart the task.
Why certificate distribution is the hard part
Setting up NGINX or HAProxy to do mTLS is straightforward configuration. The hard part is getting three things into the sidecar container at startup:
- The signed certificate for this service
- The private key for that certificate
- The CA bundle the sidecar uses to verify peer certificates
You have four options for how those files reach the container, and three of them create compliance problems.
Baked into the image. The private key ends up in your container registry. Every pull of the image retrieves the key. Image scanning tools flag it. Your key rotation strategy requires a full image rebuild. This fails a PCI audit on multiple counts.
Mounted from EFS. Adds a storage dependency, requires EFS encryption at rest configuration, and moves the key management problem to an EFS access point rather than solving it.
Passed as environment variables in the task definition. The values are visible in
plaintext in the ECS console, in CloudTrail RegisterTaskDefinition events, and in
any tooling that lists task definition details. Private keys in plaintext in audit logs
is not a conversation you want to have with a QSA.
Pulled from Secrets Manager at task startup using the task role. This is the right approach. No plaintext in the task definition. Full CloudTrail audit trail scoped to the task role. Key material never appears in image layers or configuration files. Rotation updates the secret, not the infrastructure.
The init container pattern
ECS task definitions support a secrets block that injects Secrets Manager values into
containers as environment variables. For NGINX and HAProxy, you need files on disk, not
environment variables. The bridge is an init container that writes the injected
environment variables to a shared ephemeral volume before the proxy starts.
The dependency chain:
- Init container starts, reads
CERT_PEM,KEY_PEM, andCA_BUNDLEfrom the injected environment, writes them as files to a shared volume, then exits. - Proxy container starts after the init container exits successfully, reads the certificate files from the shared volume, and binds to the mTLS ports.
- App container starts, communicates with the proxy over localhost.
ECS supports container dependencies via the DependsOn field. Setting the proxy
container to depend on the init container with condition COMPLETE ensures the files
are present before NGINX or HAProxy reads them.
CloudFormation reference
Secrets Manager secrets
Store the certificate material as separate secrets or as a single JSON secret with multiple fields. Separate secrets make rotation independent; a single JSON secret reduces the number of GetSecretValue calls.
CertificateSecret:
Type: AWS::SecretsManager::Secret
Properties:
Name: /pci/svc-payments/mtls-cert
Description: mTLS certificate and private key for payments service
SecretString: !Sub |
{
"cert_pem": "PLACEHOLDER",
"key_pem": "PLACEHOLDER",
"ca_bundle": "PLACEHOLDER"
}
In practice you populate these values via a rotation Lambda or a pipeline step that
calls ACM Private CA, not by setting SecretString directly in the template.
IAM task role
The task role needs secretsmanager:GetSecretValue scoped to the specific secret ARN.
Wildcard resource policies on Secrets Manager are a PCI finding.
EcsTaskRole:
Type: AWS::IAM::Role
Properties:
RoleName: pci-payments-task-role
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: ReadMtlsSecret
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: secretsmanager:GetSecretValue
Resource: !Ref CertificateSecret
Task definition
The init container writes the secret fields to files on a volume named certs. The
NGINX container mounts the same volume read-only.
PaymentsTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: pci-payments
NetworkMode: awsvpc
RequiresCompatibilities: [FARGATE]
Cpu: "512"
Memory: "1024"
TaskRoleArn: !GetAtt EcsTaskRole.Arn
ExecutionRoleArn: !GetAtt EcsExecutionRole.Arn
Volumes:
- Name: certs
ContainerDefinitions:
- Name: cert-init
Image: public.ecr.aws/amazonlinux/amazonlinux:2
Essential: false
Command:
- "/bin/sh"
- "-c"
- |
echo "$CERT_PEM" > /certs/cert.pem
echo "$KEY_PEM" > /certs/key.pem
echo "$CA_BUNDLE" > /certs/ca.pem
chmod 600 /certs/key.pem
Secrets:
- Name: CERT_PEM
ValueFrom: !Sub "${CertificateSecret}:cert_pem::"
- Name: KEY_PEM
ValueFrom: !Sub "${CertificateSecret}:key_pem::"
- Name: CA_BUNDLE
ValueFrom: !Sub "${CertificateSecret}:ca_bundle::"
MountPoints:
- SourceVolume: certs
ContainerPath: /certs
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/pci-payments
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: cert-init
- Name: nginx-proxy
Image: nginx:1.27-alpine
Essential: true
DependsOn:
- ContainerName: cert-init
Condition: COMPLETE
PortMappings:
- ContainerPort: 8443
Protocol: tcp
MountPoints:
- SourceVolume: certs
ContainerPath: /etc/nginx/certs
ReadOnly: true
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/pci-payments
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: nginx
- Name: app
Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/payments-app:latest"
Essential: true
DependsOn:
- ContainerName: nginx-proxy
Condition: START
PortMappings:
- ContainerPort: 8080
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/pci-payments
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: app
NGINX configuration
The proxy listens on 8443 for inbound mTLS connections, proxies to the app on localhost, and presents its own certificate for outbound connections to peer services.
server {
listen 8443 ssl;
server_name payments.internal;
ssl_certificate /etc/nginx/certs/cert.pem;
ssl_certificate_key /etc/nginx/certs/key.pem;
ssl_client_certificate /etc/nginx/certs/ca.pem;
ssl_verify_client on;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Client-Cert $ssl_client_s_dn;
}
}
ssl_verify_client on rejects any connection that does not present a certificate signed
by the CA in ca.pem. The X-Client-Cert header passes the peer’s subject DN to the
application if it needs to make authorization decisions based on the caller’s identity.
HAProxy alternative
If your organisation standardises on HAProxy, the equivalent frontend and backend configuration:
frontend mtls-inbound
bind *:8443 ssl crt /etc/haproxy/certs/cert-and-key.pem \
ca-file /etc/haproxy/certs/ca.pem \
verify required
default_backend app-backend
backend app-backend
server app 127.0.0.1:8080
frontend outbound-proxy
bind 127.0.0.1:9443
default_backend peer-service
backend peer-service
server peer peer-service.internal:8443 ssl \
crt /etc/haproxy/certs/cert-and-key.pem \
ca-file /etc/haproxy/certs/ca.pem \
verify required
HAProxy expects the certificate and private key concatenated in a single PEM file for
the crt directive. Adjust the init container command accordingly:
cat "$CERT_PEM" "$KEY_PEM" > /certs/cert-and-key.pem
echo "$CA_BUNDLE" > /certs/ca.pem
chmod 600 /certs/cert-and-key.pem
The rotation lifecycle
Short-lived certificates are the right default in a PCI environment. A 7-day or 24-hour TTL from ACM Private CA limits the blast radius of a compromised key. The trade-off is that rotation is no longer an occasional event; it is a continuous operational process.
The rotation sequence:
- A Lambda function, triggered by EventBridge on a schedule shorter than your cert TTL, calls ACM Private CA to issue a new certificate.
- The Lambda writes the new cert, key, and CA bundle into the Secrets Manager secret.
- The Lambda calls
ecs update-service --force-new-deploymenton the relevant services. - ECS performs a rolling replacement of tasks. New tasks start the init container, pull the updated secret, and write the new files to the ephemeral volume.
- Old tasks are drained and stopped after new tasks pass health checks.
CertRotationFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: pci-cert-rotation
Runtime: python3.12
Handler: index.handler
Role: !GetAtt CertRotationRole.Arn
Environment:
Variables:
SECRET_ARN: !Ref CertificateSecret
ECS_CLUSTER: !Ref EcsCluster
ECS_SERVICE: !Ref PaymentsService
PCA_ARN: !Sub "arn:aws:acm-pca:${AWS::Region}:${AWS::AccountId}:certificate-authority/YOUR-PCA-ID"
CertRotationSchedule:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "rate(6 days)"
State: ENABLED
Targets:
- Id: CertRotationFunction
Arn: !GetAtt CertRotationFunction.Arn
The schedule should fire well before certificate expiry. If your cert TTL is 7 days, rotate at 6 days to give yourself a one-day window if the Lambda fails or ECS is slow to complete the rolling update.
What rotation does not solve
Automated rotation reduces operational burden but does not eliminate it. Three things remain manual or require additional tooling:
Certificate revocation. If a private key is compromised before expiry, you need a revocation mechanism. ACM Private CA supports CRL and OCSP. Configuring NGINX or HAProxy to check revocation status and enabling the CRL distribution point adds latency and a new dependency. Most teams accept the risk of short-lived certs and skip revocation checking; that decision should be documented and reviewed with your QSA.
CA bundle distribution across services. When your private CA rotates its own certificate, every service that trusts it needs the updated CA bundle. If you have ten services each with their own Secrets Manager secret containing the CA bundle, you need to update all ten. Centralising the CA bundle as a single secret referenced by all services reduces that to one update.
Monitoring expiry before rotation fires. EventBridge schedules can fail silently. Add a CloudWatch alarm on the ACM Private CA certificate expiry metric or a simple Lambda that checks the expiry dates of the certificates in Secrets Manager and publishes a custom metric. If the alarm fires, someone investigates before a cert expires and tasks start rejecting connections.
What the QSA will ask
In a PCI audit, the questions around in-transit encryption at the service level are predictable. Be ready to demonstrate:
- Which services transmit cardholder data and whether all connections between them are encrypted with TLS 1.2 or higher.
- How certificates are issued, stored, and rotated, and who has access to private key material.
- Whether private keys ever appear in plaintext in logs, task definitions, or image layers.
- What happens when a certificate expires and how you detect that before it causes an outage.
The Secrets Manager approach covers the storage and access questions cleanly: CloudTrail shows every GetSecretValue call, resource-based policies on the secret restrict access to the task role, and the secret value is never exposed in task definition attributes visible in the console. The rotation Lambda covers the lifecycle questions. The expiry alarm covers the monitoring question.
The operational burden of this setup is real. A private CA, rotation automation, expiry monitoring, and per-service task role policies is more moving parts than most teams expect when they start implementing mTLS. The alternative is manual certificate management at audit time, which is consistently worse. Build the automation at the start and the ongoing burden shrinks to monitoring the rotation Lambda and the expiry alarm.