hoamai.click

mTLS on ECS: the sidecar pattern and the secret distribution problem

#aws#security#ecs#pci-dss#cloud

Most teams terminate TLS at the load balancer and treat internal service-to-service traffic as trusted. That works fine outside a regulated boundary. Inside a PCI-DSS Cardholder Data Environment (CDE), it fails a Qualified Security Assessor (QSA) review.

PCI-DSS Requirement 4.2.1 requires strong cryptography for all transmission of account data across open, public networks and across internal network segments where data could be intercepted. East-west traffic between your ECS services in the same VPC is in scope if those services handle or transmit cardholder data. A load balancer terminating TLS at the edge does not satisfy this for service-to-service calls.

The answer is mutual TLS: both sides of every connection present certificates, so a compromised service cannot silently impersonate a peer. The cryptographic part is well understood. The operational part, specifically getting certificates into containers without creating new compliance problems in the process, is where most teams underestimate the work.

The sidecar proxy pattern

The cleanest approach for ECS is a sidecar container running NGINX or HAProxy alongside your application. The application speaks plaintext on localhost. The sidecar handles all TLS termination for inbound connections and TLS origination for outbound connections to peer services.

[inbound mTLS]


 ┌──────────┐    plaintext     ┌──────────┐
 │  NGINX / │ ◄──────────────► │   app    │
 │  HAProxy │    localhost     │container │
 └──────────┘                  └──────────┘


[outbound mTLS]

On ECS Fargate, all containers in a task share the same network namespace. The sidecar and the app communicate over localhost without any extra networking configuration. The application needs no TLS code. Certificate policy enforcement lives entirely in the proxy config.

This isolation has two practical benefits in a PCI audit. First, your application code does not handle certificates, so certificate-related findings are confined to infrastructure rather than scattered across application repositories. Second, when you rotate certificates you do not need to redeploy application code, only restart the task.

Why certificate distribution is the hard part

Setting up NGINX or HAProxy to do mTLS is straightforward configuration. The hard part is getting three things into the sidecar container at startup:

  1. The signed certificate for this service
  2. The private key for that certificate
  3. The CA bundle the sidecar uses to verify peer certificates

You have four options for how those files reach the container, and three of them create compliance problems.

Baked into the image. The private key ends up in your container registry. Every pull of the image retrieves the key. Image scanning tools flag it. Your key rotation strategy requires a full image rebuild. This fails a PCI audit on multiple counts.

Mounted from EFS. Adds a storage dependency, requires EFS encryption at rest configuration, and moves the key management problem to an EFS access point rather than solving it.

Passed as environment variables in the task definition. The values are visible in plaintext in the ECS console, in CloudTrail RegisterTaskDefinition events, and in any tooling that lists task definition details. Private keys in plaintext in audit logs is not a conversation you want to have with a QSA.

Pulled from Secrets Manager at task startup using the task role. This is the right approach. No plaintext in the task definition. Full CloudTrail audit trail scoped to the task role. Key material never appears in image layers or configuration files. Rotation updates the secret, not the infrastructure.

The init container pattern

ECS task definitions support a secrets block that injects Secrets Manager values into containers as environment variables. For NGINX and HAProxy, you need files on disk, not environment variables. The bridge is an init container that writes the injected environment variables to a shared ephemeral volume before the proxy starts.

The dependency chain:

  1. Init container starts, reads CERT_PEM, KEY_PEM, and CA_BUNDLE from the injected environment, writes them as files to a shared volume, then exits.
  2. Proxy container starts after the init container exits successfully, reads the certificate files from the shared volume, and binds to the mTLS ports.
  3. App container starts, communicates with the proxy over localhost.

ECS supports container dependencies via the DependsOn field. Setting the proxy container to depend on the init container with condition COMPLETE ensures the files are present before NGINX or HAProxy reads them.

CloudFormation reference

Secrets Manager secrets

Store the certificate material as separate secrets or as a single JSON secret with multiple fields. Separate secrets make rotation independent; a single JSON secret reduces the number of GetSecretValue calls.

CertificateSecret:
  Type: AWS::SecretsManager::Secret
  Properties:
    Name: /pci/svc-payments/mtls-cert
    Description: mTLS certificate and private key for payments service
    SecretString: !Sub |
      {
        "cert_pem": "PLACEHOLDER",
        "key_pem": "PLACEHOLDER",
        "ca_bundle": "PLACEHOLDER"
      }

In practice you populate these values via a rotation Lambda or a pipeline step that calls ACM Private CA, not by setting SecretString directly in the template.

IAM task role

The task role needs secretsmanager:GetSecretValue scoped to the specific secret ARN. Wildcard resource policies on Secrets Manager are a PCI finding.

EcsTaskRole:
  Type: AWS::IAM::Role
  Properties:
    RoleName: pci-payments-task-role
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Effect: Allow
          Principal:
            Service: ecs-tasks.amazonaws.com
          Action: sts:AssumeRole
    Policies:
      - PolicyName: ReadMtlsSecret
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action: secretsmanager:GetSecretValue
              Resource: !Ref CertificateSecret

Task definition

The init container writes the secret fields to files on a volume named certs. The NGINX container mounts the same volume read-only.

PaymentsTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: pci-payments
    NetworkMode: awsvpc
    RequiresCompatibilities: [FARGATE]
    Cpu: "512"
    Memory: "1024"
    TaskRoleArn: !GetAtt EcsTaskRole.Arn
    ExecutionRoleArn: !GetAtt EcsExecutionRole.Arn
    Volumes:
      - Name: certs
    ContainerDefinitions:
      - Name: cert-init
        Image: public.ecr.aws/amazonlinux/amazonlinux:2
        Essential: false
        Command:
          - "/bin/sh"
          - "-c"
          - |
            echo "$CERT_PEM" > /certs/cert.pem
            echo "$KEY_PEM"  > /certs/key.pem
            echo "$CA_BUNDLE" > /certs/ca.pem
            chmod 600 /certs/key.pem
        Secrets:
          - Name: CERT_PEM
            ValueFrom: !Sub "${CertificateSecret}:cert_pem::"
          - Name: KEY_PEM
            ValueFrom: !Sub "${CertificateSecret}:key_pem::"
          - Name: CA_BUNDLE
            ValueFrom: !Sub "${CertificateSecret}:ca_bundle::"
        MountPoints:
          - SourceVolume: certs
            ContainerPath: /certs
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/pci-payments
            awslogs-region: !Ref AWS::Region
            awslogs-stream-prefix: cert-init

      - Name: nginx-proxy
        Image: nginx:1.27-alpine
        Essential: true
        DependsOn:
          - ContainerName: cert-init
            Condition: COMPLETE
        PortMappings:
          - ContainerPort: 8443
            Protocol: tcp
        MountPoints:
          - SourceVolume: certs
            ContainerPath: /etc/nginx/certs
            ReadOnly: true
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/pci-payments
            awslogs-region: !Ref AWS::Region
            awslogs-stream-prefix: nginx

      - Name: app
        Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/payments-app:latest"
        Essential: true
        DependsOn:
          - ContainerName: nginx-proxy
            Condition: START
        PortMappings:
          - ContainerPort: 8080
            Protocol: tcp
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/pci-payments
            awslogs-region: !Ref AWS::Region
            awslogs-stream-prefix: app

NGINX configuration

The proxy listens on 8443 for inbound mTLS connections, proxies to the app on localhost, and presents its own certificate for outbound connections to peer services.

server {
    listen 8443 ssl;
    server_name payments.internal;

    ssl_certificate     /etc/nginx/certs/cert.pem;
    ssl_certificate_key /etc/nginx/certs/key.pem;

    ssl_client_certificate /etc/nginx/certs/ca.pem;
    ssl_verify_client      on;

    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Client-Cert   $ssl_client_s_dn;
    }
}

ssl_verify_client on rejects any connection that does not present a certificate signed by the CA in ca.pem. The X-Client-Cert header passes the peer’s subject DN to the application if it needs to make authorization decisions based on the caller’s identity.

HAProxy alternative

If your organisation standardises on HAProxy, the equivalent frontend and backend configuration:

frontend mtls-inbound
    bind *:8443 ssl crt /etc/haproxy/certs/cert-and-key.pem \
                    ca-file /etc/haproxy/certs/ca.pem \
                    verify required
    default_backend app-backend

backend app-backend
    server app 127.0.0.1:8080

frontend outbound-proxy
    bind 127.0.0.1:9443
    default_backend peer-service

backend peer-service
    server peer peer-service.internal:8443 ssl \
        crt /etc/haproxy/certs/cert-and-key.pem \
        ca-file /etc/haproxy/certs/ca.pem \
        verify required

HAProxy expects the certificate and private key concatenated in a single PEM file for the crt directive. Adjust the init container command accordingly:

cat "$CERT_PEM" "$KEY_PEM" > /certs/cert-and-key.pem
echo "$CA_BUNDLE" > /certs/ca.pem
chmod 600 /certs/cert-and-key.pem

The rotation lifecycle

Short-lived certificates are the right default in a PCI environment. A 7-day or 24-hour TTL from ACM Private CA limits the blast radius of a compromised key. The trade-off is that rotation is no longer an occasional event; it is a continuous operational process.

The rotation sequence:

  1. A Lambda function, triggered by EventBridge on a schedule shorter than your cert TTL, calls ACM Private CA to issue a new certificate.
  2. The Lambda writes the new cert, key, and CA bundle into the Secrets Manager secret.
  3. The Lambda calls ecs update-service --force-new-deployment on the relevant services.
  4. ECS performs a rolling replacement of tasks. New tasks start the init container, pull the updated secret, and write the new files to the ephemeral volume.
  5. Old tasks are drained and stopped after new tasks pass health checks.
CertRotationFunction:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: pci-cert-rotation
    Runtime: python3.12
    Handler: index.handler
    Role: !GetAtt CertRotationRole.Arn
    Environment:
      Variables:
        SECRET_ARN: !Ref CertificateSecret
        ECS_CLUSTER: !Ref EcsCluster
        ECS_SERVICE: !Ref PaymentsService
        PCA_ARN: !Sub "arn:aws:acm-pca:${AWS::Region}:${AWS::AccountId}:certificate-authority/YOUR-PCA-ID"

CertRotationSchedule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "rate(6 days)"
    State: ENABLED
    Targets:
      - Id: CertRotationFunction
        Arn: !GetAtt CertRotationFunction.Arn

The schedule should fire well before certificate expiry. If your cert TTL is 7 days, rotate at 6 days to give yourself a one-day window if the Lambda fails or ECS is slow to complete the rolling update.

What rotation does not solve

Automated rotation reduces operational burden but does not eliminate it. Three things remain manual or require additional tooling:

Certificate revocation. If a private key is compromised before expiry, you need a revocation mechanism. ACM Private CA supports CRL and OCSP. Configuring NGINX or HAProxy to check revocation status and enabling the CRL distribution point adds latency and a new dependency. Most teams accept the risk of short-lived certs and skip revocation checking; that decision should be documented and reviewed with your QSA.

CA bundle distribution across services. When your private CA rotates its own certificate, every service that trusts it needs the updated CA bundle. If you have ten services each with their own Secrets Manager secret containing the CA bundle, you need to update all ten. Centralising the CA bundle as a single secret referenced by all services reduces that to one update.

Monitoring expiry before rotation fires. EventBridge schedules can fail silently. Add a CloudWatch alarm on the ACM Private CA certificate expiry metric or a simple Lambda that checks the expiry dates of the certificates in Secrets Manager and publishes a custom metric. If the alarm fires, someone investigates before a cert expires and tasks start rejecting connections.

What the QSA will ask

In a PCI audit, the questions around in-transit encryption at the service level are predictable. Be ready to demonstrate:

  • Which services transmit cardholder data and whether all connections between them are encrypted with TLS 1.2 or higher.
  • How certificates are issued, stored, and rotated, and who has access to private key material.
  • Whether private keys ever appear in plaintext in logs, task definitions, or image layers.
  • What happens when a certificate expires and how you detect that before it causes an outage.

The Secrets Manager approach covers the storage and access questions cleanly: CloudTrail shows every GetSecretValue call, resource-based policies on the secret restrict access to the task role, and the secret value is never exposed in task definition attributes visible in the console. The rotation Lambda covers the lifecycle questions. The expiry alarm covers the monitoring question.

The operational burden of this setup is real. A private CA, rotation automation, expiry monitoring, and per-service task role policies is more moving parts than most teams expect when they start implementing mTLS. The alternative is manual certificate management at audit time, which is consistently worse. Build the automation at the start and the ongoing burden shrinks to monitoring the rotation Lambda and the expiry alarm.

← All posts