AWS Compute & Containers is the layer of AWS that answers "where does my code run?" โ spanning raw virtual machines (EC2), serverless functions (Lambda), managed containers (ECS/EKS/Fargate), and fully managed application hosting (App Runner). The core value proposition is a spectrum of control vs. operational overhead: EC2 gives you full OS control but requires patching; Fargate removes node management but limits customization; App Runner removes almost everything. An engineer choosing within this spectrum trades off latency-to-production, cost predictability, scaling granularity, and egress complexity. For a distributed systems or platform engineering interview, the expectation is to justify placement on that spectrum, understand the mechanics of auto scaling and load balancing, and reason about container networking, IAM delegation, and deployment safety.
More Control <-------------------------------------------------> Less Control
EC2 (bare VM) | EC2 + ECS | ECS Fargate | App Runner | Lambda
| | | |
More Ops work | Node mgmt | No nodes | No cluster | No infra
Every service uses EC2 hardware underneath. The higher you go, the more AWS handles:
| Family Prefix | Optimized For | Key Use Cases |
|---|---|---|
c (Compute) | vCPU-to-memory ratio ~2:1 | CPU-bound web servers, batch processing, gaming |
m (General) | Balanced 1:4 vCPU:GiB | Most application servers, small databases |
r (Memory) | High RAM (~8 GiB/vCPU) | In-memory caches (Redis), large JVM heaps, SAP HANA |
x / u (Memory Extreme) | Massive RAM (up to 24 TB) | SAP, Oracle, in-memory databases at scale |
i / d (Storage) | Local NVMe SSD | Cassandra, Elasticsearch, HDFS, temporal scratch |
p / g (GPU) | NVIDIA GPUs | ML training (p4d = A100), inference (g5 = A10G) |
inf (Inferentia) | AWS custom ML chip | Cost-efficient inference (~40% cheaper than g5) |
trn (Trainium) | AWS custom training chip | Large model training (cheaper than p4 for NLP) |
hpc | High-bandwidth networking | MPI workloads, CFD, molecular dynamics |
t (Burstable) | CPU credits, baseline CPU | Dev/test, small web apps, infrequent spikes |
Instance sizing convention: <family><generation>.<size> โ e.g., m7g.2xlarge
metal = dedicated physical server, no hypervisor overhead, bare-metal accessGeneration matters: Newer generations (7 vs 5) typically offer 10โ20% better price-performance. Graviton3 (g suffix, e.g., m7g) is ARM64 and often 20โ40% cheaper than equivalent x86.
| Model | Commitment | Discount vs On-Demand | Best For |
|---|---|---|---|
| On-Demand | None | 0% | Unpredictable, short bursts, prototyping |
| Reserved (Standard) | 1 or 3 yr, fixed family | Up to 72% | Steady-state baseline load |
| Reserved (Convertible) | 1 or 3 yr, flexible | Up to 66% | Steady-state with possible instance type changes |
| Savings Plans (Compute) | 1 or 3 yr, $/hr commitment | Up to 66% | Flexible โ covers EC2, Fargate, Lambda automatically |
| Savings Plans (EC2 Instance) | 1 or 3 yr, specific family | Up to 72% | Committed to specific instance family per region |
| Spot Instances | None | Up to 90% | Fault-tolerant batch, stateless workers, ML training |
| Dedicated Host | On-demand or reserved | Varies | License compliance (BYOL), regulatory isolation |
| Dedicated Instance | On-demand | ~10% premium | Hardware isolation from other accounts |
Spot interruption: AWS gives a 2-minute warning via instance metadata event and an EventBridge event. You should checkpoint state, drain connections, and terminate gracefully. Spot interruption rates vary by AZ and instance type โ choose instance types with <5% interruption frequency. See Cost Optimization for Spot portfolio strategies and interruption-rate analysis.
Savings Plans vs Reserved Instances: Compute Savings Plans are the modern default โ they apply automatically to EC2 (any region, family, size, OS), Fargate, and Lambda. No need to manage reservation inventory. EC2 Instance Savings Plans give the highest discount but lock you to a specific instance family in a specific region.
An AMI is a snapshot of an instance's root volume + launch permissions + block device mapping. It encodes: OS, pre-installed software, kernel parameters.
Types:
AMI lifecycle: Build โ Test โ Share โ Deprecate. Use EC2 Image Builder for automated, pipeline-based AMI creation with CIS hardening.
Runs once at first boot (cloud-init). Used to bootstrap: install packages, pull config from S3/SSM Parameter Store, register with configuration management.
#!/bin/bash
# Runs as root on first launch
yum update -y
yum install -y amazon-cloudwatch-agent
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -c ssm:/cloudwatch-config -s
For re-runs on every boot, use /var/lib/cloud/scripts/per-boot/. Max size: 16 KB plain text, 64 KB base64-encoded.
A link-local HTTP endpoint (169.254.169.254) accessible only from inside the instance. Provides:
/latest/meta-data/iam/security-credentials/<role-name>)/latest/user-data)IMDSv1 vs IMDSv2:
# IMDSv2 flow
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id
Control how instances are distributed across underlying hardware:
| Type | Strategy | Use Case | Tradeoff |
|---|---|---|---|
| Cluster | All instances in same rack/AZ | Low-latency HPC, MPI, GPU training | Single AZ = no HA |
| Spread | Instances on distinct hardware | HA for small critical sets (max 7/AZ) | Can't launch many |
| Partition | Groups of instances on separate racks | Hadoop, Cassandra, Kafka (rack-aware) | Balance of HA + scale |
Cluster placement groups give 10 Gbps enhanced networking between instances and sub-millisecond latency. Required for p4d (A100) GPU clusters.
A physical server allocated to you. Use cases:
Dedicated Hosts give full visibility into sockets and physical cores. You can share them within an AWS Organization via RAM (Resource Access Manager).
| Feature | Launch Configuration | Launch Template |
|---|---|---|
| Status | Legacy, no new features | Current, recommended |
| Versioning | No versioning | Versioned ($Default, $Latest) |
| Spot + On-Demand mix | No | Yes (mixed instances policy) |
| T2/T3 unlimited | No | Yes |
| Multiple instance types | No | Yes |
| Metadata options (IMDSv2) | No | Yes |
Always use Launch Templates. They support multiple instance types and purchase options in a single ASG, enabling cost-optimized mixed fleets.
Target Tracking (simplest, recommended for most cases):
ASGAverageCPUUtilization = 50%)Step Scaling (fine-grained control):
Simple Scaling (legacy): Fires once per alarm breach, with a cooldown. Don't use โ Step Scaling supersedes it.
Scheduled Scaling:
MinCapacity/MaxCapacity/DesiredCapacity on a cron schedulePredictive Scaling (ML-based):
Pause instance launch or termination to perform custom actions:
Launch: Pending โ Pending:Wait โ [your code runs] โ Pending:Proceed โ InService
Terminate: Terminating โ Terminating:Wait โ [your code runs] โ Terminating:Proceed โ Terminated
Common uses:
Notifications go to SNS or SQS, or you can call complete-lifecycle-action from within the instance. Timeout is 1 hour default (max 48 hours).
A pool of pre-initialized, stopped (or running) instances that can be promoted to InService in seconds instead of minutes:
For API Gateway, VPC networking, and Route 53 integration, see API Gateway & Networking.
| Feature | ALB (Application) | NLB (Network) | CLB (Classic) |
|---|---|---|---|
| OSI Layer | Layer 7 (HTTP/HTTPS/WebSocket/gRPC) | Layer 4 (TCP/UDP/TLS) | Layers 4 & 7 (legacy) |
| Routing | Path, host, header, query param, method | IP:port only | Basic HTTP path |
| Performance | ~1M RPS, variable latency | Millions RPS, ultra-low latency, static IPs | Don't use |
| Protocols | HTTP/1.1, HTTP/2, gRPC, WebSocket | TCP, UDP, TLS | HTTP, HTTPS, TCP |
| Static IPs | No (DNS-based) | Yes (Elastic IPs per AZ) | No |
| Lambda targets | Yes | No | No |
| IP targets | Yes | Yes | No |
| Preserve source IP | X-Forwarded-For header | Natively (client IP) | Limited |
| mTLS | Via listener rules | TLS passthrough | No |
| Use ELB when | HTTP microservices, API Gateway alternative, gRPC | NLB: gaming, IoT, on-prem NLB PrivateLink, DNS whitelisting | Migrating legacy only |
Key decision: NLB if you need static IPs (for IP whitelisting by customers/partners), ultra-low latency (<1ms additional), or non-HTTP protocols. ALB for everything HTTP โ richer routing, WAF integration, Lambda targets.
A target group is the "where traffic lands" abstraction:
Each target group has its own health check configuration. A target is only sent traffic when it passes health checks.
Healthy threshold: 3 consecutive successes โ healthy
Unhealthy threshold: 2 consecutive failures โ deregistered
Interval: 30 seconds (default)
Timeout: 5 seconds
For ALB: HTTP/HTTPS. Expect a 200 (or configurable range). For NLB: TCP (just connects), or HTTP/HTTPS (checks response code).
Deregistration delay (connection draining): ALB waits up to 300s (default) before removing a deregistered target, allowing in-flight requests to complete. Set to 30s for stateless APIs with fast requests.
ALB: Terminates TLS at the load balancer. Certificate managed via ACM (free, auto-renewed). Backend talks HTTP โ simpler, no cert rotation on instances.
End-to-end encryption: ALB terminates TLS โ re-encrypts to backend with a second certificate. Required for PCI-DSS.
NLB TLS termination: Similar to ALB, but can also do TLS passthrough (NLB passes encrypted packets directly to backend, preserving client IP and handling at the application).
SNI (Server Name Indication): Both ALB and NLB support multiple certificates per listener. The load balancer selects the certificate based on the SNI hostname in the client hello.
ALB sticky sessions: Load balancer inserts a cookie (AWSALB). Subsequent requests from the same client go to the same target. Duration: 1 second to 7 days.
Application-based stickiness: Use your own cookie โ ALB encrypts the target info and routes based on it. Better for security (no ALB-internal cookie needed).
Pitfall: Sticky sessions defeat the purpose of horizontal scaling for stateless apps. Use them only for legacy apps with server-side session state you can't move to Redis/ElastiCache. Modern pattern: externalize session state, remove stickiness.
ECS runs Docker containers. For Kubernetes concepts (pods, deployments, namespaces), see Kubernetes.
Cluster: Logical grouping of compute resources (EC2 instances or Fargate capacity). Namespace for services, tasks, and capacity providers.
Task Definition: The blueprint โ an immutable, versioned specification defining:
dependsOn: HEALTHY){
"family": "api-service",
"networkMode": "awsvpc",
"taskRoleArn": "arn:aws:iam::123:role/api-task-role",
"executionRoleArn": "arn:aws:iam::123:role/ecsTaskExecutionRole",
"containerDefinitions": [{
"name": "api",
"image": "123.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3",
"cpu": 512,
"memory": 1024,
"portMappings": [{"containerPort": 8080}],
"secrets": [
{"name": "DB_PASSWORD", "valueFrom": "arn:aws:ssm:us-east-1:123:parameter/db-password"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "api"
}
}
}],
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024"
}
Task: A running instantiation of a task definition. One-off (like a batch job) or managed by a service. Each task gets its own ENI in awsvpc mode.
Service: Long-running tasks. Maintains desired count, integrates with ALB/NLB, handles rolling deployments, auto-scaling. Services are the primary unit of operation for HTTP services.
The key networking mode for modern ECS:
m5.large = 3 ENIs ร 10 IPs each). For EC2 launch type with many tasks, use ENI trunking (allows up to 120 tasks per instance).Bridge mode (EC2 only): Tasks share the host network; port mapping is dynamic. Required for tasks needing host-level networking or when ENI limits are hit.
ECS integrates with AWS Cloud Map for service discovery:
api.namespace.local) resolve to task IPsFor complex microservices: use AWS App Mesh (Envoy-based service mesh) for traffic management, circuit breaking, mutual TLS between services.
You provision and manage EC2 instances in the cluster. ECS places containers on those instances.
Pros:
Cons:
ECS Container Agent: Runs on each EC2 instance, communicates with ECS API, manages container lifecycle. Uses the instance's IAM instance profile for ECS API calls.
Serverless container execution โ AWS allocates, provisions, and terminates the underlying compute.
Pros:
Cons:
docker exec into Fargate tasks in production (ECS Exec via SSM for debugging)Pricing example: 1 vCPU + 2 GB task running 24/7 for 30 days in us-east-1 โ $35/month. Equivalent t3.micro (2 vCPU, 1 GiB) on-demand โ $8/month โ but Fargate gives you dedicated resources + no ops overhead.
Fargate Spot: Up to 70% discount on Fargate tasks โ same interruption model as EC2 Spot. Ideal for batch jobs, CI/CD workers, non-critical async processing.
| Scenario | Recommendation |
|---|---|
| HTTP API, variable traffic | ECS Fargate (no node management) |
| GPU inference | ECS on EC2 (g5/p4 instances) |
| Batch processing with cost sensitivity | ECS Fargate Spot |
| Very high task density, steady workload | ECS on EC2 with Reserved Instances |
| Dev/staging environments | Fargate (pay only when running) |
| On-prem compliance, specific instance types | ECS on EC2 |
Default deployment strategy. ECS gradually replaces old tasks with new:
minHealthyPercent: 100 โ scale up before scale down (extra capacity)
maxPercent: 200 โ allow up to 2x desired count during deployment
Deployment circuit breaker (enable this): ECS detects if new tasks are crashing/failing health checks and rolls back automatically. Track consecutiveFailureThreshold (default 3 failures).
CodeDeploy integrates with the broader CI/CD & DevOps pipeline (CodePipeline, CodeBuild).
Two separate target groups (blue = current, green = new). Traffic shifted between them:
Canary: 10% โ green for 5 minutes โ 100% โ green (if no alarms)
Linear: 10% every 1 minute โ 100%
AllAtOnce: 100% immediately (for staging)
Process:
Advantage: Instant rollback โ just shift traffic back to blue target group. Zero downtime, no partial-state deployments.
Three scaling policy types mirror EC2 ASG:
ECSServiceAverageCPUUtilization, ECSServiceAverageMemoryUtilization, or ALBRequestCountPerTargetALBRequestCountPerTarget is the best metric for HTTP services โ scales based on actual request pressure, not CPU (which may be low even under high load for I/O-bound services).
Decouple services from launch type. A capacity provider links an ASG or Fargate to a cluster. Services use capacity provider strategies:
{
"capacityProviderStrategy": [
{"capacityProvider": "FARGATE", "weight": 1, "base": 1},
{"capacityProvider": "FARGATE_SPOT", "weight": 4, "base": 0}
]
}
This runs 1 guaranteed Fargate task + 4 Fargate Spot tasks for every 1 Fargate โ ~70% cost savings while maintaining HA.
ADOT (AWS Distro for OpenTelemetry) sidecar: A common capacity provider pattern deploys an ADOT collector as a sidecar for metrics/traces without changing application code.
EKS is managed Kubernetes on AWS. For core K8s concepts (pods, ReplicaSets, Services), see Kubernetes.
| Feature | Managed Node Groups | Self-Managed Node Groups |
|---|---|---|
| Provisioning | AWS provisions/joins nodes | You provision, bootstrap, join manually |
| AMI updates | Automated rolling upgrade | Manual |
| Draining | AWS cordons + drains before termination | Must handle manually |
| Spot support | Yes | Yes |
| Custom AMIs | Yes | Yes |
| Multiple instance types | Yes | Yes |
| Launch template | Required | Optional |
| Use case | 95% of workloads | Niche: custom bootstrap, specific OS |
Run pods serverlessly on EKS. Define namespace + label selectors โ matching pods run on Fargate nodes:
fargateProfile:
fargateProfileName: "default"
selectors:
- namespace: "app"
labels:
tier: "backend"
- namespace: "kube-system" # for CoreDNS only
Fargate on EKS tradeoffs vs Managed Node Groups:
Managed lifecycle for cluster-critical components. AWS tests and distributes compatible versions:
vpc-cni โ AWS VPC CNI (pod networking via ENIs)coredns โ Cluster DNSkube-proxy โ iptables rules for Servicesaws-ebs-csi-driver โ Dynamic EBS PV provisioningaws-efs-csi-driver โ EFS PV provisioning (shared RWX)aws-load-balancer-controller โ Provisions ALB/NLB from Ingress/Service objectsadot โ AWS Distro for OpenTelemetryamazon-guardduty-agent โ Runtime threat detectionCritical: Without vpc-cni add-on management, upgrading EKS minor versions can break networking. Pin add-on versions to cluster version compatibility matrix.
IRSA is the primary way to grant EKS workloads AWS permissions. For IAM policies, roles, and OIDC federation concepts, see IAM & Security.
Associates an IAM role with a Kubernetes service account. Pods using that SA get temporary credentials via OIDC without needing instance profile credentials (no credential sharing across pods on same node).
How it works:
eks-pod-identity-webhook injects AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars into matching podsAssumeRoleWithWebIdentity โ temporary credentials# Service Account with IRSA annotation
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-reader
namespace: app
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/s3-reader-role
EKS Pod Identity (newer alternative to IRSA): Simpler setup โ no OIDC trust policy boilerplate. Create a pod identity association that maps role to service account. Less config, same security model.
Run EKS on your own infrastructure (VMware vSphere, bare metal, Nutanix, Snow):
ECR stores Docker container images. For CI/CD pipelines that build and push images, see CI/CD & DevOps.
Basic scanning (free, deprecated): Scans on push using Clair for OS-level CVEs. Returns COMPLETE status.
Enhanced scanning (Amazon Inspector integration, recommended):
scanType: ENHANCEDShift-left integration: Block ECR push in CI pipeline if aws ecr describe-image-scan-findings returns CRITICAL findings.
Automated cleanup rules to control storage costs:
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 production images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod-"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {"type": "expire"}
},
{
"rulePriority": 2,
"description": "Expire untagged images older than 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": {"type": "expire"}
}
]
}
Rules evaluated in priority order. Common pattern: keep last N tagged images per prefix + expire untagged after 7 days.
Cross-region replication (replication configuration on the registry):
Cross-account replication (pull-through cache or registry permissions):
ECR Public: Free, unauthenticated pull globally. Use for open-source distribution. Private ECR: requires docker login via aws ecr get-login-password.
Fully managed service for containerized web applications and APIs. Abstracts away clusters, load balancers, VPCs, auto-scaling, and TLS.
| Fit | Not a Fit |
|---|---|
| HTTP/HTTPS APIs with variable traffic | Non-HTTP (gRPC, WebSocket, TCP) |
| Startups/teams without platform expertise | Fine-grained networking (specific VPC subnets, SGs) |
| Dev/test/staging environments | Very high scale (>25 instances) |
| Migrate from Heroku/Render to AWS | Need GPU instances |
| Single-service deployments | Multi-container (sidecar patterns) |
VPC connector: App Runner can reach resources in your VPC (RDS, ElastiCache) via a VPC connector. Outbound only โ VPC resources must accept inbound from the VPC connector's security group.
vs ECS Fargate: App Runner is simpler but less flexible. No task definitions, no service discovery, no capacity providers. ECS Fargate if you need: multi-container tasks, VPC-native inbound, custom routing, service mesh, or more than 25 instances.
AWS Graviton (Graviton2/3/4) are AWS-designed 64-bit ARM processors using Neoverse N1/V1 microarchitecture.
| Metric | Graviton3 vs comparable x86 |
|---|---|
| Price/performance | 20โ40% better |
| Energy efficiency | 60% less energy |
| Memory bandwidth | 2x vs Graviton2 |
| Float point | 2x vs Graviton2 |
| ML performance | 3x vs Graviton2 (BFLOAT16) |
Specific data points:
c7g (Graviton3) vs c6i (Intel Ice Lake): 25% better performance, 20% cheaperm7g vs m6i: Similar perf, 15โ20% cost reductionCode compatibility:
GOARCH=arm64 โ typically zero code changescargo build --target aarch64-unknown-linux-gnu โ zero code changesdocker buildx and --platform linux/amd64,linux/arm64# Build multi-arch image
# FROM --platform=$BUILDPLATFORM enables cross-compilation
FROM --platform=$BUILDPLATFORM node:18-alpine AS builder
ARG TARGETPLATFORM
ARG BUILDPLATFORM
# Build and push multi-arch manifest
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag 123.dkr.ecr.us-east-1.amazonaws.com/myapp:latest \
--push .
ECS/EKS: Specify CPU architecture in task definition / node selector:
"runtimePlatform": {
"cpuArchitecture": "ARM64",
"operatingSystemFamily": "LINUX"
}
When NOT to migrate:
Managed batch computing service. Provisions and manages EC2/Fargate compute based on submitted jobs.
Compute Environment: The pool of EC2 or Fargate resources.
Job Queue: Jobs are submitted to queues. Multiple queues can map to one or more compute environments with priorities.
Job Definition: Like an ECS task definition โ Docker image, CPU/memory, retries, timeout, environment variables, IAM role, volumes.
Job: A unit of work. Can be array jobs (same definition, N independent runs โ useful for hyperparameter sweeps) or dependent jobs (job B waits for job A).
| Use Case | Why Batch |
|---|---|
| ML training at scale | Spot instances, auto-provisioning, job queues |
| Data ETL pipelines | Managed infrastructure, retry logic |
| Genomics / scientific compute | HPC instances, array jobs |
| Video transcoding | Spot for cost, output in S3 |
| Nightly report generation | Scheduled, pay only when running |
Batch vs ECS vs Lambda:
Fair Share Scheduling: Distribute Batch compute across teams/projects based on configurable shares. Prevents one team's backlog from starving another.
A helper container in the same pod/task that augments the main container without modifying it.
[Pod / ECS Task]
โโโ main container: api (port 8080)
โโโ sidecar container: envoy proxy (port 8443 โ 8080)
fluentd (reads logs from shared volume)
datadog-agent (scrapes /metrics)
ECS sidecar in task definition:
{
"containerDefinitions": [
{
"name": "api",
"image": "myapp:latest",
"dependsOn": [{"containerName": "log-router", "condition": "START"}]
},
{
"name": "log-router",
"image": "amazon/aws-for-fluent-bit:latest",
"essential": false,
"firelensConfiguration": {"type": "fluentbit"}
}
]
}
AWS FireLens: Native ECS sidecar for log routing. Fluent Bit sidecar intercepts container stdout/stderr and routes to CloudWatch, Kinesis, S3, or Splunk without changing application code.
A proxy sidecar that acts as an ambassador to external services. The main container connects to localhost; the ambassador handles service discovery, retries, circuit breaking, and mTLS to the external service.
[ECS Task]
โโโ main container: connects to localhost:6379
โโโ ambassador: envoy or redis-proxy
โ resolves ElastiCache cluster DNS
โ handles connection pooling
โ adds retries with exponential backoff
Use case: Connecting to AWS services with complex retry/circuit-break logic without polluting application code. Service mesh sidecar proxies (Envoy via App Mesh) implement this pattern.
A sidecar that normalizes the interface of the main container to conform to a standard. The main container exposes a proprietary metrics/log format; the adapter translates to a standard.
[Pod]
โโโ main container: legacy app exporting custom /stats
โโโ adapter: prometheus-exporter sidecar
โ reads /stats (proprietary)
โ exposes /metrics (Prometheus format)
โ Prometheus scrapes adapter, not main container
Common in EKS: OpenTelemetry Collector as adapter, consuming app-specific telemetry and exporting to CloudWatch, Jaeger, or X-Ray.
Q: Walk me through what happens when an ECS Fargate task starts.
A: 1) ECS scheduler receives a RunTask or service scale-out event. 2) ECS control plane selects Fargate capacity in the target AZ. 3) AWS provisions a Firecracker microVM (takes ~5-10 seconds for cold start). 4) The ECS agent inside the microVM pulls credentials from the IAM execution role via STS, authenticates with ECR, and pulls the container image. 5) The ENI is provisioned in the VPC subnet and attached to the task (awsvpc mode). 6) Containers start in dependency order (per dependsOn). 7) The ECS agent begins reporting the task's health to the ECS control plane. 8) If behind a load balancer, the target registers and starts receiving health checks; once passing, traffic is routed.
Q: How would you design an auto-scaling strategy for an ECS service handling an API with highly variable traffic (10x spikes)?
A: Use three layers: 1) Target tracking on ALBRequestCountPerTarget set to ~500 req/task โ scales proportionally to actual HTTP load, not CPU. 2) Scheduled scale-out before known traffic events (marketing campaigns, business hours). 3) Warm pool or minimum capacity of at least 2 tasks across 2 AZs for zero-downtime during scale-up. For the underlying capacity (if EC2 launch type), a Capacity Provider with a mixed ASG (On-Demand base + Spot for burst) reduces cost by 60-70%. Set scale-in cooldown to 300s to avoid thrashing after spike. Enable deployment circuit breaker to auto-rollback failed deploys during high-traffic windows.
Q: Spot instances interrupted your training job at 80% completion. How do you handle this?
A: Design for interruption from day one: 1) Checkpoint model weights every N steps to S3 (e.g., every 1000 steps). 2) Subscribe to the EC2 Spot interruption notice via instance metadata (/latest/meta-data/spot/termination-time) or EventBridge. On notice, flush current checkpoint immediately and gracefully terminate. 3) Re-launch the job with --resume-from-checkpoint pointing to the last S3 checkpoint. 4) Use a job queue (SQS or AWS Batch) that holds the job spec โ failure re-enqueues it automatically. 5) For AWS Batch: set attempts: 5 in the job definition; Batch handles re-queue on spot interruption. 6) Consider Spot + On-Demand mixed fleet: run 1 On-Demand instance as "master" + N Spot workers; if a worker is interrupted, the master redistributes work.
Q: What's the difference between an ECS task role and an ECS execution role?
A: The execution role is used by the ECS agent (not your code) to pull images from ECR, read secrets from SSM/Secrets Manager, and write logs to CloudWatch. It's operational infrastructure โ your application never uses it. The task role is used by your application code running inside the container to call AWS services (e.g., S3, DynamoDB, SQS). Credentials are delivered via the IMDS endpoint inside the task (169.254.170.2 โ an ECS-internal metadata endpoint). Always apply least privilege: the task role should only have permissions your application actually needs. See IAM & Security for role and policy design.
Q: ALB vs NLB โ when does the choice matter?
A: Choose NLB when: (1) customers need to whitelist static IPs (NLB has Elastic IPs; ALB is DNS-only), (2) you have non-HTTP protocols (TCP, UDP, raw TLS passthrough), (3) ultra-low latency is critical โ NLB operates at L4 with ~100 microsecond overhead vs ALB's milliseconds at L7. Choose ALB when: (1) you need content-based routing (path, host, header), (2) you need WAF integration, (3) targets include Lambda functions, (4) you need gRPC support, (5) you need sticky sessions or advanced health checks. A common hybrid: NLB (for static IP) in front of ALB (for routing) using ALB as an NLB target. See API Gateway & Networking for VPC, Route 53, and API Gateway context.
Q: Explain IRSA and why it's better than using EC2 instance profiles for EKS workloads.
A: With EC2 instance profiles, every pod on the same node shares the node's IAM credentials. If one pod is compromised, the attacker gets credentials for all workloads on that node. IRSA (IAM Roles for Service Accounts) solves this with OIDC federation: each pod gets a projected JWT token for its service account; the AWS SDK exchanges this for temporary role credentials via STS AssumeRoleWithWebIdentity. The key benefit is blast radius reduction โ a compromised pod can only access resources permitted by its specific service account's IAM role, not the entire node's permissions. Additionally, you get full auditability in CloudTrail: you can see exactly which pod (via the role session name) called which API.
Q: How do you handle blue/green deployments in ECS with zero downtime?
A: Use ECS + CodeDeploy blue/green (see CI/CD & DevOps for the full CodePipeline setup): 1) Service has two target groups (blue = current, green = new). ALB listener routes 100% to blue. 2) Deploy new task set to the cluster, register with green target group. 3) ALB test listener (port 8443) routes to green for automated integration tests. 4) On pass, CodeDeploy shifts production traffic: either all-at-once, canary (10% for N minutes then 100%), or linear (10% every minute). 5) Old (blue) tasks remain running during rollback window. 6) Rollback: one API call shifts traffic back to blue immediately. The key: because traffic shifting is instant (ALB routing rule change), rollback takes seconds, not minutes โ unlike rolling updates which must drain and re-deploy.
Q: You need to run 1,000 batch jobs, each processing a 100MB file from S3, completing in 2 minutes. Design the system.
A: This is a textbook AWS Batch array job: 1) Upload 1,000 files to S3. 2) Create a Batch job definition pointing to a container that reads AWS_BATCH_JOB_ARRAY_INDEX env var and maps it to a file (stored in DynamoDB or parameter store). 3) Submit one array job with arraySize: 1000. 4) Compute environment: managed, Fargate Spot (2 minutes per job, low risk of spot interruption; or EC2 Spot). 5) Set job retry attempts to 3 (for spot interruptions). 6) Batch manages concurrency based on maxvCpus โ with 1000 jobs ร 0.5 vCPU each = 500 vCPUs max (adjust as needed). 7) CloudWatch metrics + EventBridge to alert on job failures. Total cost estimate: 1000 jobs ร 2 min ร 0.5 vCPU ร Fargate Spot pricing โ $5โ10.
Q: What is the ENI trunking feature in ECS and when do you need it?
A: In awsvpc networking mode, each ECS task gets its own ENI. EC2 instances have ENI limits (e.g., m5.large = 3 ENIs). Without trunking, you can only run 3 tasks per m5.large. With ENI trunking (a.k.a. "task networking for ECS on EC2 with vpc-cni-like trunking"), ECS uses a trunk ENI attached to the instance and branch ENIs for each task โ enabling up to 120 tasks per instance. Enable it by setting awsvpcTrunking: enabled in the ECS account settings. Requires ECS-optimized AMI with specific platform version. The tradeoff: more complex networking; Fargate sidesteps this entirely.
Q: Compare ECS Fargate and Lambda for a workload that processes images uploaded to S3.
A: Key dimensions: Duration โ Lambda max 15 minutes; if image processing takes >15 min (video, high-res), use Fargate. Memory โ Lambda max 10 GB; Fargate up to 30 GB, GPU option. Concurrency โ Lambda scales per-event to thousands simultaneously with no warm-up (on provisioned concurrency); Fargate scales on metrics with some lag. Cost โ Lambda is cheaper for sporadic/short jobs (pay per 100ms); Fargate better for sustained load (pay per vCPU-second but no invocation overhead). Operations โ Lambda is simpler (no Dockerfile, no cluster). Decision: For typical web images (<60s, <1GB, high burst): Lambda. For RAW photos, ML pipelines, video thumbnailing: ECS Fargate. For batch-nightly processing of thousands of images: AWS Batch. Images are stored in S3.
Q: How does Graviton migration work in practice for a containerized ECS service?
A: 1) Build a multi-arch Docker image: use docker buildx build --platform linux/amd64,linux/arm64 and push a manifest list to ECR. 2) Update ECS task definition to add runtimePlatform: {cpuArchitecture: ARM64}. 3) Run the ARM64 task in a shadow environment, validate functional correctness and performance. 4) Update the service with the new task definition โ rolling deploy. 5) Monitor: CPU utilization, latency p99, error rates. For most Node.js/Python/JVM services, this is a 2-4 hour effort for 20% cost reduction. The main risk: native extensions (.node files, Python C extensions) compiled for x86. Audit package.json for native deps and verify their ARM64 support first.
Karpenter is an open-source node provisioner for Kubernetes (maintained by AWS) that replaces Cluster Autoscaler with faster, more efficient scaling.
| Criterion | Cluster Autoscaler | Karpenter |
|---|---|---|
| Scale-out latency | 2-4 min | 30-60s |
| Node selection | Node groups (fixed type) | Any instance type (best fit) |
| Bin packing | Suboptimal | Optimized (consolidation) |
| Spot diversity | Manual configuration | Automatic (diversify types) |
| Node consolidation | Manual deprovisioning | Automatic (removes underutilized nodes) |
| Configuration | Node group limits | NodeClass + NodePool |
# NodePool: defines constraints for nodes Karpenter can provision
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
nodeClassRef:
name: al2023 # EC2NodeClass reference
requirements:
# Prefer Graviton (ARM64) for 20-40% cost savings
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"] # arm64 preferred by weight
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer Spot
- key: node.kubernetes.io/instance-type
operator: In
# Diverse instance types reduce Spot interruption probability
values: ["m7g.large", "m7g.xlarge", "m6g.large", "c7g.large", "c7g.xlarge", "m5.large", "m5.xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
limits:
cpu: 1000 # Max 1000 vCPUs total across this NodePool
memory: 4000Gi
disruption:
consolidationPolicy: WhenUnderutilized # Remove nodes with low utilization
consolidateAfter: 30s
expireAfter: 720h # Force rotate nodes every 30 days (security patching)
---
# EC2NodeClass: AWS-specific node configuration
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: al2023
spec:
amiFamily: AL2023 # Amazon Linux 2023 (vs AL2 for older AMIs)
role: KarpenterNodeRole
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
iops: 3000
encrypted: true
Karpenter's subnet and security group selectors are tag-based โ ensure your VPC & Networking subnets carry the
karpenter.sh/discoverytag matching the cluster name.
# Prevent all pods from being evicted simultaneously during consolidation
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
selector:
matchLabels:
app: api
minAvailable: "50%" # At least 50% of pods must remain available
AWS Graviton3 (ARM64) instances offer 20-40% better price-performance vs equivalent x86 instances. This is one of the highest ROI optimizations available. See Cost Optimization for how Graviton fits into a broader savings strategy alongside Savings Plans and Spot.
Most modern runtimes and languages are Graviton-compatible:
node:20-alpine is multi-arch)linux/arm64# Step 1: Check if your container images are multi-arch
docker manifest inspect node:20-alpine | grep architecture
# Should show both amd64 and arm64 variants
# Step 2: Build multi-arch images in CI/CD
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest \
--push .
# Step 3: Test on Graviton (canary)
# EKS: use nodeSelector or node affinity
spec:
nodeSelector:
kubernetes.io/arch: arm64
containers:
- image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
# Step 4: Update Karpenter NodePool to prefer arm64 (see above)
# Step 5: Monitor for issues; rollback nodeSelector to amd64 if needed
m7g.xlarge (Graviton3 ARM64): $0.1632/hr
m7i.xlarge (Intel x86): $0.2016/hr
Savings: 19% per instance
r7g.2xlarge (Graviton3): $0.5376/hr
r7i.2xlarge (Intel): $0.6048/hr
Savings: 11% per instance
For a 20-node cluster averaging m7g.xlarge:
Monthly savings: 20 ร ($0.2016 - $0.1632) ร 730 = $561/month
Annual savings: $6,732
App Runner is fully managed: you provide source code or container image, AWS handles everything (load balancer, auto-scaling, TLS, deployments).
| Criterion | App Runner | ECS Fargate |
|---|---|---|
| Setup time | 5 minutes | 30-60 minutes |
| Configuration | Minimal (CPU/memory only) | Full ECS task/service/ALB setup |
| Auto-scaling | Automatic (no config) | Manual scaling policy config |
| Custom networking | VPC connector (egress only) | Full VPC integration |
| Load balancer | Built-in, not customizable | ALB (full control) |
| Custom domains | Yes | Via Route 53 + ACM |
| Health checks | HTTP path only | Full ALB health check config |
| Cost (idle) | Scales to 0 (paused, resumes in ~seconds) | Minimum 0.25 vCPU if running |
| Best for | Simple containerized APIs, prototypes, internal tools | Production microservices, complex networking |
# apprunner.yaml (source-based deployment)
version: 1.0
runtime: nodejs18
build:
commands:
build:
- npm install
- npm run build
run:
runtime-version: 18.12.0
command: node dist/server.js
network:
port: 8080
env: PORT
env:
- name: NODE_ENV
value: production
secrets:
- name: DATABASE_URL
value-from: arn:aws:secretsmanager:us-east-1:123456789012:secret:db-url
# CDK: App Runner service
const service = new apprunner.Service(this, 'ApiService', {
source: apprunner.Source.fromEcr({
imageConfiguration: { port: 8080, environmentSecrets: { DB_URL: apprunner.Secret.fromSecretsManager(dbSecret) } },
repository: ecrRepo,
tagOrDigest: 'latest',
}),
cpu: apprunner.Cpu.ONE_VCPU,
memory: apprunner.Memory.TWO_GB,
autoDeploymentsEnabled: true, // redeploy on new ECR push
vpcConnector, // for private DB access
});
The IAM role that App Runner uses to pull from ECR and read Secrets Manager is a separate service role โ see IAM & Security for the exact trust policy required.
AWS Batch is ideal for ML training jobs, genomics analysis, financial simulations, and any large-scale parallel compute that would be awkward in Lambda or ECS.
# Submit a Batch job for model training
import boto3
batch = boto3.client('batch')
response = batch.submit_job(
jobName='train-model-v42',
jobQueue='ml-training-queue',
jobDefinition='model-training-job:3',
containerOverrides={
'command': ['python', 'train.py', '--epochs', '50', '--model', 'bert-base'],
'environment': [
{'name': 'S3_TRAINING_DATA', 'value': 's3://my-ml-bucket/training/v3/'},
{'name': 'S3_OUTPUT', 'value': 's3://my-ml-bucket/models/v42/'},
],
'resourceRequirements': [
{'type': 'GPU', 'value': '1'}, # Request 1 GPU
]
},
retryStrategy={'attempts': 3},
timeout={'attemptDurationSeconds': 7200}, # 2-hour timeout
)
# Monitor
batch.describe_jobs(jobs=[response['jobId']])
# Submit 100 parallel jobs for hyperparameter sweep
response = batch.submit_job(
jobName='hyperparam-sweep',
jobQueue='ml-training-queue',
jobDefinition='hyperparam-job',
arrayProperties={'size': 100}, # creates 100 child jobs (index 0-99)
# Each child job gets AWS_BATCH_JOB_ARRAY_INDEX env var (0-99)
# Use index to select hyperparameters from a config file
)