AWS is the market-leading cloud platform from Amazon — 200+ services spanning IaaS, PaaS, and SaaS, deployed across 33 geographic regions, 105 Availability Zones, and 600+ Points of Presence worldwide. It solves the fundamental problem of capital expenditure and operational overhead for computing resources through on-demand, pay-as-you-go pricing. A team picks AWS for its unmatched breadth (no other cloud offers as many services), its mature ecosystem, its leadership in AI/ML (Bedrock, SageMaker), and the powerful serverless and event-driven primitives that enable rapid innovation without managing hardware. This hub links to 19 deep-dive articles covering every major domain — use them to go from orientation to expert depth on any topic.
The core mental model is global infrastructure composed of Regions, Availability Zones, and edge locations. A Region is a geographic area (e.g., us-east-1). Each Region contains multiple physically isolated AZs — separate data centers with independent power, cooling, and networking, connected by low-latency fiber. Edge locations (600+) serve CloudFront CDN and Route 53 DNS globally.
Infrastructure numbers worth knowing:
Key layers:
Control Plane vs Data Plane:
[User / App / CI]
|
| API call (HTTPS + SigV4 signing)
v
[AWS Global Endpoint / Control Plane]
|
v
[Authentication & Authorization (IAM)]
| |
| Allow | Explicit Deny → rejected immediately
v
[Command dispatched to Data Plane — specific Region]
|
v
[Resource: EC2 instance / S3 object / Lambda invocation]
The control plane handles API calls (management operations): creating/deleting/configuring resources. The data plane handles the actual work: serving requests to your EC2 instances, returning S3 objects, executing Lambda functions. AWS designs data planes for higher availability than control planes — your running workloads survive brief control plane disruptions.
IAM — the authorization fabric: Every API call has a principal (who) calling an action (what) on a resource (which ARN). IAM evaluates policies in order: explicit Deny wins, then SCPs, then resource-based policies, then identity-based policies. The result is either Allow or an implicit Deny. See IAM & Security for the full evaluation logic.
Event-Driven Architecture: Many services emit events natively — S3 object created, DynamoDB record changed, SQS message received. EventBridge routes these to targets (Lambda, Step Functions, SQS, HTTP endpoints) using rules. This loose coupling is the dominant pattern for scalable, cost-efficient architectures on AWS.
Quick reference for interviews and architecture decisions. Deep-dive articles linked from each section above.
| Service | Description | Key Use Case |
|---|---|---|
| EC2 | Virtual machines, 700+ instance types | OS-level control, ML inference, legacy lift-and-shift |
| Lambda | Serverless functions up to 15 min, 10 GB memory | Event-driven APIs, glue code, async processing |
| ECS | AWS-native container orchestration | Docker microservices, simpler Kubernetes alternative |
| EKS | Managed Kubernetes control plane | Teams already on K8s, multi-cloud portability |
| Fargate | Serverless container runtime for ECS/EKS | No EC2 node management; pay-per-task |
| App Runner | Container → URL in minutes | Simplest container hosting, no infra knowledge needed |
| AWS Batch | Managed batch job queues | ML training jobs, simulations, large-scale ETL |
| Lightsail | Simplified VMs with fixed pricing | Simple web apps, dev/test, low traffic sites |
| Service | Description | Key Use Case |
|---|---|---|
| S3 | Object storage, 11-nines durability, unlimited scale | Data lakes, static assets, backups, ML training data |
| EBS | Block storage volumes attached to EC2 | Database storage, OS volumes (gp3 default) |
| EFS | Managed NFS file system, auto-scaling | Shared storage across multiple EC2 or containers |
| FSx for Lustre | High-performance parallel file system | ML training, HPC, needs POSIX + high throughput |
| FSx for Windows | Managed Windows File Server | Windows workloads, SMB protocol, Active Directory |
| S3 Glacier | Archive tiers (Instant/Flexible/Deep Archive) | Long-term compliance, infrequently accessed data |
| Service | Description | Key Use Case |
|---|---|---|
| RDS | Managed relational: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server | Traditional RDBMS with managed patching/backups |
| Aurora | Cloud-native relational, storage auto-scales to 128 TB | High-performance MySQL/PostgreSQL, 5× throughput |
| Aurora Serverless v2 | Fine-grained ACU auto-scaling | Variable or unpredictable relational workloads |
| DynamoDB | Managed NoSQL, single-digit ms at any scale | Key-value, document, high-throughput microservices |
| ElastiCache | Managed Redis or Memcached | Session store, rate limiting, DB query cache |
| Redshift | Columnar data warehouse, Redshift Serverless option | Analytics, BI, large-scale SQL on structured data |
| Neptune | Managed graph database (Gremlin, SPARQL, openCypher) | Social graphs, fraud detection, knowledge graphs |
| DocumentDB | MongoDB-compatible document database | Document workloads, MongoDB migration |
| Timestream | Time-series database | IoT telemetry, metrics, DevOps time-series data |
| Service | Description | Key Use Case |
|---|---|---|
| VPC | Logically isolated virtual network with CIDR control | Foundation for all AWS workloads |
| CloudFront | Global CDN + edge computing | Static asset delivery, API acceleration, DDoS mitigation |
| Route 53 | DNS with health checks + 7 routing policies | Domain management, latency routing, failover |
| API Gateway | Managed REST/HTTP/WebSocket API frontend | API proxy for Lambda, ECS, HTTP integrations |
| ALB | Application (L7) load balancer, content-based routing | HTTP/HTTPS routing, microservices, ECS/EKS |
| NLB | Network (L4) load balancer, static IPs, ultra-low latency | TCP/UDP, PrivateLink endpoints, consistent IP |
| Transit Gateway | Hub-and-spoke network hub | Multi-VPC connectivity + hybrid cloud networking |
| Direct Connect | Dedicated private line to AWS (1G / 10G / 100G) | Consistent bandwidth, compliance, hybrid workloads |
| Service | Description | Key Use Case |
|---|---|---|
| Bedrock | Foundation model API: Claude, Llama, Titan, Mistral, Stable Diffusion | GenAI apps without managing model infrastructure |
| SageMaker | End-to-end ML platform: notebooks, training, inference, pipelines | Full MLOps lifecycle, custom model training |
| Bedrock Agents | Agentic workflows with tool use + memory | AI agents that call APIs and retrieve knowledge |
| Bedrock Knowledge Bases | Managed RAG with S3 + vector store | Serverless RAG, no embedding pipeline to maintain |
| Rekognition | Computer vision: faces, objects, text, moderation | Image/video analysis, content moderation |
| Textract | Document extraction: forms, tables, signatures (beyond OCR) | Form processing, mortgage docs, medical records |
| Comprehend | NLP: entities, sentiment, key phrases, custom classifiers | Text analytics pipelines, customer feedback |
| Transcribe | Speech-to-text with speaker diarization | Voice apps, call center transcription, captioning |
| Polly | Text-to-speech, 60+ voices, SSML support | Voice interfaces, e-learning narration |
| Translate | Neural machine translation, 75+ languages | Multilingual apps, real-time chat translation |
| Forecast | Time-series forecasting (AutoML) | Demand forecasting, inventory planning |
| Personalize | Recommendation engine (same algorithms as Amazon.com) | E-commerce, content streaming recommendations |
| Service | Description | Key Use Case |
|---|---|---|
| CloudFormation | AWS-native IaC, stateful stacks, changesets | Stack-based provisioning, drift detection |
| CDK | Code-first IaC (TypeScript, Python, Go, Java) | Developer-friendly IaC that compiles to CloudFormation |
| Systems Manager | Operational management: Run Command, Patch Manager, Session Manager | Patch automation, no-SSH access, config management |
| Secrets Manager | Secret storage with automatic rotation | DB passwords, API keys, OAuth tokens |
| Parameter Store | Config and secret storage, free tier | Non-rotating config, feature flags, env vars |
| CloudWatch | Metrics, logs, alarms, dashboards, Synthetics | Unified observability across all AWS services |
| X-Ray | Distributed tracing with service map | Microservice latency debugging, bottleneck detection |
| CloudTrail | API audit trail for all management events | Compliance, forensics, security investigation |
| Config | Resource inventory + compliance rules + remediation | Drift detection, CIS benchmark enforcement |
| Trusted Advisor | Best practice checks across 5 categories | Cost savings, security gaps, service limit warnings |
| Service | Description | Key Use Case |
|---|---|---|
| SQS | Managed queues (Standard + FIFO), DLQ support | Decoupling, work queues, async task offload |
| SNS | Pub/sub: push to SQS, Lambda, email, SMS, HTTP | Fan-out notifications, multi-consumer event broadcast |
| EventBridge | Serverless event bus with content-based routing rules | Event-driven microservices, SaaS integration (100+ sources) |
| Kinesis Data Streams | Real-time streaming, up to 365-day retention | Clickstream, log aggregation, IoT, replay capability |
| Kinesis Firehose | Streaming delivery to S3/Redshift/OpenSearch (near real-time) | Zero-code ingestion pipeline, automatic batching |
| MSK | Managed Apache Kafka, full Kafka API compatibility | Teams with existing Kafka workloads and tooling |
| Step Functions | Visual workflow orchestration (ASL state machine) | Multi-step processes, Saga pattern, human approval steps |
| Service | Description | Key Use Case |
|---|---|---|
| IAM | Identity and access management, policy evaluation | Every AWS API call authorization decision |
| IAM Identity Center | SSO + multi-account access management (successor to SSO) | Human access to many AWS accounts via one login |
| KMS | Managed encryption keys (AES-256, RSA, ECC) | Envelope encryption for S3, EBS, RDS, Secrets Manager |
| ACM | SSL/TLS certificate provisioning and auto-renewal | HTTPS for ALB, CloudFront, API Gateway |
| WAF | Web Application Firewall, managed rule groups | OWASP Top 10 protection, rate limiting, bot control |
| Shield | DDoS protection (Standard: free; Advanced: $3k/mo) | Layer 3/4 volumetric + Layer 7 application DDoS |
| GuardDuty | ML-based threat detection across CloudTrail/DNS/VPC flow logs | Unusual API patterns, credential exfiltration, crypto mining |
| Macie | S3 sensitive data discovery with ML classifiers | PII detection, credential leaks, compliance scanning |
| Inspector | Automated vulnerability scanning for EC2/ECR/Lambda | CVE detection, reachability analysis, SBOM |
| Security Hub | Centralized security findings, ASFF normalization | Multi-account security posture, compliance dashboard |
API Gateway handles HTTP requests, Lambda runs business logic, Bedrock provides the foundation model. Uses the current Messages API format.
// Lambda (Node.js) — Bedrock Messages API (Claude 3+)
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
const client = new BedrockRuntimeClient({ region: 'us-east-1' });
export const handler = async (event) => {
const { prompt } = JSON.parse(event.body);
const payload = {
anthropic_version: 'bedrock-2023-05-31',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
};
const command = new InvokeModelCommand({
modelId: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
contentType: 'application/json',
accept: 'application/json',
body: JSON.stringify(payload),
});
const response = await client.send(command);
const result = JSON.parse(Buffer.from(response.body).toString());
return {
statusCode: 200,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: result.content[0].text }),
};
};
S3 upload triggers Lambda via EventBridge (or direct S3 notification), processes the file, stores results in DynamoDB. Classic RAG ingestion pattern.
# SAM template — S3 → Lambda → DynamoDB
Resources:
DocumentProcessorFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: processor.handler
Runtime: nodejs22.x
Timeout: 300
MemorySize: 1024
Policies:
- S3ReadPolicy:
BucketName: !Ref SourceBucket
- DynamoDBCrudPolicy:
TableName: !Ref MetadataTable
Environment:
Variables:
METADATA_TABLE: !Ref MetadataTable
Events:
FileUpload:
Type: S3
Properties:
Bucket: !Ref SourceBucket
Events: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: suffix
Value: .pdf
// processor.handler — extract, embed, store
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
import { DynamoDBClient, PutItemCommand } from '@aws-sdk/client-dynamodb';
const s3 = new S3Client();
const ddb = new DynamoDBClient();
export const handler = async (event) => {
for (const record of event.Records) {
const bucket = record.s3.bucket.name;
const key = decodeURIComponent(record.s3.object.key);
// 1. Fetch document from S3
const { Body } = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
const text = await Body.transformToString();
// 2. Store processing metadata in DynamoDB
await ddb.send(new PutItemCommand({
TableName: process.env.METADATA_TABLE,
Item: {
documentId: { S: key },
processedAt: { S: new Date().toISOString() },
status: { S: 'processed' },
sizeBytes: { N: String(record.s3.object.size) },
},
}));
}
};
Docker container deployed behind an ALB. ECS handles health checks, rolling deployments, and scaling. Secrets pulled from Secrets Manager at task startup.
FROM node:22-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
EXPOSE 8080
USER node
CMD ["node", "server.js"]
{
"family": "ai-microservice",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole",
"containerDefinitions": [{
"name": "app",
"image": "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/ai-service:latest",
"portMappings": [{ "containerPort": 8080 }],
"secrets": [
{ "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:prod/db-password" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/ai-microservice",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}]
}
AWS handles core AI/ML and scalable backend services. Other clouds used for specific integrations. Key patterns:
Decoupled processing with guaranteed delivery semantics. The DLQ captures failed messages for later analysis or replay rather than silently dropping them.
API Gateway
|
v
Lambda (synchronous — validates, enqueues)
|
v
SQS Queue (VisibilityTimeout = 3× max processing time)
|
v
Lambda (async consumer — batchSize=10, bisectOnError=true)
| \
v v (on failure after maxReceiveCount)
DynamoDB SQS DLQ → CloudWatch Alarm → SNS alert
Key configuration details:
FunctionResponseTypes: [ReportBatchItemFailures] on the event source mapping so Lambda reports individual failed items — successfully processed messages are not retriedmaxReceiveCount on the source queue to 3–5 before sending to DLQApproximateNumberOfMessagesVisible > 0 for 5 minutes, page on-call| Criterion | Lambda | Fargate | EC2 |
|---|---|---|---|
| Max execution time | 15 minutes | Unlimited | Unlimited |
| Idle cost | $0 (scales to zero) | $0 (scales to zero) | Per-hour billing |
| Cold start latency | 200 ms – 10 s | 10 – 60 s | Minutes (or pre-warmed ASG) |
| State model | Stateless (ephemeral /tmp) | Container-level ephemeral storage | Full OS persistent storage |
| OS control | None | None | Full (AMI, kernel params) |
| Max memory | 10 GB | 120 GB | Instance-defined (up to TBs) |
| GPU support | No | Yes (p-family instances) | Yes (full GPU instance catalog) |
| Best for | Event-driven, <15 min, spiky traffic | Containerized APIs, long-running tasks | ML inference, OS control, steady-state high throughput |
Heuristic: Start with Lambda. Migrate to Fargate when you hit the 15-min limit or need more memory/CPU consistency. Migrate to EC2 when you need GPU, specific OS configuration, or the math says reserved instances are cheaper.
| Criterion | SQS | Kinesis Data Streams | EventBridge |
|---|---|---|---|
| Message ordering | Best-effort (FIFO queue: strict) | Per-shard strict ordering | None |
| Replay / retention | DLQ only, max 14 days | Up to 365 days | No replay |
| Consumer model | Competing consumers (one message → one consumer) | Multiple independent consumers per shard | Many rules → many targets |
| Throughput | Unlimited (standard); 3,000 msg/s per API call (FIFO) | 1 MB/s write per shard | Unlimited (soft limits apply) |
| Latency | Near real-time (polling) | Real-time (<1s) | Near real-time |
| Pricing model | Per request | Per shard-hour + per GB | Per event |
| Best for | Work queues, task offload, decoupling services | Real-time analytics, log aggregation, data replay | Event routing, SaaS integration, cross-service choreography |
Heuristic: Use SQS for task queues between services. Use Kinesis when you need multiple consumers reading the same stream independently, or need replay. Use EventBridge as the event bus when routing events between many services or reacting to AWS/SaaS events.
| Criterion | Aurora / RDS | DynamoDB |
|---|---|---|
| Schema | Defined schema, migrations required | Schemaless (per-item attributes) |
| Query flexibility | Full SQL: joins, aggregations, ad-hoc | Primary key + GSI only; no joins |
| Consistency | Strong (ACID transactions) | Eventual (default) or strong read (extra cost) |
| Scaling | Vertical + read replicas (horizontal limited) | Horizontal, unlimited scale |
| Latency | Low ms (with connection pooling + RDS Proxy) | Single-digit ms at any scale |
| Cost at scale | Predictable instance cost | Per request (can get expensive at high read/write) |
| Best for | Complex queries, reporting, existing SQL workloads | Known access patterns, >100K RPS, key-value/session |
Q: Explain how IAM works. How do you grant an EC2 instance permission to read from an S3 bucket?
A: IAM is the authorization fabric for every AWS API call. Every call has a principal (who is calling), an action (e.g., s3:GetObject), and a resource (the bucket ARN). For an EC2 instance, you never use user credentials — instead, create an IAM Role with a policy granting s3:GetObject on the target bucket, and attach that role as the instance profile. The EC2 metadata service (169.254.169.254) vends temporary STS credentials to the instance automatically. The application uses the SDK which reads these credentials transparently.
Q: When would you choose Aurora over DynamoDB, or vice versa? A: Choose Aurora when you need relational queries, joins, transactions (ACID), or complex reporting — essentially when you have a well-defined schema and multiple access patterns that benefit from SQL. Choose DynamoDB for massively scalable, predictably low-latency workloads with known, simple access patterns: key-value stores, session management, shopping carts, event sourcing. DynamoDB's trade-off is that you must design your access patterns upfront (single-table design). If you need ad-hoc querying over DynamoDB data, export to S3 and query with Athena.
Q: You have a Lambda function timing out intermittently. Walk me through debugging.
A: Structured approach: (1) CloudWatch Logs — filter for the specific request ID showing the timeout; look for the last logged line before timeout to identify where it hangs. (2) Lambda metrics — Duration p99 vs configured timeout, Throttles, ConcurrentExecutions. (3) X-Ray trace — find the specific slow segment; is it a downstream HTTP call, a database query, or CPU? (4) Check if the Lambda is inside a VPC and hitting a cold start ENI attachment delay. (5) Increase MemorySize — Lambda CPU is proportional to memory, so more memory means faster CPU-bound work. (6) Verify downstream dependencies: check RDS connection pool exhaustion (add RDS Proxy), check DynamoDB provisioned capacity vs consumed capacity.
Q: Design a system to process thousands of PDFs, extract text, and make it queryable. A: Classic event-driven serverless RAG pipeline: (1) Users upload PDFs to S3. (2) S3 event triggers a Lambda or Step Functions workflow per file. (3) Lambda calls Textract for high-accuracy extraction (forms, tables, signatures — not just raw text). (4) Text is chunked (e.g., 512-token overlapping windows) and sent to Bedrock (Titan Embeddings or Cohere) to produce vectors. (5) Vectors + metadata stored in OpenSearch (k-NN index) or pgvector on Aurora. (6) Query API: API Gateway → Lambda → embed the query → ANN search → retrieve top-k chunks → send context + question to Bedrock (Claude) → return answer. For scale: use SQS between upload and processing to handle burst, Dead Letter Queue for failed documents, Step Functions for retry/branching logic.
Q: When would you NOT use Lambda? A: Avoid Lambda for: (1) Processes running longer than 15 minutes (use Fargate or ECS). (2) Workloads requiring GPU (EC2 GPU instances). (3) Steady, predictable high throughput where a reserved EC2 or Fargate service is cheaper (e.g., >1M invocations/day of a heavy function). (4) Applications requiring a persistent in-memory cache (Lambda instances are recycled; use ElastiCache). (5) Tasks needing specific OS-level configuration, kernel modules, or direct hardware access. (6) WebSocket servers requiring persistent connections (use API Gateway WebSocket + connection store, or ECS).
Q: How does AWS billing work? How do you control costs?
A: AWS billing is pay-per-use: EC2 charges per second (after first minute), Lambda per 1ms of compute, S3 per GB stored + per request. Key levers for cost control: (1) Compute savings — Savings Plans (1 or 3 year commitment, up to 66% off) cover EC2, Lambda, and Fargate regardless of instance type; Spot Instances for fault-tolerant batch workloads (up to 90% off). (2) Graviton — ARM-based instances up to 40% cheaper than x86 for same workload. (3) Right-sizing — use Compute Optimizer recommendations; turn off dev environments overnight with Instance Scheduler. (4) Storage tiering — S3 Intelligent-Tiering, EBS gp3 vs io2 selection, lifecycle rules to Glacier. (5) FinOps practices — mandatory project, env, team tags on all resources; Cost Explorer anomaly detection alerts; per-team budget alarms via AWS Budgets. (6) Architecture — cache aggressively with ElastiCache/DAX to reduce DB calls; use SQS batching; prefer Lambda@Edge for simple transforms over full Lambda invocations.
Q: Explain the shared responsibility model. A: AWS is responsible for security of the cloud — the physical infrastructure, data center facilities, network hardware, hypervisors, and the managed services themselves (e.g., the RDS database engine patching). The customer is responsible for security in the cloud — everything they put on top: IAM policies, encryption configuration, security group rules, patching the OS on EC2 instances, application-layer security, and data classification. The boundary shifts depending on the service: with Lambda you don't patch an OS (AWS responsibility), but with EC2 you do (your responsibility). This model means a misconfigured S3 bucket is always the customer's fault, not AWS's.
Q: How would you architect a greenfield application? Walk me through your decision process. A: Framework: (1) Traffic pattern — spiky/event-driven → Lambda; steady/containerized → Fargate; ML inference → EC2 GPU or SageMaker. (2) Data model — relational with complex queries → Aurora; known key-value patterns, massive scale → DynamoDB; analytics → Redshift + Athena over S3. (3) HA/DR — Multi-AZ by default for all stateful resources; decide if you need multi-region (RTO/RPO requirements). (4) Security baseline — VPC with private subnets for all backend resources; Secrets Manager for credentials; CloudTrail + GuardDuty enabled from day one; WAF in front of public APIs. (5) Observability — structured JSON logs to CloudWatch with correlation IDs; X-Ray tracing; dashboards and alarms before go-live. (6) IaC — CDK or Terraform from day one; no manual console changes in production. (7) Cost — tag everything; enable Cost Explorer; set budget alarms.
Q: How do you manage secrets in AWS?
A: Never hardcode or commit secrets. Use Secrets Manager for anything that needs automatic rotation — it rotates RDS passwords natively and supports custom Lambda rotation for any other secret. Use Parameter Store (SecureString type, KMS-encrypted) for non-rotating config. In Lambda: reference secrets via SDK calls in the initialization code (outside the handler), cached for the lifetime of the execution environment. In ECS: reference via secrets in the task definition — the agent fetches and injects them as environment variables at task start. In EKS: use the Secrets Store CSI Driver to mount Secrets Manager values as files. Access to secrets is controlled via IAM — the role or task role must have secretsmanager:GetSecretValue permission on the specific ARN.
Q: Explain AZ outage vs Region outage. How do you design for each?
A: An AZ outage is a failure within one data center complex — occurs a few times per year in large regions. Design for it by distributing resources across a minimum of 2 (ideally 3) AZs: Multi-AZ RDS/Aurora, Auto Scaling Groups with AZRebalance, ALB/NLB across AZs, ECS task placement spread by AZ. A Region outage is a full geographic failure — extremely rare but catastrophic without preparation. Design with a DR strategy keyed to your RTO/RPO: pilot light (replicated data, minimal infra), warm standby (scaled-down live replica), or active-active (Route 53 with health checks routing traffic across two regions). Data replication: S3 Cross-Region Replication, DynamoDB Global Tables, Aurora Global Database (replication lag <1s). Runbooks for DNS failover should be tested quarterly.
As a Senior Full Stack AI Engineer in a multi-cloud environment, AWS is the core platform for AI/ML workloads and scalable backend services.