← all lessons/ Appendix · Other Clouds & Platforms/#80
Lesson 1 of 5 in Appendix · Other Clouds & Platforms

Amazon Web Services

Appendix · Other Clouds & PlatformsIntermediate~26 min read

The 30-Second Pitch

AWS is the market-leading cloud platform from Amazon — 200+ services spanning IaaS, PaaS, and SaaS, deployed across 33 geographic regions, 105 Availability Zones, and 600+ Points of Presence worldwide. It solves the fundamental problem of capital expenditure and operational overhead for computing resources through on-demand, pay-as-you-go pricing. A team picks AWS for its unmatched breadth (no other cloud offers as many services), its mature ecosystem, its leadership in AI/ML (Bedrock, SageMaker), and the powerful serverless and event-driven primitives that enable rapid innovation without managing hardware. This hub links to 19 deep-dive articles covering every major domain — use them to go from orientation to expert depth on any topic.


Deep-Dive Knowledge Base

Foundation & Security

  • IAM & Security — IAM principals, policy evaluation, STS, AssumeRole, permission boundaries, SCPs
  • VPC & Networking — VPC, subnets, security groups, NACLs, NAT Gateway, Transit Gateway, Direct Connect
  • Security Services — WAF, Shield, Macie, Inspector v2, Network Firewall, Security Hub, centralized findings

Compute

  • Lambda & Serverless — Firecracker microVMs, cold starts, event sources, Lambda extensions, Powertools for AWS
  • Compute & Containers — EC2, ECS, EKS, Fargate, App Runner, Graviton processors, Auto Scaling

Storage & Data

  • Storage & S3 — S3 storage classes, lifecycle policies, Object Lambda, EBS, EFS, FSx families
  • DynamoDB & Data Services — Single-table design, GSI/LSI, DynamoDB Streams, DAX, Global Tables, on-demand capacity
  • Databases: RDS & Aurora — RDS engines, Aurora architecture, Aurora Serverless v2, ElastiCache, RDS Proxy
  • Data Analytics — Athena, Redshift, Glue, Lake Formation, EMR, OpenSearch, Kinesis analytics

Integration & Messaging

  • Messaging & Events — SQS, SNS, EventBridge, Kinesis, MSK, fan-out patterns, DLQ handling
  • Step Functions — Amazon States Language, Standard vs Express, SDK integrations, Saga pattern
  • API Gateway & Networking — REST/HTTP/WebSocket APIs, authorizers, throttling, VPC Link, usage plans

Delivery & Operations

  • CloudFront & Route 53 — CDN, edge computing, Lambda@Edge, CloudFront Functions, DNS routing policies
  • CI/CD & DevOps — CodePipeline, CodeBuild, CodeDeploy, CDK Pipelines, OIDC trust with GitHub Actions
  • Developer Tools — CDK advanced patterns, SAM, Projen, AppConfig feature flags, CodeCatalyst

AI/ML & Observability

  • AI/ML Services — Bedrock, SageMaker, Bedrock Agents, Knowledge Bases, Rekognition, Textract
  • Observability — CloudWatch Metrics/Logs/Alarms, X-Ray, CloudTrail, Config, correlation IDs, dashboards

Architecture & Cost

  • Architecture Patterns — Well-Architected Framework, all 6 pillars, multi-region strategies, disaster recovery tiers
  • Cost Optimization — Savings Plans, Spot Instances, Graviton price/perf, right-sizing, FinOps tagging strategy

How It Actually Works

The core mental model is global infrastructure composed of Regions, Availability Zones, and edge locations. A Region is a geographic area (e.g., us-east-1). Each Region contains multiple physically isolated AZs — separate data centers with independent power, cooling, and networking, connected by low-latency fiber. Edge locations (600+) serve CloudFront CDN and Route 53 DNS globally.

Infrastructure numbers worth knowing:

  • 33 Regions (with more announced)
  • 105 Availability Zones
  • 600+ Points of Presence (CloudFront edge + regional caches)
  • 200+ distinct services

Key layers:

  1. Compute: EC2 (virtual machines), Lambda (serverless, Firecracker microVMs), ECS/EKS (containers)
  2. Storage: S3 (object, 11-nines durability), EBS (block, for EC2), EFS (shared NFS)
  3. Database: RDS (managed relational), DynamoDB (managed NoSQL), Aurora (high-performance cloud-native)
  4. Networking: VPC (private virtual network), CloudFront (CDN), Route 53 (DNS), API Gateway
  5. AI/ML: Bedrock (foundation model API), SageMaker (end-to-end ML platform), 12+ purpose-built AI services

Control Plane vs Data Plane:

[User / App / CI]
        |
        | API call (HTTPS + SigV4 signing)
        v
[AWS Global Endpoint / Control Plane]
        |
        v
[Authentication & Authorization (IAM)]
     |          |
     | Allow    | Explicit Deny → rejected immediately
     v
[Command dispatched to Data Plane — specific Region]
        |
        v
[Resource: EC2 instance / S3 object / Lambda invocation]

The control plane handles API calls (management operations): creating/deleting/configuring resources. The data plane handles the actual work: serving requests to your EC2 instances, returning S3 objects, executing Lambda functions. AWS designs data planes for higher availability than control planes — your running workloads survive brief control plane disruptions.

IAM — the authorization fabric: Every API call has a principal (who) calling an action (what) on a resource (which ARN). IAM evaluates policies in order: explicit Deny wins, then SCPs, then resource-based policies, then identity-based policies. The result is either Allow or an implicit Deny. See IAM & Security for the full evaluation logic.

Event-Driven Architecture: Many services emit events natively — S3 object created, DynamoDB record changed, SQS message received. EventBridge routes these to targets (Lambda, Step Functions, SQS, HTTP endpoints) using rules. This loose coupling is the dominant pattern for scalable, cost-efficient architectures on AWS.


AWS Service Catalog

Quick reference for interviews and architecture decisions. Deep-dive articles linked from each section above.

Compute

ServiceDescriptionKey Use Case
EC2Virtual machines, 700+ instance typesOS-level control, ML inference, legacy lift-and-shift
LambdaServerless functions up to 15 min, 10 GB memoryEvent-driven APIs, glue code, async processing
ECSAWS-native container orchestrationDocker microservices, simpler Kubernetes alternative
EKSManaged Kubernetes control planeTeams already on K8s, multi-cloud portability
FargateServerless container runtime for ECS/EKSNo EC2 node management; pay-per-task
App RunnerContainer → URL in minutesSimplest container hosting, no infra knowledge needed
AWS BatchManaged batch job queuesML training jobs, simulations, large-scale ETL
LightsailSimplified VMs with fixed pricingSimple web apps, dev/test, low traffic sites

Storage

ServiceDescriptionKey Use Case
S3Object storage, 11-nines durability, unlimited scaleData lakes, static assets, backups, ML training data
EBSBlock storage volumes attached to EC2Database storage, OS volumes (gp3 default)
EFSManaged NFS file system, auto-scalingShared storage across multiple EC2 or containers
FSx for LustreHigh-performance parallel file systemML training, HPC, needs POSIX + high throughput
FSx for WindowsManaged Windows File ServerWindows workloads, SMB protocol, Active Directory
S3 GlacierArchive tiers (Instant/Flexible/Deep Archive)Long-term compliance, infrequently accessed data

Database

ServiceDescriptionKey Use Case
RDSManaged relational: MySQL, PostgreSQL, MariaDB, Oracle, SQL ServerTraditional RDBMS with managed patching/backups
AuroraCloud-native relational, storage auto-scales to 128 TBHigh-performance MySQL/PostgreSQL, 5× throughput
Aurora Serverless v2Fine-grained ACU auto-scalingVariable or unpredictable relational workloads
DynamoDBManaged NoSQL, single-digit ms at any scaleKey-value, document, high-throughput microservices
ElastiCacheManaged Redis or MemcachedSession store, rate limiting, DB query cache
RedshiftColumnar data warehouse, Redshift Serverless optionAnalytics, BI, large-scale SQL on structured data
NeptuneManaged graph database (Gremlin, SPARQL, openCypher)Social graphs, fraud detection, knowledge graphs
DocumentDBMongoDB-compatible document databaseDocument workloads, MongoDB migration
TimestreamTime-series databaseIoT telemetry, metrics, DevOps time-series data

Networking

ServiceDescriptionKey Use Case
VPCLogically isolated virtual network with CIDR controlFoundation for all AWS workloads
CloudFrontGlobal CDN + edge computingStatic asset delivery, API acceleration, DDoS mitigation
Route 53DNS with health checks + 7 routing policiesDomain management, latency routing, failover
API GatewayManaged REST/HTTP/WebSocket API frontendAPI proxy for Lambda, ECS, HTTP integrations
ALBApplication (L7) load balancer, content-based routingHTTP/HTTPS routing, microservices, ECS/EKS
NLBNetwork (L4) load balancer, static IPs, ultra-low latencyTCP/UDP, PrivateLink endpoints, consistent IP
Transit GatewayHub-and-spoke network hubMulti-VPC connectivity + hybrid cloud networking
Direct ConnectDedicated private line to AWS (1G / 10G / 100G)Consistent bandwidth, compliance, hybrid workloads

AI / ML

ServiceDescriptionKey Use Case
BedrockFoundation model API: Claude, Llama, Titan, Mistral, Stable DiffusionGenAI apps without managing model infrastructure
SageMakerEnd-to-end ML platform: notebooks, training, inference, pipelinesFull MLOps lifecycle, custom model training
Bedrock AgentsAgentic workflows with tool use + memoryAI agents that call APIs and retrieve knowledge
Bedrock Knowledge BasesManaged RAG with S3 + vector storeServerless RAG, no embedding pipeline to maintain
RekognitionComputer vision: faces, objects, text, moderationImage/video analysis, content moderation
TextractDocument extraction: forms, tables, signatures (beyond OCR)Form processing, mortgage docs, medical records
ComprehendNLP: entities, sentiment, key phrases, custom classifiersText analytics pipelines, customer feedback
TranscribeSpeech-to-text with speaker diarizationVoice apps, call center transcription, captioning
PollyText-to-speech, 60+ voices, SSML supportVoice interfaces, e-learning narration
TranslateNeural machine translation, 75+ languagesMultilingual apps, real-time chat translation
ForecastTime-series forecasting (AutoML)Demand forecasting, inventory planning
PersonalizeRecommendation engine (same algorithms as Amazon.com)E-commerce, content streaming recommendations

DevOps & Management

ServiceDescriptionKey Use Case
CloudFormationAWS-native IaC, stateful stacks, changesetsStack-based provisioning, drift detection
CDKCode-first IaC (TypeScript, Python, Go, Java)Developer-friendly IaC that compiles to CloudFormation
Systems ManagerOperational management: Run Command, Patch Manager, Session ManagerPatch automation, no-SSH access, config management
Secrets ManagerSecret storage with automatic rotationDB passwords, API keys, OAuth tokens
Parameter StoreConfig and secret storage, free tierNon-rotating config, feature flags, env vars
CloudWatchMetrics, logs, alarms, dashboards, SyntheticsUnified observability across all AWS services
X-RayDistributed tracing with service mapMicroservice latency debugging, bottleneck detection
CloudTrailAPI audit trail for all management eventsCompliance, forensics, security investigation
ConfigResource inventory + compliance rules + remediationDrift detection, CIS benchmark enforcement
Trusted AdvisorBest practice checks across 5 categoriesCost savings, security gaps, service limit warnings

Integration & Messaging

ServiceDescriptionKey Use Case
SQSManaged queues (Standard + FIFO), DLQ supportDecoupling, work queues, async task offload
SNSPub/sub: push to SQS, Lambda, email, SMS, HTTPFan-out notifications, multi-consumer event broadcast
EventBridgeServerless event bus with content-based routing rulesEvent-driven microservices, SaaS integration (100+ sources)
Kinesis Data StreamsReal-time streaming, up to 365-day retentionClickstream, log aggregation, IoT, replay capability
Kinesis FirehoseStreaming delivery to S3/Redshift/OpenSearch (near real-time)Zero-code ingestion pipeline, automatic batching
MSKManaged Apache Kafka, full Kafka API compatibilityTeams with existing Kafka workloads and tooling
Step FunctionsVisual workflow orchestration (ASL state machine)Multi-step processes, Saga pattern, human approval steps

Security

ServiceDescriptionKey Use Case
IAMIdentity and access management, policy evaluationEvery AWS API call authorization decision
IAM Identity CenterSSO + multi-account access management (successor to SSO)Human access to many AWS accounts via one login
KMSManaged encryption keys (AES-256, RSA, ECC)Envelope encryption for S3, EBS, RDS, Secrets Manager
ACMSSL/TLS certificate provisioning and auto-renewalHTTPS for ALB, CloudFront, API Gateway
WAFWeb Application Firewall, managed rule groupsOWASP Top 10 protection, rate limiting, bot control
ShieldDDoS protection (Standard: free; Advanced: $3k/mo)Layer 3/4 volumetric + Layer 7 application DDoS
GuardDutyML-based threat detection across CloudTrail/DNS/VPC flow logsUnusual API patterns, credential exfiltration, crypto mining
MacieS3 sensitive data discovery with ML classifiersPII detection, credential leaks, compliance scanning
InspectorAutomated vulnerability scanning for EC2/ECR/LambdaCVE detection, reachability analysis, SBOM
Security HubCentralized security findings, ASFF normalizationMulti-account security posture, compliance dashboard

Patterns You Should Know

1. Serverless API Backend with AI Integration

API Gateway handles HTTP requests, Lambda runs business logic, Bedrock provides the foundation model. Uses the current Messages API format.

javascript
// Lambda (Node.js) — Bedrock Messages API (Claude 3+)
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';

const client = new BedrockRuntimeClient({ region: 'us-east-1' });

export const handler = async (event) => {
    const { prompt } = JSON.parse(event.body);

    const payload = {
        anthropic_version: 'bedrock-2023-05-31',
        max_tokens: 1024,
        messages: [{ role: 'user', content: prompt }],
    };

    const command = new InvokeModelCommand({
        modelId: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
        contentType: 'application/json',
        accept: 'application/json',
        body: JSON.stringify(payload),
    });

    const response = await client.send(command);
    const result = JSON.parse(Buffer.from(response.body).toString());

    return {
        statusCode: 200,
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text: result.content[0].text }),
    };
};

2. Event-Driven Data Processing Pipeline

S3 upload triggers Lambda via EventBridge (or direct S3 notification), processes the file, stores results in DynamoDB. Classic RAG ingestion pattern.

yaml
# SAM template — S3 → Lambda → DynamoDB
Resources:
  DocumentProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: processor.handler
      Runtime: nodejs22.x
      Timeout: 300
      MemorySize: 1024
      Policies:
        - S3ReadPolicy:
            BucketName: !Ref SourceBucket
        - DynamoDBCrudPolicy:
            TableName: !Ref MetadataTable
      Environment:
        Variables:
          METADATA_TABLE: !Ref MetadataTable
      Events:
        FileUpload:
          Type: S3
          Properties:
            Bucket: !Ref SourceBucket
            Events: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: suffix
                    Value: .pdf
javascript
// processor.handler — extract, embed, store
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
import { DynamoDBClient, PutItemCommand } from '@aws-sdk/client-dynamodb';

const s3 = new S3Client();
const ddb = new DynamoDBClient();

export const handler = async (event) => {
    for (const record of event.Records) {
        const bucket = record.s3.bucket.name;
        const key = decodeURIComponent(record.s3.object.key);

        // 1. Fetch document from S3
        const { Body } = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
        const text = await Body.transformToString();

        // 2. Store processing metadata in DynamoDB
        await ddb.send(new PutItemCommand({
            TableName: process.env.METADATA_TABLE,
            Item: {
                documentId: { S: key },
                processedAt: { S: new Date().toISOString() },
                status: { S: 'processed' },
                sizeBytes: { N: String(record.s3.object.size) },
            },
        }));
    }
};

3. Containerized Microservice on ECS Fargate

Docker container deployed behind an ALB. ECS handles health checks, rolling deployments, and scaling. Secrets pulled from Secrets Manager at task startup.

dockerfile
FROM node:22-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
EXPOSE 8080
USER node
CMD ["node", "server.js"]
json
{
  "family": "ai-microservice",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole",
  "containerDefinitions": [{
    "name": "app",
    "image": "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/ai-service:latest",
    "portMappings": [{ "containerPort": 8080 }],
    "secrets": [
      { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:prod/db-password" }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/ai-microservice",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }]
}

4. Multi-Cloud Strategy with AWS as Primary

AWS handles core AI/ML and scalable backend services. Other clouds used for specific integrations. Key patterns:

  • Identity federation: IAM Identity Center with SAML 2.0 grants corporate identities access to AWS accounts without IAM users
  • Network connectivity: AWS Direct Connect or Site-to-Site VPN for on-prem/hybrid; VPC peering or Transit Gateway for multi-account
  • IaC portability: Terraform (or Pulumi) manages resources across AWS, Azure, and GCP from a single codebase
  • Cloud-native services: Use managed Kubernetes (EKS/AKS/GKE) rather than cloud-proprietary orchestration to preserve portability
  • Strategy: AWS for AI/ML leadership (SageMaker, Bedrock) and global infrastructure; Azure for Microsoft 365/AD integration; GCP for BigQuery analytics or Vertex AI models

5. Event-Driven Microservices with Dead-Letter Queue

Decoupled processing with guaranteed delivery semantics. The DLQ captures failed messages for later analysis or replay rather than silently dropping them.

API Gateway
     |
     v
Lambda (synchronous — validates, enqueues)
     |
     v
SQS Queue (VisibilityTimeout = 3× max processing time)
     |
     v
Lambda (async consumer — batchSize=10, bisectOnError=true)
     |          \
     v           v (on failure after maxReceiveCount)
DynamoDB      SQS DLQ → CloudWatch Alarm → SNS alert

Key configuration details:

  • Set FunctionResponseTypes: [ReportBatchItemFailures] on the event source mapping so Lambda reports individual failed items — successfully processed messages are not retried
  • Set maxReceiveCount on the source queue to 3–5 before sending to DLQ
  • DLQ alarm: if ApproximateNumberOfMessagesVisible > 0 for 5 minutes, page on-call
  • For FIFO queues: message group ID controls ordering; DLQ must also be FIFO type

Key Decision Frameworks

Lambda vs Fargate vs EC2

CriterionLambdaFargateEC2
Max execution time15 minutesUnlimitedUnlimited
Idle cost$0 (scales to zero)$0 (scales to zero)Per-hour billing
Cold start latency200 ms – 10 s10 – 60 sMinutes (or pre-warmed ASG)
State modelStateless (ephemeral /tmp)Container-level ephemeral storageFull OS persistent storage
OS controlNoneNoneFull (AMI, kernel params)
Max memory10 GB120 GBInstance-defined (up to TBs)
GPU supportNoYes (p-family instances)Yes (full GPU instance catalog)
Best forEvent-driven, <15 min, spiky trafficContainerized APIs, long-running tasksML inference, OS control, steady-state high throughput

Heuristic: Start with Lambda. Migrate to Fargate when you hit the 15-min limit or need more memory/CPU consistency. Migrate to EC2 when you need GPU, specific OS configuration, or the math says reserved instances are cheaper.

SQS vs Kinesis vs EventBridge

CriterionSQSKinesis Data StreamsEventBridge
Message orderingBest-effort (FIFO queue: strict)Per-shard strict orderingNone
Replay / retentionDLQ only, max 14 daysUp to 365 daysNo replay
Consumer modelCompeting consumers (one message → one consumer)Multiple independent consumers per shardMany rules → many targets
ThroughputUnlimited (standard); 3,000 msg/s per API call (FIFO)1 MB/s write per shardUnlimited (soft limits apply)
LatencyNear real-time (polling)Real-time (<1s)Near real-time
Pricing modelPer requestPer shard-hour + per GBPer event
Best forWork queues, task offload, decoupling servicesReal-time analytics, log aggregation, data replayEvent routing, SaaS integration, cross-service choreography

Heuristic: Use SQS for task queues between services. Use Kinesis when you need multiple consumers reading the same stream independently, or need replay. Use EventBridge as the event bus when routing events between many services or reacting to AWS/SaaS events.

SQL vs NoSQL on AWS

CriterionAurora / RDSDynamoDB
SchemaDefined schema, migrations requiredSchemaless (per-item attributes)
Query flexibilityFull SQL: joins, aggregations, ad-hocPrimary key + GSI only; no joins
ConsistencyStrong (ACID transactions)Eventual (default) or strong read (extra cost)
ScalingVertical + read replicas (horizontal limited)Horizontal, unlimited scale
LatencyLow ms (with connection pooling + RDS Proxy)Single-digit ms at any scale
Cost at scalePredictable instance costPer request (can get expensive at high read/write)
Best forComplex queries, reporting, existing SQL workloadsKnown access patterns, >100K RPS, key-value/session

What Interviewers Actually Ask

Q: Explain how IAM works. How do you grant an EC2 instance permission to read from an S3 bucket? A: IAM is the authorization fabric for every AWS API call. Every call has a principal (who is calling), an action (e.g., s3:GetObject), and a resource (the bucket ARN). For an EC2 instance, you never use user credentials — instead, create an IAM Role with a policy granting s3:GetObject on the target bucket, and attach that role as the instance profile. The EC2 metadata service (169.254.169.254) vends temporary STS credentials to the instance automatically. The application uses the SDK which reads these credentials transparently.

Q: When would you choose Aurora over DynamoDB, or vice versa? A: Choose Aurora when you need relational queries, joins, transactions (ACID), or complex reporting — essentially when you have a well-defined schema and multiple access patterns that benefit from SQL. Choose DynamoDB for massively scalable, predictably low-latency workloads with known, simple access patterns: key-value stores, session management, shopping carts, event sourcing. DynamoDB's trade-off is that you must design your access patterns upfront (single-table design). If you need ad-hoc querying over DynamoDB data, export to S3 and query with Athena.

Q: You have a Lambda function timing out intermittently. Walk me through debugging. A: Structured approach: (1) CloudWatch Logs — filter for the specific request ID showing the timeout; look for the last logged line before timeout to identify where it hangs. (2) Lambda metricsDuration p99 vs configured timeout, Throttles, ConcurrentExecutions. (3) X-Ray trace — find the specific slow segment; is it a downstream HTTP call, a database query, or CPU? (4) Check if the Lambda is inside a VPC and hitting a cold start ENI attachment delay. (5) Increase MemorySize — Lambda CPU is proportional to memory, so more memory means faster CPU-bound work. (6) Verify downstream dependencies: check RDS connection pool exhaustion (add RDS Proxy), check DynamoDB provisioned capacity vs consumed capacity.

Q: Design a system to process thousands of PDFs, extract text, and make it queryable. A: Classic event-driven serverless RAG pipeline: (1) Users upload PDFs to S3. (2) S3 event triggers a Lambda or Step Functions workflow per file. (3) Lambda calls Textract for high-accuracy extraction (forms, tables, signatures — not just raw text). (4) Text is chunked (e.g., 512-token overlapping windows) and sent to Bedrock (Titan Embeddings or Cohere) to produce vectors. (5) Vectors + metadata stored in OpenSearch (k-NN index) or pgvector on Aurora. (6) Query API: API Gateway → Lambda → embed the query → ANN search → retrieve top-k chunks → send context + question to Bedrock (Claude) → return answer. For scale: use SQS between upload and processing to handle burst, Dead Letter Queue for failed documents, Step Functions for retry/branching logic.

Q: When would you NOT use Lambda? A: Avoid Lambda for: (1) Processes running longer than 15 minutes (use Fargate or ECS). (2) Workloads requiring GPU (EC2 GPU instances). (3) Steady, predictable high throughput where a reserved EC2 or Fargate service is cheaper (e.g., >1M invocations/day of a heavy function). (4) Applications requiring a persistent in-memory cache (Lambda instances are recycled; use ElastiCache). (5) Tasks needing specific OS-level configuration, kernel modules, or direct hardware access. (6) WebSocket servers requiring persistent connections (use API Gateway WebSocket + connection store, or ECS).

Q: How does AWS billing work? How do you control costs? A: AWS billing is pay-per-use: EC2 charges per second (after first minute), Lambda per 1ms of compute, S3 per GB stored + per request. Key levers for cost control: (1) Compute savings — Savings Plans (1 or 3 year commitment, up to 66% off) cover EC2, Lambda, and Fargate regardless of instance type; Spot Instances for fault-tolerant batch workloads (up to 90% off). (2) Graviton — ARM-based instances up to 40% cheaper than x86 for same workload. (3) Right-sizing — use Compute Optimizer recommendations; turn off dev environments overnight with Instance Scheduler. (4) Storage tiering — S3 Intelligent-Tiering, EBS gp3 vs io2 selection, lifecycle rules to Glacier. (5) FinOps practices — mandatory project, env, team tags on all resources; Cost Explorer anomaly detection alerts; per-team budget alarms via AWS Budgets. (6) Architecture — cache aggressively with ElastiCache/DAX to reduce DB calls; use SQS batching; prefer Lambda@Edge for simple transforms over full Lambda invocations.

Q: Explain the shared responsibility model. A: AWS is responsible for security of the cloud — the physical infrastructure, data center facilities, network hardware, hypervisors, and the managed services themselves (e.g., the RDS database engine patching). The customer is responsible for security in the cloud — everything they put on top: IAM policies, encryption configuration, security group rules, patching the OS on EC2 instances, application-layer security, and data classification. The boundary shifts depending on the service: with Lambda you don't patch an OS (AWS responsibility), but with EC2 you do (your responsibility). This model means a misconfigured S3 bucket is always the customer's fault, not AWS's.

Q: How would you architect a greenfield application? Walk me through your decision process. A: Framework: (1) Traffic pattern — spiky/event-driven → Lambda; steady/containerized → Fargate; ML inference → EC2 GPU or SageMaker. (2) Data model — relational with complex queries → Aurora; known key-value patterns, massive scale → DynamoDB; analytics → Redshift + Athena over S3. (3) HA/DR — Multi-AZ by default for all stateful resources; decide if you need multi-region (RTO/RPO requirements). (4) Security baseline — VPC with private subnets for all backend resources; Secrets Manager for credentials; CloudTrail + GuardDuty enabled from day one; WAF in front of public APIs. (5) Observability — structured JSON logs to CloudWatch with correlation IDs; X-Ray tracing; dashboards and alarms before go-live. (6) IaC — CDK or Terraform from day one; no manual console changes in production. (7) Cost — tag everything; enable Cost Explorer; set budget alarms.

Q: How do you manage secrets in AWS? A: Never hardcode or commit secrets. Use Secrets Manager for anything that needs automatic rotation — it rotates RDS passwords natively and supports custom Lambda rotation for any other secret. Use Parameter Store (SecureString type, KMS-encrypted) for non-rotating config. In Lambda: reference secrets via SDK calls in the initialization code (outside the handler), cached for the lifetime of the execution environment. In ECS: reference via secrets in the task definition — the agent fetches and injects them as environment variables at task start. In EKS: use the Secrets Store CSI Driver to mount Secrets Manager values as files. Access to secrets is controlled via IAM — the role or task role must have secretsmanager:GetSecretValue permission on the specific ARN.

Q: Explain AZ outage vs Region outage. How do you design for each? A: An AZ outage is a failure within one data center complex — occurs a few times per year in large regions. Design for it by distributing resources across a minimum of 2 (ideally 3) AZs: Multi-AZ RDS/Aurora, Auto Scaling Groups with AZRebalance, ALB/NLB across AZs, ECS task placement spread by AZ. A Region outage is a full geographic failure — extremely rare but catastrophic without preparation. Design with a DR strategy keyed to your RTO/RPO: pilot light (replicated data, minimal infra), warm standby (scaled-down live replica), or active-active (Route 53 with health checks routing traffic across two regions). Data replication: S3 Cross-Region Replication, DynamoDB Global Tables, Aurora Global Database (replication lag <1s). Runbooks for DNS failover should be tested quarterly.


How It Connects to This Role's Stack

As a Senior Full Stack AI Engineer in a multi-cloud environment, AWS is the core platform for AI/ML workloads and scalable backend services.

  • LlamaIndex / RAG on AWS: Data sources in S3, embedding models via Bedrock (Titan Embeddings) or SageMaker, vector storage in OpenSearch (k-NN) or Aurora pgvector, orchestration in Lambda + Step Functions. Bedrock Knowledge Bases can replace custom embedding pipelines entirely for standard RAG use cases.
  • Node.js on AWS: Lambda supports Node.js 22.x natively. Containerize with Docker for ECS Fargate or EKS. AWS SDK v3 is modular (import only what you use), ships with native ESM support, and has built-in retry/backoff logic.
  • Multi-Cloud integration: AWS for core AI/ML (SageMaker, Bedrock) and global scale; Azure for enterprise AD/Microsoft integration (IAM Identity Center + SAML 2.0); GCP for BigQuery or Vertex AI-specific models. Use Terraform for cross-cloud IaC, and Kubernetes as the portable compute layer. Direct Connect or site-to-site VPN for network connectivity between clouds.
  • CI/CD: AWS CodePipeline + CodeBuild + CodeDeploy for fully-managed pipelines, or GitHub Actions with OIDC (no long-lived access keys — assume a role via the GitHub Actions OIDC provider). CDK Pipelines for self-mutating infrastructure pipelines. See CI/CD & DevOps for full coverage.
  • Microservices / Kubernetes: EKS for K8s workloads; ECS for simpler Docker orchestration. AWS Load Balancer Controller integrates ALB with Kubernetes ingress. App Mesh for service mesh. ECR for private container registry.

Red Flags to Avoid

  • "I just use the Console for everything." Senior engineers automate with CloudFormation, CDK, or Terraform. Manual console changes in production are an anti-pattern — no audit trail, not repeatable.
  • "IAM is just for users." Failing to understand IAM roles for services (EC2 instance profiles, Lambda execution roles, ECS task roles) is a fundamental gap. Service-to-service permissions are the majority of IAM policy usage in production.
  • "We put everything in the public subnet." Backend resources (databases, internal APIs, Lambda functions calling internal services) belong in private subnets. Mention NAT Gateway for outbound-only internet access from private subnets.
  • "Lambda is for everything, it's always cheaper." Not recognizing Lambda's limitations (15-min timeout, cold starts, per-invocation cost at scale, no GPU) shows inexperience. Run the math for steady-state traffic.
  • "High availability means multiple EC2s in the same AZ." This misses the core failure domain concept. Multi-AZ is the minimum for any production stateful resource.
  • "We store database passwords in environment variables in the code." Major security anti-pattern — must mention Secrets Manager, rotation, and least-privilege IAM on the secret ARN.
  • "AWS is the only cloud we need." For a multi-cloud role, failing to articulate where AWS excels versus where Azure or GCP may be a better fit shows shallow strategic thinking.
  • "S3 is a file system." S3 is an object store. No append, no in-place partial writes, no directory rename as an atomic operation. Treating it like POSIX causes correctness and performance bugs.
  • "We can optimize cost later." Cost decisions baked in early (instance types, storage classes, data transfer architecture) are expensive to undo. FinOps tagging, budget alarms, and Savings Plan analysis should happen at architecture time, not after the first surprise bill.
  • "Encryption is optional — we'll add it later." KMS encryption for S3, EBS, RDS, and Secrets Manager is a checkbox at creation time. Enabling encryption on an existing unencrypted EBS volume requires creating a new volume from a snapshot — disruptive. Bake it in from the start.
← PreviousDistillation & Model Compression: Pruning, Quantization & Student ModelsFrom: 🔧 Appendix · Fine-tuning & TrainingNext →Microsoft Azure