โ† all lessons/๐Ÿ“„ AWS Deep Dives/#00

AWS Observability & Monitoring

Observability is the ability to understand what your system is doing from the outside โ€” without having to redeploy or...

๐Ÿ“„ OtherIntermediate~35 min read

The 30-Second Pitch

Observability is the ability to understand what your system is doing from the outside โ€” without having to redeploy or add new instrumentation every time something breaks. AWS provides three first-party pillars: CloudWatch (metrics + logs + alarms), X-Ray (distributed traces), and CloudTrail (API audit trail), complemented by AWS Config for configuration compliance. Getting this stack right means the difference between a 5-minute postmortem and a 5-hour war room. For interviews, the expectation is to reason across all four services โ€” when to use each, how to correlate across them, and how to design an observability stack that doesn't cost more than the app it monitors.

How It Actually Works

The Three Pillars + Audit

Metrics  โ†’  CloudWatch Metrics / Alarms         "Is the system healthy?"
Logs     โ†’  CloudWatch Logs / Logs Insights      "What did the system say?"
Traces   โ†’  AWS X-Ray / ADOT                     "Where did the latency go?"
Audit    โ†’  CloudTrail / AWS Config              "Who changed what, and when?"

Every observable system emits all three pillars. Metrics tell you that something is wrong (latency p99 spike). Logs tell you what happened (specific error message). Traces tell you where in the distributed call chain it happened (DynamoDB throttle in a downstream Lambda). Audit tells you who changed the system itself.


1. Amazon CloudWatch โ€” Unified Observability

CloudWatch is the central telemetry hub for AWS. It ingests metrics from 70+ AWS services automatically, accepts custom metrics via API or the Embedded Metrics Format, stores and queries structured logs, and fires alarms that trigger automated responses. Almost every observability pattern on AWS starts here.

Metrics

Namespaces and Dimensions

Every CloudWatch metric lives inside a namespace (e.g., AWS/Lambda, AWS/EC2, MyApp/Payments). Within a namespace, a metric is identified by its name + a set of dimensions (key-value pairs). A Lambda Errors metric with dimension FunctionName=checkout is a different time series from FunctionName=auth โ€” even though both live in AWS/Lambda.

Statistics

StatisticMeaning
SampleCountNumber of data points in the period
SumTotal across all data points
AverageSum / SampleCount
Min / MaxLowest / highest value in the period
p50, p90, p99, p99.9Percentile โ€” requires high-resolution or extended statistics

Resolution

  • Standard resolution: 1-minute granularity โ€” default for most AWS service metrics
  • High resolution: 1-second granularity โ€” custom metrics published with StorageResolution=1; alarms can fire in 10-second periods; 3ร— the cost of standard

Custom Metrics: PutMetricData vs EMF

Option 1 โ€” PutMetricData API call: synchronous, extra latency, counts against 150 TPS limit, costs $0.30/metric/month.

Option 2 โ€” Embedded Metrics Format (EMF): write a structured JSON log line with a special _aws envelope. CloudWatch Logs Agent (or the Lambda runtime) extracts the metric automatically, zero extra API calls. Best practice for Lambda โ€” piggybacks on log ingestion you're already paying for.

EMF structured log for Lambda โ€” emit a custom business metric:

typescript
// Using @aws-lambda-powertools/metrics (preferred) or raw EMF JSON
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const metrics = new Metrics({
  namespace: 'MyApp/Payments',
  serviceName: 'checkout-service',
});

export const handler = async (event: APIGatewayEvent) => {
  const paymentAmount = event.body ? JSON.parse(event.body).amount : 0;

  // Emit custom metric โ€” extracted from logs, no PutMetricData call
  metrics.addMetric('PaymentAmount', MetricUnits.Count, paymentAmount);
  metrics.addMetric('PaymentProcessed', MetricUnits.Count, 1);
  metrics.addDimension('PaymentMethod', event.headers['X-Payment-Method'] ?? 'unknown');

  try {
    await processPayment(paymentAmount);
    metrics.addMetric('PaymentSuccess', MetricUnits.Count, 1);
  } catch (err) {
    metrics.addMetric('PaymentFailure', MetricUnits.Count, 1);
    throw err;
  } finally {
    // Flushes EMF JSON to stdout โ€” CloudWatch Logs extracts metrics automatically
    metrics.publishStoredMetrics();
  }
};

Raw EMF JSON written directly to stdout (lower-level reference):

json
{
  "_aws": {
    "Timestamp": 1706000000000,
    "CloudWatchMetrics": [
      {
        "Namespace": "MyApp/Payments",
        "Dimensions": [["PaymentMethod"]],
        "Metrics": [
          { "Name": "PaymentAmount", "Unit": "None" },
          { "Name": "PaymentProcessed", "Unit": "Count" }
        ]
      }
    ]
  },
  "PaymentMethod": "stripe",
  "PaymentAmount": 149.99,
  "PaymentProcessed": 1,
  "correlationId": "req-abc-123"
}

Metric Math

Compose new time series from existing metrics using expressions evaluated server-side:

error_rate = errors / (errors + requests) * 100

Used in alarms and dashboards. Supports functions: SUM(), AVG(), MAX(), FILL(), RATE(), SEARCH() (cross-metric queries), IF(), METRICS().

Metric Insights

SQL-like query language for CloudWatch metrics. Supports cross-account queries when combined with CloudWatch Observability Access Manager. Example:

sql
SELECT AVG(Duration)
FROM SCHEMA("AWS/Lambda", FunctionName)
WHERE FunctionName LIKE 'prod-%'
GROUP BY FunctionName
ORDER BY AVG() DESC
LIMIT 10

Specialized Insight Extensions

  • CloudWatch Container Insights โ€” ECS/EKS metrics at cluster/service/task/pod/container level: CPU, memory, network I/O, disk I/O. Requires CloudWatch agent or Fluent Bit sidecar. EKS also requires the Container Insights add-on.
  • CloudWatch Lambda Insights โ€” per-invocation metrics beyond what AWS/Lambda provides: init_duration, memory_utilization, cpu_total_time, tx_bytes, rx_bytes. Requires the Lambda Insights extension layer (arn:aws:lambda:...:layer:LambdaInsightsExtension:...). Tied to Lambda execution environment lifecycle.

Cross-Account / Cross-Region Monitoring

CloudWatch Observability Access Manager lets you link source accounts (where workloads run) to a monitoring account (central observability). From the monitoring account you can view metrics, logs, and traces from all linked accounts without switching consoles โ€” critical for multi-account AWS Organizations setups.


Alarms

Threshold Alarms

State: OK โ†’ ALARM โ†’ INSUFFICIENT_DATA
Condition: Metric > threshold for N out of M consecutive data points

Example: Lambda Errors > 5 for 3 out of 5 one-minute periods. The N-of-M logic prevents transient spikes from triggering pages.

Anomaly Detection Alarms

CloudWatch trains an ML model on up to 2 weeks of historical data to establish a dynamic baseline. The alarm fires when a metric falls outside the ANOMALY_DETECTION_BAND(metric, stddev_factor). No threshold to tune โ€” the model adapts to time-of-day and day-of-week patterns automatically.

ANOMALY_DETECTION_BAND(m1, 2)
# metric m1 must stay within 2 standard deviations of predicted value

Composite Alarms

Combine multiple alarms using Boolean logic. This is the primary tool for reducing alert fatigue:

"checkout-service-degraded" =
  (ALARM("checkout-error-rate") OR ALARM("checkout-p99-latency"))
  AND NOT ALARM("upstream-dependency-down")

Use case: fire a single PagerDuty page when both error rate AND latency are elevated, but suppress it if the issue is known to be an upstream dependency outage.

Alarm Actions

  • SNS topic โ€” fan out to email, SMS, Lambda, HTTP endpoint, PagerDuty/Opsgenie
  • EC2 actions โ€” stop, terminate, reboot, or recover an instance
  • Auto Scaling โ€” scale out/in policies
  • SSM OpsCenter โ€” create an OpsItem for structured incident tracking
  • Systems Manager Incident Manager โ€” create and escalate an incident

Missing Data Treatment

SettingBehavior
missingAlarm transitions to INSUFFICIENT_DATA
notBreachingTreat missing points as within threshold (good for sparse metrics)
breachingTreat missing points as exceeding threshold
ignoreAlarm state stays unchanged

Default is missing. For heartbeat-style checks, use breaching โ€” absence of data is the problem.


Logs

Log Groups and Streams

  • Log group: named container for related logs; set retention (1 day to 10 years, or Never Expire) and optional KMS encryption per group. Lambda creates a log group per function automatically; EC2 requires CloudWatch agent config.
  • Log stream: a sequence of events from a single source. Lambda creates one stream per execution environment container; EC2 creates one per instance.

If you never set retention, logs accumulate forever at $0.03/GB/month. Set retention on every log group โ€” automate via AWS Config rule or Lambda at account-creation time.

Structured Logging

Write JSON to stdout/stderr. This enables powerful Logs Insights queries without regex parsing. Include a consistent set of fields: level, timestamp, correlationId, requestId, service, message.

typescript
console.log(JSON.stringify({
  level: 'INFO',
  timestamp: new Date().toISOString(),
  service: 'checkout',
  correlationId: ctx.correlationId,
  requestId: context.awsRequestId,
  message: 'Payment processed',
  paymentId: payment.id,
  amount: payment.amount,
  durationMs: Date.now() - startTime,
}));

CloudWatch Logs Insights

Interactive query engine for log data. Queries run in parallel across all streams in a group. 15-minute query execution limit. Key commands: fields, filter, stats, sort, limit, parse (regex extraction), pattern (ML-based automatic pattern detection).

# Top 10 slowest Lambda invocations
filter @type = "REPORT"
| stats max(@duration) as maxDuration by @requestId
| sort maxDuration desc
| limit 10
# Error rate by hour โ€” structured JSON logs
filter level = "ERROR"
| stats count() as errors by bin(1h)
| sort @timestamp desc
# Parse unstructured log lines with regex
parse @message "user=* action=* latency=*ms" as user, action, latencyMs
| filter action = "checkout"
| stats avg(latencyMs), p99(latencyMs) by user
| sort p99_latencyMs desc

Metric Filters

Create a CloudWatch metric from a log group without writing application code. Example: count all lines containing "level":"ERROR" as an AppErrors metric. Supports numeric value extraction (e.g., extract durationMs from JSON and publish as a metric). Lower-level alternative to EMF โ€” EMF is preferred for new Lambda code.

Log Subscriptions

Stream log events in near real-time to a downstream processor. Three targets:

TargetUse Case
Lambda functionReal-time enrichment, alerting, routing
Kinesis Data StreamsHigh-volume streaming to custom consumers
Kinesis Data FirehoseS3/OpenSearch/Splunk delivery with buffering

Subscription filters support the same filter pattern syntax as metric filters. Multiple subscriptions per log group (up to 2 by default). Cross-account log streaming: source account's log group โ†’ subscription โ†’ resource-based policy allows destination account's Kinesis/Lambda.


Dashboards

CloudWatch provides automatic dashboards for EC2, Lambda, DynamoDB, and other services โ€” preconfigured, no setup required. Custom dashboards support widgets:

Widget TypeUse Case
Metric graphTime series, stacked area, bar chart
Log tableLogs Insights query results inline
Alarm statusRAG status of alarm list
Text (Markdown)Section headers, runbook links
ExplorerDynamic multi-resource metric view

Dashboards are cross-account (with Observability Access Manager) and embeddable via snapshot URLs. JSON dashboard body is version-controllable โ€” store in your IaC repo alongside the stack it monitors.


CloudWatch Synthetics

Canaries are Puppeteer/Selenium scripts that run on a configurable schedule (every 1 minute to once a day) from AWS-managed infrastructure. They test your endpoints from the outside โ€” catching issues before real users do.

Built-in Blueprints

BlueprintWhat It Tests
Heartbeat monitorHTTP(S) endpoint availability + latency
API canaryREST/GraphQL API correctness (request/response validation)
Broken link checkerCrawls a page and verifies all links return 2xx
Visual monitoringScreenshot comparison against a baseline (detects UI regressions)
GUI workflowMulti-step Puppeteer flow (login โ†’ add to cart โ†’ checkout)

Each canary step captures a screenshot, a HAR file (full network waterfall), and pass/fail status. Canary results automatically feed a CloudWatch alarm โ€” alert when success rate drops below 100% or latency exceeds SLA.


2. AWS X-Ray โ€” Distributed Tracing

X-Ray answers "where did the time go?" in a distributed system. It correlates work across Lambda, ECS, EC2, API Gateway, SQS, SNS, and DynamoDB into a single end-to-end trace.

Core Concepts

ConceptDescription
TraceComplete end-to-end journey of one request; identified by a Trace-ID header
SegmentWork done by one service (e.g., one Lambda invocation, one ECS task)
SubsegmentGranular breakdown within a segment: DB query, external HTTP call, custom block
AnnotationIndexed key-value pair (string, number, boolean) โ€” filterable in X-Ray console and service map groups
MetadataArbitrary JSON data attached to a segment/subsegment โ€” not indexed, not searchable, but visible in trace detail
SamplingRule-based decision of which requests to trace โ€” balances observability cost vs. coverage

Sampling Rules

Default rule: reservoir of 5 requests/second traced at 100%, then 5% of remaining requests. Custom rules override by service name, URL path, HTTP method, host, or resource ARN. Rule evaluation order: lower priority number = evaluated first.

Rule: "checkout-full-trace"
  host: *.myapp.com
  url_path: /checkout/*
  http_method: POST
  reservoir_size: 50      # first 50 req/s traced fully
  fixed_rate: 0.20        # 20% of remainder
  priority: 1             # evaluated before default

In production with high traffic, tracing 100% of requests would cost ~$5 per million traces and add latency overhead from the daemon. Sampling keeps costs predictable while providing statistically valid performance data.

Integration

Lambda: enable Active Tracing in function configuration (one checkbox or TracingConfig: Active in CloudFormation). The X-Ray daemon runs as a managed sidecar โ€” no daemon management required. Execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords.

ECS: add an X-Ray daemon container as a sidecar in the task definition. Set AWS_XRAY_DAEMON_ADDRESS environment variable in your application container. Task execution role needs X-Ray permissions.

EC2: install and run the X-Ray daemon as a systemd service. The daemon buffers segments and sends them to the X-Ray API in batches.

API Gateway: enable X-Ray tracing per stage. The gateway injects X-Amzn-Trace-Id header for downstream propagation โ€” any SDK-instrumented service receiving that header automatically links to the upstream trace.

SDK Instrumentation (Node.js example):

typescript
import AWSXRay from 'aws-xray-sdk-core';
import * as AWSv3 from '@aws-sdk/client-dynamodb';
import express from 'express';

const app = express();

// Capture all outbound HTTPS calls automatically
AWSXRay.captureHTTPsGlobal(require('https'), true);

// Instrument AWS SDK v3 client
const ddbClient = AWSXRay.captureAWSv3Client(new AWSv3.DynamoDBClient({}));

app.use(AWSXRay.express.openSegment('checkout-service'));

app.post('/checkout', async (req, res) => {
  const segment = AWSXRay.getSegment();

  // Annotation: indexed, filterable in X-Ray console
  segment?.addAnnotation('tenantId', req.headers['x-tenant-id'] as string);
  segment?.addAnnotation('paymentMethod', req.body.paymentMethod);

  // Custom subsegment for a business operation
  const subsegment = segment?.addNewSubsegment('validate-inventory');
  try {
    await validateInventory(req.body.items);
    subsegment?.close();
  } catch (err) {
    subsegment?.addError(err as Error);
    subsegment?.close();
    throw err;
  }

  // DynamoDB call automatically traced (captureAWSv3Client)
  await ddbClient.send(new AWSv3.GetItemCommand({ ... }));

  res.json({ status: 'ok' });
});

app.use(AWSXRay.express.closeSegment());

X-Ray Service Map

The service map is a real-time graph of your architecture as X-Ray observes it. Each node shows:

  • Request rate (req/s)
  • Error rate (4xx %)
  • Fault rate (5xx %)
  • Latency distribution (p50/p99 histogram)

Edges show call relationships and error propagation. Clicking a node filters to traces for that service. Groups let you save filter expressions as named views (e.g., annotation.tenantId = "acme" to see all traces for one tenant).

AWS Distro for OpenTelemetry (ADOT)

ADOT is AWS's maintained distribution of the OpenTelemetry collector and language SDKs. It is the preferred choice for ECS/EKS workloads because:

  • Vendor-agnostic: same code exports to X-Ray, Jaeger, Zipkin, or any OTLP backend
  • OTLP exporter sends traces to X-Ray and metrics to CloudWatch
  • ADOT Lambda Layer: drop-in automatic instrumentation for Lambda โ€” no code changes required, configure via environment variables
yaml
# ADOT Collector config: receive OTLP, export to X-Ray + CloudWatch
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  awsxray:
    region: us-east-1
  awsemf:
    region: us-east-1
    namespace: MyApp/OTEL

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      exporters: [awsemf]

3. AWS CloudTrail โ€” API Audit Trail

CloudTrail records every AWS API call made in your account โ€” who made it, from where, at what time, and what the response was. It is the authoritative answer to "who changed production?"

What CloudTrail Captures

Event TypeExamplesDefaultCost
Management eventsEC2:RunInstances, IAM:CreatePolicy, S3:CreateBucketEnabled (last 90 days in Event History)Free for first copy per region
Data eventsS3:GetObject, Lambda:Invoke, DynamoDB:GetItemDisabled~$0.10/100K events
Insights eventsUnusual IAM:CreateAccessKey volume, EC2:RunInstances burstDisabled~$0.35/100K events analyzed

Event History (90-day rolling window) is free but not queryable with custom SQL. For compliance and forensics, you need a trail that writes to S3.

Configuration Best Practices

  1. Organization trail โ€” created in the management account, automatically captures all member accounts, single S3 bucket, single CloudWatch Logs group.
  2. All regions โ€” always enable multi-region to catch activity in opted-in and future regions.
  3. S3 + Object Lock โ€” store trail logs in a dedicated S3 bucket with Object Lock in compliance mode. This makes logs tamper-proof even if an attacker gains S3 access.
  4. Log File Validation โ€” CloudTrail writes a digest file (SHA-256 hash chain) every hour. validate-logs CLI command verifies no files were deleted or modified.
  5. CloudWatch Logs integration โ€” stream trail to a CloudWatch Logs group; create metric filters + alarms for:
    • Root account login
    • Console sign-in without MFA
    • IAM policy changes
    • CloudTrail itself being disabled or modified
    • Security group / NACL changes

CloudTrail Lake

Managed event data store with up to 7-year retention. SQL queries via the console or API โ€” no need to set up Athena, Glue catalog, or S3 partitioning. Immutable ingestion: events cannot be deleted. Use cases: compliance evidence, security investigations, SLA reporting.

Key Forensics Queries

sql
-- Find all IAM changes in the last 24 hours
SELECT eventTime, userIdentity.arn, eventName, sourceIPAddress
FROM cloudtrail_logs
WHERE eventSource = 'iam.amazonaws.com'
  AND eventTime > DATE_ADD('hour', -24, NOW())
ORDER BY eventTime DESC;

-- Find failed authorization attempts
SELECT eventTime, userIdentity.arn, errorCode, errorMessage, sourceIPAddress
FROM cloudtrail_logs
WHERE errorCode IN ('AccessDenied', 'UnauthorizedAccess', 'Client.UnauthorizedAccess')
ORDER BY eventTime DESC
LIMIT 100;

-- Who assumed a specific role?
SELECT eventTime, userIdentity.arn, sourceIPAddress,
       responseElements.credentials.expiration
FROM cloudtrail_logs
WHERE eventName = 'AssumeRole'
  AND requestParameters.roleArn = 'arn:aws:iam::123456789012:role/prod-admin'
ORDER BY eventTime DESC;

4. AWS Config โ€” Configuration Compliance

While CloudTrail answers "who made this API call?", AWS Config answers "what is the current and historical state of my resources?" It continuously records configuration snapshots for every supported resource and evaluates them against rules.

Core Concepts

  • Configuration item: point-in-time snapshot of a resource's configuration (attributes, relationships to other resources, tags, IAM policies).
  • Configuration timeline: full history of all configuration changes for a resource โ€” who changed it, when, what it looked like before and after.
  • Config rules: evaluate whether resources comply with a desired state. Two types:
    • Managed rules: 200+ AWS-provided rules (e.g., encrypted-volumes, s3-bucket-public-read-prohibited)
    • Custom rules: Lambda-backed rules for business-specific logic (e.g., "all EC2 instances must have a CostCenter tag")
  • Conformance packs: bundles of rules + remediation actions mapped to compliance frameworks (CIS AWS Benchmark, NIST 800-53, HIPAA, PCI-DSS). Deploy a conformance pack to get 30+ rules in one action.
  • Remediation actions: SSM Automation documents that auto-correct non-compliant resources (e.g., automatically enable S3 block public access when the rule fires).
  • Organization Config Rules: deploy rules from the management account to all member accounts simultaneously.

Common Managed Rules

RuleWhat It Checks
encrypted-volumesAll EBS volumes must be encrypted at rest
s3-bucket-public-read-prohibitedNo S3 bucket with public read ACL or policy
restricted-sshNo security group allows 0.0.0.0/0 on port 22
root-account-mfa-enabledRoot account has MFA enabled
cloudtrail-enabledCloudTrail is active in the region
rds-storage-encryptedAll RDS instances have storage encryption
iam-password-policyAccount password policy meets minimum requirements
required-tagsAll resources have mandatory tags (e.g., Environment, Owner)

CloudTrail vs. AWS Config

QuestionTool
"Who called DeleteSecurityGroup at 14:32?"CloudTrail
"What did the security group look like before deletion?"AWS Config timeline
"Which security groups allow unrestricted SSH right now?"AWS Config rules
"Show me all API calls from a compromised access key"CloudTrail
"When did this S3 bucket policy change, and what was the old policy?"AWS Config

Use them together: Config identifies non-compliant resources; CloudTrail identifies who made the change that caused non-compliance.

Automated Remediation Flow

Resource changes state
        โ†“
AWS Config detects configuration drift
        โ†“
Config rule evaluates โ†’ NON_COMPLIANT
        โ†“
EventBridge rule fires on Config compliance change event
        โ†“
SSM Automation document runs (e.g., "AWSConfigRemediation-EnableS3BucketEncryption")
        โ†“
Resource corrected โ†’ Config re-evaluates โ†’ COMPLIANT
        โ†“
SNS notification: "Auto-remediated: s3-bucket-public-read-prohibited on bucket prod-uploads"

For higher-risk remediations (e.g., terminating EC2 instances, rotating IAM credentials) use manual remediation with an approval gate via SSM Change Manager rather than fully automatic execution. The rule still fires immediately; a human approves the SSM document run.

Aggregator: AWS Config Aggregator consolidates compliance data from all accounts and regions into a single view. Query across your entire organization: "show me all non-compliant resources in all production accounts."


5. Observability Strategy โ€” Putting It Together

Three Pillars Mapping

Metrics  โ†’ CloudWatch Metrics (namespace/dimension)
           โ†“ custom: EMF in Lambda / PutMetricData elsewhere
           โ†“ alarms โ†’ SNS โ†’ PagerDuty/Opsgenie/Slack

Logs     โ†’ CloudWatch Logs (structured JSON)
           โ†“ Logs Insights for ad-hoc queries
           โ†“ metric filters for automated counters
           โ†“ subscriptions โ†’ Kinesis Firehose โ†’ S3 โ†’ Athena (long-term)

Traces   โ†’ X-Ray (Lambda, EC2, ECS simple workloads)
           ADOT โ†’ X-Ray (ECS/EKS production, multi-language)
           โ†“ service map for visual dependency analysis
           โ†“ trace filtering by annotation (tenantId, userId)

Network  โ†’ VPC Flow Logs โ†’ CloudWatch Logs or S3
           โ†“ Logs Insights queries for security / connectivity debugging
           โ†“ metric filters: rejected traffic counters โ†’ alarms

VPC Flow Logs as a Fourth Signal

VPC Flow Logs capture accepted and rejected network traffic at the ENI, subnet, or VPC level. They are CloudWatch Logs data โ€” stored in a log group and queryable with Logs Insights. They are not application-level observability but are essential for network security forensics and connectivity debugging.

Key use cases:

  • Security: detect port scans, lateral movement, unexpected outbound connections (e.g., data exfiltration to external IPs)
  • Connectivity debugging: confirm whether a packet was accepted or rejected by a security group/NACL โ€” eliminates "is it the app or the network?" ambiguity
  • Cost attribution: identify which ENIs are generating cross-AZ or internet transfer costs
# Logs Insights: find rejected traffic to a specific port
fields srcAddr, dstAddr, dstPort, action, protocol
| filter action = "REJECT" and dstPort = 5432
| stats count() as rejections by srcAddr
| sort rejections desc
| limit 20

Flow logs have 1โ€“15 minute delivery lag (not real-time). For real-time network threat detection, combine VPC Flow Logs with Amazon GuardDuty โ€” GuardDuty ingests flow logs automatically and applies ML threat detection without you querying them manually.

Correlation IDs โ€” The Connective Tissue

Without correlation IDs, debugging a multi-service request means cross-referencing timestamps in three separate consoles. With them, you can filter CloudWatch Logs, X-Ray traces, and CloudTrail events to a single logical request.

Implementation pattern:

  1. API Gateway: inject $context.requestId as a custom header X-Correlation-Id in the integration request (or generate a UUID via a Lambda authorizer).
  2. Lambda: read X-Correlation-Id from the event, fall back to context.awsRequestId. Add as X-Ray annotation and include in every log line.
  3. SQS/SNS: pass correlationId as a message attribute. Receiving Lambda reads it from the event and continues propagating.
  4. DynamoDB writes: store correlationId as an attribute for data-level forensics.
typescript
// Lambda receiving from API Gateway
export const handler = async (event: APIGatewayEvent, context: Context) => {
  const correlationId =
    event.headers['X-Correlation-Id'] ?? context.awsRequestId;

  // Add to X-Ray for trace filtering
  AWSXRay.getSegment()?.addAnnotation('correlationId', correlationId);

  // Include in all logs
  const log = (msg: string, extra?: object) =>
    console.log(JSON.stringify({
      correlationId, requestId: context.awsRequestId,
      service: 'checkout', message: msg, ...extra,
    }));

  log('Processing payment', { amount: event.body });

  // Propagate to SQS
  await sqsClient.send(new SendMessageCommand({
    QueueUrl: process.env.QUEUE_URL!,
    MessageBody: JSON.stringify(payload),
    MessageAttributes: {
      correlationId: { DataType: 'String', StringValue: correlationId },
    },
  }));
};

Cost-Effective Observability

ResourcePricingOptimization
Custom metrics$0.30/metric/month (first 10K)Use EMF in Lambda โ€” shares log cost
Logs ingest$0.50/GBSet log retention on every group; reduce verbosity in hot paths
Logs storage$0.03/GB/monthRetention policy; archive cold logs to S3 via Firehose
Logs Insights$0.005/GB scannedNarrow time range and log group; save frequent queries
X-Ray traces$5.00/million tracesDefault sampling (5 req/s + 5%) is sufficient for most workloads
CloudWatch dashboardsFirst 3 free, $3/dashboard/monthConsolidate โ€” fewer, denser dashboards beat many sparse ones
CloudTrail data events$0.10/100K eventsEnable only for S3 buckets and Lambda functions that require audit

Biggest wins: (1) set retention on all log groups on day one โ€” forgotten groups grow unbounded, (2) use EMF instead of custom PutMetricData in Lambda, (3) keep X-Ray sampling at defaults unless you have a specific low-traffic debugging need.

Alerting Best Practices: USE + RED Methods

USE (for infrastructure โ€” EC2, RDS, ECS nodes):

  • Utilization โ€” CPU %, memory %, disk I/O %
  • Saturation โ€” queue depth, run queue length, connection pool exhaustion
  • Errors โ€” disk errors, network errors, hardware faults

RED (for services โ€” APIs, Lambda, ECS tasks):

  • Rate โ€” requests per second
  • Errors โ€” error rate (4xx, 5xx)
  • Duration โ€” latency p50/p99/p999

Composite alarm pattern for a service:

"checkout-service-health" = (
  ALARM("checkout-error-rate-high")         # RED: error rate > 1%
  OR ALARM("checkout-p99-latency-high")     # RED: p99 > 3s
)
AND NOT ALARM("payment-provider-degraded")  # suppress if known upstream issue

Single page fires. Alarm description includes runbook URL. On-call engineer gets one notification instead of five.

Incident Response Walkthrough โ€” Production Latency Spike

This is the end-to-end investigation flow when a composite alarm fires:

1. COMPOSITE ALARM fires: "checkout-service-degraded"
   โ†“ CloudWatch Alarm โ†’ SNS โ†’ PagerDuty page

2. DASHBOARD CHECK (30 seconds)
   - Open CloudWatch automatic Lambda dashboard
   - Confirm: p99 Duration spiked from 200ms โ†’ 4s at 14:47 UTC
   - Error rate: flat (no 5xx increase โ€” latency, not errors)

3. X-RAY SERVICE MAP (1 minute)
   - checkout-lambda โ†’ dynamodb shows 3.8s avg latency
   - Other downstream services (SQS, external API) look normal
   - DynamoDB node is orange: high latency, low fault rate

4. X-RAY TRACE DRILL-DOWN (2 minutes)
   - Filter traces: annotation.service = "checkout", duration > 2s
   - Open a slow trace: DynamoDB GetItem subsegment = 3.7s
   - Check annotation: tableArn = "orders-table"
   - Metadata shows: ReturnedItemCount = 1, ConsumedCapacity = 14.5 RCU

5. CLOUDWATCH LOGS INSIGHTS (2 minutes)
   filter @type = "REPORT" and @duration > 2000
   | stats count(), avg(@duration), max(@duration) by bin(5m)
   โ†’ spike started at 14:45 UTC, correlated with deployment event

6. CLOUDTRAIL (1 minute)
   - Filter: eventSource = dynamodb.amazonaws.com, eventTime around 14:43
   - Find: UpdateTable event โ€” provisioned throughput changed from 100 to 5 RCU
   - userIdentity.arn: arn:aws:iam::123456789:role/ci-deploy-role

7. ROOT CAUSE: CI/CD pipeline Terraform apply reduced DynamoDB capacity
   RESOLUTION: Revert table throughput; add Config rule to alert on capacity decreases
   FOLLOW-UP: Add DynamoDB ConsumedReadCapacityUnits alarm to composite alarm

Total investigation time: ~6 minutes with full observability stack. Without it: indefinite.


6. Amazon Managed Grafana & Managed Prometheus

For Kubernetes-native teams or multi-cloud environments, AWS provides fully managed versions of the two dominant open-source observability tools.

Amazon Managed Service for Prometheus (AMP)

  • CNCF-compatible managed Prometheus โ€” standard PromQL queries, standard scrape configs
  • Remote-write from EKS (via Prometheus server or ADOT collector), ECS (via ADOT), or on-premises
  • 150-day default retention (configurable); no cluster to size or WAL to manage
  • Integrates with Amazon Managed Grafana as a native data source
  • IAM-based authentication via SigV4 signing โ€” no Prometheus Basic Auth to manage
yaml
# EKS: ADOT collector remote-writes to AMP
exporters:
  prometheusremotewrite:
    endpoint: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
    auth:
      authenticator: sigv4auth

Amazon Managed Grafana (AMG)

  • Fully managed Grafana โ€” no server to provision, automatic upgrades, built-in HA
  • IAM Identity Center (SSO) authentication โ€” no local Grafana users to manage
  • Native data source plugins: CloudWatch, X-Ray, AMP, Timestream, Athena, OpenSearch
  • Existing Grafana dashboards (JSON export) import directly
  • Use AMG when: you already have Grafana dashboards, you're running EKS, or you want a unified view across CloudWatch and Prometheus metrics

When to use AMP + AMG vs native CloudWatch:

  • CloudWatch native: simple Lambda/ECS workloads, team already lives in AWS Console, minimal Kubernetes
  • AMP + AMG: Kubernetes-heavy stack, existing Grafana expertise, need PromQL flexibility, multi-cloud metrics

Example PromQL queries against AMP (EKS workloads):

promql
# Request rate by pod (RED: Rate)
sum(rate(http_requests_total{namespace="prod"}[5m])) by (pod)

# Error rate percentage (RED: Errors)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# p99 latency by service (RED: Duration)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{namespace="prod"}[5m]))
  by (le, service)
)

# CPU throttling ratio (USE: Saturation)
sum(rate(container_cpu_cfs_throttled_seconds_total[5m]))
  / sum(rate(container_cpu_cfs_periods_total[5m])) * 100

CloudWatch Evidently โ€” Feature Flags & A/B Testing

A related observability-adjacent service: Amazon CloudWatch Evidently lets you run controlled feature rollouts and A/B experiments. Define a feature with percentage-based traffic splits, then analyze experiment results using CloudWatch Evidently metrics alongside your standard CloudWatch dashboards. Useful when you want to gate a new payment flow to 5% of traffic and compare conversion rates before full rollout โ€” all within the same AWS observability console rather than a third-party feature flag SaaS.

CloudWatch Internet Monitor

Monitors internet-facing application health from the perspective of end users across ISPs and AWS edge locations. Automatically detects when an internet routing issue, ISP outage, or AWS region problem is impacting your users' connectivity โ€” and tells you what percentage of your traffic is affected and from which geography. Integrates with CloudWatch alarms. Relevant for global applications where customer-reported issues may be ISP-side rather than your infrastructure.


Lambda Cold Start Observability

Cold starts are often the highest-impact latency spikes in Lambda-based systems and need dedicated observability. Key metrics and where to find them:

SignalSourceHow to Alert
InitDuration in REPORT logsCloudWatch LogsMetric filter โ†’ alarm on p99 > 2s
init_duration metricLambda Insights layerCloudWatch alarm on max
Cold start rateEMF custom metricEmit ColdStart=1 in first invocation, ColdStart=0 thereafter
Concurrent executions burstAWS/Lambda:ConcurrentExecutionsAlarm near account concurrency limit
typescript
// Track cold starts with EMF
const isColdStart = (() => {
  let first = true;
  return () => { const v = first; first = false; return v; };
})();

export const handler = async (event: unknown, context: Context) => {
  metrics.addMetric('ColdStart', MetricUnits.Count, isColdStart() ? 1 : 0);
  metrics.addDimension('FunctionName', context.functionName);
  metrics.publishStoredMetrics();
  // ... handler logic
};

Cold start alarms should trigger investigation into: provisioned concurrency coverage gaps, Lambda layer size reduction, memory size tuning (more memory = faster init), or moving initialization code outside the handler.


Interview Q&A

Q: How does CloudWatch differ from X-Ray? When do you use each?

A: CloudWatch is the metrics, logs, and alarms platform โ€” it tells you that something is wrong (error rate spike, latency alarm) and gives you the log lines around the failure. X-Ray is distributed tracing โ€” it tells you where in the call chain the problem originated by linking segments from API Gateway through Lambda through DynamoDB into a single trace. Use CloudWatch for operational dashboards and alerting; use X-Ray when you need to debug which downstream dependency is causing latency or errors. In practice you use both: CloudWatch fires the alarm, X-Ray helps you identify the root cause.


Q: How would you debug a performance regression in a Lambda-based microservice?

A: Start at CloudWatch metrics: check Duration (p99), Errors, Throttles, and ConcurrentExecutions for the affected function. If Lambda Insights is enabled, check init_duration for cold start increases and memory_utilization to rule out memory pressure. Next, query CloudWatch Logs Insights for REPORT lines to find the slowest individual invocations. Then open X-Ray service map to see if the latency is inside the Lambda (CPU-bound) or waiting on a downstream call (DynamoDB, an external API). Drill into specific slow traces in X-Ray to see subsegment timing. Common culprits: cold starts (increase provisioned concurrency or memory), DynamoDB hot partitions (check ConsumedCapacity metric), N+1 queries (visible as many DynamoDB subsegments per trace).


Q: What is the Embedded Metrics Format (EMF) and why is it better than PutMetricData in Lambda?

A: EMF is a structured JSON envelope written to stdout that CloudWatch Logs automatically parses and extracts as a CloudWatch metric โ€” no separate API call. The advantages in Lambda are: (1) zero additional latency โ€” the metric write is synchronous with the log write that was happening anyway; (2) no PutMetricData API call means no additional IAM permission, no extra cost per call, no risk of hitting the 150 TPS limit; (3) the same log line contains both the human-readable message and the metric, so you never lose context when debugging; (4) the raw data is queryable in Logs Insights even before metric aggregation. The trade-off is that EMF metrics have the same 1-hour resolution floor as CloudWatch custom metrics by default unless you use high-resolution storage.


Q: Explain how you'd implement distributed tracing across an API Gateway โ†’ Lambda โ†’ SQS โ†’ Lambda chain.

A: (1) Enable X-Ray tracing on the API Gateway stage โ€” the gateway injects X-Amzn-Trace-Id into the request. (2) Enable active tracing on both Lambda functions โ€” the runtime automatically creates a segment linked to the incoming trace header. (3) The first Lambda passes the trace ID to SQS as a message attribute (AWSTraceHeader). When SQS delivers to the second Lambda, the runtime reads this attribute and links the new segment to the original trace. (4) Add X-Ray SDK instrumentation in both functions: captureAWSv3Client for the SQS and DynamoDB clients, custom subsegments for business logic, and annotations (tenantId, orderId) for filtering. (5) Also propagate a correlationId custom header through SQS message attributes โ€” X-Ray trace IDs expire after 30 days but business-level correlation IDs in logs persist for the full log group retention period.


Q: How do you set up alerting that reduces noise and avoids alert fatigue?

A: Four techniques: (1) N-of-M data points in threshold alarms โ€” 3 out of 5 periods eliminates single-spike false positives; (2) Composite alarms โ€” only page when both error rate AND latency are elevated, not when either crosses a threshold in isolation; (3) Anomaly detection alarms โ€” let CloudWatch learn the baseline instead of setting static thresholds that fire every Monday morning when batch jobs run; (4) suppress known upstream issues โ€” composite alarm with AND NOT ALARM("dependency-degraded") prevents cascading alerts from an upstream outage. Operationally: every alarm needs a runbook link in its description, a defined severity, and an owner. Alarms with no owner get ignored and create fatigue.


Q: What does CloudTrail capture vs AWS Config? How do they complement each other?

A: CloudTrail is an event log of API calls: who called what API, when, from where, with what parameters, and what the response was. AWS Config is a configuration recorder: it maintains a timeline of resource state snapshots and evaluates current state against compliance rules. CloudTrail answers "what happened?" Config answers "what does it look like now, and what did it look like before?" Complement: Config's rule fires and marks a resource non-compliant โ†’ you use CloudTrail to find which API call caused the drift โ†’ you have both the what (Config before/after) and the who/when (CloudTrail event).


Q: A client says "our prod environment was changed and we don't know who did it." What do you investigate?

A: Start with CloudTrail Event History for the time window in question. Filter by the affected resource ARN or by event source (e.g., ec2.amazonaws.com). Look for ModifySecurityGroup, UpdateFunctionCode, PutBucketPolicy, or whichever resource type is affected. Check userIdentity to identify the IAM principal (role, user, or service). Cross-reference with AWS Config timeline for the resource โ€” it shows the exact before/after configuration diff. If the access key or role ARN is unfamiliar, check CloudTrail for AssumeRole events to trace the assumed role chain back to the original principal. If it happened across an AWS Org, use the organization trail in the management account. If the client doesn't have a trail configured, CloudTrail Event History only covers the last 90 days and only management events โ€” data events and older history are gone.


Q: How would you design a cost-effective observability stack for a startup?

A: Use AWS native tools for the first 6โ€“12 months โ€” avoid paying for Datadog or New Relic at startup scale. (1) EMF for all Lambda custom metrics โ€” zero incremental cost on top of existing log ingestion. (2) Set retention to 30 days on all log groups immediately โ€” use a CloudFormation custom resource or EventBridge + Lambda to enforce this automatically on any new log group. (3) X-Ray with default sampling (5 req/s + 5%) โ€” trace coverage without predictable high cost. (4) One composite alarm per service: (p99 > SLA) OR (error rate > 1%) โ†’ SNS โ†’ Slack. (5) Three automatic CloudWatch dashboards (free): Lambda, API Gateway, DynamoDB. One custom dashboard for business KPIs (EMF metrics). (6) Skip AMP/AMG until you have EKS with multiple clusters โ€” overkill for ECS or Lambda workloads.


Q: What are CloudWatch composite alarms and when do you use them?

A: Composite alarms evaluate a Boolean expression over other alarm states โ€” ALARM, OK, or INSUFFICIENT_DATA. They do not directly evaluate any metric; they aggregate child alarm states. Use cases: (1) Noise reduction โ€” "service is degraded" fires only when both error rate AND latency alarms are in ALARM, not either alone; (2) Suppression โ€” add AND NOT ALARM("maintenance-mode") to prevent pages during planned deployments; (3) Aggregation โ€” single alarm for an entire service to a human on-call while child alarms carry more detail for automation; (4) Cross-account health โ€” a monitoring account composite alarm aggregating child alarms from multiple source accounts for a single organizational health view.


Q: How do you ensure log retention compliance (90-day, 1-year, 7-year) across all accounts?

A: Three layers: (1) Preventive โ€” AWS Organizations SCP denying logs:DeleteRetentionPolicy and logs:PutRetentionPolicy below the minimum for production OUs. (2) Detective โ€” AWS Config managed rule cloudwatch-log-group-encrypted and a custom Config rule checking that retentionInDays >= 90 for all log groups; non-compliant groups trigger SNS notification. (3) Corrective โ€” EventBridge rule on CreateLogGroup event โ†’ Lambda that calls PutRetentionPolicy with the account-appropriate retention value. For 7-year compliance (SOC2, HIPAA), stream logs via Kinesis Firehose to an S3 bucket with Object Lock (governance or compliance mode) โ€” CloudWatch Logs retention does not provide tamper-evident storage, but S3 Object Lock does.


Q: Explain sampling in X-Ray. Why not trace 100% of requests in production?

A: Sampling determines the fraction of requests for which X-Ray creates and records a full trace. At 100%, a service handling 10,000 req/s would generate 10,000 traces/sec ร— $5/M = $50/hour โ€” plus the latency overhead of the X-Ray SDK recording every segment. The default rule (5 req/s reservoir + 5% of the rest) captures all low-traffic paths fully while reducing high-traffic path costs by 95%. The remaining 5% is still statistically representative for latency distributions and error rate analysis. For critical low-volume paths (checkout, auth), increase the reservoir size so every request is traced. For high-volume read paths (homepage, search), the 5% default is usually sufficient. Custom sampling rules let you set different rates per URL pattern, HTTP method, or service name โ€” so you can trace 100% of errors (errorCode != null) while sampling 1% of successful requests.


Red Flags to Avoid

  • No log retention policy โ€” forgetting to set retention means log groups grow indefinitely. At $0.03/GB/month a busy service will accumulate gigabytes fast. Set retention on every log group at creation time.
  • PutMetricData in hot Lambda paths โ€” every call adds ~10ms of latency and counts against a 150 TPS API limit. Use EMF instead.
  • Tracing 100% of requests without capping โ€” can turn a $5/month X-Ray bill into $500 at scale. Always configure sampling rules appropriate to your traffic volume.
  • Unstructured log lines in Lambda โ€” console.log("error:", err) is unparseable by Logs Insights. Always JSON.stringify your log objects.
  • No composite alarms โ€” alerting directly on individual metrics means 5โ€“10 notifications per incident. Composite alarms give one actionable signal per service degradation.
  • Missing data treatment left as default missing โ€” for heartbeat metrics (canary success), missing means a completely down endpoint never triggers an alarm if CloudWatch has no data. Set to breaching.
  • CloudTrail data events disabled on sensitive S3 buckets โ€” management events (bucket creation) are free, but GetObject and PutObject are data events that require explicit enablement. A breach via S3 exfiltration produces no CloudTrail evidence without them.
  • No organization trail โ€” per-account trails are easy to miss in new accounts. An organization trail in the management account covers all current and future accounts automatically.
  • Dashboards without runbook links โ€” alarms that fire without context force on-call engineers to rediscover the investigation steps every time. Every alarm description should link to a runbook.
  • Ignoring INSUFFICIENT_DATA alarms โ€” this state often means the metric stopped flowing (metric filter broken, CloudWatch agent stopped, Lambda function not invoked). Treat it as actionable, not neutral.
  • No correlation ID strategy โ€” debugging a multi-Lambda failure without correlation IDs means manually cross-referencing timestamps across three log groups. Define the propagation strategy before your first microservice goes to production.

See Also

  • Lambda & Serverless โ€” Lambda Insights layer, EMF patterns, X-Ray active tracing, cold start metrics
  • IAM & Security โ€” CloudTrail forensics for IAM events, least-privilege for X-Ray daemon roles
  • AWS Architecture โ€” Observability as a Well-Architected operational excellence pillar
  • Compute & Containers โ€” Container Insights for ECS/EKS, ADOT collector sidecar patterns
  • CI/CD & DevOps โ€” Deployment monitoring, canary alarms for blue/green rollbacks
  • VPC & Networking โ€” VPC Flow Logs as a CloudWatch Logs data source for network observability
  • Cost Optimization โ€” CloudWatch cost management, log retention economics, X-Ray sampling math
Next โ†’The AI Engineer's Roadmap: Skills, Tools & Career Path (2025+)Up next: ๐Ÿงญ Phase 0 ยท The AI Engineer on the Edge