Amazon RDS, Aurora, and ElastiCache cover the relational and in-memory database needs for the vast majority of production workloads on AWS. RDS gives you managed MySQL, PostgreSQL, and other engines with automated backups, Multi-AZ failover, and read replicasβwithout touching the operating system. Aurora is AWS's rewrite of the relational engine for the cloud: same SQL interfaces, but shared distributed storage spanning 3 AZs, sub-30-second failover, and up to 5Γ MySQL throughput. ElastiCache puts Redis or Memcached in a managed cluster for sub-millisecond caching, session storage, and pub/sub. Together they cover the spectrum from OLTP to microsecond reads, and understanding their trade-offsβespecially versus DynamoDBβis essential for AWS architecture and senior engineering interviews.
ββββββββββββββββββββββββββββββββ
β RDS Multi-AZ β
β β
App βββββββββββββββΆ Primary DB Instance (AZ-A) β
β β sync replication β
β βΌ β
β Standby DB Instance (AZ-B) β
β (NOT readable, HA only) β
β β
Read Traffic ββββββΆ Read Replica 1 (AZ-A/B/C) β
Read Traffic ββββββΆ Read Replica 2 (cross-region)β
ββββββββββββββββββββββββββββββββ
Each instance: its own EBS volume
βββββββββββββββββββββββββββββββββββββββββββ
β Aurora Cluster β
β β
Writes ββββββββββββΆ Writer Instance (AZ-A) β
Reads ββββββββββββΆ Reader Instance 1 (AZ-A) β
Reads ββββββββββββΆ Reader Instance 2 (AZ-B) β
Reads ββββββββββββΆ Reader Instance 3 (AZ-C) β
β β all share β β
β βββββββββββββββββββββββββββ β
β β Distributed Storage β β
β β 6 copies / 3 AZs β β
β β Auto-heals, 128 TB max β β
β βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
Single storage layer β no per-instance volumes
Cluster Mode Disabled: Cluster Mode Enabled:
ββββββββββββββββββββ ββββββββββββββββββββββββββ
β Primary (AZ-A) β β Shard 0: 0000β3FFF β
β Replica (AZ-B) β β ββ Primary (AZ-A) β
β Replica (AZ-C) β β ββ Replica (AZ-B) β
β same dataset β β Shard 1: 4000β7FFF β
ββββββββββββββββββββ β ββ Primary (AZ-B) β
β ββ Replica (AZ-C) β
Max: 1 primary + 5 replicas β ...up to 500 shards β
Vertical scale only ββββββββββββββββββββββββββ
Horizontal + vertical scale
| Engine | Versions | License | Notes |
|---|---|---|---|
| MySQL | 8.0, 8.4 | GPL | Most popular; good Aurora MySQL compatibility |
| PostgreSQL | 15, 16, 17 | PostgreSQL | Best feature parity; extensions supported |
| MariaDB | 10.6, 10.11 | GPL | MySQL fork; good for MySQL migrations |
| Oracle | 19c, 21c | BYOL or LI | BYOL = bring your own license; LI = license included |
| SQL Server | SE, EE | License Included | SE cheaper; EE for full features |
| Db2 | 11.5 | BYOL | Enterprise workloads migrating from IBM |
| Class | Family | Purpose | When to use |
|---|---|---|---|
| db.t4g | Burstable | Dev/test | Low sustained CPU; credit-based bursting |
| db.m7g | General | Production OLTP | Balanced CPU/RAM; Graviton3 |
| db.r7g | Memory-optimized | Large datasets, analytics | 2Γ RAM per vCPU vs m-class |
| db.x2g | Extreme memory | In-memory workloads | SAP, large Oracle, heavy caching |
Graviton3 (g suffix) instances offer ~20β40% better price-performance than equivalent x86 instances for most RDS workloads.
| Type | Baseline IOPS | Max IOPS | Throughput | Use case |
|---|---|---|---|---|
| gp3 | 3,000 included | 64,000 | Up to 4,000 MB/s | Default for most workloads |
| io1 | Provisioned | 256,000 | Up to 4,000 MB/s | Highest IOPS, consistent latency |
| io2 | Provisioned | 256,000 | Up to 4,000 MB/s | io1 successor; higher durability |
| magnetic | ~100 | ~100 | Low | Legacy only β never use for new workloads |
gp3 key advantage: IOPS and throughput are provisioned independently of storage size. You can have a 100 GB volume with 10,000 IOPS without paying for a larger volume (unlike gp2, where IOPS scaled with size at 3 IOPS/GB).
Storage autoscaling: Enabled by default. Grows in 10 GB increments when free space falls below a threshold. Never shrinks automatically β plan your initial size knowing the floor is permanent.
Multi-AZ provides high availability, not scale-out. The standby replica in a second AZ receives synchronous replication from the primary. On failure, AWS updates the DNS CNAME to point to the standby β typically within 60β120 seconds.
What Multi-AZ does:
What Multi-AZ does NOT do:
Read Replicas use asynchronous replication and are fully readable. Use them to offload read traffic from the primary.
Key interview point: Multi-AZ standby cannot be promoted to standalone and is never readable. A Read Replica can be promoted but is not a failover target for Multi-AZ.
Parameter groups configure the DB engine itself. Every RDS instance is associated with a parameter group.
# PostgreSQL parameter group examples
max_connections = LEAST({DBInstanceClassMemory/9531392}, 5000)
work_mem = 65536 # 64 MB per sort operation
shared_buffers = {DBInstanceClassMemory/32768}
log_min_duration_statement = 1000 # log queries > 1s
Option groups add optional features on top of the engine, primarily for Oracle and SQL Server:
RDS Proxy solves the Lambda-at-scale connection problem. Lambda functions don't maintain persistent connections β each invocation may open a new DB connection, and at 1,000 concurrent Lambda invocations you get 1,000 simultaneous connections to your DB.
Without proxy: 1,000 Lambda invocations β 1,000 DB connections β OOM or connection refused.
With proxy: 1,000 Lambda invocations β Proxy β 20β50 pooled DB connections.
RDS Proxy maintains a warm connection pool and multiplexes Lambda connections onto a small number of backend connections. It also provides:
Cost: ~$0.015/vCPU/hr of the RDS instance behind the proxy. Roughly $10β20/month for a db.r7g.2xlarge.
Automated backups:
Manual snapshots:
Cross-region snapshot copy:
RDS applies OS patches and minor version upgrades during a configurable weekly maintenance window (30-minute window, e.g., sun:05:00-sun:05:30). For Multi-AZ: failover to standby β patch primary β failover back. Typical downtime < 5 minutes.
import * as rds from 'aws-cdk-lib/aws-rds';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
// Parameter group for PostgreSQL 16
const paramGroup = new rds.ParameterGroup(this, 'PgParams', {
engine: rds.DatabaseInstanceEngine.postgres({
version: rds.PostgresEngineVersion.VER_16,
}),
parameters: {
max_connections: '500',
work_mem: '65536',
log_min_duration_statement: '1000',
shared_preload_libraries: 'pg_stat_statements',
},
});
// Security group β only accept traffic from app tier
const dbSg = new ec2.SecurityGroup(this, 'DbSg', { vpc });
dbSg.addIngressRule(appSg, ec2.Port.tcp(5432));
// Primary instance with Multi-AZ
const primary = new rds.DatabaseInstance(this, 'Primary', {
engine: rds.DatabaseInstanceEngine.postgres({
version: rds.PostgresEngineVersion.VER_16,
}),
instanceType: ec2.InstanceType.of(
ec2.InstanceClass.R7G,
ec2.InstanceSize.XLARGE2,
),
vpc,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
securityGroups: [dbSg],
multiAz: true,
storageType: rds.StorageType.GP3,
allocatedStorage: 100,
maxAllocatedStorage: 500, // autoscaling up to 500 GB
iops: 6000, // gp3: provision independently
parameterGroup: paramGroup,
backupRetention: Duration.days(14),
deletionProtection: true,
storageEncrypted: true,
});
// Read replica in a different AZ
const readReplica = new rds.DatabaseInstanceReadReplica(this, 'ReadReplica', {
sourceDatabaseInstance: primary,
instanceType: ec2.InstanceType.of(
ec2.InstanceClass.R7G,
ec2.InstanceSize.XLARGE, // smaller β reads only
),
vpc,
availabilityZone: 'us-east-1c',
});
Aurora is not a hosted version of MySQL or PostgreSQL β it's a purpose-built cloud database that speaks those protocols. The core innovation is the shared distributed storage layer, which is physically separate from the compute instances.
An Aurora cluster consists of:
Aurora storage is not per-instance EBS. It is a distributed storage system that:
Key interview point: Because all instances share the same storage, adding a reader to Aurora is fast (minutes, no data copy). Adding a Read Replica to standard RDS requires copying the full dataset to a new EBS volume.
Aurora writer failover is < 30 seconds (vs 60β120 seconds for RDS Multi-AZ) because:
Reader tiers (0β15): Lower tier = promoted first. Assign your best-provisioned reader to tier 0 for fastest failover.
| Metric | Aurora MySQL | Aurora PostgreSQL |
|---|---|---|
| vs RDS MySQL same hardware | Up to 5Γ throughput | β |
| vs RDS PostgreSQL same hardware | β | Up to 3Γ throughput |
| Reason | Log-structured writes bypass binlog, distributed redo | Shared storage + write quorum |
| Attribute | Value |
|---|---|
| Secondary regions | Up to 5 |
| Replication lag | Typically < 1 second |
| Replication mechanism | Storage-level (not binlog/WAL shipping) |
| Secondary regions | Read-only by default |
| Failover type | Manual (with automation via scripts/Lambda) |
| RTO on failover | < 1 minute (manual promotion) |
| RPO on failover | < 1 second (typical replication lag) |
Use cases: DR with near-zero RPO, low-latency reads near users in multiple continents, compliance (data in specific region, readable globally).
Failover process: Detach the secondary cluster β promote it to standalone β update application connection strings. AWS does not do this automatically (as of 2025) β you orchestrate it via Lambda or Terraform.
Aurora Serverless v2 scales the compute layer automatically in 0.5 ACU increments.
ACU (Aurora Capacity Unit): ~2 GB RAM + proportional CPU + network. A 4 ACU instance has ~8 GB RAM.
| Attribute | Aurora Serverless v2 | Aurora Serverless v1 (legacy) |
|---|---|---|
| Scale increment | 0.5 ACU | Doubling steps |
| Scale speed | < 1 second | Minutes |
| Minimum capacity | 0.5 ACU | 1 ACU |
| Maximum capacity | 256 ACU | 256 ACU |
| Cold start | No (stays warm) | Yes (scales to zero) |
| Scales to zero | No (0.5 ACU min) | Yes |
| Can mix with provisioned | Yes (same cluster) | No |
| Connections | Normal Aurora behavior | Limited |
When to use Serverless v2:
Limitation: Serverless v2 does not scale to zero. For true zero-cost idle, you need v1 (deprecated, limited) or schedule start/stop via Lambda.
Backtrack rewinds the cluster to a prior point in time in-place, without creating a new cluster. Up to 72-hour window. Much faster than snapshot restore for recovering from DROP TABLE or bad migrations.
Cost: $0.012/million change records stored.
Limitation: Not available for Aurora PostgreSQL. For PostgreSQL, use snapshot restore or point-in-time recovery.
Aurora can call SageMaker and Amazon Comprehend directly from SQL using native functions. You write standard SQL; Aurora calls the ML endpoint and returns results as columns.
-- Sentiment analysis using Comprehend from PostgreSQL
SELECT review_text,
aws_comprehend_detect_sentiment(review_text, 'en') AS sentiment
FROM product_reviews
WHERE created_at > NOW() - INTERVAL '7 days';
Standard Aurora charges per I/O request (~$0.20/million reads, $0.20/million writes). For I/O-heavy workloads:
Break-even: If I/O cost exceeds ~25% of your total cluster bill, I/O-Optimized is cheaper. Good fit for OLAP, heavy analytics, or write-heavy workloads.
| Feature | RDS Multi-AZ | Aurora HA | Aurora Global |
|---|---|---|---|
| Replicas | 1 standby (not readable) | Up to 15 readers (readable) | Up to 5 secondary regions |
| Failover time | 60β120s | < 30s | < 1 min (manual) |
| Replication | Synchronous (block-level) | Storage-level (shared) | Storage-level async |
| Read scale-out | No | Yes | Read-only secondary regions |
| RPO | ~0 | ~0 | < 1s typical |
| Cross-region | Only via snapshots | Via Global DB feature | Native |
| Data loss on failover | None | None | < 1s |
Traditional DB connections are expensive: each connection holds ~5β10 MB of RAM on the DB server, has a TCP handshake cost, and takes time to authenticate. An RDS PostgreSQL db.r7g.large instance (32 GB RAM) can handle roughly 3,400 max connections before running out of memory.
Lambda at scale is the worst pattern for this:
Lambda containers (1000s) RDS Proxy Aurora PostgreSQL
ββββββββββββββββββββββββ βββββββββββ βββββββββββββββββ
β Lambda container 1 ββββββββΆ β β βββββββΆ β β
β Lambda container 2 ββββββββΆ β Pool: β βββββββΆ β ~50 backend β
β Lambda container 3 ββββββββΆ β 1000 β βββββββΆ β connections β
β ... ββββββββΆ β front β β β
β Lambda container N ββββββββΆ β end β βββββββββββββββββ
ββββββββββββββββββββββββ βββββββββββ
1000+ connections multiplexed to ~50
Lambda function
β 1. Assume IAM role (via execution role)
β 2. Call rds:connect to generate auth token (15-min expiry)
βΌ
RDS Proxy
β 3. Proxy validates IAM token
β 4. Proxy fetches master credentials from Secrets Manager
βΌ
Aurora PostgreSQL
5. Proxy connects with master credentials
6. Lambda never knows the DB password
import { RDSDataClient } from '@aws-sdk/client-rds-data';
import { Signer } from '@aws-sdk/rds-signer';
import { Client } from 'pg';
// Initialized ONCE outside handler (warm start reuse)
const signer = new Signer({
region: process.env.AWS_REGION!,
hostname: process.env.PROXY_ENDPOINT!, // RDS Proxy endpoint
port: 5432,
username: 'lambda_user',
});
let pgClient: Client | null = null;
async function getConnection(): Promise<Client> {
if (pgClient && pgClient['_connected']) return pgClient;
// IAM token β rotates automatically every 15 min
const token = await signer.getAuthToken();
pgClient = new Client({
host: process.env.PROXY_ENDPOINT,
port: 5432,
database: process.env.DB_NAME,
user: 'lambda_user',
password: token, // IAM token as password
ssl: { rejectUnauthorized: true },
connectionTimeoutMillis: 5000,
});
await pgClient.connect();
return pgClient;
}
export const handler = async (event: AWSLambda.APIGatewayEvent) => {
const client = await getConnection();
const result = await client.query(
'SELECT id, name FROM users WHERE tenant_id = $1 LIMIT 100',
[event.pathParameters?.tenantId],
);
return {
statusCode: 200,
body: JSON.stringify(result.rows),
};
};
Supported backends: RDS MySQL 5.7/8.0, RDS PostgreSQL 13+, Aurora MySQL 5.7/8.0, Aurora PostgreSQL 13+.
Failover improvement: Proxy caches the new writer endpoint and redirects connections in < 10 seconds, compared to 60β120 seconds waiting for DNS TTL to expire on direct connections.
As of 2024, AWS ElastiCache for Redis is built on Valkey (the Redis fork maintained by AWS and the Linux Foundation after the Redis license change). The API is fully compatible with Redis 7.x.
Primary (us-east-1a) β writes
βββ Replica 1 (us-east-1b) β reads
βββ Replica 2 (us-east-1c) β reads, also failover target
When to switch from Disabled to Enabled: Dataset > single node limit, or throughput > ~200K ops/sec on a single shard.
| Data Type | Redis Command | Use Case |
|---|---|---|
| String | SET/GET | Simple cache, counters, flags |
| Hash | HSET/HGET | User sessions, objects |
| List | LPUSH/RPOP | Job queues, activity feeds |
| Set | SADD/SMEMBERS | Tags, unique visitors |
| Sorted Set | ZADD/ZRANGE | Leaderboards, priority queues |
| HyperLogLog | PFADD/PFCOUNT | Cardinality estimation (~0.81% error) |
| Stream | XADD/XREAD | Event streams, logs |
| Geospatial | GEOADD/GEODIST | Location-based queries |
| Mode | Mechanism | Durability | Performance Impact |
|---|---|---|---|
| None | No persistence | Data lost on restart | Fastest |
| RDB | Snapshot at intervals | Up to interval of data loss | Low (background fork) |
| AOF | Log every write | Near-zero data loss | Medium (fsync options) |
| RDB + AOF | Both | Near-zero data loss | Slightly higher |
For pure cache: No persistence β restart quickly from source DB.
For session store: AOF with appendfsync everysec β tolerate 1-second loss.
Cache-aside (lazy loading) β most common:
Read:
1. Check cache
2a. Cache hit β return data
2b. Cache miss β query DB β store in cache with TTL β return data
Write:
1. Write to DB
2. Invalidate cache key (or update cache)
Write-through:
Write:
1. Write to cache
2. Write to DB (synchronously)
Read:
1. Always hits cache (populated on write)
Write-behind (risky):
Write:
1. Write to cache
2. Asynchronously flush to DB (batched)
Risk: cache node fails before flush β data loss
Read-through: App talks only to cache; cache fetches from DB on miss. Requires cache-aware library.
import Redis from 'ioredis';
import { Pool } from 'pg';
const redis = new Redis({
host: process.env.REDIS_ENDPOINT,
port: 6379,
tls: {}, // ElastiCache in-transit encryption
lazyConnect: true,
});
const pg = new Pool({ connectionString: process.env.DATABASE_URL });
const CACHE_TTL = 300; // 5 minutes
async function getUserById(userId: string) {
const cacheKey = `user:${userId}`;
// 1. Try cache first
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// 2. Cache miss β query DB
const result = await pg.query(
'SELECT id, name, email, tier FROM users WHERE id = $1',
[userId],
);
const user = result.rows[0] ?? null;
// 3. Populate cache with TTL
if (user) {
await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(user));
}
return user;
}
async function updateUser(userId: string, data: Partial<User>) {
// 1. Write to DB first (source of truth)
await pg.query(
'UPDATE users SET name = $1, tier = $2 WHERE id = $3',
[data.name, data.tier, userId],
);
// 2. Invalidate cache β next read will repopulate
await redis.del(`user:${userId}`);
// Alternative: update cache directly
// await redis.setex(`user:${userId}`, CACHE_TTL, JSON.stringify({...user, ...data}));
}
// Rate limiting with sorted sets
async function checkRateLimit(
clientId: string,
maxRequests: number,
windowSecs: number,
): Promise<boolean> {
const key = `ratelimit:${clientId}`;
const now = Date.now();
const windowStart = now - windowSecs * 1000;
const pipeline = redis.pipeline();
pipeline.zremrangebyscore(key, '-inf', windowStart); // remove old entries
pipeline.zadd(key, now, `${now}`); // add current request
pipeline.zcard(key); // count in window
pipeline.expire(key, windowSecs); // reset TTL
const results = await pipeline.exec();
const count = results?.[2]?.[1] as number;
return count <= maxRequests;
}
Memcached is the simpler option: pure key-value, no persistence, no replication, multi-threaded.
| Feature | Redis (Valkey) | Memcached |
|---|---|---|
| Data types | Rich (8+ types) | String/Binary only |
| Persistence | Yes (RDB/AOF) | No |
| Replication | Yes | No |
| Pub/Sub | Yes | No |
| Lua scripting | Yes | No |
| Multi-threading | Single-threaded (Redis 6+ I/O threads) | Native multi-thread |
| Cluster sharding | Yes (hash slots) | Client-side (ketama) |
| Failover | Automatic (Multi-AZ) | None (client rerouths) |
Choose Memcached when: You only need simple string caching, want multi-threaded performance for CPU-bound cache operations, and don't need persistence or complex data structures.
Choose Redis for everything else: persistence, sessions, leaderboards, pub/sub, Lua scripts, replication.
Both sit in front of a database to accelerate reads, but they serve different purposes.
| Attribute | ElastiCache Redis | DynamoDB DAX |
|---|---|---|
| Primary DB | Any (RDS, Aurora, external) | DynamoDB only |
| Data types | Rich Redis types | DynamoDB items only |
| API | Redis protocol (custom client) | DynamoDB API (drop-in) |
| Consistency | Application-managed | Eventual consistency |
| Read latency | ~200 Β΅sβ1 ms | Single-digit microseconds |
| Cache invalidation | Manual (app deletes keys) | Automatic on DAX write-through |
| Write caching | Optional (write-through) | Write-through to DynamoDB |
| Hot partition relief | App must route to Redis | Automatic (DAX absorbs hot reads) |
| Session store | Yes (natural fit) | Awkward (DynamoDB data model) |
| Rate limiting | Yes (sorted sets) | No |
| Cost model | Per node-hour | Per node-hour (DAX cluster) |
| Cluster scale | Up to 500 shards | Up to 10 nodes |
When to use DAX: You're already on DynamoDB, you have hot partition issues, you need sub-millisecond reads, and you want zero application changes (DAX is API-compatible).
When to use ElastiCache: Your data lives in RDS/Aurora/any DB, you need complex caching patterns, session store, rate limiting, pub/sub, or you want full control over invalidation.
See DynamoDB for DynamoDB access patterns and DAX deep-dive.
| Criterion | RDS PostgreSQL | Aurora Serverless v2 | Aurora PostgreSQL (provisioned) | DynamoDB | ElastiCache Redis |
|---|---|---|---|---|---|
| Schema | Fixed, relational | Fixed, relational | Fixed, relational | Flexible (schemaless) | Key-value / typed |
| Scale pattern | Vertical + replicas | Auto (0.5β256 ACU) | Vertical + 15 readers | Horizontal (unlimited) | Vertical + sharding |
| ACID | Full | Full | Full | Per-item only | No (single command atomic) |
| Join support | Yes | Yes | Yes | No (single table only) | No |
| Max throughput | ~50K TPS (r7g.8xl) | ~256 ACU worth | ~500K TPS (32xl) | Unlimited (provisioned) | ~1M ops/sec (cluster) |
| Cost model | Per instance-hr | Per ACU-hr | Per instance-hr | Per RCU/WCU or on-demand | Per node-hr |
| Operational complexity | Medium | Low | Medium | Low | Medium |
| Ideal use case | OLTP, migrations | Unpredictable traffic | High-throughput OLTP | Key-value, event stores | Cache, sessions |
Need complex joins + ACID + < 10K TPS?
β Aurora PostgreSQL (provisioned) or RDS PostgreSQL
Need > 10K TPS + simple access patterns + global scale?
β DynamoDB (with provisioned capacity)
Traffic is unpredictable and may spike 10Γ in seconds?
β Aurora Serverless v2
Sub-millisecond reads + cache + session store + rate limiting?
β ElastiCache Redis
Migrating on-prem Oracle/SQL Server?
β RDS (matching engine, minimize migration risk)
Multi-tenant SaaS, < 500 tenants, different schemas per tenant?
β Aurora PostgreSQL, one schema per tenant (search_path)
Multi-tenant SaaS, > 10K tenants, same schema?
β DynamoDB (tenant_id as partition key) or Aurora Serverless v2
max_connections formula:
RDS sets max_connections dynamically based on RAM:
max_connections = LEAST({DBInstanceClassMemory / 9531392}, 5000)
A db.r7g.2xlarge (64 GB RAM) gets ~6,700 max connections. But each connection uses ~5β10 MB of RAM for its working memory β at full capacity you'd use 33β67 GB just for connections. Use RDS Proxy or PgBouncer to avoid hitting this limit.
PgBouncer (self-managed connection pooler for PostgreSQL):
[pgbouncer]
pool_mode = transaction # best for web apps (share conn per transaction)
max_client_conn = 10000 # clients PgBouncer accepts
default_pool_size = 50 # backend connections to Aurora
server_idle_timeout = 600
Transaction mode: a backend connection is held only for the duration of one transaction, then returned to pool. Incompatible with SET commands, advisory locks, and prepared statements that span transactions.
Performance Insights: Aurora and RDS surfaces top SQL by average active sessions (AAS) and wait states. Key wait states:
| Wait State | Meaning | Fix |
|---|---|---|
io/file/sql/binlog | Writing binlog | Disable binlog (Aurora has none) or tune sync |
lock/table/sql_lock | Table lock contention | Use row-level locking, shorter transactions |
wait/io/aurora_redo_log_flush | Redo log write | Use io2 or Aurora I/O-Optimized |
CPU | CPU bound | Add index, optimize query, scale instance |
pg_stat_statements (PostgreSQL):
-- Enable in parameter group: shared_preload_libraries = pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slowest queries by total time
SELECT
query,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(rows::numeric / calls, 1) AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Index strategies for PostgreSQL:
-- Partial index: only index rows that matter
CREATE INDEX idx_orders_pending
ON orders (created_at)
WHERE status = 'pending';
-- Covering index: include extra columns to avoid heap fetch
CREATE INDEX idx_users_email_covering
ON users (email)
INCLUDE (name, tier);
-- Expression index: index on derived value
CREATE INDEX idx_users_lower_email
ON users (LOWER(email));
-- Analyze a specific query
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE user_id = 'abc' AND status = 'pending'
ORDER BY created_at DESC LIMIT 10;
Aurora reader endpoint routing: The reader endpoint load-balances across all reader instances. For analytics vs OLTP routing:
// Application-level routing
const writePool = new Pool({ host: process.env.AURORA_WRITER_ENDPOINT });
const readPool = new Pool({ host: process.env.AURORA_READER_ENDPOINT });
async function getUser(id: string) {
return readPool.query('SELECT * FROM users WHERE id = $1', [id]);
}
async function createOrder(data: Order) {
return writePool.query('INSERT INTO orders ...', [...]);
}
Eviction policies:
| Policy | Behavior | Use when |
|---|---|---|
noeviction | Return error when memory full | Session store β never lose data |
allkeys-lru | Evict least-recently-used from all keys | General cache |
volatile-lru | Evict LRU from keys with TTL set | Mixed cache + session store |
allkeys-lfu | Evict least-frequently-used | Frequency-based caching (Redis 4+) |
volatile-ttl | Evict shortest-TTL keys first | When you want predictable expiry |
Memory fragmentation: When Redis reallocates memory frequently, fragmentation ratio rises.
# Check via redis-cli INFO memory
mem_fragmentation_ratio: 1.8 # > 1.5 = significant fragmentation
Fix: MEMORY PURGE (Redis 4+) or schedule a rolling node restart. Fragmentation above 1.5 wastes significant RAM.
Pipelining commands to reduce round trips:
// Instead of 5 separate await redis.get() calls:
const pipeline = redis.pipeline();
pipeline.get('user:1');
pipeline.get('user:2');
pipeline.get('user:3');
pipeline.get('user:4');
pipeline.get('user:5');
const results = await pipeline.exec();
// 1 round trip instead of 5
Cluster mode sizing: Switch to cluster mode when:
| Method | Scope | Retention | Cross-Region | Cost |
|---|---|---|---|---|
| RDS automated backup | RDS/Aurora | 1β35 days | Manual copy | Included up to DB size |
| Manual snapshot | RDS/Aurora | Until deleted | Copy supported | $0.095/GB-month |
| Aurora Backtrack | Aurora MySQL | Up to 72 hr | No | $0.012/million change records |
| AWS Backup | RDS, Aurora, DynamoDB, EFS | Policy-based | Yes | Backup storage + copy |
Cross-region snapshot automation (EventBridge + Lambda):
// EventBridge rule: cron(0 2 * * ? *) β daily at 2am UTC
export const handler = async () => {
const rds = new RDSClient({ region: 'us-east-1' });
// Copy latest automated snapshot to DR region
const snapshots = await rds.send(new DescribeDBSnapshotsCommand({
DBInstanceIdentifier: 'prod-postgres',
SnapshotType: 'automated',
}));
const latest = snapshots.DBSnapshots
?.sort((a, b) =>
(b.SnapshotCreateTime?.getTime() ?? 0) - (a.SnapshotCreateTime?.getTime() ?? 0)
)[0];
if (!latest?.DBSnapshotArn) return;
await new RDSClient({ region: 'eu-west-1' }).send(
new CopyDBSnapshotCommand({
SourceDBSnapshotIdentifier: latest.DBSnapshotArn,
TargetDBSnapshotIdentifier: `dr-${latest.DBSnapshotIdentifier}`,
SourceRegion: 'us-east-1',
KmsKeyId: process.env.DR_KMS_KEY_ARN,
CopyTags: true,
}),
);
};
At-rest: KMS-managed keys. Encryption is set at creation time and cannot be changed on a live instance. To re-encrypt with a different key:
In-transit: Enable require_ssl in parameter group for PostgreSQL. RDS certificates are managed by AWS and auto-rotate.
IAM DB Auth (passwordless):
-- PostgreSQL: create the IAM-authenticated user
CREATE USER lambda_user;
GRANT rds_iam TO lambda_user;
GRANT SELECT, INSERT, UPDATE ON TABLE orders TO lambda_user;
// Lambda: generate token and connect
const signer = new Signer({ region, hostname, port: 5432, username: 'lambda_user' });
const token = await signer.getAuthToken(); // 15-min expiry, signed with IAM credentials
IAM tokens expire every 15 minutes. Generate a fresh token for each new connection. With RDS Proxy, the proxy handles token validation and manages long-lived backend connections.
AWS Backup provides a unified, policy-based backup service across RDS, Aurora, DynamoDB, EFS, EBS, and FSx. Key features:
See IAM & Security for KMS key management and Observability for backup monitoring with CloudWatch.
| Engine | Audit Method | Destination |
|---|---|---|
| PostgreSQL | pgaudit extension + log_statement | CloudWatch Logs |
| MySQL | General log + Audit Log plugin | CloudWatch Logs |
| Oracle | Native Audit (option group) | CloudWatch Logs |
| SQL Server | SQL Server Audit (option group) | CloudWatch Logs |
Enable log_connections, log_disconnections, and log_statement = 'ddl' at minimum for compliance workloads.
Q: When would you choose Aurora over standard RDS? What are the trade-offs?
A: Aurora when you need faster failover (< 30s vs 60β120s), more read replicas (15 vs 5), higher throughput (3β5Γ on same hardware), or point-in-time recovery without restore (Backtrack on MySQL). Trade-offs: Aurora costs more per instance (roughly 20% premium), slightly higher latency for single-row reads due to shared storage layer network calls, and you lose some engine-specific features (e.g., Aurora PostgreSQL lags behind community PostgreSQL by 1β2 minor versions). For a standard 2-AZ OLTP app with < 5K TPS and no read scaling needs, standard RDS is simpler and cheaper.
Q: Explain Aurora's storage architecture. How is it different from RDS Multi-AZ?
A: RDS Multi-AZ uses two separate EBS volumes β the primary and standby each have their own block storage, synchronized at the block level. Aurora uses a single shared distributed storage layer: 6 copies across 3 AZs, requiring only 4/6 writes to acknowledge a write (tolerates 2 simultaneous AZ failures). All Aurora instances (writer + all readers) read from the same storage pages. No data copy needed when adding readers; failover is fast because the promoted reader already has the storage state.
Q: How does Aurora Serverless v2 scale? What are its limitations?
A: Serverless v2 scales in 0.5 ACU increments (each ACU is ~2 GB RAM + proportional CPU). Scaling happens in < 1 second β it monitors CPU and memory pressure and adjusts capacity near-continuously. Minimum is 0.5 ACU (never scales to zero; you always pay at least ~$0.06/hr at minimum capacity). Maximum is 256 ACU. Limitations: doesn't scale to zero (unlike v1), no support for Aurora parallel query, and Backtrack is not available on Serverless v2 clusters.
Q: A Lambda function is timing out connecting to RDS. Walk me through your diagnosis.
A: First, check if Lambda is in the same VPC and subnet as RDS β Lambda must be in the same VPC and the security group on RDS must allow inbound on port 5432/3306 from Lambda's security group. Second, check max_connections β if the DB is at limit, new connections queue or fail. Third, check if RDS Proxy is configured β if not, 1,000 concurrent Lambdas will try to open 1,000+ connections. Fourth, confirm the Lambda execution role has rds-db:connect permission if using IAM auth. Fifth, check CloudWatch Logs for RDS connection errors and Enhanced Monitoring for CPU/memory spike on the DB. Fix: add RDS Proxy between Lambda and RDS, add the rds:connect IAM permission, and verify security group rules. See Lambda for Lambda VPC networking details.
Q: When would you use ElastiCache Redis vs DynamoDB for a session store?
A: ElastiCache Redis is the natural fit for session storage. Sessions are small JSON blobs accessed by session ID β exactly the Redis SET key value EX ttl pattern. Redis gives sub-millisecond reads, TTL-based expiry, and SCAN for session enumeration. DynamoDB can work but you're paying RCU/WCU for what amounts to simple key-value lookups, and TTL-based expiry in DynamoDB (via ttl attribute) takes up to 48 hours to actually delete expired items. Redis is cheaper and faster for this pattern. Use DynamoDB sessions only if you're already on DynamoDB and want to avoid adding another service.
Q: How do you handle read replicas with a cache layer β what's the cache invalidation strategy?
A: Cache invalidation with replicas requires awareness of replica lag. On write: invalidate the cache key immediately after writing to the primary. On the next read, route to the cache β if a cache miss occurs, read from the primary (not the replica) to avoid serving stale data from a lagged replica. For reads that tolerate eventual consistency, read from replica β populate cache. The safest pattern: invalidate on write, repopulate from primary on first miss, then serve from cache for all subsequent reads within the TTL window. Monitor replica lag via ReplicaLag CloudWatch metric and adjust TTL accordingly.
Q: What's the difference between RDS Multi-AZ and a Read Replica? Can you promote a Multi-AZ standby?
A: Multi-AZ standby: synchronous replication, same AZ, not readable, purely for HA failover. You cannot promote it to a standalone instance β it exists only as the failover target. Read Replica: asynchronous replication, can be in any AZ or region, fully readable, and can be promoted to standalone (cutting replication permanently, making it an independent DB). The standby in Multi-AZ is opaque to you β you can't connect to it, query it, or promote it. If you need both HA and read scale-out, deploy Multi-AZ (for failover) plus Read Replicas (for read scale).
Q: How would you design a database architecture for a multi-tenant SaaS with 500 tenants?
A: With 500 tenants, the main options are: (1) Shared DB, schema-per-tenant β one Aurora PostgreSQL cluster, each tenant gets a PostgreSQL schema; application sets search_path = tenant_123 at connection time. Simplest to operate, data isolation via schema, one migration per tenant at deploy. Downside: noisy neighbor risk, max ~1,000 schemas per DB before performance degrades. (2) Shared DB, table-per-tenant β similar but noisier. (3) Dedicated cluster per tenant β higher isolation, higher cost (500 Aurora clusters). For 500 tenants, schema-per-tenant on Aurora Serverless v2 is the sweet spot: low idle cost per schema, instant scale on tenant spikes, strong isolation, and operational simplicity. RDS Proxy handles connection pooling across schemas. For compliance-heavy tenants, offer a dedicated cluster option at premium pricing. See AWS Architecture for multi-tenancy patterns.
Q: Explain Aurora Global Database β what's the RTO/RPO and how does failover work?
A: Aurora Global Database replicates at the storage layer (not binlog/WAL) with < 1 second lag to secondary regions. RPO is typically < 1 second. RTO is < 1 minute for manual promotion. Failover is manual: detach the secondary cluster from global, promote it to standalone writer, update application DNS/connection strings to point to the new region. There's no automatic global failover built-in β you need Lambda + Route 53 health checks + EventBridge to automate it. Read traffic in secondary regions gets local latency. Secondaries are read-only β you cannot write to them while attached to the global cluster.
Q: How do you secure RDS β encryption, network, auth, audit logging?
A: Four layers: (1) Network β RDS in private subnets (no public IP), security group restricts port 5432/3306 to application security group only; VPC endpoints or NAT for outbound. See VPC & Networking. (2) Authentication β prefer IAM DB Auth (passwordless, tokens expire in 15 min) for Lambda/EC2; Secrets Manager for rotation of master password; principle of least privilege on DB users (separate user per service, no app connecting as root). See IAM & Security. (3) Encryption β KMS at-rest (set at creation); TLS in-transit via require_ssl in parameter group; cross-region snapshot copies encrypted with regional KMS key. (4) Audit β pgaudit for PostgreSQL (log DDL, DML, role changes); CloudWatch Logs for centralized log storage; set retention policy on log groups.
Q: What is RDS Proxy and when is it essential?
A: RDS Proxy is a fully managed connection pooler that sits between your application and RDS/Aurora. It maintains a pool of long-lived backend connections and multiplexes thousands of application connections onto them. Essential when: Lambda functions connect to RDS (Lambda at scale creates thousands of connections), application has frequent connection churn (connection/disconnect per request), you want passwordless IAM auth with Secrets Manager rotation, or you need sub-10-second failover redirection. Not needed when: you have a small number of long-lived application servers with their own connection pools (PgBouncer, application pool), or when using Aurora Data API. See Lambda for the Lambda connection pattern.
require_ssl = 1 in the parameter group β connections may fall back to unencrypted without itpg_stat_statements β you'll be flying blind on query performance until a production incidentnoeviction on a general cache β ElastiCache will return errors on writes instead of evicting old data; use allkeys-lru for cachesRelated topics: DynamoDB Β· Lambda Β· IAM & Security Β· VPC & Networking Β· AWS Architecture Β· Cost Optimization Β· Compute & Containers Β· Observability