AWS Storage is the broadest and most mature cloud storage portfolio available, spanning object storage (S3), block storage (EBS), file storage (EFS, FSx), hybrid gateways (Storage Gateway), and unified backup (AWS Backup). S3 alone underpins much of the modern web β from static asset hosting to data lake foundations to ML training datasets. Understanding the full storage stack is essential for designing cost-efficient, durable, and high-performance architectures, and it is one of the most heavily tested domains in AWS Solutions Architect and Developer certification exams.
S3 is a globally-addressed, regionally-stored flat object store. There is no true filesystem hierarchy β what looks like a folder path (data/2024/jan/file.csv) is just a key with slashes in it. Every object has:
Consistency model (post-December 2020): S3 provides strong read-after-write consistency for all operations β PUT, DELETE, LIST. There is no longer any eventual consistency window for new or overwritten objects.
PUT s3://my-bucket/data/file.csv
β
[S3 Control Plane] β IAM Auth β Bucket Policy check β Object Lock check
β
[S3 Data Plane] β 3+ AZ replication (for Standard class) β 200 OK + ETag
β
Subsequent GET β reads the newly written version immediately (strong consistency)
Enable versioning on a bucket to keep multiple variants of every object:
Mandatory for objects >5 GB, recommended for objects >100 MB:
UploadIdETag| Class | Min Duration | Min Object Size | AZs | Use Case | Approx. Storage Cost |
|---|---|---|---|---|---|
| Standard | None | None | β₯3 | Frequently accessed, active data | ~$0.023/GB/mo |
| Intelligent-Tiering | None | None (< 128KB not monitored) | β₯3 | Unknown/changing access patterns | ~$0.023/GB/mo + $0.0025/1k objects monitoring |
| Standard-IA | 30 days | 128 KB | β₯3 | Infrequently accessed, rapid retrieval | ~$0.0125/GB/mo + retrieval fee |
| One Zone-IA | 30 days | 128 KB | 1 | Reproducible infrequent data, lower cost | ~$0.01/GB/mo |
| Glacier Instant Retrieval | 90 days | 128 KB | β₯3 | Archive with ms retrieval | ~$0.004/GB/mo |
| Glacier Flexible Retrieval | 90 days | 40 KB | β₯3 | Archive, retrieval in minutes to hours | ~$0.0036/GB/mo |
| Glacier Deep Archive | 180 days | 40 KB | β₯3 | Long-term compliance archive | ~$0.00099/GB/mo |
Intelligent-Tiering tiers (automatic, no retrieval charge):
Key decision rules:
Lifecycle rules automate transitions between storage classes and expiration:
{
"Rules": [
{
"ID": "move-logs-to-cold-storage",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 },
"NoncurrentVersionTransitions": [
{ "NoncurrentDays": 30, "StorageClass": "STANDARD_IA" }
],
"NoncurrentVersionExpiration": { "NoncurrentDays": 90 },
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
}
]
}
Critical details:
AbortIncompleteMultipartUpload β always add this rule; orphaned parts accumulate silently and cost moneyNoncurrentVersionExpiration β cleans up old versions in versioned buckets; without it, storage grows unboundedlyJSON policies attached directly to the bucket. Support cross-account access and anonymous public access. Evaluated alongside IAM identity policies using the least-privilege intersection rule: an action is allowed only if both the IAM policy AND the bucket policy allow it (with the exception that an explicit Deny anywhere overrides all Allows).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCloudFrontOAC",
"Effect": "Allow",
"Principal": {
"Service": "cloudfront.amazonaws.com"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"StringEquals": {
"AWS:SourceArn": "arn:aws:cloudfront::123456789012:distribution/EDFDVBD6EXAMPLE"
}
}
}
]
}
Object and bucket Access Control Lists predate bucket policies. AWS recommends disabling ACLs entirely by setting Object Ownership to BucketOwnerEnforced. This also means all objects uploaded to the bucket are owned by the bucket owner regardless of who uploaded them β critical for cross-account upload scenarios.
Four independent settings, configurable at account level and bucket level:
BlockPublicAcls β rejects any ACL grant that allows public accessIgnorePublicAcls β ignores existing public ACLsBlockPublicPolicy β rejects bucket policies that grant public accessRestrictPublicBuckets β restricts access to buckets with public policies to only AWS services and authorized usersBest practice: Enable all four at the account level. Disable selectively only for intentionally public buckets (e.g., static website hosting).
Allow temporary, time-limited access to private S3 objects without exposing credentials:
import { S3Client, GetObjectCommand, PutObjectCommand } from "@aws-sdk/client-s3";
import { getSignedUrl } from "@aws-sdk/s3-request-presigner";
const client = new S3Client({ region: "us-east-1" });
// Presigned GET β allow download for 1 hour
const getUrl = await getSignedUrl(
client,
new GetObjectCommand({ Bucket: "my-bucket", Key: "report.pdf" }),
{ expiresIn: 3600 }
);
// Presigned PUT β allow direct browser upload for 15 minutes
const putUrl = await getSignedUrl(
client,
new PutObjectCommand({
Bucket: "my-bucket",
Key: "uploads/user-photo.jpg",
ContentType: "image/jpeg",
}),
{ expiresIn: 900 }
);
Presigned POST (HTML form uploads): More powerful than presigned PUT β supports conditions (max file size, allowed prefixes, content-type restrictions). Uses a Policy document + Signature rather than a URL query string. Preferred for browser-based uploads where you need server-side validation of upload parameters.
Key facts:
S3 scales automatically per prefix. Each unique prefix path supports at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second. To maximize throughput for high-traffic workloads:
data/2024/)a3f2/user-uploads/..., b7c1/user-uploads/...Routes uploads/downloads through CloudFront edge locations using AWS's private network backbone instead of the public internet. Adds ~$0.04/GB over standard transfer costs. Useful for:
Enable at the bucket level; use the .s3-accelerate.amazonaws.com endpoint.
Use Range: bytes=0-1048575 HTTP headers to download only a portion of an object. Enables:
Execute SQL-like queries directly on S3 objects (CSV, JSON, Parquet, gzip/bzip2 compressed CSV/JSON) β only the matching rows are returned over the network:
SELECT s.name, s.age FROM S3Object s WHERE s.age > 25
Reduces data transfer by up to 400x for selective queries. Does not support JOINs or aggregations across multiple files. For complex analytics, use Amazon Athena (S3 Select under the hood, plus query planning) or Redshift Spectrum β see Data Analytics for full Athena + Glue + Redshift patterns.
Intercept GetObject requests with a Lambda function to transform data on the fly β resize images, redact PII, convert formats β before returning it to the caller. No changes needed in the calling application.
Client GET β S3 Object Lambda Access Point β Lambda function β S3 standard bucket β transformed bytes β Client
Enable Requester Pays on a bucket to charge the data transfer and request costs to the requester's AWS account rather than the bucket owner. Used for public datasets (e.g., AWS Open Data Registry) where the dataset provider does not want to pay egress costs.
Run operations at scale across billions of objects:
Cross-Region Replication (CRR): Source and destination buckets in different Regions. Use cases: compliance (data residency), latency reduction for global readers, disaster recovery.
Same-Region Replication (SRR): Same Region. Use cases: aggregate logs from multiple buckets, live replication between production and test accounts.
Key rules:
S3 can publish events on object create, delete, restore, replication, and lifecycle transitions. See Lambda & Serverless for how Lambda processes these events at scale.
| Destination | Protocol | Ordering | Use Case |
|---|---|---|---|
| SQS | Poll-based | FIFO possible | Decoupled processing queue |
| SNS | Push (fan-out) | No | Fan-out to multiple consumers |
| Lambda | Direct invoke | No | Serverless immediate processing |
| EventBridge | Push | No | Advanced filtering, routing, archiving |
EventBridge is the most powerful destination β supports content-based filtering on all event fields, dead-letter queues, event replay, and routing to 20+ targets.
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": { "name": ["my-uploads-bucket"] },
"object": { "key": [{ "prefix": "uploads/images/" }] }
}
}
Note: S3 event notifications have at-least-once delivery semantics. For exactly-once processing, use SQS deduplication or idempotent Lambda handlers.
CloudFront is AWS's globally distributed Content Delivery Network (CDN) β 400+ edge locations (Points of Presence) plus 13 Regional Edge Caches. It caches content close to users, reducing latency and origin load.
A distribution maps to one or more origins:
Multiple origins per distribution enable origin groups for failover: CloudFront tries the primary origin; on 5xx or connection failure, it automatically retries the secondary.
A distribution has one or more cache behaviors matched by URL path pattern (/api/*, /images/*, *). Each behavior independently controls:
Cache Policy defines what constitutes a unique cache key and the TTL bounds:
min-ttl, default-ttl, max-ttl β CloudFront respects Cache-Control from origin within these boundsOrigin Request Policy controls what CloudFront forwards to the origin without affecting the cache key:
Authorization header to origin but don't include it in cache key (use with caution β means all users share the same cached response)Accept-Language to origin without caching per languageRestricts S3 bucket access so only CloudFront can retrieve objects. OAC supersedes the older Origin Access Identity (OAI). OAC is enforced through IAM resource-based policies on the S3 bucket:
s3:GetObject to the CloudFront service principal with a condition on AWS:SourceArn matching the specific distribution ARN (see IAM & Security for bucket policy syntax and the principal/condition model)Signed URL: one URL with an embedded policy and signature. Use for:
Signed Cookie: sets CloudFront-Policy, CloudFront-Signature, CloudFront-Key-Pair-Id cookies. Use for:
Both require a CloudFront key pair (RSA-2048) β the private key signs the policy, CloudFront validates with the stored public key. Managed in Key Groups (not IAM).
| CloudFront Functions | Lambda@Edge | |
|---|---|---|
| Runtime | JS (ES5.1) | Node.js 14/18, Python 3.x |
| Trigger points | Viewer Request, Viewer Response | Viewer Request, Viewer Response, Origin Request, Origin Response |
| Max exec time | 1 ms | 5 s (viewer), 30 s (origin) |
| Max memory | 2 MB | 128 MB β 10 GB |
| Network access | No | Yes |
| Pricing | ~$0.10/million | ~$0.60/million + compute |
| Use cases | URL rewrites/redirects, simple A/B testing, HTTP header manipulation, cache key normalization | Auth flows, dynamic origin selection, body inspection/modification, API calls during request |
Rule of thumb: Use CloudFront Functions for anything that takes <1ms and needs no I/O. Use Lambda@Edge when you need to call an external service (e.g., JWT validation against Cognito, dynamic personalization).
EBS provides persistent block storage volumes for EC2 instances. Think of it as a network-attached SSD/HDD. Data persists independently of the EC2 instance lifecycle.
| Type | Category | Max IOPS | Max Throughput | Use Case |
|---|---|---|---|---|
| gp3 | SSD General Purpose | 16,000 | 1,000 MB/s | Boot volumes, dev/test, general workloads |
| gp2 | SSD General Purpose (legacy) | 16,000 | 250 MB/s | Legacy; gp3 is better in all dimensions |
| io2 Block Express | SSD Provisioned IOPS | 256,000 | 4,000 MB/s | Mission-critical RDBMS, SAP HANA |
| io1 | SSD Provisioned IOPS (legacy) | 64,000 | 1,000 MB/s | Databases needing >16k IOPS |
| st1 | HDD Throughput Optimized | 500 | 500 MB/s | Big data, log processing, sequential I/O |
| sc1 | HDD Cold | 250 | 250 MB/s | Infrequently accessed archival, lowest cost |
gp2 vs gp3: gp3 decouples IOPS and throughput from volume size (gp2 is 3 IOPS/GB, maxing at 16k for volumes β₯5.3TB). gp3 provides 3,000 IOPS baseline and 125 MB/s baseline regardless of size, with provisioned scaling to 16,000 IOPS and 1,000 MB/s. gp3 is 20% cheaper than gp2. Always choose gp3 for new volumes.
io1 vs io2: io2 provides 500 IOPS/GB (vs 50 IOPS/GB for io1), 99.999% durability (vs 99.8β99.9% for io1), same price. Always choose io2 over io1.
HDD volumes (st1/sc1) cannot be used as boot volumes.
Available for io1 and io2 volumes only. Allows attaching a single volume to up to 16 Nitro-based EC2 instances simultaneously within the same AZ. The application must manage concurrent writes β typically requires a clustered file system (like GFS2) or a database that handles its own locking.
aws/ebs or a customer-managed CMK)EFS is a fully managed, elastic NFS file system that can be mounted concurrently by thousands of EC2 instances, ECS tasks, Lambda functions, and on-premises servers (via Direct Connect or VPN).
You cannot change the performance mode after creation.
EFS Intelligent-Tiering (similar concept to S3):
fs-xxxx.efs.us-east-1.amazonaws.com.amazon-efs-utils)Storage Gateway bridges on-premises environments to AWS storage. Three gateway types:
Two sub-modes:
Centralized, policy-driven backup service across AWS services (EBS, RDS, DynamoDB, EFS, FSx, S3, DocumentDB, Neptune, EC2 instance backup via AMI).
Production Account (us-east-1)
β Backup Plan (daily, 30-day retention)
Backup Vault (us-east-1) β Copy to β Backup Vault (eu-west-1, backup account)
AWS Backup for S3 creates continuous (PITR β point-in-time recovery to within 1 hour) and periodic backups. Unlike S3 versioning alone, Backup provides centralized governance, vault lock, and cross-account copies.
Q: What is the difference between S3 Standard-IA and S3 One Zone-IA? When would you use each? A: Both classes are designed for infrequently accessed data with a 30-day minimum storage duration and a retrieval fee per GB. The key difference is durability: Standard-IA replicates data across at least 3 AZs (99.999999999% durability), while One Zone-IA stores data in a single AZ (99.5% availability SLA). Use Standard-IA for any data you cannot easily regenerate β DR copies, compliance archives, infrequent reports. Use One Zone-IA only for derived or reproducible data β thumbnail caches, transcoded video previews, reprocessable logs β where the cost savings justify the risk of losing the data if that AZ fails.
Q: A customer's S3 bucket is being accessed by a CloudFront distribution. How do you ensure the bucket cannot be accessed directly, bypassing CloudFront?
A: Configure Origin Access Control (OAC) on the CloudFront distribution (not the legacy OAI). This attaches a SigV4-signed request to all CloudFrontβS3 fetches. Then update the S3 bucket policy to grant s3:GetObject only to cloudfront.amazonaws.com with a Condition matching the specific distribution's ARN (AWS:SourceArn). Finally, enable all four Block Public Access settings on the bucket. This ensures all traffic must come through CloudFront; any direct S3 URL request is denied.
Q: You need to allow a third-party application to upload files directly to your S3 bucket from a browser without exposing your AWS credentials. What are your options and how do they differ? A: Two options: Presigned PUT URL and Presigned POST. A presigned PUT URL is simpler β your server generates a signed URL for a specific key with a specific content type; the browser PUTs the file directly. It offers less control over what the browser can upload. A presigned POST is more powerful β your server generates a policy document that can constrain the key prefix, maximum file size, allowed content types, and metadata. The browser submits an HTML form (or XHR) with the policy, signature, and file. Presigned POST is the correct choice when you need server-enforced upload constraints without processing the upload through your own server.
Q: What is MFA Delete, and when must you use it? A: MFA Delete requires a valid MFA token (in addition to standard credentials) to either change the versioning state of a bucket or permanently delete a versioned object. It can only be enabled by the root account user (not IAM users). Use it for your most sensitive, compliance-critical buckets β financial records, audit logs β where you need a second factor to prevent accidental or malicious deletion even by a compromised administrator account. Combined with Vault Lock in AWS Backup, this forms a WORM-compliant architecture.
Q: Explain S3 multipart upload. What happens if a multipart upload is never completed?
A: Multipart upload splits a large object into parts (min 5 MB each, max 10,000 parts) uploaded in parallel, then atomically assembled. It provides better resilience (retry individual parts), faster aggregate throughput, and is required for objects >5 GB. If an upload is initiated but never completed or aborted, the uploaded parts remain in S3 and accumulate storage charges indefinitely β they are invisible in the bucket listing but visible in S3 Inventory. The fix is an S3 lifecycle rule with AbortIncompleteMultipartUpload (e.g., 7 days), which automatically deletes orphaned parts.
Q: What is the difference between gp2 and gp3 EBS volumes? Should you ever use gp2 today? A: gp2 delivers 3 IOPS per GB of provisioned size, meaning you must over-provision storage to get higher IOPS, up to a maximum of 16,000 IOPS at 5,333+ GB. gp3 provides a flat baseline of 3,000 IOPS and 125 MB/s regardless of size, which can be independently scaled to 16,000 IOPS and 1,000 MB/s at additional cost. gp3 is approximately 20% cheaper per GB than gp2 and offers 4x the max throughput. There is no valid reason to provision new gp2 volumes; migrate existing gp2 volumes to gp3 using live volume modification (no downtime required).
Q: When would you choose EFS over EBS? Over S3? A: Choose EFS when multiple EC2 instances, Lambda functions, or containers need to concurrently read and write a shared POSIX file system β web server farms sharing uploaded content, CI/CD runners sharing build caches, home directory mounting for hundreds of users. EBS is single-instance (except io1/io2 Multi-Attach) and block-level; EFS is multi-instance and file-level. Choose S3 over EFS when the access pattern is object-level (upload/download entire files), when you need lifecycle tiering, versioning, or global replication, or when cost is paramount (S3 is an order of magnitude cheaper than EFS for large-scale storage). EFS is best for applications that use standard POSIX file operations and cannot easily adapt to object store semantics.
Q: A Lambda function processing S3 events sometimes processes the same object twice. Why, and how do you fix it? A: S3 event notifications have at-least-once delivery semantics β S3 may deliver the same event multiple times. Additionally, Lambda itself retries on errors (up to 2 retries for async invocations by default). The fix is to make the Lambda function idempotent: before processing, check a deduplication store (DynamoDB item with a conditional write, or a Redis SET NX) using the S3 object's ETag and key as a unique identifier. Only process the object if the record doesn't already exist. Alternatively, route events through SQS with deduplication enabled (FIFO queue) before Lambda; SQS deduplicates within a 5-minute window.
Q: What is the difference between CloudFront signed URLs and signed cookies?
A: Both restrict access to CloudFront-served content to authorized users only. A signed URL grants access to a single specific URL (one file) and is ideal for distributing individual files, for clients that don't support cookies, or when each URL needs different expiry or IP restrictions. A signed cookie grants access to multiple files matching a wildcard pattern (e.g., all /premium/* content) without modifying each URL β ideal for authenticated streaming (HLS/DASH manifests + segments), premium content libraries, or software download portals where many files constitute a single entitlement.
Q: Describe how you would design a globally distributed media asset delivery pipeline for a video streaming platform using AWS storage services.
A: Upload path: Mobile/web clients upload video files using presigned PUT URLs (or presigned POST for browser form uploads) directly to an S3 bucket in the primary Region. S3 triggers an EventBridge event β a Step Functions workflow β AWS Elemental MediaConvert for transcoding β output segments written to a second S3 bucket. S3 Lifecycle transitions raw uploads to Standard-IA after 30 days and Glacier after 1 year. Delivery path: CloudFront distribution with the transcoded output bucket as origin, protected by OAC. HLS/DASH manifests and segments are served from CloudFront edge caches with appropriate TTLs (Cache-Control: max-age=86400 for segments, short TTLs for manifests). Subscribers access content via CloudFront signed cookies set after authentication. Lambda@Edge at the viewer-request stage validates the JWT and denies unauthorized requests before they reach the origin.
Q: How does S3 Cross-Region Replication differ from S3 Same-Region Replication? What are the prerequisites and limitations? A: CRR replicates objects to a bucket in a different AWS Region; SRR replicates within the same Region. Both require versioning enabled on both source and destination buckets. Both are asynchronous (seconds to minutes). Both require an IAM role with permissions to read from the source and write to the destination. Key limitations: existing objects at the time replication is configured are not replicated (use S3 Batch Replication for backfill); delete markers are not replicated by default; replica objects cannot be replicated again (no chaining); lifecycle actions on the source are not replicated. CRR incurs inter-region data transfer costs; SRR does not. Use CRR for DR across Regions, data sovereignty compliance, and global latency reduction; use SRR for log aggregation, cross-account data sharing within a Region, and live test/prod environment sync.
Q: What is S3 Transfer Acceleration and when should you use it (or not)?
A: Transfer Acceleration routes S3 uploads and downloads through the nearest CloudFront edge location over AWS's private backbone, bypassing the public internet for most of the journey. It adds $0.04β$0.08/GB over standard S3 transfer pricing. It is beneficial when users are far from the bucket's Region and upload/download large files (the longer the public internet path, the greater the benefit from private backbone routing). It is NOT beneficial when users are geographically close to the bucket's Region (the overhead of routing through an edge location may actually increase latency), for small objects (per-request latency of the edge hand-off outweighs the throughput benefit), or for intra-AWS transfers (already on the private network). AWS provides a speed comparison tool at s3-accelerate-speedtest.s3-accelerate.amazonaws.com to measure actual benefit before enabling.
S3 is the standard foundation for AWS data lakes, and feeds directly into AI/ML workloads β see Bedrock, SageMaker & AI/ML Services for how SageMaker training jobs, Feature Store, and Bedrock Knowledge Bases consume S3-backed datasets.
Raw Ingestion β s3://data-lake/raw/ (Standard, versioned)
β Glue ETL / Lambda
Cleaned Data β s3://data-lake/processed/ (Standard or Intelligent-Tiering)
β Athena / Redshift Spectrum / EMR
Analytics β Query in place, pay per scan
β SageMaker / Bedrock Knowledge Bases (see /aws-ai-ml-services)
AI/ML β Training datasets, vector embeddings, model artifacts
β Lifecycle
Archive β s3://data-lake/archive/ β Glacier Deep Archive after 2 years
User Upload (presigned PUT)
β S3 bucket (Standard, versioning on, BPA on)
β EventBridge rule (object created, prefix = uploads/)
β Step Functions workflow
ββ Lambda: validate file type / virus scan
ββ Lambda: process / transform
ββ S3 PutObject: write result to s3://processed/
ββ DynamoDB: record job status
β CloudFront: serve processed output via OAC
On-Premises NAS
β Storage Gateway (File Gateway, NFS/SMB mount)
β S3 (primary copy, Standard class)
β S3 Lifecycle β Standard-IA (30 days) β Glacier (365 days)
β AWS Backup (Vault Lock, cross-account copy to backup account)
S3 Object Lambda lets you intercept GetObject calls and transform the response using a Lambda function, without storing multiple copies of the data.
App requests object β S3 Object Lambda Access Point β Lambda transforms β App receives modified data
β (fetches original from S3)
Original S3 Bucket
import boto3
import json
import urllib3
def lambda_handler(event, context):
# Get the original object from S3
s3 = boto3.client('s3')
object_context = event['getObjectContext']
# Fetch the original object from S3
http = urllib3.PoolManager()
original_object = http.request('GET', object_context['inputS3Url'])
# Transform: mask SSNs using regex
import re
content = original_object.data.decode('utf-8')
masked = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '***-**-****', content)
# Return transformed object
s3.write_get_object_response(
Body=masked,
RequestRoute=object_context['outputRoute'],
RequestToken=object_context['outputToken'],
ContentType='text/plain',
StatusCode=200,
)
S3 Batch Operations performs large-scale batch actions across billions of objects in a single managed job.
1. Create manifest (CSV or S3 Inventory report listing target objects)
2. Create S3 Batch Operations job with operation + manifest
3. Review estimated job scope (count, size)
4. Confirm + run
5. Job processes objects in parallel (scales automatically)
6. Job completion report written to S3 (success/failure per object)
# Create a Batch Operations job to invoke Lambda on all objects
s3control = boto3.client('s3control')
response = s3control.create_job(
AccountId='123456789012',
Operation={
'LambdaInvoke': {
'FunctionArn': 'arn:aws:lambda:us-east-1:123456789012:function:ProcessObject',
'InvocationSchemaVersion': '2.0',
}
},
Manifest={
'Spec': {'Format': 'S3BatchOperations_CSV_20180820', 'Fields': ['Bucket', 'Key']},
'Location': {'ObjectArn': 'arn:aws:s3:::my-manifests/objects.csv', 'ETag': 'abc123'},
},
Report={
'Bucket': 'arn:aws:s3:::my-reports',
'Format': 'Report_CSV_20180820',
'Enabled': True,
'Prefix': 'batch-reports/',
'ReportScope': 'AllTasks',
},
Priority=10,
RoleArn='arn:aws:iam::123456789012:role/BatchOperationsRole',
ConfirmationRequired=False,
)
S3 Access Points simplify access management for shared datasets with many applications or teams.
// Access point for "data-science" team β read-only to /analytics/ prefix
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:role/DataScienceRole"},
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:us-east-1:123456789012:accesspoint/data-science",
"arn:aws:s3:us-east-1:123456789012:accesspoint/data-science/object/analytics/*"
]
}]
}
VpcConfiguration restricts access to specific VPCS3 Express One Zone is a new storage class (2023) optimized for performance-sensitive workloads:
Query data within S3 objects without downloading the entire object.
# Retrieve only specific columns from a CSV/JSON/Parquet in S3
response = s3.select_object_content(
Bucket='my-data-lake',
Key='orders/2024/january.parquet',
ExpressionType='SQL',
Expression="SELECT customerId, total FROM S3Object WHERE status = 'completed' AND total > 1000",
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {'RecordDelimiter': '\n'}},
)
# Process streaming response
for event in response['Payload']:
if 'Records' in event:
data = event['Records']['Payload'].decode('utf-8')
print(data)
| Service | Protocol | Optimized For | Typical Use Case |
|---|---|---|---|
| FSx for Lustre | POSIX | High-throughput parallel I/O | ML training, HPC, media rendering |
| FSx for Windows | SMB/NTFS | Windows workloads | Active Directory, SQL Server backups |
| FSx for NetApp ONTAP | NFS/SMB/iSCSI | Enterprise storage features | Lift-and-shift enterprise apps |
| FSx for OpenZFS | NFS/ZFS | Low-latency NAS | Workloads requiring ZFS features |
S3 Data Lake β [lazy loading] β FSx for Lustre (linked to S3)
β
ML Training Job (ECS/SageMaker)
β
Results β S3 (export)
s3.us-east-1.accelerate.amazonaws.com endpoint5,500 GET/HEAD and 3,500 PUT/COPY/POST/DELETE per second per prefix (partition)
a3f2/, d9b1/, etc.