system-design.io

Object Storage Architecture and Access Patterns

Bridge: The Library of Everything

Imagine a vast library where every book, scroll, and tablet is stored not on shelves organized by subject, but in a single massive hall. Each item has a unique call number, a detailed label describing its contents, author, and date, and the actual content itself. To find an item, you don’t walk through aisles of categories; you simply provide the call number to an attendant who retrieves it directly from the hall. This library scales effortlessly—new items are added to the hall without reorganizing existing ones—and you can store anything from a novel to a dataset without worrying about fitting it into a predefined category.

This is the essence of object storage: a flat namespace where data is managed as self-contained objects, each comprising the data, rich metadata, and a unique identifier. Unlike traditional file systems that force data into hierarchical folders, object storage lets us store petabytes of unstructured data—images, videos, logs, backups—with simple, direct access via an identifier.

Iterative Complexity: Building the Concept Layer by Layer

Atomic Unit: The Object

We begin with the simplest building block: the object itself. In code, an object can be represented as:

obj = {
    "id": "object-123e4567-e89b-12d3-a456-426614174000",
    "data": b"...actual bytes of the file...",
    "metadata": {
        "content-type": "image/jpeg",
        "created": "2026-03-20T10:00:00Z",
        "size": 2457600,
        "custom": {"project": "alpha", "version": "2.1"}
    }
}

Here, the id is a globally unique key (often a UUID or hash), data is the unstructured byte sequence, and metadata provides context that enables search, access control, and lifecycle management. This self-description is what makes objects powerful: they carry their own catalog card.

Limitation: Finding Objects in a Flat Sea

If we merely dumped objects into a pile, locating a specific id would require scanning every item—a linear search that doesn’t scale. As we accumulate millions or billions of objects, this becomes infeasible.

Adding Feature: The Metadata Index

To solve this, we introduce a separate metadata index that maps each object ID to its location (which storage node holds the data) and stores the metadata for quick lookup. This index is itself distributed and highly scalable, often implemented with optimized key-value stores or distributed databases. When you request an object by its ID, the system first consults the index to find where the data lives, then retrieves the data from that node.

In a diagram:

+----------------+       +---------------------+
|   Client       |       |  Metadata Index     |
|   (GET /obj123)| --->  |  ID → (node, meta)  |
+----------------+       +---------------------+
                                  |
                                  v
                        +------------------+
                        | Storage Node A   |
                        |  [obj123 data]   |
                        +------------------+

Figure 1: Basic read path. The client queries the metadata index to locate the object, then fetches the data from the appropriate storage node.

Limitation: Durability and Scale on a Single Node

Storing all objects on a single storage node creates a single point of failure. Moreover, as data grows beyond the capacity of one node, we cannot scale further without replacing hardware—a costly and disruptive process.

Adding Feature: Distributed Storage with Replication

We distribute objects across a cluster of storage nodes. Each object is replicated (e.g., 3 copies) on different nodes to protect against hardware failures. When a node fails, the system can still serve the object from its replicas. New nodes can be added to the cluster seamlessly, allowing horizontal scaling.

The write path now looks like:

+----------------+       +---------------------+       +------------------+
|   Client       |       |  Metadata Index     | --->  |  Replication     |
| (PUT /obj456)  | --->  |  Assign ID, store   |       |  Factor = 3      |
|                |       |  meta → index       |       |  → Nodes A,B,C   |
+----------------+       +---------------------+       +------------------+
                                  |
                                  v
                        +------------------+
                        |  Storage Nodes   |
                        |  (A,B,C hold)    |
                        |  identical data  |
                        +------------------+

Figure 2: Write path with replication. The metadata index records the object’s location, and the data is written to multiple nodes for durability.

Resulting Modern Concept: The Object Storage Cluster

We now have a full object storage architecture comprising:

Objects – the atomic units of data + metadata + ID.
Metadata Index – a scalable lookup service mapping IDs to locations and metadata.
Storage Nodes – a cluster of machines holding object data, often with a mix of HDDs and SSDs for cost-performance balance.
API Layer – a RESTful interface (commonly S3-compatible) that clients use to create, read, update, and delete objects.
Replication/Durability Layer – mechanisms (like replication or erasure coding) that ensure data survives node failures.
Additional Features – versioning, lifecycle policies, access controls, encryption, and immutability options.

This architecture emerged prominently with Amazon S3’s launch in 2006, which demonstrated that a simple HTTP-based interface could manage exabytes of data reliably. Since then, the pattern has been adopted across cloud providers and on-premises solutions, becoming the backbone for modern unstructured data workloads.

Problem-Solution Narrative: Show the Broken State, Then Sell the Fix

Pain Point: Replication’s Storage Overhead

Early object storage systems relied on simple replication (e.g., 3x) to achieve durability. While effective, this approach triples the storage requirement. For petabyte-scale datasets, the cost becomes prohibitive. We need a way to protect data without multiplying storage capacity linearly.

Broken State: Storing 1 PB of data with 3x replication requires 3 PB of raw disk space, driving up capital and operational expenses.

Fix: Erasure coding provides comparable durability with far less overhead. Instead of copying whole objects, erasure coding splits data into k data fragments and adds m parity fragments. Any k of the total k+m fragments can reconstruct the original object. For example, a 4+2 scheme splits data into 4 fragments and adds 2 parity pieces, allowing recovery from any two fragment losses while using only 1.5x the original storage (compared to 3x for replication).

+----------------+       +---------------------+
| Original Data  |       |  Erasure Coding     |
|  [A B C D]     | --->  |  → [A,B,C,D,P1,P2] |
+----------------+       +---------------------+
                                  |
                                  v
                        +------------------+
                        |  Distributed     |
                        |  Fragments       |
                        |  A→Node1, B→Node2|
                        |  C→Node3, D→Node4|
                        |  P1→Node5, P2→Node6|
                        +------------------+

Figure 3: 4+2 erasure coding. Six fragments are stored across six nodes; any four suffice to rebuild the data.

With erasure coding, we can store 1 PB of data using only ~1.67 PB of raw space (for 4+2), nearly halving the storage cost compared to triple replication while maintaining the ability to withstand multiple node failures.

Pain Point: Inconsistency in Distributed Systems

In a distributed system, network delays or partitions can cause replicas to diverge. If a client writes an object and immediately reads it, they might see an outdated version if the read hits a replica that hasn’t yet been updated. This inconsistency breaks expectations for applications that rely on read-after-write guarantees.

Broken State: After uploading a user profile picture, the user sees the old picture because the read was served from a lagging replica.

Fix: Eventual consistency with strong read-after-write for new objects. Modern object storage (like Amazon S3) guarantees that a newly created object is immediately readable with its written content (read-after-write consistency). For overwrites and deletes, the system offers eventual consistency, meaning all replicas will converge to the same state after a short window. This model balances performance and correctness: new data is instantly visible, while updates propagate asynchronously.

Time: T0                    T0+10ms                    T0+100ms
Client: PUT /user/avatar    →                        →
Storage: [NodeA: v1]        [NodeA: v2, NodeB: v1]   [NodeA: v2, NodeB: v2]
Client: GET /user/avatar    → (returns v2)           → (returns v2)

Figure 4: Read-after-write consistency. The write is immediately visible; subsequent reads see the new version even as replicas converge.

Pain Point: Managing Data at Different Access Temperatures

Not all data is accessed equally. Frequently accessed “hot” data warrants low-latency storage (like SSDs), while rarely accessed “cold” data can reside on cheaper, slower media (like HDDs or tape). Manually moving data between tiers is error-prone and unscalable.

Broken State: Administrators constantly monitor access patterns and manually migrate objects, leading to mistakes and uneven performance.

Fix: Policy-driven data tiering. Object storage systems allow lifecycle rules that automatically transition objects between storage classes based on age or access patterns. For example, objects older than 30 days move from Standard (hot) to Standard-IA (warm), and after 365 days to Glacier (cold). This automation aligns storage costs with data value without operational overhead.

Visual Literacy \& Descriptions: Diagrams That Teach

We use ASCII art and svgbob-style diagrams to illustrate key abstractions. Remember to read the accompanying text to understand what each element represents.

Object Structure

+---------------------------+
|          Object           |
+---------------------------+
| ID:   obj-abc123          |
| Data: [byte array...]     |
| Meta: {                   |
|   content-type: text/plain|
|   size: 1024              |
|   ...                     |
| }                         |
+---------------------------+

In this diagram, the object is a self-contained unit. The ID is the handle used to retrieve it; the data is the actual payload; the metadata provides searchable attributes.

Flat Namespace vs. Hierarchical File System

Hierarchical File System:          Flat Namespace (Object Storage):
                                   (Buckets are logical containers)
  /home                            +-------------------+
  /home/user                       |  bucket: photos   |
  /home/user/documents             |                   |
  /home/user/documents/report.txt  |  obj-001 (cat.jpg)|
                                   |  obj-002 (dog.png)|
                                   |  obj-003 (logo.svg)|
                                   +-------------------+
                                   |  bucket: backups  |
                                   |                   |
                                   |  obj-100 (dump.sql)|
                                   |  obj-101 (vm.img) |
                                   +-------------------+

Note: While object storage uses a flat namespace *within a bucket, buckets themselves provide a first level of organization. There are no nested folders inside a bucket—only objects identified by unique keys.*

Distributed Read Path with Metadata Index

Client → Metadata Index → Storage Node
   (GET)        (ID → loc)        (data)
     |                 |               |
     v                 v               v
+------+          +----------+     +----------+
|Req---|----------|Lookup----|---->|Read Data|
+------+          +----------+     +----------+

The metadata index acts as a directory service, turning an object ID into a location (node and disk offset). This indirection enables the storage layer to scale independently of the lookup mechanism.

Categorical Chunking: Grouping by Intent

Instead of listing features randomly, we organize them by the intent they serve.

Data Ingestion (Write Patterns)

Simple PUT – For objects under 5 MiB, a single HTTP PUT uploads the data and metadata in one request.
Multipart Upload – For larger objects, the data is split into parts, uploaded in parallel, and then assembled by the service. This improves resilience and throughput for big files.
Pre-signed URLs – Grant time-limited write access to clients without exposing long-term credentials, enabling direct uploads from browsers or mobile apps.

Data Retrieval (Read Patterns)

Simple GET – Fetches the entire object by ID.
Range GET – Retrieves only a byte range (e.g., for streaming video or resuming downloads), using the Range header.
Head Object – Retrieves metadata only, useful for checking existence or size without transferring data.
Select Operations – Some services allow SQL-like queries on object content (e.g., CSV or JSON) without downloading the whole object.

Data Management (Control Patterns)

Listing (LIST) – Enumerates objects with optional prefix and delimiter filters to simulate folder-like browsing.
Tagging – Adds key-value tags to objects for cost allocation, access control, or lifecycle rules.
Versioning – Keeps multiple versions of an object (identified by version ID) to protect against overwrites or enable rollback.
Locking (Immutability) – Implements WORM (Write Once Read Many) policies for compliance, preventing deletion or modification for a specified duration.

Lifecycle \& Cost Optimization

Transition Rules – Move objects between storage classes (e.g., Standard → Glacier) based on age.
Expiration Rules – Automatically delete objects after a retention period.
Intelligent-Tiering – Automatically moves objects between frequent and infrequent access tiers based on actual usage patterns.

Security \& Access Control

Access Control Lists (ACLs) – Define read/write permissions per object or bucket (legacy; increasingly supplemented by IAM policies).
Bucket Policies – JSON-based policies that grant permissions based on conditions (e.g., VPC endpoints, source IPs).
Encryption – Server-side encryption (SSE-S3, SSE-KMS, SSE-C) or client-side encryption to protect data at rest.
Transport Security – All API calls occur over HTTPS, ensuring confidentiality and integrity in transit.

Tone \& Style: We Perspective with Historical Context

We, as learners and practitioners, have seen object storage evolve from a niche cloud service to the universal substrate for unstructured data. Its origins trace back to research on scalable storage systems in the early 2000s, but it was Amazon S3’s 2006 launch that codified the RESTful, flat-namespace model we use today. The simplicity of treating every piece of data as an addressable object—without the baggage of hierarchical file systems—unlocked unprecedented scale and flexibility.

We adopt a conversational yet authoritative tone, recognizing that the best technical explanations feel like a guided walkthrough rather than a lecture. We use “we” to invite you into the thought process: We notice that replication wastes space; we wonder if there’s a better way; we discover erasure coding. This approach keeps the narrative engaging while preserving rigor.

Programming Language: Python Examples

We use Python with the boto3 library (the AWS SDK) to interact with any S3-compatible object storage service. The examples below assume you have configured access credentials (e.g., via environment variables or a credentials file).

Creating a Bucket and Uploading an Object

import boto3
from botocore.exceptions import ClientError

s3 = boto3.client('s3',
                  endpoint_url='http://localhost:9000',  # e.g., MinIO
                  aws_access_key_id='minioadmin',
                  aws_secret_access_key='minioadmin')

bucket_name = 'my-bucket'
object_key = 'photos/2026/03/vacation.jpg'
file_path = '/local/path/vacation.jpg'

# Create bucket if it doesn't exist
try:
    s3.head_bucket(Bucket=bucket_name)
except ClientError:
    s3.create_bucket(Bucket=bucket_name)

# Upload file
s3.upload_file(Filename=file_path,
               Bucket=bucket_name,
               Key=object_key,
               ExtraArgs={'Metadata': {'project': 'album', 'year': '2026'}})
print(f"Uploaded {file_path} to s3://{bucket_name}/{object_key}")

Downloading an Object with Range Request

# Download only the first 1 MiB (useful for previewing large files)
response = s3.get_object(Bucket=bucket_name,
                         Key=object_key,
                         Range='bytes=0-1048575')
preview_data = response['Body'].read()
print(f"First {len(preview_data)} bytes: {preview_data[:20]}...")

Listing Objects with a Prefix

paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket_name, Prefix='photos/2026/'):
    for obj in page.get('Contents', []):
        print(f"{obj['Key']} ({obj['Size']} bytes)")

Enabling Versioning and Retrieving a Specific Version

# Enable versioning on the bucket
s3.put_bucket_versioning(
    Bucket=bucket_name,
    VersioningConfiguration={'Status': 'Enabled'}
)

# Upload a new version of the same key
s3.upload_file(Filename='/local/path/vacation_v2.jpg',
               Bucket=bucket_name,
               Key=object_key)

# List versions
versions = s3.list_object_versions(Bucket=bucket_name,
                                   Prefix=object_key)
for ver in versions.get('Versions', []):
    print(f"Version {ver['VersionId']} (last modified: {ver['LastModified']})")

# Download a specific version
s3.download_file(Bucket=bucket_name,
                 Key=object_key,
                 Filename='/local/path/vacation_v1.jpg',
                 VersionId='3L4k5J...')  # from list output

Setting a Lifecycle Rule to Transition to Glacier

lifecycle_rule = {
    'ID': 'MoveToGlacierAfter1Year',
    'Status': 'Enabled',
    'Filter': {'Prefix': 'archive/'},
    'Transitions': [{
        'Days': 365,
        'StorageClass': 'GLACIER'
    }],
    'Expiration': {
        'Days': 3650  # Delete after 10 years
    }
}

s3.put_bucket_lifecycle_configuration(
    Bucket=bucket_name,
    LifecycleConfiguration={'Rules': [lifecycle_rule]}
)
print("Lifecycle rule applied.")

These snippets illustrate how the abstract architecture translates into concrete operations. By leveraging the standardized S3 API, we can work with diverse backends—AWS S3, MinIO, Ceph RGW, Cloudian HyperStore, or Azure Blob Storage (via the blob service’s S3-compatible endpoint)—with minimal code changes.

Putting It All Together: The Modern Object Storage Landscape

We have journeyed from a simple analogy to a detailed architecture, uncovering the motivations behind each design choice. Object storage’s power lies in its ability to:

Scale horizontally by adding nodes without reshuffling existing data.
Provide durability through flexible protection schemes (replication or erasure coding).
Enable rich metadata that transforms objects from opaque blobs into self-describing resources.
Offer simple, ubiquitous access via RESTful APIs that language and platform agnostic.
Support advanced features like versioning, lifecycle management, and immutability for compliance and cost optimization.

As we continue to generate unprecedented volumes of unstructured data—from 8K video streams to scientific instrument logs—object storage will remain the foundational layer that lets us store, find, and use that data efficiently. Understanding its architecture and access patterns equips us to design systems that leverage its strengths while navigating its trade-offs, a skill indispensable for modern software engineers.