Skip to content

SFS SAL Backend Design

Context and Problem Statement

Broader SFS design. Follow up ADRs will go into depth; this is for breadth.

Decision Drivers

  • Assume NVMe-alike storage technology. High parallelism, similar access latency over all blocks.
  • Versioning: Every object will have multiple versions.
  • Files on disk are immutable once committed.
  • Easy querying of bucket, object and version data using SQL.
  • Simplicity
  • Leave room to grow: Keep sharding to multiple SQLite databases in mind.

Decision Outcome

Data and Metadata

Data is the payload submitted in PUT requests by the user.

Metadata consists of S3 metadata (atime, names) and internal metadata (object path on disk, version id).

Metadata lives in SQLite databases; Data in files on disk.

Data Layout

Store object data on a standard Linux filesystem like XFS or ext4.

Identify objects by UUID assigned on create. Use version 4, random UUIDs.

Use a multi level directory tree based on the object UUID to address object data files on the filesystem. Use two one byte / two hex char levels. For example, map UUID 683600a-b6c4-4760-a083-5f1a65ab8e95 /36/83/600ab6c44760a0835f1a65ab8e95.

Metadata Model

erDiagram
    Bucket ||..}o Object : has
    Object ||..}o VersionedObject : has
    VersionedObject ||..|| ObjectState : "is in"

    Bucket {
        serial bucket_id PK
        string name
        blob data
    }

    Object {
        uuid object_uuid
        serial object_id PK
        int bucket_id FK
        string name
        blob data
    }

    VersionedObject {
        serial version_id
        int object_id FK
        int obj_state_Id FK
        int checksum
        ts ctime
        ts dtime
        int size
        path filename
        bool pinned "Prevent deletion"
    }

    ObjectState {
        serial obj_state_id PK
        string name
    }

We store most RGW metadata using the Ceph encoder in blobs, as exemplified with data in the ER diagram. We extract values to columns if we need to query them.

Object State Machine

stateDiagram-v2
    [*] --> Writing
    [*] --> Deleted
    Writing --> Writing
    Writing --> Committed
    Committed --> Deleted

Object versions follow a state machine. When created by upload they start in writing and once all data is on disk reach committed. Multipart uploads stay in writing until the user uploaded all parts, then reach committed. Committed versions are available for download. Versions may immediately reach deleted when issuing a DELETE to an object or from committed when deleting an existing version.

Deletion is generally soft and follows garbage collection and retention rules.

Keeping a writing state allows us to fail uploads when they don't reach committed after a timeout.

Last Object Version

Use max(version_id) in state committed of selected object.