fsm — container image orchestrator

// 01 — system overview

Architecture

two independent FSM workflows share a single SQLite database and BoltDB state store. prepare runs with a deterministic ID per family — crash and restart and it picks up from the last completed step. activate always creates a fresh thin snapshot.

☁

S3 Bucket

public bucket
anonymous access
image layers
images/{family}/*

PHASE 1 — prepare-{family}

1FetchManifest — list S3 keys

2DownloadBlobs — fetch layers to blobs/

3PrepareThinBase — create DM volume + ext4

4UnpackIntoBase — extract OCI tarballs

prepared=1 written to SQLite by main.go

PHASE 2 — activate-{uuid} (each invocation)

1ActivateSnapshot — thin clone from base

→mounted at /mnt/images/{snap_lv_id}

◫

SQLite

fsm.db
WAL mode
images, blobs,
activations, sequences

▣

DeviceMapper

thin pool
base volumes
cow snapshots
/dev/mapper/*

◈

BoltDB

FSM state
./fsmdb/
crash recovery
per-step checkpoints

// 02 — fsm workflow

Workflow Animator

each step is a durable BoltDB checkpoint. crash at any point and resume picks up from the last completed transition. click a step to inspect it, or press play to walk through the state machine.

ready

press play to walk through the workflow, or click any step to inspect it.

step 0 / 0

// 03 — design decisions

Key Decisions

the refactor was driven by one principle: trust the FSM to know which steps ran. remove everything inside steps that duplicates what BoltDB already tracks.

FSM Design 4 decisions

// 01

two-phase FSM design

prepare and activate have different semantics. prepare is idempotent — it runs once per image family forever. activate creates a fresh snapshot every invocation. separating them into distinct FSM workflows makes each phase's intent explicit and independently resumable.

architecture

// 02

deterministic run ID

using prepare-{family} as the FSM run ID means crash recovery naturally resumes the exact same run via BoltDB. no need for a prepared flag inside steps — the FSM already knows which transitions completed. the flag is only checked in main.go before deciding whether to start phase 1.

crash recovery

// 03

trust the response chain — with crash recovery

each FSM step receives state via r.W.Msg from the previous step. ActivateSnapshot reads r.W.Msg.BaseLvID directly. PrepareThinBase is the exception: it checks DB for an existing base_lv_id first, because a crash after the DB write but before the FSM transition would leave an orphan volume. idempotency check before allocation, response chain after.

fsm principle crash recovery

// 04

prepared flag lives in main.go

checking prepared inside UnpackIntoBase says "don't trust the FSM to know if this step ran." that undermines the entire point of using a durable state machine. the flag is checked once in main.go before starting phase 1, then set there after phase 1 completes.

correctness

Error Handling 2 decisions

// 05

fsm.Abort for non-retryable errors

if S3 returns zero layers, retrying won't help — the family doesn't exist. fsm.Abort(err) signals the library to stop immediately instead of burning all retry attempts on a hopeless operation. blob download failures propagate normally so the FSM retries them on transient S3 errors.

error handling

// 06

WriteResult removed as FSM step

writing a text file is not a meaningful crash checkpoint. if the process dies after mounting a snapshot, there's no need to re-mount to write a file — the result can be derived from SQLite. moved to main.go after both FSMs complete, making the step graph honest about what actually needs durability.

simplicity

Hostile Environment 3 decisions

// 07

unique lock ownership

locks use pid:{pid}:{random} as the holder value. the random suffix handles PID reuse after a crash — two processes never share a holder ID. ReleaseLock only deletes rows where v = ? matches, so process A cannot accidentally release process B's lock.

concurrency hostile-env

// 08

monotonic ID allocation

thin volume IDs come from a sequences table with atomic UPDATE ... RETURNING. each namespace (base, snap) gets its own counter starting at 1. eliminates collision risk from the previous rand.Int63n(1000000) approach — critical when running on thousands of servers.

concurrency scale

// 09

surface dmsetup errors

dmsetup message errors are now captured and checked. if the target device doesn't exist after the message fails, it's a real error (pool corruption, full thinpool, kernel driver issue) — propagated up. if the device does exist, it was a race and the error is safe to ignore. no more silent _ = exec.Command().

error handling hostile-env