a durable, crash-resistant container image pipeline built on the superfly/fsm library. two-phase workflow: a deterministic prepare phase that runs once per image family, and a per-invocation activate phase that creates a fresh copy-on-write snapshot.
two independent FSM workflows share a single SQLite database and BoltDB state store. prepare runs with a deterministic ID per family — crash and restart and it picks up from the last completed step. activate always creates a fresh thin snapshot.
each step is a durable BoltDB checkpoint. crash at any point and resume picks up from the last completed transition. click a step to inspect it, or press play to walk through the state machine.
the refactor was driven by one principle: trust the FSM to know which steps ran. remove everything inside steps that duplicates what BoltDB already tracks.
prepare-{family} as the FSM run ID means crash recovery naturally resumes the exact same run via BoltDB. no need for a prepared flag inside steps — the FSM already knows which transitions completed. the flag is only checked in main.go before deciding whether to start phase 1.r.W.Msg from the previous step. ActivateSnapshot reads r.W.Msg.BaseLvID directly. PrepareThinBase is the exception: it checks DB for an existing base_lv_id first, because a crash after the DB write but before the FSM transition would leave an orphan volume. idempotency check before allocation, response chain after.prepared inside UnpackIntoBase says "don't trust the FSM to know if this step ran." that undermines the entire point of using a durable state machine. the flag is checked once in main.go before starting phase 1, then set there after phase 1 completes.fsm.Abort(err) signals the library to stop immediately instead of burning all retry attempts on a hopeless operation. blob download failures propagate normally so the FSM retries them on transient S3 errors.pid:{pid}:{random} as the holder value. the random suffix handles PID reuse after a crash — two processes never share a holder ID. ReleaseLock only deletes rows where v = ? matches, so process A cannot accidentally release process B's lock.sequences table with atomic UPDATE ... RETURNING. each namespace (base, snap) gets its own counter starting at 1. eliminates collision risk from the previous rand.Int63n(1000000) approach — critical when running on thousands of servers.dmsetup message errors are now captured and checked. if the target device doesn't exist after the message fails, it's a real error (pool corruption, full thinpool, kernel driver issue) — propagated up. if the device does exist, it was a race and the error is safe to ignore. no more silent _ = exec.Command().