Building Distributed Systems With Stateright

Presenter: Jonathan Nadal

Book: www.stateright.rs

This talk attemps to to answer...

This talk attemps to to answer...

  • Why are distributed systems difficult?
    e.g. race conditions and other nondeterminism
  • What techniques are effective?
    e.g chaos testing and model checking
  • How can their limitations be overcome?
    e.g. embeddable model checking

Why are distributed systems difficult?

Why are distributed systems difficult?

Decentralization and nondeterminism
  • Time is a partial order
  • Node crashes
  • Node timeouts
  • Node version differences
  • Message reordering
  • Message loss
  • Message redelivery (network/retries)
  • Security vulnerabilities
  • ...

Why are distributed systems difficult?

Decentralization and nondeterminism

What techniques are effective?

What techniques are effective?

  • Exercise the nondeterminism.
  • Writing test cases for individual system "traces"
    (sequences of steps) is insufficient.

What techniques are effective?

Jepsen
  • Black-box testing of production binaries.
  • Crashes nodes, introduces partitions, skews clocks, ...
  • Tests observable system behavior for correctness (e.g. linearizability).
  • Dozens of real world systems tested. Most failed.
    Cassandra, CockroachDB, Elasticsearch, Hazelcast, Kafka, PostgreSQL, RabbitMQ, Redis, TiDB, VoltDB, ...
  • Limitation: tests a random subset of possible traces

What techniques are effective?

Proofs
  • Inductive invariants verify all system behaviors.
  • Applicable to design or implementation
  • Limitation: requires extensive training

What techniques are effective?

Proofs
  • Industry example: IronFleet

What techniques are effective?

Model checking

What techniques are effective?

Model checking
  • Industry example: TLA+/TLC for AWS S3/DynamoDB/more.

How can their limitations be overcome?

How can their limitations be overcome?

Introducing Stateright, a model checker for distributed systems!

How can their limitations be overcome?

Stateright...
  • is a model checker provided as a Rust library.
  • allows you to verify systems implemented in Rust.
  • includes a "time travel debugger."
  • includes a model for an actor system.
  • includes a runtime for that actor system.
  • includes code to define reference semantics
    for the system.
  • includes a linearizability tester.
  • includes extensive documentation.

How can their limitations be overcome?

Stateright's own limitations and caveats
  • Requires access to source code.
  • Only supports Rust at this time.
  • Actor runtime is a simplistic proof-of-concept.
  • Does not yet have a sizable community.
  • Will have breaking changes prior to the 1.0 release.
    But I can coordinate with early adopters to minimize disruption.

How can their limitations be overcome?

Prior Art

Stateright Model Checker

Stateright Model Checker

Model Trait

Stateright Model Checker

Example: Sliding Puzzle
puzzles must be solvable
From Wikipedia (public domain):

Stateright Model Checker

Example: Sliding Puzzle

Stateright Model Checker

Example: Sliding Puzzle
If not solvable...
... then automated tests fail.

Stateright Model Checker

Example: Sliding Puzzle
If solvable, you can also verify a solution "trace."

Stateright Model Checker

Industry Example: ibc-rs

Stateright Model Checker

Industry Example: ibc-rs
⇧ TLA+ / Rust ⇩

Stateright Model Checker

Industry Example: ibc-rs
Highlights:
  • Model checks some of the Rust IBC implementation.
  • Incomplete experiment: ported a subset of protocol actions and only a trivial property thus far.
  • Stateright throughput exceeds TLC (TLA+ checker).
  • Rust is admittedly more verbose than TLA+.

Stateright Actor Framework

Stateright Actor Framework

Actor trait

Stateright Actor Framework

Example: PingPongActor

Stateright Actor Framework

ActorModel struct

Stateright Actor Framework

Example: Ping-Pong ActorModel

Stateright Actor Framework

Stateright Explorer
Time Travel Debugger

What's Next?

  • Foster a community of users and contributors
  • Book chapter on Multi-Paxos, then TBD (ideas?)
  • Dynamic partial order reduction (DPOR)
  • Liveness checker improvements
  • Tokio/async-std runtimes?
  • Improve ergonomics with generators?
  • Adapters for more languages? (JavaScript?)
  • Other suggestions?

Stateright Resources

Q&A and/or live demo