Building Distributed

Building Distributed Systems With Stateright

Presenter: Jonathan Nadal

Book: www.stateright.rs

This talk attemps to to answer...

Why are distributed systems difficult?
e.g. race conditions and other nondeterminism
What techniques are effective?
e.g chaos testing and model checking
How can their limitations be overcome?
e.g. embeddable model checking

Why are distributed systems difficult?

Decentralization and nondeterminism

Time is a partial order
Node crashes
Node timeouts
Node version differences
Message reordering
Message loss
Message redelivery (network/retries)
Security vulnerabilities
...

Why are distributed systems difficult?

Decentralization and nondeterminism

What techniques are effective?

Exercise the nondeterminism.
Writing test cases for individual system "traces"
(sequences of steps) is insufficient.

What techniques are effective?

Jepsen

Black-box testing of production binaries.
Crashes nodes, introduces partitions, skews clocks, ...
Tests observable system behavior for correctness (e.g. linearizability).
Dozens of real world systems tested. Most failed.
Cassandra, CockroachDB, Elasticsearch, Hazelcast, Kafka, PostgreSQL, RabbitMQ, Redis, TiDB, VoltDB, ...
Limitation: tests a random subset of possible traces

What techniques are effective?

Proofs

Inductive invariants verify all system behaviors.
Applicable to design or implementation
Limitation: requires extensive training

What techniques are effective?

Proofs

Industry example: IronFleet

What techniques are effective?

Model checking

Enumerate all traces in a boundary
Typically design only
Who Builds a House Without Drawing Blueprints
Limitation: risk of implementation divergence

What techniques are effective?

Model checking

Industry example: TLA+/TLC for AWS S3/DynamoDB/more.

How can their limitations be overcome?

Introducing Stateright, a model checker for distributed systems!

How can their limitations be overcome?

Stateright...

is a model checker provided as a Rust library.
allows you to verify systems implemented in Rust.
includes a "time travel debugger."
includes a model for an actor system.
includes a runtime for that actor system.
includes code to define reference semantics
for the system.
includes a linearizability tester.
includes extensive documentation.

How can their limitations be overcome?

Stateright's own limitations and caveats

Requires access to source code.
Only supports Rust at this time.
Actor runtime is a simplistic proof-of-concept.
Does not yet have a sizable community.
Will have breaking changes prior to the 1.0 release.
But I can coordinate with early adopters to minimize disruption.

How can their limitations be overcome?

Prior Art

VeriSoft (for C)
Concuerror (for Erlang)
Java PathFinder (for the JVM)
Coyote (for .NET)

Stateright Model Checker

Model Trait

Stateright Model Checker

Example: Sliding Puzzle

puzzles must be solvable

From Wikipedia (public domain):

Stateright Model Checker

Example: Sliding Puzzle

Stateright Model Checker

Example: Sliding Puzzle

If not solvable...

... then automated tests fail.

Stateright Model Checker

Example: Sliding Puzzle

If solvable, you can also verify a solution "trace."

Stateright Model Checker

Industry Example: ibc-rs

Inter-Blockchain Communication protocol
Rust impl: github.com/informalsystems/ibc-rs
Copyright Informal Systems and contributors
Open source: Apache License 2.0
Partial TLA+ to Stateright port by Vitor Enes Duarte

Stateright Model Checker

Industry Example: ibc-rs

⇧ TLA+ / Rust ⇩

Stateright Model Checker

Industry Example: ibc-rs

Highlights:

Model checks some of the Rust IBC implementation.
Incomplete experiment: ported a subset of protocol actions and only a trivial property thus far.
Stateright throughput exceeds TLC (TLA+ checker).
Rust is admittedly more verbose than TLA+.

Stateright Actor Framework

Actor trait

Stateright Actor Framework

Example: PingPongActor

Stateright Actor Framework

ActorModel struct

Stateright Actor Framework

Example: Ping-Pong ActorModel

Stateright Actor Framework

Stateright Explorer

Time Travel Debugger

What's Next?

Foster a community of users and contributors
Book chapter on Multi-Paxos, then TBD (ideas?)
Dynamic partial order reduction (DPOR)
Liveness checker improvements
Tokio/async-std runtimes?
Improve ergonomics with generators?
Adapters for more languages? (JavaScript?)
Other suggestions?