Skip to main content

2 posts tagged with "dst"

View All Tags

· 4 min read
David Farr

Working on a project such as Resonate that uses deterministic simulation testing (DST) extensively can be both fickle and incredibly, incredibly rewarding. You see, when DST is a regular part of your testing strategy you tend to anticipate and avoid the type of bugs that DST is so good at finding in the first place. So much so that when an automated GitHub issue such as this one pops up, you tend to assume there is a bug in the test code rather than a bug in the system itself.

But this is a story about a time when DST found a real bug.

Promises and Callbacks

Two fundamental concepts of Resonate are Promises and Callbacks. If you want to learn more about these concepts I recommend checking out the Durable Promise Specification.

For the purpose of this post it is sufficient to understand that Resonate exposes a create API for both promises and callbacks and a convenience API that atomically creates a promise and a callback. A callback is registered against a promise.

CreatePromise(pid)
CreateCallback(pid, cid)
CreatePromiseAndCallback(pid, cid)

Seed 23496

Resonate DST (deterministically) randomly generates requests and fires them off against the server. We use a seed to configure the test so that if we see a failure on GitHub Actions we can easily reproduce the same failure locally.

One of these failures occurs with seed 23496.

You can check it out for yourself! Clone the resonate repository and run the DST command.

# checkout specific commit
git checkout 1f1a419

# run dst
go run ./... dst run --seed 23496 --ticks 1000

If all goes right (or wrong depending on your perspective) a log of all requests will be printed to the console ending with the following error message.

level=ERROR msg="DST is non linearizable"
Error: DST failed for seed='23496'

Resonate uses Porcupine, a fantastic go project, to verify the linearizability of the concurrent requests sent and responses received by the test. Porcupine also generates a visualization you can see by opening up dst.html in a browser, click “jump to first error” to see where Porcupine could no longer verify linearizability.

Visualization Screenshot

The Investigation

We can use the request log and the visualization to help us determine what exactly went wrong. Looking in the vicinity of the first error, three requests stand out as interesting.

IdRequestResonate StatusHttp Status
78CreatePromise(pid=p5)2010201
90CreatePromiseAndCallback(pid=p5, cid=cb13)4090409
119CreateCallback(pid=p5, cid=cb13)2000200

CreatePromise (request 78) results in a 2010 status code. Resonate status codes can be converted to http response codes by dividing by 10 and rounding down, here a 201 indicates that the promise was created successfully.

CreatePromiseAndCallback (request 90) results in a 4090 indicating a conflict, the promise already exists.

CreateCallback (request 119) results in a 2000 indicating the callback was successfully deduplicated, but not created, by this request. The callback must already exist.

Inspecting the logs further we can verify that at no point in time prior to request 119 was a callback with id cb13 created. All indicators are pointing towards an erroneous callback creation, perhaps our atomic CreatePromiseAndCallback request is not so atomic after all.

IdRequestActualExpected
78CreatePromise(pid=p5)201201
90CreatePromiseAndCallback(pid=p5, cid=cb13)409409
119CreateCallback(pid=p5, cid=cb13)200201

The Bug

CreatePromiseAndCallback is a relatively recent addition to Resonate and a likely culprit, so I dug into the SQL queries. The request is intended to be atomic; either both the promise and callback are created or neither are. Let’s look at the queries, understanding that these occur in a transaction.

# Create Promise
INSERT INTO promises(id) VALUES ('p5') ON CONFLICT(id) DO NOTHING

# Create Callback
INSERT INTO callbacks(id, promise_id)
SELECT
'cb5', 'p5'
WHERE EXISTS
(SELECT 1 FROM promises WHERE id = 'p5' AND state = 1)

Aha! The create promise query has an on conflict clause that effectively undoes the atomicity guarantees of the transaction. So when the CreatePromiseAndCallback request occurs the promise is not created, but the callback is. Our DST model is rightfully unaware that callback cb13 exists and therefore returns an error when CreateCallback returns a 200, indicating deduplication.

Reflection

How long would it have taken us to discover this bug without deterministic simulation testing? And how many productive systems would have been affected? It’s perhaps hard to say for certain, but I for one am happy we have DST.

· 8 min read
Dominik Tornow

At Resonate, we consider deterministic simulation testing (DST) to be a cornerstone of our mission to build correct and reliable distributed systems. While an increasing array of projects, including Foundation DB, TigerBeetle DB, and Resonate itself, have embraced DST, along with companies like Antithesis providing platforms dedicated to this approach, comprehensive information remains limited.

In this post, we’ll demystify DST by constructing an accurate yet concise mental model. We will explore the theoretical foundations as well as practical applications—and why I believe DST will soon become indispensable for developing distributed systems.

The Challenge

The task of testing concurrent and distributed systems presents formidable challenges: concurrency introduces non-determinism manifested through unpredictable execution order while distribution introduces non-determinism manifested through unpredictable partial failures, creating a vast behavior space to consider.

Notably, execution traces resulting from rare event combinations are particularly elusive. These are not only challenging to observe, but also even more challenging to reproduce, giving rise to the infamous Heisenbugs.

A System's Multiverse

Concurrency results in non-determinism in the form of random execution order: The concurrent composition of a set of processes refers to the set of all possible interleavings, so given two processes P = ⟨a, b⟩ and Q = ⟨c, d⟩, the set of all possible interleavings is:

P | Q = { ⟨a, b, c, d⟩, ⟨a, c, b, d⟩, ... ⟨c, d, a, b⟩ }

Distribution results in non-determinism in the form of random partial failures: adverse events like crashes or network partitions further complicate interleavings:

P | Q | 💀

We can argue that due to the non-determinism introduced by concurrency and distribution we are not dealing with one but many systems-one per possible interleaving

Traditional testing methodologies, including unit, integration, and end-to-end tests, often prove inadequate for holistically capturing the safety and liveness properties of a concurrent, distributed system. Due to the inherent non-deterministic nature of systems, tests are not reproducible.

This gap in the testing landscape is bridged by deterministic simulation testing, positioning itself between non-determinsitic simulation testing (e.g. jepsen or chaos engineering) and formal methods (e.g. TLA+, p-lang).

Comparism

Non-Determinisitic Simulation Testing

Non-determinisitic simulation testing identifies correctness violations by validating the system itself. Tools such as Jepsen follow a probabilistic, non-reproducible exploration of possible execution traces to find correctness violations by elevating the propability of rare execution traces self-induced by frequent process crashes and network partitions. Their primary limitation is reproducibility; despite increased chances of bug detection, their sporadic nature hampers consistent reproduction for subsequent analysis.

Formal Methods

Formal methods identify correctness violations by validating a formal model of a system. Tools such as TLA+ follow a determinisitc, reproducible exploration of possible execution traces to find correctness violations. Their primary limitation is the model-system discrepancy; a correct conceptualization (mode) does not imply a correct implementation (system).

Deterministic Systems

Understanding deterministic simulation testing hinges on understanding deterministic execution. Let's demystify deterministic execution by constructing a minimal system model illustrated through a practical GoLang example.

System Model

A system's execution unfolds in a series of discrete steps, that is, discrete events (called a trace) governed by the rules of its runtime environment. Each event is categorized as either observable (external) or non-observable (internal) based on its relevance to us.

A system is deterministic if there exists only one possible trace of external events.

In essence a determinisitc system is a system without choice.

Formal Model

An external trace or trace is a projection of a complete trace containing only external events where the deliniation of internal and external events is user supplied.

trace = ⟨e₁, e₂, e₃, eₙ⟩, with eᵢ ∈ External

The function traces maps a System S and the Initial State σ under the evaluation rules of a Runtime R to the set of all possible external traces.

traces(S, σ, R) = {trace₁, ... traceₙ}

A system is determinitic if for every Initial State σ, the cardinality of the set of traces for System S under Runtime R is 1:

∀ σ ∈ Σ : | traces(S, σ, R) | = 1

A system is non-deterministic if there exists an Initial State σ such that the cardinality of the set of traces for System S under Runtime R is larger than 1:

∃ σ ∈ Σ : | traces(S, σ, R) | > 1

Example

With the theoretical foundations in place, let's explore a practical application. Consider the following GoLang example, which sums up the values in a key-value map. For this illustration, we'll define the trace of the execution as the lines printed to standard output, where the events are labeled as i:number, v:number, and s:number.

GoLang does not specify an iteration order for key-value maps, in fact, golang explicitly randomizes the iteration order so software engineers don't implicitly rely on an unspecified order. Two different executions of the program may produce different traces.

package main

import "fmt"

func main() {
// Create a map with string keys and integer values
myMap := map[string]int{"key.1": 2, "key.2": 3, "key.3": 5}

// Initialize a variable to hold the cumulative sum of the values
sum := 0

// Print the initial value
fmt.Println("i:", sum)

// Loop over all key-value pairs in the map
for _, value := range myMap {
// Print the current value
fmt.Println("v:", value)
// Add the value to the sum
sum += value
}

// Print the final total sum of the values
fmt.Println("s:", sum)
}

Is this program deterministic? The answer hinges on our classification of events. If we consider i:number and s:number as external events but v:number as an internal event, the program is deterministic as the set of all possible traces has only one element:

i:0 -> s:10

If we consider i:number, s:number, and v:number as external events, the program is non-determinitic as the set of all possible traces has six elements:

i:0 -> v:2 -> v:3 -> v:5 -> s:10
i:0 -> v:2 -> v:5 -> v:3 -> s:10
i:0 -> v:3 -> v:2 -> v:5 -> s:10
i:0 -> v:3 -> v:5 -> v:2 -> s:10
i:0 -> v:5 -> v:2 -> v:3 -> s:10
i:0 -> v:5 -> v:3 -> v:2 -> s:10

This example highlights how the same program's determinism or non-determinism hinges on our perspective—specifically, on our definitions of internal versus external events.

Deterministic Simulation Testing

We delineate a system into two main components: application and environment. The application and the environment interact via a well defined protocol (API).

System Model

The delineate of application and environment-while ultimately determined by the system designer-follows the philosophie of deliniating of the orchestration of commands (application) and the execution of commands (environment).

The core idea is to substitute the environment with a simulator: Deterministic simulation testing repeatedly executes an application in a simulated environment under changing initial conditions, monitoring that the correctness constraints are maintained across executions.

Deterministic Simulation Testing

The composition of application and simulator must yield a deterministic system, that is, given the same initial state, the system will always yield the same trace of external events.

Now the progression of the system is determined by the intial state of the system. In practice the minimal initial state is the seed for a pseudo random number generator that the simulator uses to drive the system forward.

With this you can reproduce an entire execution by restarting the system with the same random seed.

It's about time

Many complex systems depend on real time in some way. For example, a stream processing system may offload a batch from hot storage to cold storage two weeks after creation of the batch. Time dependent code path are hard to test—and even harder to test repeatedly.

A side effect not of deterministic simulation testing but of virtualizing the environment is virtualizing time: The simulator does not only simulate the execution of commands but can also simulate the progression of time. This allows us to simulate months or years of continuous operations in mere hours of testing.

Limitations

Deterministic simulation testing is a capable but intrusive technique: The composition of application and simulator must yield a deterministic system-a high bar to meet. For example, if you target golang you have no contol over goroutine scheduling, forcing you to "work around" golangs concurrency abstractions by building your own.

Conclusion

Deterministic simulation testing is invaluable to our mission to build correct and reliable distributed systems. In the next blog post I will detail Resonate's implementation of deterministic simulation testing in golang. In the meantime, head over to our repository and check out our implementation yourself.