TLA+ Modeling Tips

> TLA+ is a formal specification language developed by Leslie Lamport. It is used for designing, modelling, documentation, and verification of programs, especially concurrent systems and distributed systems. TLA+ is considered to be exhaustively-testable pseudocode, and its use likened to drawing blueprints for software systems;

https://en.wikipedia.org/wiki/TLA+

This Scott Wlaschin talk [1] is a good introduction to TLA+. And the slides [2].

[1] https://www.youtube.com/watch?v=qs_mmezrOWs

[2] https://speakerdeck.com/swlaschin/tla-plus-for-programmers

To me the modeling is not that hard but I struggle to come up with properties/invariants.

Take something like Google Docs. I'm sure they have some websocket protocol to handle collaborative text editing:

- establish connection

- do a few retries if connection is lost

- always accept edits even when there is no connection

- send queued edits when connection is back

- etc

But what properties can you imagine here?

We never enter an error state? Not really. Error states are unavoidable / outside the control of the system.

We always recover from errors? Not really. Connection might never come back. Or the error requires action from the user (e.g. authentication issues).

The strange case of "TLA+" topics reaching HN's Front Page without any reason.

This Scott Wlaschin talk [1] is a good introduction to TLA+. And the slides [2].

[1] https://www.youtube.com/watch?v=qs_mmezrOWs

[2] https://speakerdeck.com/swlaschin/tla-plus-for-programmers

https://en.wikipedia.org/wiki/TLA+

In theory (according to Wikipedia at least) the acronym stands for "Temporal Logic for Actions" ... but the pun in the standard meaning of the acronym TLA (i.e. "Three Letter Acronym") is far too enticing for me to believe the choice was fully accidental. :D

The strange case of "TLA+" topics reaching HN's Front Page without any reason.

The reason is obvious - LLMs got good at writing them, initial cost to write went down 99%, toil to keep them up to date with code went down 99% and some people noticed. Value proposition went from ‘you must be joking unless we’re AWS and you’re proposing an IAM code change’ to ‘plan mode may also emit TLA+ spec as an intermediate implementation artifact’.

It's not strange at all.

The second story, and highly upvoted, on HN right now is: "AI will make formal verification go mainstream"[1].

[1] https://news.ycombinator.com/item?id=46294574

To me the modeling is not that hard but I struggle to come up with properties/invariants.

Take something like Google Docs. I'm sure they have some websocket protocol to handle collaborative text editing:

- establish connection

- do a few retries if connection is lost

- always accept edits even when there is no connection

- send queued edits when connection is back

- etc

But what properties can you imagine here?

We never enter an error state? Not really. Error states are unavoidable / outside the control of the system.

We always recover from errors? Not really. Connection might never come back. Or the error requires action from the user (e.g. authentication issues).

I can imagine a huge number of properties.

1. Eventual consistency: if no new edits are generated, then eventually all connected viewers see the same document.

2. Durability: if the system acknowledges an edit, then that edit is stored permanently in the undo/redo chain of a document.

3. Causal consistency: if the system acknowledges an edit B that depends (for some definition of depend) on edit A, then it must have acknowledged A (instead of e.g. throwing away A due to a conflict and then acknowledging B).

4. Eventual connection: if, after a certain point, the user never fails any part of the handshake process, eventually the user can successfully connect to the document (there are definitely bugs in collaborative tools where users end up never able to connect to a document even with no problems in the connection)

5. Connection is idempotent: Connecting once vs connecting n times has the same result (ensuring e.g. that the process of connecting doesn't corrupt the document)

6. Security properties hold: any user who doesn't have the permissions to view a document is always denied access to the document (because there are sometimes security bugs where an unforeseen set of steps can actually lead to viewing the doc)

6. Order preservation of edits: for any user, even a user with intermittent connection problems, the document they see always has an ordered subset of the edits that user has made (i.e. the user never sees their edits applied out of order)

7. Data structure invariants hold: these documents are ultimately backed by data structures, sometimes complex ones that require certain balancing properties to be true. Make sure that those hold under all edits of the document.

Etc. There's probably dozens of properties at least you could write and check even for an abstract Google Doc-like system (to say nothing of the particulars of a specific implementation). That's not to say you have to write or verify all of these properties! Even just choosing one or two can give a huge boost in reliability confidence.

Google Docs uses Operational Transformations (OT). You could try and model OT in TLA+.

TCP already provides a model for that one [0] and you can also use logical clocks [1], which is a more generic concept. Now that the whole reliability is out of the way and you're already using a queue for the ordering of the edits, then all you need to care is about the collaborative aspects.

[0]: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

[1]: https://dl.acm.org/doi/abs/10.1145/359545.359563

Start with property-based testing, that'll help you learn how to specify properties without the overhead of TLA+.

We never enter error state as long as XYZ properties are satisfied, I guess... I will admit, this is probably a naive take, I'm currently trying to go through the TLA+ videos and have trouble wrapping my head around it.

Some kind of casual or eventual consistency related invariants perhaps?

The times I've used TLA+ in anger have been generally related to lower-level "communicating state machines". One was a LoRaWAN handshake with the possibility of lost radio packets, another was a slightly-higher-level Bluetooth message exchange between a phone and a device.

In the first situation I wanted to ensure that the state machine I'd put together couldn't ever deadlock. There's a fairness constraint you can turn in on TLA+ that basically says "at some point in every timeline it will choose every path instead of just looping on a specific failure path". The model I put together ensured that, assuming at some point there was a successful transmission and reception of packets, the state machines on both sides could handle arbitrary packet loss without getting into a bad state indefinitely.

On the Bluetooth side it was the converse: a vendor had built the protocol and I believed that it was possible to undetectably lose messages based on their protocol. The model I put together was quite simple and I used it to give me the sequence of events that would cause a message to be lost; I then wrapped that sequence of events up in an email to the vendor to explain a minimal test case to prove that the protocol was broken.

For your specific example, I'd suggest you think about invariants instead of actions. I'd probably split your example into two distinct models though: one for the connection retry behaviour and one for the queued edits.

The invariant for the first model would be pretty straightforward: starting from the disconnected state, ensure that all (fair) paths eventually lead to the connected state. The fairness part is necessary because without it you can end up in an obvious loop of AttemptConnect -> Fail -> AttemptConnect -> Fail... that never ends up connected. A riff on that that gets a little more interesting was a model I put together for an iOS app that was talking to an Elixir Phoenix backend over Phoenix Channels. Not only did it need to end up with a connected websocket but it also needed to successfully join a channel. There were a few more steps to it and the model ended up being somewhat interesting.

The second model in your example is the edits model. What are the invariants here? We completely ignore the websocket part of it and just model it as two (or three!) state machines with a queue in between them. The websocket model handles the reconnect logic. The key invariant here is that edits aren't lost. In the two-state-machine model you're modelling a client that's providing edits and a server that's maintaining the overall state. An obvious invariant is that the server's state eventually reflects the client's queued edits.

The three-state-machine model is where you're going to get more value out of it. Now you've got TWO clients making edits, potentially on stale copies of the document. How do you handle conflicts? There's going to be a whole sequence of events like "client 1 receives document v123", "client 2 receives document v123", "client 2 makes an edit to v123 and sends to server", "server accepts edit and returns v124", "client 1 makes an edit to v123", "server does...?!"

I haven't reviewed this model at all but a quick-ish google search found it. This is likely somewhat close to how Google Docs works under the hood: https://github.com/JYwellin/CRDT-TLA

I can imagine a huge number of properties.

1. Eventual consistency: if no new edits are generated, then eventually all connected viewers see the same document.

2. Durability: if the system acknowledges an edit, then that edit is stored permanently in the undo/redo chain of a document.

5. Connection is idempotent: Connecting once vs connecting n times has the same result (ensuring e.g. that the process of connecting doesn't corrupt the document)

Google Docs uses Operational Transformations (OT). You could try and model OT in TLA+.

[0]: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

[1]: https://dl.acm.org/doi/abs/10.1145/359545.359563

Just like with code, the issue was never about writing a TLA+ spec. It was about writing a good TLA+ spec. Anyone can press a piano's key, but not everyone can write the Moonlight Sonata.

Some kind of casual or eventual consistency related invariants perhaps?

Start with property-based testing, that'll help you learn how to specify properties without the overhead of TLA+.

I haven't reviewed this model at all but a quick-ish google search found it. This is likely somewhat close to how Google Docs works under the hood: https://github.com/JYwellin/CRDT-TLA

Just like with code, the issue was never about writing a TLA+ spec. It was about writing a good TLA+ spec. Anyone can press a piano's key, but not everyone can write the Moonlight Sonata.

Yes, but the accessibility of TLA+, if you come from traditional programming languages, is quite harsh. I have 2 books about it and keep coming back to it every once in a while, but every single time there is a lot of investment into the exact syntax of certain things. Better tooling like AI autocomplete is a huge step there. I can still think about the spec, but not get stuck every single time on minor details.

It's not strange at all.

The second story, and highly upvoted, on HN right now is: "AI will make formal verification go mainstream"[1].

[1] https://news.ycombinator.com/item?id=46294574

It's interesting btw that Martin Kleppmann lists a couple of proof assistants, but not model checkers like TLA+. With him working extensively on distributed systems, I would have thought that would be on the list as well.

The devil is in the details. That's the whole point of formalism. There should be no ambiguity meaning what's written is what you want to be written. If you generate and can't prove what's generated is what you intended, the whole point of even writing it is moot.

> if you come from traditional programming languages, is quite harsh

This may be the result of failing to adequately distinguish between the spec and the implementation. PlusCal, as I understand it, is an attempt to bridge this gap for programmers who don't yet have a good grasp of this distinction. So where in TLA+ you would model the observable behavior of a system in terms of transitions in a state machine, PlusCal gives users a language that looks more like implementation. (Frankly, I find the PlusCal approach more distracting and opaque, but I'm also a more abstract thinker than most.)

> if you come from traditional programming languages, is quite harsh

I understand the semantics, I forget the syntax. So better tooling is great because it allows even retards like me to use TLA+ after forgetting the syntax details after 2 weeks.

Yes, and no. Relevant details. If the formalism bogs you down in tangential or irrelevant stuff, it means it's a poor fit.

But I don't think that's the case here. Most programmers don't have much experience with specs, and much less experience with formalized specs, especially if they come from an imperative background. I think those from declarative backgrounds will be much more at home with TLA+. The imperative style causes people to obsess over the "how", while the declarative style focuses more on the "what", and specs are more of a matter of "what" than "how".

I understand the semantics, I forget the syntax. So better tooling is great because it allows even retards like me to use TLA+ after forgetting the syntax details after 2 weeks.

Yes, and no. Relevant details. If the formalism bogs you down in tangential or irrelevant stuff, it means it's a poor fit.

It can be both. Here is one example that provides the how instead of the what:

https://www.rfc-editor.org/rfc/rfc5321.html#section-3.1

A lot of declarative languages do a lot of hard work designing an imperative kernel that will support the declarative construct. It just takes skill to separate the two together, just like it takes skill to split code between states, interfaces and implementations.

It can be both. Here is one example that provides the how instead of the what:

https://www.rfc-editor.org/rfc/rfc5321.html#section-3.1

Model minimalistically

Start from a tiny core, and always keep a working model as you extend. Your default should be omission. Add a component only when you can explain why leaving it out would not work. Most models are about a slice of behavior, not the whole system in full glory: E.g., Leader election, repair, reconfiguration. Cut entire layers and components if they do not affect that slice. Abstraction is the art of knowing what to cut. Deleting should spark joy.

Model specification, not implementation

Write declaratively. State what must hold, not how it is achieved. If your spec mirrors control flow, loops, or helper functions, you are simulating code. Cut it out. Every variable must earn its keep. Extra variables multiply the state space (model checking time) and hide bugs. Ask yourself repeatedly: can I derive this instead of storing it? For example, you do not need to maintain a WholeSet variable if you can define it as a state function of existing variables: WholeSet == provisionalItems \union nonProvisionalItems.

Review the model for illegal knowledge

Do a full read-through of your model and check what each process can really see. TLA+ makes it easy to read global state (or another process's state) that no real distributed process could ever observe atomically. This is one of the most common modeling errors. Make a dedicated pass to eliminate illegal global knowledge.

Check atomicity granularity

Push actions to be as fine-grained as correctness allows. Overly large atomic actions hide races and invalidate concurrency arguments. Fine-grained actions expose the real interleavings your protocol must tolerate.

Think in guarded commands, not procedures

Each action should express one logical step in guarded-command style. The guard should ideally define the meaning of the action. Put all enablement conditions in the guard. If the guard holds, the action may fire at any time in true event-driven style. This is why I now prefer writing TLA+ directly over PlusCal: TLA+ forces you to think in guarded-command actions, which is how distributed algorithms are meant to be designed. Yes, PlusCal is easier for developers to read, but it also nudges you toward sequential implementation-shaped thinking. And recently, with tools like Spectacle, sharing and visually exploring TLA+ specs got much easier.

Step back and ask what you forgot to model

There is no substitute for thinking hard about your system. TLA+ modeling is only there to help you think hard about your system, and cannot substitute thinking about it. Check that you incorporated all relevant aspects: failures, message reordering, repair, reconfiguration.

Write TypeOK invariants

TLA+ is not typed, so you should state types explicitly and early by writing TypeOK invariants. A good TypeOK invariant provides an executable documentation for your model. Writing this in seconds can save you many minutes of hunting runtime bugs through TLA+ counterexample logs.

Write as many invariants as you can

If a property matters, make it explicit as an invariant. Write them early. Expand them over time. Try to keep your invariants as tight as possible. Document your learnings about invariants and non-invariants. A TLA+ spec is a communication artifact. Write it for readers, not for the TLC model checker. Be explicit and boring for the sake of clarity.

Write progress properties

Safety invariants alone are not enough. Check that things eventually happen: requests complete, leaders emerge, and goals accomplished. Many "correct" models may quietly do nothing forever. Checking progress properties catch paths that stall.

Be suspicious of success

A successful TLC run proves nothing unless the model explores meaningful behavior. Low coverage or tiny state spaces usually mean the model is over-constrained or wrong. Break the spec on purpose to check that your spec is actually doing some real work, and not giving up in a vacuous/trivial way. Inject bugs on purpose. If your invariants do not fail, they are too weak. Test the spec by sabotaging it.

Optimize model checking efficiency last

Separate the model from the model checker. The spec should stand on its own. Using the cfg file, you can optimize for model checking by using appropriate configuration, constraints, bounds for counters, and symmetry terms.

You can find many examples and walkthroughs of TLA+ specifications on my blog.

There are many more in the TLA+ repo as well.

Hacker Times