Concurrency · #12 of 13

Atomics + the Memory Model

The lesson most C++ courses get wrong — what `std::atomic` actually guarantees, and why `memory_order` exists

Why it matters

You've used std::atomic<int> once already as "the thing that lets two threads update a counter without a mutex." That mental model is the floor. The ceiling is the C++ memory model — a formal description of what one thread can see about another thread's writes, and what reorderings the compiler and CPU are allowed to perform behind your back.

If you're never writing lock-free data structures, you can mostly ignore this. But you cannot read modern systems code (kernels, allocators, runtimes, the standard library itself) without knowing what memory_order_acquire means. This lesson teaches the mental model without pretending the underlying model is simple.

The fact that makes this hard

Two threads. One writes x then y. The other reads y then x. You might assume: if the reader saw the new y, it must also see the new x. This is wrong.

Both the compiler and the CPU are allowed to reorder memory operations as long as the single-threaded observable behavior is preserved. So the writer's "x then y" might emit "y then x." The reader's "y then x" might fetch "x" before "y." Either of these reorderings is a perfectly correct single-threaded optimization. Across threads, they shred your invariants.

std::atomic and memory_order are how you tell the compiler "these operations participate in cross-thread ordering. Don't reorder them freely."

`std::atomic<T>` — the basic guarantee

std::atomic<T> gives you three things:

Indivisible reads and writes. No thread ever sees a "half-written" value. This is the property you'd expect from "atomic."
A defined memory order for that one operation, relative to the atomic operations the same thread did before and after.
Restricted operations. You can .load(), .store(), .exchange(), .compare_exchange_*(), and (for integral types) arithmetic operations like ++. You cannot .assign() and .modify_field() like with a regular struct — there's no "atomic struct update," only atomic word-sized swaps.

#include <iostream>
#include <atomic>
#include <thread>

std::atomic<int> counter{0};

void worker() {
for (int i = 0; i < 1000; i++) {
  counter.fetch_add(1, std::memory_order_relaxed);
}
}

int main() {
std::jthread t1(worker);
std::jthread t2(worker);
std::jthread t3(worker);
// joined automatically when jthreads go out of scope

t1.join(); t2.join(); t3.join();
std::cout << "counter = " << counter.load() << "\n";
return 0;
}

idle

Three threads, each incrementing 1000 times. The counter ends at exactly 3000 because every increment is atomic. `memory_order_relaxed` is sufficient here — no other variables need to stay in sync with this one.

expected output

counter = 3000

Or run locally

g++ -std=c++23 -O2 snippet.cpp && ./a.out

The default memory_order for atomic operations is memory_order_seq_cst — sequential consistency — the strongest, slowest guarantee. The example uses relaxed because a counter doesn't need any ordering relationship with other variables.

What `memory_order` actually means

There are six values, but only four shapes worth learning:

| Order | Meaning | Use for | |---|---|---| | relaxed | Atomicity only. No ordering with other operations. | Counters, statistics. | | acquire (on load) | Reads after this load see writes-before in the producing thread. | The reader side of a flag. | | release (on store) | Writes before this store are visible to a thread that does an acquire load on the same atomic. | The writer side of a flag. | | seq_cst | All threads see all seq_cst operations in the same global order. | Default. Use when in doubt. |

acquire/release is the pair you reach for to publish data between threads:

std::atomic<bool> ready{false};
int payload = 0;

// Writer thread:
payload = 42;                              // ordinary write
ready.store(true, std::memory_order_release);

// Reader thread:
while (!ready.load(std::memory_order_acquire)) {}
std::cout << payload;                      // guaranteed to see 42

The release-store publishes the prior writes. The acquire-load synchronizes with that release, making those writes visible. This is the happens-before relation in concrete C++ terms.

Without release/acquire (or seq_cst), the reader is not guaranteed to see payload == 42 even after it sees ready == true. The compiler or CPU could reorder the writer's two operations, or the reader's, and your invariant breaks.

`compare_exchange`: the lock-free workhorse

The CPU primitive most lock-free algorithms are built on: compare_exchange_strong(expected, desired). Atomically:

If the current value equals expected, replace with desired and return true.
Otherwise, load the current value into expected and return false.

In one instruction (cmpxchg on x86, cas everywhere else). Used for lock-free push, lock-free pop, lock-free linked lists, etc.

#include <iostream>
#include <atomic>

std::atomic<int> v{10};

int main() {
int expected = 10;
bool ok = v.compare_exchange_strong(expected, 20);
std::cout << "first  swap: ok=" << ok << " v=" << v.load() << " expected=" << expected << "\n";

expected = 10;   // stale assumption
ok = v.compare_exchange_strong(expected, 30);
std::cout << "second swap: ok=" << ok << " v=" << v.load() << " expected=" << expected << "\n";
return 0;
}

idle

`compare_exchange_strong` is the atomic conditional-swap that powers lock-free algorithms. On failure, `expected` is updated to the actual current value.

expected output

first  swap: ok=1 v=20 expected=10
second swap: ok=0 v=20 expected=20
# (failed compare-exchange writes the current value back into expected — letting you retry)

Or run locally

g++ -std=c++23 -O2 snippet.cpp && ./a.out

The "loop until you win" pattern that uses this:

int current = v.load(std::memory_order_relaxed);
int updated;
do {
  updated = transform(current);
} while (!v.compare_exchange_weak(current, updated,
                                  std::memory_order_release,
                                  std::memory_order_relaxed));

compare_exchange_weak is the same operation but may fail spuriously on some architectures. It's the right one to use inside a retry loop; _strong is for one-shot tries.

The `std::atomic_flag` minimal lock

The simplest possible atomic — just one bit, set or clear. The classic hand-rolled spinlock:

std::atomic_flag lock = ATOMIC_FLAG_INIT;

void enter() {
  while (lock.test_and_set(std::memory_order_acquire)) {
    // spin
  }
}

void leave() {
  lock.clear(std::memory_order_release);
}

Do not actually ship this. Use std::mutex, which is faster on contention because the OS can de-schedule blocked threads. The above is illustrative — it's the simplest demonstration of acquire/release pairing for mutual exclusion.

The single rule that makes the memory model usable

If two threads access the same memory location, and at least one of them writes, then either both accesses must use atomic operations with synchronizing memory_orders, or there must be a lock (std::mutex) protecting all the accesses.

Failing this rule is a data race, which in C++ is undefined behavior. Not "you might read garbage" — undefined. The compiler may optimize as though the race can't happen, which can produce code that seems unrelated to your source.

This rule is the entire memory model in one sentence. Everything else is which atomic order to pick, and that's a performance question, not a correctness one.

Performance: how much slower is `seq_cst`?

On x86, seq_cst writes emit a mfence (or similar) instruction that flushes the store buffer. That's roughly 30-100 cycles of latency. acquire/release on x86 are essentially free for loads/stores (x86's hardware memory model is already quite strong), but on ARM they're distinct instructions and significantly cheaper than seq_cst.

The recommendation: default to seq_cst. Move to acquire/release only when profiling shows the difference matters. Move to relaxed only for counters and statistics where the ordering doesn't matter at all.

The exception is performance-critical lock-free code, where the difference between seq_cst and acquire/release can be 2-3× on ARM. In that case, you're in the world of formal verification, model checkers, and TLA+ — far beyond this lesson's scope.

When NOT to roll your own

Lock-free programming is hard. Here are the algorithms that look easy but are very hard to get right:

A lock-free queue with multiple producers and consumers. The ABA problem alone (a value changes from A to B to A and your CAS succeeds wrongly) takes ~100 lines of careful code.
A lock-free hash map. Even the "easy" version uses three different memory orders and a hazard-pointer scheme.
Sequence-locked counters (seqlocks). Easy to write incorrectly, easy to convince yourself work.

If you find yourself writing one of these from scratch, stop. Use:

std::mutex for almost everything. Modern mutexes are fast.
folly::ProducerConsumerQueue or moodycamel::ConcurrentQueue for actual lock-free queues, written by people who do this for a living.
Sharded data (per-thread, then merge) to avoid contention entirely.

The right time to write your own lock-free code is after you've proven that the off-the-shelf solution is the bottleneck, with profiling.

Key takeaways

The compiler and CPU reorder memory operations to optimize single-threaded execution. Cross-thread ordering doesn't come for free — you opt in via atomics or locks.
std::atomic<T> gives you atomicity (no torn reads) and a way to specify cross-thread ordering via memory_order.
The four memory-order shapes worth knowing: relaxed (counters), acquire (consumer side of a publish), release (producer side of a publish), seq_cst (default, strongest).
The single rule: if two threads touch the same memory and at least one writes, use atomics or a mutex. Otherwise it's a data race and the behavior is undefined.
Default to seq_cst. Profile before reaching for weaker orders.
Don't write lock-free data structures from scratch. Use battle-tested libraries.

What's next

The last lesson — 13 What to Skip (and Why) — closes out the curriculum by talking about the parts of C++ this course intentionally de-emphasized: inheritance hierarchies, exception-based control flow, iostream ceremony. Then four capstones (C1 already shipped; C2–C4 exercise everything you've learned).

⚛️ Atomics + the Memory Model