Atomics + the Memory Model
The lesson most C++ courses get wrong — what `std::atomic` actually guarantees, and why `memory_order` exists
Why it matters
You’ve used std::atomic<int> once already as “the thing that lets two
threads update a counter without a mutex.” That mental model is the
floor. The ceiling is the C++ memory model — a formal description of
what one thread can see about another thread’s writes, and what
reorderings the compiler and CPU are allowed to perform behind your
back.
If you’re never writing lock-free data structures, you can mostly ignore
this. But you cannot read modern systems code (kernels, allocators,
runtimes, the standard library itself) without knowing what
memory_order_acquire means. This lesson teaches the mental model
without pretending the underlying model is simple.
The fact that makes this hard
Two threads. One writes x then y. The other reads y then x. You
might assume: if the reader saw the new y, it must also see the new
x. This is wrong.
Both the compiler and the CPU are allowed to reorder memory operations as long as the single-threaded observable behavior is preserved. So the writer’s “x then y” might emit “y then x.” The reader’s “y then x” might fetch “x” before “y.” Either of these reorderings is a perfectly correct single-threaded optimization. Across threads, they shred your invariants.
std::atomic and memory_order are how you tell the compiler “these
operations participate in cross-thread ordering. Don’t reorder them
freely.”
std::atomic<T> — the basic guarantee
std::atomic<T> gives you three things:
- Indivisible reads and writes. No thread ever sees a “half-written” value. This is the property you’d expect from “atomic.”
- A defined memory order for that one operation, relative to the atomic operations the same thread did before and after.
- Restricted operations. You can
.load(),.store(),.exchange(),.compare_exchange_*(), and (for integral types) arithmetic operations like++. You cannot.assign()and.modify_field()like with a regular struct — there’s no “atomic struct update,” only atomic word-sized swaps.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<int> counter{0};
void worker() {
for (int i = 0; i < 1000; i++) {
counter.fetch_add(1, std::memory_order_relaxed);
}
}
int main() {
std::jthread t1(worker);
std::jthread t2(worker);
std::jthread t3(worker);
// joined automatically when jthreads go out of scope
t1.join(); t2.join(); t3.join();
std::cout << "counter = " << counter.load() << "\n";
return 0;
}
Three threads, each incrementing 1000 times. The counter ends at exactly 3000 because every increment is atomic. `memory_order_relaxed` is sufficient here — no other variables need to stay in sync with this one.
counter = 3000 Or run locally
g++ -std=c++23 -O2 snippet.cpp && ./a.out The default memory_order for atomic operations is
memory_order_seq_cst — sequential consistency — the strongest, slowest
guarantee. The example uses relaxed because a counter doesn’t need
any ordering relationship with other variables.
What memory_order actually means
There are six values, but only four shapes worth learning:
| Order | Meaning | Use for |
|---|---|---|
relaxed | Atomicity only. No ordering with other operations. | Counters, statistics. |
acquire (on load) | Reads after this load see writes-before in the producing thread. | The reader side of a flag. |
release (on store) | Writes before this store are visible to a thread that does an acquire load on the same atomic. | The writer side of a flag. |
seq_cst | All threads see all seq_cst operations in the same global order. | Default. Use when in doubt. |
acquire/release is the pair you reach for to publish data between
threads:
std::atomic<bool> ready{false};
int payload = 0;
// Writer thread:
payload = 42; // ordinary write
ready.store(true, std::memory_order_release);
// Reader thread:
while (!ready.load(std::memory_order_acquire)) {}
std::cout << payload; // guaranteed to see 42
The release-store publishes the prior writes. The acquire-load synchronizes with that release, making those writes visible. This is the happens-before relation in concrete C++ terms.
Without release/acquire (or seq_cst), the reader is not guaranteed
to see payload == 42 even after it sees ready == true. The
compiler or CPU could reorder the writer’s two operations, or the
reader’s, and your invariant breaks.
compare_exchange: the lock-free workhorse
The CPU primitive most lock-free algorithms are built on:
compare_exchange_strong(expected, desired). Atomically:
- If the current value equals
expected, replace withdesiredand returntrue. - Otherwise, load the current value into
expectedand returnfalse.
In one instruction (cmpxchg on x86, cas everywhere else). Used for
lock-free push, lock-free pop, lock-free linked lists, etc.
#include <iostream>
#include <atomic>
std::atomic<int> v{10};
int main() {
int expected = 10;
bool ok = v.compare_exchange_strong(expected, 20);
std::cout << "first swap: ok=" << ok << " v=" << v.load() << " expected=" << expected << "\n";
expected = 10; // stale assumption
ok = v.compare_exchange_strong(expected, 30);
std::cout << "second swap: ok=" << ok << " v=" << v.load() << " expected=" << expected << "\n";
return 0;
}
`compare_exchange_strong` is the atomic conditional-swap that powers lock-free algorithms. On failure, `expected` is updated to the actual current value.
first swap: ok=1 v=20 expected=10
second swap: ok=0 v=20 expected=20
# (failed compare-exchange writes the current value back into expected — letting you retry) Or run locally
g++ -std=c++23 -O2 snippet.cpp && ./a.out The “loop until you win” pattern that uses this:
int current = v.load(std::memory_order_relaxed);
int updated;
do {
updated = transform(current);
} while (!v.compare_exchange_weak(current, updated,
std::memory_order_release,
std::memory_order_relaxed));
compare_exchange_weak is the same operation but may fail spuriously
on some architectures. It’s the right one to use inside a retry loop;
_strong is for one-shot tries.
The std::atomic_flag minimal lock
The simplest possible atomic — just one bit, set or clear. The classic hand-rolled spinlock:
std::atomic_flag lock = ATOMIC_FLAG_INIT;
void enter() {
while (lock.test_and_set(std::memory_order_acquire)) {
// spin
}
}
void leave() {
lock.clear(std::memory_order_release);
}
Do not actually ship this. Use std::mutex, which is faster on
contention because the OS can de-schedule blocked threads. The above is
illustrative — it’s the simplest demonstration of acquire/release
pairing for mutual exclusion.
The single rule that makes the memory model usable
If two threads access the same memory location, and at least one of them writes, then either both accesses must use atomic operations with synchronizing
memory_orders, or there must be a lock (std::mutex) protecting all the accesses.
Failing this rule is a data race, which in C++ is undefined behavior. Not “you might read garbage” — undefined. The compiler may optimize as though the race can’t happen, which can produce code that seems unrelated to your source.
This rule is the entire memory model in one sentence. Everything else is which atomic order to pick, and that’s a performance question, not a correctness one.
Performance: how much slower is seq_cst?
On x86, seq_cst writes emit a mfence (or similar) instruction that
flushes the store buffer. That’s roughly 30-100 cycles of latency.
acquire/release on x86 are essentially free for loads/stores (x86’s
hardware memory model is already quite strong), but on ARM they’re
distinct instructions and significantly cheaper than seq_cst.
The recommendation: default to seq_cst. Move to acquire/release
only when profiling shows the difference matters. Move to relaxed
only for counters and statistics where the ordering doesn’t matter at
all.
The exception is performance-critical lock-free code, where the
difference between seq_cst and acquire/release can be 2-3× on
ARM. In that case, you’re in the world of formal verification, model
checkers, and TLA+ — far beyond this lesson’s scope.
When NOT to roll your own
Lock-free programming is hard. Here are the algorithms that look easy but are very hard to get right:
- A lock-free queue with multiple producers and consumers. The ABA problem alone (a value changes from A to B to A and your CAS succeeds wrongly) takes ~100 lines of careful code.
- A lock-free hash map. Even the “easy” version uses three different memory orders and a hazard-pointer scheme.
- Sequence-locked counters (seqlocks). Easy to write incorrectly, easy to convince yourself work.
If you find yourself writing one of these from scratch, stop. Use:
std::mutexfor almost everything. Modern mutexes are fast.folly::ProducerConsumerQueueormoodycamel::ConcurrentQueuefor actual lock-free queues, written by people who do this for a living.- Sharded data (per-thread, then merge) to avoid contention entirely.
The right time to write your own lock-free code is after you’ve proven that the off-the-shelf solution is the bottleneck, with profiling.
Key takeaways
- The compiler and CPU reorder memory operations to optimize single-threaded execution. Cross-thread ordering doesn’t come for free — you opt in via atomics or locks.
std::atomic<T>gives you atomicity (no torn reads) and a way to specify cross-thread ordering viamemory_order.- The four memory-order shapes worth knowing: relaxed (counters), acquire (consumer side of a publish), release (producer side of a publish), seq_cst (default, strongest).
- The single rule: if two threads touch the same memory and at least one writes, use atomics or a mutex. Otherwise it’s a data race and the behavior is undefined.
- Default to
seq_cst. Profile before reaching for weaker orders. - Don’t write lock-free data structures from scratch. Use battle-tested libraries.
What’s next
The last lesson — 13 What to Skip (and Why) — closes out the
curriculum by talking about the parts of C++ this course intentionally
de-emphasized: inheritance hierarchies, exception-based control flow,
iostream ceremony. Then four capstones (C1 already shipped; C2–C4
exercise everything you’ve learned).