C++ Atomics

1. What Are Atomics?
Atomic operations are indivisible hardware-level actions.
-
On many architectures (e.g., x86), they compile down to a single atomic instruction (like
LOCK CMPXCHG
), preventing interleaving race conditions when multiple threads access the same memory. -
The C++ standard does not require all
std::atomic<T>
types to be lock-free; on some platforms, the library may fall back to an internal mutex if no hardware atomic instruction exists.
2. High-Level Overview of Compare-and-Swap (CAS)
Compare-and-Swap (CAS) is an atomic read-modify-write primitive that:
- Reads a memory location
p
. - Compares it to an
expected
value. - If they match, atomically writes a
desired
new value intop
. -
Returns
true
on success (swap occurred), orfalse
on failure (and updatesexpected
to the current value).
This single indivisible step enables building lock-free data structures without traditional mutexes.
Spurious Failures (False Negatives)
A spurious failure is when CAS returns false
even though the atomic’s value actually
equaled expected
. It can fail for reasons unrelated to the value comparison (cache eviction,
interrupt, or contention). The C++ standard allows these extra failures so CAS can map efficiently to hardware
primitives.
Example of a retry loop:
std::atomic counter = 5;
int expected = 5;
while (!counter.compare_exchange_weak(expected, 6));
compare_exchange_weak
vs. compare_exchange_strong
-
compare_exchange_weak(expected, desired)
• May fail spuriously (false negative).
• Ideal in loops that retry on anyfalse
return. -
compare_exchange_strong(expected, desired)
• Fails only if the value truly differs fromexpected
.
• Use when you perform exactly one CAS and handle failure explicitly.
Because `compare_exchange_weak` takes its `expected` parameter by reference, whenever it fails it writes the atomic’s current value back into `expected` before returning `false`. This could or could not be intended behaviour.
When you need exactly one CAS without looping, prefer compare_exchange_strong
.
3. What Is the ABA Problem?
The ABA problem occurs in CAS-based lock-free code when a pointer’s value changes from A to B and back to A before a thread’s CAS, causing the CAS to succeed on a different object at the same address.
- Thread 1 reads
head
intooldHead
(address A). - Thread 2 pops A, pushes B, frees A, allocates a new node at the same address A, then pushes it.
- Thread 1’s CAS sees
head == oldHead
(A) and succeeds—wrongly operating on the new A.
Why the ABA Problem Only Affects Pointers
ABA arises from reusing memory addresses. Plain integer or other non-pointer values cannot be freed and reallocated at the same address, so they are not vulnerable.
Ways to Mitigate ABA
- Version-Tagged Pointers
Wrap each pointer in a small struct that also holds a version counter. On every update, increment the counter before doing CAS on the combined pointer+version (wrapper struct). This way, even if the pointer address returns to its old value, the version will differ and CAS will fail. - Hazard Pointers
Give each thread a slot in a global hazard array where it publishes the pointer it is about to access. When you remove (retire) a node, defer its deletion and later scan the hazard array—only reclaim nodes not present in any slot. This ensures you never free or reuse memory while another thread might still hold a reference.
4. Memory Consistency
Atomic operations are indivisible and leverage hardware cache-coherence protocols to ensure cores eventually see each update. However, compilers and CPUs may reorder instructions around an atomic operation if they believe it won’t affect single-threaded correctness. Such reorderings can break multi-threaded algorithms, so C++ defines four memory-consistency levels:
- Relaxed: No ordering guarantees beyond atomicity, least overhead as it allows OOO processing and no need to insert compiler/CPU fences
- Consume: A lightweight acquire that only orders operations directly dependent on the loaded value.
- Acquire/Release: Point-to-point synchronization on a single atomic.
- Acquire-Release (RMW): Read-modify-write acts as both acquire and release.
- Sequentially-Consistent (SC): All threads agree on a single global order of operations, most overhead as there is no OOO processing and need to insert compiler/CPU fences.
1. Relaxed
With memory_order_relaxed
, any instructions before the atomic may be moved after it, and vice versa.
This can lead to surprising outcomes in two-thread interactions:
atomic x = 0
atomic y = 0
Thread 1
x.store(1, relaxed)
y.store(1, relaxed)
Thread 2
r1 = y.load(relaxed)
r2 = x.load(relaxed)
Under relaxed ordering, Thread 2 can legally observe r1 == 1
but r2 == 0
. Under release
and acquire (since there is only one writer thread), this is not allowed.
2. Consume
memory_order_consume
is a lightweight acquire: it only orders subsequent operations that have a data dependency
on the loaded value. Any load or store that uses the result of the consume-load cannot be moved before it.
set head to atomic pointer to Node
p = head.load(memory_order_consume)
v = p->value
In this pseudocode, the consume
load of head
ensures that the subsequent dependent access
to
p->value
cannot be reordered before the load, safely publishing the initialized node data.
On architectures with dependency guarantees (ARM, PowerPC), this avoids the full fence of an acquire, while still
ensuring you see the correct data pointed to by p
. Elsewhere (including x86 and most compilers),
consume is demoted to memory_order_acquire.
3. and 4. Acquire/Release
Release (memory_order_release) prevents any memory operations before it from being moved after it. Acquire (memory_order_acquire) prevents any memory operations after it from being moved before it. Together they implement a publish-subscribe (producer-consumer) pattern.
- Producer: write data then release-store the flag.
- Consumer: acquire-load the flag then read data.
atomic data = 0
atomic ready = false
producer
data = 100
ready.store(true, memory_order_release)
consumer
while not ready.load(memory_order_acquire)
spin
read data , will be 100
In this example, the release‐store on ready
ensures that the write to
data
cannot be reordered after the flag. When the consumer’s acquire‐load of ready
returns
true
, it also guarantees that it sees the updated data == 100
.
5. Acquire–Release (Read–Modify–Write)
A single RMW (e.g. fetch_add
, fetch_sub
, CAS) with
memory_order_acq_rel
acts as both:
- A release on its write side—no earlier loads/stores can move past it, so any data you’ve produced before remains visible to later readers.
- An acquire on its read side—no later loads/stores can move before it, so you see any data published by the writer that did the matching release.
Use acq_rel
when you need both barriers in one atomic step:
- Reference counting: You decrement the count and, if it hits zero, safely destroy the object.
- Lock-free push/pop: A CAS that both reads the old pointer and publishes the new one needs full synchronization on both sides.
- Combined publish/subscribe: When a single counter or flag update both consumes state (reads it) and publishes new state for others to see.
set refCount to atomic 1
oldCount = fetch_sub refCount by 1 with memory_order_acq_rel
if oldCount equals 1 then delete p
In this pseudocode, the single atomic fetch_sub
both decrements and reads the counter with acquire–release
semantics, ensuring no reordering before or after. When it returns 1, it’s safe to delete p
and still
see all prior initialization writes.
6. Sequentially-Consistent
Sequentially-consistent (the default) enforces a single global order of all atomic operations across threads. In contrast, acquire/release only orders operations on the same atomic.
Acquire/Release allows this outcome:
std::atomic A0, B0;
int r1, r2, r3, r4;
Writer thread 1
A.store(1, std::memory_order_release);
Writer thread 2
B.store(1, std::memory_order_release);
Reader thread 1
r1 = B.load(std::memory_order_acquire);
r2 = A.load(std::memory_order_acquire);
Reader thread 2
r3 = B.load(std::memory_order_acquire);
r4 = A.load(std::memory_order_acquire);
Under acquire/release, it is legal for reader thread 1 to see (r1,r2)==(1,0)
and reader thread 2 to see (r3,r4)==(0,1)
, because there is no cross-variable global ordering.
Sequentially-consistent forbids that cycle:
std::atomic A0, B0;
int r1, r2, r3, r4;
Writer thread 1 (default seq_cst)
A.store(1);
Writer thread 2 (default seq_cst)
B.store(1);
Reader thread 1 (default seq_cst)
r1 = B.load;
r2 = A.load;
Reader thread 2 (default seq_cst)
r3 = B.load;
r4 = A.load;
With seq_cst, all four operations participate in one total order that respects each thread’s program order, so you
cannot observe both (1,0)
and (0,1)
cases. Take note, the case above is for 2 different writer
threads. If it was just one writer thread, then both seq_cst and release & acquire would not be able to show the (1,0)
and (0,1) cases.