C++ Atomics

1. What Are Atomics?

Atomic operations are indivisible hardware-level actions.

On many architectures (e.g., x86), they compile down to a single atomic instruction (like LOCK CMPXCHG), preventing interleaving race conditions when multiple threads access the same memory.
The C++ standard does not require all std::atomic<T> types to be lock-free; on some platforms, the library may fall back to an internal mutex if no hardware atomic instruction exists.

2. High-Level Overview of Compare-and-Swap (CAS)

Compare-and-Swap (CAS) is an atomic read-modify-write primitive that:

Reads a memory location p.
Compares it to an expected value.
If they match, atomically writes a desired new value into p.
Returns true on success (swap occurred), or false on failure (and updates expected to the current value).

This single indivisible step enables building lock-free data structures without traditional mutexes.

Spurious Failures (False Negatives)

A spurious failure is when CAS returns false even though the atomic’s value actually equaled expected. It can fail for reasons unrelated to the value comparison (cache eviction, interrupt, or contention). The C++ standard allows these extra failures so CAS can map efficiently to hardware primitives.

Example of a retry loop:


std::atomic counter = 5;
int expected = 5;
while (!counter.compare_exchange_weak(expected, 6));

`compare_exchange_weak` vs. `compare_exchange_strong`

compare_exchange_weak(expected, desired)
• May fail spuriously (false negative).
• Ideal in loops that retry on any false return.

Because `compare_exchange_weak` takes its `expected` parameter by reference, whenever it fails it writes the atomic’s current value back into `expected` before returning `false`. This could or could not be intended behaviour.

compare_exchange_strong(expected, desired)
• Fails only if the value truly differs from expected.
• Use when you perform exactly one CAS and handle failure explicitly.

When you need exactly one CAS without looping, prefer compare_exchange_strong.

3. What Is the ABA Problem?

The ABA problem occurs in CAS-based lock-free code when a pointer’s value changes from A to B and back to A before a thread’s CAS, causing the CAS to succeed on a different object at the same address.

Thread 1 reads head into oldHead (address A).
Thread 2 pops A, pushes B, frees A, allocates a new node at the same address A, then pushes it.
Thread 1’s CAS sees head == oldHead (A) and succeeds—wrongly operating on the new A.

Why the ABA Problem Only Affects Pointers

ABA arises from reusing memory addresses. Plain integer or other non-pointer values cannot be freed and reallocated at the same address, so they are not vulnerable.

Ways to Mitigate ABA

Version-Tagged Pointers
Wrap each pointer in a small struct that also holds a version counter. On every update, increment the counter before doing CAS on the combined pointer+version (wrapper struct). This way, even if the pointer address returns to its old value, the version will differ and CAS will fail.
Hazard Pointers
Give each thread a slot in a global hazard array where it publishes the pointer it is about to access. When you remove (retire) a node, defer its deletion and later scan the hazard array—only reclaim nodes not present in any slot. This ensures you never free or reuse memory while another thread might still hold a reference.

4. Memory Consistency

Atomic operations are indivisible and leverage hardware cache-coherence protocols to ensure cores eventually see each update. However, compilers and CPUs may reorder instructions around an atomic operation if they believe it won’t affect single-threaded correctness. Such reorderings can break multi-threaded algorithms, so C++ defines four memory-consistency levels:

Relaxed: No ordering guarantees beyond atomicity, least overhead as it allows OOO processing and no need to insert compiler/CPU fences
Consume: A lightweight acquire that only orders operations directly dependent on the loaded value.
Acquire/Release: Point-to-point synchronization on a single atomic.
Acquire-Release (RMW): Read-modify-write acts as both acquire and release.
Sequentially-Consistent (SC): All threads agree on a single global order of operations, most overhead as there is no OOO processing and need to insert compiler/CPU fences.

1. Relaxed

With memory_order_relaxed, any instructions before the atomic may be moved after it, and vice versa. This can lead to surprising outcomes in two-thread interactions:


atomic x = 0
atomic y = 0

Thread 1
  x.store(1, relaxed)
  y.store(1, relaxed)

Thread 2
  r1 = y.load(relaxed)
  r2 = x.load(relaxed)

Under relaxed ordering, Thread 2 can legally observe r1 == 1 but r2 == 0. Under release and acquire (since there is only one writer thread), this is not allowed.

2. Consume

memory_order_consume is a lightweight acquire: it only orders subsequent operations that have a data dependency on the loaded value. Any load or store that uses the result of the consume-load cannot be moved before it.


set head to atomic pointer to Node
p = head.load(memory_order_consume)
v = p->value

In this pseudocode, the consume load of head ensures that the subsequent dependent access to p->value cannot be reordered before the load, safely publishing the initialized node data.

On architectures with dependency guarantees (ARM, PowerPC), this avoids the full fence of an acquire, while still ensuring you see the correct data pointed to by p. Elsewhere (including x86 and most compilers), consume is demoted to memory_order_acquire.

3. and 4. Acquire/Release

Release (memory_order_release) prevents any memory operations before it from being moved after it. Acquire (memory_order_acquire) prevents any memory operations after it from being moved before it. Together they implement a publish-subscribe (producer-consumer) pattern.

Producer: write data then release-store the flag.
Consumer: acquire-load the flag then read data.


atomic data = 0
atomic ready = false

producer
  data = 100
  ready.store(true, memory_order_release)

consumer
  while not ready.load(memory_order_acquire)
    spin
  read data , will be 100

In this example, the release‐store on ready ensures that the write to data cannot be reordered after the flag. When the consumer’s acquire‐load of ready returns true, it also guarantees that it sees the updated data == 100.

5. Acquire–Release (Read–Modify–Write)

A single RMW (e.g. fetch_add, fetch_sub, CAS) with memory_order_acq_rel acts as both:

A release on its write side—no earlier loads/stores can move past it, so any data you’ve produced before remains visible to later readers.
An acquire on its read side—no later loads/stores can move before it, so you see any data published by the writer that did the matching release.

Use acq_rel when you need both barriers in one atomic step:

Reference counting: You decrement the count and, if it hits zero, safely destroy the object.
Lock-free push/pop: A CAS that both reads the old pointer and publishes the new one needs full synchronization on both sides.
Combined publish/subscribe: When a single counter or flag update both consumes state (reads it) and publishes new state for others to see.


set refCount to atomic 1
oldCount = fetch_sub refCount by 1 with memory_order_acq_rel
if oldCount equals 1 then delete p

In this pseudocode, the single atomic fetch_sub both decrements and reads the counter with acquire–release semantics, ensuring no reordering before or after. When it returns 1, it’s safe to delete p and still see all prior initialization writes.

6. Sequentially-Consistent

Sequentially-consistent (the default) enforces a single global order of all atomic operations across threads. In contrast, acquire/release only orders operations on the same atomic.

Acquire/Release allows this outcome:


std::atomic A0, B0;
int r1, r2, r3, r4;

Writer thread 1
A.store(1, std::memory_order_release);

Writer thread 2
B.store(1, std::memory_order_release);

Reader thread 1
r1 = B.load(std::memory_order_acquire);
r2 = A.load(std::memory_order_acquire);

Reader thread 2
r3 = B.load(std::memory_order_acquire);
r4 = A.load(std::memory_order_acquire);

Under acquire/release, it is legal for reader thread 1 to see (r1,r2)==(1,0) and reader thread 2 to see (r3,r4)==(0,1), because there is no cross-variable global ordering.

Sequentially-consistent forbids that cycle:


std::atomic A0, B0;
int r1, r2, r3, r4;

Writer thread 1 (default seq_cst)
A.store(1);

Writer thread 2 (default seq_cst)
B.store(1);

Reader thread 1 (default seq_cst)
r1 = B.load;
r2 = A.load;

Reader thread 2 (default seq_cst)
r3 = B.load;
r4 = A.load;

With seq_cst, all four operations participate in one total order that respects each thread’s program order, so you cannot observe both (1,0) and (0,1) cases. Take note, the case above is for 2 different writer threads. If it was just one writer thread, then both seq_cst and release & acquire would not be able to show the (1,0) and (0,1) cases.