C++ Custom Implementations

In this C++ Internals Blog, I am documenting my custom implementations of essential C++ components, including a hashmap, thread‑safe singly linked list, thread‑safe queue, string class with small string optimization, vector, shared pointer, and unique pointer. Each implementation has a robust slew of test cases using the gtest framework.

Github Link

For each of these components, I discuss my design decisions, what I learned along the way, potential future improvements, and how they relate to critical low‑latency performance considerations. In particular, I explore how CPU cache lines and memory locality, out‑of‑order (OoO) processing, and CPU branch prediction can affect the performance of these data structures.

Understanding Cache Lines, Memory Locality, Alignment, and False Sharing

Cache Lines and Memory Locality

Modern CPUs have multiple levels of cache (L1, L2, and L3), each operating on fixed-size blocks of memory known as cache lines. Typically, a cache line is 64 bytes. When data is stored contiguously (for example, in a vector), accessing one element usually loads the entire cache line into the cache. This means that if you access the first element of a vector of integers, several subsequent elements are also loaded into the cache. As a result, even if you haven’t directly accessed the second element, it is already available in the cache—a phenomenon known as spatial locality.

In contrast, a cache miss—when the required data is not in the cache—forces the CPU to retrieve data from main memory, which is orders of magnitude slower than cache access. Therefore, spatial locality is a good way to increase the cache hit rate.

Cache Alignment

Cache alignment involves organizing data so that its starting address falls on a boundary that matches the size of a cache line (commonly 64 bytes). In C++, you can enforce this by using the alignas(64) specifier. When data structures are aligned on cache line boundaries, a single memory fetch can pull in a complete, useful block of data, reducing the need for multiple, slower fetches if the data were split across cache lines.

Starting on a cache line boundary improves cache hits because it minimizes the possibility of data straddling two cache lines. This ensures that when a cache line is loaded, all of the data you expect to access together is available in one go.

False Sharing

False sharing occurs in multi-threaded environments when independent data items, frequently updated by different threads, reside on the same cache line. Even though the data are unrelated, an update to one piece of data can trigger the invalidation of the entire cache line. Consequently, other threads must fetch the updated cache line, incurring extra latency.

To mitigate false sharing, you can pad data structures or use alignment strategies (such as alignas) to ensure that frequently updated variables are placed in separate cache lines. This prevents the cache coherence mechanism from causing unnecessary slowdowns due to cache line invalidation.

Out‑of‑Order Execution

Modern CPUs use out‑of‑order (OoO) execution to maximize the utilization of their internal execution units. Instead of strictly following the program order, the CPU reorders instructions at runtime to keep its pipelines busy. It executes instructions as soon as their operands become available rather than waiting for earlier instructions to finish.

How It Works

Impact on Latency

Out‑of‑order execution can significantly improve latency by allowing the CPU to hide memory and execution delays. When independent instructions are executed concurrently, the CPU avoids idling its execution units, effectively increasing throughput.

When OoO Execution Is Limited

Branch Prediction?

Branch prediction is a technique used by modern CPUs to guess the outcome of a conditional branch (such as an if statement or loop exit) before it is definitively known. By doing so, the CPU can keep its instruction pipeline full, minimizing stalls that occur when the correct execution path is not immediately clear.

How It Works

How Branch Prediction Reduces Latency

By guessing the outcome of conditional branches in advance, branch prediction keeps the CPU's pipeline busy. This means that while the actual branch decision is being determined, the CPU is already executing the predicted instructions. If the prediction is correct, this process avoids the latency that would otherwise occur from waiting for the branch decision and refilling the pipeline with the correct instructions. In essence, effective branch prediction minimizes idle cycles and maximizes instruction throughput.

When Can Branch Prediction Cause More Latency?

Programmers can assist the compiler in branch prediction by using attributes like likely and unlikely or using compiler built‑in functions like __builtin_expect to guide the prediction of conditions.

Template‑Based, Header‑Only Design & Rule of Five Compliance

In my custom C++ implementations (except for the string class), I leverage templates so that each component (such as unique_ptr, shared_ptr, vectors, thread‑safe containers, etc.) can handle any type. By defining these classes as templates, I place the entire implementation in header files. This approach is standard in C++ because the template code is instantiated only when included in a translation unit.

This design offers several advantages:

Another cornerstone of my design is adherence to the Rule of Five:

Overall, the template‑based, header‑only design not only simplifies the code generation process and ensures type safety but also creates ample opportunities for compiler optimizations that are key for high‑performance, low‑latency systems.

Table of Contents

Unique Pointer

My custom UniquePtr is designed around a simple yet powerful concept: a raw pointer that cleans up after itself once the smart pointer goes out of scope. This automatic resource cleanup is a cornerstone of RAII (Resource Acquisition Is Initialization) in C++, ensuring that resources are properly deallocated without explicit intervention.

Destructor

The destructor of UniquePtr is straightforward. It calls delete on the managed pointer, ensuring that the underlying object is destroyed when the UniquePtr goes out of scope:

Explicit Constructor

The constructor that takes a raw pointer is declared with the explicit keyword:

This is crucial because it prevents implicit conversions from a raw pointer to a UniquePtr. Without explicit, a function expecting a UniquePtr might accidentally receive a raw pointer, leading the compiler to automatically create a temporary UniquePtr. Such implicit conversions can result in unexpected behavior or resource management issues.

Deleted Copy Constructor and Copy Assignment Operator

To enforce unique ownership, copying is disabled by deleting the copy constructor and copy assignment operator

With these functions deleted, only one instance of UniquePtr can own a given pointer, preventing issues like double deletion.

Low-Level Performance Considerations

In terms of memory, a raw pointer typically occupies 4 or 8 bytes (depending on the architecture), which is a very small footprint that easily fits within a single cache line. This compact size helps keep cache usage minimal and ensures that access to the pointer is extremely fast.

Branch Prediction in Move Assignment

In the move assignment operator, I include a conditional check to guard against self-assignment

Although this if (this != &other) introduces a branch, modern CPU branch predictors are highly efficient at handling such simple conditions. Additionally, self-assignment via move is a very rare occurrence, so the predictor almost always guesses correctly and the latency overhead is negligible.

Shared Pointer

Unlike UniquePtr, my custom SharedPtr allows multiple instances to own the same object. This shared ownership is built on top of the RAII principles that govern UniquePtr, with the addition of a reference counter.

The reference counter is stored as an atomic variable: std::atomic* ref count;

Using std::atomic ensures thread safety because most atomic operations compile down to a single assembly instruction, minimizing interleaving between threads and preventing race conditions. While I could have guarded the reference counter with a mutex, that approach would introduce lock contention and additional overhead, so an atomic variable is a more efficient solution.

Why size_t?
The counter uses size_t because this unsigned integer type is designed to represent sizes and counts. Although on many platforms int and size_t might both be 4 bytes, using size_t avoids negative values and semantically indicates a count or size, which is appropriate for reference counting.

The behavior of the SharedPtr is as follows:

An interesting detail of STL’s std::shared_ptr is its additional weak reference count. A weak pointer does not contribute to the reference count. Initially, I was puzzled by the purpose of a weak reference count. If code attempts to convert a weak pointer to a shared pointer, the weak reference count helps ensure that the conversion fails gracefully when the managed object has already been destroyed— avoiding access to invalid memory and preventing potential segmentation faults.

Cache Coherency and Performance Considerations:
When a thread increments or decrements the atomic counter, the corresponding cache line is updated and potentially invalidated across cores. While this does introduce some overhead through the cache coherence mechanism, shared pointers are generally less prone to false sharing compared to other highly mutable data structures—since reference count modifications occur relatively infrequently.

Vector

My custom Vector is built around a few essential private members:

The Vector class is implemented as a template and is entirely header‑only, ensuring that the compiler generates type‑specific code and maximizing the possibility for inlining. This design is especially useful in performance‑sensitive environments.

Construction, Copying, and Moving

The default constructor sets ptr to nullptr and initializes both m_capacity and idx to zero. The copy constructor allocates sufficient memory and uses placement new to copy each element. The move constructor transfers ownership of the pointer, resetting the source vector to a safe empty state. The std::move assume noexcept move constructors. If a move throws, the vector is left in an invalid state.

To simplify exception safety and memory management, both the copy and move assignment operators follow the “copy‑and‑swap” idiom. A temporary vector is constructed from the source (by copying or moving), and then swapped with the current instance. This design minimizes explicit reallocation logic while allowing the temporary object to handle cleanup when it goes out of scope.

Insertion: push_back() and emplace_back()

When inserting a new element, if the vector is full, the resize() function is invoked to double the capacity (or allocate one unit if the vector is initially empty).

Resizing Mechanism

The resize() function does the heavy lifting for growing the vector:

Low‑Level Performance Considerations

Cache Locality: A key benefit of a vector’s contiguous memory layout is excellent cache locality. When iterating over elements, the CPU prefetches adjacent data—if the first index of an integer vector is accessed, the adjacent elements are loaded into cache as well. This minimizes cache misses, which is especially critical in low‑latency systems like high‑frequency trading.

Thread Safety (Future Consideration): This vector is currently not thread‑safe. If you were to add locks to make it thread‑safe, you could face challenges:

Out‑of‑Order Execution: In the inner loops of functions like resize(), where elements are moved or constructed in a new memory block, instructions that are independent of each other can be executed out‑of‑order by the CPU. This helps to hide memory latency and improve throughput, provided that there are few dependencies between the operations.

Branch Prediction: Simple conditionals, such as checking if an index is within bounds in operator[] or determining if the vector is full in push_back(), are highly predictable in typical use cases. Modern CPUs efficiently predict these branches, so their cost is minimal. When branches are well‑predicted, the instruction pipeline remains full, further reducing latency.

Overall, my custom vector implementation focuses on efficiency and robustness, leveraging raw pointer flexibility and modern C++ features like move semantics and perfect forwarding. At the same time, I consider the interplay of data structure design with underlying hardware aspects to ensure that even the smallest inefficiencies are minimized.

Thread Safe Queue

In my Thread Safe Queue, I wrap a standard std::queue with a coarse‑grained lock and a condition variable to ensure thread safety. A single mutable std::mutex protects the entire queue, while the condition variable (std::condition_variable) is used to signal waiting threads when new data becomes available. Using mutable here allows even const member functions to lock and unlock the mutex safely.

Functions like push() lock the mutex using std::lock_guard (an RAII wrapper that automatically releases the lock when the scope ends), ensuring that the queue is modified in a thread‑safe manner. Once a new element is pushed, cv.notify_one() is called to wake up a thread that may be waiting for data.

The wait_pop() function acquires the queue's mutex, pops if the queue is not empty. If the queue is empty, the function does not return, but it wait for the cv to be notified, and then prompty checks if q is empty (if it is a spurious wakeup). If q is not empty, it pops.

Why Condition Variables Instead of an Atomic Flag?

One might wonder why not simply use an atomic flag and a busy loop instead of a condition variable. The key reasons are:

Future Considerations: Fine‑Grained vs. Coarse‑Grained Locking

While my current design uses a coarse‑grained lock (one mutex for the entire queue), there are scenarios where you might consider fine‑grained locking—such as having one lock per index. However, this approach poses challenges:

False Sharing and Lock Placement: If locks are allocated contiguously (for example, stored in an array), adjacent locks could reside on the same cache line. When one thread acquires a lock, the cache line is updated and invalidated across cores. If another thread attempts to acquire an adjacent lock, it may experience a delay due to waiting for the updated cache line.

In contrast, a single coarse‑grained lock minimizes the number of lock objects, reducing the potential for false sharing but at the cost of serializing access to the entire queue.

Out‑of‑Order Execution and Branch Prediction

Out‑of‑Order Execution: Modern CPUs can rearrange independent instructions to maximize throughput. However, once a thread acquires the mutex, the critical section becomes serialized, which limits the benefits of out‑of‑order execution. That said, independent operations outside the locked region can still benefit from OoO execution.

Branch Prediction: The code for lock acquisition (using std::lock_guard or std::unique_lock) involves simple, highly predictable conditionals. Since the common case is that the lock is available, modern CPUs can usually predict these branches efficiently, resulting in minimal overhead.

Overall, while the thread‑safe queue's single lock design serializes access, it provides a simple and reliable mechanism to ensure thread safety, minimizing CPU overhead and unnecessary busy waiting.

Thread Safe Singly Linked List

My thread safe singly linked list is designed to provide fine‑grained locking while ensuring robust memory management. To do this, each node in the list encapsulates its data and synchronization primitives. This design allows individual nodes to be locked independently, which is a different approach compared to the coarse‑grained lock used in my thread safe queue.

Node Structure

Each node in the linked list contains:

Linked List Pointers

To simplify boundary conditions, the linked list uses dummy (sentinel) nodes:

I chose smart pointers to manage both the value and the next pointer because they automatically handle memory deallocation when the list goes out of scope.

Locking Strategy

In order to allow concurrent modifications while maintaining data integrity, my design uses fine‑grained locking with one mutex per node:

Potential Deadlock Considerations:
Although std::scoped_lock helps by atomically locking multiple mutexes, care must be taken to:

As a future enhancement, I plan to implement a thread‑safe doubly linked list, where the next pointer remains a shared pointer and the previous pointer is managed by a weak pointer to avoid circular references.

Cache Lines and False Sharing

In a linked list, each node is allocated separately (often via std::make_shared). Because these nodes are not guaranteed to be contiguous, the embedded mutexes are less likely to reside on the same cache line. This reduces the risk of false sharing, where multiple threads updating data on the same cache line cause unnecessary cache invalidation.

Branch Prediction

My linked list operations include conditional checks—for example, verifying whether dummy_head->next equals dummy_tail to determine if the list is empty. These conditionals are typically highly predictable, so modern CPUs efficiently predict these branches, resulting in minimal performance overhead.

Hashmap

In my custom hashmap, collisions are resolved using separate chaining. I organize the data into an array of buckets, where each bucket is a std::list of key/value pairs (with type alias value_type defined as std::pair<const K, V>). This design provides:

Dynamic resizing is managed via a load factor threshold. I set maxLoadFactor to 0.7. This load factor represents the ratio of elements to buckets at which performance begins to degrade, since overly long chains cause lookups to slow down. When the ratio exceeds 0.7, the hashmap automatically rehashes, increasing the bucket count and re-distributing elements. The value 0.7 is a common heuristic—it strikes a balance between memory usage and lookup performance, ensuring a near‑constant average-time complexity for insertions and lookups.

The bucket array (mStore) is allocated contiguously, which provides good cache locality during bucket indexing. However, the individual nodes within each bucket—each element of the underlying std::list—are allocated separately on the heap. This means that while accessing a specific bucket is fast, iterating over the linked list may incur more cache misses compared to a fully contiguous data structure. Fortunately, with a good hash function, collisions will be minimal.

I made extensive use of type aliasing to simplify the code. For example:

using value_type = std::pair<const K, V>;
using Bucket = std::list<value_type>;

These aliases make the code cleaner and easier to maintain.

To allow users to iterate over all key/value pairs in the hashmap, I built a custom forward iterator that traverses bucket by bucket and then iterates within each bucket’s linked list.

Why Use std::move Instead of memcpy?

When rehashing, I use std::move for moving key/value pairs into the new buckets rather than memcpy. This is because key/value pairs may involve non‑trivial types that require proper invocation of move constructors or destructors. std::move ensures that the elements’ state is transferred safely and their resources are managed correctly, preserving proper object semantics.

My Design vs. STL’s unordered_map

Similarities:

Differences:

Low‑Level Performance Considerations

Cache Lines and Memory Layout: Since my bucket array (mStore) is allocated contiguously, accessing a specific bucket benefits from spatial locality. However, because the nodes in each bucket’s linked list are dynamically allocated, traversing a bucket may result in more cache misses. With a good hash function that distributes keys well, collisions are reduced, and buckets are used more efficiently.

False Sharing: Although my hashmap isn’t initially designed for multi-threading, if extended for concurrent access false sharing might become an issue. If multiple threads update adjacent buckets (which are contiguous in memory), the cache line containing these buckets could become a contention point. STL libraries often mitigate this through padding or specialized allocators. A well‑distributed hash function helps spread bucket usage across memory, reducing the risk of false sharing.

Out‑of‑Order Execution: Operations like hashing and quickly iterating over short linked lists are independent, allowing modern CPUs to execute these instructions out‑of‑order. However, during rehashing, when many elements must be moved, the process becomes more sequential, limiting OoO benefits.

Branch Prediction: The control flow in my implementation—such as checking for empty buckets or determining whether to rehash—introduces some branches. Modern CPUs can predict these branches accurately, minimizing the latency impact of these conditionals.

String

In my custom String class, I implemented Small String Optimization (SSO) to efficiently handle common short strings—such as filenames, user inputs, and log messages—without incurring the overhead of dynamic memory allocation. The design uses a fixed‐size internal buffer of 128 bytes, which, being a multiple of 64, aligns well with common cache line sizes. This not only improves cache locality but also helps reduce false sharing in multi-threaded scenarios (my string implementation is not thread safe though).

Key Design Elements

Constructors and Appending Functions

The class provides several constructors:

For appending, there are functions that accept either a C-style string or another custom String object. In both cases, the process involves:

This design ensures that short strings are managed efficiently, which is especially beneficial since small strings are extremely common in many applications.

Custom Iterator

I also implemented a custom bidirectional iterator for the String class, which supports both pre- and post-increment/decrement operations. While my implementation uses a single iterator type with some casting to handle both mutable and const access, a production version might provide separate iterator and const_iterator types for stronger const-correctness.

Clearing the String

To clear the string, I simply set the first element of the smallBuffer to the null terminator ('\0') and update the length to zero:

smallBuffer[0] = '\0';
len = 0;

Memory Layout: Where Is the SSO Stored?

The SSO buffer (smallBuffer) is declared as a fixed-size array within the String class:

char smallBuffer[SSO_SIZE + 1]; // SSO_SIZE is 128

Its exact location in memory depends on how the String object is allocated:

There's no need for manual deallocation of smallBuffer because it is managed automatically by the object's lifetime.

Copy Assignment vs. Move Assignment

The copy assignment operator uses the copy‑and‑swap idiom to safely and efficiently transfer resources. Because the SSO buffer is an array embedded in the object, it has a fixed memory location and cannot simply be swapped by exchanging pointers (with std::swap). Instead, std::memcpy is used to swap the contents of the small buffers, byte by byte. Casting it to an rvalue reference using std::move does not actually copy over the contents—move semantics typically only exchange pointers, which doesn't make sense when the buffer is embedded within the object.

In contrast, the move assignment operator transfers ownership of any heap‑allocated memory or, in the SSO case, copies the buffer contents from the source to the destination. The source is then reset to a valid empty state. Since we are "stealing" resources in a move, the swap idiom is unnecessary.

Performance Impact: Cache Lines, OoO Processing, and Branch Prediction

Cache Line Proximity: By setting the SSO threshold to 128 bytes—a multiple of 64—most short strings fit within one or two cache lines. This means that when the string is accessed, the entire string is likely to already be in cache, resulting in fewer cache misses.

Contiguous Storage and Memory Layout: Because the SSO buffer is embedded in the object, its data is stored contiguously with other member variables. This enhances spatial locality and reduces the chance of false sharing in a multi-threaded context, as each object's SSO buffer is isolated within its own memory allocation.

Out-of-Order Processing: Functions like append() and string construction perform a minimal number of branches and independent operations (such as checking the string length or copying data). Modern CPUs can execute these independent operations out-of-order, which helps hide memory latency and improve overall throughput.

Branch Prediction: Most conditional checks in the string implementation (such as verifying whether the string fits within the SSO buffer or whether a pointer is null) are simple and predictable. As a result, modern CPUs accurately predict these branches, minimizing pipeline stalls. Uniform code paths—for example, centralizing append logic—help further by ensuring consistent branch behavior.