Curiously Recurring What?
A 30+-year-old C++ Idiom Is Still Useful Today
Over dinner a few months ago, Chris Kitching, the brilliant CTO of Spectral Compute whose SCALE compiler can lower CUDA source code onto AMD GPUs, made an offhand comment about the “curiously recurring template pattern” (CRTP) and how it can be used to achieve polymorphism without incurring the performance cost of indirecting through a vtable.
I thought of the clown car full of limit order book implementations I had lying around; some AVX2, some AVX512, some using different memory layouts, some using different metadata to accelerate key cases. But the various implementations were resident in separate Git branches. It would have been easy to replace the various methods in the order_book class with virtual functions, and indirect through them to test the various implementations; but that would have incurred a Draconian performance hit, and the point of the exercise had been to explore the design space and find the fastest combination. Giving up performance by going across an unnecessary procedure call1 was unacceptable.
So I took an action to explore CRTP as a possible solution to this problem: “flatten” the source code so scalar, AVX2, and AVX-512 code can coexist and be compared side-to-side in the code base, without compromising performance. Tentatively, it’s working out really well, although I can’t yet share the final AVX-512 implementation. I thought now would be a good time to go over CRTP and why it was such a good fit for this exercise of exploring different CLOB implementations.
Whither CRTP?
You may be wondering: what the heck is CRTP? In this case, the Wikipedia article has an excellent article on the topic, including the history that the term was invented by James Coplien in his book, Advanced C++ Programming Styles and Idioms (Copyright 1992). Sean McBride’s recent (June 2025) review of the book characterized Coplien’s book as “the most difficult C++ book I’ve read,” and having read and struggled with it 30 years ago, I share that sentiment. CRTP is such a counterintuitive idiom that the Wikipedia article straightfacedly recounts how Microsoft developer Christian Beaumont “initially thought it could not compile in the Microsoft compiler available at the time”:
The Microsoft Implementation of CRTP in Active Template Library (ATL) was independently discovered, also in 1995, by Jan Falkin, who accidentally derived a base class from a derived class. Christian Beaumont first saw Falkin’s code and initially thought it could not compile in the Microsoft compiler available at the time. Following the revelation that it did work, Beaumont based the entire ATL and Windows Template Library (WTL) design on this mistake.
Formally known as “F-bounded quantification,” CRTP entails having a class derive from a class template instantiation using itself as template argument. The base class can then define static functions that invoke member functions of the derived class; when the compiler is given all these class definitions, plus code that uses the static functions, it has all the information needed to compile the code knowing the type of the derived class. Code that creates and uses instances of the derived classes then transparently benefit from this form of static polymorphism, without even knowing about the enabling linguistic legerdemain.
Backgrounder: Runtime Polymorphism
It may be helpful to give a concrete example, starting with a more straightforward implementation of polymorphism. (You might say as the language designers intended.)
In C++, an abstract base class defines pure virtual functions that specify operations that may be performed on the object, but defer implementation onto their derived classes; in fact, unlike (impure?) virtual functions, derived classes must provide implementations of their parent classes’ pure virtual functions.
The Ray Tracing In One Weekend series (RTOW), whose sample source code is available on GitHub, has a good example of this traditional type of polymorphism.
If you don’t know what ray tracing is, but like to program and enjoy pretty pictures, it may be worth checking out RTOW. Ray tracing is a brute force method of rendering that creates images by modeling the behavior of light rays in the environment. A key abstraction at the center of ray tracing is the ray intersection calculation that determines whether a ray of light intersects with an object and, if the intersection occurs, computes the location of that intersection. To keep the separation of concerns clean between the casting of rays and intersecting those rays with objects, the RTOW code defines an abstract base class hittable, with derived class sphere:2
class hittable {
public:
virtual ~hittable() = default;
virtual bool hit(const ray& r) const = 0;
};
class sphere : public hittable {
sphere(const point3& center, double radius, shared_ptr<material> mat)
: center(center), radius(std::fmax(0,radius)), mat(mat) {}
bool hit(const ray& r) const override {
…
}
};
class triangle : public hittable {
triangle(const point3 vertices[3]): …
bool hit(const ray& r, interval ray_t, hit_record& rec) const override {
… // ray-triangle intersection routine
}
};When a programmer using this class hierarchy creates an instance of sphere, it also is an instance of hittable; and the code can operate on hittable objects without knowing whether they are spheres or triangles. This function, for example, returns the objects that a ray hit:
std::vector<hittable *>
return_hits( const ray& r, const std::vector<hittable *> objects ) {
std::vector<hittable *> ret;
for ( auto obj : objects ) {
if ( obj->hit( ray ) ) {
ret.emplace_back( obj );
}
}
}This polymorphism is enabled by a virtual table, also known as a vtable or vtbl: an array of pointers-to-function, one per class, that the compiler creates behind the scenes. When you call obj->hit(), the compiler uses the pointer to obj to find the vtbl and call the corresponding function.3
This type of runtime polymorphism – where clients interact with object strictly through interfaces, with no knowledge of their internal representation4 - is the bedrock underpinning many important frameworks and technologies, from native language support (e.g., Java interfaces) to technologies like Microsoft’s Common Object Model (COM), which enables language-independent API deployment, to the Service Oriented Architecture (SOA) that Amazon embraced to improve on the inefficiencies of what had been a huge, monolithic code base5.
But the pattern depicted above – where the sphere and triangle classes inherit from, so have an IS-A relationship with hittable (i.e., sphere * and triangle * both are hittable *) – is hard to achieve with CRTP. Chris gave a concise example showing how, which involves using the C++17 type trait std::is_base_of:6
// Boring, Java-esque OOP stuff // CRTP equivalent
template<typename T>
class Shape {}; class Shape : public T {};
class Polygon : public Shape {}; template<typename T>
class Polygon : public Shape<T> {};
class Circle : public Shape {}; template<typename T>
// These “are” polygons, of course. class Circle : public Shape<T> {};
class Square : public Polygon {}; // These “are” polygons, of course.
class Pentagon : public Polygon {}; template<typename T>
class Square : public Polygon<T> {};
class Pentagon : public Polygon<T> {};
// The generic one. // The generic one.
void doSomething(Shape& foo); template<typename T>
void doSomething(T& foo);
// The special overload for polygons. // The special overload for polygons.
void doSomething(Polygon& foo); template<typename T>
requires(std::is_base_of_v<Polygon<T>, T>)
void doSomething(T& foo);
In Chris’s words: “The key point is: notice how we can overload doSomething for the category of classes “Polygon”, even though we used CRTP!”
For our purposes, CRTP is preferable because although we want to explore the design space of different instruction sets, memory layouts, and implementation strategies, we do not want to give up any performance by calling across virtual function boundaries. We could mark the derived classes as final, which empowers the compiler to bypass the vbtl call when it can prove that an object is of the expected type; but CRTP is a more certain path.
So how does CRTP work? Our usage, for static polymorphism, will echo this example from the Wikipedia article:
template <typename T>
struct Base {
void call() {
// ...
static_cast<T*>(this)->implementation();
// ...
}
static void staticFunc() {
// ...
T::staticSubFunc();
// ...
}
};Consider the order_book class in our sample code. The class, as published by Charles Cooper, already was arranged in a way that was amenable to refactoring for CRTP, with static functions such as add_order calling member functions such as ADD_ORDER:
static void add_order(order_id_t const oid, book_id_t const book_idx,
sprice_t const price, qty_t const qty)
{
#if TRACE
printf(”ADD %lu, %u, %d, %u”, oid, book_idx, price, qty);
#endif // TRACE
...
}
void ADD_ORDER(order_t *order, sprice_t const price, qty_t const qty)
{
...
}add_order looks up the correct book (each ‘symbol’ in the exchange - MSFT for Microsoft, for example - has its own instance of order_book), and dispatches the order by calling that book’s ADD_ORDER method.
Let’s say we are exploring the tradeoffs between AOS (array of structures) and SOA (structure of arrays) memory layouts of the limit order book data. Using CRTP, we can templatize the order_book class, then create derived classes order_book_scalar (the original scalar implementation, which uses the AOS layout and will serve as our canonical test reference) and order_book_soa, which rearranges the price levels to be a tuple of std::vector instead of a std::vector of tuples.
The reason we are experimenting with SOA is because it’s needed for optimal SIMD performance. I’ll explain in a later post, when we look at the AVX implementations of order_book.
For the CRTP refactoring, we also need to move some of the order_book members to the derived class – namely the m_bids and m_asks members. These are the per-book snapshots of the aggregate orders at each price level.
static constexpr size_t MAX_BOOKS = 1 << 14;
static constexpr size_t NUM_LEVELS = 1 << 20;
static order_book *s_books; // can we allocate this on the stack?
static oidmap<order_t> oid_map;
using level_vector = pool<level, level_id_t, NUM_LEVELS>;
using sorted_levels_t = std::vector<price_level>;
// A global allocator for all the price levels allocated by all the books.
static level_vector s_levels;
sorted_levels_t m_bids;
sorted_levels_t m_offers;
using level_ptr_t = level_vector::__ptr;The CRTP changes entail templatizing order_book:
template<typename Derived, LAYOUT layout, TARGET_ISA isa, bool TRACE = false>
class order_book
{
...
};and creating derived class order_book_scalar:
class order_book_scalar : public order_book<order_book_scalar, LAYOUT::ARRAY_OF_STRUCTS, TARGET_ISA::GENERIC_C>
{
};The “curiously recurring” aspect of this inheritance structure is that order_book_scalar inherits from itself.
Note that we took this opportunity to templatize the TRACE compile-time option (it had been a preprocessor macro) and the memory layout and ISA (instruction set architecture), to future-proof this class hierarchy to experimenting with SOA memory layouts and AVX2 and AVX-512 implementations of the limit order book.
We moved the m_bids/m_asks members and the ADD_ORDER method from order_book to order_book_scalar, and modified the static function order_book::add_order to specifically reference the Derived class:
s_books[size_t(book_idx)].ADD_ORDER(order, price, qty);
static_cast<Derived *>(&s_books[size_t(order->book_idx)])->ADD_ORDER(order, price, qty);The client code also needs to be refactored, and here, the refactoring of main.cpp had the beneficial side effect of moving almost all of main() into a templatized function to test and measure the performance of an order_book implementation. We introduced a new function template timeBacktest():
template<typename T>
double
timeBacktest( const std::string filename )
{
...
}that handles the ADD_ORDER message from the exchange by specifically referencing the template parameter T:
case (itch_t::ADD_ORDER): {
...
T::add_order(order_id_t(pkt.oid), book_id_t(pkt.stock_locate), ... );
break;
}main() then can invoke timeBacktest() with the order_book class instantiation to test:
timeBacktest<order_book_scalar>( filename );Later refactorings introduce an SOA implementation of order_book, a cross-check function to make sure our speculative CLOB implementations actually work, and command-line options to invoke the full Cartesian product of possible combinations of template parameters.
Before, I had separate GitHub branches with AVX2 and AVX-512 implementations, with various heuristics and metadata controlled by preprocessor constructs. I had to check out a specific branch and configure the preprocessor macros to test a given combination. Using CRTP, we’ll be able to flatten the code base so we can do side-by-side comparisons of the various implementations’ source code – and their performance.
Thanks again to Chris Kitching of Spectral Compute for putting me onto the possibilities of this refactor!
Function calls are inexpensive on modern CPUs, but they hinder compilers' ability to generate optimal code, not least because the ABI requires most registers to be preserved across the function call.
To keep things simple, I’ve taken some liberties with the RTOW code. The hit method in their implementation returns information on where the ray and object intersected, as well as whether they intersected. Also, I invented the triangle class for illustrative purposes.
Developers can authorize the compiler to bypass the virtual function call when possible by making the class final, inhibiting inheritance from that class (and further overloading of the class’s virtual functions).
In reviewing a draft of this article, Chris pointed out that we’re singling out the one thing that CRTP can’t do: type erasure. Without additional information that isn’t always available at compile time, code that operates on the less-derived class cannot be compiled into code that calls the more-derived implementation.
Around 2002, Jeff Bezos wrote an internal memo that exhorted everyone at Amazon to embrace interfaces and to avoid all alternative forms of IPC (interprocess communication). Famously, the memo ended with a threat to fire anyone who failed to comply with the mandate!
For clarity, sample virtual functions have been entirely omitted from this side-by-side comparison.

