Fun New Project: itch-order-simd

First of a series on optimizing a limit order book with AVX

Nov 14, 2025

If you spend enough time at a high-frequency trading firm, even the techies learn about Central Limit Order Books or CLOBs, the central abstraction of choice for many markets to facilitate trading. If you aren’t familiar with the concept, this article by NASDAQ is an excellent introduction. Another fun overview of high-frequency trading that touches briefly on the subject at hand, may be found here at this article “Inside High-Frequency Trading Systems.”

A “limit order book” is a list of orders to buy and sell “symbols,” the generic name for stocks, bonds, options, futures, or whatever other commodity is being traded on the exchange (there is a glossary at the bottom of this article). Market participants submit orders with the price and quantity of the symbol that they wish to buy or sell, and the exchange has rules to match these buyers and sellers and execute the requested trades. As one might expect, the exchange’s specialized technology to perform this task is called the “matching engine.”

The term “limit” in “limit order book” refers to the price that a trader specified along with their order. A buy order (“bid”) at a given price means the trader does not wish the order to be filled above that price; for sell orders (“asks”), the seller does not wish to sell below a given price. Traders often submit bids that are a bit lower, or asks that are a bit higher, than the prices currently being traded on the exchange; a “market order” is a request that the exchange simply fill the order at the best-available price, but often that is a construct of the retail trading market, not a primitive operation offered by the exchange. If an order cannot be immediately filled, it stays in the book until filled or cancelled.

The current price of a commodity is difficult to define: one only knows the price at which the most recent trade was fulfilled, but by definition, the price may be moving. The lowest ask price is always higher than the highest bid price, inviting the definition of a market’s “spread”: the difference between the two. As a rule, the smaller the spread, the more liquidity is in the market. (Also as a rule, HFT firms’ profit is highly correlated with the size of the spread; the bigger the spread, the more opportunity there is for them to submit profitable trades.) In any case, whether the prices you see scrolling across the CNBC chyron are “midpoints” (the average between the lowest ask and the highest bid) or the last price filled, obviously it’s subject to change, possibly quickly.

Computers that are plugged into the exchange get updates with tremendous frequency – depending on the exchange, up to millions of times per second: mostly new orders (whether they be to buy or sell a symbol), cancellations of existing orders, and “fills,” notices that the exchange’s matching engine has filled corresponding bid and ask orders that had been on the books. Since the CLOB is huge in comparison to an order, most exchanges require market participants to maintain their own accurate mirrors of the exchange’s CLOB as updates pour into their trading systems.

Although details vary by exchange, updates typically include ADD, DELETE, and EXECUTE orders, with each order containing only the security, price, quantity, and whether the order is a bid or ask at the given price. As a consequence, efficiently updating the levels corresponding to the security in question is the stock in trade (dyswidt) of every high frequency market participant, preferably with the levels sorted by price and canonicalized to a form more amenable to processing by the strategy code.

Often, HFT firms maintain an aggregate book that distills all the pending orders into price levels, with the total quantities of bid and ask orders resting at each level. The “top” of the book, with the orders closest to the bid-ask spread, is the main focus of many trading algorithms. Some are as simple as observing that the book is “weighted” too heavily in one direction or the other: for example, if there are more asks than bids, to enough of a degree, then the commodity is likely to decline.

The topic of this blog series will be to maintain an aggregate book, minimizing latency of updates to maximize a prospective strategy’s opportunity to process the book and make trading decisions.

Having worked on production HFT code, I took an interest in fast CLOB updates, with an especial focus on SIMD-friendly algorithms to do the updates quickly. CLOB code is fun to optimize and also makes for a good illustration of the transformations that need to be applied to enable SIMD processing to work best.

To start, I looked around the Internet for an open source implementation, hoping to find one specialized to the ITCH feed used by NASDAQ to trade equities. The high volume and low latency requirements to successfully trade on NASDAQ are legion, ITCH is well-documented, and you can reach out to NASDAQ to obtain sample ITCH files to use to simulate processing. This implementation is a good example of a ‘textbook’ implementation, using data structures like hashmaps for order lookup and a doubly-linked list for easy removal of the order upon fulfillment or cancellation. The Order structure looks like this:

struct Level {
    uint32_t price;
    uint32_t limitVolume;
    // either both are nullptr or both are populated
    // i.e. if orders in Level == 1, both pointers are ==
    Order* first;
    Order* last;
    Level(uint32_t _price);
};

To the author’s credit, they use a pool for allocation, not the default C++ heap, but my intuition leads me away from using this type of data structure for a CLOB – on 64-bit systems, the pointers alone are 8B each. Since data movement is the limiting reagent of all compute, a smaller memory footprint often leads to faster code.

An implementation closer to my starting point was built by a Charles Cooper, whose implementation uses std::vector to hold the aggregate limit order book. Cooper’s implementation is both older (work started c. 2014) and claims much lower per-trade latency than the previously-linked repository (61ns v. 92ns). Quoting from the readme:

This is a very fast implementation of the ITCH order book, clocking in at around 61ns / tick (or 16 million messages / second, tested on a 2012 i7-3820), offering fast updates and O(1) access to any price level (to get the price is a single dereference, the aggregate quantity is another dereference). It only calculates the aggregate quantities at each price and does not track the queue depth for each order.

The order structure in Cooper’s repository looks like this:

enum class qty_t : uint32_t {};
enum class book_id_t : uint16_t {};
enum class level_id_t : uint32_t {};
typedef struct order {
  qty_t m_qty;
  level_id_t level_idx;
  book_id_t book_idx;
} order_t;

The two sides then are instantiated in the flagship order_book structure as follows:

class price_level
{
 public:
  price_level() {}
  price_level(sprice_t __price, level_id_t __ptr)
      : m_price(__price), m_ptr(__ptr)
  {
  }
  sprice_t m_price;
  level_id_t m_ptr;
};

using sorted_levels_t = std::vector<price_level>;
sorted_levels_t m_bids;
sorted_levels_t m_offers;

Adding a new order consists of doing an Insertion Sort into one or the other of m_bids / m_asks, depending on whether the order is to buy or sell. The ADD_ORDER method in order_book:

  void ADD_ORDER(order_t *order, sprice_t const price, qty_t const qty)
  {
    sorted_levels_t *sorted_levels = is_bid(price) ? &m_bids : &m_offers;
    // search descending for the price
    auto insertion_point = sorted_levels->end();
    bool found = false;
    while (insertion_point-- != sorted_levels->begin()) {
      price_level &curprice = *insertion_point;
      if (curprice.m_price == price) {
        order->level_idx = curprice.m_ptr;
        found = true;
        break;
      } else if (price > curprice.m_price) {
        // insertion pt will be -1 if price < all prices
        break;
      }
    }
    if (!found) {
      order->level_idx = s_levels.alloc();
      s_levels[order->level_idx].m_qty = qty_t(0);
      s_levels[order->level_idx].m_price = price;
      price_level const px(price, order->level_idx);
      ++insertion_point;
      sorted_levels->insert(insertion_point, px);
    }
    s_levels[order->level_idx].m_qty = s_levels[order->level_idx].m_qty + qty;
  }

starts by assigning sorted_levels to be a pointer to the sorted vector for the side:

sorted_levels_t *sorted_levels = is_bid(price) ? &m_bids : &m_offers;

The inline function is_bid() relies on the sign of the price:

bool constexpr is_bid(sprice_t const x) { return int32_t(x) >= 0; }

The loop scans the sorted vector of price_level structures, looking for one with the same price as the incoming order; if it finds one, it terminates with found==true. Otherwise, a new index is allocated for the price level and inserted into the vector.

The DELETE method reduces the aggregate quantity for the given price level and, if it is now zero, removes the corresponding price_level structure from the side (again using a variable sorted_levels for the std::vector of bids or asks.

  // shared between delete and execute
  void DELETE_ORDER(order_t *order)
  {
    assert(MKPRIMITIVE(s_levels[order->level_idx].m_qty) >=
           MKPRIMITIVE(order->m_qty));
    auto tmp = MKPRIMITIVE(s_levels[order->level_idx].m_qty);
    tmp -= MKPRIMITIVE(order->m_qty);
    s_levels[order->level_idx].m_qty = qty_t(tmp);
    if (qty_t(0) == s_levels[order->level_idx].m_qty) {
      // DELETE_SORTED([order->level_idx].price);
      sprice_t price = s_levels[order->level_idx].m_price;
      sorted_levels_t *sorted_levels = is_bid(price) ? &m_bids : &m_offers;
      auto it = sorted_levels->end();
      while (it-- != sorted_levels->begin()) {
        if (it->m_price == price) {
          sorted_levels->erase(it);
          break;
        }
      }
      s_levels.free(order->level_idx);
    }
  }

Let’s be clear. This code runs pretty fast on modern CPUs! It uses arrays, albeit “array of structures” as opposed to the “structure of arrays” memory layout favored for processing by SIMD instructions; but modern CPUs have hardware prefetchers that make array processing very fast.

Before we take a serious pass at optimizing this code, we’ll want to refactor it a bit, partly for housekeeping to improve its processing of ITCH files (the multi-gigabyte files from Nasdaq that summarize a day’s trading), partly to improve the emulated I/O performance, and partly to lay the groundwork for SIMD optimization. An SOA layout for the sorted vectors is de rigueur!

Roadmap

I’ll be honest - this blog series is a writeup on coding work I’ve already done. I’m waist-deep in past commits and work branches, trying to hammer the code into a form with pedagogical value.

So in rough order, here are some of the changes to the repository that I am planning:

A refactor of the buf_t class in bufferedreader.h, to use mapped file I/O instead of the POSIX open/read/close API set; in my opinion, mapped file I/O delivers a much closer approximation to getting orders off the wire;
An update to the ITCH processing code to read the initial symbols, so we can decode which stock symbol (e.g., MSFT) corresponds to the 16-bit ‘locate’ field in the order;
Duplication of the existing m_bids / m_offers vectors with SOA layout, with optional cross-checking.
I’ll likely break out the bigger ADD_ORDER / DELETE_ORDER functions into a separate implementation file instead of leaving them in headers.
One refactor that I have not completed is to flatten the source code for various SOA layouts and instruction sets that currently reside in branches. I want to flatten the code and make all the variants visible simultaneously, and I have an implementation strategy in mind. We’ll cross that bridge when we get to it.

Once these changes are in place, we’ll be able to pursue a proper SIMD optimized version of the limit order book. And as long as the cross-checking is indeed optional, we’ll be able to see how much faster it runs! (Also on my to-do list: try the optimized code on a recent AVX-512 implementation, instead of my slow AMD machine.)

For now, here is a link to my fork of Cooper’s itch-order-book repository.

Glossary

ask – order to sell. Typically accompanied by a quantity, and the lowest price the buyer would accept.

bid – order to buy. Typically accompanied by a quantity, and the highest price the buyer would be willing to pay.

central limit order book (CLOB) – the central data structure

exchange – an entity that facilitates the buying and selling of some set of commodities, often using a central limit order book (CLOB)

level – a price; in the context of a CLOB, multiple orders may come in at the same price level.

price – the amount of money a trader is willing to pay (if buying) or accept (if selling). Typically accompanies a bid or ask. Importantly, prices are not always specified using currencies – the generic term is ‘tick.’ For some exchanges, prices are 64-bit integers!

price level – see level.

side – generic term for whether an order is a bid (to buy) or ask (to sell).

spread – difference between the lowest ask and highest bid price.

symbol – a commodity being traded on the exchange, be it an equity, option, or future.

tick – the smallest unit of price.

The Parallel Programmer

Discussion about this post

Ready for more?