Handling Fast Market Data Efficiently – Hint: Go Lockless
Today I was doing some testing on my latest data codec in my new ticker plant, and I ran across some performance issues that I didn't really like. Specifically, the processing of the data from the UDP feed was not nearly fast enough for me. As time went on, we were queueing up more and more data. Not good. So let's see what we had in the mix that we needed to change...
First, the buffer I was using was assuming that the messages from the exchange were not completely within a UDP datagram. This was a nice "luxury", but it's not true, and it was costing us time in the processing. It's better to assume that each UDP datagram is complete, and queue them up as complete units to process, than to have the logic in the buffer to "squish" them together into one byte stream, and then tokenize them by the ending data tags.
That was really quite helpful because at the same time I decided that it was a bad idea to use the mutex/conditional I had set up to allow the one producing thread and one consuming thread to efficiently access the data. Instead, I grabbed a very simple lockless circular FIFO queue off the web and cleaned it up to use for this UDP datagram buffering. It's easy enough to use - there's one thread that moved the head, and another that moves the tail. Simple. As long as the head and tail aren't cached on the CPUs, it'll work without locking. Simple enough.
But when I get rid of the locking/waiting, then I have to handle the case where the queue is empty and we need to try again. My solution there is to start simple and put a simple 250 msec wait. When I started testing this, I saw that there were significant pulses in the incoming data because a lot of datagrams arrived while we were waiting. So I got a little smarter.
I added an expanding delay - starting small, and building, so that we can hit it quickly if it's a short delay, but when the close comes, we'll only do a few checks before it goes to only a few times a second. That's very reasonable.
I did more tests and finally ended up with a variable scheme that had no delay for a few hits and then started stretching it out. Very nice.
In the end, I had something that emptied far faster than the UDP data source, and that's critical for a ticker plant. There's enough to slow it down later in the processing, so it's essential to start out as fast as possible.