The Beauty of Tuning a Solution

GeneralDev.jpg

Yesterday afternoon I got the first really good cut of one of my servers done. It was really nice to see it running for more than five minutes, and it was a great relief. However, the stats on the delivery times of the messages weren't what I was hoping for. In fact, it was a little on the pokey side of things - not really all that much better than the existing code. The numbers for the delay from the receipt of the UDP datagram to the client actually receiving an NBBO quote message I had were pretty bad:

Max Min Avg
200 msec 10 msec 70 msec

But I knew that I'd be able to make it faster... I just needed to figure out where the delay was and what I needed to do to fix it.

This morning in the shower I was thinking about the problem, like you do, and realized that I was probably hitting the sleep intervals for processing data off the queues. Because I have lockless queues (for the most part), I don't have the ability to use a conditional on a mutex to be alerted when something is there to process. The pop() methods will return a NULL when there's nothing to return, and it's up to my code to wait a bit and try again.

These waiting loops are pretty simple, but I don't want them to spin like crazy when the market is closed. So I have a variable sleep value for the loop - the longer you go without getting something from the queue, the bigger the sleep interval to make it less of a load on the system. So if things are coming fast and furious, there's no wait, and after the close, you don't crush the box with your spinning loops.

But there were problems - specifically, if you waited a little bit, you might very quickly get into a 100 msec sleep. If you happened to hit that once, you're likely to have to wait another 100 msec before checking the queue again. All of a sudden, the 200 msec maximum delay was understandable. So how to fix it?

The first thing was to pull these waiting loops into the queues themselves so they were a lot easier to control and modify. The code got a lot cleaner, and the timing loops because part of the queues themselves. Much nicer.

Then I needed to tune the delay parameters so that I was being careful to be responsive, but at the same time, not overly hoggish of the CPU. When I looked at the delays I had, it seemed that I was increasing them far too fast (red line). When I took it in a lot more smaller steps, I started getting really nice results (blue line):

Plot of Sleep Loop Delays

which resulted in the much more acceptable delays of:

Max Min Avg
8 msec <0.5 msec 3 msec

Sweet.

Get it right, and then get it fast. Works every time.