This afternoon I found my speed problem, and it's not an easy solution. It should be, but it's not, and in retrospect, I really wish OPRA made a better FAST codec. But I'm getting ahead of myself.
I started adding a lot of timing snapshots to the code. It's annoying if there's a lot of it, but in the short-run, it's really the only way to see how long this section of code takes. Sure, there are profilers, but they can only look at modules or functions - not sections of code, as I needed. So I put these in and started the runs. Much to my surprise, the FAST codec is not too bad, averaging about 20 to 50 usec per message. Sure, it depends on the size of the message from OPRA, but it's not bad. This also includes the sending of the messages out to my test harness and calculating the minimum, maximum, and average latency for the messages it receives.
So it's not in the FAST codec. Where, then, is it?
Certainly, one of the nastier delays is the queue restart. By boost ASIO thread reads off the UDP datagram, and then tags it with a timestamp, and places it on a FIFO queue. The queue is lockless, and so there's no conditional to wait() on, and that's a good thing, as the delay time would be horrible if it were (already been there, tried that). So when the queue is empty, my servicing thread can choose one of two options: sleep and let it queue up, or spin and burn up the CPU. I've tried both, and I can get the time to a reasonable level with the spinning, but then the CPU usage is eating a complete core - not ideal.
Sleeping isn't any better because it allows the minimum "maximum" latency to be the sleep interval and that is currently set at 1 ms. However, it's much more considerate on the CPU usage - averaging only about 12% of one core.
But still, the delays remained. Then it hit me. I was looking at the problem all the while and just not "seeing" it. If I have one thread decoding these datagrams from both A and B channels of the exchange feed, then I'm doing twice the work I need. This isn't bad, but it's taking twice the time. I needed to get two decoding threads active again. I say "again" because I had this in the past, and removed it in favor of a lockless arbitration on what messages to send. At this point, I need the speed and am willing to have the lock (for now), to get the processing time in half.
But that's not all. If I think about the load from a single OPRA channel I'm looking at upwards of 100,000 msgs/sec. If I take, on average, 40 us per message to decode it, I can get only about 25,000 msgs/sec - not even close to the speed they are arriving. So this is my fundamental problem: the OPRA FAST codec is just not as quick at decoding the messages as OPRA is at sending them. Sure... with faster CPUs I might be able to get it a little faster, but in the end, it's not going to be more than maybe 100% faster. That still leaves me with a factor of 2 to make up.
So we're going to queue things up. No way around it. Given that we're queueing, the delays I've been seeing are completely explainable. Let's assume we get a burst of 10,000 messages - the last message in the burst is going to get decoded some 400 ms after it arrives and is timestamped. No two ways about it.
Some of my competitors feeds say they have much smaller latencies - and the way they measure it, they do - because they allow the OS-level socket buffers to hold the 10,000 messages as they process one completely before pulling the next off the socket. If they timestamp it, decode it and send it, their latencies should be no more than about 50-100 usec. That's what mine would be. But you're not measuring the real latency of the server. For that, you have to keep the OS-level socket buffer empty and timestamp them within the process space and then measure the overall latency.
There's no free lunch. You can buffer the datagrams in your process or you can have the OS buffer them for you. The only difference is with the former, you'll actually know when the datagram arrived at your NIC, and with the latter you won't.
So... what's the solution? Well... we can throw more cores at it. If we somehow parallel processed the datagrams into messages, then the effective decoding time could be much less. But then we'd have to worry about getting the messages out of order.
We could re-write the FAST codec - and this I've thought about, but the code isn't that bad, and OPRA is going to 48 channels soon, and that should cut the rate to a max of 50,000 msgs/sec, and give the decoder a little breathing room.
I'll work on it, but I'm confident now that there's no horrible problem in my code. And no one is doing it really any better than I am. I just need to see if I can come up with some clever way to get around these limitations.