Finally Have Hope for Tomorrow
Today was spent, like every day in the last week, trying to see if my work yesterday had actually improved the speed profile of the ticker plants. Today started out just the same, but I decided to try a slightly different track today: build up a way to test intraday, and then hammer it a lot more with the iterations.
What I did was to make the UDP feeder code "hold back" 50,000 datagrams and then "cut them loose" all at once into the downstream components. I could then monitor the time it took to completely process the "back log" and as a comparative measure, see what effect the change I just made had on the running system. For instance, when I disabled the processing of the UDP datagrams, I expected the "recovery time" to be less than a second. Face it - all we're supposed to be doing then is pulling the filled datagrams off the queue and recycling them.
What I found was that it was taking more than 10 seconds to empty the back log, in the presence of the "normal feed". This didn't seem right at all. So I started digging into the code for the datagram pooling and found something very interesting.
The pool's alloc() and recycle() methods were using a standard STL list, and if there's nothing in the list, a new one is made, and if there's no room in the list, the recycled one is tossed away. When I simply made alloc() create a new one, and recycle() delete it, I saw a dramatic increase in the processing speed. The recovery time was less than a second! But why?
It turns out that the boost spinlock which is based on the old compare-and-swap gcc macros, was turning out to be a real pain to me. I started digging into the pool with separate tests, and it was amazing what the difference was. I didn't want to have all the creations and deletions in the code - that wasn't the answer, but I needed to have something that was thread-safe.
Then it hit me - my CircularFIFO, the single-producer/single-consumer lockless queue I had should work. There's only one thread that's calling alloc() - that's the boost ASIO thread that's reading the UDP datagrams. There's only one thread that's putting the datagrams back into the pool - that's the processing thread. It should work.
And indeed it did. The processing times were as fast as the malloc/free ones, but there was no system overhead for the malloc/free cycle, and I had finally cracked a major speed killer in my code.
I updated the StringPool as well as it was using in the ZeroMQ transmitter in a very similar fashion. Again, it was not going to be the problem there, either.
I'm actually looking forward to tomorrow's tests. For the first time in a long time, I really think I have a shot.