Indiana Jones and The Legend of the Fast Ticker Plant
It's getting to be boring, but in a very bad way... Recently, I've been trying to find the speed that was once in my ticker plants, but seems to have left the building when adding a few key features. Today was no exception.
Facing problems (again) at the open, I realized that I needed to be able to re-create the "flood" that occurs at the open during normal market hours, so I added a little test code to the exchange feeds to "build up" the incoming datagram queue to 50,000 messages, and then start processing messages. This was a nice addition as it allowed me to reproduce the "flood" during the day. Nice to have.
Still... it wasn't happening. No matter what I was doing, I wasn't getting the speed I once had. I couldn't go back, so I had to look at vastly different plans. Starting with the locks in my Trie.
The basic problem in my Trie is that I can create the path to the leaf node in a lockless, thread-safe manner, because the basic building block is the GCC built-in:
old = __sync_val_compare_and_swap(*ptr, val1, val2);
returns the previous value at ptr, assuming it was val1. This works fine when you're building something, but when you're replacing something, it's another matter. Look at the following example:
old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);
In general, there's no way to make this thread safe! Why? Because the thread will pull together the three arguments, and it's entirely possible to have another thread change the value of l->data[b] after the first argument and before the second. This makes it virtually impossible to use this in a very fast, tightly threaded environment.
When there's more stability in the code, or you're not doing a simple replacement, you can work out ways to make the atomic operations work. It's just this one that was killing me. So I had to put in a simple spinlock, and that made the operation thread-safe. Kinda dissappointing.
I also noted that the thread performance of my mmap really looked bad:
# Threads | Speed |
1 | 0.84 usec |
2 | 4.05 usec |
4 | 7.3 usec |
That's a pretty nasty penalty to pay for multiple threads. The same work done in more time by more threads. It's counter-intuitive, that's for sure. But it makes the point that if there's a lot of contention, then the locking and unlocking is really the overhead, and the more you have to do it, the more it costs you.
So I made a decision to rewrite the exchange feed and remove one of the two threads on the datagram queues and instead have just one thread emptying both of the queues. This way, all the downstream components didn't have to worry about multiple thread contention, and according to the table, I should get a lot faster performance.
We'll have to see tomorrow - everything is down now, so even my "flood gate" trick won't work.
UPDATE: The problem I was having with:
old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);
was that there was no way to know if it failed. After all, the compiler will pull together the arguments, and then make the call. But if we've been undercut, how will I know? The spec says that the returned value is that value before the operation - that that's after the marshaling of the parameters.
What I've been able to use nicely is:
old = l->data[b]; while (!__sync_bool_compare_and_swap(&(l->data[b]), old, aMessage)) { old = l->data[b]; }
where we know the pointer to the location (slot) and then grab the 'old' value before attempting to do the compare and swap. Then, I'll use the boolean version of the function and if it didn't work, I'll try it all again with the latest value.
This works fine as it detects the failure of the CAS and resets the operation. The previous version:
old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);
had no way to know it failed other than to do something similar to the boolean version. We have to know if it failed. Simple enough.