Archive for the ‘Coding’ Category

Indiana Jones and The Legend of the Fast Ticker Plant

Wednesday, January 19th, 2011

Detective.jpg

It's getting to be boring, but in a very bad way... Recently, I've been trying to find the speed that was once in my ticker plants, but seems to have left the building when adding a few key features. Today was no exception.

Facing problems (again) at the open, I realized that I needed to be able to re-create the "flood" that occurs at the open during normal market hours, so I added a little test code to the exchange feeds to "build up" the incoming datagram queue to 50,000 messages, and then start processing messages. This was a nice addition as it allowed me to reproduce the "flood" during the day. Nice to have.

Still... it wasn't happening. No matter what I was doing, I wasn't getting the speed I once had. I couldn't go back, so I had to look at vastly different plans. Starting with the locks in my Trie.

The basic problem in my Trie is that I can create the path to the leaf node in a lockless, thread-safe manner, because the basic building block is the GCC built-in:

  old = __sync_val_compare_and_swap(*ptr, val1, val2);

returns the previous value at ptr, assuming it was val1. This works fine when you're building something, but when you're replacing something, it's another matter. Look at the following example:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

In general, there's no way to make this thread safe! Why? Because the thread will pull together the three arguments, and it's entirely possible to have another thread change the value of l->data[b] after the first argument and before the second. This makes it virtually impossible to use this in a very fast, tightly threaded environment.

When there's more stability in the code, or you're not doing a simple replacement, you can work out ways to make the atomic operations work. It's just this one that was killing me. So I had to put in a simple spinlock, and that made the operation thread-safe. Kinda dissappointing.

I also noted that the thread performance of my mmap really looked bad:

# Threads Speed
1 0.84 usec
2 4.05 usec
4 7.3 usec

That's a pretty nasty penalty to pay for multiple threads. The same work done in more time by more threads. It's counter-intuitive, that's for sure. But it makes the point that if there's a lot of contention, then the locking and unlocking is really the overhead, and the more you have to do it, the more it costs you.

So I made a decision to rewrite the exchange feed and remove one of the two threads on the datagram queues and instead have just one thread emptying both of the queues. This way, all the downstream components didn't have to worry about multiple thread contention, and according to the table, I should get a lot faster performance.

We'll have to see tomorrow - everything is down now, so even my "flood gate" trick won't work.

UPDATE: The problem I was having with:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

was that there was no way to know if it failed. After all, the compiler will pull together the arguments, and then make the call. But if we've been undercut, how will I know? The spec says that the returned value is that value before the operation - that that's after the marshaling of the parameters.

What I've been able to use nicely is:

  old = l->data[b];
  while (!__sync_bool_compare_and_swap(&(l->data[b]), old, aMessage)) {
    old = l->data[b];
  }

where we know the pointer to the location (slot) and then grab the 'old' value before attempting to do the compare and swap. Then, I'll use the boolean version of the function and if it didn't work, I'll try it all again with the latest value.

This works fine as it detects the failure of the CAS and resets the operation. The previous version:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

had no way to know it failed other than to do something similar to the boolean version. We have to know if it failed. Simple enough.

Long, Hard Day of Optimization – Not a Ton of Success

Wednesday, January 19th, 2011

Getting it right and then getting it fast can make for a lot of work. And while no one said it would be easy, it's how hard it is to get the necessary speed, and what changes you have to make to the code in order to get it. Such has been my day. I've been spending a lot of time this week trying to get things faster, and while I've had quite a bit of success, I haven't had the level of success I was hoping for.

I was hoping that I'd find that one problem... that one bottle neck that slowed everything down. But that was not to be. I found a lot of little places to get a little speed: pooling data containers for the UDP datagram input and processing... reducing copying... and today - trying to make the processing of the UDP datagrams themselves faster.

When I start to mess with the processing, I invariably end up causing bugs either with changes to the logic, or the speed that was not a problem before, now has become a problem because I've tightened up the timing enough to make it a problem.

The problem today has been that I have done a lot of work, but it hasn't made a lot of difference in the overall scheme of things. I'm still backing up on the UDP datagram queue on the input. I'm better, but it's not enough to actually stop the overloading of the queue (at 100,000 elements). So it's not a real success, but I can say that I know what it isn't, and hopefully I'll be getting a lot closer to the real answer soon.

Still Trying to Get More Speed in the Ticker Plants

Tuesday, January 18th, 2011

bug.gif

This morning's tests didn't go any better... once again, I was overrun like the guys at the Alamo. Pretty depressing. I've been on this for quite a while now, and it seems that I'm making no real progress.

Once again, today was spent making a ton of little changes to the code in hopes of finding the killer bottleneck. One of the biggies was making versions of the message processing methods that took container references as opposed to passing the new containers back to the caller. If I'm making a lot of messages, and constantly passing them back on the stack I've got to be taking more time than passing in the reference to the container and just adding the messages to that container reference.

I left the old "pass back" methods in the code, where I just create a container, pass it to the new method, and then return it to the caller. It's a clean and easy way to have one version of the method while supporting two different calling schemes.

There were a lot of other little things - speeding up the creation of the security ID by not packing the strike mantissa using Google's VarInt encoding. It's not slow, per se, it's just not as fast as blasting down 4 bytes using memcpy(). And when you're going this 50,000 times a second, every little bit helps.

We'll have to see tomorrow what the effects are. I'm feeling good about my chances, but I"ve been here before, and felt good before. I just have to keep at it. Never give up.

Hard Struggle Integrating Trie into QuickCache

Friday, January 14th, 2011

bug.gif

In my continuing quest for more speed in the tick processing of my Ticker Plants, I decided today to integrate the 128-bit Trie I'd created for the security ID into the QuickCache - that container for the messages that each Ticker Plant has to enable clients to obtain the "last known" messages. The reason for this is that the existing QuickCache was not respecting the different exchanges on the Quote message, and it needed to. Retrofitting that behavior into the QuickCache turned out to be a major engineering feat.

The differences are really based on the different storage organizations. In the old quasi-trie, I had the messages first organized by their type (Quote, Trade, etc.) and then by their underlying name, finally by their type - stock, option, etc. It's in the 'name' part of the storage where I'd used the trie - one level for each of the first four characters in the underlying's name. This had a lot of really nice benefits - the messages were stored by family so I could respond easily to requests of the same nature, which come up quite often, really.

When I switched to the 16-byte (128-bit) trie where it was a complete lockless trie for the entire security ID, I lost the ability to easily find messages by their type, instrument type, etc. because it was all encoded into the security ID, and while it's possible to try to create a "mask" of values to "scan" in the trie, but that makes for some horrifically bad code, and in the end, it was easier to take the hit on the performance and be more conventional in the scanning.

But what a hit. I have to scan all values in the trie to make sure I find all the values of a family... or the first alphabetical underlying... or last. It's very straightforward, but not really efficient. Using the name-based trie made many of these much simpler. But the limitations of the name-based trie were too much to overcome, and it still had locks on the option maps, and in the end, it just wasn't going to be fast enough for the data stream at peak loads.

Still... in order to get the new 16-byte trie into the QuickCache it took a lot of work. But in the end, all the tests passed with flying colors, and I had a fix to a problem that was giving me grief. Glad that's done.

Getting Into the Details for Efficiency Sake

Thursday, January 13th, 2011

bug.gif

Today I had an other unsuccessful day testing my ticker plants against the open and the OPRA feeds. They were pumping out far more data than my code could process. It was working, but with all the changes required for the necessary functionality, I was falling behind, and it wasn't a great feeling.

I spent the day looking at things like making ivars protected and using them directly as opposed to passing values to protected setter methods. A stretch, yeah, but that's what today has been about - trying to find out what's up with the speed and why it was working fine before, and now it's really pretty bad.

I'll have to try these on Monday and see if they make any difference.

Finding More Speed in the Unlikeliest of Places

Wednesday, January 12th, 2011

bug.gif

Today I've spent a ton of time with the uint128_t trie to get it as fast as possible. My goal was to beat the old system, and eventually, I think I have. But getting there was a painful process that had a lot of battles along the way.

When I started testing the uint128_t trie, I noticed that the access was significantly slower than the old character-based trie I had been using. Sure, the old one wasn't going to really work for what I needed, but still... to have a 10x speed hit is not what I was looking to do. So I dug into it.

Turns out the new trie is amazingly fast. I mean just blinding. Good for me. The problem was in generating the conflation key that I use in the trie to store the messages. Really... it was around 7 ms to generate the key and then only 0.01 ms to actually put it in the trie. Yikes!

So I had to keep digging. Turns out I was doing a lot more copying than I had to do, and by 'had to do' I mean 'designed into the API'. I was using STL containers for things when it was possible to make the methods really work on (char *) arrays and then wrap those into the ones that used the STL containers for backward compatibility. This netted me a considerable speed improvement, but I'm still not happy.

I also dug into the payload pools and found that I could do a little better job there as well. Again, no major change in the code, but every little bit is going to help in this guy. I've got more feeds than I have CPUs, and it's getting to be a little nasty to worry about the context switching.

In the end, I changed a lot of little things, but got the speed close to what it was. I'll have to wait and see what happens on the open tomorrow morning to be sure. I'm concerned, but cautiously optimistic.

Google Chrome dev 10.0.634.0 is Out

Wednesday, January 12th, 2011

V8 Javascript Engine

Today I noticed that Google Chrome dev 10.0.634.0 was out and so I upgraded. The big thing that seems to be in this release is the updating of the V8 engine to 3.0.6.1. I'm still a little sore about the plans Google has stated about removing H.264 from Chrome, but that just makes it easier to realize that Google really is going through a really bad spell, and maybe in a year or ten, they'll pull out of it. But I don't know... it took IBM more than a decade, and Texas Instruments never really recovered.

I sure hope the bad folks making bad decisions in Google get moved out. It's sad to see a great bunch of engineers used to do very bad things. Sad.

Everyone Back in the Pool!

Tuesday, January 11th, 2011

bug.gif

Today was spent working a lot on the efficiency of the ticker plants - most notably on removing the excessive use of malloc() and free() or new and delete and replacing it with some pooled containers for the processing of the UDP datagrams and the payloads of the ZeroMQ messages. I'm not really sure how inefficient linux is for dealing with small allocations, say less than 1kB, but it can't be good to have 20,000/sec flipping through the system.

It's really pretty simple - given the fact that I'm dealing with reasonably small container sizes. I just needed to have an STL list of some kind - a simple spinlock to protect it, and then a simple alloc() method to get one from the pool, or create one if the pool is empty, and recycle() to put it back in the pool, if there's room, or delete it if not. Not rocket science.

In fact, I made a special StringPool just for the ZeroMQ messages as they are being held in simple STL std::string objects. The bonus of all this is that I don't ahve to have a "receive buffer" and then copy the data from the receive buffer into the container for pushing onto the stack. I can allocate one from the pool, have the boost ASIO put the data in the container, and then simply transfer the pointer into the stack.

Far simple. Very sweet, in fact.

Once I had it done, it has to work better even if it doesn't spec out to be any faster - the switches to the system for malloc() and free() have to be making a positive difference. Good enough.

Working on Ticker Plant Efficiency

Monday, January 10th, 2011

Today I noticed that I had a serious problem with the CPU usage of my ticker plants. Specifically, the OPRA channels were 3x or 4x larger than they had been before I had made the last round of changes. Not good. Many of the ticker plants were above 100% CPU utilization, and when you have that many plants on an 8 core box, it's not good to have four or more of them taking more than a complete CPU. Yikes.

So today was spent trying to find where they problems were. I guess I knew where they were - in the code that changed, but the problem is, that code was for the conflation queue, and that needed to be changed, and the resulting code is just awfully slow.

I have a feeling it's in the boost unordered_map, but I'm going to try and make it work, if possible, for the alternative is a nasty bit of code - another trie and this time very large. I don't want to have to do that.

Nice Day of Cleaning Up Outstanding Issues

Friday, January 7th, 2011

Today was a day of cleaning up a lot of little nagging issues on the ticker plants. I spent time fiddling with the timing on lockless queue polling, IRC responding, exchange mappings... all these things that I needed to get to - eventually, but today was the day to "clean the decks". Not bad.

Nice day, too. It's great to get that sense of completion.