Archive for January, 2011

Google Chrome dev 10.0.642.2 is Out

Friday, January 21st, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 10.0.642.2 was released with a nice list of fixes including a new version of the V8 javascript engine. I'm pleased they are making progress, but still very disappointed that they are pulling H.264 support in the <video> tag.

But hey... it's their program. Their choice.

Finally Have Hope for Tomorrow

Thursday, January 20th, 2011

bug.gif

Today was spent, like every day in the last week, trying to see if my work yesterday had actually improved the speed profile of the ticker plants. Today started out just the same, but I decided to try a slightly different track today: build up a way to test intraday, and then hammer it a lot more with the iterations.

What I did was to make the UDP feeder code "hold back" 50,000 datagrams and then "cut them loose" all at once into the downstream components. I could then monitor the time it took to completely process the "back log" and as a comparative measure, see what effect the change I just made had on the running system. For instance, when I disabled the processing of the UDP datagrams, I expected the "recovery time" to be less than a second. Face it - all we're supposed to be doing then is pulling the filled datagrams off the queue and recycling them.

What I found was that it was taking more than 10 seconds to empty the back log, in the presence of the "normal feed". This didn't seem right at all. So I started digging into the code for the datagram pooling and found something very interesting.

The pool's alloc() and recycle() methods were using a standard STL list, and if there's nothing in the list, a new one is made, and if there's no room in the list, the recycled one is tossed away. When I simply made alloc() create a new one, and recycle() delete it, I saw a dramatic increase in the processing speed. The recovery time was less than a second! But why?

It turns out that the boost spinlock which is based on the old compare-and-swap gcc macros, was turning out to be a real pain to me. I started digging into the pool with separate tests, and it was amazing what the difference was. I didn't want to have all the creations and deletions in the code - that wasn't the answer, but I needed to have something that was thread-safe.

Then it hit me - my CircularFIFO, the single-producer/single-consumer lockless queue I had should work. There's only one thread that's calling alloc() - that's the boost ASIO thread that's reading the UDP datagrams. There's only one thread that's putting the datagrams back into the pool - that's the processing thread. It should work.

And indeed it did. The processing times were as fast as the malloc/free ones, but there was no system overhead for the malloc/free cycle, and I had finally cracked a major speed killer in my code.

I updated the StringPool as well as it was using in the ZeroMQ transmitter in a very similar fashion. Again, it was not going to be the problem there, either.

I'm actually looking forward to tomorrow's tests. For the first time in a long time, I really think I have a shot.

Indiana Jones and The Legend of the Fast Ticker Plant

Wednesday, January 19th, 2011

Detective.jpg

It's getting to be boring, but in a very bad way... Recently, I've been trying to find the speed that was once in my ticker plants, but seems to have left the building when adding a few key features. Today was no exception.

Facing problems (again) at the open, I realized that I needed to be able to re-create the "flood" that occurs at the open during normal market hours, so I added a little test code to the exchange feeds to "build up" the incoming datagram queue to 50,000 messages, and then start processing messages. This was a nice addition as it allowed me to reproduce the "flood" during the day. Nice to have.

Still... it wasn't happening. No matter what I was doing, I wasn't getting the speed I once had. I couldn't go back, so I had to look at vastly different plans. Starting with the locks in my Trie.

The basic problem in my Trie is that I can create the path to the leaf node in a lockless, thread-safe manner, because the basic building block is the GCC built-in:

  old = __sync_val_compare_and_swap(*ptr, val1, val2);

returns the previous value at ptr, assuming it was val1. This works fine when you're building something, but when you're replacing something, it's another matter. Look at the following example:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

In general, there's no way to make this thread safe! Why? Because the thread will pull together the three arguments, and it's entirely possible to have another thread change the value of l->data[b] after the first argument and before the second. This makes it virtually impossible to use this in a very fast, tightly threaded environment.

When there's more stability in the code, or you're not doing a simple replacement, you can work out ways to make the atomic operations work. It's just this one that was killing me. So I had to put in a simple spinlock, and that made the operation thread-safe. Kinda dissappointing.

I also noted that the thread performance of my mmap really looked bad:

# Threads Speed
1 0.84 usec
2 4.05 usec
4 7.3 usec

That's a pretty nasty penalty to pay for multiple threads. The same work done in more time by more threads. It's counter-intuitive, that's for sure. But it makes the point that if there's a lot of contention, then the locking and unlocking is really the overhead, and the more you have to do it, the more it costs you.

So I made a decision to rewrite the exchange feed and remove one of the two threads on the datagram queues and instead have just one thread emptying both of the queues. This way, all the downstream components didn't have to worry about multiple thread contention, and according to the table, I should get a lot faster performance.

We'll have to see tomorrow - everything is down now, so even my "flood gate" trick won't work.

UPDATE: The problem I was having with:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

was that there was no way to know if it failed. After all, the compiler will pull together the arguments, and then make the call. But if we've been undercut, how will I know? The spec says that the returned value is that value before the operation - that that's after the marshaling of the parameters.

What I've been able to use nicely is:

  old = l->data[b];
  while (!__sync_bool_compare_and_swap(&(l->data[b]), old, aMessage)) {
    old = l->data[b];
  }

where we know the pointer to the location (slot) and then grab the 'old' value before attempting to do the compare and swap. Then, I'll use the boolean version of the function and if it didn't work, I'll try it all again with the latest value.

This works fine as it detects the failure of the CAS and resets the operation. The previous version:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

had no way to know it failed other than to do something similar to the boolean version. We have to know if it failed. Simple enough.

Long, Hard Day of Optimization – Not a Ton of Success

Wednesday, January 19th, 2011

Getting it right and then getting it fast can make for a lot of work. And while no one said it would be easy, it's how hard it is to get the necessary speed, and what changes you have to make to the code in order to get it. Such has been my day. I've been spending a lot of time this week trying to get things faster, and while I've had quite a bit of success, I haven't had the level of success I was hoping for.

I was hoping that I'd find that one problem... that one bottle neck that slowed everything down. But that was not to be. I found a lot of little places to get a little speed: pooling data containers for the UDP datagram input and processing... reducing copying... and today - trying to make the processing of the UDP datagrams themselves faster.

When I start to mess with the processing, I invariably end up causing bugs either with changes to the logic, or the speed that was not a problem before, now has become a problem because I've tightened up the timing enough to make it a problem.

The problem today has been that I have done a lot of work, but it hasn't made a lot of difference in the overall scheme of things. I'm still backing up on the UDP datagram queue on the input. I'm better, but it's not enough to actually stop the overloading of the queue (at 100,000 elements). So it's not a real success, but I can say that I know what it isn't, and hopefully I'll be getting a lot closer to the real answer soon.

Still Trying to Get More Speed in the Ticker Plants

Tuesday, January 18th, 2011

bug.gif

This morning's tests didn't go any better... once again, I was overrun like the guys at the Alamo. Pretty depressing. I've been on this for quite a while now, and it seems that I'm making no real progress.

Once again, today was spent making a ton of little changes to the code in hopes of finding the killer bottleneck. One of the biggies was making versions of the message processing methods that took container references as opposed to passing the new containers back to the caller. If I'm making a lot of messages, and constantly passing them back on the stack I've got to be taking more time than passing in the reference to the container and just adding the messages to that container reference.

I left the old "pass back" methods in the code, where I just create a container, pass it to the new method, and then return it to the caller. It's a clean and easy way to have one version of the method while supporting two different calling schemes.

There were a lot of other little things - speeding up the creation of the security ID by not packing the strike mantissa using Google's VarInt encoding. It's not slow, per se, it's just not as fast as blasting down 4 bytes using memcpy(). And when you're going this 50,000 times a second, every little bit helps.

We'll have to see tomorrow what the effects are. I'm feeling good about my chances, but I"ve been here before, and felt good before. I just have to keep at it. Never give up.

Hard Struggle Integrating Trie into QuickCache

Friday, January 14th, 2011

bug.gif

In my continuing quest for more speed in the tick processing of my Ticker Plants, I decided today to integrate the 128-bit Trie I'd created for the security ID into the QuickCache - that container for the messages that each Ticker Plant has to enable clients to obtain the "last known" messages. The reason for this is that the existing QuickCache was not respecting the different exchanges on the Quote message, and it needed to. Retrofitting that behavior into the QuickCache turned out to be a major engineering feat.

The differences are really based on the different storage organizations. In the old quasi-trie, I had the messages first organized by their type (Quote, Trade, etc.) and then by their underlying name, finally by their type - stock, option, etc. It's in the 'name' part of the storage where I'd used the trie - one level for each of the first four characters in the underlying's name. This had a lot of really nice benefits - the messages were stored by family so I could respond easily to requests of the same nature, which come up quite often, really.

When I switched to the 16-byte (128-bit) trie where it was a complete lockless trie for the entire security ID, I lost the ability to easily find messages by their type, instrument type, etc. because it was all encoded into the security ID, and while it's possible to try to create a "mask" of values to "scan" in the trie, but that makes for some horrifically bad code, and in the end, it was easier to take the hit on the performance and be more conventional in the scanning.

But what a hit. I have to scan all values in the trie to make sure I find all the values of a family... or the first alphabetical underlying... or last. It's very straightforward, but not really efficient. Using the name-based trie made many of these much simpler. But the limitations of the name-based trie were too much to overcome, and it still had locks on the option maps, and in the end, it just wasn't going to be fast enough for the data stream at peak loads.

Still... in order to get the new 16-byte trie into the QuickCache it took a lot of work. But in the end, all the tests passed with flying colors, and I had a fix to a problem that was giving me grief. Glad that's done.

Getting Into the Details for Efficiency Sake

Thursday, January 13th, 2011

bug.gif

Today I had an other unsuccessful day testing my ticker plants against the open and the OPRA feeds. They were pumping out far more data than my code could process. It was working, but with all the changes required for the necessary functionality, I was falling behind, and it wasn't a great feeling.

I spent the day looking at things like making ivars protected and using them directly as opposed to passing values to protected setter methods. A stretch, yeah, but that's what today has been about - trying to find out what's up with the speed and why it was working fine before, and now it's really pretty bad.

I'll have to try these on Monday and see if they make any difference.

Finding More Speed in the Unlikeliest of Places

Wednesday, January 12th, 2011

bug.gif

Today I've spent a ton of time with the uint128_t trie to get it as fast as possible. My goal was to beat the old system, and eventually, I think I have. But getting there was a painful process that had a lot of battles along the way.

When I started testing the uint128_t trie, I noticed that the access was significantly slower than the old character-based trie I had been using. Sure, the old one wasn't going to really work for what I needed, but still... to have a 10x speed hit is not what I was looking to do. So I dug into it.

Turns out the new trie is amazingly fast. I mean just blinding. Good for me. The problem was in generating the conflation key that I use in the trie to store the messages. Really... it was around 7 ms to generate the key and then only 0.01 ms to actually put it in the trie. Yikes!

So I had to keep digging. Turns out I was doing a lot more copying than I had to do, and by 'had to do' I mean 'designed into the API'. I was using STL containers for things when it was possible to make the methods really work on (char *) arrays and then wrap those into the ones that used the STL containers for backward compatibility. This netted me a considerable speed improvement, but I'm still not happy.

I also dug into the payload pools and found that I could do a little better job there as well. Again, no major change in the code, but every little bit is going to help in this guy. I've got more feeds than I have CPUs, and it's getting to be a little nasty to worry about the context switching.

In the end, I changed a lot of little things, but got the speed close to what it was. I'll have to wait and see what happens on the open tomorrow morning to be sure. I'm concerned, but cautiously optimistic.

Google Chrome dev 10.0.634.0 is Out

Wednesday, January 12th, 2011

V8 Javascript Engine

Today I noticed that Google Chrome dev 10.0.634.0 was out and so I upgraded. The big thing that seems to be in this release is the updating of the V8 engine to 3.0.6.1. I'm still a little sore about the plans Google has stated about removing H.264 from Chrome, but that just makes it easier to realize that Google really is going through a really bad spell, and maybe in a year or ten, they'll pull out of it. But I don't know... it took IBM more than a decade, and Texas Instruments never really recovered.

I sure hope the bad folks making bad decisions in Google get moved out. It's sad to see a great bunch of engineers used to do very bad things. Sad.

Google Chrome Dropping H.264

Wednesday, January 12th, 2011

pirate.jpg

When I read this this morning I was stunned. Really. Stunned. I can't imagine what they are thinking. They include Flash in Chrome, and it's Adobe's IP, but they won't continue to support H.264 because it's what... supported by Apple? What gives? It's already there... it's already working... but they are going to take it out. I could understand if they took out all plugins or codecs. But they aren't. Just H.264.

I find it hard to believe that the engineers decided to do this. I also have a hard time believing that Google is running short on funds to support the work. And it's not like Chrome is not mature - it's polished when compared to many, and certainly in the mature category now.

Just stunning. Really.

Bad move.

[1/13] UPDATE: I could not have said it better than this post on Slashdot. Amazing. Painful.