Archive for the ‘Coding’ Category

Spending Time Tweaking for Speed

Thursday, March 10th, 2011

Speed

Today was a day where I spent about half my time tweaking for speed, and the other half recovering from the problems of making it too fast. I'm in the middle of a push to make the core of my ticker plants as fast as possible for a group that's going to start testing it next week. There have been a lot of little tweaks - each getting just a touch faster, but together, it's a non-trivial improvement. Unfortunately, these little changes pointed out threading issues as well. Those then needed to be fixed.

It's a game I've played for a while now, so today isn't unlike a lot of other days lately, but it's just "more of the same". It's all part of the improvement process.

Found My Speed Problems – No Really Easy Solution

Wednesday, March 9th, 2011

Speed

This afternoon I found my speed problem, and it's not an easy solution. It should be, but it's not, and in retrospect, I really wish OPRA made a better FAST codec. But I'm getting ahead of myself.

I started adding a lot of timing snapshots to the code. It's annoying if there's a lot of it, but in the short-run, it's really the only way to see how long this section of code takes. Sure, there are profilers, but they can only look at modules or functions - not sections of code, as I needed. So I put these in and started the runs. Much to my surprise, the FAST codec is not too bad, averaging about 20 to 50 usec per message. Sure, it depends on the size of the message from OPRA, but it's not bad. This also includes the sending of the messages out to my test harness and calculating the minimum, maximum, and average latency for the messages it receives.

So it's not in the FAST codec. Where, then, is it?

Certainly, one of the nastier delays is the queue restart. By boost ASIO thread reads off the UDP datagram, and then tags it with a timestamp, and places it on a FIFO queue. The queue is lockless, and so there's no conditional to wait() on, and that's a good thing, as the delay time would be horrible if it were (already been there, tried that). So when the queue is empty, my servicing thread can choose one of two options: sleep and let it queue up, or spin and burn up the CPU. I've tried both, and I can get the time to a reasonable level with the spinning, but then the CPU usage is eating a complete core - not ideal.

Sleeping isn't any better because it allows the minimum "maximum" latency to be the sleep interval and that is currently set at 1 ms. However, it's much more considerate on the CPU usage - averaging only about 12% of one core.

But still, the delays remained. Then it hit me. I was looking at the problem all the while and just not "seeing" it. If I have one thread decoding these datagrams from both A and B channels of the exchange feed, then I'm doing twice the work I need. This isn't bad, but it's taking twice the time. I needed to get two decoding threads active again. I say "again" because I had this in the past, and removed it in favor of a lockless arbitration on what messages to send. At this point, I need the speed and am willing to have the lock (for now), to get the processing time in half.

But that's not all. If I think about the load from a single OPRA channel I'm looking at upwards of 100,000 msgs/sec. If I take, on average, 40 us per message to decode it, I can get only about 25,000 msgs/sec - not even close to the speed they are arriving. So this is my fundamental problem: the OPRA FAST codec is just not as quick at decoding the messages as OPRA is at sending them. Sure... with faster CPUs I might be able to get it a little faster, but in the end, it's not going to be more than maybe 100% faster. That still leaves me with a factor of 2 to make up.

So we're going to queue things up. No way around it. Given that we're queueing, the delays I've been seeing are completely explainable. Let's assume we get a burst of 10,000 messages - the last message in the burst is going to get decoded some 400 ms after it arrives and is timestamped. No two ways about it.

Some of my competitors feeds say they have much smaller latencies - and the way they measure it, they do - because they allow the OS-level socket buffers to hold the 10,000 messages as they process one completely before pulling the next off the socket. If they timestamp it, decode it and send it, their latencies should be no more than about 50-100 usec. That's what mine would be. But you're not measuring the real latency of the server. For that, you have to keep the OS-level socket buffer empty and timestamp them within the process space and then measure the overall latency.

There's no free lunch. You can buffer the datagrams in your process or you can have the OS buffer them for you. The only difference is with the former, you'll actually know when the datagram arrived at your NIC, and with the latter you won't.

So... what's the solution? Well... we can throw more cores at it. If we somehow parallel processed the datagrams into messages, then the effective decoding time could be much less. But then we'd have to worry about getting the messages out of order.

We could re-write the FAST codec - and this I've thought about, but the code isn't that bad, and OPRA is going to 48 channels soon, and that should cut the rate to a max of 50,000 msgs/sec, and give the decoder a little breathing room.

I'll work on it, but I'm confident now that there's no horrible problem in my code. And no one is doing it really any better than I am. I just need to see if I can come up with some clever way to get around these limitations.

Still Trying to Find More Speed (cont.)

Wednesday, March 9th, 2011

Code Clean Up

This morning I went through a section of the code I'll be focusing on today and realized that when I originally wrote it I had the idea that every UDP datagram from the exchange should generate a message to send. If it didn't then it was an error. As a point of fact, there are a lot of datagrams that don't generate messages - Admin messages, Control messages, all kinds of messages.

What I had been doing was to create a simple base message and place it into the list so that an empty list signaled an error and I then filtered out the simple messages from being sent. Now I don't think I was spending a lot of time checking the message type, but the new and delete of the message could not have helped. So I decided to clean that up and get rid of the error checks on the empty generated list. This shouldn't be a huge difference, but every little bit helps when you're looking for fractions of a millisecond.

The next point I found was the skipping of sequence numbers by the OPRA FAST codec. In general, OPRA sends datagrams encoded in this pseudo-FAST protocol where the first message is specified, but then only "deltas" are enclosed in the packet. This makes it very fast to encode multiple messages - which is, of course, why they used it. The upshot of that is the starting sequence number is provided for the datagram, and I was using that to advance the "last sequence number" field in the decoder. The problem is that if there are more than one message in a datagram, this is going to make it appear that we're skipping (and therefore missing) messages from the exchange.

So I went in there and have a few interesting ideas about how we might be able to make that work. Of course, the simplest scheme is to check all the messages and take the largest number as the new "last sequence number" and use that. But I might also get lucky if OPRA only sends contiguous messages. If that's the case, I might be able to take the starting sequence number, add the count of messages, and then run with it.

I also might be able to just look at the last message in the list and use that. It's really full of a lot of possibilities to clean this up.

The next thing was to get an idea of the speed of the datagram processing. To that end, I skipped all the message processing and put in checks for the timeliness of the datagrams. I'll need to wait for the open to test this, but then I need to get busy really hammering on this code. I want to make some real progress on this today.

Google Chrome dev 11.0.696.0 is Out

Wednesday, March 9th, 2011

V8 Javascript Engine

This morning I checked and Google Chrome dev 11.0.696.0 was out. Seems in this release, they updated the V8 engine to 3.2.0.1 and made a few Mac-specific fixes/tweaks. Nice of them to pay attention to the platform, really. It's a decent browser, but I still can't help but wonder what's up with Google these days? Is it doing evil, or reforming it's ways?

iTunes 10.2.1 and Java 1.6.0_24 are on Software Updates

Wednesday, March 9th, 2011

Software Update

This morning I noticed that Apple has released iTunes 10.2.1 and updated Java on Mac OS X to 1.6.0_24. The update on iTunes is to allow syncing with iOS 4.3 devices - the new iPad 2 and all the iPads and iPhones that will be upgraded to 4.3 this month when it's released.

The update for Java is just to bring 1.6.0 up to par with the other platforms. It's got the security fixes, the stability improvements, etc. Nice to see, given that Apple has abandoned Java past OS X 10.7 (Lion). Who knows? Maybe Oracle will botch it so badly that Steve will relent and keep delivering Java.

Still Trying to Find More Speed

Tuesday, March 8th, 2011

bug.gif

I've spent a few hours today trying to find even more speed in my exchange decoder. It's the core part of the ticker plants as it's the first component in the chain, and the exchange datagrams come into this component, are converted into our own message format, and sending them downstream. The problem I'm seeing is that the time to get through this step is far too long, in my mind. I'm seeing times in the hundreds of milliseconds - and that's not right.

So I'm trying to find the problem. It's not easy because I can only do this during exchange hours, but even then, it's not obvious where the problem lies. I clearly need to do more work.

I'm concerned that it's in the OPRA decoding - that would be a tragic problem. Messing with that code could be really dangerous.

The Conversion from Decimals to Integers

Monday, March 7th, 2011

GeneralDev.jpg

This afternoon I've been working very hard to convert all the prices and decimal numbers in my ticker plant codebase from float values to uint32_t values with a given getDecimalMultiplier() on each message. This came up in a meeting regarding some other group's use of the codebase, and they currently don't use floating point numbers - but rather an integer and a multiplier. OK... I can fix that, and so I did.

First thing was realizing that a uint32_t was sufficient as that would give me a 10,000 multiplier and values in excess of $400,000.00, when divided out. Good enough. Then I had to go into the code, replace all the values, add constructors and methods to take either the floating point number, and convert it to the proper integer, or take the integer and just use it.

The next thing was to look at the conversion/sampling functions on the exchange data. A lot of these take an integer mantissa and a divisor code and generate a float. What I needed to do was to alter these, or make similar functional versions, where they would take the same arguments and generate the correct integer representation of the value - offset by 10000 (my new offset). Again, not really hard, but it's detail work - making sure you get all the conversions done and don't loose any precision in the process.

Next, I created getters and setters for the messages that allowed the user to get the integer or floating point value at their choice. The scheme I used was to say that getPrice() got the decimal number and getPriceAsInt() got the biased integer. Pretty simple, and I don't think I'm going to have a lot of confusion here, which is very important.

Finally, with nothing but a few float values remaining - and the getters and setters using float arguments, I decided it was better to do a complete conversion to double and get rid of any float values in the processing. It's cleaner, more easily dealt with at the chip level, better scale and accuracy -- it's just better.

With this, I have everything stored as integers, with the multiplier available to the clients, and even decimal getters if they don't want to hassle with the conversions themselves. It's about as clean as I can imagine making it.

Finally Realizing One Size Never Fits All

Friday, March 4th, 2011

GottaWonder.jpg

I originally designed my ticker plants to fit a specific client: the systems feeding the human traders. Eyeballs. There was no need to have everything up-to-date every millisecond - the human eye can't tell, and the systems don't update faster than a few times a second. It's just a waste. But what they do care about is that when they see the change, it's the latest data available. This means don't queue it up! You have to remember the order the ticks came in, but allow for updates to the data to replace the old with the new. This is commonly called conflation. It's a good thing for systems delivering data to humans.

But automated trading systems don't want this. They want every tick. They want it all as fast as possible. It's understandable - if you can make a machine able to see everything, then you have a much better chance of seeing opportunity and therefore making a profit. While I didn't design my ticker plants for these kinds of systems, several months ago, I was asked to make it work for these kinds of systems.

I've spent a lot of time trying to speed things up so that one system is capable of meeting the needs of both kinds of clients. It's been very difficult, and in a very real sense, what I've been doing is dumbing down my system to force the clients to handle every tick. If I could have done it, it would have been fantastic. But it really isn't possible. The compromises for one client are just too far from the compromises for the other.

So I finally had another little Ah! Ha! moment - Stop trying to make one size fit all. Elementary, but true, and an important understanding of really making something good for everyone.

If I made my ticker plants the way I started - for the 'slow' trading, and then had the 'fast' trading use an embedded ticker plant, then those that needed speed wouldn't even have to deal with a network hop. That's good. No serialization or deserialization. No worries about dropping packets from the server to the client. There are a lot of things that just "go away" when you decode and use the data in the same process.

I do this in my NBBO server - I have n exchange feeds all going into one NBBOEngine, and then sending it out to the clients. I don't take in the feed, process it, and then send it out - that'd take too long. I process the feed within the process space of the consuming application.

The resources to do this aren't horrible, two threads, less than a core and some memory. All this can be dealt with very easily by adding a box or two, if necessary. These boxes could be the "servers" you turned off because you no longer need them. In any case, it's a very solvable problem.

In the end, those that need conflation get it, and those that don't want it, get the data in-process as fast as possible. It's really the best of both worlds as it doesn't make compromises for one client or another.

Google Chrome dev 11.0.686.3 is Out

Friday, March 4th, 2011

Seems there's another quick-fix for Google Chrome dev to bring it to 11.0.686.3 - this time about the autofill related crash. Fair enough - it's nice they are being this responsive, but if it's just a day, they could have waited the original release and not messed with these two updates. But maybe they had to release for political reasons.

Successful Tests with ZeroMQ – Time to Update

Thursday, March 3rd, 2011

ZeroMQ

I've had a very successful day testing ZeroMQ in my ticker plants with the updated parameters that had been hinted to me by a co-worker. It's not something I'd have thought to try, given that we're using OpenPGM - I thought the socket buffers were going to be controlled by OpenPGM, but I guess not.

In any case, if I create a socket and then set the send and receive buffers to 64MB and the peak speed to 500Mbps with a 100 msec recovery interval:

  // set the send and receive buffers to 64MB each
  static int64_t      __sndbuf = 67108864;
  static int64_t      __rcvbuf = 67108864;
  // have the maximum sending rate be 500Mbps
  static int64_t      __rate = 500000;
  // ...and the recovery interval 100 msec
  static int64_t      __recovery = 100;
 
  // create the socket...
  try {
    mSocket = new zmq::socket_t(*mContext, ZMQ_PUB);
    if (mSocket == NULL) {
      error = true;
      cLog.error("could not create the socket!");
    } else {
      // now let's set the parameters one by one...
      mSocket->setsockopt(ZMQ_SNDBUF, &__sndbuf, sizeof(__sndbuf));
      mSocket->setsockopt(ZMQ_RCVBUF, &__rcvbuf, sizeof(__rcvbuf));
      mSocket->setsockopt(ZMQ_RATE, &__rate, sizeof(__rate));
      mSocket->setsockopt(ZMQ_RECOVERY_IVL_MSEC, &__recovery,
                          sizeof(__recovery));
      // now let's connect to the right multicast group
      mSocket->connect(aURL.c_str());
    }
  } catch (zmq::error_t & e) {
    cLog.error("while creating the socket an exception was thrown!");
    if (mSocket != NULL) {
      delete mSocket;
      mSocket = NULL;
    }
  }

I've got a lot more testing to do, but these parameters really seem to help. Very nice.

The next step is to get the latest code from the GitHub git repo and try it. There are a ton of new features and lots of fixes which hopefully will clear up the last of the problems I'm seeing.