Archive for the ‘Coding’ Category

Bug Hunt 101

Tuesday, November 16th, 2010

bug.gif

There comes a time when you are done with the major issues and start to do the full-up tests when you run into little bugs that take all of about 15 mins each to solve, and you spend hour after hour fixing these little nagging problems. It's the polish in the project. You could leave these in and have the system restart, but that's got no class. You need to spend the 15 min and just fix it.

That's what I've been doing all afternoon - fixing these issues. I've solved a bunch of them, but I think there are a few more that I need to handle. Certainly when it comes to communicating with the Broker. That guy needs some help when it comes to massive hits from all my ticker plants starting up. I'm not sure exactly how I'm going to deal with them, but I know I'm going to have to.

It's all a part of the process. At least I'm no longer writing bash scripts. That was also necessary, but wasn't nearly as fun.

Hardening the Ticker Plant for Production

Monday, November 15th, 2010

Today I'm spending a lot of time hardening the TickerPlant for production. It's not the most glamorous work... in fact, it's kind of mind-numbing, but it's as important as getting the rest of the core right, as the program isn't going to be run by me - it's going to be run by operations, and that means it needs to be a lot more secure than I might normally make it for myself.

I started with a little better application shell - it's about 50 lines of C++ code that starts things off after reading the command line arguments. I added in the standard 'usage' function to act as the "help" of the application. I then added a lot more comments in the code and it was looking pretty nice.

Then it was off to the shell scripts. I need to make the start/stop/restart scripts so that the user doesn't need to worry about the environment or the location - it's all just automatic. I've used scripts like these in the past and they have been really helpful, but getting them all into bash will be a little work - they were a mixture of bash and c shell in the past.

It's not glamorous, but it's necessary.

UPDATE: I still need to add in the sending of the ticker plant's stats via SNMP and add in the Jabber client so that we can communicate with the TickerPlant easily.

Really Hammering on the Unit Tests

Monday, November 15th, 2010

GeneralDev.jpg

I have been looking at the memory footprint of my best bid/offer server and thinking that there might be a memory leak in there. I've done a lot of unit testing, but it's very hard to tell exactly what's happening when you hit it with 128,000 symbols and more than 30 ticks a symbol. After all, I expect the memory usage to rise as I put in new instruments. But I was worried that I was still creating data structures for existing instruments.

I looked at the code, and traced it for a few cases, and it seemed to work, but that wasn't nearly as satisfying as I had hoped. I really wanted to know for sure. So something finally came to me: Hammer it! I mean, really hammer it.

So I added a few loops to the test app and used the existing 128,000 instruments. I ran through another dozen ticks that would very predictably effect the best bid/offer. I ran that for each instrument ten times and let it run.

What I saw was that the memory climbed during the initial phase, and then during the "re-run" phase, it was rock solid. I mean it was amazing.

I'm satisfied. The small-scale tests are good, and the massive scale is as well. There aren't any leaks I can see. What I have now is what the system requires.

BBEdit 9.6.1 is Out

Monday, November 15th, 2010

This morning I saw that BBEdit 9.6.1 was released. It's got an impressive list of fixes, and while I can't say that I've hit one of them, it's nice to see that someone is really kicking the tires, and they are fixing these issues quickly. Good enough.

Refactoring the Serialization Scheme for a Component

Friday, November 12th, 2010

I was looking at the memory usage of my NBBO server and noticed that after a little bit it was hovering on the high side of 4GB in memory. While that may not seem like much, it's far too big for what this process was doing, so I decided to dig into what that number was all about today. The end result was that I really needed to change the serialization scheme I was using for my engine component.

In order to have everything all "fit" together, I have a Broker service that holds my configuration data. I've been placing the current running state in that service for each instance, and it's been working well for me. This time, however, I realized that while it works for me, it's not really very efficient, and that's what's killing me.

What I was doing was to place all the values into lists and maps and then just bundle them all together and let that be the payload that was sent to the configuration service. While this is just fine, it leads to data structures that are pretty involved to deal with, and really no more readable than the byte-level encoding I'm doing with my messages. So I decided to go back to that scheme as it should be easier for someone to understand as it's also used in the message serialization code.

The wrinkle to this problem was that I had several objects/structures and each one needed to be serialized and deserialized properly and while it's not hard, it was time-consuming. Each one needed to have it's own format, and then I had to write it up, and finally test it. Slow and sure is the best way to do all this stuff.

In the end, the serialization footprint went down from 2GB to about 50MB - significant reduction in the transient memory usage of the process. At the same time, the speed of the serialization and sending went down - not bad, either.

Good News for Java on the Mac

Friday, November 12th, 2010

java-logo-thumb.png

Looks like Oracle and Apple have come to terms about supporting Java on the Mac. Now it looks like we'll have a standard distribution of Java from Oracle (Sun) just like the Windows and Linux builds. That's nice.

I have no doubt that it'll be more current than the versions Apple ships - they try to stay pretty stable on a single version of the OS, but Oracle (Sun) really moves regardless of the underlying OS version. I'm just not sure about the overall quality, but I guess that's going to be as good as the linux port. Which is to say, not bad.

Good enough. Glad to hear it.

The Beauty of Tuning a Solution

Thursday, November 11th, 2010

GeneralDev.jpg

Yesterday afternoon I got the first really good cut of one of my servers done. It was really nice to see it running for more than five minutes, and it was a great relief. However, the stats on the delivery times of the messages weren't what I was hoping for. In fact, it was a little on the pokey side of things - not really all that much better than the existing code. The numbers for the delay from the receipt of the UDP datagram to the client actually receiving an NBBO quote message I had were pretty bad:

Max Min Avg
200 msec 10 msec 70 msec

But I knew that I'd be able to make it faster... I just needed to figure out where the delay was and what I needed to do to fix it.

This morning in the shower I was thinking about the problem, like you do, and realized that I was probably hitting the sleep intervals for processing data off the queues. Because I have lockless queues (for the most part), I don't have the ability to use a conditional on a mutex to be alerted when something is there to process. The pop() methods will return a NULL when there's nothing to return, and it's up to my code to wait a bit and try again.

These waiting loops are pretty simple, but I don't want them to spin like crazy when the market is closed. So I have a variable sleep value for the loop - the longer you go without getting something from the queue, the bigger the sleep interval to make it less of a load on the system. So if things are coming fast and furious, there's no wait, and after the close, you don't crush the box with your spinning loops.

But there were problems - specifically, if you waited a little bit, you might very quickly get into a 100 msec sleep. If you happened to hit that once, you're likely to have to wait another 100 msec before checking the queue again. All of a sudden, the 200 msec maximum delay was understandable. So how to fix it?

The first thing was to pull these waiting loops into the queues themselves so they were a lot easier to control and modify. The code got a lot cleaner, and the timing loops because part of the queues themselves. Much nicer.

Then I needed to tune the delay parameters so that I was being careful to be responsive, but at the same time, not overly hoggish of the CPU. When I looked at the delays I had, it seemed that I was increasing them far too fast (red line). When I took it in a lot more smaller steps, I started getting really nice results (blue line):

Plot of Sleep Loop Delays

which resulted in the much more acceptable delays of:

Max Min Avg
8 msec <0.5 msec 3 msec

Sweet.

Get it right, and then get it fast. Works every time.

Finally Able to Nail Down the Best Bid/Offer Server

Wednesday, November 10th, 2010

GeneralDev.jpg

Today has been a long day of working on getting a few details done and then focusing back on the Best Bid/Offer server. I have been struggling with a boost asio crash in the code, and I started removing things until there was just about nothing left. What I found was that it was in the exchange feeds - but they have been tested in isolation pretty well. What's up?

Well, I started to look at the code again and then it hit me - stack versus heap allocation and the copy operations. In STL's map, you create an empty value and then copy in the contents. If I've been lazy in the least, I could really have messed myself up there. So I decided that it wasn't worth the worry and switched to pointers and the heap. Now I wasn't copying anything and I just had to add in NULL pointer checks to a few places.

When I made those changes, everything worked really well. Wonderful!

I then took the time to add some stats to the client so I could see the delay from the datagrams hitting the initial server to the final client. A simple min, max, average should be sufficient, and it was pretty easy to build.

Very nice work.

I'm beat.

Converting Doubles and Floats in C++ for Java

Tuesday, November 9th, 2010

cplusplus.jpg

My C++ ticker plant needs to communicate with Java, and it's unfortunately the case that the host-order versus network-order comes up in these situations as well. Thankfully, for virtually all our integers (signed and unsigned) we're using Google's VarInt encoding with the ZigZag encoding for the signed integers. It's very nice as it completely removes any byte ordering issues from the cross-language API.

But doubles and floats are another issue.

In order to make them fast on the Java decoding side (the majority of my clients will be Java-based) we really need to make the encoding in the messages Java-focused. This means that when the data arrives in the ByteBuffer, a simple getFloat() gets the value out correctly.

What this means is that I need to run htonl() on the float before it's packed into the data stream, and for a double, I need to do that, and then reverse the order of the two uint32_t values to properly reverse the byte ordering. I decided to start with the simplest method: brute force. Assuming the double to convert is in the variable gold, this code bit reverses the bytes and puts them in the std::string buffer for sending to the clients:

    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);

And getting it back out of the std::string as it's read in from the upstream system is simply:

    uint8_t    in[8];
    uint8_t    out[8];
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);

Very deliberate, but not very efficient. But it worked. Once I had that in the code, I started working on coming up with better methods of doing this conversion, and timing them to see what the real difference was. To keep the playing field level, this is the test that I wrote for the first version of the conversion - out and back, 100,000 times:

  // start the timer
  uint64_t    start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);
 
    // now back again
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);
  }
  uint64_t    loopTime = getTime() - start;
  std::cout << "100000 'loop' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

And the result of this running on my MacBook Pro was:

  peabody{drbob}465: a.out
  100000 'loop' passes took 11048 usec or 9.05141 trips/usec
  gold: 254.76

I then started fiddling with the code. The first thing I wanted to try was using htonl() and see if it was any faster than simply throwing down bytes in the correct order. Turns out, it was. Nice. My next test was faster:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t	in[3];
    memcpy(&in[1], &gold, 8);
    in[0] = htonl(in[2]);
    in[1] = htonl(in[1]);
    conduit.append((const char *)in, 8);
 
    uint32_t	out[3];
    memcpy(&out[1], conduit.data(), 8);
    out[0] = ntohl(out[2]);
    out[1] = ntohl(out[1]);
    memcpy(&gold, &out[0], 8);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'ntoh/hton' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

and while I was still using memcpy(), I was now working with larger chunks of data, and the speed proved out:

  100000 'ntoh/hton' passes took 5992 usec or 16.6889 trips/usec
  gold: 254.76

But I wanted to get away from the memcpy() calls altogether. If I was a little careful with the way I did things, it worked out just fine:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t  *in = (uint32_t *)(void *)(&gold);
    uint32_t  buff[] = { htonl(in[1]), htonl(in[0]) };
    conduit.append((const char *)buff, 8);
 
    uint32_t  *out = (uint32_t *)conduit.data();
    uint32_t  *target = (uint32_t *)&gold;
    target[0] = ntohl(out[1]);
    target[1] = ntohl(out[0]);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'string mix II' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

This guy clocked in nicely:

  100000 'string mix II' passes took 5291 usec or 18.9 trips/usec
  gold: 254.76

Lesson learned: Get it right, then get it fast. Always pays off. I've more than doubled the speed and it's the same interface to the outside world. Even if we look at the code, there's very little I can really trim out - even if I didn't do the byte re-ordering. It's just about down to the extra time in the htonl() calls. And that's really nice.

Google Chrome dev 9.0.572.1 is Out

Tuesday, November 9th, 2010

This morning Google Chrome dev was updated to 9.0.572.1 with a new version of Flash. OK... seems fair, and as long as I don't have to have any other Flash in my system, I'm OK with that.