Archive for November, 2010

The Beauty of Tuning a Solution

Thursday, November 11th, 2010

GeneralDev.jpg

Yesterday afternoon I got the first really good cut of one of my servers done. It was really nice to see it running for more than five minutes, and it was a great relief. However, the stats on the delivery times of the messages weren't what I was hoping for. In fact, it was a little on the pokey side of things - not really all that much better than the existing code. The numbers for the delay from the receipt of the UDP datagram to the client actually receiving an NBBO quote message I had were pretty bad:

Max Min Avg
200 msec 10 msec 70 msec

But I knew that I'd be able to make it faster... I just needed to figure out where the delay was and what I needed to do to fix it.

This morning in the shower I was thinking about the problem, like you do, and realized that I was probably hitting the sleep intervals for processing data off the queues. Because I have lockless queues (for the most part), I don't have the ability to use a conditional on a mutex to be alerted when something is there to process. The pop() methods will return a NULL when there's nothing to return, and it's up to my code to wait a bit and try again.

These waiting loops are pretty simple, but I don't want them to spin like crazy when the market is closed. So I have a variable sleep value for the loop - the longer you go without getting something from the queue, the bigger the sleep interval to make it less of a load on the system. So if things are coming fast and furious, there's no wait, and after the close, you don't crush the box with your spinning loops.

But there were problems - specifically, if you waited a little bit, you might very quickly get into a 100 msec sleep. If you happened to hit that once, you're likely to have to wait another 100 msec before checking the queue again. All of a sudden, the 200 msec maximum delay was understandable. So how to fix it?

The first thing was to pull these waiting loops into the queues themselves so they were a lot easier to control and modify. The code got a lot cleaner, and the timing loops because part of the queues themselves. Much nicer.

Then I needed to tune the delay parameters so that I was being careful to be responsive, but at the same time, not overly hoggish of the CPU. When I looked at the delays I had, it seemed that I was increasing them far too fast (red line). When I took it in a lot more smaller steps, I started getting really nice results (blue line):

Plot of Sleep Loop Delays

which resulted in the much more acceptable delays of:

Max Min Avg
8 msec <0.5 msec 3 msec

Sweet.

Get it right, and then get it fast. Works every time.

Apple Releases Mac OS X 10.6.5 on Software Update

Thursday, November 11th, 2010

SnowLeopard.jpg

This morning I got 10.6.5 with a slew of fixes - both big and small, totaling in at a whopping 500+MB download. It's supposed to be full of a lot of little updates and fixes, but it's also been reported that 10.6.6 is already in the hands of developers - presumably with different updates and/or features like AirPlay and AirPrint. In any case, it's been a while since I rebooted my laptop, and this was a good time to clean everything out.

Love the stuff they create.

Finally Able to Nail Down the Best Bid/Offer Server

Wednesday, November 10th, 2010

GeneralDev.jpg

Today has been a long day of working on getting a few details done and then focusing back on the Best Bid/Offer server. I have been struggling with a boost asio crash in the code, and I started removing things until there was just about nothing left. What I found was that it was in the exchange feeds - but they have been tested in isolation pretty well. What's up?

Well, I started to look at the code again and then it hit me - stack versus heap allocation and the copy operations. In STL's map, you create an empty value and then copy in the contents. If I've been lazy in the least, I could really have messed myself up there. So I decided that it wasn't worth the worry and switched to pointers and the heap. Now I wasn't copying anything and I just had to add in NULL pointer checks to a few places.

When I made those changes, everything worked really well. Wonderful!

I then took the time to add some stats to the client so I could see the delay from the datagrams hitting the initial server to the final client. A simple min, max, average should be sufficient, and it was pretty easy to build.

Very nice work.

I'm beat.

Converting Doubles and Floats in C++ for Java

Tuesday, November 9th, 2010

cplusplus.jpg

My C++ ticker plant needs to communicate with Java, and it's unfortunately the case that the host-order versus network-order comes up in these situations as well. Thankfully, for virtually all our integers (signed and unsigned) we're using Google's VarInt encoding with the ZigZag encoding for the signed integers. It's very nice as it completely removes any byte ordering issues from the cross-language API.

But doubles and floats are another issue.

In order to make them fast on the Java decoding side (the majority of my clients will be Java-based) we really need to make the encoding in the messages Java-focused. This means that when the data arrives in the ByteBuffer, a simple getFloat() gets the value out correctly.

What this means is that I need to run htonl() on the float before it's packed into the data stream, and for a double, I need to do that, and then reverse the order of the two uint32_t values to properly reverse the byte ordering. I decided to start with the simplest method: brute force. Assuming the double to convert is in the variable gold, this code bit reverses the bytes and puts them in the std::string buffer for sending to the clients:

    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);

And getting it back out of the std::string as it's read in from the upstream system is simply:

    uint8_t    in[8];
    uint8_t    out[8];
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);

Very deliberate, but not very efficient. But it worked. Once I had that in the code, I started working on coming up with better methods of doing this conversion, and timing them to see what the real difference was. To keep the playing field level, this is the test that I wrote for the first version of the conversion - out and back, 100,000 times:

  // start the timer
  uint64_t    start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);
 
    // now back again
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);
  }
  uint64_t    loopTime = getTime() - start;
  std::cout << "100000 'loop' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

And the result of this running on my MacBook Pro was:

  peabody{drbob}465: a.out
  100000 'loop' passes took 11048 usec or 9.05141 trips/usec
  gold: 254.76

I then started fiddling with the code. The first thing I wanted to try was using htonl() and see if it was any faster than simply throwing down bytes in the correct order. Turns out, it was. Nice. My next test was faster:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t	in[3];
    memcpy(&in[1], &gold, 8);
    in[0] = htonl(in[2]);
    in[1] = htonl(in[1]);
    conduit.append((const char *)in, 8);
 
    uint32_t	out[3];
    memcpy(&out[1], conduit.data(), 8);
    out[0] = ntohl(out[2]);
    out[1] = ntohl(out[1]);
    memcpy(&gold, &out[0], 8);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'ntoh/hton' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

and while I was still using memcpy(), I was now working with larger chunks of data, and the speed proved out:

  100000 'ntoh/hton' passes took 5992 usec or 16.6889 trips/usec
  gold: 254.76

But I wanted to get away from the memcpy() calls altogether. If I was a little careful with the way I did things, it worked out just fine:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t  *in = (uint32_t *)(void *)(&gold);
    uint32_t  buff[] = { htonl(in[1]), htonl(in[0]) };
    conduit.append((const char *)buff, 8);
 
    uint32_t  *out = (uint32_t *)conduit.data();
    uint32_t  *target = (uint32_t *)&gold;
    target[0] = ntohl(out[1]);
    target[1] = ntohl(out[0]);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'string mix II' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

This guy clocked in nicely:

  100000 'string mix II' passes took 5291 usec or 18.9 trips/usec
  gold: 254.76

Lesson learned: Get it right, then get it fast. Always pays off. I've more than doubled the speed and it's the same interface to the outside world. Even if we look at the code, there's very little I can really trim out - even if I didn't do the byte re-ordering. It's just about down to the extra time in the htonl() calls. And that's really nice.

Acorn 2.6.2 is Out

Tuesday, November 9th, 2010

I noticed today that Acorn 2.6.2 is out with a few fixes and a new plugin API for developers. Very nice. In fact, I used it today to make a nice background image for Adium so I didn't have to look like Colloquy. Much better.

Troy

Google Chrome dev 9.0.572.1 is Out

Tuesday, November 9th, 2010

This morning Google Chrome dev was updated to 9.0.572.1 with a new version of Flash. OK... seems fair, and as long as I don't have to have any other Flash in my system, I'm OK with that.

Working with Some Especially Dense Individuals

Monday, November 8th, 2010

I'm really stunned by the things I see working in Corporate America day after day. Today I witnessed one of those events that you just have to stop and wonder when you see it happen - like a squirrel water skiing on America's Funniest Home Videos.

There's very little privacy at The Shop, we sit in long tables each getting less than five linear feet a piece. It's meant to foster teamwork, and I can certainly understand that, but it has the consequence that no one conversation is private from the people on either side of the person in the conversation.

So I'm sitting at my desk, and Ralph, who sits two seats away was talking to Fred who sits next to me, about the Java work Fred is doing on the Ticker Plant. He wants to learn Java, and this is a perfect way to do it because we have working C++ code, and all he has to do is port it. There's a lot to that, to be certain, but the basics of the code is already laid out. It just needs to be transformed into Java.

Ralph is the group leader, but I don't report into him - I'm just working with the group on this new Ticker Plant. Consequently, there have been several issues that Ralph didn't like the people I conferred with, and took direction from. It's a classic problem in business where the management isn't quite sure of the existing group can pull something off, so they bring in someone else and then don't have them report up the regular chain.

Feelings have been understandably hurt, and while I can sympathize with Ralph, I have been given instructions, and they are very clear. It's equally clear that Ralph loves the idea of being in charge, and this flies directly in the face of that. It's not a great situation for Ralph, but it's clear that this is what his position is now, and he can choose to make the best of it or leave.

So this morning Ralph was asking about when Fred got the client work done, he'd like to play with it and see how it works. Fair enough, but I'm sitting right there, and can hear every word. Why not ask me, Ralph? Well... the first reason is that Fred is doing the work on the Java client, and that's very reasonable. Also, Fred will do what Ralph asks as he's the manager.

What I fail to understand... or maybe I understand it all too well... is that Ralph should understand that I'm not the enemy here. I'm the guy that's building the next system. Period. I didn't ask for this, I was assigned it. It's not a personal attack on him by me. There's no way on earth I could have known Ralph before this. It's a job, and just like trades in the NFL, sometimes the coach looses confidence in your playing ability. That's it.

But to not just stand up and be professional about it is just beyond me. I've been let go. Complaining about it isn't going to help. I've been passed over for projects, there's no shame in it - it's a management decision. They listened, they decided. I didn't have to like it, but it was their decision.

So for Ralph to be so childish about asking about the project is just sad to me. I don't want to make enemies. I honestly don't know anyone that does. But clearly I'm just out of touch with real people. They go end-around people rather than be professional. They complain. They whine.

Sad but true.

The really sad thing is that it's showing Ralph to not be a team player. This is going to make him less valuable to the management, and so he's really slitting his own throat. Maybe he wants to do this, but I'm guessing he thinks he's "winning" this "battle", when he probably isn't. At least not from the directions I'm getting.

So it's very sad.

Putting it All Together for an NBBO Ticker Plant

Friday, November 5th, 2010

Well... it's been a heck of a week, but I'm very happy that I have an NBBO Ticker Plant in the can - ready for testing on Monday. It took a long three days, and one of those days was devoted almost entirely to the persistence of the exchange data and message cache. The testing for that proved very helpful as I caught a few nasty copy-paste bugs in the data elements.

What I'm hoping is that with the NBBO Plant being fed from more than 25 exchange feeds, it'll still be able to keep up with the flow - because none of these feeds are option feeds. We'll have to wait for Monday to see, but I'm optimistic.

If not, I'll have to figure out how to speed things up, or possibly put in conflation queues between the exchange feeds and the NBBO Engine.

Proper Unit Testing is Valuable – But Painful

Friday, November 5th, 2010

GeneralDev.jpg

One of the things I really think is important in large systems work is proper unit testing. Now I'm not talking about the "Test First" idea, I'm talking about building some code - a class in C++ or Java, and it contains something other than mindless data containers, and those methods needing testing. If it's a cache, then throwing things in, counting them, and getting them out again. That kind of testing. Stuff where you need to be certain that the code is working before you go on.

The problem with a lot of this testing is the same thing you see in testing ICs - test vectors often require additional instrumentation on the class in order to test things properly. For example, I was testing parts of my NBBO engine where I needed to save it to a persistence system, and read it back out, and I needed to make sure it was correct. This means that I have to put operator==() on everything - even the data structures, because I can't be sure I'm not missing a component in the larger picture.

Then in order to test the larger components, I have to really add a clone() method to all the objects in the data structure so that I can be assured that once these elemental objects are equal, placing them in the larger data structure is also equivalent.

Now the code is simple:

  LeafNode & LeafNode::clone()
  {
    return new LeafNode(*this);
  }

because I always have the copy constructor on all my objects (to keep the compiler from creating one for me with behaviors I did not intend). So the code is really just a few lines - but it's this added instrumentation that makes efficient testing possible that makes the process all the more tiresome.

However, It finds bugs. It benchmarks performance. It's useful. It's just not very fun. Painful, in fact, at times.

But it needs to be done.

Stepping into the Flash-Free World

Friday, November 5th, 2010

I was reading this piece by John G on Daring Fireball, and it got me to thinking - it probably was a little more honest to disable Flash completely. I mean, it's still on my box, just not in my Safari or Firefox plugins. And with Google Chrome right on my desktop, I can see Flash if I really need to.

I think I'm going to like it a lot more... I've certainly given it a lot of thought.