Archive for the ‘Coding’ Category

The Problem with Partially Implemented Features

Friday, June 3rd, 2011

bug.gif

This afternoon, while the Developer Days was chugging along nicely, I decided to try and fix a bug I was seeing in the request/response service. I'd connect and subscribe to a family, and then get the first set of calculation results, but then after that, nothing. Initially, I wasn't getting the complete family the first time due to a problem with the isDirty() status not being set properly on cloned families. I fixed that guy pretty easily, but this was a little odd. During the day when we should have plenty of ticks, we weren't seeing updates. Made no sense. So I started digging into the code.

Turns out, it took me about 10 mins to realize that the isDirty() was not being set on any updates. It was there, but it wasn't really working. Not even close. This is one of the really nasty problems I see with some developers: they think that putting in the skeleton of a feature is enough to stop on, and they go on to something else without really effectively bookmarking the work, so they are sure to get back to it.

This was a pretty easy feature to do. I'm guessing it was no more than 15 mins to complete. But it was a critical 15 mins, and to have skipped this for several weeks is really pretty bad. You have to be able to bookmark your work - either notes in the code, like "TODO:" marks, or maybe a notepad, or something, so that you can remember what's been done, what hasn't, and what needs to be done. It's kinda frustrating, but at the same time, I have to remember who wrote the code, and cut them some slack.

Just got to get it all out of their hands to get the quality up.

Threading Issues with Boost ASIO async_write()

Friday, June 3rd, 2011

Boost C++ Libraries

This morning it's Developer Day, and I'm fighting a problem that really had me stumped for a while. When I have two independent requests hit my service, if they are close enough in time, they can cause a real problem in the service. I seem to get the requests just fine. I'm able to generate the results for each request just fine, but when I try to send back the generated responses to the client, it gets the service and the client into an unstable state.

Not good.

When I try to reproduce the test using my C++ client, I can't seem to get the requests close enough in time to trigger the problem. Or maybe I have a better client-side socket driver. Hard to say. But someone generated a test case in Python, and that does the trick. The client doesn't appear to ever get the response, but I know I sent it. Not a fun place to be when people are supposed to be hitting your stuff for fun and enjoyment.

I got the idea that it was in the boost ASIO sending because the generation was right, the C++ client was right, but the Python wasn't. The original code looked like this:

  void MMDServiceHandler::asyncSendMessage( const std::string & aString )
  {
    bool        error = false;
 
    // first, make sure we have something to do
    if (!error && ((mSocket == NULL) || !mSocket->is_open())) {
      error = true;
      cLog.warn("[asyncSendMessage] no socket to send on! Bad news");
    }
 
    /**
     * Now let's put the preamble and message into a buffer so that the
     * MMD knows what to do with it...
     */
    if (!error) {
      // let's make a few pointers for the payload
      boost::shared_ptr<Preamble>     hdr(new Preamble(aString));
      boost::shared_ptr<std::string>  body(new std::string(aString));
      // now let's make a package of the header and buffer
      std::vector<boost::asio::const_buffer>  pkg;
      pkg.push_back(boost::asio::buffer(hdr->body(), hdr->size());
      pkg.push_back(boost::asio::buffer(*body));
      // finally, send it out asynchronously
      boost::asio::async_write(*mSocket, pkg,
                  boost::bind(&MMDServiceHandler::asyncMessageSent, this,
                          hdr, body,
                          boost::asio::placeholders::error,
                          boost::asio::placeholders::bytes_transferred));
    }
  }

My theory this morning was that by placing two items in the boost package, I was allowing boost to think that they could be sent separately. With two such writes done in quick succession, it's possible to think that the four components got interleaved and so it messed up the receiver, and then the sender as it was waiting for something back from the client, and it was never going to come.

What I decided to go with was a change to a single buffer:

  void MMDServiceHandler::asyncSendMessage( const std::string & aString )
  {
    bool        error = false;
 
    // first, make sure we have something to do
    if (!error && ((mSocket == NULL) || !mSocket->is_open())) {
      error = true;
      cLog.warn("[asyncSendMessage] no socket to send on! Bad news");
    }
 
    /**
     * Now let's put the preamble and message into a buffer so that the
     * MMD knows what to do with it...
     */
    if (!error) {
      // let's make a header and a pointer for the body
      Preamble    hdr(aString);
      boost::shared_ptr<std::vector<char> >  body(
                  new std::vector<char>(hdr.size() + aString.size()));
      // now let's put everything into one contiguous piece...
      memcpy(&((*body)[0]), hdr.body(), hdr.size());
      memcpy(&((*body)[0]) + hdr.size(), aString.data(), aString.size());
      // finally, send it out asynchronously
      boost::asio::async_write(*mSocket, boost::asio::buffer(*body),
                  boost::bind(&MMDServiceHandler::asyncMessageSent, this,
                          body,
                          boost::asio::placeholders::error,
                          boost::asio::placeholders::bytes_transferred));
    }
  }

By placing it all into one std::vector, using jsut two memcpy() calls, I was thinking I was even making it a little faster. The data was still in a stable shared pointer that was going to have a reference until the asyncMessageSent() method was called, and then it'd be dropped. So I was thinking I had this bad boy licked.

Yeah... well... not so much.

So the next thing I went to was switching from using async_write() to just using write(). Thankfully, I already had a syncSendMessage() method for just such an occasion:

  void MMDServiceHandler::syncSendMessage( const std::string & aString )
  {
    bool        error = false;
 
    // first, make sure we have something to do
    if (!error && ((mSocket == NULL) || !mSocket->is_open())) {
      error = true;
      cLog.warn("[syncSendMessage] no socket to send on! Bad news");
    }
 
    /**
     * Now let's put the preamble and message into a buffer so that the
     * MMD knows what to do with it...
     */
    if (!error) {
      // let's make a header and a pointer for the body
      Preamble    hdr(aString);
      std::vector<char>  body(hdr.size() + aString.size());
      // now let's put everything into one contiguous piece...
      memcpy(&body[0], hdr.body(), hdr.size());
      memcpy(&body[0] + hdr.size(), aString.data(), aString.size());
      // finally, send it out synchronously
      boost::system::error_code  err;
      boost::asio::write(*mSocket, boost::asio::buffer(body),
              boost::asio::transfer_all(), err);
      if (err) {
        error = true;
        cLog.warn("[syncSendMessage] unable to send the message!");
      }
    }
 
    return !error;
  }

When I switched to the syncSendMessage(), everything worked out. I didn't like that I had to synchronously write out the data, so I put the entire send method in a simple thread that does the send and then dies. Sure, I could have put it in a pool, but this is good enough for today, and we'll wait and see what the real limitations are. But for now, it's been a really hard, but good, morning and I'm ready to just sit back and help the other guys with their Developer Days projects.

Cleaning Up More Code – I Need a Bigger Bucket

Thursday, June 2nd, 2011

Code Clean Up

Today I'm slugging through the clean-up of the last bit of work done by others on this project of mine, and while it's not as bad as some of the other code I've had to clean up, there's still a considerable bunch of problems in the way a lot of things were done. For example, our internal representation of a date is a single unsigned integer value of the form YYYYMMDD. It's simple, easy to compare, and very easy to read by a human. However, for the calculation model we need to use, they have a simple structure with three parts: year, month, day. Each is a simple integer, and while I can see why they might want to do that, they don't have any of the supporting code to make it easy to do comparisons, etc. None of this really matters because that's just what we have to do. So be it.

But the problem I saw in the code was that there was a method to convert from our representation to theirs, and it was written as:

  void Calculator::getDate( uint32_t aDate, uint32_t & aYear,
                            uint32_t & aMonth, uint32_t & aDay)
  {
    aYear = aDate / 10000;
    aMonth = (aDate - aYear * 10000)/100;
    aDay = (aDate - aYear * 10000 - aMonth * 100);
  }

and in the code using this method was called:

  getDate(expir, model.year, model.month, model.day);

and there was at least one real problem with this, and one really bad design choice. First, let's look at the problem. The date structure is defined as:

  struct FEDate {
    int day;
    int month;
    int year;
  };

and from clever inspection, we're mapping these signed integers to the unsigned integers of the method. Well... the code really looks like:

  getDate(expir, (uint32_t &)model.year, (uint32_t &)model.month,
          (uint32_t &)model.day);

when it's not a really great idea to say the references are to unsigned values when they are going to be used as signed ones. The horrible design issue is that there's never a case we need to have this as the separate components - we're much better off dealing with the conversion to the struct directly:

  void Calculator::fillDate( uint32_t aDate, FEDate & aTarget )
  {
    aTarget.year = aDate / 10000;
    aTarget.month = (aDate - aTarget.year * 10000)/100;
    aTarget.day = (aDate - aTarget.year * 10000 - aTarget.month * 100);
  }

Now we don't have to worry about the conversion issue at all. Plus, it's clear that we're converting from our representation to theirs. Simple. Clean. Direct. Much better.

I'm going to have to go through a lot of the code like this, finding little things that are still causing memory leaks - and there are at least a few here. I need to get them all cleaned out as quickly as possible as we have the Developer Day tomorrow, and I'd really like to have this all done today.

SubEthaEdit 3.5.4 is Out

Thursday, June 2nd, 2011

subethaedit.jpg

This morning I decided to check and see if SubEthaEdit had been updated and sure enough, 3.5.4 was out with what appears to be just a single change: Mac OS X Lion fixes. Shucks. There were a few things that I had hoped they'd put into a new release, but I have to face facts: it's exactly what they want it to be. No big changes planned. I'm going to have to accept that this is "it".

I'm not really shocked. It is a great editor. It's just not going to beat BBEdit and MacVim. It has "jump scrolling", and that's too abrupt for me. It's method locator is nice, but it's not as nice as BBEdit's. It's find uses the nice Safari-style "jump balloon", but BBEdit is close and Vim isn't bad for a cross-platform tool.

I read an article recently where someone went through all this on the Mac and said (basically): "No free lunch. BBEdit, MacVim, Emacs - learn one." I laughed a lot at the article, and myself, because he's right on the money. If you want a really powerful editor then these are your choices. Live with it. I guess I was hoping for the SubEthaEdit upstart, but that's not going to happen. It is what it is, and they are happy with it. Great. Back to BBEdit and MacVim.

Google Chrome dev 13.0.782.1 is Out

Thursday, June 2nd, 2011

Google Chrome

This morning I noticed that Google Chrome dev 13.0.782.1 was out with some nice incremental features. I'm not a big printer, but I'm sure there are a lot of folks are, and that's important to them. Certainly, the comments are excited about the printing. Also the 2D acceleration looks to be nice - can't hurt, that's sure. It's nice and fast, which is what you'd expect from a peck of engineers, and as long as it keeps moving in that direction, that's great.

[3/3] UPDATE: that didn't take long. Already 13.0.782.4 is out. Must have been a pretty serious problem to issue a new release so soon. Well... glad they responded so quickly.

Finding Allocation Errors with TCMalloc – Ain’t Easy

Wednesday, June 1st, 2011

google-labs-logo.gif

Today I've learned a very valuable lesson today: TCMalloc really doesn't have bugs, but it sure looks like it does and stack traces can be very deceptive at times. I have been getting a series of segmentation faults on some code and the backtrace was always in about the same state, and was saying something like this:

  #0  0x0002aac607b388a in tcmalloc::ThreadCache::ReleaseToCentralCache
        (tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
        from /usr/lib/libtcmalloc.so
  #1  0x0002aac607b3cf7 in tcmalloc::ThreadCache::Scavenge() ()
        from /usr/lib/libtcmalloc.so
  ...

The lesson learned, after googling this backtrace, is that TCMalloc doesn't have bugs, it's just too stable. However, it's not able to properly trap double-frees, or illegal frees, so when it finds that it's structures are corrupted, it bails out and appears to have a bug, when the problem was really in the 'hosting' code. Meaning: user error.

So I started looking at what was leading up to this in the backtrace. I worked on this for the better part of a day, and reformulated the code several times. In the end, I was totally unable to correct the problem. Very frustrating.

Then it hit me - maybe it wasn't in the calling stack? After all, this same code was working quite well for months in other apps. This was the 'Eureka moment' for this guy... it wasn't the call stack at all - it was somewhere else in the code. So I started grepping for all the 'new' and 'delete' instances in the code. Sure enough... I found a few problems.

It's so easy for junior guys to miss these things, and they did. I only look for them because I've been bitten so badly (like this) so many times - it's the first thing I do when building a class with heap support - make the allocations and deallocations match. No two ways about it.

I'm hoping that this fixes these problems, and it's looking good so far. Just awfully tricky when the bug is nowhere in the stack. Wild.

Rewrote the NBBO Engine – Better, Faster, More General (cont.)

Wednesday, June 1st, 2011

Today I had to do a little hammering on my new NBBO engine because I found a few problems in what it was doing. I wasn't properly filtering out bad data from the NBBO calculation, and that needed to change. Plus, I wanted to add a simple method to force a recalc of the NBBO because I was seeing bad data get "stuck" in the engine, and wanted to have some way to clear it out. Finally, I introduced a bug in the forced recalc that I had to find - silly cut-n-paste bug, but easily found.

But the nice change I did was to realize that the exchange data for the instruments could skip having the security key - a textual representation of the security, and just use the security ID - a 128-bit number that was equivalent to the security key. The difference was that I could skip the conversion from the key to the ID, and that saved an amazing 33%. I was able to get my times down to about 22 μsec - just amazing.

Consequently, the engine takes 33% less CPU and that directly translates to more feeds on the box. That's always a good thing. Very nice fix/change.

Rewrote the NBBO Engine – Better, Faster, More General

Tuesday, May 31st, 2011

Today I took on the task of re-writing the national best bid/offer (NBBO) engine in my codebase. This is significant because this was very fast, lockless (for stocks), and was a critical part of the data feeds. It's just that important to get the data right. But it had a limitation that was a killer - it was based on the idea that each instrument belonged to a "family" - rooted in a stock or index. The problem wasn't bad for stocks, and the options weren't too bad, but we are starting to get instruments like spreads that aren't based on one family, and therein lies the problem.

I solved part of the problem by making these spreads have a 128-bit ID value like all the other instruments, so that they might fit in a nice 16-way trie. I just then needed to fix the NBBO engine to use this trie as opposed to the family-based (name) trie.

There were a lot of little details to pay attention to, but for the most part, it was a smooth transition. The data is now completely lockless if we want, but for now there's a spinlock at the instrument-level to make sure that we don't update the NBBO data improperly. However, this is likely never to happen as a single symbol comes from one feed, and one feed alone, so it's a single thread. However, it's possible that someone could make a multi-thread, single-feed, and in that case, it's possible to have two threads with the same instrument. So I'll leave it in for now. Just to be safe.

The upshot of this change is that we're now ready to handle any instrument that fits into the 128-bit ID scheme. Also, because we got rid of the map for options within a family, it's faster for options than before. Sweet.

Plenty of testing to do, but it's a great start.

Finally Got Something Going!

Friday, May 27th, 2011

trophy.jpg

This afternoon we finally got something going on the request/response greek engine! I had to add in an IRC interface to the calculations to make it easier to see what's happening and try values - but in the end, that's a great addition to the system. It only took about 15 mins to add, but the effect was amazing. We could now look at the messages coming in, force a calculation on a stock family - or an individual option, and then view the results. Very nice.

The request/response system is working as well. There were a lot of issues about what the values needed to be (see a previous post), and I had to change the SQL to extract the volatility values from the database because no one really checked the values. I know... I should have been more on top of this, but when I ask a grown-up, professional programmer: "Did you write a test case? Is it right?" and I hear "Yes", I tend to believe them.

Not any more. I've become a skeptic. They made me a skeptic.

But in the end, we finally got something going. What a relief!

The Difference Between “Good” and “Great” – Massive

Thursday, May 26th, 2011

Today had been a lot of system-level integration work and it's getting a little frustrating. For example, today I learned that I needed to provide the model with the latest trade prices for stocks as well as the quotes and associated data. Why I didn't know about this weeks ago, I have no idea, but I guess that's the joy of working with just in time memories. Specifically, it's because I didn't play a more involved role in the beginning of the project. I trusted my teammates, who have worked with this library before, to know all the inputs and let me know well in advance of needing them.

Nope.

So I'm finding out today that I need to add in another 20-plus feeds to the system and that means dealing with a different level of abstraction on the stock feeds - which is nice, and that in turn means a good bit of re-factored code to make the abstraction work well. It's nice to see it done, but it's not nice that it was a surprise.

Hopefully tomorrow will be fewer surprises.