Archive for the ‘Coding’ Category

Finding Allocation Errors with TCMalloc – Ain’t Easy

Wednesday, June 1st, 2011

google-labs-logo.gif

Today I've learned a very valuable lesson today: TCMalloc really doesn't have bugs, but it sure looks like it does and stack traces can be very deceptive at times. I have been getting a series of segmentation faults on some code and the backtrace was always in about the same state, and was saying something like this:

  #0  0x0002aac607b388a in tcmalloc::ThreadCache::ReleaseToCentralCache
        (tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
        from /usr/lib/libtcmalloc.so
  #1  0x0002aac607b3cf7 in tcmalloc::ThreadCache::Scavenge() ()
        from /usr/lib/libtcmalloc.so
  ...

The lesson learned, after googling this backtrace, is that TCMalloc doesn't have bugs, it's just too stable. However, it's not able to properly trap double-frees, or illegal frees, so when it finds that it's structures are corrupted, it bails out and appears to have a bug, when the problem was really in the 'hosting' code. Meaning: user error.

So I started looking at what was leading up to this in the backtrace. I worked on this for the better part of a day, and reformulated the code several times. In the end, I was totally unable to correct the problem. Very frustrating.

Then it hit me - maybe it wasn't in the calling stack? After all, this same code was working quite well for months in other apps. This was the 'Eureka moment' for this guy... it wasn't the call stack at all - it was somewhere else in the code. So I started grepping for all the 'new' and 'delete' instances in the code. Sure enough... I found a few problems.

It's so easy for junior guys to miss these things, and they did. I only look for them because I've been bitten so badly (like this) so many times - it's the first thing I do when building a class with heap support - make the allocations and deallocations match. No two ways about it.

I'm hoping that this fixes these problems, and it's looking good so far. Just awfully tricky when the bug is nowhere in the stack. Wild.

Rewrote the NBBO Engine – Better, Faster, More General (cont.)

Wednesday, June 1st, 2011

Today I had to do a little hammering on my new NBBO engine because I found a few problems in what it was doing. I wasn't properly filtering out bad data from the NBBO calculation, and that needed to change. Plus, I wanted to add a simple method to force a recalc of the NBBO because I was seeing bad data get "stuck" in the engine, and wanted to have some way to clear it out. Finally, I introduced a bug in the forced recalc that I had to find - silly cut-n-paste bug, but easily found.

But the nice change I did was to realize that the exchange data for the instruments could skip having the security key - a textual representation of the security, and just use the security ID - a 128-bit number that was equivalent to the security key. The difference was that I could skip the conversion from the key to the ID, and that saved an amazing 33%. I was able to get my times down to about 22 μsec - just amazing.

Consequently, the engine takes 33% less CPU and that directly translates to more feeds on the box. That's always a good thing. Very nice fix/change.

Rewrote the NBBO Engine – Better, Faster, More General

Tuesday, May 31st, 2011

Today I took on the task of re-writing the national best bid/offer (NBBO) engine in my codebase. This is significant because this was very fast, lockless (for stocks), and was a critical part of the data feeds. It's just that important to get the data right. But it had a limitation that was a killer - it was based on the idea that each instrument belonged to a "family" - rooted in a stock or index. The problem wasn't bad for stocks, and the options weren't too bad, but we are starting to get instruments like spreads that aren't based on one family, and therein lies the problem.

I solved part of the problem by making these spreads have a 128-bit ID value like all the other instruments, so that they might fit in a nice 16-way trie. I just then needed to fix the NBBO engine to use this trie as opposed to the family-based (name) trie.

There were a lot of little details to pay attention to, but for the most part, it was a smooth transition. The data is now completely lockless if we want, but for now there's a spinlock at the instrument-level to make sure that we don't update the NBBO data improperly. However, this is likely never to happen as a single symbol comes from one feed, and one feed alone, so it's a single thread. However, it's possible that someone could make a multi-thread, single-feed, and in that case, it's possible to have two threads with the same instrument. So I'll leave it in for now. Just to be safe.

The upshot of this change is that we're now ready to handle any instrument that fits into the 128-bit ID scheme. Also, because we got rid of the map for options within a family, it's faster for options than before. Sweet.

Plenty of testing to do, but it's a great start.

Finally Got Something Going!

Friday, May 27th, 2011

trophy.jpg

This afternoon we finally got something going on the request/response greek engine! I had to add in an IRC interface to the calculations to make it easier to see what's happening and try values - but in the end, that's a great addition to the system. It only took about 15 mins to add, but the effect was amazing. We could now look at the messages coming in, force a calculation on a stock family - or an individual option, and then view the results. Very nice.

The request/response system is working as well. There were a lot of issues about what the values needed to be (see a previous post), and I had to change the SQL to extract the volatility values from the database because no one really checked the values. I know... I should have been more on top of this, but when I ask a grown-up, professional programmer: "Did you write a test case? Is it right?" and I hear "Yes", I tend to believe them.

Not any more. I've become a skeptic. They made me a skeptic.

But in the end, we finally got something going. What a relief!

The Difference Between “Good” and “Great” – Massive

Thursday, May 26th, 2011

Today had been a lot of system-level integration work and it's getting a little frustrating. For example, today I learned that I needed to provide the model with the latest trade prices for stocks as well as the quotes and associated data. Why I didn't know about this weeks ago, I have no idea, but I guess that's the joy of working with just in time memories. Specifically, it's because I didn't play a more involved role in the beginning of the project. I trusted my teammates, who have worked with this library before, to know all the inputs and let me know well in advance of needing them.

Nope.

So I'm finding out today that I need to add in another 20-plus feeds to the system and that means dealing with a different level of abstraction on the stock feeds - which is nice, and that in turn means a good bit of re-factored code to make the abstraction work well. It's nice to see it done, but it's not nice that it was a surprise.

Hopefully tomorrow will be fewer surprises.

Slugging Through Other’s Brittle Code isn’t Fun

Wednesday, May 25th, 2011

For the last few weeks I've been slugging through a lot of code written by guys that are a little junior, and end up making junior-level mistakes. These aren't horrible problems, but they certainly have set me back a bit as far as getting the project done. Each time I run into a style of problem, I try to point it out to the guys to say why this is a problem, and why not to write code this way in the future. I have a feeling this has mixed reception, but I'm trying to help them become better at this craft, but it's often a painful process - for them and me.

Today was some really brittle code. Make one little change to one method, and I have to change another. This is often the sign of bad interfaces. If an object is well-defined and it's methods are well thought-out, then you usually add methods to a class, or make different classes. But when it's sort-of thrown-together, you have to change the calling parameters for some methods, and pass in complete objects, to get the added behavior.

This is brittle code. It's incapable of being stretched, expanded, changed, without major changes to the surrounding code.

It's hard to work with this, and while I'll end up re-writing it all in the end, for now, I need to work with it - if I can, and get something working in the short-run. But it's all a bunch of mental notes for what to come back and clean out after the initial release is over.

Cacheing Price Feeds in the Greek Engine

Tuesday, May 24th, 2011

Today I've been working on putting a cache back into all the price feeds for my greek engine - and then persisting them to the config service. This is important because when I reload the instruments, it'd be nice to be able to hit a cache and have the instruments "priced" right away. I tried hitting the external ticker plants, but that was nowhere near fast enough. I needed something that was a lot faster.

Then again, I also realized that the first deployment of the engine will not need the external ticker plants, so it made sense to spend a little on the memory footprint, and put the caches - with persistence, into the engine. It's going to pay off.

Again, the speed was pretty nice, but the sheer volume of data is pretty big, so it's going to take a few seconds to persist all the data, but that's OK. It's going to be done in an off-loaded thread, so it's not that bad.

Getting closer. I now have good prices, and persistence that makes testing fast... it's really starting to come together.

Working on Serialization – Big Payoff Potential (cont.)

Tuesday, May 24th, 2011

This morning I finished up my reformulation of the persistence of the serialized instruments to the configuration service (backed by a mongoDB). The results this time were really quite nice. I was able to serialize everything in about 1.5sec. Very acceptable. Deserialization on the app start-up was fast as well, and so today I'll be able to get a lot more testing done simply because I'm able to get a lot more tests done in the same amount of time.

It's going to really pay off. Nice.

Google Chrome dev 13.0.772.0 is Out

Tuesday, May 24th, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 13.0.772.0 was out and the release notes show some nice progress. The latest V8 engine (3.3.8.1) as well as print preview are both nice. It's great to see them making progress.

Working on Serialization – Big Payoff Potential

Monday, May 23rd, 2011

Today it was all about serialization. The problem I've been having is that my tests hit the database for some 400,000+ instruments, and because of that, each test takes about 5 min to run. The database is not fast, but it's the fact that it might only take a minute to get the data, it's the fits and starts with iODBC that's hurtful, and the database is really in bad shape. So I decided this morning that if I could get serialization of my objects going, then I could use persistence, and that would make my tests a lot faster.

So today it's all about serialization.

It's not all that hard - just a lot of details and making sure things work properly. The big problem is that when working with instruments, there's a lot of references going on, and getting all that right in the serialization is non-trivial. Still, having seen it work for me time and again, it's definitely worth it.

At the end of the day, I had something that worked, but it took far too long to persist everything. I was serializing each instrument family (a stock and it's options) independently, and saving that to it's own mongoDB document. This was just too slow, as it was taking 30 sec to write out all the data. In the morning I'm going to pack things a little differently and put as much in a document as I can. This way, it's fewer, bigger documents, and not so many small ones.