Archive for April, 2011

Nasty Debugging Problem with Simple Solution

Friday, April 29th, 2011

bug.gif

Today I spent a very long time on a problem that was really quite simple, but very hard to find. The problem was that many of my configuration service calls were timing out - but only after other calls had successfully been sent and received. This degradation of performance was very repeatable, but equally puzzling. It was clear early on that the problem was in the mongoDB - specifically, getting data out of Mongo. This wasn't exactly clear in all cases, but the hints were very much there.

For instance, if my configuration service hit a new single-server configuration mongoDB, everything was acceptable. But if it hit the staging replica set, it timed out. All this was with reads, so there's no chance of the writing coming into play. Very odd, then that a replica set was slower.

We kept digging, and went so far as to turn off the replica set and turn it into a single server. This yielded the same times as the replica sets - which is to say "slow". Maybe it was the hardware? Nope, a new single-server instance on that hardware was fine.

Finally, after several hours, we got to the heart of the matter: my configuration service was hitting the authorization mongoDB for the auth token to make sure the user was allowed to hit the configuration data. Bingo! We had a 266,000 entry mongoDB table without an index!

All that was needed was to type in the mongo shell:

  db.token.ensureIndex({token:1});

and the times sped up dramatically. This was the key - we didn't look at the data - just the hardware and the software. It was a long day, and while I'm glad we got this one out of the way, it didn't solve my problems 100% as my larger queries are still timing out. David says he's going to look at the the emongo driver this weekend for possible causes. He added the replica sets support to it today as we needed to move away from erlmongo as it inly uses one socket connection to the database. emongo allows for connection pools, which is going to help me a lot.

Google Chrome dev 12.0.742.12 is Out

Friday, April 29th, 2011

Well, the Google Chrome guys are still putting the UI polish on 12.x, as they just released 12.0.742.12 with the release notes saying it's just UI issues and a few sync issues. It really appears that they are going for a stabilized release of 12.x for beta as I've read that they believe 11.x is headed for release. Time marches on...

It’s Amazing to Me What’s Considered Necessary at Times

Thursday, April 28th, 2011

So I've been working on this little service for my greek engine - it's not a major component, but its' something that finds use in the Shop, so I was replicating it's functionality in the new codebase. One of the things that the legacy messages had was the OPRA Message Type for the trade messages. This is a one-character field that says what kind of trade this message describes. Is it a cancel? Is it electronic? There are a lot of meta-data you could have about a trade, but typically, you want to take it out of the exchange-specific realm, and put it into bit flags, etc. Make it source-independent.

Which I had.

Then I came upon this legacy message and saw that it had this OPRA message type as a field. I asked around, and was told that I needed to have that in the message. That's odd. Very odd. This means that every app will have to have the same logic for what this "means" to the trade. This doesn't make a lot of sense at all. In fact, I think it's silly.

But it's a requirement, so it's in. Silly. Totally silly.

UPDATE: after another meeting, it was the consensus that this wasn't such a hot idea, and that we should try to live without it. OK with me. Simple git revert.

MongoDB Replica Sets Issues

Thursday, April 28th, 2011

MongoDB

This morning I started to see some disturbing problems with the configuration service written in erlang for The Broker. It's all backed by a mongoDB that's currently configured as a replica set, and after a few apps were up, the speed took such a hit as to start to time out my requests. I wasn't sure what it was, so I took the advice I was given, downloaded the latest pre-built binaries and ran a stand-alone install on one of my boxes.

It was really pretty amazingly easy. You unzip the tarball and just run it. I made a simple directory to put all the data in, and away it went. Very nice. I was able to reconfigure my Broker code to hit this database for the three Brokers I had in my little dev cluster.

Then I ran the app.

Very nice response times. Very. I let it run for an hour or so, accumulating data and saving it to the new mongoDB. Then I stopped everything, and restarted it. Rather than hanging, as the replica set did, it started up with a little slowness, but everything worked. It's not mongoDB, and it's not the way I was using it. At least not in a single server mode.

Someone did a little digging and found that the 1.8.1 release might have released a bug in the replica set negotiation. So we're going to get the "final release" source and put it on the boxes and see if that doesn't help. But we need something. As it is now, replica sets are really not going to scale like we need.

There Really is No Substitute for Documentation

Wednesday, April 27th, 2011

This afternoon I'm onto another problem with The Broker, and this time it's really difficult to figure out because there have been a lot of changes made to the codebase, and none of them are documented in the least. The problems include the immeditate unregistration of services after they have been registered, as well as not accurately identifying those services that aren't available to the client.

I think I was getting close to the answer, but the erlang code is just too functional, and it's hard to know where something is called from if you don't have a complete stack trace. In this case, I don't - or at least don't recognize it if I do. I'm about a 5 or 6 out of 10 in erlang, and that's not enough to really be able to dig all this out of the code without some form of documentation to help me know what role each module plays in the overall scheme of things.

In the end, I was able to document what I saw, what I think the problems were, and how I'd go about fixing them, and sent that off to the guy who wrote all the code and is far far better at erlang than I am. I'll have to wait and see for tomorrow when he returns.

Getting Going on Time and Sales Service (cont.)

Wednesday, April 27th, 2011

This morning I finished up on a nice simplification that I saw in the Time and Sales service: the temporary structure I was using to hold the data from the Option and the Print was really a new message - the OptionPrint message. This is exactly what we need to send out to the new clients, and a version of this can easily be made to send out to the legacy clients.

So I gutted all the code in the service and made the new message in the old messaging codebase, and then retrofitted it into the new service. In all, it took me about half a day - finishing up this morning. But it's worth it.

Now, all the pieces fit again - if we wire up a transmitter to the service, it'll automatically send the messages out the proper channel - be that legacy or new. It also makes it very memory-friendly as the same structure that's hold the temp data is the message we'll be sending out. That means there's no conversion to a message - we just ship what we have.

It takes no more to create, and it's far more efficient to use. Sounds like a win to me.

Got Nailed Again by Infrastructure Changes

Wednesday, April 27th, 2011

This morning for the second time in about as many weeks, another group in The Shop decided to update their Mongo database, and it's going to cost me most of the day to fix my code because of the change. They say their java client to mongo was not allowing them to use larger than 4MB documents, but I've been storing 16MB+ docs in the database with the erlang driver. But they didn't ask me before they did it. They just told me they were doing it.

In going from 1.6 to 1.8, it turns out that there's now a hard and fast rule about docs being less than 16MB - so I get messed over. I am going to have to look at how all my documents are created and make sure that no one gets too big because the failure will hang my process. It's ugly - really ugly.

I think when I'm done with this, I'll try to put something into the configuration service that errors out if you send it a payload that's more than 16MB. This way, it can't let you save something too big, and erroring out is better than locking up every time.

Still... it's these "detours" that are really getting annoying - and quite avoidable. They could put up a dev environment and we can test, and then work out solutions before doing it to the staging install. That's what they should be doing. I've no reason to believe they will, however.

Updated to WordPress 3.1.2 at HostMonster

Wednesday, April 27th, 2011

This morning I noticed that there was an update to WordPress to 3.1.2, and so I took the few minutes to update all my installs at HostMonster. It's a single security issue that would never effect my sites as I don't allow for contributors, but still, it makes sense to patch any and all security holes just in case.

Google Chrome dev 12.0.742.9 is Out – Release Candidate?

Tuesday, April 26th, 2011

Seems the Google Chrome guys are busy - today it's 12.0.742.9 with the release notes calling it a release candidate. Interesting. If they are planning on moving the 12.x branch to 'beta', then they'll soon be bumping the 'dev' channel to 13.x - which is an interesting number to say the least. I wonder if anyone is superstitious? In any case, they are trying to polish up this version for the release, and that's always good news as it means a better experience for everyone.

Getting Going on Time and Sales Service

Monday, April 25th, 2011

Today I did a lot of work helping people get focused on what we should be doing. For my part, we needed a Time and Sales Service where each option trade sent to us by the exchanges needs to be captured, a few implied volatilities calculated, and then shot out to the waiting clients. This is really orthogonal to the rest of the greek engine, so I wanted to tackle that so it doesn't slow down other people, or divert them from the primary goals.

The design is going to be very simple: a service that takes an Option and a Print message and queues it up to be calculated and sent out. Very simple. The design is pretty simple: have a struct (object) that will hold all I need from the two incoming arguments, fill it on entry, queue that, and return. All this should be very fast as it's all just ivar calls to the base objects.

Then the queue has a processing thread, and will take these off one by one, calculate the values, and then generate the messages in the legacy and new formats so that we can feed both the old and new systems. Pretty simple.