Nasty Debugging Problem with Simple Solution

bug.gif

Today I spent a very long time on a problem that was really quite simple, but very hard to find. The problem was that many of my configuration service calls were timing out - but only after other calls had successfully been sent and received. This degradation of performance was very repeatable, but equally puzzling. It was clear early on that the problem was in the mongoDB - specifically, getting data out of Mongo. This wasn't exactly clear in all cases, but the hints were very much there.

For instance, if my configuration service hit a new single-server configuration mongoDB, everything was acceptable. But if it hit the staging replica set, it timed out. All this was with reads, so there's no chance of the writing coming into play. Very odd, then that a replica set was slower.

We kept digging, and went so far as to turn off the replica set and turn it into a single server. This yielded the same times as the replica sets - which is to say "slow". Maybe it was the hardware? Nope, a new single-server instance on that hardware was fine.

Finally, after several hours, we got to the heart of the matter: my configuration service was hitting the authorization mongoDB for the auth token to make sure the user was allowed to hit the configuration data. Bingo! We had a 266,000 entry mongoDB table without an index!

All that was needed was to type in the mongo shell:

  db.token.ensureIndex({token:1});

and the times sped up dramatically. This was the key - we didn't look at the data - just the hardware and the software. It was a long day, and while I'm glad we got this one out of the way, it didn't solve my problems 100% as my larger queries are still timing out. David says he's going to look at the the emongo driver this weekend for possible causes. He added the replica sets support to it today as we needed to move away from erlmongo as it inly uses one socket connection to the database. emongo allows for connection pools, which is going to help me a lot.