Good, Solid, Code Needs Maintenance – Even Mine

bug.gif

Last night I got a call from the second-shift operators here at work about a problem with my fast-tick risk engine. Basically, a process that gets the information on all the options in the system was handing and timing out and therefore was failing. This code hasn't been changed in years (so I thought) so I had no idea what to make of the problem. Time to dig into it.

I got online and looked at the stack trace of the error. I tried to match that up to the code in CVS, and couldn't. Great! Someone changed the code and didn't check it in! ran through my head more than once. So I had to get the manager of the unknown coder that had to have changed this online as well. He looked at the diffs in ViewCVS and came to the conclusion that the changes were all cosmetic. I had said that I could track this down, but I'd need to be able to rebuild the code to add in some debugging/logging statements that were very much absent in the existing code. But to do that, I had to know that I had the right version of the source or we'd be in even bigger trouble than we already were.

He gave me the go-ahead and I updated a few libraries of mine to make sure that it wasn't a flaky problem I'd fixed in the last year or two, and then rebuilt and deployed it. I got the same errors, but this time, the line numbers lined up with the stack trace. Finally! I could get to the bottom of this.

What I found was that even though the server appeared to be working properly, it most likely had a locking problem that was causing this issue and while I could not be positive on this, it certainly appeared that this was the cause. So I let everyone know and restarted the server. When it came up, everything worked as it should. Good.

This morning, I then dug into the effected code and noticed that since it was very old I hadn't updated it to the more consistent locking scheme I'd put in place for the rest of the app. I didn't have the read-lock on the master list of instruments - and it needed to be there so I'd be safe from any additions/deletions, and there was a read-lock on the instrument which wasn't necessary because the data I was looking at is set at load time and there's no way to change it during the run-time. I replaced it with a safer retain()/release() pair, and this will also protect me from having the instrument deleted out from under me.

With these changes, the dev server worked perfectly and I've been able to get several large queries through it without any problem. Nice.

The lesson here is that all good code needs to be maintained regularly to make sure that changes in one part of the code make it to all other effected areas. This hasn't happened in this codebase simply because there isn't time to do so. When something breaks (like last night), there's time to fix it, but not before. I've said it many times before, and it's still true - the best time to fix a hole in the roof is not during a monsoon. Too bad we just don't have the time.