Debugging is a Continuous Effort

I was out on Friday feeling pretty crummy. Today I'm feeling a little less crummy only because I'm a little more used to it. When I got back, I saw that my analytics engine had been spitting out a bunch of NaNs for this one analytic. So I started to dig into the problem. What I found amazed me - it should have been broken much worse than this much earlier than this.

The engine is written in C++ and as such, those pesky NaNs are something you have to consider, and if you're clever, use them to indicate illegal values, etc. This engine has been running for years doing the same kinds of things day in and day out many times a day - seemingly without problem. But when I dug into it today I realized that a few months ago I changed the datasource for instrument prices and in doing that set the stage for a bug cascade that ended up biting me Friday while I was out.

The problem starts with the data source of prices - Reuters. When they are about 2 hours from a market open, they will zero out all the data in the records they send as an indicator that the instrument is about to enter the active market portion of the day. Normally, this is OK, but the problem comes in when you realize that you need to make a "price" from a bunch of zeros, and you realize that those zeros are telling you to use the historical mark for the instrument and not use the data from Reuters. I thought I had the code in the application to do that, but it seems I was more than a little mistaken.

No... I wasn't ignoring the zero prices, I was converting them to NaNs and putting them into the time series data. I know I was thinking that this would signal later in the code to skip this data point, but even that was unnecessary as I simply should not have overwritten good data with bad no matter what I was thinking I was going to do with it later.

After fixing that bug, I realized that even with the one data point a NaN, there was no reason for all the historical data points to also return NaN. As I looked into the problem more, I realized that I had made the historical calculations biased by the value for today. That way, once I've calculated the historical numbers, I subtract out the value for today and then when I call it again, I can simply compute the value for today, add it to the rest of the values and everything is up to date.

But again, since I was putting in a NaN for a zero price, I was messing up all the values by having the value for today so messed up. Amazing. Fixing it was not too bad - only took a little time, but it was the data change that started the whole problem. It's just amazing to me that I need to keep up with debugging when you change things like the data sources - based on the assumptions that come with the data.