Totally Missed a Threading Problem with Statics
This morning I was looking at the logs of one of my price injectors and I got an exception on the CKStopwatch saying that the number of time events and time structs didn't match, which is a serious data integrity problem. I'm never gotten this before, and so I had to dig into it right away.
When I looked at the code, I saw something I knew was there, but it hadn't hit me in all the months this thing had been running. You see, in this injector, I need to have a few (configurable, based on load and number of processors) threads that take the prices and send them to the destination - inject them into the message stream. In these threads, I need to have an idea of the elapsed time they have been running so that I can print out statistics on their operation. Nothing big, but I need to have the number of prices they have injected over the time interval they have been working. Ticks per sec.
Since I needed to have this persistent over several loops of this thread's main processing method, I decided to declare it static at the top of the run loop:
int SPPoller::process() { bool error = false; static int totalSent = 0; static CKStopwatch interval;
and then it checked to see if this was the first time through and reset the stopwatch. But all this was a horrible mistake waiting to bite me.
The static reference is going to give me one and only one value for this guy regardless of the number of threads using this code. The fact that the int was thread-safe was lucky for me, but the CKStopwatch wasn't. That's where I got the exception from the other day, and a different one this morning.
The reason I did this was to try and keep the variables close to their location of use. And had I gone with thread-local storage, I'd have been in good shape. But I didn't. I debugged this with a single injector and it was fine. I only have two in most configurations, so it's a little more dangerous, but still not as bad as if I had 50 threads.
The solution was easy - make the total sent count and the interval timer instance variables of the class and then in the constructor, reset the timer and zero out the count and everything will be just fine. This was only effecting my logging, and so it's not a horrible problem, but the exception caused the poller to die, and that was a serious issue. Had to be fixed. Easy to do.
Whew! That one caught me by surprise.