Tricky Little Threading Bug

bug.gif

For the last two days I've been wrestling with a tricky little threading bug. The problem is in a class collection called SearchSpace and it's used for caching computed values that are computationally expensive to produce, so we cache them for interpolation of intermediate values. It's a standard 1D and 2D search space with linear and Taylor Series approximations built in. Most of the time everything runs fine, but two days ago I came into the office and saw that my dev server had a Seg Fault at the line:


    if (mHash[lIndex] == NULL) {
        ...
    }

The value mHash was NULL and that was causing the Seg Fault. The problem was, about 10 lines before this, I have code that looks like:


    if (mHash == NULL) {
        init(false);
    }

Given that the method this appears in is protected by a mutex for thread safety, one would think that it's impossible for the variable mHash to be NULL on entry, initialized, and then it's again NULL a few lines later. Or, if it's not NULL on entry, it's impossible for it to become NULL by the problem line. I know I was confused. But the impossible happens so often to me it's almost common.

The key to this problem is that the tHashMap (where this method exists) is getting deleted in the middle of the operation. There's no mutex issue to contend with, and on exit all the variables are destroyed, and I get a seg fault, so the question remained - who was doing the removal and why? Even more importantly, how to stop it without throwing a ton of locking and unlocking at the problem?

It turns out that the SearchSpace runs a purge() method every so often to clear out the unused data. Data that hasn't been accessed in 10 minutes is considered 'stale' or 'unnecessary', and is deleted. If we need to re-create it later, so be it, but we have to balance the memory footprint with the computational cost of generating the numbers. It turns out that 10 mins is a nice trade-off point for this application. When the purge() is being run, it looks for empty SearchSpaceNodes (which contain tHashMaps) and deletes them. This, then, is the culprit. If we're working on a nearly empty SearchSpaceNode and the purge() method deletes it, we're left with a dangling pointer and that's the problem. So how to fix it?

One of the things I've used successfully in the server is to implement a very simplistic retain/release counter. But given that there will be tens of thousands of these SearchSpaceNodes in the system, I didn't want to burden the system with a mutex and an integer for each node. There had to be a simpler way.

The answer was simple - add a simple bool ivar to the SearchSpaceNode called inUse. Have the addItem() method set it when it gets a SearchSpaceNode, do all the work it needs, and when it's done, it resets it. The purge() method then only deletes the empty SearchSpaceNodes that are 'not busy'. If it's busy one time through, and nothing is added, then the next time through purge() it'll be deleted. But now it's going to be impossible to get the seg fault due to the SearchSpaceNode being deleted out from under a working thread.

Not easy to see, and it could have been solved by using mutexes and making every method on the SearchSpaceNode thread-safe, but that's not completely necessary. I only need to protect this case, and with what I've got now, I've got the thread-safety I need and don't have the overhead of all those mutexes locking and unlocking all the time.