The Best Debugger is Sitting Right Between Your Ears
Last night on the way home I was thinking about the locking problem I had battling during the day and was just rolling it around in my mind. As so often happens, the solution came to me in a little question leading to a little insight, and then as I ran it through in my mind I was convinced that I had figured it out.
The problem was that I wasn't considering a hidden thread and a hidden mutex. Had it been one or the other, I might have seen it sooner, but the fact was it took me that long to have my mind "fan out" from the initial problem and see what else was happening in the system.
The message broker API is not thread-safe. So in order to make sure we don't mess it up, we have a global mutex on it. One one thing can be going through it - in or out. Seems very reasonable. Since we need to receive messages we have a polling thread in the API wrapper classes that uses poll() to see if there's anything at the socket, and if there is, it locks the API and then processes the incoming message, and unlocks the mutex.
We also have the sending thread. That guy is based on the fact that a price comes from the ticker plant and there may be several instruments that this price matches, so I have to lock the list of instruments, get all those that are driven by this price, and then send each one in turn. Finally, I unlock the list of instruments.
These would all be OK if it weren't for a few facts:
- the incoming messages can cause me to add a new instrument
- the locking on the instrument list on the sending thread encompasses the entire loop
What would happen is that I'd be sending an instrument with price to the message broker and an incoming message would arrive. That incoming message would block until the sending was done and then it would obtain the mutex on the API and start the processing of the incoming message.
If that message ended up requesting me to add a new instrument to the list by calling addInstrument() then I'd try to get the lock on the instrument list. Bingo! The sending thread is not done, and can't send again because the incoming thread had the API mutex locked, and the incoming message can't add the instrument because the sending thread has the lock on the instrument list. Deadlock.
The solution was simple: Never leave anything locked when sending a message. Simple. When I copied the instruments to a temp list in a thread-safe manner, and then processed the copy without a lock on the instrument list everything worked.
I have to say, this is one of the fun times to write code. The best debugger is the one between your ears. You have to understand what is happening and then think about the parts. To have figured this out with a debugger would have been very lucky to say the least. Timing was critical to the deadlock. I'm really surprised that it happened as often as it did.
Anyway... that was the biggie for the day. I'll track the code for the rest of the day and make sure I'm right, but it's gotta be the problem.