Tricky Bug Revisited

Today I was hit by a very large number of calculation process stalls - nearly 200 in all. This points out that I hadn't really solved the problem with the changes I've made, and I need to get a little more creative on the problem and it's solution. So that's what the majority of today was about - getting creative.

From today's work I could see that the complete pass of a calculation was being done. First, the Are you ready? was being sent to the calculation process, and it was answering with Yup, I'm ready. Then the calculation set was sent, operated on, and the results returned to the server process, and then the added step of the Thank You being sent and the Welcome returned. All this worked every time.

The problem seems to be in the starting of the process the next time. Again, this doesn't happen all the time, and in fact, most times it's fine. But it's in the sending of the Are you ready? message that never seems to get to the calculation process that things hit a snag. So I created a new method on the server-side communication object: handshake() which does the sending and receiving of an int to (and from) the calculation process. This new method is now used in a lot of places in the server-side object, and in addition to the things it always did, it's got a retry based on a timeout of the response from the calculation process.

See... the calculation process should do this handshaking very fast, and so a simple 30 sec. timeout is about 30 times bigger than it needs to be. But after 30 sec. we can be sure that there's no way that the calculation process is going to answer. So we'll try it again. The question will become: what happens then?

If the retries are done and they all time out, then we'll know that it's not a timing issue, but a socket state issue. There are really only two things that can be at fault in this case: the timing of the data was such that the buffers were corrupted, or the socket is really disconnected when it thinks it's connected, and so we need to kill the process and start another.

Personally, I'm thinking it's the socket. I think it's gotten itself into a state where it thinks it's OK, but it's not. The problem is then that I need to kill the connection from the server-side, and then re-add the calculation bundle back to the queue so it's not lost to the world. I think this will be something I can do, but I need to know for sure that this is the problem and not a simple timing issue.