Tricky Little Bug

I've been working on a very tricky little bug in a C++ server that has been pestering me for literally months. For the longest time I was convinced that the bug was not in my code, but was, in fact, in the linux kernel and it's handling of socket I/O. It was a compelling argument, and I'm not convinced yet that the kernel isn't making matters worse, but that's for later.

The problem manifested itself as this: one server process on a machine and five machines each with eight calculation processes all talking to the server process on the on 'main' machine. Things would be fine for a long time... then for no apparent reason, one of the calculation machines would have all it's calculation processes (all eight of them) stop communicating with the server process. Since each calculation process (32 in total) each connected to the server process, it seemed very unlikely that one of the calculation processes was effecting the others on the box. The Red Hat engineer agreed with me, as the processes were independent processes, and the only thing shared would be some part of the kernel on that box.

So I did a lot of debugging in the different processes, and it appeared that the problem was finally in the poll() method on the main machine. Everything pointed to this - but I had to back up a bit and then take a hard look at what I was doing and the assumptions I'd made to come to this point. Because I had the feeling that there was no way it was in poll().

What I started looking at was the possibility that it was not the discrete method calls, but that it was in the implied asynchronous functioning of the socket I/O. For example, the data was getting sent from the calculation process to the server process, but it was being done buffered. While it appeared that the write and read operations were completing, the write was really writing to a buffer, and that buffer would be sent when the kernel got around to it. Likewise, the read would be when sufficient data was received to let the kernel pass it to the process. So it might be possible for there to be a disconnect on the writing and reading.

I started looking at the serialization code and ran into the following code for serializing a vector of pointers:

    template< class T >
    void writePointers( Writer & aWriter, tList<T *> & aList )
    {
        aWriter << aList.length();
        tIterator<T *> lIterator = aList.begin();
        while (lIterator.hasNext()) {
            aWriter << *lIterator.getNext();
        }
    }

with a similar method for reading them in on the other end:

    template< class T >
    void readPointers( Reader & aReader, tList<T *> & aList )
    {
        aList.clear();
        int  lLength = 0;
        aReader >> lLength;
        for (int i = 0; i < lLength; i++) {
            T  *lNew = new T;
            aReader >> *lNew;
            aList.addBack(lNew);
        }
    }

so, in theory, we write the size and then each element, and the reader gets that size and then reads in that number. Pretty simple. Problem was, when I looked at it in light of the buffered socket I/O I realized that if the size changed after the writing of the size, then we were in trouble. Also, what about NULLs?

So, the change I made was to tag each element before transferring it. Basically, a handshaking was done within the list process - a code said "Hey, I'm sending a NULL", and that could be delt with by the receiver. Another code would be "Hey, here comes a good one", and a final code said "Hey, no more to send", and with this, I didn't have to send the size first, I could let the size be determined by the contents and not the size before the iterating.

So far, this has gotten rid of these stalls in the calculation processes. It's all about defensive programming. Assuming things really get us all into trouble.

UPDATE: unfortunately, it only took a few days and this bug popped up again. While I am happy with the change I made, it wasn't the core of the issue. Crud. Now I'm back to trying to find out why the communication is getting messed up.