The Problems of Premature Optimization
This morning I figured out my problem from late last week - the serialization problem. The logging really was the key, and the trigger was that I was seeing the deserialization of an array of messages starting over. It's like something reset the decoder. Very odd.
Then I looked at the size.
65495.
I about threw myself out the window.
I had assumed that all the messages I was going to receive were single message containers. For this, it makes sense to think of the size as a uint16_t. But what happens when you get a query for 50,000 messages? The response is an array of messages and the size is a lot bigger than 64k. The "starting over" was the key. It was wrapping around the counter and trying to match up the wrapped data to the messages.
Horrible failure.
I went into the code and changed all these uint16_t to uint32_t and I'll be fine. The tests were perfect, and I could get on with the rest of the testing. But this points out the problems with Premature Optimization. I was thinking that there was no need to have a counter/cursor bigger than 64k when all messages were less than 100 bytes. And I was right.
But I didn't think about the problem of dealing with messages that I create and can be much larger than a single UDP datagram. Being in this business for decades doesn't make you immune to this problem. It's the thought process that you have to be careful about. I'm as guilty here as anyone.
OK, maybe i found the problem a little quicker, but it still took the better part of a day, and it was annoying to boot.
Lesson learned: Optimize when you have a performance problem. Not before.