Tricky Problem with Market Data Server and Ticker Plant

MarketData.jpg

Today I was face-to-face with a bug that's been giving me grief for quite a while and I would like to say that I've solved it, but I only think I might have solved it, and we'll have to wait for time to tell to see if I really have solved the problem.

The problem relates to my market data server and it's connections to my ticker plant. In theory, these connections are reliable - breaking and re-making connections as needed to make sure that the requests get through and serviced as quickly and efficiently as possible. But there seems to be an issue with my production box and it's connection to it's ticker plant.

Upon restart, everything is fine. But after a few days, the data returned from the server - provided by the plant, "flip-flops" from a valid value to zero and back again. It's most annoying, and very difficult to track down. It doesn't happen on my development box, and developing or testing in production is just way way too dangerous.

So I was thinking it might be these connections. So I told the server to drop all connections to the plant and create new. This had the same effect as a restart in that all the requests were good then. This leads me to believe it's not in the main server but in the satellite processes that talk to the ticker plant. Armed with this, I dug into that code to see if I could see anything.

What I saw, what I hoped to see, was that I was depending on the connection to the ticker plant to be 100% reliable. Why? Well... because it typically is. But why not just use the connection methods to see if the connection is valid? And if it's valid, why not send a simple ping to the server? If either of these fail, then drop the connection and create a new one. Seems far more reliable than simply assuming that the connection will take care of itself.

The overhead is minor and this is only done once per request from the server, so the overhead should be minor. What I think was happening was the flipping was when I was hitting different connections to the ticker plant and it was just the luck of the draw which one I got. Why some went bad and didn't heal themselves on the production box, I don't know. It doesn't happen on the dev box. Maybe there's something in the networking... I don't know. But I'm hoping that this change will make the server much more reliable.

We'll have to wait and see.