Debugging Socket Problems on Vendor Software
I've spent most of today debugging a problem I saw in a vendor's API to a messaging system. It's not the best vendor I've ever worked with, in fact, I don't think they are even in the top 80%, but they are the vendor I have to work with at the time, and I've had to try and make the best of it. So here's the problem we ran into today.
We have a price injecting system. It gets price data over a custom socket interface (I wrote) from a price feed server and the reformats it into the format needed by the vendor's product and sends it on it's way. It's a simple transformation system. Nothing fancy. But the vendor's API is socket-based as well, and as we were to learn, not done nearly as well as mine.
When the transformer/feeder app was running in Chicago, and the price source was in Chicago and the database for the vendor's product was in New York, everything worked fine. When the feeder was in London and everything else was the same, my feeder missed the messages coming from the vendor's messaging system when I was injecting prices.
Inject prices - miss messages. Stop injecting prices - get messages. Make another test app subscribed for the same messages and it always gets them - because it injects no prices. This was getting crazy. So on a whim I decided to make use of a price source in London for the London feeder - Bingo! Now it worked. It appears that the vendor's API can't handle delays in the socket delivery from another completely different source. Yup. It's got nothing to do with the use of the vendor's API - it's the activity on a completely different socket that's effecting the vendor's API.
Note that in all this, my code is working fine. It's the vendor's that's stopped working properly. Nowhere in the documentation do they say that excessive waiting on socket communications will invalidate the delivery of messages - why should they? They probably never tested it, as they probably never had reason to. But when you charge $20 mil for something, you really ought to take a more pro-active view on things. For example, don't use a home-grown messaging buss when there are so many commercial ones that you can include in your $20 mil cost and not effect the bottom-line much.
In the end, I think I'm stable now, but there's really no way to know. They aren't going to fix this, I didn't expect them to. They took almost 2 months to fix the last bug we pointed out and that was a simple recompile with the right data type for a 64-bit version of the API. This would require real changes in how they do things, and change is not a word I'd use with this vendor.
I just have to say that I really hate the fact that it's expected that we figure this out. I'm not getting any part of that $20 mil, and yet I've saved their bacon by figuring out how to make it work in our environment. Crummy vendor.