Tracking Down Problems in Vendor Libraries

Detective.jpg

Today I got an email from a tech support guy for a vendor we use, asking if I'd tried their latest version to see if it fixed the problem I was having. Basically, it's a proprietary message bus that's very simple, and at the same time, fast. The problem is, it's not really fast enough to warrant being on it's own, but it's so old, that when it was new, it was probably something pretty useful - in the context of their system.

He asked a bunch of questions, all things I'd gone over with the last guy that contacted me on the 18 month old bug, but I wrote back a detailed message saying what I'd tried and what I believed to be the issue. However, because of the way they wrote their library, there's really no way I can know what's happening under the covers.

I suspect that it's in the socket handling - specifically the use of the poll() system call, and what it returns in different conditions. I know from experience that it can be tricky on linux if the socket gets in a weird state. My friend who wrote our C++ wrapper to the C API from the vendor did the simple poll() system call to see if there was a pending message on the socket, and if so, he'd call their handling method. But he used the simple poll() system call.

I decided that I'd see if the improved poll() I had written for CKit would help. It's got a lot better error handling and maybe the issue is with that. Certainly a good place to start. So I added that code, and we'll see what happens.

If that's not the case, then I'm going to start logging the activity around the poll() call to see if the vendor's handler function is hanging, or if there's even data at the socket to read. I'm not sure what's going to come of this, but it's an interesting diversion. Maybe I can come up with a work-around and fix this horrible problem.