Making an App Fault-Tolerant with Intolerant Components

SwissJupiter.jpg

I noticed this morning that one of my price injectors was hung up in a poor, sick little infinite loop when the service (vendor provided) had died and then I restarted it. I had not coded up the library to close and re-open the connection. In an attempt to make my application fault-tolerant to this service's restarts, I decided to dig in and add in all the pieces I needed to properly reconnect when the service was restarted.

If only it were easy.

The first thing I had to do was to unroll where I was in the processing so that I'd pause what I was doing (or trying to do) when an error with the service was detected. That didn't take too long, but I wanted to make sure I didn't put in the logical equivalent of the goto statement, so it took a little bit of work to handle it properly.

Once that was done I needed to have the main thread detect this condition and then close/re-open the connection. Here's where I really started to run into problems. While the code appeared to be what I needed, I would get just a few attempts and then a double free core dump. Every time. And it was always in the vendor's API code. It seemed that no matter what I did there was no way to avoid this problem. Crud.

So what if I tried to "go around the horn" and exit the app with an error condition and then have the guardian script that started the app, and restarts it in the case of a core dump would see this and restart the app. That's OK, but what if it fails right away? Well... that's the problem. So what I tried next was to put a retry loop on the creation of the connection to the service. Maybe that would work.

Better. It seems that so long as the connection isn't ever really made, you can call the open() call as many times as you need to get the job done. I'm getting a lot closer. Now that I have a way to exit the app with an error and restart it with a retry loop all I had to do was to make sure we didn't litter the directory with core files. The final problem was that trying to close a troubled connection lead to the same double frees that I was getting in the first place.

So I had to put in even more logic to the wrapper classes on the vendor's API so that I could be assured that the application could exit cleanly and then the restart would take care of waiting until the service was up again.

Finally I had something. It took a few hours, but in the end I have a system that's fault-tolerant to the vendor's service restarts and that's what I wanted to build today. It's going to make it much stronger a system. Good news.