Debugging Replicated Database Problems
Well... as I thought it might, the read-only copy of the instrument master database failed on me last night and while the primary is working fine, I feel it's necessary to be able to find a test case, or condition, where the replicated database fails so that I can give this to the team working on that project and they, in turn, can fix the underlying issue(s). I'm sure the local database admins will be be involved, as they have to be as we don't have that level of control over the servers and the machines. So, mauled by the sharks (from my previous post) I go back into the water trying to find the test case that will highlight the problem.
Last evening, the server was restarted at 5:49 pm, and the symbol set was divided into four groups of 889 underlyings and all four were sent out to the database proxy for loading. Typically, all four will finish within a few minutes of each other, but last night the first one finished at 17:55:12 and the second finished at 17:55:50 - but the third and fourth never finished. When I reconfigured the server to point to the primary, the four finished within 4 mins of each other - as they should. Clearly, there was something with the replicated database that was causing two of the loading threads to sit there waiting for data to come back. The question is, how to reproduce this?
It gets more of a quandary when you take into account that my development server started at 7:00 pm local time and it was fine using the read-only database - all four of it's loading threads finishing within a few minutes of each other. So there's something that's happening to the replicated copy between 5:50 and 7:00 pm that caused this problem, but it was gone by 7:00 pm.
I have a simple web page on the server's editor that allows me to look at the database operations that are being done in the code to see what the data is in the database and what's being retrieved. This has really helped a lot in the diagnosis of database issues like bad prices and missing key values. Yesterday, when we were having problems with the replication and the prices, I did have a few times when this page would not return all the data. Because it's a Perl script, it'd return what it had processed, but it would still act as if there were more to read (because there was), and yet nothing would come back. I'd love to be able to reproduce that for the guys.
Unfortunately, I haven't been able to. I have no tools at my disposal other than the requests I make. I'll keep hitting it throughout the day, but I don't hold out a lot of hope that this is going to point to anything conclusive. This leads me back to the same spot I was at yesterday - do I trust it? Today, however, the answer is different: No. I'll trust it when I have to trust it and not before. Since no one is really as concerned about this as I am, I'll stick to using the primary and see where the chips fall. There's no reason to risk production outages when all I've got to diagnose the problem is a few data loading scripts.