Archive for October, 2007

Debugging Replicated Database Problems

Thursday, October 4th, 2007

database.jpg

Well... as I thought it might, the read-only copy of the instrument master database failed on me last night and while the primary is working fine, I feel it's necessary to be able to find a test case, or condition, where the replicated database fails so that I can give this to the team working on that project and they, in turn, can fix the underlying issue(s). I'm sure the local database admins will be be involved, as they have to be as we don't have that level of control over the servers and the machines. So, mauled by the sharks (from my previous post) I go back into the water trying to find the test case that will highlight the problem.

Last evening, the server was restarted at 5:49 pm, and the symbol set was divided into four groups of 889 underlyings and all four were sent out to the database proxy for loading. Typically, all four will finish within a few minutes of each other, but last night the first one finished at 17:55:12 and the second finished at 17:55:50 - but the third and fourth never finished. When I reconfigured the server to point to the primary, the four finished within 4 mins of each other - as they should. Clearly, there was something with the replicated database that was causing two of the loading threads to sit there waiting for data to come back. The question is, how to reproduce this?

It gets more of a quandary when you take into account that my development server started at 7:00 pm local time and it was fine using the read-only database - all four of it's loading threads finishing within a few minutes of each other. So there's something that's happening to the replicated copy between 5:50 and 7:00 pm that caused this problem, but it was gone by 7:00 pm.

I have a simple web page on the server's editor that allows me to look at the database operations that are being done in the code to see what the data is in the database and what's being retrieved. This has really helped a lot in the diagnosis of database issues like bad prices and missing key values. Yesterday, when we were having problems with the replication and the prices, I did have a few times when this page would not return all the data. Because it's a Perl script, it'd return what it had processed, but it would still act as if there were more to read (because there was), and yet nothing would come back. I'd love to be able to reproduce that for the guys.

Unfortunately, I haven't been able to. I have no tools at my disposal other than the requests I make. I'll keep hitting it throughout the day, but I don't hold out a lot of hope that this is going to point to anything conclusive. This leads me back to the same spot I was at yesterday - do I trust it? Today, however, the answer is different: No. I'll trust it when I have to trust it and not before. Since no one is really as concerned about this as I am, I'll stick to using the primary and see where the chips fall. There's no reason to risk production outages when all I've got to diagnose the problem is a few data loading scripts.

Depending on Other Groups

Wednesday, October 3rd, 2007

Interesting thing happened in the last 24 hrs. at work. We have a nice instrument master database - it holds all the instrument data as well as marks and such. Very nice. It's so nice that it's replicated to a read-only copy here in Chicago as well as a replicant in London for disaster/recovery purposes. It's great - except when the replication fails.

That's what happened last night. Normally, there's an explanation sent out about the "why" of why the replication failed. Changed tables, duplicate records - something caused the replication that normally works just fine to fail. But today I didn't hear what the cause of the break was. It appears to be fixed, but if we don't know why it failed, why can we assume that if it looks OK that it'll stay that way?

No reason to think so in my mind.

So when I ask the guys whose responsibility it is to maintain the database(s) "Is the read-only copy OK to use tomorrow?" and get a simple answer like "It should be good for tomorrow" I get a little nervous. What was the problem? Are we sure it's OK? I'm feeling like I'm on the beach and they're saying:

"Sure, Bob... go back in the water... pay no attention to the fins circling..."

"Oh, are we going in? No, not right now, just ate, don't want to get cramps... but you go right ahead... and splash a lot... here, hold this pork chop."

But since they are the ones responsible for the database(s), I'll listen to them and if I walk out of the surf a bloody, dismembered ghost of myself I'll say "Hey! You said it was OK!", and if there's fallout it'll be up to them, and not me, to answer those questions.

UPDATE: Well... as if I might have seen it coming, the read-only database was not ready for prime-time and I got the call at home after my server stalled on the restart due to waiting on the database. I switched it to the primary database and all was fine. Guess that means I'll have to figure out why.

New Acorn 1.0.2 Out Today

Wednesday, October 3rd, 2007

Today the guys at Flying Meat released Acorn 1.0.2 - there are a mostly bug fixes but there are a few new features too. Nothing that's super exciting right now, but it's nice to see the changes and even the little additions. I'm looking forward to 1.1 where Gus says there will be more significant additions like a 'Save for Web' where it'll try to figure a good way to save the image in a small size without compromising image quality.

And while we're on the subject of image editors, Photoshop Elements 6.0 is due out in 2008 for the Mac. They haven't really listed the features it'll have, but if it looks like PSE 6.0 for Windows it's going to be very different from what is PSE 4.0. It's been said that it looks like Lightroom, but having not seen that app, I can only say that the features look decent but the UI is very "enclosed" - not at all Mac-like. Very much like you'd run this app to the exclusion of other apps - no toolbars - one giant window. Very different. Not sure I like the change.

Interesting Bug in the Server

Tuesday, October 2nd, 2007

servers.jpg

This morning I got an email from one of the Hong Kong users about problems they were having with the theoretical values on OTC options in the server. While I didn't really get the proper picture from his email, the follow-up from a user in London provided the proper illumination to see what the problem was. Basically, when changing the volatility and/or dividends curve(s) for an instrument, the 'Open' greeks would be calculated properly and the 'Last' would not.

I dug into the code and saw that the 'Open' greeks are calculated no matter what, but the 'Last' are calculated only if they aren't in the cached data for that instrument already. This was the key - that the cached data was all wrong because it was generated based on the old curves, and with the new curves, all the cached data needed to be invalidated and recalculated.

Once I knew what I needed to do, it was just a matter of putting the code in-place to allow me to clear out the cache at the instrument level, then the underlying which would clear out all the derivatives' caches, etc. Then I had to put in the code to detect the change in the curves as read in from the editor, and putting it all together was pretty easy.

Once I had the code in-place, it was very easy to test and see that we are indeed getting the right greeks for changes in the curves. Also, I fixed a few issues in the editor so that it would not send the curves back to the server on an edit unless they had been edited by the user. This is going to help in simple efficiency as well as not making the server think things have changed when they haven't.

Added removeRows() to BKTable

Monday, October 1st, 2007

comboGraph.png

I had a new developer come up to me today and ask me if it were possible to remove a group of rows from a BKTable, and after a little bit of a sync on the terminology, it was clear that he wanted to delete a group of rows from the table based on some criteria. I was thinking about this and it seemed like a good idea even-though there's no way to do it now. So while he had to stick with iterating through the table and removing each one after testing it for it's applicability in the final results, I decided it was something I wanted to ask Jeff if he'd use it enough to add it to the class. Turns out, it's something that he'd like to see too. So I spent a little time today putting that into the BKTable.

I went with the idea that you'd provide this a JEP expression and it would either remove the rows that matched this expression or remove every row but those that matched this expression. Since I could use a lot of the same components that are in use in the filter table view this wasn't too hard. In fact, the effect is very similar to the filter table view, but in this case the removal is permanent and you can easily choose either set to remove.

After I got this in and tested I talked to Jeff and let him know that it'd be OK to have a list of a few things they wanted in BKit. He mentioned that his guys are suggested to come talk to me, with suggestions/problems so I guess there just aren't a lot of problems. OK.