Archive for May, 2007

Tricky Bug on the Ropes

Thursday, May 17th, 2007

OK, I think I have the Tricky Bug on the ropes now. What was unique about yesterday was that I had an instrument with a single-point volatility curve with a value of 0.01 - which is exceptionally small. This lead to outlandish greeks and TVs - we're talking 30 digits. So, I got to thinking... it makes a lot of sense if the calculation libraries were able to come up with something for the results, but in so doing caused memory corruption problems that messed up the file descriptors, and therefore, the sockets.

These calculation libraries have outright crashed with bad data in the past, so I do a lot of pre-emptive checks to make sure that what I'm passing in has a good chance of making it all the way through. But if this data is just right on the edge, it's possible that the values are good enough to make it back, but internally, the calculation libraries are so badly damaged that they throw off the box.

To check this, I've added the checks for the small single-point volatility curves into the code. If this stops the problem then I'll know. I really think this is the problem.

Blender and What I Wish I could do

Thursday, May 17th, 2007

Even since blender was a commercial product, with a free version for SGI machines, I've played with blender. I say played because while I wish I were a graphic artist, I'm not. Not even close.

Oh, I can appreciate good work, and I can do a decent job of things when I have to, but I'm no real artist. I know I haven't really put forth the effort to be a good artist - hours and hours of study and practice, but it's as much the inspiration and native ability that I'm lacking. I'm not a doodler, and that is probably a key element in what makes a person put the effort into the study. So I guess I'm a graphics hacker and leave it at that.

Still... a new version of blender is out, and it runs well on Mac OS X, so I picked up a copy, and should that inspiration finally strike, I know I'll be ready for it.

WordPress and Daylight Savings Time

Thursday, May 17th, 2007

Yesterday I noticed that the posts to this WordPress blog were an hour off. Very odd. So I looked into the MarsEdit preferences, and then started looking at the WordPress settings as I remembered when the default time of UTC+0 was being first used. What I found was that MarsEdit was OK, and it was WordPress and it's inability for handle Daylight Savings Time that was the real problem. So I did a little digging.

Seems I'm not the first to see this, and they still haven't fixed it. Odd... but the good news was that there was a plugin for WordPress that fixed the DST problem, but it only really worked by setting the TZ environment variable and that meant Unix boxes. No big deal to me, as the host is a linux box, so I picked it up and deposited it in the WordPress plugin directory and then activated it.

Bingo! The timezone could be set to America/Chicago and things seemed to work perfectly. Nicely done. I can appreciate many of the comments on the WordPress web sites that this is not a complete fix for the problem, and I have to agree. It's a hack. A nice hack, but it isn't as universally useful as WordPress itself, and for that, there should be a better fix. But for now... for me... this fixes the problem.

HostMonster is Excellent

Wednesday, May 16th, 2007

I've recently hosted a site with HostMonster and I have to say that while I suspected that I'd really be glad I did this once it was done, I had no idea how nice it would be. I'm sure a lot of good hosting providers are like this, but the ability to add the Secure POP/SMTP over SSL today was just too much for me! I can now deal with secure email to and from the site within Mac OS X's Mail.app which makes it look like just another part of my InBox.

I'm also amazed at the capabilities of the site for self-management. I'm sure this is why the costs are as low as they are - make it easy to have the folks maintain their own accounts and things will be cheaper to run. But I had no idea that web hosting had tools like these. Very nice. It's really quite amazing.

If you ever need a site hosted - give HostMonster a look, you won't regret it.

Tricky Bug Revisited

Wednesday, May 16th, 2007

Today I was hit by a very large number of calculation process stalls - nearly 200 in all. This points out that I hadn't really solved the problem with the changes I've made, and I need to get a little more creative on the problem and it's solution. So that's what the majority of today was about - getting creative.

From today's work I could see that the complete pass of a calculation was being done. First, the Are you ready? was being sent to the calculation process, and it was answering with Yup, I'm ready. Then the calculation set was sent, operated on, and the results returned to the server process, and then the added step of the Thank You being sent and the Welcome returned. All this worked every time.

The problem seems to be in the starting of the process the next time. Again, this doesn't happen all the time, and in fact, most times it's fine. But it's in the sending of the Are you ready? message that never seems to get to the calculation process that things hit a snag. So I created a new method on the server-side communication object: handshake() which does the sending and receiving of an int to (and from) the calculation process. This new method is now used in a lot of places in the server-side object, and in addition to the things it always did, it's got a retry based on a timeout of the response from the calculation process.

See... the calculation process should do this handshaking very fast, and so a simple 30 sec. timeout is about 30 times bigger than it needs to be. But after 30 sec. we can be sure that there's no way that the calculation process is going to answer. So we'll try it again. The question will become: what happens then?

If the retries are done and they all time out, then we'll know that it's not a timing issue, but a socket state issue. There are really only two things that can be at fault in this case: the timing of the data was such that the buffers were corrupted, or the socket is really disconnected when it thinks it's connected, and so we need to kill the process and start another.

Personally, I'm thinking it's the socket. I think it's gotten itself into a state where it thinks it's OK, but it's not. The problem is then that I need to kill the connection from the server-side, and then re-add the calculation bundle back to the queue so it's not lost to the world. I think this will be something I can do, but I need to know for sure that this is the problem and not a simple timing issue.

Coding and New MacBooks

Tuesday, May 15th, 2007

Today has been a good day for coding... I've fixed several issues with the Server and they will test tomorrow and then deploy, and I've seen the new MacBooks from Apple. I have to say, the hype on the rumor sites this morning played this up a lot bigger than it turned out - yes, the speed is nice, and the drive size increases are nice, but really, it's a little move and not something that's really earth-shaking.

It was fun to watch the diggnation podcast today. Those guys are so like how I used to act with my friends that it's always worth a good laugh. The stories are sometimes disturbing, but that's life. Also, the weekly News from Lake Woebegone podcast came out today and that's another giggle.

Pretty calm day... coding... giggling... not bad at all.

The Simple Things

Monday, May 14th, 2007

Sunday night... all's quiet... then comes the ring. It's work... things are down... please help.

It's not a great way to start the week. Last night I got called because things weren't working as they should. I come to find out that the linux servers have been rebooted without letting me know. This is a serious problem as the processes that run on most of them can't be in the chkconfig profile to start on startup. So I was trying to fix one thing, and found that another process on another machine wasn't running and that was the source of the problem, etc. It was a mess.

When I finally got things under control this morning for production, I counted seventeen, that's 17, machines had been rebooted and not a word on the planned outage, or a phone call on the work done so that I can get things going again in an orderly fashion. So I had to send an email to the Unix Sys Admins asking why I wasn't told anything. No reason, no excuse. Just my bad luck, the answer.

Next time, they are going to call me. I manage apps on 29 servers around the globe and they all depend in one way on each other. I can't be in the dark like this on a regular basis. No fun at all.

Tracking Down Tricky Bug

Thursday, May 10th, 2007

I've been dealing with this nasty communications bug in the code I'm working with (MarketMash server), and I thought I'd solved the problem by eliminating a possible problem in the serialization of a list of pointers, but I was wrong. It wasn't fixed.

The basic protocol for the communication of one unit of work between the server and the calculation engine goes something like this:

  • the server sends the engine Are you ready?
  • the engine responds to the server Yup, I'm ready
  • the server sends a complete description of the calculation(s) to perform - serializing them out over the socket in a byte stream
  • the engine gets the request, processes it, and streams back the response

then the process repeats itself over and over again. The problem manifests itself as the engine is waiting at the top of the loop for an Are you ready? message, and the server is waiting for something from the engine. So, to try and nail down that the response is getting sent to the server and received properly, I've modified the protocol to look like this:

  • the server sends the engine Are you ready?
  • the engine responds to the server Yup, I'm ready
  • the server sends a complete description of the calculation(s) to perform - serializing them out over the socket in a byte stream
  • the engine gets the request, processes it, and streams back the response
  • the server receives the complete response and sends the engine Thank You
  • the engine logs the Thank You and responds to the server Welcome
  • the server receives the Welcome and logs if it doesn't get it

The goal of this is to make sure that I can see that the response is getting sent back to the server and received properly. If not, then the Thank You will not be received and I'll be able to tell that in the engine logs.

I sure hope this helps me track down what the problem really is.

Coda with some Issues

Thursday, May 10th, 2007

I have to say that I love Coda from the guys at Panic. When I first saw the Panic Sans font, I was sold! It's a great tool for keeping moderate-sized web sites up to date if you're a coder, which I am. I mean, I'm not going to give this to my 10-year old daughter - she needs more graphical tools, but for me, this is a great combination of editor/preview/transfer/docs that makes this easier.

But there are some issues I found this morning in doing a major overhaul of my web site at the new host I have. They are not horrible, but just really, really annoying.

The construction tools are great. It's nice that it's all in one tool with a good editor - I've got SubEthaEdit, and like it a lot. I also like the integration with Transmit and have registered that as well. I like the preview, but I wish it would let me set the default fonts like I can with every browser on the market - this would make the preview look more accurate, but I've already sent them something on that.

No, this is about the uploading capabilities of Coda. Basically, I added a CSS style sheet to my web site and that included several graphics files - all of which were placed into one directory for ease of use. The problem was that the uploading of these files didn't respect the directory structure, and so it got "flattened" on upload and made a mess. When I had multiple files in different directories with the same name, the 'latest' one uploaded won. It was a pain to clean up, but now that I know the problems, and until I get a fix from them, I'm liable to use Cyberduck as my uploader as it doesn't make these types of mistakes. I sent them the bug report and we'll see what happens.

Tricky Little Bug

Wednesday, May 9th, 2007

I've been working on a very tricky little bug in a C++ server that has been pestering me for literally months. For the longest time I was convinced that the bug was not in my code, but was, in fact, in the linux kernel and it's handling of socket I/O. It was a compelling argument, and I'm not convinced yet that the kernel isn't making matters worse, but that's for later.

The problem manifested itself as this: one server process on a machine and five machines each with eight calculation processes all talking to the server process on the on 'main' machine. Things would be fine for a long time... then for no apparent reason, one of the calculation machines would have all it's calculation processes (all eight of them) stop communicating with the server process. Since each calculation process (32 in total) each connected to the server process, it seemed very unlikely that one of the calculation processes was effecting the others on the box. The Red Hat engineer agreed with me, as the processes were independent processes, and the only thing shared would be some part of the kernel on that box.

So I did a lot of debugging in the different processes, and it appeared that the problem was finally in the poll() method on the main machine. Everything pointed to this - but I had to back up a bit and then take a hard look at what I was doing and the assumptions I'd made to come to this point. Because I had the feeling that there was no way it was in poll().

What I started looking at was the possibility that it was not the discrete method calls, but that it was in the implied asynchronous functioning of the socket I/O. For example, the data was getting sent from the calculation process to the server process, but it was being done buffered. While it appeared that the write and read operations were completing, the write was really writing to a buffer, and that buffer would be sent when the kernel got around to it. Likewise, the read would be when sufficient data was received to let the kernel pass it to the process. So it might be possible for there to be a disconnect on the writing and reading.

I started looking at the serialization code and ran into the following code for serializing a vector of pointers:

    template< class T >
    void writePointers( Writer & aWriter, tList<T *> & aList )
    {
        aWriter << aList.length();
        tIterator<T *> lIterator = aList.begin();
        while (lIterator.hasNext()) {
            aWriter << *lIterator.getNext();
        }
    }

with a similar method for reading them in on the other end:

    template< class T >
    void readPointers( Reader & aReader, tList<T *> & aList )
    {
        aList.clear();
        int  lLength = 0;
        aReader >> lLength;
        for (int i = 0; i < lLength; i++) {
            T  *lNew = new T;
            aReader >> *lNew;
            aList.addBack(lNew);
        }
    }

so, in theory, we write the size and then each element, and the reader gets that size and then reads in that number. Pretty simple. Problem was, when I looked at it in light of the buffered socket I/O I realized that if the size changed after the writing of the size, then we were in trouble. Also, what about NULLs?

So, the change I made was to tag each element before transferring it. Basically, a handshaking was done within the list process - a code said "Hey, I'm sending a NULL", and that could be delt with by the receiver. Another code would be "Hey, here comes a good one", and a final code said "Hey, no more to send", and with this, I didn't have to send the size first, I could let the size be determined by the contents and not the size before the iterating.

So far, this has gotten rid of these stalls in the calculation processes. It's all about defensive programming. Assuming things really get us all into trouble.

UPDATE: unfortunately, it only took a few days and this bug popped up again. While I am happy with the change I made, it wasn't the core of the issue. Crud. Now I'm back to trying to find out why the communication is getting messed up.