Archive for September, 2007

Looking at Experimental Data

Friday, September 28th, 2007

Today I've spent a good deal of time working with a trader and a little app I wrote for their desk to pull some data from a service in a format that they can read into their applications. It's the kind of thing that I'll put a day or so into and they'll use it for a long time without modification simply because it's a data access component to them. No problem - in theory.

Yesterday, I worked on the socket communications for the clients of my market data server to change the way packets of data were read back from the server into the client's space. Rather than read a 2kB packet and then process it, I changed the code to read everything that was available at the socket, if anything was available. This meant that if we were receiving a 50kB message, we didn't read it in 25 chunks, we read it in one large chunk and then processed that. This made the processing of the data much faster because we didn't re-scan the first 2kB 25 times - we scanned it once. Now a 59kB packet is one thing, but some packets will be 1MB or more. Now we're talking significant savings in time.

Well... today I had to deal with someone that was not convinced that this was faster. In fact, they were convinced that it was slower. Given that they don't know the code, and only see that something has changed, I can understand their need for some kind of assurance that things have changed for the better. So what I did was to run two sets of trials: old versus new, five runs each, same data set to see what we'd get. Ideally, this will be a large enough sample set to be able to factor out the small (or large) variations in the access speed, network traffic, and other variables that you run into on large computer networked applications.

What I found out was that the time to gather the data from the source was somewhat variable, but the time it took to process the data once I had it was pretty controllable. The old way had times for this experiment from 43.2 to 43.7 sec - a pretty nice grouping, and the new way had times from 6.5 to 7.0 sec - again, a nice grouping. While the access times to get the data were much more variable, I made sure to include the access time as well as the processing time so that we could easily see where the time was spent.

Having spent more than a little time with experimental evidence myself, this was a nice sample size, and the breakout of the data made it clear where the variation was, and wasn't. Unfortunately, for this person, the data didn't say the same thing. They didn't see anything like that in the data. They saw the variation in the total time and said "See, the new is slower than the old here, so your changes hurt the system." When I tried to explain that if you looked at the breakout of the times, it was clear that the difference was the time spent in getting the data from the source (not in my control) and the processing time was nearly constant. But there was no convincing them.

For several hours I fought through this - more trials, asking what would convince them, etc. All the while, I'm thinking that these folks are extremely arrogant. It was only after I stepped back for a bit and looked at what they were saying that it hit me - they aren't arrogant, they're just horrible scientists.

Since their background is advanced degrees in science, I made the poor assumption that they actually were decent at reading experimental data and figuring out what the data is saying to them. After all, the job they have deals with experimental data every single day - it's called prices. Stochastic processes abound in this field, and it should have been second nature to these folks to read experimental data, but it's not. So me sense of frustration with their arrogance quickly turned into sadness about their lack of fundamental skills in this area of their work. Sad but true, I can't imagine trusting them with a dime of my money if they can't read experimental data like this.

As is so often the case, we carry in expectations to relationships that are sometimes far below the mark, and sometimes far above. It's all about really getting to know where the other folks are coming from and where their skills and weaknesses are. Now that I've properly calibrated myself to these folks, it'll be much easier to deal with them in the future.

The Financial Industry does have it’s Personalities

Thursday, September 27th, 2007

cubeLifeView.gif

There's no denying that the financial industry has it's share of personalities. And I'm being kind here, as you will soon see. For the most part, I take it as part of the job. They are millionaires and are making (and losing) millions a day, so they are (somewhat) allowed to be high-strung, and a bit temperamental. But today was an experience that I get only once in a great while and it's worth writing about.

One of the users of my market data provider chatted me saying he was seeing vastly different response times for a 300 symbol request of some historical data. I had helped this person just yesterday get another data request under control, so I was wondering what it could be that we didn't get covered yesterday. Well... I got the request and tried it on two machines - the one that he was running it on, and one of my machines. I've learned enough in this business to know that there are all kinds of unseen issues on some boxes - wild processes, memory hogs, etc. all need to be factored out to make sure you're looking at your problem and not someone else's.

What I found was that he was right - there was a significant difference in the response times, and more importantly, the data provider was getting the data to the server reasonably quickly, it was getting the data from there to the client that seemed to be the problem. I told the user what I thought the problem was, and my attack plan. I always like to be as transparent as possible with the users as it lets them know what I'm doing even if they don't understand it all.

He didn't agree with my attack plan. He wanted me to determine why the variance existed in the first place. I tried to tell him that I believed that the gathering/decoding logic in the client API was incredibly inefficient and the resulting CPU-bound process was wildly varying because the load on the box was wildly varying. He wouldn't listen to it.

So I bit my tongue and went on to prove to him that even-though hat he thought was faster wasn't really any faster, it was a sample size of one with a large variation in the load. He tried to tell me that I was all wrong. That he had done the tests properly and the sample size didn't matter. I tried to say what I thought was the problem and what my plan of attack was going to be. He only got madder.

We'll move along through this part of the day quickly because it's really not going to help to cover it in any detail. Suffice it to say that I was treated very unprofessionally, but finally was told to keep him appraised of any updates.

I then took my time and put in tests to see where the real time was being spent. I was surprised to see that the vast majority of the time was in reading data off the socket. In fact, the protocol between the client and the server has all datasets ending in a CRLF combination, and so the data that's read in needs to be checked for this combination.

What I realized was that in reading from the socket, I had limited the data read in from the socket to about 2kB a 'chunk'. Each chunk was then added to the result set and checked for the terminal data condition. But imagine if the data was going to be 2MB? That's 1000 chunks and the first chunk is going to be checked 100 times for the terminal data. There's the problem. So I changed the socket reader to read in everything that's available into the buffer as opposed to a small chunk. What happened was not surprising: all the data was read in at one time, the check was done once, and the result was that the reception of the data was far far faster than before.

When I tested this out on his requests I saw 10 to 40 min requests fall to 2.5 min and stay there. Very little variability now because the process isn't as CPU-bound and the transfer takes priority. Very nicely done, in my mind.

When I gave him the results, he wanted to test it, naturally. When he found out how fast it was, he said "Why haven't we done this before now? We've wasted a lot of time. This should have been done LONG ago!" How nice.

I pointed out that when we put this system together it was an order of magnitude faster than the system it replaced. In fact, it was just fine in the performance department for everyone - even him. That I didn't investigate every possibility for a performance improvement is because I had other things to do as well, and they were all very happy with the speed as it was. He admitted that was true. How nice.

So in the end, it's much faster thanks to a little thought that he didn't want to have me consider, and then was mad that I hadn't fixed the problem that didn't exist until today. Amazing person. Truly one of a kind.

Pixelmator versus Acorn

Wednesday, September 26th, 2007

This morning Pixelmator was released as 1.0 and I had to check it out. Certainly, it looks more like Photoshop, but with a decidedly Mac-like twist - very nice. There are a lot of things to like about Pixelmator - the number of file formats is impressive - able to read/write Photoshop files is a big plus. Also, the toolbar is a lot more familiar, not necessarily better - just more familiar. There are little things like the color changing, the brushes, etc. They aren't in Acorn - yet.

And there's the problem, for me.

Do I stick with Acorn or go to Pixelmator. Both are "1.0" products, but there's no need to get both, one should do what I need an image editor to do, it's just a question of which one has the same end game that I want. Since they are both young, and both small groups, it's not clear that there's a definitive answer to that question for either application.

I want to like Acorn due to VoodooPad, but the professional look of Pixelmator makes it so darn compelling. It's certainly going to be an interesting year to see which one gets the larger audience, and which one goes for the larger toolset even at the risk of a little complexity. I'm willing to have a little complexity for tools that I would use, it's the other stuff that I'd never use that is really unnecessary for me.

Well... it'll be interesting to watch these products develop.

Identity – The Often Overlooked Test Case

Tuesday, September 25th, 2007

Today I had a very interesting bug that has been sitting in the server's code base for ages. In one of the support libraries there's a standard C++ string class modeled loosely after the Java String class. I inherited this code and never gave it much thought - until today.

The original code had an operator=() method implemented like this:


    const jString & jString::operatorconst jString & anOther ) {
        empty();
        append(anOther);
        return *this;
    }

Where the empty() method and the append(const jString &) methods worked perfectly. The problem came into play when I had something like this in the code:


    jString     one("hello");
    
    one = one;

Now this does not make a lot of sense, and in truth, the actual code was a lot more convoluted than this - but the point is that it's setting itself equal to itself. In any case, it shouldn't have done what it did. What it did was to clear out the value and leave the string empty. Why? Because the empty() method cleared out the value of anOther so that the append() call had nothing to copy from. This was a pain in the rear because I believed that the operator=() was smarter than this. It wasn't.

The fix is simple:


    const jString & jString::operatorconst jString & anOther ) {
        if (this != & anOther) {
            empty();
            append(anOther);
        }
        return *this;
    }

and so long as we check to make sure we aren't operating on ourselves, we're good to go. I know that there are probably a good number of cases like this in the code and I'm going to have to check each of the operator=() methods in the libraries, but at least I know what to fix, and the fix is easy.

I know that I'm going to be paying a lot more attention to the overlooked case of 'self' in the code I write from now on.

Emailing from PHP on Mac OS X

Tuesday, September 25th, 2007

Apple-logo.jpg

When my daughter wanted to have her web site include Flash-based form submission software I needed to make sure that PHP on frosty, my iMac running OS X 10.3, would be able to email out properly. As I found out, 10.3 is not set up to allow this, but it's easy enough to configure once you know how. I Googled the problem and found most of the answer, but there was one line missing and that turned out to be a critical line for the forwarding of emails while in a PHP script. The low-down is that you just need to properly configure postfix on the box, as it's already installed and most of the work is already done for you. So here goes:

First, look at /etc/hostconfig and make sure that there is a line that reads:

  MAILSERVER=-AUTOMATIC-

Next, you need to edit the file /etc/postfix/main.cf and make sure it has the following lines:

  ...
  myhostname = frosty.themanfromspud.com
  ...
  mydomain = themanfromspud.com
  ...
  myorigin = $mydomain
  ...
  inet_interfaces = all
  ...
  mydestination = $myhostname, localhost.$mydomain
  ...
  mynetworks_style = subnet
  ...
  relayhost = smtp.comcast.net
  ...

and then you need to uncomment the line in the file /etc/postfix/master.cf to look like:

  smtp     inet  n    -    n    -    -    smtpd

At this point, if you restart your box, the email will send out from the host to any email address on the net. This is also very helpful in that my crontabs are now able to send me email off the box and this makes it a lot easier to keep track of them.

I've done this on Mac OS X 10.3 and 10.4 and it works great on both. I just wanted to write this into the journal before I forgot how to do it in the event that I need to do it again.

Debugging is a Continuous Effort

Monday, September 24th, 2007

I was out on Friday feeling pretty crummy. Today I'm feeling a little less crummy only because I'm a little more used to it. When I got back, I saw that my analytics engine had been spitting out a bunch of NaNs for this one analytic. So I started to dig into the problem. What I found amazed me - it should have been broken much worse than this much earlier than this.

The engine is written in C++ and as such, those pesky NaNs are something you have to consider, and if you're clever, use them to indicate illegal values, etc. This engine has been running for years doing the same kinds of things day in and day out many times a day - seemingly without problem. But when I dug into it today I realized that a few months ago I changed the datasource for instrument prices and in doing that set the stage for a bug cascade that ended up biting me Friday while I was out.

The problem starts with the data source of prices - Reuters. When they are about 2 hours from a market open, they will zero out all the data in the records they send as an indicator that the instrument is about to enter the active market portion of the day. Normally, this is OK, but the problem comes in when you realize that you need to make a "price" from a bunch of zeros, and you realize that those zeros are telling you to use the historical mark for the instrument and not use the data from Reuters. I thought I had the code in the application to do that, but it seems I was more than a little mistaken.

No... I wasn't ignoring the zero prices, I was converting them to NaNs and putting them into the time series data. I know I was thinking that this would signal later in the code to skip this data point, but even that was unnecessary as I simply should not have overwritten good data with bad no matter what I was thinking I was going to do with it later.

After fixing that bug, I realized that even with the one data point a NaN, there was no reason for all the historical data points to also return NaN. As I looked into the problem more, I realized that I had made the historical calculations biased by the value for today. That way, once I've calculated the historical numbers, I subtract out the value for today and then when I call it again, I can simply compute the value for today, add it to the rest of the values and everything is up to date.

But again, since I was putting in a NaN for a zero price, I was messing up all the values by having the value for today so messed up. Amazing. Fixing it was not too bad - only took a little time, but it was the data change that started the whole problem. It's just amazing to me that I need to keep up with debugging when you change things like the data sources - based on the assumptions that come with the data.

The Fall from Grace of a Rock Star

Thursday, September 20th, 2007

Today I overheard a disagreement between one of the developers here and Rock Star, whom I've discussed before. I have to say that I tried not to listen, and did a reasonably effective job - given that it went on for over 3 hours. Then there was an hour in their manager's office. This disagreement took up the entire morning of these two guys day. Amazing.

The one guy came and told me what it was all about afterward, and I have to say I'm not in the least bit surprised. It was turf wars - plain and simple. The Rock Star wanted to be involved in all decisions relating to a project he's on - but is not the owner of. That, I have always found, is the key distinction. If you own it, then you are responsible for it, and then you're the one expected to make it work. But if you're just on a project, then the guy who owns it, is the one to make decisions like who is involved in what decisions.

In this case, while Rock Star wanted to be the owner, he wasn't, and the manager of the two developers pointed that out. This did not please the Rock Star, but then that's the reason the disagreement between the two took all morning. What seems to have happened is the real fall from grace for the Rock Star. He's now just Veruca - the kid from 'Willy Wonka and the Chocolate Factory' - "I want it NOW!". He's exceptionally focused on getting what he wants, and getting people to give him what he wants, even if they have no intention of doing what he wants them to do.

So the concerns I had about his work are history. There's no need to worry any longer. His manager has a good understanding of how he works, what his requests are, and hopefully, how to handle him. I still think he can make good contributions to the Shop, I think he needs to understand that not everything is going to be given to him how he wants it, when he wants it. There are other concerns at play, limitations in the hardware, production issues, that he's simply not looking at - but needs to. Hopefully, he'll stick around long enough to season into a good developer.

Tricky Little Threading Bug

Thursday, September 20th, 2007

bug.gif

For the last two days I've been wrestling with a tricky little threading bug. The problem is in a class collection called SearchSpace and it's used for caching computed values that are computationally expensive to produce, so we cache them for interpolation of intermediate values. It's a standard 1D and 2D search space with linear and Taylor Series approximations built in. Most of the time everything runs fine, but two days ago I came into the office and saw that my dev server had a Seg Fault at the line:


    if (mHash[lIndex] == NULL) {
        ...
    }

The value mHash was NULL and that was causing the Seg Fault. The problem was, about 10 lines before this, I have code that looks like:


    if (mHash == NULL) {
        init(false);
    }

Given that the method this appears in is protected by a mutex for thread safety, one would think that it's impossible for the variable mHash to be NULL on entry, initialized, and then it's again NULL a few lines later. Or, if it's not NULL on entry, it's impossible for it to become NULL by the problem line. I know I was confused. But the impossible happens so often to me it's almost common.

The key to this problem is that the tHashMap (where this method exists) is getting deleted in the middle of the operation. There's no mutex issue to contend with, and on exit all the variables are destroyed, and I get a seg fault, so the question remained - who was doing the removal and why? Even more importantly, how to stop it without throwing a ton of locking and unlocking at the problem?

It turns out that the SearchSpace runs a purge() method every so often to clear out the unused data. Data that hasn't been accessed in 10 minutes is considered 'stale' or 'unnecessary', and is deleted. If we need to re-create it later, so be it, but we have to balance the memory footprint with the computational cost of generating the numbers. It turns out that 10 mins is a nice trade-off point for this application. When the purge() is being run, it looks for empty SearchSpaceNodes (which contain tHashMaps) and deletes them. This, then, is the culprit. If we're working on a nearly empty SearchSpaceNode and the purge() method deletes it, we're left with a dangling pointer and that's the problem. So how to fix it?

One of the things I've used successfully in the server is to implement a very simplistic retain/release counter. But given that there will be tens of thousands of these SearchSpaceNodes in the system, I didn't want to burden the system with a mutex and an integer for each node. There had to be a simpler way.

The answer was simple - add a simple bool ivar to the SearchSpaceNode called inUse. Have the addItem() method set it when it gets a SearchSpaceNode, do all the work it needs, and when it's done, it resets it. The purge() method then only deletes the empty SearchSpaceNodes that are 'not busy'. If it's busy one time through, and nothing is added, then the next time through purge() it'll be deleted. But now it's going to be impossible to get the seg fault due to the SearchSpaceNode being deleted out from under a working thread.

Not easy to see, and it could have been solved by using mutexes and making every method on the SearchSpaceNode thread-safe, but that's not completely necessary. I only need to protect this case, and with what I've got now, I've got the thread-safety I need and don't have the overhead of all those mutexes locking and unlocking all the time.

Decided to get Fission and Audio Hijack Pro

Wednesday, September 19th, 2007

Fission.png

I was talking with Liza the other day and she was putting together a few playlists that she wanted to give me to try and expand my listening portfolio, is it were, when she mentioned that it'd be nice to be able to do a little editing on the songs to cut out some not-so-good parts and leave just the chorus and a few other bits in the song. I said I could do that if I got this audio editor, Fission, from these guys at Rogue Amoeba. I was doing a little more DVD ripping this morning, and I decided that now would be a good time to invest a little and get a good set of audio tools for those times that I want something and don't have anything like it on the Mac.

There was a bundle offered by Rogue Amoeba for Fission and Audio Hijack Pro, and I thought that if I ever needed it once, it'd more than pay for the bundle price, so I spent the $50 and got both. I'm glad I did - they work nicely with one another, even going so far as to pull up Fission from within Audio Hijack Pro when you ask to edit a sample. Very nice.

I haven't done a lot of audio editing, and I wouldn't say I'm any kind of audio engineer, but there have been a lot of times when I've recorded a little something - mostly it was the kids on my old NeXT turbo slab, and then wanted to clean it up and use it for my 'build' sounds, or my email notification. I still have Joseph at 2 yrs. saying "Daddy mail!" when every email arrives. That's 11 years and counting. So there are times for the tools, and since they aren't the cost of Photoshop, it seems right to get them and support the developers.

It's also a lot of fun to mess around with sound. I haven't played an instrument in years, but it will always be a part of me and I can remember how it felt. I'm not sure if I'll be doing any mash-ups for Liza this weekend, but it's nice to have these tools available, and it's really nice to support good guys doing good things for this platform.

Observing from a Distance

Tuesday, September 18th, 2007

It's been an interesting few weeks seeing how the Rock Star interacts with the other people at the Shop. Interesting in that he's dealing with them in a very similar manner to how he was dealing with me. This observation from a distance has been really educational for me.

First, it shows me that I'm not that different than the other people here in the Shop. We're all trying to meet deadlines with less than perfect tools and support, but we have to get the job done. This means that we all need to recognize that there is a time to ask for assistance from a third-party vendor, and there's a time to take what they have given you and run with it. Nothing is ever going to be perfect, and you have to take what you can get most times.

Second, it shows me that I didn't handle Rock Star all that poorly. I never insulted him. I never treated him like a non-person. While I didn't go out of my way to talk to him, he (and I) had plenty to do and it was just very easy to not talk to him. I certainly could have handled our interactions a lot worse.

Finally, it shows me that in the end, it really is a delicate combination of skills and personality that comes together to make a great team. You can't have people that all get along - without skills they can't really get anything done. You can't ignore the 'human element' either, because you'll end up with people that can do the work individually, but don't really interact well.

I've learned a lot about how I deal with people, and how others do as well. I'm happy to say that I'm pretty pleased with the results. I'm not perfect, no one is, but I'm really happy that I've had this chance to see that I'm not as bad as some folks might like to paint me out to be.