Speeding Up the Market Data Historical Cache

MarketData.jpg

Late last week, Rock Star - the fallen one came by my place and asked me why the two different source of historical prices were taking different amounts of time to return from my market data server. I explained about the caching of the data, and he mentioned that even on the second and subsequent hits, the one source within the Bank was considerably slower than the source outside the Bank. Of course, I tended not to believe him, or put his musings down as experimental error.

But as he showed me the data he was getting, I began to think that he might be onto something. So I wrote up a test case of my own and checked the two sources for the same data for a long enough date range to make the timings significant of the work and not of the overhead involved.

What I found was that the one was faster than the other. And immediately I thought I knew why. The historical data from any source has to be understood within the confines of the requested date range. Say I ask a source for 10 years of data for IBM. It's not going to return anything for holidays and weekends, so it's very possible that the first available day of data is not eh first day requested. If you have the logic of "check the cache for data, fill as necessary, and then respond" you end up asking for the same little bits over and over again. But they never amount to anything.

So I had to modify the general historical cache code that I had written for the second source, and add in the knowledge of the requested first and last dates - in addition to the actual first and last data dates. This made the code hit the cache completely which is a big win, but there was still a little difference in the cache return speed.

What I had done was to get the list of dates between a range of dates, and then using those dates, get the data from the time series for those dates. Over the weekend, I realized that the way I was getting the dates was just inefficient - scanning a std::map is not as good as using lower_bound() and then using that iterator to move through the map until we get to a point where the date is outside our range.

Likewise, it made more sense to subclass CKTimeSeries and provide a 'response filling' method so that we could use the same idea to directly fill the response from the cached data. Something like this:

  void MDTimeSeries::addToResponse( CKTimeTable *aResponse,
                                    const CKString & aSymbol,
                                    const CKString & aFieldName,
                                    int aStartDate, int anEndDate )
  {
    if (aResponse != NULL) {
      CKStackLocker          lockem(getTimeseriesMutex());
 
      std::map<double, double>::iterator    i;
      for (i = getTimeseries()->lower_bound(aStartDate);
           i != getTimeseries()->end(); ++i) {
        // see if we've gone past the limit of the data we want
        if (i->first > anEndDate) {
          break;
        }
        if (!isnan(i->second)) {
          // copy in the data as it's relavent
          aResponse->setDoubleValue((long)i->first, aSymbol,
                                    aFieldName, i->second);
        }
      }
    }
  }

This turned out to be a big win for the speed and it's now faster than the older implementation, which means I can get rid of it at sometime in the future if I want. Not bad.