Pushing More Ticks Through the Fast-Tick Server Safely

servers.jpg

This morning I had a problem with my fast-tick server where it appeared that one of the 'sidekick' threads was not able to successfully process it's run-loop. I noticed this because it started to run the 'purge' of stale greek values, but it never completed the task. When this happened I had to restart the complete app because the code was locked up and yet wasn't crashing. No fun, and it happened twice in one morning!

The relavent portion of the code responsible for this basically did the following:

    /* 
     * Interesting problem... when this guy runs and there is a lot of 
     * work to do, he'll effectively lock out the CalcEngineWorkers and 
     * nothing will get done with the CalcNodes. All because this guy 
     * is too fast and keep locking other processes out. So... I'm going 
     * to put a few 'breaks' into the processing flow. A hundred of them 
     * to be exact, and each time, we're going to see if someone else 
     * needs a turn. Then we'll continue. This is just a cooperative 
     * way to get the job done without locking other threads out. 
     */ 
    int   blockSize = InstrumentManager::numberOfInstruments()/100; 
    // lock up this guy for a read to make sure he doesn't change on us 
    __lockRead(); 
    try { 
      // log what we're going to be doing 
      getLog() << l_status.setErrorId("InstrumentManager") 
               << "taking the time now to purge old data from " 
               << "the instruments" << endl; 
      // next, we need to get an iterator for all the instruments 
      int                   cnt = 0; 
      int                   pass = 0; 
      tIterator< void * >   iter = __allInstruments(); 
      while (iter.hasNext()) { 
        INSTR_BASE      *inst = (INSTR_BASE *) iter.getNext(); 
        if (inst != NULL) { 
          inst->retain(); 
          cnt += inst->purge(); 
          inst->release(); 
        } 
        /* 
         * Check and see if we should yield a bit and see if the system 
         * has something else that needs to get done. With only the read 
         * lock, this isn't a bad place to pause. 
         */ 
        if (++pass % blockSize == 0) { 
          sched_yield(); 
        } 
      } 
      // log what we did 
      getLog() << l_status.setErrorId("InstrumentManager") 
               << cnt << " unnecessary data elements purged from " 
               << "the instruments" << endl; 
    } catch (...) { 
    } 
    __unlock();

the code essentially locks up the list of all instruments for a read, goes through each instrument, telling it to purge any stale data, and then reports on the results and releases the lock. Seems pretty simple. But it's got issues.

First, why maintain the read lock for the entire process? Well... if we don't, then someone can add or remove instruments and we'll not have the first clue about it and the instruments may actually be deleted and that's going to cause us a world of hurt.

Second, why are we waiting to put the retain() on the instrument until we get to it in the iterator on the list? Seems to be a better idea would be to put the retain() on the instrument as soon as possible and then we know it'll be around for us to use when we get around to checking it. Good point, that was one of my concerns about this code when looking at it today.

But the real kicker is the hidden issue: other threads/processes waiting for the lock to be removed so they can modify the list. This may not seem like a lot of work, but with in excess of 400,000 instruments, it takes about 10 seconds (wall clock time) to run through this section of the code. That's a lot. When we are in times of a lot of changes - like the open of the US markets, then this is a real issue. This lock causes us to pause the processing of ticks and greeks and that's no good. So... what can we do to fix this?

Answer: do the obvious: copy the instruments and then process them. More correctly copy the pointers to the instruments and then run through the list processing each.

Why is this a big difference? Well, first off, it means that the lock is only going to be on the main instrument list for the time required to copy about 400,000 pointers. That's essentially nothing. Then the lock is removed and the other threads and processes are free to do what they need.

Second, it allow is to put the retain() on the instrument at the time of the pointer copy, and that means that it's "safe" as soon as possible, and that means it's nearly impossible to have an instrument killed out from under us. Much nicer.

In the end, the code now looks like this:

    /* 
     * We need to make a copy of the pointers to all the active 
     * instruments right now. As we do this, we're going to retain() 
     * each so it doesn't go away. When we're done with each, we'll 
     * release() it to be nice. 
     */ 
    tVector<INSTR_BASE *>   instruments(instrCnt); 
    // lock up this guy for a read to make sure he doesn't change on us 
    __lockRead(); 
    try { 
      tIterator< void * >   iter = __allInstruments(); 
      while (iter.hasNext()) { 
        INSTR_BASE      *inst = (INSTR_BASE *) iter.getNext(); 
        if (inst != NULL) { 
          inst->retain(); 
          instruments.addBack(inst); 
        } 
      } 
    } catch (...) { 
    } 
    __unlock(); 
    // update the instrument count to what we actually have in hand 
    instrCnt = instruments.size();
 
    // next, we need to run through all the instruments 
    int     cnt = 0; 
    for (int i = 0; i < instrCnt; ++i) { 
      INSTR_BASE        *inst = instruments[i]; 
      if (inst != NULL) { 
        cnt += inst->purge(); 
        inst->release(); 
      } 
    } 
    // log what we did 
    getLog() << l_status.setErrorId("InstrumentManager") 
             << cnt << " unnecessary data elements purged from " 
             << "the instruments" << endl;

My initial tests show that the clean-up is being done just as before, but the pauses in the processing are not - which is good, and expected. I'm very pleased with this because I think it's quite likely that this had something to do with the deadlock, and now shouldn't be an issue any longer.