Struggling with Dead Workers in Carmine

Redis Database

This morning I'm once again trying to figure out a problem I've been having with the workers in the Carmine message queue implementation. Basically, the thread that starts the workers is doing just fine, but the workers themselves, are just stopping. I had no idea what to do about it - so I wrote to the author asking him about this.

He responded with the :monitor option to the worker function. I didn't see it in reading the code, but yes, there's a function that gets called when the queue is cycled, so I added a simple function there to reset an atom, and then in the thread that starts these workers, I inc that atom, and if it exceeds 50 sec of not being reset, then I know that it's taken more than 50 sec for the queue to cycle, and I try to stop/start the worker.

The basic monitor and it's worker look something like this:

  save-mon (fn [{:keys [mid-circle-size ndry-runs poll-reply]}]
             (debug "persistence worker heartbeat (iteration)...")
             (reset! _save_loops 0))
  saver (mq/worker (epr/connection :queue) *dump*
          {:handler (fn [{:keys [message attempt]}]
                      (save-it! cfg message)
                      {:status :success})
           :monitor save-mon
           :nthreads 1})

and then in the main body of the function we have something that checks to see if the _save_loops atom has been reset recently enough:

  (let [sc (swap! _save_loops inc)]
    (when (< 50 sc)
      (warnf "Persistence worker hasn't cycled for %s sec -- Restarting!" sc)
      (reset! _save_loops 0)
      (infof "Stopping the persistence worker... [%s]"
             (if (mq/stop saver) "ok" "FAIL"))
      (infof "Starting the persistence worker... [%s]"
             (if (mq/start saver) "ok" "FAIL"))))

This all took a while to figure out, but after a time, I got it working, and it appeared to be working. But the stopping and starting just weren't doing the right things. Add to this, the background that in the other data center, I had multiple installations of this where the workers weren't in trouble at all.

I'm starting to think it's the redis server. That will likely be the next thing I do - restart the redis server and hope that clears any odd state that might be there.

I do wish this would settle down.

UPDATE: I emailed Peter, the author, and he asked me to check the logs - and in this case the timbre logs - his logging package. These go to standard out, and I had forgotten about the. Sure enough, there was useful data there, and all the stops and starts were logged as well. At this point, I believe it's something in redis, and it has been successfully cleared out. But if it happens again, I'll dump the redis database and start fresh - it's just the queue data.