Archive for the ‘Coding’ Category

Multiple-Producer/Single-Consumer Lockless FIFO Queue

Thursday, February 10th, 2011

Professor.jpg

I've been working on trying to get more speed out of the components of my ticker plants as they run fine during the day, but the crush at the open is still something that gets ahead of them. To that end, I wanted to make my conflation queue lockless. The trie that holds the messages is lockless - using just the GCC built-ins for compare-and-swap. But the STL std::deque was certainly not thread-safe, and updating the two together was something I left in the lock as well.

Then I got to thinking about the order of operations in the conflation queue, and I became convinced that I could order things properly and it would be possible t have a lockless conflation queue if I had a lockless FIFO queue that allowed for multiple-producers. As is is now, the only lockless FIFO queue I have is a nice, fixed-size, circular FIFO that doesn't even use compare-and-swap. But that wasn't going to be good enough.

So I started writing. My goal: lockless, multiple-producer/single-consumer, FIFO queue using compare-and-swap. It's all sitting in this gist.

There are a few things I'm really pleased about this class. First, it's a nice, general template class, and the problem with my circular FIFO class is that it uses the volatile keyword to synchronize data between the CPUs. This is nice, as it doesn't use any compare-and-swap operations, but it makes it very hard to put your own classes into the queue. It's best suited for simple values like pointers, etc. Nice and useful, but to have something like this is even better.

The one section of code that I wish were a little better is in the pop() call:

  1. /**
  2.   * This method updates the passed-in reference with the value on the
  3.   * top of the queue - if it can. If so, it'll return the value and
  4.   * 'true', but if it can't, as in the queue is empty, then the method
  5.   * will return 'false' and the value will be untouched.
  6.   */
  7. bool pop( T & anElem )
  8. {
  9. bool error = false;
  10.  
  11. // see if the head is NULL - i.e. empty queue
  12. if (__sync_bool_compare_and_swap(&mHead, NULL, mHead)) {
  13. error = true;
  14. }
  15.  
  16. /**
  17.   * This is interesting... when we pop, we need to see if the
  18.   * head and tail are pointing to the same thing - the only
  19.   * element in the queue. If they are, then the best thing to
  20.   * do is to NULL-out the tail, so push() can work, and then
  21.   * quickly swap out the head so it's NULL as well. Then we
  22.   * have time to do what we need.
  23.   *
  24.   * However, if there are several elements in the queue, then
  25.   * we only need to remove the head and process it.
  26.   */
  27. if (!error) {
  28. // get what I *think* is the head (should be)
  29. Node *h = mHead;
  30. /**
  31.   * If we have the head == tail, then we have only one item
  32.   * in the list. In order to make this a clean break, NULL
  33.   * out the tail so new push() calls can start building the
  34.   * list again.
  35.   */
  36. __sync_bool_compare_and_swap(&mTail, h, NULL);
  37. // now let's move the head to the next in the list
  38. while (!__sync_bool_compare_and_swap(&mHead, h, mHead->next)) {
  39. h = mHead;
  40. }
  41.  
  42. // get the value out of the node and into the arg
  43. if (h != NULL) {
  44. anElem = h->value;
  45. // at this point we're done with this node - delete it
  46. delete h;
  47. h = NULL;
  48. } else {
  49. error = true;
  50. }
  51. }
  52.  
  53. return !error;
  54. }

In lines 29-40 we are trying to deal with the fact that there is one, and only one, value in the queue. This means that the head and the tail point to the same Node. What I wanted to do was to NULL out the head and tail, but save the Node for later extraction. Since I can't atomically swap two variables, I had to do the one that I believe is more critical - the tail, as that's effecting the push() calls. I then very quickly NULL out the head and grab that value for processing.

I may come back to this soon, but for now all my tests look good and I'm very happy with the results. I'm able to have a lockless conflation queue which tests out very nicely, and with this as a general component, I have a lot more flexibility in what I use in the tighter performance sections of the code.

Good day's work.

[2/14] UPDATE: I've been doing a lot of reading on queues like this and it turns out that if you start with an 'empty' Node, the code for push() and pop() gets a lot easier. The complete class is still here, but this is the new push():

    /**
     * This method takes an item and places it in the queue - if it can.
     * If so, then it will return 'true', otherwise, it'll return 'false'.
     */
    bool push( const T & anElem )
    {
      bool    error = false;
 
      // create a new Node with the provided value
      Node   *me = new Node(anElem);
      if (me == NULL) {
        error = true;
      } else {
        /**
         * We need to add the new value to the tail and then link it
         * back into the list. Not too bad.
         */
        // put in the new tail, and get the old one
        Node  *oldTail = mTail;
        while (!__sync_bool_compare_and_swap(&mTail, oldTail, me)) {
          oldTail = mTail;
        }
        // OK, make sure that the list remains intact
        if (oldTail != NULL) {
          oldTail->next = me;
        }
      }
 
      return !error;
    }

and pop() becomes:

    /**
     * This method updates the passed-in reference with the value on the
     * top of the queue - if it can. If so, it'll return the value and
     * 'true', but if it can't, as in the queue is empty, then the method
     * will return 'false' and the value will be untouched.
     */
    bool pop( T & anElem )
    {
      bool    error = false;
 
      // if the next guy is NULL, we're empty
      if (__sync_bool_compare_and_swap(&(mHead->next), NULL, NULL)) {
        error = true;
      } else {
        // move the head to the next Node and get the value
        Node  *oldHead = __sync_val_compare_and_swap(&mHead, mHead, mHead->next);
        anElem = mHead->value;
        // if there is an oldHead (should be), delete it now
        if (oldHead != NULL) {
          delete oldHead;
        }
      }
 
      return !error;
    }

Of important note is that we're now only touching the mTail in the push() method, and only touching the mHead in the pop(). This is much cleaner than the previous implementation, and in addition, there are no more "spin to replace" loops as we don't care in this scheme. Much cleaner.

Hard to Find a Good Fine-Grained Timer on Linux

Wednesday, February 9th, 2011

Linux

I'm looking at some logs of a client of my ticker plant. I'm trying to iron out the last few bugs in the system, and I'm coming across the following data that's making no sense whatsoever. Basically, every 10 sec, my client is printing stats on the messages it's received. Things like messages received per second (on average), messages conflated, the size of the conflation queue, the min, max, and average delay of the messages as timestamped when they arrive from the exchange to the time they are received by my client code.

What I'm seeing is something like this:

Time Messages
(/sec)
Conf Msgs
(/sec)
Conf Queue
(msgs)
Max Delay
(msec)
13:30:47.01 4083.4 550.6 0 838.6
13:30:57.01 6561.7 547.6 24 13459.0
13:31:07.01 5170.2 797.5 0 410.7

My concern is about the max delay - the last column, and for the middle time slice. It's showing that on an empty conflation queue, this interval had a maximum delay from the source of 13 sec.! That's longer than the sampling interval. It's as if it was hiding up stream!

I initially suspected the time base I was using. Originally, I was using gettimeofday() as it's pretty universal, and calculating the microseconds since epoch with:

  struct timeval  tv;
  gettimeofday(&tv, NULL);
  return (tv.tv_sec * 1000000 + tv.tv_usec);

Then I read on stackoverflow that this isn't a very good timebase, and I switched to the supposedly more stable clock_gettime():

  struct timespec  tv;
  clock_gettime(CLOCK_REALTIME, &tv);
  return (tv.tv_sec * 1000000 + tv.tv_nsec/1000);

I'm not sure if this is going to be a lot better, as I think there's a problem still lurking in my code, but it's a start. Getting a good timer is important when you're timing things this close.

Google Chrome dev 10.0.648.45 is Out

Wednesday, February 9th, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 10.0.648.45 was out with a few really nice additions. Among them are the latest V8 engine (3.0.12.12) and Flash (10.2), but also "many crash fixes". Good enough for them. I'm not using Flash anywhere else on my MacBook Pro, so it's my one place to view Flash content - and I use it every now and then, but not often.

When I restarted it, it was amazingly fast, and I was jazzed for the first time in a long time. Wouldn't this be great! I thought. But after a bit it was back to 'normal' speed. Guess Comcast was just speedy then.

Rebalancing the Multicast Channels Seems to Reduce Errors

Tuesday, February 8th, 2011

ZeroMQ

I've been trying to track down some ZeroMQ errors I've been seeing today as I try to build a clean multi-threaded client for those folks that want more performance out of the ticker plant client. It's a decent design in that it devolves to the same client as I have now if you only create one client. The zmq::context_t is essentially a process singleton, by design, so I put that as a static class variable on my ZMQ receiver class, but then each receiver class has it's own zmq::socket_t instance which is what you connect() to the URLs that are the reliable multicast channels.

It all makes sense on paper, but I have been seeing a lot of errors that seem to be coming from ZeroMQ, and that's just pretty unlikely in my experience. So I started to wonder if there was some external factor that was causing ZeroMQ to have these problems. I had initially cut up the multicast channels into 27 per message type: A through Z and 'other'. I decided that OPRA had a balanced scheme for 48 channels, and that couldn't be worse, so I reconfigured my channels to match OPRA's.

As it's running, it's certainly doing better than before - even if it's not perfect, it's better. I'll have to keep an eye on it, and get the latest ZeroMQ to see if they have fixed any issues in the latest version on github. I know we're going to need something as ZeroMQ is a bit flakey still. But to be fair, I'm pushing it pretty hard.

UPDATE: OK, not perfect... we're still seeing errors, but it's better. Still have something to hammer on... I'll get it - eventually.

Nice Re-Org of Client Class Allows Users to Balance Load

Tuesday, February 8th, 2011

GeneralDev.jpg

This morning I was working on how to decrease the latency of my ticker plant client for very large subscription sets when it hit me. In my original design, I had a single ZeroMQ receiver that fed any number of subscribed clients. That's OK for small sets, and maybe that's OK for large sets with conflation, but I wanted something that allowed the user a lot more flexibility.

So I stopped making the ZeroMQ receiver a static ivar of the client class, and made it a regular ivar of the client. I then changed the ZeroMQ receiver to make it's amq::context_t a singleton in the process (static ivar of the ZeroMQ receiver class) so that I could control the thread usage there. There will still be a single zmq::socket_t for each receiver, and since the receivers are paired with the clients, we'll have the ability to use the filtering of the reliable multicast on the individual client subscriptions.

Now, if you want to subscribe to a large dataset you can. It'll all come to the one client. But if you want to handle things a little more quickly, subscribe to smaller groups, and each client will get just that portion of the whole that it asks for, but will have the threading to be able to take it as fast as it can.

Controllable speed. If you don't need it, just subscribe to a big block. But if you need it, it's there. This is a great little compromise that should give the users a lot more flexibility than they have now.

Excited about looking at the results at the open. Should be exciting.

Gotta Be Careful With Static Instance Variables

Monday, February 7th, 2011

bug.gif

Today I got nailed by my own laziness - no two ways about it. I had built these pools of data containers - one was for simple time/date-stamped (char *) UDP datagrams, and the other was for std::string * containers. They were both similar in their execution - a single-consumer/single-producer lockless FIFO queue that would act as the pool, and a few methods that would return the next unused one, and recycle one into the pool (or delete it) based on the capacity of the pool.

Simple stuff, really. The problem was that I got a little lazy and made the alloc() and recycle() methods static methods with the actual pool a static ivar of the class. Bad idea.

When I started to use these pools in one of my servers that happened to use about 30 UDP exchange feeds, there was only the one static pool, and it wasn't a single-producer/single-consumer any more. Bad move.

The solution was simple - don't use the static ivar and don't cut corners. I made the pool a class, put the alloc() and recycle() on the class, had the UDP exchange feeds each have a pool, and all was well again. This was pretty easy for the string pool, but took just a little longer for the tagged datagrams. Still, not too bad.

In the end, I solved my nasty data problems because I went back to clean data container streams. Good enough.

When Getting “Help” is the Last Thing I Need

Monday, February 7th, 2011

I've written a replacement to an exchange data feed and now I'm trying to work with a few groups to see if it'll work for them. The original intention was not to include them, so if it works for them, so much the better. Still, I've expected more than a little resistance because these are developers that have clearly stated that they do not want this new exchange feed - that the one they have is just fine, thank you. I understand their position - I'm new, they don't want to change. They fear what might happen, so it's easiest to simply drag their feet or poke holes in the project to say why they can't use it.

This is easily solved in one of three ways: fire people, force people, or give up. I'm not in a position to do any of these, but if I had my choice, I think I'd favor the ultimatum angle: force, then fire. This industry is paid far too well to put up with unnecessary crud from prima donnas thinking they know better.

Be that as it may... something struck me as quite odd, and made me lean towards the "firing" angle for one developer when he mentioned today that he took my code, replaced spinlocks with pthread locks and saw an amazing jump in performance. That makes no sense, as I moved from pthreads to spinlocks long ago for this exact reason. Still, I was willing to say I'd try it.

And I did.

Amazing. I got literally thousands of messages a second less using pthreads than using spinlocks - exactly what I expected. So much for the brilliant developer that thinks he's found the silver bullet to all problems.

Yup. For that developer, I'd give them a nice severance bonus and show them the door. Being reluctant is one thing. Being a bad developer is another.

Slick Infix to Prefix Algorithm

Friday, February 4th, 2011

I've been slugging out this idea of infix to prefix conversion and this afternoon I finally cracked it. Well, to be fair, I cracked it on paper this morning and I got it all coded up this afternoon. This comes about because the lisp-like language that I put together for a project at The Shop would require that all arithmetic expressions be written in prefix. Seems obvious, but a lot of the users weren't really excited about the difficulties in putting together prefix expressions.

They seem to be pretty content to say:

  (and (condition 1)
       (condition 2)
       (condition 3))

as they like the short-hand it gives them, but when they need to do something with more than simple addition, like, say the quadratic equation, then things get a little more difficult, and they start to get confused:

  (set a 1)
  (set b -7)
  (set c 12)
  (set x1 (/ (+ (-b) (sqrt (- (* b b) (* 4 a c)))) (* 2 a)))
  (set x2 (/ (- (-b) (sqrt (- (* b b) (* 4 a c)))) (* 2 a)))

and while I can agree it's not trivial, the infix representation isn't all that easy, either:

  (set x1 (/ ((-b) + (sqrt (b * b - 4 * a * c)))/(2 * a)))
  (set x2 (/ ((-b) - (sqrt (b * b - 4 * a * c)))/(2 * a)))

But they wanted infix in their math expressions. But not pure infix - just selective infix to make decoding as hard as possible. In short, any expression (a section of code enclosed in parentheses) can be either prefix or infix, and it's up to the script parser to figure out which it is, and act accordingly.

Oh... and they also didn't like the required spaces.

This lead to something like this:

  (set x1 (/ ((-b) + (sqrt (b*b - 4*a*c))) (* 2 a)))
  (set x2 (/ ((-b) - (sqrt (b*b - 4*a*c))) (* 2 a)))

where there is infix in parts and prefix in other parts. Not ideal, to be sure, but it's the users that want this, so we need to find some way to make it happen.

Getting the Easy Stuff First

The easy stuff was missing spaces and operator ordering. The missing spaces simply meant that I had to have foreknowledge of the different operators that I could run into, and use them as separators. Not hard, but it was more time-consuming on the parsing, and made the code quite a bit uglier. Still, it was easy to make:

  (*5 2)

return 10.

The next easy one was the operator order. If you have:

  (10 = (* 5 2))

it's easy to see that the first token is not an operator, and to put the first token as the first argument to the expression. When you hit the second argument, it is an operator, and can be put in that place in the expression. In fact, it's easy to say:

  (10 5 3 2 1 *)

as there's only one operator in the mix.

Only slightly harder is to ignore duplicated operators:

  (10 + 5 + 3 + 2 + 1)

for I can look at the first operator for the expression, and if the next operator in the expression is the same, I just ignore it. We're getting pretty far, actually.

The last easy one was putting expressions before operators, but I already had that with the numbers before operators, so we get for free:

  ((5*2) = (2+2+2+2+2))

Now for the Hard Stuff

In truth, all that stuff only took me about an hour to figure out. The hard part is operator precedence. I struggled with operator precedence for several days until I happened to some across an idea stewing in my head. The basic problem is that I did not want to have a multi-pass parser where the first pass tokenizes the expression, the next orders the tokens based on their operator precedence, the next creates the prefix mode, the next makes the evaluation tree, etc. It's not hard to see that a multi-pass parser is a very good way to go, but I'd already put so much effort into my parser (it's a single-pass), that I didn't want to throw all that away if I didn't have to.

All I needed was to come up with a way to handle the operator precedence and I was golden. But it was a pain to come up with. In the end, I had something that worked, and it seemed to be pretty solid. I had a stack of expressions I was parsing into. I started with the top-level expression, and then when I hit a different operator of greater precedence, you pull the last argument off the expression, make a new expression, and put the new operator and argument in the new expression.

It's not really easy to see, and I'm not convinced that it's any really easier than a multi-pass parser, but it works. The reverse is to see the different, looser-binding operator, and enclose the expression in another expression and use the looser operator as the new 'main' operator. Again, not easy to see, but it's working and that's what really matters.

I'm ready to get on to something else.

Google Chrome dev 10.0.648.18 is Out

Friday, February 4th, 2011

This morning the Google Chrome team released 10.0.648.18 with a nice list of changes - including the latest version of their V8 javascript engine - 3.0.12.8. There were also a few Mac-specific fixes, and that's great, but they still have that new preferences 'page' that you can't resize the left-hand selector. Very odd, and even annoying as it's so bloody large.

So it goes... can't have it all, I suppose.

Under Water with Reluctant Users and Infix Syntax (cont.)

Thursday, February 3rd, 2011

Today was a nasty nightmarish continuation of yesterday. I got a lot of stuff done, but I felt I was always a set behind. It'll change, I know, but while I'm in this "plate catching" phase it's not nearly as fun as it has been... or could be... or will be.