Archive for the ‘Coding’ Category

Google Chrome dev 11.0.672.2 is Out

Friday, February 18th, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 11.0.672.2 was out and with it quite a few changes. While the major version number doesn't mean a great deal, in my experience, the additions of the new V8 engine as well as a few nice Mac specific features makes this nice to have. I'm sure there's more to it than just this, but this is enough.

The Lockless MP/SC Ring FIFO Revisited

Thursday, February 17th, 2011

bug.gif

Today has been an interesting day... it started off with one of my servers having a lot of trouble with my new RingFIFO. Because it was still very early in the day (before the open), I just swapped out the RingFIFO with my LinkedFIFO and everything ran fine today. Still... when I had some time later today, I wanted to dig back into the issues with my RingFIFO and see if I couldn't fix it up and put it back into play for tomorrow.

Turns out, all I needed was a break and a little out of the box thinking. What I had originally used as the 'ring' was a simple template array of values:

  /**
   * We have a very simple structure - an array of values of a fixed
   * size and a simple head and tail.
   */
  volatile T         mData[eCapacity];
  volatile size_t    mHead;
  volatile size_t    mTail;
  volatile size_t    mInsPt;

but the problem with this is that I had the concern about how to atomically move the tail and place a value in the spot while checking for queue crashing. It was just too hard. In my old code, the push() referenced the head and the tail, and the same for the pop(). This was clearly a recipe for disaster.

So I simplified. Significantly. The results are pretty nice.

First thing I realized is that the location is really orthogonal to the validity of the data in the queue. This means that I should really have something that 'tags' the data elements as 'valid' or 'invalid', and then use that to know when I can pop, or if I've overrun the queue.

If the head is pointing to the next good value, then all I need to do is check it's 'valid' flag. If it's valid, then I can pop it. Move the head up, snatck the old value and invalidate it, and we're done.

It's all easily done if we change our data elements to something like this:

  /**
   * In order to simplify the ring buffer access, I'm going to actually
   * have the ring a series of 'nodes', and for each, there will be a
   * value and a valid 'flag'. If the flag isn't true, then the value
   * in the node isn't valid. We'll use this to decouple the push()
   * and pop() so that each only needs to know it's own location and
   * then interrogate the node for state.
   */
  struct Node {
    T      value;
    bool   valid;
 
    Node() : value(), valid(false) { }
    Node( const T & aValue ) : value(aValue), valid(true) { }
    ~Node() { }
  };
 
  /**
   * We have a very simple structure - an array of values of a fixed
   * size and a simple head and tail.
   */
  volatile Node      mData[eCapacity];
  volatile size_t    mHead;
  volatile size_t    mTail;

Now I can have a far more cleanly separated push() and pop(). This is going to make a significant difference, I hope. I can say this for sure - it's considerably faster in my tests - like about twice. Very nice.

Wicked Error with (Thankfully) Simple Solution

Thursday, February 17th, 2011

bug.gif

This morning I got an error that looked like this:

  *** glibc detected *** tpClient: double free or
     corruption (fasttop): 0x0...b0 ***

and after a bit of googling, it appears that this problem is in the C++ runtime checking on the calls to free(). It seems that the default behavior is to check the argument to free() to see if it's a double-free. Sounds nice, in theory, but if we're hammering the system, it seems like this might not be a good thing to do.

Thankfully, the solution is easy - don't check. To turn it off, simply say:

  export MALLOC_CHECK_=0

and the checks will not be done. Good enough for me. I'm still going to try out the TPMalloc from google, but that's a previous post. Still don't have it built for x86_64 yet.

Interesting Thread-Cacheing malloc() Replacement from Google

Thursday, February 17th, 2011

google-labs-logo.gif

This morning I was looking into the problem with my ticker plant clients and the older kernel shipping with CentOS 5. Basically, when I increased the message rate, we crossed over some threshold on CentOS and we started getting a lot of heap corruptions manifesting themselves as malloc() or free() problems in ZeroMQ. On Ubuntu 10.04.1 everything was fine, most likely because the kernel in Ubuntu 10.04.1 was significantly newer than the one in CentOS 5. So I went on a search for a better malloc() and free().

What I came across was the google-perftools. This is pretty amazing stuff. It's a thread-cache replacement for malloc() and free() that is as simple as adding a -ltcmalloc to the build line. It's got profiling tools as well, but that's not what interests me as much, it's the amazing speed gains that it provides. The graphs on the paper how about a 4x increase in operations per second when using this.

It's not conceptually hard - the TCMalloc library grabs a large block of memory from the system and then offers it up to the application. This puts the calls in user space, and the control of memory there as well. Because their design has the smaller blocks held in the thread, it's possible to see no locking contention on the malloc() and free() which should be a major boon to me.

I have to get it built by the Unix Admins for Ubuntu 10.04.1 - I've already built the x86_64 RPMs for CentOS 5 and installed them on a test box I have access to, but I really want to start on the Ubuntu boxes. Simple change, should see major improvement. Very exciting.

UPDATE: it's built on all my boxes and ready to go for tomorrow. I'm excited about the possibilities.

Google Chrome dev 10.0.648.82 is Out

Thursday, February 17th, 2011

This morning I saw that Google Chrome dev 10.0.648.82 was out with a stability release update. It's worded a little strangely, but then again, these guys re engineers not english majors, so you have to cut them a little slack now and again.

Nice to see the improvements. It's getting better and better.

Lockless Multiple-Producer/Single-Consumer Ring FIFO Queue

Wednesday, February 16th, 2011

Professor.jpg

I'm still on the quest for better performance out of my ticker plants and their clients, and to that end today I spent quite a bit of time on building a multiple-producer/single-consumer, lockless, FIFO ring buffer. The problem I have with my LinkedFIFO queue is that it's constantly doing malloc and free - OK, new and delete, but it's the same thing. The problem is all this thrashing. What I want is to be able to pre-allocate the memory for the buffer and then just use it.

So I spent a lot of time - more than I'd like to have spent, getting this guy right. It wasn't as easy as I'd have hoped, but in the end, I think it's going to be very useful for me.

The hard part was the multiple-producers. No question. The obvious solution to the multiple-producers is to get a concept of the 'new' Tail, put your data at the old Tail, and update the new Tail. You have to check for queue crashes in there, and all this has to be atomic. It's that last part that's the kicker.

What I come up with was the idea that the queue had an insert point. This was not the Tail, because the insert point would be atomically moved by the push() method and then, after the crashing was checked and we've set the value, we can update the tail.

    bool push( T & anElem )
    {
      bool    error = false;
 
      // move the insertion point and get the old value for me
      size_t  insPt = __sync_fetch_and_add(&mInsPt, 1);
      // see if we've crashed the queue all the way around...
      if (__sync_bool_compare_and_swap(&mHead, (insPt + 1), mHead)) {
        error = true;
        cLog.error("[push] the queue is full! Can't push this item");
      } else {
        // save the data in the right spot for the size
        const_cast<T &>(mData[insPt & eBitMask]) = anElem;
        // ...and finally update the 'tail' for a new item
        __sync_fetch_and_add(&mTail, 1);
      }
 
      return !error;
    }

If you move the tail before it's got a value, then it's possible that the consumer can see that the tail has moved, assume that it contains valid data, and process it. Bd move. So it's clear you have to update the tail last. But then how do you know where to put the new data? Hence the insert point.

This is working fine for me, and while it's still pretty new code, I have a feeling it's solid - there just isn't that much to it and the tests I've come up with are pretty inclusive. It's nice to have a choice of efficient FIFO queues to choose from.

Tracking Down a Problem I Already Solved

Tuesday, February 15th, 2011

bug.gif

I hate being dense. Really. In the recent past, I realized that I can't really use the compare-and-swap in a multi-threaded environment like this:

  Node  *old = __sync_val_compare_and_swap(&mTail,
                                           mTail,
                                           newbie);

because it's possible that between the marshaling for the args, another thread would get in there and change the value of mTail so that this atomic operation would fail, but the returned value would not be what I expected it to be. It would be the previous value, but the new value would not be newbie, which is what I wanted.

What I was looking for was simply an exchange - I don't care about the current value, I just want to swap what's there now for something new, and give me the old while I give you the new.

No such luck. There's nothing like that in GCC.

So I'm back to the only thing I can do:

  Node   *t = mTail;
  while (!__sync_bool_compare_and_swap(&mTail, t, newbie)) {
    t = mTail;
  }

and in this we are forced to loop and get a consistent set, but in the end it's still just a swap. Sure do with they had one of those.

Anyway, I'm miffed at myself for spending several hours on this when I'd already solved it. The wrinkle came in the multi-threaded usage of the LinkedFIFO, and while my test cases worked, I didn't really check that there was no corruption in the data. I added those tests to the app and all of a sudden it was clear I had an issue. It took me about 5 mins to fix, and all was well, but finding the problem was hours.

Sometimes, I'm my own worst enemy. However, I will say that I'm getting far more comfortable with all this stuff the more I use it. Guess there's a silver lining in this after all.

Getting Ever Closer to the Goal

Monday, February 14th, 2011

MarketData.jpg

This morning has been better than most, but still challenging. Today I was able to handle the open with my ticker plants, and the client had a latency of less than 368 msec. This is great news. I had expected the upper limit at around 400 msec, and this was based on the tests I've been running with the individual components. At about 100,000 msg/sec from the exchange, a processing time of 0.4 usec equates to 400 msec. So we're probably a little shy of the 100,000 msg/sec, and that was what we were seeing.

Great news.

But it was not without sadness. The multiple-producer/single-consumer linked FIFO was not being a nice citizen, and corrupting the malloc/free heap. It was causing a lot of problems. It was easy to track down as it was a big user of new and delete, and it was newly added to the code. When I added a simple spinlock to the code, everything cleared up, and slowed down. So I went on a little fishing expedition for some other implementations to look at.

What I found was that by starting with a Node, and pointing the head and tail at it, you can really simplify the code. Also, you can segregate the head and tail pointers to their single, respective, methods so that there's no chance of a cross-contamination. Much nicer. I put that in and things are looking much better.

I'll still need to test for several days to be sure, but things are looking a lot better. Next up is to rebuild the two other development machines to be Ubuntu 10.04.1 as I've had some real issues with this code on CentOS 5. I believe it's due to the memory allocation scheme in the kernel, and Ubuntu is a lot newer than CentOS 5. Hard to say. What isn't hard to see is that on CentOS the code won't run for 2 mins, and on Ubuntu, it seems to run forever.

I'm hoping to get these boxes rebuilt today and then all my stuff moved back over to them so I can be working off the set of three for tomorrow's open. That would be a really nice goal.

Using ZeroMQ’s Zero Copy Message Constructor

Monday, February 14th, 2011

ZeroMQ

This morning I was looking at some code that was left over from last week, and was causing a lot of problems with the CentOS build of my ticker plants. The problem was in the zmq::message_t's data() method call, and I was trying to dig into the ZeroMQ code to see what was happening there that might be causing the problems on CentOS and not on Ubuntu.

Nothing really popped out. But at the same time, my existing transmitter code looks something like this:

  try {
    // make a ZMQ message of the right size
    zmq::message_t  msg(aPayload.size());
    // ...copy in the data we need from the payload
    memcpy(msg.data(), aPayload.data(), aPayload.size());
    // ...and WOOSH! out it goes
    aTopic.socket->send(msg);
  } catch (std::exception &amp; e) {
  }

It's a design rule that you need to create a new ZMQ message for each send, and while it's possible to copy messages, I hadn't really dug into it that much up to now. But this morning I decided to see if I could use the zero copy version of the message constructor to minimize the creations on the heap and therefore bring a little stability to the CentOS build.

It turns out to be pretty easy. In my code, I've got one 'buffer' - a std::string that I clear out every message send and fill with the serialized form of the outgoing message. I'm already passing this into the sending method as aPayload, and copying the data out of it into the new message.

For the zero copy to work, we need to have a function that can be called to "free" the memory passed into the ZMQ message's constructor. Since I've got this one buffer, and I'm reusing it over and over again, all this "free" method needed to do was exactly nothing. I just needed to satisfy the ZeroMQ contract.

So I made a static do-nothing method:

  static void payloadRecycle( void *data, void *hint );

and then changed my code to look like:

  try {
    // make a ZMQ message with the payload's data
    zmq::message_t  msg((void*)aPayload.data(), aPayload.size(), payloadRecycle);
    // ...and WOOSH! out it goes
    aTopic.socket->send(msg);
  } catch (std::exception &amp; e) {
  }

At this point, the zmq::message_t is going to create it's structures around this buffer, send out the message, and then call my "free" function. I'll
do nothing, and then we'll return to the loop where I clear out this buffer and do it all again. Very slick!

Unfortunately, this didn't help the CentOS build as the messages appeared to be coming from this call, but they didn't stop. Ubuntu, however, is running strong. Can't figure that out, but I'm switching to Ubuntu anyway, and so it really doesn't matter.

Still... I got rid of the buffer creation and copy, and that's got to help the performace a little.

UPDATE: I had a problem on the receiver side of things. It seems that when I do the zero-copy, ZMQ misses the size of the payload by one or two bytes either side of the actual payload. When I use the memcpy() code, it's just fine. Sounds like a bug in the ZMQ code to me. I'll see if I can find out if they have extensively tested that code.

[9:22] UPDATE: the problem is more subtle - the fact is that the ZMQ method, send(), is not atomic. It will return as soon as it can, and buffer the message to be sent. I, on the other hand, assumed that it was, and so I was refilling the buffer with the next message. Not good. The problem could be solved with a string buffer pool, but then we're copying into that, and the message still needs to be constructed.

I suppose that if I put the string pool in the "filling" part, it'd work as we're only going to fill one string, and that would save a copy. But for now, I know why it's not working, and I'm OK with the old code.

At least I know what's happening and what it takes to really fix it.

Optimization Brings It’s Own Problems

Friday, February 11th, 2011

Linux

Today has been an up and down day... on the one hand, the changes I made yesterday have increased the speed of the ZeroMQ transmitter to new heights, but with that has come other problems - most notably in CentOS 5. Interestingly enough, I'm now seeing real problems with the ticker plants on CentOS 5 when the message rates get to even moderate levels. It's always in the ZeroMQ code, but I've dug into that code, and I can't see what it's doing wrong other than calling malloc() quite a bit.

Interestingly enough, when I run it on Ubuntu 10.04.1, it's fine - so I'm currently inclined to think that it's something int e linux kernel to do with memory allocation/deallocation that was fixed in the later kernel, and simply not available for CentOS 5. We'll have to run more tests on Monday, but for now, I'm at a bit of a loss.

Thankfully, we're moving to Ubuntu, and so long as it runs there, I'm good to go. The only wrinkle is the client - does this run on a CentOS 5 client? Don't know, and we'll have to see next week. It could go either way.

I did a lot of other little optimizations today - things that weren't an issue before, have now become an issue because of the speed of the overall process. Locks on maps that aren't really needed... things like this. I spent the entire day finding these little issues and fixing/updating them, and checking CentOS for a possible fix.

They all worked on Ubuntu, and it seemed that nothing I could do was going to get it running on CentOS again. We'll have to start back at it on Monday.