Archive for the ‘Coding’ Category

Thread Safe Queries on a Trie? Use a Trash Can!

Tuesday, January 25th, 2011

GeneralDev.jpg

I've been facing this problem with my Trie - seeing as how it's lockless, it's replacing elements very fast, and it's possible that a query of the trie can get you into trouble because during the query, a replacement can come in and the value you're querying gets deleted. Then you're looking at a nasty seg fault.

The problem is, you don't want to put a lock in there of any kind - that's going to defeat the purpose of the trie in the first place. So how do you protect queries, but not slow anything down? I was stumped. Until I had a flash - the trash.

If I used a very simple std::slist<Message *> as the "trash", and then a simple boolean that controlled it's use, I can make the put() method look like:

  Message  *old = mmap::put(aMessage);
  if (old != NULL) {
    if (mUseTrash) {
      mTrashCan.push_front(old);
    } else {
      delete old;
    }
  }

Then, when we're done, we can simply run through all the messages in the "trash", and delete them. Very nice and very clean.

The one wrinkle in this is that the Trie itself is not the place to put this code. The Trie doesn't have the behavior that old contents are deleted, they are just passed back to the caller. It's the subclass - my QuickCache, that implements this "delete the old" behavior. This means that I need to put the code there, and that makes it a lot less elegant, but still very nice. The Trie stays clean, and that's good.

Still... nice solution to a nasty problem.

BBEdit 9.6.3 is Out

Tuesday, January 25th, 2011

BBEdit.jpg

This morning I saw that BBEdit 9.6.3 was out with a nice little list of improvements. Nothing major, but quite a few little edge case crashes that I'm sure it's nice to get out of the way for support issues. Still one of my favorite tools of all time.

ZeroMQ and Ubuntu 10.04.1 Problems

Monday, January 24th, 2011

ZeroMQ

Today I've been trying to get my ticker plant running on Ubuntu 10.04.1 as we're moving to Ubuntu at The Shop. I thought it would be easy, as all the other packages were installed by the UnixAdmins. All I needed to do was to git clone and then make and we're ready to go.

Well... not so fast, there bucko.

First, GCC now wants to warn for:

  char *list[] = { "first", "second" };

saying that you can't assign a string constant to a char *. The solution is to change the code to read:

  char *list[] = { (char *)"first", (char *)"second" };

or the slightly more cryptic:

  char const * const list[] = { "first", "second" };

Not too bad. I got the code to compile. Now to run it. Ooopppsss... no good there. Seems that the Ubuntu interfaces aren't allowed for the ZeroMQ/OpenPGM sockets. Specifically, if you run the code, you're going to get an error on line 68 - the call to connect(). But on CentOS5, it runs fine.

I had to make sure the gist illuminated the problem, and then I sent it to the ZeroMQ mailing list. I sure hope Sustrik has some idea about what's up. I really need to have it work on Ubuntu as I have clients on Ubuntu, and we had plans to move the servers to Ubuntu very soon. Hope it's good news, soon.

[1/25] UPDATE: I don't think it's Ubuntu as much as it's the master of the ZeroMQ git repo. I looked at the code and in sub.cpp line 27, the SUB constructor is calling the XSUB constructor (file xsub.cpp line 33) and in there, the code:

  options.type = ZMQ_XSUB;
  options.requires_in = true;
  options.requires_out = true;
  zmq_msg_init(&message);

and in the socket_base.cpp class, line 195, it's clear that if you are trying to use the epgm protocol, and if you have options.require_in and options.require_out set, you're not going to be allowed. I mentioned this to Sustrik, the lead maintainer of ZeroMQ, and sure enough, he agreed that this was a recent change, and that it was going to be a problem.

For me, the solution is easier - use the tarball that I built the RPMs off of, and install that guy. It works, and doesn't have this problem. Later, when they get this solved, and other bugs fixed, we'll get a more recent cut and rebuild all the packages.

Starting Yet Another Broker Rewrite

Monday, January 24th, 2011

Ringmaster

This morning I get to start on yet another rewrite of the Broker. The Java version of the Broker is a lot more complex than they were hoping it'd be, it's not really bug-free, and getting it that way is looking to be a daunting task. Additionally, it looks like there's still not going to be an easy route for a Python client, and that's looking over everyone's head. So we sat down a while back, and came up with another plan.

Every service is a web server. Every client is an HTTP client. That's about as universal as possible.

The sockets will be pooled. The connections gracefully handled. The Python libraries exist. It's a very simple model. We can use persistence, or not, and the end result is that the code will be vastly simpler.

So today I'm starting to look at the code for the service in C++. I need a nice, embeddable web server, and then I'll use cURL as the client-side component and get everything I need. Sure, I could do the client and server in boost ASIO, but then I'm decoding the headers and hassling with the SSL implementation, and that's too much work. I'm going to try and make this as simple as possible from the implementation side of things.

Should be interesting. Here we go...

UPDATE: might have spoken too soon... I think it's likely that we're going back to the Erlang version of the Broker - the first version. We could put an HTTP interface on it and it'd solve all the problems. We'll see what the Broker's author thinks after looking into it for a while.

Odd Timeout Bug with Boost Async TImeout

Friday, January 21st, 2011

Boost C++ Libraries

One of the things I've been noticing is that my shotgun approaches to the requests to the Broker are all timing out, and then successfully retrying. Very odd. Why fail, and then immediately succeed? Now that I have my ticker plants able to keep up with the feed, I wanted to take a little time and figure this out.

I started logging all the bytes on my test code - which I copied from the working ticker plant, and then saw the problem - every one of the requests/connections was being torn down before the data could be processed. All the data was there - it just wasn't being processed. Very odd. So I added logging in for the different places where the connection could be invalidated, and ran it again.

A timeout!? Really? These requests are taking less than a second, and the timeout is set for 25 sec. What's going on here. And then I started to really look at the code I had:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    // we need to alert the user that the timeout occurred
    if (mTarget != NULL) {
      mTarget->fireUpdateFailed("asynchronous read timeout occurred");
    }
 
    // now remove this updater from the pool
    mBoss->removeFromUpdaters(mChannelID);
  }

Unless you've done a lot with boost asio and timeouts, this looks fine. The code is called when the timer fires and I'm able to respond to it. But that's not quite the whole story. It turns out that on a timer cancel, we get an error code. We really need to have:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    if (anError != boost::asio::error::operation_aborted) {
      // we need to alert the user that the timeout occurred
      if (mTarget != NULL) {
        mTarget->fireUpdateFailed("asynchronous read timeout occurred");
      }
 
      // now remove this updater from the pool
      mBoss->removeFromUpdaters(mChannelID);
    }
  }

Now we have something that works. I was simply missing that a cancel was going to be "seen" as an error. I updated the code and everything started working without the failures. Why was the retry working? It was a synchronous call because it was a retry. Funny.

Glad to get that solved.

Success is Oh So Sweet!

Friday, January 21st, 2011

bug.gif

Success is Oh So Sweet! I was able to keep up with the OPRA flow this morning for the first time in many weeks. All the necessary changes were now OK, as I'd cracked a big performance problem on the ticker plants. I still have one more: the ConflationQueue still has a boost spinlock, and as my tests have shown, that's a performance killer when more than one thread access it, but not nearly as bad as any other alternative I could come up with.

What I want now is to come up with a way to remove the spinlock from the ConflationQueue and I'll be one happy camper. But for now, I can move on to other things that need my pressing attention.

Google Chrome dev 10.0.642.2 is Out

Friday, January 21st, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 10.0.642.2 was released with a nice list of fixes including a new version of the V8 javascript engine. I'm pleased they are making progress, but still very disappointed that they are pulling H.264 support in the <video> tag.

But hey... it's their program. Their choice.

Finally Have Hope for Tomorrow

Thursday, January 20th, 2011

bug.gif

Today was spent, like every day in the last week, trying to see if my work yesterday had actually improved the speed profile of the ticker plants. Today started out just the same, but I decided to try a slightly different track today: build up a way to test intraday, and then hammer it a lot more with the iterations.

What I did was to make the UDP feeder code "hold back" 50,000 datagrams and then "cut them loose" all at once into the downstream components. I could then monitor the time it took to completely process the "back log" and as a comparative measure, see what effect the change I just made had on the running system. For instance, when I disabled the processing of the UDP datagrams, I expected the "recovery time" to be less than a second. Face it - all we're supposed to be doing then is pulling the filled datagrams off the queue and recycling them.

What I found was that it was taking more than 10 seconds to empty the back log, in the presence of the "normal feed". This didn't seem right at all. So I started digging into the code for the datagram pooling and found something very interesting.

The pool's alloc() and recycle() methods were using a standard STL list, and if there's nothing in the list, a new one is made, and if there's no room in the list, the recycled one is tossed away. When I simply made alloc() create a new one, and recycle() delete it, I saw a dramatic increase in the processing speed. The recovery time was less than a second! But why?

It turns out that the boost spinlock which is based on the old compare-and-swap gcc macros, was turning out to be a real pain to me. I started digging into the pool with separate tests, and it was amazing what the difference was. I didn't want to have all the creations and deletions in the code - that wasn't the answer, but I needed to have something that was thread-safe.

Then it hit me - my CircularFIFO, the single-producer/single-consumer lockless queue I had should work. There's only one thread that's calling alloc() - that's the boost ASIO thread that's reading the UDP datagrams. There's only one thread that's putting the datagrams back into the pool - that's the processing thread. It should work.

And indeed it did. The processing times were as fast as the malloc/free ones, but there was no system overhead for the malloc/free cycle, and I had finally cracked a major speed killer in my code.

I updated the StringPool as well as it was using in the ZeroMQ transmitter in a very similar fashion. Again, it was not going to be the problem there, either.

I'm actually looking forward to tomorrow's tests. For the first time in a long time, I really think I have a shot.

Indiana Jones and The Legend of the Fast Ticker Plant

Wednesday, January 19th, 2011

Detective.jpg

It's getting to be boring, but in a very bad way... Recently, I've been trying to find the speed that was once in my ticker plants, but seems to have left the building when adding a few key features. Today was no exception.

Facing problems (again) at the open, I realized that I needed to be able to re-create the "flood" that occurs at the open during normal market hours, so I added a little test code to the exchange feeds to "build up" the incoming datagram queue to 50,000 messages, and then start processing messages. This was a nice addition as it allowed me to reproduce the "flood" during the day. Nice to have.

Still... it wasn't happening. No matter what I was doing, I wasn't getting the speed I once had. I couldn't go back, so I had to look at vastly different plans. Starting with the locks in my Trie.

The basic problem in my Trie is that I can create the path to the leaf node in a lockless, thread-safe manner, because the basic building block is the GCC built-in:

  old = __sync_val_compare_and_swap(*ptr, val1, val2);

returns the previous value at ptr, assuming it was val1. This works fine when you're building something, but when you're replacing something, it's another matter. Look at the following example:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

In general, there's no way to make this thread safe! Why? Because the thread will pull together the three arguments, and it's entirely possible to have another thread change the value of l->data[b] after the first argument and before the second. This makes it virtually impossible to use this in a very fast, tightly threaded environment.

When there's more stability in the code, or you're not doing a simple replacement, you can work out ways to make the atomic operations work. It's just this one that was killing me. So I had to put in a simple spinlock, and that made the operation thread-safe. Kinda dissappointing.

I also noted that the thread performance of my mmap really looked bad:

# Threads Speed
1 0.84 usec
2 4.05 usec
4 7.3 usec

That's a pretty nasty penalty to pay for multiple threads. The same work done in more time by more threads. It's counter-intuitive, that's for sure. But it makes the point that if there's a lot of contention, then the locking and unlocking is really the overhead, and the more you have to do it, the more it costs you.

So I made a decision to rewrite the exchange feed and remove one of the two threads on the datagram queues and instead have just one thread emptying both of the queues. This way, all the downstream components didn't have to worry about multiple thread contention, and according to the table, I should get a lot faster performance.

We'll have to see tomorrow - everything is down now, so even my "flood gate" trick won't work.

UPDATE: The problem I was having with:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

was that there was no way to know if it failed. After all, the compiler will pull together the arguments, and then make the call. But if we've been undercut, how will I know? The spec says that the returned value is that value before the operation - that that's after the marshaling of the parameters.

What I've been able to use nicely is:

  old = l->data[b];
  while (!__sync_bool_compare_and_swap(&(l->data[b]), old, aMessage)) {
    old = l->data[b];
  }

where we know the pointer to the location (slot) and then grab the 'old' value before attempting to do the compare and swap. Then, I'll use the boolean version of the function and if it didn't work, I'll try it all again with the latest value.

This works fine as it detects the failure of the CAS and resets the operation. The previous version:

  old = __sync_val_compare_and_swap(&(l->data[b]), l->data[b], aMessage);

had no way to know it failed other than to do something similar to the boolean version. We have to know if it failed. Simple enough.

Long, Hard Day of Optimization – Not a Ton of Success

Wednesday, January 19th, 2011

Getting it right and then getting it fast can make for a lot of work. And while no one said it would be easy, it's how hard it is to get the necessary speed, and what changes you have to make to the code in order to get it. Such has been my day. I've been spending a lot of time this week trying to get things faster, and while I've had quite a bit of success, I haven't had the level of success I was hoping for.

I was hoping that I'd find that one problem... that one bottle neck that slowed everything down. But that was not to be. I found a lot of little places to get a little speed: pooling data containers for the UDP datagram input and processing... reducing copying... and today - trying to make the processing of the UDP datagrams themselves faster.

When I start to mess with the processing, I invariably end up causing bugs either with changes to the logic, or the speed that was not a problem before, now has become a problem because I've tightened up the timing enough to make it a problem.

The problem today has been that I have done a lot of work, but it hasn't made a lot of difference in the overall scheme of things. I'm still backing up on the UDP datagram queue on the input. I'm better, but it's not enough to actually stop the overloading of the queue (at 100,000 elements). So it's not a real success, but I can say that I know what it isn't, and hopefully I'll be getting a lot closer to the real answer soon.