Archive for the ‘Coding’ Category

Tracking Down Bugs in Latest ZeroMQ Code

Wednesday, January 26th, 2011

ZeroMQ

This morning I was able to track down some problems I've been having with ZeroMQ on Ubuntu. Turns out, it was bugs in the latest code in ZeroMQ off the github master. They had made some changes in the code, not checked them into master, and while they thought the bugs were fixed, they really weren't. So I talked to Sustrik on IRC and he asked me to try a simple test - it was just the simple change on a few lines of code.

Turns out, that cleared up the exception I was getting, as I expected - based on looking at the change, but that didn't solve the problem of getting the messages. Seems there's more to it. WHen I told him of these results, he asked me to make up a little compatibility matrix for the different versions: 2.1.0 (ref)... 2.1.0 with my changes for RECOVERY_IVL (me), and the git master (master). Here's what I found:

Sender Recevier Works?
2.1.0 2.1.0 Yes
me me Yes
master master No
me 2.1.0 Yes
me master No

it sure seems like there's a problem in the latest code. Thankfully, the unix admins have built my drop of the code into Ubuntu packages, so we can continue with the project, but I'm going to have to be very careful when we move off this code as it's going to be possibly a significant change.

Just something to consider. I like helping out, and giving back.

Lots of Progress Today – Baby Steps to a Great Ticker Plant

Tuesday, January 25th, 2011

Today I've had the opportunity to do a lot of little things on the codebase. The IRC client wasn't splitting lines right, there were a lot more efficient ways of querying out the messages from the QuickCache, client constructors needed a little work - all stuff that wasn't big, but it was important. I was humming right along with the changes - tacking one problem after another.

Pretty nice day. Lots of really useful stuff done.

Thread Safe Queries on a Trie? Use a Trash Can!

Tuesday, January 25th, 2011

GeneralDev.jpg

I've been facing this problem with my Trie - seeing as how it's lockless, it's replacing elements very fast, and it's possible that a query of the trie can get you into trouble because during the query, a replacement can come in and the value you're querying gets deleted. Then you're looking at a nasty seg fault.

The problem is, you don't want to put a lock in there of any kind - that's going to defeat the purpose of the trie in the first place. So how do you protect queries, but not slow anything down? I was stumped. Until I had a flash - the trash.

If I used a very simple std::slist<Message *> as the "trash", and then a simple boolean that controlled it's use, I can make the put() method look like:

  Message  *old = mmap::put(aMessage);
  if (old != NULL) {
    if (mUseTrash) {
      mTrashCan.push_front(old);
    } else {
      delete old;
    }
  }

Then, when we're done, we can simply run through all the messages in the "trash", and delete them. Very nice and very clean.

The one wrinkle in this is that the Trie itself is not the place to put this code. The Trie doesn't have the behavior that old contents are deleted, they are just passed back to the caller. It's the subclass - my QuickCache, that implements this "delete the old" behavior. This means that I need to put the code there, and that makes it a lot less elegant, but still very nice. The Trie stays clean, and that's good.

Still... nice solution to a nasty problem.

BBEdit 9.6.3 is Out

Tuesday, January 25th, 2011

BBEdit.jpg

This morning I saw that BBEdit 9.6.3 was out with a nice little list of improvements. Nothing major, but quite a few little edge case crashes that I'm sure it's nice to get out of the way for support issues. Still one of my favorite tools of all time.

ZeroMQ and Ubuntu 10.04.1 Problems

Monday, January 24th, 2011

ZeroMQ

Today I've been trying to get my ticker plant running on Ubuntu 10.04.1 as we're moving to Ubuntu at The Shop. I thought it would be easy, as all the other packages were installed by the UnixAdmins. All I needed to do was to git clone and then make and we're ready to go.

Well... not so fast, there bucko.

First, GCC now wants to warn for:

  char *list[] = { "first", "second" };

saying that you can't assign a string constant to a char *. The solution is to change the code to read:

  char *list[] = { (char *)"first", (char *)"second" };

or the slightly more cryptic:

  char const * const list[] = { "first", "second" };

Not too bad. I got the code to compile. Now to run it. Ooopppsss... no good there. Seems that the Ubuntu interfaces aren't allowed for the ZeroMQ/OpenPGM sockets. Specifically, if you run the code, you're going to get an error on line 68 - the call to connect(). But on CentOS5, it runs fine.

I had to make sure the gist illuminated the problem, and then I sent it to the ZeroMQ mailing list. I sure hope Sustrik has some idea about what's up. I really need to have it work on Ubuntu as I have clients on Ubuntu, and we had plans to move the servers to Ubuntu very soon. Hope it's good news, soon.

[1/25] UPDATE: I don't think it's Ubuntu as much as it's the master of the ZeroMQ git repo. I looked at the code and in sub.cpp line 27, the SUB constructor is calling the XSUB constructor (file xsub.cpp line 33) and in there, the code:

  options.type = ZMQ_XSUB;
  options.requires_in = true;
  options.requires_out = true;
  zmq_msg_init(&message);

and in the socket_base.cpp class, line 195, it's clear that if you are trying to use the epgm protocol, and if you have options.require_in and options.require_out set, you're not going to be allowed. I mentioned this to Sustrik, the lead maintainer of ZeroMQ, and sure enough, he agreed that this was a recent change, and that it was going to be a problem.

For me, the solution is easier - use the tarball that I built the RPMs off of, and install that guy. It works, and doesn't have this problem. Later, when they get this solved, and other bugs fixed, we'll get a more recent cut and rebuild all the packages.

Starting Yet Another Broker Rewrite

Monday, January 24th, 2011

Ringmaster

This morning I get to start on yet another rewrite of the Broker. The Java version of the Broker is a lot more complex than they were hoping it'd be, it's not really bug-free, and getting it that way is looking to be a daunting task. Additionally, it looks like there's still not going to be an easy route for a Python client, and that's looking over everyone's head. So we sat down a while back, and came up with another plan.

Every service is a web server. Every client is an HTTP client. That's about as universal as possible.

The sockets will be pooled. The connections gracefully handled. The Python libraries exist. It's a very simple model. We can use persistence, or not, and the end result is that the code will be vastly simpler.

So today I'm starting to look at the code for the service in C++. I need a nice, embeddable web server, and then I'll use cURL as the client-side component and get everything I need. Sure, I could do the client and server in boost ASIO, but then I'm decoding the headers and hassling with the SSL implementation, and that's too much work. I'm going to try and make this as simple as possible from the implementation side of things.

Should be interesting. Here we go...

UPDATE: might have spoken too soon... I think it's likely that we're going back to the Erlang version of the Broker - the first version. We could put an HTTP interface on it and it'd solve all the problems. We'll see what the Broker's author thinks after looking into it for a while.

Odd Timeout Bug with Boost Async TImeout

Friday, January 21st, 2011

Boost C++ Libraries

One of the things I've been noticing is that my shotgun approaches to the requests to the Broker are all timing out, and then successfully retrying. Very odd. Why fail, and then immediately succeed? Now that I have my ticker plants able to keep up with the feed, I wanted to take a little time and figure this out.

I started logging all the bytes on my test code - which I copied from the working ticker plant, and then saw the problem - every one of the requests/connections was being torn down before the data could be processed. All the data was there - it just wasn't being processed. Very odd. So I added logging in for the different places where the connection could be invalidated, and ran it again.

A timeout!? Really? These requests are taking less than a second, and the timeout is set for 25 sec. What's going on here. And then I started to really look at the code I had:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    // we need to alert the user that the timeout occurred
    if (mTarget != NULL) {
      mTarget->fireUpdateFailed("asynchronous read timeout occurred");
    }
 
    // now remove this updater from the pool
    mBoss->removeFromUpdaters(mChannelID);
  }

Unless you've done a lot with boost asio and timeouts, this looks fine. The code is called when the timer fires and I'm able to respond to it. But that's not quite the whole story. It turns out that on a timer cancel, we get an error code. We really need to have:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    if (anError != boost::asio::error::operation_aborted) {
      // we need to alert the user that the timeout occurred
      if (mTarget != NULL) {
        mTarget->fireUpdateFailed("asynchronous read timeout occurred");
      }
 
      // now remove this updater from the pool
      mBoss->removeFromUpdaters(mChannelID);
    }
  }

Now we have something that works. I was simply missing that a cancel was going to be "seen" as an error. I updated the code and everything started working without the failures. Why was the retry working? It was a synchronous call because it was a retry. Funny.

Glad to get that solved.

Success is Oh So Sweet!

Friday, January 21st, 2011

bug.gif

Success is Oh So Sweet! I was able to keep up with the OPRA flow this morning for the first time in many weeks. All the necessary changes were now OK, as I'd cracked a big performance problem on the ticker plants. I still have one more: the ConflationQueue still has a boost spinlock, and as my tests have shown, that's a performance killer when more than one thread access it, but not nearly as bad as any other alternative I could come up with.

What I want now is to come up with a way to remove the spinlock from the ConflationQueue and I'll be one happy camper. But for now, I can move on to other things that need my pressing attention.

Google Chrome dev 10.0.642.2 is Out

Friday, January 21st, 2011

V8 Javascript Engine

This morning I noticed that Google Chrome dev 10.0.642.2 was released with a nice list of fixes including a new version of the V8 javascript engine. I'm pleased they are making progress, but still very disappointed that they are pulling H.264 support in the <video> tag.

But hey... it's their program. Their choice.

Finally Have Hope for Tomorrow

Thursday, January 20th, 2011

bug.gif

Today was spent, like every day in the last week, trying to see if my work yesterday had actually improved the speed profile of the ticker plants. Today started out just the same, but I decided to try a slightly different track today: build up a way to test intraday, and then hammer it a lot more with the iterations.

What I did was to make the UDP feeder code "hold back" 50,000 datagrams and then "cut them loose" all at once into the downstream components. I could then monitor the time it took to completely process the "back log" and as a comparative measure, see what effect the change I just made had on the running system. For instance, when I disabled the processing of the UDP datagrams, I expected the "recovery time" to be less than a second. Face it - all we're supposed to be doing then is pulling the filled datagrams off the queue and recycling them.

What I found was that it was taking more than 10 seconds to empty the back log, in the presence of the "normal feed". This didn't seem right at all. So I started digging into the code for the datagram pooling and found something very interesting.

The pool's alloc() and recycle() methods were using a standard STL list, and if there's nothing in the list, a new one is made, and if there's no room in the list, the recycled one is tossed away. When I simply made alloc() create a new one, and recycle() delete it, I saw a dramatic increase in the processing speed. The recovery time was less than a second! But why?

It turns out that the boost spinlock which is based on the old compare-and-swap gcc macros, was turning out to be a real pain to me. I started digging into the pool with separate tests, and it was amazing what the difference was. I didn't want to have all the creations and deletions in the code - that wasn't the answer, but I needed to have something that was thread-safe.

Then it hit me - my CircularFIFO, the single-producer/single-consumer lockless queue I had should work. There's only one thread that's calling alloc() - that's the boost ASIO thread that's reading the UDP datagrams. There's only one thread that's putting the datagrams back into the pool - that's the processing thread. It should work.

And indeed it did. The processing times were as fast as the malloc/free ones, but there was no system overhead for the malloc/free cycle, and I had finally cracked a major speed killer in my code.

I updated the StringPool as well as it was using in the ZeroMQ transmitter in a very similar fashion. Again, it was not going to be the problem there, either.

I'm actually looking forward to tomorrow's tests. For the first time in a long time, I really think I have a shot.