Archive for January, 2011

Added More Fault Tolerance to Ticker Plants

Wednesday, January 26th, 2011

This afternoon I've been working on adding a bunch more fault tolerance to my ticker plants. I had a problem one time with the SIAC symbol mapping data coming out of the configuration service. It should have been a simple JSON map with strings for keys and values, but for some reason, one of the elements wasn't a string. Exception.

So I added in the test for the data types and if they aren't strings, I log it and move on. Had another one where the configuration service didn't return sufficient data for properly configuring a UDP receiver. It then got stuck in a very tight loop that logged a message saying it was retrying, and soon filled up the disk!

That one took a little more effort, but it's all about checking and re-checking to make sure that things are self-consistent. It's not rocket science, but it's hard to predict these problems, and that's why I like to just watch the application run and see what the real world has to offer in terms of problems.

Finally, I added one nice thing to my lisp-like parser. The java version has the functions cond and merge that offer conditional behavior like an if/then but a little more flexible. The cond takes pairs of arguments organized as a predicate and an action. The evaluation of the cond starts with the first predicate - if it evaluates to true then the corresponding action is evaluated and that is the return value for the cond. It's a simple switch statement.

The merge is like an OR union. Every predicate is evaluated, and if it evaluates to true, the corresponding action is OR-ed with the result of the merge. This is more like a filter.

Both are interesting and I really enjoyed adding them to the parser. Now I just need to get to the project where this parser is going to be used.

Tracking Down Bugs in Latest ZeroMQ Code

Wednesday, January 26th, 2011

ZeroMQ

This morning I was able to track down some problems I've been having with ZeroMQ on Ubuntu. Turns out, it was bugs in the latest code in ZeroMQ off the github master. They had made some changes in the code, not checked them into master, and while they thought the bugs were fixed, they really weren't. So I talked to Sustrik on IRC and he asked me to try a simple test - it was just the simple change on a few lines of code.

Turns out, that cleared up the exception I was getting, as I expected - based on looking at the change, but that didn't solve the problem of getting the messages. Seems there's more to it. WHen I told him of these results, he asked me to make up a little compatibility matrix for the different versions: 2.1.0 (ref)... 2.1.0 with my changes for RECOVERY_IVL (me), and the git master (master). Here's what I found:

Sender Recevier Works?
2.1.0 2.1.0 Yes
me me Yes
master master No
me 2.1.0 Yes
me master No

it sure seems like there's a problem in the latest code. Thankfully, the unix admins have built my drop of the code into Ubuntu packages, so we can continue with the project, but I'm going to have to be very careful when we move off this code as it's going to be possibly a significant change.

Just something to consider. I like helping out, and giving back.

Lots of Progress Today – Baby Steps to a Great Ticker Plant

Tuesday, January 25th, 2011

Today I've had the opportunity to do a lot of little things on the codebase. The IRC client wasn't splitting lines right, there were a lot more efficient ways of querying out the messages from the QuickCache, client constructors needed a little work - all stuff that wasn't big, but it was important. I was humming right along with the changes - tacking one problem after another.

Pretty nice day. Lots of really useful stuff done.

Thread Safe Queries on a Trie? Use a Trash Can!

Tuesday, January 25th, 2011

GeneralDev.jpg

I've been facing this problem with my Trie - seeing as how it's lockless, it's replacing elements very fast, and it's possible that a query of the trie can get you into trouble because during the query, a replacement can come in and the value you're querying gets deleted. Then you're looking at a nasty seg fault.

The problem is, you don't want to put a lock in there of any kind - that's going to defeat the purpose of the trie in the first place. So how do you protect queries, but not slow anything down? I was stumped. Until I had a flash - the trash.

If I used a very simple std::slist<Message *> as the "trash", and then a simple boolean that controlled it's use, I can make the put() method look like:

  Message  *old = mmap::put(aMessage);
  if (old != NULL) {
    if (mUseTrash) {
      mTrashCan.push_front(old);
    } else {
      delete old;
    }
  }

Then, when we're done, we can simply run through all the messages in the "trash", and delete them. Very nice and very clean.

The one wrinkle in this is that the Trie itself is not the place to put this code. The Trie doesn't have the behavior that old contents are deleted, they are just passed back to the caller. It's the subclass - my QuickCache, that implements this "delete the old" behavior. This means that I need to put the code there, and that makes it a lot less elegant, but still very nice. The Trie stays clean, and that's good.

Still... nice solution to a nasty problem.

BBEdit 9.6.3 is Out

Tuesday, January 25th, 2011

BBEdit.jpg

This morning I saw that BBEdit 9.6.3 was out with a nice little list of improvements. Nothing major, but quite a few little edge case crashes that I'm sure it's nice to get out of the way for support issues. Still one of my favorite tools of all time.

The Realization That I’m Not a Patient Man

Monday, January 24th, 2011

Yeah, I'd like to think I'm patient, but I've realized today that I'm not. Not really. I can pretend to be patient. I can even act like I'm patient, but in the end, I'm nothing of the sort. Not even a little.

I finally got some nasty problems solved, so I start on some easy things. Unfortunately, these easy things require other people to be involved. Hardware issues, OS install issues, all these are, and should be, handled by someone other than me. But therein lies the rub - I don't like waiting for other people to do things at their pace. I want it done at mine.

I have thankfully learned that it's far better for me to just shut up, than to say anything. With my kids, with people at work. It's far better to listen, and then plan out what to say with the appearance of patience. It really helps.

But I'm not fooling myself, and I wonder if I'm really fooling anyone?

ZeroMQ and Ubuntu 10.04.1 Problems

Monday, January 24th, 2011

ZeroMQ

Today I've been trying to get my ticker plant running on Ubuntu 10.04.1 as we're moving to Ubuntu at The Shop. I thought it would be easy, as all the other packages were installed by the UnixAdmins. All I needed to do was to git clone and then make and we're ready to go.

Well... not so fast, there bucko.

First, GCC now wants to warn for:

  char *list[] = { "first", "second" };

saying that you can't assign a string constant to a char *. The solution is to change the code to read:

  char *list[] = { (char *)"first", (char *)"second" };

or the slightly more cryptic:

  char const * const list[] = { "first", "second" };

Not too bad. I got the code to compile. Now to run it. Ooopppsss... no good there. Seems that the Ubuntu interfaces aren't allowed for the ZeroMQ/OpenPGM sockets. Specifically, if you run the code, you're going to get an error on line 68 - the call to connect(). But on CentOS5, it runs fine.

I had to make sure the gist illuminated the problem, and then I sent it to the ZeroMQ mailing list. I sure hope Sustrik has some idea about what's up. I really need to have it work on Ubuntu as I have clients on Ubuntu, and we had plans to move the servers to Ubuntu very soon. Hope it's good news, soon.

[1/25] UPDATE: I don't think it's Ubuntu as much as it's the master of the ZeroMQ git repo. I looked at the code and in sub.cpp line 27, the SUB constructor is calling the XSUB constructor (file xsub.cpp line 33) and in there, the code:

  options.type = ZMQ_XSUB;
  options.requires_in = true;
  options.requires_out = true;
  zmq_msg_init(&message);

and in the socket_base.cpp class, line 195, it's clear that if you are trying to use the epgm protocol, and if you have options.require_in and options.require_out set, you're not going to be allowed. I mentioned this to Sustrik, the lead maintainer of ZeroMQ, and sure enough, he agreed that this was a recent change, and that it was going to be a problem.

For me, the solution is easier - use the tarball that I built the RPMs off of, and install that guy. It works, and doesn't have this problem. Later, when they get this solved, and other bugs fixed, we'll get a more recent cut and rebuild all the packages.

Starting Yet Another Broker Rewrite

Monday, January 24th, 2011

Ringmaster

This morning I get to start on yet another rewrite of the Broker. The Java version of the Broker is a lot more complex than they were hoping it'd be, it's not really bug-free, and getting it that way is looking to be a daunting task. Additionally, it looks like there's still not going to be an easy route for a Python client, and that's looking over everyone's head. So we sat down a while back, and came up with another plan.

Every service is a web server. Every client is an HTTP client. That's about as universal as possible.

The sockets will be pooled. The connections gracefully handled. The Python libraries exist. It's a very simple model. We can use persistence, or not, and the end result is that the code will be vastly simpler.

So today I'm starting to look at the code for the service in C++. I need a nice, embeddable web server, and then I'll use cURL as the client-side component and get everything I need. Sure, I could do the client and server in boost ASIO, but then I'm decoding the headers and hassling with the SSL implementation, and that's too much work. I'm going to try and make this as simple as possible from the implementation side of things.

Should be interesting. Here we go...

UPDATE: might have spoken too soon... I think it's likely that we're going back to the Erlang version of the Broker - the first version. We could put an HTTP interface on it and it'd solve all the problems. We'll see what the Broker's author thinks after looking into it for a while.

Odd Timeout Bug with Boost Async TImeout

Friday, January 21st, 2011

Boost C++ Libraries

One of the things I've been noticing is that my shotgun approaches to the requests to the Broker are all timing out, and then successfully retrying. Very odd. Why fail, and then immediately succeed? Now that I have my ticker plants able to keep up with the feed, I wanted to take a little time and figure this out.

I started logging all the bytes on my test code - which I copied from the working ticker plant, and then saw the problem - every one of the requests/connections was being torn down before the data could be processed. All the data was there - it just wasn't being processed. Very odd. So I added logging in for the different places where the connection could be invalidated, and ran it again.

A timeout!? Really? These requests are taking less than a second, and the timeout is set for 25 sec. What's going on here. And then I started to really look at the code I had:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    // we need to alert the user that the timeout occurred
    if (mTarget != NULL) {
      mTarget->fireUpdateFailed("asynchronous read timeout occurred");
    }
 
    // now remove this updater from the pool
    mBoss->removeFromUpdaters(mChannelID);
  }

Unless you've done a lot with boost asio and timeouts, this looks fine. The code is called when the timer fires and I'm able to respond to it. But that's not quite the whole story. It turns out that on a timer cancel, we get an error code. We really need to have:

  void MMDClientUpdater::asyncReadTimeout( const boost::system::error_code & anError )
  {
    if (anError != boost::asio::error::operation_aborted) {
      // we need to alert the user that the timeout occurred
      if (mTarget != NULL) {
        mTarget->fireUpdateFailed("asynchronous read timeout occurred");
      }
 
      // now remove this updater from the pool
      mBoss->removeFromUpdaters(mChannelID);
    }
  }

Now we have something that works. I was simply missing that a cancel was going to be "seen" as an error. I updated the code and everything started working without the failures. Why was the retry working? It was a synchronous call because it was a retry. Funny.

Glad to get that solved.

Success is Oh So Sweet!

Friday, January 21st, 2011

bug.gif

Success is Oh So Sweet! I was able to keep up with the OPRA flow this morning for the first time in many weeks. All the necessary changes were now OK, as I'd cracked a big performance problem on the ticker plants. I still have one more: the ConflationQueue still has a boost spinlock, and as my tests have shown, that's a performance killer when more than one thread access it, but not nearly as bad as any other alternative I could come up with.

What I want now is to come up with a way to remove the spinlock from the ConflationQueue and I'll be one happy camper. But for now, I can move on to other things that need my pressing attention.