Archive for the ‘Coding’ Category

Patching ZeroMQ – Pretty Neat

Monday, December 6th, 2010

ZeroMQ

This morning I was chatting with Martin S. on the ZeroMQ IRC channel and there was a suggestion of how to handle the "socket recovery interval in msec" option in the code. He pointed out that I'd need to change the ZeroMQ code, and why didn't I do that and then send the patch to the mailing list and he'd incorporate it.

Sweet! A request for a (simple) patch to the codebase by the primary maintainer. I like this stuff. It's not hard, but there are a few wrinkles, and the coding standards are at least in existance, which is a huge help to the project. I just need to get a few things figured out, write the code, compile it all up, and then make the diff for the mailing list.

I'm sure there's going to be a lot of little details I learn as I do this, but it's nice to get a chance to contribute to another nice open source project.

UPDATE: I've pretty much got it all done, but the hint I received from the guy that really knows OpenPGM in the group is a little sketchy. He gave me an equation which includes the size of the transport package:

Easy workaround is to calculate the buffer size in sequence numbers in 0MQ and pass that onto OpenPGM. Then you can export socket options for 0MQ to set the buffer size in seconds, milliseconds, etc.

int sqns = (secs * max_rte) / tpdu_size;
pgm_setsockopt (sock, IPPROTO_PGM, PGM_TXW_SQNS, &sqns, sizeof (sqns));

I think I found what should go in that spot, but I wasn't 100% sure. So I replied to the guy on the mailing list and now I'm waiting for confirmation/correction from him. It shouldn't take too much longer to finish this up, and then I'll have a way to set the ZMQ_RECOVERY_IVL_MSEC - which, with a non-zero value, will override the ZMQ_RECOVERY_IVL value and use the value in milliseconds. Should be pretty easy to finish.

Tracking Down Nasty Memory Issue – Patience is a Virtue (cont.)

Friday, December 3rd, 2010

Detective.jpg

This morning has been very enlightening on ZeroMQ. Very exciting stuff. As I was leaving yesterday I had made a test app for the ZeroMQ guys to check and then posted the following test results as I varied the value of ZMQ_RATE:

bps ZMQ_RATE Initial Final
10 Mbps 10000 7 MB 18 MB
50 Mbps 50000 7 MB 73 MB
200 Mbps 200000 7 MB 280 MB

The data was pretty compelling. The effect ZMQ_RATE had on the memory footprint of the same data source was staggering. Thankfully, I put it all together in a nice email to the mailing list and I got a great hit from Martin S.:

Isn't it just the TX buffer? The size of PGM's TX buffer can be be computed as ZMQ_RATE * ZMQ_RECOVERY_IVL. The messages are held in memory even after they are sent to allow retransmission (repair) for the period of ZMQ_RECOVERY_IVL seconds.

So I added the following to the ZMQ transmitter's code:

  static int64_t     __rate = 50000;
  static int64_t     __recovery = 1;
  static int64_t     __loopback = 0;
 
  // we need to set this guy up properly
  top->socket->setsockopt(ZMQ_RATE, &__rate, sizeof(__rate));
  top->socket->setsockopt(ZMQ_RECOVERY_IVL, &__recovery, sizeof(__recovery));
  top->socket->setsockopt(ZMQ_MCAST_LOOP, &__loopback, sizeof(__loopback));

And then started running the tests again.

The results were amazing:

bps ZMQ_RATE Initial Final
50 Mbps 50000 7 MB 11 MB
200 Mbps 200000 7 MB 32 MB

This was exactly what I was looking for! The ZMQ_RECOVER_IVL can't go below 1 sec, but for me even that's too much. If you're not here and ready to get ticks, then waiting a second is likely to be several hundred if not several thousand messages. It'd be fine with me to make it 0.5 sec - but Martin says that's the underlying resolution of OpenPGM.

Not bad. I'll take it. What a great morning!

[12/7] UPDATE: the option:

  static int64_t     __loopback = 0;
 
  top->socket->setsockopt(ZMQ_MCAST_LOOP, &__loopback, sizeof(__loopback));

is a massive red herring. It's not about the loopback interface, as my reliable multicast URLs are all targeted to specific NICs, it's more about being able to receive on the same box as the sender. I was trying to figure out why things "broke", and it's when I took this out that things worked again. Dangerously worded docs on this one... leave it out.

Tracking Down Nasty Memory Issue – Patience is a Virtue

Thursday, December 2nd, 2010

Detective.jpg

I've been trying to track down what I believed to be a nasty memory leak in my code today. The short-cut to the answer is that it wasn't a leak, and it wasn't in my code. But I'm getting ahead of myself.

The problem was manifesting itself as steadily growing memory on some of my ticker plants. In truth, it was probably all of them, but it wasn't effecting all of them equally. I have spent a lot of time on this over the past weeks, and today I was going to get to the bottom of this for sure.

So I started digging into the problem by shutting things off. What I found was that if I was listening to anything on the UDP socket and doing anything with it I was getting about an 8-byte increase every two seconds. Very odd. I had turned off ZeroMQ at the time, so the messages were just getting dropped in the trash, but they were being processed completely up to that point.

I was trying everything, and then I had to run to a meeting. I left the test running because I needed to hurry. It wasn't going to consume the box in half an hour, anyway.

When I came back I noticed that the memory had stabilized!

Now it was getting interesting. Very interesting. I started tracking things down and it turns out that the ZMQ_RATE parameter was a major factor in the terminal memory value. I then wrote up a simple test - something that I knew the ZeroMQ guys would appreciate, and started running it.

Again - major dependency on the value of ZMQ_RATE. I'll have to do more work on this tomorrow.

Google Chrome dev 9.0.597.0 is Out

Thursday, December 2nd, 2010

GoogleChrome.jpg

After quite a silence, Google Chrome dev 9.0.597.0 is out and there are some really nice fixes in this release:

All

  • Ongoing work on IndexDB and GPU
  • Tweaks/Fixes to Google Chrome Instant
  • Extensions/Apps work
  • Autofill related fixes

Known Issues

  • Page becomes unresponsive when trying to play video - Issue 65772
  • Certain HTML5 sites fail to load due to a compositor issue - Issue 64722

I like the GPU updates and the video updates, but I can pass on the "instant"... icky addition in my book, but their app, their choice.

Fantastic Speed Boost on My uint128_t

Wednesday, December 1st, 2010

Professor.jpg

Late yesterday I realized that I had some lingering code that was using the uint128_t I had created because I needed to uniquely map the instrument names into some kind of number space for including in the likes of std::map. The code I had originally written worked, but it wasn't nearly fast enough, and so I stopped using it (so I thought), and switched to the trie.

But it wasn't really gone. I had a lingering use for it in my client code, and I decided this morning to fix up the implementation so that I had something that was a lot faster - hopefully in the same ballpark as a uint64_t for map usage.

The first thing I did was to add a timed test section to my testing code for the conflation key - that's what I called the 128-bit value generated from the name of the instrument. It was pretty simple:

  int         cnt = 100000;
  log.info("starting the uint64_t tests...");
  boost::unordered_map<uint64_t, int> little;
  uint64_t    startTime = msg::TransferStats::usecSinceEpoch();
  // ...first the puts
  for (int i = 0; i < cnt; ++i) {
    little[i] = i;
  }
  // ...now the gets
  for (int i = 0; i < cnt; ++i) {
    if (little[i] != i) {
      error = true;
      log.error("uint64_t test failed for i=%ld", i);
    }
  }
  uint64_t    totalTime = msg::TransferStats::usecSinceEpoch() - startTime;
  log.info("%d uint64_t tests completed in %ld usec", cnt, totalTime);
 
  log.info("starting the uint128_t tests...");
  boost::unordered_map<uint128_t, int> big;
  startTime = msg::TransferStats::usecSinceEpoch();
  // ...first the puts
  for (int i = 0; i < cnt; ++i) {
    big[i] = i;
  }
  // ...now the gets
  for (int i = 0; i < cnt; ++i) {
    if (big[i] != i) {
      error = true;
      log.error("uint128_t test failed for i=%ld", i);
    }
  }
  totalTime = msg::TransferStats::usecSinceEpoch() - startTime;
  log.info("%d uint128_t tests completed in %ld usec", cnt, totalTime);  

What I saw in my initial tests was horrible. It wasn't even close. I had more than a factor of 300x difference between the two. When I looked at the way I'd implemented the uint128_t it made a lot of sense:

  private:
    uint8_t   mBytes[16];

I had 16 individual bytes as the data ivar for the object. Makes a lot of sense as it never suffers from the host/network byte ordering issues, and things looked fast in the code - but there were loops and a lot of calls to memcpy(). So I needed to take a new approach, and I decided to go to the other extreme - two uint64_t values as opposed to sixteen uint8_t values.

This changed a lot of the code. For one, it made a lot of sense to write my own hton() and ntoh() functions for the classes so they'd look like the system calls ntohl() and the like. It really wasn't all that hard, either:

  uint128_t hton( const uint128_t & aValue )
  {
    uint128_t     retval;
 
    // get the byte pointers to the source and destination
    uint8_t *dest = (uint8_t *)retval;
    uint8_t *src = (uint8_t *)aValue;
    // now map the bytes one-by-one from source to destination
    dest[0]  = src[7];
    dest[1]  = src[6];
    dest[2]  = src[5];
    dest[3]  = src[4];
    dest[4]  = src[3];
    dest[5]  = src[2];
    dest[6]  = src[1];
    dest[7]  = src[0];
    dest[8]  = src[15];
    dest[9]  = src[14];
    dest[10] = src[13];
    dest[11] = src[12];
    dest[12] = src[11];
    dest[13] = src[10];
    dest[14] = src[9];
    dest[15] = src[8];
 
    return retval;
  }
 
 
  uint128_t ntoh( const uint128_t & aValue )
  {
    uint128_t     retval;
 
    // get the byte pointers to the source and destination
    uint8_t *dest = (uint8_t *)retval;
    uint8_t *src = (uint8_t *)aValue;
    // now map the bytes one-by-one from source to destination
    dest[7]  = src[0];
    dest[6]  = src[1];
    dest[5]  = src[2];
    dest[4]  = src[3];
    dest[3]  = src[4];
    dest[2]  = src[5];
    dest[1]  = src[6];
    dest[0]  = src[7];
    dest[15] = src[8];
    dest[14] = src[9];
    dest[13] = src[10];
    dest[12] = src[11];
    dest[11] = src[12];
    dest[10] = src[13];
    dest[9]  = src[14];
    dest[8]  = src[15];
 
    return retval;
  }

The old scheme allowed me to use memcpy() to put the data into a data stream - and to take it out. But now with a real "host byte order", I needed to add methods on the uuid_t class to pack and unpack it's data from the data streams. Not bad, and it made the code look a lot cleaner, but I had that crud scattered in the code in a ton of places.

Bad form on my part - really.

I even had to create the prefix/postfix increment and decrement operators to make sure it could function in the loops I might have. I really wanted this to be complete. Thankfully, the code to do this didn't turn out to be that hard. In fact, I was able to do it all in a lot fewer lines of code because I could use the compiler to do a lot of the up-casting work that I had to do with memcpy() before. Nice benefit.

The upshot of all these changes is that the new uint128_t was only 30% slower than the uint64_t! That's amazing compared to where it started. It's not going to set any speed records, but given that it's not a built-in CPU data type, it's pretty good. Certainly good enough for all the things I need it to do.

Fantastic work!

Finding Subtle Bugs Takes Time

Tuesday, November 30th, 2010

bug.gif

I've been lucky today - I finished a good chunk of code this morning and now I'm able to watch my ticker plants run. It's a funny thing, finding the subtle bugs takes time. You have to watch the code run, ask questions, monitor log files - all these things take time, and it's that phase of polishing an app that's very rewarding.

If you have a distributed app you need to monitor the load and rebalance things as needed. That takes more time.

I've been luck this morning to be able to take this time and really study what's happening. I've found a few bugs, and those would not have been easy if I were in a hurry as they weren't serious crashing issues. Still, they needed to be fixed.

It's nice to have a little time to watch and monitor the work you've done. It's really very rewarding.

ZeroMQ is Nearing Release of 2.1

Tuesday, November 30th, 2010

ZeroMQ

I've found a singular problem with ZeroMQ, and noted in the IRC chat conversations that this should be fixed in the soon-to-be-released 2.1. It's a simple memory leak with the sending of messages. My code is pretty simple: I get the payload for the message, I get the ZMQ socket it needs to be sent on, and then I simply make a ZMQ message and send it. That's about as simple as you can get:

  1. if (aTopic.first() != NULL) {
  2. try {
  3. // lock up this socket while we send the data out...
  4. boost::detail::spinlock::scoped_lock lock(aTopic.second());
  5. // make a ZMQ message of the right size
  6. zmq::message_t msg(aPayload.size());
  7. // ...copy in the data we need from the payload
  8. memcpy(msg.data(), aPayload.data(), aPayload.size());
  9. // ...and WHOOSH! out it goes
  10. aTopic.first()->send(msg);
  11. } catch (std::exception & e) {
  12. error = true;
  13. cLog.error("[sendToZMQ] trying to send the data got an "
  14. "exception: %s", e.what());
  15. }
  16. }

If I comment out line 10 - the send(), the memory doesn't grow any faster than I might expect based on the cached messages. But leave it in and the memory grows and grows. More interestingly, it's different for the different kinds of payloads I send. Very odd.

Anyway... the ZeroMQ guys said they planned on having a release last week, but it seems things happened, and that's OK with me - this is important. I need to keep the memory under control.

CKit’s IRC Protocol Implemented in Boost ASIO

Monday, November 29th, 2010

Boost C++ Libraries

For the last day and a half I've been working on re-writing my C++ IRC Client code to use boost's asio, and it's been pretty interesting. I will say there are a lot of plusses to using the boost socket functionality - even over my own socket library (imagine that!).

First, it's got the complete asynchronous mechanism for sending and receiving - you just have to love that. Also, they have done a wonderful job in making it all very rational and sane. Asynchronous methods perform as you'd expect, and have remarkably similar signatures to their synchronous counterparts. It makes writing with the classes very simple. Clearly, a lot of thought has gone into this stuff, and that's really nice to see.

Secondly, there's not the need for all the threads that I had in my old code. Primarily because of the single io_service thread that boost asio uses for all async operations. This really is a great timesaver. You can easily have multiple threads sending out chats to the IRC server with the async writer. Very slick.

Finally, the resulting code size is much smaller. That's more of a consequence of the other two, but the payoff can't be understated. Less code means less maintenance, less cost, and less time. You just can't beat that.

So I have it all done, but I haven't been able to test it yet as we don't have an IRC server up and running - yet. It's been discussed, and maybe today they'll see that I've gotten code ready, and they'll put one up. It's not hard - just takes a little time, that's all. But the benefits will be enormous. I look forward to getting the server going, testing my code, and integrating this into my TickerPlants and other libraries. It's just an amazingly powerful tool for support and problem solving.

SNMP vs. IRC – Complexity Over Simplicity

Monday, November 29th, 2010

chat.jpg

In the past, I've used IRC in my applications to great utility. I created a simple framework that allowed me to have each application instance "pose as" a "user" on IRC, and when the application was running, you could see this in the chat rooms, and you could interact with it by as complex, or as simple, a means as you, the application designed, wanted. The protocol is simple, it's fast, and there's very little administration to the system.

SNMP is not so simple, but it's far more common in the monitoring and control of applications. The question is really: Is the complication worth it? I'm not at all certain that it is.

Several months ago, the decision was made to use Jabber as opposed to IRC. I went along with it as that's the right thing to do. But it's not the choice I'd have made. The ircd is simple, fast, and with a little additional code, you can make it log everything. From that point, you don't have any concerns about compliance, and you're free to use it as needed.

It seems that the powers that be are re-evaluating the Jabber solution as it's a little more complex in it's implementation, and the goal here is not to have something really complex, it's to have something really easy. So I'm hoping that I'll get to finish my IRC client based on the boost asio work. If so, it should be exceptionally fast and easy to use. Both are the hallmarks of a great utility.

I hope I get to see it come to pass.

The complexity of SNMP just seems like massive overkill.

Tricking a Tricky Threading Problem

Wednesday, November 24th, 2010

Professor.jpg

This afternoon I've been tracking down a good solution to a nasty threading problem. This part of my ticker plants is the UDP receiver and it tries to get the UDP datagrams off the socket and into a buffer as fast as possible. To that end, I've got a single-consumer, single-producer, lockless FIFO queue that should be thread-safe as the 'head' and 'tail' are volatile and there's only one thread messing with one of these guys at a time.

But that's just the theory. Here's what the code looks like:

  template<typename Element, uint32_t Size>
  bool CircularFIFO<Element, Size>::push( Element & item )
  {
    uint32_t  nextTail = increment(tail);
    if (nextTail != head) {
      array[tail] = item;
      tail = nextTail;
      return true;
    }
 
    // queue was full
    return false;
  }
 
 
  template<typename Element, uint32_t Size>
  bool CircularFIFO<Element, Size>::pop( Element & item )
  {
    if (head == tail) {
      // empty queue
      return false;
    }
 
    item = array[head];
    head = increment(head);
    return true;
  }

Here's what happens: I'll be running just fine and then the call to pop() will return true and because of that, the value (a pointer) will return as something. This presents a real problem. If it returns a NULL, that's easy to deal with. Problem happens when it returns junk.

Ideally, it wouldn't return a NULL or junk, but coding for that has turned out to be harder than I thought. First, I can just check for a NULL or what I think of as "junk" data, and not delete that pointer, but what happens when it returns "junk" that's not fitting my pattern of "junk"? Well... I'll delete it and BAM! SegFault.

Not easy.

I believe the problem is one of compiler preference. The data in the class is defined as:

  volatile uint32_t head;
  volatile uint32_t tail;
  Element   array[Capacity];

where the lack of the volatile keyword is the big deal here. What I need to do is to make the data look like:

  volatile uint32_t head;
  volatile uint32_t tail;
  volatile Element   array[Capacity];

and then correct all the castings in the code to make them work properly.

I've got something to compile that looks like:

  template<typename Element, uint32_t Size>
  bool CircularFIFO<Element, Size>::push( Element & item )
  {
    uint32_t  nextTail = increment(tail);
    if (nextTail != head) {
      array[tail] = *((volatile Element *)(void *)&item);
      tail = nextTail;
      return true;
    }
 
    // queue was full
    return false;
  }
 
 
  template<typename Element, uint32_t Size>
  bool CircularFIFO<Element, Size>::pop( Element & item )
  {
    if (head == tail) {
      // empty queue
      return false;
    }
 
    item = *const_cast<Element *>(&(array[head]));
    head = increment(head);
    return true;
  }

We'll have to see how this runs.