Archive for March, 2011

Wild Socket Problem – Possibly Bonded NIC Issue?

Tuesday, March 15th, 2011

Ubuntu Tux

Focused on an interesting problem today. In the last few weeks, I've done a lot of re-writing on the UDP receiver in my ticker plant to get it better, faster, etc. And one of the things I've noticed is that I was accumulating, but not logging, dropped messages from the exchange. Now this is a serious issue because I'm looking at both the A and B sides from the exchange - they are meant to be fault-tolerant pairs so that should you loose a datagram on one, the other has it and you can get it there. So to loose packets is significant.

Made more significant in the nature by which I'm losing them. Let's say I start one of my apps that listens to a set of UDP multicast feeds. This guy gets started and it's running just fine. In another shell on the same box, I start another application that listens to a different set of UDP channels. As this second application is starting - the first app starts dropping packets! Within a few seconds, everything stabilizes and both applications are fine and neither app is dropping anything.

If I then stop the second app - the first app drops a few packets! Again, within a second or so, it's all stable again and nothing more is dropped.

From this, I have a few observations and a theory.

  • It is not in the process space - two apps share nothing but the OS and hardware. So it's not "within" either process.
  • It is socket related - because I loose packets on A and B channels, it's not the failure of one multicast channel.
  • It is load related - the more load there is on the first and second apps, the worse the drops.

My theory is that it's the way the bonded interface is configured. Specifically, I believe it's set up to automatically rebalance the load between the two sockets, and in so doing, changing the load causes some of the sockets to be shifted from one physical NIC to another, and the packets are dropped.

It certainly makes sense. The question is: can I effect the configuration in a meaningful way? I looked at the modes for bonding NICs in Ubuntu, and depending on how they have it set up, I might just have to live with it. If so, at least I know where it's coming from.

UPDATE: the core issue is that I can't specify the NIC for boost asio to use for reception of the UDP traffic. If I try to use the address, I get nothing. If I use the "0.0.0.0", then I get data but the problems persist. It's an annoying limitation with boost asio UDP, but it's a limitation, and we'll have to deal with it. Crud.

UPDATE: the only option I found was in the joining of the multicast channel. It turns out that you can tell boost which address to join the multicast address on this takes the form of something like:

  socket->set_option(multicast::join_group(
                              address_v4::from_string(aChannel.first),
                              address_v4::from_string("10.2.2.8"));

where the second address is the address of the NIC you want to listen on. It works only marginally for me, and that's a drag, but it's a possibility if I need it. It's not boost's problem.

[4:20pm] UPDATE: I found out that it's the Intel NIC drivers! A guy in The Shop ran across this for his work a little bit ago, and found the solution in updated drivers for the Intel 10GbE NICs. I've talked to the Unix Admins, and they are building a patch for my boxes. This is fantastic news!

Fun Use of Boost Threads in Monitoring Thread

Monday, March 14th, 2011

Boost C++ Libraries

I was having a bit of a problem with The Broker today, it seemed. It appeared that when I saved my state to the Broker's configuration service, I was hanging, and the monitoring thread that fired off this save was hung. I got the guys to restart the Broker and things seemed OK, but I decided to take advantage of one of the really neat things of boost threads, and fire off the call in a separate thread so if it stalls, the monitoring thread doesn't.

The old code looks like this:

  if (secs - mLastMessageSaved >= 300) {
    saveMessagesToConfigSvc();
    mLastMessageSaved = secs;
  }

becomes:

  if (secs - mLastMessageSaved >= 300) {
    using namespace boost;
    thread   go = thread(&TickerPlant::saveMessagesToConfigSvc(), this);
    mLastMessageSaved = secs;
  }

and now the call to saveMessagesToConfigSvc() is now called by the separate thread, and as soon as the method returns, the thread is killed and cleaned up. Exceedingly sweet!

OK... this is what boost threads are all about, but in comparison to Java threads, or something that takes a little more scaffolding, this is elegant to the extreme. Just add a few constructs to the line and it's done. You can't get much simpler than that. Very nice.

Happy Pi Day!

Monday, March 14th, 2011

pi.jpg

This is something that was started by some of my kid's teachers years ago, and I still like it - Pi Day! We try to have pie in the house, and I'm happy to say that Pumpkin is the order of the day today. Yum!

Happy Pi-ing!

Wonderful Solution to a Locking Problem – Merging the Streams

Friday, March 11th, 2011

GeneralDev.jpg

The past couple of days have been about speeding up the processing of the exchange data through my feed system. Because the code for the decoding of the messages is fixed, specifically for the big OPRA feeds (using OPRA's FAST decoder), most of this is accomplished by re-organizing the data flow and data structures. One of the things I had done a while back was to have the two channels of an exchange feed put their datagrams in separate queues, and then have one thread empty both so as to remove the need for locking on the sequence number arbitration code.

The problem was that this required twice the time to process the data through the decoder because both A and B sides went through the same thread. This can get to be really quite nasty. For example, on an OPRA feed, it takes about 40 usec to decode an average datagram, and at 50,000 datagrams/sec we're looking at 2 sec to process one side, but this design would have to do double that work. Nasty. Lots of buffering.

The solution is to have one thread per datagram stream. That immediately cuts the processing time in half. The problem is that we then need to lock for the sequence number arbitration. Nasty. Then I had a flash - merge the data!

First, tag one of the channels as primary, and have it control the arbitration. Every other channel decodes it's datagrams but then instead of trying to have the thread send it out, have that thread put the decoded messages into a queue that the primary will process as soon as it's done with it's datagram. The arbitration is very fast because it's as simple as checking the sequence number and a few flags. It's the decoding that takes the time. With one of the FIFO queues, we can have multiple non-primary channels, and have the primary take the results off and send them out.

Even more importantly, the primary can be the primary feed line of the exchange, and that makes things even better as the secondary feed is really only needed when there's a failure of the primary. What we've done then, is to make it more like the "normal" feed with a "backup" just in case.

Very neat.

Google Chrome dev 11.0.696.3 is Out

Friday, March 11th, 2011

This morning I noticed that Google Chrome dev 11.0.696.3 was released with a few issues addressed. Nothing major, but it's nice to see the attention to detail my the builders.

Spending Time Tweaking for Speed

Thursday, March 10th, 2011

Speed

Today was a day where I spent about half my time tweaking for speed, and the other half recovering from the problems of making it too fast. I'm in the middle of a push to make the core of my ticker plants as fast as possible for a group that's going to start testing it next week. There have been a lot of little tweaks - each getting just a touch faster, but together, it's a non-trivial improvement. Unfortunately, these little changes pointed out threading issues as well. Those then needed to be fixed.

It's a game I've played for a while now, so today isn't unlike a lot of other days lately, but it's just "more of the same". It's all part of the improvement process.

Found My Speed Problems – No Really Easy Solution

Wednesday, March 9th, 2011

Speed

This afternoon I found my speed problem, and it's not an easy solution. It should be, but it's not, and in retrospect, I really wish OPRA made a better FAST codec. But I'm getting ahead of myself.

I started adding a lot of timing snapshots to the code. It's annoying if there's a lot of it, but in the short-run, it's really the only way to see how long this section of code takes. Sure, there are profilers, but they can only look at modules or functions - not sections of code, as I needed. So I put these in and started the runs. Much to my surprise, the FAST codec is not too bad, averaging about 20 to 50 usec per message. Sure, it depends on the size of the message from OPRA, but it's not bad. This also includes the sending of the messages out to my test harness and calculating the minimum, maximum, and average latency for the messages it receives.

So it's not in the FAST codec. Where, then, is it?

Certainly, one of the nastier delays is the queue restart. By boost ASIO thread reads off the UDP datagram, and then tags it with a timestamp, and places it on a FIFO queue. The queue is lockless, and so there's no conditional to wait() on, and that's a good thing, as the delay time would be horrible if it were (already been there, tried that). So when the queue is empty, my servicing thread can choose one of two options: sleep and let it queue up, or spin and burn up the CPU. I've tried both, and I can get the time to a reasonable level with the spinning, but then the CPU usage is eating a complete core - not ideal.

Sleeping isn't any better because it allows the minimum "maximum" latency to be the sleep interval and that is currently set at 1 ms. However, it's much more considerate on the CPU usage - averaging only about 12% of one core.

But still, the delays remained. Then it hit me. I was looking at the problem all the while and just not "seeing" it. If I have one thread decoding these datagrams from both A and B channels of the exchange feed, then I'm doing twice the work I need. This isn't bad, but it's taking twice the time. I needed to get two decoding threads active again. I say "again" because I had this in the past, and removed it in favor of a lockless arbitration on what messages to send. At this point, I need the speed and am willing to have the lock (for now), to get the processing time in half.

But that's not all. If I think about the load from a single OPRA channel I'm looking at upwards of 100,000 msgs/sec. If I take, on average, 40 us per message to decode it, I can get only about 25,000 msgs/sec - not even close to the speed they are arriving. So this is my fundamental problem: the OPRA FAST codec is just not as quick at decoding the messages as OPRA is at sending them. Sure... with faster CPUs I might be able to get it a little faster, but in the end, it's not going to be more than maybe 100% faster. That still leaves me with a factor of 2 to make up.

So we're going to queue things up. No way around it. Given that we're queueing, the delays I've been seeing are completely explainable. Let's assume we get a burst of 10,000 messages - the last message in the burst is going to get decoded some 400 ms after it arrives and is timestamped. No two ways about it.

Some of my competitors feeds say they have much smaller latencies - and the way they measure it, they do - because they allow the OS-level socket buffers to hold the 10,000 messages as they process one completely before pulling the next off the socket. If they timestamp it, decode it and send it, their latencies should be no more than about 50-100 usec. That's what mine would be. But you're not measuring the real latency of the server. For that, you have to keep the OS-level socket buffer empty and timestamp them within the process space and then measure the overall latency.

There's no free lunch. You can buffer the datagrams in your process or you can have the OS buffer them for you. The only difference is with the former, you'll actually know when the datagram arrived at your NIC, and with the latter you won't.

So... what's the solution? Well... we can throw more cores at it. If we somehow parallel processed the datagrams into messages, then the effective decoding time could be much less. But then we'd have to worry about getting the messages out of order.

We could re-write the FAST codec - and this I've thought about, but the code isn't that bad, and OPRA is going to 48 channels soon, and that should cut the rate to a max of 50,000 msgs/sec, and give the decoder a little breathing room.

I'll work on it, but I'm confident now that there's no horrible problem in my code. And no one is doing it really any better than I am. I just need to see if I can come up with some clever way to get around these limitations.

Still Trying to Find More Speed (cont.)

Wednesday, March 9th, 2011

Code Clean Up

This morning I went through a section of the code I'll be focusing on today and realized that when I originally wrote it I had the idea that every UDP datagram from the exchange should generate a message to send. If it didn't then it was an error. As a point of fact, there are a lot of datagrams that don't generate messages - Admin messages, Control messages, all kinds of messages.

What I had been doing was to create a simple base message and place it into the list so that an empty list signaled an error and I then filtered out the simple messages from being sent. Now I don't think I was spending a lot of time checking the message type, but the new and delete of the message could not have helped. So I decided to clean that up and get rid of the error checks on the empty generated list. This shouldn't be a huge difference, but every little bit helps when you're looking for fractions of a millisecond.

The next point I found was the skipping of sequence numbers by the OPRA FAST codec. In general, OPRA sends datagrams encoded in this pseudo-FAST protocol where the first message is specified, but then only "deltas" are enclosed in the packet. This makes it very fast to encode multiple messages - which is, of course, why they used it. The upshot of that is the starting sequence number is provided for the datagram, and I was using that to advance the "last sequence number" field in the decoder. The problem is that if there are more than one message in a datagram, this is going to make it appear that we're skipping (and therefore missing) messages from the exchange.

So I went in there and have a few interesting ideas about how we might be able to make that work. Of course, the simplest scheme is to check all the messages and take the largest number as the new "last sequence number" and use that. But I might also get lucky if OPRA only sends contiguous messages. If that's the case, I might be able to take the starting sequence number, add the count of messages, and then run with it.

I also might be able to just look at the last message in the list and use that. It's really full of a lot of possibilities to clean this up.

The next thing was to get an idea of the speed of the datagram processing. To that end, I skipped all the message processing and put in checks for the timeliness of the datagrams. I'll need to wait for the open to test this, but then I need to get busy really hammering on this code. I want to make some real progress on this today.

Google Chrome dev 11.0.696.0 is Out

Wednesday, March 9th, 2011

V8 Javascript Engine

This morning I checked and Google Chrome dev 11.0.696.0 was out. Seems in this release, they updated the V8 engine to 3.2.0.1 and made a few Mac-specific fixes/tweaks. Nice of them to pay attention to the platform, really. It's a decent browser, but I still can't help but wonder what's up with Google these days? Is it doing evil, or reforming it's ways?

iTunes 10.2.1 and Java 1.6.0_24 are on Software Updates

Wednesday, March 9th, 2011

Software Update

This morning I noticed that Apple has released iTunes 10.2.1 and updated Java on Mac OS X to 1.6.0_24. The update on iTunes is to allow syncing with iOS 4.3 devices - the new iPad 2 and all the iPads and iPhones that will be upgraded to 4.3 this month when it's released.

The update for Java is just to bring 1.6.0 up to par with the other platforms. It's got the security fixes, the stability improvements, etc. Nice to see, given that Apple has abandoned Java past OS X 10.7 (Lion). Who knows? Maybe Oracle will botch it so badly that Steve will relent and keep delivering Java.