Wild Socket Problem – Possibly Bonded NIC Issue?
Focused on an interesting problem today. In the last few weeks, I've done a lot of re-writing on the UDP receiver in my ticker plant to get it better, faster, etc. And one of the things I've noticed is that I was accumulating, but not logging, dropped messages from the exchange. Now this is a serious issue because I'm looking at both the A and B sides from the exchange - they are meant to be fault-tolerant pairs so that should you loose a datagram on one, the other has it and you can get it there. So to loose packets is significant.
Made more significant in the nature by which I'm losing them. Let's say I start one of my apps that listens to a set of UDP multicast feeds. This guy gets started and it's running just fine. In another shell on the same box, I start another application that listens to a different set of UDP channels. As this second application is starting - the first app starts dropping packets! Within a few seconds, everything stabilizes and both applications are fine and neither app is dropping anything.
If I then stop the second app - the first app drops a few packets! Again, within a second or so, it's all stable again and nothing more is dropped.
From this, I have a few observations and a theory.
- It is not in the process space - two apps share nothing but the OS and hardware. So it's not "within" either process.
- It is socket related - because I loose packets on A and B channels, it's not the failure of one multicast channel.
- It is load related - the more load there is on the first and second apps, the worse the drops.
My theory is that it's the way the bonded interface is configured. Specifically, I believe it's set up to automatically rebalance the load between the two sockets, and in so doing, changing the load causes some of the sockets to be shifted from one physical NIC to another, and the packets are dropped.
It certainly makes sense. The question is: can I effect the configuration in a meaningful way? I looked at the modes for bonding NICs in Ubuntu, and depending on how they have it set up, I might just have to live with it. If so, at least I know where it's coming from.
UPDATE: the core issue is that I can't specify the NIC for boost asio to use for reception of the UDP traffic. If I try to use the address, I get nothing. If I use the "0.0.0.0", then I get data but the problems persist. It's an annoying limitation with boost asio UDP, but it's a limitation, and we'll have to deal with it. Crud.
UPDATE: the only option I found was in the joining of the multicast channel. It turns out that you can tell boost which address to join the multicast address on this takes the form of something like:
socket->set_option(multicast::join_group( address_v4::from_string(aChannel.first), address_v4::from_string("10.2.2.8"));
where the second address is the address of the NIC you want to listen on. It works only marginally for me, and that's a drag, but it's a possibility if I need it. It's not boost's problem.
[4:20pm] UPDATE: I found out that it's the Intel NIC drivers! A guy in The Shop ran across this for his work a little bit ago, and found the solution in updated drivers for the Intel 10GbE NICs. I've talked to the Unix Admins, and they are building a patch for my boxes. This is fantastic news!