Weird Ethernet Problem with Avahi and HAL on Fedora Core 5

tux.jpg

I got an email this morning from the UnixEng crew - a good lot of guys and gals, to be sure, and they were saying that one of the machines we have here in the development group was having a ton of network errors. The switch was set to 100/Full, and typically, that's what we need because we've learned that the auto/auto negotiation seems to have more problems than it's worth.

So the network guy switched the port to auto/auto, and that helped a little, but I wanted to set the linux box to 100/Full because I know that's better than leaving it full auto. Typically, we do this when we build the box, but a lot of times it gets skipped (forgotten) because things start out working, only to fail at a later date.

Anyway, the nice tools to remember here are ethtool and ifconfig. If you run ifconfig and look at the data for the ethernet port (eth0 in this case), you can see if there's a likely problem in the port.

Typically, you might see:

   eth0   Link encap:Ethernet  HWaddr 00:17:A4:99:07:EF
          inet addr:146.180.7.94  Bcast:146.180.7.127  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:123456 errors:0 dropped:0 overruns:0 frame:0
          TX packets:123456 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelength:1000

but if you see large numbers for the errors or the collisions then you're probably looking at a duplex mismatch. The way to see what your card is set at right now is to run ethtool on the port:

   ethtool eth0

and a lot of nice, useful information about eth0 is going to spill out.

If the network card needs to be forced into a mode, then the easiest way to do this is to edit /etc/sysconfig/network-scripts/ifcfg-eth0 and add the line:

   ETHTOOL_OPTS="speed 100 duplex full autoneg off"

and a quick reboot later and things should be fine. That is they should be. Today it was something different.

When I made the changes to the /etc/sysconfig/network-scripts/ifcfg-eth0 file and rebooted, the services Avahi and HAL failed to start. When HAL failed to start, the box hung. Thankfully, I could get it to boot into single-user mode and start to look at things.

Google pointed to SELinux, but that was very clearly disabled on this box - as it should be. We looked at a lot of things - X11 drivers, etc. In the end, I was faced with the fact that it was the only thing that has changed on the box, and so I removed that one line.

Bingo! That was it. Now what's odd is that that line is on another box with the same hardware spec - so we're baffled as to why for this machine there's a problem. It could be the version of the kernel or packages - a yum update would help, but it's working now, so we may put that off. It's also possible that it's the switch ont he other end... but that's going to be very hard to pin down.

In the end, we're going to keep an eye on it, but what a mess trying to fix something like this.