Lots of Trouble with MindAlign These Days

chat.jpg

We use MindAlign in the Shop to have a secure chat environment, such as it is. Over the last few days, it's been problematic to say the least. Today, in fact, they have narrowed down the problem to a global corporate loss of DNS due to MindAlign.

I am not going to bad mouth MindAlign, it's a package that the corporation has picked, and we're going to use it. Period. But today when I came into work I noticed (again) that almost all my production applications (servers and services) had one core/CPU hammering away doing seemingly nothing - but very very fast.

I found that once again we were having MindAlign problems and I decided that this morning I'd dig into the problem and find out why they were all spinning like mad. The first step on that path was to get a few stack dumps, and it's a lot easier to do that in Java than it is in C++, so I did a few thread dumps on an afflicted Java app and after three of them, it because pretty clear where the problem was - in the BKIRCProtocol object. That's the thing that talks to the IRC server, which is the core of MindAlign.

It appeared that the IRC protocol was trying to throw an EOFException because the socket was not returning any data, and was, in fact returning an error code indicating that it wasn't all there anymore. The problem was, that at the higher level, where the EOFException was being caught, I was assuming that the socket was closed - which is why I'd get the EOF condition in the first place. By making that assumption, a socket with an indeterminate state talking to the MindAlign server was "there", but not "there enough".

I put in an explicit disconnect() on the socket connection and that should take care of things. Simple one-line fix (with an accompanying 10 line comment to say why it's being done. Then I was on to the C++ version.

Thankfully, the code is very similar, but the socket communications isn't. But I dug into the code about the same spot - a socket that's almost dead, but returning that it's still connected even though it's not returning any data. What I found was code that looked like this:

  // now read up to the "\n" NEWLINE that the IRC server sends
  if (!error) {
    retval = mCommPort.readUpToNEWLINE();
  }

where readUpToNEWLINE() should return only when a line is received from the IRC server terminated in a NEWLINE (\n) character. The problem was, it was returning an empty string and that was all I needed to know. If it returns nothing, then the only acceptable reason is that there was a timeout. So, check for that, and all other conditions cause us to disconnect from the server and on the next pass, a new connection will be created.

Something like this:

  // now read up to the "\n" NEWLINE that the IRC server sends
  if (!error) {
    retval = mCommPort.readUpToNEWLINE();
    /*
     * It's possible that the data is empty - but the only way for
     * that to be acceptable is for a timeout to have occured. So,
     * if the data is empty and a timeout *didn't* occur, then we
     * need to disconnect this guy and the next pass through, we'll
     * be able to connect again and set things up properly - we
     * hope.
     */
    if (retval.empty() && (errno != ERR_READ_TIMEOUT)) {
      disconnect();
    }
  }

At this point, I think I have the MindAlign issues under control. There's still the issues with the corporate servers, etc., but I can't do anything about them, and we'll just have to wait for them to get fixed up properly by the support teams. It's just nice to have these fixes into BKit and CKit so that I won't get bitten by this problem in the future.