This is Why Codecs Have to be Strictly Controlled
OK… so I'm working on some nice little features on the greek engine today - adding a few nice IRC commands to the system, and making things just a little bit nicer for the support staff, and we start getting these odd problems. In some cases, the response time from the engine is wildly varying, and in other cases, the memory footprint is far too big. All very odd, seemingly unrelated, but all timed to happen today.
So I started looking at yet another problem - one of the clients to the engine is sending a Close Channel message when it wasn't needed. That, in and of itself, is not the problem, but on closer inspection, the contents of the message are alarming:
[asyncRead] a close channel for unknown channel was received [asyncRead] 58 af cb a6 3b 49 db 41 03 84 4d 11 c4 5e 55 d1 05 45 12 X…;I.A..M..^U..E.
the sec on line of the error message is the binary contents of the close channel message, which should contain an 'X', followed by the 16-byte channel ID, followed by an encoded variant value. In this case, the 'E' means it is an error, and by definition that means that a varint-encoded number follows, and after that, another variant that is the "reason'. The value after the 'E' is intended to be the numeric error code, and the variant is meant to hold the message or messages that accompany it.
But as you can see, there's the 'E', and a varint-encoded value, and then nothing. In my decoder, I look at the next byte and try to decode it. If that happens to be a String, or a Map, I can go off into lala land and decode a GB or two. Not good.
The solution? Well… there's two: we have to get the app (or app writer) that generates this malformed error to correct their mistake, and until this person can be identified, we have to protect our decoder against this kind of problem and put in a simple test:
if (aPos < aCode.size()) { mErrorValue->get<1>().deserialize(aCode, aPos); }
The real problem with this is that someone created a codec that encodes data improperly. To what extent? I have no idea, but this kind of things is capable of bring down a whole lot of servers and clients. I'm lucky it's not been a lot worse. But it underscores the need to have a group of people that control these critical components, and not allow just "anyone" to fiddle with them. The risk and consequences are just too great.
I wish I had faith that this will be the catalyst to stop all this, but I have serious doubts about it. I'm certainly going to try to get it to create some change. It needs to happen, and it needs to happen now.