Sometimes Cacheing Isn’t the Best Strategy

servers.jpg

Today I was blown away by a performance boost I got from not cacheing resultant data. When you look at it, it's very logical why it's faster, but it took me an hour or so to get to the point of even questioning the cache as a bottleneck in the processing.

For the past day or so, I've been looking at the speed my server is processing incoming price events (ticks). It's a lot better than it used to be, but there were times when it took minutes to get a price through the server. It didn't seem to be a locking problem. It was just that some ticks took much longer than others to process. So I started digging.

The first place to look was in the processing of the incoming prices. The prices come from the feeder apps, go onto a compression queue (updates new data, but retains queue order) and then are pulled off by a thread and sent through the system. My first thoughts were that the queue was holding onto the prices longer than needed - primarily because the processing thread wasn't getting back around to the queue quickly enough. I thought it might be in the thread synchronization of the pushing and popping threads, but after a little bit of experimentation, I realized that this wasn't the case. The problem was elsewhere.

Next, I turned my attention to the updating of the prices. There's nothing in the code that would block the updates so drastically. I looked and looked further downstream, but in the end, this was just not where the problem was.

Finally, I started looking at the general CPU usage of the kernel process - the kernel of the server - not the OS. This lead me back to something I've know about for months, and just haven't gotten around to fixing: when there are clients attached to the server, the CPU usage goes way up - primarily due to the messaging to those clients of what's changed in the server. Unfortunately, I couldn't get good handle on why this was happening - until this morning.

There are two parts of the client communication with the server: the outgoing notification that a particular instrument has changed, and the request/response with the client for the complete detailed information about a particular instrument. I 'short circuited' each of these in turn, and found that the CPU usage was really attributed to the notifications and not the queries. I was a little surprised. Then I started thinking about the details of the notifications.

Originally, when the server would update an instrument's data, it would ask each connected client if they were interested in hearing about this guy's update. This then migrated to having a client proxy on the server side asked the question, which was an important step, but it was still a little pokey. The next step was to have the client proxy remember the answers for each instrument, and going on the idea that once an instrument is 'interesting', it's very unlikely that it will stop being 'interesting', and so we cached the 'interesting instruments' and first checked if the updated instrument was in the list of 'interesting instruments' before doing the more thorough check.

The list was a simple STL std::set of strings, but when the client had been connected for a while, it was possible that there would be literally thousands of 'interesting instruments' in the list. Scanning this list each time seemed to be the problem, so that's where I went to work.

The goal was to make checking the individual instrument so fast and easy that there would be no way a cache was going to be faster than simply checking this guy each time. The first step was to realize the fact that all the clients that connect to the server really are looking for all positions, so I made a few methods on the instrument class that would simply and very quickly tell me if there were any positions on that guy, and not bother with the more complete checking that was done when we wanted only a subset of positions from the server.

The second modification was in putting an STL std::set where I had a simple list - which had a linear sort, for those instruments we wanted to be notified of - even with no positions on them. When I put in these changes to the system and restarted I was amazed to see that fully 66% of the CPU usage of the 'connected' server was gone. Amazing.

The CPU usage of the kernel now very clearly tracks the incoming price events. It's at least 2 to 3 times faster for prices through the server, and it's simpler, smaller (no cache of instruments), and can now move more prices than ever before. Really very cool.