I've written a system of alerts for a web app that I've been working on, and one of the problems - as it always is, is that of bad data. Specifically, false positives and false negatives. If we have an alert system, then we need to be sure that when there's a reason to alert, we alert the right people. A data entry error is not necessarily a reason for alert. It might clear up in a minute, and if so, we shouldn't alert folks telling them that the sky is falling.
So there's some element of signal processing in this work. How to filter out the bad data points and keep the good. I've been working with some interesting algorithms as well as some pretty straight-forward ones. One of the simpler ones is a simple n-point moving average. I have this on the input of the data to the alert system to suppress the spikes but there were some very nasty problems associated with that algorithm.
Probably the biggest was that it really doesn't filter the bad data 'out', as much as diffuses it. Take a large negative spike with a small trailing bit. An 11-point moving average would look like this:
7:23:43 AM |
5,138.13 |
5,151.16 |
7:23:53 AM |
5,707.47 |
5,316.76 |
7:24:03 AM |
5,682.36 |
5,353.47 |
7:24:13 AM |
4,888.47 |
5,529.33 |
7:24:23 AM |
4,888.47 |
5,695.98 |
7:24:33 AM |
5,546.79 |
5,631.53 |
7:24:43 AM |
6,235.12 |
-70,614.18 |
7:24:53 AM |
6,235.12 |
-71,246.95 |
7:25:03 AM |
6,370.65 |
-72,168.50 |
7:25:13 AM |
6,269.32 |
-72,773.25 |
7:25:24 AM |
4,984.98 |
-73,349.24 |
7:25:34 AM |
-833,564.68 |
-73,109.86 |
7:25:44 AM |
-1,253.10 |
-73,110.26 |
7:25:54 AM |
-4,454.62 |
-73,077.15 |
7:26:04 AM |
-1,763.81 |
-73,117.80 |
7:26:14 AM |
-1,447.41 |
-73,160.35 |
7:26:24 AM |
8,180.01 |
-72,933.61 |
7:26:34 AM |
6,230.67 |
3,251.76 |
7:26:44 AM |
6,599.34 |
4,012.67 |
7:26:54 AM |
5,923.49 |
5,037.98 |
7:27:04 AM |
5,801.25 |
6,045.78 |
7:27:14 AM |
7,479.17 |
6,795.37 |
7:27:24 AM |
4,474.36 |
6,688.10 |
7:27:34 AM |
7,116.87 |
6,841.77 |
7:27:44 AM |
6,823.86 |
6,855.29 |
7:27:54 AM |
9,321.94 |
7,179.46 |
7:28:04 AM |
6,798.15 |
7,338.22 |
The problem is that the bad data starts at 7:25:34, but it effects the moving average for a much larger area simply due to it's magnitude with respect to the "signal" it's in. There's not a lot a moving average can do to smooth this out. Sure, other filters can, but I also need something that's going to be fast, so I can't be computing an FFT to get a frequency-domain based filter for this data.
As I was thinking about this problem and started thinking about the other simple statistical functions that were in Numbers (the program I was using to look at and play with the data. In a flash, the median hit me, and I started to run with it.
What if I did a 5-point median, and then did an 11-point average on that? Oh... that looked good. What about a straight 11-point median? See for yourself:
7:23:43 AM |
5,138.13 |
5,151.16 |
5,138.13 |
7:23:53 AM |
5,707.47 |
5,316.76 |
5,546.79 |
7:24:03 AM |
5,682.36 |
5,353.47 |
5,546.79 |
7:24:13 AM |
4,888.47 |
5,529.33 |
5,682.36 |
7:24:23 AM |
4,888.47 |
5,695.98 |
5,693.90 |
7:24:33 AM |
5,546.79 |
5,631.53 |
5,682.36 |
7:24:43 AM |
6,235.12 |
-70,614.18 |
5,682.36 |
7:24:53 AM |
6,235.12 |
-71,246.95 |
5,546.79 |
7:25:03 AM |
6,370.65 |
-72,168.50 |
4,984.98 |
7:25:13 AM |
6,269.32 |
-72,773.25 |
4,984.98 |
7:25:24 AM |
4,984.98 |
-73,349.24 |
4,984.98 |
7:25:34 AM |
-833,564.68 |
-73,109.86 |
4,984.98 |
7:25:44 AM |
-1,253.10 |
-73,110.26 |
4,984.98 |
7:25:54 AM |
-4,454.62 |
-73,077.15 |
4,984.98 |
7:26:04 AM |
-1,763.81 |
-73,117.80 |
4,984.98 |
7:26:14 AM |
-1,447.41 |
-73,160.35 |
4,984.98 |
7:26:24 AM |
8,180.01 |
-72,933.61 |
5,801.25 |
7:26:34 AM |
6,230.67 |
3,251.76 |
5,801.25 |
7:26:44 AM |
6,599.34 |
4,012.67 |
5,923.49 |
7:26:54 AM |
5,923.49 |
5,037.98 |
6,230.67 |
7:27:04 AM |
5,801.25 |
6,045.78 |
6,599.34 |
7:27:14 AM |
7,479.17 |
6,795.37 |
6,798.15 |
7:27:24 AM |
4,474.36 |
6,688.10 |
6,798.15 |
7:27:34 AM |
7,116.87 |
6,841.77 |
6,823.86 |
7:27:44 AM |
6,823.86 |
6,855.29 |
6,823.86 |
7:27:54 AM |
9,321.94 |
7,179.46 |
6,999.97 |
7:28:04 AM |
6,798.15 |
7,338.22 |
7,116.87 |
This is a ton better. While it can still effect the output, bad data has to exist for at least half the window in order to effect it at all. This is a pretty fair situation as if it's that bad, then someone needs to know about it.
Next week, I'll look at putting this into the code.