Conducting Timing Tests
So this morning I'm doing some timing tests to see what the effect is of trying to run multiple jobs at the same time. Initially, we thought that the boxes would be CPU-bound because of all the logic in the pinning and calculations. But that's not the case. I can use top to see that the vast majority of the time the CPUs are idle (2 CPUs in this box).
So what's taking so long?
The monitoring (NewRelic) says it's all I/O - seems reasonable, but then can we speed up the time to make all six runs by putting them all in the run queue at the same time? Maybe it's I/O choked, and if that's the case, then No, it won't help. But if we're latency bound, then it will - so long as the service on the other end are fast enough to handle multiple requests.
It's a lot of questions, so that's why I'm running the tests.
Timing of a particular run is easy - just use time. But if I want to parallel the runs and see the effect in the same manner, I need to be a little more clever:
#!/bin/bash philadelphia & central-jersey & baltimore & wait
This is a great little script that runs all three in parallel, and then waits for them all to be done before returning. If I put this into a time call:
$ time ./parallel
then I'll get exactly what I want to get.
Now it's just a matter of doing all the runs and gathering all the data.
Division | Total | real | user | sys |
philadelphia | 414.68 | 3:13.863 | 2:02.340 | 0:08.889 |
central-jersey | 109.00 | 2:17.305 | 1:29.034 | 0:05.768 |
baltimore | 214.04 | 4:03.828 | 1:53.211 | 0:07.500 |
cincinnati | 121.80 | 3:09.080 | 1:33.894 | 0:06.716 |
cleveland | 264.36 | 2:56.955 | 1:48.855 | 0:07.872 |
austin | 194.46 | 4:45.738 | 2:06.456 | 0:08.581 |
OK, this is all a nice baseline, but what we're really interested in is how they work together. In that case, we ran successively larger sets, starting at the top of the list and running more jobs and timing them:
Job Count | real | user | sys |
2 | 7:37.395 | 4:22.084 | 0:16.545 |
3 | 7:40.657 | 6:45.033 | 0:24.438 |
4 | 12:14.561 | 8:05.870 | 1:54.751 |
5 | 16:09.544 | 12:06.461 | 0:31.694 |
6 | 18:53.886 | 14:59.040 | 0:37.562 |
This looks puzzling, so I'm going to do similar tests - this time serially running the first 2, 3, 4, 5, and 6 jobs and see how they compare:
Job Count | real | user | sys |
2 | 9:00.463 | 4:10.892 | 0:16.337 |
3 | 12:21.413 | 5:59.158 | 0:23.101 |
4 | 15:49.010 | 7:32.160 | 0:27.686 |
5 | 20:13.591 | 9:39.248 | 0:37.150 |
6 | 26:44.176 | 11:47.096 | 0:44.891 |
All this data looks like:
Which tells me that it doesn't matter, really, if we try to run these in parallel. They simply don't scale that way. There's some other limiting factor in the system and it's not the CPU. This isn't terribly surprising, but it's something that means we really need to start looking at these external systems and see where we can limit our exposure to them.
UPDATE: it took 4:10 - just over four hours to do all 40 of the divisions that we need to do nightly. I guess in that light, it's OK, as there were plenty of retries in the run and we still had time to do the Top 40. The problem, I fear, is that we still have a global choke point that we need to work through, but we'll run into that soon enough, I guess.