Off CPU analysis, getrusage, wait4 and related techniques…
As I looked to analyze x264, I saw that the On CPU metric was considerably less than other benchmarks like openssl or c-ray that are On CPU almost 100% of the time. I also noticed that my Ryzen 1700 box scores somewhat lower on x264 than one described in Phoronix.
I have a hypothesis here that the reason is time spent waiting on a disk that is relatively slow as compared with Phoronix running off SSD. However, I was looking for some additional ways to more definitely demonstrate/measure this and believe I may have some changes that will both simplify wspy and give some basic metrics. This post documents some of the steps along the way and improvements planned.
First approach was to consider printing out statistics for disk reads and writes. I already have this in wspy under a –diskstats option but hadn’t really calibrated things. Next, I observed that one way generalizing this waiting for disk was a more generic system call (e.g. read) that blocks and the kernel decides to schedule another process. The getrusage(2) system call can retrieve this information for all children of a process. I also noticed that the /usr/bin/time command also retrieves this information.
So I implemented a quick “–show-usage” option that calls getrusage(2) and dumps information into a resource.txt file. I also did a simple run of multiple benchmarks with “time -v” and the information seems consistent and seems cumulative for all children. For example, here is the output of “time -v” for a phoronix run of c-ray:
Command being timed: "phoronix-test-suite batch-run c-ray" User time (seconds): 631.46 System time (seconds): 0.23 Percent of CPU this job got: 750% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:24.16 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 24536 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 54135 Voluntary context switches: 1688 Involuntary context switches: 154524 Swaps: 0 File system inputs: 0 File system outputs: 34432 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Similar information gets dumped from my –show-usage option
utime: 631.325629 stime: 0.247362 maxrss: 23K minflt: 54094 majflt: 1 nswap: 0 inblock: 120 oublock: 34432 msgsnd: 0 msgrcv: 0 nsignals: 0 nvcsw: 1672 nivcsw: 154274
In particular, the number of voluntary context switches (e.g. make a system call and yield to another process) is much lower than the involuntary context switches (e.g. time slices).
Those ratios are much reversed for x264
utime: 284.392069 stime: 1.284842 maxrss: 523K minflt: 163052 majflt: 0 nswap: 0 inblock: 0 oublock: 688 msgsnd: 0 msgrcv: 0 nsignals: 0 nvcsw: 83752 nivcsw: 1522
So even though x264 writes many fewer blocks of I/O than c-ray, the ratio of voluntary context switches is much higher so processes making calls that have them give up the processor.
There does seem to be a slight anomaly with openssl as neither “time -v” nor “wspy –show-usage” provide a realistic amount of user time (and they are consistent with each other).
utime: 0.403309 stime: 0.222706 maxrss: 24K minflt: 47722 majflt: 23 nswap: 0 inblock: 6720 oublock: 664 msgsnd: 0 msgrcv: 0 nsignals: 0 nvcsw: 920 nivcsw: 35
So it does look like the information returned is cumulative and consistent.
Looking a bit further however, I also noticed that I might be able to use wait4(2) system call in wspy in a way that might simplify things a fair amount(*). As present, when a process exits, I go to parse the /proc/[pid]/stat file for the last information and then store it with my process information. I then dump this either to a CSV file or create total metrics from the in-memory process tree.
There is an easier way.
Instead of calling “wait”, if I call “wait4” to get the ptrace(2) events, then I can also pass in a pointer to get a rusage structure. This both simplifies my scraping technique of parsing /proc/[pid]/stat as a file and lets me collect this information across the process tree. Along the way, I’ll need to debug what is going on with openssl in terms of lost metrics and also see if the right things happen for multi-threaded programs where I sometimes get incorrect account of times. One potential issue is the man page for wait4 says “these functions are obsolete; use waitpid(2) or waitid(2) in new programs” so not sure how well they work and waitpid doesn’t give the same information.
However, I’m going to try modifying the ptrace2 engine to try this out and then update this post after I’ve gotten there. By the way, following are numbers and ratios of voluntary and involuntary context switches for some of the benchmarks I had previously examined
voluntary involuntary ratio asmfish: 2075 43862 0.05x compress-7zip: 236125 79036 2.99x c-ray: 1672 154274 0.01x go-benchmark: 461041 62524 7.4x openssl: 920 35 -- povray: 38252 60301 0.63x redis: 22754 530 42x stream: 7877 36030 0.22x x264: 83752 1522 55x
Not 100% simple here, but benchmarks like x264 are a spending a huge amount more letting the processes get interrupted and then also spending this time off the CPU, while programs like asmfish, c-ray and even stream are the opposite.
Note: I’ve noticed that occasionally some fields of these rusage are empty at the top levels of the process tree, e.g. IPC messages; or rather low to be believable, e.g. incoming/outgoing blocks read. Not certain if kernel is 100% keeping up the accounting.