Off CPU analysis, getrusage, wait4 and related techniques…

As I looked to analyze x264, I saw that the On CPU metric was considerably less than other benchmarks like openssl or c-ray that are On CPU almost 100% of the time. I also noticed that my Ryzen 1700 box scores somewhat lower on x264 than one described in Phoronix.

I have a hypothesis here that the reason is time spent waiting on a disk that is relatively slow as compared with Phoronix running off SSD. However, I was looking for some additional ways to more definitely demonstrate/measure this and believe I may have some changes that will both simplify wspy and give some basic metrics. This post documents some of the steps along the way and improvements planned.

First approach was to consider printing out statistics for disk reads and writes. I already have this in wspy under a –diskstats option but hadn’t really calibrated things. Next, I observed that one way generalizing this waiting for disk was a more generic system call (e.g. read) that blocks and the kernel decides to schedule another process. The getrusage(2) system call can retrieve this information for all children of a process. I also noticed that the /usr/bin/time command also retrieves this information.

So I implemented a quick “–show-usage” option that calls getrusage(2) and dumps information into a resource.txt file. I also did a simple run of multiple benchmarks with “time -v” and the information seems consistent and seems cumulative for all children. For example, here is the output of “time -v” for a phoronix run of c-ray:

	Command being timed: "phoronix-test-suite batch-run c-ray"
	User time (seconds): 631.46
	System time (seconds): 0.23
	Percent of CPU this job got: 750%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:24.16
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 24536
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 54135
	Voluntary context switches: 1688
	Involuntary context switches: 154524
	Swaps: 0
	File system inputs: 0
	File system outputs: 34432
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Similar information gets dumped from my –show-usage option

utime:    631.325629
stime:    0.247362
maxrss:   23K
minflt:   54094
majflt:   1
nswap:    0
inblock:  120
oublock:  34432
msgsnd:   0
msgrcv:   0
nsignals: 0
nvcsw:    1672
nivcsw:   154274

In particular, the number of voluntary context switches (e.g. make a system call and yield to another process) is much lower than the involuntary context switches (e.g. time slices).

Those ratios are much reversed for x264

utime:    284.392069
stime:    1.284842
maxrss:   523K
minflt:   163052
majflt:   0
nswap:    0
inblock:  0
oublock:  688
msgsnd:   0
msgrcv:   0
nsignals: 0
nvcsw:    83752
nivcsw:   1522

So even though x264 writes many fewer blocks of I/O than c-ray, the ratio of voluntary context switches is much higher so processes making calls that have them give up the processor.

There does seem to be a slight anomaly with openssl as neither “time -v” nor “wspy –show-usage” provide a realistic amount of user time (and they are consistent with each other).

utime:    0.403309
stime:    0.222706
maxrss:   24K
minflt:   47722
majflt:   23
nswap:    0
inblock:  6720
oublock:  664
msgsnd:   0
msgrcv:   0
nsignals: 0
nvcsw:    920
nivcsw:   35

So it does look like the information returned is cumulative and consistent.

Looking a bit further however, I also noticed that I might be able to use wait4(2) system call in wspy in a way that might simplify things a fair amount(*). As present, when a process exits, I go to parse the /proc/[pid]/stat file for the last information and then store it with my process information. I then dump this either to a CSV file or create total metrics from the in-memory process tree.

There is an easier way.

Instead of calling “wait”, if I call “wait4” to get the ptrace(2) events, then I can also pass in a pointer to get a rusage structure. This both simplifies my scraping technique of parsing /proc/[pid]/stat as a file and lets me collect this information across the process tree. Along the way, I’ll need to debug what is going on with openssl in terms of lost metrics and also see if the right things happen for multi-threaded programs where I sometimes get incorrect account of times. One potential issue is the man page for wait4 says “these functions are obsolete; use waitpid(2) or waitid(2) in new programs” so not sure how well they work and waitpid doesn’t give the same information.

However, I’m going to try modifying the ptrace2 engine to try this out and then update this post after I’ve gotten there. By the way, following are numbers and ratios of voluntary and involuntary context switches for some of the benchmarks I had previously examined

                 voluntary  involuntary   ratio
asmfish:            2075      43862         0.05x
compress-7zip:    236125      79036         2.99x
c-ray:              1672     154274         0.01x
go-benchmark:     461041      62524         7.4x
openssl:             920         35          --
povray:            38252      60301         0.63x
redis:             22754        530         42x
stream:             7877      36030         0.22x
x264:              83752       1522         55x

Not 100% simple here, but benchmarks like x264 are a spending a huge amount more letting the processes get interrupted and then also spending this time off the CPU, while programs like asmfish, c-ray and even stream are the opposite.

Comments

Off CPU analysis, getrusage, wait4 and related techniques… — 1 Comment

mev on 2018-04-30 at 2:12 pm said:

Note: I’ve noticed that occasionally some fields of these rusage are empty at the top levels of the process tree, e.g. IPC messages; or rather low to be believable, e.g. incoming/outgoing blocks read. Not certain if kernel is 100% keeping up the accounting.

Reply ↓

Performance analysis, tools and experiments

An eclectic collection

Comments

Off CPU analysis, getrusage, wait4 and related techniques… — 1 Comment

Leave a Reply Cancel reply