As a followup to a previous post, I’ve added support to wspy for topdown analysis for backend stalls.
Continue reading →
Investigating Intel performance counters for backend memory costing…
I’ve implemented the first level for topdown performance counter analysis and also done an initial analysis of ~15 workloads from recent Phoronix article. A logical next step is to expand the “backend bound” category to first separate CPU-bound vs. memory-bound and then separate out the L1 vs. L2 vs. L3 vs. main memory vs. memory stores.
This post looks at some of the counters under consideration.
Continue reading →
Off CPU analysis, getrusage, wait4 and related techniques…
As I looked to analyze x264, I saw that the On CPU metric was considerably less than other benchmarks like openssl or c-ray that are On CPU almost 100% of the time. I also noticed that my Ryzen 1700 box scores somewhat lower on x264 than one described in Phoronix.
I have a hypothesis here that the reason is time spent waiting on a disk that is relatively slow as compared with Phoronix running off SSD. However, I was looking for some additional ways to more definitely demonstrate/measure this and believe I may have some changes that will both simplify wspy and give some basic metrics. This post documents some of the steps along the way and improvements planned.
Continue reading →
wspy – process tracing, lost nodes and potential refactoring
One of the tough bugs I’ve noticed sometimes pop up is an “orphaned” process tree in my tree. When I’ve collected trees of processes, very occasionally I’ll have a tree node drop on the floor and show up as an orphan when it clearly should have been attached.
The problem seems to come up more with very large data sets, e.g. build-gcc run with >1,000,000 processes. I was fortunate to recently see it pop up in a few c-ray on Ryzen with ~600 runs. The symptoms are as if one of a few things might be happening:
- Somewhere race conditions are causing me to miss events, particularly exit(2) events. Without these, the processes never get closed out with finish times and their counters and closing statistics are never dumped
- Perhaps I am getting the events but my tree building and accounting have subtle bugs introducing problems, particularly when pid numbers are reused
When the problem first appeared with build-gcc (where pid’s wrap around more than 30 times), I was suspicious more of the second cause, but more recently I’ve been influenced more by the former, particularly when it showed up in a small c-ray example on a fast processor.
This makes it interesting to figure out how to best debug things. I’ve dumped quite a few trace files of underlying event points and gone through them to look for patterns. Haven’t found anything there yet.
However, recent restructuring I did to dump process information to a file and have a later program reconstruct the trees, makes me realize I might also do this using an alternate implementation. I am going to try these as alternate implementations of a “–processtree-engine” and then run more than one to diagnose things.
Continue reading →
Set up Ryzen 7 1700 system
I have set up a Ryzen 7 1700 system. I bought it at Costco at point it was discounted at point the older CPUs are getting discounted in time for newer.
Continue reading →
Phoronix benchmarks – new Ryzen processors; looking @ the workloads
AMD released new Ryzen processors today. Phoronix published an article that benchmarked these processors. Anand Tech also published a review. TechReport also wrote a review.
The posting is *not* measured on these new processors. Instead, it looks at dissecting the workloads when run on earlier AMD (Ryzen 1700) and Intel (Haswell i7-4770s) processors.
Continue reading →
Investigating performance counters related to memory
As part of my investigation to create a page for STREAM, I have tried to reconcile things with underlying performance counters. This page documents some of that work.
Continue reading →
wspy – added csv file as well as analysis functions
I have updated wspy to dump a “processtable.csv” file at the same time it dumps a “processtable.txt” file.
This gives me several advantages:
- I’ve separated the format of output from collecting instrumentation. Hence, I can run things once to collect the data and then display it in different forms. This addressed a problem where I kept thinking how to best decorate the process tree to strike a balance between too cluttered and not enough information. I can now save the information once in the tree and then display as needed
- I have created an ability to dump metrics in format different from a tree
- I have a tool that also helps me further investigate a problem where the build_gcc tree with 1 million processes has something going wrong.
Top down performance counter analysis (part 3) – wspy
As described in top down performance counter analysis part 1, top down analysis is an approach that uses key performance counters to characterize an application and then successively drills down with further refinement. On Intel x86 processors, this first level refinement characterizes the applications by looking at the overall usage of the pipeline. Topdown is implemented in the perf(1) tool and in the previous post, I also was able to extend likwid-perfctr to add a TOPDOWN performance group for 1st level characterization.
In my last post, I got topdown performance counter metrics added to the wspy tool when it is run in the “core” mode. In other words, instead of following processes, wspy reports against all processes that run on a core. In this posting, I have now also added it when performance counters are run per process. The resultant process trees are now decorated with the topdown metrics.
Continue reading →
Top down performance counter analysis (part 2) – wspy
As described in top down performance counter analysis part 1, top down analysis is an approach that uses key performance counters to characterize an application and then successively drills down with further refinement. On Intel x86 processors, this first level refinement characterizes the applications by looking at the overall usage of the pipeline. Topdown is implemented in the perf(1) tool and in the previous post, I also was able to extend likwid-perfctr to add a TOPDOWN performance group for 1st level characterization.
In this posting, I’ve documented steps I took to add such top down support to wspy
Continue reading →