topdown tool – adding support for level 3 frontend
The Intel topdown performance analysis method separates out frontend stalls (level 1) into latency (level 2) and bandwidth (level 2). A frontend latency stall is when 0 uops are issued at all because the frontend is waiting. A frontend bandwidth stall is when 1, 2 or 3 uops are issued.
I have updated the “topdown” wrapper script to provide additional level 3 details related to frontend stalls.
Following is an illustration using the following command that looks at level 3 frontend information for the build-linux-kernel benchmark.
./wspy/topdown -l 3 -f -o topdown.txt phoronix-test-suite batch-run build-linux-kernel
The output in topdown.txt is as follows:
retire 0.356 speculation 0.113 frontend 0.421 idq_uops_delivered_0 0.129 icache_stall 0.092 itlb_misses 0.036 idq_uops_delivered_1 0.183 idq_uops_delivered_2 0.238 idq_uops_delivered_3 0.291 dsb_ops 36.66% backend 0.110
A brief explanation using this output:
- The top level events (retire, speculation, frontend, backend show this benchmark is more heavily weighted towards frontend stalls (0.421)
- The frontend latency stalls with 0 uops happen approximately 1/8th of the time (0.129)
- Three types of frontend latency stalls can be found at the next level. In this case, most are icache misses (0.092) and a smaller number are itlb misses (0.036). The third type (branch resteers) are not explicitly listed, but do not make a significant contribution
- The numbers are close but not exact between the overall the frontend (0.421) which is based on the following counter:
idq_uops_not_delivered.core [Uops not delivered to Resource Allocation Table (RAT) per thread when backend of the machine is not stalled Spec update: HSD135]
and the following counters that try to break out how many cycles have 4, 3, 2, 1 or 0 uops
idq_uops_not_delivered.cycles_0_uops_deliv.core [Cycles per thread when 4 or more uops are not delivered to Resource Allocation Table (RAT) when backend of the machine is not stalled Spec update: HSD135] idq_uops_not_delivered.cycles_le_1_uop_deliv.core [Cycles per thread when 3 or more uops are not delivered to Resource Allocation Table (RAT) when backend of the machine is not stalled Spec update: HSD135] idq_uops_not_delivered.cycles_le_2_uop_deliv.core [Cycles with less than 2 uops delivered by the front end Spec update: HSD135] idq_uops_not_delivered.cycles_le_3_uop_deliv.core [Cycles with less than 3 uops delivered by the front end Spec update: HSD135]
So I’ll worry more about an overall representation than an exact accounting.
- The last metric printed is are the “dsb_ops” as a percentage of the total uops issued. The decoded stream buffer (dsb) also known as the uOp cache. This cache bypasses having to redo the decode stages. While it can place up to 4 uops per cycle, it might also be a source of the idq (decode queue) having fewer than 4 uops issued. There is a separate counter that tallies those situations, though I haven’t printing it out here.
Mostly this lets me decode a little further to understand a little better what is happening in the frontend and what may be contributing to stalls.
(*) Here is the way I would do the accounting to reconcile
idq_uops_not_delivered.core = 0.421 Thus for 1,000 cycles, there are 4,000 slots and of these 0.421*4000 are not delivered (1,684) and the remainder (2,316) are delivered. idq_uops_not_delivered.cycles_0_uops_deliv_core = 0.129 => 129 slots with 0 uops. idq_uops_not_delivered.cycles_le_1_uop_deliv_core = 0.183 => 54 slots with 1 uops idq_uops_not_delivered.cycles_le_2_uop_deliv_core = 0.238 => 55 slots with 2 uops idq_uops_not_delivered.cycles_l3_3_uop_deliv_core = 0.291 => 53 slots with 3 uops => 709 slots with 4 uops Thus the total number of uops in 1000 cycles is: 54*1 + 55*2 + 53*3 + 709*4 = 3159 which is quite a bit higher than the number above.
So there is something wrong either in my understanding of the counters or of the overall counters themselves. One potential way to reconcile some of this is the Intel errata for Haswell which says:
Certain Perfmon Events May be Counted Incorrectly When The Processor is Not in C0 State. Problem: Due to this erratum, the perfmon events listed below may be counted when the logical processor is not in C0 State. IDQ.EMPTY (event code 0x79 and umask 0x02) IDQ_UOPS_NOT_DELIVERED.CORE (event code 0x9c and umask 0x01) RESOURCE_STALLS.ANY (event core 0xa2 umask 0x01) CYCLE_ACTIVITY.CYCLES_LDM_PENDING (event 0xa3 umask 0x02 and cmask 0x02) CYCLE_ACTIVITY.CYCLES_NO_EXECUTE (event 0xa3 umask 0x04 and cmask 0x04) CYCLE_ACTIVITY.STALLS_LDM_PENDING (event 0xa3 umask 0x06 and cmask 0x06) Implication: The count will be higher than expected. Workaround: None identified. Status: For the steppings affected, see the Summary Table of Changes.
Comments
topdown tool – adding support for level 3 frontend — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>