topdown tool – adding support for level 3 frontend – Performance analysis, tools and experiments

The Intel topdown performance analysis method separates out frontend stalls (level 1) into latency (level 2) and bandwidth (level 2). A frontend latency stall is when 0 uops are issued at all because the frontend is waiting. A frontend bandwidth stall is when 1, 2 or 3 uops are issued.

I have updated the “topdown” wrapper script to provide additional level 3 details related to frontend stalls.

Following is an illustration using the following command that looks at level 3 frontend information for the build-linux-kernel benchmark.

 ./wspy/topdown -l 3 -f -o topdown.txt phoronix-test-suite batch-run build-linux-kernel

The output in topdown.txt is as follows:

retire         0.356
speculation    0.113
frontend       0.421
idq_uops_delivered_0   0.129
icache_stall               0.092
itlb_misses                0.036
idq_uops_delivered_1   0.183
idq_uops_delivered_2   0.238
idq_uops_delivered_3   0.291
dsb_ops                    36.66%
backend        0.110

A brief explanation using this output:

The top level events (retire, speculation, frontend, backend show this benchmark is more heavily weighted towards frontend stalls (0.421)
The frontend latency stalls with 0 uops happen approximately 1/8th of the time (0.129)
Three types of frontend latency stalls can be found at the next level. In this case, most are icache misses (0.092) and a smaller number are itlb misses (0.036). The third type (branch resteers) are not explicitly listed, but do not make a significant contribution

The numbers are close but not exact between the overall the frontend (0.421) which is based on the following counter:

  idq_uops_not_delivered.core
       [Uops not delivered to Resource Allocation Table (RAT) per thread when
        backend of the machine is not stalled Spec update: HSD135]

and the following counters that try to break out how many cycles have 4, 3, 2, 1 or 0 uops

  idq_uops_not_delivered.cycles_0_uops_deliv.core
       [Cycles per thread when 4 or more uops are not delivered to Resource
        Allocation Table (RAT) when backend of the machine is not stalled Spec
        update: HSD135]
  idq_uops_not_delivered.cycles_le_1_uop_deliv.core
       [Cycles per thread when 3 or more uops are not delivered to Resource
        Allocation Table (RAT) when backend of the machine is not stalled Spec
        update: HSD135]
  idq_uops_not_delivered.cycles_le_2_uop_deliv.core
       [Cycles with less than 2 uops delivered by the front end Spec update:
        HSD135]
  idq_uops_not_delivered.cycles_le_3_uop_deliv.core
       [Cycles with less than 3 uops delivered by the front end Spec update:
	HSD135]

So I’ll worry more about an overall representation than an exact accounting.

The last metric printed is are the “dsb_ops” as a percentage of the total uops issued. The decoded stream buffer (dsb) also known as the uOp cache. This cache bypasses having to redo the decode stages. While it can place up to 4 uops per cycle, it might also be a source of the idq (decode queue) having fewer than 4 uops issued. There is a separate counter that tallies those situations, though I haven’t printing it out here.

Mostly this lets me decode a little further to understand a little better what is happening in the frontend and what may be contributing to stalls.

(*) Here is the way I would do the accounting to reconcile

idq_uops_not_delivered.core = 0.421

Thus for 1,000 cycles, there are 4,000 slots and of these 0.421*4000 are not delivered  (1,684) and the remainder (2,316) are delivered.

idq_uops_not_delivered.cycles_0_uops_deliv_core   = 0.129 => 129 slots with 0 uops.
idq_uops_not_delivered.cycles_le_1_uop_deliv_core = 0.183 =>  54 slots with 1 uops
idq_uops_not_delivered.cycles_le_2_uop_deliv_core = 0.238 =>  55 slots with 2 uops
idq_uops_not_delivered.cycles_l3_3_uop_deliv_core = 0.291 =>  53 slots with 3 uops
                                                          => 709 slots with 4 uops 

Thus the total number of uops in 1000 cycles is:
54*1 + 55*2 + 53*3 + 709*4 = 3159

which is quite a bit higher than the number above.

So there is something wrong either in my understanding of the counters or of the overall counters themselves. One potential way to reconcile some of this is the Intel errata for Haswell which says:

Certain Perfmon Events May be Counted Incorrectly When The Processor is Not in C0 State.

Problem: Due to this erratum, the perfmon events listed below may be counted when the logical processor is not in C0 State.
IDQ.EMPTY (event code 0x79 and umask 0x02)
IDQ_UOPS_NOT_DELIVERED.CORE (event code 0x9c and umask 0x01)
RESOURCE_STALLS.ANY (event core 0xa2 umask 0x01)
CYCLE_ACTIVITY.CYCLES_LDM_PENDING (event 0xa3 umask 0x02 and cmask 0x02)
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE (event 0xa3 umask 0x04 and cmask 0x04)
CYCLE_ACTIVITY.STALLS_LDM_PENDING (event 0xa3 umask 0x06 and cmask 0x06)

Implication: The count will be higher than expected.

Workaround: None identified.

Status: For the steppings affected, see the Summary Table of Changes.

Performance analysis, tools and experiments

An eclectic collection

topdown tool – adding support for level 3 frontend

Comments

topdown tool – adding support for level 3 frontend — No Comments

Leave a Reply Cancel reply