perf tools: Add documentation for topdown metrics

+256

1 changed file

expand all

tools

perf

Documentation

topdown.txt

+256

tools/perf/Documentation/topdown.txt

··· 1 + Using TopDown metrics in user space 2 + ----------------------------------- 3 + 4 + Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown 5 + methology to break down CPU pipeline execution into 4 bottlenecks: 6 + frontend bound, backend bound, bad speculation, retiring. 7 + 8 + For more details on Topdown see [1][5] 9 + 10 + Traditionally this was implemented by events in generic counters 11 + and specific formulas to compute the bottlenecks. 12 + 13 + perf stat --topdown implements this. 14 + 15 + Full Top Down includes more levels that can break down the 16 + bottlenecks further. This is not directly implemented in perf, 17 + but available in other tools that can run on top of perf, 18 + such as toplev[2] or vtune[3] 19 + 20 + New Topdown features in Ice Lake 21 + =============================== 22 + 23 + With Ice Lake CPUs the TopDown metrics are directly available as 24 + fixed counters and do not require generic counters. This allows 25 + to collect TopDown always in addition to other events. 26 + 27 + % perf stat -a --topdown -I1000 28 + # time retiring bad speculation frontend bound backend bound 29 + 1.001281330 23.0% 15.3% 29.6% 32.1% 30 + 2.003009005 5.0% 6.8% 46.6% 41.6% 31 + 3.004646182 6.7% 6.7% 46.0% 40.6% 32 + 4.006326375 5.0% 6.4% 47.6% 41.0% 33 + 5.007991804 5.1% 6.3% 46.3% 42.3% 34 + 6.009626773 6.2% 7.1% 47.3% 39.3% 35 + 7.011296356 4.7% 6.7% 46.2% 42.4% 36 + 8.012951831 4.7% 6.7% 47.5% 41.1% 37 + ... 38 + 39 + This also enables measuring TopDown per thread/process instead 40 + of only per core. 41 + 42 + Using TopDown through RDPMC in applications on Ice Lake 43 + ====================================================== 44 + 45 + For more fine grained measurements it can be useful to 46 + access the new directly from user space. This is more complicated, 47 + but drastically lowers overhead. 48 + 49 + On Ice Lake, there is a new fixed counter 3: SLOTS, which reports 50 + "pipeline SLOTS" (cycles multiplied by core issue width) and a 51 + metric register that reports slots ratios for the different bottleneck 52 + categories. 53 + 54 + The metrics counter is CPU model specific and is not available on older 55 + CPUs. 56 + 57 + Example code 58 + ============ 59 + 60 + Library functions to do the functionality described below 61 + is also available in libjevents [4] 62 + 63 + The application opens a group with fixed counter 3 (SLOTS) and any 64 + metric event, and allow user programs to read the performance counters. 65 + 66 + Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04, 67 + so the perf_event_attr structure should be initialized with 68 + { .config = 0x0400, .type = PERF_TYPE_RAW } 69 + The metric events are mapped to the pseudo event event=0x00, umask=0x8X. 70 + For example, the perf_event_attr structure can be initialized with 71 + { .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event 72 + The Fixed counter 3 must be the leader of the group. 73 + 74 + #include <linux/perf_event.h> 75 + #include <sys/syscall.h> 76 + #include <unistd.h> 77 + 78 + /* Provide own perf_event_open stub because glibc doesn't */ 79 + __attribute__((weak)) 80 + int perf_event_open(struct perf_event_attr *attr, pid_t pid, 81 + int cpu, int group_fd, unsigned long flags) 82 + { 83 + return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); 84 + } 85 + 86 + /* Open slots counter file descriptor for current task. */ 87 + struct perf_event_attr slots = { 88 + .type = PERF_TYPE_RAW, 89 + .size = sizeof(struct perf_event_attr), 90 + .config = 0x400, 91 + .exclude_kernel = 1, 92 + }; 93 + 94 + int slots_fd = perf_event_open(&slots, 0, -1, -1, 0); 95 + if (slots_fd < 0) 96 + ... error ... 97 + 98 + /* 99 + * Open metrics event file descriptor for current task. 100 + * Set slots event as the leader of the group. 101 + */ 102 + struct perf_event_attr metrics = { 103 + .type = PERF_TYPE_RAW, 104 + .size = sizeof(struct perf_event_attr), 105 + .config = 0x8000, 106 + .exclude_kernel = 1, 107 + }; 108 + 109 + int metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0); 110 + if (metrics_fd < 0) 111 + ... error ... 112 + 113 + 114 + The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used 115 + to read slots and the topdown metrics at different points of the program: 116 + 117 + #include <stdint.h> 118 + #include <x86intrin.h> 119 + 120 + #define RDPMC_FIXED (1 << 30) /* return fixed counters */ 121 + #define RDPMC_METRIC (1 << 29) /* return metric counters */ 122 + 123 + #define FIXED_COUNTER_SLOTS 3 124 + #define METRIC_COUNTER_TOPDOWN_L1 0 125 + 126 + static inline uint64_t read_slots(void) 127 + { 128 + return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS); 129 + } 130 + 131 + static inline uint64_t read_metrics(void) 132 + { 133 + return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1); 134 + } 135 + 136 + Then the program can be instrumented to read these metrics at different 137 + points. 138 + 139 + It's not a good idea to do this with too short code regions, 140 + as the parallelism and overlap in the CPU program execution will 141 + cause too much measurement inaccuracy. For example instrumenting 142 + individual basic blocks is definitely too fine grained. 143 + 144 + Decoding metrics values 145 + ======================= 146 + 147 + The value reported by read_metrics() contains four 8 bit fields 148 + that represent a scaled ratio that represent the Level 1 bottleneck. 149 + All four fields add up to 0xff (= 100%) 150 + 151 + The binary ratios in the metric value can be converted to float ratios: 152 + 153 + #define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff) 154 + 155 + #define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff) 156 + #define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff) 157 + #define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff) 158 + #define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff) 159 + 160 + and then converted to percent for printing. 161 + 162 + The ratios in the metric accumulate for the time when the counter 163 + is enabled. For measuring programs it is often useful to measure 164 + specific sections. For this it is needed to deltas on metrics. 165 + 166 + This can be done by scaling the metrics with the slots counter 167 + read at the same time. 168 + 169 + Then it's possible to take deltas of these slots counts 170 + measured at different points, and determine the metrics 171 + for that time period. 172 + 173 + slots_a = read_slots(); 174 + metric_a = read_metrics(); 175 + 176 + ... larger code region ... 177 + 178 + slots_b = read_slots() 179 + metric_b = read_metrics() 180 + 181 + # compute scaled metrics for measurement a 182 + retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a 183 + bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a 184 + fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a 185 + be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a 186 + 187 + # compute delta scaled metrics between b and a 188 + retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a 189 + bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a 190 + fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a 191 + be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a 192 + 193 + Later the individual ratios for the measurement period can be recreated 194 + from these counts. 195 + 196 + slots_delta = slots_b - slots_a 197 + retiring_ratio = (float)retiring_slots / slots_delta 198 + bad_spec_ratio = (float)bad_spec_slots / slots_delta 199 + fe_bound_ratio = (float)fe_bound_slots / slots_delta 200 + be_bound_ratio = (float)be_bound_slots / slota_delta 201 + 202 + printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n", 203 + retiring_ratio * 100., 204 + bad_spec_ratio * 100., 205 + fe_bound_ratio * 100., 206 + be_bound_ratio * 100.); 207 + 208 + Resetting metrics counters 209 + ========================== 210 + 211 + Since the individual metrics are only 8bit they lose precision for 212 + short regions over time because the number of cycles covered by each 213 + fraction bit shrinks. So the counters need to be reset regularly. 214 + 215 + When using the kernel perf API the kernel resets on every read. 216 + So as long as the reading is at reasonable intervals (every few 217 + seconds) the precision is good. 218 + 219 + When using perf stat it is recommended to always use the -I option, 220 + with no longer interval than a few seconds 221 + 222 + perf stat -I 1000 --topdown ... 223 + 224 + For user programs using RDPMC directly the counter can 225 + be reset explicitly using ioctl: 226 + 227 + ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); 228 + 229 + This "opens" a new measurement period. 230 + 231 + A program using RDPMC for TopDown should schedule such a reset 232 + regularly, as in every few seconds. 233 + 234 + Limits on Ice Lake 235 + ================== 236 + 237 + Four pseudo TopDown metric events are exposed for the end-users, 238 + topdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. 239 + They can be used to collect the TopDown value under the following 240 + rules: 241 + - All the TopDown metric events must be in a group with the SLOTS event. 242 + - The SLOTS event must be the leader of the group. 243 + - The PERF_FORMAT_GROUP flag must be applied for each TopDown metric 244 + events 245 + 246 + The SLOTS event and the TopDown metric events can be counting members of 247 + a sampling read group. Since the SLOTS event must be the leader of a TopDown 248 + group, the second event of the group is the sampling event. 249 + For example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' 250 + 251 + 252 + [1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win 253 + [2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual 254 + [3] https://software.intel.com/en-us/intel-vtune-amplifier-xe 255 + [4] https://github.com/andikleen/pmu-tools/tree/master/jevents 256 + [5] https://sites.google.com/site/analysismethods/yasin-pubs

Configure Feed

Configure Feed