perf report: Add latency and parallelism profiling documentation

+3 -2

tools/perf/Documentation/callchain-overhead-calculation.txt

··· 1 1 Overhead calculation 2 2 -------------------- 3 - The overhead can be shown in two columns as 'Children' and 'Self' when 4 - perf collects callchains. The 'self' overhead is simply calculated by 3 + The CPU overhead can be shown in two columns as 'Children' and 'Self' 4 + when perf collects callchains (and corresponding 'Wall' columns for 5 + wall-clock overhead). The 'self' overhead is simply calculated by 5 6 adding all period values of the entry - usually a function (symbol). 6 7 This is the value that perf shows traditionally and sum of all the 7 8 'self' overhead values should be 100%.

+85

tools/perf/Documentation/cpu-and-latency-overheads.txt

··· 1 + CPU and latency overheads 2 + ------------------------- 3 + There are two notions of time: wall-clock time and CPU time. 4 + For a single-threaded program, or a program running on a single-core machine, 5 + these notions are the same. However, for a multi-threaded/multi-process program 6 + running on a multi-core machine, these notions are significantly different. 7 + Each second of wall-clock time we have number-of-cores seconds of CPU time. 8 + Perf can measure overhead for both of these times (shown in 'overhead' and 9 + 'latency' columns for CPU and wall-clock time correspondingly). 10 + 11 + Optimizing CPU overhead is useful to improve 'throughput', while optimizing 12 + latency overhead is useful to improve 'latency'. It's important to understand 13 + which one is useful in a concrete situation at hand. For example, the former 14 + may be useful to improve max throughput of a CI build server that runs on 100% 15 + CPU utilization, while the latter may be useful to improve user-perceived 16 + latency of a single interactive program build. 17 + These overheads may be significantly different in some cases. For example, 18 + consider a program that executes function 'foo' for 9 seconds with 1 thread, 19 + and then executes function 'bar' for 1 second with 128 threads (consumes 20 + 128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%. 21 + While the latency overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize 22 + running time of the program looking at the (wrong in this case) CPU overhead, 23 + we would concentrate on the function 'bar', but it can yield only 10% running 24 + time improvement at best. 25 + 26 + By default, perf shows only CPU overhead. To show latency overhead, use 27 + 'perf record --latency' and 'perf report': 28 + 29 + ----------------------------------- 30 + Overhead Latency Command 31 + 93.88% 25.79% cc1 32 + 1.90% 39.87% gzip 33 + 0.99% 10.16% dpkg-deb 34 + 0.57% 1.00% as 35 + 0.40% 0.46% sh 36 + ----------------------------------- 37 + 38 + To sort by latency overhead, use 'perf report --latency': 39 + 40 + ----------------------------------- 41 + Latency Overhead Command 42 + 39.87% 1.90% gzip 43 + 25.79% 93.88% cc1 44 + 10.16% 0.99% dpkg-deb 45 + 4.17% 0.29% git 46 + 2.81% 0.11% objtool 47 + ----------------------------------- 48 + 49 + To get insight into the difference between the overheads, you may check 50 + parallelization histogram with '--sort=latency,parallelism,comm,symbol --hierarchy' 51 + flags. It shows fraction of (wall-clock) time the workload utilizes different 52 + numbers of cores ('Parallelism' column). For example, in the following case 53 + the workload utilizes only 1 core most of the time, but also has some 54 + highly-parallel phases, which explains significant difference between 55 + CPU and wall-clock overheads: 56 + 57 + ----------------------------------- 58 + Latency Overhead Parallelism / Command / Symbol 59 + + 56.98% 2.29% 1 60 + + 16.94% 1.36% 2 61 + + 4.00% 20.13% 125 62 + + 3.66% 18.25% 124 63 + + 3.48% 17.66% 126 64 + + 3.26% 0.39% 3 65 + + 2.61% 12.93% 123 66 + ----------------------------------- 67 + 68 + By expanding corresponding lines, you may see what commands/functions run 69 + at the given parallelism level: 70 + 71 + ----------------------------------- 72 + Latency Overhead Parallelism / Command / Symbol 73 + - 56.98% 2.29% 1 74 + 32.80% 1.32% gzip 75 + 4.46% 0.18% cc1 76 + 2.81% 0.11% objtool 77 + 2.43% 0.10% dpkg-source 78 + 2.22% 0.09% ld 79 + 2.10% 0.08% dpkg-genchanges 80 + ----------------------------------- 81 + 82 + To see the normal function-level profile for particular parallelism levels 83 + (number of threads actively running on CPUs), you may use '--parallelism' 84 + filter. For example, to see the profile only for low parallelism phases 85 + of a workload use '--latency --parallelism=1-2' flags.

+32 -17

tools/perf/Documentation/perf-report.txt

··· 44 44 --comms=:: 45 45 Only consider symbols in these comms. CSV that understands 46 46 file://filename entries. This option will affect the percentage of 47 - the overhead column. See --percentage for more info. 47 + the overhead and latency columns. See --percentage for more info. 48 48 --pid=:: 49 49 Only show events for given process ID (comma separated list). 50 50 ··· 54 54 --dsos=:: 55 55 Only consider symbols in these dsos. CSV that understands 56 56 file://filename entries. This option will affect the percentage of 57 - the overhead column. See --percentage for more info. 57 + the overhead and latency columns. See --percentage for more info. 58 58 -S:: 59 59 --symbols=:: 60 60 Only consider these symbols. CSV that understands 61 61 file://filename entries. This option will affect the percentage of 62 - the overhead column. See --percentage for more info. 62 + the overhead and latency columns. See --percentage for more info. 63 63 64 64 --symbol-filter=:: 65 65 Only show symbols that match (partially) with this filter. ··· 67 67 -U:: 68 68 --hide-unresolved:: 69 69 Only display entries resolved to a symbol. 70 + 71 + --parallelism:: 72 + Only consider these parallelism levels. Parallelism level is the number 73 + of threads that actively run on CPUs at the time of sample. The flag 74 + accepts single number, comma-separated list, and ranges (for example: 75 + "1", "7,8", "1,64-128"). This is useful in understanding what a program 76 + is doing during sequential/low-parallelism phases as compared to 77 + high-parallelism phases. This option will affect the percentage of 78 + the overhead and latency columns. See --percentage for more info. 79 + Also see the `CPU and latency overheads' section for more details. 70 80 71 81 --latency:: 72 82 Show latency-centric profile rather than the default ··· 102 92 entries are displayed as "[other]". 103 93 - cpu: cpu number the task ran at the time of sample 104 94 - socket: processor socket number the task ran at the time of sample 95 + - parallelism: number of running threads at the time of sample 105 96 - srcline: filename and line number executed at the time of sample. The 106 97 DWARF debugging info must be provided. 107 98 - srcfile: file name of the source file of the samples. Requires dwarf ··· 113 102 - cgroup_id: ID derived from cgroup namespace device and inode numbers. 114 103 - cgroup: cgroup pathname in the cgroupfs. 115 104 - transaction: Transaction abort flags. 116 - - overhead: Overhead percentage of sample 117 - - overhead_sys: Overhead percentage of sample running in system mode 118 - - overhead_us: Overhead percentage of sample running in user mode 119 - - overhead_guest_sys: Overhead percentage of sample running in system mode 105 + - overhead: CPU overhead percentage of sample. 106 + - latency: latency (wall-clock) overhead percentage of sample. 107 + See the `CPU and latency overheads' section for more details. 108 + - overhead_sys: CPU overhead percentage of sample running in system mode 109 + - overhead_us: CPU overhead percentage of sample running in user mode 110 + - overhead_guest_sys: CPU overhead percentage of sample running in system mode 120 111 on guest machine 121 - - overhead_guest_us: Overhead percentage of sample running in user mode on 112 + - overhead_guest_us: CPU overhead percentage of sample running in user mode on 122 113 guest machine 123 114 - sample: Number of sample 124 115 - period: Raw number of event count of sample ··· 143 130 - weight2: Average value of event specific weight (2nd field of weight_struct). 144 131 - weight3: Average value of event specific weight (3rd field of weight_struct). 145 132 146 - By default, comm, dso and symbol keys are used. 147 - (i.e. --sort comm,dso,symbol) 133 + By default, overhead, comm, dso and symbol keys are used. 134 + (i.e. --sort overhead,comm,dso,symbol). 148 135 149 136 If --branch-stack option is used, following sort keys are also 150 137 available: ··· 219 206 --fields=:: 220 207 Specify output field - multiple keys can be specified in CSV format. 221 208 Following fields are available: 222 - overhead, overhead_sys, overhead_us, overhead_children, sample, period, 223 - weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat. The 224 - last 3 names are alias for the corresponding weights. When the weight 209 + overhead, latency, overhead_sys, overhead_us, overhead_children, sample, 210 + period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat. 211 + The last 3 names are alias for the corresponding weights. When the weight 225 212 fields are used, they will show the average value of the weight. 226 213 227 214 Also it can contain any sort key(s). ··· 307 294 Accumulate callchain of children to parent entry so that then can 308 295 show up in the output. The output will have a new "Children" column 309 296 and will be sorted on the data. It requires callchains are recorded. 310 - See the `overhead calculation' section for more details. Enabled by 297 + See the `Overhead calculation' section for more details. Enabled by 311 298 default, disable with --no-children. 312 299 313 300 --max-stack:: ··· 460 447 --call-graph option for details. 461 448 462 449 --percentage:: 463 - Determine how to display the overhead percentage of filtered entries. 464 - Filters can be applied by --comms, --dsos and/or --symbols options and 465 - Zoom operations on the TUI (thread, dso, etc). 450 + Determine how to display the CPU and latency overhead percentage 451 + of filtered entries. Filters can be applied by --comms, --dsos, --symbols 452 + and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc). 466 453 467 454 "relative" means it's relative to filtered entries only so that the 468 455 sum of shown entries will be always 100%. "absolute" means it retains ··· 644 631 645 632 --skip-empty:: 646 633 Do not print 0 results in the --stat output. 634 + 635 + include::cpu-and-latency-overheads.txt[] 647 636 648 637 include::callchain-overhead-calculation.txt[] 649 638

+4

tools/perf/Documentation/tips.txt

··· 62 62 To show time in nanoseconds in record/report add --ns 63 63 To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2 64 64 To compare scalability of two workload samples use perf diff -c ratio file1 file2 65 + For latency profiling, try: perf record/report --latency 66 + For parallelism histogram, try: perf report --hierarchy --sort latency,parallelism,comm,symbol 67 + To analyze particular parallelism levels, try: perf report --latency --parallelism=32-64 68 + To see how parallelism changes over time, try: perf report -F time,latency,parallelism --time-quantum=1s

Configure Feed

Configure Feed