perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU

+1

Documentation/admin-guide/perf/index.rst

··· 25 25 alibaba_pmu 26 26 dwc_pcie_pmu 27 27 nvidia-tegra241-pmu 28 + nvidia-tegra410-pmu 28 29 meson-ddr-pmu 29 30 cxl 30 31 ampere_cspmu

+106

Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst

··· 1 + ===================================================================== 2 + NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU) 3 + ===================================================================== 4 + 5 + The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance 6 + metrics like memory bandwidth, latency, and utilization: 7 + 8 + * Unified Coherence Fabric (UCF) 9 + 10 + PMU Driver 11 + ---------- 12 + 13 + The PMU driver describes the available events and configuration of each PMU in 14 + sysfs. Please see the sections below to get the sysfs path of each PMU. Like 15 + other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show 16 + the CPU id used to handle the PMU event. There is also "associated_cpus" 17 + sysfs attribute, which contains a list of CPUs associated with the PMU instance. 18 + 19 + UCF PMU 20 + ------- 21 + 22 + The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a 23 + distributed cache, last level for CPU Memory and CXL Memory, and cache coherent 24 + interconnect that supports hardware coherence across multiple coherently caching 25 + agents, including: 26 + 27 + * CPU clusters 28 + * GPU 29 + * PCIe Ordering Controller Unit (OCU) 30 + * Other IO-coherent requesters 31 + 32 + The events and configuration options of this PMU device are described in sysfs, 33 + see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>. 34 + 35 + Some of the events available in this PMU can be used to measure bandwidth and 36 + utilization: 37 + 38 + * slc_access_rd: count the number of read requests to SLC. 39 + * slc_access_wr: count the number of write requests to SLC. 40 + * slc_bytes_rd: count the number of bytes transferred by slc_access_rd. 41 + * slc_bytes_wr: count the number of bytes transferred by slc_access_wr. 42 + * mem_access_rd: count the number of read requests to local or remote memory. 43 + * mem_access_wr: count the number of write requests to local or remote memory. 44 + * mem_bytes_rd: count the number of bytes transferred by mem_access_rd. 45 + * mem_bytes_wr: count the number of bytes transferred by mem_access_wr. 46 + * cycles: counts the UCF cycles. 47 + 48 + The average bandwidth is calculated as:: 49 + 50 + AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS 51 + AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS 52 + AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS 53 + AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS 54 + 55 + The average request rate is calculated as:: 56 + 57 + AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES 58 + AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES 59 + AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES 60 + AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES 61 + 62 + More details about what other events are available can be found in Tegra410 SoC 63 + technical reference manual. 64 + 65 + The events can be filtered based on source or destination. The source filter 66 + indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or 67 + remote socket. The destination filter specifies the destination memory type, 68 + e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The 69 + local/remote classification of the destination filter is based on the home 70 + socket of the address, not where the data actually resides. The available 71 + filters are described in 72 + /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/. 73 + 74 + The list of UCF PMU event filters: 75 + 76 + * Source filter: 77 + 78 + * src_loc_cpu: if set, count events from local CPU 79 + * src_loc_noncpu: if set, count events from local non-CPU device 80 + * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket 81 + 82 + * Destination filter: 83 + 84 + * dst_loc_cmem: if set, count events to local system memory (CMEM) address 85 + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 86 + * dst_loc_other: if set, count events to local CXL memory address 87 + * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket 88 + 89 + If the source is not specified, the PMU will count events from all sources. If 90 + the destination is not specified, the PMU will count events to all destinations. 91 + 92 + Example usage: 93 + 94 + * Count event id 0x0 in socket 0 from all sources and to all destinations:: 95 + 96 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0/ 97 + 98 + * Count event id 0x0 in socket 0 with source filter = local CPU and destination 99 + filter = local system memory (CMEM):: 100 + 101 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/ 102 + 103 + * Count event id 0x0 in socket 1 with source filter = local non-CPU device and 104 + destination filter = remote memory:: 105 + 106 + perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/

+86 -1

drivers/perf/arm_cspmu/nvidia_cspmu.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 3 + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 4 4 * 5 5 */ 6 6 ··· 20 20 21 21 #define NV_CNVL_PORT_COUNT 4ULL 22 22 #define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0) 23 + 24 + #define NV_UCF_SRC_COUNT 3ULL 25 + #define NV_UCF_DST_COUNT 4ULL 26 + #define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0) 27 + #define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0) 28 + #define NV_UCF_FILTER_DST GENMASK_ULL(11, 8) 29 + #define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST) 23 30 24 31 #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0) 25 32 ··· 131 124 NULL, 132 125 }; 133 126 127 + static struct attribute *ucf_pmu_event_attrs[] = { 128 + ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D), 129 + 130 + ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0), 131 + ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3), 132 + ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109), 133 + ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A), 134 + ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119), 135 + 136 + ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183), 137 + ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184), 138 + 139 + ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111), 140 + ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112), 141 + ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113), 142 + ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114), 143 + 144 + ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121), 145 + ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122), 146 + ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123), 147 + ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124), 148 + 149 + ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180), 150 + ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181), 151 + ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182), 152 + 153 + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 154 + NULL 155 + }; 156 + 134 157 static struct attribute *generic_pmu_event_attrs[] = { 135 158 ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 136 159 NULL, ··· 187 150 ARM_CSPMU_FORMAT_EVENT_ATTR, 188 151 ARM_CSPMU_FORMAT_ATTR(rem_socket, "config1:0-3"), 189 152 NULL, 153 + }; 154 + 155 + static struct attribute *ucf_pmu_format_attrs[] = { 156 + ARM_CSPMU_FORMAT_EVENT_ATTR, 157 + ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"), 158 + ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"), 159 + ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"), 160 + ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"), 161 + ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"), 162 + ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"), 163 + ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"), 164 + NULL 190 165 }; 191 166 192 167 static struct attribute *generic_pmu_format_attrs[] = { ··· 285 236 writel(filter, cspmu->base0 + PMCCFILTR); 286 237 } 287 238 239 + static u32 ucf_pmu_event_filter(const struct perf_event *event) 240 + { 241 + u32 ret, filter, src, dst; 242 + 243 + filter = nv_cspmu_event_filter(event); 244 + 245 + /* Monitor all sources if none is selected. */ 246 + src = FIELD_GET(NV_UCF_FILTER_SRC, filter); 247 + if (src == 0) 248 + src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0); 249 + 250 + /* Monitor all destinations if none is selected. */ 251 + dst = FIELD_GET(NV_UCF_FILTER_DST, filter); 252 + if (dst == 0) 253 + dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0); 254 + 255 + ret = FIELD_PREP(NV_UCF_FILTER_SRC, src); 256 + ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst); 257 + 258 + return ret; 259 + } 288 260 289 261 enum nv_cspmu_name_fmt { 290 262 NAME_FMT_GENERIC, ··· 410 340 .get_filter2 = NULL, 411 341 .data = NULL, 412 342 .init_data = NULL 343 + }, 344 + }, 345 + { 346 + .prodid = 0x2CF20000, 347 + .prodid_mask = NV_PRODID_MASK, 348 + .name_pattern = "nvidia_ucf_pmu_%u", 349 + .name_fmt = NAME_FMT_SOCKET, 350 + .template_ctx = { 351 + .event_attr = ucf_pmu_event_attrs, 352 + .format_attr = ucf_pmu_format_attrs, 353 + .filter_mask = NV_UCF_FILTER_ID_MASK, 354 + .filter_default_val = NV_UCF_FILTER_DEFAULT, 355 + .filter2_mask = 0x0, 356 + .filter2_default_val = 0x0, 357 + .get_filter = ucf_pmu_event_filter, 413 358 }, 414 359 }, 415 360 {

Configure Feed

Configure Feed