Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU

The Unified Coherence Fabric (UCF) contains last level cache
and cache coherent interconnect in Tegra410 SOC. The PMU in
this device can be used to capture events related to access
to the last level cache and memory from different sources.

Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>

authored by

Besar Wicaksono and committed by
Will Deacon
f5caf26f d332424d

+193 -1
+1
Documentation/admin-guide/perf/index.rst
··· 25 25 alibaba_pmu 26 26 dwc_pcie_pmu 27 27 nvidia-tegra241-pmu 28 + nvidia-tegra410-pmu 28 29 meson-ddr-pmu 29 30 cxl 30 31 ampere_cspmu
+106
Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
··· 1 + ===================================================================== 2 + NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU) 3 + ===================================================================== 4 + 5 + The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance 6 + metrics like memory bandwidth, latency, and utilization: 7 + 8 + * Unified Coherence Fabric (UCF) 9 + 10 + PMU Driver 11 + ---------- 12 + 13 + The PMU driver describes the available events and configuration of each PMU in 14 + sysfs. Please see the sections below to get the sysfs path of each PMU. Like 15 + other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show 16 + the CPU id used to handle the PMU event. There is also "associated_cpus" 17 + sysfs attribute, which contains a list of CPUs associated with the PMU instance. 18 + 19 + UCF PMU 20 + ------- 21 + 22 + The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a 23 + distributed cache, last level for CPU Memory and CXL Memory, and cache coherent 24 + interconnect that supports hardware coherence across multiple coherently caching 25 + agents, including: 26 + 27 + * CPU clusters 28 + * GPU 29 + * PCIe Ordering Controller Unit (OCU) 30 + * Other IO-coherent requesters 31 + 32 + The events and configuration options of this PMU device are described in sysfs, 33 + see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>. 34 + 35 + Some of the events available in this PMU can be used to measure bandwidth and 36 + utilization: 37 + 38 + * slc_access_rd: count the number of read requests to SLC. 39 + * slc_access_wr: count the number of write requests to SLC. 40 + * slc_bytes_rd: count the number of bytes transferred by slc_access_rd. 41 + * slc_bytes_wr: count the number of bytes transferred by slc_access_wr. 42 + * mem_access_rd: count the number of read requests to local or remote memory. 43 + * mem_access_wr: count the number of write requests to local or remote memory. 44 + * mem_bytes_rd: count the number of bytes transferred by mem_access_rd. 45 + * mem_bytes_wr: count the number of bytes transferred by mem_access_wr. 46 + * cycles: counts the UCF cycles. 47 + 48 + The average bandwidth is calculated as:: 49 + 50 + AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS 51 + AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS 52 + AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS 53 + AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS 54 + 55 + The average request rate is calculated as:: 56 + 57 + AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES 58 + AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES 59 + AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES 60 + AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES 61 + 62 + More details about what other events are available can be found in Tegra410 SoC 63 + technical reference manual. 64 + 65 + The events can be filtered based on source or destination. The source filter 66 + indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or 67 + remote socket. The destination filter specifies the destination memory type, 68 + e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The 69 + local/remote classification of the destination filter is based on the home 70 + socket of the address, not where the data actually resides. The available 71 + filters are described in 72 + /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/. 73 + 74 + The list of UCF PMU event filters: 75 + 76 + * Source filter: 77 + 78 + * src_loc_cpu: if set, count events from local CPU 79 + * src_loc_noncpu: if set, count events from local non-CPU device 80 + * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket 81 + 82 + * Destination filter: 83 + 84 + * dst_loc_cmem: if set, count events to local system memory (CMEM) address 85 + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 86 + * dst_loc_other: if set, count events to local CXL memory address 87 + * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket 88 + 89 + If the source is not specified, the PMU will count events from all sources. If 90 + the destination is not specified, the PMU will count events to all destinations. 91 + 92 + Example usage: 93 + 94 + * Count event id 0x0 in socket 0 from all sources and to all destinations:: 95 + 96 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0/ 97 + 98 + * Count event id 0x0 in socket 0 with source filter = local CPU and destination 99 + filter = local system memory (CMEM):: 100 + 101 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/ 102 + 103 + * Count event id 0x0 in socket 1 with source filter = local non-CPU device and 104 + destination filter = remote memory:: 105 + 106 + perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
+86 -1
drivers/perf/arm_cspmu/nvidia_cspmu.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 3 + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 4 4 * 5 5 */ 6 6 ··· 20 20 21 21 #define NV_CNVL_PORT_COUNT 4ULL 22 22 #define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0) 23 + 24 + #define NV_UCF_SRC_COUNT 3ULL 25 + #define NV_UCF_DST_COUNT 4ULL 26 + #define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0) 27 + #define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0) 28 + #define NV_UCF_FILTER_DST GENMASK_ULL(11, 8) 29 + #define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST) 23 30 24 31 #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0) 25 32 ··· 131 124 NULL, 132 125 }; 133 126 127 + static struct attribute *ucf_pmu_event_attrs[] = { 128 + ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D), 129 + 130 + ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0), 131 + ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3), 132 + ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109), 133 + ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A), 134 + ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119), 135 + 136 + ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183), 137 + ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184), 138 + 139 + ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111), 140 + ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112), 141 + ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113), 142 + ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114), 143 + 144 + ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121), 145 + ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122), 146 + ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123), 147 + ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124), 148 + 149 + ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180), 150 + ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181), 151 + ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182), 152 + 153 + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 154 + NULL 155 + }; 156 + 134 157 static struct attribute *generic_pmu_event_attrs[] = { 135 158 ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 136 159 NULL, ··· 187 150 ARM_CSPMU_FORMAT_EVENT_ATTR, 188 151 ARM_CSPMU_FORMAT_ATTR(rem_socket, "config1:0-3"), 189 152 NULL, 153 + }; 154 + 155 + static struct attribute *ucf_pmu_format_attrs[] = { 156 + ARM_CSPMU_FORMAT_EVENT_ATTR, 157 + ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"), 158 + ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"), 159 + ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"), 160 + ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"), 161 + ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"), 162 + ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"), 163 + ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"), 164 + NULL 190 165 }; 191 166 192 167 static struct attribute *generic_pmu_format_attrs[] = { ··· 285 236 writel(filter, cspmu->base0 + PMCCFILTR); 286 237 } 287 238 239 + static u32 ucf_pmu_event_filter(const struct perf_event *event) 240 + { 241 + u32 ret, filter, src, dst; 242 + 243 + filter = nv_cspmu_event_filter(event); 244 + 245 + /* Monitor all sources if none is selected. */ 246 + src = FIELD_GET(NV_UCF_FILTER_SRC, filter); 247 + if (src == 0) 248 + src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0); 249 + 250 + /* Monitor all destinations if none is selected. */ 251 + dst = FIELD_GET(NV_UCF_FILTER_DST, filter); 252 + if (dst == 0) 253 + dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0); 254 + 255 + ret = FIELD_PREP(NV_UCF_FILTER_SRC, src); 256 + ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst); 257 + 258 + return ret; 259 + } 288 260 289 261 enum nv_cspmu_name_fmt { 290 262 NAME_FMT_GENERIC, ··· 410 340 .get_filter2 = NULL, 411 341 .data = NULL, 412 342 .init_data = NULL 343 + }, 344 + }, 345 + { 346 + .prodid = 0x2CF20000, 347 + .prodid_mask = NV_PRODID_MASK, 348 + .name_pattern = "nvidia_ucf_pmu_%u", 349 + .name_fmt = NAME_FMT_SOCKET, 350 + .template_ctx = { 351 + .event_attr = ucf_pmu_event_attrs, 352 + .format_attr = ucf_pmu_format_attrs, 353 + .filter_mask = NV_UCF_FILTER_ID_MASK, 354 + .filter_default_val = NV_UCF_FILTER_DEFAULT, 355 + .filter2_mask = 0x0, 356 + .filter2_default_val = 0x0, 357 + .get_filter = ucf_pmu_event_filter, 413 358 }, 414 359 }, 415 360 {