Merge tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping

+1

Documentation/core-api/index.rst

··· 102 102 dma-api-howto 103 103 dma-attributes 104 104 dma-isa-lpc 105 + swiotlb 105 106 mm-api 106 107 genalloc 107 108 pin_user_pages

+321

Documentation/core-api/swiotlb.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + DMA and swiotlb 5 + =============== 6 + 7 + swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It is 8 + typically used when a device doing DMA can't directly access the target memory 9 + buffer because of hardware limitations or other requirements. In such a case, 10 + the DMA layer calls swiotlb to allocate a temporary memory buffer that conforms 11 + to the limitations. The DMA is done to/from this temporary memory buffer, and 12 + the CPU copies the data between the temporary buffer and the original target 13 + memory buffer. This approach is generically called "bounce buffering", and the 14 + temporary memory buffer is called a "bounce buffer". 15 + 16 + Device drivers don't interact directly with swiotlb. Instead, drivers inform 17 + the DMA layer of the DMA attributes of the devices they are managing, and use 18 + the normal DMA map, unmap, and sync APIs when programming a device to do DMA. 19 + These APIs use the device DMA attributes and kernel-wide settings to determine 20 + if bounce buffering is necessary. If so, the DMA layer manages the allocation, 21 + freeing, and sync'ing of bounce buffers. Since the DMA attributes are per 22 + device, some devices in a system may use bounce buffering while others do not. 23 + 24 + Because the CPU copies data between the bounce buffer and the original target 25 + memory buffer, doing bounce buffering is slower than doing DMA directly to the 26 + original memory buffer, and it consumes more CPU resources. So it is used only 27 + when necessary for providing DMA functionality. 28 + 29 + Usage Scenarios 30 + --------------- 31 + swiotlb was originally created to handle DMA for devices with addressing 32 + limitations. As physical memory sizes grew beyond 4 GiB, some devices could 33 + only provide 32-bit DMA addresses. By allocating bounce buffer memory below 34 + the 4 GiB line, these devices with addressing limitations could still work and 35 + do DMA. 36 + 37 + More recently, Confidential Computing (CoCo) VMs have the guest VM's memory 38 + encrypted by default, and the memory is not accessible by the host hypervisor 39 + and VMM. For the host to do I/O on behalf of the guest, the I/O must be 40 + directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option 41 + to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set 42 + up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and 43 + the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the 44 + data to/from the original target memory buffer. The CPU copying bridges between 45 + the unencrypted and the encrypted memory. This use of bounce buffers allows 46 + device drivers to "just work" in a CoCo VM, with no modifications 47 + needed to handle the memory encryption complexity. 48 + 49 + Other edge case scenarios arise for bounce buffers. For example, when IOMMU 50 + mappings are set up for a DMA operation to/from a device that is considered 51 + "untrusted", the device should be given access only to the memory containing 52 + the data being transferred. But if that memory occupies only part of an IOMMU 53 + granule, other parts of the granule may contain unrelated kernel data. Since 54 + IOMMU access control is per-granule, the untrusted device can gain access to 55 + the unrelated kernel data. This problem is solved by bounce buffering the DMA 56 + operation and ensuring that unused portions of the bounce buffers do not 57 + contain any unrelated kernel data. 58 + 59 + Core Functionality 60 + ------------------ 61 + The primary swiotlb APIs are swiotlb_tbl_map_single() and 62 + swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a 63 + specified size in bytes and returns the physical address of the buffer. The 64 + buffer memory is physically contiguous. The expectation is that the DMA layer 65 + maps the physical memory address to a DMA address, and returns the DMA address 66 + to the driver for programming into the device. If a DMA operation specifies 67 + multiple memory buffer segments, a separate bounce buffer must be allocated for 68 + each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a 69 + CPU copy) to initialize the bounce buffer to match the contents of the original 70 + buffer. 71 + 72 + swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have 73 + updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the 74 + unmap does a "sync" operation to cause a CPU copy of the data from the bounce 75 + buffer back to the original buffer. Then the bounce buffer memory is freed. 76 + 77 + swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that 78 + a driver may use when control of a buffer transitions between the CPU and the 79 + device. The swiotlb "sync" APIs cause a CPU copy of the data between the 80 + original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb 81 + "sync" APIs support doing a partial sync, where only a subset of the bounce 82 + buffer is copied to/from the original buffer. 83 + 84 + Core Functionality Constraints 85 + ------------------------------ 86 + The swiotlb map/unmap/sync APIs must operate without blocking, as they are 87 + called by the corresponding DMA APIs which may run in contexts that cannot 88 + block. Hence the default memory pool for swiotlb allocations must be 89 + pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb 90 + allocations must be physically contiguous, the entire default memory pool is 91 + allocated as a single contiguous block. 92 + 93 + The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff. 94 + The pool should be large enough to ensure that bounce buffer requests can 95 + always be satisfied, as the non-blocking requirement means requests can't wait 96 + for space to become available. But a large pool potentially wastes memory, as 97 + this pre-allocated memory is not available for other uses in the system. The 98 + tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA 99 + I/O. These VMs use a heuristic to set the default pool size to ~6% of memory, 100 + with a max of 1 GiB, which has the potential to be very wasteful of memory. 101 + Conversely, the heuristic might produce a size that is insufficient, depending 102 + on the I/O patterns of the workload in the VM. The dynamic swiotlb feature 103 + described below can help, but has limitations. Better management of the swiotlb 104 + default memory pool size remains an open issue. 105 + 106 + A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE 107 + bytes, which is 256 KiB with current definitions. When a device's DMA settings 108 + are such that the device might use swiotlb, the maximum size of a DMA segment 109 + must be limited to that 256 KiB. This value is communicated to higher-level 110 + kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the 111 + higher-level code fails to account for this limit, it may make requests that 112 + are too large for swiotlb, and get a "swiotlb full" error. 113 + 114 + A key device DMA setting is "min_align_mask", which is a power of 2 minus 1 115 + so that some number of low order bits are set, or it may be zero. swiotlb 116 + allocations ensure these min_align_mask bits of the physical address of the 117 + bounce buffer match the same bits in the address of the original buffer. When 118 + min_align_mask is non-zero, it may produce an "alignment offset" in the address 119 + of the bounce buffer that slightly reduces the maximum size of an allocation. 120 + This potential alignment offset is reflected in the value returned by 121 + swiotlb_max_mapping_size(), which can show up in places like 122 + /sys/block/<device>/queue/max_sectors_kb. For example, if a device does not use 123 + swiotlb, max_sectors_kb might be 512 KiB or larger. If a device might use 124 + swiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero, 125 + max_sectors_kb might be even smaller, such as 252 KiB. 126 + 127 + swiotlb_tbl_map_single() also takes an "alloc_align_mask" parameter. This 128 + parameter specifies the allocation of bounce buffer space must start at a 129 + physical address with the alloc_align_mask bits set to zero. But the actual 130 + bounce buffer might start at a larger address if min_align_mask is non-zero. 131 + Hence there may be pre-padding space that is allocated prior to the start of 132 + the bounce buffer. Similarly, the end of the bounce buffer is rounded up to an 133 + alloc_align_mask boundary, potentially resulting in post-padding space. Any 134 + pre-padding or post-padding space is not initialized by swiotlb code. The 135 + "alloc_align_mask" parameter is used by IOMMU code when mapping for untrusted 136 + devices. It is set to the granule size - 1 so that the bounce buffer is 137 + allocated entirely from granules that are not used for any other purpose. 138 + 139 + Data structures concepts 140 + ------------------------ 141 + Memory used for swiotlb bounce buffers is allocated from overall system memory 142 + as one or more "pools". The default pool is allocated during system boot with a 143 + default size of 64 MiB. The default pool size may be modified with the 144 + "swiotlb=" kernel boot line parameter. The default size may also be adjusted 145 + due to other conditions, such as running in a CoCo VM, as described above. If 146 + CONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later in 147 + the life of the system. Each pool must be a contiguous range of physical 148 + memory. The default pool is allocated below the 4 GiB physical address line so 149 + it works for devices that can only address 32-bits of physical memory (unless 150 + architecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, the 151 + pool memory must be decrypted before swiotlb is used. 152 + 153 + Each pool is divided into "slots" of size IO_TLB_SIZE, which is 2 KiB with 154 + current definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitute 155 + what might be called a "slot set". When a bounce buffer is allocated, it 156 + occupies one or more contiguous slots. A slot is never shared by multiple 157 + bounce buffers. Furthermore, a bounce buffer must be allocated from a single 158 + slot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE * 159 + IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slot 160 + set if the alignment and size constraints can be met. 161 + 162 + Slots are also grouped into "areas", with the constraint that a slot set exists 163 + entirely in a single area. Each area has its own spin lock that must be held to 164 + manipulate the slots in that area. The division into areas avoids contending 165 + for a single global spin lock when swiotlb is heavily used, such as in a CoCo 166 + VM. The number of areas defaults to the number of CPUs in the system for 167 + maximum parallelism, but since an area can't be smaller than IO_TLB_SEGSIZE 168 + slots, it might be necessary to assign multiple CPUs to the same area. The 169 + number of areas can also be set via the "swiotlb=" kernel boot parameter. 170 + 171 + When allocating a bounce buffer, if the area associated with the calling CPU 172 + does not have enough free space, areas associated with other CPUs are tried 173 + sequentially. For each area tried, the area's spin lock must be obtained before 174 + trying an allocation, so contention may occur if swiotlb is relatively busy 175 + overall. But an allocation request does not fail unless all areas do not have 176 + enough free space. 177 + 178 + IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 as 179 + the code uses shifting and bit masking to do many of the calculations. The 180 + number of areas is rounded up to a power of 2 if necessary to meet this 181 + requirement. 182 + 183 + The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_mask 184 + argument to swiotlb_tbl_map_single() specifies a larger alignment, one or more 185 + initial slots in each slot set might not meet the alloc_align_mask criterium. 186 + Because a bounce buffer allocation can't cross a slot set boundary, eliminating 187 + those initial slots effectively reduces the max size of a bounce buffer. 188 + Currently, there's no problem because alloc_align_mask is set based on IOMMU 189 + granule size, and granules cannot be larger than PAGE_SIZE. But if that were to 190 + change in the future, the initial pool allocation might need to be done with 191 + alignment larger than PAGE_SIZE. 192 + 193 + Dynamic swiotlb 194 + --------------- 195 + When CONFIG_DYNAMIC_SWIOTLB is enabled, swiotlb can do on-demand expansion of 196 + the amount of memory available for allocation as bounce buffers. If a bounce 197 + buffer request fails due to lack of available space, an asynchronous background 198 + task is kicked off to allocate memory from general system memory and turn it 199 + into an swiotlb pool. Creating an additional pool must be done asynchronously 200 + because the memory allocation may block, and as noted above, swiotlb requests 201 + are not allowed to block. Once the background task is kicked off, the bounce 202 + buffer request creates a "transient pool" to avoid returning an "swiotlb full" 203 + error. A transient pool has the size of the bounce buffer request, and is 204 + deleted when the bounce buffer is freed. Memory for this transient pool comes 205 + from the general system memory atomic pool so that creation does not block. 206 + Creating a transient pool has relatively high cost, particularly in a CoCo VM 207 + where the memory must be decrypted, so it is done only as a stopgap until the 208 + background task can add another non-transient pool. 209 + 210 + Adding a dynamic pool has limitations. Like with the default pool, the memory 211 + must be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages 212 + (e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max size 213 + allocation may not be available. The dynamic pool allocator tries smaller sizes 214 + until it succeeds, but with a minimum size of 1 MiB. Given sufficient system 215 + memory fragmentation, dynamically adding a pool might not succeed at all. 216 + 217 + The number of areas in a dynamic pool may be different from the number of areas 218 + in the default pool. Because the new pool size is typically a few MiB at most, 219 + the number of areas will likely be smaller. For example, with a new pool size 220 + of 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. If 221 + the system has more than 16 CPUs, multiple CPUs must share an area, creating 222 + more lock contention. 223 + 224 + New pools added via dynamic swiotlb are linked together in a linear list. 225 + swiotlb code frequently must search for the pool containing a particular 226 + swiotlb physical address, so that search is linear and not performant with a 227 + large number of dynamic pools. The data structures could be improved for 228 + faster searches. 229 + 230 + Overall, dynamic swiotlb works best for small configurations with relatively 231 + few CPUs. It allows the default swiotlb pool to be smaller so that memory is 232 + not wasted, with dynamic pools making more space available if needed (as long 233 + as fragmentation isn't an obstacle). It is less useful for large CoCo VMs. 234 + 235 + Data Structure Details 236 + ---------------------- 237 + swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool, 238 + io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator, 239 + which includes the default memory pool and any dynamic or transient pools 240 + linked to it. Limited statistics on swiotlb usage are kept per memory allocator 241 + and are stored in this data structure. These statistics are available under 242 + /sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set. 243 + 244 + io_tlb_pool describes a memory pool, either the default pool, a dynamic pool, 245 + or a transient pool. The description includes the start and end addresses of 246 + the memory in the pool, a pointer to an array of io_tlb_area structures, and a 247 + pointer to an array of io_tlb_slot structures that are associated with the pool. 248 + 249 + io_tlb_area describes an area. The primary field is the spin lock used to 250 + serialize access to slots in the area. The io_tlb_area array for a pool has an 251 + entry for each area, and is accessed using a 0-based area index derived from the 252 + calling processor ID. Areas exist solely to allow parallel access to swiotlb 253 + from multiple CPUs. 254 + 255 + io_tlb_slot describes an individual memory slot in the pool, with size 256 + IO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slot 257 + index computed from the bounce buffer address relative to the starting memory 258 + address of the pool. The size of struct io_tlb_slot is 24 bytes, so the 259 + overhead is about 1% of the slot size. 260 + 261 + The io_tlb_slot array is designed to meet several requirements. First, the DMA 262 + APIs and the corresponding swiotlb APIs use the bounce buffer address as the 263 + identifier for a bounce buffer. This address is returned by 264 + swiotlb_tbl_map_single(), and then passed as an argument to 265 + swiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions. The original 266 + memory buffer address obviously must be passed as an argument to 267 + swiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently, 268 + swiotlb data structures must save the original memory buffer address so that it 269 + can be used when doing sync operations. This original address is saved in the 270 + io_tlb_slot array. 271 + 272 + Second, the io_tlb_slot array must handle partial sync requests. In such cases, 273 + the argument to swiotlb_sync_*() is not the address of the start of the bounce 274 + buffer but an address somewhere in the middle of the bounce buffer, and the 275 + address of the start of the bounce buffer isn't known to swiotlb code. But 276 + swiotlb code must be able to calculate the corresponding original memory buffer 277 + address to do the CPU copy dictated by the "sync". So an adjusted original 278 + memory buffer address is populated into the struct io_tlb_slot for each slot 279 + occupied by the bounce buffer. An adjusted "alloc_size" of the bounce buffer is 280 + also recorded in each struct io_tlb_slot so a sanity check can be performed on 281 + the size of the "sync" operation. The "alloc_size" field is not used except for 282 + the sanity check. 283 + 284 + Third, the io_tlb_slot array is used to track available slots. The "list" field 285 + in struct io_tlb_slot records how many contiguous available slots exist starting 286 + at that slot. A "0" indicates that the slot is occupied. A value of "1" 287 + indicates only the current slot is available. A value of "2" indicates the 288 + current slot and the next slot are available, etc. The maximum value is 289 + IO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicates 290 + that the entire slot set is available. These values are used when searching for 291 + available slots to use for a new bounce buffer. They are updated when allocating 292 + a new bounce buffer and when freeing a bounce buffer. At pool creation time, the 293 + "list" field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in every 294 + slot set. 295 + 296 + Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to 297 + meet alloc_align_mask requirements described above. When 298 + swiotlb_tlb_map_single() allocates bounce buffer space to meet alloc_align_mask 299 + requirements, it may allocate pre-padding space across zero or more slots. But 300 + when swiotbl_tlb_unmap_single() is called with the bounce buffer address, the 301 + alloc_align_mask value that governed the allocation, and therefore the 302 + allocation of any padding slots, is not known. The "pad_slots" field records 303 + the number of padding slots so that swiotlb_tbl_unmap_single() can free them. 304 + The "pad_slots" value is recorded only in the first non-padding slot allocated 305 + to the bounce buffer. 306 + 307 + Restricted pools 308 + ---------------- 309 + The swiotlb machinery is also used for "restricted pools", which are pools of 310 + memory separate from the default swiotlb pool, and that are dedicated for DMA 311 + use by a particular device. Restricted pools provide a level of DMA memory 312 + protection on systems with limited hardware protection capabilities, such as 313 + those lacking an IOMMU. Such usage is specified by DeviceTree entries and 314 + requires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is based 315 + on its own io_tlb_mem data structure that is independent of the main swiotlb 316 + io_tlb_mem. 317 + 318 + Restricted pools add swiotlb_alloc() and swiotlb_free() APIs, which are called 319 + from the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIs 320 + allocate/free slots from/to the restricted pool directly and do not go through 321 + swiotlb_tbl_map/unmap_single().

+19 -15

drivers/iommu/dma-iommu.c

··· 1152 1152 */ 1153 1153 if (dev_use_swiotlb(dev, size, dir) && 1154 1154 iova_offset(iovad, phys | size)) { 1155 - void *padding_start; 1156 - size_t padding_size, aligned_size; 1157 - 1158 1155 if (!is_swiotlb_active(dev)) { 1159 1156 dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n"); 1160 1157 return DMA_MAPPING_ERROR; ··· 1159 1162 1160 1163 trace_swiotlb_bounced(dev, phys, size); 1161 1164 1162 - aligned_size = iova_align(iovad, size); 1163 - phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, 1165 + phys = swiotlb_tbl_map_single(dev, phys, size, 1164 1166 iova_mask(iovad), dir, attrs); 1165 1167 1166 1168 if (phys == DMA_MAPPING_ERROR) 1167 1169 return DMA_MAPPING_ERROR; 1168 1170 1169 - /* Cleanup the padding area. */ 1170 - padding_start = phys_to_virt(phys); 1171 - padding_size = aligned_size; 1171 + /* 1172 + * Untrusted devices should not see padding areas with random 1173 + * leftover kernel data, so zero the pre- and post-padding. 1174 + * swiotlb_tbl_map_single() has initialized the bounce buffer 1175 + * proper to the contents of the original memory buffer. 1176 + */ 1177 + if (dev_is_untrusted(dev)) { 1178 + size_t start, virt = (size_t)phys_to_virt(phys); 1172 1179 1173 - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && 1174 - (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)) { 1175 - padding_start += size; 1176 - padding_size -= size; 1180 + /* Pre-padding */ 1181 + start = iova_align_down(iovad, virt); 1182 + memset((void *)start, 0, virt - start); 1183 + 1184 + /* Post-padding */ 1185 + start = virt + size; 1186 + memset((void *)start, 0, 1187 + iova_align(iovad, start) - start); 1177 1188 } 1178 - 1179 - memset(padding_start, 0, padding_size); 1180 1189 } 1181 1190 1182 1191 if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) ··· 1721 1718 } 1722 1719 1723 1720 static const struct dma_map_ops iommu_dma_ops = { 1724 - .flags = DMA_F_PCI_P2PDMA_SUPPORTED, 1721 + .flags = DMA_F_PCI_P2PDMA_SUPPORTED | 1722 + DMA_F_CAN_SKIP_SYNC, 1725 1723 .alloc = iommu_dma_alloc, 1726 1724 .free = iommu_dma_free, 1727 1725 .alloc_pages_op = dma_common_alloc_pages,

+1 -1

drivers/net/ethernet/engleder/tsnep_main.c

··· 1587 1587 length = __le32_to_cpu(entry->desc_wb->properties) & 1588 1588 TSNEP_DESC_LENGTH_MASK; 1589 1589 xsk_buff_set_size(entry->xdp, length - ETH_FCS_LEN); 1590 - xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool); 1590 + xsk_buff_dma_sync_for_cpu(entry->xdp); 1591 1591 1592 1592 /* RX metadata with timestamps is in front of actual data, 1593 1593 * subtract metadata size to get length of actual data and

+1 -1

drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c

··· 55 55 xdp_set_data_meta_invalid(xdp_buff); 56 56 xdp_buff->rxq = &ch->xdp_rxq; 57 57 58 - xsk_buff_dma_sync_for_cpu(xdp_buff, ch->xsk_pool); 58 + xsk_buff_dma_sync_for_cpu(xdp_buff); 59 59 xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff); 60 60 61 61 /* xdp.data pointer may have changed */

+1 -1

drivers/net/ethernet/intel/i40e/i40e_xsk.c

··· 482 482 483 483 bi = *i40e_rx_bi(rx_ring, next_to_process); 484 484 xsk_buff_set_size(bi, size); 485 - xsk_buff_dma_sync_for_cpu(bi, rx_ring->xsk_pool); 485 + xsk_buff_dma_sync_for_cpu(bi); 486 486 487 487 if (!first) 488 488 first = bi;

+1 -1

drivers/net/ethernet/intel/ice/ice_xsk.c

··· 878 878 ICE_RX_FLX_DESC_PKT_LEN_M; 879 879 880 880 xsk_buff_set_size(xdp, size); 881 - xsk_buff_dma_sync_for_cpu(xdp, xsk_pool); 881 + xsk_buff_dma_sync_for_cpu(xdp); 882 882 883 883 if (!first) { 884 884 first = xdp;

+1 -1

drivers/net/ethernet/intel/igc/igc_main.c

··· 2812 2812 } 2813 2813 2814 2814 bi->xdp->data_end = bi->xdp->data + size; 2815 - xsk_buff_dma_sync_for_cpu(bi->xdp, ring->xsk_pool); 2815 + xsk_buff_dma_sync_for_cpu(bi->xdp); 2816 2816 2817 2817 res = __igc_xdp_run_prog(adapter, prog, bi->xdp); 2818 2818 switch (res) {

+1 -1

drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

··· 303 303 } 304 304 305 305 bi->xdp->data_end = bi->xdp->data + size; 306 - xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool); 306 + xsk_buff_dma_sync_for_cpu(bi->xdp); 307 307 xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp); 308 308 309 309 if (likely(xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR))) {

+2 -2

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c

··· 270 270 /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */ 271 271 mxbuf->cqe = cqe; 272 272 xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt); 273 - xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool); 273 + xsk_buff_dma_sync_for_cpu(&mxbuf->xdp); 274 274 net_prefetch(mxbuf->xdp.data); 275 275 276 276 /* Possible flows: ··· 319 319 /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */ 320 320 mxbuf->cqe = cqe; 321 321 xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt); 322 - xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool); 322 + xsk_buff_dma_sync_for_cpu(&mxbuf->xdp); 323 323 net_prefetch(mxbuf->xdp.data); 324 324 325 325 prog = rcu_dereference(rq->xdp_prog);

+1 -1

drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

··· 917 917 918 918 if (!rq->xsk_pool) { 919 919 count = mlx5e_refill_rx_wqes(rq, head, wqe_bulk); 920 - } else if (likely(!rq->xsk_pool->dma_need_sync)) { 920 + } else if (likely(!dma_dev_need_sync(rq->pdev))) { 921 921 mlx5e_xsk_free_rx_wqes(rq, head, wqe_bulk); 922 922 count = mlx5e_xsk_alloc_rx_wqes_batched(rq, head, wqe_bulk); 923 923 } else {

+1 -1

drivers/net/ethernet/netronome/nfp/nfd3/xsk.c

··· 184 184 xrxbuf->xdp->data += meta_len; 185 185 xrxbuf->xdp->data_end = xrxbuf->xdp->data + pkt_len; 186 186 xdp_set_data_meta_invalid(xrxbuf->xdp); 187 - xsk_buff_dma_sync_for_cpu(xrxbuf->xdp, r_vec->xsk_pool); 187 + xsk_buff_dma_sync_for_cpu(xrxbuf->xdp); 188 188 net_prefetch(xrxbuf->xdp->data); 189 189 190 190 if (meta_len) {

+1 -1

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c

··· 5361 5361 5362 5362 /* RX buffer is good and fit into a XSK pool buffer */ 5363 5363 buf->xdp->data_end = buf->xdp->data + buf1_len; 5364 - xsk_buff_dma_sync_for_cpu(buf->xdp, rx_q->xsk_pool); 5364 + xsk_buff_dma_sync_for_cpu(buf->xdp); 5365 5365 5366 5366 prog = READ_ONCE(priv->xdp_prog); 5367 5367 res = __stmmac_xdp_run_prog(priv, prog, buf->xdp);

+1 -1

drivers/xen/swiotlb-xen.c

··· 216 216 */ 217 217 trace_swiotlb_bounced(dev, dev_addr, size); 218 218 219 - map = swiotlb_tbl_map_single(dev, phys, size, size, 0, dir, attrs); 219 + map = swiotlb_tbl_map_single(dev, phys, size, 0, dir, attrs); 220 220 if (map == (phys_addr_t)DMA_MAPPING_ERROR) 221 221 return DMA_MAPPING_ERROR; 222 222

+4

include/linux/device.h

··· 691 691 * and optionall (if the coherent mask is large enough) also 692 692 * for dma allocations. This flag is managed by the dma ops 693 693 * instance from ->dma_supported. 694 + * @dma_skip_sync: DMA sync operations can be skipped for coherent buffers. 694 695 * 695 696 * At the lowest level, every device in a Linux system is represented by an 696 697 * instance of struct device. The device structure contains the information ··· 803 802 #endif 804 803 #ifdef CONFIG_DMA_OPS_BYPASS 805 804 bool dma_ops_bypass : 1; 805 + #endif 806 + #ifdef CONFIG_DMA_NEED_SYNC 807 + bool dma_skip_sync:1; 806 808 #endif 807 809 }; 808 810

+12

include/linux/dma-map-ops.h

··· 18 18 * 19 19 * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can 20 20 * handle PCI P2PDMA pages in the map_sg/unmap_sg operation. 21 + * DMA_F_CAN_SKIP_SYNC: DMA sync operations can be skipped if the device is 22 + * coherent and it's not an SWIOTLB buffer. 21 23 */ 22 24 #define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0) 25 + #define DMA_F_CAN_SKIP_SYNC (1 << 1) 23 26 24 27 struct dma_map_ops { 25 28 unsigned int flags; ··· 275 272 return true; 276 273 } 277 274 #endif /* CONFIG_ARCH_HAS_DMA_COHERENCE_H */ 275 + 276 + static inline void dma_reset_need_sync(struct device *dev) 277 + { 278 + #ifdef CONFIG_DMA_NEED_SYNC 279 + /* Reset it only once so that the function can be called on hotpath */ 280 + if (unlikely(dev->dma_skip_sync)) 281 + dev->dma_skip_sync = false; 282 + #endif 283 + } 278 284 279 285 /* 280 286 * Check whether potential kmalloc() buffers are safe for non-coherent DMA.

+76 -29

include/linux/dma-mapping.h

··· 117 117 size_t size, enum dma_data_direction dir, unsigned long attrs); 118 118 void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, 119 119 enum dma_data_direction dir, unsigned long attrs); 120 - void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size, 121 - enum dma_data_direction dir); 122 - void dma_sync_single_for_device(struct device *dev, dma_addr_t addr, 123 - size_t size, enum dma_data_direction dir); 124 - void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, 125 - int nelems, enum dma_data_direction dir); 126 - void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, 127 - int nelems, enum dma_data_direction dir); 128 120 void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, 129 121 gfp_t flag, unsigned long attrs); 130 122 void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, ··· 139 147 bool dma_addressing_limited(struct device *dev); 140 148 size_t dma_max_mapping_size(struct device *dev); 141 149 size_t dma_opt_mapping_size(struct device *dev); 142 - bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); 143 150 unsigned long dma_get_merge_boundary(struct device *dev); 144 151 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, 145 152 enum dma_data_direction dir, gfp_t gfp, unsigned long attrs); ··· 184 193 } 185 194 static inline void dma_unmap_resource(struct device *dev, dma_addr_t addr, 186 195 size_t size, enum dma_data_direction dir, unsigned long attrs) 187 - { 188 - } 189 - static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, 190 - size_t size, enum dma_data_direction dir) 191 - { 192 - } 193 - static inline void dma_sync_single_for_device(struct device *dev, 194 - dma_addr_t addr, size_t size, enum dma_data_direction dir) 195 - { 196 - } 197 - static inline void dma_sync_sg_for_cpu(struct device *dev, 198 - struct scatterlist *sg, int nelems, enum dma_data_direction dir) 199 - { 200 - } 201 - static inline void dma_sync_sg_for_device(struct device *dev, 202 - struct scatterlist *sg, int nelems, enum dma_data_direction dir) 203 196 { 204 197 } 205 198 static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) ··· 252 277 { 253 278 return 0; 254 279 } 255 - static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) 256 - { 257 - return false; 258 - } 259 280 static inline unsigned long dma_get_merge_boundary(struct device *dev) 260 281 { 261 282 return 0; ··· 280 309 return -EINVAL; 281 310 } 282 311 #endif /* CONFIG_HAS_DMA */ 312 + 313 + #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC) 314 + void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size, 315 + enum dma_data_direction dir); 316 + void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr, 317 + size_t size, enum dma_data_direction dir); 318 + void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, 319 + int nelems, enum dma_data_direction dir); 320 + void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, 321 + int nelems, enum dma_data_direction dir); 322 + bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr); 323 + 324 + static inline bool dma_dev_need_sync(const struct device *dev) 325 + { 326 + /* Always call DMA sync operations when debugging is enabled */ 327 + return !dev->dma_skip_sync || IS_ENABLED(CONFIG_DMA_API_DEBUG); 328 + } 329 + 330 + static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, 331 + size_t size, enum dma_data_direction dir) 332 + { 333 + if (dma_dev_need_sync(dev)) 334 + __dma_sync_single_for_cpu(dev, addr, size, dir); 335 + } 336 + 337 + static inline void dma_sync_single_for_device(struct device *dev, 338 + dma_addr_t addr, size_t size, enum dma_data_direction dir) 339 + { 340 + if (dma_dev_need_sync(dev)) 341 + __dma_sync_single_for_device(dev, addr, size, dir); 342 + } 343 + 344 + static inline void dma_sync_sg_for_cpu(struct device *dev, 345 + struct scatterlist *sg, int nelems, enum dma_data_direction dir) 346 + { 347 + if (dma_dev_need_sync(dev)) 348 + __dma_sync_sg_for_cpu(dev, sg, nelems, dir); 349 + } 350 + 351 + static inline void dma_sync_sg_for_device(struct device *dev, 352 + struct scatterlist *sg, int nelems, enum dma_data_direction dir) 353 + { 354 + if (dma_dev_need_sync(dev)) 355 + __dma_sync_sg_for_device(dev, sg, nelems, dir); 356 + } 357 + 358 + static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) 359 + { 360 + return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false; 361 + } 362 + #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */ 363 + static inline bool dma_dev_need_sync(const struct device *dev) 364 + { 365 + return false; 366 + } 367 + static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, 368 + size_t size, enum dma_data_direction dir) 369 + { 370 + } 371 + static inline void dma_sync_single_for_device(struct device *dev, 372 + dma_addr_t addr, size_t size, enum dma_data_direction dir) 373 + { 374 + } 375 + static inline void dma_sync_sg_for_cpu(struct device *dev, 376 + struct scatterlist *sg, int nelems, enum dma_data_direction dir) 377 + { 378 + } 379 + static inline void dma_sync_sg_for_device(struct device *dev, 380 + struct scatterlist *sg, int nelems, enum dma_data_direction dir) 381 + { 382 + } 383 + static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) 384 + { 385 + return false; 386 + } 387 + #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */ 283 388 284 389 struct page *dma_alloc_pages(struct device *dev, size_t size, 285 390 dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);

+5

include/linux/iova.h

··· 65 65 return ALIGN(size, iovad->granule); 66 66 } 67 67 68 + static inline size_t iova_align_down(struct iova_domain *iovad, size_t size) 69 + { 70 + return ALIGN_DOWN(size, iovad->granule); 71 + } 72 + 68 73 static inline dma_addr_t iova_dma_addr(struct iova_domain *iovad, struct iova *iova) 69 74 { 70 75 return (dma_addr_t)iova->pfn_lo << iova_shift(iovad);

+1 -1

include/linux/swiotlb.h

··· 43 43 extern void __init swiotlb_update_mem_attributes(void); 44 44 45 45 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys, 46 - size_t mapping_size, size_t alloc_size, 46 + size_t mapping_size, 47 47 unsigned int alloc_aligned_mask, enum dma_data_direction dir, 48 48 unsigned long attrs); 49 49

+21 -4

include/net/page_pool/types.h

··· 45 45 46 46 /** 47 47 * struct page_pool_params - page pool parameters 48 - * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV 49 48 * @order: 2^order pages on allocation 50 49 * @pool_size: size of the ptr_ring 51 50 * @nid: NUMA node id to allocate from pages from ··· 54 55 * @dma_dir: DMA mapping direction 55 56 * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV 56 57 * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV 58 + * @netdev: corresponding &net_device for Netlink introspection 59 + * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_SYSTEM_POOL 57 60 */ 58 61 struct page_pool_params { 59 62 struct_group_tagged(page_pool_params_fast, fast, 60 - unsigned int flags; 61 63 unsigned int order; 62 64 unsigned int pool_size; 63 65 int nid; ··· 70 70 ); 71 71 struct_group_tagged(page_pool_params_slow, slow, 72 72 struct net_device *netdev; 73 + unsigned int flags; 73 74 /* private: used by test code only */ 74 75 void (*init_callback)(struct page *page, void *arg); 75 76 void *init_arg; ··· 131 130 struct page_pool_params_fast p; 132 131 133 132 int cpuid; 134 - bool has_init_callback; 133 + u32 pages_state_hold_cnt; 135 134 135 + bool has_init_callback:1; /* slow::init_callback is set */ 136 + bool dma_map:1; /* Perform DMA mapping */ 137 + bool dma_sync:1; /* Perform DMA sync */ 138 + #ifdef CONFIG_PAGE_POOL_STATS 139 + bool system:1; /* This is a global percpu pool */ 140 + #endif 141 + 142 + /* The following block must stay within one cacheline. On 32-bit 143 + * systems, sizeof(long) == sizeof(int), so that the block size is 144 + * ``3 * sizeof(long)``. On 64-bit systems, the actual size is 145 + * ``2 * sizeof(long) + sizeof(int)``. The closest pow-2 to both of 146 + * them is ``4 * sizeof(long)``, so just use that one for simplicity. 147 + * Having it aligned to a cacheline boundary may be excessive and 148 + * doesn't bring any good. 149 + */ 150 + __cacheline_group_begin(frag) __aligned(4 * sizeof(long)); 136 151 long frag_users; 137 152 struct page *frag_page; 138 153 unsigned int frag_offset; 139 - u32 pages_state_hold_cnt; 154 + __cacheline_group_end(frag); 140 155 141 156 struct delayed_work release_dw; 142 157 void (*disconnect)(void *pool);

+2 -5

include/net/xdp_sock_drv.h

··· 219 219 return meta; 220 220 } 221 221 222 - static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) 222 + static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) 223 223 { 224 224 struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); 225 - 226 - if (!pool->dma_need_sync) 227 - return; 228 225 229 226 xp_dma_sync_for_cpu(xskb); 230 227 } ··· 399 402 return NULL; 400 403 } 401 404 402 - static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) 405 + static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) 403 406 { 404 407 } 405 408

+4 -10

include/net/xsk_buff_pool.h

··· 43 43 refcount_t users; 44 44 struct list_head list; /* Protected by the RTNL_LOCK */ 45 45 u32 dma_pages_cnt; 46 - bool dma_need_sync; 47 46 }; 48 47 49 48 struct xsk_buff_pool { ··· 81 82 u8 tx_metadata_len; /* inherited from umem */ 82 83 u8 cached_need_wakeup; 83 84 bool uses_need_wakeup; 84 - bool dma_need_sync; 85 85 bool unaligned; 86 86 bool tx_sw_csum; 87 87 void *addrs; ··· 153 155 return xskb->frame_dma; 154 156 } 155 157 156 - void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb); 157 158 static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb) 158 159 { 159 - xp_dma_sync_for_cpu_slow(xskb); 160 + dma_sync_single_for_cpu(xskb->pool->dev, xskb->dma, 161 + xskb->pool->frame_len, 162 + DMA_BIDIRECTIONAL); 160 163 } 161 164 162 - void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma, 163 - size_t size); 164 165 static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool, 165 166 dma_addr_t dma, size_t size) 166 167 { 167 - if (!pool->dma_need_sync) 168 - return; 169 - 170 - xp_dma_sync_for_device_slow(pool, dma, size); 168 + dma_sync_single_for_device(pool->dev, dma, size, DMA_BIDIRECTIONAL); 171 169 } 172 170 173 171 /* Masks for xdp_umem_page flags.

+5

kernel/dma/Kconfig

··· 107 107 bool 108 108 depends on SWIOTLB 109 109 110 + config DMA_NEED_SYNC 111 + def_bool ARCH_HAS_SYNC_DMA_FOR_DEVICE || ARCH_HAS_SYNC_DMA_FOR_CPU || \ 112 + ARCH_HAS_SYNC_DMA_FOR_CPU_ALL || DMA_API_DEBUG || DMA_OPS || \ 113 + SWIOTLB 114 + 110 115 config DMA_RESTRICTED_POOL 111 116 bool "DMA Restricted Pool" 112 117 depends on OF && OF_RESERVED_MEM && SWIOTLB

+51 -18

kernel/dma/mapping.c

··· 329 329 } 330 330 EXPORT_SYMBOL(dma_unmap_resource); 331 331 332 - void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size, 332 + #ifdef CONFIG_DMA_NEED_SYNC 333 + void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size, 333 334 enum dma_data_direction dir) 334 335 { 335 336 const struct dma_map_ops *ops = get_dma_ops(dev); ··· 342 341 ops->sync_single_for_cpu(dev, addr, size, dir); 343 342 debug_dma_sync_single_for_cpu(dev, addr, size, dir); 344 343 } 345 - EXPORT_SYMBOL(dma_sync_single_for_cpu); 344 + EXPORT_SYMBOL(__dma_sync_single_for_cpu); 346 345 347 - void dma_sync_single_for_device(struct device *dev, dma_addr_t addr, 346 + void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr, 348 347 size_t size, enum dma_data_direction dir) 349 348 { 350 349 const struct dma_map_ops *ops = get_dma_ops(dev); ··· 356 355 ops->sync_single_for_device(dev, addr, size, dir); 357 356 debug_dma_sync_single_for_device(dev, addr, size, dir); 358 357 } 359 - EXPORT_SYMBOL(dma_sync_single_for_device); 358 + EXPORT_SYMBOL(__dma_sync_single_for_device); 360 359 361 - void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, 360 + void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, 362 361 int nelems, enum dma_data_direction dir) 363 362 { 364 363 const struct dma_map_ops *ops = get_dma_ops(dev); ··· 370 369 ops->sync_sg_for_cpu(dev, sg, nelems, dir); 371 370 debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir); 372 371 } 373 - EXPORT_SYMBOL(dma_sync_sg_for_cpu); 372 + EXPORT_SYMBOL(__dma_sync_sg_for_cpu); 374 373 375 - void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, 374 + void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, 376 375 int nelems, enum dma_data_direction dir) 377 376 { 378 377 const struct dma_map_ops *ops = get_dma_ops(dev); ··· 384 383 ops->sync_sg_for_device(dev, sg, nelems, dir); 385 384 debug_dma_sync_sg_for_device(dev, sg, nelems, dir); 386 385 } 387 - EXPORT_SYMBOL(dma_sync_sg_for_device); 386 + EXPORT_SYMBOL(__dma_sync_sg_for_device); 387 + 388 + bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr) 389 + { 390 + const struct dma_map_ops *ops = get_dma_ops(dev); 391 + 392 + if (dma_map_direct(dev, ops)) 393 + /* 394 + * dma_skip_sync could've been reset on first SWIOTLB buffer 395 + * mapping, but @dma_addr is not necessary an SWIOTLB buffer. 396 + * In this case, fall back to more granular check. 397 + */ 398 + return dma_direct_need_sync(dev, dma_addr); 399 + return true; 400 + } 401 + EXPORT_SYMBOL_GPL(__dma_need_sync); 402 + 403 + static void dma_setup_need_sync(struct device *dev) 404 + { 405 + const struct dma_map_ops *ops = get_dma_ops(dev); 406 + 407 + if (dma_map_direct(dev, ops) || (ops->flags & DMA_F_CAN_SKIP_SYNC)) 408 + /* 409 + * dma_skip_sync will be reset to %false on first SWIOTLB buffer 410 + * mapping, if any. During the device initialization, it's 411 + * enough to check only for the DMA coherence. 412 + */ 413 + dev->dma_skip_sync = dev_is_dma_coherent(dev); 414 + else if (!ops->sync_single_for_device && !ops->sync_single_for_cpu && 415 + !ops->sync_sg_for_device && !ops->sync_sg_for_cpu) 416 + /* 417 + * Synchronization is not possible when none of DMA sync ops 418 + * is set. 419 + */ 420 + dev->dma_skip_sync = true; 421 + else 422 + dev->dma_skip_sync = false; 423 + } 424 + #else /* !CONFIG_DMA_NEED_SYNC */ 425 + static inline void dma_setup_need_sync(struct device *dev) { } 426 + #endif /* !CONFIG_DMA_NEED_SYNC */ 388 427 389 428 /* 390 429 * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems ··· 814 773 815 774 arch_dma_set_mask(dev, mask); 816 775 *dev->dma_mask = mask; 776 + dma_setup_need_sync(dev); 777 + 817 778 return 0; 818 779 } 819 780 EXPORT_SYMBOL(dma_set_mask); ··· 883 840 return min(dma_max_mapping_size(dev), size); 884 841 } 885 842 EXPORT_SYMBOL_GPL(dma_opt_mapping_size); 886 - 887 - bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) 888 - { 889 - const struct dma_map_ops *ops = get_dma_ops(dev); 890 - 891 - if (dma_map_direct(dev, ops)) 892 - return dma_direct_need_sync(dev, dma_addr); 893 - return ops->sync_single_for_cpu || ops->sync_single_for_device; 894 - } 895 - EXPORT_SYMBOL_GPL(dma_need_sync); 896 843 897 844 unsigned long dma_get_merge_boundary(struct device *dev) 898 845 {

+48 -14

kernel/dma/swiotlb.c

··· 1340 1340 1341 1341 #endif /* CONFIG_DEBUG_FS */ 1342 1342 1343 + /** 1344 + * swiotlb_tbl_map_single() - bounce buffer map a single contiguous physical area 1345 + * @dev: Device which maps the buffer. 1346 + * @orig_addr: Original (non-bounced) physical IO buffer address 1347 + * @mapping_size: Requested size of the actual bounce buffer, excluding 1348 + * any pre- or post-padding for alignment 1349 + * @alloc_align_mask: Required start and end alignment of the allocated buffer 1350 + * @dir: DMA direction 1351 + * @attrs: Optional DMA attributes for the map operation 1352 + * 1353 + * Find and allocate a suitable sequence of IO TLB slots for the request. 1354 + * The allocated space starts at an alignment specified by alloc_align_mask, 1355 + * and the size of the allocated space is rounded up so that the total amount 1356 + * of allocated space is a multiple of (alloc_align_mask + 1). If 1357 + * alloc_align_mask is zero, the allocated space may be at any alignment and 1358 + * the size is not rounded up. 1359 + * 1360 + * The returned address is within the allocated space and matches the bits 1361 + * of orig_addr that are specified in the DMA min_align_mask for the device. As 1362 + * such, this returned address may be offset from the beginning of the allocated 1363 + * space. The bounce buffer space starting at the returned address for 1364 + * mapping_size bytes is initialized to the contents of the original IO buffer 1365 + * area. Any pre-padding (due to an offset) and any post-padding (due to 1366 + * rounding-up the size) is not initialized. 1367 + */ 1343 1368 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, 1344 - size_t mapping_size, size_t alloc_size, 1345 - unsigned int alloc_align_mask, enum dma_data_direction dir, 1346 - unsigned long attrs) 1369 + size_t mapping_size, unsigned int alloc_align_mask, 1370 + enum dma_data_direction dir, unsigned long attrs) 1347 1371 { 1348 1372 struct io_tlb_mem *mem = dev->dma_io_tlb_mem; 1349 1373 unsigned int offset; 1350 1374 struct io_tlb_pool *pool; 1351 1375 unsigned int i; 1376 + size_t size; 1352 1377 int index; 1353 1378 phys_addr_t tlb_addr; 1354 1379 unsigned short pad_slots; ··· 1387 1362 if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) 1388 1363 pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n"); 1389 1364 1390 - if (mapping_size > alloc_size) { 1391 - dev_warn_once(dev, "Invalid sizes (mapping: %zd bytes, alloc: %zd bytes)", 1392 - mapping_size, alloc_size); 1393 - return (phys_addr_t)DMA_MAPPING_ERROR; 1394 - } 1365 + /* 1366 + * The default swiotlb memory pool is allocated with PAGE_SIZE 1367 + * alignment. If a mapping is requested with larger alignment, 1368 + * the mapping may be unable to use the initial slot(s) in all 1369 + * sets of IO_TLB_SEGSIZE slots. In such case, a mapping request 1370 + * of or near the maximum mapping size would always fail. 1371 + */ 1372 + dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK, 1373 + "Alloc alignment may prevent fulfilling requests with max mapping_size\n"); 1395 1374 1396 1375 offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr); 1397 - index = swiotlb_find_slots(dev, orig_addr, 1398 - alloc_size + offset, alloc_align_mask, &pool); 1376 + size = ALIGN(mapping_size + offset, alloc_align_mask + 1); 1377 + index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool); 1399 1378 if (index == -1) { 1400 1379 if (!(attrs & DMA_ATTR_NO_WARN)) 1401 1380 dev_warn_ratelimited(dev, 1402 1381 "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n", 1403 - alloc_size, mem->nslabs, mem_used(mem)); 1382 + size, mem->nslabs, mem_used(mem)); 1404 1383 return (phys_addr_t)DMA_MAPPING_ERROR; 1405 1384 } 1385 + 1386 + /* 1387 + * If dma_skip_sync was set, reset it on first SWIOTLB buffer 1388 + * mapping to always sync SWIOTLB buffers. 1389 + */ 1390 + dma_reset_need_sync(dev); 1406 1391 1407 1392 /* 1408 1393 * Save away the mapping from the original address to the DMA address. ··· 1423 1388 offset &= (IO_TLB_SIZE - 1); 1424 1389 index += pad_slots; 1425 1390 pool->slots[index].pad_slots = pad_slots; 1426 - for (i = 0; i < nr_slots(alloc_size + offset); i++) 1391 + for (i = 0; i < (nr_slots(size) - pad_slots); i++) 1427 1392 pool->slots[index + i].orig_addr = slot_addr(orig_addr, i); 1428 1393 tlb_addr = slot_addr(pool->start, index) + offset; 1429 1394 /* ··· 1578 1543 1579 1544 trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size); 1580 1545 1581 - swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, size, 0, dir, 1582 - attrs); 1546 + swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs); 1583 1547 if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) 1584 1548 return DMA_MAPPING_ERROR; 1585 1549

+48 -30

net/core/page_pool.c

··· 173 173 spin_unlock_bh(&pool->ring.producer_lock); 174 174 } 175 175 176 + static void page_pool_struct_check(void) 177 + { 178 + CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_users); 179 + CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_page); 180 + CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_offset); 181 + CACHELINE_ASSERT_GROUP_SIZE(struct page_pool, frag, 4 * sizeof(long)); 182 + } 183 + 176 184 static int page_pool_init(struct page_pool *pool, 177 185 const struct page_pool_params *params, 178 186 int cpuid) 179 187 { 180 188 unsigned int ring_qsize = 1024; /* Default */ 189 + 190 + page_pool_struct_check(); 181 191 182 192 memcpy(&pool->p, &params->fast, sizeof(pool->p)); 183 193 memcpy(&pool->slow, &params->slow, sizeof(pool->slow)); ··· 195 185 pool->cpuid = cpuid; 196 186 197 187 /* Validate only known flags were used */ 198 - if (pool->p.flags & ~(PP_FLAG_ALL)) 188 + if (pool->slow.flags & ~PP_FLAG_ALL) 199 189 return -EINVAL; 200 190 201 191 if (pool->p.pool_size) ··· 209 199 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending, 210 200 * which is the XDP_TX use-case. 211 201 */ 212 - if (pool->p.flags & PP_FLAG_DMA_MAP) { 202 + if (pool->slow.flags & PP_FLAG_DMA_MAP) { 213 203 if ((pool->p.dma_dir != DMA_FROM_DEVICE) && 214 204 (pool->p.dma_dir != DMA_BIDIRECTIONAL)) 215 205 return -EINVAL; 206 + 207 + pool->dma_map = true; 216 208 } 217 209 218 - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) { 210 + if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) { 219 211 /* In order to request DMA-sync-for-device the page 220 212 * needs to be mapped 221 213 */ 222 - if (!(pool->p.flags & PP_FLAG_DMA_MAP)) 214 + if (!(pool->slow.flags & PP_FLAG_DMA_MAP)) 223 215 return -EINVAL; 224 216 225 217 if (!pool->p.max_len) 226 218 return -EINVAL; 219 + 220 + pool->dma_sync = true; 227 221 228 222 /* pool->p.offset has to be set according to the address 229 223 * offset used by the DMA engine to start copying rx data ··· 237 223 pool->has_init_callback = !!pool->slow.init_callback; 238 224 239 225 #ifdef CONFIG_PAGE_POOL_STATS 240 - if (!(pool->p.flags & PP_FLAG_SYSTEM_POOL)) { 226 + if (!(pool->slow.flags & PP_FLAG_SYSTEM_POOL)) { 241 227 pool->recycle_stats = alloc_percpu(struct page_pool_recycle_stats); 242 228 if (!pool->recycle_stats) 243 229 return -ENOMEM; ··· 247 233 * (also percpu) page pool instance. 248 234 */ 249 235 pool->recycle_stats = &pp_system_recycle_stats; 236 + pool->system = true; 250 237 } 251 238 #endif 252 239 253 240 if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) { 254 241 #ifdef CONFIG_PAGE_POOL_STATS 255 - if (!(pool->p.flags & PP_FLAG_SYSTEM_POOL)) 242 + if (!pool->system) 256 243 free_percpu(pool->recycle_stats); 257 244 #endif 258 245 return -ENOMEM; ··· 264 249 /* Driver calling page_pool_create() also call page_pool_destroy() */ 265 250 refcount_set(&pool->user_cnt, 1); 266 251 267 - if (pool->p.flags & PP_FLAG_DMA_MAP) 252 + if (pool->dma_map) 268 253 get_device(pool->p.dev); 269 254 270 255 return 0; ··· 274 259 { 275 260 ptr_ring_cleanup(&pool->ring, NULL); 276 261 277 - if (pool->p.flags & PP_FLAG_DMA_MAP) 262 + if (pool->dma_map) 278 263 put_device(pool->p.dev); 279 264 280 265 #ifdef CONFIG_PAGE_POOL_STATS 281 - if (!(pool->p.flags & PP_FLAG_SYSTEM_POOL)) 266 + if (!pool->system) 282 267 free_percpu(pool->recycle_stats); 283 268 #endif 284 269 } ··· 399 384 return page; 400 385 } 401 386 402 - static void page_pool_dma_sync_for_device(const struct page_pool *pool, 403 - const struct page *page, 404 - unsigned int dma_sync_size) 387 + static void __page_pool_dma_sync_for_device(const struct page_pool *pool, 388 + const struct page *page, 389 + u32 dma_sync_size) 405 390 { 391 + #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC) 406 392 dma_addr_t dma_addr = page_pool_get_dma_addr(page); 407 393 408 394 dma_sync_size = min(dma_sync_size, pool->p.max_len); 409 - dma_sync_single_range_for_device(pool->p.dev, dma_addr, 410 - pool->p.offset, dma_sync_size, 411 - pool->p.dma_dir); 395 + __dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset, 396 + dma_sync_size, pool->p.dma_dir); 397 + #endif 398 + } 399 + 400 + static __always_inline void 401 + page_pool_dma_sync_for_device(const struct page_pool *pool, 402 + const struct page *page, 403 + u32 dma_sync_size) 404 + { 405 + if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) 406 + __page_pool_dma_sync_for_device(pool, page, dma_sync_size); 412 407 } 413 408 414 409 static bool page_pool_dma_map(struct page_pool *pool, struct page *page) ··· 440 415 if (page_pool_set_dma_addr(page, dma)) 441 416 goto unmap_failed; 442 417 443 - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) 444 - page_pool_dma_sync_for_device(pool, page, pool->p.max_len); 418 + page_pool_dma_sync_for_device(pool, page, pool->p.max_len); 445 419 446 420 return true; 447 421 ··· 485 461 if (unlikely(!page)) 486 462 return NULL; 487 463 488 - if ((pool->p.flags & PP_FLAG_DMA_MAP) && 489 - unlikely(!page_pool_dma_map(pool, page))) { 464 + if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page))) { 490 465 put_page(page); 491 466 return NULL; 492 467 } ··· 505 482 gfp_t gfp) 506 483 { 507 484 const int bulk = PP_ALLOC_CACHE_REFILL; 508 - unsigned int pp_flags = pool->p.flags; 509 485 unsigned int pp_order = pool->p.order; 486 + bool dma_map = pool->dma_map; 510 487 struct page *page; 511 488 int i, nr_pages; 512 489 ··· 531 508 */ 532 509 for (i = 0; i < nr_pages; i++) { 533 510 page = pool->alloc.cache[i]; 534 - if ((pp_flags & PP_FLAG_DMA_MAP) && 535 - unlikely(!page_pool_dma_map(pool, page))) { 511 + if (dma_map && unlikely(!page_pool_dma_map(pool, page))) { 536 512 put_page(page); 537 513 continue; 538 514 } ··· 604 582 { 605 583 dma_addr_t dma; 606 584 607 - if (!(pool->p.flags & PP_FLAG_DMA_MAP)) 585 + if (!pool->dma_map) 608 586 /* Always account for inflight pages, even if we didn't 609 587 * map them 610 588 */ ··· 687 665 } 688 666 689 667 /* If the page refcnt == 1, this will try to recycle the page. 690 - * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for 668 + * If pool->dma_sync is set, we'll try to sync the DMA area for 691 669 * the configured size min(dma_sync_size, pool->max_len). 692 670 * If the page refcnt != 1, then the page will be returned to memory 693 671 * subsystem. ··· 710 688 if (likely(__page_pool_page_can_be_recycled(page))) { 711 689 /* Read barrier done in page_ref_count / READ_ONCE */ 712 690 713 - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) 714 - page_pool_dma_sync_for_device(pool, page, 715 - dma_sync_size); 691 + page_pool_dma_sync_for_device(pool, page, dma_sync_size); 716 692 717 693 if (allow_direct && page_pool_recycle_in_cache(page, pool)) 718 694 return NULL; ··· 849 829 return NULL; 850 830 851 831 if (__page_pool_page_can_be_recycled(page)) { 852 - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) 853 - page_pool_dma_sync_for_device(pool, page, -1); 854 - 832 + page_pool_dma_sync_for_device(pool, page, -1); 855 833 return page; 856 834 } 857 835

+4 -25

net/xdp/xsk_buff_pool.c

··· 338 338 339 339 dma_map->netdev = netdev; 340 340 dma_map->dev = dev; 341 - dma_map->dma_need_sync = false; 342 341 dma_map->dma_pages_cnt = nr_pages; 343 342 refcount_set(&dma_map->users, 1); 344 343 list_add(&dma_map->list, &umem->xsk_dma_list); ··· 423 424 424 425 pool->dev = dma_map->dev; 425 426 pool->dma_pages_cnt = dma_map->dma_pages_cnt; 426 - pool->dma_need_sync = dma_map->dma_need_sync; 427 427 memcpy(pool->dma_pages, dma_map->dma_pages, 428 428 pool->dma_pages_cnt * sizeof(*pool->dma_pages)); 429 429 ··· 458 460 __xp_dma_unmap(dma_map, attrs); 459 461 return -ENOMEM; 460 462 } 461 - if (dma_need_sync(dev, dma)) 462 - dma_map->dma_need_sync = true; 463 463 dma_map->dma_pages[i] = dma; 464 464 } 465 465 ··· 553 557 xskb->xdp.data_meta = xskb->xdp.data; 554 558 xskb->xdp.flags = 0; 555 559 556 - if (pool->dma_need_sync) { 557 - dma_sync_single_range_for_device(pool->dev, xskb->dma, 0, 558 - pool->frame_len, 559 - DMA_BIDIRECTIONAL); 560 - } 560 + if (pool->dev) 561 + xp_dma_sync_for_device(pool, xskb->dma, pool->frame_len); 562 + 561 563 return &xskb->xdp; 562 564 } 563 565 EXPORT_SYMBOL(xp_alloc); ··· 627 633 { 628 634 u32 nb_entries1 = 0, nb_entries2; 629 635 630 - if (unlikely(pool->dma_need_sync)) { 636 + if (unlikely(pool->dev && dma_dev_need_sync(pool->dev))) { 631 637 struct xdp_buff *buff; 632 638 633 639 /* Slow path */ ··· 687 693 (addr & ~PAGE_MASK); 688 694 } 689 695 EXPORT_SYMBOL(xp_raw_get_dma); 690 - 691 - void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb) 692 - { 693 - dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0, 694 - xskb->pool->frame_len, DMA_BIDIRECTIONAL); 695 - } 696 - EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow); 697 - 698 - void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma, 699 - size_t size) 700 - { 701 - dma_sync_single_range_for_device(pool->dev, dma, 0, 702 - size, DMA_BIDIRECTIONAL); 703 - } 704 - EXPORT_SYMBOL(xp_dma_sync_for_device_slow);

Configure Feed

Configure Feed