Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

PCI/P2PDMA: Document DMABUF model

Reflect latest changes in p2p implementation to support DMABUF lifecycle.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-5-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

authored by

Jason Gunthorpe and committed by
Alex Williamson
50d44fce 395698bd

+73 -22
+73 -22
Documentation/driver-api/pci/p2pdma.rst
··· 9 9 called Peer-to-Peer (or P2P). However, there are a number of issues that 10 10 make P2P transactions tricky to do in a perfectly safe way. 11 11 12 - One of the biggest issues is that PCI doesn't require forwarding 13 - transactions between hierarchy domains, and in PCIe, each Root Port 14 - defines a separate hierarchy domain. To make things worse, there is no 15 - simple way to determine if a given Root Complex supports this or not. 16 - (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel 17 - only supports doing P2P when the endpoints involved are all behind the 18 - same PCI bridge, as such devices are all in the same PCI hierarchy 19 - domain, and the spec guarantees that all transactions within the 20 - hierarchy will be routable, but it does not require routing 21 - between hierarchies. 12 + For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up 13 + until they reach a host bridge or root port. If the path includes PCIe switches 14 + then based on the ACS settings the transaction can route entirely within 15 + the PCIe hierarchy and never reach the root port. The kernel will evaluate 16 + the PCIe topology and always permit P2P in these well-defined cases. 22 17 23 - The second issue is that to make use of existing interfaces in Linux, 24 - memory that is used for P2P transactions needs to be backed by struct 25 - pages. However, PCI BARs are not typically cache coherent so there are 26 - a few corner case gotchas with these pages so developers need to 27 - be careful about what they do with them. 18 + However, if the P2P transaction reaches the host bridge then it might have to 19 + hairpin back out the same root port, be routed inside the CPU SOC to another 20 + PCIe root port, or routed internally to the SOC. 21 + 22 + The PCIe specification doesn't define the forwarding of transactions between 23 + hierarchy domains and kernel defaults to blocking such routing. There is an 24 + allow list to allow detecting known-good HW, in which case P2P between any 25 + two PCIe devices will be permitted. 26 + 27 + Since P2P inherently is doing transactions between two devices it requires two 28 + drivers to be co-operating inside the kernel. The providing driver has to convey 29 + its MMIO to the consuming driver. To meet the driver model lifecycle rules the 30 + MMIO must have all DMA mapping removed, all CPU accesses prevented, all page 31 + table mappings undone before the providing driver completes remove(). 32 + 33 + This requires the providing and consuming driver to actively work together to 34 + guarantee that the consuming driver has stopped using the MMIO during a removal 35 + cycle. This is done by either a synchronous invalidation shutdown or waiting 36 + for all usage refcounts to reach zero. 37 + 38 + At the lowest level the P2P subsystem offers a naked struct p2p_provider that 39 + delegates lifecycle management to the providing driver. It is expected that 40 + drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF 41 + to provide an invalidation shutdown. These MMIO addresess have no struct page, and 42 + if used with mmap() must create special PTEs. As such there are very few 43 + kernel uAPIs that can accept pointers to them; in particular they cannot be used 44 + with read()/write(), including O_DIRECT. 45 + 46 + Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE 47 + pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of 48 + pgmap ensures that when the pgmap is destroyed all other drivers have stopped 49 + using the MMIO. This option works with O_DIRECT flows, in some cases, if the 50 + underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through 51 + FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap 52 + it also relies on architecture support along with alignment and minimum size 53 + limitations. 28 54 29 55 30 56 Driver Writer's Guide ··· 140 114 Struct Page Caveats 141 115 ------------------- 142 116 143 - Driver writers should be very careful about not passing these special 144 - struct pages to code that isn't prepared for it. At this time, the kernel 145 - interfaces do not have any checks for ensuring this. This obviously 146 - precludes passing these pages to userspace. 117 + While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs, 118 + pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set. 147 119 148 - P2P memory is also technically IO memory but should never have any side 149 - effects behind it. Thus, the order of loads and stores should not be important 150 - and ioreadX(), iowriteX() and friends should not be necessary. 120 + The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The 121 + KVA is still MMIO and must still be accessed through the normal 122 + readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just 123 + like any other MMIO mapping. While this will actually work on some 124 + architectures, others will experience corruption or just crash in the kernel. 125 + Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU 126 + access happens. 127 + 128 + 129 + Usage With DMABUF 130 + ================= 131 + 132 + DMABUF provides an alternative to the above struct page-based 133 + client/provider/orchestrator system and should be used when struct page 134 + doesn't exist. In this mode the exporting driver will wrap 135 + some of its MMIO in a DMABUF and give the DMABUF FD to userspace. 136 + 137 + Userspace can then pass the FD to an importing driver which will ask the 138 + exporting driver to map it to the importer. 139 + 140 + In this case the initiator and target pci_devices are known and the P2P subsystem 141 + is used to determine the mapping type. The phys_addr_t-based DMA API is used to 142 + establish the dma_addr_t. 143 + 144 + Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants 145 + to remove() it must deliver an invalidation shutdown to all DMABUF importing 146 + drivers through move_notify() and synchronously DMA unmap all the MMIO. 147 + 148 + No importing driver can continue to have a DMA map to the MMIO after the 149 + exporting driver has destroyed its p2p_provider. 151 150 152 151 153 152 P2P DMA Support Library