Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

iommu: Introduce pci_dev_reset_iommu_prepare/done()

PCIe permits a device to ignore ATS invalidation TLPs while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out. E.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.

The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.

The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends SW to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.

Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached RID/PASID domains
group->blocking_domain so that the IOMMU hardware would fence any incoming
ATS queries. And IOMMU drivers should also synchronously stop issuing new
ATS invalidations and wait for all ATS invalidations to complete. This can
avoid any ATS invaliation timeouts.

However, if there is a domain attachment/replacement happening during an
ongoing reset, ATS routines may be re-activated between the two function
calls. So, introduce a new resetting_domain in the iommu_group structure
to reject any concurrent attach_dev/set_dev_pasid call during a reset for
a concern of compatibility failure. Since this changes the behavior of an
attach operation, update the uAPI accordingly.

Note that there are two corner cases:
1. Devices in the same iommu_group
Since an attachment is always per iommu_group, this means that any
sibling devices in the iommu_group cannot change domain, to prevent
race conditions.
2. An SR-IOV PF that is being reset while its VF is not
In such case, the VF itself is already broken. So, there is no point
in preventing PF from going through the iommu reset.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

authored by

Nicolin Chen and committed by
Joerg Roedel
c279e839 a75b2be2

+190
+173
drivers/iommu/iommu.c
··· 61 61 int id; 62 62 struct iommu_domain *default_domain; 63 63 struct iommu_domain *blocking_domain; 64 + /* 65 + * During a group device reset, @resetting_domain points to the physical 66 + * domain, while @domain points to the attached domain before the reset. 67 + */ 68 + struct iommu_domain *resetting_domain; 64 69 struct iommu_domain *domain; 65 70 struct list_head entry; 66 71 unsigned int owner_cnt; ··· 2200 2195 2201 2196 guard(mutex)(&dev->iommu_group->mutex); 2202 2197 2198 + /* 2199 + * This is a concurrent attach during a device reset. Reject it until 2200 + * pci_dev_reset_iommu_done() attaches the device to group->domain. 2201 + * 2202 + * Note that this might fail the iommu_dma_map(). But there's nothing 2203 + * more we can do here. 2204 + */ 2205 + if (dev->iommu_group->resetting_domain) 2206 + return -EBUSY; 2203 2207 return __iommu_attach_device(domain, dev, NULL); 2204 2208 } 2205 2209 ··· 2266 2252 struct iommu_group *group = dev->iommu_group; 2267 2253 2268 2254 lockdep_assert_held(&group->mutex); 2255 + 2256 + /* 2257 + * Driver handles the low-level __iommu_attach_device(), including the 2258 + * one invoked by pci_dev_reset_iommu_done() re-attaching the device to 2259 + * the cached group->domain. In this case, the driver must get the old 2260 + * domain from group->resetting_domain rather than group->domain. This 2261 + * prevents it from re-attaching the device from group->domain (old) to 2262 + * group->domain (new). 2263 + */ 2264 + if (group->resetting_domain) 2265 + return group->resetting_domain; 2269 2266 2270 2267 return group->domain; 2271 2268 } ··· 2433 2408 2434 2409 if (WARN_ON(!new_domain)) 2435 2410 return -EINVAL; 2411 + 2412 + /* 2413 + * This is a concurrent attach during a device reset. Reject it until 2414 + * pci_dev_reset_iommu_done() attaches the device to group->domain. 2415 + */ 2416 + if (group->resetting_domain) 2417 + return -EBUSY; 2436 2418 2437 2419 /* 2438 2420 * Changing the domain is done by calling attach_dev() on the new ··· 3559 3527 return -EINVAL; 3560 3528 3561 3529 mutex_lock(&group->mutex); 3530 + 3531 + /* 3532 + * This is a concurrent attach during a device reset. Reject it until 3533 + * pci_dev_reset_iommu_done() attaches the device to group->domain. 3534 + */ 3535 + if (group->resetting_domain) { 3536 + ret = -EBUSY; 3537 + goto out_unlock; 3538 + } 3539 + 3562 3540 for_each_group_device(group, device) { 3563 3541 /* 3564 3542 * Skip PASID validation for devices without PASID support ··· 3652 3610 return -EINVAL; 3653 3611 3654 3612 mutex_lock(&group->mutex); 3613 + 3614 + /* 3615 + * This is a concurrent attach during a device reset. Reject it until 3616 + * pci_dev_reset_iommu_done() attaches the device to group->domain. 3617 + */ 3618 + if (group->resetting_domain) { 3619 + ret = -EBUSY; 3620 + goto out_unlock; 3621 + } 3622 + 3655 3623 entry = iommu_make_pasid_array_entry(domain, handle); 3656 3624 curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, 3657 3625 XA_ZERO_ENTRY, GFP_KERNEL); ··· 3918 3866 return ret; 3919 3867 } 3920 3868 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL"); 3869 + 3870 + /** 3871 + * pci_dev_reset_iommu_prepare() - Block IOMMU to prepare for a PCI device reset 3872 + * @pdev: PCI device that is going to enter a reset routine 3873 + * 3874 + * The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and block 3875 + * ATS before initiating a reset. This means that a PCIe device during the reset 3876 + * routine wants to block any IOMMU activity: translation and ATS invalidation. 3877 + * 3878 + * This function attaches the device's RID/PASID(s) the group->blocking_domain, 3879 + * setting the group->resetting_domain. This allows the IOMMU driver pausing any 3880 + * IOMMU activity while leaving the group->domain pointer intact. Later when the 3881 + * reset is finished, pci_dev_reset_iommu_done() can restore everything. 3882 + * 3883 + * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done() 3884 + * before/after the core-level reset routine, to unset the resetting_domain. 3885 + * 3886 + * Return: 0 on success or negative error code if the preparation failed. 3887 + * 3888 + * These two functions are designed to be used by PCI reset functions that would 3889 + * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed 3890 + * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other 3891 + * case, callers must ensure there will be no racy iommu_release_device() call, 3892 + * which otherwise would UAF the dev->iommu_group pointer. 3893 + */ 3894 + int pci_dev_reset_iommu_prepare(struct pci_dev *pdev) 3895 + { 3896 + struct iommu_group *group = pdev->dev.iommu_group; 3897 + unsigned long pasid; 3898 + void *entry; 3899 + int ret; 3900 + 3901 + if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev)) 3902 + return 0; 3903 + 3904 + guard(mutex)(&group->mutex); 3905 + 3906 + /* Re-entry is not allowed */ 3907 + if (WARN_ON(group->resetting_domain)) 3908 + return -EBUSY; 3909 + 3910 + ret = __iommu_group_alloc_blocking_domain(group); 3911 + if (ret) 3912 + return ret; 3913 + 3914 + /* Stage RID domain at blocking_domain while retaining group->domain */ 3915 + if (group->domain != group->blocking_domain) { 3916 + ret = __iommu_attach_device(group->blocking_domain, &pdev->dev, 3917 + group->domain); 3918 + if (ret) 3919 + return ret; 3920 + } 3921 + 3922 + /* 3923 + * Stage PASID domains at blocking_domain while retaining pasid_array. 3924 + * 3925 + * The pasid_array is mostly fenced by group->mutex, except one reader 3926 + * in iommu_attach_handle_get(), so it's safe to read without xa_lock. 3927 + */ 3928 + xa_for_each_start(&group->pasid_array, pasid, entry, 1) 3929 + iommu_remove_dev_pasid(&pdev->dev, pasid, 3930 + pasid_array_entry_to_domain(entry)); 3931 + 3932 + group->resetting_domain = group->blocking_domain; 3933 + return ret; 3934 + } 3935 + EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare); 3936 + 3937 + /** 3938 + * pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done 3939 + * @pdev: PCI device that has finished a reset routine 3940 + * 3941 + * After a PCIe device finishes a reset routine, it wants to restore its IOMMU 3942 + * IOMMU activity, including new translation as well as cache invalidation, by 3943 + * re-attaching all RID/PASID of the device's back to the domains retained in 3944 + * the core-level structure. 3945 + * 3946 + * Caller must pair it with a successful pci_dev_reset_iommu_prepare(). 3947 + * 3948 + * Note that, although unlikely, there is a risk that re-attaching domains might 3949 + * fail due to some unexpected happening like OOM. 3950 + */ 3951 + void pci_dev_reset_iommu_done(struct pci_dev *pdev) 3952 + { 3953 + struct iommu_group *group = pdev->dev.iommu_group; 3954 + unsigned long pasid; 3955 + void *entry; 3956 + 3957 + if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev)) 3958 + return; 3959 + 3960 + guard(mutex)(&group->mutex); 3961 + 3962 + /* pci_dev_reset_iommu_prepare() was bypassed for the device */ 3963 + if (!group->resetting_domain) 3964 + return; 3965 + 3966 + /* pci_dev_reset_iommu_prepare() was not successfully called */ 3967 + if (WARN_ON(!group->blocking_domain)) 3968 + return; 3969 + 3970 + /* Re-attach RID domain back to group->domain */ 3971 + if (group->domain != group->blocking_domain) { 3972 + WARN_ON(__iommu_attach_device(group->domain, &pdev->dev, 3973 + group->blocking_domain)); 3974 + } 3975 + 3976 + /* 3977 + * Re-attach PASID domains back to the domains retained in pasid_array. 3978 + * 3979 + * The pasid_array is mostly fenced by group->mutex, except one reader 3980 + * in iommu_attach_handle_get(), so it's safe to read without xa_lock. 3981 + */ 3982 + xa_for_each_start(&group->pasid_array, pasid, entry, 1) 3983 + WARN_ON(__iommu_set_group_pasid( 3984 + pasid_array_entry_to_domain(entry), group, pasid, 3985 + group->blocking_domain)); 3986 + 3987 + group->resetting_domain = NULL; 3988 + } 3989 + EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done); 3921 3990 3922 3991 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU) 3923 3992 /**
+13
include/linux/iommu.h
··· 1188 1188 struct device *dev, ioasid_t pasid); 1189 1189 ioasid_t iommu_alloc_global_pasid(struct device *dev); 1190 1190 void iommu_free_global_pasid(ioasid_t pasid); 1191 + 1192 + /* PCI device reset functions */ 1193 + int pci_dev_reset_iommu_prepare(struct pci_dev *pdev); 1194 + void pci_dev_reset_iommu_done(struct pci_dev *pdev); 1191 1195 #else /* CONFIG_IOMMU_API */ 1192 1196 1193 1197 struct iommu_ops {}; ··· 1515 1511 } 1516 1512 1517 1513 static inline void iommu_free_global_pasid(ioasid_t pasid) {} 1514 + 1515 + static inline int pci_dev_reset_iommu_prepare(struct pci_dev *pdev) 1516 + { 1517 + return 0; 1518 + } 1519 + 1520 + static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev) 1521 + { 1522 + } 1518 1523 #endif /* CONFIG_IOMMU_API */ 1519 1524 1520 1525 #ifdef CONFIG_IRQ_MSI_IOMMU
+4
include/uapi/linux/vfio.h
··· 964 964 * hwpt corresponding to the given pt_id. 965 965 * 966 966 * Return: 0 on success, -errno on failure. 967 + * 968 + * When a device is resetting, -EBUSY will be returned to reject any concurrent 969 + * attachment to the resetting device itself or any sibling device in the IOMMU 970 + * group having the resetting device. 967 971 */ 968 972 struct vfio_device_attach_iommufd_pt { 969 973 __u32 argsz;