Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

- Remove redundant resource check in vfio-platform (Angus Chen)

- Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing
removal of arbitrary kernel limits in favor of cgroup control (Yishai
Hadas)

- mdev tidy-ups, including removing the module-only build restriction
for sample drivers, Kconfig changes to select mdev support,
documentation movement to keep sample driver usage instructions with
sample drivers rather than with API docs, remove references to
out-of-tree drivers in docs (Christoph Hellwig)

- Fix collateral breakages from mdev Kconfig changes (Arnd Bergmann)

- Make mlx5 migration support match device support, improve source and
target flows to improve pre-copy support and reduce downtime (Yishai
Hadas)

- Convert additional mdev sysfs case to use sysfs_emit() (Bo Liu)

- Resolve copy-paste error in mdev mbochs sample driver Kconfig (Ye
Xingchen)

- Avoid propagating missing reset error in vfio-platform if reset
requirement is relaxed by module option (Tomasz Duszynski)

- Range size fixes in mlx5 variant driver for missed last byte and
stricter range calculation (Yishai Hadas)

- Fixes to suspended vaddr support and locked_vm accounting, excluding
mdev configurations from the former due to potential to indefinitely
block kernel threads, fix underflow and restore locked_vm on new mm
(Steve Sistare)

- Update outdated vfio documentation due to new IOMMUFD interfaces in
recent kernels (Yi Liu)

- Resolve deadlock between group_lock and kvm_lock, finally (Matthew
Rosato)

- Fix NULL pointer in group initialization error path with IOMMUFD (Yan
Zhao)

* tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio: (32 commits)
vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd
docs: vfio: Update vfio.rst per latest interfaces
vfio: Update the kdoc for vfio_device_ops
vfio/mlx5: Fix range size calculation upon tracker creation
vfio: no need to pass kvm pointer during device open
vfio: fix deadlock between group lock and kvm lock
vfio: revert "iommu driver notify callback"
vfio/type1: revert "implement notify callback"
vfio/type1: revert "block on invalid vaddr"
vfio/type1: restore locked_vm
vfio/type1: track locked_vm per dma
vfio/type1: prevent underflow of locked_vm via exec()
vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
vfio: platform: ignore missing reset if disabled at module init
vfio/mlx5: Improve the target side flow to reduce downtime
vfio/mlx5: Improve the source side flow upon pre_copy
vfio/mlx5: Check whether VF is migratable
samples: fix the prompt about SAMPLE_VFIO_MDEV_MBOCHS
vfio/mdev: Use sysfs_emit() to instead of sprintf()
vfio-mdev: add back CONFIG_VFIO dependency
...

+756 -422
+1 -107
Documentation/driver-api/vfio-mediated-device.rst
··· 60 60 | mdev.ko | 61 61 | +-----------+ | mdev_register_parent() +--------------+ 62 62 | | | +<------------------------+ | 63 - | | | | | nvidia.ko |<-> physical 63 + | | | | | ccw_device.ko|<-> physical 64 64 | | | +------------------------>+ | device 65 65 | | | | callbacks +--------------+ 66 66 | | Physical | | 67 67 | | device | | mdev_register_parent() +--------------+ 68 68 | | interface | |<------------------------+ | 69 69 | | | | | i915.ko |<-> physical 70 - | | | +------------------------>+ | device 71 - | | | | callbacks +--------------+ 72 - | | | | 73 - | | | | mdev_register_parent() +--------------+ 74 - | | | +<------------------------+ | 75 - | | | | | ccw_device.ko|<-> physical 76 70 | | | +------------------------>+ | device 77 71 | | | | callbacks +--------------+ 78 72 | +-----------+ | ··· 263 269 these callbacks are supported in the TYPE1 IOMMU module. To enable them for 264 270 other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide 265 271 these two callback functions. 266 - 267 - Using the Sample Code 268 - ===================== 269 - 270 - mtty.c in samples/vfio-mdev/ directory is a sample driver program to 271 - demonstrate how to use the mediated device framework. 272 - 273 - The sample driver creates an mdev device that simulates a serial port over a PCI 274 - card. 275 - 276 - 1. Build and load the mtty.ko module. 277 - 278 - This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ 279 - 280 - Files in this device directory in sysfs are similar to the following:: 281 - 282 - # tree /sys/devices/virtual/mtty/mtty/ 283 - /sys/devices/virtual/mtty/mtty/ 284 - |-- mdev_supported_types 285 - | |-- mtty-1 286 - | | |-- available_instances 287 - | | |-- create 288 - | | |-- device_api 289 - | | |-- devices 290 - | | `-- name 291 - | `-- mtty-2 292 - | |-- available_instances 293 - | |-- create 294 - | |-- device_api 295 - | |-- devices 296 - | `-- name 297 - |-- mtty_dev 298 - | `-- sample_mtty_dev 299 - |-- power 300 - | |-- autosuspend_delay_ms 301 - | |-- control 302 - | |-- runtime_active_time 303 - | |-- runtime_status 304 - | `-- runtime_suspended_time 305 - |-- subsystem -> ../../../../class/mtty 306 - `-- uevent 307 - 308 - 2. Create a mediated device by using the dummy device that you created in the 309 - previous step:: 310 - 311 - # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ 312 - /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create 313 - 314 - 3. Add parameters to qemu-kvm:: 315 - 316 - -device vfio-pci,\ 317 - sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 318 - 319 - 4. Boot the VM. 320 - 321 - In the Linux guest VM, with no hardware on the host, the device appears 322 - as follows:: 323 - 324 - # lspci -s 00:05.0 -xxvv 325 - 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) 326 - Subsystem: Device 4348:3253 327 - Physical Slot: 5 328 - Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 329 - Stepping- SERR- FastB2B- DisINTx- 330 - Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 331 - <TAbort- <MAbort- >SERR- <PERR- INTx- 332 - Interrupt: pin A routed to IRQ 10 333 - Region 0: I/O ports at c150 [size=8] 334 - Region 1: I/O ports at c158 [size=8] 335 - Kernel driver in use: serial 336 - 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 337 - 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 338 - 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 339 - 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 340 - 341 - In the Linux guest VM, dmesg output for the device is as follows: 342 - 343 - serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 344 - 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 345 - 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A 346 - 347 - 348 - 5. In the Linux guest VM, check the serial ports:: 349 - 350 - # setserial -g /dev/ttyS* 351 - /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 352 - /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 353 - /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 354 - 355 - 6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or 356 - /dev/ttyS2 with hardware flow control disabled. 357 - 358 - 7. Type data on the minicom terminal or send data to the terminal emulation 359 - program and read the data. 360 - 361 - Data is loop backed from hosts mtty driver. 362 - 363 - 8. Destroy the mediated device that you created:: 364 - 365 - # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove 366 272 367 273 References 368 274 ==========
+60 -22
Documentation/driver-api/vfio.rst
··· 249 249 250 250 VFIO bus drivers, such as vfio-pci make use of only a few interfaces 251 251 into VFIO core. When devices are bound and unbound to the driver, 252 - the driver should call vfio_register_group_dev() and 253 - vfio_unregister_group_dev() respectively:: 252 + Following interfaces are called when devices are bound to and 253 + unbound from the driver:: 254 254 255 - void vfio_init_group_dev(struct vfio_device *device, 256 - struct device *dev, 257 - const struct vfio_device_ops *ops); 258 - void vfio_uninit_group_dev(struct vfio_device *device); 259 255 int vfio_register_group_dev(struct vfio_device *device); 256 + int vfio_register_emulated_iommu_dev(struct vfio_device *device); 260 257 void vfio_unregister_group_dev(struct vfio_device *device); 261 258 262 - The driver should embed the vfio_device in its own structure and call 263 - vfio_init_group_dev() to pre-configure it before going to registration 264 - and call vfio_uninit_group_dev() after completing the un-registration. 259 + The driver should embed the vfio_device in its own structure and use 260 + vfio_alloc_device() to allocate the structure, and can register 261 + @init/@release callbacks to manage any private state wrapping the 262 + vfio_device:: 263 + 264 + vfio_alloc_device(dev_struct, member, dev, ops); 265 + void vfio_put_device(struct vfio_device *device); 266 + 265 267 vfio_register_group_dev() indicates to the core to begin tracking the 266 268 iommu_group of the specified dev and register the dev as owned by a VFIO bus 267 269 driver. Once vfio_register_group_dev() returns it is possible for userspace to ··· 272 270 similar to a file operations structure:: 273 271 274 272 struct vfio_device_ops { 275 - int (*open)(struct vfio_device *vdev); 273 + char *name; 274 + int (*init)(struct vfio_device *vdev); 276 275 void (*release)(struct vfio_device *vdev); 276 + int (*bind_iommufd)(struct vfio_device *vdev, 277 + struct iommufd_ctx *ictx, u32 *out_device_id); 278 + void (*unbind_iommufd)(struct vfio_device *vdev); 279 + int (*attach_ioas)(struct vfio_device *vdev, u32 *pt_id); 280 + int (*open_device)(struct vfio_device *vdev); 281 + void (*close_device)(struct vfio_device *vdev); 277 282 ssize_t (*read)(struct vfio_device *vdev, char __user *buf, 278 283 size_t count, loff_t *ppos); 279 - ssize_t (*write)(struct vfio_device *vdev, 280 - const char __user *buf, 281 - size_t size, loff_t *ppos); 284 + ssize_t (*write)(struct vfio_device *vdev, const char __user *buf, 285 + size_t count, loff_t *size); 282 286 long (*ioctl)(struct vfio_device *vdev, unsigned int cmd, 283 287 unsigned long arg); 284 - int (*mmap)(struct vfio_device *vdev, 285 - struct vm_area_struct *vma); 288 + int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma); 289 + void (*request)(struct vfio_device *vdev, unsigned int count); 290 + int (*match)(struct vfio_device *vdev, char *buf); 291 + void (*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length); 292 + int (*device_feature)(struct vfio_device *device, u32 flags, 293 + void __user *arg, size_t argsz); 286 294 }; 287 295 288 296 Each function is passed the vdev that was originally registered 289 - in the vfio_register_group_dev() call above. This allows the bus driver 290 - to obtain its private data using container_of(). The open/release 291 - callbacks are issued when a new file descriptor is created for a 292 - device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides 293 - a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap 294 - interfaces implement the device region access defined by the device's 295 - own VFIO_DEVICE_GET_REGION_INFO ioctl. 297 + in the vfio_register_group_dev() or vfio_register_emulated_iommu_dev() 298 + call above. This allows the bus driver to obtain its private data using 299 + container_of(). 296 300 301 + :: 302 + 303 + - The init/release callbacks are issued when vfio_device is initialized 304 + and released. 305 + 306 + - The open/close device callbacks are issued when the first 307 + instance of a file descriptor for the device is created (eg. 308 + via VFIO_GROUP_GET_DEVICE_FD) for a user session. 309 + 310 + - The ioctl callback provides a direct pass through for some VFIO_DEVICE_* 311 + ioctls. 312 + 313 + - The [un]bind_iommufd callbacks are issued when the device is bound to 314 + and unbound from iommufd. 315 + 316 + - The attach_ioas callback is issued when the device is attached to an 317 + IOAS managed by the bound iommufd. The attached IOAS is automatically 318 + detached when the device is unbound from iommufd. 319 + 320 + - The read/write/mmap callbacks implement the device region access defined 321 + by the device's own VFIO_DEVICE_GET_REGION_INFO ioctl. 322 + 323 + - The request callback is issued when device is going to be unregistered, 324 + such as when trying to unbind the device from the vfio bus driver. 325 + 326 + - The dma_unmap callback is issued when a range of iovas are unmapped 327 + in the container or IOAS attached by the device. Drivers which make 328 + use of the vfio page pinning interface must implement this callback in 329 + order to unpin pages within the dma_unmap range. Drivers must tolerate 330 + this callback even before calls to open_device(). 297 331 298 332 PPC64 sPAPR implementation note 299 333 -------------------------------
-1
Documentation/s390/vfio-ap.rst
··· 553 553 * ZCRYPT 554 554 * S390_AP_IOMMU 555 555 * VFIO 556 - * VFIO_MDEV 557 556 * KVM 558 557 559 558 If using make menuconfig select the following to build the vfio_ap module::
-1
MAINTAINERS
··· 21882 21882 21883 21883 VFIO DRIVER 21884 21884 M: Alex Williamson <alex.williamson@redhat.com> 21885 - R: Cornelia Huck <cohuck@redhat.com> 21886 21885 L: kvm@vger.kernel.org 21887 21886 S: Maintained 21888 21887 T: git https://github.com/awilliam/linux-vfio.git
+6 -2
arch/s390/Kconfig
··· 714 714 config VFIO_CCW 715 715 def_tristate n 716 716 prompt "Support for VFIO-CCW subchannels" 717 - depends on S390_CCW_IOMMU && VFIO_MDEV 717 + depends on S390_CCW_IOMMU 718 + depends on VFIO 719 + select VFIO_MDEV 718 720 help 719 721 This driver allows usage of I/O subchannels via VFIO-CCW. 720 722 ··· 726 724 config VFIO_AP 727 725 def_tristate n 728 726 prompt "VFIO support for AP devices" 729 - depends on S390_AP_IOMMU && VFIO_MDEV && KVM 727 + depends on S390_AP_IOMMU && KVM 728 + depends on VFIO 730 729 depends on ZCRYPT 730 + select VFIO_MDEV 731 731 help 732 732 This driver grants access to Adjunct Processor (AP) devices 733 733 via the VFIO mediated device interface.
-1
arch/s390/configs/debug_defconfig
··· 594 594 CONFIG_VFIO=m 595 595 CONFIG_VFIO_PCI=m 596 596 CONFIG_MLX5_VFIO_PCI=m 597 - CONFIG_VFIO_MDEV=m 598 597 CONFIG_VIRTIO_PCI=m 599 598 CONFIG_VIRTIO_BALLOON=m 600 599 CONFIG_VIRTIO_INPUT=y
-1
arch/s390/configs/defconfig
··· 583 583 CONFIG_VFIO=m 584 584 CONFIG_VFIO_PCI=m 585 585 CONFIG_MLX5_VFIO_PCI=m 586 - CONFIG_VFIO_MDEV=m 587 586 CONFIG_VIRTIO_PCI=m 588 587 CONFIG_VIRTIO_BALLOON=m 589 588 CONFIG_VIRTIO_INPUT=y
+2 -1
drivers/gpu/drm/i915/Kconfig
··· 127 127 depends on X86 128 128 depends on 64BIT 129 129 depends on KVM 130 - depends on VFIO_MDEV 130 + depends on VFIO 131 131 select DRM_I915_GVT 132 132 select KVM_EXTERNAL_WRITE_TRACKING 133 + select VFIO_MDEV 133 134 134 135 help 135 136 Choose this option if you want to enable Intel GVT-g graphics
+1 -6
drivers/vfio/container.c
··· 360 360 { 361 361 struct vfio_container *container; 362 362 363 - container = kzalloc(sizeof(*container), GFP_KERNEL); 363 + container = kzalloc(sizeof(*container), GFP_KERNEL_ACCOUNT); 364 364 if (!container) 365 365 return -ENOMEM; 366 366 ··· 376 376 static int vfio_fops_release(struct inode *inode, struct file *filep) 377 377 { 378 378 struct vfio_container *container = filep->private_data; 379 - struct vfio_iommu_driver *driver = container->iommu_driver; 380 - 381 - if (driver && driver->ops->notify) 382 - driver->ops->notify(container->iommu_data, 383 - VFIO_IOMMU_CONTAINER_CLOSE); 384 379 385 380 filep->private_data = NULL; 386 381
+1 -1
drivers/vfio/fsl-mc/vfio_fsl_mc.c
··· 28 28 int i; 29 29 30 30 vdev->regions = kcalloc(count, sizeof(struct vfio_fsl_mc_region), 31 - GFP_KERNEL); 31 + GFP_KERNEL_ACCOUNT); 32 32 if (!vdev->regions) 33 33 return -ENOMEM; 34 34
+2 -2
drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c
··· 29 29 30 30 irq_count = mc_dev->obj_desc.irq_count; 31 31 32 - mc_irq = kcalloc(irq_count, sizeof(*mc_irq), GFP_KERNEL); 32 + mc_irq = kcalloc(irq_count, sizeof(*mc_irq), GFP_KERNEL_ACCOUNT); 33 33 if (!mc_irq) 34 34 return -ENOMEM; 35 35 ··· 77 77 if (fd < 0) /* Disable only */ 78 78 return 0; 79 79 80 - irq->name = kasprintf(GFP_KERNEL, "vfio-irq[%d](%s)", 80 + irq->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-irq[%d](%s)", 81 81 hwirq, dev_name(&vdev->mc_dev->dev)); 82 82 if (!irq->name) 83 83 return -ENOMEM;
+38 -8
drivers/vfio/group.c
··· 140 140 ret = iommufd_vfio_compat_ioas_create(iommufd); 141 141 142 142 if (ret) { 143 - iommufd_ctx_put(group->iommufd); 143 + iommufd_ctx_put(iommufd); 144 144 goto out_unlock; 145 145 } 146 146 ··· 157 157 return ret; 158 158 } 159 159 160 + static void vfio_device_group_get_kvm_safe(struct vfio_device *device) 161 + { 162 + spin_lock(&device->group->kvm_ref_lock); 163 + if (!device->group->kvm) 164 + goto unlock; 165 + 166 + _vfio_device_get_kvm_safe(device, device->group->kvm); 167 + 168 + unlock: 169 + spin_unlock(&device->group->kvm_ref_lock); 170 + } 171 + 160 172 static int vfio_device_group_open(struct vfio_device *device) 161 173 { 162 174 int ret; ··· 179 167 goto out_unlock; 180 168 } 181 169 170 + mutex_lock(&device->dev_set->lock); 171 + 182 172 /* 183 - * Here we pass the KVM pointer with the group under the lock. If the 184 - * device driver will use it, it must obtain a reference and release it 185 - * during close_device. 173 + * Before the first device open, get the KVM pointer currently 174 + * associated with the group (if there is one) and obtain a reference 175 + * now that will be held until the open_count reaches 0 again. Save 176 + * the pointer in the device for use by drivers. 186 177 */ 187 - ret = vfio_device_open(device, device->group->iommufd, 188 - device->group->kvm); 178 + if (device->open_count == 0) 179 + vfio_device_group_get_kvm_safe(device); 180 + 181 + ret = vfio_device_open(device, device->group->iommufd); 182 + 183 + if (device->open_count == 0) 184 + vfio_device_put_kvm(device); 185 + 186 + mutex_unlock(&device->dev_set->lock); 189 187 190 188 out_unlock: 191 189 mutex_unlock(&device->group->group_lock); ··· 205 183 void vfio_device_group_close(struct vfio_device *device) 206 184 { 207 185 mutex_lock(&device->group->group_lock); 186 + mutex_lock(&device->dev_set->lock); 187 + 208 188 vfio_device_close(device, device->group->iommufd); 189 + 190 + if (device->open_count == 0) 191 + vfio_device_put_kvm(device); 192 + 193 + mutex_unlock(&device->dev_set->lock); 209 194 mutex_unlock(&device->group->group_lock); 210 195 } 211 196 ··· 482 453 483 454 refcount_set(&group->drivers, 1); 484 455 mutex_init(&group->group_lock); 456 + spin_lock_init(&group->kvm_ref_lock); 485 457 INIT_LIST_HEAD(&group->device_list); 486 458 mutex_init(&group->device_lock); 487 459 group->iommu_group = iommu_group; ··· 836 806 if (!vfio_file_is_group(file)) 837 807 return; 838 808 839 - mutex_lock(&group->group_lock); 809 + spin_lock(&group->kvm_ref_lock); 840 810 group->kvm = kvm; 841 - mutex_unlock(&group->group_lock); 811 + spin_unlock(&group->kvm_ref_lock); 842 812 } 843 813 EXPORT_SYMBOL_GPL(vfio_file_set_kvm); 844 814
+1 -7
drivers/vfio/mdev/Kconfig
··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 3 3 config VFIO_MDEV 4 - tristate "Mediated device driver framework" 5 - default n 6 - help 7 - Provides a framework to virtualize devices. 8 - See Documentation/driver-api/vfio-mediated-device.rst for more details. 9 - 10 - If you don't know what do here, say N. 4 + tristate
+1 -1
drivers/vfio/mdev/mdev_sysfs.c
··· 96 96 static ssize_t name_show(struct mdev_type *mtype, 97 97 struct mdev_type_attribute *attr, char *buf) 98 98 { 99 - return sprintf(buf, "%s\n", 99 + return sysfs_emit(buf, "%s\n", 100 100 mtype->pretty_name ? mtype->pretty_name : mtype->sysfs_name); 101 101 } 102 102
+2 -2
drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
··· 744 744 { 745 745 struct hisi_acc_vf_migration_file *migf; 746 746 747 - migf = kzalloc(sizeof(*migf), GFP_KERNEL); 747 + migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT); 748 748 if (!migf) 749 749 return ERR_PTR(-ENOMEM); 750 750 ··· 863 863 struct hisi_acc_vf_migration_file *migf; 864 864 int ret; 865 865 866 - migf = kzalloc(sizeof(*migf), GFP_KERNEL); 866 + migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT); 867 867 if (!migf) 868 868 return ERR_PTR(-ENOMEM); 869 869
+60 -19
drivers/vfio/pci/mlx5/cmd.c
··· 7 7 8 8 enum { CQ_OK = 0, CQ_EMPTY = -1, CQ_POLL_ERR = -2 }; 9 9 10 + static int mlx5vf_is_migratable(struct mlx5_core_dev *mdev, u16 func_id) 11 + { 12 + int query_sz = MLX5_ST_SZ_BYTES(query_hca_cap_out); 13 + void *query_cap = NULL, *cap; 14 + int ret; 15 + 16 + query_cap = kzalloc(query_sz, GFP_KERNEL); 17 + if (!query_cap) 18 + return -ENOMEM; 19 + 20 + ret = mlx5_vport_get_other_func_cap(mdev, func_id, query_cap, 21 + MLX5_CAP_GENERAL_2); 22 + if (ret) 23 + goto out; 24 + 25 + cap = MLX5_ADDR_OF(query_hca_cap_out, query_cap, capability); 26 + if (!MLX5_GET(cmd_hca_cap_2, cap, migratable)) 27 + ret = -EOPNOTSUPP; 28 + out: 29 + kfree(query_cap); 30 + return ret; 31 + } 32 + 10 33 static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id, 11 34 u16 *vhca_id); 12 35 static void ··· 218 195 if (mvdev->vf_id < 0) 219 196 goto end; 220 197 198 + ret = mlx5vf_is_migratable(mvdev->mdev, mvdev->vf_id + 1); 199 + if (ret) 200 + goto end; 201 + 221 202 if (mlx5vf_cmd_get_vhca_id(mvdev->mdev, mvdev->vf_id + 1, 222 203 &mvdev->vhca_id)) 223 204 goto end; ··· 400 373 struct mlx5_vhca_data_buffer *buf; 401 374 int ret; 402 375 403 - buf = kzalloc(sizeof(*buf), GFP_KERNEL); 376 + buf = kzalloc(sizeof(*buf), GFP_KERNEL_ACCOUNT); 404 377 if (!buf) 405 378 return ERR_PTR(-ENOMEM); 406 379 ··· 500 473 } 501 474 502 475 static int add_buf_header(struct mlx5_vhca_data_buffer *header_buf, 503 - size_t image_size) 476 + size_t image_size, bool initial_pre_copy) 504 477 { 505 478 struct mlx5_vf_migration_file *migf = header_buf->migf; 506 479 struct mlx5_vf_migration_header header = {}; ··· 508 481 struct page *page; 509 482 u8 *to_buff; 510 483 511 - header.image_size = cpu_to_le64(image_size); 484 + header.record_size = cpu_to_le64(image_size); 485 + header.flags = cpu_to_le32(MLX5_MIGF_HEADER_FLAGS_TAG_MANDATORY); 486 + header.tag = cpu_to_le32(MLX5_MIGF_HEADER_TAG_FW_DATA); 512 487 page = mlx5vf_get_migration_page(header_buf, 0); 513 488 if (!page) 514 489 return -EINVAL; ··· 518 489 memcpy(to_buff, &header, sizeof(header)); 519 490 kunmap_local(to_buff); 520 491 header_buf->length = sizeof(header); 521 - header_buf->header_image_size = image_size; 522 492 header_buf->start_pos = header_buf->migf->max_pos; 523 493 migf->max_pos += header_buf->length; 524 494 spin_lock_irqsave(&migf->list_lock, flags); 525 495 list_add_tail(&header_buf->buf_elm, &migf->buf_list); 526 496 spin_unlock_irqrestore(&migf->list_lock, flags); 497 + if (initial_pre_copy) 498 + migf->pre_copy_initial_bytes += sizeof(header); 527 499 return 0; 528 500 } 529 501 ··· 538 508 if (!status) { 539 509 size_t image_size; 540 510 unsigned long flags; 511 + bool initial_pre_copy = migf->state != MLX5_MIGF_STATE_PRE_COPY && 512 + !async_data->last_chunk; 541 513 542 514 image_size = MLX5_GET(save_vhca_state_out, async_data->out, 543 515 actual_image_size); 544 516 if (async_data->header_buf) { 545 - status = add_buf_header(async_data->header_buf, image_size); 517 + status = add_buf_header(async_data->header_buf, image_size, 518 + initial_pre_copy); 546 519 if (status) 547 520 goto err; 548 521 } ··· 555 522 spin_lock_irqsave(&migf->list_lock, flags); 556 523 list_add_tail(&async_data->buf->buf_elm, &migf->buf_list); 557 524 spin_unlock_irqrestore(&migf->list_lock, flags); 525 + if (initial_pre_copy) 526 + migf->pre_copy_initial_bytes += image_size; 558 527 migf->state = async_data->last_chunk ? 559 528 MLX5_MIGF_STATE_COMPLETE : MLX5_MIGF_STATE_PRE_COPY; 560 529 wake_up_interruptible(&migf->poll_wait); ··· 618 583 } 619 584 620 585 if (MLX5VF_PRE_COPY_SUPP(mvdev)) { 621 - header_buf = mlx5vf_get_data_buffer(migf, 622 - sizeof(struct mlx5_vf_migration_header), DMA_NONE); 623 - if (IS_ERR(header_buf)) { 624 - err = PTR_ERR(header_buf); 625 - goto err_free; 586 + if (async_data->last_chunk && migf->buf_header) { 587 + header_buf = migf->buf_header; 588 + migf->buf_header = NULL; 589 + } else { 590 + header_buf = mlx5vf_get_data_buffer(migf, 591 + sizeof(struct mlx5_vf_migration_header), DMA_NONE); 592 + if (IS_ERR(header_buf)) { 593 + err = PTR_ERR(header_buf); 594 + goto err_free; 595 + } 626 596 } 627 597 } 628 598 ··· 830 790 node = interval_tree_iter_first(ranges, 0, ULONG_MAX); 831 791 for (i = 0; i < num_ranges; i++) { 832 792 void *addr_range_i_base = range_list_ptr + record_size * i; 833 - unsigned long length = node->last - node->start; 793 + unsigned long length = node->last - node->start + 1; 834 794 835 795 MLX5_SET64(page_track_range, addr_range_i_base, start_address, 836 796 node->start); ··· 840 800 } 841 801 842 802 WARN_ON(node); 843 - log_addr_space_size = ilog2(total_ranges_len); 803 + log_addr_space_size = ilog2(roundup_pow_of_two(total_ranges_len)); 844 804 if (log_addr_space_size < 845 805 (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_min_addr_space)) || 846 806 log_addr_space_size > ··· 1072 1032 void *in; 1073 1033 int err; 1074 1034 1075 - qp = kzalloc(sizeof(*qp), GFP_KERNEL); 1035 + qp = kzalloc(sizeof(*qp), GFP_KERNEL_ACCOUNT); 1076 1036 if (!qp) 1077 1037 return ERR_PTR(-ENOMEM); 1078 1038 1079 - qp->rq.wqe_cnt = roundup_pow_of_two(max_recv_wr); 1080 - log_rq_stride = ilog2(MLX5_SEND_WQE_DS); 1081 - log_rq_sz = ilog2(qp->rq.wqe_cnt); 1082 1039 err = mlx5_db_alloc_node(mdev, &qp->db, mdev->priv.numa_node); 1083 1040 if (err) 1084 1041 goto err_free; 1085 1042 1086 1043 if (max_recv_wr) { 1044 + qp->rq.wqe_cnt = roundup_pow_of_two(max_recv_wr); 1045 + log_rq_stride = ilog2(MLX5_SEND_WQE_DS); 1046 + log_rq_sz = ilog2(qp->rq.wqe_cnt); 1087 1047 err = mlx5_frag_buf_alloc_node(mdev, 1088 1048 wq_get_byte_sz(log_rq_sz, log_rq_stride), 1089 1049 &qp->buf, mdev->priv.numa_node); ··· 1253 1213 int i; 1254 1214 1255 1215 recv_buf->page_list = kvcalloc(npages, sizeof(*recv_buf->page_list), 1256 - GFP_KERNEL); 1216 + GFP_KERNEL_ACCOUNT); 1257 1217 if (!recv_buf->page_list) 1258 1218 return -ENOMEM; 1259 1219 1260 1220 for (;;) { 1261 - filled = alloc_pages_bulk_array(GFP_KERNEL, npages - done, 1221 + filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, 1222 + npages - done, 1262 1223 recv_buf->page_list + done); 1263 1224 if (!filled) 1264 1225 goto err; ··· 1289 1248 1290 1249 recv_buf->dma_addrs = kvcalloc(recv_buf->npages, 1291 1250 sizeof(*recv_buf->dma_addrs), 1292 - GFP_KERNEL); 1251 + GFP_KERNEL_ACCOUNT); 1293 1252 if (!recv_buf->dma_addrs) 1294 1253 return -ENOMEM; 1295 1254
+25 -3
drivers/vfio/pci/mlx5/cmd.h
··· 9 9 #include <linux/kernel.h> 10 10 #include <linux/vfio_pci_core.h> 11 11 #include <linux/mlx5/driver.h> 12 + #include <linux/mlx5/vport.h> 12 13 #include <linux/mlx5/cq.h> 13 14 #include <linux/mlx5/qp.h> 14 15 ··· 27 26 enum mlx5_vf_load_state { 28 27 MLX5_VF_LOAD_STATE_READ_IMAGE_NO_HEADER, 29 28 MLX5_VF_LOAD_STATE_READ_HEADER, 29 + MLX5_VF_LOAD_STATE_PREP_HEADER_DATA, 30 + MLX5_VF_LOAD_STATE_READ_HEADER_DATA, 30 31 MLX5_VF_LOAD_STATE_PREP_IMAGE, 31 32 MLX5_VF_LOAD_STATE_READ_IMAGE, 32 33 MLX5_VF_LOAD_STATE_LOAD_IMAGE, 33 34 }; 34 35 36 + struct mlx5_vf_migration_tag_stop_copy_data { 37 + __le64 stop_copy_size; 38 + }; 39 + 40 + enum mlx5_vf_migf_header_flags { 41 + MLX5_MIGF_HEADER_FLAGS_TAG_MANDATORY = 0, 42 + MLX5_MIGF_HEADER_FLAGS_TAG_OPTIONAL = 1 << 0, 43 + }; 44 + 45 + enum mlx5_vf_migf_header_tag { 46 + MLX5_MIGF_HEADER_TAG_FW_DATA = 0, 47 + MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE = 1 << 0, 48 + }; 49 + 35 50 struct mlx5_vf_migration_header { 36 - __le64 image_size; 51 + __le64 record_size; 37 52 /* For future use in case we may need to change the kernel protocol */ 38 - __le64 flags; 53 + __le32 flags; /* Use mlx5_vf_migf_header_flags */ 54 + __le32 tag; /* Use mlx5_vf_migf_header_tag */ 55 + __u8 data[]; /* Its size is given in the record_size */ 39 56 }; 40 57 41 58 struct mlx5_vhca_data_buffer { ··· 61 42 loff_t start_pos; 62 43 u64 length; 63 44 u64 allocated_length; 64 - u64 header_image_size; 65 45 u32 mkey; 66 46 enum dma_data_direction dma_dir; 67 47 u8 dmaed:1; ··· 90 72 enum mlx5_vf_load_state load_state; 91 73 u32 pdn; 92 74 loff_t max_pos; 75 + u64 record_size; 76 + u32 record_tag; 77 + u64 stop_copy_prep_size; 78 + u64 pre_copy_initial_bytes; 93 79 struct mlx5_vhca_data_buffer *buf; 94 80 struct mlx5_vhca_data_buffer *buf_header; 95 81 spinlock_t list_lock;
+219 -42
drivers/vfio/pci/mlx5/main.c
··· 21 21 22 22 #include "cmd.h" 23 23 24 - /* Arbitrary to prevent userspace from consuming endless memory */ 25 - #define MAX_MIGRATION_SIZE (512*1024*1024) 24 + /* Device specification max LOAD size */ 25 + #define MAX_LOAD_SIZE (BIT_ULL(__mlx5_bit_sz(load_vhca_state_in, size)) - 1) 26 26 27 27 static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev) 28 28 { ··· 73 73 int ret; 74 74 75 75 to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list)); 76 - page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL); 76 + page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT); 77 77 if (!page_list) 78 78 return -ENOMEM; 79 79 80 80 do { 81 - filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill, page_list); 81 + filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill, 82 + page_list); 82 83 if (!filled) { 83 84 ret = -ENOMEM; 84 85 goto err; ··· 88 87 ret = sg_alloc_append_table_from_pages( 89 88 &buf->table, page_list, filled, 0, 90 89 filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC, 91 - GFP_KERNEL); 90 + GFP_KERNEL_ACCOUNT); 92 91 93 92 if (ret) 94 93 goto err; ··· 304 303 wake_up_interruptible(&migf->poll_wait); 305 304 } 306 305 306 + static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf) 307 + { 308 + size_t size = sizeof(struct mlx5_vf_migration_header) + 309 + sizeof(struct mlx5_vf_migration_tag_stop_copy_data); 310 + struct mlx5_vf_migration_tag_stop_copy_data data = {}; 311 + struct mlx5_vhca_data_buffer *header_buf = NULL; 312 + struct mlx5_vf_migration_header header = {}; 313 + unsigned long flags; 314 + struct page *page; 315 + u8 *to_buff; 316 + int ret; 317 + 318 + header_buf = mlx5vf_get_data_buffer(migf, size, DMA_NONE); 319 + if (IS_ERR(header_buf)) 320 + return PTR_ERR(header_buf); 321 + 322 + header.record_size = cpu_to_le64(sizeof(data)); 323 + header.flags = cpu_to_le32(MLX5_MIGF_HEADER_FLAGS_TAG_OPTIONAL); 324 + header.tag = cpu_to_le32(MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE); 325 + page = mlx5vf_get_migration_page(header_buf, 0); 326 + if (!page) { 327 + ret = -EINVAL; 328 + goto err; 329 + } 330 + to_buff = kmap_local_page(page); 331 + memcpy(to_buff, &header, sizeof(header)); 332 + header_buf->length = sizeof(header); 333 + data.stop_copy_size = cpu_to_le64(migf->buf->allocated_length); 334 + memcpy(to_buff + sizeof(header), &data, sizeof(data)); 335 + header_buf->length += sizeof(data); 336 + kunmap_local(to_buff); 337 + header_buf->start_pos = header_buf->migf->max_pos; 338 + migf->max_pos += header_buf->length; 339 + spin_lock_irqsave(&migf->list_lock, flags); 340 + list_add_tail(&header_buf->buf_elm, &migf->buf_list); 341 + spin_unlock_irqrestore(&migf->list_lock, flags); 342 + migf->pre_copy_initial_bytes = size; 343 + return 0; 344 + err: 345 + mlx5vf_put_data_buffer(header_buf); 346 + return ret; 347 + } 348 + 349 + static int mlx5vf_prep_stop_copy(struct mlx5_vf_migration_file *migf, 350 + size_t state_size) 351 + { 352 + struct mlx5_vhca_data_buffer *buf; 353 + size_t inc_state_size; 354 + int ret; 355 + 356 + /* let's be ready for stop_copy size that might grow by 10 percents */ 357 + if (check_add_overflow(state_size, state_size / 10, &inc_state_size)) 358 + inc_state_size = state_size; 359 + 360 + buf = mlx5vf_get_data_buffer(migf, inc_state_size, DMA_FROM_DEVICE); 361 + if (IS_ERR(buf)) 362 + return PTR_ERR(buf); 363 + 364 + migf->buf = buf; 365 + buf = mlx5vf_get_data_buffer(migf, 366 + sizeof(struct mlx5_vf_migration_header), DMA_NONE); 367 + if (IS_ERR(buf)) { 368 + ret = PTR_ERR(buf); 369 + goto err; 370 + } 371 + 372 + migf->buf_header = buf; 373 + ret = mlx5vf_add_stop_copy_header(migf); 374 + if (ret) 375 + goto err_header; 376 + return 0; 377 + 378 + err_header: 379 + mlx5vf_put_data_buffer(migf->buf_header); 380 + migf->buf_header = NULL; 381 + err: 382 + mlx5vf_put_data_buffer(migf->buf); 383 + migf->buf = NULL; 384 + return ret; 385 + } 386 + 307 387 static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd, 308 388 unsigned long arg) 309 389 { ··· 395 313 loff_t *pos = &filp->f_pos; 396 314 unsigned long minsz; 397 315 size_t inc_length = 0; 398 - bool end_of_data; 316 + bool end_of_data = false; 399 317 int ret; 400 318 401 319 if (cmd != VFIO_MIG_GET_PRECOPY_INFO) ··· 439 357 goto err_migf_unlock; 440 358 } 441 359 442 - buf = mlx5vf_get_data_buff_from_pos(migf, *pos, &end_of_data); 443 - if (buf) { 444 - if (buf->start_pos == 0) { 445 - info.initial_bytes = buf->header_image_size - *pos; 446 - } else if (buf->start_pos == 447 - sizeof(struct mlx5_vf_migration_header)) { 448 - /* First data buffer following the header */ 449 - info.initial_bytes = buf->start_pos + 450 - buf->length - *pos; 451 - } else { 452 - info.dirty_bytes = buf->start_pos + buf->length - *pos; 453 - } 360 + if (migf->pre_copy_initial_bytes > *pos) { 361 + info.initial_bytes = migf->pre_copy_initial_bytes - *pos; 454 362 } else { 455 - if (!end_of_data) { 456 - ret = -EINVAL; 457 - goto err_migf_unlock; 363 + buf = mlx5vf_get_data_buff_from_pos(migf, *pos, &end_of_data); 364 + if (buf) { 365 + info.dirty_bytes = buf->start_pos + buf->length - *pos; 366 + } else { 367 + if (!end_of_data) { 368 + ret = -EINVAL; 369 + goto err_migf_unlock; 370 + } 371 + info.dirty_bytes = inc_length; 458 372 } 459 - 460 - info.dirty_bytes = inc_length; 461 373 } 462 374 463 375 if (!end_of_data || !inc_length) { ··· 516 440 if (ret) 517 441 goto err; 518 442 519 - buf = mlx5vf_get_data_buffer(migf, length, DMA_FROM_DEVICE); 520 - if (IS_ERR(buf)) { 521 - ret = PTR_ERR(buf); 522 - goto err; 443 + /* Checking whether we have a matching pre-allocated buffer that can fit */ 444 + if (migf->buf && migf->buf->allocated_length >= length) { 445 + buf = migf->buf; 446 + migf->buf = NULL; 447 + } else { 448 + buf = mlx5vf_get_data_buffer(migf, length, DMA_FROM_DEVICE); 449 + if (IS_ERR(buf)) { 450 + ret = PTR_ERR(buf); 451 + goto err; 452 + } 523 453 } 524 454 525 455 ret = mlx5vf_cmd_save_vhca_state(mvdev, migf, buf, true, false); ··· 549 467 size_t length; 550 468 int ret; 551 469 552 - migf = kzalloc(sizeof(*migf), GFP_KERNEL); 470 + migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT); 553 471 if (!migf) 554 472 return ERR_PTR(-ENOMEM); 555 473 ··· 584 502 if (ret) 585 503 goto out_pd; 586 504 505 + if (track) { 506 + ret = mlx5vf_prep_stop_copy(migf, length); 507 + if (ret) 508 + goto out_pd; 509 + } 510 + 587 511 buf = mlx5vf_alloc_data_buffer(migf, length, DMA_FROM_DEVICE); 588 512 if (IS_ERR(buf)) { 589 513 ret = PTR_ERR(buf); ··· 603 515 out_save: 604 516 mlx5vf_free_data_buffer(buf); 605 517 out_pd: 606 - mlx5vf_cmd_dealloc_pd(migf); 518 + mlx5fv_cmd_clean_migf_resources(migf); 607 519 out_free: 608 520 fput(migf->filp); 609 521 end: ··· 652 564 { 653 565 int ret; 654 566 655 - if (requested_length > MAX_MIGRATION_SIZE) 567 + if (requested_length > MAX_LOAD_SIZE) 656 568 return -ENOMEM; 657 569 658 570 if (vhca_buf->allocated_length < requested_length) { ··· 704 616 } 705 617 706 618 static int 619 + mlx5vf_resume_read_header_data(struct mlx5_vf_migration_file *migf, 620 + struct mlx5_vhca_data_buffer *vhca_buf, 621 + const char __user **buf, size_t *len, 622 + loff_t *pos, ssize_t *done) 623 + { 624 + size_t copy_len, to_copy; 625 + size_t required_data; 626 + u8 *to_buff; 627 + int ret; 628 + 629 + required_data = migf->record_size - vhca_buf->length; 630 + to_copy = min_t(size_t, *len, required_data); 631 + copy_len = to_copy; 632 + while (to_copy) { 633 + ret = mlx5vf_append_page_to_mig_buf(vhca_buf, buf, &to_copy, pos, 634 + done); 635 + if (ret) 636 + return ret; 637 + } 638 + 639 + *len -= copy_len; 640 + if (vhca_buf->length == migf->record_size) { 641 + switch (migf->record_tag) { 642 + case MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE: 643 + { 644 + struct page *page; 645 + 646 + page = mlx5vf_get_migration_page(vhca_buf, 0); 647 + if (!page) 648 + return -EINVAL; 649 + to_buff = kmap_local_page(page); 650 + migf->stop_copy_prep_size = min_t(u64, 651 + le64_to_cpup((__le64 *)to_buff), MAX_LOAD_SIZE); 652 + kunmap_local(to_buff); 653 + break; 654 + } 655 + default: 656 + /* Optional tag */ 657 + break; 658 + } 659 + 660 + migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER; 661 + migf->max_pos += migf->record_size; 662 + vhca_buf->length = 0; 663 + } 664 + 665 + return 0; 666 + } 667 + 668 + static int 707 669 mlx5vf_resume_read_header(struct mlx5_vf_migration_file *migf, 708 670 struct mlx5_vhca_data_buffer *vhca_buf, 709 671 const char __user **buf, ··· 783 645 *len -= copy_len; 784 646 vhca_buf->length += copy_len; 785 647 if (vhca_buf->length == sizeof(struct mlx5_vf_migration_header)) { 786 - u64 flags; 648 + u64 record_size; 649 + u32 flags; 787 650 788 - vhca_buf->header_image_size = le64_to_cpup((__le64 *)to_buff); 789 - if (vhca_buf->header_image_size > MAX_MIGRATION_SIZE) { 651 + record_size = le64_to_cpup((__le64 *)to_buff); 652 + if (record_size > MAX_LOAD_SIZE) { 790 653 ret = -ENOMEM; 791 654 goto end; 792 655 } 793 656 794 - flags = le64_to_cpup((__le64 *)(to_buff + 657 + migf->record_size = record_size; 658 + flags = le32_to_cpup((__le32 *)(to_buff + 795 659 offsetof(struct mlx5_vf_migration_header, flags))); 796 - if (flags) { 797 - ret = -EOPNOTSUPP; 798 - goto end; 660 + migf->record_tag = le32_to_cpup((__le32 *)(to_buff + 661 + offsetof(struct mlx5_vf_migration_header, tag))); 662 + switch (migf->record_tag) { 663 + case MLX5_MIGF_HEADER_TAG_FW_DATA: 664 + migf->load_state = MLX5_VF_LOAD_STATE_PREP_IMAGE; 665 + break; 666 + case MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE: 667 + migf->load_state = MLX5_VF_LOAD_STATE_PREP_HEADER_DATA; 668 + break; 669 + default: 670 + if (!(flags & MLX5_MIGF_HEADER_FLAGS_TAG_OPTIONAL)) { 671 + ret = -EOPNOTSUPP; 672 + goto end; 673 + } 674 + /* We may read and skip this optional record data */ 675 + migf->load_state = MLX5_VF_LOAD_STATE_PREP_HEADER_DATA; 799 676 } 800 677 801 - migf->load_state = MLX5_VF_LOAD_STATE_PREP_IMAGE; 802 678 migf->max_pos += vhca_buf->length; 679 + vhca_buf->length = 0; 803 680 *has_work = true; 804 681 } 805 682 end: ··· 858 705 if (ret) 859 706 goto out_unlock; 860 707 break; 708 + case MLX5_VF_LOAD_STATE_PREP_HEADER_DATA: 709 + if (vhca_buf_header->allocated_length < migf->record_size) { 710 + mlx5vf_free_data_buffer(vhca_buf_header); 711 + 712 + migf->buf_header = mlx5vf_alloc_data_buffer(migf, 713 + migf->record_size, DMA_NONE); 714 + if (IS_ERR(migf->buf_header)) { 715 + ret = PTR_ERR(migf->buf_header); 716 + migf->buf_header = NULL; 717 + goto out_unlock; 718 + } 719 + 720 + vhca_buf_header = migf->buf_header; 721 + } 722 + 723 + vhca_buf_header->start_pos = migf->max_pos; 724 + migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER_DATA; 725 + break; 726 + case MLX5_VF_LOAD_STATE_READ_HEADER_DATA: 727 + ret = mlx5vf_resume_read_header_data(migf, vhca_buf_header, 728 + &buf, &len, pos, &done); 729 + if (ret) 730 + goto out_unlock; 731 + break; 861 732 case MLX5_VF_LOAD_STATE_PREP_IMAGE: 862 733 { 863 - u64 size = vhca_buf_header->header_image_size; 734 + u64 size = max(migf->record_size, 735 + migf->stop_copy_prep_size); 864 736 865 737 if (vhca_buf->allocated_length < size) { 866 738 mlx5vf_free_data_buffer(vhca_buf); ··· 914 736 break; 915 737 case MLX5_VF_LOAD_STATE_READ_IMAGE: 916 738 ret = mlx5vf_resume_read_image(migf, vhca_buf, 917 - vhca_buf_header->header_image_size, 739 + migf->record_size, 918 740 &buf, &len, pos, &done, &has_work); 919 741 if (ret) 920 742 goto out_unlock; ··· 927 749 928 750 /* prep header buf for next image */ 929 751 vhca_buf_header->length = 0; 930 - vhca_buf_header->header_image_size = 0; 931 752 /* prep data buf for next image */ 932 753 vhca_buf->length = 0; 933 754 ··· 958 781 struct mlx5_vhca_data_buffer *buf; 959 782 int ret; 960 783 961 - migf = kzalloc(sizeof(*migf), GFP_KERNEL); 784 + migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT); 962 785 if (!migf) 963 786 return ERR_PTR(-ENOMEM); 964 787
+3 -3
drivers/vfio/pci/vfio_pci_config.c
··· 1244 1244 if (vdev->msi_perm) 1245 1245 return len; 1246 1246 1247 - vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL); 1247 + vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL_ACCOUNT); 1248 1248 if (!vdev->msi_perm) 1249 1249 return -ENOMEM; 1250 1250 ··· 1731 1731 * no requirements on the length of a capability, so the gap between 1732 1732 * capabilities needs byte granularity. 1733 1733 */ 1734 - map = kmalloc(pdev->cfg_size, GFP_KERNEL); 1734 + map = kmalloc(pdev->cfg_size, GFP_KERNEL_ACCOUNT); 1735 1735 if (!map) 1736 1736 return -ENOMEM; 1737 1737 1738 - vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL); 1738 + vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL_ACCOUNT); 1739 1739 if (!vconfig) { 1740 1740 kfree(map); 1741 1741 return -ENOMEM;
+4 -3
drivers/vfio/pci/vfio_pci_core.c
··· 144 144 * of the exclusive page in case that hot-add 145 145 * device's bar is assigned into it. 146 146 */ 147 - dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); 147 + dummy_res = 148 + kzalloc(sizeof(*dummy_res), GFP_KERNEL_ACCOUNT); 148 149 if (dummy_res == NULL) 149 150 goto no_mmap; 150 151 ··· 864 863 865 864 region = krealloc(vdev->region, 866 865 (vdev->num_regions + 1) * sizeof(*region), 867 - GFP_KERNEL); 866 + GFP_KERNEL_ACCOUNT); 868 867 if (!region) 869 868 return -ENOMEM; 870 869 ··· 1645 1644 { 1646 1645 struct vfio_pci_mmap_vma *mmap_vma; 1647 1646 1648 - mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); 1647 + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT); 1649 1648 if (!mmap_vma) 1650 1649 return -ENOMEM; 1651 1650
+1 -1
drivers/vfio/pci/vfio_pci_igd.c
··· 180 180 if (!addr || !(~addr)) 181 181 return -ENODEV; 182 182 183 - opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL); 183 + opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL_ACCOUNT); 184 184 if (!opregionvbt) 185 185 return -ENOMEM; 186 186
+6 -4
drivers/vfio/pci/vfio_pci_intrs.c
··· 177 177 if (!vdev->pdev->irq) 178 178 return -ENODEV; 179 179 180 - vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL); 180 + vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL_ACCOUNT); 181 181 if (!vdev->ctx) 182 182 return -ENOMEM; 183 183 ··· 216 216 if (fd < 0) /* Disable only */ 217 217 return 0; 218 218 219 - vdev->ctx[0].name = kasprintf(GFP_KERNEL, "vfio-intx(%s)", 219 + vdev->ctx[0].name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)", 220 220 pci_name(pdev)); 221 221 if (!vdev->ctx[0].name) 222 222 return -ENOMEM; ··· 284 284 if (!is_irq_none(vdev)) 285 285 return -EINVAL; 286 286 287 - vdev->ctx = kcalloc(nvec, sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL); 287 + vdev->ctx = kcalloc(nvec, sizeof(struct vfio_pci_irq_ctx), 288 + GFP_KERNEL_ACCOUNT); 288 289 if (!vdev->ctx) 289 290 return -ENOMEM; 290 291 ··· 344 343 if (fd < 0) 345 344 return 0; 346 345 347 - vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)", 346 + vdev->ctx[vector].name = kasprintf(GFP_KERNEL_ACCOUNT, 347 + "vfio-msi%s[%d](%s)", 348 348 msix ? "x" : "", vector, 349 349 pci_name(pdev)); 350 350 if (!vdev->ctx[vector].name)
+1 -1
drivers/vfio/pci/vfio_pci_rdwr.c
··· 470 470 goto out_unlock; 471 471 } 472 472 473 - ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL); 473 + ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL_ACCOUNT); 474 474 if (!ioeventfd) { 475 475 ret = -ENOMEM; 476 476 goto out_unlock;
+6 -6
drivers/vfio/platform/vfio_platform_common.c
··· 142 142 cnt++; 143 143 144 144 vdev->regions = kcalloc(cnt, sizeof(struct vfio_platform_region), 145 - GFP_KERNEL); 145 + GFP_KERNEL_ACCOUNT); 146 146 if (!vdev->regions) 147 147 return -ENOMEM; 148 148 149 149 for (i = 0; i < cnt; i++) { 150 150 struct resource *res = 151 151 vdev->get_resource(vdev, i); 152 - 153 - if (!res) 154 - goto err; 155 152 156 153 vdev->regions[i].addr = res->start; 157 154 vdev->regions[i].size = resource_size(res); ··· 647 650 mutex_init(&vdev->igate); 648 651 649 652 ret = vfio_platform_get_reset(vdev); 650 - if (ret && vdev->reset_required) 653 + if (ret && vdev->reset_required) { 651 654 dev_err(dev, "No reset function found for device %s\n", 652 655 vdev->name); 653 - return ret; 656 + return ret; 657 + } 658 + 659 + return 0; 654 660 } 655 661 EXPORT_SYMBOL_GPL(vfio_platform_init_common); 656 662
+4 -4
drivers/vfio/platform/vfio_platform_irq.c
··· 186 186 187 187 if (fd < 0) /* Disable only */ 188 188 return 0; 189 - 190 - irq->name = kasprintf(GFP_KERNEL, "vfio-irq[%d](%s)", 191 - irq->hwirq, vdev->name); 189 + irq->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-irq[%d](%s)", 190 + irq->hwirq, vdev->name); 192 191 if (!irq->name) 193 192 return -ENOMEM; 194 193 ··· 285 286 while (vdev->get_irq(vdev, cnt) >= 0) 286 287 cnt++; 287 288 288 - vdev->irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), GFP_KERNEL); 289 + vdev->irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), 290 + GFP_KERNEL_ACCOUNT); 289 291 if (!vdev->irqs) 290 292 return -ENOMEM; 291 293
+16 -9
drivers/vfio/vfio.h
··· 18 18 19 19 void vfio_device_put_registration(struct vfio_device *device); 20 20 bool vfio_device_try_get_registration(struct vfio_device *device); 21 - int vfio_device_open(struct vfio_device *device, 22 - struct iommufd_ctx *iommufd, struct kvm *kvm); 21 + int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd); 23 22 void vfio_device_close(struct vfio_device *device, 24 23 struct iommufd_ctx *iommufd); 25 24 ··· 73 74 struct file *opened_file; 74 75 struct blocking_notifier_head notifier; 75 76 struct iommufd_ctx *iommufd; 77 + spinlock_t kvm_ref_lock; 76 78 }; 77 79 78 80 int vfio_device_set_group(struct vfio_device *device, ··· 95 95 } 96 96 97 97 #if IS_ENABLED(CONFIG_VFIO_CONTAINER) 98 - /* events for the backend driver notify callback */ 99 - enum vfio_iommu_notify_type { 100 - VFIO_IOMMU_CONTAINER_CLOSE = 0, 101 - }; 102 - 103 98 /** 104 99 * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks 105 100 */ ··· 125 130 void *data, size_t count, bool write); 126 131 struct iommu_domain *(*group_iommu_domain)(void *iommu_data, 127 132 struct iommu_group *group); 128 - void (*notify)(void *iommu_data, 129 - enum vfio_iommu_notify_type event); 130 133 }; 131 134 132 135 struct vfio_iommu_driver { ··· 248 255 extern bool vfio_noiommu __read_mostly; 249 256 #else 250 257 enum { vfio_noiommu = false }; 258 + #endif 259 + 260 + #ifdef CONFIG_HAVE_KVM 261 + void _vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm); 262 + void vfio_device_put_kvm(struct vfio_device *device); 263 + #else 264 + static inline void _vfio_device_get_kvm_safe(struct vfio_device *device, 265 + struct kvm *kvm) 266 + { 267 + } 268 + 269 + static inline void vfio_device_put_kvm(struct vfio_device *device) 270 + { 271 + } 251 272 #endif 252 273 253 274 #endif
+111 -137
drivers/vfio/vfio_iommu_type1.c
··· 71 71 unsigned int vaddr_invalid_count; 72 72 uint64_t pgsize_bitmap; 73 73 uint64_t num_non_pinned_groups; 74 - wait_queue_head_t vaddr_wait; 75 74 bool v2; 76 75 bool nesting; 77 76 bool dirty_page_tracking; 78 - bool container_open; 79 77 struct list_head emulated_iommu_groups; 80 78 }; 81 79 ··· 97 99 struct task_struct *task; 98 100 struct rb_root pfn_list; /* Ex-user pinned pfn list */ 99 101 unsigned long *bitmap; 102 + struct mm_struct *mm; 103 + size_t locked_vm; 100 104 }; 101 105 102 106 struct vfio_batch { ··· 150 150 */ 151 151 #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) 152 152 #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) 153 - 154 - #define WAITED 1 155 153 156 154 static int put_pfn(unsigned long pfn, int prot); 157 155 ··· 409 411 return ret; 410 412 } 411 413 414 + static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm, 415 + bool lock_cap, long npage) 416 + { 417 + int ret = mmap_write_lock_killable(mm); 418 + 419 + if (ret) 420 + return ret; 421 + 422 + ret = __account_locked_vm(mm, abs(npage), npage > 0, task, lock_cap); 423 + mmap_write_unlock(mm); 424 + return ret; 425 + } 426 + 412 427 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) 413 428 { 414 429 struct mm_struct *mm; ··· 430 419 if (!npage) 431 420 return 0; 432 421 433 - mm = async ? get_task_mm(dma->task) : dma->task->mm; 434 - if (!mm) 422 + mm = dma->mm; 423 + if (async && !mmget_not_zero(mm)) 435 424 return -ESRCH; /* process exited */ 436 425 437 - ret = mmap_write_lock_killable(mm); 438 - if (!ret) { 439 - ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task, 440 - dma->lock_cap); 441 - mmap_write_unlock(mm); 442 - } 426 + ret = mm_lock_acct(dma->task, mm, dma->lock_cap, npage); 427 + if (!ret) 428 + dma->locked_vm += npage; 443 429 444 430 if (async) 445 431 mmput(mm); ··· 602 594 return ret; 603 595 } 604 596 605 - static int vfio_wait(struct vfio_iommu *iommu) 606 - { 607 - DEFINE_WAIT(wait); 608 - 609 - prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE); 610 - mutex_unlock(&iommu->lock); 611 - schedule(); 612 - mutex_lock(&iommu->lock); 613 - finish_wait(&iommu->vaddr_wait, &wait); 614 - if (kthread_should_stop() || !iommu->container_open || 615 - fatal_signal_pending(current)) { 616 - return -EFAULT; 617 - } 618 - return WAITED; 619 - } 620 - 621 - /* 622 - * Find dma struct and wait for its vaddr to be valid. iommu lock is dropped 623 - * if the task waits, but is re-locked on return. Return result in *dma_p. 624 - * Return 0 on success with no waiting, WAITED on success if waited, and -errno 625 - * on error. 626 - */ 627 - static int vfio_find_dma_valid(struct vfio_iommu *iommu, dma_addr_t start, 628 - size_t size, struct vfio_dma **dma_p) 629 - { 630 - int ret = 0; 631 - 632 - do { 633 - *dma_p = vfio_find_dma(iommu, start, size); 634 - if (!*dma_p) 635 - return -EINVAL; 636 - else if (!(*dma_p)->vaddr_invalid) 637 - return ret; 638 - else 639 - ret = vfio_wait(iommu); 640 - } while (ret == WAITED); 641 - 642 - return ret; 643 - } 644 - 645 - /* 646 - * Wait for all vaddr in the dma_list to become valid. iommu lock is dropped 647 - * if the task waits, but is re-locked on return. Return 0 on success with no 648 - * waiting, WAITED on success if waited, and -errno on error. 649 - */ 650 - static int vfio_wait_all_valid(struct vfio_iommu *iommu) 651 - { 652 - int ret = 0; 653 - 654 - while (iommu->vaddr_invalid_count && ret >= 0) 655 - ret = vfio_wait(iommu); 656 - 657 - return ret; 658 - } 659 - 660 597 /* 661 598 * Attempt to pin pages. We really don't want to track all the pfns and 662 599 * the iommu can only map chunks of consecutive pfns anyway, so get the ··· 746 793 struct mm_struct *mm; 747 794 int ret; 748 795 749 - mm = get_task_mm(dma->task); 750 - if (!mm) 796 + mm = dma->mm; 797 + if (!mmget_not_zero(mm)) 751 798 return -ENODEV; 752 799 753 800 ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages); ··· 757 804 ret = 0; 758 805 759 806 if (do_accounting && !is_invalid_reserved_pfn(*pfn_base)) { 760 - ret = vfio_lock_acct(dma, 1, true); 807 + ret = vfio_lock_acct(dma, 1, false); 761 808 if (ret) { 762 809 put_pfn(*pfn_base, dma->prot); 763 810 if (ret == -ENOMEM) ··· 802 849 unsigned long remote_vaddr; 803 850 struct vfio_dma *dma; 804 851 bool do_accounting; 805 - dma_addr_t iova; 806 852 807 853 if (!iommu || !pages) 808 854 return -EINVAL; ··· 812 860 813 861 mutex_lock(&iommu->lock); 814 862 815 - /* 816 - * Wait for all necessary vaddr's to be valid so they can be used in 817 - * the main loop without dropping the lock, to avoid racing vs unmap. 818 - */ 819 - again: 820 - if (iommu->vaddr_invalid_count) { 821 - for (i = 0; i < npage; i++) { 822 - iova = user_iova + PAGE_SIZE * i; 823 - ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma); 824 - if (ret < 0) 825 - goto pin_done; 826 - if (ret == WAITED) 827 - goto again; 828 - } 863 + if (WARN_ONCE(iommu->vaddr_invalid_count, 864 + "vfio_pin_pages not allowed with VFIO_UPDATE_VADDR\n")) { 865 + ret = -EBUSY; 866 + goto pin_done; 829 867 } 830 868 831 869 /* Fail if no dma_umap notifier is registered */ ··· 833 891 834 892 for (i = 0; i < npage; i++) { 835 893 unsigned long phys_pfn; 894 + dma_addr_t iova; 836 895 struct vfio_pfn *vpfn; 837 896 838 897 iova = user_iova + PAGE_SIZE * i; ··· 1116 1173 vfio_unmap_unpin(iommu, dma, true); 1117 1174 vfio_unlink_dma(iommu, dma); 1118 1175 put_task_struct(dma->task); 1176 + mmdrop(dma->mm); 1119 1177 vfio_dma_bitmap_free(dma); 1120 - if (dma->vaddr_invalid) { 1178 + if (dma->vaddr_invalid) 1121 1179 iommu->vaddr_invalid_count--; 1122 - wake_up_all(&iommu->vaddr_wait); 1123 - } 1124 1180 kfree(dma); 1125 1181 iommu->dma_avail++; 1126 1182 } ··· 1283 1341 struct rb_node *n, *first_n; 1284 1342 1285 1343 mutex_lock(&iommu->lock); 1344 + 1345 + /* Cannot update vaddr if mdev is present. */ 1346 + if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups)) { 1347 + ret = -EBUSY; 1348 + goto unlock; 1349 + } 1286 1350 1287 1351 pgshift = __ffs(iommu->pgsize_bitmap); 1288 1352 pgsize = (size_t)1 << pgshift; ··· 1514 1566 return list_empty(iova); 1515 1567 } 1516 1568 1569 + static int vfio_change_dma_owner(struct vfio_dma *dma) 1570 + { 1571 + struct task_struct *task = current->group_leader; 1572 + struct mm_struct *mm = current->mm; 1573 + long npage = dma->locked_vm; 1574 + bool lock_cap; 1575 + int ret; 1576 + 1577 + if (mm == dma->mm) 1578 + return 0; 1579 + 1580 + lock_cap = capable(CAP_IPC_LOCK); 1581 + ret = mm_lock_acct(task, mm, lock_cap, npage); 1582 + if (ret) 1583 + return ret; 1584 + 1585 + if (mmget_not_zero(dma->mm)) { 1586 + mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage); 1587 + mmput(dma->mm); 1588 + } 1589 + 1590 + if (dma->task != task) { 1591 + put_task_struct(dma->task); 1592 + dma->task = get_task_struct(task); 1593 + } 1594 + mmdrop(dma->mm); 1595 + dma->mm = mm; 1596 + mmgrab(dma->mm); 1597 + dma->lock_cap = lock_cap; 1598 + return 0; 1599 + } 1600 + 1517 1601 static int vfio_dma_do_map(struct vfio_iommu *iommu, 1518 1602 struct vfio_iommu_type1_dma_map *map) 1519 1603 { ··· 1595 1615 dma->size != size) { 1596 1616 ret = -EINVAL; 1597 1617 } else { 1618 + ret = vfio_change_dma_owner(dma); 1619 + if (ret) 1620 + goto out_unlock; 1598 1621 dma->vaddr = vaddr; 1599 1622 dma->vaddr_invalid = false; 1600 1623 iommu->vaddr_invalid_count--; 1601 - wake_up_all(&iommu->vaddr_wait); 1602 1624 } 1603 1625 goto out_unlock; 1604 1626 } else if (dma) { ··· 1634 1652 * against the locked memory limit and we need to be able to do both 1635 1653 * outside of this call path as pinning can be asynchronous via the 1636 1654 * external interfaces for mdev devices. RLIMIT_MEMLOCK requires a 1637 - * task_struct and VM locked pages requires an mm_struct, however 1638 - * holding an indefinite mm reference is not recommended, therefore we 1639 - * only hold a reference to a task. We could hold a reference to 1640 - * current, however QEMU uses this call path through vCPU threads, 1641 - * which can be killed resulting in a NULL mm and failure in the unmap 1642 - * path when called via a different thread. Avoid this problem by 1643 - * using the group_leader as threads within the same group require 1644 - * both CLONE_THREAD and CLONE_VM and will therefore use the same 1645 - * mm_struct. 1646 - * 1647 - * Previously we also used the task for testing CAP_IPC_LOCK at the 1648 - * time of pinning and accounting, however has_capability() makes use 1649 - * of real_cred, a copy-on-write field, so we can't guarantee that it 1650 - * matches group_leader, or in fact that it might not change by the 1651 - * time it's evaluated. If a process were to call MAP_DMA with 1652 - * CAP_IPC_LOCK but later drop it, it doesn't make sense that they 1653 - * possibly see different results for an iommu_mapped vfio_dma vs 1654 - * externally mapped. Therefore track CAP_IPC_LOCK in vfio_dma at the 1655 - * time of calling MAP_DMA. 1655 + * task_struct. Save the group_leader so that all DMA tracking uses 1656 + * the same task, to make debugging easier. VM locked pages requires 1657 + * an mm_struct, so grab the mm in case the task dies. 1656 1658 */ 1657 1659 get_task_struct(current->group_leader); 1658 1660 dma->task = current->group_leader; 1659 1661 dma->lock_cap = capable(CAP_IPC_LOCK); 1662 + dma->mm = current->mm; 1663 + mmgrab(dma->mm); 1660 1664 1661 1665 dma->pfn_list = RB_ROOT; 1662 1666 ··· 1674 1706 struct rb_node *n; 1675 1707 unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; 1676 1708 int ret; 1677 - 1678 - ret = vfio_wait_all_valid(iommu); 1679 - if (ret < 0) 1680 - return ret; 1681 1709 1682 1710 /* Arbitrarily pick the first domain in the list for lookups */ 1683 1711 if (!list_empty(&iommu->domain_list)) ··· 2152 2188 struct iommu_domain_geometry *geo; 2153 2189 LIST_HEAD(iova_copy); 2154 2190 LIST_HEAD(group_resv_regions); 2155 - int ret = -EINVAL; 2191 + int ret = -EBUSY; 2156 2192 2157 2193 mutex_lock(&iommu->lock); 2158 2194 2195 + /* Attach could require pinning, so disallow while vaddr is invalid. */ 2196 + if (iommu->vaddr_invalid_count) 2197 + goto out_unlock; 2198 + 2159 2199 /* Check for duplicates */ 2200 + ret = -EINVAL; 2160 2201 if (vfio_iommu_find_iommu_group(iommu, iommu_group)) 2161 2202 goto out_unlock; 2162 2203 ··· 2561 2592 INIT_LIST_HEAD(&iommu->iova_list); 2562 2593 iommu->dma_list = RB_ROOT; 2563 2594 iommu->dma_avail = dma_entry_limit; 2564 - iommu->container_open = true; 2565 2595 mutex_init(&iommu->lock); 2566 2596 mutex_init(&iommu->device_list_lock); 2567 2597 INIT_LIST_HEAD(&iommu->device_list); 2568 - init_waitqueue_head(&iommu->vaddr_wait); 2569 2598 iommu->pgsize_bitmap = PAGE_MASK; 2570 2599 INIT_LIST_HEAD(&iommu->emulated_iommu_groups); 2571 2600 ··· 2627 2660 return ret; 2628 2661 } 2629 2662 2663 + static bool vfio_iommu_has_emulated(struct vfio_iommu *iommu) 2664 + { 2665 + bool ret; 2666 + 2667 + mutex_lock(&iommu->lock); 2668 + ret = !list_empty(&iommu->emulated_iommu_groups); 2669 + mutex_unlock(&iommu->lock); 2670 + return ret; 2671 + } 2672 + 2630 2673 static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu, 2631 2674 unsigned long arg) 2632 2675 { ··· 2645 2668 case VFIO_TYPE1v2_IOMMU: 2646 2669 case VFIO_TYPE1_NESTING_IOMMU: 2647 2670 case VFIO_UNMAP_ALL: 2648 - case VFIO_UPDATE_VADDR: 2649 2671 return 1; 2672 + case VFIO_UPDATE_VADDR: 2673 + /* 2674 + * Disable this feature if mdevs are present. They cannot 2675 + * safely pin/unpin/rw while vaddrs are being updated. 2676 + */ 2677 + return iommu && !vfio_iommu_has_emulated(iommu); 2650 2678 case VFIO_DMA_CC_IOMMU: 2651 2679 if (!iommu) 2652 2680 return 0; ··· 3060 3078 struct vfio_dma *dma; 3061 3079 bool kthread = current->mm == NULL; 3062 3080 size_t offset; 3063 - int ret; 3064 3081 3065 3082 *copied = 0; 3066 3083 3067 - ret = vfio_find_dma_valid(iommu, user_iova, 1, &dma); 3068 - if (ret < 0) 3069 - return ret; 3084 + dma = vfio_find_dma(iommu, user_iova, 1); 3085 + if (!dma) 3086 + return -EINVAL; 3070 3087 3071 3088 if ((write && !(dma->prot & IOMMU_WRITE)) || 3072 3089 !(dma->prot & IOMMU_READ)) 3073 3090 return -EPERM; 3074 3091 3075 - mm = get_task_mm(dma->task); 3076 - 3077 - if (!mm) 3092 + mm = dma->mm; 3093 + if (!mmget_not_zero(mm)) 3078 3094 return -EPERM; 3079 3095 3080 3096 if (kthread) ··· 3118 3138 size_t done; 3119 3139 3120 3140 mutex_lock(&iommu->lock); 3141 + 3142 + if (WARN_ONCE(iommu->vaddr_invalid_count, 3143 + "vfio_dma_rw not allowed with VFIO_UPDATE_VADDR\n")) { 3144 + ret = -EBUSY; 3145 + goto out; 3146 + } 3147 + 3121 3148 while (count > 0) { 3122 3149 ret = vfio_iommu_type1_dma_rw_chunk(iommu, user_iova, data, 3123 3150 count, write, &done); ··· 3136 3149 user_iova += done; 3137 3150 } 3138 3151 3152 + out: 3139 3153 mutex_unlock(&iommu->lock); 3140 3154 return ret; 3141 3155 } ··· 3164 3176 return domain; 3165 3177 } 3166 3178 3167 - static void vfio_iommu_type1_notify(void *iommu_data, 3168 - enum vfio_iommu_notify_type event) 3169 - { 3170 - struct vfio_iommu *iommu = iommu_data; 3171 - 3172 - if (event != VFIO_IOMMU_CONTAINER_CLOSE) 3173 - return; 3174 - mutex_lock(&iommu->lock); 3175 - iommu->container_open = false; 3176 - mutex_unlock(&iommu->lock); 3177 - wake_up_all(&iommu->vaddr_wait); 3178 - } 3179 - 3180 3179 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = { 3181 3180 .name = "vfio-iommu-type1", 3182 3181 .owner = THIS_MODULE, ··· 3178 3203 .unregister_device = vfio_iommu_type1_unregister_device, 3179 3204 .dma_rw = vfio_iommu_type1_dma_rw, 3180 3205 .group_iommu_domain = vfio_iommu_type1_group_iommu_domain, 3181 - .notify = vfio_iommu_type1_notify, 3182 3206 }; 3183 3207 3184 3208 static int __init vfio_iommu_type1_init(void)
+59 -11
drivers/vfio/vfio_main.c
··· 16 16 #include <linux/fs.h> 17 17 #include <linux/idr.h> 18 18 #include <linux/iommu.h> 19 + #ifdef CONFIG_HAVE_KVM 20 + #include <linux/kvm_host.h> 21 + #endif 19 22 #include <linux/list.h> 20 23 #include <linux/miscdevice.h> 21 24 #include <linux/module.h> ··· 348 345 } 349 346 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev); 350 347 348 + #ifdef CONFIG_HAVE_KVM 349 + void _vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm) 350 + { 351 + void (*pfn)(struct kvm *kvm); 352 + bool (*fn)(struct kvm *kvm); 353 + bool ret; 354 + 355 + lockdep_assert_held(&device->dev_set->lock); 356 + 357 + pfn = symbol_get(kvm_put_kvm); 358 + if (WARN_ON(!pfn)) 359 + return; 360 + 361 + fn = symbol_get(kvm_get_kvm_safe); 362 + if (WARN_ON(!fn)) { 363 + symbol_put(kvm_put_kvm); 364 + return; 365 + } 366 + 367 + ret = fn(kvm); 368 + symbol_put(kvm_get_kvm_safe); 369 + if (!ret) { 370 + symbol_put(kvm_put_kvm); 371 + return; 372 + } 373 + 374 + device->put_kvm = pfn; 375 + device->kvm = kvm; 376 + } 377 + 378 + void vfio_device_put_kvm(struct vfio_device *device) 379 + { 380 + lockdep_assert_held(&device->dev_set->lock); 381 + 382 + if (!device->kvm) 383 + return; 384 + 385 + if (WARN_ON(!device->put_kvm)) 386 + goto clear; 387 + 388 + device->put_kvm(device->kvm); 389 + device->put_kvm = NULL; 390 + symbol_put(kvm_put_kvm); 391 + 392 + clear: 393 + device->kvm = NULL; 394 + } 395 + #endif 396 + 351 397 /* true if the vfio_device has open_device() called but not close_device() */ 352 398 static bool vfio_assert_device_open(struct vfio_device *device) 353 399 { ··· 404 352 } 405 353 406 354 static int vfio_device_first_open(struct vfio_device *device, 407 - struct iommufd_ctx *iommufd, struct kvm *kvm) 355 + struct iommufd_ctx *iommufd) 408 356 { 409 357 int ret; 410 358 ··· 420 368 if (ret) 421 369 goto err_module_put; 422 370 423 - device->kvm = kvm; 424 371 if (device->ops->open_device) { 425 372 ret = device->ops->open_device(device); 426 373 if (ret) ··· 428 377 return 0; 429 378 430 379 err_unuse_iommu: 431 - device->kvm = NULL; 432 380 if (iommufd) 433 381 vfio_iommufd_unbind(device); 434 382 else ··· 444 394 445 395 if (device->ops->close_device) 446 396 device->ops->close_device(device); 447 - device->kvm = NULL; 448 397 if (iommufd) 449 398 vfio_iommufd_unbind(device); 450 399 else ··· 451 402 module_put(device->dev->driver->owner); 452 403 } 453 404 454 - int vfio_device_open(struct vfio_device *device, 455 - struct iommufd_ctx *iommufd, struct kvm *kvm) 405 + int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd) 456 406 { 457 407 int ret = 0; 458 408 459 - mutex_lock(&device->dev_set->lock); 409 + lockdep_assert_held(&device->dev_set->lock); 410 + 460 411 device->open_count++; 461 412 if (device->open_count == 1) { 462 - ret = vfio_device_first_open(device, iommufd, kvm); 413 + ret = vfio_device_first_open(device, iommufd); 463 414 if (ret) 464 415 device->open_count--; 465 416 } 466 - mutex_unlock(&device->dev_set->lock); 467 417 468 418 return ret; 469 419 } ··· 470 422 void vfio_device_close(struct vfio_device *device, 471 423 struct iommufd_ctx *iommufd) 472 424 { 473 - mutex_lock(&device->dev_set->lock); 425 + lockdep_assert_held(&device->dev_set->lock); 426 + 474 427 vfio_assert_device_open(device); 475 428 if (device->open_count == 1) 476 429 vfio_device_last_close(device, iommufd); 477 430 device->open_count--; 478 - mutex_unlock(&device->dev_set->lock); 479 431 } 480 432 481 433 /*
+1 -1
drivers/vfio/virqfd.c
··· 112 112 int ret = 0; 113 113 __poll_t events; 114 114 115 - virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL); 115 + virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL_ACCOUNT); 116 116 if (!virqfd) 117 117 return -ENOMEM; 118 118
+5 -1
include/linux/vfio.h
··· 46 46 struct vfio_device_set *dev_set; 47 47 struct list_head dev_set_list; 48 48 unsigned int migration_flags; 49 - /* Driver must reference the kvm during open_device or never touch it */ 50 49 struct kvm *kvm; 51 50 52 51 /* Members below here are private, not for driver use */ ··· 57 58 struct list_head group_next; 58 59 struct list_head iommu_entry; 59 60 struct iommufd_access *iommufd_access; 61 + void (*put_kvm)(struct kvm *kvm); 60 62 #if IS_ENABLED(CONFIG_IOMMUFD) 61 63 struct iommufd_device *iommufd_device; 62 64 struct iommufd_ctx *iommufd_ictx; ··· 70 70 * 71 71 * @init: initialize private fields in device structure 72 72 * @release: Reclaim private fields in device structure 73 + * @bind_iommufd: Called when binding the device to an iommufd 74 + * @unbind_iommufd: Opposite of bind_iommufd 75 + * @attach_ioas: Called when attaching device to an IOAS/HWPT managed by the 76 + * bound iommufd. Undo in unbind_iommufd. 73 77 * @open_device: Called when the first file descriptor is opened for this device 74 78 * @close_device: Opposite of open_device 75 79 * @read: Perform read(2) on device file descriptor
+9 -6
include/uapi/linux/vfio.h
··· 49 49 /* Supports VFIO_DMA_UNMAP_FLAG_ALL */ 50 50 #define VFIO_UNMAP_ALL 9 51 51 52 - /* Supports the vaddr flag for DMA map and unmap */ 52 + /* 53 + * Supports the vaddr flag for DMA map and unmap. Not supported for mediated 54 + * devices, so this capability is subject to change as groups are added or 55 + * removed. 56 + */ 53 57 #define VFIO_UPDATE_VADDR 10 54 58 55 59 /* ··· 1347 1343 * Map process virtual addresses to IO virtual addresses using the 1348 1344 * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required. 1349 1345 * 1350 - * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and 1351 - * unblock translation of host virtual addresses in the iova range. The vaddr 1346 + * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova. The vaddr 1352 1347 * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR. To 1353 1348 * maintain memory consistency within the user application, the updated vaddr 1354 1349 * must address the same memory object as originally mapped. Failure to do so ··· 1398 1395 * must be 0. This cannot be combined with the get-dirty-bitmap flag. 1399 1396 * 1400 1397 * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host 1401 - * virtual addresses in the iova range. Tasks that attempt to translate an 1402 - * iova's vaddr will block. DMA to already-mapped pages continues. This 1403 - * cannot be combined with the get-dirty-bitmap flag. 1398 + * virtual addresses in the iova range. DMA to already-mapped pages continues. 1399 + * Groups may not be added to the container while any addresses are invalid. 1400 + * This cannot be combined with the get-dirty-bitmap flag. 1404 1401 */ 1405 1402 struct vfio_iommu_type1_dma_unmap { 1406 1403 __u32 argsz;
+11 -8
samples/Kconfig
··· 191 191 Build UHID sample program. 192 192 193 193 config SAMPLE_VFIO_MDEV_MTTY 194 - tristate "Build VFIO mtty example mediated device sample code -- loadable modules only" 195 - depends on VFIO_MDEV && m 194 + tristate "Build VFIO mtty example mediated device sample code" 195 + depends on VFIO 196 + select VFIO_MDEV 196 197 help 197 198 Build a virtual tty sample driver for use as a VFIO 198 199 mediated device 199 200 200 201 config SAMPLE_VFIO_MDEV_MDPY 201 - tristate "Build VFIO mdpy example mediated device sample code -- loadable modules only" 202 - depends on VFIO_MDEV && m 202 + tristate "Build VFIO mdpy example mediated device sample code" 203 + depends on VFIO 204 + select VFIO_MDEV 203 205 help 204 206 Build a virtual display sample driver for use as a VFIO 205 207 mediated device. It is a simple framebuffer and supports 206 208 the region display interface (VFIO_GFX_PLANE_TYPE_REGION). 207 209 208 210 config SAMPLE_VFIO_MDEV_MDPY_FB 209 - tristate "Build VFIO mdpy example guest fbdev driver -- loadable module only" 210 - depends on FB && m 211 + tristate "Build VFIO mdpy example guest fbdev driver" 212 + depends on FB 211 213 select FB_CFB_FILLRECT 212 214 select FB_CFB_COPYAREA 213 215 select FB_CFB_IMAGEBLIT ··· 217 215 Guest fbdev driver for the virtual display sample driver. 218 216 219 217 config SAMPLE_VFIO_MDEV_MBOCHS 220 - tristate "Build VFIO mdpy example mediated device sample code -- loadable modules only" 221 - depends on VFIO_MDEV && m 218 + tristate "Build VFIO mbochs example mediated device sample code" 219 + depends on VFIO 220 + select VFIO_MDEV 222 221 select DMA_SHARED_BUFFER 223 222 help 224 223 Build a virtual display sample driver for use as a VFIO
+100
samples/vfio-mdev/README.rst
··· 1 + Using the mtty vfio-mdev sample code 2 + ==================================== 3 + 4 + mtty is a sample vfio-mdev driver that demonstrates how to use the mediated 5 + device framework. 6 + 7 + The sample driver creates an mdev device that simulates a serial port over a PCI 8 + card. 9 + 10 + 1. Build and load the mtty.ko module. 11 + 12 + This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ 13 + 14 + Files in this device directory in sysfs are similar to the following:: 15 + 16 + # tree /sys/devices/virtual/mtty/mtty/ 17 + /sys/devices/virtual/mtty/mtty/ 18 + |-- mdev_supported_types 19 + | |-- mtty-1 20 + | | |-- available_instances 21 + | | |-- create 22 + | | |-- device_api 23 + | | |-- devices 24 + | | `-- name 25 + | `-- mtty-2 26 + | |-- available_instances 27 + | |-- create 28 + | |-- device_api 29 + | |-- devices 30 + | `-- name 31 + |-- mtty_dev 32 + | `-- sample_mtty_dev 33 + |-- power 34 + | |-- autosuspend_delay_ms 35 + | |-- control 36 + | |-- runtime_active_time 37 + | |-- runtime_status 38 + | `-- runtime_suspended_time 39 + |-- subsystem -> ../../../../class/mtty 40 + `-- uevent 41 + 42 + 2. Create a mediated device by using the dummy device that you created in the 43 + previous step:: 44 + 45 + # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ 46 + /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create 47 + 48 + 3. Add parameters to qemu-kvm:: 49 + 50 + -device vfio-pci,\ 51 + sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 52 + 53 + 4. Boot the VM. 54 + 55 + In the Linux guest VM, with no hardware on the host, the device appears 56 + as follows:: 57 + 58 + # lspci -s 00:05.0 -xxvv 59 + 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) 60 + Subsystem: Device 4348:3253 61 + Physical Slot: 5 62 + Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 63 + Stepping- SERR- FastB2B- DisINTx- 64 + Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 65 + <TAbort- <MAbort- >SERR- <PERR- INTx- 66 + Interrupt: pin A routed to IRQ 10 67 + Region 0: I/O ports at c150 [size=8] 68 + Region 1: I/O ports at c158 [size=8] 69 + Kernel driver in use: serial 70 + 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 71 + 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 72 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 73 + 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 74 + 75 + In the Linux guest VM, dmesg output for the device is as follows: 76 + 77 + serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 78 + 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 79 + 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A 80 + 81 + 82 + 5. In the Linux guest VM, check the serial ports:: 83 + 84 + # setserial -g /dev/ttyS* 85 + /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 86 + /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 87 + /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 88 + 89 + 6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or 90 + /dev/ttyS2 with hardware flow control disabled. 91 + 92 + 7. Type data on the minicom terminal or send data to the terminal emulation 93 + program and read the data. 94 + 95 + Data is loop backed from hosts mtty driver. 96 + 97 + 8. Destroy the mediated device that you created:: 98 + 99 + # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove 100 +