Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

drm/doc: Document DRM device reset expectations

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Acked-by: Sebastian Wick <sebastian.wick@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230929092509.42042-1-andrealmeid@igalia.com

authored by

André Almeida and committed by
Christian König
db0f246c 988d0ff2

+77
+77
Documentation/gpu/drm-uapi.rst
··· 285 285 mmapped regular files. Threads cause additional pain with signal 286 286 handling as well. 287 287 288 + Device reset 289 + ============ 290 + 291 + The GPU stack is really complex and is prone to errors, from hardware bugs, 292 + faulty applications and everything in between the many layers. Some errors 293 + require resetting the device in order to make the device usable again. This 294 + section describes the expectations for DRM and usermode drivers when a 295 + device resets and how to propagate the reset status. 296 + 297 + Device resets can not be disabled without tainting the kernel, which can lead to 298 + hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in 299 + device resets is to propagate the message to the application and apply any 300 + special policy for blocking guilty applications, if any. Corollary is that 301 + debugging a hung GPU context require hardware support to be able to preempt such 302 + a GPU context while it's stopped. 303 + 304 + Kernel Mode Driver 305 + ------------------ 306 + 307 + The KMD is responsible for checking if the device needs a reset, and to perform 308 + it as needed. Usually a hang is detected when a job gets stuck executing. KMD 309 + should keep track of resets, because userspace can query any time about the 310 + reset status for a specific context. This is needed to propagate to the rest of 311 + the stack that a reset has happened. Currently, this is implemented by each 312 + driver separately, with no common DRM interface. Ideally this should be properly 313 + integrated at DRM scheduler to provide a common ground for all drivers. After a 314 + reset, KMD should reject new command submissions for affected contexts. 315 + 316 + User Mode Driver 317 + ---------------- 318 + 319 + After command submission, UMD should check if the submission was accepted or 320 + rejected. After a reset, KMD should reject submissions, and UMD can issue an 321 + ioctl to the KMD to check the reset status, and this can be checked more often 322 + if the UMD requires it. After detecting a reset, UMD will then proceed to report 323 + it to the application using the appropriate API error code, as explained in the 324 + section below about robustness. 325 + 326 + Robustness 327 + ---------- 328 + 329 + The only way to try to keep a graphical API context working after a reset is if 330 + it complies with the robustness aspects of the graphical API that it is using. 331 + 332 + Graphical APIs provide ways to applications to deal with device resets. However, 333 + there is no guarantee that the app will use such features correctly, and a 334 + userspace that doesn't support robust interfaces (like a non-robust 335 + OpenGL context or API without any robustness support like libva) leave the 336 + robustness handling entirely to the userspace driver. There is no strong 337 + community consensus on what the userspace driver should do in that case, 338 + since all reasonable approaches have some clear downsides. 339 + 340 + OpenGL 341 + ~~~~~~ 342 + 343 + Apps using OpenGL should use the available robust interfaces, like the 344 + extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This 345 + interface tells if a reset has happened, and if so, all the context state is 346 + considered lost and the app proceeds by creating new ones. There's no consensus 347 + on what to do to if robustness is not in use. 348 + 349 + Vulkan 350 + ~~~~~~ 351 + 352 + Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. 353 + This error code means, among other things, that a device reset has happened and 354 + it needs to recreate the contexts to keep going. 355 + 356 + Reporting causes of resets 357 + -------------------------- 358 + 359 + Apart from propagating the reset through the stack so apps can recover, it's 360 + really useful for driver developers to learn more about what caused the reset in 361 + the first place. DRM devices should make use of devcoredump to store relevant 362 + information about the reset, so this information can be added to user bug 363 + reports. 364 + 288 365 .. _drm_driver_ioctl: 289 366 290 367 IOCTL Support on Device Nodes