Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

drm/amdgpu: add support for SMU debug option

SMU firmware expects the driver maintains error context
and doesn't interact with SMU any more when SMU errors
occurred. That will aid in debugging SMU firmware issues.

Add SMU debug option support for this request, it can be
enabled or disabled via amdgpu_smu_debug debugfs file.
Use a 32-bit mask to indicate corresponding debug modes.
Currently, only one mode(HALT_ON_ERROR) is supported.
When enabled, it brings hardware to a kind of halt state
so that no one can touch it any more in the envent of SMU
errors.

The dirver interacts with SMU via sending messages. And
threre are three ways to sending messages to SMU in current
implementation. Handle them respectively as following:

1, smu_cmn_send_smc_msg_with_param() for normal timeout cases

Halt on any error.

2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
for longer timeout cases

Halt on errors apart from ETIME. Otherwise this way won't work.
Let the user handle ETIME error in such a case.

3, smu_cmn_send_msg_without_waiting() for no waiting cases

Halt on errors apart from ETIME. Otherwise second way won't work.

== Command Guide ==

1, enable HALT_ON_ERROR mode

# echo 0x1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable HALT_ON_ERROR mode

# echo 0x0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v5:
- Use bit mask to allow more debug features.(Evan)
- Use WRAN() instead of BUG().(Evan)

v4:
- Set to halt state instead of a simple hang.(Christian)

v3:
- Use debugfs_create_bool().(Christian)
- Put variable into smu_context struct.
- Don't resend command when timeout.

v2:
- Resend command when timeout.(Lijo)
- Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Lang Yu and committed by
Alex Deucher
6ff7fddb 34f3a4a9

+33 -1
+3
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
··· 1618 1618 if (!debugfs_initialized()) 1619 1619 return 0; 1620 1620 1621 + debugfs_create_x32("amdgpu_smu_debug", 0600, root, 1622 + &adev->smu.smu_debug_mask); 1623 + 1621 1624 ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev, 1622 1625 &fops_ib_preempt); 1623 1626 if (IS_ERR(ent)) {
+9
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
··· 481 481 }; 482 482 483 483 #define WORKLOAD_POLICY_MAX 7 484 + 485 + /* Used to mask smu debug modes */ 486 + #define SMU_DEBUG_HALT_ON_ERROR 0x1 487 + 484 488 struct smu_context 485 489 { 486 490 struct amdgpu_device *adev; ··· 573 569 struct smu_user_dpm_profile user_dpm_profile; 574 570 575 571 struct stb_context stb_context; 572 + 573 + /* 574 + * 0 = disabled (default), otherwise enable corresponding debug mode 575 + */ 576 + uint32_t smu_debug_mask; 576 577 }; 577 578 578 579 struct i2c_adapter;
+21 -1
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
··· 272 272 __smu_cmn_send_msg(smu, msg_index, param); 273 273 res = 0; 274 274 Out: 275 + if (unlikely(smu->smu_debug_mask & SMU_DEBUG_HALT_ON_ERROR) && 276 + res && (res != -ETIME)) { 277 + amdgpu_device_halt(smu->adev); 278 + WARN_ON(1); 279 + } 280 + 275 281 return res; 276 282 } 277 283 ··· 294 288 int smu_cmn_wait_for_response(struct smu_context *smu) 295 289 { 296 290 u32 reg; 291 + int res; 297 292 298 293 reg = __smu_cmn_poll_stat(smu); 299 - return __smu_cmn_reg2errno(smu, reg); 294 + res = __smu_cmn_reg2errno(smu, reg); 295 + 296 + if (unlikely(smu->smu_debug_mask & SMU_DEBUG_HALT_ON_ERROR) && 297 + res && (res != -ETIME)) { 298 + amdgpu_device_halt(smu->adev); 299 + WARN_ON(1); 300 + } 301 + 302 + return res; 300 303 } 301 304 302 305 /** ··· 372 357 if (read_arg) 373 358 smu_cmn_read_arg(smu, read_arg); 374 359 Out: 360 + if (unlikely(smu->smu_debug_mask & SMU_DEBUG_HALT_ON_ERROR) && res) { 361 + amdgpu_device_halt(smu->adev); 362 + WARN_ON(1); 363 + } 364 + 375 365 mutex_unlock(&smu->message_lock); 376 366 return res; 377 367 }