Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

drm/amdgpu: add RAS poison handling for MCA

For MCA poison, if unmap queue fails, only gpu reset should be
triggered without page retirement handling, MCA notifier will do it.

v2: handle MCA poison consumption in umc_poison_handler directly.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Tao Zhou and committed by
Alex Deucher
ae45a18b 24b82292

+20 -11
+20 -11
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
··· 169 169 void *ras_error_status, 170 170 bool reset) 171 171 { 172 - int ret; 173 - struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status; 174 - struct ras_common_if head = { 175 - .block = AMDGPU_RAS_BLOCK__UMC, 176 - }; 177 - struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head); 172 + int ret = AMDGPU_RAS_SUCCESS; 178 173 179 - ret = 180 - amdgpu_umc_do_page_retirement(adev, ras_error_status, NULL, reset); 174 + if (!adev->gmc.xgmi.connected_to_cpu) { 175 + struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status; 176 + struct ras_common_if head = { 177 + .block = AMDGPU_RAS_BLOCK__UMC, 178 + }; 179 + struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head); 181 180 182 - if (ret == AMDGPU_RAS_SUCCESS && obj) { 183 - obj->err_data.ue_count += err_data->ue_count; 184 - obj->err_data.ce_count += err_data->ce_count; 181 + ret = 182 + amdgpu_umc_do_page_retirement(adev, ras_error_status, NULL, reset); 183 + 184 + if (ret == AMDGPU_RAS_SUCCESS && obj) { 185 + obj->err_data.ue_count += err_data->ue_count; 186 + obj->err_data.ce_count += err_data->ce_count; 187 + } 188 + } else if (reset) { 189 + /* MCA poison handler is only responsible for GPU reset, 190 + * let MCA notifier do page retirement. 191 + */ 192 + kgd2kfd_set_sram_ecc_flag(adev->kfd.dev); 193 + amdgpu_ras_reset_gpu(adev); 185 194 } 186 195 187 196 return ret;