Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

drm/xe/guc: Return an error code if the GuC load fails

Due to multiple explosion issues in the early days of the Xe driver,
the GuC load was hacked to never return a failure. That prevented
kernel panics and such initially, but now all it achieves is creating
more confusing errors when the driver tries to submit commands to a
GuC it already knows is not there. So fix that up.

As a stop-gap and to help with debug of load failures due to invalid
GuC init params, a wedge call had been added to the inner GuC load
function. The reason being that it leaves the GuC log accessible via
debugfs. However, for an end user, simply aborting the module load is
much cleaner than wedging and trying to continue. The wedge blocks
user submissions but it seems that various bits of the driver itself
still try to submit to a dead GuC and lots of subsequent errors occur.
And with regards to developers debugging why their particular code
change is being rejected by the GuC, it is trivial to either add the
wedge back in and hack the return code to zero again or to just do a
GuC log dump to dmesg.

v2: Add support for error injection testing and drop the now redundant
wedge call.

CC: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com

+9 -4
+9 -4
drivers/gpu/drm/xe/xe_guc.c
··· 1055 1055 #endif 1056 1056 #define GUC_LOAD_TIME_WARN_MS 200 1057 1057 1058 - static void guc_wait_ucode(struct xe_guc *guc) 1058 + static int guc_wait_ucode(struct xe_guc *guc) 1059 1059 { 1060 1060 struct xe_gt *gt = guc_to_gt(guc); 1061 1061 struct xe_mmio *mmio = &gt->mmio; ··· 1162 1162 break; 1163 1163 } 1164 1164 1165 - xe_device_declare_wedged(gt_to_xe(gt)); 1165 + return -EPROTO; 1166 1166 } else if (delta_ms > GUC_LOAD_TIME_WARN_MS) { 1167 1167 xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n", 1168 1168 delta_ms, status, count); ··· 1174 1174 delta_ms, xe_guc_pc_get_act_freq(guc_pc), guc_pc_get_cur_freq(guc_pc), 1175 1175 before_freq, status, count); 1176 1176 } 1177 + 1178 + return 0; 1177 1179 } 1180 + ALLOW_ERROR_INJECTION(guc_wait_ucode, ERRNO); 1178 1181 1179 1182 static int __xe_guc_upload(struct xe_guc *guc) 1180 1183 { ··· 1209 1206 goto out; 1210 1207 1211 1208 /* Wait for authentication */ 1212 - guc_wait_ucode(guc); 1209 + ret = guc_wait_ucode(guc); 1210 + if (ret) 1211 + goto out; 1213 1212 1214 1213 xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_RUNNING); 1215 1214 return 0; 1216 1215 1217 1216 out: 1218 1217 xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_LOAD_FAIL); 1219 - return 0 /* FIXME: ret, don't want to stop load currently */; 1218 + return ret; 1220 1219 } 1221 1220 1222 1221 static int vf_guc_min_load_for_hwconfig(struct xe_guc *guc)