Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

RAS: Report all ARM processor CPER information to userspace

The ARM processor CPER record was added in UEFI v2.6 and remained
unchanged up to v2.10.

Yet, the original arm_event trace code added by

e9279e83ad1f ("trace, ras: add ARM processor error trace event")

is incomplete, as it only traces some fields of UAPI 2.6 table N.16, not
exporting any information from tables N.17 to N.29 of the record.

This is not enough for the user to be able to figure out what has
exactly happened or to take appropriate action.

According to the UEFI v2.9 specification chapter N2.4.4, the ARM
processor error section includes:

- several (ERR_INFO_NUM) ARM processor error information structures
(Tables N.17 to N.20);
- several (CONTEXT_INFO_NUM) ARM processor context information
structures (Tables N.21 to N.29);
- several vendor specific error information structures. The
size is given by Section Length minus the size of the other
fields.

In addition, it also exports two fields that are parsed by the GHES
driver when firmware reports it, e.g.:

- error severity
- CPU logical index

Report all of these information to userspace via a the ARM tracepoint so
that userspace can properly record the error and take decisions related
to CPU core isolation according to error severity and other info.

The updated ARM trace event now contains the following fields:

====================================== =============================
UEFI field on table N.16 ARM Processor trace fields
====================================== =============================
Validation handled when filling data for
affinity MPIDR and running
state.
ERR_INFO_NUM pei_len
CONTEXT_INFO_NUM ctx_len
Section Length indirectly reported by
pei_len, ctx_len and oem_len
Error affinity level affinity
MPIDR_EL1 mpidr
MIDR_EL1 midr
Running State running_state
PSCI State psci_state
Processor Error Information Structure pei_err - count at pei_len
Processor Context ctx_err- count at ctx_len
Vendor Specific Error Info oem - count at oem_len
====================================== =============================

It should be noted that decoding of tables N.17 to N.29, if needed, will
be handled in userspace. That gives more flexibility, as there won't be
any need to flood the kernel with micro-architecture specific error
decoding.

Also, decoding the other fields require a complex logic, and should be
done for each of the several values inside the record field. So, let
userspace daemons like rasdaemon decode them, parsing such tables and
having vendor-specific micro-architecture-specific decoders.

[mchehab: modified description, solved merge conflicts and fixed coding style]

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Co-developed-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Daniel Ferguson <danielf@os.amperecomputing.com> # rebased
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Shiju Jose <shiju.jose@huawei.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Fixes: e9279e83ad1f ("trace, ras: add ARM processor error trace event")
Link: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-section
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

authored by

Jason Tian and committed by
Ard Biesheuvel
05954511 e41ef37d

+99 -17
+4 -7
drivers/acpi/apei/ghes.c
··· 552 552 } 553 553 554 554 static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, 555 - int sev, bool sync) 555 + int sev, bool sync) 556 556 { 557 557 struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata); 558 558 int flags = sync ? MF_ACTION_REQUIRED : 0; ··· 560 560 int sec_sev, i; 561 561 char *p; 562 562 563 - log_arm_hw_error(err); 564 - 565 563 sec_sev = ghes_severity(gdata->error_severity); 564 + log_arm_hw_error(err, sec_sev); 566 565 if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE) 567 566 return false; 568 567 ··· 894 895 895 896 arch_apei_report_mem_error(sev, mem_err); 896 897 queued = ghes_handle_memory_failure(gdata, sev, sync); 897 - } 898 - else if (guid_equal(sec_type, &CPER_SEC_PCIE)) { 898 + } else if (guid_equal(sec_type, &CPER_SEC_PCIE)) { 899 899 ghes_handle_aer(gdata); 900 - } 901 - else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { 900 + } else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { 902 901 queued = ghes_handle_arm_hw_error(gdata, sev, sync); 903 902 } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { 904 903 struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+38 -2
drivers/ras/ras.c
··· 53 53 } 54 54 EXPORT_SYMBOL_GPL(log_non_standard_event); 55 55 56 - void log_arm_hw_error(struct cper_sec_proc_arm *err) 56 + void log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev) 57 57 { 58 - trace_arm_event(err); 58 + struct cper_arm_err_info *err_info; 59 + struct cper_arm_ctx_info *ctx_info; 60 + u8 *ven_err_data; 61 + u32 ctx_len = 0; 62 + int n, sz, cpu; 63 + s32 vsei_len; 64 + u32 pei_len; 65 + u8 *pei_err, *ctx_err; 66 + 67 + pei_len = sizeof(struct cper_arm_err_info) * err->err_info_num; 68 + pei_err = (u8 *)(err + 1); 69 + 70 + err_info = (struct cper_arm_err_info *)(err + 1); 71 + ctx_info = (struct cper_arm_ctx_info *)(err_info + err->err_info_num); 72 + ctx_err = (u8 *)ctx_info; 73 + 74 + for (n = 0; n < err->context_info_num; n++) { 75 + sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size; 76 + ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz); 77 + ctx_len += sz; 78 + } 79 + 80 + vsei_len = err->section_length - (sizeof(struct cper_sec_proc_arm) + pei_len + ctx_len); 81 + if (vsei_len < 0) { 82 + pr_warn(FW_BUG "section length: %d\n", err->section_length); 83 + pr_warn(FW_BUG "section length is too small\n"); 84 + pr_warn(FW_BUG "firmware-generated error record is incorrect\n"); 85 + vsei_len = 0; 86 + } 87 + ven_err_data = (u8 *)ctx_info; 88 + 89 + cpu = GET_LOGICAL_INDEX(err->mpidr); 90 + if (cpu < 0) 91 + cpu = -1; 92 + 93 + trace_arm_event(err, pei_err, pei_len, ctx_err, ctx_len, 94 + ven_err_data, (u32)vsei_len, sev, cpu); 59 95 } 60 96 61 97 static int __init ras_init(void)
+13 -3
include/linux/ras.h
··· 24 24 void log_non_standard_event(const guid_t *sec_type, 25 25 const guid_t *fru_id, const char *fru_text, 26 26 const u8 sev, const u8 *err, const u32 len); 27 - void log_arm_hw_error(struct cper_sec_proc_arm *err); 28 - 27 + void log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev); 29 28 #else 30 29 static inline void 31 30 log_non_standard_event(const guid_t *sec_type, ··· 32 33 const u8 sev, const u8 *err, const u32 len) 33 34 { return; } 34 35 static inline void 35 - log_arm_hw_error(struct cper_sec_proc_arm *err) { return; } 36 + log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev) { return; } 36 37 #endif 37 38 38 39 struct atl_err { ··· 51 52 static inline unsigned long 52 53 amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err) { return -EINVAL; } 53 54 #endif /* CONFIG_AMD_ATL */ 55 + 56 + #if defined(CONFIG_ARM) || defined(CONFIG_ARM64) 57 + #include <asm/smp_plat.h> 58 + /* 59 + * Include ARM-specific SMP header which provides a function mapping mpidr to 60 + * CPU logical index. 61 + */ 62 + #define GET_LOGICAL_INDEX(mpidr) get_logical_index(mpidr & MPIDR_HWID_BITMASK) 63 + #else 64 + #define GET_LOGICAL_INDEX(mpidr) -EINVAL 65 + #endif /* CONFIG_ARM || CONFIG_ARM64 */ 54 66 55 67 #endif /* __RAS_H__ */
+44 -5
include/ras/ras_event.h
··· 168 168 * This event is generated when hardware detects an ARM processor error 169 169 * has occurred. UEFI 2.6 spec section N.2.4.4. 170 170 */ 171 + #define APEIL "ARM Processor Err Info data len" 172 + #define APEID "ARM Processor Err Info raw data" 173 + #define APECIL "ARM Processor Err Context Info data len" 174 + #define APECID "ARM Processor Err Context Info raw data" 175 + #define VSEIL "Vendor Specific Err Info data len" 176 + #define VSEID "Vendor Specific Err Info raw data" 171 177 TRACE_EVENT(arm_event, 172 178 173 - TP_PROTO(const struct cper_sec_proc_arm *proc), 179 + TP_PROTO(const struct cper_sec_proc_arm *proc, 180 + const u8 *pei_err, 181 + const u32 pei_len, 182 + const u8 *ctx_err, 183 + const u32 ctx_len, 184 + const u8 *oem, 185 + const u32 oem_len, 186 + u8 sev, 187 + int cpu), 174 188 175 - TP_ARGS(proc), 189 + TP_ARGS(proc, pei_err, pei_len, ctx_err, ctx_len, oem, oem_len, sev, cpu), 176 190 177 191 TP_STRUCT__entry( 178 192 __field(u64, mpidr) ··· 194 180 __field(u32, running_state) 195 181 __field(u32, psci_state) 196 182 __field(u8, affinity) 183 + __field(u32, pei_len) 184 + __dynamic_array(u8, pei_buf, pei_len) 185 + __field(u32, ctx_len) 186 + __dynamic_array(u8, ctx_buf, ctx_len) 187 + __field(u32, oem_len) 188 + __dynamic_array(u8, oem_buf, oem_len) 189 + __field(u8, sev) 190 + __field(int, cpu) 197 191 ), 198 192 199 193 TP_fast_assign( ··· 221 199 __entry->running_state = ~0; 222 200 __entry->psci_state = ~0; 223 201 } 202 + __entry->pei_len = pei_len; 203 + memcpy(__get_dynamic_array(pei_buf), pei_err, pei_len); 204 + __entry->ctx_len = ctx_len; 205 + memcpy(__get_dynamic_array(ctx_buf), ctx_err, ctx_len); 206 + __entry->oem_len = oem_len; 207 + memcpy(__get_dynamic_array(oem_buf), oem, oem_len); 208 + __entry->sev = sev; 209 + __entry->cpu = cpu; 224 210 ), 225 211 226 - TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; " 227 - "running state: %d; PSCI state: %d", 212 + TP_printk("cpu: %d; error: %d; affinity level: %d; MPIDR: %016llx; MIDR: %016llx; " 213 + "running state: %d; PSCI state: %d; " 214 + "%s: %d; %s: %s; %s: %d; %s: %s; %s: %d; %s: %s", 215 + __entry->cpu, 216 + __entry->sev, 228 217 __entry->affinity, __entry->mpidr, __entry->midr, 229 - __entry->running_state, __entry->psci_state) 218 + __entry->running_state, __entry->psci_state, 219 + APEIL, __entry->pei_len, APEID, 220 + __print_hex(__get_dynamic_array(pei_buf), __entry->pei_len), 221 + APECIL, __entry->ctx_len, APECID, 222 + __print_hex(__get_dynamic_array(ctx_buf), __entry->ctx_len), 223 + VSEIL, __entry->oem_len, VSEID, 224 + __print_hex(__get_dynamic_array(oem_buf), __entry->oem_len)) 230 225 ); 231 226 232 227 /*