watchdog: return early in watchdog_hardlockup_check()

Patch series "watchdog/hardlockup: Improvements to hardlockup", v2.

This series addresses limitations in the hardlockup detector
implementations and updates the documentation to reflect actual behavior
and recent changes.

The changes are structured as follows:

Refactoring (Patch 1)
=====================
Patch 1 refactors watchdog_hardlockup_check() to return early if no
lockup is detected. This reduces the indentation level of the main
logic block, serving as a clean base for the subsequent changes.

Hardlockup Detection Improvements (Patches 2 & 4)
=================================================
The hardlockup detector logic relies on updating saved interrupt counts to
determine if the CPU is making progress.

Patch 1 ensures that the saved interrupt count is updated unconditionally
before checking the "touched" flag. This prevents stale comparisons which
can delay detection. This is a logic fix that ensures the detector
remains accurate even when the watchdog is frequently touched.

Patch 3 improves the Buddy detector's timeliness. The current checking
interval (every 3rd sample) causes high variability in detection time (up
to 24s). This patch changes the Buddy detector to check at every hrtimer
interval (4s) with a missed-interrupt threshold of 3, narrowing the
detection window to a consistent 8-12 second range.

Documentation Updates (Patches 3 & 5)
=====================================
The current documentation does not fully capture the variable nature of
detection latency or the details of the Buddy system.

Patch 3 removes the strict "10 seconds" definition of a hardlockup, which
was misleading given the periodic nature of the detector. It adds a
"Detection Overhead" section to the admin guide, using "Best Case" and
"Worst Case" scenarios to illustrate that detection time can vary
significantly (e.g., ~6s to ~20s).

Patch 5 adds a dedicated section for the Buddy detector, which was
previously undocumented. It details the mechanism, the new timing logic,
and known limitations.

This patch (of 5):

Invert the `is_hardlockup(cpu)` check in `watchdog_hardlockup_check()` to
return early when a hardlockup is not detected. This flattens the main
logic block, reducing the indentation level and making the code easier to
read and maintain.

This refactoring serves as a preparation patch for future hardlockup
changes.

Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com
Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-1-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Mayank Rungta and committed by

Andrew Morton 2 months ago 3e811cae ea297603

+64 -63

1 changed file

expand all

kernel

watchdog.c

+64 -63

kernel/watchdog.c

··· 187 187 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) 188 188 { 189 189 int hardlockup_all_cpu_backtrace; 190 + unsigned int this_cpu; 191 + unsigned long flags; 190 192 191 193 if (per_cpu(watchdog_hardlockup_touched, cpu)) { 192 194 per_cpu(watchdog_hardlockup_touched, cpu) = false; ··· 203 201 * fired multiple times before we overflow'd. If it hasn't 204 202 * then this is a good indication the cpu is stuck 205 203 */ 206 - if (is_hardlockup(cpu)) { 207 - unsigned int this_cpu = smp_processor_id(); 208 - unsigned long flags; 204 + if (!is_hardlockup(cpu)) { 205 + per_cpu(watchdog_hardlockup_warned, cpu) = false; 206 + return; 207 + } 209 208 210 209 #ifdef CONFIG_SYSFS 211 - ++hardlockup_count; 210 + ++hardlockup_count; 212 211 #endif 213 - /* 214 - * A poorly behaving BPF scheduler can trigger hard lockup by 215 - * e.g. putting numerous affinitized tasks in a single queue and 216 - * directing all CPUs at it. The following call can return true 217 - * only once when sched_ext is enabled and will immediately 218 - * abort the BPF scheduler and print out a warning message. 219 - */ 220 - if (scx_hardlockup(cpu)) 212 + /* 213 + * A poorly behaving BPF scheduler can trigger hard lockup by 214 + * e.g. putting numerous affinitized tasks in a single queue and 215 + * directing all CPUs at it. The following call can return true 216 + * only once when sched_ext is enabled and will immediately 217 + * abort the BPF scheduler and print out a warning message. 218 + */ 219 + if (scx_hardlockup(cpu)) 220 + return; 221 + 222 + /* Only print hardlockups once. */ 223 + if (per_cpu(watchdog_hardlockup_warned, cpu)) 224 + return; 225 + 226 + /* 227 + * Prevent multiple hard-lockup reports if one cpu is already 228 + * engaged in dumping all cpu back traces. 229 + */ 230 + if (hardlockup_all_cpu_backtrace) { 231 + if (test_and_set_bit_lock(0, &hard_lockup_nmi_warn)) 221 232 return; 222 - 223 - /* Only print hardlockups once. */ 224 - if (per_cpu(watchdog_hardlockup_warned, cpu)) 225 - return; 226 - 227 - /* 228 - * Prevent multiple hard-lockup reports if one cpu is already 229 - * engaged in dumping all cpu back traces. 230 - */ 231 - if (hardlockup_all_cpu_backtrace) { 232 - if (test_and_set_bit_lock(0, &hard_lockup_nmi_warn)) 233 - return; 234 - } 235 - 236 - /* 237 - * NOTE: we call printk_cpu_sync_get_irqsave() after printing 238 - * the lockup message. While it would be nice to serialize 239 - * that printout, we really want to make sure that if some 240 - * other CPU somehow locked up while holding the lock associated 241 - * with printk_cpu_sync_get_irqsave() that we can still at least 242 - * get the message about the lockup out. 243 - */ 244 - pr_emerg("CPU%u: Watchdog detected hard LOCKUP on cpu %u\n", this_cpu, cpu); 245 - printk_cpu_sync_get_irqsave(flags); 246 - 247 - print_modules(); 248 - print_irqtrace_events(current); 249 - if (cpu == this_cpu) { 250 - if (regs) 251 - show_regs(regs); 252 - else 253 - dump_stack(); 254 - printk_cpu_sync_put_irqrestore(flags); 255 - } else { 256 - printk_cpu_sync_put_irqrestore(flags); 257 - trigger_single_cpu_backtrace(cpu); 258 - } 259 - 260 - if (hardlockup_all_cpu_backtrace) { 261 - trigger_allbutcpu_cpu_backtrace(cpu); 262 - if (!hardlockup_panic) 263 - clear_bit_unlock(0, &hard_lockup_nmi_warn); 264 - } 265 - 266 - sys_info(hardlockup_si_mask & ~SYS_INFO_ALL_BT); 267 - if (hardlockup_panic) 268 - nmi_panic(regs, "Hard LOCKUP"); 269 - 270 - per_cpu(watchdog_hardlockup_warned, cpu) = true; 271 - } else { 272 - per_cpu(watchdog_hardlockup_warned, cpu) = false; 273 233 } 234 + 235 + /* 236 + * NOTE: we call printk_cpu_sync_get_irqsave() after printing 237 + * the lockup message. While it would be nice to serialize 238 + * that printout, we really want to make sure that if some 239 + * other CPU somehow locked up while holding the lock associated 240 + * with printk_cpu_sync_get_irqsave() that we can still at least 241 + * get the message about the lockup out. 242 + */ 243 + this_cpu = smp_processor_id(); 244 + pr_emerg("CPU%u: Watchdog detected hard LOCKUP on cpu %u\n", this_cpu, cpu); 245 + printk_cpu_sync_get_irqsave(flags); 246 + 247 + print_modules(); 248 + print_irqtrace_events(current); 249 + if (cpu == this_cpu) { 250 + if (regs) 251 + show_regs(regs); 252 + else 253 + dump_stack(); 254 + printk_cpu_sync_put_irqrestore(flags); 255 + } else { 256 + printk_cpu_sync_put_irqrestore(flags); 257 + trigger_single_cpu_backtrace(cpu); 258 + } 259 + 260 + if (hardlockup_all_cpu_backtrace) { 261 + trigger_allbutcpu_cpu_backtrace(cpu); 262 + if (!hardlockup_panic) 263 + clear_bit_unlock(0, &hard_lockup_nmi_warn); 264 + } 265 + 266 + sys_info(hardlockup_si_mask & ~SYS_INFO_ALL_BT); 267 + if (hardlockup_panic) 268 + nmi_panic(regs, "Hard LOCKUP"); 269 + 270 + per_cpu(watchdog_hardlockup_warned, cpu) = true; 274 271 } 275 272 276 273 #else /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */

Configure Feed

Configure Feed