doc: watchdog: clarify hardlockup detection timing

The current documentation implies that a hardlockup is strictly defined as
looping for "more than 10 seconds." However, the detection mechanism is
periodic (based on `watchdog_thresh`), meaning detection time varies
significantly depending on when the lockup occurs relative to the NMI perf
event.

Update the definition to remove the strict "more than 10 seconds"
constraint in the introduction and defer details to the Implementation
section.

Additionally, add a "Detection Overhead" section illustrating the Best
Case (~6s) and Worst Case (~20s) detection scenarios to provide
administrators with a clearer understanding of the watchdog's latency.

Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-3-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Mayank Rungta and committed by

Andrew Morton 2 months ago de832583 746bb7fa

+40 -1

1 changed file

expand all

Documentation

admin-guide

lockup-watchdogs.rst

+40 -1

Documentation/admin-guide/lockup-watchdogs.rst

··· 16 16 provided for this. 17 17 18 18 A 'hardlockup' is defined as a bug that causes the CPU to loop in 19 - kernel mode for more than 10 seconds (see "Implementation" below for 19 + kernel mode for several seconds (see "Implementation" below for 20 20 details), without letting other interrupts have a chance to run. 21 21 Similarly to the softlockup case, the current stack trace is displayed 22 22 upon detection and the system will stay locked up unless the default ··· 63 63 administrators to configure the period of the hrtimer and the perf 64 64 event. The right value for a particular environment is a trade-off 65 65 between fast response to lockups and detection overhead. 66 + 67 + Detection Overhead 68 + ------------------ 69 + 70 + The hardlockup detector checks for lockups using a periodic NMI perf 71 + event. This means the time to detect a lockup can vary depending on 72 + when the lockup occurs relative to the NMI check window. 73 + 74 + **Best Case:** 75 + In the best case scenario, the lockup occurs just before the first 76 + heartbeat is due. The detector will notice the missing hrtimer 77 + interrupt almost immediately during the next check. 78 + 79 + :: 80 + 81 + Time 100.0: cpu 1 heartbeat 82 + Time 100.1: hardlockup_check, cpu1 stores its state 83 + Time 103.9: Hard Lockup on cpu1 84 + Time 104.0: cpu 1 heartbeat never comes 85 + Time 110.1: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup 86 + 87 + Time to detection: ~6 seconds 88 + 89 + **Worst Case:** 90 + In the worst case scenario, the lockup occurs shortly after a valid 91 + interrupt (heartbeat) which itself happened just after the NMI check. 92 + The next NMI check sees that the interrupt count has changed (due to 93 + that one heartbeat), assumes the CPU is healthy, and resets the 94 + baseline. The lockup is only detected at the subsequent check. 95 + 96 + :: 97 + 98 + Time 100.0: hardlockup_check, cpu1 stores its state 99 + Time 100.1: cpu 1 heartbeat 100 + Time 100.2: Hard Lockup on cpu1 101 + Time 110.0: hardlockup_check, cpu1 stores its state (misses lockup as state changed) 102 + Time 120.0: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup 103 + 104 + Time to detection: ~20 seconds 66 105 67 106 By default, the watchdog runs on all online cores. However, on a 68 107 kernel configured with NO_HZ_FULL, by default the watchdog runs only

Configure Feed

Configure Feed