Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

watchdog/hardlockup: improve buddy system detection timeliness

Currently, the buddy system only performs checks every 3rd sample. With a
4-second interval. If a check window is missed, the next check occurs 12
seconds later, potentially delaying hard lockup detection for up to 24
seconds.

Modify the buddy system to perform checks at every interval (4s).
Introduce a missed-interrupt threshold to maintain the existing grace
period while reducing the detection window to 8-12 seconds.

Best and worst case detection scenarios:

Before (12s check window):
- Best case: Lockup occurs after first check but just before heartbeat
interval. Detected in ~8s (8s till next check).
- Worst case: Lockup occurs just after a check.
Detected in ~24s (missed check + 12s till next check + 12s logic).

After (4s check window with threshold of 3):
- Best case: Lockup occurs just before a check.
Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd).
- Worst case: Lockup occurs just after a check.
Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd).

Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Mayank Rungta and committed by
Andrew Morton
077ba036 de832583

+18 -11
+1
include/linux/nmi.h
··· 21 21 extern int watchdog_user_enabled; 22 22 extern int watchdog_thresh; 23 23 extern unsigned long watchdog_enabled; 24 + extern int watchdog_hardlockup_miss_thresh; 24 25 25 26 extern struct cpumask watchdog_cpumask; 26 27 extern unsigned long *watchdog_cpumask_bits;
+16 -3
kernel/watchdog.c
··· 61 61 # endif /* CONFIG_SMP */ 62 62 63 63 /* 64 + * Number of consecutive missed interrupts before declaring a lockup. 65 + * Default to 1 (immediate) for NMI/Perf. Buddy will overwrite this to 3. 66 + */ 67 + int __read_mostly watchdog_hardlockup_miss_thresh = 1; 68 + EXPORT_SYMBOL_GPL(watchdog_hardlockup_miss_thresh); 69 + 70 + /* 64 71 * Should we panic when a soft-lockup or hard-lockup occurs: 65 72 */ 66 73 unsigned int __read_mostly hardlockup_panic = ··· 144 137 145 138 static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts); 146 139 static DEFINE_PER_CPU(int, hrtimer_interrupts_saved); 140 + static DEFINE_PER_CPU(int, hrtimer_interrupts_missed); 147 141 static DEFINE_PER_CPU(bool, watchdog_hardlockup_warned); 148 142 static DEFINE_PER_CPU(bool, watchdog_hardlockup_touched); 149 143 static unsigned long hard_lockup_nmi_warn; ··· 167 159 per_cpu(watchdog_hardlockup_touched, cpu) = true; 168 160 } 169 161 170 - static void watchdog_hardlockup_update(unsigned int cpu) 162 + static void watchdog_hardlockup_update_reset(unsigned int cpu) 171 163 { 172 164 int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu)); 173 165 ··· 177 169 * written/read by a single CPU. 178 170 */ 179 171 per_cpu(hrtimer_interrupts_saved, cpu) = hrint; 172 + per_cpu(hrtimer_interrupts_missed, cpu) = 0; 180 173 } 181 174 182 175 static bool is_hardlockup(unsigned int cpu) ··· 185 176 int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu)); 186 177 187 178 if (per_cpu(hrtimer_interrupts_saved, cpu) != hrint) { 188 - watchdog_hardlockup_update(cpu); 179 + watchdog_hardlockup_update_reset(cpu); 189 180 return false; 190 181 } 182 + 183 + per_cpu(hrtimer_interrupts_missed, cpu)++; 184 + if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_thresh) 185 + return false; 191 186 192 187 return true; 193 188 } ··· 211 198 unsigned long flags; 212 199 213 200 if (per_cpu(watchdog_hardlockup_touched, cpu)) { 214 - watchdog_hardlockup_update(cpu); 201 + watchdog_hardlockup_update_reset(cpu); 215 202 per_cpu(watchdog_hardlockup_touched, cpu) = false; 216 203 return; 217 204 }
+1 -8
kernel/watchdog_buddy.c
··· 21 21 22 22 int __init watchdog_hardlockup_probe(void) 23 23 { 24 + watchdog_hardlockup_miss_thresh = 3; 24 25 return 0; 25 26 } 26 27 ··· 86 85 void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) 87 86 { 88 87 unsigned int next_cpu; 89 - 90 - /* 91 - * Test for hardlockups every 3 samples. The sample period is 92 - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over 93 - * watchdog_thresh (over by 20%). 94 - */ 95 - if (hrtimer_interrupts % 3 != 0) 96 - return; 97 88 98 89 /* check for a hardlockup on the next CPU */ 99 90 next_cpu = watchdog_next_cpu(smp_processor_id());