clocksource: Rewrite watchdog code completely

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).

Over the years it turned out that this approach has major flaws:

- Long delays between watchdog invocations can result in wrap arounds
of the reference clocksource

- Scalability of the reference clocksource readout can degrade on large
multi-socket systems due to interconnect congestion

This was addressed with various heuristics which degraded the accuracy of
the watchdog to the point that it fails to detect actual TSC problems on
older hardware which exposes slow inter CPU drifts due to firmware
manipulating the TSC to hide SMI time.

To address this and bring back sanity to the watchdog, rewrite the code
completely with a different approach:

1) Restrict the validation against a reference clocksource to the boot
CPU, which is usually the CPU/Socket closest to the legacy block which
contains the reference source (HPET/ACPI-PM timer). Validate that the
reference readout is within a bound latency so that the actual
comparison against the TSC stays within 500ppm as long as the clocks
are stable.

2) Compare the TSCs of the other CPUs in a round robin fashion against
the boot CPU in the same way the TSC synchronization on CPU hotplug
works. This still can suffer from delayed reaction of the remote CPU
to the SMP function call and the latency of the control variable cache
line. But this latency is not affecting correctness. It only affects
the accuracy. With low contention the readout latency is in the low
nanoseconds range, which detects even slight skews between CPUs. Under
high contention this becomes obviously less accurate, but still
detects slow skews reliably as it solely relies on subsequent readouts
being monotonically increasing. It just can take slightly longer to
detect the issue.

3) Rewrite the watchdog test so it tests the various mechanisms one by
one and validating the result against the expectation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Daniel J Blueman <daniel@quora.org>
Link: https://patch.msgid.link/20260123231521.926490888@kernel.org
Link: https://patch.msgid.link/87h5qeomm5.ffs@tglx

Thomas Gleixner 1 month ago 763aacf8 1432f9d4

+634 -633

10 changed files

expand all

Documentation

admin-guide

kernel-parameters.txt

arch

x86

include

asm

time.h

kernel

hpet.c

tsc.c

drivers

clocksource

acpi_pm.c

include

linux

clocksource.h

kernel

time

Kconfig

clocksource-wdtest.c

clocksource.c

jiffies.c

+1 -6

Documentation/admin-guide/kernel-parameters.txt

··· 7950 7950 (HPET or PM timer) on systems whose TSC frequency was 7951 7951 obtained from HW or FW using either an MSR or CPUID(0x15). 7952 7952 Warn if the difference is more than 500 ppm. 7953 - [x86] watchdog: Use TSC as the watchdog clocksource with 7954 - which to check other HW timers (HPET or PM timer), but 7955 - only on systems where TSC has been deemed trustworthy. 7956 - This will be suppressed by an earlier tsc=nowatchdog and 7957 - can be overridden by a later tsc=nowatchdog. A console 7958 - message will flag any such suppression or overriding. 7953 + [x86] watchdog: Enforce the clocksource watchdog on TSC 7959 7954 7960 7955 tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given 7961 7956 value instead. Useful when the early TSC frequency discovery

-1

arch/x86/include/asm/time.h

··· 7 7 8 8 extern void hpet_time_init(void); 9 9 extern bool pit_timer_init(void); 10 - extern bool tsc_clocksource_watchdog_disabled(void); 11 10 12 11 extern struct clock_event_device *global_clock_event; 13 12

+1 -3

arch/x86/kernel/hpet.c

··· 854 854 .rating = 250, 855 855 .read = read_hpet, 856 856 .mask = HPET_MASK, 857 - .flags = CLOCK_SOURCE_IS_CONTINUOUS, 857 + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED, 858 858 .resume = hpet_resume_counter, 859 859 }; 860 860 ··· 1082 1082 if (!hpet_counting()) 1083 1083 goto out_nohpet; 1084 1084 1085 - if (tsc_clocksource_watchdog_disabled()) 1086 - clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY; 1087 1085 clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq); 1088 1086 1089 1087 if (id & HPET_ID_LEGSUP) {

+18 -31

arch/x86/kernel/tsc.c

··· 322 322 return 1; 323 323 } 324 324 #endif 325 - 326 325 __setup("notsc", notsc_setup); 327 326 327 + enum { 328 + TSC_WATCHDOG_AUTO, 329 + TSC_WATCHDOG_OFF, 330 + TSC_WATCHDOG_ON, 331 + }; 332 + 328 333 static int no_sched_irq_time; 329 - static int no_tsc_watchdog; 330 - static int tsc_as_watchdog; 334 + static int tsc_watchdog; 331 335 332 336 static int __init tsc_setup(char *str) 333 337 { ··· 341 337 no_sched_irq_time = 1; 342 338 if (!strcmp(str, "unstable")) 343 339 mark_tsc_unstable("boot parameter"); 344 - if (!strcmp(str, "nowatchdog")) { 345 - no_tsc_watchdog = 1; 346 - if (tsc_as_watchdog) 347 - pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n", 348 - __func__); 349 - tsc_as_watchdog = 0; 350 - } 340 + if (!strcmp(str, "nowatchdog")) 341 + tsc_watchdog = TSC_WATCHDOG_OFF; 351 342 if (!strcmp(str, "recalibrate")) 352 343 tsc_force_recalibrate = 1; 353 - if (!strcmp(str, "watchdog")) { 354 - if (no_tsc_watchdog) 355 - pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n", 356 - __func__); 357 - else 358 - tsc_as_watchdog = 1; 359 - } 344 + if (!strcmp(str, "watchdog")) 345 + tsc_watchdog = TSC_WATCHDOG_ON; 360 346 return 1; 361 347 } 362 - 363 348 __setup("tsc=", tsc_setup); 364 349 365 350 #define MAX_RETRIES 5 ··· 1168 1175 static struct clocksource clocksource_tsc_early = { 1169 1176 .name = "tsc-early", 1170 1177 .rating = 299, 1171 - .uncertainty_margin = 32 * NSEC_PER_MSEC, 1172 1178 .read = read_tsc, 1173 1179 .mask = CLOCKSOURCE_MASK(64), 1174 1180 .flags = CLOCK_SOURCE_IS_CONTINUOUS | ··· 1194 1202 .flags = CLOCK_SOURCE_IS_CONTINUOUS | 1195 1203 CLOCK_SOURCE_CAN_INLINE_READ | 1196 1204 CLOCK_SOURCE_MUST_VERIFY | 1197 - CLOCK_SOURCE_VERIFY_PERCPU | 1198 1205 CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT, 1199 1206 .id = CSID_X86_TSC, 1200 1207 .vdso_clock_mode = VDSO_CLOCKMODE_TSC, ··· 1222 1231 1223 1232 static void __init tsc_disable_clocksource_watchdog(void) 1224 1233 { 1234 + if (tsc_watchdog == TSC_WATCHDOG_ON) 1235 + return; 1225 1236 clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY; 1226 1237 clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; 1227 - } 1228 - 1229 - bool tsc_clocksource_watchdog_disabled(void) 1230 - { 1231 - return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) && 1232 - tsc_as_watchdog && !no_tsc_watchdog; 1233 1238 } 1234 1239 1235 1240 static void __init check_system_tsc_reliable(void) ··· 1382 1395 (unsigned long)tsc_khz / 1000, 1383 1396 (unsigned long)tsc_khz % 1000); 1384 1397 1398 + clocksource_tsc.flags |= CLOCK_SOURCE_CALIBRATED; 1399 + 1385 1400 /* Inform the TSC deadline clockevent devices about the recalibration */ 1386 1401 lapic_update_tsc_freq(); 1387 1402 ··· 1459 1470 1460 1471 if (early) { 1461 1472 cpu_khz = x86_platform.calibrate_cpu(); 1462 - if (tsc_early_khz) { 1473 + if (tsc_early_khz) 1463 1474 tsc_khz = tsc_early_khz; 1464 - } else { 1475 + else 1465 1476 tsc_khz = x86_platform.calibrate_tsc(); 1466 - clocksource_tsc.freq_khz = tsc_khz; 1467 - } 1468 1477 } else { 1469 1478 /* We should not be here with non-native cpu calibration */ 1470 1479 WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu); ··· 1566 1579 return; 1567 1580 } 1568 1581 1569 - if (tsc_clocksource_reliable || no_tsc_watchdog) 1582 + if (tsc_clocksource_reliable || tsc_watchdog == TSC_WATCHDOG_OFF) 1570 1583 tsc_disable_clocksource_watchdog(); 1571 1584 1572 1585 clocksource_register_khz(&clocksource_tsc_early, tsc_khz);

+1 -3

drivers/clocksource/acpi_pm.c

··· 98 98 .rating = 200, 99 99 .read = acpi_pm_read, 100 100 .mask = (u64)ACPI_PM_MASK, 101 - .flags = CLOCK_SOURCE_IS_CONTINUOUS, 101 + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED, 102 102 .suspend = acpi_pm_suspend, 103 103 .resume = acpi_pm_resume, 104 104 }; ··· 243 243 return -ENODEV; 244 244 } 245 245 246 - if (tsc_clocksource_watchdog_disabled()) 247 - clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY; 248 246 return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); 249 247 } 250 248

+7 -21

include/linux/clocksource.h

··· 44 44 * @shift: Cycle to nanosecond divisor (power of two) 45 45 * @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs) 46 46 * @maxadj: Maximum adjustment value to mult (~11%) 47 - * @uncertainty_margin: Maximum uncertainty in nanoseconds per half second. 48 - * Zero says to use default WATCHDOG_THRESHOLD. 49 47 * @archdata: Optional arch-specific data 50 48 * @max_cycles: Maximum safe cycle value which won't overflow on 51 49 * multiplication ··· 103 105 u32 shift; 104 106 u64 max_idle_ns; 105 107 u32 maxadj; 106 - u32 uncertainty_margin; 107 108 #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA 108 109 struct arch_clocksource_data archdata; 109 110 #endif ··· 130 133 struct list_head wd_list; 131 134 u64 cs_last; 132 135 u64 wd_last; 136 + unsigned int wd_cpu; 133 137 #endif 134 138 struct module *owner; 135 139 }; ··· 140 142 */ 141 143 #define CLOCK_SOURCE_IS_CONTINUOUS 0x01 142 144 #define CLOCK_SOURCE_MUST_VERIFY 0x02 145 + #define CLOCK_SOURCE_CALIBRATED 0x04 143 146 144 147 #define CLOCK_SOURCE_WATCHDOG 0x10 145 148 #define CLOCK_SOURCE_VALID_FOR_HRES 0x20 146 149 #define CLOCK_SOURCE_UNSTABLE 0x40 147 150 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 148 151 #define CLOCK_SOURCE_RESELECT 0x100 149 - #define CLOCK_SOURCE_VERIFY_PERCPU 0x200 150 - #define CLOCK_SOURCE_CAN_INLINE_READ 0x400 151 - #define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x800 152 + #define CLOCK_SOURCE_CAN_INLINE_READ 0x200 153 + #define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x400 154 + 155 + #define CLOCK_SOURCE_WDTEST 0x800 156 + #define CLOCK_SOURCE_WDTEST_PERCPU 0x1000 152 157 153 158 /* simplify initialization of mask field */ 154 159 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) ··· 301 300 302 301 #define TIMER_ACPI_DECLARE(name, table_id, fn) \ 303 302 ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) 304 - 305 - static inline unsigned int clocksource_get_max_watchdog_retry(void) 306 - { 307 - /* 308 - * When system is in the boot phase or under heavy workload, there 309 - * can be random big latencies during the clocksource/watchdog 310 - * read, so allow retries to filter the noise latency. As the 311 - * latency's frequency and maximum value goes up with the number of 312 - * CPUs, scale the number of retries with the number of online 313 - * CPUs. 314 - */ 315 - return (ilog2(num_online_cpus()) / 2) + 1; 316 - } 317 - 318 - void clocksource_verify_percpu(struct clocksource *cs); 319 303 320 304 /** 321 305 * struct clocksource_base - hardware abstraction for clock on which a clocksource

-12

kernel/time/Kconfig

··· 212 212 hardware is not capable then this option only increases 213 213 the size of the kernel image. 214 214 215 - config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 216 - int "Clocksource watchdog maximum allowable skew (in microseconds)" 217 - depends on CLOCKSOURCE_WATCHDOG 218 - range 50 1000 219 - default 125 220 - help 221 - Specify the maximum amount of allowable watchdog skew in 222 - microseconds before reporting the clocksource to be unstable. 223 - The default is based on a half-second clocksource watchdog 224 - interval and NTP's maximum frequency drift of 500 parts 225 - per million. If the clocksource is good enough for NTP, 226 - it is good enough for the clocksource watchdog! 227 215 endif 228 216 229 217 config POSIX_AUX_CLOCKS

+145 -151

kernel/time/clocksource-wdtest.c

··· 3 3 * Unit test for the clocksource watchdog. 4 4 * 5 5 * Copyright (C) 2021 Facebook, Inc. 6 + * Copyright (C) 2026 Intel Corp. 6 7 * 7 8 * Author: Paul E. McKenney <paulmck@kernel.org> 9 + * Author: Thomas Gleixner <tglx@kernel.org> 8 10 */ 9 11 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 10 12 11 - #include <linux/device.h> 12 13 #include <linux/clocksource.h> 13 - #include <linux/init.h> 14 - #include <linux/module.h> 15 - #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */ 16 - #include <linux/tick.h> 17 - #include <linux/kthread.h> 18 14 #include <linux/delay.h> 19 - #include <linux/prandom.h> 20 - #include <linux/cpu.h> 15 + #include <linux/module.h> 16 + #include <linux/kthread.h> 21 17 22 18 #include "tick-internal.h" 19 + #include "timekeeping_internal.h" 23 20 24 21 MODULE_LICENSE("GPL"); 25 22 MODULE_DESCRIPTION("Clocksource watchdog unit test"); 26 23 MODULE_AUTHOR("Paul E. McKenney <paulmck@kernel.org>"); 24 + MODULE_AUTHOR("Thomas Gleixner <tglx@kernel.org>"); 27 25 28 - static int holdoff = IS_BUILTIN(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) ? 10 : 0; 29 - module_param(holdoff, int, 0444); 30 - MODULE_PARM_DESC(holdoff, "Time to wait to start test (s)."); 31 - 32 - /* Watchdog kthread's task_struct pointer for debug purposes. */ 33 - static struct task_struct *wdtest_task; 34 - 35 - static u64 wdtest_jiffies_read(struct clocksource *cs) 36 - { 37 - return (u64)jiffies; 38 - } 39 - 40 - static struct clocksource clocksource_wdtest_jiffies = { 41 - .name = "wdtest-jiffies", 42 - .rating = 1, /* lowest valid rating*/ 43 - .uncertainty_margin = TICK_NSEC, 44 - .read = wdtest_jiffies_read, 45 - .mask = CLOCKSOURCE_MASK(32), 46 - .flags = CLOCK_SOURCE_MUST_VERIFY, 47 - .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */ 48 - .shift = JIFFIES_SHIFT, 49 - .max_cycles = 10, 26 + enum wdtest_states { 27 + WDTEST_INJECT_NONE, 28 + WDTEST_INJECT_DELAY, 29 + WDTEST_INJECT_POSITIVE, 30 + WDTEST_INJECT_NEGATIVE, 31 + WDTEST_INJECT_PERCPU = 0x100, 50 32 }; 51 33 52 - static int wdtest_ktime_read_ndelays; 53 - static bool wdtest_ktime_read_fuzz; 34 + static enum wdtest_states wdtest_state; 35 + static unsigned long wdtest_test_count; 36 + static ktime_t wdtest_last_ts, wdtest_offset; 37 + 38 + #define SHIFT_4000PPM 8 39 + 40 + static ktime_t wdtest_get_offset(struct clocksource *cs) 41 + { 42 + if (wdtest_state < WDTEST_INJECT_PERCPU) 43 + return wdtest_test_count & 0x1 ? 0 : wdtest_offset >> SHIFT_4000PPM; 44 + 45 + /* Only affect the readout of the "remote" CPU */ 46 + return cs->wd_cpu == smp_processor_id() ? 0 : NSEC_PER_MSEC; 47 + } 54 48 55 49 static u64 wdtest_ktime_read(struct clocksource *cs) 56 50 { 57 - int wkrn = READ_ONCE(wdtest_ktime_read_ndelays); 58 - static int sign = 1; 59 - u64 ret; 51 + ktime_t now = ktime_get_raw_fast_ns(); 52 + ktime_t intv = now - wdtest_last_ts; 60 53 61 - if (wkrn) { 62 - udelay(cs->uncertainty_margin / 250); 63 - WRITE_ONCE(wdtest_ktime_read_ndelays, wkrn - 1); 54 + /* 55 + * Only increment the test counter once per watchdog interval and 56 + * store the interval for the offset calculation of this step. This 57 + * guarantees a consistent behaviour even if the other side needs 58 + * to repeat due to a watchdog read timeout. 59 + */ 60 + if (intv > (NSEC_PER_SEC / 4)) { 61 + WRITE_ONCE(wdtest_test_count, wdtest_test_count + 1); 62 + wdtest_last_ts = now; 63 + wdtest_offset = intv; 64 64 } 65 - ret = ktime_get_real_fast_ns(); 66 - if (READ_ONCE(wdtest_ktime_read_fuzz)) { 67 - sign = -sign; 68 - ret = ret + sign * 100 * NSEC_PER_MSEC; 65 + 66 + switch (wdtest_state & ~WDTEST_INJECT_PERCPU) { 67 + case WDTEST_INJECT_POSITIVE: 68 + return now + wdtest_get_offset(cs); 69 + case WDTEST_INJECT_NEGATIVE: 70 + return now - wdtest_get_offset(cs); 71 + case WDTEST_INJECT_DELAY: 72 + udelay(500); 73 + return now; 74 + default: 75 + return now; 69 76 } 70 - return ret; 71 77 } 72 78 73 - static void wdtest_ktime_cs_mark_unstable(struct clocksource *cs) 74 - { 75 - pr_info("--- Marking %s unstable due to clocksource watchdog.\n", cs->name); 76 - } 77 - 78 - #define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \ 79 - CLOCK_SOURCE_VALID_FOR_HRES | \ 80 - CLOCK_SOURCE_MUST_VERIFY | \ 81 - CLOCK_SOURCE_VERIFY_PERCPU) 79 + #define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \ 80 + CLOCK_SOURCE_CALIBRATED | \ 81 + CLOCK_SOURCE_MUST_VERIFY | \ 82 + CLOCK_SOURCE_WDTEST) 82 83 83 84 static struct clocksource clocksource_wdtest_ktime = { 84 85 .name = "wdtest-ktime", 85 - .rating = 300, 86 + .rating = 10, 86 87 .read = wdtest_ktime_read, 87 88 .mask = CLOCKSOURCE_MASK(64), 88 89 .flags = KTIME_FLAGS, 89 - .mark_unstable = wdtest_ktime_cs_mark_unstable, 90 90 .list = LIST_HEAD_INIT(clocksource_wdtest_ktime.list), 91 91 }; 92 92 93 - /* Reset the clocksource if needed. */ 94 - static void wdtest_ktime_clocksource_reset(void) 93 + static void wdtest_clocksource_reset(enum wdtest_states which, bool percpu) 95 94 { 96 - if (clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE) { 97 - clocksource_unregister(&clocksource_wdtest_ktime); 98 - clocksource_wdtest_ktime.flags = KTIME_FLAGS; 99 - schedule_timeout_uninterruptible(HZ / 10); 100 - clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 101 - } 102 - } 103 - 104 - /* Run the specified series of watchdog tests. */ 105 - static int wdtest_func(void *arg) 106 - { 107 - unsigned long j1, j2; 108 - int i, max_retries; 109 - char *s; 110 - 111 - schedule_timeout_uninterruptible(holdoff * HZ); 112 - 113 - /* 114 - * Verify that jiffies-like clocksources get the manually 115 - * specified uncertainty margin. 116 - */ 117 - pr_info("--- Verify jiffies-like uncertainty margin.\n"); 118 - __clocksource_register(&clocksource_wdtest_jiffies); 119 - WARN_ON_ONCE(clocksource_wdtest_jiffies.uncertainty_margin != TICK_NSEC); 120 - 121 - j1 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies); 122 - schedule_timeout_uninterruptible(HZ); 123 - j2 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies); 124 - WARN_ON_ONCE(j1 == j2); 125 - 126 - clocksource_unregister(&clocksource_wdtest_jiffies); 127 - 128 - /* 129 - * Verify that tsc-like clocksources are assigned a reasonable 130 - * uncertainty margin. 131 - */ 132 - pr_info("--- Verify tsc-like uncertainty margin.\n"); 133 - clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 134 - WARN_ON_ONCE(clocksource_wdtest_ktime.uncertainty_margin < NSEC_PER_USEC); 135 - 136 - j1 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime); 137 - udelay(1); 138 - j2 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime); 139 - pr_info("--- tsc-like times: %lu - %lu = %lu.\n", j2, j1, j2 - j1); 140 - WARN_ONCE(time_before(j2, j1 + NSEC_PER_USEC), 141 - "Expected at least 1000ns, got %lu.\n", j2 - j1); 142 - 143 - /* Verify tsc-like stability with various numbers of errors injected. */ 144 - max_retries = clocksource_get_max_watchdog_retry(); 145 - for (i = 0; i <= max_retries + 1; i++) { 146 - if (i <= 1 && i < max_retries) 147 - s = ""; 148 - else if (i <= max_retries) 149 - s = ", expect message"; 150 - else 151 - s = ", expect clock skew"; 152 - pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s); 153 - WRITE_ONCE(wdtest_ktime_read_ndelays, i); 154 - schedule_timeout_uninterruptible(2 * HZ); 155 - WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays)); 156 - WARN_ON_ONCE((i <= max_retries) != 157 - !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE)); 158 - wdtest_ktime_clocksource_reset(); 159 - } 160 - 161 - /* Verify tsc-like stability with clock-value-fuzz error injection. */ 162 - pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n"); 163 - WRITE_ONCE(wdtest_ktime_read_fuzz, true); 164 - schedule_timeout_uninterruptible(2 * HZ); 165 - WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE)); 166 - clocksource_verify_percpu(&clocksource_wdtest_ktime); 167 - WRITE_ONCE(wdtest_ktime_read_fuzz, false); 168 - 169 95 clocksource_unregister(&clocksource_wdtest_ktime); 170 96 171 - pr_info("--- Done with test.\n"); 97 + pr_info("Test: State %d percpu %d\n", which, percpu); 98 + 99 + wdtest_state = which; 100 + if (percpu) 101 + wdtest_state |= WDTEST_INJECT_PERCPU; 102 + wdtest_test_count = 0; 103 + wdtest_last_ts = 0; 104 + 105 + clocksource_wdtest_ktime.rating = 10; 106 + clocksource_wdtest_ktime.flags = KTIME_FLAGS; 107 + if (percpu) 108 + clocksource_wdtest_ktime.flags |= CLOCK_SOURCE_WDTEST_PERCPU; 109 + clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 110 + } 111 + 112 + static bool wdtest_execute(enum wdtest_states which, bool percpu, unsigned int expect, 113 + unsigned long calls) 114 + { 115 + wdtest_clocksource_reset(which, percpu); 116 + 117 + for (; READ_ONCE(wdtest_test_count) < calls; msleep(100)) { 118 + unsigned int flags = READ_ONCE(clocksource_wdtest_ktime.flags); 119 + 120 + if (kthread_should_stop()) 121 + return false; 122 + 123 + if (flags & CLOCK_SOURCE_UNSTABLE) { 124 + if (expect & CLOCK_SOURCE_UNSTABLE) 125 + return true; 126 + pr_warn("Fail: Unexpected unstable\n"); 127 + return false; 128 + } 129 + if (flags & CLOCK_SOURCE_VALID_FOR_HRES) { 130 + if (expect & CLOCK_SOURCE_VALID_FOR_HRES) 131 + return true; 132 + pr_warn("Fail: Unexpected valid for highres\n"); 133 + return false; 134 + } 135 + } 136 + 137 + if (!expect) 138 + return true; 139 + 140 + pr_warn("Fail: Timed out\n"); 141 + return false; 142 + } 143 + 144 + static bool wdtest_run(bool percpu) 145 + { 146 + if (!wdtest_execute(WDTEST_INJECT_NONE, percpu, CLOCK_SOURCE_VALID_FOR_HRES, 8)) 147 + return false; 148 + 149 + if (!wdtest_execute(WDTEST_INJECT_DELAY, percpu, 0, 4)) 150 + return false; 151 + 152 + if (!wdtest_execute(WDTEST_INJECT_POSITIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8)) 153 + return false; 154 + 155 + if (!wdtest_execute(WDTEST_INJECT_NEGATIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8)) 156 + return false; 157 + 158 + return true; 159 + } 160 + 161 + static int wdtest_func(void *arg) 162 + { 163 + clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 164 + if (wdtest_run(false)) { 165 + if (wdtest_run(true)) 166 + pr_info("Success: All tests passed\n"); 167 + } 168 + clocksource_unregister(&clocksource_wdtest_ktime); 169 + 170 + if (!IS_MODULE(CONFIG_TEST_CLOCKSOURCE_WATCHDOG)) 171 + return 0; 172 + 173 + while (!kthread_should_stop()) 174 + schedule_timeout_interruptible(3600 * HZ); 172 175 return 0; 173 176 } 174 177 175 - static void wdtest_print_module_parms(void) 176 - { 177 - pr_alert("--- holdoff=%d\n", holdoff); 178 - } 179 - 180 - /* Cleanup function. */ 181 - static void clocksource_wdtest_cleanup(void) 182 - { 183 - } 178 + static struct task_struct *wdtest_thread; 184 179 185 180 static int __init clocksource_wdtest_init(void) 186 181 { 187 - int ret = 0; 182 + struct task_struct *t = kthread_run(wdtest_func, NULL, "wdtest"); 188 183 189 - wdtest_print_module_parms(); 190 - 191 - /* Create watchdog-test task. */ 192 - wdtest_task = kthread_run(wdtest_func, NULL, "wdtest"); 193 - if (IS_ERR(wdtest_task)) { 194 - ret = PTR_ERR(wdtest_task); 195 - pr_warn("%s: Failed to create wdtest kthread.\n", __func__); 196 - wdtest_task = NULL; 197 - return ret; 184 + if (IS_ERR(t)) { 185 + pr_warn("Failed to create wdtest kthread.\n"); 186 + return PTR_ERR(t); 198 187 } 199 - 188 + wdtest_thread = t; 200 189 return 0; 201 190 } 202 - 203 191 module_init(clocksource_wdtest_init); 192 + 193 + static void clocksource_wdtest_cleanup(void) 194 + { 195 + if (wdtest_thread) 196 + kthread_stop(wdtest_thread); 197 + } 204 198 module_exit(clocksource_wdtest_cleanup);

+461 -404

kernel/time/clocksource.c

··· 7 7 8 8 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 9 9 10 - #include <linux/device.h> 11 10 #include <linux/clocksource.h> 12 - #include <linux/init.h> 13 - #include <linux/module.h> 14 - #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */ 15 - #include <linux/tick.h> 16 - #include <linux/kthread.h> 17 - #include <linux/prandom.h> 18 11 #include <linux/cpu.h> 12 + #include <linux/delay.h> 13 + #include <linux/device.h> 14 + #include <linux/init.h> 15 + #include <linux/kthread.h> 16 + #include <linux/module.h> 17 + #include <linux/prandom.h> 18 + #include <linux/sched.h> 19 + #include <linux/tick.h> 20 + #include <linux/topology.h> 19 21 20 22 #include "tick-internal.h" 21 23 #include "timekeeping_internal.h" ··· 109 107 static int finished_booting; 110 108 static u64 suspend_start; 111 109 112 - /* 113 - * Interval: 0.5sec. 114 - */ 115 - #define WATCHDOG_INTERVAL (HZ >> 1) 116 - #define WATCHDOG_INTERVAL_MAX_NS ((2 * WATCHDOG_INTERVAL) * (NSEC_PER_SEC / HZ)) 117 - 118 - /* 119 - * Threshold: 0.0312s, when doubled: 0.0625s. 120 - */ 121 - #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 5) 122 - 123 - /* 124 - * Maximum permissible delay between two readouts of the watchdog 125 - * clocksource surrounding a read of the clocksource being validated. 126 - * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as 127 - * a lower bound for cs->uncertainty_margin values when registering clocks. 128 - * 129 - * The default of 500 parts per million is based on NTP's limits. 130 - * If a clocksource is good enough for NTP, it is good enough for us! 131 - * 132 - * In other words, by default, even if a clocksource is extremely 133 - * precise (for example, with a sub-nanosecond period), the maximum 134 - * permissible skew between the clocksource watchdog and the clocksource 135 - * under test is not permitted to go below the 500ppm minimum defined 136 - * by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the 137 - * CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option. 138 - */ 139 - #ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 140 - #define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 141 - #else 142 - #define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ) 143 - #endif 144 - 145 - /* 146 - * Default for maximum permissible skew when cs->uncertainty_margin is 147 - * not specified, and the lower bound even when cs->uncertainty_margin 148 - * is specified. This is also the default that is used when registering 149 - * clocks with unspecified cs->uncertainty_margin, so this macro is used 150 - * even in CONFIG_CLOCKSOURCE_WATCHDOG=n kernels. 151 - */ 152 - #define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC) 153 - 154 110 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG 155 111 static void clocksource_watchdog_work(struct work_struct *work); 156 112 static void clocksource_select(void); ··· 120 160 static DEFINE_SPINLOCK(watchdog_lock); 121 161 static int watchdog_running; 122 162 static atomic_t watchdog_reset_pending; 123 - static int64_t watchdog_max_interval; 163 + 164 + /* Watchdog interval: 0.5sec. */ 165 + #define WATCHDOG_INTERVAL (HZ >> 1) 166 + #define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ)) 167 + 168 + /* Maximum time between two reference watchdog readouts */ 169 + #define WATCHDOG_READOUT_MAX_NS (50U * NSEC_PER_USEC) 170 + 171 + /* 172 + * Maximum time between two remote readouts for NUMA=n. On NUMA enabled systems 173 + * the timeout is calculated from the numa distance. 174 + */ 175 + #define WATCHDOG_DEFAULT_TIMEOUT_NS (50U * NSEC_PER_USEC) 176 + 177 + /* 178 + * Remote timeout NUMA distance multiplier. The local distance is 10. The 179 + * default remote distance is 20. ACPI tables provide more accurate numbers 180 + * which are guaranteed to be greater than the local distance. 181 + * 182 + * This results in a 5us base value, which is equivalent to the above !NUMA 183 + * default. 184 + */ 185 + #define WATCHDOG_NUMA_MULTIPLIER_NS ((u64)(WATCHDOG_DEFAULT_TIMEOUT_NS / LOCAL_DISTANCE)) 186 + 187 + /* Limit the NUMA timeout in case the distance values are insanely big */ 188 + #define WATCHDOG_NUMA_MAX_TIMEOUT_NS ((u64)(500U * NSEC_PER_USEC)) 189 + 190 + /* Shift values to calculate the approximate $N ppm of a given delta. */ 191 + #define SHIFT_500PPM 11 192 + #define SHIFT_4000PPM 8 193 + 194 + /* Number of attempts to read the watchdog */ 195 + #define WATCHDOG_FREQ_RETRIES 3 196 + 197 + /* Five reads local and remote for inter CPU skew detection */ 198 + #define WATCHDOG_REMOTE_MAX_SEQ 10 124 199 125 200 static inline void clocksource_watchdog_lock(unsigned long *flags) 126 201 { ··· 236 241 spin_unlock_irqrestore(&watchdog_lock, flags); 237 242 } 238 243 239 - static int verify_n_cpus = 8; 240 - module_param(verify_n_cpus, int, 0644); 241 - 242 - enum wd_read_status { 243 - WD_READ_SUCCESS, 244 - WD_READ_UNSTABLE, 245 - WD_READ_SKIP 246 - }; 247 - 248 - static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) 249 - { 250 - int64_t md = watchdog->uncertainty_margin; 251 - unsigned int nretries, max_retries; 252 - int64_t wd_delay, wd_seq_delay; 253 - u64 wd_end, wd_end2; 254 - 255 - max_retries = clocksource_get_max_watchdog_retry(); 256 - for (nretries = 0; nretries <= max_retries; nretries++) { 257 - local_irq_disable(); 258 - *wdnow = watchdog->read(watchdog); 259 - *csnow = cs->read(cs); 260 - wd_end = watchdog->read(watchdog); 261 - wd_end2 = watchdog->read(watchdog); 262 - local_irq_enable(); 263 - 264 - wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end); 265 - if (wd_delay <= md + cs->uncertainty_margin) { 266 - if (nretries > 1 && nretries >= max_retries) { 267 - pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", 268 - smp_processor_id(), watchdog->name, nretries); 269 - } 270 - return WD_READ_SUCCESS; 271 - } 272 - 273 - /* 274 - * Now compute delay in consecutive watchdog read to see if 275 - * there is too much external interferences that cause 276 - * significant delay in reading both clocksource and watchdog. 277 - * 278 - * If consecutive WD read-back delay > md, report 279 - * system busy, reinit the watchdog and skip the current 280 - * watchdog test. 281 - */ 282 - wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2); 283 - if (wd_seq_delay > md) 284 - goto skip_test; 285 - } 286 - 287 - pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n", 288 - smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name); 289 - return WD_READ_UNSTABLE; 290 - 291 - skip_test: 292 - pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n", 293 - smp_processor_id(), watchdog->name, wd_seq_delay); 294 - pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n", 295 - cs->name, wd_delay); 296 - return WD_READ_SKIP; 297 - } 298 - 299 - static u64 csnow_mid; 300 - static cpumask_t cpus_ahead; 301 - static cpumask_t cpus_behind; 302 - static cpumask_t cpus_chosen; 303 - 304 - static void clocksource_verify_choose_cpus(void) 305 - { 306 - int cpu, i, n = verify_n_cpus; 307 - 308 - if (n < 0 || n >= num_online_cpus()) { 309 - /* Check all of the CPUs. */ 310 - cpumask_copy(&cpus_chosen, cpu_online_mask); 311 - cpumask_clear_cpu(smp_processor_id(), &cpus_chosen); 312 - return; 313 - } 314 - 315 - /* If no checking desired, or no other CPU to check, leave. */ 316 - cpumask_clear(&cpus_chosen); 317 - if (n == 0 || num_online_cpus() <= 1) 318 - return; 319 - 320 - /* Make sure to select at least one CPU other than the current CPU. */ 321 - cpu = cpumask_any_but(cpu_online_mask, smp_processor_id()); 322 - if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) 323 - return; 324 - cpumask_set_cpu(cpu, &cpus_chosen); 325 - 326 - /* Force a sane value for the boot parameter. */ 327 - if (n > nr_cpu_ids) 328 - n = nr_cpu_ids; 329 - 330 - /* 331 - * Randomly select the specified number of CPUs. If the same 332 - * CPU is selected multiple times, that CPU is checked only once, 333 - * and no replacement CPU is selected. This gracefully handles 334 - * situations where verify_n_cpus is greater than the number of 335 - * CPUs that are currently online. 336 - */ 337 - for (i = 1; i < n; i++) { 338 - cpu = cpumask_random(cpu_online_mask); 339 - if (!WARN_ON_ONCE(cpu >= nr_cpu_ids)) 340 - cpumask_set_cpu(cpu, &cpus_chosen); 341 - } 342 - 343 - /* Don't verify ourselves. */ 344 - cpumask_clear_cpu(smp_processor_id(), &cpus_chosen); 345 - } 346 - 347 - static void clocksource_verify_one_cpu(void *csin) 348 - { 349 - struct clocksource *cs = (struct clocksource *)csin; 350 - 351 - csnow_mid = cs->read(cs); 352 - } 353 - 354 - void clocksource_verify_percpu(struct clocksource *cs) 355 - { 356 - int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX; 357 - u64 csnow_begin, csnow_end; 358 - int cpu, testcpu; 359 - s64 delta; 360 - 361 - if (verify_n_cpus == 0) 362 - return; 363 - cpumask_clear(&cpus_ahead); 364 - cpumask_clear(&cpus_behind); 365 - cpus_read_lock(); 366 - migrate_disable(); 367 - clocksource_verify_choose_cpus(); 368 - if (cpumask_empty(&cpus_chosen)) { 369 - migrate_enable(); 370 - cpus_read_unlock(); 371 - pr_warn("Not enough CPUs to check clocksource '%s'.\n", cs->name); 372 - return; 373 - } 374 - testcpu = smp_processor_id(); 375 - pr_info("Checking clocksource %s synchronization from CPU %d to CPUs %*pbl.\n", 376 - cs->name, testcpu, cpumask_pr_args(&cpus_chosen)); 377 - preempt_disable(); 378 - for_each_cpu(cpu, &cpus_chosen) { 379 - if (cpu == testcpu) 380 - continue; 381 - csnow_begin = cs->read(cs); 382 - smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1); 383 - csnow_end = cs->read(cs); 384 - delta = (s64)((csnow_mid - csnow_begin) & cs->mask); 385 - if (delta < 0) 386 - cpumask_set_cpu(cpu, &cpus_behind); 387 - delta = (csnow_end - csnow_mid) & cs->mask; 388 - if (delta < 0) 389 - cpumask_set_cpu(cpu, &cpus_ahead); 390 - cs_nsec = cycles_to_nsec_safe(cs, csnow_begin, csnow_end); 391 - if (cs_nsec > cs_nsec_max) 392 - cs_nsec_max = cs_nsec; 393 - if (cs_nsec < cs_nsec_min) 394 - cs_nsec_min = cs_nsec; 395 - } 396 - preempt_enable(); 397 - migrate_enable(); 398 - cpus_read_unlock(); 399 - if (!cpumask_empty(&cpus_ahead)) 400 - pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n", 401 - cpumask_pr_args(&cpus_ahead), testcpu, cs->name); 402 - if (!cpumask_empty(&cpus_behind)) 403 - pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n", 404 - cpumask_pr_args(&cpus_behind), testcpu, cs->name); 405 - pr_info(" CPU %d check durations %lldns - %lldns for clocksource %s.\n", 406 - testcpu, cs_nsec_min, cs_nsec_max, cs->name); 407 - } 408 - EXPORT_SYMBOL_GPL(clocksource_verify_percpu); 409 - 410 244 static inline void clocksource_reset_watchdog(void) 411 245 { 412 246 struct clocksource *cs; ··· 244 420 cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 245 421 } 246 422 423 + enum wd_result { 424 + WD_SUCCESS, 425 + WD_FREQ_NO_WATCHDOG, 426 + WD_FREQ_TIMEOUT, 427 + WD_FREQ_RESET, 428 + WD_FREQ_SKEWED, 429 + WD_CPU_TIMEOUT, 430 + WD_CPU_SKEWED, 431 + }; 432 + 433 + struct watchdog_cpu_data { 434 + /* Keep first as it is 32 byte aligned */ 435 + call_single_data_t csd; 436 + atomic_t remote_inprogress; 437 + enum wd_result result; 438 + u64 cpu_ts[2]; 439 + struct clocksource *cs; 440 + /* Ensure that the sequence is in a separate cache line */ 441 + atomic_t seq ____cacheline_aligned; 442 + /* Set by the control CPU according to NUMA distance */ 443 + u64 timeout_ns; 444 + }; 445 + 446 + struct watchdog_data { 447 + raw_spinlock_t lock; 448 + enum wd_result result; 449 + 450 + u64 wd_seq; 451 + u64 wd_delta; 452 + u64 cs_delta; 453 + u64 cpu_ts[2]; 454 + 455 + unsigned int curr_cpu; 456 + } ____cacheline_aligned_in_smp; 457 + 458 + static void watchdog_check_skew_remote(void *unused); 459 + 460 + static DEFINE_PER_CPU_ALIGNED(struct watchdog_cpu_data, watchdog_cpu_data) = { 461 + .csd = CSD_INIT(watchdog_check_skew_remote, NULL), 462 + }; 463 + 464 + static struct watchdog_data watchdog_data = { 465 + .lock = __RAW_SPIN_LOCK_UNLOCKED(watchdog_data.lock), 466 + }; 467 + 468 + static inline void watchdog_set_result(struct watchdog_cpu_data *wd, enum wd_result result) 469 + { 470 + guard(raw_spinlock)(&watchdog_data.lock); 471 + if (!wd->result) { 472 + atomic_set(&wd->seq, WATCHDOG_REMOTE_MAX_SEQ); 473 + WRITE_ONCE(wd->result, result); 474 + } 475 + } 476 + 477 + /* Wait for the sequence number to hand over control. */ 478 + static bool watchdog_wait_seq(struct watchdog_cpu_data *wd, u64 start, int seq) 479 + { 480 + for(int cnt = 0; atomic_read(&wd->seq) < seq; cnt++) { 481 + /* Bail if the other side set an error result */ 482 + if (READ_ONCE(wd->result) != WD_SUCCESS) 483 + return false; 484 + 485 + /* Prevent endless loops if the other CPU does not react. */ 486 + if (cnt == 5000) { 487 + u64 nsecs = ktime_get_raw_fast_ns(); 488 + 489 + if (nsecs - start >=wd->timeout_ns) { 490 + watchdog_set_result(wd, WD_CPU_TIMEOUT); 491 + return false; 492 + } 493 + cnt = 0; 494 + } 495 + cpu_relax(); 496 + } 497 + return seq < WATCHDOG_REMOTE_MAX_SEQ; 498 + } 499 + 500 + static void watchdog_check_skew(struct watchdog_cpu_data *wd, int index) 501 + { 502 + u64 prev, now, delta, start = ktime_get_raw_fast_ns(); 503 + int local = index, remote = (index + 1) & 0x1; 504 + struct clocksource *cs = wd->cs; 505 + 506 + /* Set the local timestamp so that the first iteration works correctly */ 507 + wd->cpu_ts[local] = cs->read(cs); 508 + 509 + /* Signal arrival */ 510 + atomic_inc(&wd->seq); 511 + 512 + for (int seq = local + 2; seq < WATCHDOG_REMOTE_MAX_SEQ; seq += 2) { 513 + if (!watchdog_wait_seq(wd, start, seq)) 514 + return; 515 + 516 + /* Capture local timestamp before possible non-local coherency overhead */ 517 + now = cs->read(cs); 518 + 519 + /* Store local timestamp before reading remote to limit coherency stalls */ 520 + wd->cpu_ts[local] = now; 521 + 522 + prev = wd->cpu_ts[remote]; 523 + delta = (now - prev) & cs->mask; 524 + 525 + if (delta > cs->max_raw_delta) { 526 + watchdog_set_result(wd, WD_CPU_SKEWED); 527 + return; 528 + } 529 + 530 + /* Hand over to the remote CPU */ 531 + atomic_inc(&wd->seq); 532 + } 533 + } 534 + 535 + static void watchdog_check_skew_remote(void *unused) 536 + { 537 + struct watchdog_cpu_data *wd = this_cpu_ptr(&watchdog_cpu_data); 538 + 539 + atomic_inc(&wd->remote_inprogress); 540 + watchdog_check_skew(wd, 1); 541 + atomic_dec(&wd->remote_inprogress); 542 + } 543 + 544 + static inline bool wd_csd_locked(struct watchdog_cpu_data *wd) 545 + { 546 + return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK; 547 + } 548 + 549 + /* 550 + * This is only invoked for remote CPUs. See watchdog_check_cpu_skew(). 551 + */ 552 + static inline u64 wd_get_remote_timeout(unsigned int remote_cpu) 553 + { 554 + unsigned int n1, n2; 555 + u64 ns; 556 + 557 + if (nr_node_ids == 1) 558 + return WATCHDOG_DEFAULT_TIMEOUT_NS; 559 + 560 + n1 = cpu_to_node(smp_processor_id()); 561 + n2 = cpu_to_node(remote_cpu); 562 + ns = WATCHDOG_NUMA_MULTIPLIER_NS * node_distance(n1, n2); 563 + return min(ns, WATCHDOG_NUMA_MAX_TIMEOUT_NS); 564 + } 565 + 566 + static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu) 567 + { 568 + struct watchdog_cpu_data *wd; 569 + 570 + wd = per_cpu_ptr(&watchdog_cpu_data, cpu); 571 + if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) { 572 + watchdog_data.result = WD_CPU_TIMEOUT; 573 + return; 574 + } 575 + 576 + atomic_set(&wd->seq, 0); 577 + wd->result = WD_SUCCESS; 578 + wd->cs = cs; 579 + /* Store the current CPU ID for the watchdog test unit */ 580 + cs->wd_cpu = smp_processor_id(); 581 + 582 + wd->timeout_ns = wd_get_remote_timeout(cpu); 583 + 584 + /* Kick the remote CPU into the watchdog function */ 585 + if (WARN_ON_ONCE(smp_call_function_single_async(cpu, &wd->csd))) { 586 + watchdog_data.result = WD_CPU_TIMEOUT; 587 + return; 588 + } 589 + 590 + scoped_guard(irq) 591 + watchdog_check_skew(wd, 0); 592 + 593 + scoped_guard(raw_spinlock_irq, &watchdog_data.lock) { 594 + watchdog_data.result = wd->result; 595 + memcpy(watchdog_data.cpu_ts, wd->cpu_ts, sizeof(wd->cpu_ts)); 596 + } 597 + } 598 + 599 + static void watchdog_check_cpu_skew(struct clocksource *cs) 600 + { 601 + unsigned int cpu = watchdog_data.curr_cpu; 602 + 603 + cpu = cpumask_next_wrap(cpu, cpu_online_mask); 604 + watchdog_data.curr_cpu = cpu; 605 + 606 + /* Skip the current CPU. Handles num_online_cpus() == 1 as well */ 607 + if (cpu == smp_processor_id()) 608 + return; 609 + 610 + /* Don't interfere with the test mechanics */ 611 + if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU)) 612 + return; 613 + 614 + __watchdog_check_cpu_skew(cs, cpu); 615 + } 616 + 617 + static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending) 618 + { 619 + unsigned int ppm_shift = SHIFT_4000PPM; 620 + u64 wd_ts0, wd_ts1, cs_ts; 621 + 622 + watchdog_data.result = WD_SUCCESS; 623 + if (!watchdog) { 624 + watchdog_data.result = WD_FREQ_NO_WATCHDOG; 625 + return false; 626 + } 627 + 628 + if (cs->flags & CLOCK_SOURCE_WDTEST_PERCPU) 629 + return true; 630 + 631 + /* 632 + * If both the clocksource and the watchdog claim they are 633 + * calibrated use 500ppm limit. Uncalibrated clocksources need a 634 + * larger allowance because thefirmware supplied frequencies can be 635 + * way off. 636 + */ 637 + if (watchdog->flags & CLOCK_SOURCE_CALIBRATED && cs->flags & CLOCK_SOURCE_CALIBRATED) 638 + ppm_shift = SHIFT_500PPM; 639 + 640 + for (int retries = 0; retries < WATCHDOG_FREQ_RETRIES; retries++) { 641 + s64 wd_last, cs_last, wd_seq, wd_delta, cs_delta, max_delta; 642 + 643 + scoped_guard(irq) { 644 + wd_ts0 = watchdog->read(watchdog); 645 + cs_ts = cs->read(cs); 646 + wd_ts1 = watchdog->read(watchdog); 647 + } 648 + 649 + wd_last = cs->wd_last; 650 + cs_last = cs->cs_last; 651 + 652 + /* Validate the watchdog readout window */ 653 + wd_seq = cycles_to_nsec_safe(watchdog, wd_ts0, wd_ts1); 654 + if (wd_seq > WATCHDOG_READOUT_MAX_NS) { 655 + /* Store for printout in case all retries fail */ 656 + watchdog_data.wd_seq = wd_seq; 657 + continue; 658 + } 659 + 660 + /* Store for subsequent processing */ 661 + cs->wd_last = wd_ts0; 662 + cs->cs_last = cs_ts; 663 + 664 + /* First round or reset pending? */ 665 + if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || reset_pending) 666 + goto reset; 667 + 668 + /* Calculate the nanosecond deltas from the last invocation */ 669 + wd_delta = cycles_to_nsec_safe(watchdog, wd_last, wd_ts0); 670 + cs_delta = cycles_to_nsec_safe(cs, cs_last, cs_ts); 671 + 672 + watchdog_data.wd_delta = wd_delta; 673 + watchdog_data.cs_delta = cs_delta; 674 + 675 + /* 676 + * Ensure that the deltas are within the readout limits of 677 + * the clocksource and the watchdog. Long delays can cause 678 + * clocksources to overflow. 679 + */ 680 + max_delta = max(wd_delta, cs_delta); 681 + if (max_delta > cs->max_idle_ns || max_delta > watchdog->max_idle_ns) 682 + goto reset; 683 + 684 + /* 685 + * Calculate and validate the skew against the allowed PPM 686 + * value of the maximum delta plus the watchdog readout 687 + * time. 688 + */ 689 + if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq) 690 + return true; 691 + 692 + watchdog_data.result = WD_FREQ_SKEWED; 693 + return false; 694 + } 695 + 696 + watchdog_data.result = WD_FREQ_TIMEOUT; 697 + return false; 698 + 699 + reset: 700 + cs->flags |= CLOCK_SOURCE_WATCHDOG; 701 + watchdog_data.result = WD_FREQ_RESET; 702 + return false; 703 + } 704 + 705 + /* Synchronization for sched clock */ 706 + static void clocksource_tick_stable(struct clocksource *cs) 707 + { 708 + if (cs == curr_clocksource && cs->tick_stable) 709 + cs->tick_stable(cs); 710 + } 711 + 712 + /* Conditionaly enable high resolution mode */ 713 + static void clocksource_enable_highres(struct clocksource *cs) 714 + { 715 + if ((cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) || 716 + !(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) || 717 + !watchdog || !(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) 718 + return; 719 + 720 + /* Mark it valid for high-res. */ 721 + cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES; 722 + 723 + /* 724 + * Can't schedule work before finished_booting is 725 + * true. clocksource_done_booting will take care of it. 726 + */ 727 + if (!finished_booting) 728 + return; 729 + 730 + if (cs->flags & CLOCK_SOURCE_WDTEST) 731 + return; 732 + 733 + /* 734 + * If this is not the current clocksource let the watchdog thread 735 + * reselect it. Due to the change to high res this clocksource 736 + * might be preferred now. If it is the current clocksource let the 737 + * tick code know about that change. 738 + */ 739 + if (cs != curr_clocksource) { 740 + cs->flags |= CLOCK_SOURCE_RESELECT; 741 + schedule_work(&watchdog_work); 742 + } else { 743 + tick_clock_notify(); 744 + } 745 + } 746 + 747 + static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 2); 748 + 749 + static void watchdog_print_freq_timeout(struct clocksource *cs) 750 + { 751 + if (!__ratelimit(&ratelimit_state)) 752 + return; 753 + pr_info("Watchdog %s read timed out. Readout sequence took: %lluns\n", 754 + watchdog->name, watchdog_data.wd_seq); 755 + } 756 + 757 + static void watchdog_print_freq_skew(struct clocksource *cs) 758 + { 759 + pr_warn("Marking clocksource %s unstable due to frequency skew\n", cs->name); 760 + pr_warn("Watchdog %20s interval: %16lluns\n", watchdog->name, watchdog_data.wd_delta); 761 + pr_warn("Clocksource %20s interval: %16lluns\n", cs->name, watchdog_data.cs_delta); 762 + } 763 + 764 + static void watchdog_handle_remote_timeout(struct clocksource *cs) 765 + { 766 + pr_info_once("Watchdog remote CPU %u read timed out\n", watchdog_data.curr_cpu); 767 + } 768 + 769 + static void watchdog_print_remote_skew(struct clocksource *cs) 770 + { 771 + pr_warn("Marking clocksource %s unstable due to inter CPU skew\n", cs->name); 772 + if (watchdog_data.cpu_ts[0] < watchdog_data.cpu_ts[1]) { 773 + pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", smp_processor_id(), 774 + watchdog_data.cpu_ts[0], watchdog_data.curr_cpu, watchdog_data.cpu_ts[1]); 775 + } else { 776 + pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", watchdog_data.curr_cpu, 777 + watchdog_data.cpu_ts[1], smp_processor_id(), watchdog_data.cpu_ts[0]); 778 + } 779 + } 780 + 781 + static void watchdog_check_result(struct clocksource *cs) 782 + { 783 + switch (watchdog_data.result) { 784 + case WD_SUCCESS: 785 + clocksource_tick_stable(cs); 786 + clocksource_enable_highres(cs); 787 + return; 788 + 789 + case WD_FREQ_TIMEOUT: 790 + watchdog_print_freq_timeout(cs); 791 + /* Try again later and invalidate the reference timestamps. */ 792 + cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 793 + return; 794 + 795 + case WD_FREQ_NO_WATCHDOG: 796 + case WD_FREQ_RESET: 797 + /* 798 + * Nothing to do when the reference timestamps were reset 799 + * or no watchdog clocksource registered. 800 + */ 801 + return; 802 + 803 + case WD_FREQ_SKEWED: 804 + watchdog_print_freq_skew(cs); 805 + break; 806 + 807 + case WD_CPU_TIMEOUT: 808 + /* Remote check timed out. Try again next cycle. */ 809 + watchdog_handle_remote_timeout(cs); 810 + return; 811 + 812 + case WD_CPU_SKEWED: 813 + watchdog_print_remote_skew(cs); 814 + break; 815 + } 816 + __clocksource_unstable(cs); 817 + } 247 818 248 819 static void clocksource_watchdog(struct timer_list *unused) 249 820 { 250 - int64_t wd_nsec, cs_nsec, interval; 251 - u64 csnow, wdnow, cslast, wdlast; 252 - int next_cpu, reset_pending; 253 821 struct clocksource *cs; 254 - enum wd_read_status read_ret; 255 - unsigned long extra_wait = 0; 256 - u32 md; 822 + bool reset_pending; 257 823 258 - spin_lock(&watchdog_lock); 824 + guard(spinlock)(&watchdog_lock); 259 825 if (!watchdog_running) 260 - goto out; 826 + return; 261 827 262 828 reset_pending = atomic_read(&watchdog_reset_pending); 263 829 264 830 list_for_each_entry(cs, &watchdog_list, wd_list) { 265 - 266 831 /* Clocksource already marked unstable? */ 267 832 if (cs->flags & CLOCK_SOURCE_UNSTABLE) { 268 833 if (finished_booting) ··· 659 446 continue; 660 447 } 661 448 662 - read_ret = cs_watchdog_read(cs, &csnow, &wdnow); 663 - 664 - if (read_ret == WD_READ_UNSTABLE) { 665 - /* Clock readout unreliable, so give it up. */ 666 - __clocksource_unstable(cs); 667 - continue; 449 + /* Compare against watchdog clocksource if available */ 450 + if (watchdog_check_freq(cs, reset_pending)) { 451 + /* Check for inter CPU skew */ 452 + watchdog_check_cpu_skew(cs); 668 453 } 669 454 670 - /* 671 - * When WD_READ_SKIP is returned, it means the system is likely 672 - * under very heavy load, where the latency of reading 673 - * watchdog/clocksource is very big, and affect the accuracy of 674 - * watchdog check. So give system some space and suspend the 675 - * watchdog check for 5 minutes. 676 - */ 677 - if (read_ret == WD_READ_SKIP) { 678 - /* 679 - * As the watchdog timer will be suspended, and 680 - * cs->last could keep unchanged for 5 minutes, reset 681 - * the counters. 682 - */ 683 - clocksource_reset_watchdog(); 684 - extra_wait = HZ * 300; 685 - break; 686 - } 687 - 688 - /* Clocksource initialized ? */ 689 - if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || 690 - atomic_read(&watchdog_reset_pending)) { 691 - cs->flags |= CLOCK_SOURCE_WATCHDOG; 692 - cs->wd_last = wdnow; 693 - cs->cs_last = csnow; 694 - continue; 695 - } 696 - 697 - wd_nsec = cycles_to_nsec_safe(watchdog, cs->wd_last, wdnow); 698 - cs_nsec = cycles_to_nsec_safe(cs, cs->cs_last, csnow); 699 - wdlast = cs->wd_last; /* save these in case we print them */ 700 - cslast = cs->cs_last; 701 - cs->cs_last = csnow; 702 - cs->wd_last = wdnow; 703 - 704 - if (atomic_read(&watchdog_reset_pending)) 705 - continue; 706 - 707 - /* 708 - * The processing of timer softirqs can get delayed (usually 709 - * on account of ksoftirqd not getting to run in a timely 710 - * manner), which causes the watchdog interval to stretch. 711 - * Skew detection may fail for longer watchdog intervals 712 - * on account of fixed margins being used. 713 - * Some clocksources, e.g. acpi_pm, cannot tolerate 714 - * watchdog intervals longer than a few seconds. 715 - */ 716 - interval = max(cs_nsec, wd_nsec); 717 - if (unlikely(interval > WATCHDOG_INTERVAL_MAX_NS)) { 718 - if (system_state > SYSTEM_SCHEDULING && 719 - interval > 2 * watchdog_max_interval) { 720 - watchdog_max_interval = interval; 721 - pr_warn("Long readout interval, skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", 722 - cs_nsec, wd_nsec); 723 - } 724 - watchdog_timer.expires = jiffies; 725 - continue; 726 - } 727 - 728 - /* Check the deviation from the watchdog clocksource. */ 729 - md = cs->uncertainty_margin + watchdog->uncertainty_margin; 730 - if (abs(cs_nsec - wd_nsec) > md) { 731 - s64 cs_wd_msec; 732 - s64 wd_msec; 733 - u32 wd_rem; 734 - 735 - pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n", 736 - smp_processor_id(), cs->name); 737 - pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n", 738 - watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask); 739 - pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n", 740 - cs->name, cs_nsec, csnow, cslast, cs->mask); 741 - cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem); 742 - wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem); 743 - pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n", 744 - cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec); 745 - if (curr_clocksource == cs) 746 - pr_warn(" '%s' is current clocksource.\n", cs->name); 747 - else if (curr_clocksource) 748 - pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name); 749 - else 750 - pr_warn(" No current clocksource.\n"); 751 - __clocksource_unstable(cs); 752 - continue; 753 - } 754 - 755 - if (cs == curr_clocksource && cs->tick_stable) 756 - cs->tick_stable(cs); 757 - 758 - if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && 759 - (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) && 760 - (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) { 761 - /* Mark it valid for high-res. */ 762 - cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES; 763 - 764 - /* 765 - * clocksource_done_booting() will sort it if 766 - * finished_booting is not set yet. 767 - */ 768 - if (!finished_booting) 769 - continue; 770 - 771 - /* 772 - * If this is not the current clocksource let 773 - * the watchdog thread reselect it. Due to the 774 - * change to high res this clocksource might 775 - * be preferred now. If it is the current 776 - * clocksource let the tick code know about 777 - * that change. 778 - */ 779 - if (cs != curr_clocksource) { 780 - cs->flags |= CLOCK_SOURCE_RESELECT; 781 - schedule_work(&watchdog_work); 782 - } else { 783 - tick_clock_notify(); 784 - } 785 - } 455 + watchdog_check_result(cs); 786 456 } 787 457 788 - /* 789 - * We only clear the watchdog_reset_pending, when we did a 790 - * full cycle through all clocksources. 791 - */ 458 + /* Clear after the full clocksource walk */ 792 459 if (reset_pending) 793 460 atomic_dec(&watchdog_reset_pending); 794 461 795 - /* 796 - * Cycle through CPUs to check if the CPUs stay synchronized 797 - * to each other. 798 - */ 799 - next_cpu = cpumask_next_wrap(raw_smp_processor_id(), cpu_online_mask); 800 - 801 - /* 802 - * Arm timer if not already pending: could race with concurrent 803 - * pair clocksource_stop_watchdog() clocksource_start_watchdog(). 804 - */ 462 + /* Could have been rearmed by a stop/start cycle */ 805 463 if (!timer_pending(&watchdog_timer)) { 806 - watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait; 807 - add_timer_on(&watchdog_timer, next_cpu); 464 + watchdog_timer.expires += WATCHDOG_INTERVAL; 465 + add_timer_local(&watchdog_timer); 808 466 } 809 - out: 810 - spin_unlock(&watchdog_lock); 811 467 } 812 468 813 469 static inline void clocksource_start_watchdog(void) 814 470 { 815 - if (watchdog_running || !watchdog || list_empty(&watchdog_list)) 471 + if (watchdog_running || list_empty(&watchdog_list)) 816 472 return; 817 - timer_setup(&watchdog_timer, clocksource_watchdog, 0); 473 + timer_setup(&watchdog_timer, clocksource_watchdog, TIMER_PINNED); 818 474 watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL; 819 - add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask)); 475 + 476 + add_timer_on(&watchdog_timer, get_boot_cpu_id()); 820 477 watchdog_running = 1; 821 478 } 822 479 823 480 static inline void clocksource_stop_watchdog(void) 824 481 { 825 - if (!watchdog_running || (watchdog && !list_empty(&watchdog_list))) 482 + if (!watchdog_running || !list_empty(&watchdog_list)) 826 483 return; 827 484 timer_delete(&watchdog_timer); 828 485 watchdog_running = 0; ··· 779 696 struct clocksource *cs, *tmp; 780 697 unsigned long flags; 781 698 int select = 0; 782 - 783 - /* Do any required per-CPU skew verification. */ 784 - if (curr_clocksource && 785 - curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE && 786 - curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU) 787 - clocksource_verify_percpu(curr_clocksource); 788 699 789 700 spin_lock_irqsave(&watchdog_lock, flags); 790 701 list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) { ··· 1100 1023 continue; 1101 1024 if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)) 1102 1025 continue; 1026 + if (cs->flags & CLOCK_SOURCE_WDTEST) 1027 + continue; 1103 1028 return cs; 1104 1029 } 1105 1030 return NULL; ··· 1125 1046 if (skipcur && cs == curr_clocksource) 1126 1047 continue; 1127 1048 if (strcmp(cs->name, override_name) != 0) 1049 + continue; 1050 + if (cs->flags & CLOCK_SOURCE_WDTEST) 1128 1051 continue; 1129 1052 /* 1130 1053 * Check to make sure we don't switch to a non-highres ··· 1261 1180 /* Update cs::freq_khz */ 1262 1181 cs->freq_khz = div_u64((u64)freq * scale, 1000); 1263 1182 } 1264 - 1265 - /* 1266 - * If the uncertainty margin is not specified, calculate it. If 1267 - * both scale and freq are non-zero, calculate the clock period, but 1268 - * bound below at 2*WATCHDOG_MAX_SKEW, that is, 500ppm by default. 1269 - * However, if either of scale or freq is zero, be very conservative 1270 - * and take the tens-of-milliseconds WATCHDOG_THRESHOLD value 1271 - * for the uncertainty margin. Allow stupidly small uncertainty 1272 - * margins to be specified by the caller for testing purposes, 1273 - * but warn to discourage production use of this capability. 1274 - * 1275 - * Bottom line: The sum of the uncertainty margins of the 1276 - * watchdog clocksource and the clocksource under test will be at 1277 - * least 500ppm by default. For more information, please see the 1278 - * comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above. 1279 - */ 1280 - if (scale && freq && !cs->uncertainty_margin) { 1281 - cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq); 1282 - if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW) 1283 - cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW; 1284 - } else if (!cs->uncertainty_margin) { 1285 - cs->uncertainty_margin = WATCHDOG_THRESHOLD; 1286 - } 1287 - WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW); 1288 1183 1289 1184 /* 1290 1185 * Ensure clocksources that have large 'mult' values don't overflow

-1

kernel/time/jiffies.c

··· 32 32 static struct clocksource clocksource_jiffies = { 33 33 .name = "jiffies", 34 34 .rating = 1, /* lowest valid rating*/ 35 - .uncertainty_margin = 32 * NSEC_PER_MSEC, 36 35 .read = jiffies_read, 37 36 .mask = CLOCKSOURCE_MASK(32), 38 37 .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */

Configure Feed

Configure Feed