Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

- A rework of the hrtimer subsystem to reduce the overhead for
frequently armed timers, especially the hrtick scheduler timer:

- Better timer locality decision

- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing
a RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which
both can end up in rebalancing can be completely avoided.

- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the
need resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where
the scheduler has picked the next task and has the next hrtick
timer armed. In case that there is no immediate reschedule or
soft interrupts have to be handled before reaching the reschedule
point in the interrupt entry code the clock event is reprogrammed
in one of those code paths to prevent that the timer becomes
stale.

- Support for clocksource coupled clockevents

The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.

As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.

- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.

With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other
hrtimer users obviously benefit from these optimizations.

- Robustness improvements and cleanups of historical sins in the
hrtimer and timekeeping code.

- Rewrite of the clocksource watchdog.

The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design,
which was made in the context of systems far smaller than today, is
based on the assumption that the to be monitored clocksource (TSC)
can be trivially compared against a known to be stable clocksource
(HPET/ACPI-PM timer).

Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.

The rewrite addresses this by:

- Restricting the validation against the reference clocksource to
the boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).

- Do a round robin validation betwen the boot CPU and the other
CPUs based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.

- Being more leniant versus remote timeouts

- The usual tiny fixes, cleanups and enhancements all over the place

* tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
alarmtimer: Access timerqueue node under lock in suspend
hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
posix-timers: Fix stale function name in comment
timers: Get this_cpu once while clearing the idle state
clocksource: Rewrite watchdog code completely
clocksource: Don't use non-continuous clocksources as watchdog
x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
MIPS: Don't select CLOCKSOURCE_WATCHDOG
parisc: Remove unused clocksource flags
hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
hrtimer: Mark index and clockid of clock base as const
hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
hrtimer: Drop spurious space in 'enum hrtimer_base_type'
hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
hrtimer: Remove hrtimer_get_expires_ns()
timekeeping: Mark offsets array as const
timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
timer_list: Print offset as signed integer
tracing: Use explicit array size instead of sentinel elements in symbol printing
...

+2254 -1449
+1 -6
Documentation/admin-guide/kernel-parameters.txt
··· 7974 7974 (HPET or PM timer) on systems whose TSC frequency was 7975 7975 obtained from HW or FW using either an MSR or CPUID(0x15). 7976 7976 Warn if the difference is more than 500 ppm. 7977 - [x86] watchdog: Use TSC as the watchdog clocksource with 7978 - which to check other HW timers (HPET or PM timer), but 7979 - only on systems where TSC has been deemed trustworthy. 7980 - This will be suppressed by an earlier tsc=nowatchdog and 7981 - can be overridden by a later tsc=nowatchdog. A console 7982 - message will flag any such suppression or overriding. 7977 + [x86] watchdog: Enforce the clocksource watchdog on TSC 7983 7978 7984 7979 tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given 7985 7980 value instead. Useful when the early TSC frequency discovery
+1
MAINTAINERS
··· 26669 26669 F: include/linux/timex.h 26670 26670 F: include/uapi/linux/time.h 26671 26671 F: include/uapi/linux/timex.h 26672 + F: kernel/time/.kunitconfig 26672 26673 F: kernel/time/alarmtimer.c 26673 26674 F: kernel/time/clocksource* 26674 26675 F: kernel/time/ntp*
-1
arch/mips/Kconfig
··· 1131 1131 bool 1132 1132 1133 1133 config CSRC_R4K 1134 - select CLOCKSOURCE_WATCHDOG if CPU_FREQ 1135 1134 bool 1136 1135 1137 1136 config CSRC_SB1250
+1 -4
arch/parisc/kernel/time.c
··· 210 210 .read = read_cr16, 211 211 .mask = CLOCKSOURCE_MASK(BITS_PER_LONG), 212 212 .flags = CLOCK_SOURCE_IS_CONTINUOUS | 213 - CLOCK_SOURCE_VALID_FOR_HRES | 214 - CLOCK_SOURCE_MUST_VERIFY | 215 - CLOCK_SOURCE_VERIFY_PERCPU, 213 + CLOCK_SOURCE_VALID_FOR_HRES, 216 214 }; 217 - 218 215 219 216 /* 220 217 * timer interrupt and sched_clock() initialization
+2
arch/x86/Kconfig
··· 141 141 select ARCH_USE_SYM_ANNOTATIONS 142 142 select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 143 143 select ARCH_WANT_DEFAULT_BPF_JIT if X86_64 144 + select ARCH_WANTS_CLOCKSOURCE_READ_INLINE if X86_64 144 145 select ARCH_WANTS_DYNAMIC_TASK_STRUCT 145 146 select ARCH_WANTS_NO_INSTR 146 147 select ARCH_WANT_GENERAL_HUGETLB ··· 164 163 select EDAC_SUPPORT 165 164 select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC) 166 165 select GENERIC_CLOCKEVENTS_BROADCAST_IDLE if GENERIC_CLOCKEVENTS_BROADCAST 166 + select GENERIC_CLOCKEVENTS_COUPLED_INLINE if X86_64 167 167 select GENERIC_CLOCKEVENTS_MIN_ADJUST 168 168 select GENERIC_CMOS_UPDATE 169 169 select GENERIC_CPU_AUTOPROBE
+22
arch/x86/include/asm/clock_inlined.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_CLOCK_INLINED_H 3 + #define _ASM_X86_CLOCK_INLINED_H 4 + 5 + #include <asm/tsc.h> 6 + 7 + struct clocksource; 8 + 9 + static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs) 10 + { 11 + return (u64)rdtsc_ordered(); 12 + } 13 + 14 + struct clock_event_device; 15 + 16 + static __always_inline void 17 + arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt) 18 + { 19 + native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles); 20 + } 21 + 22 + #endif
-1
arch/x86/include/asm/time.h
··· 7 7 8 8 extern void hpet_time_init(void); 9 9 extern bool pit_timer_init(void); 10 - extern bool tsc_clocksource_watchdog_disabled(void); 11 10 12 11 extern struct clock_event_device *global_clock_event; 13 12
+23 -18
arch/x86/kernel/apic/apic.c
··· 412 412 /* 413 413 * Program the next event, relative to now 414 414 */ 415 - static int lapic_next_event(unsigned long delta, 416 - struct clock_event_device *evt) 415 + static int lapic_next_event(unsigned long delta, struct clock_event_device *evt) 417 416 { 418 417 apic_write(APIC_TMICT, delta); 419 418 return 0; 420 419 } 421 420 422 - static int lapic_next_deadline(unsigned long delta, 423 - struct clock_event_device *evt) 421 + static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt) 424 422 { 425 - u64 tsc; 423 + /* 424 + * There is no weak_wrmsr_fence() required here as all of this is purely 425 + * CPU local. Avoid the [ml]fence overhead. 426 + */ 427 + u64 tsc = rdtsc(); 426 428 427 - /* This MSR is special and need a special fence: */ 428 - weak_wrmsr_fence(); 429 - 430 - tsc = rdtsc(); 431 - wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR)); 429 + native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR)); 432 430 return 0; 433 431 } 434 432 ··· 450 452 * the timer _and_ zero the counter registers: 451 453 */ 452 454 if (v & APIC_LVT_TIMER_TSCDEADLINE) 453 - wrmsrq(MSR_IA32_TSC_DEADLINE, 0); 455 + native_wrmsrq(MSR_IA32_TSC_DEADLINE, 0); 454 456 else 455 457 apic_write(APIC_TMICT, 0); 456 458 ··· 547 549 548 550 if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) 549 551 return false; 552 + 553 + /* XEN_PV does not support it, but be paranoia about it */ 554 + if (boot_cpu_has(X86_FEATURE_XENPV)) 555 + goto clear; 556 + 550 557 if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) 551 558 return true; 552 559 ··· 564 561 if (boot_cpu_data.microcode >= rev) 565 562 return true; 566 563 567 - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); 568 564 pr_err(FW_BUG "TSC_DEADLINE disabled due to Errata; " 569 565 "please update microcode to version: 0x%x (or later)\n", rev); 566 + 567 + clear: 568 + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); 570 569 return false; 571 570 } 572 571 ··· 591 586 592 587 if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) { 593 588 levt->name = "lapic-deadline"; 594 - levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | 595 - CLOCK_EVT_FEAT_DUMMY); 589 + levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY); 590 + levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED; 591 + levt->cs_id = CSID_X86_TSC; 596 592 levt->set_next_event = lapic_next_deadline; 597 - clockevents_config_and_register(levt, 598 - tsc_khz * (1000 / TSC_DIVISOR), 599 - 0xF, ~0UL); 600 - } else 593 + clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL); 594 + } else { 601 595 clockevents_register_device(levt); 596 + } 602 597 603 598 apic_update_vector(smp_processor_id(), LOCAL_TIMER_VECTOR, true); 604 599 }
+1 -3
arch/x86/kernel/hpet.c
··· 854 854 .rating = 250, 855 855 .read = read_hpet, 856 856 .mask = HPET_MASK, 857 - .flags = CLOCK_SOURCE_IS_CONTINUOUS, 857 + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED, 858 858 .resume = hpet_resume_counter, 859 859 }; 860 860 ··· 1082 1082 if (!hpet_counting()) 1083 1083 goto out_nohpet; 1084 1084 1085 - if (tsc_clocksource_watchdog_disabled()) 1086 - clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY; 1087 1085 clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq); 1088 1086 1089 1087 if (id & HPET_ID_LEGSUP) {
+29 -32
arch/x86/kernel/tsc.c
··· 322 322 return 1; 323 323 } 324 324 #endif 325 - 326 325 __setup("notsc", notsc_setup); 327 326 327 + enum { 328 + TSC_WATCHDOG_AUTO, 329 + TSC_WATCHDOG_OFF, 330 + TSC_WATCHDOG_ON, 331 + }; 332 + 328 333 static int no_sched_irq_time; 329 - static int no_tsc_watchdog; 330 - static int tsc_as_watchdog; 334 + static int tsc_watchdog; 331 335 332 336 static int __init tsc_setup(char *str) 333 337 { ··· 341 337 no_sched_irq_time = 1; 342 338 if (!strcmp(str, "unstable")) 343 339 mark_tsc_unstable("boot parameter"); 344 - if (!strcmp(str, "nowatchdog")) { 345 - no_tsc_watchdog = 1; 346 - if (tsc_as_watchdog) 347 - pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n", 348 - __func__); 349 - tsc_as_watchdog = 0; 350 - } 340 + if (!strcmp(str, "nowatchdog")) 341 + tsc_watchdog = TSC_WATCHDOG_OFF; 351 342 if (!strcmp(str, "recalibrate")) 352 343 tsc_force_recalibrate = 1; 353 - if (!strcmp(str, "watchdog")) { 354 - if (no_tsc_watchdog) 355 - pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n", 356 - __func__); 357 - else 358 - tsc_as_watchdog = 1; 359 - } 344 + if (!strcmp(str, "watchdog")) 345 + tsc_watchdog = TSC_WATCHDOG_ON; 360 346 return 1; 361 347 } 362 - 363 348 __setup("tsc=", tsc_setup); 364 349 365 350 #define MAX_RETRIES 5 ··· 1168 1175 static struct clocksource clocksource_tsc_early = { 1169 1176 .name = "tsc-early", 1170 1177 .rating = 299, 1171 - .uncertainty_margin = 32 * NSEC_PER_MSEC, 1172 1178 .read = read_tsc, 1173 1179 .mask = CLOCKSOURCE_MASK(64), 1174 1180 .flags = CLOCK_SOURCE_IS_CONTINUOUS | ··· 1192 1200 .read = read_tsc, 1193 1201 .mask = CLOCKSOURCE_MASK(64), 1194 1202 .flags = CLOCK_SOURCE_IS_CONTINUOUS | 1195 - CLOCK_SOURCE_VALID_FOR_HRES | 1203 + CLOCK_SOURCE_CAN_INLINE_READ | 1196 1204 CLOCK_SOURCE_MUST_VERIFY | 1197 - CLOCK_SOURCE_VERIFY_PERCPU, 1205 + CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT, 1198 1206 .id = CSID_X86_TSC, 1199 1207 .vdso_clock_mode = VDSO_CLOCKMODE_TSC, 1200 1208 .enable = tsc_cs_enable, ··· 1222 1230 1223 1231 static void __init tsc_disable_clocksource_watchdog(void) 1224 1232 { 1233 + if (tsc_watchdog == TSC_WATCHDOG_ON) 1234 + return; 1225 1235 clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY; 1226 1236 clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; 1227 - } 1228 - 1229 - bool tsc_clocksource_watchdog_disabled(void) 1230 - { 1231 - return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) && 1232 - tsc_as_watchdog && !no_tsc_watchdog; 1233 1237 } 1234 1238 1235 1239 static void __init check_system_tsc_reliable(void) ··· 1382 1394 (unsigned long)tsc_khz / 1000, 1383 1395 (unsigned long)tsc_khz % 1000); 1384 1396 1397 + clocksource_tsc.flags |= CLOCK_SOURCE_CALIBRATED; 1398 + 1385 1399 /* Inform the TSC deadline clockevent devices about the recalibration */ 1386 1400 lapic_update_tsc_freq(); 1387 1401 ··· 1399 1409 have_art = true; 1400 1410 clocksource_tsc.base = &art_base_clk; 1401 1411 } 1412 + 1413 + /* 1414 + * Transfer the valid for high resolution flag if it was set on the 1415 + * early TSC already. That guarantees that there is no intermediate 1416 + * clocksource selected once the early TSC is unregistered. 1417 + */ 1418 + if (clocksource_tsc_early.flags & CLOCK_SOURCE_VALID_FOR_HRES) 1419 + clocksource_tsc.flags |= CLOCK_SOURCE_VALID_FOR_HRES; 1420 + 1402 1421 clocksource_register_khz(&clocksource_tsc, tsc_khz); 1403 1422 unreg: 1404 1423 clocksource_unregister(&clocksource_tsc_early); ··· 1459 1460 1460 1461 if (early) { 1461 1462 cpu_khz = x86_platform.calibrate_cpu(); 1462 - if (tsc_early_khz) { 1463 + if (tsc_early_khz) 1463 1464 tsc_khz = tsc_early_khz; 1464 - } else { 1465 + else 1465 1466 tsc_khz = x86_platform.calibrate_tsc(); 1466 - clocksource_tsc.freq_khz = tsc_khz; 1467 - } 1468 1467 } else { 1469 1468 /* We should not be here with non-native cpu calibration */ 1470 1469 WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu); ··· 1566 1569 return; 1567 1570 } 1568 1571 1569 - if (tsc_clocksource_reliable || no_tsc_watchdog) 1572 + if (tsc_clocksource_reliable || tsc_watchdog == TSC_WATCHDOG_OFF) 1570 1573 tsc_disable_clocksource_watchdog(); 1571 1574 1572 1575 clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
-1
drivers/clocksource/Kconfig
··· 596 596 config CLKSRC_MIPS_GIC 597 597 bool 598 598 depends on MIPS_GIC 599 - select CLOCKSOURCE_WATCHDOG 600 599 select TIMER_OF 601 600 602 601 config CLKSRC_PXA
+1 -3
drivers/clocksource/acpi_pm.c
··· 98 98 .rating = 200, 99 99 .read = acpi_pm_read, 100 100 .mask = (u64)ACPI_PM_MASK, 101 - .flags = CLOCK_SOURCE_IS_CONTINUOUS, 101 + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED, 102 102 .suspend = acpi_pm_suspend, 103 103 .resume = acpi_pm_resume, 104 104 }; ··· 243 243 return -ENODEV; 244 244 } 245 245 246 - if (tsc_clocksource_watchdog_disabled()) 247 - clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY; 248 246 return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); 249 247 } 250 248
+4 -1
include/asm-generic/thread_info_tif.h
··· 41 41 #define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING) 42 42 43 43 #ifdef HAVE_TIF_RESTORE_SIGMASK 44 - # define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() */ 44 + # define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() 45 45 # define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK) 46 46 #endif 47 47 48 48 #define TIF_RSEQ 11 // Run RSEQ fast path 49 49 #define _TIF_RSEQ BIT(TIF_RSEQ) 50 + 51 + #define TIF_HRTIMER_REARM 12 // re-arm the timer 52 + #define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM) 50 53 51 54 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
+7 -3
include/linux/clockchips.h
··· 43 43 /* 44 44 * Clock event features 45 45 */ 46 - # define CLOCK_EVT_FEAT_PERIODIC 0x000001 47 - # define CLOCK_EVT_FEAT_ONESHOT 0x000002 48 - # define CLOCK_EVT_FEAT_KTIME 0x000004 46 + # define CLOCK_EVT_FEAT_PERIODIC 0x000001 47 + # define CLOCK_EVT_FEAT_ONESHOT 0x000002 48 + # define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED 0x000004 49 49 50 50 /* 51 51 * x86(64) specific (mis)features: ··· 73 73 * level handler of the event source 74 74 * @set_next_event: set next event function using a clocksource delta 75 75 * @set_next_ktime: set next event function using a direct ktime value 76 + * @set_next_coupled: set next event function for clocksource coupled mode 76 77 * @next_event: local storage for the next event in oneshot mode 77 78 * @max_delta_ns: maximum delta value in ns 78 79 * @min_delta_ns: minimum delta value in ns ··· 81 80 * @shift: nanoseconds to cycles divisor (power of two) 82 81 * @state_use_accessors:current state of the device, assigned by the core code 83 82 * @features: features 83 + * @cs_id: Clocksource ID to denote the clocksource for coupled mode 84 84 * @next_event_forced: True if the last programming was a forced event 85 85 * @retries: number of forced programming retries 86 86 * @set_state_periodic: switch state to periodic ··· 104 102 void (*event_handler)(struct clock_event_device *); 105 103 int (*set_next_event)(unsigned long evt, struct clock_event_device *); 106 104 int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); 105 + void (*set_next_coupled)(u64 cycles, struct clock_event_device *); 107 106 ktime_t next_event; 108 107 u64 max_delta_ns; 109 108 u64 min_delta_ns; ··· 112 109 u32 shift; 113 110 enum clock_event_state state_use_accessors; 114 111 unsigned int features; 112 + enum clocksource_ids cs_id; 115 113 unsigned int next_event_forced; 116 114 unsigned long retries; 117 115
+8 -19
include/linux/clocksource.h
··· 44 44 * @shift: Cycle to nanosecond divisor (power of two) 45 45 * @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs) 46 46 * @maxadj: Maximum adjustment value to mult (~11%) 47 - * @uncertainty_margin: Maximum uncertainty in nanoseconds per half second. 48 - * Zero says to use default WATCHDOG_THRESHOLD. 49 47 * @archdata: Optional arch-specific data 50 48 * @max_cycles: Maximum safe cycle value which won't overflow on 51 49 * multiplication ··· 103 105 u32 shift; 104 106 u64 max_idle_ns; 105 107 u32 maxadj; 106 - u32 uncertainty_margin; 107 108 #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA 108 109 struct arch_clocksource_data archdata; 109 110 #endif ··· 130 133 struct list_head wd_list; 131 134 u64 cs_last; 132 135 u64 wd_last; 136 + unsigned int wd_cpu; 133 137 #endif 134 138 struct module *owner; 135 139 }; ··· 140 142 */ 141 143 #define CLOCK_SOURCE_IS_CONTINUOUS 0x01 142 144 #define CLOCK_SOURCE_MUST_VERIFY 0x02 145 + #define CLOCK_SOURCE_CALIBRATED 0x04 143 146 144 147 #define CLOCK_SOURCE_WATCHDOG 0x10 145 148 #define CLOCK_SOURCE_VALID_FOR_HRES 0x20 146 149 #define CLOCK_SOURCE_UNSTABLE 0x40 147 150 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 148 151 #define CLOCK_SOURCE_RESELECT 0x100 149 - #define CLOCK_SOURCE_VERIFY_PERCPU 0x200 152 + #define CLOCK_SOURCE_CAN_INLINE_READ 0x200 153 + #define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x400 154 + 155 + #define CLOCK_SOURCE_WDTEST 0x800 156 + #define CLOCK_SOURCE_WDTEST_PERCPU 0x1000 157 + 150 158 /* simplify initialization of mask field */ 151 159 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) 152 160 ··· 301 297 302 298 #define TIMER_ACPI_DECLARE(name, table_id, fn) \ 303 299 ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) 304 - 305 - static inline unsigned int clocksource_get_max_watchdog_retry(void) 306 - { 307 - /* 308 - * When system is in the boot phase or under heavy workload, there 309 - * can be random big latencies during the clocksource/watchdog 310 - * read, so allow retries to filter the noise latency. As the 311 - * latency's frequency and maximum value goes up with the number of 312 - * CPUs, scale the number of retries with the number of online 313 - * CPUs. 314 - */ 315 - return (ilog2(num_online_cpus()) / 2) + 1; 316 - } 317 - 318 - void clocksource_verify_percpu(struct clocksource *cs); 319 300 320 301 /** 321 302 * struct clocksource_base - hardware abstraction for clock on which a clocksource
+20 -44
include/linux/hrtimer.h
··· 13 13 #define _LINUX_HRTIMER_H 14 14 15 15 #include <linux/hrtimer_defs.h> 16 + #include <linux/hrtimer_rearm.h> 16 17 #include <linux/hrtimer_types.h> 17 18 #include <linux/init.h> 18 19 #include <linux/list.h> ··· 32 31 * soft irq context 33 32 * HRTIMER_MODE_HARD - Timer callback function will be executed in 34 33 * hard irq context even on PREEMPT_RT. 34 + * HRTIMER_MODE_LAZY_REARM - Avoid reprogramming if the timer was the 35 + * first expiring timer and is moved into the 36 + * future. Special mode for the HRTICK timer to 37 + * avoid extensive reprogramming of the hardware, 38 + * which is expensive in virtual machines. Risks 39 + * a pointless expiry, but that's better than 40 + * reprogramming on every context switch, 35 41 */ 36 42 enum hrtimer_mode { 37 43 HRTIMER_MODE_ABS = 0x00, ··· 46 38 HRTIMER_MODE_PINNED = 0x02, 47 39 HRTIMER_MODE_SOFT = 0x04, 48 40 HRTIMER_MODE_HARD = 0x08, 41 + HRTIMER_MODE_LAZY_REARM = 0x10, 49 42 50 43 HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED, 51 44 HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED, ··· 63 54 HRTIMER_MODE_ABS_PINNED_HARD = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_HARD, 64 55 HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD, 65 56 }; 66 - 67 - /* 68 - * Values to track state of the timer 69 - * 70 - * Possible states: 71 - * 72 - * 0x00 inactive 73 - * 0x01 enqueued into rbtree 74 - * 75 - * The callback state is not part of the timer->state because clearing it would 76 - * mean touching the timer after the callback, this makes it impossible to free 77 - * the timer from the callback function. 78 - * 79 - * Therefore we track the callback state in: 80 - * 81 - * timer->base->cpu_base->running == timer 82 - * 83 - * On SMP it is possible to have a "callback function running and enqueued" 84 - * status. It happens for example when a posix timer expired and the callback 85 - * queued a signal. Between dropping the lock which protects the posix timer 86 - * and reacquiring the base lock of the hrtimer, another CPU can deliver the 87 - * signal and rearm the timer. 88 - * 89 - * All state transitions are protected by cpu_base->lock. 90 - */ 91 - #define HRTIMER_STATE_INACTIVE 0x00 92 - #define HRTIMER_STATE_ENQUEUED 0x01 93 57 94 58 /** 95 59 * struct hrtimer_sleeper - simple sleeper structure ··· 116 134 return timer->_softexpires; 117 135 } 118 136 119 - static inline s64 hrtimer_get_expires_ns(const struct hrtimer *timer) 120 - { 121 - return ktime_to_ns(timer->node.expires); 122 - } 123 - 124 137 ktime_t hrtimer_cb_get_time(const struct hrtimer *timer); 125 138 126 139 static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer) ··· 123 146 return ktime_sub(timer->node.expires, hrtimer_cb_get_time(timer)); 124 147 } 125 148 126 - static inline int hrtimer_is_hres_active(struct hrtimer *timer) 127 - { 128 - return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ? 129 - timer->base->cpu_base->hres_active : 0; 130 - } 131 - 132 149 #ifdef CONFIG_HIGH_RES_TIMERS 150 + extern unsigned int hrtimer_resolution; 133 151 struct clock_event_device; 134 152 135 153 extern void hrtimer_interrupt(struct clock_event_device *dev); 136 154 137 - extern unsigned int hrtimer_resolution; 155 + extern struct static_key_false hrtimer_highres_enabled_key; 138 156 139 - #else 157 + static inline bool hrtimer_highres_enabled(void) 158 + { 159 + return static_branch_likely(&hrtimer_highres_enabled_key); 160 + } 140 161 162 + #else /* CONFIG_HIGH_RES_TIMERS */ 141 163 #define hrtimer_resolution (unsigned int)LOW_RES_NSEC 142 - 143 - #endif 164 + static inline bool hrtimer_highres_enabled(void) { return false; } 165 + #endif /* !CONFIG_HIGH_RES_TIMERS */ 144 166 145 167 static inline ktime_t 146 168 __hrtimer_expires_remaining_adjusted(const struct hrtimer *timer, ktime_t now) ··· 269 293 */ 270 294 static inline bool hrtimer_is_queued(struct hrtimer *timer) 271 295 { 272 - /* The READ_ONCE pairs with the update functions of timer->state */ 273 - return !!(READ_ONCE(timer->state) & HRTIMER_STATE_ENQUEUED); 296 + /* The READ_ONCE pairs with the update functions of timer->is_queued */ 297 + return READ_ONCE(timer->is_queued); 274 298 } 275 299 276 300 /*
+43 -40
include/linux/hrtimer_defs.h
··· 19 19 * timer to a base on another cpu. 20 20 * @clockid: clock id for per_cpu support 21 21 * @seq: seqcount around __run_hrtimer 22 + * @expires_next: Absolute time of the next event in this clock base 22 23 * @running: pointer to the currently running hrtimer 23 24 * @active: red black tree root node for the active timers 24 25 * @offset: offset of this clock to the monotonic base 25 26 */ 26 27 struct hrtimer_clock_base { 27 - struct hrtimer_cpu_base *cpu_base; 28 - unsigned int index; 29 - clockid_t clockid; 30 - seqcount_raw_spinlock_t seq; 31 - struct hrtimer *running; 32 - struct timerqueue_head active; 33 - ktime_t offset; 28 + struct hrtimer_cpu_base *cpu_base; 29 + const unsigned int index; 30 + const clockid_t clockid; 31 + seqcount_raw_spinlock_t seq; 32 + ktime_t expires_next; 33 + struct hrtimer *running; 34 + struct timerqueue_linked_head active; 35 + ktime_t offset; 34 36 } __hrtimer_clock_base_align; 35 37 36 - enum hrtimer_base_type { 38 + enum hrtimer_base_type { 37 39 HRTIMER_BASE_MONOTONIC, 38 40 HRTIMER_BASE_REALTIME, 39 41 HRTIMER_BASE_BOOTTIME, ··· 44 42 HRTIMER_BASE_REALTIME_SOFT, 45 43 HRTIMER_BASE_BOOTTIME_SOFT, 46 44 HRTIMER_BASE_TAI_SOFT, 47 - HRTIMER_MAX_CLOCK_BASES, 45 + HRTIMER_MAX_CLOCK_BASES 48 46 }; 49 47 50 48 /** 51 49 * struct hrtimer_cpu_base - the per cpu clock bases 52 - * @lock: lock protecting the base and associated clock bases 53 - * and timers 54 - * @cpu: cpu number 55 - * @active_bases: Bitfield to mark bases with active timers 56 - * @clock_was_set_seq: Sequence counter of clock was set events 57 - * @hres_active: State of high resolution mode 58 - * @in_hrtirq: hrtimer_interrupt() is currently executing 59 - * @hang_detected: The last hrtimer interrupt detected a hang 60 - * @softirq_activated: displays, if the softirq is raised - update of softirq 61 - * related settings is not required then. 62 - * @nr_events: Total number of hrtimer interrupt events 63 - * @nr_retries: Total number of hrtimer interrupt retries 64 - * @nr_hangs: Total number of hrtimer interrupt hangs 65 - * @max_hang_time: Maximum time spent in hrtimer_interrupt 66 - * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are 67 - * expired 68 - * @online: CPU is online from an hrtimers point of view 69 - * @timer_waiters: A hrtimer_cancel() invocation waits for the timer 70 - * callback to finish. 71 - * @expires_next: absolute time of the next event, is required for remote 72 - * hrtimer enqueue; it is the total first expiry time (hard 73 - * and soft hrtimer are taken into account) 74 - * @next_timer: Pointer to the first expiring timer 75 - * @softirq_expires_next: Time to check, if soft queues needs also to be expired 76 - * @softirq_next_timer: Pointer to the first expiring softirq based timer 77 - * @clock_base: array of clock bases for this cpu 50 + * @lock: lock protecting the base and associated clock bases and timers 51 + * @cpu: cpu number 52 + * @active_bases: Bitfield to mark bases with active timers 53 + * @clock_was_set_seq: Sequence counter of clock was set events 54 + * @hres_active: State of high resolution mode 55 + * @deferred_rearm: A deferred rearm is pending 56 + * @deferred_needs_update: The deferred rearm must re-evaluate the first timer 57 + * @hang_detected: The last hrtimer interrupt detected a hang 58 + * @softirq_activated: displays, if the softirq is raised - update of softirq 59 + * related settings is not required then. 60 + * @nr_events: Total number of hrtimer interrupt events 61 + * @nr_retries: Total number of hrtimer interrupt retries 62 + * @nr_hangs: Total number of hrtimer interrupt hangs 63 + * @max_hang_time: Maximum time spent in hrtimer_interrupt 64 + * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are expired 65 + * @online: CPU is online from an hrtimers point of view 66 + * @timer_waiters: A hrtimer_cancel() waiters for the timer callback to finish. 67 + * @expires_next: Absolute time of the next event, is required for remote 68 + * hrtimer enqueue; it is the total first expiry time (hard 69 + * and soft hrtimer are taken into account) 70 + * @next_timer: Pointer to the first expiring timer 71 + * @softirq_expires_next: Time to check, if soft queues needs also to be expired 72 + * @softirq_next_timer: Pointer to the first expiring softirq based timer 73 + * @deferred_expires_next: Cached expires next value for deferred rearm 74 + * @clock_base: Array of clock bases for this cpu 78 75 * 79 76 * Note: next_timer is just an optimization for __remove_hrtimer(). 80 77 * Do not dereference the pointer because it is not reliable on ··· 84 83 unsigned int cpu; 85 84 unsigned int active_bases; 86 85 unsigned int clock_was_set_seq; 87 - unsigned int hres_active : 1, 88 - in_hrtirq : 1, 89 - hang_detected : 1, 90 - softirq_activated : 1, 91 - online : 1; 86 + bool hres_active; 87 + bool deferred_rearm; 88 + bool deferred_needs_update; 89 + bool hang_detected; 90 + bool softirq_activated; 91 + bool online; 92 92 #ifdef CONFIG_HIGH_RES_TIMERS 93 93 unsigned int nr_events; 94 94 unsigned short nr_retries; ··· 104 102 struct hrtimer *next_timer; 105 103 ktime_t softirq_expires_next; 106 104 struct hrtimer *softirq_next_timer; 105 + ktime_t deferred_expires_next; 107 106 struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; 108 107 call_single_data_t csd; 109 108 } ____cacheline_aligned;
+83
include/linux/hrtimer_rearm.h
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #ifndef _LINUX_HRTIMER_REARM_H 3 + #define _LINUX_HRTIMER_REARM_H 4 + 5 + #ifdef CONFIG_HRTIMER_REARM_DEFERRED 6 + #include <linux/thread_info.h> 7 + 8 + void __hrtimer_rearm_deferred(void); 9 + 10 + /* 11 + * This is purely CPU local, so check the TIF bit first to avoid the overhead of 12 + * the atomic test_and_clear_bit() operation for the common case where the bit 13 + * is not set. 14 + */ 15 + static __always_inline bool hrtimer_test_and_clear_rearm_deferred_tif(unsigned long tif_work) 16 + { 17 + lockdep_assert_irqs_disabled(); 18 + 19 + if (unlikely(tif_work & _TIF_HRTIMER_REARM)) { 20 + clear_thread_flag(TIF_HRTIMER_REARM); 21 + return true; 22 + } 23 + return false; 24 + } 25 + 26 + #define TIF_REARM_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_HRTIMER_REARM) 27 + 28 + /* Invoked from the exit to user before invoking exit_to_user_mode_loop() */ 29 + static __always_inline bool 30 + hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) 31 + { 32 + /* Help the compiler to optimize the function out for syscall returns */ 33 + if (!(tif_mask & _TIF_HRTIMER_REARM)) 34 + return false; 35 + /* 36 + * Rearm the timer if none of the resched flags is set before going into 37 + * the loop which re-enables interrupts. 38 + */ 39 + if (unlikely((*tif_work & TIF_REARM_MASK) == _TIF_HRTIMER_REARM)) { 40 + clear_thread_flag(TIF_HRTIMER_REARM); 41 + __hrtimer_rearm_deferred(); 42 + /* Don't go into the loop if HRTIMER_REARM was the only flag */ 43 + *tif_work &= ~TIF_HRTIMER_REARM; 44 + return !*tif_work; 45 + } 46 + return false; 47 + } 48 + 49 + /* Invoked from the time slice extension decision function */ 50 + static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) 51 + { 52 + if (hrtimer_test_and_clear_rearm_deferred_tif(tif_work)) 53 + __hrtimer_rearm_deferred(); 54 + } 55 + 56 + /* 57 + * This is to be called on all irqentry_exit() paths that will enable 58 + * interrupts. 59 + */ 60 + static __always_inline void hrtimer_rearm_deferred(void) 61 + { 62 + hrtimer_rearm_deferred_tif(read_thread_flags()); 63 + } 64 + 65 + /* 66 + * Invoked from the scheduler on entry to __schedule() so it can defer 67 + * rearming after the load balancing callbacks which might change hrtick. 68 + */ 69 + static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) 70 + { 71 + return hrtimer_test_and_clear_rearm_deferred_tif(read_thread_flags()); 72 + } 73 + 74 + #else /* CONFIG_HRTIMER_REARM_DEFERRED */ 75 + static __always_inline void __hrtimer_rearm_deferred(void) { } 76 + static __always_inline void hrtimer_rearm_deferred(void) { } 77 + static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { } 78 + static __always_inline bool 79 + hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; } 80 + static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; } 81 + #endif /* !CONFIG_HRTIMER_REARM_DEFERRED */ 82 + 83 + #endif
+11 -8
include/linux/hrtimer_types.h
··· 17 17 18 18 /** 19 19 * struct hrtimer - the basic hrtimer structure 20 - * @node: timerqueue node, which also manages node.expires, 20 + * @node: Linked timerqueue node, which also manages node.expires, 21 21 * the absolute expiry time in the hrtimers internal 22 22 * representation. The time is related to the clock on 23 23 * which the timer is based. Is setup by adding ··· 28 28 * was armed. 29 29 * @function: timer expiry callback function 30 30 * @base: pointer to the timer base (per cpu and per clock) 31 - * @state: state information (See bit values above) 31 + * @is_queued: Indicates whether a timer is enqueued or not 32 32 * @is_rel: Set if the timer was armed relative 33 33 * @is_soft: Set if hrtimer will be expired in soft interrupt context. 34 34 * @is_hard: Set if hrtimer will be expired in hard interrupt context 35 35 * even on RT. 36 + * @is_lazy: Set if the timer is frequently rearmed to avoid updates 37 + * of the clock event device 36 38 * 37 39 * The hrtimer structure must be initialized by hrtimer_setup() 38 40 */ 39 41 struct hrtimer { 40 - struct timerqueue_node node; 42 + struct timerqueue_linked_node node; 43 + struct hrtimer_clock_base *base; 44 + bool is_queued; 45 + bool is_rel; 46 + bool is_soft; 47 + bool is_hard; 48 + bool is_lazy; 41 49 ktime_t _softexpires; 42 50 enum hrtimer_restart (*__private function)(struct hrtimer *); 43 - struct hrtimer_clock_base *base; 44 - u8 state; 45 - u8 is_rel; 46 - u8 is_soft; 47 - u8 is_hard; 48 51 }; 49 52 50 53 #endif /* _LINUX_HRTIMER_TYPES_H */
+19 -6
include/linux/irq-entry-common.h
··· 3 3 #define __LINUX_IRQENTRYCOMMON_H 4 4 5 5 #include <linux/context_tracking.h> 6 + #include <linux/hrtimer_rearm.h> 6 7 #include <linux/kmsan.h> 7 8 #include <linux/rseq_entry.h> 8 9 #include <linux/static_call_types.h> ··· 33 32 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \ 34 33 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \ 35 34 ARCH_EXIT_TO_USER_MODE_WORK) 35 + 36 + #ifdef CONFIG_HRTIMER_REARM_DEFERRED 37 + # define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK) 38 + # define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM) 39 + #else 40 + # define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK) 41 + # define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK) 42 + #endif 36 43 37 44 /** 38 45 * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs ··· 212 203 /** 213 204 * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required 214 205 * @regs: Pointer to pt_regs on entry stack 206 + * @work_mask: Which TIF bits need to be evaluated 215 207 * 216 208 * 1) check that interrupts are disabled 217 209 * 2) call tick_nohz_user_enter_prepare() ··· 222 212 * 223 213 * Don't invoke directly, use the syscall/irqentry_ prefixed variants below 224 214 */ 225 - static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs) 215 + static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs, 216 + const unsigned long work_mask) 226 217 { 227 218 unsigned long ti_work; 228 219 ··· 233 222 tick_nohz_user_enter_prepare(); 234 223 235 224 ti_work = read_thread_flags(); 236 - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) 237 - ti_work = exit_to_user_mode_loop(regs, ti_work); 225 + if (unlikely(ti_work & work_mask)) { 226 + if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask)) 227 + ti_work = exit_to_user_mode_loop(regs, ti_work); 228 + } 238 229 239 230 arch_exit_to_user_mode_prepare(regs, ti_work); 240 231 } ··· 252 239 /* Temporary workaround to keep ARM64 alive */ 253 240 static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs) 254 241 { 255 - __exit_to_user_mode_prepare(regs); 242 + __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK); 256 243 rseq_exit_to_user_mode_legacy(); 257 244 __exit_to_user_mode_validate(); 258 245 } ··· 266 253 */ 267 254 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) 268 255 { 269 - __exit_to_user_mode_prepare(regs); 256 + __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL); 270 257 rseq_syscall_exit_to_user_mode(); 271 258 __exit_to_user_mode_validate(); 272 259 } ··· 280 267 */ 281 268 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs) 282 269 { 283 - __exit_to_user_mode_prepare(regs); 270 + __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ); 284 271 rseq_irqentry_exit_to_user_mode(); 285 272 __exit_to_user_mode_validate(); 286 273 }
+1 -5
include/linux/jiffies.h
··· 67 67 /* USER_TICK_USEC is the time between ticks in usec assuming fake USER_HZ */ 68 68 #define USER_TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ) 69 69 70 - #ifndef __jiffy_arch_data 71 - #define __jiffy_arch_data 72 - #endif 73 - 74 70 /* 75 71 * The 64-bit value is not atomic on 32-bit systems - you MUST NOT read it 76 72 * without sampling the sequence number in jiffies_lock. ··· 79 83 * See arch/ARCH/kernel/vmlinux.lds.S 80 84 */ 81 85 extern u64 __cacheline_aligned_in_smp jiffies_64; 82 - extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies; 86 + extern unsigned long volatile __cacheline_aligned_in_smp jiffies; 83 87 84 88 #if (BITS_PER_LONG < 64) 85 89 u64 get_jiffies_64(void);
+72 -9
include/linux/rbtree.h
··· 35 35 #define RB_CLEAR_NODE(node) \ 36 36 ((node)->__rb_parent_color = (unsigned long)(node)) 37 37 38 + #define RB_EMPTY_LINKED_NODE(lnode) RB_EMPTY_NODE(&(lnode)->node) 39 + #define RB_CLEAR_LINKED_NODE(lnode) ({ \ 40 + RB_CLEAR_NODE(&(lnode)->node); \ 41 + (lnode)->prev = (lnode)->next = NULL; \ 42 + }) 38 43 39 44 extern void rb_insert_color(struct rb_node *, struct rb_root *); 40 45 extern void rb_erase(struct rb_node *, struct rb_root *); 41 - 46 + extern bool rb_erase_linked(struct rb_node_linked *, struct rb_root_linked *); 42 47 43 48 /* Find logical next and previous nodes in a tree */ 44 49 extern struct rb_node *rb_next(const struct rb_node *); ··· 218 213 return leftmost ? node : NULL; 219 214 } 220 215 221 - /** 222 - * rb_add() - insert @node into @tree 223 - * @node: node to insert 224 - * @tree: tree to insert @node into 225 - * @less: operator defining the (partial) node order 226 - */ 227 216 static __always_inline void 228 - rb_add(struct rb_node *node, struct rb_root *tree, 229 - bool (*less)(struct rb_node *, const struct rb_node *)) 217 + __rb_add(struct rb_node *node, struct rb_root *tree, 218 + bool (*less)(struct rb_node *, const struct rb_node *), 219 + void (*linkop)(struct rb_node *, struct rb_node *, struct rb_node **)) 230 220 { 231 221 struct rb_node **link = &tree->rb_node; 232 222 struct rb_node *parent = NULL; ··· 234 234 link = &parent->rb_right; 235 235 } 236 236 237 + linkop(node, parent, link); 237 238 rb_link_node(node, parent, link); 238 239 rb_insert_color(node, tree); 240 + } 241 + 242 + #define __node_2_linked_node(_n) \ 243 + rb_entry((_n), struct rb_node_linked, node) 244 + 245 + static inline void 246 + rb_link_linked_node(struct rb_node *node, struct rb_node *parent, struct rb_node **link) 247 + { 248 + if (!parent) 249 + return; 250 + 251 + struct rb_node_linked *nnew = __node_2_linked_node(node); 252 + struct rb_node_linked *npar = __node_2_linked_node(parent); 253 + 254 + if (link == &parent->rb_left) { 255 + nnew->prev = npar->prev; 256 + nnew->next = npar; 257 + npar->prev = nnew; 258 + if (nnew->prev) 259 + nnew->prev->next = nnew; 260 + } else { 261 + nnew->next = npar->next; 262 + nnew->prev = npar; 263 + npar->next = nnew; 264 + if (nnew->next) 265 + nnew->next->prev = nnew; 266 + } 267 + } 268 + 269 + /** 270 + * rb_add_linked() - insert @node into the leftmost linked tree @tree 271 + * @node: node to insert 272 + * @tree: linked tree to insert @node into 273 + * @less: operator defining the (partial) node order 274 + * 275 + * Returns @true when @node is the new leftmost, @false otherwise. 276 + */ 277 + static __always_inline bool 278 + rb_add_linked(struct rb_node_linked *node, struct rb_root_linked *tree, 279 + bool (*less)(struct rb_node *, const struct rb_node *)) 280 + { 281 + __rb_add(&node->node, &tree->rb_root, less, rb_link_linked_node); 282 + if (!node->prev) 283 + tree->rb_leftmost = node; 284 + return !node->prev; 285 + } 286 + 287 + /* Empty linkop function which is optimized away by the compiler */ 288 + static __always_inline void 289 + rb_link_noop(struct rb_node *n, struct rb_node *p, struct rb_node **l) { } 290 + 291 + /** 292 + * rb_add() - insert @node into @tree 293 + * @node: node to insert 294 + * @tree: tree to insert @node into 295 + * @less: operator defining the (partial) node order 296 + */ 297 + static __always_inline void 298 + rb_add(struct rb_node *node, struct rb_root *tree, 299 + bool (*less)(struct rb_node *, const struct rb_node *)) 300 + { 301 + __rb_add(node, tree, less, rb_link_noop); 239 302 } 240 303 241 304 /**
+16
include/linux/rbtree_types.h
··· 9 9 } __attribute__((aligned(sizeof(long)))); 10 10 /* The alignment might seem pointless, but allegedly CRIS needs it */ 11 11 12 + struct rb_node_linked { 13 + struct rb_node node; 14 + struct rb_node_linked *prev; 15 + struct rb_node_linked *next; 16 + }; 17 + 12 18 struct rb_root { 13 19 struct rb_node *rb_node; 14 20 }; ··· 34 28 struct rb_node *rb_leftmost; 35 29 }; 36 30 31 + /* 32 + * Leftmost tree with links. This would allow a trivial rb_rightmost update, 33 + * but that has been omitted due to the lack of users. 34 + */ 35 + struct rb_root_linked { 36 + struct rb_root rb_root; 37 + struct rb_node_linked *rb_leftmost; 38 + }; 39 + 37 40 #define RB_ROOT (struct rb_root) { NULL, } 38 41 #define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL } 42 + #define RB_ROOT_LINKED (struct rb_root_linked) { {NULL, }, NULL } 39 43 40 44 #endif
+13 -3
include/linux/rseq_entry.h
··· 40 40 #endif /* !CONFIG_RSEQ_STATS */ 41 41 42 42 #ifdef CONFIG_RSEQ 43 + #include <linux/hrtimer_rearm.h> 43 44 #include <linux/jump_label.h> 44 45 #include <linux/rseq.h> 45 46 #include <linux/sched/signal.h> ··· 111 110 t->rseq.slice.state.granted = false; 112 111 } 113 112 114 - static __always_inline bool rseq_grant_slice_extension(bool work_pending) 113 + static __always_inline bool __rseq_grant_slice_extension(bool work_pending) 115 114 { 116 115 struct task_struct *curr = current; 117 116 struct rseq_slice_ctrl usr_ctrl; ··· 216 215 return false; 217 216 } 218 217 218 + static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) 219 + { 220 + if (unlikely(__rseq_grant_slice_extension(ti_work & mask))) { 221 + hrtimer_rearm_deferred_tif(ti_work); 222 + return true; 223 + } 224 + return false; 225 + } 226 + 219 227 #else /* CONFIG_RSEQ_SLICE_EXTENSION */ 220 228 static __always_inline bool rseq_slice_extension_enabled(void) { return false; } 221 229 static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } 222 230 static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } 223 - static __always_inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 231 + static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } 224 232 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 225 233 226 234 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); ··· 788 778 static inline void rseq_irqentry_exit_to_user_mode(void) { } 789 779 static inline void rseq_exit_to_user_mode_legacy(void) { } 790 780 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } 791 - static inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 781 + static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } 792 782 #endif /* !CONFIG_RSEQ */ 793 783 794 784 #endif /* _LINUX_RSEQ_ENTRY_H */
+8
include/linux/timekeeper_internal.h
··· 72 72 * @id: The timekeeper ID 73 73 * @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW 74 74 * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds 75 + * @cs_id: The ID of the current clocksource 76 + * @cs_ns_to_cyc_mult: Multiplicator for nanoseconds to cycles conversion 77 + * @cs_ns_to_cyc_shift: Shift value for nanoseconds to cycles conversion 78 + * @cs_ns_to_cyc_maxns: Maximum nanoseconds to cyles conversion range 75 79 * @clock_was_set_seq: The sequence number of clock was set events 76 80 * @cs_was_changed_seq: The sequence number of clocksource change events 77 81 * @clock_valid: Indicator for valid clock ··· 163 159 u64 raw_sec; 164 160 165 161 /* Cachline 3 and 4 (timekeeping internal variables): */ 162 + enum clocksource_ids cs_id; 163 + u32 cs_ns_to_cyc_mult; 164 + u32 cs_ns_to_cyc_shift; 165 + u64 cs_ns_to_cyc_maxns; 166 166 unsigned int clock_was_set_seq; 167 167 u8 cs_was_changed_seq; 168 168 u8 clock_valid;
+48 -8
include/linux/timerqueue.h
··· 5 5 #include <linux/rbtree.h> 6 6 #include <linux/timerqueue_types.h> 7 7 8 - extern bool timerqueue_add(struct timerqueue_head *head, 9 - struct timerqueue_node *node); 10 - extern bool timerqueue_del(struct timerqueue_head *head, 11 - struct timerqueue_node *node); 12 - extern struct timerqueue_node *timerqueue_iterate_next( 13 - struct timerqueue_node *node); 8 + bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node); 9 + bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node); 10 + struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node); 11 + 12 + bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node); 14 13 15 14 /** 16 15 * timerqueue_getnext - Returns the timer with the earliest expiration time ··· 18 19 * 19 20 * Returns a pointer to the timer node that has the earliest expiration time. 20 21 */ 21 - static inline 22 - struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head) 22 + static inline struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head) 23 23 { 24 24 struct rb_node *leftmost = rb_first_cached(&head->rb_root); 25 25 ··· 39 41 { 40 42 head->rb_root = RB_ROOT_CACHED; 41 43 } 44 + 45 + /* Timer queues with linked nodes */ 46 + 47 + static __always_inline 48 + struct timerqueue_linked_node *timerqueue_linked_first(struct timerqueue_linked_head *head) 49 + { 50 + return rb_entry_safe(head->rb_root.rb_leftmost, struct timerqueue_linked_node, node); 51 + } 52 + 53 + static __always_inline 54 + struct timerqueue_linked_node *timerqueue_linked_next(struct timerqueue_linked_node *node) 55 + { 56 + return rb_entry_safe(node->node.next, struct timerqueue_linked_node, node); 57 + } 58 + 59 + static __always_inline 60 + struct timerqueue_linked_node *timerqueue_linked_prev(struct timerqueue_linked_node *node) 61 + { 62 + return rb_entry_safe(node->node.prev, struct timerqueue_linked_node, node); 63 + } 64 + 65 + static __always_inline 66 + bool timerqueue_linked_del(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node) 67 + { 68 + return rb_erase_linked(&node->node, &head->rb_root); 69 + } 70 + 71 + static __always_inline void timerqueue_linked_init(struct timerqueue_linked_node *node) 72 + { 73 + RB_CLEAR_LINKED_NODE(&node->node); 74 + } 75 + 76 + static __always_inline bool timerqueue_linked_node_queued(struct timerqueue_linked_node *node) 77 + { 78 + return !RB_EMPTY_LINKED_NODE(&node->node); 79 + } 80 + 81 + static __always_inline void timerqueue_linked_init_head(struct timerqueue_linked_head *head) 82 + { 83 + head->rb_root = RB_ROOT_LINKED; 84 + } 85 + 42 86 #endif /* _LINUX_TIMERQUEUE_H */
+12 -3
include/linux/timerqueue_types.h
··· 6 6 #include <linux/types.h> 7 7 8 8 struct timerqueue_node { 9 - struct rb_node node; 10 - ktime_t expires; 9 + struct rb_node node; 10 + ktime_t expires; 11 11 }; 12 12 13 13 struct timerqueue_head { 14 - struct rb_root_cached rb_root; 14 + struct rb_root_cached rb_root; 15 + }; 16 + 17 + struct timerqueue_linked_node { 18 + struct rb_node_linked node; 19 + ktime_t expires; 20 + }; 21 + 22 + struct timerqueue_linked_head { 23 + struct rb_root_linked rb_root; 15 24 }; 16 25 17 26 #endif /* _LINUX_TIMERQUEUE_TYPES_H */
+8 -5
include/linux/trace_events.h
··· 22 22 23 23 const char *trace_print_flags_seq(struct trace_seq *p, const char *delim, 24 24 unsigned long flags, 25 - const struct trace_print_flags *flag_array); 25 + const struct trace_print_flags *flag_array, 26 + size_t flag_array_size); 26 27 27 28 const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val, 28 - const struct trace_print_flags *symbol_array); 29 + const struct trace_print_flags *symbol_array, 30 + size_t symbol_array_size); 29 31 30 32 #if BITS_PER_LONG == 32 31 33 const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim, 32 34 unsigned long long flags, 33 - const struct trace_print_flags_u64 *flag_array); 35 + const struct trace_print_flags_u64 *flag_array, 36 + size_t flag_array_size); 34 37 35 38 const char *trace_print_symbols_seq_u64(struct trace_seq *p, 36 39 unsigned long long val, 37 - const struct trace_print_flags_u64 38 - *symbol_array); 40 + const struct trace_print_flags_u64 *symbol_array, 41 + size_t symbol_array_size); 39 42 #endif 40 43 41 44 struct trace_iterator;
+34 -8
include/trace/events/timer.h
··· 218 218 * hrtimer_start - called when the hrtimer is started 219 219 * @hrtimer: pointer to struct hrtimer 220 220 * @mode: the hrtimers mode 221 + * @was_armed: Was armed when hrtimer_start*() was invoked 221 222 */ 222 223 TRACE_EVENT(hrtimer_start, 223 224 224 - TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode), 225 + TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode, bool was_armed), 225 226 226 - TP_ARGS(hrtimer, mode), 227 + TP_ARGS(hrtimer, mode, was_armed), 227 228 228 229 TP_STRUCT__entry( 229 230 __field( void *, hrtimer ) ··· 232 231 __field( s64, expires ) 233 232 __field( s64, softexpires ) 234 233 __field( enum hrtimer_mode, mode ) 234 + __field( bool, was_armed ) 235 235 ), 236 236 237 237 TP_fast_assign( ··· 241 239 __entry->expires = hrtimer_get_expires(hrtimer); 242 240 __entry->softexpires = hrtimer_get_softexpires(hrtimer); 243 241 __entry->mode = mode; 242 + __entry->was_armed = was_armed; 244 243 ), 245 244 246 245 TP_printk("hrtimer=%p function=%ps expires=%llu softexpires=%llu " 247 - "mode=%s", __entry->hrtimer, __entry->function, 246 + "mode=%s was_armed=%d", __entry->hrtimer, __entry->function, 248 247 (unsigned long long) __entry->expires, 249 248 (unsigned long long) __entry->softexpires, 250 - decode_hrtimer_mode(__entry->mode)) 249 + decode_hrtimer_mode(__entry->mode), __entry->was_armed) 251 250 ); 252 251 253 252 /** 254 253 * hrtimer_expire_entry - called immediately before the hrtimer callback 255 254 * @hrtimer: pointer to struct hrtimer 256 - * @now: pointer to variable which contains current time of the 257 - * timers base. 255 + * @now: variable which contains current time of the timers base. 258 256 * 259 257 * Allows to determine the timer latency. 260 258 */ 261 259 TRACE_EVENT(hrtimer_expire_entry, 262 260 263 - TP_PROTO(struct hrtimer *hrtimer, ktime_t *now), 261 + TP_PROTO(struct hrtimer *hrtimer, ktime_t now), 264 262 265 263 TP_ARGS(hrtimer, now), 266 264 ··· 272 270 273 271 TP_fast_assign( 274 272 __entry->hrtimer = hrtimer; 275 - __entry->now = *now; 273 + __entry->now = now; 276 274 __entry->function = ACCESS_PRIVATE(hrtimer, function); 277 275 ), 278 276 ··· 321 319 TP_PROTO(struct hrtimer *hrtimer), 322 320 323 321 TP_ARGS(hrtimer) 322 + ); 323 + 324 + /** 325 + * hrtimer_rearm - Invoked when the clockevent device is rearmed 326 + * @next_event: The next expiry time (CLOCK_MONOTONIC) 327 + */ 328 + TRACE_EVENT(hrtimer_rearm, 329 + 330 + TP_PROTO(ktime_t next_event, bool deferred), 331 + 332 + TP_ARGS(next_event, deferred), 333 + 334 + TP_STRUCT__entry( 335 + __field( s64, next_event ) 336 + __field( bool, deferred ) 337 + ), 338 + 339 + TP_fast_assign( 340 + __entry->next_event = next_event; 341 + __entry->deferred = deferred; 342 + ), 343 + 344 + TP_printk("next_event=%llu deferred=%d", 345 + (unsigned long long) __entry->next_event, __entry->deferred) 324 346 ); 325 347 326 348 /**
+20 -20
include/trace/stages/stage3_trace_output.h
··· 64 64 #define __get_rel_sockaddr(field) ((struct sockaddr *)__get_rel_dynamic_array(field)) 65 65 66 66 #undef __print_flags 67 - #define __print_flags(flag, delim, flag_array...) \ 68 - ({ \ 69 - static const struct trace_print_flags __flags[] = \ 70 - { flag_array, { -1, NULL }}; \ 71 - trace_print_flags_seq(p, delim, flag, __flags); \ 67 + #define __print_flags(flag, delim, flag_array...) \ 68 + ({ \ 69 + static const struct trace_print_flags __flags[] = \ 70 + { flag_array }; \ 71 + trace_print_flags_seq(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \ 72 72 }) 73 73 74 74 #undef __print_symbolic 75 - #define __print_symbolic(value, symbol_array...) \ 76 - ({ \ 77 - static const struct trace_print_flags symbols[] = \ 78 - { symbol_array, { -1, NULL }}; \ 79 - trace_print_symbols_seq(p, value, symbols); \ 75 + #define __print_symbolic(value, symbol_array...) \ 76 + ({ \ 77 + static const struct trace_print_flags symbols[] = \ 78 + { symbol_array }; \ 79 + trace_print_symbols_seq(p, value, symbols, ARRAY_SIZE(symbols)); \ 80 80 }) 81 81 82 82 #undef __print_flags_u64 83 83 #undef __print_symbolic_u64 84 84 #if BITS_PER_LONG == 32 85 - #define __print_flags_u64(flag, delim, flag_array...) \ 86 - ({ \ 87 - static const struct trace_print_flags_u64 __flags[] = \ 88 - { flag_array, { -1, NULL } }; \ 89 - trace_print_flags_seq_u64(p, delim, flag, __flags); \ 85 + #define __print_flags_u64(flag, delim, flag_array...) \ 86 + ({ \ 87 + static const struct trace_print_flags_u64 __flags[] = \ 88 + { flag_array }; \ 89 + trace_print_flags_seq_u64(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \ 90 90 }) 91 91 92 - #define __print_symbolic_u64(value, symbol_array...) \ 93 - ({ \ 94 - static const struct trace_print_flags_u64 symbols[] = \ 95 - { symbol_array, { -1, NULL } }; \ 96 - trace_print_symbols_seq_u64(p, value, symbols); \ 92 + #define __print_symbolic_u64(value, symbol_array...) \ 93 + ({ \ 94 + static const struct trace_print_flags_u64 symbols[] = \ 95 + { symbol_array }; \ 96 + trace_print_symbols_seq_u64(p, value, symbols, ARRAY_SIZE(symbols)); \ 97 97 }) 98 98 #else 99 99 #define __print_flags_u64(flag, delim, flag_array...) \
+3 -1
kernel/entry/common.c
··· 50 50 local_irq_enable_exit_to_user(ti_work); 51 51 52 52 if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { 53 - if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY)) 53 + if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY)) 54 54 schedule(); 55 55 } 56 56 ··· 225 225 */ 226 226 if (state.exit_rcu) { 227 227 instrumentation_begin(); 228 + hrtimer_rearm_deferred(); 228 229 /* Tell the tracer that IRET will enable interrupts */ 229 230 trace_hardirqs_on_prepare(); 230 231 lockdep_hardirqs_on_prepare(); ··· 239 238 if (IS_ENABLED(CONFIG_PREEMPTION)) 240 239 irqentry_exit_cond_resched(); 241 240 241 + hrtimer_rearm_deferred(); 242 242 /* Covers both tracing and lockdep */ 243 243 trace_hardirqs_on(); 244 244 instrumentation_end();
+77 -18
kernel/sched/core.c
··· 872 872 * Use HR-timers to deliver accurate preemption points. 873 873 */ 874 874 875 - static void hrtick_clear(struct rq *rq) 875 + enum { 876 + HRTICK_SCHED_NONE = 0, 877 + HRTICK_SCHED_DEFER = BIT(1), 878 + HRTICK_SCHED_START = BIT(2), 879 + HRTICK_SCHED_REARM_HRTIMER = BIT(3) 880 + }; 881 + 882 + static void __used hrtick_clear(struct rq *rq) 876 883 { 877 884 if (hrtimer_active(&rq->hrtick_timer)) 878 885 hrtimer_cancel(&rq->hrtick_timer); ··· 904 897 return HRTIMER_NORESTART; 905 898 } 906 899 907 - static void __hrtick_restart(struct rq *rq) 900 + static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires) 901 + { 902 + /* 903 + * Queued is false when the timer is not started or currently 904 + * running the callback. In both cases, restart. If queued check 905 + * whether the expiry time actually changes substantially. 906 + */ 907 + return !hrtimer_is_queued(timer) || 908 + abs(expires - hrtimer_get_expires(timer)) > 5000; 909 + } 910 + 911 + static void hrtick_cond_restart(struct rq *rq) 908 912 { 909 913 struct hrtimer *timer = &rq->hrtick_timer; 910 914 ktime_t time = rq->hrtick_time; 911 915 912 - hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD); 916 + if (hrtick_needs_rearm(timer, time)) 917 + hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD); 913 918 } 914 919 915 920 /* ··· 933 914 struct rq_flags rf; 934 915 935 916 rq_lock(rq, &rf); 936 - __hrtick_restart(rq); 917 + hrtick_cond_restart(rq); 937 918 rq_unlock(rq, &rf); 938 919 } 939 920 ··· 944 925 */ 945 926 void hrtick_start(struct rq *rq, u64 delay) 946 927 { 947 - struct hrtimer *timer = &rq->hrtick_timer; 948 928 s64 delta; 949 929 950 930 /* ··· 951 933 * doesn't make sense and can cause timer DoS. 952 934 */ 953 935 delta = max_t(s64, delay, 10000LL); 954 - rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta); 936 + 937 + /* 938 + * If this is in the middle of schedule() only note the delay 939 + * and let hrtick_schedule_exit() deal with it. 940 + */ 941 + if (rq->hrtick_sched) { 942 + rq->hrtick_sched |= HRTICK_SCHED_START; 943 + rq->hrtick_delay = delta; 944 + return; 945 + } 946 + 947 + rq->hrtick_time = ktime_add_ns(ktime_get(), delta); 948 + if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time)) 949 + return; 955 950 956 951 if (rq == this_rq()) 957 - __hrtick_restart(rq); 952 + hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD); 958 953 else 959 954 smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd); 955 + } 956 + 957 + static inline void hrtick_schedule_enter(struct rq *rq) 958 + { 959 + rq->hrtick_sched = HRTICK_SCHED_DEFER; 960 + if (hrtimer_test_and_clear_rearm_deferred()) 961 + rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER; 962 + } 963 + 964 + static inline void hrtick_schedule_exit(struct rq *rq) 965 + { 966 + if (rq->hrtick_sched & HRTICK_SCHED_START) { 967 + rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay); 968 + hrtick_cond_restart(rq); 969 + } else if (idle_rq(rq)) { 970 + /* 971 + * No need for using hrtimer_is_active(). The timer is CPU local 972 + * and interrupts are disabled, so the callback cannot be 973 + * running and the queued state is valid. 974 + */ 975 + if (hrtimer_is_queued(&rq->hrtick_timer)) 976 + hrtimer_cancel(&rq->hrtick_timer); 977 + } 978 + 979 + if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER) 980 + __hrtimer_rearm_deferred(); 981 + 982 + rq->hrtick_sched = HRTICK_SCHED_NONE; 960 983 } 961 984 962 985 static void hrtick_rq_init(struct rq *rq) 963 986 { 964 987 INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq); 965 - hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD); 988 + rq->hrtick_sched = HRTICK_SCHED_NONE; 989 + hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, 990 + HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM); 966 991 } 967 992 #else /* !CONFIG_SCHED_HRTICK: */ 968 - static inline void hrtick_clear(struct rq *rq) 969 - { 970 - } 971 - 972 - static inline void hrtick_rq_init(struct rq *rq) 973 - { 974 - } 993 + static inline void hrtick_clear(struct rq *rq) { } 994 + static inline void hrtick_rq_init(struct rq *rq) { } 995 + static inline void hrtick_schedule_enter(struct rq *rq) { } 996 + static inline void hrtick_schedule_exit(struct rq *rq) { } 975 997 #endif /* !CONFIG_SCHED_HRTICK */ 976 998 977 999 /* ··· 5090 5032 */ 5091 5033 spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); 5092 5034 __balance_callbacks(rq, NULL); 5035 + hrtick_schedule_exit(rq); 5093 5036 raw_spin_rq_unlock_irq(rq); 5094 5037 } 5095 5038 ··· 6844 6785 6845 6786 schedule_debug(prev, preempt); 6846 6787 6847 - if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) 6848 - hrtick_clear(rq); 6849 - 6850 6788 klp_sched_try_switch(prev); 6851 6789 6852 6790 local_irq_disable(); ··· 6869 6813 */ 6870 6814 rq_lock(rq, &rf); 6871 6815 smp_mb__after_spinlock(); 6816 + 6817 + hrtick_schedule_enter(rq); 6872 6818 6873 6819 /* Promote REQ to ACT */ 6874 6820 rq->clock_update_flags <<= 1; ··· 6974 6916 6975 6917 rq_unpin_lock(rq, &rf); 6976 6918 __balance_callbacks(rq, NULL); 6919 + hrtick_schedule_exit(rq); 6977 6920 raw_spin_rq_unlock_irq(rq); 6978 6921 } 6979 6922 trace_sched_exit_tp(is_switch);
+1 -1
kernel/sched/deadline.c
··· 1097 1097 act = ns_to_ktime(dl_next_period(dl_se)); 1098 1098 } 1099 1099 1100 - now = hrtimer_cb_get_time(timer); 1100 + now = ktime_get(); 1101 1101 delta = ktime_to_ns(now) - rq_clock(rq); 1102 1102 act = ktime_add_ns(act, delta); 1103 1103
+32 -23
kernel/sched/fair.c
··· 5600 5600 * validating it and just reschedule. 5601 5601 */ 5602 5602 if (queued) { 5603 - resched_curr_lazy(rq_of(cfs_rq)); 5603 + resched_curr(rq_of(cfs_rq)); 5604 5604 return; 5605 5605 } 5606 5606 #endif ··· 6805 6805 static void hrtick_start_fair(struct rq *rq, struct task_struct *p) 6806 6806 { 6807 6807 struct sched_entity *se = &p->se; 6808 + unsigned long scale = 1024; 6809 + unsigned long util = 0; 6810 + u64 vdelta; 6811 + u64 delta; 6808 6812 6809 6813 WARN_ON_ONCE(task_rq(p) != rq); 6810 6814 6811 - if (rq->cfs.h_nr_queued > 1) { 6812 - u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; 6813 - u64 slice = se->slice; 6814 - s64 delta = slice - ran; 6815 + if (rq->cfs.h_nr_queued <= 1) 6816 + return; 6815 6817 6816 - if (delta < 0) { 6817 - if (task_current_donor(rq, p)) 6818 - resched_curr(rq); 6819 - return; 6820 - } 6821 - hrtick_start(rq, delta); 6818 + /* 6819 + * Compute time until virtual deadline 6820 + */ 6821 + vdelta = se->deadline - se->vruntime; 6822 + if ((s64)vdelta < 0) { 6823 + if (task_current_donor(rq, p)) 6824 + resched_curr(rq); 6825 + return; 6822 6826 } 6827 + delta = (se->load.weight * vdelta) / NICE_0_LOAD; 6828 + 6829 + /* 6830 + * Correct for instantaneous load of other classes. 6831 + */ 6832 + util += cpu_util_irq(rq); 6833 + if (util && util < 1024) { 6834 + scale *= 1024; 6835 + scale /= (1024 - util); 6836 + } 6837 + 6838 + hrtick_start(rq, (scale * delta) / 1024); 6823 6839 } 6824 6840 6825 6841 /* 6826 - * called from enqueue/dequeue and updates the hrtick when the 6827 - * current task is from our class and nr_running is low enough 6828 - * to matter. 6842 + * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1. 6829 6843 */ 6830 6844 static void hrtick_update(struct rq *rq) 6831 6845 { 6832 6846 struct task_struct *donor = rq->donor; 6833 6847 6834 6848 if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class) 6849 + return; 6850 + 6851 + if (hrtick_active(rq)) 6835 6852 return; 6836 6853 6837 6854 hrtick_start_fair(rq, donor); ··· 7173 7156 WARN_ON_ONCE(!task_sleep); 7174 7157 WARN_ON_ONCE(p->on_rq != 1); 7175 7158 7176 - /* Fix-up what dequeue_task_fair() skipped */ 7177 - hrtick_update(rq); 7178 - 7179 7159 /* 7180 7160 * Fix-up what block_task() skipped. 7181 7161 * ··· 7206 7192 /* 7207 7193 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED). 7208 7194 */ 7209 - 7210 - hrtick_update(rq); 7211 7195 return true; 7212 7196 } 7213 7197 ··· 13447 13435 entity_tick(cfs_rq, se, queued); 13448 13436 } 13449 13437 13450 - if (queued) { 13451 - if (!need_resched()) 13452 - hrtick_start_fair(rq, curr); 13438 + if (queued) 13453 13439 return; 13454 - } 13455 13440 13456 13441 if (static_branch_unlikely(&sched_numa_balancing)) 13457 13442 task_tick_numa(rq, curr);
+5
kernel/sched/features.h
··· 63 63 */ 64 64 SCHED_FEAT(WAKEUP_PREEMPTION, true) 65 65 66 + #ifdef CONFIG_HRTIMER_REARM_DEFERRED 67 + SCHED_FEAT(HRTICK, true) 68 + SCHED_FEAT(HRTICK_DL, true) 69 + #else 66 70 SCHED_FEAT(HRTICK, false) 67 71 SCHED_FEAT(HRTICK_DL, false) 72 + #endif 68 73 69 74 /* 70 75 * Decrement CPU capacity based on time not spent running tasks
+15 -28
kernel/sched/sched.h
··· 1288 1288 call_single_data_t hrtick_csd; 1289 1289 struct hrtimer hrtick_timer; 1290 1290 ktime_t hrtick_time; 1291 + ktime_t hrtick_delay; 1292 + unsigned int hrtick_sched; 1291 1293 #endif 1292 1294 1293 1295 #ifdef CONFIG_SCHEDSTATS ··· 3035 3033 * - enabled by features 3036 3034 * - hrtimer is actually high res 3037 3035 */ 3038 - static inline int hrtick_enabled(struct rq *rq) 3036 + static inline bool hrtick_enabled(struct rq *rq) 3039 3037 { 3040 - if (!cpu_active(cpu_of(rq))) 3041 - return 0; 3042 - return hrtimer_is_hres_active(&rq->hrtick_timer); 3038 + return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled(); 3043 3039 } 3044 3040 3045 - static inline int hrtick_enabled_fair(struct rq *rq) 3041 + static inline bool hrtick_enabled_fair(struct rq *rq) 3046 3042 { 3047 - if (!sched_feat(HRTICK)) 3048 - return 0; 3049 - return hrtick_enabled(rq); 3043 + return sched_feat(HRTICK) && hrtick_enabled(rq); 3050 3044 } 3051 3045 3052 - static inline int hrtick_enabled_dl(struct rq *rq) 3046 + static inline bool hrtick_enabled_dl(struct rq *rq) 3053 3047 { 3054 - if (!sched_feat(HRTICK_DL)) 3055 - return 0; 3056 - return hrtick_enabled(rq); 3048 + return sched_feat(HRTICK_DL) && hrtick_enabled(rq); 3057 3049 } 3058 3050 3059 3051 extern void hrtick_start(struct rq *rq, u64 delay); 3052 + static inline bool hrtick_active(struct rq *rq) 3053 + { 3054 + return hrtimer_active(&rq->hrtick_timer); 3055 + } 3060 3056 3061 3057 #else /* !CONFIG_SCHED_HRTICK: */ 3062 - 3063 - static inline int hrtick_enabled_fair(struct rq *rq) 3064 - { 3065 - return 0; 3066 - } 3067 - 3068 - static inline int hrtick_enabled_dl(struct rq *rq) 3069 - { 3070 - return 0; 3071 - } 3072 - 3073 - static inline int hrtick_enabled(struct rq *rq) 3074 - { 3075 - return 0; 3076 - } 3077 - 3058 + static inline bool hrtick_enabled_fair(struct rq *rq) { return false; } 3059 + static inline bool hrtick_enabled_dl(struct rq *rq) { return false; } 3060 + static inline bool hrtick_enabled(struct rq *rq) { return false; } 3078 3061 #endif /* !CONFIG_SCHED_HRTICK */ 3079 3062 3080 3063 #ifndef arch_scale_freq_tick
+14 -1
kernel/softirq.c
··· 663 663 { 664 664 __irq_enter_raw(); 665 665 666 + /* 667 + * If this is a nested interrupt that hits the exit_to_user_mode_loop 668 + * where it has enabled interrupts but before it has hit schedule() we 669 + * could have hrtimers in an undefined state. Fix it up here. 670 + */ 671 + hrtimer_rearm_deferred(); 672 + 666 673 if (tick_nohz_full_cpu(smp_processor_id()) || 667 674 (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) 668 675 tick_irq_enter(); ··· 726 719 #endif 727 720 account_hardirq_exit(current); 728 721 preempt_count_sub(HARDIRQ_OFFSET); 729 - if (!in_interrupt() && local_softirq_pending()) 722 + if (!in_interrupt() && local_softirq_pending()) { 723 + /* 724 + * If we left hrtimers unarmed, make sure to arm them now, 725 + * before enabling interrupts to run SoftIRQ. 726 + */ 727 + hrtimer_rearm_deferred(); 730 728 invoke_softirq(); 729 + } 731 730 732 731 if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() && 733 732 local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))
+2
kernel/time/.kunitconfig
··· 1 + CONFIG_KUNIT=y 2 + CONFIG_TIME_KUNIT_TEST=y
+16 -12
kernel/time/Kconfig
··· 17 17 config ARCH_CLOCKSOURCE_INIT 18 18 bool 19 19 20 + config ARCH_WANTS_CLOCKSOURCE_READ_INLINE 21 + bool 22 + 20 23 # Timekeeping vsyscall support 21 24 config GENERIC_TIME_VSYSCALL 22 25 bool ··· 47 44 config GENERIC_CLOCKEVENTS_MIN_ADJUST 48 45 bool 49 46 47 + config GENERIC_CLOCKEVENTS_COUPLED 48 + bool 49 + 50 + config GENERIC_CLOCKEVENTS_COUPLED_INLINE 51 + select GENERIC_CLOCKEVENTS_COUPLED 52 + bool 53 + 50 54 # Generic update of CMOS clock 51 55 config GENERIC_CMOS_UPDATE 52 56 bool 57 + 58 + # Deferred rearming of the hrtimer interrupt 59 + config HRTIMER_REARM_DEFERRED 60 + def_bool y 61 + depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS 62 + depends on HIGH_RES_TIMERS && SCHED_HRTICK 53 63 54 64 # Select to handle posix CPU timers from task_work 55 65 # and not from the timer interrupt context ··· 212 196 hardware is not capable then this option only increases 213 197 the size of the kernel image. 214 198 215 - config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 216 - int "Clocksource watchdog maximum allowable skew (in microseconds)" 217 - depends on CLOCKSOURCE_WATCHDOG 218 - range 50 1000 219 - default 125 220 - help 221 - Specify the maximum amount of allowable watchdog skew in 222 - microseconds before reporting the clocksource to be unstable. 223 - The default is based on a half-second clocksource watchdog 224 - interval and NTP's maximum frequency drift of 500 parts 225 - per million. If the clocksource is good enough for NTP, 226 - it is good enough for the clocksource watchdog! 227 199 endif 228 200 229 201 config POSIX_AUX_CLOCKS
+8 -4
kernel/time/alarmtimer.c
··· 234 234 if (!rtc) 235 235 return 0; 236 236 237 - /* Find the soonest timer to expire*/ 237 + /* Find the soonest timer to expire */ 238 238 for (i = 0; i < ALARM_NUMTYPE; i++) { 239 239 struct alarm_base *base = &alarm_bases[i]; 240 240 struct timerqueue_node *next; 241 + ktime_t next_expires; 241 242 ktime_t delta; 242 243 243 - scoped_guard(spinlock_irqsave, &base->lock) 244 + scoped_guard(spinlock_irqsave, &base->lock) { 244 245 next = timerqueue_getnext(&base->timerqueue); 246 + if (next) 247 + next_expires = next->expires; 248 + } 245 249 if (!next) 246 250 continue; 247 - delta = ktime_sub(next->expires, base->get_ktime()); 251 + delta = ktime_sub(next_expires, base->get_ktime()); 248 252 if (!min || (delta < min)) { 249 - expires = next->expires; 253 + expires = next_expires; 250 254 min = delta; 251 255 type = i; 252 256 }
+41 -7
kernel/time/clockevents.c
··· 293 293 294 294 #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */ 295 295 296 + #ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED 297 + #ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE 298 + #include <asm/clock_inlined.h> 299 + #else 300 + static __always_inline void 301 + arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { } 302 + #endif 303 + 304 + static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires) 305 + { 306 + u64 cycles; 307 + 308 + if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED))) 309 + return false; 310 + 311 + if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles))) 312 + return false; 313 + 314 + if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE)) 315 + arch_inlined_clockevent_set_next_coupled(cycles, dev); 316 + else 317 + dev->set_next_coupled(cycles, dev); 318 + return true; 319 + } 320 + 321 + #else 322 + static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires) 323 + { 324 + return false; 325 + } 326 + #endif 327 + 296 328 /** 297 329 * clockevents_program_event - Reprogram the clock event device. 298 330 * @dev: device to program ··· 333 301 * 334 302 * Returns 0 on success, -ETIME when the event is in the past. 335 303 */ 336 - int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, 337 - bool force) 304 + int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force) 338 305 { 339 - unsigned long long clc; 340 306 int64_t delta; 307 + u64 cycles; 341 308 342 309 if (WARN_ON_ONCE(expires < 0)) 343 310 return -ETIME; ··· 350 319 WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n", 351 320 clockevent_get_state(dev)); 352 321 353 - /* Shortcut for clockevent devices that can deal with ktime. */ 354 - if (dev->features & CLOCK_EVT_FEAT_KTIME) 322 + /* ktime_t based reprogramming for the broadcast hrtimer device */ 323 + if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER)) 355 324 return dev->set_next_ktime(expires, dev); 325 + 326 + if (likely(clockevent_set_next_coupled(dev, expires))) 327 + return 0; 356 328 357 329 delta = ktime_to_ns(ktime_sub(expires, ktime_get())); 358 330 ··· 365 331 366 332 if (delta > (int64_t)dev->min_delta_ns) { 367 333 delta = min(delta, (int64_t) dev->max_delta_ns); 368 - clc = ((unsigned long long) delta * dev->mult) >> dev->shift; 369 - if (!dev->set_next_event((unsigned long) clc, dev)) 334 + cycles = ((u64)delta * dev->mult) >> dev->shift; 335 + if (!dev->set_next_event((unsigned long) cycles, dev)) 370 336 return 0; 371 337 } 372 338
+145 -151
kernel/time/clocksource-wdtest.c
··· 3 3 * Unit test for the clocksource watchdog. 4 4 * 5 5 * Copyright (C) 2021 Facebook, Inc. 6 + * Copyright (C) 2026 Intel Corp. 6 7 * 7 8 * Author: Paul E. McKenney <paulmck@kernel.org> 9 + * Author: Thomas Gleixner <tglx@kernel.org> 8 10 */ 9 11 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 10 12 11 - #include <linux/device.h> 12 13 #include <linux/clocksource.h> 13 - #include <linux/init.h> 14 - #include <linux/module.h> 15 - #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */ 16 - #include <linux/tick.h> 17 - #include <linux/kthread.h> 18 14 #include <linux/delay.h> 19 - #include <linux/prandom.h> 20 - #include <linux/cpu.h> 15 + #include <linux/module.h> 16 + #include <linux/kthread.h> 21 17 22 18 #include "tick-internal.h" 19 + #include "timekeeping_internal.h" 23 20 24 21 MODULE_LICENSE("GPL"); 25 22 MODULE_DESCRIPTION("Clocksource watchdog unit test"); 26 23 MODULE_AUTHOR("Paul E. McKenney <paulmck@kernel.org>"); 24 + MODULE_AUTHOR("Thomas Gleixner <tglx@kernel.org>"); 27 25 28 - static int holdoff = IS_BUILTIN(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) ? 10 : 0; 29 - module_param(holdoff, int, 0444); 30 - MODULE_PARM_DESC(holdoff, "Time to wait to start test (s)."); 31 - 32 - /* Watchdog kthread's task_struct pointer for debug purposes. */ 33 - static struct task_struct *wdtest_task; 34 - 35 - static u64 wdtest_jiffies_read(struct clocksource *cs) 36 - { 37 - return (u64)jiffies; 38 - } 39 - 40 - static struct clocksource clocksource_wdtest_jiffies = { 41 - .name = "wdtest-jiffies", 42 - .rating = 1, /* lowest valid rating*/ 43 - .uncertainty_margin = TICK_NSEC, 44 - .read = wdtest_jiffies_read, 45 - .mask = CLOCKSOURCE_MASK(32), 46 - .flags = CLOCK_SOURCE_MUST_VERIFY, 47 - .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */ 48 - .shift = JIFFIES_SHIFT, 49 - .max_cycles = 10, 26 + enum wdtest_states { 27 + WDTEST_INJECT_NONE, 28 + WDTEST_INJECT_DELAY, 29 + WDTEST_INJECT_POSITIVE, 30 + WDTEST_INJECT_NEGATIVE, 31 + WDTEST_INJECT_PERCPU = 0x100, 50 32 }; 51 33 52 - static int wdtest_ktime_read_ndelays; 53 - static bool wdtest_ktime_read_fuzz; 34 + static enum wdtest_states wdtest_state; 35 + static unsigned long wdtest_test_count; 36 + static ktime_t wdtest_last_ts, wdtest_offset; 37 + 38 + #define SHIFT_4000PPM 8 39 + 40 + static ktime_t wdtest_get_offset(struct clocksource *cs) 41 + { 42 + if (wdtest_state < WDTEST_INJECT_PERCPU) 43 + return wdtest_test_count & 0x1 ? 0 : wdtest_offset >> SHIFT_4000PPM; 44 + 45 + /* Only affect the readout of the "remote" CPU */ 46 + return cs->wd_cpu == smp_processor_id() ? 0 : NSEC_PER_MSEC; 47 + } 54 48 55 49 static u64 wdtest_ktime_read(struct clocksource *cs) 56 50 { 57 - int wkrn = READ_ONCE(wdtest_ktime_read_ndelays); 58 - static int sign = 1; 59 - u64 ret; 51 + ktime_t now = ktime_get_raw_fast_ns(); 52 + ktime_t intv = now - wdtest_last_ts; 60 53 61 - if (wkrn) { 62 - udelay(cs->uncertainty_margin / 250); 63 - WRITE_ONCE(wdtest_ktime_read_ndelays, wkrn - 1); 54 + /* 55 + * Only increment the test counter once per watchdog interval and 56 + * store the interval for the offset calculation of this step. This 57 + * guarantees a consistent behaviour even if the other side needs 58 + * to repeat due to a watchdog read timeout. 59 + */ 60 + if (intv > (NSEC_PER_SEC / 4)) { 61 + WRITE_ONCE(wdtest_test_count, wdtest_test_count + 1); 62 + wdtest_last_ts = now; 63 + wdtest_offset = intv; 64 64 } 65 - ret = ktime_get_real_fast_ns(); 66 - if (READ_ONCE(wdtest_ktime_read_fuzz)) { 67 - sign = -sign; 68 - ret = ret + sign * 100 * NSEC_PER_MSEC; 65 + 66 + switch (wdtest_state & ~WDTEST_INJECT_PERCPU) { 67 + case WDTEST_INJECT_POSITIVE: 68 + return now + wdtest_get_offset(cs); 69 + case WDTEST_INJECT_NEGATIVE: 70 + return now - wdtest_get_offset(cs); 71 + case WDTEST_INJECT_DELAY: 72 + udelay(500); 73 + return now; 74 + default: 75 + return now; 69 76 } 70 - return ret; 71 77 } 72 78 73 - static void wdtest_ktime_cs_mark_unstable(struct clocksource *cs) 74 - { 75 - pr_info("--- Marking %s unstable due to clocksource watchdog.\n", cs->name); 76 - } 77 - 78 - #define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \ 79 - CLOCK_SOURCE_VALID_FOR_HRES | \ 80 - CLOCK_SOURCE_MUST_VERIFY | \ 81 - CLOCK_SOURCE_VERIFY_PERCPU) 79 + #define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \ 80 + CLOCK_SOURCE_CALIBRATED | \ 81 + CLOCK_SOURCE_MUST_VERIFY | \ 82 + CLOCK_SOURCE_WDTEST) 82 83 83 84 static struct clocksource clocksource_wdtest_ktime = { 84 85 .name = "wdtest-ktime", 85 - .rating = 300, 86 + .rating = 10, 86 87 .read = wdtest_ktime_read, 87 88 .mask = CLOCKSOURCE_MASK(64), 88 89 .flags = KTIME_FLAGS, 89 - .mark_unstable = wdtest_ktime_cs_mark_unstable, 90 90 .list = LIST_HEAD_INIT(clocksource_wdtest_ktime.list), 91 91 }; 92 92 93 - /* Reset the clocksource if needed. */ 94 - static void wdtest_ktime_clocksource_reset(void) 93 + static void wdtest_clocksource_reset(enum wdtest_states which, bool percpu) 95 94 { 96 - if (clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE) { 97 - clocksource_unregister(&clocksource_wdtest_ktime); 98 - clocksource_wdtest_ktime.flags = KTIME_FLAGS; 99 - schedule_timeout_uninterruptible(HZ / 10); 100 - clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 101 - } 102 - } 103 - 104 - /* Run the specified series of watchdog tests. */ 105 - static int wdtest_func(void *arg) 106 - { 107 - unsigned long j1, j2; 108 - int i, max_retries; 109 - char *s; 110 - 111 - schedule_timeout_uninterruptible(holdoff * HZ); 112 - 113 - /* 114 - * Verify that jiffies-like clocksources get the manually 115 - * specified uncertainty margin. 116 - */ 117 - pr_info("--- Verify jiffies-like uncertainty margin.\n"); 118 - __clocksource_register(&clocksource_wdtest_jiffies); 119 - WARN_ON_ONCE(clocksource_wdtest_jiffies.uncertainty_margin != TICK_NSEC); 120 - 121 - j1 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies); 122 - schedule_timeout_uninterruptible(HZ); 123 - j2 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies); 124 - WARN_ON_ONCE(j1 == j2); 125 - 126 - clocksource_unregister(&clocksource_wdtest_jiffies); 127 - 128 - /* 129 - * Verify that tsc-like clocksources are assigned a reasonable 130 - * uncertainty margin. 131 - */ 132 - pr_info("--- Verify tsc-like uncertainty margin.\n"); 133 - clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 134 - WARN_ON_ONCE(clocksource_wdtest_ktime.uncertainty_margin < NSEC_PER_USEC); 135 - 136 - j1 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime); 137 - udelay(1); 138 - j2 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime); 139 - pr_info("--- tsc-like times: %lu - %lu = %lu.\n", j2, j1, j2 - j1); 140 - WARN_ONCE(time_before(j2, j1 + NSEC_PER_USEC), 141 - "Expected at least 1000ns, got %lu.\n", j2 - j1); 142 - 143 - /* Verify tsc-like stability with various numbers of errors injected. */ 144 - max_retries = clocksource_get_max_watchdog_retry(); 145 - for (i = 0; i <= max_retries + 1; i++) { 146 - if (i <= 1 && i < max_retries) 147 - s = ""; 148 - else if (i <= max_retries) 149 - s = ", expect message"; 150 - else 151 - s = ", expect clock skew"; 152 - pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s); 153 - WRITE_ONCE(wdtest_ktime_read_ndelays, i); 154 - schedule_timeout_uninterruptible(2 * HZ); 155 - WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays)); 156 - WARN_ON_ONCE((i <= max_retries) != 157 - !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE)); 158 - wdtest_ktime_clocksource_reset(); 159 - } 160 - 161 - /* Verify tsc-like stability with clock-value-fuzz error injection. */ 162 - pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n"); 163 - WRITE_ONCE(wdtest_ktime_read_fuzz, true); 164 - schedule_timeout_uninterruptible(2 * HZ); 165 - WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE)); 166 - clocksource_verify_percpu(&clocksource_wdtest_ktime); 167 - WRITE_ONCE(wdtest_ktime_read_fuzz, false); 168 - 169 95 clocksource_unregister(&clocksource_wdtest_ktime); 170 96 171 - pr_info("--- Done with test.\n"); 97 + pr_info("Test: State %d percpu %d\n", which, percpu); 98 + 99 + wdtest_state = which; 100 + if (percpu) 101 + wdtest_state |= WDTEST_INJECT_PERCPU; 102 + wdtest_test_count = 0; 103 + wdtest_last_ts = 0; 104 + 105 + clocksource_wdtest_ktime.rating = 10; 106 + clocksource_wdtest_ktime.flags = KTIME_FLAGS; 107 + if (percpu) 108 + clocksource_wdtest_ktime.flags |= CLOCK_SOURCE_WDTEST_PERCPU; 109 + clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 110 + } 111 + 112 + static bool wdtest_execute(enum wdtest_states which, bool percpu, unsigned int expect, 113 + unsigned long calls) 114 + { 115 + wdtest_clocksource_reset(which, percpu); 116 + 117 + for (; READ_ONCE(wdtest_test_count) < calls; msleep(100)) { 118 + unsigned int flags = READ_ONCE(clocksource_wdtest_ktime.flags); 119 + 120 + if (kthread_should_stop()) 121 + return false; 122 + 123 + if (flags & CLOCK_SOURCE_UNSTABLE) { 124 + if (expect & CLOCK_SOURCE_UNSTABLE) 125 + return true; 126 + pr_warn("Fail: Unexpected unstable\n"); 127 + return false; 128 + } 129 + if (flags & CLOCK_SOURCE_VALID_FOR_HRES) { 130 + if (expect & CLOCK_SOURCE_VALID_FOR_HRES) 131 + return true; 132 + pr_warn("Fail: Unexpected valid for highres\n"); 133 + return false; 134 + } 135 + } 136 + 137 + if (!expect) 138 + return true; 139 + 140 + pr_warn("Fail: Timed out\n"); 141 + return false; 142 + } 143 + 144 + static bool wdtest_run(bool percpu) 145 + { 146 + if (!wdtest_execute(WDTEST_INJECT_NONE, percpu, CLOCK_SOURCE_VALID_FOR_HRES, 8)) 147 + return false; 148 + 149 + if (!wdtest_execute(WDTEST_INJECT_DELAY, percpu, 0, 4)) 150 + return false; 151 + 152 + if (!wdtest_execute(WDTEST_INJECT_POSITIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8)) 153 + return false; 154 + 155 + if (!wdtest_execute(WDTEST_INJECT_NEGATIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8)) 156 + return false; 157 + 158 + return true; 159 + } 160 + 161 + static int wdtest_func(void *arg) 162 + { 163 + clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000); 164 + if (wdtest_run(false)) { 165 + if (wdtest_run(true)) 166 + pr_info("Success: All tests passed\n"); 167 + } 168 + clocksource_unregister(&clocksource_wdtest_ktime); 169 + 170 + if (!IS_MODULE(CONFIG_TEST_CLOCKSOURCE_WATCHDOG)) 171 + return 0; 172 + 173 + while (!kthread_should_stop()) 174 + schedule_timeout_interruptible(3600 * HZ); 172 175 return 0; 173 176 } 174 177 175 - static void wdtest_print_module_parms(void) 176 - { 177 - pr_alert("--- holdoff=%d\n", holdoff); 178 - } 179 - 180 - /* Cleanup function. */ 181 - static void clocksource_wdtest_cleanup(void) 182 - { 183 - } 178 + static struct task_struct *wdtest_thread; 184 179 185 180 static int __init clocksource_wdtest_init(void) 186 181 { 187 - int ret = 0; 182 + struct task_struct *t = kthread_run(wdtest_func, NULL, "wdtest"); 188 183 189 - wdtest_print_module_parms(); 190 - 191 - /* Create watchdog-test task. */ 192 - wdtest_task = kthread_run(wdtest_func, NULL, "wdtest"); 193 - if (IS_ERR(wdtest_task)) { 194 - ret = PTR_ERR(wdtest_task); 195 - pr_warn("%s: Failed to create wdtest kthread.\n", __func__); 196 - wdtest_task = NULL; 197 - return ret; 184 + if (IS_ERR(t)) { 185 + pr_warn("Failed to create wdtest kthread.\n"); 186 + return PTR_ERR(t); 198 187 } 199 - 188 + wdtest_thread = t; 200 189 return 0; 201 190 } 202 - 203 191 module_init(clocksource_wdtest_init); 192 + 193 + static void clocksource_wdtest_cleanup(void) 194 + { 195 + if (wdtest_thread) 196 + kthread_stop(wdtest_thread); 197 + } 204 198 module_exit(clocksource_wdtest_cleanup);
+474 -403
kernel/time/clocksource.c
··· 7 7 8 8 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 9 9 10 - #include <linux/device.h> 11 10 #include <linux/clocksource.h> 12 - #include <linux/init.h> 13 - #include <linux/module.h> 14 - #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */ 15 - #include <linux/tick.h> 16 - #include <linux/kthread.h> 17 - #include <linux/prandom.h> 18 11 #include <linux/cpu.h> 12 + #include <linux/delay.h> 13 + #include <linux/device.h> 14 + #include <linux/init.h> 15 + #include <linux/kthread.h> 16 + #include <linux/module.h> 17 + #include <linux/prandom.h> 18 + #include <linux/sched.h> 19 + #include <linux/tick.h> 20 + #include <linux/topology.h> 19 21 20 22 #include "tick-internal.h" 21 23 #include "timekeeping_internal.h" ··· 109 107 static int finished_booting; 110 108 static u64 suspend_start; 111 109 112 - /* 113 - * Interval: 0.5sec. 114 - */ 115 - #define WATCHDOG_INTERVAL (HZ >> 1) 116 - #define WATCHDOG_INTERVAL_MAX_NS ((2 * WATCHDOG_INTERVAL) * (NSEC_PER_SEC / HZ)) 117 - 118 - /* 119 - * Threshold: 0.0312s, when doubled: 0.0625s. 120 - */ 121 - #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 5) 122 - 123 - /* 124 - * Maximum permissible delay between two readouts of the watchdog 125 - * clocksource surrounding a read of the clocksource being validated. 126 - * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as 127 - * a lower bound for cs->uncertainty_margin values when registering clocks. 128 - * 129 - * The default of 500 parts per million is based on NTP's limits. 130 - * If a clocksource is good enough for NTP, it is good enough for us! 131 - * 132 - * In other words, by default, even if a clocksource is extremely 133 - * precise (for example, with a sub-nanosecond period), the maximum 134 - * permissible skew between the clocksource watchdog and the clocksource 135 - * under test is not permitted to go below the 500ppm minimum defined 136 - * by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the 137 - * CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option. 138 - */ 139 - #ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 140 - #define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 141 - #else 142 - #define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ) 143 - #endif 144 - 145 - /* 146 - * Default for maximum permissible skew when cs->uncertainty_margin is 147 - * not specified, and the lower bound even when cs->uncertainty_margin 148 - * is specified. This is also the default that is used when registering 149 - * clocks with unspecified cs->uncertainty_margin, so this macro is used 150 - * even in CONFIG_CLOCKSOURCE_WATCHDOG=n kernels. 151 - */ 152 - #define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC) 153 - 154 110 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG 155 111 static void clocksource_watchdog_work(struct work_struct *work); 156 112 static void clocksource_select(void); ··· 120 160 static DEFINE_SPINLOCK(watchdog_lock); 121 161 static int watchdog_running; 122 162 static atomic_t watchdog_reset_pending; 123 - static int64_t watchdog_max_interval; 163 + 164 + /* Watchdog interval: 0.5sec. */ 165 + #define WATCHDOG_INTERVAL (HZ >> 1) 166 + #define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ)) 167 + 168 + /* Maximum time between two reference watchdog readouts */ 169 + #define WATCHDOG_READOUT_MAX_NS (50U * NSEC_PER_USEC) 170 + 171 + /* 172 + * Maximum time between two remote readouts for NUMA=n. On NUMA enabled systems 173 + * the timeout is calculated from the numa distance. 174 + */ 175 + #define WATCHDOG_DEFAULT_TIMEOUT_NS (50U * NSEC_PER_USEC) 176 + 177 + /* 178 + * Remote timeout NUMA distance multiplier. The local distance is 10. The 179 + * default remote distance is 20. ACPI tables provide more accurate numbers 180 + * which are guaranteed to be greater than the local distance. 181 + * 182 + * This results in a 5us base value, which is equivalent to the above !NUMA 183 + * default. 184 + */ 185 + #define WATCHDOG_NUMA_MULTIPLIER_NS ((u64)(WATCHDOG_DEFAULT_TIMEOUT_NS / LOCAL_DISTANCE)) 186 + 187 + /* Limit the NUMA timeout in case the distance values are insanely big */ 188 + #define WATCHDOG_NUMA_MAX_TIMEOUT_NS ((u64)(500U * NSEC_PER_USEC)) 189 + 190 + /* Shift values to calculate the approximate $N ppm of a given delta. */ 191 + #define SHIFT_500PPM 11 192 + #define SHIFT_4000PPM 8 193 + 194 + /* Number of attempts to read the watchdog */ 195 + #define WATCHDOG_FREQ_RETRIES 3 196 + 197 + /* Five reads local and remote for inter CPU skew detection */ 198 + #define WATCHDOG_REMOTE_MAX_SEQ 10 124 199 125 200 static inline void clocksource_watchdog_lock(unsigned long *flags) 126 201 { ··· 236 241 spin_unlock_irqrestore(&watchdog_lock, flags); 237 242 } 238 243 239 - static int verify_n_cpus = 8; 240 - module_param(verify_n_cpus, int, 0644); 241 - 242 - enum wd_read_status { 243 - WD_READ_SUCCESS, 244 - WD_READ_UNSTABLE, 245 - WD_READ_SKIP 246 - }; 247 - 248 - static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) 249 - { 250 - int64_t md = watchdog->uncertainty_margin; 251 - unsigned int nretries, max_retries; 252 - int64_t wd_delay, wd_seq_delay; 253 - u64 wd_end, wd_end2; 254 - 255 - max_retries = clocksource_get_max_watchdog_retry(); 256 - for (nretries = 0; nretries <= max_retries; nretries++) { 257 - local_irq_disable(); 258 - *wdnow = watchdog->read(watchdog); 259 - *csnow = cs->read(cs); 260 - wd_end = watchdog->read(watchdog); 261 - wd_end2 = watchdog->read(watchdog); 262 - local_irq_enable(); 263 - 264 - wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end); 265 - if (wd_delay <= md + cs->uncertainty_margin) { 266 - if (nretries > 1 && nretries >= max_retries) { 267 - pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", 268 - smp_processor_id(), watchdog->name, nretries); 269 - } 270 - return WD_READ_SUCCESS; 271 - } 272 - 273 - /* 274 - * Now compute delay in consecutive watchdog read to see if 275 - * there is too much external interferences that cause 276 - * significant delay in reading both clocksource and watchdog. 277 - * 278 - * If consecutive WD read-back delay > md, report 279 - * system busy, reinit the watchdog and skip the current 280 - * watchdog test. 281 - */ 282 - wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2); 283 - if (wd_seq_delay > md) 284 - goto skip_test; 285 - } 286 - 287 - pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n", 288 - smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name); 289 - return WD_READ_UNSTABLE; 290 - 291 - skip_test: 292 - pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n", 293 - smp_processor_id(), watchdog->name, wd_seq_delay); 294 - pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n", 295 - cs->name, wd_delay); 296 - return WD_READ_SKIP; 297 - } 298 - 299 - static u64 csnow_mid; 300 - static cpumask_t cpus_ahead; 301 - static cpumask_t cpus_behind; 302 - static cpumask_t cpus_chosen; 303 - 304 - static void clocksource_verify_choose_cpus(void) 305 - { 306 - int cpu, i, n = verify_n_cpus; 307 - 308 - if (n < 0 || n >= num_online_cpus()) { 309 - /* Check all of the CPUs. */ 310 - cpumask_copy(&cpus_chosen, cpu_online_mask); 311 - cpumask_clear_cpu(smp_processor_id(), &cpus_chosen); 312 - return; 313 - } 314 - 315 - /* If no checking desired, or no other CPU to check, leave. */ 316 - cpumask_clear(&cpus_chosen); 317 - if (n == 0 || num_online_cpus() <= 1) 318 - return; 319 - 320 - /* Make sure to select at least one CPU other than the current CPU. */ 321 - cpu = cpumask_any_but(cpu_online_mask, smp_processor_id()); 322 - if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) 323 - return; 324 - cpumask_set_cpu(cpu, &cpus_chosen); 325 - 326 - /* Force a sane value for the boot parameter. */ 327 - if (n > nr_cpu_ids) 328 - n = nr_cpu_ids; 329 - 330 - /* 331 - * Randomly select the specified number of CPUs. If the same 332 - * CPU is selected multiple times, that CPU is checked only once, 333 - * and no replacement CPU is selected. This gracefully handles 334 - * situations where verify_n_cpus is greater than the number of 335 - * CPUs that are currently online. 336 - */ 337 - for (i = 1; i < n; i++) { 338 - cpu = cpumask_random(cpu_online_mask); 339 - if (!WARN_ON_ONCE(cpu >= nr_cpu_ids)) 340 - cpumask_set_cpu(cpu, &cpus_chosen); 341 - } 342 - 343 - /* Don't verify ourselves. */ 344 - cpumask_clear_cpu(smp_processor_id(), &cpus_chosen); 345 - } 346 - 347 - static void clocksource_verify_one_cpu(void *csin) 348 - { 349 - struct clocksource *cs = (struct clocksource *)csin; 350 - 351 - csnow_mid = cs->read(cs); 352 - } 353 - 354 - void clocksource_verify_percpu(struct clocksource *cs) 355 - { 356 - int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX; 357 - u64 csnow_begin, csnow_end; 358 - int cpu, testcpu; 359 - s64 delta; 360 - 361 - if (verify_n_cpus == 0) 362 - return; 363 - cpumask_clear(&cpus_ahead); 364 - cpumask_clear(&cpus_behind); 365 - cpus_read_lock(); 366 - migrate_disable(); 367 - clocksource_verify_choose_cpus(); 368 - if (cpumask_empty(&cpus_chosen)) { 369 - migrate_enable(); 370 - cpus_read_unlock(); 371 - pr_warn("Not enough CPUs to check clocksource '%s'.\n", cs->name); 372 - return; 373 - } 374 - testcpu = smp_processor_id(); 375 - pr_info("Checking clocksource %s synchronization from CPU %d to CPUs %*pbl.\n", 376 - cs->name, testcpu, cpumask_pr_args(&cpus_chosen)); 377 - preempt_disable(); 378 - for_each_cpu(cpu, &cpus_chosen) { 379 - if (cpu == testcpu) 380 - continue; 381 - csnow_begin = cs->read(cs); 382 - smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1); 383 - csnow_end = cs->read(cs); 384 - delta = (s64)((csnow_mid - csnow_begin) & cs->mask); 385 - if (delta < 0) 386 - cpumask_set_cpu(cpu, &cpus_behind); 387 - delta = (csnow_end - csnow_mid) & cs->mask; 388 - if (delta < 0) 389 - cpumask_set_cpu(cpu, &cpus_ahead); 390 - cs_nsec = cycles_to_nsec_safe(cs, csnow_begin, csnow_end); 391 - if (cs_nsec > cs_nsec_max) 392 - cs_nsec_max = cs_nsec; 393 - if (cs_nsec < cs_nsec_min) 394 - cs_nsec_min = cs_nsec; 395 - } 396 - preempt_enable(); 397 - migrate_enable(); 398 - cpus_read_unlock(); 399 - if (!cpumask_empty(&cpus_ahead)) 400 - pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n", 401 - cpumask_pr_args(&cpus_ahead), testcpu, cs->name); 402 - if (!cpumask_empty(&cpus_behind)) 403 - pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n", 404 - cpumask_pr_args(&cpus_behind), testcpu, cs->name); 405 - pr_info(" CPU %d check durations %lldns - %lldns for clocksource %s.\n", 406 - testcpu, cs_nsec_min, cs_nsec_max, cs->name); 407 - } 408 - EXPORT_SYMBOL_GPL(clocksource_verify_percpu); 409 - 410 244 static inline void clocksource_reset_watchdog(void) 411 245 { 412 246 struct clocksource *cs; ··· 244 420 cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 245 421 } 246 422 423 + enum wd_result { 424 + WD_SUCCESS, 425 + WD_FREQ_NO_WATCHDOG, 426 + WD_FREQ_TIMEOUT, 427 + WD_FREQ_RESET, 428 + WD_FREQ_SKEWED, 429 + WD_CPU_TIMEOUT, 430 + WD_CPU_SKEWED, 431 + }; 432 + 433 + struct watchdog_cpu_data { 434 + /* Keep first as it is 32 byte aligned */ 435 + call_single_data_t csd; 436 + atomic_t remote_inprogress; 437 + enum wd_result result; 438 + u64 cpu_ts[2]; 439 + struct clocksource *cs; 440 + /* Ensure that the sequence is in a separate cache line */ 441 + atomic_t seq ____cacheline_aligned; 442 + /* Set by the control CPU according to NUMA distance */ 443 + u64 timeout_ns; 444 + }; 445 + 446 + struct watchdog_data { 447 + raw_spinlock_t lock; 448 + enum wd_result result; 449 + 450 + u64 wd_seq; 451 + u64 wd_delta; 452 + u64 cs_delta; 453 + u64 cpu_ts[2]; 454 + 455 + unsigned int curr_cpu; 456 + } ____cacheline_aligned_in_smp; 457 + 458 + static void watchdog_check_skew_remote(void *unused); 459 + 460 + static DEFINE_PER_CPU_ALIGNED(struct watchdog_cpu_data, watchdog_cpu_data) = { 461 + .csd = CSD_INIT(watchdog_check_skew_remote, NULL), 462 + }; 463 + 464 + static struct watchdog_data watchdog_data = { 465 + .lock = __RAW_SPIN_LOCK_UNLOCKED(watchdog_data.lock), 466 + }; 467 + 468 + static inline void watchdog_set_result(struct watchdog_cpu_data *wd, enum wd_result result) 469 + { 470 + guard(raw_spinlock)(&watchdog_data.lock); 471 + if (!wd->result) { 472 + atomic_set(&wd->seq, WATCHDOG_REMOTE_MAX_SEQ); 473 + WRITE_ONCE(wd->result, result); 474 + } 475 + } 476 + 477 + /* Wait for the sequence number to hand over control. */ 478 + static bool watchdog_wait_seq(struct watchdog_cpu_data *wd, u64 start, int seq) 479 + { 480 + for(int cnt = 0; atomic_read(&wd->seq) < seq; cnt++) { 481 + /* Bail if the other side set an error result */ 482 + if (READ_ONCE(wd->result) != WD_SUCCESS) 483 + return false; 484 + 485 + /* Prevent endless loops if the other CPU does not react. */ 486 + if (cnt == 5000) { 487 + u64 nsecs = ktime_get_raw_fast_ns(); 488 + 489 + if (nsecs - start >=wd->timeout_ns) { 490 + watchdog_set_result(wd, WD_CPU_TIMEOUT); 491 + return false; 492 + } 493 + cnt = 0; 494 + } 495 + cpu_relax(); 496 + } 497 + return seq < WATCHDOG_REMOTE_MAX_SEQ; 498 + } 499 + 500 + static void watchdog_check_skew(struct watchdog_cpu_data *wd, int index) 501 + { 502 + u64 prev, now, delta, start = ktime_get_raw_fast_ns(); 503 + int local = index, remote = (index + 1) & 0x1; 504 + struct clocksource *cs = wd->cs; 505 + 506 + /* Set the local timestamp so that the first iteration works correctly */ 507 + wd->cpu_ts[local] = cs->read(cs); 508 + 509 + /* Signal arrival */ 510 + atomic_inc(&wd->seq); 511 + 512 + for (int seq = local + 2; seq < WATCHDOG_REMOTE_MAX_SEQ; seq += 2) { 513 + if (!watchdog_wait_seq(wd, start, seq)) 514 + return; 515 + 516 + /* Capture local timestamp before possible non-local coherency overhead */ 517 + now = cs->read(cs); 518 + 519 + /* Store local timestamp before reading remote to limit coherency stalls */ 520 + wd->cpu_ts[local] = now; 521 + 522 + prev = wd->cpu_ts[remote]; 523 + delta = (now - prev) & cs->mask; 524 + 525 + if (delta > cs->max_raw_delta) { 526 + watchdog_set_result(wd, WD_CPU_SKEWED); 527 + return; 528 + } 529 + 530 + /* Hand over to the remote CPU */ 531 + atomic_inc(&wd->seq); 532 + } 533 + } 534 + 535 + static void watchdog_check_skew_remote(void *unused) 536 + { 537 + struct watchdog_cpu_data *wd = this_cpu_ptr(&watchdog_cpu_data); 538 + 539 + atomic_inc(&wd->remote_inprogress); 540 + watchdog_check_skew(wd, 1); 541 + atomic_dec(&wd->remote_inprogress); 542 + } 543 + 544 + static inline bool wd_csd_locked(struct watchdog_cpu_data *wd) 545 + { 546 + return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK; 547 + } 548 + 549 + /* 550 + * This is only invoked for remote CPUs. See watchdog_check_cpu_skew(). 551 + */ 552 + static inline u64 wd_get_remote_timeout(unsigned int remote_cpu) 553 + { 554 + unsigned int n1, n2; 555 + u64 ns; 556 + 557 + if (nr_node_ids == 1) 558 + return WATCHDOG_DEFAULT_TIMEOUT_NS; 559 + 560 + n1 = cpu_to_node(smp_processor_id()); 561 + n2 = cpu_to_node(remote_cpu); 562 + ns = WATCHDOG_NUMA_MULTIPLIER_NS * node_distance(n1, n2); 563 + return min(ns, WATCHDOG_NUMA_MAX_TIMEOUT_NS); 564 + } 565 + 566 + static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu) 567 + { 568 + struct watchdog_cpu_data *wd; 569 + 570 + wd = per_cpu_ptr(&watchdog_cpu_data, cpu); 571 + if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) { 572 + watchdog_data.result = WD_CPU_TIMEOUT; 573 + return; 574 + } 575 + 576 + atomic_set(&wd->seq, 0); 577 + wd->result = WD_SUCCESS; 578 + wd->cs = cs; 579 + /* Store the current CPU ID for the watchdog test unit */ 580 + cs->wd_cpu = smp_processor_id(); 581 + 582 + wd->timeout_ns = wd_get_remote_timeout(cpu); 583 + 584 + /* Kick the remote CPU into the watchdog function */ 585 + if (WARN_ON_ONCE(smp_call_function_single_async(cpu, &wd->csd))) { 586 + watchdog_data.result = WD_CPU_TIMEOUT; 587 + return; 588 + } 589 + 590 + scoped_guard(irq) 591 + watchdog_check_skew(wd, 0); 592 + 593 + scoped_guard(raw_spinlock_irq, &watchdog_data.lock) { 594 + watchdog_data.result = wd->result; 595 + memcpy(watchdog_data.cpu_ts, wd->cpu_ts, sizeof(wd->cpu_ts)); 596 + } 597 + } 598 + 599 + static void watchdog_check_cpu_skew(struct clocksource *cs) 600 + { 601 + unsigned int cpu = watchdog_data.curr_cpu; 602 + 603 + cpu = cpumask_next_wrap(cpu, cpu_online_mask); 604 + watchdog_data.curr_cpu = cpu; 605 + 606 + /* Skip the current CPU. Handles num_online_cpus() == 1 as well */ 607 + if (cpu == smp_processor_id()) 608 + return; 609 + 610 + /* Don't interfere with the test mechanics */ 611 + if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU)) 612 + return; 613 + 614 + __watchdog_check_cpu_skew(cs, cpu); 615 + } 616 + 617 + static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending) 618 + { 619 + unsigned int ppm_shift = SHIFT_4000PPM; 620 + u64 wd_ts0, wd_ts1, cs_ts; 621 + 622 + watchdog_data.result = WD_SUCCESS; 623 + if (!watchdog) { 624 + watchdog_data.result = WD_FREQ_NO_WATCHDOG; 625 + return false; 626 + } 627 + 628 + if (cs->flags & CLOCK_SOURCE_WDTEST_PERCPU) 629 + return true; 630 + 631 + /* 632 + * If both the clocksource and the watchdog claim they are 633 + * calibrated use 500ppm limit. Uncalibrated clocksources need a 634 + * larger allowance because thefirmware supplied frequencies can be 635 + * way off. 636 + */ 637 + if (watchdog->flags & CLOCK_SOURCE_CALIBRATED && cs->flags & CLOCK_SOURCE_CALIBRATED) 638 + ppm_shift = SHIFT_500PPM; 639 + 640 + for (int retries = 0; retries < WATCHDOG_FREQ_RETRIES; retries++) { 641 + s64 wd_last, cs_last, wd_seq, wd_delta, cs_delta, max_delta; 642 + 643 + scoped_guard(irq) { 644 + wd_ts0 = watchdog->read(watchdog); 645 + cs_ts = cs->read(cs); 646 + wd_ts1 = watchdog->read(watchdog); 647 + } 648 + 649 + wd_last = cs->wd_last; 650 + cs_last = cs->cs_last; 651 + 652 + /* Validate the watchdog readout window */ 653 + wd_seq = cycles_to_nsec_safe(watchdog, wd_ts0, wd_ts1); 654 + if (wd_seq > WATCHDOG_READOUT_MAX_NS) { 655 + /* Store for printout in case all retries fail */ 656 + watchdog_data.wd_seq = wd_seq; 657 + continue; 658 + } 659 + 660 + /* Store for subsequent processing */ 661 + cs->wd_last = wd_ts0; 662 + cs->cs_last = cs_ts; 663 + 664 + /* First round or reset pending? */ 665 + if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || reset_pending) 666 + goto reset; 667 + 668 + /* Calculate the nanosecond deltas from the last invocation */ 669 + wd_delta = cycles_to_nsec_safe(watchdog, wd_last, wd_ts0); 670 + cs_delta = cycles_to_nsec_safe(cs, cs_last, cs_ts); 671 + 672 + watchdog_data.wd_delta = wd_delta; 673 + watchdog_data.cs_delta = cs_delta; 674 + 675 + /* 676 + * Ensure that the deltas are within the readout limits of 677 + * the clocksource and the watchdog. Long delays can cause 678 + * clocksources to overflow. 679 + */ 680 + max_delta = max(wd_delta, cs_delta); 681 + if (max_delta > cs->max_idle_ns || max_delta > watchdog->max_idle_ns) 682 + goto reset; 683 + 684 + /* 685 + * Calculate and validate the skew against the allowed PPM 686 + * value of the maximum delta plus the watchdog readout 687 + * time. 688 + */ 689 + if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq) 690 + return true; 691 + 692 + watchdog_data.result = WD_FREQ_SKEWED; 693 + return false; 694 + } 695 + 696 + watchdog_data.result = WD_FREQ_TIMEOUT; 697 + return false; 698 + 699 + reset: 700 + cs->flags |= CLOCK_SOURCE_WATCHDOG; 701 + watchdog_data.result = WD_FREQ_RESET; 702 + return false; 703 + } 704 + 705 + /* Synchronization for sched clock */ 706 + static void clocksource_tick_stable(struct clocksource *cs) 707 + { 708 + if (cs == curr_clocksource && cs->tick_stable) 709 + cs->tick_stable(cs); 710 + } 711 + 712 + /* Conditionaly enable high resolution mode */ 713 + static void clocksource_enable_highres(struct clocksource *cs) 714 + { 715 + if ((cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) || 716 + !(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) || 717 + !watchdog || !(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) 718 + return; 719 + 720 + /* Mark it valid for high-res. */ 721 + cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES; 722 + 723 + /* 724 + * Can't schedule work before finished_booting is 725 + * true. clocksource_done_booting will take care of it. 726 + */ 727 + if (!finished_booting) 728 + return; 729 + 730 + if (cs->flags & CLOCK_SOURCE_WDTEST) 731 + return; 732 + 733 + /* 734 + * If this is not the current clocksource let the watchdog thread 735 + * reselect it. Due to the change to high res this clocksource 736 + * might be preferred now. If it is the current clocksource let the 737 + * tick code know about that change. 738 + */ 739 + if (cs != curr_clocksource) { 740 + cs->flags |= CLOCK_SOURCE_RESELECT; 741 + schedule_work(&watchdog_work); 742 + } else { 743 + tick_clock_notify(); 744 + } 745 + } 746 + 747 + static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 2); 748 + 749 + static void watchdog_print_freq_timeout(struct clocksource *cs) 750 + { 751 + if (!__ratelimit(&ratelimit_state)) 752 + return; 753 + pr_info("Watchdog %s read timed out. Readout sequence took: %lluns\n", 754 + watchdog->name, watchdog_data.wd_seq); 755 + } 756 + 757 + static void watchdog_print_freq_skew(struct clocksource *cs) 758 + { 759 + pr_warn("Marking clocksource %s unstable due to frequency skew\n", cs->name); 760 + pr_warn("Watchdog %20s interval: %16lluns\n", watchdog->name, watchdog_data.wd_delta); 761 + pr_warn("Clocksource %20s interval: %16lluns\n", cs->name, watchdog_data.cs_delta); 762 + } 763 + 764 + static void watchdog_handle_remote_timeout(struct clocksource *cs) 765 + { 766 + pr_info_once("Watchdog remote CPU %u read timed out\n", watchdog_data.curr_cpu); 767 + } 768 + 769 + static void watchdog_print_remote_skew(struct clocksource *cs) 770 + { 771 + pr_warn("Marking clocksource %s unstable due to inter CPU skew\n", cs->name); 772 + if (watchdog_data.cpu_ts[0] < watchdog_data.cpu_ts[1]) { 773 + pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", smp_processor_id(), 774 + watchdog_data.cpu_ts[0], watchdog_data.curr_cpu, watchdog_data.cpu_ts[1]); 775 + } else { 776 + pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", watchdog_data.curr_cpu, 777 + watchdog_data.cpu_ts[1], smp_processor_id(), watchdog_data.cpu_ts[0]); 778 + } 779 + } 780 + 781 + static void watchdog_check_result(struct clocksource *cs) 782 + { 783 + switch (watchdog_data.result) { 784 + case WD_SUCCESS: 785 + clocksource_tick_stable(cs); 786 + clocksource_enable_highres(cs); 787 + return; 788 + 789 + case WD_FREQ_TIMEOUT: 790 + watchdog_print_freq_timeout(cs); 791 + /* Try again later and invalidate the reference timestamps. */ 792 + cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 793 + return; 794 + 795 + case WD_FREQ_NO_WATCHDOG: 796 + case WD_FREQ_RESET: 797 + /* 798 + * Nothing to do when the reference timestamps were reset 799 + * or no watchdog clocksource registered. 800 + */ 801 + return; 802 + 803 + case WD_FREQ_SKEWED: 804 + watchdog_print_freq_skew(cs); 805 + break; 806 + 807 + case WD_CPU_TIMEOUT: 808 + /* Remote check timed out. Try again next cycle. */ 809 + watchdog_handle_remote_timeout(cs); 810 + return; 811 + 812 + case WD_CPU_SKEWED: 813 + watchdog_print_remote_skew(cs); 814 + break; 815 + } 816 + __clocksource_unstable(cs); 817 + } 247 818 248 819 static void clocksource_watchdog(struct timer_list *unused) 249 820 { 250 - int64_t wd_nsec, cs_nsec, interval; 251 - u64 csnow, wdnow, cslast, wdlast; 252 - int next_cpu, reset_pending; 253 821 struct clocksource *cs; 254 - enum wd_read_status read_ret; 255 - unsigned long extra_wait = 0; 256 - u32 md; 822 + bool reset_pending; 257 823 258 - spin_lock(&watchdog_lock); 824 + guard(spinlock)(&watchdog_lock); 259 825 if (!watchdog_running) 260 - goto out; 826 + return; 261 827 262 828 reset_pending = atomic_read(&watchdog_reset_pending); 263 829 264 830 list_for_each_entry(cs, &watchdog_list, wd_list) { 265 - 266 831 /* Clocksource already marked unstable? */ 267 832 if (cs->flags & CLOCK_SOURCE_UNSTABLE) { 268 833 if (finished_booting) ··· 659 446 continue; 660 447 } 661 448 662 - read_ret = cs_watchdog_read(cs, &csnow, &wdnow); 663 - 664 - if (read_ret == WD_READ_UNSTABLE) { 665 - /* Clock readout unreliable, so give it up. */ 666 - __clocksource_unstable(cs); 667 - continue; 449 + /* Compare against watchdog clocksource if available */ 450 + if (watchdog_check_freq(cs, reset_pending)) { 451 + /* Check for inter CPU skew */ 452 + watchdog_check_cpu_skew(cs); 668 453 } 669 454 670 - /* 671 - * When WD_READ_SKIP is returned, it means the system is likely 672 - * under very heavy load, where the latency of reading 673 - * watchdog/clocksource is very big, and affect the accuracy of 674 - * watchdog check. So give system some space and suspend the 675 - * watchdog check for 5 minutes. 676 - */ 677 - if (read_ret == WD_READ_SKIP) { 678 - /* 679 - * As the watchdog timer will be suspended, and 680 - * cs->last could keep unchanged for 5 minutes, reset 681 - * the counters. 682 - */ 683 - clocksource_reset_watchdog(); 684 - extra_wait = HZ * 300; 685 - break; 686 - } 687 - 688 - /* Clocksource initialized ? */ 689 - if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || 690 - atomic_read(&watchdog_reset_pending)) { 691 - cs->flags |= CLOCK_SOURCE_WATCHDOG; 692 - cs->wd_last = wdnow; 693 - cs->cs_last = csnow; 694 - continue; 695 - } 696 - 697 - wd_nsec = cycles_to_nsec_safe(watchdog, cs->wd_last, wdnow); 698 - cs_nsec = cycles_to_nsec_safe(cs, cs->cs_last, csnow); 699 - wdlast = cs->wd_last; /* save these in case we print them */ 700 - cslast = cs->cs_last; 701 - cs->cs_last = csnow; 702 - cs->wd_last = wdnow; 703 - 704 - if (atomic_read(&watchdog_reset_pending)) 705 - continue; 706 - 707 - /* 708 - * The processing of timer softirqs can get delayed (usually 709 - * on account of ksoftirqd not getting to run in a timely 710 - * manner), which causes the watchdog interval to stretch. 711 - * Skew detection may fail for longer watchdog intervals 712 - * on account of fixed margins being used. 713 - * Some clocksources, e.g. acpi_pm, cannot tolerate 714 - * watchdog intervals longer than a few seconds. 715 - */ 716 - interval = max(cs_nsec, wd_nsec); 717 - if (unlikely(interval > WATCHDOG_INTERVAL_MAX_NS)) { 718 - if (system_state > SYSTEM_SCHEDULING && 719 - interval > 2 * watchdog_max_interval) { 720 - watchdog_max_interval = interval; 721 - pr_warn("Long readout interval, skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", 722 - cs_nsec, wd_nsec); 723 - } 724 - watchdog_timer.expires = jiffies; 725 - continue; 726 - } 727 - 728 - /* Check the deviation from the watchdog clocksource. */ 729 - md = cs->uncertainty_margin + watchdog->uncertainty_margin; 730 - if (abs(cs_nsec - wd_nsec) > md) { 731 - s64 cs_wd_msec; 732 - s64 wd_msec; 733 - u32 wd_rem; 734 - 735 - pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n", 736 - smp_processor_id(), cs->name); 737 - pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n", 738 - watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask); 739 - pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n", 740 - cs->name, cs_nsec, csnow, cslast, cs->mask); 741 - cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem); 742 - wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem); 743 - pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n", 744 - cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec); 745 - if (curr_clocksource == cs) 746 - pr_warn(" '%s' is current clocksource.\n", cs->name); 747 - else if (curr_clocksource) 748 - pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name); 749 - else 750 - pr_warn(" No current clocksource.\n"); 751 - __clocksource_unstable(cs); 752 - continue; 753 - } 754 - 755 - if (cs == curr_clocksource && cs->tick_stable) 756 - cs->tick_stable(cs); 757 - 758 - if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && 759 - (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) && 760 - (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) { 761 - /* Mark it valid for high-res. */ 762 - cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES; 763 - 764 - /* 765 - * clocksource_done_booting() will sort it if 766 - * finished_booting is not set yet. 767 - */ 768 - if (!finished_booting) 769 - continue; 770 - 771 - /* 772 - * If this is not the current clocksource let 773 - * the watchdog thread reselect it. Due to the 774 - * change to high res this clocksource might 775 - * be preferred now. If it is the current 776 - * clocksource let the tick code know about 777 - * that change. 778 - */ 779 - if (cs != curr_clocksource) { 780 - cs->flags |= CLOCK_SOURCE_RESELECT; 781 - schedule_work(&watchdog_work); 782 - } else { 783 - tick_clock_notify(); 784 - } 785 - } 455 + watchdog_check_result(cs); 786 456 } 787 457 788 - /* 789 - * We only clear the watchdog_reset_pending, when we did a 790 - * full cycle through all clocksources. 791 - */ 458 + /* Clear after the full clocksource walk */ 792 459 if (reset_pending) 793 460 atomic_dec(&watchdog_reset_pending); 794 461 795 - /* 796 - * Cycle through CPUs to check if the CPUs stay synchronized 797 - * to each other. 798 - */ 799 - next_cpu = cpumask_next_wrap(raw_smp_processor_id(), cpu_online_mask); 800 - 801 - /* 802 - * Arm timer if not already pending: could race with concurrent 803 - * pair clocksource_stop_watchdog() clocksource_start_watchdog(). 804 - */ 462 + /* Could have been rearmed by a stop/start cycle */ 805 463 if (!timer_pending(&watchdog_timer)) { 806 - watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait; 807 - add_timer_on(&watchdog_timer, next_cpu); 464 + watchdog_timer.expires += WATCHDOG_INTERVAL; 465 + add_timer_local(&watchdog_timer); 808 466 } 809 - out: 810 - spin_unlock(&watchdog_lock); 811 467 } 812 468 813 469 static inline void clocksource_start_watchdog(void) 814 470 { 815 - if (watchdog_running || !watchdog || list_empty(&watchdog_list)) 471 + if (watchdog_running || list_empty(&watchdog_list)) 816 472 return; 817 - timer_setup(&watchdog_timer, clocksource_watchdog, 0); 473 + timer_setup(&watchdog_timer, clocksource_watchdog, TIMER_PINNED); 818 474 watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL; 819 - add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask)); 475 + 476 + add_timer_on(&watchdog_timer, get_boot_cpu_id()); 820 477 watchdog_running = 1; 821 478 } 822 479 823 480 static inline void clocksource_stop_watchdog(void) 824 481 { 825 - if (!watchdog_running || (watchdog && !list_empty(&watchdog_list))) 482 + if (!watchdog_running || !list_empty(&watchdog_list)) 826 483 return; 827 484 timer_delete(&watchdog_timer); 828 485 watchdog_running = 0; ··· 734 651 if (cs->flags & CLOCK_SOURCE_MUST_VERIFY) 735 652 continue; 736 653 654 + /* 655 + * If it's not continuous, don't put the fox in charge of 656 + * the henhouse. 657 + */ 658 + if (!(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS)) 659 + continue; 660 + 737 661 /* Skip current if we were requested for a fallback. */ 738 662 if (fallback && cs == old_wd) 739 663 continue; ··· 779 689 struct clocksource *cs, *tmp; 780 690 unsigned long flags; 781 691 int select = 0; 782 - 783 - /* Do any required per-CPU skew verification. */ 784 - if (curr_clocksource && 785 - curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE && 786 - curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU) 787 - clocksource_verify_percpu(curr_clocksource); 788 692 789 693 spin_lock_irqsave(&watchdog_lock, flags); 790 694 list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) { ··· 1100 1016 continue; 1101 1017 if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)) 1102 1018 continue; 1019 + if (cs->flags & CLOCK_SOURCE_WDTEST) 1020 + continue; 1103 1021 return cs; 1104 1022 } 1105 1023 return NULL; ··· 1125 1039 if (skipcur && cs == curr_clocksource) 1126 1040 continue; 1127 1041 if (strcmp(cs->name, override_name) != 0) 1042 + continue; 1043 + if (cs->flags & CLOCK_SOURCE_WDTEST) 1128 1044 continue; 1129 1045 /* 1130 1046 * Check to make sure we don't switch to a non-highres ··· 1257 1169 1258 1170 clocks_calc_mult_shift(&cs->mult, &cs->shift, freq, 1259 1171 NSEC_PER_SEC / scale, sec * scale); 1260 - } 1261 1172 1262 - /* 1263 - * If the uncertainty margin is not specified, calculate it. If 1264 - * both scale and freq are non-zero, calculate the clock period, but 1265 - * bound below at 2*WATCHDOG_MAX_SKEW, that is, 500ppm by default. 1266 - * However, if either of scale or freq is zero, be very conservative 1267 - * and take the tens-of-milliseconds WATCHDOG_THRESHOLD value 1268 - * for the uncertainty margin. Allow stupidly small uncertainty 1269 - * margins to be specified by the caller for testing purposes, 1270 - * but warn to discourage production use of this capability. 1271 - * 1272 - * Bottom line: The sum of the uncertainty margins of the 1273 - * watchdog clocksource and the clocksource under test will be at 1274 - * least 500ppm by default. For more information, please see the 1275 - * comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above. 1276 - */ 1277 - if (scale && freq && !cs->uncertainty_margin) { 1278 - cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq); 1279 - if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW) 1280 - cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW; 1281 - } else if (!cs->uncertainty_margin) { 1282 - cs->uncertainty_margin = WATCHDOG_THRESHOLD; 1173 + /* Update cs::freq_khz */ 1174 + cs->freq_khz = div_u64((u64)freq * scale, 1000); 1283 1175 } 1284 - WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW); 1285 1176 1286 1177 /* 1287 1178 * Ensure clocksources that have large 'mult' values don't overflow ··· 1308 1241 1309 1242 if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX)) 1310 1243 cs->id = CSID_GENERIC; 1244 + 1245 + if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT)) 1246 + cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT; 1247 + 1311 1248 if (cs->vdso_clock_mode < 0 || 1312 1249 cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) { 1313 1250 pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",
+652 -489
kernel/time/hrtimer.c
··· 50 50 #include "tick-internal.h" 51 51 52 52 /* 53 + * Constants to set the queued state of the timer (INACTIVE, ENQUEUED) 54 + * 55 + * The callback state is kept separate in the CPU base because having it in 56 + * the timer would required touching the timer after the callback, which 57 + * makes it impossible to free the timer from the callback function. 58 + * 59 + * Therefore we track the callback state in: 60 + * 61 + * timer->base->cpu_base->running == timer 62 + * 63 + * On SMP it is possible to have a "callback function running and enqueued" 64 + * status. It happens for example when a posix timer expired and the callback 65 + * queued a signal. Between dropping the lock which protects the posix timer 66 + * and reacquiring the base lock of the hrtimer, another CPU can deliver the 67 + * signal and rearm the timer. 68 + * 69 + * All state transitions are protected by cpu_base->lock. 70 + */ 71 + #define HRTIMER_STATE_INACTIVE false 72 + #define HRTIMER_STATE_ENQUEUED true 73 + 74 + /* 53 75 * The resolution of the clocks. The resolution value is returned in 54 76 * the clock_getres() system call to give application programmers an 55 77 * idea of the (in)accuracy of timers. Timer values are rounded up to ··· 99 77 * to reach a base using a clockid, hrtimer_clockid_to_base() 100 78 * is used to convert from clockid to the proper hrtimer_base_type. 101 79 */ 80 + 81 + #define BASE_INIT(idx, cid) \ 82 + [idx] = { .index = idx, .clockid = cid } 83 + 102 84 DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = 103 85 { 104 86 .lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock), 105 - .clock_base = 106 - { 107 - { 108 - .index = HRTIMER_BASE_MONOTONIC, 109 - .clockid = CLOCK_MONOTONIC, 110 - }, 111 - { 112 - .index = HRTIMER_BASE_REALTIME, 113 - .clockid = CLOCK_REALTIME, 114 - }, 115 - { 116 - .index = HRTIMER_BASE_BOOTTIME, 117 - .clockid = CLOCK_BOOTTIME, 118 - }, 119 - { 120 - .index = HRTIMER_BASE_TAI, 121 - .clockid = CLOCK_TAI, 122 - }, 123 - { 124 - .index = HRTIMER_BASE_MONOTONIC_SOFT, 125 - .clockid = CLOCK_MONOTONIC, 126 - }, 127 - { 128 - .index = HRTIMER_BASE_REALTIME_SOFT, 129 - .clockid = CLOCK_REALTIME, 130 - }, 131 - { 132 - .index = HRTIMER_BASE_BOOTTIME_SOFT, 133 - .clockid = CLOCK_BOOTTIME, 134 - }, 135 - { 136 - .index = HRTIMER_BASE_TAI_SOFT, 137 - .clockid = CLOCK_TAI, 138 - }, 87 + .clock_base = { 88 + BASE_INIT(HRTIMER_BASE_MONOTONIC, CLOCK_MONOTONIC), 89 + BASE_INIT(HRTIMER_BASE_REALTIME, CLOCK_REALTIME), 90 + BASE_INIT(HRTIMER_BASE_BOOTTIME, CLOCK_BOOTTIME), 91 + BASE_INIT(HRTIMER_BASE_TAI, CLOCK_TAI), 92 + BASE_INIT(HRTIMER_BASE_MONOTONIC_SOFT, CLOCK_MONOTONIC), 93 + BASE_INIT(HRTIMER_BASE_REALTIME_SOFT, CLOCK_REALTIME), 94 + BASE_INIT(HRTIMER_BASE_BOOTTIME_SOFT, CLOCK_BOOTTIME), 95 + BASE_INIT(HRTIMER_BASE_TAI_SOFT, CLOCK_TAI), 139 96 }, 140 97 .csd = CSD_INIT(retrigger_next_event, NULL) 141 98 }; ··· 127 126 return likely(base->online); 128 127 } 129 128 129 + #ifdef CONFIG_HIGH_RES_TIMERS 130 + DEFINE_STATIC_KEY_FALSE(hrtimer_highres_enabled_key); 131 + 132 + static void hrtimer_hres_workfn(struct work_struct *work) 133 + { 134 + static_branch_enable(&hrtimer_highres_enabled_key); 135 + } 136 + 137 + static DECLARE_WORK(hrtimer_hres_work, hrtimer_hres_workfn); 138 + 139 + static inline void hrtimer_schedule_hres_work(void) 140 + { 141 + if (!hrtimer_highres_enabled()) 142 + schedule_work(&hrtimer_hres_work); 143 + } 144 + #else 145 + static inline void hrtimer_schedule_hres_work(void) { } 146 + #endif 147 + 130 148 /* 131 149 * Functions and macros which are different for UP/SMP systems are kept in a 132 150 * single place 133 151 */ 134 152 #ifdef CONFIG_SMP 135 - 136 153 /* 137 154 * We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base() 138 155 * such that hrtimer_callback_running() can unconditionally dereference 139 156 * timer->base->cpu_base 140 157 */ 141 158 static struct hrtimer_cpu_base migration_cpu_base = { 142 - .clock_base = { { 143 - .cpu_base = &migration_cpu_base, 144 - .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq, 145 - &migration_cpu_base.lock), 146 - }, }, 159 + .clock_base = { 160 + [0] = { 161 + .cpu_base = &migration_cpu_base, 162 + .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq, 163 + &migration_cpu_base.lock), 164 + }, 165 + }, 147 166 }; 148 167 149 168 #define migration_base migration_cpu_base.clock_base[0] ··· 180 159 * possible to set timer->base = &migration_base and drop the lock: the timer 181 160 * remains locked. 182 161 */ 183 - static 184 - struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, 185 - unsigned long *flags) 162 + static struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, 163 + unsigned long *flags) 186 164 __acquires(&timer->base->lock) 187 165 { 188 - struct hrtimer_clock_base *base; 189 - 190 166 for (;;) { 191 - base = READ_ONCE(timer->base); 167 + struct hrtimer_clock_base *base = READ_ONCE(timer->base); 168 + 192 169 if (likely(base != &migration_base)) { 193 170 raw_spin_lock_irqsave(&base->cpu_base->lock, *flags); 194 171 if (likely(base == timer->base)) ··· 239 220 return expires >= new_base->cpu_base->expires_next; 240 221 } 241 222 242 - static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned) 223 + static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned) 243 224 { 244 225 if (!hrtimer_base_is_online(base)) { 245 226 int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER)); ··· 267 248 * the timer callback is currently running. 268 249 */ 269 250 static inline struct hrtimer_clock_base * 270 - switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, 271 - int pinned) 251 + switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, bool pinned) 272 252 { 273 253 struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base; 274 254 struct hrtimer_clock_base *new_base; ··· 280 262 281 263 if (base != new_base) { 282 264 /* 283 - * We are trying to move timer to new_base. 284 - * However we can't change timer's base while it is running, 285 - * so we keep it on the same CPU. No hassle vs. reprogramming 286 - * the event source in the high resolution case. The softirq 287 - * code will take care of this when the timer function has 288 - * completed. There is no conflict as we hold the lock until 289 - * the timer is enqueued. 265 + * We are trying to move timer to new_base. However we can't 266 + * change timer's base while it is running, so we keep it on 267 + * the same CPU. No hassle vs. reprogramming the event source 268 + * in the high resolution case. The remote CPU will take care 269 + * of this when the timer function has completed. There is no 270 + * conflict as we hold the lock until the timer is enqueued. 290 271 */ 291 272 if (unlikely(hrtimer_callback_running(timer))) 292 273 return base; ··· 295 278 raw_spin_unlock(&base->cpu_base->lock); 296 279 raw_spin_lock(&new_base->cpu_base->lock); 297 280 298 - if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, 299 - this_cpu_base)) { 281 + if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) { 300 282 raw_spin_unlock(&new_base->cpu_base->lock); 301 283 raw_spin_lock(&base->cpu_base->lock); 302 284 new_cpu_base = this_cpu_base; ··· 314 298 315 299 #else /* CONFIG_SMP */ 316 300 317 - static inline struct hrtimer_clock_base * 318 - lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags) 301 + static inline struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, 302 + unsigned long *flags) 319 303 __acquires(&timer->base->cpu_base->lock) 320 304 { 321 305 struct hrtimer_clock_base *base = timer->base; 322 306 323 307 raw_spin_lock_irqsave(&base->cpu_base->lock, *flags); 324 - 325 308 return base; 326 309 } 327 310 ··· 355 340 return dclc < 0 ? -tmp : tmp; 356 341 } 357 342 EXPORT_SYMBOL_GPL(__ktime_divns); 358 - #endif /* BITS_PER_LONG >= 64 */ 343 + #endif /* BITS_PER_LONG < 64 */ 359 344 360 345 /* 361 346 * Add two ktime values and do a safety check for overflow: ··· 437 422 } 438 423 } 439 424 425 + /* Stub timer callback for improperly used timers. */ 426 + static enum hrtimer_restart stub_timer(struct hrtimer *unused) 427 + { 428 + WARN_ON_ONCE(1); 429 + return HRTIMER_NORESTART; 430 + } 431 + 432 + /* 433 + * hrtimer_fixup_assert_init is called when: 434 + * - an untracked/uninit-ed object is found 435 + */ 436 + static bool hrtimer_fixup_assert_init(void *addr, enum debug_obj_state state) 437 + { 438 + struct hrtimer *timer = addr; 439 + 440 + switch (state) { 441 + case ODEBUG_STATE_NOTAVAILABLE: 442 + hrtimer_setup(timer, stub_timer, CLOCK_MONOTONIC, 0); 443 + return true; 444 + default: 445 + return false; 446 + } 447 + } 448 + 440 449 static const struct debug_obj_descr hrtimer_debug_descr = { 441 - .name = "hrtimer", 442 - .debug_hint = hrtimer_debug_hint, 443 - .fixup_init = hrtimer_fixup_init, 444 - .fixup_activate = hrtimer_fixup_activate, 445 - .fixup_free = hrtimer_fixup_free, 450 + .name = "hrtimer", 451 + .debug_hint = hrtimer_debug_hint, 452 + .fixup_init = hrtimer_fixup_init, 453 + .fixup_activate = hrtimer_fixup_activate, 454 + .fixup_free = hrtimer_fixup_free, 455 + .fixup_assert_init = hrtimer_fixup_assert_init, 446 456 }; 447 457 448 458 static inline void debug_hrtimer_init(struct hrtimer *timer) ··· 480 440 debug_object_init_on_stack(timer, &hrtimer_debug_descr); 481 441 } 482 442 483 - static inline void debug_hrtimer_activate(struct hrtimer *timer, 484 - enum hrtimer_mode mode) 443 + static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) 485 444 { 486 445 debug_object_activate(timer, &hrtimer_debug_descr); 487 446 } ··· 488 449 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) 489 450 { 490 451 debug_object_deactivate(timer, &hrtimer_debug_descr); 452 + } 453 + 454 + static inline void debug_hrtimer_assert_init(struct hrtimer *timer) 455 + { 456 + debug_object_assert_init(timer, &hrtimer_debug_descr); 491 457 } 492 458 493 459 void destroy_hrtimer_on_stack(struct hrtimer *timer) ··· 505 461 506 462 static inline void debug_hrtimer_init(struct hrtimer *timer) { } 507 463 static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { } 508 - static inline void debug_hrtimer_activate(struct hrtimer *timer, 509 - enum hrtimer_mode mode) { } 464 + static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) { } 510 465 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { } 466 + static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { } 511 467 #endif 512 468 513 469 static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode) ··· 523 479 trace_hrtimer_setup(timer, clockid, mode); 524 480 } 525 481 526 - static inline void debug_activate(struct hrtimer *timer, 527 - enum hrtimer_mode mode) 482 + static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode, bool was_armed) 528 483 { 529 484 debug_hrtimer_activate(timer, mode); 530 - trace_hrtimer_start(timer, mode); 485 + trace_hrtimer_start(timer, mode, was_armed); 531 486 } 532 487 533 - static inline void debug_deactivate(struct hrtimer *timer) 488 + #define for_each_active_base(base, cpu_base, active) \ 489 + for (unsigned int idx = ffs(active); idx--; idx = ffs((active))) \ 490 + for (bool done = false; !done; active &= ~(1U << idx)) \ 491 + for (base = &cpu_base->clock_base[idx]; !done; done = true) 492 + 493 + #define hrtimer_from_timerqueue_node(_n) container_of_const(_n, struct hrtimer, node) 494 + 495 + #if defined(CONFIG_NO_HZ_COMMON) 496 + /* 497 + * Same as hrtimer_bases_next_event() below, but skips the excluded timer and 498 + * does not update cpu_base->next_timer/expires. 499 + */ 500 + static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_base, 501 + const struct hrtimer *exclude, 502 + unsigned int active, ktime_t expires_next) 534 503 { 535 - debug_hrtimer_deactivate(timer); 536 - trace_hrtimer_cancel(timer); 537 - } 504 + struct hrtimer_clock_base *base; 505 + ktime_t expires; 538 506 539 - static struct hrtimer_clock_base * 540 - __next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active) 507 + lockdep_assert_held(&cpu_base->lock); 508 + 509 + for_each_active_base(base, cpu_base, active) { 510 + expires = ktime_sub(base->expires_next, base->offset); 511 + if (expires >= expires_next) 512 + continue; 513 + 514 + /* 515 + * If the excluded timer is the first on this base evaluate the 516 + * next timer. 517 + */ 518 + struct timerqueue_linked_node *node = timerqueue_linked_first(&base->active); 519 + 520 + if (unlikely(&exclude->node == node)) { 521 + node = timerqueue_linked_next(node); 522 + if (!node) 523 + continue; 524 + expires = ktime_sub(node->expires, base->offset); 525 + if (expires >= expires_next) 526 + continue; 527 + } 528 + expires_next = expires; 529 + } 530 + /* If base->offset changed, the result might be negative */ 531 + return max(expires_next, 0); 532 + } 533 + #endif 534 + 535 + static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base) 541 536 { 542 - unsigned int idx; 537 + struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active); 543 538 544 - if (!*active) 545 - return NULL; 546 - 547 - idx = __ffs(*active); 548 - *active &= ~(1U << idx); 549 - 550 - return &cpu_base->clock_base[idx]; 539 + return hrtimer_from_timerqueue_node(next); 551 540 } 552 541 553 - #define for_each_active_base(base, cpu_base, active) \ 554 - while ((base = __next_base((cpu_base), &(active)))) 555 - 556 - static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base, 557 - const struct hrtimer *exclude, 558 - unsigned int active, 559 - ktime_t expires_next) 542 + /* Find the base with the earliest expiry */ 543 + static void hrtimer_bases_first(struct hrtimer_cpu_base *cpu_base,unsigned int active, 544 + ktime_t *expires_next, struct hrtimer **next_timer) 560 545 { 561 546 struct hrtimer_clock_base *base; 562 547 ktime_t expires; 563 548 564 549 for_each_active_base(base, cpu_base, active) { 565 - struct timerqueue_node *next; 566 - struct hrtimer *timer; 567 - 568 - next = timerqueue_getnext(&base->active); 569 - timer = container_of(next, struct hrtimer, node); 570 - if (timer == exclude) { 571 - /* Get to the next timer in the queue. */ 572 - next = timerqueue_iterate_next(next); 573 - if (!next) 574 - continue; 575 - 576 - timer = container_of(next, struct hrtimer, node); 577 - } 578 - expires = ktime_sub(hrtimer_get_expires(timer), base->offset); 579 - if (expires < expires_next) { 580 - expires_next = expires; 581 - 582 - /* Skip cpu_base update if a timer is being excluded. */ 583 - if (exclude) 584 - continue; 585 - 586 - if (timer->is_soft) 587 - cpu_base->softirq_next_timer = timer; 588 - else 589 - cpu_base->next_timer = timer; 550 + expires = ktime_sub(base->expires_next, base->offset); 551 + if (expires < *expires_next) { 552 + *expires_next = expires; 553 + *next_timer = clock_base_next_timer(base); 590 554 } 591 555 } 592 - /* 593 - * clock_was_set() might have changed base->offset of any of 594 - * the clock bases so the result might be negative. Fix it up 595 - * to prevent a false positive in clockevents_program_event(). 596 - */ 597 - if (expires_next < 0) 598 - expires_next = 0; 599 - return expires_next; 600 556 } 601 557 602 558 /* ··· 619 575 * - HRTIMER_ACTIVE_SOFT, or 620 576 * - HRTIMER_ACTIVE_HARD. 621 577 */ 622 - static ktime_t 623 - __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask) 578 + static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask) 624 579 { 625 - unsigned int active; 626 580 struct hrtimer *next_timer = NULL; 627 581 ktime_t expires_next = KTIME_MAX; 582 + unsigned int active; 583 + 584 + lockdep_assert_held(&cpu_base->lock); 628 585 629 586 if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) { 630 587 active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT; 631 - cpu_base->softirq_next_timer = NULL; 632 - expires_next = __hrtimer_next_event_base(cpu_base, NULL, 633 - active, KTIME_MAX); 634 - 635 - next_timer = cpu_base->softirq_next_timer; 588 + if (active) 589 + hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer); 590 + cpu_base->softirq_next_timer = next_timer; 636 591 } 637 592 638 593 if (active_mask & HRTIMER_ACTIVE_HARD) { 639 594 active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD; 595 + if (active) 596 + hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer); 640 597 cpu_base->next_timer = next_timer; 641 - expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, 642 - expires_next); 643 598 } 644 - 645 - return expires_next; 599 + return max(expires_next, 0); 646 600 } 647 601 648 602 static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base) ··· 680 638 ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset; 681 639 ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset; 682 640 683 - ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, 684 - offs_real, offs_boot, offs_tai); 641 + ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real, 642 + offs_boot, offs_tai); 685 643 686 644 base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real; 687 645 base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot; ··· 691 649 } 692 650 693 651 /* 694 - * Is the high resolution mode active ? 652 + * Is the high resolution mode active in the CPU base. This cannot use the 653 + * static key as the CPUs are switched to high resolution mode 654 + * asynchronously. 695 655 */ 696 656 static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base) 697 657 { ··· 701 657 cpu_base->hres_active : 0; 702 658 } 703 659 704 - static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, 705 - struct hrtimer *next_timer, 660 + static inline void hrtimer_rearm_event(ktime_t expires_next, bool deferred) 661 + { 662 + trace_hrtimer_rearm(expires_next, deferred); 663 + tick_program_event(expires_next, 1); 664 + } 665 + 666 + static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer, 706 667 ktime_t expires_next) 707 668 { 708 669 cpu_base->expires_next = expires_next; ··· 732 683 if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected) 733 684 return; 734 685 735 - tick_program_event(expires_next, 1); 686 + hrtimer_rearm_event(expires_next, false); 736 687 } 737 688 738 - /* 739 - * Reprogram the event source with checking both queues for the 740 - * next event 741 - * Called with interrupts disabled and base->lock held 742 - */ 743 - static void 744 - hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) 689 + /* Reprogram the event source with a evaluation of all clock bases */ 690 + static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal) 745 691 { 746 - ktime_t expires_next; 747 - 748 - expires_next = hrtimer_update_next_event(cpu_base); 692 + ktime_t expires_next = hrtimer_update_next_event(cpu_base); 749 693 750 694 if (skip_equal && expires_next == cpu_base->expires_next) 751 695 return; ··· 749 707 /* High resolution timer related functions */ 750 708 #ifdef CONFIG_HIGH_RES_TIMERS 751 709 752 - /* 753 - * High resolution timer enabled ? 754 - */ 710 + /* High resolution timer enabled ? */ 755 711 static bool hrtimer_hres_enabled __read_mostly = true; 756 712 unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC; 757 713 EXPORT_SYMBOL_GPL(hrtimer_resolution); 758 714 759 - /* 760 - * Enable / Disable high resolution mode 761 - */ 715 + /* Enable / Disable high resolution mode */ 762 716 static int __init setup_hrtimer_hres(char *str) 763 717 { 764 718 return (kstrtobool(str, &hrtimer_hres_enabled) == 0); 765 719 } 766 - 767 720 __setup("highres=", setup_hrtimer_hres); 768 721 769 - /* 770 - * hrtimer_high_res_enabled - query, if the highres mode is enabled 771 - */ 772 - static inline int hrtimer_is_hres_enabled(void) 722 + /* hrtimer_high_res_enabled - query, if the highres mode is enabled */ 723 + static inline bool hrtimer_is_hres_enabled(void) 773 724 { 774 725 return hrtimer_hres_enabled; 775 726 } 776 727 777 - /* 778 - * Switch to high resolution mode 779 - */ 728 + /* Switch to high resolution mode */ 780 729 static void hrtimer_switch_to_hres(void) 781 730 { 782 731 struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases); 783 732 784 733 if (tick_init_highres()) { 785 - pr_warn("Could not switch to high resolution mode on CPU %u\n", 786 - base->cpu); 734 + pr_warn("Could not switch to high resolution mode on CPU %u\n", base->cpu); 787 735 return; 788 736 } 789 - base->hres_active = 1; 737 + base->hres_active = true; 790 738 hrtimer_resolution = HIGH_RES_NSEC; 791 739 792 740 tick_setup_sched_timer(true); 793 741 /* "Retrigger" the interrupt to get things going */ 794 742 retrigger_next_event(NULL); 743 + hrtimer_schedule_hres_work(); 795 744 } 796 745 797 746 #else 798 747 799 - static inline int hrtimer_is_hres_enabled(void) { return 0; } 748 + static inline bool hrtimer_is_hres_enabled(void) { return 0; } 800 749 static inline void hrtimer_switch_to_hres(void) { } 801 750 802 751 #endif /* CONFIG_HIGH_RES_TIMERS */ 752 + 803 753 /* 804 754 * Retrigger next event is called after clock was set with interrupts 805 755 * disabled through an SMP function call or directly from low level ··· 826 792 * In periodic low resolution mode, the next softirq expiration 827 793 * must also be updated. 828 794 */ 829 - raw_spin_lock(&base->lock); 795 + guard(raw_spinlock)(&base->lock); 830 796 hrtimer_update_base(base); 831 797 if (hrtimer_hres_active(base)) 832 - hrtimer_force_reprogram(base, 0); 798 + hrtimer_force_reprogram(base, /* skip_equal */ false); 833 799 else 834 800 hrtimer_update_next_event(base); 835 - raw_spin_unlock(&base->lock); 836 801 } 837 802 838 803 /* ··· 845 812 { 846 813 struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); 847 814 struct hrtimer_clock_base *base = timer->base; 848 - ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset); 815 + ktime_t expires = hrtimer_get_expires(timer); 849 816 850 - WARN_ON_ONCE(hrtimer_get_expires(timer) < 0); 817 + WARN_ON_ONCE(expires < 0); 851 818 819 + expires = ktime_sub(expires, base->offset); 852 820 /* 853 821 * CLOCK_REALTIME timer might be requested with an absolute 854 822 * expiry time which is less than base->offset. Set it to 0. ··· 876 842 timer_cpu_base->softirq_next_timer = timer; 877 843 timer_cpu_base->softirq_expires_next = expires; 878 844 879 - if (!ktime_before(expires, timer_cpu_base->expires_next) || 880 - !reprogram) 845 + if (!ktime_before(expires, timer_cpu_base->expires_next) || !reprogram) 881 846 return; 882 847 } 883 848 ··· 890 857 if (expires >= cpu_base->expires_next) 891 858 return; 892 859 893 - /* 894 - * If the hrtimer interrupt is running, then it will reevaluate the 895 - * clock bases and reprogram the clock event device. 896 - */ 897 - if (cpu_base->in_hrtirq) 860 + /* If a deferred rearm is pending skip reprogramming the device */ 861 + if (cpu_base->deferred_rearm) 898 862 return; 899 863 900 864 cpu_base->next_timer = timer; ··· 899 869 __hrtimer_reprogram(cpu_base, timer, expires); 900 870 } 901 871 902 - static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, 903 - unsigned int active) 872 + static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int active) 904 873 { 905 874 struct hrtimer_clock_base *base; 906 875 unsigned int seq; ··· 925 896 if (seq == cpu_base->clock_was_set_seq) 926 897 return false; 927 898 928 - /* 929 - * If the remote CPU is currently handling an hrtimer interrupt, it 930 - * will reevaluate the first expiring timer of all clock bases 931 - * before reprogramming. Nothing to do here. 932 - */ 933 - if (cpu_base->in_hrtirq) 899 + /* If a deferred rearm is pending the remote CPU will take care of it */ 900 + if (cpu_base->deferred_rearm) { 901 + cpu_base->deferred_needs_update = true; 934 902 return false; 903 + } 935 904 936 905 /* 937 906 * Walk the affected clock bases and check whether the first expiring ··· 940 913 active &= cpu_base->active_bases; 941 914 942 915 for_each_active_base(base, cpu_base, active) { 943 - struct timerqueue_node *next; 916 + struct timerqueue_linked_node *next; 944 917 945 - next = timerqueue_getnext(&base->active); 918 + next = timerqueue_linked_first(&base->active); 946 919 expires = ktime_sub(next->expires, base->offset); 947 920 if (expires < cpu_base->expires_next) 948 921 return true; ··· 974 947 */ 975 948 void clock_was_set(unsigned int bases) 976 949 { 977 - struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases); 978 950 cpumask_var_t mask; 979 - int cpu; 980 951 981 - if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active()) 952 + if (!hrtimer_highres_enabled() && !tick_nohz_is_active()) 982 953 goto out_timerfd; 983 954 984 955 if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { ··· 985 960 } 986 961 987 962 /* Avoid interrupting CPUs if possible */ 988 - cpus_read_lock(); 989 - for_each_online_cpu(cpu) { 990 - unsigned long flags; 963 + scoped_guard(cpus_read_lock) { 964 + int cpu; 991 965 992 - cpu_base = &per_cpu(hrtimer_bases, cpu); 993 - raw_spin_lock_irqsave(&cpu_base->lock, flags); 966 + for_each_online_cpu(cpu) { 967 + struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu); 994 968 995 - if (update_needs_ipi(cpu_base, bases)) 996 - cpumask_set_cpu(cpu, mask); 997 - 998 - raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 969 + guard(raw_spinlock_irqsave)(&cpu_base->lock); 970 + if (update_needs_ipi(cpu_base, bases)) 971 + cpumask_set_cpu(cpu, mask); 972 + } 973 + scoped_guard(preempt) 974 + smp_call_function_many(mask, retrigger_next_event, NULL, 1); 999 975 } 1000 - 1001 - preempt_disable(); 1002 - smp_call_function_many(mask, retrigger_next_event, NULL, 1); 1003 - preempt_enable(); 1004 - cpus_read_unlock(); 1005 976 free_cpumask_var(mask); 1006 977 1007 978 out_timerfd: ··· 1032 1011 retrigger_next_event(NULL); 1033 1012 } 1034 1013 1035 - /* 1036 - * Counterpart to lock_hrtimer_base above: 1037 - */ 1038 - static inline 1039 - void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags) 1014 + /* Counterpart to lock_hrtimer_base above */ 1015 + static inline void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags) 1040 1016 __releases(&timer->base->cpu_base->lock) 1041 1017 { 1042 1018 raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags); ··· 1050 1032 * .. note:: 1051 1033 * This only updates the timer expiry value and does not requeue the timer. 1052 1034 * 1053 - * There is also a variant of the function hrtimer_forward_now(). 1035 + * There is also a variant of this function: hrtimer_forward_now(). 1054 1036 * 1055 1037 * Context: Can be safely called from the callback function of @timer. If called 1056 1038 * from other contexts @timer must neither be enqueued nor running the ··· 1060 1042 */ 1061 1043 u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) 1062 1044 { 1063 - u64 orun = 1; 1064 1045 ktime_t delta; 1046 + u64 orun = 1; 1065 1047 1066 1048 delta = ktime_sub(now, hrtimer_get_expires(timer)); 1067 1049 1068 1050 if (delta < 0) 1069 1051 return 0; 1070 1052 1071 - if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED)) 1053 + if (WARN_ON(timer->is_queued)) 1072 1054 return 0; 1073 1055 1074 1056 if (interval < hrtimer_resolution) ··· 1097 1079 * enqueue_hrtimer - internal function to (re)start a timer 1098 1080 * 1099 1081 * The timer is inserted in expiry order. Insertion into the 1100 - * red black tree is O(log(n)). Must hold the base lock. 1082 + * red black tree is O(log(n)). 1101 1083 * 1102 1084 * Returns true when the new timer is the leftmost timer in the tree. 1103 1085 */ 1104 1086 static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, 1105 - enum hrtimer_mode mode) 1087 + enum hrtimer_mode mode, bool was_armed) 1106 1088 { 1107 - debug_activate(timer, mode); 1089 + lockdep_assert_held(&base->cpu_base->lock); 1090 + 1091 + debug_activate(timer, mode, was_armed); 1108 1092 WARN_ON_ONCE(!base->cpu_base->online); 1109 1093 1110 1094 base->cpu_base->active_bases |= 1 << base->index; 1111 1095 1112 1096 /* Pairs with the lockless read in hrtimer_is_queued() */ 1113 - WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED); 1097 + WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED); 1114 1098 1115 - return timerqueue_add(&base->active, &timer->node); 1099 + if (!timerqueue_linked_add(&base->active, &timer->node)) 1100 + return false; 1101 + 1102 + base->expires_next = hrtimer_get_expires(timer); 1103 + return true; 1104 + } 1105 + 1106 + static inline void base_update_next_timer(struct hrtimer_clock_base *base) 1107 + { 1108 + struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active); 1109 + 1110 + base->expires_next = next ? next->expires : KTIME_MAX; 1116 1111 } 1117 1112 1118 1113 /* 1119 1114 * __remove_hrtimer - internal function to remove a timer 1120 - * 1121 - * Caller must hold the base lock. 1122 1115 * 1123 1116 * High resolution timer mode reprograms the clock event device when the 1124 1117 * timer is the one which expires next. The caller can disable this by setting 1125 1118 * reprogram to zero. This is useful, when the context does a reprogramming 1126 1119 * anyway (e.g. timer interrupt) 1127 1120 */ 1128 - static void __remove_hrtimer(struct hrtimer *timer, 1129 - struct hrtimer_clock_base *base, 1130 - u8 newstate, int reprogram) 1121 + static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, 1122 + bool newstate, bool reprogram) 1131 1123 { 1132 1124 struct hrtimer_cpu_base *cpu_base = base->cpu_base; 1133 - u8 state = timer->state; 1125 + bool was_first; 1134 1126 1135 - /* Pairs with the lockless read in hrtimer_is_queued() */ 1136 - WRITE_ONCE(timer->state, newstate); 1137 - if (!(state & HRTIMER_STATE_ENQUEUED)) 1127 + lockdep_assert_held(&cpu_base->lock); 1128 + 1129 + if (!timer->is_queued) 1138 1130 return; 1139 1131 1140 - if (!timerqueue_del(&base->active, &timer->node)) 1132 + /* Pairs with the lockless read in hrtimer_is_queued() */ 1133 + WRITE_ONCE(timer->is_queued, newstate); 1134 + 1135 + was_first = !timerqueue_linked_prev(&timer->node); 1136 + 1137 + if (!timerqueue_linked_del(&base->active, &timer->node)) 1141 1138 cpu_base->active_bases &= ~(1 << base->index); 1142 1139 1140 + /* Nothing to update if this was not the first timer in the base */ 1141 + if (!was_first) 1142 + return; 1143 + 1144 + base_update_next_timer(base); 1145 + 1143 1146 /* 1144 - * Note: If reprogram is false we do not update 1145 - * cpu_base->next_timer. This happens when we remove the first 1146 - * timer on a remote cpu. No harm as we never dereference 1147 - * cpu_base->next_timer. So the worst thing what can happen is 1148 - * an superfluous call to hrtimer_force_reprogram() on the 1149 - * remote cpu later on if the same timer gets enqueued again. 1147 + * If reprogram is false don't update cpu_base->next_timer and do not 1148 + * touch the clock event device. 1149 + * 1150 + * This happens when removing the first timer on a remote CPU, which 1151 + * will be handled by the remote CPU's interrupt. It also happens when 1152 + * a local timer is removed to be immediately restarted. That's handled 1153 + * at the call site. 1150 1154 */ 1151 - if (reprogram && timer == cpu_base->next_timer) 1152 - hrtimer_force_reprogram(cpu_base, 1); 1155 + if (!reprogram || timer != cpu_base->next_timer || timer->is_lazy) 1156 + return; 1157 + 1158 + if (cpu_base->deferred_rearm) 1159 + cpu_base->deferred_needs_update = true; 1160 + else 1161 + hrtimer_force_reprogram(cpu_base, /* skip_equal */ true); 1153 1162 } 1154 1163 1155 - /* 1156 - * remove hrtimer, called with base lock held 1157 - */ 1158 - static inline int 1159 - remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, 1160 - bool restart, bool keep_local) 1164 + static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, 1165 + bool newstate) 1161 1166 { 1162 - u8 state = timer->state; 1167 + lockdep_assert_held(&base->cpu_base->lock); 1163 1168 1164 - if (state & HRTIMER_STATE_ENQUEUED) { 1169 + if (timer->is_queued) { 1165 1170 bool reprogram; 1171 + 1172 + debug_hrtimer_deactivate(timer); 1166 1173 1167 1174 /* 1168 1175 * Remove the timer and force reprogramming when high ··· 1197 1154 * reprogramming happens in the interrupt handler. This is a 1198 1155 * rare case and less expensive than a smp call. 1199 1156 */ 1200 - debug_deactivate(timer); 1201 1157 reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); 1202 1158 1203 - /* 1204 - * If the timer is not restarted then reprogramming is 1205 - * required if the timer is local. If it is local and about 1206 - * to be restarted, avoid programming it twice (on removal 1207 - * and a moment later when it's requeued). 1208 - */ 1209 - if (!restart) 1210 - state = HRTIMER_STATE_INACTIVE; 1211 - else 1212 - reprogram &= !keep_local; 1213 - 1214 - __remove_hrtimer(timer, base, state, reprogram); 1215 - return 1; 1159 + __remove_hrtimer(timer, base, newstate, reprogram); 1160 + return true; 1216 1161 } 1217 - return 0; 1162 + return false; 1163 + } 1164 + 1165 + /* 1166 + * Update in place has to retrieve the expiry times of the neighbour nodes 1167 + * if they exist. That is cache line neutral because the dequeue/enqueue 1168 + * operation is going to need the same cache lines. But there is a big win 1169 + * when the dequeue/enqueue can be avoided because the RB tree does not 1170 + * have to be rebalanced twice. 1171 + */ 1172 + static inline bool 1173 + hrtimer_can_update_in_place(struct hrtimer *timer, struct hrtimer_clock_base *base, ktime_t expires) 1174 + { 1175 + struct timerqueue_linked_node *next = timerqueue_linked_next(&timer->node); 1176 + struct timerqueue_linked_node *prev = timerqueue_linked_prev(&timer->node); 1177 + 1178 + /* If the new expiry goes behind the next timer, requeue is required */ 1179 + if (next && expires > next->expires) 1180 + return false; 1181 + 1182 + /* If this is the first timer, update in place */ 1183 + if (!prev) 1184 + return true; 1185 + 1186 + /* Update in place when it does not go ahead of the previous one */ 1187 + return expires >= prev->expires; 1188 + } 1189 + 1190 + static inline bool 1191 + remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base, 1192 + const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns) 1193 + { 1194 + bool was_first = false; 1195 + 1196 + /* Remove it from the timer queue if active */ 1197 + if (timer->is_queued) { 1198 + was_first = !timerqueue_linked_prev(&timer->node); 1199 + 1200 + /* Try to update in place to avoid the de/enqueue dance */ 1201 + if (hrtimer_can_update_in_place(timer, base, expires)) { 1202 + hrtimer_set_expires_range_ns(timer, expires, delta_ns); 1203 + trace_hrtimer_start(timer, mode, true); 1204 + if (was_first) 1205 + base->expires_next = expires; 1206 + return was_first; 1207 + } 1208 + 1209 + debug_hrtimer_deactivate(timer); 1210 + timerqueue_linked_del(&base->active, &timer->node); 1211 + } 1212 + 1213 + /* Set the new expiry time */ 1214 + hrtimer_set_expires_range_ns(timer, expires, delta_ns); 1215 + 1216 + debug_activate(timer, mode, timer->is_queued); 1217 + base->cpu_base->active_bases |= 1 << base->index; 1218 + 1219 + /* Pairs with the lockless read in hrtimer_is_queued() */ 1220 + WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED); 1221 + 1222 + /* If it's the first expiring timer now or again, update base */ 1223 + if (timerqueue_linked_add(&base->active, &timer->node)) { 1224 + base->expires_next = expires; 1225 + return true; 1226 + } 1227 + 1228 + if (was_first) 1229 + base_update_next_timer(base); 1230 + 1231 + return false; 1218 1232 } 1219 1233 1220 1234 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim, ··· 1290 1190 return tim; 1291 1191 } 1292 1192 1293 - static void 1294 - hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram) 1193 + static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram) 1295 1194 { 1296 - ktime_t expires; 1195 + ktime_t expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); 1297 1196 1298 1197 /* 1299 - * Find the next SOFT expiration. 1300 - */ 1301 - expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); 1302 - 1303 - /* 1304 - * reprogramming needs to be triggered, even if the next soft 1305 - * hrtimer expires at the same time than the next hard 1198 + * Reprogramming needs to be triggered, even if the next soft 1199 + * hrtimer expires at the same time as the next hard 1306 1200 * hrtimer. cpu_base->softirq_expires_next needs to be updated! 1307 1201 */ 1308 1202 if (expires == KTIME_MAX) 1309 1203 return; 1310 1204 1311 1205 /* 1312 - * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event() 1313 - * cpu_base->*expires_next is only set by hrtimer_reprogram() 1206 + * cpu_base->next_timer is recomputed by __hrtimer_get_next_event() 1207 + * cpu_base->expires_next is only set by hrtimer_reprogram() 1314 1208 */ 1315 1209 hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram); 1316 1210 } 1317 1211 1318 - static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, 1319 - u64 delta_ns, const enum hrtimer_mode mode, 1320 - struct hrtimer_clock_base *base) 1212 + #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) 1213 + static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned) 1214 + { 1215 + if (static_branch_likely(&timers_migration_enabled)) { 1216 + /* 1217 + * If it is local and the first expiring timer keep it on the local 1218 + * CPU to optimize reprogramming of the clockevent device. Also 1219 + * avoid switch_hrtimer_base() overhead when local and pinned. 1220 + */ 1221 + if (!is_local) 1222 + return false; 1223 + if (is_first || is_pinned) 1224 + return true; 1225 + 1226 + /* Honour the NOHZ full restrictions */ 1227 + if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_KERNEL_NOISE)) 1228 + return false; 1229 + 1230 + /* 1231 + * If the tick is not stopped or need_resched() is set, then 1232 + * there is no point in moving the timer somewhere else. 1233 + */ 1234 + return !tick_nohz_tick_stopped() || need_resched(); 1235 + } 1236 + return is_local; 1237 + } 1238 + #else 1239 + static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned) 1240 + { 1241 + return is_local; 1242 + } 1243 + #endif 1244 + 1245 + static inline bool hrtimer_keep_base(struct hrtimer *timer, bool is_local, bool is_first, 1246 + bool is_pinned) 1247 + { 1248 + /* If the timer is running the callback it has to stay on its CPU base. */ 1249 + if (unlikely(timer->base->running == timer)) 1250 + return true; 1251 + 1252 + return hrtimer_prefer_local(is_local, is_first, is_pinned); 1253 + } 1254 + 1255 + static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns, 1256 + const enum hrtimer_mode mode, struct hrtimer_clock_base *base) 1321 1257 { 1322 1258 struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases); 1323 - struct hrtimer_clock_base *new_base; 1324 - bool force_local, first; 1259 + bool is_pinned, first, was_first, keep_base = false; 1260 + struct hrtimer_cpu_base *cpu_base = base->cpu_base; 1261 + 1262 + was_first = cpu_base->next_timer == timer; 1263 + is_pinned = !!(mode & HRTIMER_MODE_PINNED); 1325 1264 1326 1265 /* 1327 - * If the timer is on the local cpu base and is the first expiring 1328 - * timer then this might end up reprogramming the hardware twice 1329 - * (on removal and on enqueue). To avoid that by prevent the 1330 - * reprogram on removal, keep the timer local to the current CPU 1331 - * and enforce reprogramming after it is queued no matter whether 1332 - * it is the new first expiring timer again or not. 1266 + * Don't keep it local if this enqueue happens on a unplugged CPU 1267 + * after hrtimer_cpu_dying() has been invoked. 1333 1268 */ 1334 - force_local = base->cpu_base == this_cpu_base; 1335 - force_local &= base->cpu_base->next_timer == timer; 1269 + if (likely(this_cpu_base->online)) { 1270 + bool is_local = cpu_base == this_cpu_base; 1336 1271 1337 - /* 1338 - * Don't force local queuing if this enqueue happens on a unplugged 1339 - * CPU after hrtimer_cpu_dying() has been invoked. 1340 - */ 1341 - force_local &= this_cpu_base->online; 1272 + keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned); 1273 + } 1274 + 1275 + /* Calculate absolute expiry time for relative timers */ 1276 + if (mode & HRTIMER_MODE_REL) 1277 + tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid)); 1278 + /* Compensate for low resolution granularity */ 1279 + tim = hrtimer_update_lowres(timer, tim, mode); 1342 1280 1343 1281 /* 1344 1282 * Remove an active timer from the queue. In case it is not queued ··· 1388 1250 * reprogramming later if it was the first expiring timer. This 1389 1251 * avoids programming the underlying clock event twice (once at 1390 1252 * removal and once after enqueue). 1253 + * 1254 + * @keep_base is also true if the timer callback is running on a 1255 + * remote CPU and for local pinned timers. 1391 1256 */ 1392 - remove_hrtimer(timer, base, true, force_local); 1393 - 1394 - if (mode & HRTIMER_MODE_REL) 1395 - tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid)); 1396 - 1397 - tim = hrtimer_update_lowres(timer, tim, mode); 1398 - 1399 - hrtimer_set_expires_range_ns(timer, tim, delta_ns); 1400 - 1401 - /* Switch the timer base, if necessary: */ 1402 - if (!force_local) { 1403 - new_base = switch_hrtimer_base(timer, base, 1404 - mode & HRTIMER_MODE_PINNED); 1257 + if (likely(keep_base)) { 1258 + first = remove_and_enqueue_same_base(timer, base, mode, tim, delta_ns); 1405 1259 } else { 1406 - new_base = base; 1260 + /* Keep the ENQUEUED state in case it is queued */ 1261 + bool was_armed = remove_hrtimer(timer, base, HRTIMER_STATE_ENQUEUED); 1262 + 1263 + hrtimer_set_expires_range_ns(timer, tim, delta_ns); 1264 + 1265 + /* Switch the timer base, if necessary: */ 1266 + base = switch_hrtimer_base(timer, base, is_pinned); 1267 + cpu_base = base->cpu_base; 1268 + 1269 + first = enqueue_hrtimer(timer, base, mode, was_armed); 1407 1270 } 1408 1271 1409 - first = enqueue_hrtimer(timer, new_base, mode); 1410 - if (!force_local) { 1272 + /* If a deferred rearm is pending skip reprogramming the device */ 1273 + if (cpu_base->deferred_rearm) { 1274 + cpu_base->deferred_needs_update = true; 1275 + return false; 1276 + } 1277 + 1278 + if (!was_first || cpu_base != this_cpu_base) { 1411 1279 /* 1412 - * If the current CPU base is online, then the timer is 1413 - * never queued on a remote CPU if it would be the first 1414 - * expiring timer there. 1280 + * If the current CPU base is online, then the timer is never 1281 + * queued on a remote CPU if it would be the first expiring 1282 + * timer there unless the timer callback is currently executed 1283 + * on the remote CPU. In the latter case the remote CPU will 1284 + * re-evaluate the first expiring timer after completing the 1285 + * callbacks. 1415 1286 */ 1416 - if (hrtimer_base_is_online(this_cpu_base)) 1287 + if (likely(hrtimer_base_is_online(this_cpu_base))) 1417 1288 return first; 1418 1289 1419 1290 /* ··· 1430 1283 * already offline. If the timer is the first to expire, 1431 1284 * kick the remote CPU to reprogram the clock event. 1432 1285 */ 1433 - if (first) { 1434 - struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base; 1435 - 1436 - smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd); 1437 - } 1438 - return 0; 1286 + if (first) 1287 + smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd); 1288 + return false; 1439 1289 } 1440 1290 1441 1291 /* 1442 - * Timer was forced to stay on the current CPU to avoid 1443 - * reprogramming on removal and enqueue. Force reprogram the 1444 - * hardware by evaluating the new first expiring timer. 1292 + * Special case for the HRTICK timer. It is frequently rearmed and most 1293 + * of the time moves the expiry into the future. That's expensive in 1294 + * virtual machines and it's better to take the pointless already armed 1295 + * interrupt than reprogramming the hardware on every context switch. 1296 + * 1297 + * If the new expiry is before the armed time, then reprogramming is 1298 + * required. 1445 1299 */ 1446 - hrtimer_force_reprogram(new_base->cpu_base, 1); 1447 - return 0; 1300 + if (timer->is_lazy) { 1301 + if (cpu_base->expires_next <= hrtimer_get_expires(timer)) 1302 + return false; 1303 + } 1304 + 1305 + /* 1306 + * Timer was the first expiring timer and forced to stay on the 1307 + * current CPU to avoid reprogramming on removal and enqueue. Force 1308 + * reprogram the hardware by evaluating the new first expiring 1309 + * timer. 1310 + */ 1311 + hrtimer_force_reprogram(cpu_base, /* skip_equal */ true); 1312 + return false; 1448 1313 } 1449 1314 1450 1315 /** ··· 1468 1309 * relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED); 1469 1310 * softirq based mode is considered for debug purpose only! 1470 1311 */ 1471 - void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, 1472 - u64 delta_ns, const enum hrtimer_mode mode) 1312 + void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns, 1313 + const enum hrtimer_mode mode) 1473 1314 { 1474 1315 struct hrtimer_clock_base *base; 1475 1316 unsigned long flags; 1317 + 1318 + debug_hrtimer_assert_init(timer); 1476 1319 1477 1320 /* 1478 1321 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft ··· 1523 1362 1524 1363 base = lock_hrtimer_base(timer, &flags); 1525 1364 1526 - if (!hrtimer_callback_running(timer)) 1527 - ret = remove_hrtimer(timer, base, false, false); 1365 + if (!hrtimer_callback_running(timer)) { 1366 + ret = remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE); 1367 + if (ret) 1368 + trace_hrtimer_cancel(timer); 1369 + } 1528 1370 1529 1371 unlock_hrtimer_base(timer, &flags); 1530 1372 ··· 1561 1397 * the timer callback to finish. Drop expiry_lock and reacquire it. That 1562 1398 * allows the waiter to acquire the lock and make progress. 1563 1399 */ 1564 - static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, 1565 - unsigned long flags) 1400 + static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, unsigned long flags) 1566 1401 { 1567 1402 if (atomic_read(&cpu_base->timer_waiters)) { 1568 1403 raw_spin_unlock_irqrestore(&cpu_base->lock, flags); ··· 1626 1463 spin_unlock_bh(&base->cpu_base->softirq_expiry_lock); 1627 1464 } 1628 1465 #else 1629 - static inline void 1630 - hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { } 1631 - static inline void 1632 - hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { } 1633 - static inline void 1634 - hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { } 1635 - static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, 1636 - unsigned long flags) { } 1466 + static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { } 1467 + static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { } 1468 + static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { } 1469 + static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, unsigned long fl) { } 1637 1470 #endif 1638 1471 1639 1472 /** ··· 1685 1526 { 1686 1527 struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); 1687 1528 u64 expires = KTIME_MAX; 1688 - unsigned long flags; 1689 1529 1690 - raw_spin_lock_irqsave(&cpu_base->lock, flags); 1691 - 1530 + guard(raw_spinlock_irqsave)(&cpu_base->lock); 1692 1531 if (!hrtimer_hres_active(cpu_base)) 1693 1532 expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); 1694 - 1695 - raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 1696 1533 1697 1534 return expires; 1698 1535 } ··· 1704 1549 { 1705 1550 struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); 1706 1551 u64 expires = KTIME_MAX; 1707 - unsigned long flags; 1552 + unsigned int active; 1708 1553 1709 - raw_spin_lock_irqsave(&cpu_base->lock, flags); 1554 + guard(raw_spinlock_irqsave)(&cpu_base->lock); 1555 + if (!hrtimer_hres_active(cpu_base)) 1556 + return expires; 1710 1557 1711 - if (hrtimer_hres_active(cpu_base)) { 1712 - unsigned int active; 1558 + active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT; 1559 + if (active && !cpu_base->softirq_activated) 1560 + expires = hrtimer_bases_next_event_without(cpu_base, exclude, active, KTIME_MAX); 1713 1561 1714 - if (!cpu_base->softirq_activated) { 1715 - active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT; 1716 - expires = __hrtimer_next_event_base(cpu_base, exclude, 1717 - active, KTIME_MAX); 1718 - } 1719 - active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD; 1720 - expires = __hrtimer_next_event_base(cpu_base, exclude, active, 1721 - expires); 1722 - } 1723 - 1724 - raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 1725 - 1726 - return expires; 1562 + active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD; 1563 + if (!active) 1564 + return expires; 1565 + return hrtimer_bases_next_event_without(cpu_base, exclude, active, expires); 1727 1566 } 1728 1567 #endif 1729 1568 ··· 1761 1612 } 1762 1613 EXPORT_SYMBOL_GPL(hrtimer_cb_get_time); 1763 1614 1764 - static void __hrtimer_setup(struct hrtimer *timer, 1765 - enum hrtimer_restart (*function)(struct hrtimer *), 1615 + static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(struct hrtimer *), 1766 1616 clockid_t clock_id, enum hrtimer_mode mode) 1767 1617 { 1768 1618 bool softtimer = !!(mode & HRTIMER_MODE_SOFT); ··· 1793 1645 base += hrtimer_clockid_to_base(clock_id); 1794 1646 timer->is_soft = softtimer; 1795 1647 timer->is_hard = !!(mode & HRTIMER_MODE_HARD); 1648 + timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM); 1796 1649 timer->base = &cpu_base->clock_base[base]; 1797 - timerqueue_init(&timer->node); 1650 + timerqueue_linked_init(&timer->node); 1798 1651 1799 - if (WARN_ON_ONCE(!function)) 1652 + if (WARN_ON_ONCE(!fn)) 1800 1653 ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout; 1801 1654 else 1802 - ACCESS_PRIVATE(timer, function) = function; 1655 + ACCESS_PRIVATE(timer, function) = fn; 1803 1656 } 1804 1657 1805 1658 /** ··· 1859 1710 base = READ_ONCE(timer->base); 1860 1711 seq = raw_read_seqcount_begin(&base->seq); 1861 1712 1862 - if (timer->state != HRTIMER_STATE_INACTIVE || 1863 - base->running == timer) 1713 + if (timer->is_queued || base->running == timer) 1864 1714 return true; 1865 1715 1866 - } while (read_seqcount_retry(&base->seq, seq) || 1867 - base != READ_ONCE(timer->base)); 1716 + } while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base)); 1868 1717 1869 1718 return false; 1870 1719 } ··· 1876 1729 * - callback: the timer is being ran 1877 1730 * - post: the timer is inactive or (re)queued 1878 1731 * 1879 - * On the read side we ensure we observe timer->state and cpu_base->running 1732 + * On the read side we ensure we observe timer->is_queued and cpu_base->running 1880 1733 * from the same section, if anything changed while we looked at it, we retry. 1881 1734 * This includes timer->base changing because sequence numbers alone are 1882 1735 * insufficient for that. ··· 1885 1738 * a false negative if the read side got smeared over multiple consecutive 1886 1739 * __run_hrtimer() invocations. 1887 1740 */ 1888 - 1889 - static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, 1890 - struct hrtimer_clock_base *base, 1891 - struct hrtimer *timer, ktime_t *now, 1892 - unsigned long flags) __must_hold(&cpu_base->lock) 1741 + static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base, 1742 + struct hrtimer *timer, ktime_t now, unsigned long flags) 1743 + __must_hold(&cpu_base->lock) 1893 1744 { 1894 1745 enum hrtimer_restart (*fn)(struct hrtimer *); 1895 1746 bool expires_in_hardirq; ··· 1899 1754 base->running = timer; 1900 1755 1901 1756 /* 1902 - * Separate the ->running assignment from the ->state assignment. 1757 + * Separate the ->running assignment from the ->is_queued assignment. 1903 1758 * 1904 1759 * As with a regular write barrier, this ensures the read side in 1905 1760 * hrtimer_active() cannot observe base->running == NULL && 1906 - * timer->state == INACTIVE. 1761 + * timer->is_queued == INACTIVE. 1907 1762 */ 1908 1763 raw_write_seqcount_barrier(&base->seq); 1909 1764 1910 - __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0); 1765 + __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, false); 1911 1766 fn = ACCESS_PRIVATE(timer, function); 1912 1767 1913 1768 /* ··· 1942 1797 * hrtimer_start_range_ns() can have popped in and enqueued the timer 1943 1798 * for us already. 1944 1799 */ 1945 - if (restart != HRTIMER_NORESTART && 1946 - !(timer->state & HRTIMER_STATE_ENQUEUED)) 1947 - enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS); 1800 + if (restart == HRTIMER_RESTART && !timer->is_queued) 1801 + enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false); 1948 1802 1949 1803 /* 1950 - * Separate the ->running assignment from the ->state assignment. 1804 + * Separate the ->running assignment from the ->is_queued assignment. 1951 1805 * 1952 1806 * As with a regular write barrier, this ensures the read side in 1953 1807 * hrtimer_active() cannot observe base->running.timer == NULL && 1954 - * timer->state == INACTIVE. 1808 + * timer->is_queued == INACTIVE. 1955 1809 */ 1956 1810 raw_write_seqcount_barrier(&base->seq); 1957 1811 ··· 1958 1814 base->running = NULL; 1959 1815 } 1960 1816 1817 + static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base) 1818 + { 1819 + struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active); 1820 + 1821 + return next ? hrtimer_from_timerqueue_node(next) : NULL; 1822 + } 1823 + 1961 1824 static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now, 1962 1825 unsigned long flags, unsigned int active_mask) 1963 1826 { 1964 - struct hrtimer_clock_base *base; 1965 1827 unsigned int active = cpu_base->active_bases & active_mask; 1828 + struct hrtimer_clock_base *base; 1966 1829 1967 1830 for_each_active_base(base, cpu_base, active) { 1968 - struct timerqueue_node *node; 1969 - ktime_t basenow; 1831 + ktime_t basenow = ktime_add(now, base->offset); 1832 + struct hrtimer *timer; 1970 1833 1971 - basenow = ktime_add(now, base->offset); 1972 - 1973 - while ((node = timerqueue_getnext(&base->active))) { 1974 - struct hrtimer *timer; 1975 - 1976 - timer = container_of(node, struct hrtimer, node); 1977 - 1834 + while ((timer = clock_base_next_timer(base))) { 1978 1835 /* 1979 1836 * The immediate goal for using the softexpires is 1980 1837 * minimizing wakeups, not running timers at the ··· 1991 1846 if (basenow < hrtimer_get_softexpires(timer)) 1992 1847 break; 1993 1848 1994 - __run_hrtimer(cpu_base, base, timer, &basenow, flags); 1849 + __run_hrtimer(cpu_base, base, timer, basenow, flags); 1995 1850 if (active_mask == HRTIMER_ACTIVE_SOFT) 1996 1851 hrtimer_sync_wait_running(cpu_base, flags); 1997 1852 } ··· 2010 1865 now = hrtimer_update_base(cpu_base); 2011 1866 __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT); 2012 1867 2013 - cpu_base->softirq_activated = 0; 1868 + cpu_base->softirq_activated = false; 2014 1869 hrtimer_update_softirq_timer(cpu_base, true); 2015 1870 2016 1871 raw_spin_unlock_irqrestore(&cpu_base->lock, flags); ··· 2018 1873 } 2019 1874 2020 1875 #ifdef CONFIG_HIGH_RES_TIMERS 1876 + 1877 + /* 1878 + * Very similar to hrtimer_force_reprogram(), except it deals with 1879 + * deferred_rearm and hang_detected. 1880 + */ 1881 + static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next, bool deferred) 1882 + { 1883 + cpu_base->expires_next = expires_next; 1884 + cpu_base->deferred_rearm = false; 1885 + 1886 + if (unlikely(cpu_base->hang_detected)) { 1887 + /* 1888 + * Give the system a chance to do something else than looping 1889 + * on hrtimer interrupts. 1890 + */ 1891 + expires_next = ktime_add_ns(ktime_get(), 1892 + min(100 * NSEC_PER_MSEC, cpu_base->max_hang_time)); 1893 + } 1894 + hrtimer_rearm_event(expires_next, deferred); 1895 + } 1896 + 1897 + #ifdef CONFIG_HRTIMER_REARM_DEFERRED 1898 + void __hrtimer_rearm_deferred(void) 1899 + { 1900 + struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); 1901 + ktime_t expires_next; 1902 + 1903 + if (!cpu_base->deferred_rearm) 1904 + return; 1905 + 1906 + guard(raw_spinlock)(&cpu_base->lock); 1907 + if (cpu_base->deferred_needs_update) { 1908 + hrtimer_update_base(cpu_base); 1909 + expires_next = hrtimer_update_next_event(cpu_base); 1910 + } else { 1911 + /* No timer added/removed. Use the cached value */ 1912 + expires_next = cpu_base->deferred_expires_next; 1913 + } 1914 + hrtimer_rearm(cpu_base, expires_next, true); 1915 + } 1916 + 1917 + static __always_inline void 1918 + hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next) 1919 + { 1920 + /* hrtimer_interrupt() just re-evaluated the first expiring timer */ 1921 + cpu_base->deferred_needs_update = false; 1922 + /* Cache the expiry time */ 1923 + cpu_base->deferred_expires_next = expires_next; 1924 + set_thread_flag(TIF_HRTIMER_REARM); 1925 + } 1926 + #else /* CONFIG_HRTIMER_REARM_DEFERRED */ 1927 + static __always_inline void 1928 + hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next) 1929 + { 1930 + hrtimer_rearm(cpu_base, expires_next, false); 1931 + } 1932 + #endif /* !CONFIG_HRTIMER_REARM_DEFERRED */ 2021 1933 2022 1934 /* 2023 1935 * High resolution timer interrupt ··· 2095 1893 raw_spin_lock_irqsave(&cpu_base->lock, flags); 2096 1894 entry_time = now = hrtimer_update_base(cpu_base); 2097 1895 retry: 2098 - cpu_base->in_hrtirq = 1; 1896 + cpu_base->deferred_rearm = true; 2099 1897 /* 2100 - * We set expires_next to KTIME_MAX here with cpu_base->lock 2101 - * held to prevent that a timer is enqueued in our queue via 2102 - * the migration code. This does not affect enqueueing of 2103 - * timers which run their callback and need to be requeued on 2104 - * this CPU. 1898 + * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue 1899 + * timers while __hrtimer_run_queues() is expiring the clock bases. 1900 + * Timers which are re/enqueued on the local CPU are not affected by 1901 + * this. 2105 1902 */ 2106 1903 cpu_base->expires_next = KTIME_MAX; 2107 1904 2108 1905 if (!ktime_before(now, cpu_base->softirq_expires_next)) { 2109 1906 cpu_base->softirq_expires_next = KTIME_MAX; 2110 - cpu_base->softirq_activated = 1; 1907 + cpu_base->softirq_activated = true; 2111 1908 raise_timer_softirq(HRTIMER_SOFTIRQ); 2112 1909 } 2113 1910 2114 1911 __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); 2115 - 2116 - /* Reevaluate the clock bases for the [soft] next expiry */ 2117 - expires_next = hrtimer_update_next_event(cpu_base); 2118 - /* 2119 - * Store the new expiry value so the migration code can verify 2120 - * against it. 2121 - */ 2122 - cpu_base->expires_next = expires_next; 2123 - cpu_base->in_hrtirq = 0; 2124 - raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 2125 - 2126 - /* Reprogramming necessary ? */ 2127 - if (!tick_program_event(expires_next, 0)) { 2128 - cpu_base->hang_detected = 0; 2129 - return; 2130 - } 2131 1912 2132 1913 /* 2133 1914 * The next timer was already expired due to: ··· 2118 1933 * - long lasting callbacks 2119 1934 * - being scheduled away when running in a VM 2120 1935 * 2121 - * We need to prevent that we loop forever in the hrtimer 2122 - * interrupt routine. We give it 3 attempts to avoid 2123 - * overreacting on some spurious event. 2124 - * 2125 - * Acquire base lock for updating the offsets and retrieving 2126 - * the current time. 1936 + * We need to prevent that we loop forever in the hrtiner interrupt 1937 + * routine. We give it 3 attempts to avoid overreacting on some 1938 + * spurious event. 2127 1939 */ 2128 - raw_spin_lock_irqsave(&cpu_base->lock, flags); 2129 1940 now = hrtimer_update_base(cpu_base); 2130 - cpu_base->nr_retries++; 2131 - if (++retries < 3) 2132 - goto retry; 2133 - /* 2134 - * Give the system a chance to do something else than looping 2135 - * here. We stored the entry time, so we know exactly how long 2136 - * we spent here. We schedule the next event this amount of 2137 - * time away. 2138 - */ 2139 - cpu_base->nr_hangs++; 2140 - cpu_base->hang_detected = 1; 2141 - raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 1941 + expires_next = hrtimer_update_next_event(cpu_base); 1942 + cpu_base->hang_detected = false; 1943 + if (expires_next < now) { 1944 + if (++retries < 3) 1945 + goto retry; 2142 1946 2143 - delta = ktime_sub(now, entry_time); 2144 - if ((unsigned int)delta > cpu_base->max_hang_time) 2145 - cpu_base->max_hang_time = (unsigned int) delta; 2146 - /* 2147 - * Limit it to a sensible value as we enforce a longer 2148 - * delay. Give the CPU at least 100ms to catch up. 2149 - */ 2150 - if (delta > 100 * NSEC_PER_MSEC) 2151 - expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC); 2152 - else 2153 - expires_next = ktime_add(now, delta); 2154 - tick_program_event(expires_next, 1); 2155 - pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); 1947 + delta = ktime_sub(now, entry_time); 1948 + cpu_base->max_hang_time = max_t(unsigned int, cpu_base->max_hang_time, delta); 1949 + cpu_base->nr_hangs++; 1950 + cpu_base->hang_detected = true; 1951 + } 1952 + 1953 + hrtimer_interrupt_rearm(cpu_base, expires_next); 1954 + raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 2156 1955 } 1956 + 2157 1957 #endif /* !CONFIG_HIGH_RES_TIMERS */ 2158 1958 2159 1959 /* ··· 2170 2000 2171 2001 if (!ktime_before(now, cpu_base->softirq_expires_next)) { 2172 2002 cpu_base->softirq_expires_next = KTIME_MAX; 2173 - cpu_base->softirq_activated = 1; 2003 + cpu_base->softirq_activated = true; 2174 2004 raise_timer_softirq(HRTIMER_SOFTIRQ); 2175 2005 } 2176 2006 ··· 2183 2013 */ 2184 2014 static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer) 2185 2015 { 2186 - struct hrtimer_sleeper *t = 2187 - container_of(timer, struct hrtimer_sleeper, timer); 2016 + struct hrtimer_sleeper *t = container_of(timer, struct hrtimer_sleeper, timer); 2188 2017 struct task_struct *task = t->task; 2189 2018 2190 2019 t->task = NULL; ··· 2201 2032 * Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers 2202 2033 * to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context) 2203 2034 */ 2204 - void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, 2205 - enum hrtimer_mode mode) 2035 + void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, enum hrtimer_mode mode) 2206 2036 { 2207 2037 /* 2208 2038 * Make the enqueue delivery mode check work on RT. If the sleeper ··· 2217 2049 } 2218 2050 EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires); 2219 2051 2220 - static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, 2221 - clockid_t clock_id, enum hrtimer_mode mode) 2052 + static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, clockid_t clock_id, 2053 + enum hrtimer_mode mode) 2222 2054 { 2223 2055 /* 2224 2056 * On PREEMPT_RT enabled kernels hrtimers which are not explicitly ··· 2254 2086 * @clock_id: the clock to be used 2255 2087 * @mode: timer mode abs/rel 2256 2088 */ 2257 - void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, 2258 - clockid_t clock_id, enum hrtimer_mode mode) 2089 + void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, clockid_t clock_id, 2090 + enum hrtimer_mode mode) 2259 2091 { 2260 2092 debug_setup_on_stack(&sl->timer, clock_id, mode); 2261 2093 __hrtimer_setup_sleeper(sl, clock_id, mode); ··· 2328 2160 return ret; 2329 2161 } 2330 2162 2331 - long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, 2332 - const clockid_t clockid) 2163 + long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid) 2333 2164 { 2334 2165 struct restart_block *restart; 2335 2166 struct hrtimer_sleeper t; 2336 - int ret = 0; 2167 + int ret; 2337 2168 2338 2169 hrtimer_setup_sleeper_on_stack(&t, clockid, mode); 2339 2170 hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns); ··· 2371 2204 current->restart_block.fn = do_no_restart_syscall; 2372 2205 current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE; 2373 2206 current->restart_block.nanosleep.rmtp = rmtp; 2374 - return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, 2375 - CLOCK_MONOTONIC); 2207 + return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC); 2376 2208 } 2377 2209 2378 2210 #endif ··· 2379 2213 #ifdef CONFIG_COMPAT_32BIT_TIME 2380 2214 2381 2215 SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp, 2382 - struct old_timespec32 __user *, rmtp) 2216 + struct old_timespec32 __user *, rmtp) 2383 2217 { 2384 2218 struct timespec64 tu; 2385 2219 ··· 2392 2226 current->restart_block.fn = do_no_restart_syscall; 2393 2227 current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE; 2394 2228 current->restart_block.nanosleep.compat_rmtp = rmtp; 2395 - return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, 2396 - CLOCK_MONOTONIC); 2229 + return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC); 2397 2230 } 2398 2231 #endif 2399 2232 ··· 2402 2237 int hrtimers_prepare_cpu(unsigned int cpu) 2403 2238 { 2404 2239 struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu); 2405 - int i; 2406 2240 2407 - for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { 2241 + for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { 2408 2242 struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i]; 2409 2243 2410 2244 clock_b->cpu_base = cpu_base; 2411 2245 seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock); 2412 - timerqueue_init_head(&clock_b->active); 2246 + timerqueue_linked_init_head(&clock_b->active); 2413 2247 } 2414 2248 2415 2249 cpu_base->cpu = cpu; ··· 2422 2258 2423 2259 /* Clear out any left over state from a CPU down operation */ 2424 2260 cpu_base->active_bases = 0; 2425 - cpu_base->hres_active = 0; 2426 - cpu_base->hang_detected = 0; 2261 + cpu_base->hres_active = false; 2262 + cpu_base->hang_detected = false; 2427 2263 cpu_base->next_timer = NULL; 2428 2264 cpu_base->softirq_next_timer = NULL; 2429 2265 cpu_base->expires_next = KTIME_MAX; 2430 2266 cpu_base->softirq_expires_next = KTIME_MAX; 2431 - cpu_base->online = 1; 2267 + cpu_base->softirq_activated = false; 2268 + cpu_base->online = true; 2432 2269 return 0; 2433 2270 } 2434 2271 ··· 2438 2273 static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, 2439 2274 struct hrtimer_clock_base *new_base) 2440 2275 { 2276 + struct timerqueue_linked_node *node; 2441 2277 struct hrtimer *timer; 2442 - struct timerqueue_node *node; 2443 2278 2444 - while ((node = timerqueue_getnext(&old_base->active))) { 2445 - timer = container_of(node, struct hrtimer, node); 2279 + while ((node = timerqueue_linked_first(&old_base->active))) { 2280 + timer = hrtimer_from_timerqueue_node(node); 2446 2281 BUG_ON(hrtimer_callback_running(timer)); 2447 - debug_deactivate(timer); 2282 + debug_hrtimer_deactivate(timer); 2448 2283 2449 2284 /* 2450 2285 * Mark it as ENQUEUED not INACTIVE otherwise the 2451 2286 * timer could be seen as !active and just vanish away 2452 2287 * under us on another CPU 2453 2288 */ 2454 - __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0); 2289 + __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, false); 2455 2290 timer->base = new_base; 2456 2291 /* 2457 2292 * Enqueue the timers on the new cpu. This does not ··· 2461 2296 * sort out already expired timers and reprogram the 2462 2297 * event device. 2463 2298 */ 2464 - enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS); 2299 + enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS, true); 2465 2300 } 2466 2301 } 2467 2302 2468 2303 int hrtimers_cpu_dying(unsigned int dying_cpu) 2469 2304 { 2470 - int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER)); 2305 + int ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER)); 2471 2306 struct hrtimer_cpu_base *old_base, *new_base; 2472 2307 2473 2308 old_base = this_cpu_ptr(&hrtimer_bases); ··· 2480 2315 raw_spin_lock(&old_base->lock); 2481 2316 raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING); 2482 2317 2483 - for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { 2484 - migrate_hrtimer_list(&old_base->clock_base[i], 2485 - &new_base->clock_base[i]); 2486 - } 2318 + for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) 2319 + migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]); 2487 2320 2488 2321 /* Tell the other CPU to retrigger the next event */ 2489 2322 smp_call_function_single(ncpu, retrigger_next_event, NULL, 0); 2490 2323 2491 2324 raw_spin_unlock(&new_base->lock); 2492 - old_base->online = 0; 2325 + old_base->online = false; 2493 2326 raw_spin_unlock(&old_base->lock); 2494 2327 2495 2328 return 0;
-1
kernel/time/jiffies.c
··· 32 32 static struct clocksource clocksource_jiffies = { 33 33 .name = "jiffies", 34 34 .rating = 1, /* lowest valid rating*/ 35 - .uncertainty_margin = 32 * NSEC_PER_MSEC, 36 35 .read = jiffies_read, 37 36 .mask = CLOCKSOURCE_MASK(32), 38 37 .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
+1 -1
kernel/time/posix-timers.c
··· 1092 1092 } 1093 1093 1094 1094 /* 1095 - * There should be no timers on the ignored list. itimer_delete() has 1095 + * There should be no timers on the ignored list. posix_timer_delete() has 1096 1096 * mopped them up. 1097 1097 */ 1098 1098 if (!WARN_ON_ONCE(!hlist_empty(&tsk->signal->ignored_posix_timers)))
-1
kernel/time/tick-broadcast-hrtimer.c
··· 78 78 .set_state_shutdown = bc_shutdown, 79 79 .set_next_ktime = bc_set_next, 80 80 .features = CLOCK_EVT_FEAT_ONESHOT | 81 - CLOCK_EVT_FEAT_KTIME | 82 81 CLOCK_EVT_FEAT_HRTIMER, 83 82 .rating = 0, 84 83 .bound_on = -1,
+20 -7
kernel/time/tick-sched.c
··· 864 864 } 865 865 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us); 866 866 867 + /* Simplified variant of hrtimer_forward_now() */ 868 + static ktime_t tick_forward_now(ktime_t expires, ktime_t now) 869 + { 870 + ktime_t delta = now - expires; 871 + 872 + if (likely(delta < TICK_NSEC)) 873 + return expires + TICK_NSEC; 874 + 875 + expires += TICK_NSEC * ktime_divns(delta, TICK_NSEC); 876 + if (expires > now) 877 + return expires; 878 + return expires + TICK_NSEC; 879 + } 880 + 867 881 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now) 868 882 { 869 - hrtimer_cancel(&ts->sched_timer); 870 - hrtimer_set_expires(&ts->sched_timer, ts->last_tick); 883 + ktime_t expires = ts->last_tick; 871 884 872 - /* Forward the time to expire in the future */ 873 - hrtimer_forward(&ts->sched_timer, now, TICK_NSEC); 885 + if (now >= expires) 886 + expires = tick_forward_now(expires, now); 874 887 875 888 if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) { 876 - hrtimer_start_expires(&ts->sched_timer, 877 - HRTIMER_MODE_ABS_PINNED_HARD); 889 + hrtimer_start(&ts->sched_timer, expires, HRTIMER_MODE_ABS_PINNED_HARD); 878 890 } else { 879 - tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1); 891 + hrtimer_set_expires(&ts->sched_timer, expires); 892 + tick_program_event(expires, 1); 880 893 } 881 894 882 895 /*
+181 -26
kernel/time/timekeeping.c
··· 3 3 * Kernel timekeeping code and accessor functions. Based on code from 4 4 * timer.c, moved in commit 8524070b7982. 5 5 */ 6 - #include <linux/timekeeper_internal.h> 7 - #include <linux/module.h> 8 - #include <linux/interrupt.h> 9 - #include <linux/kobject.h> 10 - #include <linux/percpu.h> 11 - #include <linux/init.h> 12 - #include <linux/mm.h> 13 - #include <linux/nmi.h> 14 - #include <linux/sched.h> 15 - #include <linux/sched/loadavg.h> 16 - #include <linux/sched/clock.h> 17 - #include <linux/syscore_ops.h> 6 + #include <linux/audit.h> 18 7 #include <linux/clocksource.h> 8 + #include <linux/compiler.h> 19 9 #include <linux/jiffies.h> 10 + #include <linux/kobject.h> 11 + #include <linux/module.h> 12 + #include <linux/nmi.h> 13 + #include <linux/pvclock_gtod.h> 14 + #include <linux/random.h> 15 + #include <linux/sched/clock.h> 16 + #include <linux/sched/loadavg.h> 17 + #include <linux/static_key.h> 18 + #include <linux/stop_machine.h> 19 + #include <linux/syscore_ops.h> 20 + #include <linux/tick.h> 20 21 #include <linux/time.h> 21 22 #include <linux/timex.h> 22 - #include <linux/tick.h> 23 - #include <linux/stop_machine.h> 24 - #include <linux/pvclock_gtod.h> 25 - #include <linux/compiler.h> 26 - #include <linux/audit.h> 27 - #include <linux/random.h> 23 + #include <linux/timekeeper_internal.h> 28 24 29 25 #include <vdso/auxclock.h> 30 26 31 27 #include "tick-internal.h" 32 - #include "ntp_internal.h" 33 28 #include "timekeeping_internal.h" 29 + #include "ntp_internal.h" 34 30 35 31 #define TK_CLEAR_NTP (1 << 0) 36 32 #define TK_CLOCK_WAS_SET (1 << 1) ··· 271 275 tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot); 272 276 } 273 277 278 + #ifdef CONFIG_ARCH_WANTS_CLOCKSOURCE_READ_INLINE 279 + #include <asm/clock_inlined.h> 280 + 281 + static DEFINE_STATIC_KEY_FALSE(clocksource_read_inlined); 282 + 274 283 /* 275 284 * tk_clock_read - atomic clocksource read() helper 276 285 * ··· 289 288 * a read of the fast-timekeeper tkrs (which is protected by its own locking 290 289 * and update logic). 291 290 */ 292 - static inline u64 tk_clock_read(const struct tk_read_base *tkr) 291 + static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr) 292 + { 293 + struct clocksource *clock = READ_ONCE(tkr->clock); 294 + 295 + if (static_branch_likely(&clocksource_read_inlined)) 296 + return arch_inlined_clocksource_read(clock); 297 + 298 + return clock->read(clock); 299 + } 300 + 301 + static inline void clocksource_disable_inline_read(void) 302 + { 303 + static_branch_disable(&clocksource_read_inlined); 304 + } 305 + 306 + static inline void clocksource_enable_inline_read(void) 307 + { 308 + static_branch_enable(&clocksource_read_inlined); 309 + } 310 + #else 311 + static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr) 293 312 { 294 313 struct clocksource *clock = READ_ONCE(tkr->clock); 295 314 296 315 return clock->read(clock); 297 316 } 317 + static inline void clocksource_disable_inline_read(void) { } 318 + static inline void clocksource_enable_inline_read(void) { } 319 + #endif 298 320 299 321 /** 300 322 * tk_setup_internals - Set up internals to use clocksource clock. ··· 391 367 tk->tkr_raw.mult = clock->mult; 392 368 tk->ntp_err_mult = 0; 393 369 tk->skip_second_overflow = 0; 370 + 371 + tk->cs_id = clock->id; 372 + 373 + /* Coupled clockevent data */ 374 + if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) && 375 + clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) { 376 + /* 377 + * Aim for an one hour maximum delta and use KHz to handle 378 + * clocksources with a frequency above 4GHz correctly as 379 + * the frequency argument of clocks_calc_mult_shift() is u32. 380 + */ 381 + clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift, 382 + NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000); 383 + /* 384 + * Initialize the conversion limit as the previous clocksource 385 + * might have the same shift/mult pair so the quick check in 386 + * tk_update_ns_to_cyc() fails to update it after a clocksource 387 + * change leaving it effectivly zero. 388 + */ 389 + tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult); 390 + } 394 391 } 395 392 396 393 /* Timekeeper helper functions. */ ··· 420 375 return mul_u64_u32_add_u64_shr(delta, tkr->mult, tkr->xtime_nsec, tkr->shift); 421 376 } 422 377 423 - static inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles) 378 + static __always_inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles) 424 379 { 425 380 /* Calculate the delta since the last update_wall_time() */ 426 381 u64 mask = tkr->mask, delta = (cycles - tkr->cycle_last) & mask; ··· 741 696 tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC); 742 697 } 743 698 699 + static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc) 700 + { 701 + struct tk_read_base *tkrs = &tks->tkr_mono; 702 + struct tk_read_base *tkrc = &tkc->tkr_mono; 703 + unsigned int shift; 704 + 705 + if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) || 706 + !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT)) 707 + return; 708 + 709 + if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift) 710 + return; 711 + /* 712 + * The conversion math is simple: 713 + * 714 + * CS::MULT (1 << NS_TO_CYC_SHIFT) 715 + * --------------- = ---------------------- 716 + * (1 << CS:SHIFT) NS_TO_CYC_MULT 717 + * 718 + * Ergo: 719 + * 720 + * NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT 721 + * 722 + * NS_TO_CYC_SHIFT has been set up in tk_setup_internals() 723 + */ 724 + shift = tkrs->shift + tks->cs_ns_to_cyc_shift; 725 + tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult); 726 + tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult); 727 + } 728 + 744 729 /* 745 730 * Restore the shadow timekeeper from the real timekeeper. 746 731 */ ··· 805 730 tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real; 806 731 807 732 if (tk->id == TIMEKEEPER_CORE) { 733 + tk_update_ns_to_cyc(tk, &tkd->timekeeper); 808 734 update_vsyscall(tk); 809 735 update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET); 810 736 ··· 858 782 delta -= incr; 859 783 } 860 784 tk_update_coarse_nsecs(tk); 785 + } 786 + 787 + /* 788 + * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles 789 + * @id: Clocksource ID which is required for validity 790 + * @expires_ns: Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted 791 + * @cycles: Pointer to storage for corresponding absolute cycles value 792 + * 793 + * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value 794 + * based on the correlated clocksource of the clockevent device by using 795 + * the base nanoseconds and cycles values of the last timekeeper update and 796 + * converting the delta between @expires_ns and base nanoseconds to cycles. 797 + * 798 + * This only works for clockevent devices which are using a less than or 799 + * equal comparator against the clocksource. 800 + * 801 + * Utilizing this avoids two clocksource reads for such devices, the 802 + * ktime_get() in clockevents_program_event() to calculate the delta expiry 803 + * value and the readout in the device::set_next_event() callback to 804 + * convert the delta back to a absolute comparator value. 805 + * 806 + * Returns: True if @id matches the current clocksource ID, false otherwise 807 + */ 808 + bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles) 809 + { 810 + struct timekeeper *tk = &tk_core.timekeeper; 811 + struct tk_read_base *tkrm = &tk->tkr_mono; 812 + ktime_t base_ns, delta_ns, max_ns; 813 + u64 base_cycles, delta_cycles; 814 + unsigned int seq; 815 + u32 mult, shift; 816 + 817 + /* 818 + * Racy check to avoid the seqcount overhead when ID does not match. If 819 + * the relevant clocksource is installed concurrently, then this will 820 + * just delay the switch over to this mechanism until the next event is 821 + * programmed. If the ID is not matching the clock events code will use 822 + * the regular relative set_next_event() callback as before. 823 + */ 824 + if (data_race(tk->cs_id) != id) 825 + return false; 826 + 827 + do { 828 + seq = read_seqcount_begin(&tk_core.seq); 829 + 830 + if (tk->cs_id != id) 831 + return false; 832 + 833 + base_cycles = tkrm->cycle_last; 834 + base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift); 835 + 836 + mult = tk->cs_ns_to_cyc_mult; 837 + shift = tk->cs_ns_to_cyc_shift; 838 + max_ns = tk->cs_ns_to_cyc_maxns; 839 + 840 + } while (read_seqcount_retry(&tk_core.seq, seq)); 841 + 842 + /* Prevent negative deltas and multiplication overflows */ 843 + delta_ns = min(expires_ns - base_ns, max_ns); 844 + delta_ns = max(delta_ns, 0); 845 + 846 + /* Convert to cycles */ 847 + delta_cycles = ((u64)delta_ns * mult) >> shift; 848 + *cycles = base_cycles + delta_cycles; 849 + return true; 861 850 } 862 851 863 852 /** ··· 989 848 } 990 849 EXPORT_SYMBOL_GPL(ktime_get_resolution_ns); 991 850 992 - static ktime_t *offsets[TK_OFFS_MAX] = { 851 + static const ktime_t *const offsets[TK_OFFS_MAX] = { 993 852 [TK_OFFS_REAL] = &tk_core.timekeeper.offs_real, 994 853 [TK_OFFS_BOOT] = &tk_core.timekeeper.offs_boot, 995 854 [TK_OFFS_TAI] = &tk_core.timekeeper.offs_tai, ··· 998 857 ktime_t ktime_get_with_offset(enum tk_offsets offs) 999 858 { 1000 859 struct timekeeper *tk = &tk_core.timekeeper; 860 + const ktime_t *offset = offsets[offs]; 1001 861 unsigned int seq; 1002 - ktime_t base, *offset = offsets[offs]; 862 + ktime_t base; 1003 863 u64 nsecs; 1004 864 1005 865 WARN_ON(timekeeping_suspended); ··· 1020 878 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs) 1021 879 { 1022 880 struct timekeeper *tk = &tk_core.timekeeper; 1023 - ktime_t base, *offset = offsets[offs]; 881 + const ktime_t *offset = offsets[offs]; 1024 882 unsigned int seq; 883 + ktime_t base; 1025 884 u64 nsecs; 1026 885 1027 886 WARN_ON(timekeeping_suspended); ··· 1045 902 */ 1046 903 ktime_t ktime_mono_to_any(ktime_t tmono, enum tk_offsets offs) 1047 904 { 1048 - ktime_t *offset = offsets[offs]; 905 + const ktime_t *offset = offsets[offs]; 1049 906 unsigned int seq; 1050 907 ktime_t tconv; 1051 908 ··· 1774 1631 1775 1632 if (tk->tkr_mono.clock == clock) 1776 1633 return 0; 1634 + 1635 + /* Disable inlined reads accross the clocksource switch */ 1636 + clocksource_disable_inline_read(); 1637 + 1777 1638 stop_machine(change_clocksource, clock, NULL); 1639 + 1640 + /* 1641 + * If the clocksource has been selected and supports inlined reads 1642 + * enable the branch. 1643 + */ 1644 + if (tk->tkr_mono.clock == clock && clock->flags & CLOCK_SOURCE_CAN_INLINE_READ) 1645 + clocksource_enable_inline_read(); 1646 + 1778 1647 tick_clock_notify(); 1779 1648 return tk->tkr_mono.clock == clock ? 0 : -1; 1780 1649 } ··· 2989 2834 continue; 2990 2835 2991 2836 timekeeping_forward_now(tks); 2992 - tk_setup_internals(tks, tk_core.timekeeper.tkr_mono.clock); 2837 + tk_setup_internals(tks, tk_core.timekeeper.tkr_raw.clock); 2993 2838 timekeeping_update_from_shadow(tkd, TK_UPDATE_ALL); 2994 2839 } 2995 2840 }
+2
kernel/time/timekeeping.h
··· 9 9 ktime_t *offs_boot, 10 10 ktime_t *offs_tai); 11 11 12 + bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles); 13 + 12 14 extern int timekeeping_valid_for_hres(void); 13 15 extern u64 timekeeping_max_deferment(void); 14 16 extern void timekeeping_warp_clock(void);
+3 -2
kernel/time/timer.c
··· 2319 2319 */ 2320 2320 void timer_clear_idle(void) 2321 2321 { 2322 + int this_cpu = smp_processor_id(); 2322 2323 /* 2323 2324 * We do this unlocked. The worst outcome is a remote pinned timer 2324 2325 * enqueue sending a pointless IPI, but taking the lock would just ··· 2328 2327 * path. Required for BASE_LOCAL only. 2329 2328 */ 2330 2329 __this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false); 2331 - if (tick_nohz_full_cpu(smp_processor_id())) 2330 + if (tick_nohz_full_cpu(this_cpu)) 2332 2331 __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false); 2333 - trace_timer_base_idle(false, smp_processor_id()); 2332 + trace_timer_base_idle(false, this_cpu); 2334 2333 2335 2334 /* Activate without holding the timer_base->lock */ 2336 2335 tmigr_cpu_activate();
+7 -9
kernel/time/timer_list.c
··· 47 47 int idx, u64 now) 48 48 { 49 49 SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function)); 50 - SEQ_printf(m, ", S:%02x", timer->state); 50 + SEQ_printf(m, ", S:%02x", timer->is_queued); 51 51 SEQ_printf(m, "\n"); 52 52 SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n", 53 53 (unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)), ··· 56 56 (long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now)); 57 57 } 58 58 59 - static void 60 - print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, 61 - u64 now) 59 + static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now) 62 60 { 61 + struct timerqueue_linked_node *curr; 63 62 struct hrtimer *timer, tmp; 64 63 unsigned long next = 0, i; 65 - struct timerqueue_node *curr; 66 64 unsigned long flags; 67 65 68 66 next_one: ··· 70 72 71 73 raw_spin_lock_irqsave(&base->cpu_base->lock, flags); 72 74 73 - curr = timerqueue_getnext(&base->active); 75 + curr = timerqueue_linked_first(&base->active); 74 76 /* 75 77 * Crude but we have to do this O(N*N) thing, because 76 78 * we have to unlock the base when printing: 77 79 */ 78 80 while (curr && i < next) { 79 - curr = timerqueue_iterate_next(curr); 81 + curr = timerqueue_linked_next(curr); 80 82 i++; 81 83 } 82 84 ··· 101 103 102 104 SEQ_printf(m, " .resolution: %u nsecs\n", hrtimer_resolution); 103 105 #ifdef CONFIG_HIGH_RES_TIMERS 104 - SEQ_printf(m, " .offset: %Lu nsecs\n", 105 - (unsigned long long) ktime_to_ns(base->offset)); 106 + SEQ_printf(m, " .offset: %Ld nsecs\n", 107 + (long long) base->offset); 106 108 #endif 107 109 SEQ_printf(m, "active timers:\n"); 108 110 print_active_timers(m, base, now + ktime_to_ns(base->offset));
+2 -2
kernel/trace/trace_events_synth.c
··· 395 395 n_u64++; 396 396 } else { 397 397 struct trace_print_flags __flags[] = { 398 - __def_gfpflag_names, {-1, NULL} }; 398 + __def_gfpflag_names }; 399 399 char *space = (i == se->n_fields - 1 ? "" : " "); 400 400 401 401 print_synth_event_num_val(s, print_fmt, ··· 408 408 trace_seq_puts(s, " ("); 409 409 trace_print_flags_seq(s, "|", 410 410 entry->fields[n_u64].as_u64, 411 - __flags); 411 + __flags, ARRAY_SIZE(__flags)); 412 412 trace_seq_putc(s, ')'); 413 413 } 414 414 n_u64++;
+12 -8
kernel/trace/trace_output.c
··· 69 69 const char * 70 70 trace_print_flags_seq(struct trace_seq *p, const char *delim, 71 71 unsigned long flags, 72 - const struct trace_print_flags *flag_array) 72 + const struct trace_print_flags *flag_array, 73 + size_t flag_array_size) 73 74 { 74 75 unsigned long mask; 75 76 const char *str; 76 77 const char *ret = trace_seq_buffer_ptr(p); 77 78 int i, first = 1; 78 79 79 - for (i = 0; flag_array[i].name && flags; i++) { 80 + for (i = 0; i < flag_array_size && flags; i++) { 80 81 81 82 mask = flag_array[i].mask; 82 83 if ((flags & mask) != mask) ··· 107 106 108 107 const char * 109 108 trace_print_symbols_seq(struct trace_seq *p, unsigned long val, 110 - const struct trace_print_flags *symbol_array) 109 + const struct trace_print_flags *symbol_array, 110 + size_t symbol_array_size) 111 111 { 112 112 int i; 113 113 const char *ret = trace_seq_buffer_ptr(p); 114 114 115 - for (i = 0; symbol_array[i].name; i++) { 115 + for (i = 0; i < symbol_array_size; i++) { 116 116 117 117 if (val != symbol_array[i].mask) 118 118 continue; ··· 135 133 const char * 136 134 trace_print_flags_seq_u64(struct trace_seq *p, const char *delim, 137 135 unsigned long long flags, 138 - const struct trace_print_flags_u64 *flag_array) 136 + const struct trace_print_flags_u64 *flag_array, 137 + size_t flag_array_size) 139 138 { 140 139 unsigned long long mask; 141 140 const char *str; 142 141 const char *ret = trace_seq_buffer_ptr(p); 143 142 int i, first = 1; 144 143 145 - for (i = 0; flag_array[i].name && flags; i++) { 144 + for (i = 0; i < flag_array_size && flags; i++) { 146 145 147 146 mask = flag_array[i].mask; 148 147 if ((flags & mask) != mask) ··· 173 170 174 171 const char * 175 172 trace_print_symbols_seq_u64(struct trace_seq *p, unsigned long long val, 176 - const struct trace_print_flags_u64 *symbol_array) 173 + const struct trace_print_flags_u64 *symbol_array, 174 + size_t symbol_array_size) 177 175 { 178 176 int i; 179 177 const char *ret = trace_seq_buffer_ptr(p); 180 178 181 - for (i = 0; symbol_array[i].name; i++) { 179 + for (i = 0; i < symbol_array_size; i++) { 182 180 183 181 if (val != symbol_array[i].mask) 184 182 continue;
+1 -2
kernel/trace/trace_syscalls.c
··· 174 174 { O_NOFOLLOW, "O_NOFOLLOW" }, 175 175 { O_NOATIME, "O_NOATIME" }, 176 176 { O_CLOEXEC, "O_CLOEXEC" }, 177 - { -1, NULL } 178 177 }; 179 178 180 179 trace_seq_printf(s, "%s(", entry->name); ··· 204 205 trace_seq_puts(s, "O_RDONLY|"); 205 206 } 206 207 207 - trace_print_flags_seq(s, "|", bits, __flags); 208 + trace_print_flags_seq(s, "|", bits, __flags, ARRAY_SIZE(__flags)); 208 209 /* 209 210 * trace_print_flags_seq() adds a '\0' to the 210 211 * buffer, but this needs to append more to the seq.
+17
lib/rbtree.c
··· 446 446 } 447 447 EXPORT_SYMBOL(rb_erase); 448 448 449 + bool rb_erase_linked(struct rb_node_linked *node, struct rb_root_linked *root) 450 + { 451 + if (node->prev) 452 + node->prev->next = node->next; 453 + else 454 + root->rb_leftmost = node->next; 455 + 456 + if (node->next) 457 + node->next->prev = node->prev; 458 + 459 + rb_erase(&node->node, &root->rb_root); 460 + RB_CLEAR_LINKED_NODE(node); 461 + 462 + return !!root->rb_leftmost; 463 + } 464 + EXPORT_SYMBOL_GPL(rb_erase_linked); 465 + 449 466 /* 450 467 * Augmented rbtree manipulation functions. 451 468 *
+14
lib/timerqueue.c
··· 82 82 return container_of(next, struct timerqueue_node, node); 83 83 } 84 84 EXPORT_SYMBOL_GPL(timerqueue_iterate_next); 85 + 86 + #define __node_2_tq_linked(_n) \ 87 + container_of(rb_entry((_n), struct rb_node_linked, node), struct timerqueue_linked_node, node) 88 + 89 + static __always_inline bool __tq_linked_less(struct rb_node *a, const struct rb_node *b) 90 + { 91 + return __node_2_tq_linked(a)->expires < __node_2_tq_linked(b)->expires; 92 + } 93 + 94 + bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node) 95 + { 96 + return rb_add_linked(&node->node, &head->rb_root, __tq_linked_less); 97 + } 98 + EXPORT_SYMBOL_GPL(timerqueue_linked_add);
+1 -1
scripts/gdb/linux/timerlist.py
··· 20 20 We can't read the hardware timer itself to add any nanoseconds 21 21 that need to be added since we last stored the time in the 22 22 timekeeper. But this is probably good enough for debug purposes.""" 23 - tk_core = gdb.parse_and_eval("&tk_core") 23 + tk_core = gdb.parse_and_eval("&timekeeper_data[TIMEKEEPER_CORE]") 24 24 25 25 return tk_core['timekeeper']['tkr_mono']['base'] 26 26