Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

timekeeping: Provide infrastructure for coupled clockevents

Some architectures have clockevent devices which are coupled to the system
clocksource by implementing a less than or equal comparator which compares
the programmed absolute expiry time against the underlying time
counter. Well known examples are TSC/TSC deadline timer and the S390 TOD
clocksource/comparator.

While the concept is nice it has some downsides:

1) The clockevents core code is strictly based on relative expiry times
as that's the most common case for clockevent device hardware. That
requires to convert the absolute expiry time provided by the caller
(hrtimers, NOHZ code) to a relative expiry time by reading and
substracting the current time.

The clockevent::set_next_event() callback must then read the counter
again to convert the relative expiry back into a absolute one.

2) The conversion factors from nanoseconds to counter clock cycles are
set up when the clockevent is registered. When NTP applies corrections
then the clockevent conversion factors can deviate from the
clocksource conversion substantially which either results in timers
firing late or in the worst case early. The early expiry then needs to
do a reprogam with a short delta.

In most cases this is papered over by the fact that the read in the
set_next_event() callback happens after the read which is used to
calculate the delta. So the tendency is that timers expire mostly
late.

All of this can be avoided by providing support for these devices in the
core code:

1) The timekeeping core keeps track of the last update to the clocksource
by storing the base nanoseconds and the corresponding clocksource
counter value. That's used to keep the conversion math for reading the
time within 64-bit in the common case.

This information can be used to avoid both reads of the underlying
clocksource in the clockevents reprogramming path:

delta = expiry - base_ns;
cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift);

The resulting cycles value can be directly used to program the
comparator.

2) As #1 does not longer provide the "compensation" through the second
read the deviation of the clocksource and clockevent conversions
caused by NTP become more prominent.

This can be cured by letting the timekeeping core compute and store
the reverse conversion factors when the clocksource cycles to
nanoseconds factors are modified by NTP:

CS::MULT (1 << NS_TO_CYC_SHIFT)
--------------- = ----------------------
(1 << CS:SHIFT) NS_TO_CYC_MULT

Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT

The NS_TO_CYC_SHIFT value is calculated when the clocksource is
installed so that it aims for a one hour maximum sleep time.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.944763521@kernel.org

authored by

Thomas Gleixner and committed by
Peter Zijlstra
cd38bdb8 23028286

+124
+1
include/linux/clocksource.h
··· 150 150 #define CLOCK_SOURCE_RESELECT 0x100 151 151 #define CLOCK_SOURCE_VERIFY_PERCPU 0x200 152 152 #define CLOCK_SOURCE_CAN_INLINE_READ 0x400 153 + #define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x800 153 154 154 155 /* simplify initialization of mask field */ 155 156 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
+8
include/linux/timekeeper_internal.h
··· 72 72 * @id: The timekeeper ID 73 73 * @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW 74 74 * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds 75 + * @cs_id: The ID of the current clocksource 76 + * @cs_ns_to_cyc_mult: Multiplicator for nanoseconds to cycles conversion 77 + * @cs_ns_to_cyc_shift: Shift value for nanoseconds to cycles conversion 78 + * @cs_ns_to_cyc_maxns: Maximum nanoseconds to cyles conversion range 75 79 * @clock_was_set_seq: The sequence number of clock was set events 76 80 * @cs_was_changed_seq: The sequence number of clocksource change events 77 81 * @clock_valid: Indicator for valid clock ··· 163 159 u64 raw_sec; 164 160 165 161 /* Cachline 3 and 4 (timekeeping internal variables): */ 162 + enum clocksource_ids cs_id; 163 + u32 cs_ns_to_cyc_mult; 164 + u32 cs_ns_to_cyc_shift; 165 + u64 cs_ns_to_cyc_maxns; 166 166 unsigned int clock_was_set_seq; 167 167 u8 cs_was_changed_seq; 168 168 u8 clock_valid;
+3
kernel/time/Kconfig
··· 47 47 config GENERIC_CLOCKEVENTS_MIN_ADJUST 48 48 bool 49 49 50 + config GENERIC_CLOCKEVENTS_COUPLED 51 + bool 52 + 50 53 # Generic update of CMOS clock 51 54 config GENERIC_CMOS_UPDATE 52 55 bool
+110
kernel/time/timekeeping.c
··· 391 391 tk->tkr_raw.mult = clock->mult; 392 392 tk->ntp_err_mult = 0; 393 393 tk->skip_second_overflow = 0; 394 + 395 + tk->cs_id = clock->id; 396 + 397 + /* Coupled clockevent data */ 398 + if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) && 399 + clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) { 400 + /* 401 + * Aim for an one hour maximum delta and use KHz to handle 402 + * clocksources with a frequency above 4GHz correctly as 403 + * the frequency argument of clocks_calc_mult_shift() is u32. 404 + */ 405 + clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift, 406 + NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000); 407 + } 394 408 } 395 409 396 410 /* Timekeeper helper functions. */ ··· 734 720 tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC); 735 721 } 736 722 723 + static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc) 724 + { 725 + struct tk_read_base *tkrs = &tks->tkr_mono; 726 + struct tk_read_base *tkrc = &tkc->tkr_mono; 727 + unsigned int shift; 728 + 729 + if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) || 730 + !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT)) 731 + return; 732 + 733 + if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift) 734 + return; 735 + /* 736 + * The conversion math is simple: 737 + * 738 + * CS::MULT (1 << NS_TO_CYC_SHIFT) 739 + * --------------- = ---------------------- 740 + * (1 << CS:SHIFT) NS_TO_CYC_MULT 741 + * 742 + * Ergo: 743 + * 744 + * NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT 745 + * 746 + * NS_TO_CYC_SHIFT has been set up in tk_setup_internals() 747 + */ 748 + shift = tkrs->shift + tks->cs_ns_to_cyc_shift; 749 + tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult); 750 + tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult); 751 + } 752 + 737 753 /* 738 754 * Restore the shadow timekeeper from the real timekeeper. 739 755 */ ··· 798 754 tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real; 799 755 800 756 if (tk->id == TIMEKEEPER_CORE) { 757 + tk_update_ns_to_cyc(tk, &tkd->timekeeper); 801 758 update_vsyscall(tk); 802 759 update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET); 803 760 ··· 851 806 delta -= incr; 852 807 } 853 808 tk_update_coarse_nsecs(tk); 809 + } 810 + 811 + /* 812 + * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles 813 + * @id: Clocksource ID which is required for validity 814 + * @expires_ns: Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted 815 + * @cycles: Pointer to storage for corresponding absolute cycles value 816 + * 817 + * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value 818 + * based on the correlated clocksource of the clockevent device by using 819 + * the base nanoseconds and cycles values of the last timekeeper update and 820 + * converting the delta between @expires_ns and base nanoseconds to cycles. 821 + * 822 + * This only works for clockevent devices which are using a less than or 823 + * equal comparator against the clocksource. 824 + * 825 + * Utilizing this avoids two clocksource reads for such devices, the 826 + * ktime_get() in clockevents_program_event() to calculate the delta expiry 827 + * value and the readout in the device::set_next_event() callback to 828 + * convert the delta back to a absolute comparator value. 829 + * 830 + * Returns: True if @id matches the current clocksource ID, false otherwise 831 + */ 832 + bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles) 833 + { 834 + struct timekeeper *tk = &tk_core.timekeeper; 835 + struct tk_read_base *tkrm = &tk->tkr_mono; 836 + ktime_t base_ns, delta_ns, max_ns; 837 + u64 base_cycles, delta_cycles; 838 + unsigned int seq; 839 + u32 mult, shift; 840 + 841 + /* 842 + * Racy check to avoid the seqcount overhead when ID does not match. If 843 + * the relevant clocksource is installed concurrently, then this will 844 + * just delay the switch over to this mechanism until the next event is 845 + * programmed. If the ID is not matching the clock events code will use 846 + * the regular relative set_next_event() callback as before. 847 + */ 848 + if (data_race(tk->cs_id) != id) 849 + return false; 850 + 851 + do { 852 + seq = read_seqcount_begin(&tk_core.seq); 853 + 854 + if (tk->cs_id != id) 855 + return false; 856 + 857 + base_cycles = tkrm->cycle_last; 858 + base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift); 859 + 860 + mult = tk->cs_ns_to_cyc_mult; 861 + shift = tk->cs_ns_to_cyc_shift; 862 + max_ns = tk->cs_ns_to_cyc_maxns; 863 + 864 + } while (read_seqcount_retry(&tk_core.seq, seq)); 865 + 866 + /* Prevent negative deltas and multiplication overflows */ 867 + delta_ns = min(expires_ns - base_ns, max_ns); 868 + delta_ns = max(delta_ns, 0); 869 + 870 + /* Convert to cycles */ 871 + delta_cycles = ((u64)delta_ns * mult) >> shift; 872 + *cycles = base_cycles + delta_cycles; 873 + return true; 854 874 } 855 875 856 876 /**
+2
kernel/time/timekeeping.h
··· 9 9 ktime_t *offs_boot, 10 10 ktime_t *offs_tai); 11 11 12 + bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles); 13 + 12 14 extern int timekeeping_valid_for_hres(void); 13 15 extern u64 timekeeping_max_deferment(void); 14 16 extern void timekeeping_warp_clock(void);