Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

timekeeping: Prevent coarse clocks going backwards

Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment:

tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

multiplicator * interval_cycles

into xtime_nsec. The accumulated value is always larger then the

mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.

This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:

757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")

That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:

1) The forwarding of the base time moves the update out of the original
period and establishes a new one.

2) The clearing of the accumulated NTP error is changing the behaviour as
well.

User-space expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.

Commit 757b000f7b93 was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.

One of the initial approaches to fix this was to establish a separate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.

Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in
it, switch the time getter functions and the VDSO update to use that value.
coarse_nsec is set on operations which forward or initialize the timekeeper
and after time was accumulated during a tick. If there is no accumulation
the timestamp is unchanged.

This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.

[ jstultz: Simplified the coarse_nsec calculation and kept behavior so
coarse clockids aren't adjusted on each inter-tick adjtimex
call, slightly reworked the comments and commit message ]

Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/

authored by

Thomas Gleixner and committed by
Ingo Molnar
b71f9804 b4432656

+49 -13
+5 -3
include/linux/timekeeper_internal.h
··· 51 51 * @offs_real: Offset clock monotonic -> clock realtime 52 52 * @offs_boot: Offset clock monotonic -> clock boottime 53 53 * @offs_tai: Offset clock monotonic -> clock tai 54 - * @tai_offset: The current UTC to TAI offset in seconds 54 + * @coarse_nsec: The nanoseconds part for coarse time getters 55 55 * @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW 56 56 * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds 57 57 * @clock_was_set_seq: The sequence number of clock was set events ··· 76 76 * ntp shifted nano seconds. 77 77 * @ntp_err_mult: Multiplication factor for scaled math conversion 78 78 * @skip_second_overflow: Flag used to avoid updating NTP twice with same second 79 + * @tai_offset: The current UTC to TAI offset in seconds 79 80 * 80 81 * Note: For timespec(64) based interfaces wall_to_monotonic is what 81 82 * we need to add to xtime (or xtime corrected for sub jiffy times) ··· 101 100 * which results in the following cacheline layout: 102 101 * 103 102 * 0: seqcount, tkr_mono 104 - * 1: xtime_sec ... tai_offset 103 + * 1: xtime_sec ... coarse_nsec 105 104 * 2: tkr_raw, raw_sec 106 105 * 3,4: Internal variables 107 106 * ··· 122 121 ktime_t offs_real; 123 122 ktime_t offs_boot; 124 123 ktime_t offs_tai; 125 - s32 tai_offset; 124 + u32 coarse_nsec; 126 125 127 126 /* Cacheline 2: */ 128 127 struct tk_read_base tkr_raw; ··· 145 144 u32 ntp_error_shift; 146 145 u32 ntp_err_mult; 147 146 u32 skip_second_overflow; 147 + s32 tai_offset; 148 148 }; 149 149 150 150 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
+42 -8
kernel/time/timekeeping.c
··· 164 164 return ts; 165 165 } 166 166 167 + static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk) 168 + { 169 + struct timespec64 ts; 170 + 171 + ts.tv_sec = tk->xtime_sec; 172 + ts.tv_nsec = tk->coarse_nsec; 173 + return ts; 174 + } 175 + 176 + /* 177 + * Update the nanoseconds part for the coarse time keepers. They can't rely 178 + * on xtime_nsec because xtime_nsec could be adjusted by a small negative 179 + * amount when the multiplication factor of the clock is adjusted, which 180 + * could cause the coarse clocks to go slightly backwards. See 181 + * timekeeping_apply_adjustment(). Thus we keep a separate copy for the coarse 182 + * clockids which only is updated when the clock has been set or we have 183 + * accumulated time. 184 + */ 185 + static inline void tk_update_coarse_nsecs(struct timekeeper *tk) 186 + { 187 + tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; 188 + } 189 + 167 190 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts) 168 191 { 169 192 tk->xtime_sec = ts->tv_sec; 170 193 tk->tkr_mono.xtime_nsec = (u64)ts->tv_nsec << tk->tkr_mono.shift; 194 + tk_update_coarse_nsecs(tk); 171 195 } 172 196 173 197 static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts) ··· 199 175 tk->xtime_sec += ts->tv_sec; 200 176 tk->tkr_mono.xtime_nsec += (u64)ts->tv_nsec << tk->tkr_mono.shift; 201 177 tk_normalize_xtime(tk); 178 + tk_update_coarse_nsecs(tk); 202 179 } 203 180 204 181 static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm) ··· 733 708 tk_normalize_xtime(tk); 734 709 delta -= incr; 735 710 } 711 + tk_update_coarse_nsecs(tk); 736 712 } 737 713 738 714 /** ··· 830 804 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs) 831 805 { 832 806 struct timekeeper *tk = &tk_core.timekeeper; 833 - unsigned int seq; 834 807 ktime_t base, *offset = offsets[offs]; 808 + unsigned int seq; 835 809 u64 nsecs; 836 810 837 811 WARN_ON(timekeeping_suspended); ··· 839 813 do { 840 814 seq = read_seqcount_begin(&tk_core.seq); 841 815 base = ktime_add(tk->tkr_mono.base, *offset); 842 - nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; 816 + nsecs = tk->coarse_nsec; 843 817 844 818 } while (read_seqcount_retry(&tk_core.seq, seq)); 845 819 ··· 2187 2161 struct timekeeper *real_tk = &tk_core.timekeeper; 2188 2162 unsigned int clock_set = 0; 2189 2163 int shift = 0, maxshift; 2190 - u64 offset; 2164 + u64 offset, orig_offset; 2191 2165 2192 2166 guard(raw_spinlock_irqsave)(&tk_core.lock); 2193 2167 ··· 2198 2172 offset = clocksource_delta(tk_clock_read(&tk->tkr_mono), 2199 2173 tk->tkr_mono.cycle_last, tk->tkr_mono.mask, 2200 2174 tk->tkr_mono.clock->max_raw_delta); 2201 - 2175 + orig_offset = offset; 2202 2176 /* Check if there's really nothing to do */ 2203 2177 if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK) 2204 2178 return false; ··· 2230 2204 * xtime_nsec isn't larger than NSEC_PER_SEC 2231 2205 */ 2232 2206 clock_set |= accumulate_nsecs_to_secs(tk); 2207 + 2208 + /* 2209 + * To avoid inconsistencies caused adjtimex TK_ADV_FREQ calls 2210 + * making small negative adjustments to the base xtime_nsec 2211 + * value, only update the coarse clocks if we accumulated time 2212 + */ 2213 + if (orig_offset != offset) 2214 + tk_update_coarse_nsecs(tk); 2233 2215 2234 2216 timekeeping_update_from_shadow(&tk_core, clock_set); 2235 2217 ··· 2282 2248 do { 2283 2249 seq = read_seqcount_begin(&tk_core.seq); 2284 2250 2285 - *ts = tk_xtime(tk); 2251 + *ts = tk_xtime_coarse(tk); 2286 2252 } while (read_seqcount_retry(&tk_core.seq, seq)); 2287 2253 } 2288 2254 EXPORT_SYMBOL(ktime_get_coarse_real_ts64); ··· 2305 2271 2306 2272 do { 2307 2273 seq = read_seqcount_begin(&tk_core.seq); 2308 - *ts = tk_xtime(tk); 2274 + *ts = tk_xtime_coarse(tk); 2309 2275 offset = tk_core.timekeeper.offs_real; 2310 2276 } while (read_seqcount_retry(&tk_core.seq, seq)); 2311 2277 ··· 2384 2350 do { 2385 2351 seq = read_seqcount_begin(&tk_core.seq); 2386 2352 2387 - now = tk_xtime(tk); 2353 + now = tk_xtime_coarse(tk); 2388 2354 mono = tk->wall_to_monotonic; 2389 2355 } while (read_seqcount_retry(&tk_core.seq, seq)); 2390 2356 2391 2357 set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec, 2392 - now.tv_nsec + mono.tv_nsec); 2358 + now.tv_nsec + mono.tv_nsec); 2393 2359 } 2394 2360 EXPORT_SYMBOL(ktime_get_coarse_ts64); 2395 2361
+2 -2
kernel/time/vsyscall.c
··· 98 98 /* CLOCK_REALTIME_COARSE */ 99 99 vdso_ts = &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE]; 100 100 vdso_ts->sec = tk->xtime_sec; 101 - vdso_ts->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; 101 + vdso_ts->nsec = tk->coarse_nsec; 102 102 103 103 /* CLOCK_MONOTONIC_COARSE */ 104 104 vdso_ts = &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE]; 105 105 vdso_ts->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; 106 - nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; 106 + nsec = tk->coarse_nsec; 107 107 nsec = nsec + tk->wall_to_monotonic.tv_nsec; 108 108 vdso_ts->sec += __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec); 109 109