Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+5

Documentation/admin-guide/kernel-parameters.txt

··· 6627 6627 6628 6628 rootflags= [KNL] Set root filesystem mount option string 6629 6629 6630 + rseq_slice_ext= [KNL] RSEQ based time slice extension 6631 + Format: boolean 6632 + Control enablement of RSEQ based time slice extension. 6633 + Default is 'on'. 6634 + 6630 6635 initramfs_options= [KNL] 6631 6636 Specify mount options for for the initramfs mount. 6632 6637

+1

Documentation/userspace-api/index.rst

··· 21 21 ebpf/index 22 22 ioctl/index 23 23 mseal 24 + rseq 24 25 25 26 Security-related interfaces 26 27 ===========================

+140

Documentation/userspace-api/rseq.rst

··· 1 + ===================== 2 + Restartable Sequences 3 + ===================== 4 + 5 + Restartable Sequences allow to register a per thread userspace memory area 6 + to be used as an ABI between kernel and userspace for three purposes: 7 + 8 + * userspace restartable sequences 9 + 10 + * quick access to read the current CPU number, node ID from userspace 11 + 12 + * scheduler time slice extensions 13 + 14 + Restartable sequences (per-cpu atomics) 15 + --------------------------------------- 16 + 17 + Restartable sequences allow userspace to perform update operations on 18 + per-cpu data without requiring heavyweight atomic operations. The actual 19 + ABI is unfortunately only available in the code and selftests. 20 + 21 + Quick access to CPU number, node ID 22 + ----------------------------------- 23 + 24 + Allows to implement per CPU data efficiently. Documentation is in code and 25 + selftests. :( 26 + 27 + Scheduler time slice extensions 28 + ------------------------------- 29 + 30 + This allows a thread to request a time slice extension when it enters a 31 + critical section to avoid contention on a resource when the thread is 32 + scheduled out inside of the critical section. 33 + 34 + The prerequisites for this functionality are: 35 + 36 + * Enabled in Kconfig 37 + 38 + * Enabled at boot time (default is enabled) 39 + 40 + * A rseq userspace pointer has been registered for the thread 41 + 42 + The thread has to enable the functionality via prctl(2):: 43 + 44 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 45 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); 46 + 47 + prctl() returns 0 on success or otherwise with the following error codes: 48 + 49 + ========= ============================================================== 50 + Errorcode Meaning 51 + ========= ============================================================== 52 + EINVAL Functionality not available or invalid function arguments. 53 + Note: arg4 and arg5 must be zero 54 + ENOTSUPP Functionality was disabled on the kernel command line 55 + ENXIO Available, but no rseq user struct registered 56 + ========= ============================================================== 57 + 58 + The state can be also queried via prctl(2):: 59 + 60 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); 61 + 62 + prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if 63 + disabled. Otherwise it returns with the following error codes: 64 + 65 + ========= ============================================================== 66 + Errorcode Meaning 67 + ========= ============================================================== 68 + EINVAL Functionality not available or invalid function arguments. 69 + Note: arg3 and arg4 and arg5 must be zero 70 + ========= ============================================================== 71 + 72 + The availability and status is also exposed via the rseq ABI struct flags 73 + field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the 74 + ``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user 75 + space and only for informational purposes. 76 + 77 + If the mechanism was enabled via prctl(), the thread can request a time 78 + slice extension by setting rseq::slice_ctrl::request to 1. If the thread is 79 + interrupted and the interrupt results in a reschedule request in the 80 + kernel, then the kernel can grant a time slice extension and return to 81 + userspace instead of scheduling out. The length of the extension is 82 + determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which 83 + is the minimum value. It can be incremented to 50 usecs, however doing so 84 + can/will affect the minimum scheduling latency. 85 + 86 + Any proposed changes to this default will have to come with a selftest and 87 + rseq-slice-hist.py output that shows the new value has merrit. 88 + 89 + The kernel indicates the grant by clearing rseq::slice_ctrl::request and 90 + setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the 91 + thread after granting the extension, the kernel clears the granted bit to 92 + indicate that to userspace. 93 + 94 + If the request bit is still set when the leaving the critical section, 95 + userspace can clear it and continue. 96 + 97 + If the granted bit is set, then userspace invokes rseq_slice_yield(2) when 98 + leaving the critical section to relinquish the CPU. The kernel enforces 99 + this by arming a timer to prevent misbehaving userspace from abusing this 100 + mechanism. 101 + 102 + If both the request bit and the granted bit are false when leaving the 103 + critical section, then this indicates that a grant was revoked and no 104 + further action is required by userspace. 105 + 106 + The required code flow is as follows:: 107 + 108 + rseq->slice_ctrl.request = 1; 109 + barrier(); // Prevent compiler reordering 110 + critical_section(); 111 + barrier(); // Prevent compiler reordering 112 + rseq->slice_ctrl.request = 0; 113 + if (rseq->slice_ctrl.granted) 114 + rseq_slice_yield(); 115 + 116 + As all of this is strictly CPU local, there are no atomicity requirements. 117 + Checking the granted state is racy, but that cannot be avoided at all:: 118 + 119 + if (rseq->slice_ctrl.granted) 120 + -> Interrupt results in schedule and grant revocation 121 + rseq_slice_yield(); 122 + 123 + So there is no point in pretending that this might be solved by an atomic 124 + operation. 125 + 126 + If the thread issues a syscall other than rseq_slice_yield(2) within the 127 + granted timeslice extension, the grant is also revoked and the CPU is 128 + relinquished immediately when entering the kernel. This is required as 129 + syscalls might consume arbitrary CPU time until they reach a scheduling 130 + point when the preemption model is either NONE or VOLUNTARY and therefore 131 + might exceed the grant by far. 132 + 133 + The preferred solution for user space is to use rseq_slice_yield(2) which 134 + is side effect free. The support for arbitrary syscalls is required to 135 + support onion layer architectured applications, where the code handling the 136 + critical section and requesting the time slice extension has no control 137 + over the code within the critical section. 138 + 139 + The kernel enforces flag consistency and terminates the thread with SIGSEGV 140 + if it detects a violation.

+1

arch/alpha/kernel/syscalls/syscall.tbl

··· 510 510 578 common file_getattr sys_file_getattr 511 511 579 common file_setattr sys_file_setattr 512 512 580 common listns sys_listns 513 + 581 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/arm/tools/syscall.tbl

··· 485 485 468 common file_getattr sys_file_getattr 486 486 469 common file_setattr sys_file_setattr 487 487 470 common listns sys_listns 488 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/arm64/tools/syscall_32.tbl

··· 482 482 468 common file_getattr sys_file_getattr 483 483 469 common file_setattr sys_file_setattr 484 484 470 common listns sys_listns 485 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/m68k/kernel/syscalls/syscall.tbl

··· 470 470 468 common file_getattr sys_file_getattr 471 471 469 common file_setattr sys_file_setattr 472 472 470 common listns sys_listns 473 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/microblaze/kernel/syscalls/syscall.tbl

··· 476 476 468 common file_getattr sys_file_getattr 477 477 469 common file_setattr sys_file_setattr 478 478 470 common listns sys_listns 479 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/mips/kernel/syscalls/syscall_n32.tbl

··· 409 409 468 n32 file_getattr sys_file_getattr 410 410 469 n32 file_setattr sys_file_setattr 411 411 470 n32 listns sys_listns 412 + 471 n32 rseq_slice_yield sys_rseq_slice_yield

+1

arch/mips/kernel/syscalls/syscall_n64.tbl

··· 385 385 468 n64 file_getattr sys_file_getattr 386 386 469 n64 file_setattr sys_file_setattr 387 387 470 n64 listns sys_listns 388 + 471 n64 rseq_slice_yield sys_rseq_slice_yield

+1

arch/mips/kernel/syscalls/syscall_o32.tbl

··· 458 458 468 o32 file_getattr sys_file_getattr 459 459 469 o32 file_setattr sys_file_setattr 460 460 470 o32 listns sys_listns 461 + 471 o32 rseq_slice_yield sys_rseq_slice_yield

+1

arch/parisc/kernel/syscalls/syscall.tbl

··· 469 469 468 common file_getattr sys_file_getattr 470 470 469 common file_setattr sys_file_setattr 471 471 470 common listns sys_listns 472 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/powerpc/kernel/syscalls/syscall.tbl

··· 561 561 468 common file_getattr sys_file_getattr 562 562 469 common file_setattr sys_file_setattr 563 563 470 common listns sys_listns 564 + 471 nospu rseq_slice_yield sys_rseq_slice_yield

+1

arch/s390/kernel/syscalls/syscall.tbl

··· 397 397 468 common file_getattr sys_file_getattr 398 398 469 common file_setattr sys_file_setattr 399 399 470 common listns sys_listns 400 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/sh/kernel/syscalls/syscall.tbl

··· 474 474 468 common file_getattr sys_file_getattr 475 475 469 common file_setattr sys_file_setattr 476 476 470 common listns sys_listns 477 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/sparc/kernel/syscalls/syscall.tbl

··· 516 516 468 common file_getattr sys_file_getattr 517 517 469 common file_setattr sys_file_setattr 518 518 470 common listns sys_listns 519 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

arch/x86/entry/syscalls/syscall_32.tbl

··· 476 476 468 i386 file_getattr sys_file_getattr 477 477 469 i386 file_setattr sys_file_setattr 478 478 470 i386 listns sys_listns 479 + 471 i386 rseq_slice_yield sys_rseq_slice_yield

+1

arch/x86/entry/syscalls/syscall_64.tbl

··· 395 395 468 common file_getattr sys_file_getattr 396 396 469 common file_setattr sys_file_setattr 397 397 470 common listns sys_listns 398 + 471 common rseq_slice_yield sys_rseq_slice_yield 398 399 399 400 # 400 401 # Due to a historical design error, certain syscalls are numbered differently

-2

arch/x86/kernel/tsc.c

··· 1143 1143 tsc_unstable = 1; 1144 1144 if (using_native_sched_clock()) 1145 1145 clear_sched_clock_stable(); 1146 - disable_sched_clock_irqtime(); 1147 1146 pr_info("Marking TSC unstable due to clocksource watchdog\n"); 1148 1147 } 1149 1148 ··· 1212 1213 tsc_unstable = 1; 1213 1214 if (using_native_sched_clock()) 1214 1215 clear_sched_clock_stable(); 1215 - disable_sched_clock_irqtime(); 1216 1216 pr_info("Marking TSC unstable due to %s\n", reason); 1217 1217 1218 1218 clocksource_mark_unstable(&clocksource_tsc_early);

+1

arch/xtensa/kernel/syscalls/syscall.tbl

··· 441 441 468 common file_getattr sys_file_getattr 442 442 469 common file_setattr sys_file_setattr 443 443 470 common listns sys_listns 444 + 471 common rseq_slice_yield sys_rseq_slice_yield

+19

include/linux/compiler_types.h

··· 620 620 __scalar_type_to_expr_cases(long long), \ 621 621 default: (x))) 622 622 623 + /* 624 + * __signed_scalar_typeof(x) - Declare a signed scalar type, leaving 625 + * non-scalar types unchanged. 626 + */ 627 + 628 + #define __scalar_type_to_signed_cases(type) \ 629 + unsigned type: (signed type)0, \ 630 + signed type: (signed type)0 631 + 632 + #define __signed_scalar_typeof(x) typeof( \ 633 + _Generic((x), \ 634 + char: (signed char)0, \ 635 + __scalar_type_to_signed_cases(char), \ 636 + __scalar_type_to_signed_cases(short), \ 637 + __scalar_type_to_signed_cases(int), \ 638 + __scalar_type_to_signed_cases(long), \ 639 + __scalar_type_to_signed_cases(long long), \ 640 + default: (x))) 641 + 623 642 /* Is this type a native word size -- useful for atomic operations */ 624 643 #define __native_word(t) \ 625 644 (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \

+149 -18

include/linux/entry-common.h

··· 2 2 #ifndef __LINUX_ENTRYCOMMON_H 3 3 #define __LINUX_ENTRYCOMMON_H 4 4 5 + #include <linux/audit.h> 5 6 #include <linux/irq-entry-common.h> 6 7 #include <linux/livepatch.h> 7 8 #include <linux/ptrace.h> ··· 37 36 SYSCALL_WORK_SYSCALL_EMU | \ 38 37 SYSCALL_WORK_SYSCALL_AUDIT | \ 39 38 SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ 39 + SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \ 40 40 ARCH_SYSCALL_WORK_ENTER) 41 - 42 41 #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ 43 42 SYSCALL_WORK_SYSCALL_TRACE | \ 44 43 SYSCALL_WORK_SYSCALL_AUDIT | \ ··· 46 45 SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ 47 46 ARCH_SYSCALL_WORK_EXIT) 48 47 49 - long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work); 48 + /** 49 + * arch_ptrace_report_syscall_entry - Architecture specific ptrace_report_syscall_entry() wrapper 50 + * 51 + * Invoked from syscall_trace_enter() to wrap ptrace_report_syscall_entry(). 52 + * 53 + * This allows architecture specific ptrace_report_syscall_entry() 54 + * implementations. If not defined by the architecture this falls back to 55 + * to ptrace_report_syscall_entry(). 56 + */ 57 + static __always_inline int arch_ptrace_report_syscall_entry(struct pt_regs *regs); 58 + 59 + #ifndef arch_ptrace_report_syscall_entry 60 + static __always_inline int arch_ptrace_report_syscall_entry(struct pt_regs *regs) 61 + { 62 + return ptrace_report_syscall_entry(regs); 63 + } 64 + #endif 65 + 66 + bool syscall_user_dispatch(struct pt_regs *regs); 67 + long trace_syscall_enter(struct pt_regs *regs, long syscall); 68 + void trace_syscall_exit(struct pt_regs *regs, long ret); 69 + 70 + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) 71 + { 72 + if (unlikely(audit_context())) { 73 + unsigned long args[6]; 74 + 75 + syscall_get_arguments(current, regs, args); 76 + audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]); 77 + } 78 + } 79 + 80 + static __always_inline long syscall_trace_enter(struct pt_regs *regs, unsigned long work) 81 + { 82 + long syscall, ret = 0; 83 + 84 + /* 85 + * Handle Syscall User Dispatch. This must comes first, since 86 + * the ABI here can be something that doesn't make sense for 87 + * other syscall_work features. 88 + */ 89 + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 90 + if (syscall_user_dispatch(regs)) 91 + return -1L; 92 + } 93 + 94 + /* 95 + * User space got a time slice extension granted and relinquishes 96 + * the CPU. The work stops the slice timer to avoid an extra round 97 + * through hrtimer_interrupt(). 98 + */ 99 + if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) 100 + rseq_syscall_enter_work(syscall_get_nr(current, regs)); 101 + 102 + /* Handle ptrace */ 103 + if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { 104 + ret = arch_ptrace_report_syscall_entry(regs); 105 + if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) 106 + return -1L; 107 + } 108 + 109 + /* Do seccomp after ptrace, to catch any tracer changes. */ 110 + if (work & SYSCALL_WORK_SECCOMP) { 111 + ret = __secure_computing(); 112 + if (ret == -1L) 113 + return ret; 114 + } 115 + 116 + /* Either of the above might have changed the syscall number */ 117 + syscall = syscall_get_nr(current, regs); 118 + 119 + if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) 120 + syscall = trace_syscall_enter(regs, syscall); 121 + 122 + syscall_enter_audit(regs, syscall); 123 + 124 + return ret ? : syscall; 125 + } 50 126 51 127 /** 52 128 * syscall_enter_from_user_mode_work - Check and handle work before invoking ··· 153 75 unsigned long work = READ_ONCE(current_thread_info()->syscall_work); 154 76 155 77 if (work & SYSCALL_WORK_ENTER) 156 - syscall = syscall_trace_enter(regs, syscall, work); 78 + syscall = syscall_trace_enter(regs, work); 157 79 158 80 return syscall; 159 81 } ··· 190 112 return ret; 191 113 } 192 114 115 + /* 116 + * If SYSCALL_EMU is set, then the only reason to report is when 117 + * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall 118 + * instruction has been already reported in syscall_enter_from_user_mode(). 119 + */ 120 + static __always_inline bool report_single_step(unsigned long work) 121 + { 122 + if (work & SYSCALL_WORK_SYSCALL_EMU) 123 + return false; 124 + 125 + return work & SYSCALL_WORK_SYSCALL_EXIT_TRAP; 126 + } 127 + 128 + /** 129 + * arch_ptrace_report_syscall_exit - Architecture specific ptrace_report_syscall_exit() 130 + * 131 + * This allows architecture specific ptrace_report_syscall_exit() 132 + * implementations. If not defined by the architecture this falls back to 133 + * to ptrace_report_syscall_exit(). 134 + */ 135 + static __always_inline void arch_ptrace_report_syscall_exit(struct pt_regs *regs, 136 + int step); 137 + 138 + #ifndef arch_ptrace_report_syscall_exit 139 + static __always_inline void arch_ptrace_report_syscall_exit(struct pt_regs *regs, 140 + int step) 141 + { 142 + ptrace_report_syscall_exit(regs, step); 143 + } 144 + #endif 145 + 193 146 /** 194 147 * syscall_exit_work - Handle work before returning to user mode 195 148 * @regs: Pointer to current pt_regs ··· 228 119 * 229 120 * Do one-time syscall specific work. 230 121 */ 231 - void syscall_exit_work(struct pt_regs *regs, unsigned long work); 122 + static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned long work) 123 + { 124 + bool step; 125 + 126 + /* 127 + * If the syscall was rolled back due to syscall user dispatching, 128 + * then the tracers below are not invoked for the same reason as 129 + * the entry side was not invoked in syscall_trace_enter(): The ABI 130 + * of these syscalls is unknown. 131 + */ 132 + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 133 + if (unlikely(current->syscall_dispatch.on_dispatch)) { 134 + current->syscall_dispatch.on_dispatch = false; 135 + return; 136 + } 137 + } 138 + 139 + audit_syscall_exit(regs); 140 + 141 + if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT) 142 + trace_syscall_exit(regs, syscall_get_return_value(current, regs)); 143 + 144 + step = report_single_step(work); 145 + if (step || work & SYSCALL_WORK_SYSCALL_TRACE) 146 + arch_ptrace_report_syscall_exit(regs, step); 147 + } 232 148 233 149 /** 234 - * syscall_exit_to_user_mode_work - Handle work before returning to user mode 150 + * syscall_exit_to_user_mode_work - Handle one time work before returning to user mode 235 151 * @regs: Pointer to currents pt_regs 236 152 * 237 - * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling 238 - * exit_to_user_mode() to perform the final transition to user mode. 153 + * Step 1 of syscall_exit_to_user_mode() with the same calling convention. 239 154 * 240 - * Calling convention is the same as for syscall_exit_to_user_mode() and it 241 - * returns with all work handled and interrupts disabled. The caller must 242 - * invoke exit_to_user_mode() before actually switching to user mode to 243 - * make the final state transitions. Interrupts must stay disabled between 244 - * return from this function and the invocation of exit_to_user_mode(). 155 + * The caller must invoke steps 2-3 of syscall_exit_to_user_mode() afterwards. 245 156 */ 246 157 static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs) 247 158 { ··· 284 155 */ 285 156 if (unlikely(work & SYSCALL_WORK_EXIT)) 286 157 syscall_exit_work(regs, work); 287 - local_irq_disable_exit_to_user(); 288 - syscall_exit_to_user_mode_prepare(regs); 289 158 } 290 159 291 160 /** 292 161 * syscall_exit_to_user_mode - Handle work before returning to user mode 293 162 * @regs: Pointer to currents pt_regs 294 163 * 295 - * Invoked with interrupts enabled and fully valid regs. Returns with all 164 + * Invoked with interrupts enabled and fully valid @regs. Returns with all 296 165 * work handled, interrupts disabled such that the caller can immediately 297 166 * switch to user mode. Called from architecture specific syscall and ret 298 167 * from fork code. ··· 303 176 * - ptrace (single stepping) 304 177 * 305 178 * 2) Preparatory work 179 + * - Disable interrupts 306 180 * - Exit to user mode loop (common TIF handling). Invokes 307 181 * arch_exit_to_user_mode_work() for architecture specific TIF work 308 182 * - Architecture specific one time work arch_exit_to_user_mode_prepare() ··· 312 184 * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the 313 185 * functionality in exit_to_user_mode(). 314 186 * 315 - * This is a combination of syscall_exit_to_user_mode_work() (1,2) and 316 - * exit_to_user_mode(). This function is preferred unless there is a 317 - * compelling architectural reason to use the separate functions. 187 + * This is a combination of syscall_exit_to_user_mode_work() (1), disabling 188 + * interrupts followed by syscall_exit_to_user_mode_prepare() (2) and 189 + * exit_to_user_mode() (3). This function is preferred unless there is a 190 + * compelling architectural reason to invoke the functions separately. 318 191 */ 319 192 static __always_inline void syscall_exit_to_user_mode(struct pt_regs *regs) 320 193 { 321 194 instrumentation_begin(); 322 195 syscall_exit_to_user_mode_work(regs); 196 + local_irq_disable_exit_to_user(); 197 + syscall_exit_to_user_mode_prepare(regs); 323 198 instrumentation_end(); 324 199 exit_to_user_mode(); 325 200 }

+11

include/linux/rseq.h

··· 163 163 static inline void rseq_syscall(struct pt_regs *regs) { } 164 164 #endif /* !CONFIG_DEBUG_RSEQ */ 165 165 166 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 167 + void rseq_syscall_enter_work(long syscall); 168 + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); 169 + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ 170 + static inline void rseq_syscall_enter_work(long syscall) { } 171 + static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) 172 + { 173 + return -ENOTSUPP; 174 + } 175 + #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 176 + 166 177 #endif /* _LINUX_RSEQ_H */

+180 -12

include/linux/rseq_entry.h

··· 15 15 unsigned long cs; 16 16 unsigned long clear; 17 17 unsigned long fixup; 18 + unsigned long s_granted; 19 + unsigned long s_expired; 20 + unsigned long s_revoked; 21 + unsigned long s_yielded; 22 + unsigned long s_aborted; 18 23 }; 19 24 20 25 DECLARE_PER_CPU(struct rseq_stats, rseq_stats); ··· 42 37 #ifdef CONFIG_RSEQ 43 38 #include <linux/jump_label.h> 44 39 #include <linux/rseq.h> 40 + #include <linux/sched/signal.h> 45 41 #include <linux/uaccess.h> 46 42 47 43 #include <linux/tracepoint-defs.h> ··· 80 74 #else 81 75 #define rseq_inline __always_inline 82 76 #endif 77 + 78 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 79 + DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key); 80 + 81 + static __always_inline bool rseq_slice_extension_enabled(void) 82 + { 83 + return static_branch_likely(&rseq_slice_extension_key); 84 + } 85 + 86 + extern unsigned int rseq_slice_ext_nsecs; 87 + bool __rseq_arm_slice_extension_timer(void); 88 + 89 + static __always_inline bool rseq_arm_slice_extension_timer(void) 90 + { 91 + if (!rseq_slice_extension_enabled()) 92 + return false; 93 + 94 + if (likely(!current->rseq.slice.state.granted)) 95 + return false; 96 + 97 + return __rseq_arm_slice_extension_timer(); 98 + } 99 + 100 + static __always_inline void rseq_slice_clear_grant(struct task_struct *t) 101 + { 102 + if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted) 103 + rseq_stat_inc(rseq_stats.s_revoked); 104 + t->rseq.slice.state.granted = false; 105 + } 106 + 107 + static __always_inline bool rseq_grant_slice_extension(bool work_pending) 108 + { 109 + struct task_struct *curr = current; 110 + struct rseq_slice_ctrl usr_ctrl; 111 + union rseq_slice_state state; 112 + struct rseq __user *rseq; 113 + 114 + if (!rseq_slice_extension_enabled()) 115 + return false; 116 + 117 + /* If not enabled or not a return from interrupt, nothing to do. */ 118 + state = curr->rseq.slice.state; 119 + state.enabled &= curr->rseq.event.user_irq; 120 + if (likely(!state.state)) 121 + return false; 122 + 123 + rseq = curr->rseq.usrptr; 124 + scoped_user_rw_access(rseq, efault) { 125 + 126 + /* 127 + * Quick check conditions where a grant is not possible or 128 + * needs to be revoked. 129 + * 130 + * 1) Any TIF bit which needs to do extra work aside of 131 + * rescheduling prevents a grant. 132 + * 133 + * 2) A previous rescheduling request resulted in a slice 134 + * extension grant. 135 + */ 136 + if (unlikely(work_pending || state.granted)) { 137 + /* Clear user control unconditionally. No point for checking */ 138 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 139 + rseq_slice_clear_grant(curr); 140 + return false; 141 + } 142 + 143 + unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault); 144 + if (likely(!(usr_ctrl.request))) 145 + return false; 146 + 147 + /* Grant the slice extention */ 148 + usr_ctrl.request = 0; 149 + usr_ctrl.granted = 1; 150 + unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault); 151 + } 152 + 153 + rseq_stat_inc(rseq_stats.s_granted); 154 + 155 + curr->rseq.slice.state.granted = true; 156 + /* Store expiry time for arming the timer on the way out */ 157 + curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns(); 158 + /* 159 + * This is racy against a remote CPU setting TIF_NEED_RESCHED in 160 + * several ways: 161 + * 162 + * 1) 163 + * CPU0 CPU1 164 + * clear_tsk() 165 + * set_tsk() 166 + * clear_preempt() 167 + * Raise scheduler IPI on CPU0 168 + * --> IPI 169 + * fold_need_resched() -> Folds correctly 170 + * 2) 171 + * CPU0 CPU1 172 + * set_tsk() 173 + * clear_tsk() 174 + * clear_preempt() 175 + * Raise scheduler IPI on CPU0 176 + * --> IPI 177 + * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false 178 + * 179 + * #1 is not any different from a regular remote reschedule as it 180 + * sets the previously not set bit and then raises the IPI which 181 + * folds it into the preempt counter 182 + * 183 + * #2 is obviously incorrect from a scheduler POV, but it's not 184 + * differently incorrect than the code below clearing the 185 + * reschedule request with the safety net of the timer. 186 + * 187 + * The important part is that the clearing is protected against the 188 + * scheduler IPI and also against any other interrupt which might 189 + * end up waking up a task and setting the bits in the middle of 190 + * the operation: 191 + * 192 + * clear_tsk() 193 + * ---> Interrupt 194 + * wakeup_on_this_cpu() 195 + * set_tsk() 196 + * set_preempt() 197 + * clear_preempt() 198 + * 199 + * which would be inconsistent state. 200 + */ 201 + scoped_guard(irq) { 202 + clear_tsk_need_resched(curr); 203 + clear_preempt_need_resched(); 204 + } 205 + return true; 206 + 207 + efault: 208 + force_sig(SIGSEGV); 209 + return false; 210 + } 211 + 212 + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ 213 + static inline bool rseq_slice_extension_enabled(void) { return false; } 214 + static inline bool rseq_arm_slice_extension_timer(void) { return false; } 215 + static inline void rseq_slice_clear_grant(struct task_struct *t) { } 216 + static inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 217 + #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 83 218 84 219 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); 85 220 bool rseq_debug_validate_ids(struct task_struct *t); ··· 506 359 unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); 507 360 if (csaddr) 508 361 unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); 362 + 363 + /* Open coded, so it's in the same user access region */ 364 + if (rseq_slice_extension_enabled()) { 365 + /* Unconditionally clear it, no point in conditionals */ 366 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 367 + } 509 368 } 510 369 370 + rseq_slice_clear_grant(t); 511 371 /* Cache the new values */ 512 372 t->rseq.ids.cpu_cid = ids->cpu_cid; 513 373 rseq_stat_inc(rseq_stats.ids); ··· 610 456 */ 611 457 u64 csaddr; 612 458 613 - if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs))) 614 - return false; 459 + scoped_user_rw_access(rseq, efault) { 460 + unsafe_get_user(csaddr, &rseq->rseq_cs, efault); 461 + 462 + /* Open coded, so it's in the same user access region */ 463 + if (rseq_slice_extension_enabled()) { 464 + /* Unconditionally clear it, no point in conditionals */ 465 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 466 + } 467 + } 468 + 469 + rseq_slice_clear_grant(t); 615 470 616 471 if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) { 617 472 if (unlikely(!rseq_update_user_cs(t, regs, csaddr))) ··· 636 473 u32 node_id = cpu_to_node(ids.cpu_id); 637 474 638 475 return rseq_update_usr(t, regs, &ids, node_id); 476 + efault: 477 + return false; 639 478 } 640 479 641 480 static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs) ··· 692 527 static __always_inline bool 693 528 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) 694 529 { 695 - if (likely(!test_tif_rseq(ti_work))) 696 - return false; 697 - 698 - if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { 699 - current->rseq.event.slowpath = true; 700 - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); 701 - return true; 530 + if (unlikely(test_tif_rseq(ti_work))) { 531 + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { 532 + current->rseq.event.slowpath = true; 533 + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); 534 + return true; 535 + } 536 + clear_tif_rseq(); 702 537 } 703 - 704 - clear_tif_rseq(); 705 - return false; 538 + /* 539 + * Arm the slice extension timer if nothing to do anymore and the 540 + * task really goes out to user space. 541 + */ 542 + return rseq_arm_slice_extension_timer(); 706 543 } 707 544 708 545 #else /* CONFIG_GENERIC_ENTRY */ ··· 778 611 static inline void rseq_irqentry_exit_to_user_mode(void) { } 779 612 static inline void rseq_exit_to_user_mode_legacy(void) { } 780 613 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } 614 + static inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 781 615 #endif /* !CONFIG_RSEQ */ 782 616 783 617 #endif /* _LINUX_RSEQ_ENTRY_H */

+31 -1

include/linux/rseq_types.h

··· 73 73 }; 74 74 75 75 /** 76 + * union rseq_slice_state - Status information for rseq time slice extension 77 + * @state: Compound to access the overall state 78 + * @enabled: Time slice extension is enabled for the task 79 + * @granted: Time slice extension was granted to the task 80 + */ 81 + union rseq_slice_state { 82 + u16 state; 83 + struct { 84 + u8 enabled; 85 + u8 granted; 86 + }; 87 + }; 88 + 89 + /** 90 + * struct rseq_slice - Status information for rseq time slice extension 91 + * @state: Time slice extension state 92 + * @expires: The time when a grant expires 93 + * @yielded: Indicator for rseq_slice_yield() 94 + */ 95 + struct rseq_slice { 96 + union rseq_slice_state state; 97 + u64 expires; 98 + u8 yielded; 99 + }; 100 + 101 + /** 76 102 * struct rseq_data - Storage for all rseq related data 77 103 * @usrptr: Pointer to the registered user space RSEQ memory 78 104 * @len: Length of the RSEQ region 79 - * @sig: Signature of critial section abort IPs 105 + * @sig: Signature of critical section abort IPs 80 106 * @event: Storage for event management 81 107 * @ids: Storage for cached CPU ID and MM CID 108 + * @slice: Storage for time slice extension data 82 109 */ 83 110 struct rseq_data { 84 111 struct rseq __user *usrptr; ··· 113 86 u32 sig; 114 87 struct rseq_event event; 115 88 struct rseq_ids ids; 89 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 90 + struct rseq_slice slice; 91 + #endif 116 92 }; 117 93 118 94 #else /* CONFIG_RSEQ */

+4 -10

include/linux/sched.h

··· 586 586 u64 sum_exec_runtime; 587 587 u64 prev_sum_exec_runtime; 588 588 u64 vruntime; 589 - union { 590 - /* 591 - * When !@on_rq this field is vlag. 592 - * When cfs_rq->curr == se (which implies @on_rq) 593 - * this field is vprot. See protect_slice(). 594 - */ 595 - s64 vlag; 596 - u64 vprot; 597 - }; 589 + /* Approximated virtual lag: */ 590 + s64 vlag; 591 + /* 'Protected' deadline, to give out minimum quantums: */ 592 + u64 vprot; 598 593 u64 slice; 599 594 600 595 u64 nr_migrations; ··· 951 956 952 957 struct mm_struct *mm; 953 958 struct mm_struct *active_mm; 954 - struct address_space *faults_disabled_mapping; 955 959 956 960 int exit_state; 957 961 int exit_code;

+1

include/linux/syscalls.h

··· 961 961 unsigned mask, struct statx __user *buffer); 962 962 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, 963 963 int flags, uint32_t sig); 964 + asmlinkage long sys_rseq_slice_yield(void); 964 965 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags); 965 966 asmlinkage long sys_open_tree_attr(int dfd, const char __user *path, 966 967 unsigned flags,

+9 -7

include/linux/thread_info.h

··· 46 46 SYSCALL_WORK_BIT_SYSCALL_AUDIT, 47 47 SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, 48 48 SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, 49 + SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE, 49 50 }; 50 51 51 - #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) 52 - #define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) 53 - #define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) 54 - #define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) 55 - #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) 56 - #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) 57 - #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) 52 + #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) 53 + #define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) 54 + #define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) 55 + #define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) 56 + #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) 57 + #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) 58 + #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) 59 + #define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE) 58 60 #endif 59 61 60 62 #include <asm/thread_info.h>

+4 -1

include/uapi/asm-generic/unistd.h

··· 860 860 #define __NR_listns 470 861 861 __SYSCALL(__NR_listns, sys_listns) 862 862 863 + #define __NR_rseq_slice_yield 471 864 + __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield) 865 + 863 866 #undef __NR_syscalls 864 - #define __NR_syscalls 471 867 + #define __NR_syscalls 472 865 868 866 869 /* 867 870 * 32 bit systems traditionally used different

+10

include/uapi/linux/prctl.h

··· 386 386 # define PR_FUTEX_HASH_SET_SLOTS 1 387 387 # define PR_FUTEX_HASH_GET_SLOTS 2 388 388 389 + /* RSEQ time slice extensions */ 390 + #define PR_RSEQ_SLICE_EXTENSION 79 391 + # define PR_RSEQ_SLICE_EXTENSION_GET 1 392 + # define PR_RSEQ_SLICE_EXTENSION_SET 2 393 + /* 394 + * Bits for RSEQ_SLICE_EXTENSION_GET/SET 395 + * PR_RSEQ_SLICE_EXT_ENABLE: Enable 396 + */ 397 + # define PR_RSEQ_SLICE_EXT_ENABLE 0x01 398 + 389 399 #endif /* _LINUX_PRCTL_H */

+40 -1

include/uapi/linux/rseq.h

··· 19 19 }; 20 20 21 21 enum rseq_flags { 22 - RSEQ_FLAG_UNREGISTER = (1 << 0), 22 + RSEQ_FLAG_UNREGISTER = (1 << 0), 23 + RSEQ_FLAG_SLICE_EXT_DEFAULT_ON = (1 << 1), 23 24 }; 24 25 25 26 enum rseq_cs_flags_bit { 27 + /* Historical and unsupported bits */ 26 28 RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, 27 29 RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, 28 30 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, 31 + /* (3) Intentional gap to put new bits into a separate byte */ 32 + 33 + /* User read only feature flags */ 34 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4, 35 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5, 29 36 }; 30 37 31 38 enum rseq_cs_flags { ··· 42 35 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), 43 36 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = 44 37 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), 38 + 39 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE = 40 + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT), 41 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED = 42 + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT), 45 43 }; 46 44 47 45 /* ··· 64 52 __u64 post_commit_offset; 65 53 __u64 abort_ip; 66 54 } __attribute__((aligned(4 * sizeof(__u64)))); 55 + 56 + /** 57 + * rseq_slice_ctrl - Time slice extension control structure 58 + * @all: Compound value 59 + * @request: Request for a time slice extension 60 + * @granted: Granted time slice extension 61 + * 62 + * @request is set by user space and can be cleared by user space or kernel 63 + * space. @granted is set and cleared by the kernel and must only be read 64 + * by user space. 65 + */ 66 + struct rseq_slice_ctrl { 67 + union { 68 + __u32 all; 69 + struct { 70 + __u8 request; 71 + __u8 granted; 72 + __u16 __reserved; 73 + }; 74 + }; 75 + }; 67 76 68 77 /* 69 78 * struct rseq is aligned on 4 * 8 bytes to ensure it is always ··· 173 140 * (allocated uniquely within a memory map). 174 141 */ 175 142 __u32 mm_cid; 143 + 144 + /* 145 + * Time slice extension control structure. CPU local updates from 146 + * kernel and user space. 147 + */ 148 + struct rseq_slice_ctrl slice_ctrl; 176 149 177 150 /* 178 151 * Flexible array member at end of structure, after last feature field.

+12

init/Kconfig

··· 1949 1949 1950 1950 If unsure, say Y. 1951 1951 1952 + config RSEQ_SLICE_EXTENSION 1953 + bool "Enable rseq-based time slice extension mechanism" 1954 + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS 1955 + help 1956 + Allows userspace to request a limited time slice extension when 1957 + returning from an interrupt to user space via the RSEQ shared 1958 + data ABI. If granted, that allows to complete a critical section, 1959 + so that other threads are not stuck on a conflicted resource, 1960 + while the task is scheduled out. 1961 + 1962 + If unsure, say N. 1963 + 1952 1964 config RSEQ_STATS 1953 1965 default n 1954 1966 bool "Enable lightweight statistics of restartable sequences" if EXPERT

-1

init/init_task.c

··· 113 113 .nr_cpus_allowed= NR_CPUS, 114 114 .mm = NULL, 115 115 .active_mm = &init_mm, 116 - .faults_disabled_mapping = NULL, 117 116 .restart_block = { 118 117 .fn = do_no_restart_syscall, 119 118 },

+3

kernel/Kconfig.preempt

··· 16 16 17 17 choice 18 18 prompt "Preemption Model" 19 + default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY 19 20 default PREEMPT_NONE 20 21 21 22 config PREEMPT_NONE 22 23 bool "No Forced Preemption (Server)" 23 24 depends on !PREEMPT_RT 25 + depends on ARCH_NO_PREEMPT 24 26 select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC 25 27 help 26 28 This is the traditional Linux preemption model, geared towards ··· 37 35 38 36 config PREEMPT_VOLUNTARY 39 37 bool "Voluntary Kernel Preemption (Desktop)" 38 + depends on !ARCH_HAS_PREEMPT_LAZY 40 39 depends on !ARCH_NO_PREEMPT 41 40 depends on !PREEMPT_RT 42 41 select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC

+25 -2

kernel/entry/common.c

··· 17 17 #define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK) 18 18 #endif 19 19 20 + /* TIF bits, which prevent a time slice extension. */ 21 + #ifdef CONFIG_PREEMPT_RT 22 + /* 23 + * Since rseq slice ext has a direct correlation to the worst case 24 + * scheduling latency (schedule is delayed after all), only have it affect 25 + * LAZY reschedules on PREEMPT_RT for now. 26 + * 27 + * However, since this delay is only applicable to userspace, a value 28 + * for rseq_slice_extension_nsec that is strictly less than the worst case 29 + * kernel space preempt_disable() region, should mean the scheduling latency 30 + * is not affected, even for !LAZY. 31 + * 32 + * However, since this value depends on the hardware at hand, it cannot be 33 + * pre-determined in any sensible way. Hence punt on this problem for now. 34 + */ 35 + # define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY) 36 + #else 37 + # define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) 38 + #endif 39 + #define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED) 40 + 20 41 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs, 21 42 unsigned long ti_work) 22 43 { ··· 49 28 50 29 local_irq_enable_exit_to_user(ti_work); 51 30 52 - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) 53 - schedule(); 31 + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { 32 + if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY)) 33 + schedule(); 34 + } 54 35 55 36 if (ti_work & _TIF_UPROBE) 56 37 uprobe_notify_resume(regs);

-7

kernel/entry/common.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _COMMON_H 3 - #define _COMMON_H 4 - 5 - bool syscall_user_dispatch(struct pt_regs *regs); 6 - 7 - #endif

+9 -90

kernel/entry/syscall-common.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 - #include <linux/audit.h> 4 3 #include <linux/entry-common.h> 5 - #include "common.h" 6 4 7 5 #define CREATE_TRACE_POINTS 8 6 #include <trace/events/syscalls.h> 9 7 10 - static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) 8 + /* Out of line to prevent tracepoint code duplication */ 9 + 10 + long trace_syscall_enter(struct pt_regs *regs, long syscall) 11 11 { 12 - if (unlikely(audit_context())) { 13 - unsigned long args[6]; 14 - 15 - syscall_get_arguments(current, regs, args); 16 - audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]); 17 - } 18 - } 19 - 20 - long syscall_trace_enter(struct pt_regs *regs, long syscall, 21 - unsigned long work) 22 - { 23 - long ret = 0; 24 - 12 + trace_sys_enter(regs, syscall); 25 13 /* 26 - * Handle Syscall User Dispatch. This must comes first, since 27 - * the ABI here can be something that doesn't make sense for 28 - * other syscall_work features. 14 + * Probes or BPF hooks in the tracepoint may have changed the 15 + * system call number. Reread it. 29 16 */ 30 - if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 31 - if (syscall_user_dispatch(regs)) 32 - return -1L; 33 - } 34 - 35 - /* Handle ptrace */ 36 - if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { 37 - ret = ptrace_report_syscall_entry(regs); 38 - if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) 39 - return -1L; 40 - } 41 - 42 - /* Do seccomp after ptrace, to catch any tracer changes. */ 43 - if (work & SYSCALL_WORK_SECCOMP) { 44 - ret = __secure_computing(); 45 - if (ret == -1L) 46 - return ret; 47 - } 48 - 49 - /* Either of the above might have changed the syscall number */ 50 - syscall = syscall_get_nr(current, regs); 51 - 52 - if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) { 53 - trace_sys_enter(regs, syscall); 54 - /* 55 - * Probes or BPF hooks in the tracepoint may have changed the 56 - * system call number as well. 57 - */ 58 - syscall = syscall_get_nr(current, regs); 59 - } 60 - 61 - syscall_enter_audit(regs, syscall); 62 - 63 - return ret ? : syscall; 17 + return syscall_get_nr(current, regs); 64 18 } 65 19 66 - /* 67 - * If SYSCALL_EMU is set, then the only reason to report is when 68 - * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall 69 - * instruction has been already reported in syscall_enter_from_user_mode(). 70 - */ 71 - static inline bool report_single_step(unsigned long work) 20 + void trace_syscall_exit(struct pt_regs *regs, long ret) 72 21 { 73 - if (work & SYSCALL_WORK_SYSCALL_EMU) 74 - return false; 75 - 76 - return work & SYSCALL_WORK_SYSCALL_EXIT_TRAP; 77 - } 78 - 79 - void syscall_exit_work(struct pt_regs *regs, unsigned long work) 80 - { 81 - bool step; 82 - 83 - /* 84 - * If the syscall was rolled back due to syscall user dispatching, 85 - * then the tracers below are not invoked for the same reason as 86 - * the entry side was not invoked in syscall_trace_enter(): The ABI 87 - * of these syscalls is unknown. 88 - */ 89 - if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 90 - if (unlikely(current->syscall_dispatch.on_dispatch)) { 91 - current->syscall_dispatch.on_dispatch = false; 92 - return; 93 - } 94 - } 95 - 96 - audit_syscall_exit(regs); 97 - 98 - if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT) 99 - trace_sys_exit(regs, syscall_get_return_value(current, regs)); 100 - 101 - step = report_single_step(work); 102 - if (step || work & SYSCALL_WORK_SYSCALL_TRACE) 103 - ptrace_report_syscall_exit(regs, step); 22 + trace_sys_exit(regs, ret); 104 23 }

+2 -2

kernel/entry/syscall_user_dispatch.c

··· 2 2 /* 3 3 * Copyright (C) 2020 Collabora Ltd. 4 4 */ 5 + 6 + #include <linux/entry-common.h> 5 7 #include <linux/sched.h> 6 8 #include <linux/prctl.h> 7 9 #include <linux/ptrace.h> ··· 16 14 #include <linux/sched/task_stack.h> 17 15 18 16 #include <asm/syscall.h> 19 - 20 - #include "common.h" 21 17 22 18 static void trigger_sigsys(struct pt_regs *regs) 23 19 {

+362 -3

kernel/rseq.c

··· 71 71 #define RSEQ_BUILD_SLOW_PATH 72 72 73 73 #include <linux/debugfs.h> 74 + #include <linux/hrtimer.h> 75 + #include <linux/percpu.h> 76 + #include <linux/prctl.h> 74 77 #include <linux/ratelimit.h> 75 78 #include <linux/rseq_entry.h> 76 79 #include <linux/sched.h> ··· 123 120 } 124 121 #endif /* CONFIG_TRACEPOINTS */ 125 122 126 - #ifdef CONFIG_DEBUG_FS 127 123 #ifdef CONFIG_RSEQ_STATS 128 124 DEFINE_PER_CPU(struct rseq_stats, rseq_stats); 129 125 ··· 140 138 stats.cs += data_race(per_cpu(rseq_stats.cs, cpu)); 141 139 stats.clear += data_race(per_cpu(rseq_stats.clear, cpu)); 142 140 stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu)); 141 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 142 + stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu)); 143 + stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu)); 144 + stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu)); 145 + stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu)); 146 + stats.s_aborted += data_race(per_cpu(rseq_stats.s_aborted, cpu)); 147 + } 143 148 } 144 149 145 150 seq_printf(m, "exit: %16lu\n", stats.exit); ··· 157 148 seq_printf(m, "cs: %16lu\n", stats.cs); 158 149 seq_printf(m, "clear: %16lu\n", stats.clear); 159 150 seq_printf(m, "fixup: %16lu\n", stats.fixup); 151 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 152 + seq_printf(m, "sgrant: %16lu\n", stats.s_granted); 153 + seq_printf(m, "sexpir: %16lu\n", stats.s_expired); 154 + seq_printf(m, "srevok: %16lu\n", stats.s_revoked); 155 + seq_printf(m, "syield: %16lu\n", stats.s_yielded); 156 + seq_printf(m, "sabort: %16lu\n", stats.s_aborted); 157 + } 160 158 return 0; 161 159 } 162 160 ··· 221 205 .release = single_release, 222 206 }; 223 207 208 + static void rseq_slice_ext_init(struct dentry *root_dir); 209 + 224 210 static int __init rseq_debugfs_init(void) 225 211 { 226 212 struct dentry *root_dir = debugfs_create_dir("rseq", NULL); 227 213 228 214 debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops); 229 215 rseq_stats_init(root_dir); 216 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) 217 + rseq_slice_ext_init(root_dir); 230 218 return 0; 231 219 } 232 220 __initcall(rseq_debugfs_init); 233 - #endif /* CONFIG_DEBUG_FS */ 234 221 235 222 static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id) 236 223 { ··· 408 389 */ 409 390 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) 410 391 { 392 + u32 rseqfl = 0; 393 + 411 394 if (flags & RSEQ_FLAG_UNREGISTER) { 412 395 if (flags & ~RSEQ_FLAG_UNREGISTER) 413 396 return -EINVAL; ··· 426 405 return 0; 427 406 } 428 407 429 - if (unlikely(flags)) 408 + if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) 430 409 return -EINVAL; 431 410 432 411 if (current->rseq.usrptr) { ··· 461 440 if (!access_ok(rseq, rseq_len)) 462 441 return -EFAULT; 463 442 443 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 444 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 445 + if (rseq_slice_extension_enabled() && 446 + (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) 447 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 448 + } 449 + 464 450 scoped_user_write_access(rseq, efault) { 465 451 /* 466 452 * If the rseq_cs pointer is non-NULL on registration, clear it to ··· 477 449 * clearing the fields. Don't bother reading it, just reset it. 478 450 */ 479 451 unsafe_put_user(0UL, &rseq->rseq_cs, efault); 452 + unsafe_put_user(rseqfl, &rseq->flags, efault); 480 453 /* Initialize IDs in user space */ 481 454 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault); 482 455 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); 483 456 unsafe_put_user(0U, &rseq->node_id, efault); 484 457 unsafe_put_user(0U, &rseq->mm_cid, efault); 458 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 485 459 } 486 460 487 461 /* ··· 493 463 current->rseq.usrptr = rseq; 494 464 current->rseq.len = rseq_len; 495 465 current->rseq.sig = sig; 466 + 467 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 468 + current->rseq.slice.state.enabled = !!(rseqfl & RSEQ_CS_FLAG_SLICE_EXT_ENABLED); 469 + #endif 496 470 497 471 /* 498 472 * If rseq was previously inactive, and has just been ··· 510 476 efault: 511 477 return -EFAULT; 512 478 } 479 + 480 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 481 + struct slice_timer { 482 + struct hrtimer timer; 483 + void *cookie; 484 + }; 485 + 486 + static const unsigned int rseq_slice_ext_nsecs_min = 5 * NSEC_PER_USEC; 487 + static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC; 488 + unsigned int rseq_slice_ext_nsecs __read_mostly = rseq_slice_ext_nsecs_min; 489 + static DEFINE_PER_CPU(struct slice_timer, slice_timer); 490 + DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); 491 + 492 + /* 493 + * When the timer expires and the task is still in user space, the return 494 + * from interrupt will revoke the grant and schedule. If the task already 495 + * entered the kernel via a syscall and the timer fires before the syscall 496 + * work was able to cancel it, then depending on the preemption model this 497 + * will either reschedule on return from interrupt or in the syscall work 498 + * below. 499 + */ 500 + static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr) 501 + { 502 + struct slice_timer *st = container_of(tmr, struct slice_timer, timer); 503 + 504 + /* 505 + * Validate that the task which armed the timer is still on the 506 + * CPU. It could have been scheduled out without canceling the 507 + * timer. 508 + */ 509 + if (st->cookie == current && current->rseq.slice.state.granted) { 510 + rseq_stat_inc(rseq_stats.s_expired); 511 + set_need_resched_current(); 512 + } 513 + return HRTIMER_NORESTART; 514 + } 515 + 516 + bool __rseq_arm_slice_extension_timer(void) 517 + { 518 + struct slice_timer *st = this_cpu_ptr(&slice_timer); 519 + struct task_struct *curr = current; 520 + 521 + lockdep_assert_irqs_disabled(); 522 + 523 + /* 524 + * This check prevents a task, which got a time slice extension 525 + * granted, from exceeding the maximum scheduling latency when the 526 + * grant expired before going out to user space. Don't bother to 527 + * clear the grant here, it will be cleaned up automatically before 528 + * going out to user space after being scheduled back in. 529 + */ 530 + if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) { 531 + set_need_resched_current(); 532 + return true; 533 + } 534 + 535 + /* 536 + * Store the task pointer as a cookie for comparison in the timer 537 + * function. This is safe as the timer is CPU local and cannot be 538 + * in the expiry function at this point. 539 + */ 540 + st->cookie = curr; 541 + hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD); 542 + /* Arm the syscall entry work */ 543 + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); 544 + return false; 545 + } 546 + 547 + static void rseq_cancel_slice_extension_timer(void) 548 + { 549 + struct slice_timer *st = this_cpu_ptr(&slice_timer); 550 + 551 + /* 552 + * st->cookie can be safely read as preemption is disabled and the 553 + * timer is CPU local. 554 + * 555 + * As this is most probably the first expiring timer, the cancel is 556 + * expensive as it has to reprogram the hardware, but that's less 557 + * expensive than going through a full hrtimer_interrupt() cycle 558 + * for nothing. 559 + * 560 + * hrtimer_try_to_cancel() is sufficient here as the timer is CPU 561 + * local and once the hrtimer code disabled interrupts the timer 562 + * callback cannot be running. 563 + */ 564 + if (st->cookie == current) 565 + hrtimer_try_to_cancel(&st->timer); 566 + } 567 + 568 + static inline void rseq_slice_set_need_resched(struct task_struct *curr) 569 + { 570 + /* 571 + * The interrupt guard is required to prevent inconsistent state in 572 + * this case: 573 + * 574 + * set_tsk_need_resched() 575 + * --> Interrupt 576 + * wakeup() 577 + * set_tsk_need_resched() 578 + * set_preempt_need_resched() 579 + * schedule_on_return() 580 + * clear_tsk_need_resched() 581 + * clear_preempt_need_resched() 582 + * set_preempt_need_resched() <- Inconsistent state 583 + * 584 + * This is safe vs. a remote set of TIF_NEED_RESCHED because that 585 + * only sets the already set bit and does not create inconsistent 586 + * state. 587 + */ 588 + scoped_guard(irq) 589 + set_need_resched_current(); 590 + } 591 + 592 + static void rseq_slice_validate_ctrl(u32 expected) 593 + { 594 + u32 __user *sctrl = &current->rseq.usrptr->slice_ctrl.all; 595 + u32 uval; 596 + 597 + if (get_user(uval, sctrl) || uval != expected) 598 + force_sig(SIGSEGV); 599 + } 600 + 601 + /* 602 + * Invoked from syscall entry if a time slice extension was granted and the 603 + * kernel did not clear it before user space left the critical section. 604 + * 605 + * While the recommended way to relinquish the CPU side effect free is 606 + * rseq_slice_yield(2), any syscall within a granted slice terminates the 607 + * grant and immediately reschedules if required. This supports onion layer 608 + * applications, where the code requesting the grant cannot control the 609 + * code within the critical section. 610 + */ 611 + void rseq_syscall_enter_work(long syscall) 612 + { 613 + struct task_struct *curr = current; 614 + struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted }; 615 + 616 + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); 617 + 618 + if (static_branch_unlikely(&rseq_debug_enabled)) 619 + rseq_slice_validate_ctrl(ctrl.all); 620 + 621 + /* 622 + * The kernel might have raced, revoked the grant and updated 623 + * userspace, but kept the SLICE work set. 624 + */ 625 + if (!ctrl.granted) 626 + return; 627 + 628 + /* 629 + * Required to stabilize the per CPU timer pointer and to make 630 + * set_tsk_need_resched() correct on PREEMPT[RT] kernels. 631 + * 632 + * Leaving the scope will reschedule on preemption models FULL, 633 + * LAZY and RT if necessary. 634 + */ 635 + scoped_guard(preempt) { 636 + rseq_cancel_slice_extension_timer(); 637 + /* 638 + * Now that preemption is disabled, quickly check whether 639 + * the task was already rescheduled before arriving here. 640 + */ 641 + if (!curr->rseq.event.sched_switch) { 642 + rseq_slice_set_need_resched(curr); 643 + 644 + if (syscall == __NR_rseq_slice_yield) { 645 + rseq_stat_inc(rseq_stats.s_yielded); 646 + /* Update the yielded state for syscall return */ 647 + curr->rseq.slice.yielded = 1; 648 + } else { 649 + rseq_stat_inc(rseq_stats.s_aborted); 650 + } 651 + } 652 + } 653 + /* Reschedule on NONE/VOLUNTARY preemption models */ 654 + cond_resched(); 655 + 656 + /* Clear the grant in kernel state and user space */ 657 + curr->rseq.slice.state.granted = false; 658 + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all)) 659 + force_sig(SIGSEGV); 660 + } 661 + 662 + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) 663 + { 664 + switch (arg2) { 665 + case PR_RSEQ_SLICE_EXTENSION_GET: 666 + if (arg3) 667 + return -EINVAL; 668 + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0; 669 + 670 + case PR_RSEQ_SLICE_EXTENSION_SET: { 671 + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 672 + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE); 673 + 674 + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE) 675 + return -EINVAL; 676 + if (!rseq_slice_extension_enabled()) 677 + return -ENOTSUPP; 678 + if (!current->rseq.usrptr) 679 + return -ENXIO; 680 + 681 + /* No change? */ 682 + if (enable == !!current->rseq.slice.state.enabled) 683 + return 0; 684 + 685 + if (get_user(rflags, &current->rseq.usrptr->flags)) 686 + goto die; 687 + 688 + if (current->rseq.slice.state.enabled) 689 + valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 690 + 691 + if ((rflags & valid) != valid) 692 + goto die; 693 + 694 + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 695 + rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 696 + if (enable) 697 + rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 698 + 699 + if (put_user(rflags, &current->rseq.usrptr->flags)) 700 + goto die; 701 + 702 + current->rseq.slice.state.enabled = enable; 703 + return 0; 704 + } 705 + default: 706 + return -EINVAL; 707 + } 708 + die: 709 + force_sig(SIGSEGV); 710 + return -EFAULT; 711 + } 712 + 713 + /** 714 + * sys_rseq_slice_yield - yield the current processor side effect free if a 715 + * task granted with a time slice extension is done with 716 + * the critical work before being forced out. 717 + * 718 + * Return: 1 if the task successfully yielded the CPU within the granted slice. 719 + * 0 if the slice extension was either never granted or was revoked by 720 + * going over the granted extension, using a syscall other than this one 721 + * or being scheduled out earlier due to a subsequent interrupt. 722 + * 723 + * The syscall does not schedule because the syscall entry work immediately 724 + * relinquishes the CPU and schedules if required. 725 + */ 726 + SYSCALL_DEFINE0(rseq_slice_yield) 727 + { 728 + int yielded = !!current->rseq.slice.yielded; 729 + 730 + current->rseq.slice.yielded = 0; 731 + return yielded; 732 + } 733 + 734 + static int rseq_slice_ext_show(struct seq_file *m, void *p) 735 + { 736 + seq_printf(m, "%d\n", rseq_slice_ext_nsecs); 737 + return 0; 738 + } 739 + 740 + static ssize_t rseq_slice_ext_write(struct file *file, const char __user *ubuf, 741 + size_t count, loff_t *ppos) 742 + { 743 + unsigned int nsecs; 744 + 745 + if (kstrtouint_from_user(ubuf, count, 10, &nsecs)) 746 + return -EINVAL; 747 + 748 + if (nsecs < rseq_slice_ext_nsecs_min) 749 + return -ERANGE; 750 + 751 + if (nsecs > rseq_slice_ext_nsecs_max) 752 + return -ERANGE; 753 + 754 + rseq_slice_ext_nsecs = nsecs; 755 + 756 + return count; 757 + } 758 + 759 + static int rseq_slice_ext_open(struct inode *inode, struct file *file) 760 + { 761 + return single_open(file, rseq_slice_ext_show, inode->i_private); 762 + } 763 + 764 + static const struct file_operations slice_ext_ops = { 765 + .open = rseq_slice_ext_open, 766 + .read = seq_read, 767 + .write = rseq_slice_ext_write, 768 + .llseek = seq_lseek, 769 + .release = single_release, 770 + }; 771 + 772 + static void rseq_slice_ext_init(struct dentry *root_dir) 773 + { 774 + debugfs_create_file("slice_ext_nsec", 0644, root_dir, NULL, &slice_ext_ops); 775 + } 776 + 777 + static int __init rseq_slice_cmdline(char *str) 778 + { 779 + bool on; 780 + 781 + if (kstrtobool(str, &on)) 782 + return 0; 783 + 784 + if (!on) 785 + static_branch_disable(&rseq_slice_extension_key); 786 + return 1; 787 + } 788 + __setup("rseq_slice_ext=", rseq_slice_cmdline); 789 + 790 + static int __init rseq_slice_init(void) 791 + { 792 + unsigned int cpu; 793 + 794 + for_each_possible_cpu(cpu) { 795 + hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired, 796 + CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD); 797 + } 798 + return 0; 799 + } 800 + device_initcall(rseq_slice_init); 801 + #else 802 + static void rseq_slice_ext_init(struct dentry *root_dir) { } 803 + #endif /* CONFIG_RSEQ_SLICE_EXTENSION */

+3

kernel/sched/clock.c

··· 173 173 scd->tick_gtod, __gtod_offset, 174 174 scd->tick_raw, __sched_clock_offset); 175 175 176 + disable_sched_clock_irqtime(); 176 177 static_branch_disable(&__sched_clock_stable); 177 178 } 178 179 ··· 239 238 240 239 if (__sched_clock_stable_early) 241 240 __set_sched_clock_stable(); 241 + else 242 + disable_sched_clock_irqtime(); /* disable if clock unstable. */ 242 243 243 244 return 0; 244 245 }

+64 -22

kernel/sched/core.c

··· 119 119 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); 120 120 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); 121 121 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp); 122 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_entry_tp); 123 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp); 124 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp); 122 125 123 126 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); 124 127 DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); ··· 1144 1141 { 1145 1142 trace_sched_set_need_resched_tp(curr, smp_processor_id(), tif); 1146 1143 } 1144 + EXPORT_SYMBOL_GPL(__trace_set_need_resched); 1147 1145 1148 1146 void resched_curr(struct rq *rq) 1149 1147 { ··· 2099 2095 */ 2100 2096 uclamp_rq_inc(rq, p, flags); 2101 2097 2102 - rq->queue_mask |= p->sched_class->queue_mask; 2103 2098 p->sched_class->enqueue_task(rq, p, flags); 2104 2099 2105 2100 psi_enqueue(p, flags); ··· 2131 2128 * and mark the task ->sched_delayed. 2132 2129 */ 2133 2130 uclamp_rq_dec(rq, p); 2134 - rq->queue_mask |= p->sched_class->queue_mask; 2135 2131 return p->sched_class->dequeue_task(rq, p, flags); 2136 2132 } 2137 2133 ··· 2181 2179 { 2182 2180 struct task_struct *donor = rq->donor; 2183 2181 2184 - if (p->sched_class == donor->sched_class) 2185 - donor->sched_class->wakeup_preempt(rq, p, flags); 2186 - else if (sched_class_above(p->sched_class, donor->sched_class)) 2182 + if (p->sched_class == rq->next_class) { 2183 + rq->next_class->wakeup_preempt(rq, p, flags); 2184 + 2185 + } else if (sched_class_above(p->sched_class, rq->next_class)) { 2186 + rq->next_class->wakeup_preempt(rq, p, flags); 2187 2187 resched_curr(rq); 2188 + rq->next_class = p->sched_class; 2189 + } 2188 2190 2189 2191 /* 2190 2192 * A queue event has occurred, and we're going to schedule. In ··· 3626 3620 trace_sched_wakeup(p); 3627 3621 } 3628 3622 3623 + void update_rq_avg_idle(struct rq *rq) 3624 + { 3625 + u64 delta = rq_clock(rq) - rq->idle_stamp; 3626 + u64 max = 2*rq->max_idle_balance_cost; 3627 + 3628 + update_avg(&rq->avg_idle, delta); 3629 + 3630 + if (rq->avg_idle > max) 3631 + rq->avg_idle = max; 3632 + rq->idle_stamp = 0; 3633 + } 3634 + 3629 3635 static void 3630 3636 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, 3631 3637 struct rq_flags *rf) ··· 3672 3654 rq_unpin_lock(rq, rf); 3673 3655 p->sched_class->task_woken(rq, p); 3674 3656 rq_repin_lock(rq, rf); 3675 - } 3676 - 3677 - if (rq->idle_stamp) { 3678 - u64 delta = rq_clock(rq) - rq->idle_stamp; 3679 - u64 max = 2*rq->max_idle_balance_cost; 3680 - 3681 - update_avg(&rq->avg_idle, delta); 3682 - 3683 - if (rq->avg_idle > max) 3684 - rq->avg_idle = max; 3685 - 3686 - rq->idle_stamp = 0; 3687 3657 } 3688 3658 } 3689 3659 ··· 6842 6836 pick_again: 6843 6837 next = pick_next_task(rq, rq->donor, &rf); 6844 6838 rq_set_donor(rq, next); 6839 + rq->next_class = next->sched_class; 6845 6840 if (unlikely(task_is_blocked(next))) { 6846 6841 next = find_proxy_task(rq, next, &rf); 6847 6842 if (!next) ··· 7587 7580 7588 7581 int sched_dynamic_mode(const char *str) 7589 7582 { 7590 - # ifndef CONFIG_PREEMPT_RT 7583 + # if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY)) 7591 7584 if (!strcmp(str, "none")) 7592 7585 return preempt_dynamic_none; 7593 7586 ··· 8519 8512 dump_rq_tasks(rq, KERN_WARNING); 8520 8513 } 8521 8514 dl_server_stop(&rq->fair_server); 8515 + #ifdef CONFIG_SCHED_CLASS_EXT 8516 + dl_server_stop(&rq->ext_server); 8517 + #endif 8522 8518 rq_unlock_irqrestore(rq, &rf); 8523 8519 8524 8520 calc_load_migrate(rq); ··· 8697 8687 rq->rt.rt_runtime = global_rt_runtime(); 8698 8688 init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); 8699 8689 #endif 8690 + rq->next_class = &idle_sched_class; 8691 + 8700 8692 rq->sd = NULL; 8701 8693 rq->rd = NULL; 8702 8694 rq->cpu_capacity = SCHED_CAPACITY_SCALE; ··· 8727 8715 hrtick_rq_init(rq); 8728 8716 atomic_set(&rq->nr_iowait, 0); 8729 8717 fair_server_init(rq); 8718 + #ifdef CONFIG_SCHED_CLASS_EXT 8719 + ext_server_init(rq); 8720 + #endif 8730 8721 8731 8722 #ifdef CONFIG_SCHED_CORE 8732 8723 rq->core = rq; ··· 9161 9146 { 9162 9147 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; 9163 9148 bool resched = false; 9149 + bool queued = false; 9164 9150 struct rq *rq; 9165 9151 9166 9152 CLASS(task_rq_lock, rq_guard)(tsk); ··· 9173 9157 scx_cgroup_move_task(tsk); 9174 9158 if (scope->running) 9175 9159 resched = true; 9160 + queued = scope->queued; 9176 9161 } 9177 9162 9178 9163 if (resched) 9179 9164 resched_curr(rq); 9165 + else if (queued) 9166 + wakeup_preempt(rq, tsk, 0); 9180 9167 9181 9168 __balance_callbacks(rq, &rq_guard.rf); 9182 9169 } ··· 10902 10883 flags |= DEQUEUE_NOCLOCK; 10903 10884 } 10904 10885 10905 - if (flags & DEQUEUE_CLASS) { 10906 - if (p->sched_class->switching_from) 10907 - p->sched_class->switching_from(rq, p); 10908 - } 10886 + if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from) 10887 + p->sched_class->switching_from(rq, p); 10909 10888 10910 10889 *ctx = (struct sched_change_ctx){ 10911 10890 .p = p, 10891 + .class = p->sched_class, 10912 10892 .flags = flags, 10913 10893 .queued = task_on_rq_queued(p), 10914 10894 .running = task_current_donor(rq, p), ··· 10938 10920 10939 10921 lockdep_assert_rq_held(rq); 10940 10922 10923 + /* 10924 + * Changing class without *QUEUE_CLASS is bad. 10925 + */ 10926 + WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS)); 10927 + 10941 10928 if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to) 10942 10929 p->sched_class->switching_to(rq, p); 10943 10930 ··· 10954 10931 if (ctx->flags & ENQUEUE_CLASS) { 10955 10932 if (p->sched_class->switched_to) 10956 10933 p->sched_class->switched_to(rq, p); 10934 + 10935 + if (ctx->running) { 10936 + /* 10937 + * If this was a class promotion; let the old class 10938 + * know it got preempted. Note that none of the 10939 + * switch*_from() methods know the new class and none 10940 + * of the switch*_to() methods know the old class. 10941 + */ 10942 + if (sched_class_above(p->sched_class, ctx->class)) { 10943 + rq->next_class->wakeup_preempt(rq, p, 0); 10944 + rq->next_class = p->sched_class; 10945 + } 10946 + /* 10947 + * If this was a degradation in class; make sure to 10948 + * reschedule. 10949 + */ 10950 + if (sched_class_above(ctx->class, p->sched_class)) 10951 + resched_curr(rq); 10952 + } 10957 10953 } else { 10958 10954 p->sched_class->prio_changed(rq, p, ctx->prio); 10959 10955 }

+1 -1

kernel/sched/cpufreq_schedutil.c

··· 682 682 "sugov:%d", 683 683 cpumask_first(policy->related_cpus)); 684 684 if (IS_ERR(thread)) { 685 - pr_err("failed to create sugov thread: %ld\n", PTR_ERR(thread)); 685 + pr_err("failed to create sugov thread: %pe\n", thread); 686 686 return PTR_ERR(thread); 687 687 } 688 688

+5 -4

kernel/sched/cputime.c

··· 12 12 13 13 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 14 14 15 + DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime); 16 + 15 17 /* 16 18 * There are no locks covering percpu hardirq/softirq time. 17 19 * They are only modified in vtime_account, on corresponding CPU ··· 27 25 */ 28 26 DEFINE_PER_CPU(struct irqtime, cpu_irqtime); 29 27 30 - int sched_clock_irqtime; 31 - 32 28 void enable_sched_clock_irqtime(void) 33 29 { 34 - sched_clock_irqtime = 1; 30 + static_branch_enable(&sched_clock_irqtime); 35 31 } 36 32 37 33 void disable_sched_clock_irqtime(void) 38 34 { 39 - sched_clock_irqtime = 0; 35 + if (irqtime_enabled()) 36 + static_branch_disable(&sched_clock_irqtime); 40 37 } 41 38 42 39 static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,

+74 -31

kernel/sched/deadline.c

··· 1449 1449 dl_se->dl_defer_idle = 0; 1450 1450 1451 1451 /* 1452 - * The fair server can consume its runtime while throttled (not queued/ 1453 - * running as regular CFS). 1452 + * The DL server can consume its runtime while throttled (not 1453 + * queued / running as regular CFS). 1454 1454 * 1455 1455 * If the server consumes its entire runtime in this state. The server 1456 1456 * is not required for the current period. Thus, reset the server by ··· 1535 1535 } 1536 1536 1537 1537 /* 1538 - * The fair server (sole dl_server) does not account for real-time 1539 - * workload because it is running fair work. 1538 + * The dl_server does not account for real-time workload because it 1539 + * is running fair work. 1540 1540 */ 1541 - if (dl_se == &rq->fair_server) 1541 + if (dl_se->dl_server) 1542 1542 return; 1543 1543 1544 1544 #ifdef CONFIG_RT_GROUP_SCHED ··· 1573 1573 * In the non-defer mode, the idle time is not accounted, as the 1574 1574 * server provides a guarantee. 1575 1575 * 1576 - * If the dl_server is in defer mode, the idle time is also considered 1577 - * as time available for the fair server, avoiding a penalty for the 1578 - * rt scheduler that did not consumed that time. 1576 + * If the dl_server is in defer mode, the idle time is also considered as 1577 + * time available for the dl_server, avoiding a penalty for the rt 1578 + * scheduler that did not consumed that time. 1579 1579 */ 1580 1580 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec) 1581 1581 { ··· 1799 1799 struct rq *rq = dl_se->rq; 1800 1800 1801 1801 dl_se->dl_defer_idle = 0; 1802 - if (!dl_server(dl_se) || dl_se->dl_server_active) 1802 + if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime) 1803 1803 return; 1804 1804 1805 1805 /* ··· 1860 1860 dl_se->dl_server = 1; 1861 1861 dl_se->dl_defer = 1; 1862 1862 setup_new_dl_entity(dl_se); 1863 + 1864 + #ifdef CONFIG_SCHED_CLASS_EXT 1865 + dl_se = &rq->ext_server; 1866 + 1867 + WARN_ON(dl_server(dl_se)); 1868 + 1869 + dl_server_apply_params(dl_se, runtime, period, 1); 1870 + 1871 + dl_se->dl_server = 1; 1872 + dl_se->dl_defer = 1; 1873 + setup_new_dl_entity(dl_se); 1874 + #endif 1863 1875 } 1864 1876 } 1865 1877 ··· 1898 1886 int cpu = cpu_of(rq); 1899 1887 struct dl_bw *dl_b; 1900 1888 unsigned long cap; 1901 - int retval = 0; 1902 1889 int cpus; 1903 1890 1904 1891 dl_b = dl_bw_of(cpu); ··· 1929 1918 dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); 1930 1919 dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime); 1931 1920 1932 - return retval; 1921 + return 0; 1933 1922 } 1934 1923 1935 1924 /* ··· 2526 2515 * Only called when both the current and waking task are -deadline 2527 2516 * tasks. 2528 2517 */ 2529 - static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, 2530 - int flags) 2518 + static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags) 2531 2519 { 2520 + /* 2521 + * Can only get preempted by stop-class, and those should be 2522 + * few and short lived, doesn't really make sense to push 2523 + * anything away for that. 2524 + */ 2525 + if (p->sched_class != &dl_sched_class) 2526 + return; 2527 + 2532 2528 if (dl_entity_preempt(&p->dl, &rq->donor->dl)) { 2533 2529 resched_curr(rq); 2534 2530 return; ··· 3209 3191 raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); 3210 3192 } 3211 3193 3194 + static void dl_server_add_bw(struct root_domain *rd, int cpu) 3195 + { 3196 + struct sched_dl_entity *dl_se; 3197 + 3198 + dl_se = &cpu_rq(cpu)->fair_server; 3199 + if (dl_server(dl_se) && cpu_active(cpu)) 3200 + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); 3201 + 3202 + #ifdef CONFIG_SCHED_CLASS_EXT 3203 + dl_se = &cpu_rq(cpu)->ext_server; 3204 + if (dl_server(dl_se) && cpu_active(cpu)) 3205 + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); 3206 + #endif 3207 + } 3208 + 3209 + static u64 dl_server_read_bw(int cpu) 3210 + { 3211 + u64 dl_bw = 0; 3212 + 3213 + if (cpu_rq(cpu)->fair_server.dl_server) 3214 + dl_bw += cpu_rq(cpu)->fair_server.dl_bw; 3215 + 3216 + #ifdef CONFIG_SCHED_CLASS_EXT 3217 + if (cpu_rq(cpu)->ext_server.dl_server) 3218 + dl_bw += cpu_rq(cpu)->ext_server.dl_bw; 3219 + #endif 3220 + 3221 + return dl_bw; 3222 + } 3223 + 3212 3224 void dl_clear_root_domain(struct root_domain *rd) 3213 3225 { 3214 3226 int i; ··· 3257 3209 * dl_servers are not tasks. Since dl_add_task_root_domain ignores 3258 3210 * them, we need to account for them here explicitly. 3259 3211 */ 3260 - for_each_cpu(i, rd->span) { 3261 - struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; 3262 - 3263 - if (dl_server(dl_se) && cpu_active(i)) 3264 - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); 3265 - } 3212 + for_each_cpu(i, rd->span) 3213 + dl_server_add_bw(rd, i); 3266 3214 } 3267 3215 3268 3216 void dl_clear_root_domain_cpu(int cpu) ··· 3404 3360 #endif 3405 3361 3406 3362 DEFINE_SCHED_CLASS(dl) = { 3407 - 3408 - .queue_mask = 8, 3409 - 3410 3363 .enqueue_task = enqueue_task_dl, 3411 3364 .dequeue_task = dequeue_task_dl, 3412 3365 .yield_task = yield_task_dl, ··· 3697 3656 dl_se->dl_non_contending = 0; 3698 3657 dl_se->dl_overrun = 0; 3699 3658 dl_se->dl_server = 0; 3659 + dl_se->dl_defer = 0; 3660 + dl_se->dl_defer_running = 0; 3661 + dl_se->dl_defer_armed = 0; 3700 3662 3701 3663 #ifdef CONFIG_RT_MUTEXES 3702 3664 dl_se->pi_se = dl_se; ··· 3757 3713 unsigned long flags, cap; 3758 3714 struct dl_bw *dl_b; 3759 3715 bool overflow = 0; 3760 - u64 fair_server_bw = 0; 3716 + u64 dl_server_bw = 0; 3761 3717 3762 3718 rcu_read_lock_sched(); 3763 3719 dl_b = dl_bw_of(cpu); ··· 3790 3746 cap -= arch_scale_cpu_capacity(cpu); 3791 3747 3792 3748 /* 3793 - * cpu is going offline and NORMAL tasks will be moved away 3794 - * from it. We can thus discount dl_server bandwidth 3795 - * contribution as it won't need to be servicing tasks after 3796 - * the cpu is off. 3749 + * cpu is going offline and NORMAL and EXT tasks will be 3750 + * moved away from it. We can thus discount dl_server 3751 + * bandwidth contribution as it won't need to be servicing 3752 + * tasks after the cpu is off. 3797 3753 */ 3798 - if (cpu_rq(cpu)->fair_server.dl_server) 3799 - fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; 3754 + dl_server_bw = dl_server_read_bw(cpu); 3800 3755 3801 3756 /* 3802 3757 * Not much to check if no DEADLINE bandwidth is present. 3803 3758 * dl_servers we can discount, as tasks will be moved out the 3804 3759 * offlined CPUs anyway. 3805 3760 */ 3806 - if (dl_b->total_bw - fair_server_bw > 0) { 3761 + if (dl_b->total_bw - dl_server_bw > 0) { 3807 3762 /* 3808 3763 * Leaving at least one CPU for DEADLINE tasks seems a 3809 3764 * wise thing to do. As said above, cpu is not offline 3810 3765 * yet, so account for that. 3811 3766 */ 3812 3767 if (dl_bw_cpus(cpu) - 1) 3813 - overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); 3768 + overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0); 3814 3769 else 3815 3770 overflow = 1; 3816 3771 }

+146 -41

kernel/sched/debug.c

··· 172 172 static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf, 173 173 size_t cnt, loff_t *ppos) 174 174 { 175 - char buf[16]; 176 175 unsigned int scaling; 176 + int ret; 177 177 178 - if (cnt > 15) 179 - cnt = 15; 180 - 181 - if (copy_from_user(&buf, ubuf, cnt)) 182 - return -EFAULT; 183 - buf[cnt] = '\0'; 184 - 185 - if (kstrtouint(buf, 10, &scaling)) 186 - return -EINVAL; 178 + ret = kstrtouint_from_user(ubuf, cnt, 10, &scaling); 179 + if (ret) 180 + return ret; 187 181 188 182 if (scaling >= SCHED_TUNABLESCALING_END) 189 183 return -EINVAL; ··· 237 243 238 244 static int sched_dynamic_show(struct seq_file *m, void *v) 239 245 { 240 - int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2; 246 + int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2; 241 247 int j; 242 248 243 249 /* Count entries in NULL terminated preempt_modes */ ··· 330 336 DL_PERIOD, 331 337 }; 332 338 333 - static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ 334 - static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 339 + static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ 340 + static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 335 341 336 - static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, 337 - size_t cnt, loff_t *ppos, enum dl_param param) 342 + static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf, 343 + size_t cnt, loff_t *ppos, enum dl_param param, 344 + void *server) 338 345 { 339 346 long cpu = (long) ((struct seq_file *) filp->private_data)->private; 347 + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; 348 + u64 old_runtime, runtime, period; 340 349 struct rq *rq = cpu_rq(cpu); 341 - u64 runtime, period; 350 + int retval = 0; 342 351 size_t err; 343 - int retval; 344 352 u64 value; 345 353 346 354 err = kstrtoull_from_user(ubuf, cnt, 10, &value); ··· 350 354 return err; 351 355 352 356 scoped_guard (rq_lock_irqsave, rq) { 353 - runtime = rq->fair_server.dl_runtime; 354 - period = rq->fair_server.dl_period; 357 + old_runtime = runtime = dl_se->dl_runtime; 358 + period = dl_se->dl_period; 355 359 356 360 switch (param) { 357 361 case DL_RUNTIME: ··· 367 371 } 368 372 369 373 if (runtime > period || 370 - period > fair_server_period_max || 371 - period < fair_server_period_min) { 374 + period > dl_server_period_max || 375 + period < dl_server_period_min) { 372 376 return -EINVAL; 373 377 } 374 378 375 379 update_rq_clock(rq); 376 - dl_server_stop(&rq->fair_server); 380 + dl_server_stop(dl_se); 381 + retval = dl_server_apply_params(dl_se, runtime, period, 0); 382 + dl_server_start(dl_se); 377 383 378 - retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); 379 - if (retval) 380 - cnt = retval; 384 + if (retval < 0) 385 + return retval; 386 + } 381 387 382 - if (!runtime) 383 - printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", 384 - cpu_of(rq)); 385 - 386 - if (rq->cfs.h_nr_queued) 387 - dl_server_start(&rq->fair_server); 388 + if (!!old_runtime ^ !!runtime) { 389 + pr_info("%s server %sabled on CPU %d%s.\n", 390 + server == &rq->fair_server ? "Fair" : "Ext", 391 + runtime ? "en" : "dis", 392 + cpu_of(rq), 393 + runtime ? "" : ", system may malfunction due to starvation"); 388 394 } 389 395 390 396 *ppos += cnt; 391 397 return cnt; 392 398 } 393 399 394 - static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) 400 + static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param, 401 + void *server) 395 402 { 396 - unsigned long cpu = (unsigned long) m->private; 397 - struct rq *rq = cpu_rq(cpu); 403 + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; 398 404 u64 value; 399 405 400 406 switch (param) { 401 407 case DL_RUNTIME: 402 - value = rq->fair_server.dl_runtime; 408 + value = dl_se->dl_runtime; 403 409 break; 404 410 case DL_PERIOD: 405 - value = rq->fair_server.dl_period; 411 + value = dl_se->dl_period; 406 412 break; 407 413 } 408 414 409 415 seq_printf(m, "%llu\n", value); 410 416 return 0; 411 - 412 417 } 413 418 414 419 static ssize_t 415 420 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, 416 421 size_t cnt, loff_t *ppos) 417 422 { 418 - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); 423 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 424 + struct rq *rq = cpu_rq(cpu); 425 + 426 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, 427 + &rq->fair_server); 419 428 } 420 429 421 430 static int sched_fair_server_runtime_show(struct seq_file *m, void *v) 422 431 { 423 - return sched_fair_server_show(m, v, DL_RUNTIME); 432 + unsigned long cpu = (unsigned long) m->private; 433 + struct rq *rq = cpu_rq(cpu); 434 + 435 + return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server); 424 436 } 425 437 426 438 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) ··· 444 440 .release = single_release, 445 441 }; 446 442 443 + #ifdef CONFIG_SCHED_CLASS_EXT 444 + static ssize_t 445 + sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf, 446 + size_t cnt, loff_t *ppos) 447 + { 448 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 449 + struct rq *rq = cpu_rq(cpu); 450 + 451 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, 452 + &rq->ext_server); 453 + } 454 + 455 + static int sched_ext_server_runtime_show(struct seq_file *m, void *v) 456 + { 457 + unsigned long cpu = (unsigned long) m->private; 458 + struct rq *rq = cpu_rq(cpu); 459 + 460 + return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server); 461 + } 462 + 463 + static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp) 464 + { 465 + return single_open(filp, sched_ext_server_runtime_show, inode->i_private); 466 + } 467 + 468 + static const struct file_operations ext_server_runtime_fops = { 469 + .open = sched_ext_server_runtime_open, 470 + .write = sched_ext_server_runtime_write, 471 + .read = seq_read, 472 + .llseek = seq_lseek, 473 + .release = single_release, 474 + }; 475 + #endif /* CONFIG_SCHED_CLASS_EXT */ 476 + 447 477 static ssize_t 448 478 sched_fair_server_period_write(struct file *filp, const char __user *ubuf, 449 479 size_t cnt, loff_t *ppos) 450 480 { 451 - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); 481 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 482 + struct rq *rq = cpu_rq(cpu); 483 + 484 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, 485 + &rq->fair_server); 452 486 } 453 487 454 488 static int sched_fair_server_period_show(struct seq_file *m, void *v) 455 489 { 456 - return sched_fair_server_show(m, v, DL_PERIOD); 490 + unsigned long cpu = (unsigned long) m->private; 491 + struct rq *rq = cpu_rq(cpu); 492 + 493 + return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server); 457 494 } 458 495 459 496 static int sched_fair_server_period_open(struct inode *inode, struct file *filp) ··· 509 464 .llseek = seq_lseek, 510 465 .release = single_release, 511 466 }; 467 + 468 + #ifdef CONFIG_SCHED_CLASS_EXT 469 + static ssize_t 470 + sched_ext_server_period_write(struct file *filp, const char __user *ubuf, 471 + size_t cnt, loff_t *ppos) 472 + { 473 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 474 + struct rq *rq = cpu_rq(cpu); 475 + 476 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, 477 + &rq->ext_server); 478 + } 479 + 480 + static int sched_ext_server_period_show(struct seq_file *m, void *v) 481 + { 482 + unsigned long cpu = (unsigned long) m->private; 483 + struct rq *rq = cpu_rq(cpu); 484 + 485 + return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server); 486 + } 487 + 488 + static int sched_ext_server_period_open(struct inode *inode, struct file *filp) 489 + { 490 + return single_open(filp, sched_ext_server_period_show, inode->i_private); 491 + } 492 + 493 + static const struct file_operations ext_server_period_fops = { 494 + .open = sched_ext_server_period_open, 495 + .write = sched_ext_server_period_write, 496 + .read = seq_read, 497 + .llseek = seq_lseek, 498 + .release = single_release, 499 + }; 500 + #endif /* CONFIG_SCHED_CLASS_EXT */ 512 501 513 502 static struct dentry *debugfs_sched; 514 503 ··· 566 487 debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); 567 488 } 568 489 } 490 + 491 + #ifdef CONFIG_SCHED_CLASS_EXT 492 + static void debugfs_ext_server_init(void) 493 + { 494 + struct dentry *d_ext; 495 + unsigned long cpu; 496 + 497 + d_ext = debugfs_create_dir("ext_server", debugfs_sched); 498 + if (!d_ext) 499 + return; 500 + 501 + for_each_possible_cpu(cpu) { 502 + struct dentry *d_cpu; 503 + char buf[32]; 504 + 505 + snprintf(buf, sizeof(buf), "cpu%lu", cpu); 506 + d_cpu = debugfs_create_dir(buf, d_ext); 507 + 508 + debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops); 509 + debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops); 510 + } 511 + } 512 + #endif /* CONFIG_SCHED_CLASS_EXT */ 569 513 570 514 static __init int sched_init_debug(void) 571 515 { ··· 628 526 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); 629 527 630 528 debugfs_fair_server_init(); 529 + #ifdef CONFIG_SCHED_CLASS_EXT 530 + debugfs_ext_server_init(); 531 + #endif 631 532 632 533 return 0; 633 534 }

+37 -5

kernel/sched/ext.c

··· 959 959 if (!curr->scx.slice) 960 960 touch_core_sched(rq, curr); 961 961 } 962 + 963 + dl_server_update(&rq->ext_server, delta_exec); 962 964 } 963 965 964 966 static bool scx_dsq_priq_less(struct rb_node *node_a, ··· 1503 1501 1504 1502 if (enq_flags & SCX_ENQ_WAKEUP) 1505 1503 touch_core_sched(rq, p); 1504 + 1505 + /* Start dl_server if this is the first task being enqueued */ 1506 + if (rq->scx.nr_running == 1) 1507 + dl_server_start(&rq->ext_server); 1506 1508 1507 1509 do_enqueue_task(rq, p, enq_flags, sticky_cpu); 1508 1510 out: ··· 2460 2454 /* see kick_cpus_irq_workfn() */ 2461 2455 smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1); 2462 2456 2463 - rq_modified_clear(rq); 2457 + rq->next_class = &ext_sched_class; 2464 2458 2465 2459 rq_unpin_lock(rq, rf); 2466 2460 balance_one(rq, prev); ··· 2475 2469 * If @force_scx is true, always try to pick a SCHED_EXT task, 2476 2470 * regardless of any higher-priority sched classes activity. 2477 2471 */ 2478 - if (!force_scx && rq_modified_above(rq, &ext_sched_class)) 2472 + if (!force_scx && sched_class_above(rq->next_class, &ext_sched_class)) 2479 2473 return RETRY_TASK; 2480 2474 2481 2475 keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP; ··· 2517 2511 static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf) 2518 2512 { 2519 2513 return do_pick_task_scx(rq, rf, false); 2514 + } 2515 + 2516 + /* 2517 + * Select the next task to run from the ext scheduling class. 2518 + * 2519 + * Use do_pick_task_scx() directly with @force_scx enabled, since the 2520 + * dl_server must always select a sched_ext task. 2521 + */ 2522 + static struct task_struct * 2523 + ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf) 2524 + { 2525 + if (!scx_enabled()) 2526 + return NULL; 2527 + 2528 + return do_pick_task_scx(dl_se->rq, rf, true); 2529 + } 2530 + 2531 + /* 2532 + * Initialize the ext server deadline entity. 2533 + */ 2534 + void ext_server_init(struct rq *rq) 2535 + { 2536 + struct sched_dl_entity *dl_se = &rq->ext_server; 2537 + 2538 + init_dl_entity(dl_se); 2539 + 2540 + dl_server_init(dl_se, rq, ext_server_pick_task); 2520 2541 } 2521 2542 2522 2543 #ifdef CONFIG_SCHED_CORE ··· 3174 3141 scx_disable_task(p); 3175 3142 } 3176 3143 3177 - static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} 3144 + static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {} 3145 + 3178 3146 static void switched_to_scx(struct rq *rq, struct task_struct *p) {} 3179 3147 3180 3148 int scx_check_setscheduler(struct task_struct *p, int policy) ··· 3436 3402 * their current sched_class. Call them directly from sched core instead. 3437 3403 */ 3438 3404 DEFINE_SCHED_CLASS(ext) = { 3439 - .queue_mask = 1, 3440 - 3441 3405 .enqueue_task = enqueue_task_scx, 3442 3406 .dequeue_task = dequeue_task_scx, 3443 3407 .yield_task = yield_task_scx,

+208 -203

kernel/sched/fair.c

··· 524 524 * Scheduling class tree data structure manipulation methods: 525 525 */ 526 526 527 + extern void __BUILD_BUG_vruntime_cmp(void); 528 + 529 + /* Use __builtin_strcmp() because of __HAVE_ARCH_STRCMP: */ 530 + 531 + #define vruntime_cmp(A, CMP_STR, B) ({ \ 532 + int __res = 0; \ 533 + \ 534 + if (!__builtin_strcmp(CMP_STR, "<")) { \ 535 + __res = ((s64)((A)-(B)) < 0); \ 536 + } else if (!__builtin_strcmp(CMP_STR, "<=")) { \ 537 + __res = ((s64)((A)-(B)) <= 0); \ 538 + } else if (!__builtin_strcmp(CMP_STR, ">")) { \ 539 + __res = ((s64)((A)-(B)) > 0); \ 540 + } else if (!__builtin_strcmp(CMP_STR, ">=")) { \ 541 + __res = ((s64)((A)-(B)) >= 0); \ 542 + } else { \ 543 + /* Unknown operator throws linker error: */ \ 544 + __BUILD_BUG_vruntime_cmp(); \ 545 + } \ 546 + \ 547 + __res; \ 548 + }) 549 + 550 + extern void __BUILD_BUG_vruntime_op(void); 551 + 552 + #define vruntime_op(A, OP_STR, B) ({ \ 553 + s64 __res = 0; \ 554 + \ 555 + if (!__builtin_strcmp(OP_STR, "-")) { \ 556 + __res = (s64)((A)-(B)); \ 557 + } else { \ 558 + /* Unknown operator throws linker error: */ \ 559 + __BUILD_BUG_vruntime_op(); \ 560 + } \ 561 + \ 562 + __res; \ 563 + }) 564 + 565 + 527 566 static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime) 528 567 { 529 - s64 delta = (s64)(vruntime - max_vruntime); 530 - if (delta > 0) 568 + if (vruntime_cmp(vruntime, ">", max_vruntime)) 531 569 max_vruntime = vruntime; 532 570 533 571 return max_vruntime; ··· 573 535 574 536 static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime) 575 537 { 576 - s64 delta = (s64)(vruntime - min_vruntime); 577 - if (delta < 0) 538 + if (vruntime_cmp(vruntime, "<", min_vruntime)) 578 539 min_vruntime = vruntime; 579 540 580 541 return min_vruntime; ··· 586 549 * Tiebreak on vruntime seems unnecessary since it can 587 550 * hardly happen. 588 551 */ 589 - return (s64)(a->deadline - b->deadline) < 0; 552 + return vruntime_cmp(a->deadline, "<", b->deadline); 590 553 } 591 554 592 555 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) 593 556 { 594 - return (s64)(se->vruntime - cfs_rq->zero_vruntime); 557 + return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime); 595 558 } 596 559 597 560 #define __node_2_se(node) \ ··· 613 576 * 614 577 * \Sum lag_i = 0 615 578 * \Sum w_i * (V - v_i) = 0 616 - * \Sum w_i * V - w_i * v_i = 0 579 + * \Sum (w_i * V - w_i * v_i) = 0 617 580 * 618 581 * From which we can solve an expression for V in v_i (which we have in 619 582 * se->vruntime): ··· 644 607 * Which we track using: 645 608 * 646 609 * v0 := cfs_rq->zero_vruntime 647 - * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime 648 - * \Sum w_i := cfs_rq->avg_load 610 + * \Sum (v_i - v0) * w_i := cfs_rq->sum_w_vruntime 611 + * \Sum w_i := cfs_rq->sum_weight 649 612 * 650 613 * Since zero_vruntime closely tracks the per-task service, these 651 - * deltas: (v_i - v), will be in the order of the maximal (virtual) lag 614 + * deltas: (v_i - v0), will be in the order of the maximal (virtual) lag 652 615 * induced in the system due to quantisation. 653 616 * 654 617 * Also, we use scale_load_down() to reduce the size. ··· 656 619 * As measured, the max (key * weight) value was ~44 bits for a kernel build. 657 620 */ 658 621 static void 659 - avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) 622 + sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) 660 623 { 661 624 unsigned long weight = scale_load_down(se->load.weight); 662 625 s64 key = entity_key(cfs_rq, se); 663 626 664 - cfs_rq->avg_vruntime += key * weight; 665 - cfs_rq->avg_load += weight; 627 + cfs_rq->sum_w_vruntime += key * weight; 628 + cfs_rq->sum_weight += weight; 666 629 } 667 630 668 631 static void 669 - avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) 632 + sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) 670 633 { 671 634 unsigned long weight = scale_load_down(se->load.weight); 672 635 s64 key = entity_key(cfs_rq, se); 673 636 674 - cfs_rq->avg_vruntime -= key * weight; 675 - cfs_rq->avg_load -= weight; 637 + cfs_rq->sum_w_vruntime -= key * weight; 638 + cfs_rq->sum_weight -= weight; 676 639 } 677 640 678 641 static inline 679 - void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) 642 + void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) 680 643 { 681 644 /* 682 - * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load 645 + * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight 683 646 */ 684 - cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta; 647 + cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; 685 648 } 686 649 687 650 /* ··· 691 654 u64 avg_vruntime(struct cfs_rq *cfs_rq) 692 655 { 693 656 struct sched_entity *curr = cfs_rq->curr; 694 - s64 avg = cfs_rq->avg_vruntime; 695 - long load = cfs_rq->avg_load; 657 + s64 avg = cfs_rq->sum_w_vruntime; 658 + long load = cfs_rq->sum_weight; 696 659 697 660 if (curr && curr->on_rq) { 698 661 unsigned long weight = scale_load_down(curr->load.weight); ··· 759 722 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) 760 723 { 761 724 struct sched_entity *curr = cfs_rq->curr; 762 - s64 avg = cfs_rq->avg_vruntime; 763 - long load = cfs_rq->avg_load; 725 + s64 avg = cfs_rq->sum_w_vruntime; 726 + long load = cfs_rq->sum_weight; 764 727 765 728 if (curr && curr->on_rq) { 766 729 unsigned long weight = scale_load_down(curr->load.weight); ··· 769 732 load += weight; 770 733 } 771 734 772 - return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load; 735 + return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load; 773 736 } 774 737 775 738 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) ··· 780 743 static void update_zero_vruntime(struct cfs_rq *cfs_rq) 781 744 { 782 745 u64 vruntime = avg_vruntime(cfs_rq); 783 - s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime); 746 + s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime); 784 747 785 - avg_vruntime_update(cfs_rq, delta); 748 + sum_w_vruntime_update(cfs_rq, delta); 786 749 787 750 cfs_rq->zero_vruntime = vruntime; 788 751 } ··· 807 770 return entity_before(__node_2_se(a), __node_2_se(b)); 808 771 } 809 772 810 - #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; }) 811 - 812 773 static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node *node) 813 774 { 814 775 if (node) { 815 776 struct sched_entity *rse = __node_2_se(node); 816 - if (vruntime_gt(min_vruntime, se, rse)) 777 + 778 + if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime)) 817 779 se->min_vruntime = rse->min_vruntime; 818 780 } 819 781 } ··· 855 819 */ 856 820 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 857 821 { 858 - avg_vruntime_add(cfs_rq, se); 822 + sum_w_vruntime_add(cfs_rq, se); 859 823 update_zero_vruntime(cfs_rq); 860 824 se->min_vruntime = se->vruntime; 861 825 se->min_slice = se->slice; ··· 867 831 { 868 832 rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, 869 833 &min_vruntime_cb); 870 - avg_vruntime_sub(cfs_rq, se); 834 + sum_w_vruntime_sub(cfs_rq, se); 871 835 update_zero_vruntime(cfs_rq); 872 836 } 873 837 ··· 923 887 924 888 static inline bool protect_slice(struct sched_entity *se) 925 889 { 926 - return ((s64)(se->vprot - se->vruntime) > 0); 890 + return vruntime_cmp(se->vruntime, "<", se->vprot); 927 891 } 928 892 929 893 static inline void cancel_protect_slice(struct sched_entity *se) ··· 1060 1024 */ 1061 1025 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) 1062 1026 { 1063 - if ((s64)(se->vruntime - se->deadline) < 0) 1027 + if (vruntime_cmp(se->vruntime, "<", se->deadline)) 1064 1028 return false; 1065 1029 1066 1030 /* ··· 1549 1513 1550 1514 /* Scale the maximum scan period with the amount of shared memory. */ 1551 1515 rcu_read_lock(); 1552 - ng = rcu_dereference(p->numa_group); 1516 + ng = rcu_dereference_all(p->numa_group); 1553 1517 if (ng) { 1554 1518 unsigned long shared = group_faults_shared(ng); 1555 1519 unsigned long private = group_faults_priv(ng); ··· 1616 1580 pid_t gid = 0; 1617 1581 1618 1582 rcu_read_lock(); 1619 - ng = rcu_dereference(p->numa_group); 1583 + ng = rcu_dereference_all(p->numa_group); 1620 1584 if (ng) 1621 1585 gid = ng->gid; 1622 1586 rcu_read_unlock(); ··· 2275 2239 return false; 2276 2240 2277 2241 rcu_read_lock(); 2278 - cur = rcu_dereference(dst_rq->curr); 2242 + cur = rcu_dereference_all(dst_rq->curr); 2279 2243 if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) || 2280 2244 !cur->mm)) 2281 2245 cur = NULL; ··· 2320 2284 * If dst and source tasks are in the same NUMA group, or not 2321 2285 * in any group then look only at task weights. 2322 2286 */ 2323 - cur_ng = rcu_dereference(cur->numa_group); 2287 + cur_ng = rcu_dereference_all(cur->numa_group); 2324 2288 if (cur_ng == p_ng) { 2325 2289 /* 2326 2290 * Do not swap within a group or between tasks that have ··· 2494 2458 maymove = !load_too_imbalanced(src_load, dst_load, env); 2495 2459 } 2496 2460 2497 - for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { 2498 - /* Skip this CPU if the source task cannot migrate */ 2499 - if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) 2500 - continue; 2501 - 2461 + /* Skip CPUs if the source task cannot migrate */ 2462 + for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), env->p->cpus_ptr) { 2502 2463 env->dst_cpu = cpu; 2503 2464 if (task_numa_compare(env, taskimp, groupimp, maymove)) 2504 2465 break; ··· 2532 2499 * to satisfy here. 2533 2500 */ 2534 2501 rcu_read_lock(); 2535 - sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); 2502 + sd = rcu_dereference_all(per_cpu(sd_numa, env.src_cpu)); 2536 2503 if (sd) { 2537 2504 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; 2538 2505 env.imb_numa_nr = sd->imb_numa_nr; ··· 3056 3023 if (!cpupid_match_pid(tsk, cpupid)) 3057 3024 goto no_join; 3058 3025 3059 - grp = rcu_dereference(tsk->numa_group); 3026 + grp = rcu_dereference_all(tsk->numa_group); 3060 3027 if (!grp) 3061 3028 goto no_join; 3062 3029 ··· 3727 3694 */ 3728 3695 #define add_positive(_ptr, _val) do { \ 3729 3696 typeof(_ptr) ptr = (_ptr); \ 3730 - typeof(_val) val = (_val); \ 3697 + __signed_scalar_typeof(*ptr) val = (_val); \ 3731 3698 typeof(*ptr) res, var = READ_ONCE(*ptr); \ 3732 3699 \ 3733 3700 res = var + val; \ ··· 3736 3703 res = 0; \ 3737 3704 \ 3738 3705 WRITE_ONCE(*ptr, res); \ 3739 - } while (0) 3740 - 3741 - /* 3742 - * Unsigned subtract and clamp on underflow. 3743 - * 3744 - * Explicitly do a load-store to ensure the intermediate value never hits 3745 - * memory. This allows lockless observations without ever seeing the negative 3746 - * values. 3747 - */ 3748 - #define sub_positive(_ptr, _val) do { \ 3749 - typeof(_ptr) ptr = (_ptr); \ 3750 - typeof(*ptr) val = (_val); \ 3751 - typeof(*ptr) res, var = READ_ONCE(*ptr); \ 3752 - res = var - val; \ 3753 - if (res > var) \ 3754 - res = 0; \ 3755 - WRITE_ONCE(*ptr, res); \ 3756 3706 } while (0) 3757 3707 3758 3708 /* ··· 3749 3733 *ptr -= min_t(typeof(*ptr), *ptr, _val); \ 3750 3734 } while (0) 3751 3735 3736 + 3737 + /* 3738 + * Because of rounding, se->util_sum might ends up being +1 more than 3739 + * cfs->util_sum. Although this is not a problem by itself, detaching 3740 + * a lot of tasks with the rounding problem between 2 updates of 3741 + * util_avg (~1ms) can make cfs->util_sum becoming null whereas 3742 + * cfs_util_avg is not. 3743 + * 3744 + * Check that util_sum is still above its lower bound for the new 3745 + * util_avg. Given that period_contrib might have moved since the last 3746 + * sync, we are only sure that util_sum must be above or equal to 3747 + * util_avg * minimum possible divider 3748 + */ 3749 + #define __update_sa(sa, name, delta_avg, delta_sum) do { \ 3750 + add_positive(&(sa)->name##_avg, delta_avg); \ 3751 + add_positive(&(sa)->name##_sum, delta_sum); \ 3752 + (sa)->name##_sum = max_t(typeof((sa)->name##_sum), \ 3753 + (sa)->name##_sum, \ 3754 + (sa)->name##_avg * PELT_MIN_DIVIDER); \ 3755 + } while (0) 3756 + 3752 3757 static inline void 3753 3758 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 3754 3759 { 3755 - cfs_rq->avg.load_avg += se->avg.load_avg; 3756 - cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; 3760 + __update_sa(&cfs_rq->avg, load, se->avg.load_avg, 3761 + se_weight(se) * se->avg.load_sum); 3757 3762 } 3758 3763 3759 3764 static inline void 3760 3765 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 3761 3766 { 3762 - sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); 3763 - sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum); 3764 - /* See update_cfs_rq_load_avg() */ 3765 - cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, 3766 - cfs_rq->avg.load_avg * PELT_MIN_DIVIDER); 3767 + __update_sa(&cfs_rq->avg, load, -se->avg.load_avg, 3768 + se_weight(se) * -se->avg.load_sum); 3767 3769 } 3768 3770 3769 3771 static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); ··· 4277 4243 */ 4278 4244 divider = get_pelt_divider(&cfs_rq->avg); 4279 4245 4280 - 4281 4246 /* Set new sched_entity's utilization */ 4282 4247 se->avg.util_avg = gcfs_rq->avg.util_avg; 4283 4248 new_sum = se->avg.util_avg * divider; ··· 4284 4251 se->avg.util_sum = new_sum; 4285 4252 4286 4253 /* Update parent cfs_rq utilization */ 4287 - add_positive(&cfs_rq->avg.util_avg, delta_avg); 4288 - add_positive(&cfs_rq->avg.util_sum, delta_sum); 4289 - 4290 - /* See update_cfs_rq_load_avg() */ 4291 - cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, 4292 - cfs_rq->avg.util_avg * PELT_MIN_DIVIDER); 4254 + __update_sa(&cfs_rq->avg, util, delta_avg, delta_sum); 4293 4255 } 4294 4256 4295 4257 static inline void ··· 4310 4282 se->avg.runnable_sum = new_sum; 4311 4283 4312 4284 /* Update parent cfs_rq runnable */ 4313 - add_positive(&cfs_rq->avg.runnable_avg, delta_avg); 4314 - add_positive(&cfs_rq->avg.runnable_sum, delta_sum); 4315 - /* See update_cfs_rq_load_avg() */ 4316 - cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, 4317 - cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER); 4285 + __update_sa(&cfs_rq->avg, runnable, delta_avg, delta_sum); 4318 4286 } 4319 4287 4320 4288 static inline void ··· 4374 4350 4375 4351 se->avg.load_sum = runnable_sum; 4376 4352 se->avg.load_avg = load_avg; 4377 - add_positive(&cfs_rq->avg.load_avg, delta_avg); 4378 - add_positive(&cfs_rq->avg.load_sum, delta_sum); 4379 - /* See update_cfs_rq_load_avg() */ 4380 - cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, 4381 - cfs_rq->avg.load_avg * PELT_MIN_DIVIDER); 4353 + __update_sa(&cfs_rq->avg, load, delta_avg, delta_sum); 4382 4354 } 4383 4355 4384 4356 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) ··· 4471 4451 rq = rq_of(cfs_rq); 4472 4452 4473 4453 rcu_read_lock(); 4474 - is_idle = is_idle_task(rcu_dereference(rq->curr)); 4454 + is_idle = is_idle_task(rcu_dereference_all(rq->curr)); 4475 4455 rcu_read_unlock(); 4476 4456 4477 4457 /* ··· 4573 4553 raw_spin_unlock(&cfs_rq->removed.lock); 4574 4554 4575 4555 r = removed_load; 4576 - sub_positive(&sa->load_avg, r); 4577 - sub_positive(&sa->load_sum, r * divider); 4578 - /* See sa->util_sum below */ 4579 - sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * PELT_MIN_DIVIDER); 4556 + __update_sa(sa, load, -r, -r*divider); 4580 4557 4581 4558 r = removed_util; 4582 - sub_positive(&sa->util_avg, r); 4583 - sub_positive(&sa->util_sum, r * divider); 4584 - /* 4585 - * Because of rounding, se->util_sum might ends up being +1 more than 4586 - * cfs->util_sum. Although this is not a problem by itself, detaching 4587 - * a lot of tasks with the rounding problem between 2 updates of 4588 - * util_avg (~1ms) can make cfs->util_sum becoming null whereas 4589 - * cfs_util_avg is not. 4590 - * Check that util_sum is still above its lower bound for the new 4591 - * util_avg. Given that period_contrib might have moved since the last 4592 - * sync, we are only sure that util_sum must be above or equal to 4593 - * util_avg * minimum possible divider 4594 - */ 4595 - sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * PELT_MIN_DIVIDER); 4559 + __update_sa(sa, util, -r, -r*divider); 4596 4560 4597 4561 r = removed_runnable; 4598 - sub_positive(&sa->runnable_avg, r); 4599 - sub_positive(&sa->runnable_sum, r * divider); 4600 - /* See sa->util_sum above */ 4601 - sa->runnable_sum = max_t(u32, sa->runnable_sum, 4602 - sa->runnable_avg * PELT_MIN_DIVIDER); 4562 + __update_sa(sa, runnable, -r, -r*divider); 4603 4563 4604 4564 /* 4605 4565 * removed_runnable is the unweighted version of removed_load so we ··· 4664 4664 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 4665 4665 { 4666 4666 dequeue_load_avg(cfs_rq, se); 4667 - sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); 4668 - sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); 4669 - /* See update_cfs_rq_load_avg() */ 4670 - cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, 4671 - cfs_rq->avg.util_avg * PELT_MIN_DIVIDER); 4672 - 4673 - sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); 4674 - sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum); 4675 - /* See update_cfs_rq_load_avg() */ 4676 - cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, 4677 - cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER); 4667 + __update_sa(&cfs_rq->avg, util, -se->avg.util_avg, -se->avg.util_sum); 4668 + __update_sa(&cfs_rq->avg, runnable, -se->avg.runnable_avg, -se->avg.runnable_sum); 4678 4669 4679 4670 add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); 4680 4671 ··· 5168 5177 * 5169 5178 * vl_i = (W + w_i)*vl'_i / W 5170 5179 */ 5171 - load = cfs_rq->avg_load; 5180 + load = cfs_rq->sum_weight; 5172 5181 if (curr && curr->on_rq) 5173 5182 load += scale_load_down(curr->load.weight); 5174 5183 ··· 7141 7150 7142 7151 static struct { 7143 7152 cpumask_var_t idle_cpus_mask; 7144 - atomic_t nr_cpus; 7145 - int has_blocked; /* Idle CPUS has blocked load */ 7153 + int has_blocked_load; /* Idle CPUS has blocked load */ 7146 7154 int needs_update; /* Newly idle CPUs need their next_balance collated */ 7147 7155 unsigned long next_balance; /* in jiffy units */ 7148 7156 unsigned long next_blocked; /* Next update of blocked load in jiffies */ ··· 7499 7509 { 7500 7510 struct sched_domain_shared *sds; 7501 7511 7502 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 7512 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 7503 7513 if (sds) 7504 7514 WRITE_ONCE(sds->has_idle_cores, val); 7505 7515 } ··· 7508 7518 { 7509 7519 struct sched_domain_shared *sds; 7510 7520 7511 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 7521 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 7512 7522 if (sds) 7513 7523 return READ_ONCE(sds->has_idle_cores); 7514 7524 ··· 7637 7647 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); 7638 7648 7639 7649 if (sched_feat(SIS_UTIL)) { 7640 - sd_share = rcu_dereference(per_cpu(sd_llc_shared, target)); 7650 + sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target)); 7641 7651 if (sd_share) { 7642 7652 /* because !--nr is the condition to stop scan */ 7643 7653 nr = READ_ONCE(sd_share->nr_idle_scan) + 1; ··· 7843 7853 * sd_asym_cpucapacity rather than sd_llc. 7844 7854 */ 7845 7855 if (sched_asym_cpucap_active()) { 7846 - sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target)); 7856 + sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, target)); 7847 7857 /* 7848 7858 * On an asymmetric CPU capacity system where an exclusive 7849 7859 * cpuset defines a symmetric island (i.e. one unique ··· 7858 7868 } 7859 7869 } 7860 7870 7861 - sd = rcu_dereference(per_cpu(sd_llc, target)); 7871 + sd = rcu_dereference_all(per_cpu(sd_llc, target)); 7862 7872 if (!sd) 7863 7873 return target; 7864 7874 ··· 8327 8337 struct energy_env eenv; 8328 8338 8329 8339 rcu_read_lock(); 8330 - pd = rcu_dereference(rd->pd); 8340 + pd = rcu_dereference_all(rd->pd); 8331 8341 if (!pd) 8332 8342 goto unlock; 8333 8343 ··· 8335 8345 * Energy-aware wake-up happens on the lowest sched_domain starting 8336 8346 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. 8337 8347 */ 8338 - sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); 8348 + sd = rcu_dereference_all(*this_cpu_ptr(&sd_asym_cpucapacity)); 8339 8349 while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) 8340 8350 sd = sd->parent; 8341 8351 if (!sd) ··· 8358 8368 int max_spare_cap_cpu = -1; 8359 8369 int fits, max_fits = -1; 8360 8370 8361 - cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); 8362 - 8363 - if (cpumask_empty(cpus)) 8371 + if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask)) 8364 8372 continue; 8365 8373 8366 8374 /* Account external pressure for the energy estimation */ ··· 8735 8747 /* 8736 8748 * Preempt the current task with a newly woken task if needed: 8737 8749 */ 8738 - static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) 8750 + static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags) 8739 8751 { 8740 8752 enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; 8741 8753 struct task_struct *donor = rq->donor; 8742 8754 struct sched_entity *se = &donor->se, *pse = &p->se; 8743 8755 struct cfs_rq *cfs_rq = task_cfs_rq(donor); 8744 8756 int cse_is_idle, pse_is_idle; 8757 + 8758 + /* 8759 + * XXX Getting preempted by higher class, try and find idle CPU? 8760 + */ 8761 + if (p->sched_class != &fair_sched_class) 8762 + return; 8745 8763 8746 8764 if (unlikely(se == pse)) 8747 8765 return; ··· 9331 9337 */ 9332 9338 static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env) 9333 9339 { 9334 - struct numa_group *numa_group = rcu_dereference(p->numa_group); 9340 + struct numa_group *numa_group = rcu_dereference_all(p->numa_group); 9335 9341 unsigned long src_weight, dst_weight; 9336 9342 int src_nid, dst_nid, dist; 9337 9343 ··· 9760 9766 } 9761 9767 9762 9768 #ifdef CONFIG_NO_HZ_COMMON 9763 - static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) 9769 + static inline bool cfs_rq_has_blocked_load(struct cfs_rq *cfs_rq) 9764 9770 { 9765 9771 if (cfs_rq->avg.load_avg) 9766 9772 return true; ··· 9793 9799 WRITE_ONCE(rq->last_blocked_load_update_tick, jiffies); 9794 9800 } 9795 9801 9796 - static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) 9802 + static inline void update_has_blocked_load_status(struct rq *rq, bool has_blocked_load) 9797 9803 { 9798 - if (!has_blocked) 9804 + if (!has_blocked_load) 9799 9805 rq->has_blocked_load = 0; 9800 9806 } 9801 9807 #else /* !CONFIG_NO_HZ_COMMON: */ 9802 - static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) { return false; } 9808 + static inline bool cfs_rq_has_blocked_load(struct cfs_rq *cfs_rq) { return false; } 9803 9809 static inline bool others_have_blocked(struct rq *rq) { return false; } 9804 9810 static inline void update_blocked_load_tick(struct rq *rq) {} 9805 - static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {} 9811 + static inline void update_has_blocked_load_status(struct rq *rq, bool has_blocked_load) {} 9806 9812 #endif /* !CONFIG_NO_HZ_COMMON */ 9807 9813 9808 9814 static bool __update_blocked_others(struct rq *rq, bool *done) ··· 9859 9865 list_del_leaf_cfs_rq(cfs_rq); 9860 9866 9861 9867 /* Don't need periodic decay once load/util_avg are null */ 9862 - if (cfs_rq_has_blocked(cfs_rq)) 9868 + if (cfs_rq_has_blocked_load(cfs_rq)) 9863 9869 *done = false; 9864 9870 } 9865 9871 ··· 9919 9925 bool decayed; 9920 9926 9921 9927 decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq); 9922 - if (cfs_rq_has_blocked(cfs_rq)) 9928 + if (cfs_rq_has_blocked_load(cfs_rq)) 9923 9929 *done = false; 9924 9930 9925 9931 return decayed; ··· 9931 9937 } 9932 9938 #endif /* !CONFIG_FAIR_GROUP_SCHED */ 9933 9939 9934 - static void sched_balance_update_blocked_averages(int cpu) 9940 + static void __sched_balance_update_blocked_averages(struct rq *rq) 9935 9941 { 9936 9942 bool decayed = false, done = true; 9937 - struct rq *rq = cpu_rq(cpu); 9938 - struct rq_flags rf; 9939 9943 9940 - rq_lock_irqsave(rq, &rf); 9941 9944 update_blocked_load_tick(rq); 9942 - update_rq_clock(rq); 9943 9945 9944 9946 decayed |= __update_blocked_others(rq, &done); 9945 9947 decayed |= __update_blocked_fair(rq, &done); 9946 9948 9947 - update_blocked_load_status(rq, !done); 9949 + update_has_blocked_load_status(rq, !done); 9948 9950 if (decayed) 9949 9951 cpufreq_update_util(rq, 0); 9950 - rq_unlock_irqrestore(rq, &rf); 9952 + } 9953 + 9954 + static void sched_balance_update_blocked_averages(int cpu) 9955 + { 9956 + struct rq *rq = cpu_rq(cpu); 9957 + 9958 + guard(rq_lock_irqsave)(rq); 9959 + update_rq_clock(rq); 9960 + __sched_balance_update_blocked_averages(rq); 9951 9961 } 9952 9962 9953 9963 /********** Helpers for sched_balance_find_src_group ************************/ ··· 10961 10963 * take care of it. 10962 10964 */ 10963 10965 if (p->nr_cpus_allowed != NR_CPUS) { 10964 - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 10965 - 10966 - cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); 10967 - imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr); 10966 + unsigned int w = cpumask_weight_and(p->cpus_ptr, 10967 + sched_group_span(local)); 10968 + imb_numa_nr = min(w, sd->imb_numa_nr); 10968 10969 } 10969 10970 10970 10971 imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus); ··· 11010 11013 if (env->sd->span_weight != llc_weight) 11011 11014 return; 11012 11015 11013 - sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); 11016 + sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu)); 11014 11017 if (!sd_share) 11015 11018 return; 11016 11019 ··· 11360 11363 goto force_balance; 11361 11364 11362 11365 if (!is_rd_overutilized(env->dst_rq->rd) && 11363 - rcu_dereference(env->dst_rq->rd->pd)) 11366 + rcu_dereference_all(env->dst_rq->rd->pd)) 11364 11367 goto out_balanced; 11365 11368 11366 11369 /* ASYM feature bypasses nice load balance check */ ··· 12428 12431 */ 12429 12432 nohz_balance_exit_idle(rq); 12430 12433 12431 - /* 12432 - * None are in tickless mode and hence no need for NOHZ idle load 12433 - * balancing: 12434 - */ 12435 - if (likely(!atomic_read(&nohz.nr_cpus))) 12436 - return; 12437 - 12438 - if (READ_ONCE(nohz.has_blocked) && 12434 + if (READ_ONCE(nohz.has_blocked_load) && 12439 12435 time_after(now, READ_ONCE(nohz.next_blocked))) 12440 12436 flags = NOHZ_STATS_KICK; 12441 12437 12438 + /* 12439 + * Most of the time system is not 100% busy. i.e nohz.nr_cpus > 0 12440 + * Skip the read if time is not due. 12441 + * 12442 + * If none are in tickless mode, there maybe a narrow window 12443 + * (28 jiffies, HZ=1000) where flags maybe set and kick_ilb called. 12444 + * But idle load balancing is not done as find_new_ilb fails. 12445 + * That's very rare. So read nohz.nr_cpus only if time is due. 12446 + */ 12442 12447 if (time_before(now, nohz.next_balance)) 12443 12448 goto out; 12449 + 12450 + /* 12451 + * None are in tickless mode and hence no need for NOHZ idle load 12452 + * balancing 12453 + */ 12454 + if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) 12455 + return; 12444 12456 12445 12457 if (rq->nr_running >= 2) { 12446 12458 flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; ··· 12458 12452 12459 12453 rcu_read_lock(); 12460 12454 12461 - sd = rcu_dereference(rq->sd); 12455 + sd = rcu_dereference_all(rq->sd); 12462 12456 if (sd) { 12463 12457 /* 12464 12458 * If there's a runnable CFS task and the current CPU has reduced ··· 12470 12464 } 12471 12465 } 12472 12466 12473 - sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); 12467 + sd = rcu_dereference_all(per_cpu(sd_asym_packing, cpu)); 12474 12468 if (sd) { 12475 12469 /* 12476 12470 * When ASYM_PACKING; see if there's a more preferred CPU ··· 12488 12482 } 12489 12483 } 12490 12484 12491 - sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)); 12485 + sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, cpu)); 12492 12486 if (sd) { 12493 12487 /* 12494 12488 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU ··· 12509 12503 goto unlock; 12510 12504 } 12511 12505 12512 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 12506 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 12513 12507 if (sds) { 12514 12508 /* 12515 12509 * If there is an imbalance between LLC domains (IOW we could ··· 12541 12535 struct sched_domain *sd; 12542 12536 12543 12537 rcu_read_lock(); 12544 - sd = rcu_dereference(per_cpu(sd_llc, cpu)); 12538 + sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); 12545 12539 12546 12540 if (!sd || !sd->nohz_idle) 12547 12541 goto unlock; ··· 12561 12555 12562 12556 rq->nohz_tick_stopped = 0; 12563 12557 cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); 12564 - atomic_dec(&nohz.nr_cpus); 12565 12558 12566 12559 set_cpu_sd_state_busy(rq->cpu); 12567 12560 } ··· 12570 12565 struct sched_domain *sd; 12571 12566 12572 12567 rcu_read_lock(); 12573 - sd = rcu_dereference(per_cpu(sd_llc, cpu)); 12568 + sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); 12574 12569 12575 12570 if (!sd || sd->nohz_idle) 12576 12571 goto unlock; ··· 12604 12599 12605 12600 /* 12606 12601 * The tick is still stopped but load could have been added in the 12607 - * meantime. We set the nohz.has_blocked flag to trig a check of the 12602 + * meantime. We set the nohz.has_blocked_load flag to trig a check of the 12608 12603 * *_avg. The CPU is already part of nohz.idle_cpus_mask so the clear 12609 - * of nohz.has_blocked can only happen after checking the new load 12604 + * of nohz.has_blocked_load can only happen after checking the new load 12610 12605 */ 12611 12606 if (rq->nohz_tick_stopped) 12612 12607 goto out; ··· 12618 12613 rq->nohz_tick_stopped = 1; 12619 12614 12620 12615 cpumask_set_cpu(cpu, nohz.idle_cpus_mask); 12621 - atomic_inc(&nohz.nr_cpus); 12622 12616 12623 12617 /* 12624 12618 * Ensures that if nohz_idle_balance() fails to observe our 12625 - * @idle_cpus_mask store, it must observe the @has_blocked 12619 + * @idle_cpus_mask store, it must observe the @has_blocked_load 12626 12620 * and @needs_update stores. 12627 12621 */ 12628 12622 smp_mb__after_atomic(); ··· 12634 12630 * Each time a cpu enter idle, we assume that it has blocked load and 12635 12631 * enable the periodic update of the load of idle CPUs 12636 12632 */ 12637 - WRITE_ONCE(nohz.has_blocked, 1); 12633 + WRITE_ONCE(nohz.has_blocked_load, 1); 12638 12634 } 12639 12635 12640 12636 static bool update_nohz_stats(struct rq *rq) ··· 12675 12671 12676 12672 /* 12677 12673 * We assume there will be no idle load after this update and clear 12678 - * the has_blocked flag. If a cpu enters idle in the mean time, it will 12679 - * set the has_blocked flag and trigger another update of idle load. 12674 + * the has_blocked_load flag. If a cpu enters idle in the mean time, it will 12675 + * set the has_blocked_load flag and trigger another update of idle load. 12680 12676 * Because a cpu that becomes idle, is added to idle_cpus_mask before 12681 12677 * setting the flag, we are sure to not clear the state and not 12682 12678 * check the load of an idle cpu. ··· 12684 12680 * Same applies to idle_cpus_mask vs needs_update. 12685 12681 */ 12686 12682 if (flags & NOHZ_STATS_KICK) 12687 - WRITE_ONCE(nohz.has_blocked, 0); 12683 + WRITE_ONCE(nohz.has_blocked_load, 0); 12688 12684 if (flags & NOHZ_NEXT_KICK) 12689 12685 WRITE_ONCE(nohz.needs_update, 0); 12690 12686 12691 12687 /* 12692 - * Ensures that if we miss the CPU, we must see the has_blocked 12688 + * Ensures that if we miss the CPU, we must see the has_blocked_load 12693 12689 * store from nohz_balance_enter_idle(). 12694 12690 */ 12695 12691 smp_mb(); ··· 12756 12752 abort: 12757 12753 /* There is still blocked load, enable periodic update */ 12758 12754 if (has_blocked_load) 12759 - WRITE_ONCE(nohz.has_blocked, 1); 12755 + WRITE_ONCE(nohz.has_blocked_load, 1); 12760 12756 } 12761 12757 12762 12758 /* ··· 12818 12814 return; 12819 12815 12820 12816 /* Don't need to update blocked load of idle CPUs*/ 12821 - if (!READ_ONCE(nohz.has_blocked) || 12817 + if (!READ_ONCE(nohz.has_blocked_load) || 12822 12818 time_before(jiffies, READ_ONCE(nohz.next_blocked))) 12823 12819 return; 12824 12820 ··· 12889 12885 */ 12890 12886 rq_unpin_lock(this_rq, rf); 12891 12887 12892 - rcu_read_lock(); 12893 - sd = rcu_dereference_check_sched_domain(this_rq->sd); 12894 - if (!sd) { 12895 - rcu_read_unlock(); 12888 + sd = rcu_dereference_sched_domain(this_rq->sd); 12889 + if (!sd) 12896 12890 goto out; 12897 - } 12898 12891 12899 12892 if (!get_rd_overloaded(this_rq->rd) || 12900 12893 this_rq->avg_idle < sd->max_newidle_lb_cost) { 12901 12894 12902 12895 update_next_balance(sd, &next_balance); 12903 - rcu_read_unlock(); 12904 12896 goto out; 12905 12897 } 12906 - rcu_read_unlock(); 12907 12898 12908 - rq_modified_clear(this_rq); 12899 + /* 12900 + * Include sched_balance_update_blocked_averages() in the cost 12901 + * calculation because it can be quite costly -- this ensures we skip 12902 + * it when avg_idle gets to be very low. 12903 + */ 12904 + t0 = sched_clock_cpu(this_cpu); 12905 + __sched_balance_update_blocked_averages(this_rq); 12906 + 12907 + this_rq->next_class = &fair_sched_class; 12909 12908 raw_spin_rq_unlock(this_rq); 12910 12909 12911 - t0 = sched_clock_cpu(this_cpu); 12912 - sched_balance_update_blocked_averages(this_cpu); 12913 - 12914 - rcu_read_lock(); 12915 12910 for_each_domain(this_cpu, sd) { 12916 12911 u64 domain_cost; 12917 12912 ··· 12960 12957 if (pulled_task || !continue_balancing) 12961 12958 break; 12962 12959 } 12963 - rcu_read_unlock(); 12964 12960 12965 12961 raw_spin_rq_lock(this_rq); 12966 12962 ··· 12975 12973 pulled_task = 1; 12976 12974 12977 12975 /* If a higher prio class was modified, restart the pick */ 12978 - if (rq_modified_above(this_rq, &fair_sched_class)) 12976 + if (sched_class_above(this_rq->next_class, &fair_sched_class)) 12979 12977 pulled_task = -1; 12980 12978 12981 12979 out: ··· 13326 13324 * zero_vruntime_fi, which would have been updated in prior calls 13327 13325 * to se_fi_update(). 13328 13326 */ 13329 - delta = (s64)(sea->vruntime - seb->vruntime) + 13330 - (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); 13327 + delta = vruntime_op(sea->vruntime, "-", seb->vruntime) + 13328 + vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi); 13331 13329 13332 13330 return delta > 0; 13333 13331 } ··· 13363 13361 for_each_sched_entity(se) { 13364 13362 cfs_rq = cfs_rq_of(se); 13365 13363 entity_tick(cfs_rq, se, queued); 13364 + } 13365 + 13366 + if (queued) { 13367 + if (!need_resched()) 13368 + hrtick_start_fair(rq, curr); 13369 + return; 13366 13370 } 13367 13371 13368 13372 if (static_branch_unlikely(&sched_numa_balancing)) ··· 13879 13871 * All the scheduling class methods: 13880 13872 */ 13881 13873 DEFINE_SCHED_CLASS(fair) = { 13882 - 13883 - .queue_mask = 2, 13884 - 13885 13874 .enqueue_task = enqueue_task_fair, 13886 13875 .dequeue_task = dequeue_task_fair, 13887 13876 .yield_task = yield_task_fair, 13888 13877 .yield_to_task = yield_to_task_fair, 13889 13878 13890 - .wakeup_preempt = check_preempt_wakeup_fair, 13879 + .wakeup_preempt = wakeup_preempt_fair, 13891 13880 13892 13881 .pick_task = pick_task_fair, 13893 13882 .pick_next_task = pick_next_task_fair, ··· 13944 13939 struct numa_group *ng; 13945 13940 13946 13941 rcu_read_lock(); 13947 - ng = rcu_dereference(p->numa_group); 13942 + ng = rcu_dereference_all(p->numa_group); 13948 13943 for_each_online_node(node) { 13949 13944 if (p->numa_faults) { 13950 13945 tsf = p->numa_faults[task_faults_idx(NUMA_MEM, node, 0)];

+4 -3

kernel/sched/idle.c

··· 460 460 { 461 461 update_curr_idle(rq); 462 462 scx_update_idle(rq, false, true); 463 + update_rq_avg_idle(rq); 463 464 } 464 465 465 466 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first) ··· 537 536 se->exec_start = now; 538 537 539 538 dl_server_update_idle(&rq->fair_server, delta_exec); 539 + #ifdef CONFIG_SCHED_CLASS_EXT 540 + dl_server_update_idle(&rq->ext_server, delta_exec); 541 + #endif 540 542 } 541 543 542 544 /* 543 545 * Simple, special scheduling class for the per-CPU idle tasks: 544 546 */ 545 547 DEFINE_SCHED_CLASS(idle) = { 546 - 547 - .queue_mask = 0, 548 - 549 548 /* no enqueue/yield_task for idle tasks */ 550 549 551 550 /* dequeue is not valid, we print a debug message there: */

+11 -3

kernel/sched/rt.c

··· 1615 1615 { 1616 1616 struct task_struct *donor = rq->donor; 1617 1617 1618 + /* 1619 + * XXX If we're preempted by DL, queue a push? 1620 + */ 1621 + if (p->sched_class != &rt_sched_class) 1622 + return; 1623 + 1618 1624 if (p->prio < donor->prio) { 1619 1625 resched_curr(rq); 1620 1626 return; ··· 2106 2100 */ 2107 2101 static int rto_next_cpu(struct root_domain *rd) 2108 2102 { 2103 + int this_cpu = smp_processor_id(); 2109 2104 int next; 2110 2105 int cpu; 2111 2106 ··· 2129 2122 cpu = cpumask_next(rd->rto_cpu, rd->rto_mask); 2130 2123 2131 2124 rd->rto_cpu = cpu; 2125 + 2126 + /* Do not send IPI to self */ 2127 + if (cpu == this_cpu) 2128 + continue; 2132 2129 2133 2130 if (cpu < nr_cpu_ids) 2134 2131 return cpu; ··· 2579 2568 #endif /* CONFIG_SCHED_CORE */ 2580 2569 2581 2570 DEFINE_SCHED_CLASS(rt) = { 2582 - 2583 - .queue_mask = 4, 2584 - 2585 2571 .enqueue_task = enqueue_task_rt, 2586 2572 .dequeue_task = dequeue_task_rt, 2587 2573 .yield_task = yield_task_rt,

+70 -72

kernel/sched/sched.h

··· 418 418 extern void sched_init_dl_servers(void); 419 419 420 420 extern void fair_server_init(struct rq *rq); 421 + extern void ext_server_init(struct rq *rq); 421 422 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); 422 423 extern int dl_server_apply_params(struct sched_dl_entity *dl_se, 423 424 u64 runtime, u64 period, bool init); ··· 675 674 void (*func)(struct rq *rq); 676 675 }; 677 676 678 - /* CFS-related fields in a runqueue */ 677 + /* Fair scheduling SCHED_{NORMAL,BATCH,IDLE} related fields in a runqueue: */ 679 678 struct cfs_rq { 680 679 struct load_weight load; 681 680 unsigned int nr_queued; 682 - unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ 683 - unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ 684 - unsigned int h_nr_idle; /* SCHED_IDLE */ 681 + unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ 682 + unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ 683 + unsigned int h_nr_idle; /* SCHED_IDLE */ 685 684 686 - s64 avg_vruntime; 687 - u64 avg_load; 685 + s64 sum_w_vruntime; 686 + u64 sum_weight; 688 687 689 688 u64 zero_vruntime; 690 689 #ifdef CONFIG_SCHED_CORE ··· 695 694 struct rb_root_cached tasks_timeline; 696 695 697 696 /* 698 - * 'curr' points to currently running entity on this cfs_rq. 697 + * 'curr' points to the currently running entity on this cfs_rq. 699 698 * It is set to NULL otherwise (i.e when none are currently running). 700 699 */ 701 700 struct sched_entity *curr; ··· 731 730 unsigned long h_load; 732 731 u64 last_h_load_update; 733 732 struct sched_entity *h_load_next; 734 - #endif /* CONFIG_FAIR_GROUP_SCHED */ 735 733 736 - #ifdef CONFIG_FAIR_GROUP_SCHED 737 734 struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */ 738 735 739 736 /* ··· 744 745 */ 745 746 int on_list; 746 747 struct list_head leaf_cfs_rq_list; 747 - struct task_group *tg; /* group that "owns" this runqueue */ 748 + struct task_group *tg; /* Group that "owns" this runqueue */ 748 749 749 750 /* Locally cached copy of our task_group's idle value */ 750 751 int idle; 751 752 752 - #ifdef CONFIG_CFS_BANDWIDTH 753 + # ifdef CONFIG_CFS_BANDWIDTH 753 754 int runtime_enabled; 754 755 s64 runtime_remaining; 755 756 756 757 u64 throttled_pelt_idle; 757 - #ifndef CONFIG_64BIT 758 + # ifndef CONFIG_64BIT 758 759 u64 throttled_pelt_idle_copy; 759 - #endif 760 + # endif 760 761 u64 throttled_clock; 761 762 u64 throttled_clock_pelt; 762 763 u64 throttled_clock_pelt_time; ··· 768 769 struct list_head throttled_list; 769 770 struct list_head throttled_csd_list; 770 771 struct list_head throttled_limbo_list; 771 - #endif /* CONFIG_CFS_BANDWIDTH */ 772 + # endif /* CONFIG_CFS_BANDWIDTH */ 772 773 #endif /* CONFIG_FAIR_GROUP_SCHED */ 773 774 }; 774 775 ··· 1120 1121 * acquire operations must be ordered by ascending &runqueue. 1121 1122 */ 1122 1123 struct rq { 1123 - /* runqueue lock: */ 1124 - raw_spinlock_t __lock; 1125 - 1126 - /* Per class runqueue modification mask; bits in class order. */ 1127 - unsigned int queue_mask; 1124 + /* 1125 + * The following members are loaded together, without holding the 1126 + * rq->lock, in an extremely hot loop in update_sg_lb_stats() 1127 + * (called from pick_next_task()). To reduce cache pollution from 1128 + * this operation, they are placed together on this dedicated cache 1129 + * line. Even though some of them are frequently modified, they are 1130 + * loaded much more frequently than they are stored. 1131 + */ 1128 1132 unsigned int nr_running; 1129 1133 #ifdef CONFIG_NUMA_BALANCING 1130 1134 unsigned int nr_numa_running; 1131 1135 unsigned int nr_preferred_running; 1132 - unsigned int numa_migrate_on; 1133 1136 #endif 1137 + unsigned int ttwu_pending; 1138 + unsigned long cpu_capacity; 1139 + #ifdef CONFIG_SCHED_PROXY_EXEC 1140 + struct task_struct __rcu *donor; /* Scheduling context */ 1141 + struct task_struct __rcu *curr; /* Execution context */ 1142 + #else 1143 + union { 1144 + struct task_struct __rcu *donor; /* Scheduler context */ 1145 + struct task_struct __rcu *curr; /* Execution context */ 1146 + }; 1147 + #endif 1148 + struct task_struct *idle; 1149 + /* padding left here deliberately */ 1150 + 1151 + /* 1152 + * The next cacheline holds the (hot) runqueue lock, as well as 1153 + * some other less performance-critical fields. 1154 + */ 1155 + u64 nr_switches ____cacheline_aligned; 1156 + 1157 + /* runqueue lock: */ 1158 + raw_spinlock_t __lock; 1159 + 1134 1160 #ifdef CONFIG_NO_HZ_COMMON 1135 - unsigned long last_blocked_load_update_tick; 1136 - unsigned int has_blocked_load; 1137 - call_single_data_t nohz_csd; 1138 1161 unsigned int nohz_tick_stopped; 1139 1162 atomic_t nohz_flags; 1163 + unsigned int has_blocked_load; 1164 + unsigned long last_blocked_load_update_tick; 1165 + call_single_data_t nohz_csd; 1140 1166 #endif /* CONFIG_NO_HZ_COMMON */ 1141 - 1142 - unsigned int ttwu_pending; 1143 - u64 nr_switches; 1144 1167 1145 1168 #ifdef CONFIG_UCLAMP_TASK 1146 1169 /* Utilization clamp values based on CPU's RUNNABLE tasks */ ··· 1176 1155 struct dl_rq dl; 1177 1156 #ifdef CONFIG_SCHED_CLASS_EXT 1178 1157 struct scx_rq scx; 1158 + struct sched_dl_entity ext_server; 1179 1159 #endif 1180 1160 1181 1161 struct sched_dl_entity fair_server; ··· 1187 1165 struct list_head *tmp_alone_branch; 1188 1166 #endif /* CONFIG_FAIR_GROUP_SCHED */ 1189 1167 1168 + #ifdef CONFIG_NUMA_BALANCING 1169 + unsigned int numa_migrate_on; 1170 + #endif 1190 1171 /* 1191 1172 * This is part of a global counter where only the total sum 1192 1173 * over all CPUs matters. A task can increase this counter on ··· 1198 1173 */ 1199 1174 unsigned long nr_uninterruptible; 1200 1175 1201 - #ifdef CONFIG_SCHED_PROXY_EXEC 1202 - struct task_struct __rcu *donor; /* Scheduling context */ 1203 - struct task_struct __rcu *curr; /* Execution context */ 1204 - #else 1205 - union { 1206 - struct task_struct __rcu *donor; /* Scheduler context */ 1207 - struct task_struct __rcu *curr; /* Execution context */ 1208 - }; 1209 - #endif 1210 1176 struct sched_dl_entity *dl_server; 1211 - struct task_struct *idle; 1212 1177 struct task_struct *stop; 1178 + const struct sched_class *next_class; 1213 1179 unsigned long next_balance; 1214 1180 struct mm_struct *prev_mm; 1215 1181 1216 - unsigned int clock_update_flags; 1217 - u64 clock; 1218 - /* Ensure that all clocks are in the same cache line */ 1182 + /* 1183 + * The following fields of clock data are frequently referenced 1184 + * and updated together, and should go on their own cache line. 1185 + */ 1219 1186 u64 clock_task ____cacheline_aligned; 1220 1187 u64 clock_pelt; 1188 + u64 clock; 1221 1189 unsigned long lost_idle_time; 1190 + unsigned int clock_update_flags; 1222 1191 u64 clock_pelt_idle; 1223 1192 u64 clock_idle; 1193 + 1224 1194 #ifndef CONFIG_64BIT 1225 1195 u64 clock_pelt_idle_copy; 1226 1196 u64 clock_idle_copy; 1227 1197 #endif 1228 - 1229 - atomic_t nr_iowait; 1230 1198 1231 1199 u64 last_seen_need_resched_ns; 1232 1200 int ticks_without_resched; ··· 1230 1212 1231 1213 struct root_domain *rd; 1232 1214 struct sched_domain __rcu *sd; 1233 - 1234 - unsigned long cpu_capacity; 1235 1215 1236 1216 struct balance_callback *balance_callback; 1237 1217 ··· 1340 1324 call_single_data_t cfsb_csd; 1341 1325 struct list_head cfsb_csd_list; 1342 1326 #endif 1343 - }; 1327 + 1328 + atomic_t nr_iowait; 1329 + } __no_randomize_layout; 1344 1330 1345 1331 #ifdef CONFIG_FAIR_GROUP_SCHED 1346 1332 ··· 1716 1698 1717 1699 #endif /* !CONFIG_FAIR_GROUP_SCHED */ 1718 1700 1701 + extern void update_rq_avg_idle(struct rq *rq); 1719 1702 extern void update_rq_clock(struct rq *rq); 1720 1703 1721 1704 /* ··· 2081 2062 rq->balance_callback = head; 2082 2063 } 2083 2064 2084 - #define rcu_dereference_check_sched_domain(p) \ 2085 - rcu_dereference_check((p), lockdep_is_held(&sched_domains_mutex)) 2065 + #define rcu_dereference_sched_domain(p) \ 2066 + rcu_dereference_all_check((p), lockdep_is_held(&sched_domains_mutex)) 2086 2067 2087 2068 /* 2088 2069 * The domain tree (rq->sd) is protected by RCU's quiescent state transition. ··· 2092 2073 * preempt-disabled sections. 2093 2074 */ 2094 2075 #define for_each_domain(cpu, __sd) \ 2095 - for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \ 2076 + for (__sd = rcu_dereference_sched_domain(cpu_rq(cpu)->sd); \ 2096 2077 __sd; __sd = __sd->parent) 2097 2078 2098 2079 /* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */ ··· 2500 2481 #ifdef CONFIG_UCLAMP_TASK 2501 2482 int uclamp_enabled; 2502 2483 #endif 2503 - /* 2504 - * idle: 0 2505 - * ext: 1 2506 - * fair: 2 2507 - * rt: 4 2508 - * dl: 8 2509 - * stop: 16 2510 - */ 2511 - unsigned int queue_mask; 2512 2484 2513 2485 /* 2514 2486 * move_queued_task/activate_task/enqueue_task: rq->lock ··· 2657 2647 int (*task_is_throttled)(struct task_struct *p, int cpu); 2658 2648 #endif 2659 2649 }; 2660 - 2661 - /* 2662 - * Does not nest; only used around sched_class::pick_task() rq-lock-breaks. 2663 - */ 2664 - static inline void rq_modified_clear(struct rq *rq) 2665 - { 2666 - rq->queue_mask = 0; 2667 - } 2668 - 2669 - static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class) 2670 - { 2671 - unsigned int mask = class->queue_mask; 2672 - return rq->queue_mask & ~((mask << 1) - 1); 2673 - } 2674 2650 2675 2651 static inline void put_prev_task(struct rq *rq, struct task_struct *prev) 2676 2652 { ··· 3393 3397 }; 3394 3398 3395 3399 DECLARE_PER_CPU(struct irqtime, cpu_irqtime); 3396 - extern int sched_clock_irqtime; 3400 + DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime); 3397 3401 3398 3402 static inline int irqtime_enabled(void) 3399 3403 { 3400 - return sched_clock_irqtime; 3404 + return static_branch_likely(&sched_clock_irqtime); 3401 3405 } 3402 3406 3403 3407 /* ··· 4006 4010 deactivate_task(src_rq, task, 0); 4007 4011 set_task_cpu(task, dst_rq->cpu); 4008 4012 activate_task(dst_rq, task, 0); 4013 + wakeup_preempt(dst_rq, task, 0); 4009 4014 } 4010 4015 4011 4016 static inline ··· 4076 4079 struct sched_change_ctx { 4077 4080 u64 prio; 4078 4081 struct task_struct *p; 4082 + const struct sched_class *class; 4079 4083 int flags; 4080 4084 bool queued; 4081 4085 bool running;

-3

kernel/sched/stop_task.c

··· 97 97 * Simple, special scheduling class for the per-CPU stop tasks: 98 98 */ 99 99 DEFINE_SCHED_CLASS(stop) = { 100 - 101 - .queue_mask = 16, 102 - 103 100 .enqueue_task = enqueue_task_stop, 104 101 .dequeue_task = dequeue_task_stop, 105 102 .yield_task = yield_task_stop,

+5

kernel/sched/topology.c

··· 508 508 if (rq->fair_server.dl_server) 509 509 __dl_server_attach_root(&rq->fair_server, rq); 510 510 511 + #ifdef CONFIG_SCHED_CLASS_EXT 512 + if (rq->ext_server.dl_server) 513 + __dl_server_attach_root(&rq->ext_server, rq); 514 + #endif 515 + 511 516 rq_unlock_irqrestore(rq, &rf); 512 517 513 518 if (old_rd)

+6

kernel/sys.c

··· 53 53 #include <linux/time_namespace.h> 54 54 #include <linux/binfmts.h> 55 55 #include <linux/futex.h> 56 + #include <linux/rseq.h> 56 57 57 58 #include <linux/sched.h> 58 59 #include <linux/sched/autogroup.h> ··· 2868 2867 break; 2869 2868 case PR_FUTEX_HASH: 2870 2869 error = futex_hash_prctl(arg2, arg3, arg4); 2870 + break; 2871 + case PR_RSEQ_SLICE_EXTENSION: 2872 + if (arg4 || arg5) 2873 + return -EINVAL; 2874 + error = rseq_slice_extension_prctl(arg2, arg3); 2871 2875 break; 2872 2876 default: 2873 2877 trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);

+1

kernel/sys_ni.c

··· 390 390 391 391 /* restartable sequence */ 392 392 COND_SYSCALL(rseq); 393 + COND_SYSCALL(rseq_slice_yield); 393 394 394 395 COND_SYSCALL(uretprobe); 395 396 COND_SYSCALL(uprobe);

+1 -1

kernel/time/hrtimer.c

··· 1742 1742 1743 1743 lockdep_assert_held(&cpu_base->lock); 1744 1744 1745 - debug_deactivate(timer); 1745 + debug_hrtimer_deactivate(timer); 1746 1746 base->running = timer; 1747 1747 1748 1748 /*

+1

scripts/syscall.tbl

··· 411 411 468 common file_getattr sys_file_getattr 412 412 469 common file_setattr sys_file_setattr 413 413 470 common listns sys_listns 414 + 471 common rseq_slice_yield sys_rseq_slice_yield

+1

tools/testing/selftests/rseq/.gitignore

··· 10 10 param_test_mm_cid_benchmark 11 11 param_test_mm_cid_compare_twice 12 12 syscall_errors_test 13 + slice_test

+4 -1

tools/testing/selftests/rseq/Makefile

··· 17 17 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ 18 18 param_test_benchmark param_test_compare_twice param_test_mm_cid \ 19 19 param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ 20 - syscall_errors_test 20 + syscall_errors_test slice_test 21 21 22 22 TEST_GEN_PROGS_EXTENDED = librseq.so 23 23 ··· 58 58 59 59 $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \ 60 60 rseq.h rseq-*.h 61 + $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ 62 + 63 + $(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h 61 64 $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@

+27

tools/testing/selftests/rseq/rseq-abi.h

··· 53 53 __u64 abort_ip; 54 54 } __attribute__((aligned(4 * sizeof(__u64)))); 55 55 56 + /** 57 + * rseq_abi_slice_ctrl - Time slice extension control structure 58 + * @all: Compound value 59 + * @request: Request for a time slice extension 60 + * @granted: Granted time slice extension 61 + * 62 + * @request is set by user space and can be cleared by user space or kernel 63 + * space. @granted is set and cleared by the kernel and must only be read 64 + * by user space. 65 + */ 66 + struct rseq_abi_slice_ctrl { 67 + union { 68 + __u32 all; 69 + struct { 70 + __u8 request; 71 + __u8 granted; 72 + __u16 __reserved; 73 + }; 74 + }; 75 + }; 76 + 56 77 /* 57 78 * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always 58 79 * contained within a single cache-line. ··· 184 163 * (allocated uniquely within a memory map). 185 164 */ 186 165 __u32 mm_cid; 166 + 167 + /* 168 + * Time slice extension control structure. CPU local updates from 169 + * kernel and user space. 170 + */ 171 + struct rseq_abi_slice_ctrl slice_ctrl; 187 172 188 173 /* 189 174 * Flexible array member at end of structure, after last feature field.

+132

tools/testing/selftests/rseq/rseq-slice-hist.py

··· 1 + #!/usr/bin/python3 2 + 3 + # 4 + # trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- $cmd 5 + # 6 + 7 + from tracecmd import * 8 + 9 + def load_kallsyms(file_path='/proc/kallsyms'): 10 + """ 11 + Parses /proc/kallsyms into a dictionary. 12 + Returns: { address_int: symbol_name } 13 + """ 14 + kallsyms_map = {} 15 + 16 + try: 17 + with open(file_path, 'r') as f: 18 + for line in f: 19 + # The format is: [address] [type] [name] [module] 20 + parts = line.split() 21 + if len(parts) < 3: 22 + continue 23 + 24 + addr = int(parts[0], 16) 25 + name = parts[2] 26 + 27 + kallsyms_map[addr] = name 28 + 29 + except PermissionError: 30 + print(f"Error: Permission denied reading {file_path}. Try running with sudo.") 31 + except FileNotFoundError: 32 + print(f"Error: {file_path} not found.") 33 + 34 + return kallsyms_map 35 + 36 + ksyms = load_kallsyms() 37 + 38 + # pending[timer_ptr] = {'ts': timestamp, 'comm': comm} 39 + pending = {} 40 + 41 + # histograms[comm][bucket] = count 42 + histograms = {} 43 + 44 + class OnlineHarmonicMean: 45 + def __init__(self): 46 + self.n = 0 # Count of elements 47 + self.S = 0.0 # Cumulative sum of reciprocals 48 + 49 + def update(self, x): 50 + if x == 0: 51 + raise ValueError("Harmonic mean is undefined for zero.") 52 + 53 + self.n += 1 54 + self.S += 1.0 / x 55 + return self.n / self.S 56 + 57 + @property 58 + def mean(self): 59 + return self.n / self.S if self.n > 0 else 0 60 + 61 + ohms = {} 62 + 63 + def handle_start(record): 64 + func_name = ksyms[record.num_field("function")] 65 + if "rseq_slice_expired" in func_name: 66 + timer_ptr = record.num_field("hrtimer") 67 + pending[timer_ptr] = { 68 + 'ts': record.ts, 69 + 'comm': record.comm 70 + } 71 + return None 72 + 73 + def handle_cancel(record): 74 + timer_ptr = record.num_field("hrtimer") 75 + 76 + if timer_ptr in pending: 77 + start_data = pending.pop(timer_ptr) 78 + duration_ns = record.ts - start_data['ts'] 79 + duration_us = duration_ns // 1000 80 + 81 + comm = start_data['comm'] 82 + 83 + if comm not in ohms: 84 + ohms[comm] = OnlineHarmonicMean() 85 + 86 + ohms[comm].update(duration_ns) 87 + 88 + if comm not in histograms: 89 + histograms[comm] = {} 90 + 91 + histograms[comm][duration_us] = histograms[comm].get(duration_us, 0) + 1 92 + return None 93 + 94 + def handle_expire(record): 95 + timer_ptr = record.num_field("hrtimer") 96 + 97 + if timer_ptr in pending: 98 + start_data = pending.pop(timer_ptr) 99 + comm = start_data['comm'] 100 + 101 + if comm not in histograms: 102 + histograms[comm] = {} 103 + 104 + # Record -1 bucket for expired (failed to cancel) 105 + histograms[comm][-1] = histograms[comm].get(-1, 0) + 1 106 + return None 107 + 108 + if __name__ == "__main__": 109 + t = Trace("trace.dat") 110 + for cpu in range(0, t.cpus): 111 + ev = t.read_event(cpu) 112 + while ev: 113 + if "hrtimer_start" in ev.name: 114 + handle_start(ev) 115 + if "hrtimer_cancel" in ev.name: 116 + handle_cancel(ev) 117 + if "hrtimer_expire_entry" in ev.name: 118 + handle_expire(ev) 119 + 120 + ev = t.read_event(cpu) 121 + 122 + print("\n" + "="*40) 123 + print("RSEQ SLICE HISTOGRAM (us)") 124 + print("="*40) 125 + for comm, buckets in histograms.items(): 126 + print(f"\nTask: {comm} Mean: {ohms[comm].mean:.3f} ns") 127 + print(f" {'Latency (us)':<15} | {'Count'}") 128 + print(f" {'-'*30}") 129 + # Sort buckets numerically, putting -1 at the top 130 + for bucket in sorted(buckets.keys()): 131 + label = "EXPIRED" if bucket == -1 else f"{bucket} us" 132 + print(f" {label:<15} | {buckets[bucket]}")

+219

tools/testing/selftests/rseq/slice_test.c

··· 1 + // SPDX-License-Identifier: LGPL-2.1 2 + #define _GNU_SOURCE 3 + #include <assert.h> 4 + #include <pthread.h> 5 + #include <sched.h> 6 + #include <signal.h> 7 + #include <stdbool.h> 8 + #include <stdio.h> 9 + #include <string.h> 10 + #include <syscall.h> 11 + #include <unistd.h> 12 + 13 + #include <linux/prctl.h> 14 + #include <sys/prctl.h> 15 + #include <sys/time.h> 16 + 17 + #include "rseq.h" 18 + 19 + #include "../kselftest_harness.h" 20 + 21 + #ifndef __NR_rseq_slice_yield 22 + # define __NR_rseq_slice_yield 471 23 + #endif 24 + 25 + #define BITS_PER_INT 32 26 + #define BITS_PER_BYTE 8 27 + 28 + #ifndef PR_RSEQ_SLICE_EXTENSION 29 + # define PR_RSEQ_SLICE_EXTENSION 79 30 + # define PR_RSEQ_SLICE_EXTENSION_GET 1 31 + # define PR_RSEQ_SLICE_EXTENSION_SET 2 32 + # define PR_RSEQ_SLICE_EXT_ENABLE 0x01 33 + #endif 34 + 35 + #ifndef RSEQ_SLICE_EXT_REQUEST_BIT 36 + # define RSEQ_SLICE_EXT_REQUEST_BIT 0 37 + # define RSEQ_SLICE_EXT_GRANTED_BIT 1 38 + #endif 39 + 40 + #ifndef asm_inline 41 + # define asm_inline asm __inline 42 + #endif 43 + 44 + #define NSEC_PER_SEC 1000000000L 45 + #define NSEC_PER_USEC 1000L 46 + 47 + struct noise_params { 48 + int64_t noise_nsecs; 49 + int64_t sleep_nsecs; 50 + int64_t run; 51 + }; 52 + 53 + FIXTURE(slice_ext) 54 + { 55 + pthread_t noise_thread; 56 + struct noise_params noise_params; 57 + }; 58 + 59 + FIXTURE_VARIANT(slice_ext) 60 + { 61 + int64_t total_nsecs; 62 + int64_t slice_nsecs; 63 + int64_t noise_nsecs; 64 + int64_t sleep_nsecs; 65 + bool no_yield; 66 + }; 67 + 68 + FIXTURE_VARIANT_ADD(slice_ext, n2_2_50) 69 + { 70 + .total_nsecs = 5LL * NSEC_PER_SEC, 71 + .slice_nsecs = 2LL * NSEC_PER_USEC, 72 + .noise_nsecs = 2LL * NSEC_PER_USEC, 73 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 74 + }; 75 + 76 + FIXTURE_VARIANT_ADD(slice_ext, n50_2_50) 77 + { 78 + .total_nsecs = 5LL * NSEC_PER_SEC, 79 + .slice_nsecs = 50LL * NSEC_PER_USEC, 80 + .noise_nsecs = 2LL * NSEC_PER_USEC, 81 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 82 + }; 83 + 84 + FIXTURE_VARIANT_ADD(slice_ext, n2_2_50_no_yield) 85 + { 86 + .total_nsecs = 5LL * NSEC_PER_SEC, 87 + .slice_nsecs = 2LL * NSEC_PER_USEC, 88 + .noise_nsecs = 2LL * NSEC_PER_USEC, 89 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 90 + .no_yield = true, 91 + }; 92 + 93 + 94 + static inline bool elapsed(struct timespec *start, struct timespec *now, 95 + int64_t span) 96 + { 97 + int64_t delta = now->tv_sec - start->tv_sec; 98 + 99 + delta *= NSEC_PER_SEC; 100 + delta += now->tv_nsec - start->tv_nsec; 101 + return delta >= span; 102 + } 103 + 104 + static void *noise_thread(void *arg) 105 + { 106 + struct noise_params *p = arg; 107 + 108 + while (RSEQ_READ_ONCE(p->run)) { 109 + struct timespec ts_start, ts_now; 110 + 111 + clock_gettime(CLOCK_MONOTONIC, &ts_start); 112 + do { 113 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 114 + } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs)); 115 + 116 + ts_start.tv_sec = 0; 117 + ts_start.tv_nsec = p->sleep_nsecs; 118 + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL); 119 + } 120 + return NULL; 121 + } 122 + 123 + FIXTURE_SETUP(slice_ext) 124 + { 125 + cpu_set_t affinity; 126 + 127 + ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0); 128 + 129 + /* Pin it on a single CPU. Avoid CPU 0 */ 130 + for (int i = 1; i < CPU_SETSIZE; i++) { 131 + if (!CPU_ISSET(i, &affinity)) 132 + continue; 133 + 134 + CPU_ZERO(&affinity); 135 + CPU_SET(i, &affinity); 136 + ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0); 137 + break; 138 + } 139 + 140 + ASSERT_EQ(rseq_register_current_thread(), 0); 141 + 142 + ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 143 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0); 144 + 145 + self->noise_params.noise_nsecs = variant->noise_nsecs; 146 + self->noise_params.sleep_nsecs = variant->sleep_nsecs; 147 + self->noise_params.run = 1; 148 + 149 + ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0); 150 + } 151 + 152 + FIXTURE_TEARDOWN(slice_ext) 153 + { 154 + self->noise_params.run = 0; 155 + pthread_join(self->noise_thread, NULL); 156 + } 157 + 158 + TEST_F(slice_ext, slice_test) 159 + { 160 + unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0; 161 + unsigned long total = 0, aborted = 0; 162 + struct rseq_abi *rs = rseq_get_abi(); 163 + struct timespec ts_start, ts_now; 164 + 165 + ASSERT_NE(rs, NULL); 166 + 167 + clock_gettime(CLOCK_MONOTONIC, &ts_start); 168 + do { 169 + struct timespec ts_cs; 170 + bool req = false; 171 + 172 + clock_gettime(CLOCK_MONOTONIC, &ts_cs); 173 + 174 + total++; 175 + RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1); 176 + do { 177 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 178 + } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs)); 179 + 180 + /* 181 + * request can be cleared unconditionally, but for making 182 + * the stats work this is actually checking it first 183 + */ 184 + if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) { 185 + RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0); 186 + /* Race between check and clear! */ 187 + req = true; 188 + success++; 189 + } 190 + 191 + if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) { 192 + /* The above raced against a late grant */ 193 + if (req) 194 + success--; 195 + if (variant->no_yield) { 196 + syscall(__NR_getpid); 197 + aborted++; 198 + } else { 199 + yielded++; 200 + if (!syscall(__NR_rseq_slice_yield)) 201 + raced++; 202 + } 203 + } else { 204 + if (!req) 205 + scheduled++; 206 + } 207 + 208 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 209 + } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs)); 210 + 211 + printf("# Total %12ld\n", total); 212 + printf("# Success %12ld\n", success); 213 + printf("# Yielded %12ld\n", yielded); 214 + printf("# Aborted %12ld\n", aborted); 215 + printf("# Scheduled %12ld\n", scheduled); 216 + printf("# Raced %12ld\n", raced); 217 + } 218 + 219 + TEST_HARNESS_MAIN

+2

tools/testing/selftests/sched_ext/Makefile

··· 183 183 select_cpu_dispatch_bad_dsq \ 184 184 select_cpu_dispatch_dbl_dsp \ 185 185 select_cpu_vtime \ 186 + rt_stall \ 186 187 test_example \ 188 + total_bw \ 187 189 188 190 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) 189 191

+23

tools/testing/selftests/sched_ext/rt_stall.bpf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. 4 + * 5 + * Copyright (c) 2025 NVIDIA Corporation. 6 + */ 7 + 8 + #include <scx/common.bpf.h> 9 + 10 + char _license[] SEC("license") = "GPL"; 11 + 12 + UEI_DEFINE(uei); 13 + 14 + void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) 15 + { 16 + UEI_RECORD(uei, ei); 17 + } 18 + 19 + SEC(".struct_ops.link") 20 + struct sched_ext_ops rt_stall_ops = { 21 + .exit = (void *)rt_stall_exit, 22 + .name = "rt_stall", 23 + };

+240

tools/testing/selftests/sched_ext/rt_stall.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2025 NVIDIA Corporation. 4 + */ 5 + #define _GNU_SOURCE 6 + #include <stdio.h> 7 + #include <stdlib.h> 8 + #include <unistd.h> 9 + #include <sched.h> 10 + #include <sys/prctl.h> 11 + #include <sys/types.h> 12 + #include <sys/wait.h> 13 + #include <time.h> 14 + #include <linux/sched.h> 15 + #include <signal.h> 16 + #include <bpf/bpf.h> 17 + #include <scx/common.h> 18 + #include <unistd.h> 19 + #include "rt_stall.bpf.skel.h" 20 + #include "scx_test.h" 21 + #include "../kselftest.h" 22 + 23 + #define CORE_ID 0 /* CPU to pin tasks to */ 24 + #define RUN_TIME 5 /* How long to run the test in seconds */ 25 + 26 + /* Simple busy-wait function for test tasks */ 27 + static void process_func(void) 28 + { 29 + while (1) { 30 + /* Busy wait */ 31 + for (volatile unsigned long i = 0; i < 10000000UL; i++) 32 + ; 33 + } 34 + } 35 + 36 + /* Set CPU affinity to a specific core */ 37 + static void set_affinity(int cpu) 38 + { 39 + cpu_set_t mask; 40 + 41 + CPU_ZERO(&mask); 42 + CPU_SET(cpu, &mask); 43 + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { 44 + perror("sched_setaffinity"); 45 + exit(EXIT_FAILURE); 46 + } 47 + } 48 + 49 + /* Set task scheduling policy and priority */ 50 + static void set_sched(int policy, int priority) 51 + { 52 + struct sched_param param; 53 + 54 + param.sched_priority = priority; 55 + if (sched_setscheduler(0, policy, &param) != 0) { 56 + perror("sched_setscheduler"); 57 + exit(EXIT_FAILURE); 58 + } 59 + } 60 + 61 + /* Get process runtime from /proc/<pid>/stat */ 62 + static float get_process_runtime(int pid) 63 + { 64 + char path[256]; 65 + FILE *file; 66 + long utime, stime; 67 + int fields; 68 + 69 + snprintf(path, sizeof(path), "/proc/%d/stat", pid); 70 + file = fopen(path, "r"); 71 + if (file == NULL) { 72 + perror("Failed to open stat file"); 73 + return -1; 74 + } 75 + 76 + /* Skip the first 13 fields and read the 14th and 15th */ 77 + fields = fscanf(file, 78 + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", 79 + &utime, &stime); 80 + fclose(file); 81 + 82 + if (fields != 2) { 83 + fprintf(stderr, "Failed to read stat file\n"); 84 + return -1; 85 + } 86 + 87 + /* Calculate the total time spent in the process */ 88 + long total_time = utime + stime; 89 + long ticks_per_second = sysconf(_SC_CLK_TCK); 90 + float runtime_seconds = total_time * 1.0 / ticks_per_second; 91 + 92 + return runtime_seconds; 93 + } 94 + 95 + static enum scx_test_status setup(void **ctx) 96 + { 97 + struct rt_stall *skel; 98 + 99 + skel = rt_stall__open(); 100 + SCX_FAIL_IF(!skel, "Failed to open"); 101 + SCX_ENUM_INIT(skel); 102 + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); 103 + 104 + *ctx = skel; 105 + 106 + return SCX_TEST_PASS; 107 + } 108 + 109 + static bool sched_stress_test(bool is_ext) 110 + { 111 + /* 112 + * We're expecting the EXT task to get around 5% of CPU time when 113 + * competing with the RT task (small 1% fluctuations are expected). 114 + * 115 + * However, the EXT task should get at least 4% of the CPU to prove 116 + * that the EXT deadline server is working correctly. A percentage 117 + * less than 4% indicates a bug where RT tasks can potentially 118 + * stall SCHED_EXT tasks, causing the test to fail. 119 + */ 120 + const float expected_min_ratio = 0.04; /* 4% */ 121 + const char *class_str = is_ext ? "EXT" : "FAIR"; 122 + 123 + float ext_runtime, rt_runtime, actual_ratio; 124 + int ext_pid, rt_pid; 125 + 126 + ksft_print_header(); 127 + ksft_set_plan(1); 128 + 129 + /* Create and set up a EXT task */ 130 + ext_pid = fork(); 131 + if (ext_pid == 0) { 132 + set_affinity(CORE_ID); 133 + process_func(); 134 + exit(0); 135 + } else if (ext_pid < 0) { 136 + perror("fork task"); 137 + ksft_exit_fail(); 138 + } 139 + 140 + /* Create an RT task */ 141 + rt_pid = fork(); 142 + if (rt_pid == 0) { 143 + set_affinity(CORE_ID); 144 + set_sched(SCHED_FIFO, 50); 145 + process_func(); 146 + exit(0); 147 + } else if (rt_pid < 0) { 148 + perror("fork for RT task"); 149 + ksft_exit_fail(); 150 + } 151 + 152 + /* Let the processes run for the specified time */ 153 + sleep(RUN_TIME); 154 + 155 + /* Get runtime for the EXT task */ 156 + ext_runtime = get_process_runtime(ext_pid); 157 + if (ext_runtime == -1) 158 + ksft_exit_fail_msg("Error getting runtime for %s task (PID %d)\n", 159 + class_str, ext_pid); 160 + ksft_print_msg("Runtime of %s task (PID %d) is %f seconds\n", 161 + class_str, ext_pid, ext_runtime); 162 + 163 + /* Get runtime for the RT task */ 164 + rt_runtime = get_process_runtime(rt_pid); 165 + if (rt_runtime == -1) 166 + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); 167 + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); 168 + 169 + /* Kill the processes */ 170 + kill(ext_pid, SIGKILL); 171 + kill(rt_pid, SIGKILL); 172 + waitpid(ext_pid, NULL, 0); 173 + waitpid(rt_pid, NULL, 0); 174 + 175 + /* Verify that the scx task got enough runtime */ 176 + actual_ratio = ext_runtime / (ext_runtime + rt_runtime); 177 + ksft_print_msg("%s task got %.2f%% of total runtime\n", 178 + class_str, actual_ratio * 100); 179 + 180 + if (actual_ratio >= expected_min_ratio) { 181 + ksft_test_result_pass("PASS: %s task got more than %.2f%% of runtime\n", 182 + class_str, expected_min_ratio * 100); 183 + return true; 184 + } 185 + ksft_test_result_fail("FAIL: %s task got less than %.2f%% of runtime\n", 186 + class_str, expected_min_ratio * 100); 187 + return false; 188 + } 189 + 190 + static enum scx_test_status run(void *ctx) 191 + { 192 + struct rt_stall *skel = ctx; 193 + struct bpf_link *link = NULL; 194 + bool res; 195 + int i; 196 + 197 + /* 198 + * Test if the dl_server is working both with and without the 199 + * sched_ext scheduler attached. 200 + * 201 + * This ensures all the scenarios are covered: 202 + * - fair_server stop -> ext_server start 203 + * - ext_server stop -> fair_server stop 204 + */ 205 + for (i = 0; i < 4; i++) { 206 + bool is_ext = i % 2; 207 + 208 + if (is_ext) { 209 + memset(&skel->data->uei, 0, sizeof(skel->data->uei)); 210 + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); 211 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 212 + } 213 + res = sched_stress_test(is_ext); 214 + if (is_ext) { 215 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); 216 + bpf_link__destroy(link); 217 + } 218 + 219 + if (!res) 220 + ksft_exit_fail(); 221 + } 222 + 223 + return SCX_TEST_PASS; 224 + } 225 + 226 + static void cleanup(void *ctx) 227 + { 228 + struct rt_stall *skel = ctx; 229 + 230 + rt_stall__destroy(skel); 231 + } 232 + 233 + struct scx_test rt_stall = { 234 + .name = "rt_stall", 235 + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", 236 + .setup = setup, 237 + .run = run, 238 + .cleanup = cleanup, 239 + }; 240 + REGISTER_SCX_TEST(&rt_stall)

+281

tools/testing/selftests/sched_ext/total_bw.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Test to verify that total_bw value remains consistent across all CPUs 4 + * in different BPF program states. 5 + * 6 + * Copyright (C) 2025 NVIDIA Corporation. 7 + */ 8 + #include <bpf/bpf.h> 9 + #include <errno.h> 10 + #include <pthread.h> 11 + #include <scx/common.h> 12 + #include <stdio.h> 13 + #include <stdlib.h> 14 + #include <string.h> 15 + #include <sys/wait.h> 16 + #include <unistd.h> 17 + #include "minimal.bpf.skel.h" 18 + #include "scx_test.h" 19 + 20 + #define MAX_CPUS 512 21 + #define STRESS_DURATION_SEC 5 22 + 23 + struct total_bw_ctx { 24 + struct minimal *skel; 25 + long baseline_bw[MAX_CPUS]; 26 + int nr_cpus; 27 + }; 28 + 29 + static void *cpu_stress_thread(void *arg) 30 + { 31 + volatile int i; 32 + time_t end_time = time(NULL) + STRESS_DURATION_SEC; 33 + 34 + while (time(NULL) < end_time) 35 + for (i = 0; i < 1000000; i++) 36 + ; 37 + 38 + return NULL; 39 + } 40 + 41 + /* 42 + * The first enqueue on a CPU causes the DL server to start, for that 43 + * reason run stressor threads in the hopes it schedules on all CPUs. 44 + */ 45 + static int run_cpu_stress(int nr_cpus) 46 + { 47 + pthread_t *threads; 48 + int i, ret = 0; 49 + 50 + threads = calloc(nr_cpus, sizeof(pthread_t)); 51 + if (!threads) 52 + return -ENOMEM; 53 + 54 + /* Create threads to run on each CPU */ 55 + for (i = 0; i < nr_cpus; i++) { 56 + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { 57 + ret = -errno; 58 + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); 59 + break; 60 + } 61 + } 62 + 63 + /* Wait for all threads to complete */ 64 + for (i = 0; i < nr_cpus; i++) { 65 + if (threads[i]) 66 + pthread_join(threads[i], NULL); 67 + } 68 + 69 + free(threads); 70 + return ret; 71 + } 72 + 73 + static int read_total_bw_values(long *bw_values, int max_cpus) 74 + { 75 + FILE *fp; 76 + char line[256]; 77 + int cpu_count = 0; 78 + 79 + fp = fopen("/sys/kernel/debug/sched/debug", "r"); 80 + if (!fp) { 81 + SCX_ERR("Failed to open debug file"); 82 + return -1; 83 + } 84 + 85 + while (fgets(line, sizeof(line), fp)) { 86 + char *bw_str = strstr(line, "total_bw"); 87 + 88 + if (bw_str) { 89 + bw_str = strchr(bw_str, ':'); 90 + if (bw_str) { 91 + /* Only store up to max_cpus values */ 92 + if (cpu_count < max_cpus) 93 + bw_values[cpu_count] = atol(bw_str + 1); 94 + cpu_count++; 95 + } 96 + } 97 + } 98 + 99 + fclose(fp); 100 + return cpu_count; 101 + } 102 + 103 + static bool verify_total_bw_consistency(long *bw_values, int count) 104 + { 105 + int i; 106 + long first_value; 107 + 108 + if (count <= 0) 109 + return false; 110 + 111 + first_value = bw_values[0]; 112 + 113 + for (i = 1; i < count; i++) { 114 + if (bw_values[i] != first_value) { 115 + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", 116 + first_value, i, bw_values[i]); 117 + return false; 118 + } 119 + } 120 + 121 + return true; 122 + } 123 + 124 + static int fetch_verify_total_bw(long *bw_values, int nr_cpus) 125 + { 126 + int attempts = 0; 127 + int max_attempts = 10; 128 + int count; 129 + 130 + /* 131 + * The first enqueue on a CPU causes the DL server to start, for that 132 + * reason run stressor threads in the hopes it schedules on all CPUs. 133 + */ 134 + if (run_cpu_stress(nr_cpus) < 0) { 135 + SCX_ERR("Failed to run CPU stress"); 136 + return -1; 137 + } 138 + 139 + /* Try multiple times to get stable values */ 140 + while (attempts < max_attempts) { 141 + count = read_total_bw_values(bw_values, nr_cpus); 142 + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); 143 + /* If system has more CPUs than we're testing, that's OK */ 144 + if (count < nr_cpus) { 145 + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); 146 + attempts++; 147 + sleep(1); 148 + continue; 149 + } 150 + 151 + /* Only verify the CPUs we're testing */ 152 + if (verify_total_bw_consistency(bw_values, nr_cpus)) { 153 + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); 154 + return 0; 155 + } 156 + 157 + attempts++; 158 + sleep(1); 159 + } 160 + 161 + return -1; 162 + } 163 + 164 + static enum scx_test_status setup(void **ctx) 165 + { 166 + struct total_bw_ctx *test_ctx; 167 + 168 + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { 169 + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); 170 + return SCX_TEST_SKIP; 171 + } 172 + 173 + test_ctx = calloc(1, sizeof(*test_ctx)); 174 + if (!test_ctx) 175 + return SCX_TEST_FAIL; 176 + 177 + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); 178 + if (test_ctx->nr_cpus <= 0) { 179 + free(test_ctx); 180 + return SCX_TEST_FAIL; 181 + } 182 + 183 + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ 184 + if (test_ctx->nr_cpus > MAX_CPUS) 185 + test_ctx->nr_cpus = MAX_CPUS; 186 + 187 + /* Test scenario 1: BPF program not loaded */ 188 + /* Read and verify baseline total_bw before loading BPF program */ 189 + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); 190 + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { 191 + SCX_ERR("Failed to get stable baseline values"); 192 + free(test_ctx); 193 + return SCX_TEST_FAIL; 194 + } 195 + 196 + /* Load the BPF skeleton */ 197 + test_ctx->skel = minimal__open(); 198 + if (!test_ctx->skel) { 199 + free(test_ctx); 200 + return SCX_TEST_FAIL; 201 + } 202 + 203 + SCX_ENUM_INIT(test_ctx->skel); 204 + if (minimal__load(test_ctx->skel)) { 205 + minimal__destroy(test_ctx->skel); 206 + free(test_ctx); 207 + return SCX_TEST_FAIL; 208 + } 209 + 210 + *ctx = test_ctx; 211 + return SCX_TEST_PASS; 212 + } 213 + 214 + static enum scx_test_status run(void *ctx) 215 + { 216 + struct total_bw_ctx *test_ctx = ctx; 217 + struct bpf_link *link; 218 + long loaded_bw[MAX_CPUS]; 219 + long unloaded_bw[MAX_CPUS]; 220 + int i; 221 + 222 + /* Test scenario 2: BPF program loaded */ 223 + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); 224 + if (!link) { 225 + SCX_ERR("Failed to attach scheduler"); 226 + return SCX_TEST_FAIL; 227 + } 228 + 229 + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); 230 + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { 231 + SCX_ERR("Failed to get stable values with BPF loaded"); 232 + bpf_link__destroy(link); 233 + return SCX_TEST_FAIL; 234 + } 235 + bpf_link__destroy(link); 236 + 237 + /* Test scenario 3: BPF program unloaded */ 238 + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); 239 + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { 240 + SCX_ERR("Failed to get stable values after BPF unload"); 241 + return SCX_TEST_FAIL; 242 + } 243 + 244 + /* Verify all three scenarios have the same total_bw values */ 245 + for (i = 0; i < test_ctx->nr_cpus; i++) { 246 + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { 247 + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", 248 + i, test_ctx->baseline_bw[i], loaded_bw[i]); 249 + return SCX_TEST_FAIL; 250 + } 251 + 252 + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { 253 + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", 254 + i, test_ctx->baseline_bw[i], unloaded_bw[i]); 255 + return SCX_TEST_FAIL; 256 + } 257 + } 258 + 259 + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); 260 + return SCX_TEST_PASS; 261 + } 262 + 263 + static void cleanup(void *ctx) 264 + { 265 + struct total_bw_ctx *test_ctx = ctx; 266 + 267 + if (test_ctx) { 268 + if (test_ctx->skel) 269 + minimal__destroy(test_ctx->skel); 270 + free(test_ctx); 271 + } 272 + } 273 + 274 + struct scx_test total_bw = { 275 + .name = "total_bw", 276 + .description = "Verify total_bw consistency across BPF program states", 277 + .setup = setup, 278 + .run = run, 279 + .cleanup = cleanup, 280 + }; 281 + REGISTER_SCX_TEST(&total_bw)

Configure Feed

Configure Feed