Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
"Scheduler Kconfig space updates:

- Further consolidate configurable preemption modes (Peter Zijlstra)

Reduce the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
preemption models from four to just two: 'full' and 'lazy' on
up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86).

None and voluntary are only available as legacy features on
platforms that don't implement lazy preemption yet, or which don't
even support preemption.

The goal is to eventually remove cond_resched() and voluntary
preemption altogether.

RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
and Peter Zijlstra):

This allows a thread to request a time slice extension when it enters
a critical section to avoid contention on a resource when the thread
is scheduled out inside of the critical section.

- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script

Scheduler performance/scalability improvements:

- Update rq->avg_idle when a task is moved to an idle CPU, which
improves the scalability of various workloads (Shubhang Kaushik)

- Reorder fields in 'struct rq' for better caching (Blake Jones)

- Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead

- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)

- Cleanups (Yury Norov):
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()

DL scheduler updates:

- Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
Fernandes, with fixes by Peter Zijlstra)

RT scheduler updates:

- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)

Entry code updates and performance improvements (Jinjie Ruan)

This is part of the scheduler tree in this cycle due to inter-
dependencies with the RSEQ based time slice extension work:

- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()

Scheduler core updates (Peter Zijlstra):

- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper

Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):

- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise

Scheduler debugging code updates:

- Export hidden tracepoints to modules (Gabriele Monaco)

- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)

- Add assertions to QUEUE_CLASS (Peter Zijlstra)

- hrtimer: Fix tracing oddity (Thomas Gleixner)

Misc fixes and cleanups:

- Re-evaluate scheduling when migrating queued tasks out of throttled
cgroups (Zicheng Qu)

- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)

- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)

- sched/cpufreq: Use %pe format for PTR_ERR() printing
(zenghongling)"

* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
sched/cpufreq: Use %pe format for PTR_ERR() printing
sched/rt: Skip currently executing CPU in rto_next_cpu()
sched/clock: Avoid false sharing for sched_clock_irqtime
selftests/sched_ext: Add test for DL server total_bw consistency
selftests/sched_ext: Add test for sched_ext dl_server
sched/debug: Fix dl_server (re)start conditions
sched/debug: Add support to change sched_ext server params
sched_ext: Add a DL server for sched_ext tasks
sched/debug: Stop and start server based on if it was active
sched/debug: Fix updating of ppos on server write ops
sched/deadline: Clear the defer params
entry: Inline syscall_exit_work() and syscall_trace_enter()
entry: Add arch_ptrace_report_syscall_entry/exit()
entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
entry: Remove unused syscall argument from syscall_trace_enter()
sched: remove task_struct->faults_disabled_mapping
sched: Update rq->avg_idle when a task is moved to an idle CPU
selftests/rseq: Add rseq slice histogram script
hrtimer: Fix trace oddity
...

+2599 -547
+5
Documentation/admin-guide/kernel-parameters.txt
··· 6627 6627 6628 6628 rootflags= [KNL] Set root filesystem mount option string 6629 6629 6630 + rseq_slice_ext= [KNL] RSEQ based time slice extension 6631 + Format: boolean 6632 + Control enablement of RSEQ based time slice extension. 6633 + Default is 'on'. 6634 + 6630 6635 initramfs_options= [KNL] 6631 6636 Specify mount options for for the initramfs mount. 6632 6637
+1
Documentation/userspace-api/index.rst
··· 21 21 ebpf/index 22 22 ioctl/index 23 23 mseal 24 + rseq 24 25 25 26 Security-related interfaces 26 27 ===========================
+140
Documentation/userspace-api/rseq.rst
··· 1 + ===================== 2 + Restartable Sequences 3 + ===================== 4 + 5 + Restartable Sequences allow to register a per thread userspace memory area 6 + to be used as an ABI between kernel and userspace for three purposes: 7 + 8 + * userspace restartable sequences 9 + 10 + * quick access to read the current CPU number, node ID from userspace 11 + 12 + * scheduler time slice extensions 13 + 14 + Restartable sequences (per-cpu atomics) 15 + --------------------------------------- 16 + 17 + Restartable sequences allow userspace to perform update operations on 18 + per-cpu data without requiring heavyweight atomic operations. The actual 19 + ABI is unfortunately only available in the code and selftests. 20 + 21 + Quick access to CPU number, node ID 22 + ----------------------------------- 23 + 24 + Allows to implement per CPU data efficiently. Documentation is in code and 25 + selftests. :( 26 + 27 + Scheduler time slice extensions 28 + ------------------------------- 29 + 30 + This allows a thread to request a time slice extension when it enters a 31 + critical section to avoid contention on a resource when the thread is 32 + scheduled out inside of the critical section. 33 + 34 + The prerequisites for this functionality are: 35 + 36 + * Enabled in Kconfig 37 + 38 + * Enabled at boot time (default is enabled) 39 + 40 + * A rseq userspace pointer has been registered for the thread 41 + 42 + The thread has to enable the functionality via prctl(2):: 43 + 44 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 45 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); 46 + 47 + prctl() returns 0 on success or otherwise with the following error codes: 48 + 49 + ========= ============================================================== 50 + Errorcode Meaning 51 + ========= ============================================================== 52 + EINVAL Functionality not available or invalid function arguments. 53 + Note: arg4 and arg5 must be zero 54 + ENOTSUPP Functionality was disabled on the kernel command line 55 + ENXIO Available, but no rseq user struct registered 56 + ========= ============================================================== 57 + 58 + The state can be also queried via prctl(2):: 59 + 60 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); 61 + 62 + prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if 63 + disabled. Otherwise it returns with the following error codes: 64 + 65 + ========= ============================================================== 66 + Errorcode Meaning 67 + ========= ============================================================== 68 + EINVAL Functionality not available or invalid function arguments. 69 + Note: arg3 and arg4 and arg5 must be zero 70 + ========= ============================================================== 71 + 72 + The availability and status is also exposed via the rseq ABI struct flags 73 + field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the 74 + ``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user 75 + space and only for informational purposes. 76 + 77 + If the mechanism was enabled via prctl(), the thread can request a time 78 + slice extension by setting rseq::slice_ctrl::request to 1. If the thread is 79 + interrupted and the interrupt results in a reschedule request in the 80 + kernel, then the kernel can grant a time slice extension and return to 81 + userspace instead of scheduling out. The length of the extension is 82 + determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which 83 + is the minimum value. It can be incremented to 50 usecs, however doing so 84 + can/will affect the minimum scheduling latency. 85 + 86 + Any proposed changes to this default will have to come with a selftest and 87 + rseq-slice-hist.py output that shows the new value has merrit. 88 + 89 + The kernel indicates the grant by clearing rseq::slice_ctrl::request and 90 + setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the 91 + thread after granting the extension, the kernel clears the granted bit to 92 + indicate that to userspace. 93 + 94 + If the request bit is still set when the leaving the critical section, 95 + userspace can clear it and continue. 96 + 97 + If the granted bit is set, then userspace invokes rseq_slice_yield(2) when 98 + leaving the critical section to relinquish the CPU. The kernel enforces 99 + this by arming a timer to prevent misbehaving userspace from abusing this 100 + mechanism. 101 + 102 + If both the request bit and the granted bit are false when leaving the 103 + critical section, then this indicates that a grant was revoked and no 104 + further action is required by userspace. 105 + 106 + The required code flow is as follows:: 107 + 108 + rseq->slice_ctrl.request = 1; 109 + barrier(); // Prevent compiler reordering 110 + critical_section(); 111 + barrier(); // Prevent compiler reordering 112 + rseq->slice_ctrl.request = 0; 113 + if (rseq->slice_ctrl.granted) 114 + rseq_slice_yield(); 115 + 116 + As all of this is strictly CPU local, there are no atomicity requirements. 117 + Checking the granted state is racy, but that cannot be avoided at all:: 118 + 119 + if (rseq->slice_ctrl.granted) 120 + -> Interrupt results in schedule and grant revocation 121 + rseq_slice_yield(); 122 + 123 + So there is no point in pretending that this might be solved by an atomic 124 + operation. 125 + 126 + If the thread issues a syscall other than rseq_slice_yield(2) within the 127 + granted timeslice extension, the grant is also revoked and the CPU is 128 + relinquished immediately when entering the kernel. This is required as 129 + syscalls might consume arbitrary CPU time until they reach a scheduling 130 + point when the preemption model is either NONE or VOLUNTARY and therefore 131 + might exceed the grant by far. 132 + 133 + The preferred solution for user space is to use rseq_slice_yield(2) which 134 + is side effect free. The support for arbitrary syscalls is required to 135 + support onion layer architectured applications, where the code handling the 136 + critical section and requesting the time slice extension has no control 137 + over the code within the critical section. 138 + 139 + The kernel enforces flag consistency and terminates the thread with SIGSEGV 140 + if it detects a violation.
+1
arch/alpha/kernel/syscalls/syscall.tbl
··· 510 510 578 common file_getattr sys_file_getattr 511 511 579 common file_setattr sys_file_setattr 512 512 580 common listns sys_listns 513 + 581 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/arm/tools/syscall.tbl
··· 485 485 468 common file_getattr sys_file_getattr 486 486 469 common file_setattr sys_file_setattr 487 487 470 common listns sys_listns 488 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/arm64/tools/syscall_32.tbl
··· 482 482 468 common file_getattr sys_file_getattr 483 483 469 common file_setattr sys_file_setattr 484 484 470 common listns sys_listns 485 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/m68k/kernel/syscalls/syscall.tbl
··· 470 470 468 common file_getattr sys_file_getattr 471 471 469 common file_setattr sys_file_setattr 472 472 470 common listns sys_listns 473 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/microblaze/kernel/syscalls/syscall.tbl
··· 476 476 468 common file_getattr sys_file_getattr 477 477 469 common file_setattr sys_file_setattr 478 478 470 common listns sys_listns 479 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/mips/kernel/syscalls/syscall_n32.tbl
··· 409 409 468 n32 file_getattr sys_file_getattr 410 410 469 n32 file_setattr sys_file_setattr 411 411 470 n32 listns sys_listns 412 + 471 n32 rseq_slice_yield sys_rseq_slice_yield
+1
arch/mips/kernel/syscalls/syscall_n64.tbl
··· 385 385 468 n64 file_getattr sys_file_getattr 386 386 469 n64 file_setattr sys_file_setattr 387 387 470 n64 listns sys_listns 388 + 471 n64 rseq_slice_yield sys_rseq_slice_yield
+1
arch/mips/kernel/syscalls/syscall_o32.tbl
··· 458 458 468 o32 file_getattr sys_file_getattr 459 459 469 o32 file_setattr sys_file_setattr 460 460 470 o32 listns sys_listns 461 + 471 o32 rseq_slice_yield sys_rseq_slice_yield
+1
arch/parisc/kernel/syscalls/syscall.tbl
··· 469 469 468 common file_getattr sys_file_getattr 470 470 469 common file_setattr sys_file_setattr 471 471 470 common listns sys_listns 472 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/powerpc/kernel/syscalls/syscall.tbl
··· 561 561 468 common file_getattr sys_file_getattr 562 562 469 common file_setattr sys_file_setattr 563 563 470 common listns sys_listns 564 + 471 nospu rseq_slice_yield sys_rseq_slice_yield
+1
arch/s390/kernel/syscalls/syscall.tbl
··· 397 397 468 common file_getattr sys_file_getattr 398 398 469 common file_setattr sys_file_setattr 399 399 470 common listns sys_listns 400 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/sh/kernel/syscalls/syscall.tbl
··· 474 474 468 common file_getattr sys_file_getattr 475 475 469 common file_setattr sys_file_setattr 476 476 470 common listns sys_listns 477 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/sparc/kernel/syscalls/syscall.tbl
··· 516 516 468 common file_getattr sys_file_getattr 517 517 469 common file_setattr sys_file_setattr 518 518 470 common listns sys_listns 519 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
arch/x86/entry/syscalls/syscall_32.tbl
··· 476 476 468 i386 file_getattr sys_file_getattr 477 477 469 i386 file_setattr sys_file_setattr 478 478 470 i386 listns sys_listns 479 + 471 i386 rseq_slice_yield sys_rseq_slice_yield
+1
arch/x86/entry/syscalls/syscall_64.tbl
··· 395 395 468 common file_getattr sys_file_getattr 396 396 469 common file_setattr sys_file_setattr 397 397 470 common listns sys_listns 398 + 471 common rseq_slice_yield sys_rseq_slice_yield 398 399 399 400 # 400 401 # Due to a historical design error, certain syscalls are numbered differently
-2
arch/x86/kernel/tsc.c
··· 1143 1143 tsc_unstable = 1; 1144 1144 if (using_native_sched_clock()) 1145 1145 clear_sched_clock_stable(); 1146 - disable_sched_clock_irqtime(); 1147 1146 pr_info("Marking TSC unstable due to clocksource watchdog\n"); 1148 1147 } 1149 1148 ··· 1212 1213 tsc_unstable = 1; 1213 1214 if (using_native_sched_clock()) 1214 1215 clear_sched_clock_stable(); 1215 - disable_sched_clock_irqtime(); 1216 1216 pr_info("Marking TSC unstable due to %s\n", reason); 1217 1217 1218 1218 clocksource_mark_unstable(&clocksource_tsc_early);
+1
arch/xtensa/kernel/syscalls/syscall.tbl
··· 441 441 468 common file_getattr sys_file_getattr 442 442 469 common file_setattr sys_file_setattr 443 443 470 common listns sys_listns 444 + 471 common rseq_slice_yield sys_rseq_slice_yield
+19
include/linux/compiler_types.h
··· 620 620 __scalar_type_to_expr_cases(long long), \ 621 621 default: (x))) 622 622 623 + /* 624 + * __signed_scalar_typeof(x) - Declare a signed scalar type, leaving 625 + * non-scalar types unchanged. 626 + */ 627 + 628 + #define __scalar_type_to_signed_cases(type) \ 629 + unsigned type: (signed type)0, \ 630 + signed type: (signed type)0 631 + 632 + #define __signed_scalar_typeof(x) typeof( \ 633 + _Generic((x), \ 634 + char: (signed char)0, \ 635 + __scalar_type_to_signed_cases(char), \ 636 + __scalar_type_to_signed_cases(short), \ 637 + __scalar_type_to_signed_cases(int), \ 638 + __scalar_type_to_signed_cases(long), \ 639 + __scalar_type_to_signed_cases(long long), \ 640 + default: (x))) 641 + 623 642 /* Is this type a native word size -- useful for atomic operations */ 624 643 #define __native_word(t) \ 625 644 (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
+149 -18
include/linux/entry-common.h
··· 2 2 #ifndef __LINUX_ENTRYCOMMON_H 3 3 #define __LINUX_ENTRYCOMMON_H 4 4 5 + #include <linux/audit.h> 5 6 #include <linux/irq-entry-common.h> 6 7 #include <linux/livepatch.h> 7 8 #include <linux/ptrace.h> ··· 37 36 SYSCALL_WORK_SYSCALL_EMU | \ 38 37 SYSCALL_WORK_SYSCALL_AUDIT | \ 39 38 SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ 39 + SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \ 40 40 ARCH_SYSCALL_WORK_ENTER) 41 - 42 41 #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ 43 42 SYSCALL_WORK_SYSCALL_TRACE | \ 44 43 SYSCALL_WORK_SYSCALL_AUDIT | \ ··· 46 45 SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ 47 46 ARCH_SYSCALL_WORK_EXIT) 48 47 49 - long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work); 48 + /** 49 + * arch_ptrace_report_syscall_entry - Architecture specific ptrace_report_syscall_entry() wrapper 50 + * 51 + * Invoked from syscall_trace_enter() to wrap ptrace_report_syscall_entry(). 52 + * 53 + * This allows architecture specific ptrace_report_syscall_entry() 54 + * implementations. If not defined by the architecture this falls back to 55 + * to ptrace_report_syscall_entry(). 56 + */ 57 + static __always_inline int arch_ptrace_report_syscall_entry(struct pt_regs *regs); 58 + 59 + #ifndef arch_ptrace_report_syscall_entry 60 + static __always_inline int arch_ptrace_report_syscall_entry(struct pt_regs *regs) 61 + { 62 + return ptrace_report_syscall_entry(regs); 63 + } 64 + #endif 65 + 66 + bool syscall_user_dispatch(struct pt_regs *regs); 67 + long trace_syscall_enter(struct pt_regs *regs, long syscall); 68 + void trace_syscall_exit(struct pt_regs *regs, long ret); 69 + 70 + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) 71 + { 72 + if (unlikely(audit_context())) { 73 + unsigned long args[6]; 74 + 75 + syscall_get_arguments(current, regs, args); 76 + audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]); 77 + } 78 + } 79 + 80 + static __always_inline long syscall_trace_enter(struct pt_regs *regs, unsigned long work) 81 + { 82 + long syscall, ret = 0; 83 + 84 + /* 85 + * Handle Syscall User Dispatch. This must comes first, since 86 + * the ABI here can be something that doesn't make sense for 87 + * other syscall_work features. 88 + */ 89 + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 90 + if (syscall_user_dispatch(regs)) 91 + return -1L; 92 + } 93 + 94 + /* 95 + * User space got a time slice extension granted and relinquishes 96 + * the CPU. The work stops the slice timer to avoid an extra round 97 + * through hrtimer_interrupt(). 98 + */ 99 + if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) 100 + rseq_syscall_enter_work(syscall_get_nr(current, regs)); 101 + 102 + /* Handle ptrace */ 103 + if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { 104 + ret = arch_ptrace_report_syscall_entry(regs); 105 + if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) 106 + return -1L; 107 + } 108 + 109 + /* Do seccomp after ptrace, to catch any tracer changes. */ 110 + if (work & SYSCALL_WORK_SECCOMP) { 111 + ret = __secure_computing(); 112 + if (ret == -1L) 113 + return ret; 114 + } 115 + 116 + /* Either of the above might have changed the syscall number */ 117 + syscall = syscall_get_nr(current, regs); 118 + 119 + if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) 120 + syscall = trace_syscall_enter(regs, syscall); 121 + 122 + syscall_enter_audit(regs, syscall); 123 + 124 + return ret ? : syscall; 125 + } 50 126 51 127 /** 52 128 * syscall_enter_from_user_mode_work - Check and handle work before invoking ··· 153 75 unsigned long work = READ_ONCE(current_thread_info()->syscall_work); 154 76 155 77 if (work & SYSCALL_WORK_ENTER) 156 - syscall = syscall_trace_enter(regs, syscall, work); 78 + syscall = syscall_trace_enter(regs, work); 157 79 158 80 return syscall; 159 81 } ··· 190 112 return ret; 191 113 } 192 114 115 + /* 116 + * If SYSCALL_EMU is set, then the only reason to report is when 117 + * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall 118 + * instruction has been already reported in syscall_enter_from_user_mode(). 119 + */ 120 + static __always_inline bool report_single_step(unsigned long work) 121 + { 122 + if (work & SYSCALL_WORK_SYSCALL_EMU) 123 + return false; 124 + 125 + return work & SYSCALL_WORK_SYSCALL_EXIT_TRAP; 126 + } 127 + 128 + /** 129 + * arch_ptrace_report_syscall_exit - Architecture specific ptrace_report_syscall_exit() 130 + * 131 + * This allows architecture specific ptrace_report_syscall_exit() 132 + * implementations. If not defined by the architecture this falls back to 133 + * to ptrace_report_syscall_exit(). 134 + */ 135 + static __always_inline void arch_ptrace_report_syscall_exit(struct pt_regs *regs, 136 + int step); 137 + 138 + #ifndef arch_ptrace_report_syscall_exit 139 + static __always_inline void arch_ptrace_report_syscall_exit(struct pt_regs *regs, 140 + int step) 141 + { 142 + ptrace_report_syscall_exit(regs, step); 143 + } 144 + #endif 145 + 193 146 /** 194 147 * syscall_exit_work - Handle work before returning to user mode 195 148 * @regs: Pointer to current pt_regs ··· 228 119 * 229 120 * Do one-time syscall specific work. 230 121 */ 231 - void syscall_exit_work(struct pt_regs *regs, unsigned long work); 122 + static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned long work) 123 + { 124 + bool step; 125 + 126 + /* 127 + * If the syscall was rolled back due to syscall user dispatching, 128 + * then the tracers below are not invoked for the same reason as 129 + * the entry side was not invoked in syscall_trace_enter(): The ABI 130 + * of these syscalls is unknown. 131 + */ 132 + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 133 + if (unlikely(current->syscall_dispatch.on_dispatch)) { 134 + current->syscall_dispatch.on_dispatch = false; 135 + return; 136 + } 137 + } 138 + 139 + audit_syscall_exit(regs); 140 + 141 + if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT) 142 + trace_syscall_exit(regs, syscall_get_return_value(current, regs)); 143 + 144 + step = report_single_step(work); 145 + if (step || work & SYSCALL_WORK_SYSCALL_TRACE) 146 + arch_ptrace_report_syscall_exit(regs, step); 147 + } 232 148 233 149 /** 234 - * syscall_exit_to_user_mode_work - Handle work before returning to user mode 150 + * syscall_exit_to_user_mode_work - Handle one time work before returning to user mode 235 151 * @regs: Pointer to currents pt_regs 236 152 * 237 - * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling 238 - * exit_to_user_mode() to perform the final transition to user mode. 153 + * Step 1 of syscall_exit_to_user_mode() with the same calling convention. 239 154 * 240 - * Calling convention is the same as for syscall_exit_to_user_mode() and it 241 - * returns with all work handled and interrupts disabled. The caller must 242 - * invoke exit_to_user_mode() before actually switching to user mode to 243 - * make the final state transitions. Interrupts must stay disabled between 244 - * return from this function and the invocation of exit_to_user_mode(). 155 + * The caller must invoke steps 2-3 of syscall_exit_to_user_mode() afterwards. 245 156 */ 246 157 static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs) 247 158 { ··· 284 155 */ 285 156 if (unlikely(work & SYSCALL_WORK_EXIT)) 286 157 syscall_exit_work(regs, work); 287 - local_irq_disable_exit_to_user(); 288 - syscall_exit_to_user_mode_prepare(regs); 289 158 } 290 159 291 160 /** 292 161 * syscall_exit_to_user_mode - Handle work before returning to user mode 293 162 * @regs: Pointer to currents pt_regs 294 163 * 295 - * Invoked with interrupts enabled and fully valid regs. Returns with all 164 + * Invoked with interrupts enabled and fully valid @regs. Returns with all 296 165 * work handled, interrupts disabled such that the caller can immediately 297 166 * switch to user mode. Called from architecture specific syscall and ret 298 167 * from fork code. ··· 303 176 * - ptrace (single stepping) 304 177 * 305 178 * 2) Preparatory work 179 + * - Disable interrupts 306 180 * - Exit to user mode loop (common TIF handling). Invokes 307 181 * arch_exit_to_user_mode_work() for architecture specific TIF work 308 182 * - Architecture specific one time work arch_exit_to_user_mode_prepare() ··· 312 184 * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the 313 185 * functionality in exit_to_user_mode(). 314 186 * 315 - * This is a combination of syscall_exit_to_user_mode_work() (1,2) and 316 - * exit_to_user_mode(). This function is preferred unless there is a 317 - * compelling architectural reason to use the separate functions. 187 + * This is a combination of syscall_exit_to_user_mode_work() (1), disabling 188 + * interrupts followed by syscall_exit_to_user_mode_prepare() (2) and 189 + * exit_to_user_mode() (3). This function is preferred unless there is a 190 + * compelling architectural reason to invoke the functions separately. 318 191 */ 319 192 static __always_inline void syscall_exit_to_user_mode(struct pt_regs *regs) 320 193 { 321 194 instrumentation_begin(); 322 195 syscall_exit_to_user_mode_work(regs); 196 + local_irq_disable_exit_to_user(); 197 + syscall_exit_to_user_mode_prepare(regs); 323 198 instrumentation_end(); 324 199 exit_to_user_mode(); 325 200 }
+11
include/linux/rseq.h
··· 163 163 static inline void rseq_syscall(struct pt_regs *regs) { } 164 164 #endif /* !CONFIG_DEBUG_RSEQ */ 165 165 166 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 167 + void rseq_syscall_enter_work(long syscall); 168 + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); 169 + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ 170 + static inline void rseq_syscall_enter_work(long syscall) { } 171 + static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) 172 + { 173 + return -ENOTSUPP; 174 + } 175 + #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 176 + 166 177 #endif /* _LINUX_RSEQ_H */
+180 -12
include/linux/rseq_entry.h
··· 15 15 unsigned long cs; 16 16 unsigned long clear; 17 17 unsigned long fixup; 18 + unsigned long s_granted; 19 + unsigned long s_expired; 20 + unsigned long s_revoked; 21 + unsigned long s_yielded; 22 + unsigned long s_aborted; 18 23 }; 19 24 20 25 DECLARE_PER_CPU(struct rseq_stats, rseq_stats); ··· 42 37 #ifdef CONFIG_RSEQ 43 38 #include <linux/jump_label.h> 44 39 #include <linux/rseq.h> 40 + #include <linux/sched/signal.h> 45 41 #include <linux/uaccess.h> 46 42 47 43 #include <linux/tracepoint-defs.h> ··· 80 74 #else 81 75 #define rseq_inline __always_inline 82 76 #endif 77 + 78 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 79 + DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key); 80 + 81 + static __always_inline bool rseq_slice_extension_enabled(void) 82 + { 83 + return static_branch_likely(&rseq_slice_extension_key); 84 + } 85 + 86 + extern unsigned int rseq_slice_ext_nsecs; 87 + bool __rseq_arm_slice_extension_timer(void); 88 + 89 + static __always_inline bool rseq_arm_slice_extension_timer(void) 90 + { 91 + if (!rseq_slice_extension_enabled()) 92 + return false; 93 + 94 + if (likely(!current->rseq.slice.state.granted)) 95 + return false; 96 + 97 + return __rseq_arm_slice_extension_timer(); 98 + } 99 + 100 + static __always_inline void rseq_slice_clear_grant(struct task_struct *t) 101 + { 102 + if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted) 103 + rseq_stat_inc(rseq_stats.s_revoked); 104 + t->rseq.slice.state.granted = false; 105 + } 106 + 107 + static __always_inline bool rseq_grant_slice_extension(bool work_pending) 108 + { 109 + struct task_struct *curr = current; 110 + struct rseq_slice_ctrl usr_ctrl; 111 + union rseq_slice_state state; 112 + struct rseq __user *rseq; 113 + 114 + if (!rseq_slice_extension_enabled()) 115 + return false; 116 + 117 + /* If not enabled or not a return from interrupt, nothing to do. */ 118 + state = curr->rseq.slice.state; 119 + state.enabled &= curr->rseq.event.user_irq; 120 + if (likely(!state.state)) 121 + return false; 122 + 123 + rseq = curr->rseq.usrptr; 124 + scoped_user_rw_access(rseq, efault) { 125 + 126 + /* 127 + * Quick check conditions where a grant is not possible or 128 + * needs to be revoked. 129 + * 130 + * 1) Any TIF bit which needs to do extra work aside of 131 + * rescheduling prevents a grant. 132 + * 133 + * 2) A previous rescheduling request resulted in a slice 134 + * extension grant. 135 + */ 136 + if (unlikely(work_pending || state.granted)) { 137 + /* Clear user control unconditionally. No point for checking */ 138 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 139 + rseq_slice_clear_grant(curr); 140 + return false; 141 + } 142 + 143 + unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault); 144 + if (likely(!(usr_ctrl.request))) 145 + return false; 146 + 147 + /* Grant the slice extention */ 148 + usr_ctrl.request = 0; 149 + usr_ctrl.granted = 1; 150 + unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault); 151 + } 152 + 153 + rseq_stat_inc(rseq_stats.s_granted); 154 + 155 + curr->rseq.slice.state.granted = true; 156 + /* Store expiry time for arming the timer on the way out */ 157 + curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns(); 158 + /* 159 + * This is racy against a remote CPU setting TIF_NEED_RESCHED in 160 + * several ways: 161 + * 162 + * 1) 163 + * CPU0 CPU1 164 + * clear_tsk() 165 + * set_tsk() 166 + * clear_preempt() 167 + * Raise scheduler IPI on CPU0 168 + * --> IPI 169 + * fold_need_resched() -> Folds correctly 170 + * 2) 171 + * CPU0 CPU1 172 + * set_tsk() 173 + * clear_tsk() 174 + * clear_preempt() 175 + * Raise scheduler IPI on CPU0 176 + * --> IPI 177 + * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false 178 + * 179 + * #1 is not any different from a regular remote reschedule as it 180 + * sets the previously not set bit and then raises the IPI which 181 + * folds it into the preempt counter 182 + * 183 + * #2 is obviously incorrect from a scheduler POV, but it's not 184 + * differently incorrect than the code below clearing the 185 + * reschedule request with the safety net of the timer. 186 + * 187 + * The important part is that the clearing is protected against the 188 + * scheduler IPI and also against any other interrupt which might 189 + * end up waking up a task and setting the bits in the middle of 190 + * the operation: 191 + * 192 + * clear_tsk() 193 + * ---> Interrupt 194 + * wakeup_on_this_cpu() 195 + * set_tsk() 196 + * set_preempt() 197 + * clear_preempt() 198 + * 199 + * which would be inconsistent state. 200 + */ 201 + scoped_guard(irq) { 202 + clear_tsk_need_resched(curr); 203 + clear_preempt_need_resched(); 204 + } 205 + return true; 206 + 207 + efault: 208 + force_sig(SIGSEGV); 209 + return false; 210 + } 211 + 212 + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ 213 + static inline bool rseq_slice_extension_enabled(void) { return false; } 214 + static inline bool rseq_arm_slice_extension_timer(void) { return false; } 215 + static inline void rseq_slice_clear_grant(struct task_struct *t) { } 216 + static inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 217 + #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 83 218 84 219 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); 85 220 bool rseq_debug_validate_ids(struct task_struct *t); ··· 506 359 unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); 507 360 if (csaddr) 508 361 unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); 362 + 363 + /* Open coded, so it's in the same user access region */ 364 + if (rseq_slice_extension_enabled()) { 365 + /* Unconditionally clear it, no point in conditionals */ 366 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 367 + } 509 368 } 510 369 370 + rseq_slice_clear_grant(t); 511 371 /* Cache the new values */ 512 372 t->rseq.ids.cpu_cid = ids->cpu_cid; 513 373 rseq_stat_inc(rseq_stats.ids); ··· 610 456 */ 611 457 u64 csaddr; 612 458 613 - if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs))) 614 - return false; 459 + scoped_user_rw_access(rseq, efault) { 460 + unsafe_get_user(csaddr, &rseq->rseq_cs, efault); 461 + 462 + /* Open coded, so it's in the same user access region */ 463 + if (rseq_slice_extension_enabled()) { 464 + /* Unconditionally clear it, no point in conditionals */ 465 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 466 + } 467 + } 468 + 469 + rseq_slice_clear_grant(t); 615 470 616 471 if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) { 617 472 if (unlikely(!rseq_update_user_cs(t, regs, csaddr))) ··· 636 473 u32 node_id = cpu_to_node(ids.cpu_id); 637 474 638 475 return rseq_update_usr(t, regs, &ids, node_id); 476 + efault: 477 + return false; 639 478 } 640 479 641 480 static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs) ··· 692 527 static __always_inline bool 693 528 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) 694 529 { 695 - if (likely(!test_tif_rseq(ti_work))) 696 - return false; 697 - 698 - if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { 699 - current->rseq.event.slowpath = true; 700 - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); 701 - return true; 530 + if (unlikely(test_tif_rseq(ti_work))) { 531 + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { 532 + current->rseq.event.slowpath = true; 533 + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); 534 + return true; 535 + } 536 + clear_tif_rseq(); 702 537 } 703 - 704 - clear_tif_rseq(); 705 - return false; 538 + /* 539 + * Arm the slice extension timer if nothing to do anymore and the 540 + * task really goes out to user space. 541 + */ 542 + return rseq_arm_slice_extension_timer(); 706 543 } 707 544 708 545 #else /* CONFIG_GENERIC_ENTRY */ ··· 778 611 static inline void rseq_irqentry_exit_to_user_mode(void) { } 779 612 static inline void rseq_exit_to_user_mode_legacy(void) { } 780 613 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } 614 + static inline bool rseq_grant_slice_extension(bool work_pending) { return false; } 781 615 #endif /* !CONFIG_RSEQ */ 782 616 783 617 #endif /* _LINUX_RSEQ_ENTRY_H */
+31 -1
include/linux/rseq_types.h
··· 73 73 }; 74 74 75 75 /** 76 + * union rseq_slice_state - Status information for rseq time slice extension 77 + * @state: Compound to access the overall state 78 + * @enabled: Time slice extension is enabled for the task 79 + * @granted: Time slice extension was granted to the task 80 + */ 81 + union rseq_slice_state { 82 + u16 state; 83 + struct { 84 + u8 enabled; 85 + u8 granted; 86 + }; 87 + }; 88 + 89 + /** 90 + * struct rseq_slice - Status information for rseq time slice extension 91 + * @state: Time slice extension state 92 + * @expires: The time when a grant expires 93 + * @yielded: Indicator for rseq_slice_yield() 94 + */ 95 + struct rseq_slice { 96 + union rseq_slice_state state; 97 + u64 expires; 98 + u8 yielded; 99 + }; 100 + 101 + /** 76 102 * struct rseq_data - Storage for all rseq related data 77 103 * @usrptr: Pointer to the registered user space RSEQ memory 78 104 * @len: Length of the RSEQ region 79 - * @sig: Signature of critial section abort IPs 105 + * @sig: Signature of critical section abort IPs 80 106 * @event: Storage for event management 81 107 * @ids: Storage for cached CPU ID and MM CID 108 + * @slice: Storage for time slice extension data 82 109 */ 83 110 struct rseq_data { 84 111 struct rseq __user *usrptr; ··· 113 86 u32 sig; 114 87 struct rseq_event event; 115 88 struct rseq_ids ids; 89 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 90 + struct rseq_slice slice; 91 + #endif 116 92 }; 117 93 118 94 #else /* CONFIG_RSEQ */
+4 -10
include/linux/sched.h
··· 586 586 u64 sum_exec_runtime; 587 587 u64 prev_sum_exec_runtime; 588 588 u64 vruntime; 589 - union { 590 - /* 591 - * When !@on_rq this field is vlag. 592 - * When cfs_rq->curr == se (which implies @on_rq) 593 - * this field is vprot. See protect_slice(). 594 - */ 595 - s64 vlag; 596 - u64 vprot; 597 - }; 589 + /* Approximated virtual lag: */ 590 + s64 vlag; 591 + /* 'Protected' deadline, to give out minimum quantums: */ 592 + u64 vprot; 598 593 u64 slice; 599 594 600 595 u64 nr_migrations; ··· 951 956 952 957 struct mm_struct *mm; 953 958 struct mm_struct *active_mm; 954 - struct address_space *faults_disabled_mapping; 955 959 956 960 int exit_state; 957 961 int exit_code;
+1
include/linux/syscalls.h
··· 961 961 unsigned mask, struct statx __user *buffer); 962 962 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, 963 963 int flags, uint32_t sig); 964 + asmlinkage long sys_rseq_slice_yield(void); 964 965 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags); 965 966 asmlinkage long sys_open_tree_attr(int dfd, const char __user *path, 966 967 unsigned flags,
+9 -7
include/linux/thread_info.h
··· 46 46 SYSCALL_WORK_BIT_SYSCALL_AUDIT, 47 47 SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, 48 48 SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, 49 + SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE, 49 50 }; 50 51 51 - #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) 52 - #define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) 53 - #define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) 54 - #define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) 55 - #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) 56 - #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) 57 - #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) 52 + #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) 53 + #define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) 54 + #define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) 55 + #define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) 56 + #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) 57 + #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) 58 + #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) 59 + #define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE) 58 60 #endif 59 61 60 62 #include <asm/thread_info.h>
+4 -1
include/uapi/asm-generic/unistd.h
··· 860 860 #define __NR_listns 470 861 861 __SYSCALL(__NR_listns, sys_listns) 862 862 863 + #define __NR_rseq_slice_yield 471 864 + __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield) 865 + 863 866 #undef __NR_syscalls 864 - #define __NR_syscalls 471 867 + #define __NR_syscalls 472 865 868 866 869 /* 867 870 * 32 bit systems traditionally used different
+10
include/uapi/linux/prctl.h
··· 386 386 # define PR_FUTEX_HASH_SET_SLOTS 1 387 387 # define PR_FUTEX_HASH_GET_SLOTS 2 388 388 389 + /* RSEQ time slice extensions */ 390 + #define PR_RSEQ_SLICE_EXTENSION 79 391 + # define PR_RSEQ_SLICE_EXTENSION_GET 1 392 + # define PR_RSEQ_SLICE_EXTENSION_SET 2 393 + /* 394 + * Bits for RSEQ_SLICE_EXTENSION_GET/SET 395 + * PR_RSEQ_SLICE_EXT_ENABLE: Enable 396 + */ 397 + # define PR_RSEQ_SLICE_EXT_ENABLE 0x01 398 + 389 399 #endif /* _LINUX_PRCTL_H */
+40 -1
include/uapi/linux/rseq.h
··· 19 19 }; 20 20 21 21 enum rseq_flags { 22 - RSEQ_FLAG_UNREGISTER = (1 << 0), 22 + RSEQ_FLAG_UNREGISTER = (1 << 0), 23 + RSEQ_FLAG_SLICE_EXT_DEFAULT_ON = (1 << 1), 23 24 }; 24 25 25 26 enum rseq_cs_flags_bit { 27 + /* Historical and unsupported bits */ 26 28 RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, 27 29 RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, 28 30 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, 31 + /* (3) Intentional gap to put new bits into a separate byte */ 32 + 33 + /* User read only feature flags */ 34 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4, 35 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5, 29 36 }; 30 37 31 38 enum rseq_cs_flags { ··· 42 35 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), 43 36 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = 44 37 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), 38 + 39 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE = 40 + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT), 41 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED = 42 + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT), 45 43 }; 46 44 47 45 /* ··· 64 52 __u64 post_commit_offset; 65 53 __u64 abort_ip; 66 54 } __attribute__((aligned(4 * sizeof(__u64)))); 55 + 56 + /** 57 + * rseq_slice_ctrl - Time slice extension control structure 58 + * @all: Compound value 59 + * @request: Request for a time slice extension 60 + * @granted: Granted time slice extension 61 + * 62 + * @request is set by user space and can be cleared by user space or kernel 63 + * space. @granted is set and cleared by the kernel and must only be read 64 + * by user space. 65 + */ 66 + struct rseq_slice_ctrl { 67 + union { 68 + __u32 all; 69 + struct { 70 + __u8 request; 71 + __u8 granted; 72 + __u16 __reserved; 73 + }; 74 + }; 75 + }; 67 76 68 77 /* 69 78 * struct rseq is aligned on 4 * 8 bytes to ensure it is always ··· 173 140 * (allocated uniquely within a memory map). 174 141 */ 175 142 __u32 mm_cid; 143 + 144 + /* 145 + * Time slice extension control structure. CPU local updates from 146 + * kernel and user space. 147 + */ 148 + struct rseq_slice_ctrl slice_ctrl; 176 149 177 150 /* 178 151 * Flexible array member at end of structure, after last feature field.
+12
init/Kconfig
··· 1949 1949 1950 1950 If unsure, say Y. 1951 1951 1952 + config RSEQ_SLICE_EXTENSION 1953 + bool "Enable rseq-based time slice extension mechanism" 1954 + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS 1955 + help 1956 + Allows userspace to request a limited time slice extension when 1957 + returning from an interrupt to user space via the RSEQ shared 1958 + data ABI. If granted, that allows to complete a critical section, 1959 + so that other threads are not stuck on a conflicted resource, 1960 + while the task is scheduled out. 1961 + 1962 + If unsure, say N. 1963 + 1952 1964 config RSEQ_STATS 1953 1965 default n 1954 1966 bool "Enable lightweight statistics of restartable sequences" if EXPERT
-1
init/init_task.c
··· 113 113 .nr_cpus_allowed= NR_CPUS, 114 114 .mm = NULL, 115 115 .active_mm = &init_mm, 116 - .faults_disabled_mapping = NULL, 117 116 .restart_block = { 118 117 .fn = do_no_restart_syscall, 119 118 },
+3
kernel/Kconfig.preempt
··· 16 16 17 17 choice 18 18 prompt "Preemption Model" 19 + default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY 19 20 default PREEMPT_NONE 20 21 21 22 config PREEMPT_NONE 22 23 bool "No Forced Preemption (Server)" 23 24 depends on !PREEMPT_RT 25 + depends on ARCH_NO_PREEMPT 24 26 select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC 25 27 help 26 28 This is the traditional Linux preemption model, geared towards ··· 37 35 38 36 config PREEMPT_VOLUNTARY 39 37 bool "Voluntary Kernel Preemption (Desktop)" 38 + depends on !ARCH_HAS_PREEMPT_LAZY 40 39 depends on !ARCH_NO_PREEMPT 41 40 depends on !PREEMPT_RT 42 41 select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
+25 -2
kernel/entry/common.c
··· 17 17 #define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK) 18 18 #endif 19 19 20 + /* TIF bits, which prevent a time slice extension. */ 21 + #ifdef CONFIG_PREEMPT_RT 22 + /* 23 + * Since rseq slice ext has a direct correlation to the worst case 24 + * scheduling latency (schedule is delayed after all), only have it affect 25 + * LAZY reschedules on PREEMPT_RT for now. 26 + * 27 + * However, since this delay is only applicable to userspace, a value 28 + * for rseq_slice_extension_nsec that is strictly less than the worst case 29 + * kernel space preempt_disable() region, should mean the scheduling latency 30 + * is not affected, even for !LAZY. 31 + * 32 + * However, since this value depends on the hardware at hand, it cannot be 33 + * pre-determined in any sensible way. Hence punt on this problem for now. 34 + */ 35 + # define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY) 36 + #else 37 + # define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) 38 + #endif 39 + #define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED) 40 + 20 41 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs, 21 42 unsigned long ti_work) 22 43 { ··· 49 28 50 29 local_irq_enable_exit_to_user(ti_work); 51 30 52 - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) 53 - schedule(); 31 + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { 32 + if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY)) 33 + schedule(); 34 + } 54 35 55 36 if (ti_work & _TIF_UPROBE) 56 37 uprobe_notify_resume(regs);
-7
kernel/entry/common.h
··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _COMMON_H 3 - #define _COMMON_H 4 - 5 - bool syscall_user_dispatch(struct pt_regs *regs); 6 - 7 - #endif
+9 -90
kernel/entry/syscall-common.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 - #include <linux/audit.h> 4 3 #include <linux/entry-common.h> 5 - #include "common.h" 6 4 7 5 #define CREATE_TRACE_POINTS 8 6 #include <trace/events/syscalls.h> 9 7 10 - static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) 8 + /* Out of line to prevent tracepoint code duplication */ 9 + 10 + long trace_syscall_enter(struct pt_regs *regs, long syscall) 11 11 { 12 - if (unlikely(audit_context())) { 13 - unsigned long args[6]; 14 - 15 - syscall_get_arguments(current, regs, args); 16 - audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]); 17 - } 18 - } 19 - 20 - long syscall_trace_enter(struct pt_regs *regs, long syscall, 21 - unsigned long work) 22 - { 23 - long ret = 0; 24 - 12 + trace_sys_enter(regs, syscall); 25 13 /* 26 - * Handle Syscall User Dispatch. This must comes first, since 27 - * the ABI here can be something that doesn't make sense for 28 - * other syscall_work features. 14 + * Probes or BPF hooks in the tracepoint may have changed the 15 + * system call number. Reread it. 29 16 */ 30 - if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 31 - if (syscall_user_dispatch(regs)) 32 - return -1L; 33 - } 34 - 35 - /* Handle ptrace */ 36 - if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { 37 - ret = ptrace_report_syscall_entry(regs); 38 - if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) 39 - return -1L; 40 - } 41 - 42 - /* Do seccomp after ptrace, to catch any tracer changes. */ 43 - if (work & SYSCALL_WORK_SECCOMP) { 44 - ret = __secure_computing(); 45 - if (ret == -1L) 46 - return ret; 47 - } 48 - 49 - /* Either of the above might have changed the syscall number */ 50 - syscall = syscall_get_nr(current, regs); 51 - 52 - if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) { 53 - trace_sys_enter(regs, syscall); 54 - /* 55 - * Probes or BPF hooks in the tracepoint may have changed the 56 - * system call number as well. 57 - */ 58 - syscall = syscall_get_nr(current, regs); 59 - } 60 - 61 - syscall_enter_audit(regs, syscall); 62 - 63 - return ret ? : syscall; 17 + return syscall_get_nr(current, regs); 64 18 } 65 19 66 - /* 67 - * If SYSCALL_EMU is set, then the only reason to report is when 68 - * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall 69 - * instruction has been already reported in syscall_enter_from_user_mode(). 70 - */ 71 - static inline bool report_single_step(unsigned long work) 20 + void trace_syscall_exit(struct pt_regs *regs, long ret) 72 21 { 73 - if (work & SYSCALL_WORK_SYSCALL_EMU) 74 - return false; 75 - 76 - return work & SYSCALL_WORK_SYSCALL_EXIT_TRAP; 77 - } 78 - 79 - void syscall_exit_work(struct pt_regs *regs, unsigned long work) 80 - { 81 - bool step; 82 - 83 - /* 84 - * If the syscall was rolled back due to syscall user dispatching, 85 - * then the tracers below are not invoked for the same reason as 86 - * the entry side was not invoked in syscall_trace_enter(): The ABI 87 - * of these syscalls is unknown. 88 - */ 89 - if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { 90 - if (unlikely(current->syscall_dispatch.on_dispatch)) { 91 - current->syscall_dispatch.on_dispatch = false; 92 - return; 93 - } 94 - } 95 - 96 - audit_syscall_exit(regs); 97 - 98 - if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT) 99 - trace_sys_exit(regs, syscall_get_return_value(current, regs)); 100 - 101 - step = report_single_step(work); 102 - if (step || work & SYSCALL_WORK_SYSCALL_TRACE) 103 - ptrace_report_syscall_exit(regs, step); 22 + trace_sys_exit(regs, ret); 104 23 }
+2 -2
kernel/entry/syscall_user_dispatch.c
··· 2 2 /* 3 3 * Copyright (C) 2020 Collabora Ltd. 4 4 */ 5 + 6 + #include <linux/entry-common.h> 5 7 #include <linux/sched.h> 6 8 #include <linux/prctl.h> 7 9 #include <linux/ptrace.h> ··· 16 14 #include <linux/sched/task_stack.h> 17 15 18 16 #include <asm/syscall.h> 19 - 20 - #include "common.h" 21 17 22 18 static void trigger_sigsys(struct pt_regs *regs) 23 19 {
+362 -3
kernel/rseq.c
··· 71 71 #define RSEQ_BUILD_SLOW_PATH 72 72 73 73 #include <linux/debugfs.h> 74 + #include <linux/hrtimer.h> 75 + #include <linux/percpu.h> 76 + #include <linux/prctl.h> 74 77 #include <linux/ratelimit.h> 75 78 #include <linux/rseq_entry.h> 76 79 #include <linux/sched.h> ··· 123 120 } 124 121 #endif /* CONFIG_TRACEPOINTS */ 125 122 126 - #ifdef CONFIG_DEBUG_FS 127 123 #ifdef CONFIG_RSEQ_STATS 128 124 DEFINE_PER_CPU(struct rseq_stats, rseq_stats); 129 125 ··· 140 138 stats.cs += data_race(per_cpu(rseq_stats.cs, cpu)); 141 139 stats.clear += data_race(per_cpu(rseq_stats.clear, cpu)); 142 140 stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu)); 141 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 142 + stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu)); 143 + stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu)); 144 + stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu)); 145 + stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu)); 146 + stats.s_aborted += data_race(per_cpu(rseq_stats.s_aborted, cpu)); 147 + } 143 148 } 144 149 145 150 seq_printf(m, "exit: %16lu\n", stats.exit); ··· 157 148 seq_printf(m, "cs: %16lu\n", stats.cs); 158 149 seq_printf(m, "clear: %16lu\n", stats.clear); 159 150 seq_printf(m, "fixup: %16lu\n", stats.fixup); 151 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 152 + seq_printf(m, "sgrant: %16lu\n", stats.s_granted); 153 + seq_printf(m, "sexpir: %16lu\n", stats.s_expired); 154 + seq_printf(m, "srevok: %16lu\n", stats.s_revoked); 155 + seq_printf(m, "syield: %16lu\n", stats.s_yielded); 156 + seq_printf(m, "sabort: %16lu\n", stats.s_aborted); 157 + } 160 158 return 0; 161 159 } 162 160 ··· 221 205 .release = single_release, 222 206 }; 223 207 208 + static void rseq_slice_ext_init(struct dentry *root_dir); 209 + 224 210 static int __init rseq_debugfs_init(void) 225 211 { 226 212 struct dentry *root_dir = debugfs_create_dir("rseq", NULL); 227 213 228 214 debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops); 229 215 rseq_stats_init(root_dir); 216 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) 217 + rseq_slice_ext_init(root_dir); 230 218 return 0; 231 219 } 232 220 __initcall(rseq_debugfs_init); 233 - #endif /* CONFIG_DEBUG_FS */ 234 221 235 222 static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id) 236 223 { ··· 408 389 */ 409 390 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) 410 391 { 392 + u32 rseqfl = 0; 393 + 411 394 if (flags & RSEQ_FLAG_UNREGISTER) { 412 395 if (flags & ~RSEQ_FLAG_UNREGISTER) 413 396 return -EINVAL; ··· 426 405 return 0; 427 406 } 428 407 429 - if (unlikely(flags)) 408 + if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) 430 409 return -EINVAL; 431 410 432 411 if (current->rseq.usrptr) { ··· 461 440 if (!access_ok(rseq, rseq_len)) 462 441 return -EFAULT; 463 442 443 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 444 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 445 + if (rseq_slice_extension_enabled() && 446 + (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) 447 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 448 + } 449 + 464 450 scoped_user_write_access(rseq, efault) { 465 451 /* 466 452 * If the rseq_cs pointer is non-NULL on registration, clear it to ··· 477 449 * clearing the fields. Don't bother reading it, just reset it. 478 450 */ 479 451 unsafe_put_user(0UL, &rseq->rseq_cs, efault); 452 + unsafe_put_user(rseqfl, &rseq->flags, efault); 480 453 /* Initialize IDs in user space */ 481 454 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault); 482 455 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); 483 456 unsafe_put_user(0U, &rseq->node_id, efault); 484 457 unsafe_put_user(0U, &rseq->mm_cid, efault); 458 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 485 459 } 486 460 487 461 /* ··· 493 463 current->rseq.usrptr = rseq; 494 464 current->rseq.len = rseq_len; 495 465 current->rseq.sig = sig; 466 + 467 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 468 + current->rseq.slice.state.enabled = !!(rseqfl & RSEQ_CS_FLAG_SLICE_EXT_ENABLED); 469 + #endif 496 470 497 471 /* 498 472 * If rseq was previously inactive, and has just been ··· 510 476 efault: 511 477 return -EFAULT; 512 478 } 479 + 480 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 481 + struct slice_timer { 482 + struct hrtimer timer; 483 + void *cookie; 484 + }; 485 + 486 + static const unsigned int rseq_slice_ext_nsecs_min = 5 * NSEC_PER_USEC; 487 + static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC; 488 + unsigned int rseq_slice_ext_nsecs __read_mostly = rseq_slice_ext_nsecs_min; 489 + static DEFINE_PER_CPU(struct slice_timer, slice_timer); 490 + DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); 491 + 492 + /* 493 + * When the timer expires and the task is still in user space, the return 494 + * from interrupt will revoke the grant and schedule. If the task already 495 + * entered the kernel via a syscall and the timer fires before the syscall 496 + * work was able to cancel it, then depending on the preemption model this 497 + * will either reschedule on return from interrupt or in the syscall work 498 + * below. 499 + */ 500 + static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr) 501 + { 502 + struct slice_timer *st = container_of(tmr, struct slice_timer, timer); 503 + 504 + /* 505 + * Validate that the task which armed the timer is still on the 506 + * CPU. It could have been scheduled out without canceling the 507 + * timer. 508 + */ 509 + if (st->cookie == current && current->rseq.slice.state.granted) { 510 + rseq_stat_inc(rseq_stats.s_expired); 511 + set_need_resched_current(); 512 + } 513 + return HRTIMER_NORESTART; 514 + } 515 + 516 + bool __rseq_arm_slice_extension_timer(void) 517 + { 518 + struct slice_timer *st = this_cpu_ptr(&slice_timer); 519 + struct task_struct *curr = current; 520 + 521 + lockdep_assert_irqs_disabled(); 522 + 523 + /* 524 + * This check prevents a task, which got a time slice extension 525 + * granted, from exceeding the maximum scheduling latency when the 526 + * grant expired before going out to user space. Don't bother to 527 + * clear the grant here, it will be cleaned up automatically before 528 + * going out to user space after being scheduled back in. 529 + */ 530 + if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) { 531 + set_need_resched_current(); 532 + return true; 533 + } 534 + 535 + /* 536 + * Store the task pointer as a cookie for comparison in the timer 537 + * function. This is safe as the timer is CPU local and cannot be 538 + * in the expiry function at this point. 539 + */ 540 + st->cookie = curr; 541 + hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD); 542 + /* Arm the syscall entry work */ 543 + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); 544 + return false; 545 + } 546 + 547 + static void rseq_cancel_slice_extension_timer(void) 548 + { 549 + struct slice_timer *st = this_cpu_ptr(&slice_timer); 550 + 551 + /* 552 + * st->cookie can be safely read as preemption is disabled and the 553 + * timer is CPU local. 554 + * 555 + * As this is most probably the first expiring timer, the cancel is 556 + * expensive as it has to reprogram the hardware, but that's less 557 + * expensive than going through a full hrtimer_interrupt() cycle 558 + * for nothing. 559 + * 560 + * hrtimer_try_to_cancel() is sufficient here as the timer is CPU 561 + * local and once the hrtimer code disabled interrupts the timer 562 + * callback cannot be running. 563 + */ 564 + if (st->cookie == current) 565 + hrtimer_try_to_cancel(&st->timer); 566 + } 567 + 568 + static inline void rseq_slice_set_need_resched(struct task_struct *curr) 569 + { 570 + /* 571 + * The interrupt guard is required to prevent inconsistent state in 572 + * this case: 573 + * 574 + * set_tsk_need_resched() 575 + * --> Interrupt 576 + * wakeup() 577 + * set_tsk_need_resched() 578 + * set_preempt_need_resched() 579 + * schedule_on_return() 580 + * clear_tsk_need_resched() 581 + * clear_preempt_need_resched() 582 + * set_preempt_need_resched() <- Inconsistent state 583 + * 584 + * This is safe vs. a remote set of TIF_NEED_RESCHED because that 585 + * only sets the already set bit and does not create inconsistent 586 + * state. 587 + */ 588 + scoped_guard(irq) 589 + set_need_resched_current(); 590 + } 591 + 592 + static void rseq_slice_validate_ctrl(u32 expected) 593 + { 594 + u32 __user *sctrl = &current->rseq.usrptr->slice_ctrl.all; 595 + u32 uval; 596 + 597 + if (get_user(uval, sctrl) || uval != expected) 598 + force_sig(SIGSEGV); 599 + } 600 + 601 + /* 602 + * Invoked from syscall entry if a time slice extension was granted and the 603 + * kernel did not clear it before user space left the critical section. 604 + * 605 + * While the recommended way to relinquish the CPU side effect free is 606 + * rseq_slice_yield(2), any syscall within a granted slice terminates the 607 + * grant and immediately reschedules if required. This supports onion layer 608 + * applications, where the code requesting the grant cannot control the 609 + * code within the critical section. 610 + */ 611 + void rseq_syscall_enter_work(long syscall) 612 + { 613 + struct task_struct *curr = current; 614 + struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted }; 615 + 616 + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); 617 + 618 + if (static_branch_unlikely(&rseq_debug_enabled)) 619 + rseq_slice_validate_ctrl(ctrl.all); 620 + 621 + /* 622 + * The kernel might have raced, revoked the grant and updated 623 + * userspace, but kept the SLICE work set. 624 + */ 625 + if (!ctrl.granted) 626 + return; 627 + 628 + /* 629 + * Required to stabilize the per CPU timer pointer and to make 630 + * set_tsk_need_resched() correct on PREEMPT[RT] kernels. 631 + * 632 + * Leaving the scope will reschedule on preemption models FULL, 633 + * LAZY and RT if necessary. 634 + */ 635 + scoped_guard(preempt) { 636 + rseq_cancel_slice_extension_timer(); 637 + /* 638 + * Now that preemption is disabled, quickly check whether 639 + * the task was already rescheduled before arriving here. 640 + */ 641 + if (!curr->rseq.event.sched_switch) { 642 + rseq_slice_set_need_resched(curr); 643 + 644 + if (syscall == __NR_rseq_slice_yield) { 645 + rseq_stat_inc(rseq_stats.s_yielded); 646 + /* Update the yielded state for syscall return */ 647 + curr->rseq.slice.yielded = 1; 648 + } else { 649 + rseq_stat_inc(rseq_stats.s_aborted); 650 + } 651 + } 652 + } 653 + /* Reschedule on NONE/VOLUNTARY preemption models */ 654 + cond_resched(); 655 + 656 + /* Clear the grant in kernel state and user space */ 657 + curr->rseq.slice.state.granted = false; 658 + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all)) 659 + force_sig(SIGSEGV); 660 + } 661 + 662 + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) 663 + { 664 + switch (arg2) { 665 + case PR_RSEQ_SLICE_EXTENSION_GET: 666 + if (arg3) 667 + return -EINVAL; 668 + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0; 669 + 670 + case PR_RSEQ_SLICE_EXTENSION_SET: { 671 + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 672 + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE); 673 + 674 + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE) 675 + return -EINVAL; 676 + if (!rseq_slice_extension_enabled()) 677 + return -ENOTSUPP; 678 + if (!current->rseq.usrptr) 679 + return -ENXIO; 680 + 681 + /* No change? */ 682 + if (enable == !!current->rseq.slice.state.enabled) 683 + return 0; 684 + 685 + if (get_user(rflags, &current->rseq.usrptr->flags)) 686 + goto die; 687 + 688 + if (current->rseq.slice.state.enabled) 689 + valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 690 + 691 + if ((rflags & valid) != valid) 692 + goto die; 693 + 694 + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 695 + rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 696 + if (enable) 697 + rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 698 + 699 + if (put_user(rflags, &current->rseq.usrptr->flags)) 700 + goto die; 701 + 702 + current->rseq.slice.state.enabled = enable; 703 + return 0; 704 + } 705 + default: 706 + return -EINVAL; 707 + } 708 + die: 709 + force_sig(SIGSEGV); 710 + return -EFAULT; 711 + } 712 + 713 + /** 714 + * sys_rseq_slice_yield - yield the current processor side effect free if a 715 + * task granted with a time slice extension is done with 716 + * the critical work before being forced out. 717 + * 718 + * Return: 1 if the task successfully yielded the CPU within the granted slice. 719 + * 0 if the slice extension was either never granted or was revoked by 720 + * going over the granted extension, using a syscall other than this one 721 + * or being scheduled out earlier due to a subsequent interrupt. 722 + * 723 + * The syscall does not schedule because the syscall entry work immediately 724 + * relinquishes the CPU and schedules if required. 725 + */ 726 + SYSCALL_DEFINE0(rseq_slice_yield) 727 + { 728 + int yielded = !!current->rseq.slice.yielded; 729 + 730 + current->rseq.slice.yielded = 0; 731 + return yielded; 732 + } 733 + 734 + static int rseq_slice_ext_show(struct seq_file *m, void *p) 735 + { 736 + seq_printf(m, "%d\n", rseq_slice_ext_nsecs); 737 + return 0; 738 + } 739 + 740 + static ssize_t rseq_slice_ext_write(struct file *file, const char __user *ubuf, 741 + size_t count, loff_t *ppos) 742 + { 743 + unsigned int nsecs; 744 + 745 + if (kstrtouint_from_user(ubuf, count, 10, &nsecs)) 746 + return -EINVAL; 747 + 748 + if (nsecs < rseq_slice_ext_nsecs_min) 749 + return -ERANGE; 750 + 751 + if (nsecs > rseq_slice_ext_nsecs_max) 752 + return -ERANGE; 753 + 754 + rseq_slice_ext_nsecs = nsecs; 755 + 756 + return count; 757 + } 758 + 759 + static int rseq_slice_ext_open(struct inode *inode, struct file *file) 760 + { 761 + return single_open(file, rseq_slice_ext_show, inode->i_private); 762 + } 763 + 764 + static const struct file_operations slice_ext_ops = { 765 + .open = rseq_slice_ext_open, 766 + .read = seq_read, 767 + .write = rseq_slice_ext_write, 768 + .llseek = seq_lseek, 769 + .release = single_release, 770 + }; 771 + 772 + static void rseq_slice_ext_init(struct dentry *root_dir) 773 + { 774 + debugfs_create_file("slice_ext_nsec", 0644, root_dir, NULL, &slice_ext_ops); 775 + } 776 + 777 + static int __init rseq_slice_cmdline(char *str) 778 + { 779 + bool on; 780 + 781 + if (kstrtobool(str, &on)) 782 + return 0; 783 + 784 + if (!on) 785 + static_branch_disable(&rseq_slice_extension_key); 786 + return 1; 787 + } 788 + __setup("rseq_slice_ext=", rseq_slice_cmdline); 789 + 790 + static int __init rseq_slice_init(void) 791 + { 792 + unsigned int cpu; 793 + 794 + for_each_possible_cpu(cpu) { 795 + hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired, 796 + CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD); 797 + } 798 + return 0; 799 + } 800 + device_initcall(rseq_slice_init); 801 + #else 802 + static void rseq_slice_ext_init(struct dentry *root_dir) { } 803 + #endif /* CONFIG_RSEQ_SLICE_EXTENSION */
+3
kernel/sched/clock.c
··· 173 173 scd->tick_gtod, __gtod_offset, 174 174 scd->tick_raw, __sched_clock_offset); 175 175 176 + disable_sched_clock_irqtime(); 176 177 static_branch_disable(&__sched_clock_stable); 177 178 } 178 179 ··· 239 238 240 239 if (__sched_clock_stable_early) 241 240 __set_sched_clock_stable(); 241 + else 242 + disable_sched_clock_irqtime(); /* disable if clock unstable. */ 242 243 243 244 return 0; 244 245 }
+64 -22
kernel/sched/core.c
··· 119 119 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); 120 120 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); 121 121 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp); 122 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_entry_tp); 123 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp); 124 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp); 122 125 123 126 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); 124 127 DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); ··· 1144 1141 { 1145 1142 trace_sched_set_need_resched_tp(curr, smp_processor_id(), tif); 1146 1143 } 1144 + EXPORT_SYMBOL_GPL(__trace_set_need_resched); 1147 1145 1148 1146 void resched_curr(struct rq *rq) 1149 1147 { ··· 2099 2095 */ 2100 2096 uclamp_rq_inc(rq, p, flags); 2101 2097 2102 - rq->queue_mask |= p->sched_class->queue_mask; 2103 2098 p->sched_class->enqueue_task(rq, p, flags); 2104 2099 2105 2100 psi_enqueue(p, flags); ··· 2131 2128 * and mark the task ->sched_delayed. 2132 2129 */ 2133 2130 uclamp_rq_dec(rq, p); 2134 - rq->queue_mask |= p->sched_class->queue_mask; 2135 2131 return p->sched_class->dequeue_task(rq, p, flags); 2136 2132 } 2137 2133 ··· 2181 2179 { 2182 2180 struct task_struct *donor = rq->donor; 2183 2181 2184 - if (p->sched_class == donor->sched_class) 2185 - donor->sched_class->wakeup_preempt(rq, p, flags); 2186 - else if (sched_class_above(p->sched_class, donor->sched_class)) 2182 + if (p->sched_class == rq->next_class) { 2183 + rq->next_class->wakeup_preempt(rq, p, flags); 2184 + 2185 + } else if (sched_class_above(p->sched_class, rq->next_class)) { 2186 + rq->next_class->wakeup_preempt(rq, p, flags); 2187 2187 resched_curr(rq); 2188 + rq->next_class = p->sched_class; 2189 + } 2188 2190 2189 2191 /* 2190 2192 * A queue event has occurred, and we're going to schedule. In ··· 3626 3620 trace_sched_wakeup(p); 3627 3621 } 3628 3622 3623 + void update_rq_avg_idle(struct rq *rq) 3624 + { 3625 + u64 delta = rq_clock(rq) - rq->idle_stamp; 3626 + u64 max = 2*rq->max_idle_balance_cost; 3627 + 3628 + update_avg(&rq->avg_idle, delta); 3629 + 3630 + if (rq->avg_idle > max) 3631 + rq->avg_idle = max; 3632 + rq->idle_stamp = 0; 3633 + } 3634 + 3629 3635 static void 3630 3636 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, 3631 3637 struct rq_flags *rf) ··· 3672 3654 rq_unpin_lock(rq, rf); 3673 3655 p->sched_class->task_woken(rq, p); 3674 3656 rq_repin_lock(rq, rf); 3675 - } 3676 - 3677 - if (rq->idle_stamp) { 3678 - u64 delta = rq_clock(rq) - rq->idle_stamp; 3679 - u64 max = 2*rq->max_idle_balance_cost; 3680 - 3681 - update_avg(&rq->avg_idle, delta); 3682 - 3683 - if (rq->avg_idle > max) 3684 - rq->avg_idle = max; 3685 - 3686 - rq->idle_stamp = 0; 3687 3657 } 3688 3658 } 3689 3659 ··· 6842 6836 pick_again: 6843 6837 next = pick_next_task(rq, rq->donor, &rf); 6844 6838 rq_set_donor(rq, next); 6839 + rq->next_class = next->sched_class; 6845 6840 if (unlikely(task_is_blocked(next))) { 6846 6841 next = find_proxy_task(rq, next, &rf); 6847 6842 if (!next) ··· 7587 7580 7588 7581 int sched_dynamic_mode(const char *str) 7589 7582 { 7590 - # ifndef CONFIG_PREEMPT_RT 7583 + # if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY)) 7591 7584 if (!strcmp(str, "none")) 7592 7585 return preempt_dynamic_none; 7593 7586 ··· 8519 8512 dump_rq_tasks(rq, KERN_WARNING); 8520 8513 } 8521 8514 dl_server_stop(&rq->fair_server); 8515 + #ifdef CONFIG_SCHED_CLASS_EXT 8516 + dl_server_stop(&rq->ext_server); 8517 + #endif 8522 8518 rq_unlock_irqrestore(rq, &rf); 8523 8519 8524 8520 calc_load_migrate(rq); ··· 8697 8687 rq->rt.rt_runtime = global_rt_runtime(); 8698 8688 init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); 8699 8689 #endif 8690 + rq->next_class = &idle_sched_class; 8691 + 8700 8692 rq->sd = NULL; 8701 8693 rq->rd = NULL; 8702 8694 rq->cpu_capacity = SCHED_CAPACITY_SCALE; ··· 8727 8715 hrtick_rq_init(rq); 8728 8716 atomic_set(&rq->nr_iowait, 0); 8729 8717 fair_server_init(rq); 8718 + #ifdef CONFIG_SCHED_CLASS_EXT 8719 + ext_server_init(rq); 8720 + #endif 8730 8721 8731 8722 #ifdef CONFIG_SCHED_CORE 8732 8723 rq->core = rq; ··· 9161 9146 { 9162 9147 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; 9163 9148 bool resched = false; 9149 + bool queued = false; 9164 9150 struct rq *rq; 9165 9151 9166 9152 CLASS(task_rq_lock, rq_guard)(tsk); ··· 9173 9157 scx_cgroup_move_task(tsk); 9174 9158 if (scope->running) 9175 9159 resched = true; 9160 + queued = scope->queued; 9176 9161 } 9177 9162 9178 9163 if (resched) 9179 9164 resched_curr(rq); 9165 + else if (queued) 9166 + wakeup_preempt(rq, tsk, 0); 9180 9167 9181 9168 __balance_callbacks(rq, &rq_guard.rf); 9182 9169 } ··· 10902 10883 flags |= DEQUEUE_NOCLOCK; 10903 10884 } 10904 10885 10905 - if (flags & DEQUEUE_CLASS) { 10906 - if (p->sched_class->switching_from) 10907 - p->sched_class->switching_from(rq, p); 10908 - } 10886 + if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from) 10887 + p->sched_class->switching_from(rq, p); 10909 10888 10910 10889 *ctx = (struct sched_change_ctx){ 10911 10890 .p = p, 10891 + .class = p->sched_class, 10912 10892 .flags = flags, 10913 10893 .queued = task_on_rq_queued(p), 10914 10894 .running = task_current_donor(rq, p), ··· 10938 10920 10939 10921 lockdep_assert_rq_held(rq); 10940 10922 10923 + /* 10924 + * Changing class without *QUEUE_CLASS is bad. 10925 + */ 10926 + WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS)); 10927 + 10941 10928 if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to) 10942 10929 p->sched_class->switching_to(rq, p); 10943 10930 ··· 10954 10931 if (ctx->flags & ENQUEUE_CLASS) { 10955 10932 if (p->sched_class->switched_to) 10956 10933 p->sched_class->switched_to(rq, p); 10934 + 10935 + if (ctx->running) { 10936 + /* 10937 + * If this was a class promotion; let the old class 10938 + * know it got preempted. Note that none of the 10939 + * switch*_from() methods know the new class and none 10940 + * of the switch*_to() methods know the old class. 10941 + */ 10942 + if (sched_class_above(p->sched_class, ctx->class)) { 10943 + rq->next_class->wakeup_preempt(rq, p, 0); 10944 + rq->next_class = p->sched_class; 10945 + } 10946 + /* 10947 + * If this was a degradation in class; make sure to 10948 + * reschedule. 10949 + */ 10950 + if (sched_class_above(ctx->class, p->sched_class)) 10951 + resched_curr(rq); 10952 + } 10957 10953 } else { 10958 10954 p->sched_class->prio_changed(rq, p, ctx->prio); 10959 10955 }
+1 -1
kernel/sched/cpufreq_schedutil.c
··· 682 682 "sugov:%d", 683 683 cpumask_first(policy->related_cpus)); 684 684 if (IS_ERR(thread)) { 685 - pr_err("failed to create sugov thread: %ld\n", PTR_ERR(thread)); 685 + pr_err("failed to create sugov thread: %pe\n", thread); 686 686 return PTR_ERR(thread); 687 687 } 688 688
+5 -4
kernel/sched/cputime.c
··· 12 12 13 13 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 14 14 15 + DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime); 16 + 15 17 /* 16 18 * There are no locks covering percpu hardirq/softirq time. 17 19 * They are only modified in vtime_account, on corresponding CPU ··· 27 25 */ 28 26 DEFINE_PER_CPU(struct irqtime, cpu_irqtime); 29 27 30 - int sched_clock_irqtime; 31 - 32 28 void enable_sched_clock_irqtime(void) 33 29 { 34 - sched_clock_irqtime = 1; 30 + static_branch_enable(&sched_clock_irqtime); 35 31 } 36 32 37 33 void disable_sched_clock_irqtime(void) 38 34 { 39 - sched_clock_irqtime = 0; 35 + if (irqtime_enabled()) 36 + static_branch_disable(&sched_clock_irqtime); 40 37 } 41 38 42 39 static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
+74 -31
kernel/sched/deadline.c
··· 1449 1449 dl_se->dl_defer_idle = 0; 1450 1450 1451 1451 /* 1452 - * The fair server can consume its runtime while throttled (not queued/ 1453 - * running as regular CFS). 1452 + * The DL server can consume its runtime while throttled (not 1453 + * queued / running as regular CFS). 1454 1454 * 1455 1455 * If the server consumes its entire runtime in this state. The server 1456 1456 * is not required for the current period. Thus, reset the server by ··· 1535 1535 } 1536 1536 1537 1537 /* 1538 - * The fair server (sole dl_server) does not account for real-time 1539 - * workload because it is running fair work. 1538 + * The dl_server does not account for real-time workload because it 1539 + * is running fair work. 1540 1540 */ 1541 - if (dl_se == &rq->fair_server) 1541 + if (dl_se->dl_server) 1542 1542 return; 1543 1543 1544 1544 #ifdef CONFIG_RT_GROUP_SCHED ··· 1573 1573 * In the non-defer mode, the idle time is not accounted, as the 1574 1574 * server provides a guarantee. 1575 1575 * 1576 - * If the dl_server is in defer mode, the idle time is also considered 1577 - * as time available for the fair server, avoiding a penalty for the 1578 - * rt scheduler that did not consumed that time. 1576 + * If the dl_server is in defer mode, the idle time is also considered as 1577 + * time available for the dl_server, avoiding a penalty for the rt 1578 + * scheduler that did not consumed that time. 1579 1579 */ 1580 1580 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec) 1581 1581 { ··· 1799 1799 struct rq *rq = dl_se->rq; 1800 1800 1801 1801 dl_se->dl_defer_idle = 0; 1802 - if (!dl_server(dl_se) || dl_se->dl_server_active) 1802 + if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime) 1803 1803 return; 1804 1804 1805 1805 /* ··· 1860 1860 dl_se->dl_server = 1; 1861 1861 dl_se->dl_defer = 1; 1862 1862 setup_new_dl_entity(dl_se); 1863 + 1864 + #ifdef CONFIG_SCHED_CLASS_EXT 1865 + dl_se = &rq->ext_server; 1866 + 1867 + WARN_ON(dl_server(dl_se)); 1868 + 1869 + dl_server_apply_params(dl_se, runtime, period, 1); 1870 + 1871 + dl_se->dl_server = 1; 1872 + dl_se->dl_defer = 1; 1873 + setup_new_dl_entity(dl_se); 1874 + #endif 1863 1875 } 1864 1876 } 1865 1877 ··· 1898 1886 int cpu = cpu_of(rq); 1899 1887 struct dl_bw *dl_b; 1900 1888 unsigned long cap; 1901 - int retval = 0; 1902 1889 int cpus; 1903 1890 1904 1891 dl_b = dl_bw_of(cpu); ··· 1929 1918 dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); 1930 1919 dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime); 1931 1920 1932 - return retval; 1921 + return 0; 1933 1922 } 1934 1923 1935 1924 /* ··· 2526 2515 * Only called when both the current and waking task are -deadline 2527 2516 * tasks. 2528 2517 */ 2529 - static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, 2530 - int flags) 2518 + static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags) 2531 2519 { 2520 + /* 2521 + * Can only get preempted by stop-class, and those should be 2522 + * few and short lived, doesn't really make sense to push 2523 + * anything away for that. 2524 + */ 2525 + if (p->sched_class != &dl_sched_class) 2526 + return; 2527 + 2532 2528 if (dl_entity_preempt(&p->dl, &rq->donor->dl)) { 2533 2529 resched_curr(rq); 2534 2530 return; ··· 3209 3191 raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); 3210 3192 } 3211 3193 3194 + static void dl_server_add_bw(struct root_domain *rd, int cpu) 3195 + { 3196 + struct sched_dl_entity *dl_se; 3197 + 3198 + dl_se = &cpu_rq(cpu)->fair_server; 3199 + if (dl_server(dl_se) && cpu_active(cpu)) 3200 + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); 3201 + 3202 + #ifdef CONFIG_SCHED_CLASS_EXT 3203 + dl_se = &cpu_rq(cpu)->ext_server; 3204 + if (dl_server(dl_se) && cpu_active(cpu)) 3205 + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); 3206 + #endif 3207 + } 3208 + 3209 + static u64 dl_server_read_bw(int cpu) 3210 + { 3211 + u64 dl_bw = 0; 3212 + 3213 + if (cpu_rq(cpu)->fair_server.dl_server) 3214 + dl_bw += cpu_rq(cpu)->fair_server.dl_bw; 3215 + 3216 + #ifdef CONFIG_SCHED_CLASS_EXT 3217 + if (cpu_rq(cpu)->ext_server.dl_server) 3218 + dl_bw += cpu_rq(cpu)->ext_server.dl_bw; 3219 + #endif 3220 + 3221 + return dl_bw; 3222 + } 3223 + 3212 3224 void dl_clear_root_domain(struct root_domain *rd) 3213 3225 { 3214 3226 int i; ··· 3257 3209 * dl_servers are not tasks. Since dl_add_task_root_domain ignores 3258 3210 * them, we need to account for them here explicitly. 3259 3211 */ 3260 - for_each_cpu(i, rd->span) { 3261 - struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; 3262 - 3263 - if (dl_server(dl_se) && cpu_active(i)) 3264 - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); 3265 - } 3212 + for_each_cpu(i, rd->span) 3213 + dl_server_add_bw(rd, i); 3266 3214 } 3267 3215 3268 3216 void dl_clear_root_domain_cpu(int cpu) ··· 3404 3360 #endif 3405 3361 3406 3362 DEFINE_SCHED_CLASS(dl) = { 3407 - 3408 - .queue_mask = 8, 3409 - 3410 3363 .enqueue_task = enqueue_task_dl, 3411 3364 .dequeue_task = dequeue_task_dl, 3412 3365 .yield_task = yield_task_dl, ··· 3697 3656 dl_se->dl_non_contending = 0; 3698 3657 dl_se->dl_overrun = 0; 3699 3658 dl_se->dl_server = 0; 3659 + dl_se->dl_defer = 0; 3660 + dl_se->dl_defer_running = 0; 3661 + dl_se->dl_defer_armed = 0; 3700 3662 3701 3663 #ifdef CONFIG_RT_MUTEXES 3702 3664 dl_se->pi_se = dl_se; ··· 3757 3713 unsigned long flags, cap; 3758 3714 struct dl_bw *dl_b; 3759 3715 bool overflow = 0; 3760 - u64 fair_server_bw = 0; 3716 + u64 dl_server_bw = 0; 3761 3717 3762 3718 rcu_read_lock_sched(); 3763 3719 dl_b = dl_bw_of(cpu); ··· 3790 3746 cap -= arch_scale_cpu_capacity(cpu); 3791 3747 3792 3748 /* 3793 - * cpu is going offline and NORMAL tasks will be moved away 3794 - * from it. We can thus discount dl_server bandwidth 3795 - * contribution as it won't need to be servicing tasks after 3796 - * the cpu is off. 3749 + * cpu is going offline and NORMAL and EXT tasks will be 3750 + * moved away from it. We can thus discount dl_server 3751 + * bandwidth contribution as it won't need to be servicing 3752 + * tasks after the cpu is off. 3797 3753 */ 3798 - if (cpu_rq(cpu)->fair_server.dl_server) 3799 - fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; 3754 + dl_server_bw = dl_server_read_bw(cpu); 3800 3755 3801 3756 /* 3802 3757 * Not much to check if no DEADLINE bandwidth is present. 3803 3758 * dl_servers we can discount, as tasks will be moved out the 3804 3759 * offlined CPUs anyway. 3805 3760 */ 3806 - if (dl_b->total_bw - fair_server_bw > 0) { 3761 + if (dl_b->total_bw - dl_server_bw > 0) { 3807 3762 /* 3808 3763 * Leaving at least one CPU for DEADLINE tasks seems a 3809 3764 * wise thing to do. As said above, cpu is not offline 3810 3765 * yet, so account for that. 3811 3766 */ 3812 3767 if (dl_bw_cpus(cpu) - 1) 3813 - overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); 3768 + overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0); 3814 3769 else 3815 3770 overflow = 1; 3816 3771 }
+146 -41
kernel/sched/debug.c
··· 172 172 static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf, 173 173 size_t cnt, loff_t *ppos) 174 174 { 175 - char buf[16]; 176 175 unsigned int scaling; 176 + int ret; 177 177 178 - if (cnt > 15) 179 - cnt = 15; 180 - 181 - if (copy_from_user(&buf, ubuf, cnt)) 182 - return -EFAULT; 183 - buf[cnt] = '\0'; 184 - 185 - if (kstrtouint(buf, 10, &scaling)) 186 - return -EINVAL; 178 + ret = kstrtouint_from_user(ubuf, cnt, 10, &scaling); 179 + if (ret) 180 + return ret; 187 181 188 182 if (scaling >= SCHED_TUNABLESCALING_END) 189 183 return -EINVAL; ··· 237 243 238 244 static int sched_dynamic_show(struct seq_file *m, void *v) 239 245 { 240 - int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2; 246 + int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2; 241 247 int j; 242 248 243 249 /* Count entries in NULL terminated preempt_modes */ ··· 330 336 DL_PERIOD, 331 337 }; 332 338 333 - static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ 334 - static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 339 + static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ 340 + static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 335 341 336 - static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, 337 - size_t cnt, loff_t *ppos, enum dl_param param) 342 + static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf, 343 + size_t cnt, loff_t *ppos, enum dl_param param, 344 + void *server) 338 345 { 339 346 long cpu = (long) ((struct seq_file *) filp->private_data)->private; 347 + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; 348 + u64 old_runtime, runtime, period; 340 349 struct rq *rq = cpu_rq(cpu); 341 - u64 runtime, period; 350 + int retval = 0; 342 351 size_t err; 343 - int retval; 344 352 u64 value; 345 353 346 354 err = kstrtoull_from_user(ubuf, cnt, 10, &value); ··· 350 354 return err; 351 355 352 356 scoped_guard (rq_lock_irqsave, rq) { 353 - runtime = rq->fair_server.dl_runtime; 354 - period = rq->fair_server.dl_period; 357 + old_runtime = runtime = dl_se->dl_runtime; 358 + period = dl_se->dl_period; 355 359 356 360 switch (param) { 357 361 case DL_RUNTIME: ··· 367 371 } 368 372 369 373 if (runtime > period || 370 - period > fair_server_period_max || 371 - period < fair_server_period_min) { 374 + period > dl_server_period_max || 375 + period < dl_server_period_min) { 372 376 return -EINVAL; 373 377 } 374 378 375 379 update_rq_clock(rq); 376 - dl_server_stop(&rq->fair_server); 380 + dl_server_stop(dl_se); 381 + retval = dl_server_apply_params(dl_se, runtime, period, 0); 382 + dl_server_start(dl_se); 377 383 378 - retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); 379 - if (retval) 380 - cnt = retval; 384 + if (retval < 0) 385 + return retval; 386 + } 381 387 382 - if (!runtime) 383 - printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", 384 - cpu_of(rq)); 385 - 386 - if (rq->cfs.h_nr_queued) 387 - dl_server_start(&rq->fair_server); 388 + if (!!old_runtime ^ !!runtime) { 389 + pr_info("%s server %sabled on CPU %d%s.\n", 390 + server == &rq->fair_server ? "Fair" : "Ext", 391 + runtime ? "en" : "dis", 392 + cpu_of(rq), 393 + runtime ? "" : ", system may malfunction due to starvation"); 388 394 } 389 395 390 396 *ppos += cnt; 391 397 return cnt; 392 398 } 393 399 394 - static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) 400 + static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param, 401 + void *server) 395 402 { 396 - unsigned long cpu = (unsigned long) m->private; 397 - struct rq *rq = cpu_rq(cpu); 403 + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; 398 404 u64 value; 399 405 400 406 switch (param) { 401 407 case DL_RUNTIME: 402 - value = rq->fair_server.dl_runtime; 408 + value = dl_se->dl_runtime; 403 409 break; 404 410 case DL_PERIOD: 405 - value = rq->fair_server.dl_period; 411 + value = dl_se->dl_period; 406 412 break; 407 413 } 408 414 409 415 seq_printf(m, "%llu\n", value); 410 416 return 0; 411 - 412 417 } 413 418 414 419 static ssize_t 415 420 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, 416 421 size_t cnt, loff_t *ppos) 417 422 { 418 - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); 423 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 424 + struct rq *rq = cpu_rq(cpu); 425 + 426 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, 427 + &rq->fair_server); 419 428 } 420 429 421 430 static int sched_fair_server_runtime_show(struct seq_file *m, void *v) 422 431 { 423 - return sched_fair_server_show(m, v, DL_RUNTIME); 432 + unsigned long cpu = (unsigned long) m->private; 433 + struct rq *rq = cpu_rq(cpu); 434 + 435 + return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server); 424 436 } 425 437 426 438 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) ··· 444 440 .release = single_release, 445 441 }; 446 442 443 + #ifdef CONFIG_SCHED_CLASS_EXT 444 + static ssize_t 445 + sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf, 446 + size_t cnt, loff_t *ppos) 447 + { 448 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 449 + struct rq *rq = cpu_rq(cpu); 450 + 451 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, 452 + &rq->ext_server); 453 + } 454 + 455 + static int sched_ext_server_runtime_show(struct seq_file *m, void *v) 456 + { 457 + unsigned long cpu = (unsigned long) m->private; 458 + struct rq *rq = cpu_rq(cpu); 459 + 460 + return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server); 461 + } 462 + 463 + static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp) 464 + { 465 + return single_open(filp, sched_ext_server_runtime_show, inode->i_private); 466 + } 467 + 468 + static const struct file_operations ext_server_runtime_fops = { 469 + .open = sched_ext_server_runtime_open, 470 + .write = sched_ext_server_runtime_write, 471 + .read = seq_read, 472 + .llseek = seq_lseek, 473 + .release = single_release, 474 + }; 475 + #endif /* CONFIG_SCHED_CLASS_EXT */ 476 + 447 477 static ssize_t 448 478 sched_fair_server_period_write(struct file *filp, const char __user *ubuf, 449 479 size_t cnt, loff_t *ppos) 450 480 { 451 - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); 481 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 482 + struct rq *rq = cpu_rq(cpu); 483 + 484 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, 485 + &rq->fair_server); 452 486 } 453 487 454 488 static int sched_fair_server_period_show(struct seq_file *m, void *v) 455 489 { 456 - return sched_fair_server_show(m, v, DL_PERIOD); 490 + unsigned long cpu = (unsigned long) m->private; 491 + struct rq *rq = cpu_rq(cpu); 492 + 493 + return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server); 457 494 } 458 495 459 496 static int sched_fair_server_period_open(struct inode *inode, struct file *filp) ··· 509 464 .llseek = seq_lseek, 510 465 .release = single_release, 511 466 }; 467 + 468 + #ifdef CONFIG_SCHED_CLASS_EXT 469 + static ssize_t 470 + sched_ext_server_period_write(struct file *filp, const char __user *ubuf, 471 + size_t cnt, loff_t *ppos) 472 + { 473 + long cpu = (long) ((struct seq_file *) filp->private_data)->private; 474 + struct rq *rq = cpu_rq(cpu); 475 + 476 + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, 477 + &rq->ext_server); 478 + } 479 + 480 + static int sched_ext_server_period_show(struct seq_file *m, void *v) 481 + { 482 + unsigned long cpu = (unsigned long) m->private; 483 + struct rq *rq = cpu_rq(cpu); 484 + 485 + return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server); 486 + } 487 + 488 + static int sched_ext_server_period_open(struct inode *inode, struct file *filp) 489 + { 490 + return single_open(filp, sched_ext_server_period_show, inode->i_private); 491 + } 492 + 493 + static const struct file_operations ext_server_period_fops = { 494 + .open = sched_ext_server_period_open, 495 + .write = sched_ext_server_period_write, 496 + .read = seq_read, 497 + .llseek = seq_lseek, 498 + .release = single_release, 499 + }; 500 + #endif /* CONFIG_SCHED_CLASS_EXT */ 512 501 513 502 static struct dentry *debugfs_sched; 514 503 ··· 566 487 debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); 567 488 } 568 489 } 490 + 491 + #ifdef CONFIG_SCHED_CLASS_EXT 492 + static void debugfs_ext_server_init(void) 493 + { 494 + struct dentry *d_ext; 495 + unsigned long cpu; 496 + 497 + d_ext = debugfs_create_dir("ext_server", debugfs_sched); 498 + if (!d_ext) 499 + return; 500 + 501 + for_each_possible_cpu(cpu) { 502 + struct dentry *d_cpu; 503 + char buf[32]; 504 + 505 + snprintf(buf, sizeof(buf), "cpu%lu", cpu); 506 + d_cpu = debugfs_create_dir(buf, d_ext); 507 + 508 + debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops); 509 + debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops); 510 + } 511 + } 512 + #endif /* CONFIG_SCHED_CLASS_EXT */ 569 513 570 514 static __init int sched_init_debug(void) 571 515 { ··· 628 526 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); 629 527 630 528 debugfs_fair_server_init(); 529 + #ifdef CONFIG_SCHED_CLASS_EXT 530 + debugfs_ext_server_init(); 531 + #endif 631 532 632 533 return 0; 633 534 }
+37 -5
kernel/sched/ext.c
··· 959 959 if (!curr->scx.slice) 960 960 touch_core_sched(rq, curr); 961 961 } 962 + 963 + dl_server_update(&rq->ext_server, delta_exec); 962 964 } 963 965 964 966 static bool scx_dsq_priq_less(struct rb_node *node_a, ··· 1503 1501 1504 1502 if (enq_flags & SCX_ENQ_WAKEUP) 1505 1503 touch_core_sched(rq, p); 1504 + 1505 + /* Start dl_server if this is the first task being enqueued */ 1506 + if (rq->scx.nr_running == 1) 1507 + dl_server_start(&rq->ext_server); 1506 1508 1507 1509 do_enqueue_task(rq, p, enq_flags, sticky_cpu); 1508 1510 out: ··· 2460 2454 /* see kick_cpus_irq_workfn() */ 2461 2455 smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1); 2462 2456 2463 - rq_modified_clear(rq); 2457 + rq->next_class = &ext_sched_class; 2464 2458 2465 2459 rq_unpin_lock(rq, rf); 2466 2460 balance_one(rq, prev); ··· 2475 2469 * If @force_scx is true, always try to pick a SCHED_EXT task, 2476 2470 * regardless of any higher-priority sched classes activity. 2477 2471 */ 2478 - if (!force_scx && rq_modified_above(rq, &ext_sched_class)) 2472 + if (!force_scx && sched_class_above(rq->next_class, &ext_sched_class)) 2479 2473 return RETRY_TASK; 2480 2474 2481 2475 keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP; ··· 2517 2511 static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf) 2518 2512 { 2519 2513 return do_pick_task_scx(rq, rf, false); 2514 + } 2515 + 2516 + /* 2517 + * Select the next task to run from the ext scheduling class. 2518 + * 2519 + * Use do_pick_task_scx() directly with @force_scx enabled, since the 2520 + * dl_server must always select a sched_ext task. 2521 + */ 2522 + static struct task_struct * 2523 + ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf) 2524 + { 2525 + if (!scx_enabled()) 2526 + return NULL; 2527 + 2528 + return do_pick_task_scx(dl_se->rq, rf, true); 2529 + } 2530 + 2531 + /* 2532 + * Initialize the ext server deadline entity. 2533 + */ 2534 + void ext_server_init(struct rq *rq) 2535 + { 2536 + struct sched_dl_entity *dl_se = &rq->ext_server; 2537 + 2538 + init_dl_entity(dl_se); 2539 + 2540 + dl_server_init(dl_se, rq, ext_server_pick_task); 2520 2541 } 2521 2542 2522 2543 #ifdef CONFIG_SCHED_CORE ··· 3174 3141 scx_disable_task(p); 3175 3142 } 3176 3143 3177 - static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} 3144 + static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {} 3145 + 3178 3146 static void switched_to_scx(struct rq *rq, struct task_struct *p) {} 3179 3147 3180 3148 int scx_check_setscheduler(struct task_struct *p, int policy) ··· 3436 3402 * their current sched_class. Call them directly from sched core instead. 3437 3403 */ 3438 3404 DEFINE_SCHED_CLASS(ext) = { 3439 - .queue_mask = 1, 3440 - 3441 3405 .enqueue_task = enqueue_task_scx, 3442 3406 .dequeue_task = dequeue_task_scx, 3443 3407 .yield_task = yield_task_scx,
+208 -203
kernel/sched/fair.c
··· 524 524 * Scheduling class tree data structure manipulation methods: 525 525 */ 526 526 527 + extern void __BUILD_BUG_vruntime_cmp(void); 528 + 529 + /* Use __builtin_strcmp() because of __HAVE_ARCH_STRCMP: */ 530 + 531 + #define vruntime_cmp(A, CMP_STR, B) ({ \ 532 + int __res = 0; \ 533 + \ 534 + if (!__builtin_strcmp(CMP_STR, "<")) { \ 535 + __res = ((s64)((A)-(B)) < 0); \ 536 + } else if (!__builtin_strcmp(CMP_STR, "<=")) { \ 537 + __res = ((s64)((A)-(B)) <= 0); \ 538 + } else if (!__builtin_strcmp(CMP_STR, ">")) { \ 539 + __res = ((s64)((A)-(B)) > 0); \ 540 + } else if (!__builtin_strcmp(CMP_STR, ">=")) { \ 541 + __res = ((s64)((A)-(B)) >= 0); \ 542 + } else { \ 543 + /* Unknown operator throws linker error: */ \ 544 + __BUILD_BUG_vruntime_cmp(); \ 545 + } \ 546 + \ 547 + __res; \ 548 + }) 549 + 550 + extern void __BUILD_BUG_vruntime_op(void); 551 + 552 + #define vruntime_op(A, OP_STR, B) ({ \ 553 + s64 __res = 0; \ 554 + \ 555 + if (!__builtin_strcmp(OP_STR, "-")) { \ 556 + __res = (s64)((A)-(B)); \ 557 + } else { \ 558 + /* Unknown operator throws linker error: */ \ 559 + __BUILD_BUG_vruntime_op(); \ 560 + } \ 561 + \ 562 + __res; \ 563 + }) 564 + 565 + 527 566 static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime) 528 567 { 529 - s64 delta = (s64)(vruntime - max_vruntime); 530 - if (delta > 0) 568 + if (vruntime_cmp(vruntime, ">", max_vruntime)) 531 569 max_vruntime = vruntime; 532 570 533 571 return max_vruntime; ··· 573 535 574 536 static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime) 575 537 { 576 - s64 delta = (s64)(vruntime - min_vruntime); 577 - if (delta < 0) 538 + if (vruntime_cmp(vruntime, "<", min_vruntime)) 578 539 min_vruntime = vruntime; 579 540 580 541 return min_vruntime; ··· 586 549 * Tiebreak on vruntime seems unnecessary since it can 587 550 * hardly happen. 588 551 */ 589 - return (s64)(a->deadline - b->deadline) < 0; 552 + return vruntime_cmp(a->deadline, "<", b->deadline); 590 553 } 591 554 592 555 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) 593 556 { 594 - return (s64)(se->vruntime - cfs_rq->zero_vruntime); 557 + return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime); 595 558 } 596 559 597 560 #define __node_2_se(node) \ ··· 613 576 * 614 577 * \Sum lag_i = 0 615 578 * \Sum w_i * (V - v_i) = 0 616 - * \Sum w_i * V - w_i * v_i = 0 579 + * \Sum (w_i * V - w_i * v_i) = 0 617 580 * 618 581 * From which we can solve an expression for V in v_i (which we have in 619 582 * se->vruntime): ··· 644 607 * Which we track using: 645 608 * 646 609 * v0 := cfs_rq->zero_vruntime 647 - * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime 648 - * \Sum w_i := cfs_rq->avg_load 610 + * \Sum (v_i - v0) * w_i := cfs_rq->sum_w_vruntime 611 + * \Sum w_i := cfs_rq->sum_weight 649 612 * 650 613 * Since zero_vruntime closely tracks the per-task service, these 651 - * deltas: (v_i - v), will be in the order of the maximal (virtual) lag 614 + * deltas: (v_i - v0), will be in the order of the maximal (virtual) lag 652 615 * induced in the system due to quantisation. 653 616 * 654 617 * Also, we use scale_load_down() to reduce the size. ··· 656 619 * As measured, the max (key * weight) value was ~44 bits for a kernel build. 657 620 */ 658 621 static void 659 - avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) 622 + sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) 660 623 { 661 624 unsigned long weight = scale_load_down(se->load.weight); 662 625 s64 key = entity_key(cfs_rq, se); 663 626 664 - cfs_rq->avg_vruntime += key * weight; 665 - cfs_rq->avg_load += weight; 627 + cfs_rq->sum_w_vruntime += key * weight; 628 + cfs_rq->sum_weight += weight; 666 629 } 667 630 668 631 static void 669 - avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) 632 + sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) 670 633 { 671 634 unsigned long weight = scale_load_down(se->load.weight); 672 635 s64 key = entity_key(cfs_rq, se); 673 636 674 - cfs_rq->avg_vruntime -= key * weight; 675 - cfs_rq->avg_load -= weight; 637 + cfs_rq->sum_w_vruntime -= key * weight; 638 + cfs_rq->sum_weight -= weight; 676 639 } 677 640 678 641 static inline 679 - void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) 642 + void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) 680 643 { 681 644 /* 682 - * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load 645 + * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight 683 646 */ 684 - cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta; 647 + cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; 685 648 } 686 649 687 650 /* ··· 691 654 u64 avg_vruntime(struct cfs_rq *cfs_rq) 692 655 { 693 656 struct sched_entity *curr = cfs_rq->curr; 694 - s64 avg = cfs_rq->avg_vruntime; 695 - long load = cfs_rq->avg_load; 657 + s64 avg = cfs_rq->sum_w_vruntime; 658 + long load = cfs_rq->sum_weight; 696 659 697 660 if (curr && curr->on_rq) { 698 661 unsigned long weight = scale_load_down(curr->load.weight); ··· 759 722 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) 760 723 { 761 724 struct sched_entity *curr = cfs_rq->curr; 762 - s64 avg = cfs_rq->avg_vruntime; 763 - long load = cfs_rq->avg_load; 725 + s64 avg = cfs_rq->sum_w_vruntime; 726 + long load = cfs_rq->sum_weight; 764 727 765 728 if (curr && curr->on_rq) { 766 729 unsigned long weight = scale_load_down(curr->load.weight); ··· 769 732 load += weight; 770 733 } 771 734 772 - return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load; 735 + return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load; 773 736 } 774 737 775 738 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) ··· 780 743 static void update_zero_vruntime(struct cfs_rq *cfs_rq) 781 744 { 782 745 u64 vruntime = avg_vruntime(cfs_rq); 783 - s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime); 746 + s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime); 784 747 785 - avg_vruntime_update(cfs_rq, delta); 748 + sum_w_vruntime_update(cfs_rq, delta); 786 749 787 750 cfs_rq->zero_vruntime = vruntime; 788 751 } ··· 807 770 return entity_before(__node_2_se(a), __node_2_se(b)); 808 771 } 809 772 810 - #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; }) 811 - 812 773 static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node *node) 813 774 { 814 775 if (node) { 815 776 struct sched_entity *rse = __node_2_se(node); 816 - if (vruntime_gt(min_vruntime, se, rse)) 777 + 778 + if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime)) 817 779 se->min_vruntime = rse->min_vruntime; 818 780 } 819 781 } ··· 855 819 */ 856 820 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 857 821 { 858 - avg_vruntime_add(cfs_rq, se); 822 + sum_w_vruntime_add(cfs_rq, se); 859 823 update_zero_vruntime(cfs_rq); 860 824 se->min_vruntime = se->vruntime; 861 825 se->min_slice = se->slice; ··· 867 831 { 868 832 rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, 869 833 &min_vruntime_cb); 870 - avg_vruntime_sub(cfs_rq, se); 834 + sum_w_vruntime_sub(cfs_rq, se); 871 835 update_zero_vruntime(cfs_rq); 872 836 } 873 837 ··· 923 887 924 888 static inline bool protect_slice(struct sched_entity *se) 925 889 { 926 - return ((s64)(se->vprot - se->vruntime) > 0); 890 + return vruntime_cmp(se->vruntime, "<", se->vprot); 927 891 } 928 892 929 893 static inline void cancel_protect_slice(struct sched_entity *se) ··· 1060 1024 */ 1061 1025 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) 1062 1026 { 1063 - if ((s64)(se->vruntime - se->deadline) < 0) 1027 + if (vruntime_cmp(se->vruntime, "<", se->deadline)) 1064 1028 return false; 1065 1029 1066 1030 /* ··· 1549 1513 1550 1514 /* Scale the maximum scan period with the amount of shared memory. */ 1551 1515 rcu_read_lock(); 1552 - ng = rcu_dereference(p->numa_group); 1516 + ng = rcu_dereference_all(p->numa_group); 1553 1517 if (ng) { 1554 1518 unsigned long shared = group_faults_shared(ng); 1555 1519 unsigned long private = group_faults_priv(ng); ··· 1616 1580 pid_t gid = 0; 1617 1581 1618 1582 rcu_read_lock(); 1619 - ng = rcu_dereference(p->numa_group); 1583 + ng = rcu_dereference_all(p->numa_group); 1620 1584 if (ng) 1621 1585 gid = ng->gid; 1622 1586 rcu_read_unlock(); ··· 2275 2239 return false; 2276 2240 2277 2241 rcu_read_lock(); 2278 - cur = rcu_dereference(dst_rq->curr); 2242 + cur = rcu_dereference_all(dst_rq->curr); 2279 2243 if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) || 2280 2244 !cur->mm)) 2281 2245 cur = NULL; ··· 2320 2284 * If dst and source tasks are in the same NUMA group, or not 2321 2285 * in any group then look only at task weights. 2322 2286 */ 2323 - cur_ng = rcu_dereference(cur->numa_group); 2287 + cur_ng = rcu_dereference_all(cur->numa_group); 2324 2288 if (cur_ng == p_ng) { 2325 2289 /* 2326 2290 * Do not swap within a group or between tasks that have ··· 2494 2458 maymove = !load_too_imbalanced(src_load, dst_load, env); 2495 2459 } 2496 2460 2497 - for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { 2498 - /* Skip this CPU if the source task cannot migrate */ 2499 - if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) 2500 - continue; 2501 - 2461 + /* Skip CPUs if the source task cannot migrate */ 2462 + for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), env->p->cpus_ptr) { 2502 2463 env->dst_cpu = cpu; 2503 2464 if (task_numa_compare(env, taskimp, groupimp, maymove)) 2504 2465 break; ··· 2532 2499 * to satisfy here. 2533 2500 */ 2534 2501 rcu_read_lock(); 2535 - sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); 2502 + sd = rcu_dereference_all(per_cpu(sd_numa, env.src_cpu)); 2536 2503 if (sd) { 2537 2504 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; 2538 2505 env.imb_numa_nr = sd->imb_numa_nr; ··· 3056 3023 if (!cpupid_match_pid(tsk, cpupid)) 3057 3024 goto no_join; 3058 3025 3059 - grp = rcu_dereference(tsk->numa_group); 3026 + grp = rcu_dereference_all(tsk->numa_group); 3060 3027 if (!grp) 3061 3028 goto no_join; 3062 3029 ··· 3727 3694 */ 3728 3695 #define add_positive(_ptr, _val) do { \ 3729 3696 typeof(_ptr) ptr = (_ptr); \ 3730 - typeof(_val) val = (_val); \ 3697 + __signed_scalar_typeof(*ptr) val = (_val); \ 3731 3698 typeof(*ptr) res, var = READ_ONCE(*ptr); \ 3732 3699 \ 3733 3700 res = var + val; \ ··· 3736 3703 res = 0; \ 3737 3704 \ 3738 3705 WRITE_ONCE(*ptr, res); \ 3739 - } while (0) 3740 - 3741 - /* 3742 - * Unsigned subtract and clamp on underflow. 3743 - * 3744 - * Explicitly do a load-store to ensure the intermediate value never hits 3745 - * memory. This allows lockless observations without ever seeing the negative 3746 - * values. 3747 - */ 3748 - #define sub_positive(_ptr, _val) do { \ 3749 - typeof(_ptr) ptr = (_ptr); \ 3750 - typeof(*ptr) val = (_val); \ 3751 - typeof(*ptr) res, var = READ_ONCE(*ptr); \ 3752 - res = var - val; \ 3753 - if (res > var) \ 3754 - res = 0; \ 3755 - WRITE_ONCE(*ptr, res); \ 3756 3706 } while (0) 3757 3707 3758 3708 /* ··· 3749 3733 *ptr -= min_t(typeof(*ptr), *ptr, _val); \ 3750 3734 } while (0) 3751 3735 3736 + 3737 + /* 3738 + * Because of rounding, se->util_sum might ends up being +1 more than 3739 + * cfs->util_sum. Although this is not a problem by itself, detaching 3740 + * a lot of tasks with the rounding problem between 2 updates of 3741 + * util_avg (~1ms) can make cfs->util_sum becoming null whereas 3742 + * cfs_util_avg is not. 3743 + * 3744 + * Check that util_sum is still above its lower bound for the new 3745 + * util_avg. Given that period_contrib might have moved since the last 3746 + * sync, we are only sure that util_sum must be above or equal to 3747 + * util_avg * minimum possible divider 3748 + */ 3749 + #define __update_sa(sa, name, delta_avg, delta_sum) do { \ 3750 + add_positive(&(sa)->name##_avg, delta_avg); \ 3751 + add_positive(&(sa)->name##_sum, delta_sum); \ 3752 + (sa)->name##_sum = max_t(typeof((sa)->name##_sum), \ 3753 + (sa)->name##_sum, \ 3754 + (sa)->name##_avg * PELT_MIN_DIVIDER); \ 3755 + } while (0) 3756 + 3752 3757 static inline void 3753 3758 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 3754 3759 { 3755 - cfs_rq->avg.load_avg += se->avg.load_avg; 3756 - cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; 3760 + __update_sa(&cfs_rq->avg, load, se->avg.load_avg, 3761 + se_weight(se) * se->avg.load_sum); 3757 3762 } 3758 3763 3759 3764 static inline void 3760 3765 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 3761 3766 { 3762 - sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); 3763 - sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum); 3764 - /* See update_cfs_rq_load_avg() */ 3765 - cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, 3766 - cfs_rq->avg.load_avg * PELT_MIN_DIVIDER); 3767 + __update_sa(&cfs_rq->avg, load, -se->avg.load_avg, 3768 + se_weight(se) * -se->avg.load_sum); 3767 3769 } 3768 3770 3769 3771 static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); ··· 4277 4243 */ 4278 4244 divider = get_pelt_divider(&cfs_rq->avg); 4279 4245 4280 - 4281 4246 /* Set new sched_entity's utilization */ 4282 4247 se->avg.util_avg = gcfs_rq->avg.util_avg; 4283 4248 new_sum = se->avg.util_avg * divider; ··· 4284 4251 se->avg.util_sum = new_sum; 4285 4252 4286 4253 /* Update parent cfs_rq utilization */ 4287 - add_positive(&cfs_rq->avg.util_avg, delta_avg); 4288 - add_positive(&cfs_rq->avg.util_sum, delta_sum); 4289 - 4290 - /* See update_cfs_rq_load_avg() */ 4291 - cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, 4292 - cfs_rq->avg.util_avg * PELT_MIN_DIVIDER); 4254 + __update_sa(&cfs_rq->avg, util, delta_avg, delta_sum); 4293 4255 } 4294 4256 4295 4257 static inline void ··· 4310 4282 se->avg.runnable_sum = new_sum; 4311 4283 4312 4284 /* Update parent cfs_rq runnable */ 4313 - add_positive(&cfs_rq->avg.runnable_avg, delta_avg); 4314 - add_positive(&cfs_rq->avg.runnable_sum, delta_sum); 4315 - /* See update_cfs_rq_load_avg() */ 4316 - cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, 4317 - cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER); 4285 + __update_sa(&cfs_rq->avg, runnable, delta_avg, delta_sum); 4318 4286 } 4319 4287 4320 4288 static inline void ··· 4374 4350 4375 4351 se->avg.load_sum = runnable_sum; 4376 4352 se->avg.load_avg = load_avg; 4377 - add_positive(&cfs_rq->avg.load_avg, delta_avg); 4378 - add_positive(&cfs_rq->avg.load_sum, delta_sum); 4379 - /* See update_cfs_rq_load_avg() */ 4380 - cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, 4381 - cfs_rq->avg.load_avg * PELT_MIN_DIVIDER); 4353 + __update_sa(&cfs_rq->avg, load, delta_avg, delta_sum); 4382 4354 } 4383 4355 4384 4356 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) ··· 4471 4451 rq = rq_of(cfs_rq); 4472 4452 4473 4453 rcu_read_lock(); 4474 - is_idle = is_idle_task(rcu_dereference(rq->curr)); 4454 + is_idle = is_idle_task(rcu_dereference_all(rq->curr)); 4475 4455 rcu_read_unlock(); 4476 4456 4477 4457 /* ··· 4573 4553 raw_spin_unlock(&cfs_rq->removed.lock); 4574 4554 4575 4555 r = removed_load; 4576 - sub_positive(&sa->load_avg, r); 4577 - sub_positive(&sa->load_sum, r * divider); 4578 - /* See sa->util_sum below */ 4579 - sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * PELT_MIN_DIVIDER); 4556 + __update_sa(sa, load, -r, -r*divider); 4580 4557 4581 4558 r = removed_util; 4582 - sub_positive(&sa->util_avg, r); 4583 - sub_positive(&sa->util_sum, r * divider); 4584 - /* 4585 - * Because of rounding, se->util_sum might ends up being +1 more than 4586 - * cfs->util_sum. Although this is not a problem by itself, detaching 4587 - * a lot of tasks with the rounding problem between 2 updates of 4588 - * util_avg (~1ms) can make cfs->util_sum becoming null whereas 4589 - * cfs_util_avg is not. 4590 - * Check that util_sum is still above its lower bound for the new 4591 - * util_avg. Given that period_contrib might have moved since the last 4592 - * sync, we are only sure that util_sum must be above or equal to 4593 - * util_avg * minimum possible divider 4594 - */ 4595 - sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * PELT_MIN_DIVIDER); 4559 + __update_sa(sa, util, -r, -r*divider); 4596 4560 4597 4561 r = removed_runnable; 4598 - sub_positive(&sa->runnable_avg, r); 4599 - sub_positive(&sa->runnable_sum, r * divider); 4600 - /* See sa->util_sum above */ 4601 - sa->runnable_sum = max_t(u32, sa->runnable_sum, 4602 - sa->runnable_avg * PELT_MIN_DIVIDER); 4562 + __update_sa(sa, runnable, -r, -r*divider); 4603 4563 4604 4564 /* 4605 4565 * removed_runnable is the unweighted version of removed_load so we ··· 4664 4664 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) 4665 4665 { 4666 4666 dequeue_load_avg(cfs_rq, se); 4667 - sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); 4668 - sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); 4669 - /* See update_cfs_rq_load_avg() */ 4670 - cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, 4671 - cfs_rq->avg.util_avg * PELT_MIN_DIVIDER); 4672 - 4673 - sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); 4674 - sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum); 4675 - /* See update_cfs_rq_load_avg() */ 4676 - cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, 4677 - cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER); 4667 + __update_sa(&cfs_rq->avg, util, -se->avg.util_avg, -se->avg.util_sum); 4668 + __update_sa(&cfs_rq->avg, runnable, -se->avg.runnable_avg, -se->avg.runnable_sum); 4678 4669 4679 4670 add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); 4680 4671 ··· 5168 5177 * 5169 5178 * vl_i = (W + w_i)*vl'_i / W 5170 5179 */ 5171 - load = cfs_rq->avg_load; 5180 + load = cfs_rq->sum_weight; 5172 5181 if (curr && curr->on_rq) 5173 5182 load += scale_load_down(curr->load.weight); 5174 5183 ··· 7141 7150 7142 7151 static struct { 7143 7152 cpumask_var_t idle_cpus_mask; 7144 - atomic_t nr_cpus; 7145 - int has_blocked; /* Idle CPUS has blocked load */ 7153 + int has_blocked_load; /* Idle CPUS has blocked load */ 7146 7154 int needs_update; /* Newly idle CPUs need their next_balance collated */ 7147 7155 unsigned long next_balance; /* in jiffy units */ 7148 7156 unsigned long next_blocked; /* Next update of blocked load in jiffies */ ··· 7499 7509 { 7500 7510 struct sched_domain_shared *sds; 7501 7511 7502 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 7512 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 7503 7513 if (sds) 7504 7514 WRITE_ONCE(sds->has_idle_cores, val); 7505 7515 } ··· 7508 7518 { 7509 7519 struct sched_domain_shared *sds; 7510 7520 7511 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 7521 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 7512 7522 if (sds) 7513 7523 return READ_ONCE(sds->has_idle_cores); 7514 7524 ··· 7637 7647 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); 7638 7648 7639 7649 if (sched_feat(SIS_UTIL)) { 7640 - sd_share = rcu_dereference(per_cpu(sd_llc_shared, target)); 7650 + sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target)); 7641 7651 if (sd_share) { 7642 7652 /* because !--nr is the condition to stop scan */ 7643 7653 nr = READ_ONCE(sd_share->nr_idle_scan) + 1; ··· 7843 7853 * sd_asym_cpucapacity rather than sd_llc. 7844 7854 */ 7845 7855 if (sched_asym_cpucap_active()) { 7846 - sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target)); 7856 + sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, target)); 7847 7857 /* 7848 7858 * On an asymmetric CPU capacity system where an exclusive 7849 7859 * cpuset defines a symmetric island (i.e. one unique ··· 7858 7868 } 7859 7869 } 7860 7870 7861 - sd = rcu_dereference(per_cpu(sd_llc, target)); 7871 + sd = rcu_dereference_all(per_cpu(sd_llc, target)); 7862 7872 if (!sd) 7863 7873 return target; 7864 7874 ··· 8327 8337 struct energy_env eenv; 8328 8338 8329 8339 rcu_read_lock(); 8330 - pd = rcu_dereference(rd->pd); 8340 + pd = rcu_dereference_all(rd->pd); 8331 8341 if (!pd) 8332 8342 goto unlock; 8333 8343 ··· 8335 8345 * Energy-aware wake-up happens on the lowest sched_domain starting 8336 8346 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. 8337 8347 */ 8338 - sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); 8348 + sd = rcu_dereference_all(*this_cpu_ptr(&sd_asym_cpucapacity)); 8339 8349 while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) 8340 8350 sd = sd->parent; 8341 8351 if (!sd) ··· 8358 8368 int max_spare_cap_cpu = -1; 8359 8369 int fits, max_fits = -1; 8360 8370 8361 - cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); 8362 - 8363 - if (cpumask_empty(cpus)) 8371 + if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask)) 8364 8372 continue; 8365 8373 8366 8374 /* Account external pressure for the energy estimation */ ··· 8735 8747 /* 8736 8748 * Preempt the current task with a newly woken task if needed: 8737 8749 */ 8738 - static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) 8750 + static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags) 8739 8751 { 8740 8752 enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; 8741 8753 struct task_struct *donor = rq->donor; 8742 8754 struct sched_entity *se = &donor->se, *pse = &p->se; 8743 8755 struct cfs_rq *cfs_rq = task_cfs_rq(donor); 8744 8756 int cse_is_idle, pse_is_idle; 8757 + 8758 + /* 8759 + * XXX Getting preempted by higher class, try and find idle CPU? 8760 + */ 8761 + if (p->sched_class != &fair_sched_class) 8762 + return; 8745 8763 8746 8764 if (unlikely(se == pse)) 8747 8765 return; ··· 9331 9337 */ 9332 9338 static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env) 9333 9339 { 9334 - struct numa_group *numa_group = rcu_dereference(p->numa_group); 9340 + struct numa_group *numa_group = rcu_dereference_all(p->numa_group); 9335 9341 unsigned long src_weight, dst_weight; 9336 9342 int src_nid, dst_nid, dist; 9337 9343 ··· 9760 9766 } 9761 9767 9762 9768 #ifdef CONFIG_NO_HZ_COMMON 9763 - static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) 9769 + static inline bool cfs_rq_has_blocked_load(struct cfs_rq *cfs_rq) 9764 9770 { 9765 9771 if (cfs_rq->avg.load_avg) 9766 9772 return true; ··· 9793 9799 WRITE_ONCE(rq->last_blocked_load_update_tick, jiffies); 9794 9800 } 9795 9801 9796 - static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) 9802 + static inline void update_has_blocked_load_status(struct rq *rq, bool has_blocked_load) 9797 9803 { 9798 - if (!has_blocked) 9804 + if (!has_blocked_load) 9799 9805 rq->has_blocked_load = 0; 9800 9806 } 9801 9807 #else /* !CONFIG_NO_HZ_COMMON: */ 9802 - static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) { return false; } 9808 + static inline bool cfs_rq_has_blocked_load(struct cfs_rq *cfs_rq) { return false; } 9803 9809 static inline bool others_have_blocked(struct rq *rq) { return false; } 9804 9810 static inline void update_blocked_load_tick(struct rq *rq) {} 9805 - static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {} 9811 + static inline void update_has_blocked_load_status(struct rq *rq, bool has_blocked_load) {} 9806 9812 #endif /* !CONFIG_NO_HZ_COMMON */ 9807 9813 9808 9814 static bool __update_blocked_others(struct rq *rq, bool *done) ··· 9859 9865 list_del_leaf_cfs_rq(cfs_rq); 9860 9866 9861 9867 /* Don't need periodic decay once load/util_avg are null */ 9862 - if (cfs_rq_has_blocked(cfs_rq)) 9868 + if (cfs_rq_has_blocked_load(cfs_rq)) 9863 9869 *done = false; 9864 9870 } 9865 9871 ··· 9919 9925 bool decayed; 9920 9926 9921 9927 decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq); 9922 - if (cfs_rq_has_blocked(cfs_rq)) 9928 + if (cfs_rq_has_blocked_load(cfs_rq)) 9923 9929 *done = false; 9924 9930 9925 9931 return decayed; ··· 9931 9937 } 9932 9938 #endif /* !CONFIG_FAIR_GROUP_SCHED */ 9933 9939 9934 - static void sched_balance_update_blocked_averages(int cpu) 9940 + static void __sched_balance_update_blocked_averages(struct rq *rq) 9935 9941 { 9936 9942 bool decayed = false, done = true; 9937 - struct rq *rq = cpu_rq(cpu); 9938 - struct rq_flags rf; 9939 9943 9940 - rq_lock_irqsave(rq, &rf); 9941 9944 update_blocked_load_tick(rq); 9942 - update_rq_clock(rq); 9943 9945 9944 9946 decayed |= __update_blocked_others(rq, &done); 9945 9947 decayed |= __update_blocked_fair(rq, &done); 9946 9948 9947 - update_blocked_load_status(rq, !done); 9949 + update_has_blocked_load_status(rq, !done); 9948 9950 if (decayed) 9949 9951 cpufreq_update_util(rq, 0); 9950 - rq_unlock_irqrestore(rq, &rf); 9952 + } 9953 + 9954 + static void sched_balance_update_blocked_averages(int cpu) 9955 + { 9956 + struct rq *rq = cpu_rq(cpu); 9957 + 9958 + guard(rq_lock_irqsave)(rq); 9959 + update_rq_clock(rq); 9960 + __sched_balance_update_blocked_averages(rq); 9951 9961 } 9952 9962 9953 9963 /********** Helpers for sched_balance_find_src_group ************************/ ··· 10961 10963 * take care of it. 10962 10964 */ 10963 10965 if (p->nr_cpus_allowed != NR_CPUS) { 10964 - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 10965 - 10966 - cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); 10967 - imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr); 10966 + unsigned int w = cpumask_weight_and(p->cpus_ptr, 10967 + sched_group_span(local)); 10968 + imb_numa_nr = min(w, sd->imb_numa_nr); 10968 10969 } 10969 10970 10970 10971 imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus); ··· 11010 11013 if (env->sd->span_weight != llc_weight) 11011 11014 return; 11012 11015 11013 - sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); 11016 + sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu)); 11014 11017 if (!sd_share) 11015 11018 return; 11016 11019 ··· 11360 11363 goto force_balance; 11361 11364 11362 11365 if (!is_rd_overutilized(env->dst_rq->rd) && 11363 - rcu_dereference(env->dst_rq->rd->pd)) 11366 + rcu_dereference_all(env->dst_rq->rd->pd)) 11364 11367 goto out_balanced; 11365 11368 11366 11369 /* ASYM feature bypasses nice load balance check */ ··· 12428 12431 */ 12429 12432 nohz_balance_exit_idle(rq); 12430 12433 12431 - /* 12432 - * None are in tickless mode and hence no need for NOHZ idle load 12433 - * balancing: 12434 - */ 12435 - if (likely(!atomic_read(&nohz.nr_cpus))) 12436 - return; 12437 - 12438 - if (READ_ONCE(nohz.has_blocked) && 12434 + if (READ_ONCE(nohz.has_blocked_load) && 12439 12435 time_after(now, READ_ONCE(nohz.next_blocked))) 12440 12436 flags = NOHZ_STATS_KICK; 12441 12437 12438 + /* 12439 + * Most of the time system is not 100% busy. i.e nohz.nr_cpus > 0 12440 + * Skip the read if time is not due. 12441 + * 12442 + * If none are in tickless mode, there maybe a narrow window 12443 + * (28 jiffies, HZ=1000) where flags maybe set and kick_ilb called. 12444 + * But idle load balancing is not done as find_new_ilb fails. 12445 + * That's very rare. So read nohz.nr_cpus only if time is due. 12446 + */ 12442 12447 if (time_before(now, nohz.next_balance)) 12443 12448 goto out; 12449 + 12450 + /* 12451 + * None are in tickless mode and hence no need for NOHZ idle load 12452 + * balancing 12453 + */ 12454 + if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) 12455 + return; 12444 12456 12445 12457 if (rq->nr_running >= 2) { 12446 12458 flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; ··· 12458 12452 12459 12453 rcu_read_lock(); 12460 12454 12461 - sd = rcu_dereference(rq->sd); 12455 + sd = rcu_dereference_all(rq->sd); 12462 12456 if (sd) { 12463 12457 /* 12464 12458 * If there's a runnable CFS task and the current CPU has reduced ··· 12470 12464 } 12471 12465 } 12472 12466 12473 - sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); 12467 + sd = rcu_dereference_all(per_cpu(sd_asym_packing, cpu)); 12474 12468 if (sd) { 12475 12469 /* 12476 12470 * When ASYM_PACKING; see if there's a more preferred CPU ··· 12488 12482 } 12489 12483 } 12490 12484 12491 - sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)); 12485 + sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, cpu)); 12492 12486 if (sd) { 12493 12487 /* 12494 12488 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU ··· 12509 12503 goto unlock; 12510 12504 } 12511 12505 12512 - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 12506 + sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); 12513 12507 if (sds) { 12514 12508 /* 12515 12509 * If there is an imbalance between LLC domains (IOW we could ··· 12541 12535 struct sched_domain *sd; 12542 12536 12543 12537 rcu_read_lock(); 12544 - sd = rcu_dereference(per_cpu(sd_llc, cpu)); 12538 + sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); 12545 12539 12546 12540 if (!sd || !sd->nohz_idle) 12547 12541 goto unlock; ··· 12561 12555 12562 12556 rq->nohz_tick_stopped = 0; 12563 12557 cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); 12564 - atomic_dec(&nohz.nr_cpus); 12565 12558 12566 12559 set_cpu_sd_state_busy(rq->cpu); 12567 12560 } ··· 12570 12565 struct sched_domain *sd; 12571 12566 12572 12567 rcu_read_lock(); 12573 - sd = rcu_dereference(per_cpu(sd_llc, cpu)); 12568 + sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); 12574 12569 12575 12570 if (!sd || sd->nohz_idle) 12576 12571 goto unlock; ··· 12604 12599 12605 12600 /* 12606 12601 * The tick is still stopped but load could have been added in the 12607 - * meantime. We set the nohz.has_blocked flag to trig a check of the 12602 + * meantime. We set the nohz.has_blocked_load flag to trig a check of the 12608 12603 * *_avg. The CPU is already part of nohz.idle_cpus_mask so the clear 12609 - * of nohz.has_blocked can only happen after checking the new load 12604 + * of nohz.has_blocked_load can only happen after checking the new load 12610 12605 */ 12611 12606 if (rq->nohz_tick_stopped) 12612 12607 goto out; ··· 12618 12613 rq->nohz_tick_stopped = 1; 12619 12614 12620 12615 cpumask_set_cpu(cpu, nohz.idle_cpus_mask); 12621 - atomic_inc(&nohz.nr_cpus); 12622 12616 12623 12617 /* 12624 12618 * Ensures that if nohz_idle_balance() fails to observe our 12625 - * @idle_cpus_mask store, it must observe the @has_blocked 12619 + * @idle_cpus_mask store, it must observe the @has_blocked_load 12626 12620 * and @needs_update stores. 12627 12621 */ 12628 12622 smp_mb__after_atomic(); ··· 12634 12630 * Each time a cpu enter idle, we assume that it has blocked load and 12635 12631 * enable the periodic update of the load of idle CPUs 12636 12632 */ 12637 - WRITE_ONCE(nohz.has_blocked, 1); 12633 + WRITE_ONCE(nohz.has_blocked_load, 1); 12638 12634 } 12639 12635 12640 12636 static bool update_nohz_stats(struct rq *rq) ··· 12675 12671 12676 12672 /* 12677 12673 * We assume there will be no idle load after this update and clear 12678 - * the has_blocked flag. If a cpu enters idle in the mean time, it will 12679 - * set the has_blocked flag and trigger another update of idle load. 12674 + * the has_blocked_load flag. If a cpu enters idle in the mean time, it will 12675 + * set the has_blocked_load flag and trigger another update of idle load. 12680 12676 * Because a cpu that becomes idle, is added to idle_cpus_mask before 12681 12677 * setting the flag, we are sure to not clear the state and not 12682 12678 * check the load of an idle cpu. ··· 12684 12680 * Same applies to idle_cpus_mask vs needs_update. 12685 12681 */ 12686 12682 if (flags & NOHZ_STATS_KICK) 12687 - WRITE_ONCE(nohz.has_blocked, 0); 12683 + WRITE_ONCE(nohz.has_blocked_load, 0); 12688 12684 if (flags & NOHZ_NEXT_KICK) 12689 12685 WRITE_ONCE(nohz.needs_update, 0); 12690 12686 12691 12687 /* 12692 - * Ensures that if we miss the CPU, we must see the has_blocked 12688 + * Ensures that if we miss the CPU, we must see the has_blocked_load 12693 12689 * store from nohz_balance_enter_idle(). 12694 12690 */ 12695 12691 smp_mb(); ··· 12756 12752 abort: 12757 12753 /* There is still blocked load, enable periodic update */ 12758 12754 if (has_blocked_load) 12759 - WRITE_ONCE(nohz.has_blocked, 1); 12755 + WRITE_ONCE(nohz.has_blocked_load, 1); 12760 12756 } 12761 12757 12762 12758 /* ··· 12818 12814 return; 12819 12815 12820 12816 /* Don't need to update blocked load of idle CPUs*/ 12821 - if (!READ_ONCE(nohz.has_blocked) || 12817 + if (!READ_ONCE(nohz.has_blocked_load) || 12822 12818 time_before(jiffies, READ_ONCE(nohz.next_blocked))) 12823 12819 return; 12824 12820 ··· 12889 12885 */ 12890 12886 rq_unpin_lock(this_rq, rf); 12891 12887 12892 - rcu_read_lock(); 12893 - sd = rcu_dereference_check_sched_domain(this_rq->sd); 12894 - if (!sd) { 12895 - rcu_read_unlock(); 12888 + sd = rcu_dereference_sched_domain(this_rq->sd); 12889 + if (!sd) 12896 12890 goto out; 12897 - } 12898 12891 12899 12892 if (!get_rd_overloaded(this_rq->rd) || 12900 12893 this_rq->avg_idle < sd->max_newidle_lb_cost) { 12901 12894 12902 12895 update_next_balance(sd, &next_balance); 12903 - rcu_read_unlock(); 12904 12896 goto out; 12905 12897 } 12906 - rcu_read_unlock(); 12907 12898 12908 - rq_modified_clear(this_rq); 12899 + /* 12900 + * Include sched_balance_update_blocked_averages() in the cost 12901 + * calculation because it can be quite costly -- this ensures we skip 12902 + * it when avg_idle gets to be very low. 12903 + */ 12904 + t0 = sched_clock_cpu(this_cpu); 12905 + __sched_balance_update_blocked_averages(this_rq); 12906 + 12907 + this_rq->next_class = &fair_sched_class; 12909 12908 raw_spin_rq_unlock(this_rq); 12910 12909 12911 - t0 = sched_clock_cpu(this_cpu); 12912 - sched_balance_update_blocked_averages(this_cpu); 12913 - 12914 - rcu_read_lock(); 12915 12910 for_each_domain(this_cpu, sd) { 12916 12911 u64 domain_cost; 12917 12912 ··· 12960 12957 if (pulled_task || !continue_balancing) 12961 12958 break; 12962 12959 } 12963 - rcu_read_unlock(); 12964 12960 12965 12961 raw_spin_rq_lock(this_rq); 12966 12962 ··· 12975 12973 pulled_task = 1; 12976 12974 12977 12975 /* If a higher prio class was modified, restart the pick */ 12978 - if (rq_modified_above(this_rq, &fair_sched_class)) 12976 + if (sched_class_above(this_rq->next_class, &fair_sched_class)) 12979 12977 pulled_task = -1; 12980 12978 12981 12979 out: ··· 13326 13324 * zero_vruntime_fi, which would have been updated in prior calls 13327 13325 * to se_fi_update(). 13328 13326 */ 13329 - delta = (s64)(sea->vruntime - seb->vruntime) + 13330 - (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); 13327 + delta = vruntime_op(sea->vruntime, "-", seb->vruntime) + 13328 + vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi); 13331 13329 13332 13330 return delta > 0; 13333 13331 } ··· 13363 13361 for_each_sched_entity(se) { 13364 13362 cfs_rq = cfs_rq_of(se); 13365 13363 entity_tick(cfs_rq, se, queued); 13364 + } 13365 + 13366 + if (queued) { 13367 + if (!need_resched()) 13368 + hrtick_start_fair(rq, curr); 13369 + return; 13366 13370 } 13367 13371 13368 13372 if (static_branch_unlikely(&sched_numa_balancing)) ··· 13879 13871 * All the scheduling class methods: 13880 13872 */ 13881 13873 DEFINE_SCHED_CLASS(fair) = { 13882 - 13883 - .queue_mask = 2, 13884 - 13885 13874 .enqueue_task = enqueue_task_fair, 13886 13875 .dequeue_task = dequeue_task_fair, 13887 13876 .yield_task = yield_task_fair, 13888 13877 .yield_to_task = yield_to_task_fair, 13889 13878 13890 - .wakeup_preempt = check_preempt_wakeup_fair, 13879 + .wakeup_preempt = wakeup_preempt_fair, 13891 13880 13892 13881 .pick_task = pick_task_fair, 13893 13882 .pick_next_task = pick_next_task_fair, ··· 13944 13939 struct numa_group *ng; 13945 13940 13946 13941 rcu_read_lock(); 13947 - ng = rcu_dereference(p->numa_group); 13942 + ng = rcu_dereference_all(p->numa_group); 13948 13943 for_each_online_node(node) { 13949 13944 if (p->numa_faults) { 13950 13945 tsf = p->numa_faults[task_faults_idx(NUMA_MEM, node, 0)];
+4 -3
kernel/sched/idle.c
··· 460 460 { 461 461 update_curr_idle(rq); 462 462 scx_update_idle(rq, false, true); 463 + update_rq_avg_idle(rq); 463 464 } 464 465 465 466 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first) ··· 537 536 se->exec_start = now; 538 537 539 538 dl_server_update_idle(&rq->fair_server, delta_exec); 539 + #ifdef CONFIG_SCHED_CLASS_EXT 540 + dl_server_update_idle(&rq->ext_server, delta_exec); 541 + #endif 540 542 } 541 543 542 544 /* 543 545 * Simple, special scheduling class for the per-CPU idle tasks: 544 546 */ 545 547 DEFINE_SCHED_CLASS(idle) = { 546 - 547 - .queue_mask = 0, 548 - 549 548 /* no enqueue/yield_task for idle tasks */ 550 549 551 550 /* dequeue is not valid, we print a debug message there: */
+11 -3
kernel/sched/rt.c
··· 1615 1615 { 1616 1616 struct task_struct *donor = rq->donor; 1617 1617 1618 + /* 1619 + * XXX If we're preempted by DL, queue a push? 1620 + */ 1621 + if (p->sched_class != &rt_sched_class) 1622 + return; 1623 + 1618 1624 if (p->prio < donor->prio) { 1619 1625 resched_curr(rq); 1620 1626 return; ··· 2106 2100 */ 2107 2101 static int rto_next_cpu(struct root_domain *rd) 2108 2102 { 2103 + int this_cpu = smp_processor_id(); 2109 2104 int next; 2110 2105 int cpu; 2111 2106 ··· 2129 2122 cpu = cpumask_next(rd->rto_cpu, rd->rto_mask); 2130 2123 2131 2124 rd->rto_cpu = cpu; 2125 + 2126 + /* Do not send IPI to self */ 2127 + if (cpu == this_cpu) 2128 + continue; 2132 2129 2133 2130 if (cpu < nr_cpu_ids) 2134 2131 return cpu; ··· 2579 2568 #endif /* CONFIG_SCHED_CORE */ 2580 2569 2581 2570 DEFINE_SCHED_CLASS(rt) = { 2582 - 2583 - .queue_mask = 4, 2584 - 2585 2571 .enqueue_task = enqueue_task_rt, 2586 2572 .dequeue_task = dequeue_task_rt, 2587 2573 .yield_task = yield_task_rt,
+70 -72
kernel/sched/sched.h
··· 418 418 extern void sched_init_dl_servers(void); 419 419 420 420 extern void fair_server_init(struct rq *rq); 421 + extern void ext_server_init(struct rq *rq); 421 422 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); 422 423 extern int dl_server_apply_params(struct sched_dl_entity *dl_se, 423 424 u64 runtime, u64 period, bool init); ··· 675 674 void (*func)(struct rq *rq); 676 675 }; 677 676 678 - /* CFS-related fields in a runqueue */ 677 + /* Fair scheduling SCHED_{NORMAL,BATCH,IDLE} related fields in a runqueue: */ 679 678 struct cfs_rq { 680 679 struct load_weight load; 681 680 unsigned int nr_queued; 682 - unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ 683 - unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ 684 - unsigned int h_nr_idle; /* SCHED_IDLE */ 681 + unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ 682 + unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ 683 + unsigned int h_nr_idle; /* SCHED_IDLE */ 685 684 686 - s64 avg_vruntime; 687 - u64 avg_load; 685 + s64 sum_w_vruntime; 686 + u64 sum_weight; 688 687 689 688 u64 zero_vruntime; 690 689 #ifdef CONFIG_SCHED_CORE ··· 695 694 struct rb_root_cached tasks_timeline; 696 695 697 696 /* 698 - * 'curr' points to currently running entity on this cfs_rq. 697 + * 'curr' points to the currently running entity on this cfs_rq. 699 698 * It is set to NULL otherwise (i.e when none are currently running). 700 699 */ 701 700 struct sched_entity *curr; ··· 731 730 unsigned long h_load; 732 731 u64 last_h_load_update; 733 732 struct sched_entity *h_load_next; 734 - #endif /* CONFIG_FAIR_GROUP_SCHED */ 735 733 736 - #ifdef CONFIG_FAIR_GROUP_SCHED 737 734 struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */ 738 735 739 736 /* ··· 744 745 */ 745 746 int on_list; 746 747 struct list_head leaf_cfs_rq_list; 747 - struct task_group *tg; /* group that "owns" this runqueue */ 748 + struct task_group *tg; /* Group that "owns" this runqueue */ 748 749 749 750 /* Locally cached copy of our task_group's idle value */ 750 751 int idle; 751 752 752 - #ifdef CONFIG_CFS_BANDWIDTH 753 + # ifdef CONFIG_CFS_BANDWIDTH 753 754 int runtime_enabled; 754 755 s64 runtime_remaining; 755 756 756 757 u64 throttled_pelt_idle; 757 - #ifndef CONFIG_64BIT 758 + # ifndef CONFIG_64BIT 758 759 u64 throttled_pelt_idle_copy; 759 - #endif 760 + # endif 760 761 u64 throttled_clock; 761 762 u64 throttled_clock_pelt; 762 763 u64 throttled_clock_pelt_time; ··· 768 769 struct list_head throttled_list; 769 770 struct list_head throttled_csd_list; 770 771 struct list_head throttled_limbo_list; 771 - #endif /* CONFIG_CFS_BANDWIDTH */ 772 + # endif /* CONFIG_CFS_BANDWIDTH */ 772 773 #endif /* CONFIG_FAIR_GROUP_SCHED */ 773 774 }; 774 775 ··· 1120 1121 * acquire operations must be ordered by ascending &runqueue. 1121 1122 */ 1122 1123 struct rq { 1123 - /* runqueue lock: */ 1124 - raw_spinlock_t __lock; 1125 - 1126 - /* Per class runqueue modification mask; bits in class order. */ 1127 - unsigned int queue_mask; 1124 + /* 1125 + * The following members are loaded together, without holding the 1126 + * rq->lock, in an extremely hot loop in update_sg_lb_stats() 1127 + * (called from pick_next_task()). To reduce cache pollution from 1128 + * this operation, they are placed together on this dedicated cache 1129 + * line. Even though some of them are frequently modified, they are 1130 + * loaded much more frequently than they are stored. 1131 + */ 1128 1132 unsigned int nr_running; 1129 1133 #ifdef CONFIG_NUMA_BALANCING 1130 1134 unsigned int nr_numa_running; 1131 1135 unsigned int nr_preferred_running; 1132 - unsigned int numa_migrate_on; 1133 1136 #endif 1137 + unsigned int ttwu_pending; 1138 + unsigned long cpu_capacity; 1139 + #ifdef CONFIG_SCHED_PROXY_EXEC 1140 + struct task_struct __rcu *donor; /* Scheduling context */ 1141 + struct task_struct __rcu *curr; /* Execution context */ 1142 + #else 1143 + union { 1144 + struct task_struct __rcu *donor; /* Scheduler context */ 1145 + struct task_struct __rcu *curr; /* Execution context */ 1146 + }; 1147 + #endif 1148 + struct task_struct *idle; 1149 + /* padding left here deliberately */ 1150 + 1151 + /* 1152 + * The next cacheline holds the (hot) runqueue lock, as well as 1153 + * some other less performance-critical fields. 1154 + */ 1155 + u64 nr_switches ____cacheline_aligned; 1156 + 1157 + /* runqueue lock: */ 1158 + raw_spinlock_t __lock; 1159 + 1134 1160 #ifdef CONFIG_NO_HZ_COMMON 1135 - unsigned long last_blocked_load_update_tick; 1136 - unsigned int has_blocked_load; 1137 - call_single_data_t nohz_csd; 1138 1161 unsigned int nohz_tick_stopped; 1139 1162 atomic_t nohz_flags; 1163 + unsigned int has_blocked_load; 1164 + unsigned long last_blocked_load_update_tick; 1165 + call_single_data_t nohz_csd; 1140 1166 #endif /* CONFIG_NO_HZ_COMMON */ 1141 - 1142 - unsigned int ttwu_pending; 1143 - u64 nr_switches; 1144 1167 1145 1168 #ifdef CONFIG_UCLAMP_TASK 1146 1169 /* Utilization clamp values based on CPU's RUNNABLE tasks */ ··· 1176 1155 struct dl_rq dl; 1177 1156 #ifdef CONFIG_SCHED_CLASS_EXT 1178 1157 struct scx_rq scx; 1158 + struct sched_dl_entity ext_server; 1179 1159 #endif 1180 1160 1181 1161 struct sched_dl_entity fair_server; ··· 1187 1165 struct list_head *tmp_alone_branch; 1188 1166 #endif /* CONFIG_FAIR_GROUP_SCHED */ 1189 1167 1168 + #ifdef CONFIG_NUMA_BALANCING 1169 + unsigned int numa_migrate_on; 1170 + #endif 1190 1171 /* 1191 1172 * This is part of a global counter where only the total sum 1192 1173 * over all CPUs matters. A task can increase this counter on ··· 1198 1173 */ 1199 1174 unsigned long nr_uninterruptible; 1200 1175 1201 - #ifdef CONFIG_SCHED_PROXY_EXEC 1202 - struct task_struct __rcu *donor; /* Scheduling context */ 1203 - struct task_struct __rcu *curr; /* Execution context */ 1204 - #else 1205 - union { 1206 - struct task_struct __rcu *donor; /* Scheduler context */ 1207 - struct task_struct __rcu *curr; /* Execution context */ 1208 - }; 1209 - #endif 1210 1176 struct sched_dl_entity *dl_server; 1211 - struct task_struct *idle; 1212 1177 struct task_struct *stop; 1178 + const struct sched_class *next_class; 1213 1179 unsigned long next_balance; 1214 1180 struct mm_struct *prev_mm; 1215 1181 1216 - unsigned int clock_update_flags; 1217 - u64 clock; 1218 - /* Ensure that all clocks are in the same cache line */ 1182 + /* 1183 + * The following fields of clock data are frequently referenced 1184 + * and updated together, and should go on their own cache line. 1185 + */ 1219 1186 u64 clock_task ____cacheline_aligned; 1220 1187 u64 clock_pelt; 1188 + u64 clock; 1221 1189 unsigned long lost_idle_time; 1190 + unsigned int clock_update_flags; 1222 1191 u64 clock_pelt_idle; 1223 1192 u64 clock_idle; 1193 + 1224 1194 #ifndef CONFIG_64BIT 1225 1195 u64 clock_pelt_idle_copy; 1226 1196 u64 clock_idle_copy; 1227 1197 #endif 1228 - 1229 - atomic_t nr_iowait; 1230 1198 1231 1199 u64 last_seen_need_resched_ns; 1232 1200 int ticks_without_resched; ··· 1230 1212 1231 1213 struct root_domain *rd; 1232 1214 struct sched_domain __rcu *sd; 1233 - 1234 - unsigned long cpu_capacity; 1235 1215 1236 1216 struct balance_callback *balance_callback; 1237 1217 ··· 1340 1324 call_single_data_t cfsb_csd; 1341 1325 struct list_head cfsb_csd_list; 1342 1326 #endif 1343 - }; 1327 + 1328 + atomic_t nr_iowait; 1329 + } __no_randomize_layout; 1344 1330 1345 1331 #ifdef CONFIG_FAIR_GROUP_SCHED 1346 1332 ··· 1716 1698 1717 1699 #endif /* !CONFIG_FAIR_GROUP_SCHED */ 1718 1700 1701 + extern void update_rq_avg_idle(struct rq *rq); 1719 1702 extern void update_rq_clock(struct rq *rq); 1720 1703 1721 1704 /* ··· 2081 2062 rq->balance_callback = head; 2082 2063 } 2083 2064 2084 - #define rcu_dereference_check_sched_domain(p) \ 2085 - rcu_dereference_check((p), lockdep_is_held(&sched_domains_mutex)) 2065 + #define rcu_dereference_sched_domain(p) \ 2066 + rcu_dereference_all_check((p), lockdep_is_held(&sched_domains_mutex)) 2086 2067 2087 2068 /* 2088 2069 * The domain tree (rq->sd) is protected by RCU's quiescent state transition. ··· 2092 2073 * preempt-disabled sections. 2093 2074 */ 2094 2075 #define for_each_domain(cpu, __sd) \ 2095 - for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \ 2076 + for (__sd = rcu_dereference_sched_domain(cpu_rq(cpu)->sd); \ 2096 2077 __sd; __sd = __sd->parent) 2097 2078 2098 2079 /* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */ ··· 2500 2481 #ifdef CONFIG_UCLAMP_TASK 2501 2482 int uclamp_enabled; 2502 2483 #endif 2503 - /* 2504 - * idle: 0 2505 - * ext: 1 2506 - * fair: 2 2507 - * rt: 4 2508 - * dl: 8 2509 - * stop: 16 2510 - */ 2511 - unsigned int queue_mask; 2512 2484 2513 2485 /* 2514 2486 * move_queued_task/activate_task/enqueue_task: rq->lock ··· 2657 2647 int (*task_is_throttled)(struct task_struct *p, int cpu); 2658 2648 #endif 2659 2649 }; 2660 - 2661 - /* 2662 - * Does not nest; only used around sched_class::pick_task() rq-lock-breaks. 2663 - */ 2664 - static inline void rq_modified_clear(struct rq *rq) 2665 - { 2666 - rq->queue_mask = 0; 2667 - } 2668 - 2669 - static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class) 2670 - { 2671 - unsigned int mask = class->queue_mask; 2672 - return rq->queue_mask & ~((mask << 1) - 1); 2673 - } 2674 2650 2675 2651 static inline void put_prev_task(struct rq *rq, struct task_struct *prev) 2676 2652 { ··· 3393 3397 }; 3394 3398 3395 3399 DECLARE_PER_CPU(struct irqtime, cpu_irqtime); 3396 - extern int sched_clock_irqtime; 3400 + DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime); 3397 3401 3398 3402 static inline int irqtime_enabled(void) 3399 3403 { 3400 - return sched_clock_irqtime; 3404 + return static_branch_likely(&sched_clock_irqtime); 3401 3405 } 3402 3406 3403 3407 /* ··· 4006 4010 deactivate_task(src_rq, task, 0); 4007 4011 set_task_cpu(task, dst_rq->cpu); 4008 4012 activate_task(dst_rq, task, 0); 4013 + wakeup_preempt(dst_rq, task, 0); 4009 4014 } 4010 4015 4011 4016 static inline ··· 4076 4079 struct sched_change_ctx { 4077 4080 u64 prio; 4078 4081 struct task_struct *p; 4082 + const struct sched_class *class; 4079 4083 int flags; 4080 4084 bool queued; 4081 4085 bool running;
-3
kernel/sched/stop_task.c
··· 97 97 * Simple, special scheduling class for the per-CPU stop tasks: 98 98 */ 99 99 DEFINE_SCHED_CLASS(stop) = { 100 - 101 - .queue_mask = 16, 102 - 103 100 .enqueue_task = enqueue_task_stop, 104 101 .dequeue_task = dequeue_task_stop, 105 102 .yield_task = yield_task_stop,
+5
kernel/sched/topology.c
··· 508 508 if (rq->fair_server.dl_server) 509 509 __dl_server_attach_root(&rq->fair_server, rq); 510 510 511 + #ifdef CONFIG_SCHED_CLASS_EXT 512 + if (rq->ext_server.dl_server) 513 + __dl_server_attach_root(&rq->ext_server, rq); 514 + #endif 515 + 511 516 rq_unlock_irqrestore(rq, &rf); 512 517 513 518 if (old_rd)
+6
kernel/sys.c
··· 53 53 #include <linux/time_namespace.h> 54 54 #include <linux/binfmts.h> 55 55 #include <linux/futex.h> 56 + #include <linux/rseq.h> 56 57 57 58 #include <linux/sched.h> 58 59 #include <linux/sched/autogroup.h> ··· 2868 2867 break; 2869 2868 case PR_FUTEX_HASH: 2870 2869 error = futex_hash_prctl(arg2, arg3, arg4); 2870 + break; 2871 + case PR_RSEQ_SLICE_EXTENSION: 2872 + if (arg4 || arg5) 2873 + return -EINVAL; 2874 + error = rseq_slice_extension_prctl(arg2, arg3); 2871 2875 break; 2872 2876 default: 2873 2877 trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
+1
kernel/sys_ni.c
··· 390 390 391 391 /* restartable sequence */ 392 392 COND_SYSCALL(rseq); 393 + COND_SYSCALL(rseq_slice_yield); 393 394 394 395 COND_SYSCALL(uretprobe); 395 396 COND_SYSCALL(uprobe);
+1 -1
kernel/time/hrtimer.c
··· 1742 1742 1743 1743 lockdep_assert_held(&cpu_base->lock); 1744 1744 1745 - debug_deactivate(timer); 1745 + debug_hrtimer_deactivate(timer); 1746 1746 base->running = timer; 1747 1747 1748 1748 /*
+1
scripts/syscall.tbl
··· 411 411 468 common file_getattr sys_file_getattr 412 412 469 common file_setattr sys_file_setattr 413 413 470 common listns sys_listns 414 + 471 common rseq_slice_yield sys_rseq_slice_yield
+1
tools/testing/selftests/rseq/.gitignore
··· 10 10 param_test_mm_cid_benchmark 11 11 param_test_mm_cid_compare_twice 12 12 syscall_errors_test 13 + slice_test
+4 -1
tools/testing/selftests/rseq/Makefile
··· 17 17 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ 18 18 param_test_benchmark param_test_compare_twice param_test_mm_cid \ 19 19 param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ 20 - syscall_errors_test 20 + syscall_errors_test slice_test 21 21 22 22 TEST_GEN_PROGS_EXTENDED = librseq.so 23 23 ··· 58 58 59 59 $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \ 60 60 rseq.h rseq-*.h 61 + $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ 62 + 63 + $(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h 61 64 $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+27
tools/testing/selftests/rseq/rseq-abi.h
··· 53 53 __u64 abort_ip; 54 54 } __attribute__((aligned(4 * sizeof(__u64)))); 55 55 56 + /** 57 + * rseq_abi_slice_ctrl - Time slice extension control structure 58 + * @all: Compound value 59 + * @request: Request for a time slice extension 60 + * @granted: Granted time slice extension 61 + * 62 + * @request is set by user space and can be cleared by user space or kernel 63 + * space. @granted is set and cleared by the kernel and must only be read 64 + * by user space. 65 + */ 66 + struct rseq_abi_slice_ctrl { 67 + union { 68 + __u32 all; 69 + struct { 70 + __u8 request; 71 + __u8 granted; 72 + __u16 __reserved; 73 + }; 74 + }; 75 + }; 76 + 56 77 /* 57 78 * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always 58 79 * contained within a single cache-line. ··· 184 163 * (allocated uniquely within a memory map). 185 164 */ 186 165 __u32 mm_cid; 166 + 167 + /* 168 + * Time slice extension control structure. CPU local updates from 169 + * kernel and user space. 170 + */ 171 + struct rseq_abi_slice_ctrl slice_ctrl; 187 172 188 173 /* 189 174 * Flexible array member at end of structure, after last feature field.
+132
tools/testing/selftests/rseq/rseq-slice-hist.py
··· 1 + #!/usr/bin/python3 2 + 3 + # 4 + # trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- $cmd 5 + # 6 + 7 + from tracecmd import * 8 + 9 + def load_kallsyms(file_path='/proc/kallsyms'): 10 + """ 11 + Parses /proc/kallsyms into a dictionary. 12 + Returns: { address_int: symbol_name } 13 + """ 14 + kallsyms_map = {} 15 + 16 + try: 17 + with open(file_path, 'r') as f: 18 + for line in f: 19 + # The format is: [address] [type] [name] [module] 20 + parts = line.split() 21 + if len(parts) < 3: 22 + continue 23 + 24 + addr = int(parts[0], 16) 25 + name = parts[2] 26 + 27 + kallsyms_map[addr] = name 28 + 29 + except PermissionError: 30 + print(f"Error: Permission denied reading {file_path}. Try running with sudo.") 31 + except FileNotFoundError: 32 + print(f"Error: {file_path} not found.") 33 + 34 + return kallsyms_map 35 + 36 + ksyms = load_kallsyms() 37 + 38 + # pending[timer_ptr] = {'ts': timestamp, 'comm': comm} 39 + pending = {} 40 + 41 + # histograms[comm][bucket] = count 42 + histograms = {} 43 + 44 + class OnlineHarmonicMean: 45 + def __init__(self): 46 + self.n = 0 # Count of elements 47 + self.S = 0.0 # Cumulative sum of reciprocals 48 + 49 + def update(self, x): 50 + if x == 0: 51 + raise ValueError("Harmonic mean is undefined for zero.") 52 + 53 + self.n += 1 54 + self.S += 1.0 / x 55 + return self.n / self.S 56 + 57 + @property 58 + def mean(self): 59 + return self.n / self.S if self.n > 0 else 0 60 + 61 + ohms = {} 62 + 63 + def handle_start(record): 64 + func_name = ksyms[record.num_field("function")] 65 + if "rseq_slice_expired" in func_name: 66 + timer_ptr = record.num_field("hrtimer") 67 + pending[timer_ptr] = { 68 + 'ts': record.ts, 69 + 'comm': record.comm 70 + } 71 + return None 72 + 73 + def handle_cancel(record): 74 + timer_ptr = record.num_field("hrtimer") 75 + 76 + if timer_ptr in pending: 77 + start_data = pending.pop(timer_ptr) 78 + duration_ns = record.ts - start_data['ts'] 79 + duration_us = duration_ns // 1000 80 + 81 + comm = start_data['comm'] 82 + 83 + if comm not in ohms: 84 + ohms[comm] = OnlineHarmonicMean() 85 + 86 + ohms[comm].update(duration_ns) 87 + 88 + if comm not in histograms: 89 + histograms[comm] = {} 90 + 91 + histograms[comm][duration_us] = histograms[comm].get(duration_us, 0) + 1 92 + return None 93 + 94 + def handle_expire(record): 95 + timer_ptr = record.num_field("hrtimer") 96 + 97 + if timer_ptr in pending: 98 + start_data = pending.pop(timer_ptr) 99 + comm = start_data['comm'] 100 + 101 + if comm not in histograms: 102 + histograms[comm] = {} 103 + 104 + # Record -1 bucket for expired (failed to cancel) 105 + histograms[comm][-1] = histograms[comm].get(-1, 0) + 1 106 + return None 107 + 108 + if __name__ == "__main__": 109 + t = Trace("trace.dat") 110 + for cpu in range(0, t.cpus): 111 + ev = t.read_event(cpu) 112 + while ev: 113 + if "hrtimer_start" in ev.name: 114 + handle_start(ev) 115 + if "hrtimer_cancel" in ev.name: 116 + handle_cancel(ev) 117 + if "hrtimer_expire_entry" in ev.name: 118 + handle_expire(ev) 119 + 120 + ev = t.read_event(cpu) 121 + 122 + print("\n" + "="*40) 123 + print("RSEQ SLICE HISTOGRAM (us)") 124 + print("="*40) 125 + for comm, buckets in histograms.items(): 126 + print(f"\nTask: {comm} Mean: {ohms[comm].mean:.3f} ns") 127 + print(f" {'Latency (us)':<15} | {'Count'}") 128 + print(f" {'-'*30}") 129 + # Sort buckets numerically, putting -1 at the top 130 + for bucket in sorted(buckets.keys()): 131 + label = "EXPIRED" if bucket == -1 else f"{bucket} us" 132 + print(f" {label:<15} | {buckets[bucket]}")
+219
tools/testing/selftests/rseq/slice_test.c
··· 1 + // SPDX-License-Identifier: LGPL-2.1 2 + #define _GNU_SOURCE 3 + #include <assert.h> 4 + #include <pthread.h> 5 + #include <sched.h> 6 + #include <signal.h> 7 + #include <stdbool.h> 8 + #include <stdio.h> 9 + #include <string.h> 10 + #include <syscall.h> 11 + #include <unistd.h> 12 + 13 + #include <linux/prctl.h> 14 + #include <sys/prctl.h> 15 + #include <sys/time.h> 16 + 17 + #include "rseq.h" 18 + 19 + #include "../kselftest_harness.h" 20 + 21 + #ifndef __NR_rseq_slice_yield 22 + # define __NR_rseq_slice_yield 471 23 + #endif 24 + 25 + #define BITS_PER_INT 32 26 + #define BITS_PER_BYTE 8 27 + 28 + #ifndef PR_RSEQ_SLICE_EXTENSION 29 + # define PR_RSEQ_SLICE_EXTENSION 79 30 + # define PR_RSEQ_SLICE_EXTENSION_GET 1 31 + # define PR_RSEQ_SLICE_EXTENSION_SET 2 32 + # define PR_RSEQ_SLICE_EXT_ENABLE 0x01 33 + #endif 34 + 35 + #ifndef RSEQ_SLICE_EXT_REQUEST_BIT 36 + # define RSEQ_SLICE_EXT_REQUEST_BIT 0 37 + # define RSEQ_SLICE_EXT_GRANTED_BIT 1 38 + #endif 39 + 40 + #ifndef asm_inline 41 + # define asm_inline asm __inline 42 + #endif 43 + 44 + #define NSEC_PER_SEC 1000000000L 45 + #define NSEC_PER_USEC 1000L 46 + 47 + struct noise_params { 48 + int64_t noise_nsecs; 49 + int64_t sleep_nsecs; 50 + int64_t run; 51 + }; 52 + 53 + FIXTURE(slice_ext) 54 + { 55 + pthread_t noise_thread; 56 + struct noise_params noise_params; 57 + }; 58 + 59 + FIXTURE_VARIANT(slice_ext) 60 + { 61 + int64_t total_nsecs; 62 + int64_t slice_nsecs; 63 + int64_t noise_nsecs; 64 + int64_t sleep_nsecs; 65 + bool no_yield; 66 + }; 67 + 68 + FIXTURE_VARIANT_ADD(slice_ext, n2_2_50) 69 + { 70 + .total_nsecs = 5LL * NSEC_PER_SEC, 71 + .slice_nsecs = 2LL * NSEC_PER_USEC, 72 + .noise_nsecs = 2LL * NSEC_PER_USEC, 73 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 74 + }; 75 + 76 + FIXTURE_VARIANT_ADD(slice_ext, n50_2_50) 77 + { 78 + .total_nsecs = 5LL * NSEC_PER_SEC, 79 + .slice_nsecs = 50LL * NSEC_PER_USEC, 80 + .noise_nsecs = 2LL * NSEC_PER_USEC, 81 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 82 + }; 83 + 84 + FIXTURE_VARIANT_ADD(slice_ext, n2_2_50_no_yield) 85 + { 86 + .total_nsecs = 5LL * NSEC_PER_SEC, 87 + .slice_nsecs = 2LL * NSEC_PER_USEC, 88 + .noise_nsecs = 2LL * NSEC_PER_USEC, 89 + .sleep_nsecs = 50LL * NSEC_PER_USEC, 90 + .no_yield = true, 91 + }; 92 + 93 + 94 + static inline bool elapsed(struct timespec *start, struct timespec *now, 95 + int64_t span) 96 + { 97 + int64_t delta = now->tv_sec - start->tv_sec; 98 + 99 + delta *= NSEC_PER_SEC; 100 + delta += now->tv_nsec - start->tv_nsec; 101 + return delta >= span; 102 + } 103 + 104 + static void *noise_thread(void *arg) 105 + { 106 + struct noise_params *p = arg; 107 + 108 + while (RSEQ_READ_ONCE(p->run)) { 109 + struct timespec ts_start, ts_now; 110 + 111 + clock_gettime(CLOCK_MONOTONIC, &ts_start); 112 + do { 113 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 114 + } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs)); 115 + 116 + ts_start.tv_sec = 0; 117 + ts_start.tv_nsec = p->sleep_nsecs; 118 + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL); 119 + } 120 + return NULL; 121 + } 122 + 123 + FIXTURE_SETUP(slice_ext) 124 + { 125 + cpu_set_t affinity; 126 + 127 + ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0); 128 + 129 + /* Pin it on a single CPU. Avoid CPU 0 */ 130 + for (int i = 1; i < CPU_SETSIZE; i++) { 131 + if (!CPU_ISSET(i, &affinity)) 132 + continue; 133 + 134 + CPU_ZERO(&affinity); 135 + CPU_SET(i, &affinity); 136 + ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0); 137 + break; 138 + } 139 + 140 + ASSERT_EQ(rseq_register_current_thread(), 0); 141 + 142 + ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 143 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0); 144 + 145 + self->noise_params.noise_nsecs = variant->noise_nsecs; 146 + self->noise_params.sleep_nsecs = variant->sleep_nsecs; 147 + self->noise_params.run = 1; 148 + 149 + ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0); 150 + } 151 + 152 + FIXTURE_TEARDOWN(slice_ext) 153 + { 154 + self->noise_params.run = 0; 155 + pthread_join(self->noise_thread, NULL); 156 + } 157 + 158 + TEST_F(slice_ext, slice_test) 159 + { 160 + unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0; 161 + unsigned long total = 0, aborted = 0; 162 + struct rseq_abi *rs = rseq_get_abi(); 163 + struct timespec ts_start, ts_now; 164 + 165 + ASSERT_NE(rs, NULL); 166 + 167 + clock_gettime(CLOCK_MONOTONIC, &ts_start); 168 + do { 169 + struct timespec ts_cs; 170 + bool req = false; 171 + 172 + clock_gettime(CLOCK_MONOTONIC, &ts_cs); 173 + 174 + total++; 175 + RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1); 176 + do { 177 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 178 + } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs)); 179 + 180 + /* 181 + * request can be cleared unconditionally, but for making 182 + * the stats work this is actually checking it first 183 + */ 184 + if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) { 185 + RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0); 186 + /* Race between check and clear! */ 187 + req = true; 188 + success++; 189 + } 190 + 191 + if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) { 192 + /* The above raced against a late grant */ 193 + if (req) 194 + success--; 195 + if (variant->no_yield) { 196 + syscall(__NR_getpid); 197 + aborted++; 198 + } else { 199 + yielded++; 200 + if (!syscall(__NR_rseq_slice_yield)) 201 + raced++; 202 + } 203 + } else { 204 + if (!req) 205 + scheduled++; 206 + } 207 + 208 + clock_gettime(CLOCK_MONOTONIC, &ts_now); 209 + } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs)); 210 + 211 + printf("# Total %12ld\n", total); 212 + printf("# Success %12ld\n", success); 213 + printf("# Yielded %12ld\n", yielded); 214 + printf("# Aborted %12ld\n", aborted); 215 + printf("# Scheduled %12ld\n", scheduled); 216 + printf("# Raced %12ld\n", raced); 217 + } 218 + 219 + TEST_HARNESS_MAIN
+2
tools/testing/selftests/sched_ext/Makefile
··· 183 183 select_cpu_dispatch_bad_dsq \ 184 184 select_cpu_dispatch_dbl_dsp \ 185 185 select_cpu_vtime \ 186 + rt_stall \ 186 187 test_example \ 188 + total_bw \ 187 189 188 190 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) 189 191
+23
tools/testing/selftests/sched_ext/rt_stall.bpf.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. 4 + * 5 + * Copyright (c) 2025 NVIDIA Corporation. 6 + */ 7 + 8 + #include <scx/common.bpf.h> 9 + 10 + char _license[] SEC("license") = "GPL"; 11 + 12 + UEI_DEFINE(uei); 13 + 14 + void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) 15 + { 16 + UEI_RECORD(uei, ei); 17 + } 18 + 19 + SEC(".struct_ops.link") 20 + struct sched_ext_ops rt_stall_ops = { 21 + .exit = (void *)rt_stall_exit, 22 + .name = "rt_stall", 23 + };
+240
tools/testing/selftests/sched_ext/rt_stall.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2025 NVIDIA Corporation. 4 + */ 5 + #define _GNU_SOURCE 6 + #include <stdio.h> 7 + #include <stdlib.h> 8 + #include <unistd.h> 9 + #include <sched.h> 10 + #include <sys/prctl.h> 11 + #include <sys/types.h> 12 + #include <sys/wait.h> 13 + #include <time.h> 14 + #include <linux/sched.h> 15 + #include <signal.h> 16 + #include <bpf/bpf.h> 17 + #include <scx/common.h> 18 + #include <unistd.h> 19 + #include "rt_stall.bpf.skel.h" 20 + #include "scx_test.h" 21 + #include "../kselftest.h" 22 + 23 + #define CORE_ID 0 /* CPU to pin tasks to */ 24 + #define RUN_TIME 5 /* How long to run the test in seconds */ 25 + 26 + /* Simple busy-wait function for test tasks */ 27 + static void process_func(void) 28 + { 29 + while (1) { 30 + /* Busy wait */ 31 + for (volatile unsigned long i = 0; i < 10000000UL; i++) 32 + ; 33 + } 34 + } 35 + 36 + /* Set CPU affinity to a specific core */ 37 + static void set_affinity(int cpu) 38 + { 39 + cpu_set_t mask; 40 + 41 + CPU_ZERO(&mask); 42 + CPU_SET(cpu, &mask); 43 + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { 44 + perror("sched_setaffinity"); 45 + exit(EXIT_FAILURE); 46 + } 47 + } 48 + 49 + /* Set task scheduling policy and priority */ 50 + static void set_sched(int policy, int priority) 51 + { 52 + struct sched_param param; 53 + 54 + param.sched_priority = priority; 55 + if (sched_setscheduler(0, policy, &param) != 0) { 56 + perror("sched_setscheduler"); 57 + exit(EXIT_FAILURE); 58 + } 59 + } 60 + 61 + /* Get process runtime from /proc/<pid>/stat */ 62 + static float get_process_runtime(int pid) 63 + { 64 + char path[256]; 65 + FILE *file; 66 + long utime, stime; 67 + int fields; 68 + 69 + snprintf(path, sizeof(path), "/proc/%d/stat", pid); 70 + file = fopen(path, "r"); 71 + if (file == NULL) { 72 + perror("Failed to open stat file"); 73 + return -1; 74 + } 75 + 76 + /* Skip the first 13 fields and read the 14th and 15th */ 77 + fields = fscanf(file, 78 + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", 79 + &utime, &stime); 80 + fclose(file); 81 + 82 + if (fields != 2) { 83 + fprintf(stderr, "Failed to read stat file\n"); 84 + return -1; 85 + } 86 + 87 + /* Calculate the total time spent in the process */ 88 + long total_time = utime + stime; 89 + long ticks_per_second = sysconf(_SC_CLK_TCK); 90 + float runtime_seconds = total_time * 1.0 / ticks_per_second; 91 + 92 + return runtime_seconds; 93 + } 94 + 95 + static enum scx_test_status setup(void **ctx) 96 + { 97 + struct rt_stall *skel; 98 + 99 + skel = rt_stall__open(); 100 + SCX_FAIL_IF(!skel, "Failed to open"); 101 + SCX_ENUM_INIT(skel); 102 + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); 103 + 104 + *ctx = skel; 105 + 106 + return SCX_TEST_PASS; 107 + } 108 + 109 + static bool sched_stress_test(bool is_ext) 110 + { 111 + /* 112 + * We're expecting the EXT task to get around 5% of CPU time when 113 + * competing with the RT task (small 1% fluctuations are expected). 114 + * 115 + * However, the EXT task should get at least 4% of the CPU to prove 116 + * that the EXT deadline server is working correctly. A percentage 117 + * less than 4% indicates a bug where RT tasks can potentially 118 + * stall SCHED_EXT tasks, causing the test to fail. 119 + */ 120 + const float expected_min_ratio = 0.04; /* 4% */ 121 + const char *class_str = is_ext ? "EXT" : "FAIR"; 122 + 123 + float ext_runtime, rt_runtime, actual_ratio; 124 + int ext_pid, rt_pid; 125 + 126 + ksft_print_header(); 127 + ksft_set_plan(1); 128 + 129 + /* Create and set up a EXT task */ 130 + ext_pid = fork(); 131 + if (ext_pid == 0) { 132 + set_affinity(CORE_ID); 133 + process_func(); 134 + exit(0); 135 + } else if (ext_pid < 0) { 136 + perror("fork task"); 137 + ksft_exit_fail(); 138 + } 139 + 140 + /* Create an RT task */ 141 + rt_pid = fork(); 142 + if (rt_pid == 0) { 143 + set_affinity(CORE_ID); 144 + set_sched(SCHED_FIFO, 50); 145 + process_func(); 146 + exit(0); 147 + } else if (rt_pid < 0) { 148 + perror("fork for RT task"); 149 + ksft_exit_fail(); 150 + } 151 + 152 + /* Let the processes run for the specified time */ 153 + sleep(RUN_TIME); 154 + 155 + /* Get runtime for the EXT task */ 156 + ext_runtime = get_process_runtime(ext_pid); 157 + if (ext_runtime == -1) 158 + ksft_exit_fail_msg("Error getting runtime for %s task (PID %d)\n", 159 + class_str, ext_pid); 160 + ksft_print_msg("Runtime of %s task (PID %d) is %f seconds\n", 161 + class_str, ext_pid, ext_runtime); 162 + 163 + /* Get runtime for the RT task */ 164 + rt_runtime = get_process_runtime(rt_pid); 165 + if (rt_runtime == -1) 166 + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); 167 + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); 168 + 169 + /* Kill the processes */ 170 + kill(ext_pid, SIGKILL); 171 + kill(rt_pid, SIGKILL); 172 + waitpid(ext_pid, NULL, 0); 173 + waitpid(rt_pid, NULL, 0); 174 + 175 + /* Verify that the scx task got enough runtime */ 176 + actual_ratio = ext_runtime / (ext_runtime + rt_runtime); 177 + ksft_print_msg("%s task got %.2f%% of total runtime\n", 178 + class_str, actual_ratio * 100); 179 + 180 + if (actual_ratio >= expected_min_ratio) { 181 + ksft_test_result_pass("PASS: %s task got more than %.2f%% of runtime\n", 182 + class_str, expected_min_ratio * 100); 183 + return true; 184 + } 185 + ksft_test_result_fail("FAIL: %s task got less than %.2f%% of runtime\n", 186 + class_str, expected_min_ratio * 100); 187 + return false; 188 + } 189 + 190 + static enum scx_test_status run(void *ctx) 191 + { 192 + struct rt_stall *skel = ctx; 193 + struct bpf_link *link = NULL; 194 + bool res; 195 + int i; 196 + 197 + /* 198 + * Test if the dl_server is working both with and without the 199 + * sched_ext scheduler attached. 200 + * 201 + * This ensures all the scenarios are covered: 202 + * - fair_server stop -> ext_server start 203 + * - ext_server stop -> fair_server stop 204 + */ 205 + for (i = 0; i < 4; i++) { 206 + bool is_ext = i % 2; 207 + 208 + if (is_ext) { 209 + memset(&skel->data->uei, 0, sizeof(skel->data->uei)); 210 + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); 211 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 212 + } 213 + res = sched_stress_test(is_ext); 214 + if (is_ext) { 215 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); 216 + bpf_link__destroy(link); 217 + } 218 + 219 + if (!res) 220 + ksft_exit_fail(); 221 + } 222 + 223 + return SCX_TEST_PASS; 224 + } 225 + 226 + static void cleanup(void *ctx) 227 + { 228 + struct rt_stall *skel = ctx; 229 + 230 + rt_stall__destroy(skel); 231 + } 232 + 233 + struct scx_test rt_stall = { 234 + .name = "rt_stall", 235 + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", 236 + .setup = setup, 237 + .run = run, 238 + .cleanup = cleanup, 239 + }; 240 + REGISTER_SCX_TEST(&rt_stall)
+281
tools/testing/selftests/sched_ext/total_bw.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Test to verify that total_bw value remains consistent across all CPUs 4 + * in different BPF program states. 5 + * 6 + * Copyright (C) 2025 NVIDIA Corporation. 7 + */ 8 + #include <bpf/bpf.h> 9 + #include <errno.h> 10 + #include <pthread.h> 11 + #include <scx/common.h> 12 + #include <stdio.h> 13 + #include <stdlib.h> 14 + #include <string.h> 15 + #include <sys/wait.h> 16 + #include <unistd.h> 17 + #include "minimal.bpf.skel.h" 18 + #include "scx_test.h" 19 + 20 + #define MAX_CPUS 512 21 + #define STRESS_DURATION_SEC 5 22 + 23 + struct total_bw_ctx { 24 + struct minimal *skel; 25 + long baseline_bw[MAX_CPUS]; 26 + int nr_cpus; 27 + }; 28 + 29 + static void *cpu_stress_thread(void *arg) 30 + { 31 + volatile int i; 32 + time_t end_time = time(NULL) + STRESS_DURATION_SEC; 33 + 34 + while (time(NULL) < end_time) 35 + for (i = 0; i < 1000000; i++) 36 + ; 37 + 38 + return NULL; 39 + } 40 + 41 + /* 42 + * The first enqueue on a CPU causes the DL server to start, for that 43 + * reason run stressor threads in the hopes it schedules on all CPUs. 44 + */ 45 + static int run_cpu_stress(int nr_cpus) 46 + { 47 + pthread_t *threads; 48 + int i, ret = 0; 49 + 50 + threads = calloc(nr_cpus, sizeof(pthread_t)); 51 + if (!threads) 52 + return -ENOMEM; 53 + 54 + /* Create threads to run on each CPU */ 55 + for (i = 0; i < nr_cpus; i++) { 56 + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { 57 + ret = -errno; 58 + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); 59 + break; 60 + } 61 + } 62 + 63 + /* Wait for all threads to complete */ 64 + for (i = 0; i < nr_cpus; i++) { 65 + if (threads[i]) 66 + pthread_join(threads[i], NULL); 67 + } 68 + 69 + free(threads); 70 + return ret; 71 + } 72 + 73 + static int read_total_bw_values(long *bw_values, int max_cpus) 74 + { 75 + FILE *fp; 76 + char line[256]; 77 + int cpu_count = 0; 78 + 79 + fp = fopen("/sys/kernel/debug/sched/debug", "r"); 80 + if (!fp) { 81 + SCX_ERR("Failed to open debug file"); 82 + return -1; 83 + } 84 + 85 + while (fgets(line, sizeof(line), fp)) { 86 + char *bw_str = strstr(line, "total_bw"); 87 + 88 + if (bw_str) { 89 + bw_str = strchr(bw_str, ':'); 90 + if (bw_str) { 91 + /* Only store up to max_cpus values */ 92 + if (cpu_count < max_cpus) 93 + bw_values[cpu_count] = atol(bw_str + 1); 94 + cpu_count++; 95 + } 96 + } 97 + } 98 + 99 + fclose(fp); 100 + return cpu_count; 101 + } 102 + 103 + static bool verify_total_bw_consistency(long *bw_values, int count) 104 + { 105 + int i; 106 + long first_value; 107 + 108 + if (count <= 0) 109 + return false; 110 + 111 + first_value = bw_values[0]; 112 + 113 + for (i = 1; i < count; i++) { 114 + if (bw_values[i] != first_value) { 115 + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", 116 + first_value, i, bw_values[i]); 117 + return false; 118 + } 119 + } 120 + 121 + return true; 122 + } 123 + 124 + static int fetch_verify_total_bw(long *bw_values, int nr_cpus) 125 + { 126 + int attempts = 0; 127 + int max_attempts = 10; 128 + int count; 129 + 130 + /* 131 + * The first enqueue on a CPU causes the DL server to start, for that 132 + * reason run stressor threads in the hopes it schedules on all CPUs. 133 + */ 134 + if (run_cpu_stress(nr_cpus) < 0) { 135 + SCX_ERR("Failed to run CPU stress"); 136 + return -1; 137 + } 138 + 139 + /* Try multiple times to get stable values */ 140 + while (attempts < max_attempts) { 141 + count = read_total_bw_values(bw_values, nr_cpus); 142 + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); 143 + /* If system has more CPUs than we're testing, that's OK */ 144 + if (count < nr_cpus) { 145 + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); 146 + attempts++; 147 + sleep(1); 148 + continue; 149 + } 150 + 151 + /* Only verify the CPUs we're testing */ 152 + if (verify_total_bw_consistency(bw_values, nr_cpus)) { 153 + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); 154 + return 0; 155 + } 156 + 157 + attempts++; 158 + sleep(1); 159 + } 160 + 161 + return -1; 162 + } 163 + 164 + static enum scx_test_status setup(void **ctx) 165 + { 166 + struct total_bw_ctx *test_ctx; 167 + 168 + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { 169 + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); 170 + return SCX_TEST_SKIP; 171 + } 172 + 173 + test_ctx = calloc(1, sizeof(*test_ctx)); 174 + if (!test_ctx) 175 + return SCX_TEST_FAIL; 176 + 177 + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); 178 + if (test_ctx->nr_cpus <= 0) { 179 + free(test_ctx); 180 + return SCX_TEST_FAIL; 181 + } 182 + 183 + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ 184 + if (test_ctx->nr_cpus > MAX_CPUS) 185 + test_ctx->nr_cpus = MAX_CPUS; 186 + 187 + /* Test scenario 1: BPF program not loaded */ 188 + /* Read and verify baseline total_bw before loading BPF program */ 189 + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); 190 + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { 191 + SCX_ERR("Failed to get stable baseline values"); 192 + free(test_ctx); 193 + return SCX_TEST_FAIL; 194 + } 195 + 196 + /* Load the BPF skeleton */ 197 + test_ctx->skel = minimal__open(); 198 + if (!test_ctx->skel) { 199 + free(test_ctx); 200 + return SCX_TEST_FAIL; 201 + } 202 + 203 + SCX_ENUM_INIT(test_ctx->skel); 204 + if (minimal__load(test_ctx->skel)) { 205 + minimal__destroy(test_ctx->skel); 206 + free(test_ctx); 207 + return SCX_TEST_FAIL; 208 + } 209 + 210 + *ctx = test_ctx; 211 + return SCX_TEST_PASS; 212 + } 213 + 214 + static enum scx_test_status run(void *ctx) 215 + { 216 + struct total_bw_ctx *test_ctx = ctx; 217 + struct bpf_link *link; 218 + long loaded_bw[MAX_CPUS]; 219 + long unloaded_bw[MAX_CPUS]; 220 + int i; 221 + 222 + /* Test scenario 2: BPF program loaded */ 223 + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); 224 + if (!link) { 225 + SCX_ERR("Failed to attach scheduler"); 226 + return SCX_TEST_FAIL; 227 + } 228 + 229 + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); 230 + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { 231 + SCX_ERR("Failed to get stable values with BPF loaded"); 232 + bpf_link__destroy(link); 233 + return SCX_TEST_FAIL; 234 + } 235 + bpf_link__destroy(link); 236 + 237 + /* Test scenario 3: BPF program unloaded */ 238 + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); 239 + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { 240 + SCX_ERR("Failed to get stable values after BPF unload"); 241 + return SCX_TEST_FAIL; 242 + } 243 + 244 + /* Verify all three scenarios have the same total_bw values */ 245 + for (i = 0; i < test_ctx->nr_cpus; i++) { 246 + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { 247 + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", 248 + i, test_ctx->baseline_bw[i], loaded_bw[i]); 249 + return SCX_TEST_FAIL; 250 + } 251 + 252 + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { 253 + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", 254 + i, test_ctx->baseline_bw[i], unloaded_bw[i]); 255 + return SCX_TEST_FAIL; 256 + } 257 + } 258 + 259 + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); 260 + return SCX_TEST_PASS; 261 + } 262 + 263 + static void cleanup(void *ctx) 264 + { 265 + struct total_bw_ctx *test_ctx = ctx; 266 + 267 + if (test_ctx) { 268 + if (test_ctx->skel) 269 + minimal__destroy(test_ctx->skel); 270 + free(test_ctx); 271 + } 272 + } 273 + 274 + struct scx_test total_bw = { 275 + .name = "total_bw", 276 + .description = "Verify total_bw consistency across BPF program states", 277 + .setup = setup, 278 + .run = run, 279 + .cleanup = cleanup, 280 + }; 281 + REGISTER_SCX_TEST(&total_bw)