Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

- Fix spurious failures in rseq self-tests (Mark Brown)

- Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative
use of the supposedly read-only field

The fix is to introduce a new ABI variant based on a new (larger)
rseq area registration size, to keep the TCMalloc use of rseq
backwards compatible on new kernels (Thomas Gleixner)

- Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot)

- Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng)

* tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix wakeup_preempt_fair() for not waking up task
sched/fair: Fix overflow in vruntime_eligible()
selftests/rseq: Expand for optimized RSEQ ABI v2
rseq: Reenable performance optimizations conditionally
rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
selftests/rseq: Validate legacy behavior
selftests/rseq: Make registration flexible for legacy and optimized mode
selftests/rseq: Skip tests if time slice extensions are not available
rseq: Revert to historical performance killing behaviour
rseq: Don't advertise time slice extensions if disabled
rseq: Protect rseq_reset() against interrupts
rseq: Set rseq::cpu_id_start to 0 on unregistration
selftests/rseq: Don't run tests with runner scripts outside of the scripts

+580 -211
+93 -1
Documentation/userspace-api/rseq.rst
··· 24 24 Allows to implement per CPU data efficiently. Documentation is in code and 25 25 selftests. :( 26 26 27 + Optimized RSEQ V2 28 + ----------------- 29 + 30 + On architectures which utilize the generic entry code and generic TIF bits 31 + the kernel supports runtime optimizations for RSEQ, which also enable 32 + enhanced features like scheduler time slice extensions. 33 + 34 + To enable them a task has to register the RSEQ region with at least the 35 + length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). 36 + 37 + If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel 38 + keeps the legacy low performance mode enabled to fulfil the expectations 39 + of existing users regarding the original RSEQ implementation behaviour. 40 + 41 + The following table documents the ABI and behavioral guarantees of the 42 + legacy and the optimized V2 mode. 43 + 44 + .. list-table:: RSEQ modes 45 + :header-rows: 1 46 + 47 + * - Nr 48 + - What 49 + 50 + - Legacy 51 + - Optimized V2 52 + 53 + * - 1 54 + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read 55 + only) 56 + .. Legacy 57 + - Updated by the kernel unconditionally after each context switch and 58 + before signal delivery 59 + .. Optimized V2 60 + - Updated by the kernel if and only if they change, i.e. if the task 61 + is migrated or mm_cid changes 62 + 63 + * - 2 64 + - The rseq_cs critical section field 65 + .. Legacy 66 + - Evaluated and handled unconditionally after each context switch and 67 + before signal delivery 68 + .. Optimized V2 69 + - Evaluated and handled conditionally only when user space was 70 + interrupted and was scheduled out or before delivering a signal in 71 + the interrupted context. 72 + 73 + * - 3 74 + - Read only fields 75 + .. Legacy 76 + - No strict enforcement except in debug mode 77 + .. Optimized V2 78 + - Strict enforcement 79 + 80 + * - 4 81 + - membarrier(...RSEQ) 82 + .. Legacy 83 + - All running threads of the process are interrupted and the ID fields 84 + are rewritten and eventually active critical sections are aborted 85 + before they return to user space. All threads which are scheduled 86 + out whether voluntary or not are covered by #1/#2 above. 87 + .. Optimized V2 88 + - All running threads of the process are interrupted and eventually 89 + active critical sections are aborted before these threads return to 90 + user space. The ID fields are only updated if changed as a 91 + consequence of the interrupt. All threads which are scheduled out 92 + whether voluntary or not are covered by #1/#2 above. 93 + 94 + * - 5 95 + - Time slice extensions 96 + .. Legacy 97 + - Not supported 98 + .. Optimized V2 99 + - Supported 100 + 101 + The legacy mode is obviously less performant as it does unconditional 102 + updates and critical section checks even if not strictly required by the 103 + ABI contract. That can't be changed anymore as some users depend on that 104 + observed behavior, which in turn enables them to violate the ABI and 105 + overwrite the cpu_id_start field for their own purposes. This is obviously 106 + discouraged as it renders RSEQ incompatible with the intended usage and 107 + breaks the expectation of other libraries in the same application. 108 + 109 + The ABI compliant optimized v2 mode, which respects the read only fields, 110 + does not require unconditional updates and therefore is way more 111 + performant. The kernel validates the read only fields for compliance. If 112 + user space modifies them, the process is killed. Compliant usage allows 113 + multiple libraries in the same application to benefit from the RSEQ 114 + functionality without disturbing each other. The ABI compliant optimized v2 115 + mode also enables extended RSEQ features like time slice extensions. 116 + 117 + 27 118 Scheduler time slice extensions 28 119 ------------------------------- 29 120 ··· 128 37 129 38 * Enabled at boot time (default is enabled) 130 39 131 - * A rseq userspace pointer has been registered for the thread 40 + * A rseq userspace pointer has been registered for the thread in 41 + optimized V2 mode 132 42 133 43 The thread has to enable the functionality via prctl(2):: 134 44
+26 -11
include/linux/rseq.h
··· 9 9 10 10 void __rseq_handle_slowpath(struct pt_regs *regs); 11 11 12 + static __always_inline bool rseq_v2(struct task_struct *t) 13 + { 14 + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.has_rseq > 1); 15 + } 16 + 12 17 /* Invoked from resume_user_mode_work() */ 13 18 static inline void rseq_handle_slowpath(struct pt_regs *regs) 14 19 { ··· 21 16 if (current->rseq.event.slowpath) 22 17 __rseq_handle_slowpath(regs); 23 18 } else { 24 - /* '&' is intentional to spare one conditional branch */ 25 - if (current->rseq.event.sched_switch & current->rseq.event.has_rseq) 19 + if (current->rseq.event.sched_switch && current->rseq.event.has_rseq) 26 20 __rseq_handle_slowpath(regs); 27 21 } 28 22 } ··· 34 30 */ 35 31 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) 36 32 { 37 - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { 38 - /* '&' is intentional to spare one conditional branch */ 39 - if (current->rseq.event.has_rseq & current->rseq.event.user_irq) 33 + if (rseq_v2(current)) { 34 + /* has_rseq is implied in rseq_v2() */ 35 + if (current->rseq.event.user_irq) 40 36 __rseq_signal_deliver(ksig->sig, regs); 41 37 } else { 42 38 if (current->rseq.event.has_rseq) ··· 54 50 { 55 51 struct rseq_event *ev = &t->rseq.event; 56 52 57 - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { 53 + /* 54 + * Only apply the user_irq optimization for RSEQ ABI V2 registrations. 55 + * Legacy users like TCMalloc rely on the original ABI V1 behaviour 56 + * which updates IDs on every context swtich. 57 + */ 58 + if (rseq_v2(t)) { 58 59 /* 59 - * Avoid a boat load of conditionals by using simple logic 60 - * to determine whether NOTIFY_RESUME needs to be raised. 60 + * Avoid a boat load of conditionals by using simple logic to 61 + * determine whether TIF_NOTIFY_RESUME or TIF_RSEQ needs to be 62 + * raised. 61 63 * 62 - * It's required when the CPU or MM CID has changed or 63 - * the entry was from user space. 64 + * It's required when the CPU or MM CID has changed or the entry 65 + * was via interrupt from user space. ev->has_rseq does not have 66 + * to be evaluated here because rseq_v2() implies has_rseq. 64 67 */ 65 - bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; 68 + bool raise = ev->user_irq | ev->ids_changed; 66 69 67 70 if (raise) { 68 71 ev->sched_switch = true; ··· 77 66 } 78 67 } else { 79 68 if (ev->has_rseq) { 69 + t->rseq.event.ids_changed = true; 80 70 t->rseq.event.sched_switch = true; 81 71 rseq_raise_notify_resume(t); 82 72 } ··· 131 119 132 120 static inline void rseq_reset(struct task_struct *t) 133 121 { 122 + /* Protect against preemption and membarrier IPI */ 123 + guard(irqsave)(); 134 124 memset(&t->rseq, 0, sizeof(t->rseq)); 135 125 t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED; 136 126 } ··· 173 159 } 174 160 175 161 #else /* CONFIG_RSEQ */ 162 + static inline bool rseq_v2(struct task_struct *t) { return false; } 176 163 static inline void rseq_handle_slowpath(struct pt_regs *regs) { } 177 164 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } 178 165 static inline void rseq_sched_switch_event(struct task_struct *t) { }
+59 -63
include/linux/rseq_entry.h
··· 111 111 t->rseq.slice.state.granted = false; 112 112 } 113 113 114 + /* 115 + * Open coded, so it can be invoked within a user access region. 116 + * 117 + * This clears the user space state of the time slice extensions field only when 118 + * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations, 119 + * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be 120 + * overwritten by an unconditional write. 121 + */ 122 + #define rseq_slice_clear_user(rseq, efault) \ 123 + do { \ 124 + if (rseq_slice_extension_enabled()) \ 125 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \ 126 + } while (0) 127 + 114 128 static __always_inline bool __rseq_grant_slice_extension(bool work_pending) 115 129 { 116 130 struct task_struct *curr = current; ··· 244 230 static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } 245 231 static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } 246 232 static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } 233 + #define rseq_slice_clear_user(rseq, efault) do { } while (0) 247 234 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ 248 235 249 236 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); 250 - bool rseq_debug_validate_ids(struct task_struct *t); 251 237 252 238 static __always_inline void rseq_note_user_irq_entry(void) 253 239 { ··· 367 353 return false; 368 354 } 369 355 370 - /* 371 - * On debug kernels validate that user space did not mess with it if the 372 - * debug branch is enabled. 373 - */ 374 - bool rseq_debug_validate_ids(struct task_struct *t) 375 - { 376 - struct rseq __user *rseq = t->rseq.usrptr; 377 - u32 cpu_id, uval, node_id; 378 - 379 - /* 380 - * On the first exit after registering the rseq region CPU ID is 381 - * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! 382 - */ 383 - node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? 384 - cpu_to_node(t->rseq.ids.cpu_id) : 0; 385 - 386 - scoped_user_read_access(rseq, efault) { 387 - unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); 388 - if (cpu_id != t->rseq.ids.cpu_id) 389 - goto die; 390 - unsafe_get_user(uval, &rseq->cpu_id, efault); 391 - if (uval != cpu_id) 392 - goto die; 393 - unsafe_get_user(uval, &rseq->node_id, efault); 394 - if (uval != node_id) 395 - goto die; 396 - unsafe_get_user(uval, &rseq->mm_cid, efault); 397 - if (uval != t->rseq.ids.mm_cid) 398 - goto die; 399 - } 400 - return true; 401 - die: 402 - t->rseq.event.fatal = true; 403 - efault: 404 - return false; 405 - } 406 - 407 356 #endif /* RSEQ_BUILD_SLOW_PATH */ 408 357 409 358 /* ··· 476 499 * faults in task context are fatal too. 477 500 */ 478 501 static rseq_inline 479 - bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids, 480 - u32 node_id, u64 *csaddr) 502 + bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids, u64 *csaddr) 481 503 { 482 504 struct rseq __user *rseq = t->rseq.usrptr; 483 505 484 - if (static_branch_unlikely(&rseq_debug_enabled)) { 485 - if (!rseq_debug_validate_ids(t)) 486 - return false; 487 - } 488 - 489 506 scoped_user_rw_access(rseq, efault) { 507 + /* Validate the R/O fields for debug and optimized mode */ 508 + if (static_branch_unlikely(&rseq_debug_enabled) || rseq_v2(t)) { 509 + u32 cpu_id, uval; 510 + 511 + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); 512 + if (cpu_id != t->rseq.ids.cpu_id) 513 + goto die; 514 + unsafe_get_user(uval, &rseq->cpu_id, efault); 515 + if (uval != cpu_id) 516 + goto die; 517 + unsafe_get_user(uval, &rseq->node_id, efault); 518 + if (uval != t->rseq.ids.node_id) 519 + goto die; 520 + unsafe_get_user(uval, &rseq->mm_cid, efault); 521 + if (uval != t->rseq.ids.mm_cid) 522 + goto die; 523 + } 524 + 490 525 unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); 491 526 unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); 492 - unsafe_put_user(node_id, &rseq->node_id, efault); 527 + unsafe_put_user(ids->node_id, &rseq->node_id, efault); 493 528 unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); 494 529 if (csaddr) 495 530 unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); 496 531 497 - /* Open coded, so it's in the same user access region */ 498 - if (rseq_slice_extension_enabled()) { 499 - /* Unconditionally clear it, no point in conditionals */ 500 - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 501 - } 532 + /* RSEQ ABI V2 only operations */ 533 + if (rseq_v2(t)) 534 + rseq_slice_clear_user(rseq, efault); 502 535 } 503 536 504 537 rseq_slice_clear_grant(t); 505 538 /* Cache the new values */ 506 - t->rseq.ids.cpu_cid = ids->cpu_cid; 539 + t->rseq.ids = *ids; 507 540 rseq_stat_inc(rseq_stats.ids); 508 541 rseq_trace_update(t, ids); 509 542 return true; 543 + 544 + die: 545 + t->rseq.event.fatal = true; 510 546 efault: 511 547 return false; 512 548 } ··· 529 539 * is in a critical section. 530 540 */ 531 541 static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs, 532 - struct rseq_ids *ids, u32 node_id) 542 + struct rseq_ids *ids) 533 543 { 534 544 u64 csaddr; 535 545 536 - if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr)) 546 + if (!rseq_set_ids_get_csaddr(t, ids, &csaddr)) 537 547 return false; 538 548 539 549 /* ··· 602 612 * interrupts disabled 603 613 */ 604 614 guard(pagefault)(); 615 + /* 616 + * This optimization is only valid when the task registered for the 617 + * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original 618 + * RSEQ implementation behaviour which unconditionally updated the IDs. 619 + * rseq_sched_switch_event() ensures that legacy registrations always 620 + * have both sched_switch and ids_changed set, which is compatible with 621 + * the historical TIF_NOTIFY_RESUME behaviour. 622 + */ 605 623 if (likely(!t->rseq.event.ids_changed)) { 606 624 struct rseq __user *rseq = t->rseq.usrptr; 607 625 /* ··· 621 623 scoped_user_rw_access(rseq, efault) { 622 624 unsafe_get_user(csaddr, &rseq->rseq_cs, efault); 623 625 624 - /* Open coded, so it's in the same user access region */ 625 - if (rseq_slice_extension_enabled()) { 626 - /* Unconditionally clear it, no point in conditionals */ 627 - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 628 - } 626 + /* RSEQ ABI V2 only operations */ 627 + if (rseq_v2(t)) 628 + rseq_slice_clear_user(rseq, efault); 629 629 } 630 630 631 631 rseq_slice_clear_grant(t); ··· 636 640 } 637 641 638 642 struct rseq_ids ids = { 639 - .cpu_id = task_cpu(t), 640 - .mm_cid = task_mm_cid(t), 643 + .cpu_id = task_cpu(t), 644 + .mm_cid = task_mm_cid(t), 645 + .node_id = cpu_to_node(ids.cpu_id), 641 646 }; 642 - u32 node_id = cpu_to_node(ids.cpu_id); 643 647 644 - return rseq_update_usr(t, regs, &ids, node_id); 648 + return rseq_update_usr(t, regs, &ids); 645 649 efault: 646 650 return false; 647 651 }
+11 -2
include/linux/rseq_types.h
··· 9 9 #ifdef CONFIG_RSEQ 10 10 struct rseq; 11 11 12 + /* 13 + * rseq_event::has_rseq contains the ABI version number so preserving it 14 + * in AND operations requires a mask. 15 + */ 16 + #define RSEQ_HAS_RSEQ_VERSION_MASK 0xff 17 + 12 18 /** 13 19 * struct rseq_event - Storage for rseq related event management 14 20 * @all: Compound to initialize and clear the data efficiently ··· 23 17 * exit to user 24 18 * @ids_changed: Indicator that IDs need to be updated 25 19 * @user_irq: True on interrupt entry from user mode 26 - * @has_rseq: True if the task has a rseq pointer installed 20 + * @has_rseq: Greater than 0 if the task has a rseq pointer installed. 21 + * Contains the RSEQ version number 27 22 * @error: Compound error code for the slow path to analyze 28 23 * @fatal: User space data corrupted or invalid 29 24 * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME ··· 66 59 * compiler emit a single compare on 64-bit 67 60 * @cpu_id: The CPU ID which was written last to user space 68 61 * @mm_cid: The MM CID which was written last to user space 62 + * @node_id: The node ID which was written last to user space 69 63 * 70 - * @cpu_id and @mm_cid are updated when the data is written to user space. 64 + * @cpu_id, @mm_cid and @node_id are updated when the data is written to user space. 71 65 */ 72 66 struct rseq_ids { 73 67 union { ··· 78 70 u32 mm_cid; 79 71 }; 80 72 }; 73 + u32 node_id; 81 74 }; 82 75 83 76 /**
+4 -1
include/uapi/linux/rseq.h
··· 28 28 RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, 29 29 RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, 30 30 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, 31 - /* (3) Intentional gap to put new bits into a separate byte */ 31 + /* (3) Intentional gap to keep new bits separate */ 32 32 33 33 /* User read only feature flags */ 34 34 RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4, ··· 161 161 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT 162 162 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL 163 163 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE 164 + * 165 + * It is now used for feature status advertisement by the kernel. 166 + * See: enum rseq_cs_flags_bit for further information. 164 167 */ 165 168 __u32 flags; 166 169
+132 -82
kernel/rseq.c
··· 236 236 } 237 237 __initcall(rseq_debugfs_init); 238 238 239 - static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id) 240 - { 241 - return rseq_set_ids_get_csaddr(t, ids, node_id, NULL); 242 - } 243 - 244 239 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) 245 240 { 246 241 struct rseq __user *urseq = t->rseq.usrptr; ··· 253 258 static void rseq_slowpath_update_usr(struct pt_regs *regs) 254 259 { 255 260 /* 256 - * Preserve rseq state and user_irq state. The generic entry code 257 - * clears user_irq on the way out, the non-generic entry 258 - * architectures are not having user_irq. 261 + * Preserve has_rseq and user_irq state. The generic entry code clears 262 + * user_irq on the way out, the non-generic entry architectures are not 263 + * setting user_irq. 259 264 */ 260 - const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; 265 + const struct rseq_event evt_mask = { 266 + .has_rseq = RSEQ_HAS_RSEQ_VERSION_MASK, 267 + .user_irq = true, 268 + }; 261 269 struct task_struct *t = current; 262 270 struct rseq_ids ids; 263 - u32 node_id; 264 271 bool event; 265 272 266 273 if (unlikely(t->flags & PF_EXITING)) ··· 298 301 if (!event) 299 302 return; 300 303 301 - node_id = cpu_to_node(ids.cpu_id); 304 + ids.node_id = cpu_to_node(ids.cpu_id); 302 305 303 - if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) { 306 + if (unlikely(!rseq_update_usr(t, regs, &ids))) { 304 307 /* 305 308 * Clear the errors just in case this might survive magically, but 306 309 * leave the rest intact. ··· 332 335 void __rseq_signal_deliver(int sig, struct pt_regs *regs) 333 336 { 334 337 rseq_stat_inc(rseq_stats.signal); 338 + 335 339 /* 336 - * Don't update IDs, they are handled on exit to user if 340 + * Don't update IDs yet, they are handled on exit to user if 337 341 * necessary. The important thing is to abort a critical section of 338 342 * the interrupted context as after this point the instruction 339 343 * pointer in @regs points to the signal handler. ··· 347 349 current->rseq.event.error = 0; 348 350 force_sigsegv(sig); 349 351 } 352 + 353 + /* 354 + * In legacy mode, force the update of IDs before returning to user 355 + * space to stay compatible. 356 + */ 357 + if (!rseq_v2(current)) 358 + rseq_force_update(); 350 359 } 351 360 352 361 /* ··· 389 384 390 385 static bool rseq_reset_ids(void) 391 386 { 392 - struct rseq_ids ids = { 393 - .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, 394 - .mm_cid = 0, 395 - }; 387 + struct rseq __user *rseq = current->rseq.usrptr; 396 388 397 389 /* 398 390 * If this fails, terminate it because this leaves the kernel in 399 391 * stupid state as exit to user space will try to fixup the ids 400 392 * again. 401 393 */ 402 - if (rseq_set_ids(current, &ids, 0)) 403 - return true; 394 + scoped_user_rw_access(rseq, efault) { 395 + unsafe_put_user(0, &rseq->cpu_id_start, efault); 396 + unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); 397 + unsafe_put_user(0, &rseq->node_id, efault); 398 + unsafe_put_user(0, &rseq->mm_cid, efault); 399 + } 400 + return true; 404 401 402 + efault: 405 403 force_sig(SIGSEGV); 406 404 return false; 407 405 } ··· 412 404 /* The original rseq structure size (including padding) is 32 bytes. */ 413 405 #define ORIG_RSEQ_SIZE 32 414 406 415 - /* 416 - * sys_rseq - setup restartable sequences for caller thread. 417 - */ 418 - SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) 407 + static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) 419 408 { 420 409 u32 rseqfl = 0; 410 + u8 version = 1; 421 411 422 - if (flags & RSEQ_FLAG_UNREGISTER) { 423 - if (flags & ~RSEQ_FLAG_UNREGISTER) 424 - return -EINVAL; 425 - /* Unregister rseq for current thread. */ 426 - if (current->rseq.usrptr != rseq || !current->rseq.usrptr) 427 - return -EINVAL; 428 - if (rseq_len != current->rseq.len) 429 - return -EINVAL; 430 - if (current->rseq.sig != sig) 431 - return -EPERM; 432 - if (!rseq_reset_ids()) 433 - return -EFAULT; 434 - rseq_reset(current); 435 - return 0; 436 - } 437 - 438 - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) 439 - return -EINVAL; 440 - 441 - if (current->rseq.usrptr) { 442 - /* 443 - * If rseq is already registered, check whether 444 - * the provided address differs from the prior 445 - * one. 446 - */ 447 - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) 448 - return -EINVAL; 449 - if (current->rseq.sig != sig) 450 - return -EPERM; 451 - /* Already registered. */ 452 - return -EBUSY; 453 - } 454 - 455 - /* 456 - * If there was no rseq previously registered, ensure the provided rseq 457 - * is properly aligned, as communcated to user-space through the ELF 458 - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq 459 - * size, the required alignment is the original struct rseq alignment. 460 - * 461 - * The rseq_len is required to be greater or equal to the original rseq 462 - * size. In order to be valid, rseq_len is either the original rseq size, 463 - * or large enough to contain all supported fields, as communicated to 464 - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. 465 - */ 466 - if (rseq_len < ORIG_RSEQ_SIZE || 467 - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || 468 - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) || 469 - rseq_len < offsetof(struct rseq, end)))) 470 - return -EINVAL; 471 412 if (!access_ok(rseq, rseq_len)) 472 413 return -EFAULT; 473 414 474 - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { 475 - rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 476 - if (rseq_slice_extension_enabled() && 477 - (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) 478 - rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 415 + /* 416 + * Architectures, which use the generic IRQ entry code (at least) enable 417 + * registrations with a size greater than the original v1 fixed sized 418 + * @rseq_len, which has been validated already to utilize the optimized 419 + * v2 ABI mode which also enables extended RSEQ features beyond MMCID. 420 + */ 421 + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE) 422 + version = 2; 423 + 424 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && version > 1) { 425 + if (rseq_slice_extension_enabled()) { 426 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 427 + if (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) 428 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; 429 + } 479 430 } 480 431 481 432 scoped_user_write_access(rseq, efault) { ··· 452 485 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); 453 486 unsafe_put_user(0U, &rseq->node_id, efault); 454 487 unsafe_put_user(0U, &rseq->mm_cid, efault); 455 - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 488 + 489 + /* 490 + * All fields past mm_cid are only valid for non-legacy v2 491 + * registrations. 492 + */ 493 + if (version > 1) { 494 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) 495 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 496 + } 456 497 } 457 498 458 499 /* ··· 476 501 #endif 477 502 478 503 /* 479 - * If rseq was previously inactive, and has just been 480 - * registered, ensure the cpu_id_start and cpu_id fields 481 - * are updated before returning to user-space. 504 + * Ensure the cpu_id_start and cpu_id fields are updated before 505 + * returning to user-space. 482 506 */ 483 - current->rseq.event.has_rseq = true; 507 + current->rseq.event.has_rseq = version; 484 508 rseq_force_update(); 485 509 return 0; 486 510 487 511 efault: 488 512 return -EFAULT; 513 + } 514 + 515 + static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) 516 + { 517 + if (flags & ~RSEQ_FLAG_UNREGISTER) 518 + return -EINVAL; 519 + if (current->rseq.usrptr != rseq || !current->rseq.usrptr) 520 + return -EINVAL; 521 + if (rseq_len != current->rseq.len) 522 + return -EINVAL; 523 + if (current->rseq.sig != sig) 524 + return -EPERM; 525 + if (!rseq_reset_ids()) 526 + return -EFAULT; 527 + rseq_reset(current); 528 + return 0; 529 + } 530 + 531 + static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig) 532 + { 533 + /* 534 + * If rseq is already registered, check whether the provided address 535 + * differs from the prior one. 536 + */ 537 + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) 538 + return -EINVAL; 539 + if (current->rseq.sig != sig) 540 + return -EPERM; 541 + /* Already registered. */ 542 + return -EBUSY; 543 + } 544 + 545 + static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) 546 + { 547 + /* 548 + * Ensure the provided rseq is properly aligned, as communicated to 549 + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If 550 + * rseq_len is the original rseq size, the required alignment is the 551 + * original struct rseq alignment. 552 + * 553 + * In order to be valid, rseq_len is either the original rseq size, or 554 + * large enough to contain all supported fields, as communicated to 555 + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. 556 + */ 557 + if (rseq_len < ORIG_RSEQ_SIZE) 558 + return false; 559 + 560 + if (rseq_len == ORIG_RSEQ_SIZE) 561 + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); 562 + 563 + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && 564 + rseq_len >= offsetof(struct rseq, end); 565 + } 566 + 567 + #define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) 568 + 569 + /* 570 + * sys_rseq - Register or unregister restartable sequences for the caller thread. 571 + */ 572 + SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) 573 + { 574 + if (flags & RSEQ_FLAG_UNREGISTER) 575 + return rseq_unregister(rseq, rseq_len, flags, sig); 576 + 577 + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED)) 578 + return -EINVAL; 579 + 580 + if (current->rseq.usrptr) 581 + return rseq_reregister(rseq, rseq_len, sig); 582 + 583 + if (!rseq_length_valid(rseq, rseq_len)) 584 + return -EINVAL; 585 + 586 + return rseq_register(rseq, rseq_len, flags, sig); 489 587 } 490 588 491 589 #ifdef CONFIG_RSEQ_SLICE_EXTENSION ··· 761 713 return -ENOTSUPP; 762 714 if (!current->rseq.usrptr) 763 715 return -ENXIO; 716 + if (!rseq_v2(current)) 717 + return -ENOTSUPP; 764 718 765 719 /* No change? */ 766 720 if (enable == !!current->rseq.slice.state.enabled)
+37 -7
kernel/sched/fair.c
··· 882 882 * 883 883 * lag_i >= 0 -> V >= v_i 884 884 * 885 - * \Sum (v_i - v)*w_i 886 - * V = ------------------ + v 885 + * \Sum (v_i - v0)*w_i 886 + * V = ------------------- + v0 887 887 * \Sum w_i 888 888 * 889 - * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i) 889 + * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i) 890 890 * 891 891 * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due 892 892 * to the loss in precision caused by the division. ··· 894 894 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) 895 895 { 896 896 struct sched_entity *curr = cfs_rq->curr; 897 - s64 avg = cfs_rq->sum_w_vruntime; 897 + s64 key, avg = cfs_rq->sum_w_vruntime; 898 898 long load = cfs_rq->sum_weight; 899 899 900 900 if (curr && curr->on_rq) { ··· 904 904 load += weight; 905 905 } 906 906 907 - return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load; 907 + key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime); 908 + 909 + /* 910 + * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD' 911 + * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24 912 + * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise. 913 + * 914 + * This gives that on 64bit the product will be at least 64bit which 915 + * overflows s64, while on 32bit it will only be 44bits and should fit 916 + * comfortably. 917 + */ 918 + #ifdef CONFIG_64BIT 919 + #ifdef CONFIG_ARCH_SUPPORTS_INT128 920 + /* This often results in simpler code than __builtin_mul_overflow(). */ 921 + return avg >= (__int128)key * load; 922 + #else 923 + s64 rhs; 924 + /* 925 + * On overflow, the sign of key tells us the correct answer: a large 926 + * positive key means vruntime >> V, so not eligible; a large negative 927 + * key means vruntime << V, so eligible. 928 + */ 929 + if (check_mul_overflow(key, load, &rhs)) 930 + return key <= 0; 931 + 932 + return avg >= rhs; 933 + #endif 934 + #else /* 32bit */ 935 + return avg >= key * load; 936 + #endif 908 937 } 909 938 910 939 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) ··· 9174 9145 9175 9146 /* 9176 9147 * Because p is enqueued, nse being null can only mean that we 9177 - * dequeued a delayed task. 9148 + * dequeued a delayed task. If there are still entities queued in 9149 + * cfs, check if the next one will be p. 9178 9150 */ 9179 - if (!nse) 9151 + if (!nse && cfs_rq->nr_queued) 9180 9152 goto pick; 9181 9153 9182 9154 if (sched_feat(RUN_TO_PARITY))
+10 -1
kernel/sched/membarrier.c
··· 199 199 * is negligible. 200 200 */ 201 201 smp_mb(); 202 - rseq_sched_switch_event(current); 202 + /* 203 + * Legacy mode requires that IDs are written and the critical section is 204 + * evaluated. V2 optimized mode handles the critical section and IDs are 205 + * only updated if they change as a consequence of preemption after 206 + * return from this IPI. 207 + */ 208 + if (rseq_v2(current)) 209 + rseq_sched_switch_event(current); 210 + else 211 + rseq_force_update(); 203 212 } 204 213 205 214 static void ipi_sync_rq_state(void *info)
+15 -6
tools/testing/selftests/rseq/Makefile
··· 14 14 # still track changes to header files and depend on shared object. 15 15 OVERRIDE_TARGETS = 1 16 16 17 - TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ 18 - param_test_benchmark param_test_compare_twice param_test_mm_cid \ 19 - param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ 20 - syscall_errors_test slice_test 17 + TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test \ 18 + param_test_benchmark param_test_mm_cid_benchmark 21 19 22 - TEST_GEN_PROGS_EXTENDED = librseq.so 20 + TEST_GEN_PROGS_EXTENDED = librseq.so \ 21 + param_test \ 22 + param_test_compare_twice \ 23 + param_test_mm_cid \ 24 + param_test_mm_cid_compare_twice \ 25 + syscall_errors_test \ 26 + legacy_check \ 27 + slice_test \ 28 + check_optimized 23 29 24 - TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh 30 + TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh run_legacy_check.sh run_timeslice_test.sh 25 31 26 32 TEST_FILES := settings 27 33 ··· 67 61 $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ 68 62 69 63 $(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h 64 + $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ 65 + 66 + $(OUTPUT)/check_optimized: check_optimized.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h 70 67 $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+17
tools/testing/selftests/rseq/check_optimized.c
··· 1 + // SPDX-License-Identifier: LGPL-2.1 2 + #define _GNU_SOURCE 3 + #include <assert.h> 4 + #include <sched.h> 5 + #include <signal.h> 6 + #include <stdio.h> 7 + #include <string.h> 8 + #include <sys/time.h> 9 + 10 + #include "rseq.h" 11 + 12 + int main(int argc, char **argv) 13 + { 14 + if (__rseq_register_current_thread(true, false)) 15 + return -1; 16 + return 0; 17 + }
+65
tools/testing/selftests/rseq/legacy_check.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #ifndef _GNU_SOURCE 3 + #define _GNU_SOURCE 4 + #endif 5 + 6 + #include <errno.h> 7 + #include <signal.h> 8 + #include <stdint.h> 9 + #include <unistd.h> 10 + 11 + #include "rseq.h" 12 + 13 + #include "../kselftest_harness.h" 14 + 15 + FIXTURE(legacy) 16 + { 17 + }; 18 + 19 + static int cpu_id_in_sigfn = -1; 20 + 21 + static void sigfn(int sig) 22 + { 23 + struct rseq_abi *rs = rseq_get_abi(); 24 + 25 + cpu_id_in_sigfn = rs->cpu_id_start; 26 + } 27 + 28 + FIXTURE_SETUP(legacy) 29 + { 30 + int res = __rseq_register_current_thread(true, true); 31 + 32 + switch (res) { 33 + case -ENOSYS: 34 + SKIP(return, "RSEQ not enabled\n"); 35 + case -EBUSY: 36 + SKIP(return, "GLIBC owns RSEQ. Disable GLIBC RSEQ registration\n"); 37 + default: 38 + ASSERT_EQ(res, 0); 39 + } 40 + 41 + ASSERT_NE(signal(SIGUSR1, sigfn), SIG_ERR); 42 + } 43 + 44 + FIXTURE_TEARDOWN(legacy) 45 + { 46 + } 47 + 48 + TEST_F(legacy, legacy_test) 49 + { 50 + struct rseq_abi *rs = rseq_get_abi(); 51 + 52 + ASSERT_NE(rs, NULL); 53 + 54 + /* Overwrite rs::cpu_id_start */ 55 + rs->cpu_id_start = -1; 56 + sleep(1); 57 + ASSERT_NE(rs->cpu_id_start, -1); 58 + 59 + rs->cpu_id_start = -1; 60 + ASSERT_EQ(raise(SIGUSR1), 0); 61 + ASSERT_NE(rs->cpu_id_start, -1); 62 + ASSERT_NE(cpu_id_in_sigfn, -1); 63 + } 64 + 65 + TEST_HARNESS_MAIN
+16 -9
tools/testing/selftests/rseq/param_test.c
··· 38 38 static int opt_yield, opt_signal, opt_sleep, 39 39 opt_disable_rseq, opt_threads = 200, 40 40 opt_disable_mod = 0, opt_test = 's'; 41 - 41 + static bool opt_rseq_legacy; 42 42 static long long opt_reps = 5000; 43 43 44 44 static __thread __attribute__((tls_model("initial-exec"))) ··· 281 281 } \ 282 282 } 283 283 284 + #define rseq_no_glibc true 285 + 284 286 #else 285 287 286 288 #define printf_verbose(fmt, ...) 289 + #define rseq_no_glibc false 287 290 288 291 #endif /* BENCHMARK */ 289 292 ··· 484 481 long long i, reps; 485 482 486 483 if (!opt_disable_rseq && thread_data->reg && 487 - rseq_register_current_thread()) 484 + __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 488 485 abort(); 489 486 reps = thread_data->reps; 490 487 for (i = 0; i < reps; i++) { ··· 561 558 long long i, reps; 562 559 563 560 if (!opt_disable_rseq && thread_data->reg && 564 - rseq_register_current_thread()) 561 + __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 565 562 abort(); 566 563 reps = thread_data->reps; 567 564 for (i = 0; i < reps; i++) { ··· 715 712 long long i, reps; 716 713 struct percpu_list *list = (struct percpu_list *)arg; 717 714 718 - if (!opt_disable_rseq && rseq_register_current_thread()) 715 + if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 719 716 abort(); 720 717 721 718 reps = opt_reps; ··· 898 895 long long i, reps; 899 896 struct percpu_buffer *buffer = (struct percpu_buffer *)arg; 900 897 901 - if (!opt_disable_rseq && rseq_register_current_thread()) 898 + if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 902 899 abort(); 903 900 904 901 reps = opt_reps; ··· 1108 1105 long long i, reps; 1109 1106 struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg; 1110 1107 1111 - if (!opt_disable_rseq && rseq_register_current_thread()) 1108 + if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 1112 1109 abort(); 1113 1110 1114 1111 reps = opt_reps; ··· 1261 1258 const int iters = opt_reps; 1262 1259 int i; 1263 1260 1264 - if (rseq_register_current_thread()) { 1261 + if (__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) { 1265 1262 fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n", 1266 1263 errno, strerror(errno)); 1267 1264 abort(); ··· 1326 1323 intptr_t expect_a = 0, expect_b = 0; 1327 1324 int cpu_a = 0, cpu_b = 0; 1328 1325 1329 - if (rseq_register_current_thread()) { 1326 + if (__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) { 1330 1327 fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n", 1331 1328 errno, strerror(errno)); 1332 1329 abort(); ··· 1478 1475 printf(" [-D M] Disable rseq for each M threads\n"); 1479 1476 printf(" [-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement, membarrie(r)\n"); 1480 1477 printf(" [-M] Push into buffer and memcpy buffer with memory barriers.\n"); 1478 + printf(" [-O] Test with optimized RSEQ\n"); 1481 1479 printf(" [-v] Verbose output.\n"); 1482 1480 printf(" [-h] Show this help.\n"); 1483 1481 printf("\n"); ··· 1606 1602 case 'M': 1607 1603 opt_mo = RSEQ_MO_RELEASE; 1608 1604 break; 1605 + case 'L': 1606 + opt_rseq_legacy = true; 1607 + break; 1609 1608 default: 1610 1609 show_usage(argc, argv); 1611 1610 goto error; ··· 1625 1618 if (set_signal_handler()) 1626 1619 goto error; 1627 1620 1628 - if (!opt_disable_rseq && rseq_register_current_thread()) 1621 + if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) 1629 1622 goto error; 1630 1623 if (!opt_disable_rseq && !rseq_validate_cpu_id()) { 1631 1624 fprintf(stderr, "Error: cpu id getter unavailable\n");
+6 -1
tools/testing/selftests/rseq/rseq-abi.h
··· 192 192 struct rseq_abi_slice_ctrl slice_ctrl; 193 193 194 194 /* 195 + * Place holder to push the size above 32 bytes. 196 + */ 197 + __u8 __reserved; 198 + 199 + /* 195 200 * Flexible array member at end of structure, after last feature field. 196 201 */ 197 202 char end[]; 198 - } __attribute__((aligned(4 * sizeof(__u64)))); 203 + } __attribute__((aligned(256))); 199 204 200 205 #endif /* _RSEQ_ABI_H */
+18 -21
tools/testing/selftests/rseq/rseq.c
··· 56 56 * unsuccessful. 57 57 */ 58 58 unsigned int rseq_size = -1U; 59 + static unsigned int rseq_alloc_size; 59 60 60 61 /* Flags used during rseq registration. */ 61 62 unsigned int rseq_flags; ··· 116 115 } 117 116 } 118 117 119 - /* The rseq areas need to be at least 32 bytes. */ 120 - static 121 - unsigned int get_rseq_min_alloc_size(void) 122 - { 123 - unsigned int alloc_size = rseq_size; 124 - 125 - if (alloc_size < ORIG_RSEQ_ALLOC_SIZE) 126 - alloc_size = ORIG_RSEQ_ALLOC_SIZE; 127 - return alloc_size; 128 - } 129 - 130 118 /* 131 119 * Return the feature size supported by the kernel. 132 120 * 133 121 * Depending on the value returned by getauxval(AT_RSEQ_FEATURE_SIZE): 134 122 * 135 - * 0: Return ORIG_RSEQ_FEATURE_SIZE (20) 123 + * 0: Return ORIG_RSEQ_FEATURE_SIZE (20) 136 124 * > 0: Return the value from getauxval(AT_RSEQ_FEATURE_SIZE). 137 125 * 138 126 * It should never return a value below ORIG_RSEQ_FEATURE_SIZE. 139 127 */ 140 - static 141 - unsigned int get_rseq_kernel_feature_size(void) 128 + static unsigned int get_rseq_kernel_feature_size(void) 142 129 { 143 130 unsigned long auxv_rseq_feature_size, auxv_rseq_align; 144 131 ··· 141 152 return ORIG_RSEQ_FEATURE_SIZE; 142 153 } 143 154 144 - int rseq_register_current_thread(void) 155 + int __rseq_register_current_thread(bool nolibc, bool legacy) 145 156 { 157 + unsigned int size; 146 158 int rc; 147 159 148 160 if (!rseq_ownership) { 149 161 /* Treat libc's ownership as a successful registration. */ 150 - return 0; 162 + return nolibc ? -EBUSY : 0; 151 163 } 152 - rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG); 164 + 165 + /* The minimal allocation size is 32, which is the legacy allocation size */ 166 + size = get_rseq_kernel_feature_size(); 167 + if (legacy || size < ORIG_RSEQ_ALLOC_SIZE) 168 + rseq_alloc_size = ORIG_RSEQ_ALLOC_SIZE; 169 + else 170 + rseq_alloc_size = size; 171 + 172 + rc = sys_rseq(&__rseq.abi, rseq_alloc_size, 0, RSEQ_SIG); 153 173 if (rc) { 154 174 /* 155 175 * After at least one thread has registered successfully ··· 177 179 * The first thread to register sets the rseq_size to mimic the libc 178 180 * behavior. 179 181 */ 180 - if (RSEQ_READ_ONCE(rseq_size) == 0) { 181 - RSEQ_WRITE_ONCE(rseq_size, get_rseq_kernel_feature_size()); 182 - } 182 + if (RSEQ_READ_ONCE(rseq_size) == 0) 183 + RSEQ_WRITE_ONCE(rseq_size, size); 183 184 184 185 return 0; 185 186 } ··· 191 194 /* Treat libc's ownership as a successful unregistration. */ 192 195 return 0; 193 196 } 194 - rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG); 197 + rc = sys_rseq(&__rseq.abi, rseq_alloc_size, RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG); 195 198 if (rc) 196 199 return -1; 197 200 return 0;
+7 -1
tools/testing/selftests/rseq/rseq.h
··· 8 8 #ifndef RSEQ_H 9 9 #define RSEQ_H 10 10 11 + #include <assert.h> 11 12 #include <stdint.h> 12 13 #include <stdbool.h> 13 14 #include <pthread.h> ··· 143 142 * succeed. A restartable sequence executed from a non-registered 144 143 * thread will always fail. 145 144 */ 146 - int rseq_register_current_thread(void); 145 + int __rseq_register_current_thread(bool nolibc, bool legacy); 146 + 147 + static inline int rseq_register_current_thread(void) 148 + { 149 + return __rseq_register_current_thread(false, false); 150 + } 147 151 148 152 /* 149 153 * Unregister rseq for current thread.
+4
tools/testing/selftests/rseq/run_legacy_check.sh
··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0" ./legacy_check
+39
tools/testing/selftests/rseq/run_param_test.sh
··· 34 34 SLOW_REPS=100 35 35 NR_THREADS=$((6*${NR_CPUS})) 36 36 37 + # Prevent GLIBC from registering RSEQ so the selftest can run in legacy and 38 + # performance optimized mode. 39 + GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0" 40 + export GLIBC_TUNABLES 41 + 37 42 function do_tests() 38 43 { 39 44 local i=0 ··· 108 103 NR_LOOPS= 109 104 } 110 105 106 + echo "Testing in legacy RSEQ mode" 107 + echo "Yield injection (25%)" 108 + inject_blocking -m 4 -y -L 109 + 110 + echo "Yield injection (50%)" 111 + inject_blocking -m 2 -y -L 112 + 113 + echo "Yield injection (100%)" 114 + inject_blocking -m 1 -y -L 115 + 116 + echo "Kill injection (25%)" 117 + inject_blocking -m 4 -k -L 118 + 119 + echo "Kill injection (50%)" 120 + inject_blocking -m 2 -k -L 121 + 122 + echo "Kill injection (100%)" 123 + inject_blocking -m 1 -k -L 124 + 125 + echo "Sleep injection (1ms, 25%)" 126 + inject_blocking -m 4 -s 1 -L 127 + 128 + echo "Sleep injection (1ms, 50%)" 129 + inject_blocking -m 2 -s 1 -L 130 + 131 + echo "Sleep injection (1ms, 100%)" 132 + inject_blocking -m 1 -s 1 -L 133 + 134 + ./check_optimized || { 135 + echo "Skipping optimized RSEQ mode test. Not supported"; 136 + exit 0 137 + } 138 + 139 + echo "Testing in optimized RSEQ mode" 111 140 echo "Yield injection (25%)" 112 141 inject_blocking -m 4 -y 113 142
+14
tools/testing/selftests/rseq/run_timeslice_test.sh
··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0+ 3 + 4 + # Prevent GLIBC from registering RSEQ so the selftest can run in legacy 5 + # and performance optimized mode. 6 + GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0" 7 + export GLIBC_TUNABLES 8 + 9 + ./check_optimized || { 10 + echo "Skipping optimized RSEQ mode test. Not supported"; 11 + exit 0 12 + } 13 + 14 + ./slice_test
+7 -5
tools/testing/selftests/rseq/slice_test.c
··· 124 124 { 125 125 cpu_set_t affinity; 126 126 127 + if (__rseq_register_current_thread(true, false)) 128 + SKIP(return, "RSEQ not supported\n"); 129 + 130 + if (prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 131 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0)) 132 + SKIP(return, "Time slice extension not supported\n"); 133 + 127 134 ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0); 128 135 129 136 /* Pin it on a single CPU. Avoid CPU 0 */ ··· 143 136 ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0); 144 137 break; 145 138 } 146 - 147 - ASSERT_EQ(rseq_register_current_thread(), 0); 148 - 149 - ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 150 - PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0); 151 139 152 140 self->noise_params.noise_nsecs = variant->noise_nsecs; 153 141 self->noise_params.sleep_nsecs = variant->sleep_nsecs;