Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

rseq: Add fields and constants for time slice extension

Aside of a Kconfig knob add the following items:

- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.

- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.

- A rseq state struct to hold the kernel state of this

- Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de

authored by

Thomas Gleixner and committed by
Peter Zijlstra
d7a5da7a 4fe82cf3

+220 -1
+1
Documentation/userspace-api/index.rst
··· 21 21 ebpf/index 22 22 ioctl/index 23 23 mseal 24 + rseq 24 25 25 26 Security-related interfaces 26 27 ===========================
+135
Documentation/userspace-api/rseq.rst
··· 1 + ===================== 2 + Restartable Sequences 3 + ===================== 4 + 5 + Restartable Sequences allow to register a per thread userspace memory area 6 + to be used as an ABI between kernel and userspace for three purposes: 7 + 8 + * userspace restartable sequences 9 + 10 + * quick access to read the current CPU number, node ID from userspace 11 + 12 + * scheduler time slice extensions 13 + 14 + Restartable sequences (per-cpu atomics) 15 + --------------------------------------- 16 + 17 + Restartable sequences allow userspace to perform update operations on 18 + per-cpu data without requiring heavyweight atomic operations. The actual 19 + ABI is unfortunately only available in the code and selftests. 20 + 21 + Quick access to CPU number, node ID 22 + ----------------------------------- 23 + 24 + Allows to implement per CPU data efficiently. Documentation is in code and 25 + selftests. :( 26 + 27 + Scheduler time slice extensions 28 + ------------------------------- 29 + 30 + This allows a thread to request a time slice extension when it enters a 31 + critical section to avoid contention on a resource when the thread is 32 + scheduled out inside of the critical section. 33 + 34 + The prerequisites for this functionality are: 35 + 36 + * Enabled in Kconfig 37 + 38 + * Enabled at boot time (default is enabled) 39 + 40 + * A rseq userspace pointer has been registered for the thread 41 + 42 + The thread has to enable the functionality via prctl(2):: 43 + 44 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 45 + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); 46 + 47 + prctl() returns 0 on success or otherwise with the following error codes: 48 + 49 + ========= ============================================================== 50 + Errorcode Meaning 51 + ========= ============================================================== 52 + EINVAL Functionality not available or invalid function arguments. 53 + Note: arg4 and arg5 must be zero 54 + ENOTSUPP Functionality was disabled on the kernel command line 55 + ENXIO Available, but no rseq user struct registered 56 + ========= ============================================================== 57 + 58 + The state can be also queried via prctl(2):: 59 + 60 + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); 61 + 62 + prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if 63 + disabled. Otherwise it returns with the following error codes: 64 + 65 + ========= ============================================================== 66 + Errorcode Meaning 67 + ========= ============================================================== 68 + EINVAL Functionality not available or invalid function arguments. 69 + Note: arg3 and arg4 and arg5 must be zero 70 + ========= ============================================================== 71 + 72 + The availability and status is also exposed via the rseq ABI struct flags 73 + field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the 74 + ``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user 75 + space and only for informational purposes. 76 + 77 + If the mechanism was enabled via prctl(), the thread can request a time 78 + slice extension by setting rseq::slice_ctrl::request to 1. If the thread is 79 + interrupted and the interrupt results in a reschedule request in the 80 + kernel, then the kernel can grant a time slice extension and return to 81 + userspace instead of scheduling out. The length of the extension is 82 + determined by the ``rseq_slice_extension_nsec`` sysctl. 83 + 84 + The kernel indicates the grant by clearing rseq::slice_ctrl::request and 85 + setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the 86 + thread after granting the extension, the kernel clears the granted bit to 87 + indicate that to userspace. 88 + 89 + If the request bit is still set when the leaving the critical section, 90 + userspace can clear it and continue. 91 + 92 + If the granted bit is set, then userspace invokes rseq_slice_yield(2) when 93 + leaving the critical section to relinquish the CPU. The kernel enforces 94 + this by arming a timer to prevent misbehaving userspace from abusing this 95 + mechanism. 96 + 97 + If both the request bit and the granted bit are false when leaving the 98 + critical section, then this indicates that a grant was revoked and no 99 + further action is required by userspace. 100 + 101 + The required code flow is as follows:: 102 + 103 + rseq->slice_ctrl.request = 1; 104 + barrier(); // Prevent compiler reordering 105 + critical_section(); 106 + barrier(); // Prevent compiler reordering 107 + rseq->slice_ctrl.request = 0; 108 + if (rseq->slice_ctrl.granted) 109 + rseq_slice_yield(); 110 + 111 + As all of this is strictly CPU local, there are no atomicity requirements. 112 + Checking the granted state is racy, but that cannot be avoided at all:: 113 + 114 + if (rseq->slice_ctrl.granted) 115 + -> Interrupt results in schedule and grant revocation 116 + rseq_slice_yield(); 117 + 118 + So there is no point in pretending that this might be solved by an atomic 119 + operation. 120 + 121 + If the thread issues a syscall other than rseq_slice_yield(2) within the 122 + granted timeslice extension, the grant is also revoked and the CPU is 123 + relinquished immediately when entering the kernel. This is required as 124 + syscalls might consume arbitrary CPU time until they reach a scheduling 125 + point when the preemption model is either NONE or VOLUNTARY and therefore 126 + might exceed the grant by far. 127 + 128 + The preferred solution for user space is to use rseq_slice_yield(2) which 129 + is side effect free. The support for arbitrary syscalls is required to 130 + support onion layer architectured applications, where the code handling the 131 + critical section and requesting the time slice extension has no control 132 + over the code within the critical section. 133 + 134 + The kernel enforces flag consistency and terminates the thread with SIGSEGV 135 + if it detects a violation.
+27 -1
include/linux/rseq_types.h
··· 73 73 }; 74 74 75 75 /** 76 + * union rseq_slice_state - Status information for rseq time slice extension 77 + * @state: Compound to access the overall state 78 + * @enabled: Time slice extension is enabled for the task 79 + * @granted: Time slice extension was granted to the task 80 + */ 81 + union rseq_slice_state { 82 + u16 state; 83 + struct { 84 + u8 enabled; 85 + u8 granted; 86 + }; 87 + }; 88 + 89 + /** 90 + * struct rseq_slice - Status information for rseq time slice extension 91 + * @state: Time slice extension state 92 + */ 93 + struct rseq_slice { 94 + union rseq_slice_state state; 95 + }; 96 + 97 + /** 76 98 * struct rseq_data - Storage for all rseq related data 77 99 * @usrptr: Pointer to the registered user space RSEQ memory 78 100 * @len: Length of the RSEQ region 79 - * @sig: Signature of critial section abort IPs 101 + * @sig: Signature of critical section abort IPs 80 102 * @event: Storage for event management 81 103 * @ids: Storage for cached CPU ID and MM CID 104 + * @slice: Storage for time slice extension data 82 105 */ 83 106 struct rseq_data { 84 107 struct rseq __user *usrptr; ··· 109 86 u32 sig; 110 87 struct rseq_event event; 111 88 struct rseq_ids ids; 89 + #ifdef CONFIG_RSEQ_SLICE_EXTENSION 90 + struct rseq_slice slice; 91 + #endif 112 92 }; 113 93 114 94 #else /* CONFIG_RSEQ */
+38
include/uapi/linux/rseq.h
··· 23 23 }; 24 24 25 25 enum rseq_cs_flags_bit { 26 + /* Historical and unsupported bits */ 26 27 RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, 27 28 RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, 28 29 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, 30 + /* (3) Intentional gap to put new bits into a separate byte */ 31 + 32 + /* User read only feature flags */ 33 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4, 34 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5, 29 35 }; 30 36 31 37 enum rseq_cs_flags { ··· 41 35 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), 42 36 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = 43 37 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), 38 + 39 + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE = 40 + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT), 41 + RSEQ_CS_FLAG_SLICE_EXT_ENABLED = 42 + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT), 44 43 }; 45 44 46 45 /* ··· 63 52 __u64 post_commit_offset; 64 53 __u64 abort_ip; 65 54 } __attribute__((aligned(4 * sizeof(__u64)))); 55 + 56 + /** 57 + * rseq_slice_ctrl - Time slice extension control structure 58 + * @all: Compound value 59 + * @request: Request for a time slice extension 60 + * @granted: Granted time slice extension 61 + * 62 + * @request is set by user space and can be cleared by user space or kernel 63 + * space. @granted is set and cleared by the kernel and must only be read 64 + * by user space. 65 + */ 66 + struct rseq_slice_ctrl { 67 + union { 68 + __u32 all; 69 + struct { 70 + __u8 request; 71 + __u8 granted; 72 + __u16 __reserved; 73 + }; 74 + }; 75 + }; 66 76 67 77 /* 68 78 * struct rseq is aligned on 4 * 8 bytes to ensure it is always ··· 172 140 * (allocated uniquely within a memory map). 173 141 */ 174 142 __u32 mm_cid; 143 + 144 + /* 145 + * Time slice extension control structure. CPU local updates from 146 + * kernel and user space. 147 + */ 148 + struct rseq_slice_ctrl slice_ctrl; 175 149 176 150 /* 177 151 * Flexible array member at end of structure, after last feature field.
+12
init/Kconfig
··· 1938 1938 1939 1939 If unsure, say Y. 1940 1940 1941 + config RSEQ_SLICE_EXTENSION 1942 + bool "Enable rseq-based time slice extension mechanism" 1943 + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS 1944 + help 1945 + Allows userspace to request a limited time slice extension when 1946 + returning from an interrupt to user space via the RSEQ shared 1947 + data ABI. If granted, that allows to complete a critical section, 1948 + so that other threads are not stuck on a conflicted resource, 1949 + while the task is scheduled out. 1950 + 1951 + If unsure, say N. 1952 + 1941 1953 config RSEQ_STATS 1942 1954 default n 1943 1955 bool "Enable lightweight statistics of restartable sequences" if EXPERT
+7
kernel/rseq.c
··· 389 389 */ 390 390 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) 391 391 { 392 + u32 rseqfl = 0; 393 + 392 394 if (flags & RSEQ_FLAG_UNREGISTER) { 393 395 if (flags & ~RSEQ_FLAG_UNREGISTER) 394 396 return -EINVAL; ··· 442 440 if (!access_ok(rseq, rseq_len)) 443 441 return -EFAULT; 444 442 443 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) 444 + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; 445 + 445 446 scoped_user_write_access(rseq, efault) { 446 447 /* 447 448 * If the rseq_cs pointer is non-NULL on registration, clear it to ··· 454 449 * clearing the fields. Don't bother reading it, just reset it. 455 450 */ 456 451 unsafe_put_user(0UL, &rseq->rseq_cs, efault); 452 + unsafe_put_user(rseqfl, &rseq->flags, efault); 457 453 /* Initialize IDs in user space */ 458 454 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault); 459 455 unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); 460 456 unsafe_put_user(0U, &rseq->node_id, efault); 461 457 unsafe_put_user(0U, &rseq->mm_cid, efault); 458 + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); 462 459 } 463 460 464 461 /*