Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

randomize_kstack: Unify random source across arches

Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().

There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.

So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:

- We can remove choose_random_kstack_offset(); That was only there to
try to make the timestamp counter value a bit harder to influence
from user space [*].

- The architecture code is simplified. All it has to do now is call
add_random_kstack_offset() in the syscall path.

- The strength of the randomness can be reasoned about independently
of the architecture.

- Arches previously using get_random_u16() now have much faster
syscall paths, see below results.

[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.

In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.

There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.

Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.

Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):

+-----------------+--------------+---------------+---------------+
| Benchmark | Result Class | per-cpu-prng | per-cpu-prng |
| | | arm64 (metal) | x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% |
| | p99 (ns) | (I) -59.24% | (I) -24.41% |
| | p99.9 (ns) | (I) -59.52% | (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% |
| | p99 (ns) | (I) -59.25% | (I) -25.03% |
| | p99.9 (ns) | (I) -59.50% | (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% |
| | p99 (ns) | (I) -60.79% | (I) -20.06% |
| | p99.9 (ns) | (I) -61.04% | (I) -25.04% |
+-----------------+--------------+---------------+---------------+

I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>

authored by

Ryan Roberts and committed by
Kees Cook
a96ef584 37beb425

+33 -115
+2 -3
arch/Kconfig
··· 1519 1519 def_bool n 1520 1520 help 1521 1521 An arch should select this symbol if it can support kernel stack 1522 - offset randomization with calls to add_random_kstack_offset() 1523 - during syscall entry and choose_random_kstack_offset() during 1524 - syscall exit. Careful removal of -fstack-protector-strong and 1522 + offset randomization with a call to add_random_kstack_offset() 1523 + during syscall entry. Careful removal of -fstack-protector-strong and 1525 1524 -fstack-protector should also be applied to the entry code and 1526 1525 closely examined, as the artificial stack bump looks like an array 1527 1526 to the compiler, so it will attempt to add canary checks regardless
-11
arch/arm64/kernel/syscall.c
··· 52 52 } 53 53 54 54 syscall_set_return_value(current, regs, 0, ret); 55 - 56 - /* 57 - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 58 - * bits. The actual entropy will be further reduced by the compiler 59 - * when applying stack alignment constraints: the AAPCS mandates a 60 - * 16-byte aligned SP at function boundaries, which will remove the 61 - * 4 low bits from any entropy chosen here. 62 - * 63 - * The resulting 6 bits of entropy is seen in SP[9:4]. 64 - */ 65 - choose_random_kstack_offset(get_random_u16()); 66 55 } 67 56 68 57 static inline bool has_syscall_work(unsigned long flags)
-11
arch/loongarch/kernel/syscall.c
··· 79 79 regs->regs[7], regs->regs[8], regs->regs[9]); 80 80 } 81 81 82 - /* 83 - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 84 - * bits. The actual entropy will be further reduced by the compiler 85 - * when applying stack alignment constraints: 16-bytes (i.e. 4-bits) 86 - * aligned, which will remove the 4 low bits from any entropy chosen 87 - * here. 88 - * 89 - * The resulting 6 bits of entropy is seen in SP[9:4]. 90 - */ 91 - choose_random_kstack_offset(get_cycles()); 92 - 93 82 syscall_exit_to_user_mode(regs); 94 83 }
+2 -14
arch/powerpc/kernel/syscall.c
··· 20 20 21 21 kuap_lock(); 22 22 23 - add_random_kstack_offset(); 24 - 25 23 if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG)) 26 24 BUG_ON(irq_soft_mask_return() != IRQS_ALL_DISABLED); 27 25 ··· 27 29 28 30 CT_WARN_ON(ct_state() == CT_STATE_KERNEL); 29 31 user_exit_irqoff(); 32 + 33 + add_random_kstack_offset(); 30 34 31 35 BUG_ON(regs_is_unrecoverable(regs)); 32 36 BUG_ON(!user_mode(regs)); ··· 172 172 regs->gpr[6], regs->gpr[7], regs->gpr[8]); 173 173 } 174 174 #endif 175 - 176 - /* 177 - * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), 178 - * so the maximum stack offset is 1k bytes (10 bits). 179 - * 180 - * The actual entropy will be further reduced by the compiler when 181 - * applying stack alignment constraints: the powerpc architecture 182 - * may have two kinds of stack alignment (16-bytes and 8-bytes). 183 - * 184 - * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3]. 185 - */ 186 - choose_random_kstack_offset(mftb()); 187 175 188 176 return ret; 189 177 }
-12
arch/riscv/kernel/traps.c
··· 344 344 syscall_handler(regs, syscall); 345 345 } 346 346 347 - /* 348 - * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), 349 - * so the maximum stack offset is 1k bytes (10 bits). 350 - * 351 - * The actual entropy will be further reduced by the compiler when 352 - * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned 353 - * for RV32I or RV64I. 354 - * 355 - * The resulting 6 bits of entropy is seen in SP[9:4]. 356 - */ 357 - choose_random_kstack_offset(get_random_u16()); 358 - 359 347 syscall_exit_to_user_mode(regs); 360 348 } else { 361 349 irqentry_state_t state = irqentry_nmi_enter(regs);
-8
arch/s390/include/asm/entry-common.h
··· 51 51 52 52 #define arch_exit_to_user_mode arch_exit_to_user_mode 53 53 54 - static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, 55 - unsigned long ti_work) 56 - { 57 - choose_random_kstack_offset(get_tod_clock_fast()); 58 - } 59 - 60 - #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare 61 - 62 54 static __always_inline bool arch_in_rcu_eqs(void) 63 55 { 64 56 if (IS_ENABLED(CONFIG_KVM))
+1 -1
arch/s390/kernel/syscall.c
··· 96 96 { 97 97 unsigned long nr; 98 98 99 - add_random_kstack_offset(); 100 99 enter_from_user_mode(regs); 100 + add_random_kstack_offset(); 101 101 regs->psw = get_lowcore()->svc_old_psw; 102 102 regs->int_code = get_lowcore()->svc_int_code; 103 103 update_timer_sys();
+2 -2
arch/x86/entry/syscall_32.c
··· 247 247 { 248 248 int nr = syscall_32_enter(regs); 249 249 250 - add_random_kstack_offset(); 251 250 /* 252 251 * Subtlety here: if ptrace pokes something larger than 2^31-1 into 253 252 * orig_ax, the int return value truncates it. This matches ··· 255 256 nr = syscall_enter_from_user_mode(regs, nr); 256 257 instrumentation_begin(); 257 258 259 + add_random_kstack_offset(); 258 260 do_syscall_32_irqs_on(regs, nr); 259 261 260 262 instrumentation_end(); ··· 268 268 int nr = syscall_32_enter(regs); 269 269 int res; 270 270 271 - add_random_kstack_offset(); 272 271 /* 273 272 * This cannot use syscall_enter_from_user_mode() as it has to 274 273 * fetch EBP before invoking any of the syscall entry work ··· 276 277 enter_from_user_mode(regs); 277 278 278 279 instrumentation_begin(); 280 + add_random_kstack_offset(); 279 281 local_irq_enable(); 280 282 /* Fetch EBP from where the vDSO stashed it. */ 281 283 if (IS_ENABLED(CONFIG_X86_64)) {
+1 -1
arch/x86/entry/syscall_64.c
··· 86 86 /* Returns true to return using SYSRET, or false to use IRET */ 87 87 __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr) 88 88 { 89 - add_random_kstack_offset(); 90 89 nr = syscall_enter_from_user_mode(regs, nr); 91 90 92 91 instrumentation_begin(); 92 + add_random_kstack_offset(); 93 93 94 94 if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) { 95 95 /* Invalid system call, but still a system call. */
-12
arch/x86/include/asm/entry-common.h
··· 82 82 current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED); 83 83 #endif 84 84 85 - /* 86 - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 87 - * bits. The actual entropy will be further reduced by the compiler 88 - * when applying stack alignment constraints (see cc_stack_align4/8 in 89 - * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32) 90 - * low bits from any entropy chosen here. 91 - * 92 - * Therefore, final stack offset entropy will be 7 (x86_64) or 93 - * 8 (ia32) bits. 94 - */ 95 - choose_random_kstack_offset(rdtsc()); 96 - 97 85 /* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */ 98 86 if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) && 99 87 this_cpu_read(x86_ibpb_exit_to_user)) {
+17 -35
include/linux/randomize_kstack.h
··· 6 6 #include <linux/kernel.h> 7 7 #include <linux/jump_label.h> 8 8 #include <linux/percpu-defs.h> 9 + #include <linux/prandom.h> 9 10 10 11 DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, 11 12 randomize_kstack_offset); ··· 46 45 #define KSTACK_OFFSET_MAX(x) ((x) & 0b1111111100) 47 46 #endif 48 47 48 + DECLARE_PER_CPU(struct rnd_state, kstack_rnd_state); 49 + 50 + static __always_inline u32 get_kstack_offset(void) 51 + { 52 + struct rnd_state *state; 53 + u32 rnd; 54 + 55 + state = &get_cpu_var(kstack_rnd_state); 56 + rnd = prandom_u32_state(state); 57 + put_cpu_var(kstack_rnd_state); 58 + 59 + return rnd; 60 + } 61 + 49 62 /** 50 - * add_random_kstack_offset - Increase stack utilization by previously 51 - * chosen random offset 63 + * add_random_kstack_offset - Increase stack utilization by a random offset. 52 64 * 53 65 * This should be used in the syscall entry path after user registers have been 54 66 * stored to the stack. Preemption may be enabled. For testing the resulting ··· 70 56 #define add_random_kstack_offset() do { \ 71 57 if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ 72 58 &randomize_kstack_offset)) { \ 73 - u32 offset = current->kstack_offset; \ 59 + u32 offset = get_kstack_offset(); \ 74 60 u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \ 75 61 /* Keep allocation even after "ptr" loses scope. */ \ 76 62 asm volatile("" :: "r"(ptr) : "memory"); \ 77 63 } \ 78 64 } while (0) 79 65 80 - /** 81 - * choose_random_kstack_offset - Choose the random offset for the next 82 - * add_random_kstack_offset() 83 - * 84 - * This should only be used during syscall exit. Preemption may be enabled. This 85 - * position in the syscall flow is done to frustrate attacks from userspace 86 - * attempting to learn the next offset: 87 - * - Maximize the timing uncertainty visible from userspace: if the 88 - * offset is chosen at syscall entry, userspace has much more control 89 - * over the timing between choosing offsets. "How long will we be in 90 - * kernel mode?" tends to be more difficult to predict than "how long 91 - * will we be in user mode?" 92 - * - Reduce the lifetime of the new offset sitting in memory during 93 - * kernel mode execution. Exposure of "thread-local" memory content 94 - * (e.g. current, percpu, etc) tends to be easier than arbitrary 95 - * location memory exposure. 96 - */ 97 - #define choose_random_kstack_offset(rand) do { \ 98 - if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ 99 - &randomize_kstack_offset)) { \ 100 - u32 offset = current->kstack_offset; \ 101 - offset = ror32(offset, 5) ^ (rand); \ 102 - current->kstack_offset = offset; \ 103 - } \ 104 - } while (0) 105 - 106 - static inline void random_kstack_task_init(struct task_struct *tsk) 107 - { 108 - tsk->kstack_offset = 0; 109 - } 110 66 #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ 111 67 #define add_random_kstack_offset() do { } while (0) 112 - #define choose_random_kstack_offset(rand) do { } while (0) 113 - #define random_kstack_task_init(tsk) do { } while (0) 114 68 #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ 115 69 116 70 #endif
-4
include/linux/sched.h
··· 1592 1592 unsigned long prev_lowest_stack; 1593 1593 #endif 1594 1594 1595 - #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET 1596 - u32 kstack_offset; 1597 - #endif 1598 - 1599 1595 #ifdef CONFIG_X86_MCE 1600 1596 void __user *mce_vaddr; 1601 1597 __u64 mce_kflags;
+8
init/main.c
··· 833 833 #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET 834 834 DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, 835 835 randomize_kstack_offset); 836 + DEFINE_PER_CPU(struct rnd_state, kstack_rnd_state); 837 + 838 + static int __init random_kstack_init(void) 839 + { 840 + prandom_seed_full_state(&kstack_rnd_state); 841 + return 0; 842 + } 843 + late_initcall(random_kstack_init); 836 844 837 845 static int __init early_randomize_kstack_offset(char *buf) 838 846 {
-1
kernel/fork.c
··· 2234 2234 if (retval) 2235 2235 goto bad_fork_cleanup_io; 2236 2236 2237 - random_kstack_task_init(p); 2238 2237 stackleak_task_init(p); 2239 2238 2240 2239 if (pid != &init_struct_pid) {