fork: stop ignoring NUMA while handling cached thread stacks

1. the numa parameter was straight up ignored.
2. nothing was done to check if the to-be-cached/allocated stack matches
the local node

The id remains ignored on free in case of memoryless nodes.

Note the current caching is already bad as the cache keeps overflowing
and a different solution is needed for the long run, to be worked
out(tm).

Stats collected over a kernel build with the patch with the following
topology:
NUMA node(s): 2
NUMA node0 CPU(s): 0-11
NUMA node1 CPU(s): 12-23

caller's node vs stack backing pages on free:
matching: 50083 (70%)
mismatched: 21492 (30%)

caching efficiency:
cached: 32651 (65.2%)
dropped: 17432 (34.8%)

Link: https://lkml.kernel.org/r/20251120054015.3019419-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Mateusz Guzik and committed by

Andrew Morton 6 months ago 262ef8e5 94984bfe

+53 -10

1 changed file

expand all

kernel

fork.c

+53 -10

kernel/fork.c

··· 208 208 struct vm_struct *stack_vm_area; 209 209 }; 210 210 211 + static struct vm_struct *alloc_thread_stack_node_from_cache(struct task_struct *tsk, int node) 212 + { 213 + struct vm_struct *vm_area; 214 + unsigned int i; 215 + 216 + /* 217 + * If the node has memory, we are guaranteed the stacks are backed by local pages. 218 + * Otherwise the pages are arbitrary. 219 + * 220 + * Note that depending on cpuset it is possible we will get migrated to a different 221 + * node immediately after allocating here, so this does *not* guarantee locality for 222 + * arbitrary callers. 223 + */ 224 + scoped_guard(preempt) { 225 + if (node != NUMA_NO_NODE && numa_node_id() != node) 226 + return NULL; 227 + 228 + for (i = 0; i < NR_CACHED_STACKS; i++) { 229 + vm_area = this_cpu_xchg(cached_stacks[i], NULL); 230 + if (vm_area) 231 + return vm_area; 232 + } 233 + } 234 + 235 + return NULL; 236 + } 237 + 211 238 static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area) 212 239 { 213 240 unsigned int i; 241 + int nid; 214 242 215 - for (i = 0; i < NR_CACHED_STACKS; i++) { 216 - struct vm_struct *tmp = NULL; 243 + /* 244 + * Don't cache stacks if any of the pages don't match the local domain, unless 245 + * there is no local memory to begin with. 246 + * 247 + * Note that lack of local memory does not automatically mean it makes no difference 248 + * performance-wise which other domain backs the stack. In this case we are merely 249 + * trying to avoid constantly going to vmalloc. 250 + */ 251 + scoped_guard(preempt) { 252 + nid = numa_node_id(); 253 + if (node_state(nid, N_MEMORY)) { 254 + for (i = 0; i < vm_area->nr_pages; i++) { 255 + struct page *page = vm_area->pages[i]; 256 + if (page_to_nid(page) != nid) 257 + return false; 258 + } 259 + } 217 260 218 - if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area)) 219 - return true; 261 + for (i = 0; i < NR_CACHED_STACKS; i++) { 262 + struct vm_struct *tmp = NULL; 263 + 264 + if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area)) 265 + return true; 266 + } 220 267 } 221 268 return false; 222 269 } ··· 330 283 { 331 284 struct vm_struct *vm_area; 332 285 void *stack; 333 - int i; 334 286 335 - for (i = 0; i < NR_CACHED_STACKS; i++) { 336 - vm_area = this_cpu_xchg(cached_stacks[i], NULL); 337 - if (!vm_area) 338 - continue; 339 - 287 + vm_area = alloc_thread_stack_node_from_cache(tsk, node); 288 + if (vm_area) { 340 289 if (memcg_charge_kernel_stack(vm_area)) { 341 290 vfree(vm_area->addr); 342 291 return -ENOMEM;

Configure Feed

Configure Feed