Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'remove-kf_sleepable-from-arena-kfuncs'

Puranjay Mohan says:

====================
Remove KF_SLEEPABLE from arena kfuncs

V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/
Changes in V7->v8:
- Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3

V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/
Changes in v6->v7:
- Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1.
- Call flush_cache_vmap() after setting up the mappings as it is
required by some architectures.

V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/
Changes in v5->v6:
Patch 1:
- Add a missing ; to make sure this patch builds individually. (AI)

V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/
Changes in v4->v5:
Patch 1:
- Fix a memory leak in arena_alloc_pages(), it was being fixed in
Patch 3 but, every patch should be complete in itself. (AI)
Patch 3:
- Don't do useless addition in arena_alloc_pages() (Alexei)
- Add a comment about kmalloc_nolock() failure and expectations.

v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/
Changes in v3->v4:
- Coding style changes related to comments in Patch 2/3 (Alexei)

v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
- Call range_tree_destroy() in error path of
populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
- Fix double mutex_unlock() in the error path of
arena_alloc_pages() (AI)
- Fix coding style issues (Alexei)
Patch 3:
- Unlock spinlock before returning from arena_vm_fault() in case
BPF_F_SEGV_ON_FAULT is set by user. (AI)
- Use __llist_del_all() in place of llist_del_all for on-stack
llist (free_pages) (Alexei)
- Fix build issues on 32-bit systems where arena.c is not compiled.
(kernel test robot)
- Make bpf_arena_alloc_pages() polymorphic so it knows if it has
been called in sleepable or non-sleepable context. This
information is passed to arena_free_pages() in the error path.
Patch 4:
- Add a better comment for the big_alloc3() test that triggers
kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
correctly above this limit.

v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
- Import tlbflush.h to fix build issue in loongarch. (kernel
test robot)
- Fix unused variable error in apply_range_clear_cb() (kernel
test robot)
- Call bpf_map_area_free() on error path of
populate_pgtable_except_pte() (AI)
- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
- Cap allocation made by kmalloc_nolock() for pages array to
KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
to overcome this limit. (AI)
Patch 3:
- Do page_ref_add(page, 1); under the spinlock to mitigate a
race (AI)
Patch 4:
- Add a new testcase big_alloc3() verifier_arena_large.c that
tries to allocate a large number of pages at once, this is to
trigger the kmalloc_nolock() limit in Patch 2 and see if the
loop logic works correctly.

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
vm_area_map_pages() may allocate memory while inserting pages into
bpf arena's vm_area. Now, at arena creation time populate all page
table levels except the last level and when new pages need to be
inserted call apply_to_page_range() again which will only do
set_pte_at() for those pages and will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
table levels, apply_to_existing_page_range() with a callback is used
that only does pte_clear() on the last level and leaves the intermediate
page table levels intact. This is needed to make sure that
bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
====================

Link: https://patch.msgid.link/20251222195022.431211-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

+593 -60
+16
include/linux/bpf.h
··· 673 673 int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags, 674 674 struct bpf_dynptr *ptr__uninit); 675 675 676 + #if defined(CONFIG_MMU) && defined(CONFIG_64BIT) 677 + void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id, 678 + u64 flags); 679 + void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt); 680 + #else 681 + static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, 682 + int node_id, u64 flags) 683 + { 684 + return NULL; 685 + } 686 + 687 + static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) 688 + { 689 + } 690 + #endif 691 + 676 692 extern const struct bpf_map_ops bpf_map_offload_ops; 677 693 678 694 /* bpf_type_flag contains a set of flags that are applicable to the values of
+327 -55
kernel/bpf/arena.c
··· 2 2 /* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ 3 3 #include <linux/bpf.h> 4 4 #include <linux/btf.h> 5 + #include <linux/cacheflush.h> 5 6 #include <linux/err.h> 7 + #include <linux/irq_work.h> 6 8 #include "linux/filter.h" 9 + #include <linux/llist.h> 7 10 #include <linux/btf_ids.h> 8 11 #include <linux/vmalloc.h> 9 12 #include <linux/pagemap.h> 13 + #include <asm/tlbflush.h> 10 14 #include "range_tree.h" 11 15 12 16 /* ··· 46 42 #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) 47 43 #define KERN_VM_SZ (SZ_4G + GUARD_SZ) 48 44 45 + static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable); 46 + 49 47 struct bpf_arena { 50 48 struct bpf_map map; 51 49 u64 user_vm_start; 52 50 u64 user_vm_end; 53 51 struct vm_struct *kern_vm; 54 52 struct range_tree rt; 53 + /* protects rt */ 54 + rqspinlock_t spinlock; 55 55 struct list_head vma_list; 56 + /* protects vma_list */ 56 57 struct mutex lock; 58 + struct irq_work free_irq; 59 + struct work_struct free_work; 60 + struct llist_head free_spans; 61 + }; 62 + 63 + static void arena_free_worker(struct work_struct *work); 64 + static void arena_free_irq(struct irq_work *iw); 65 + 66 + struct arena_free_span { 67 + struct llist_node node; 68 + unsigned long uaddr; 69 + u32 page_cnt; 57 70 }; 58 71 59 72 u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) ··· 111 90 static long compute_pgoff(struct bpf_arena *arena, long uaddr) 112 91 { 113 92 return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; 93 + } 94 + 95 + struct apply_range_data { 96 + struct page **pages; 97 + int i; 98 + }; 99 + 100 + static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) 101 + { 102 + struct apply_range_data *d = data; 103 + struct page *page; 104 + 105 + if (!data) 106 + return 0; 107 + /* sanity check */ 108 + if (unlikely(!pte_none(ptep_get(pte)))) 109 + return -EBUSY; 110 + 111 + page = d->pages[d->i]; 112 + /* paranoia, similar to vmap_pages_pte_range() */ 113 + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) 114 + return -EINVAL; 115 + 116 + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); 117 + d->i++; 118 + return 0; 119 + } 120 + 121 + static void flush_vmap_cache(unsigned long start, unsigned long size) 122 + { 123 + flush_cache_vmap(start, start + size); 124 + } 125 + 126 + static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages) 127 + { 128 + pte_t old_pte; 129 + struct page *page; 130 + 131 + /* sanity check */ 132 + old_pte = ptep_get(pte); 133 + if (pte_none(old_pte) || !pte_present(old_pte)) 134 + return 0; /* nothing to do */ 135 + 136 + page = pte_page(old_pte); 137 + if (WARN_ON_ONCE(!page)) 138 + return -EINVAL; 139 + 140 + pte_clear(&init_mm, addr, pte); 141 + 142 + /* Add page to the list so it is freed later */ 143 + if (free_pages) 144 + __llist_add(&page->pcp_llist, free_pages); 145 + 146 + return 0; 147 + } 148 + 149 + static int populate_pgtable_except_pte(struct bpf_arena *arena) 150 + { 151 + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), 152 + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); 114 153 } 115 154 116 155 static struct bpf_map *arena_map_alloc(union bpf_attr *attr) ··· 217 136 arena->user_vm_end = arena->user_vm_start + vm_range; 218 137 219 138 INIT_LIST_HEAD(&arena->vma_list); 139 + init_llist_head(&arena->free_spans); 140 + init_irq_work(&arena->free_irq, arena_free_irq); 141 + INIT_WORK(&arena->free_work, arena_free_worker); 220 142 bpf_map_init_from_attr(&arena->map, attr); 221 143 range_tree_init(&arena->rt); 222 144 err = range_tree_set(&arena->rt, 0, attr->max_entries); ··· 228 144 goto err; 229 145 } 230 146 mutex_init(&arena->lock); 147 + raw_res_spin_lock_init(&arena->spinlock); 148 + err = populate_pgtable_except_pte(arena); 149 + if (err) { 150 + range_tree_destroy(&arena->rt); 151 + bpf_map_area_free(arena); 152 + goto err; 153 + } 231 154 232 155 return &arena->map; 233 156 err: ··· 274 183 */ 275 184 if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) 276 185 return; 186 + 187 + /* Ensure no pending deferred frees */ 188 + irq_work_sync(&arena->free_irq); 189 + flush_work(&arena->free_work); 277 190 278 191 /* 279 192 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area(). ··· 362 267 struct bpf_arena *arena = container_of(map, struct bpf_arena, map); 363 268 struct page *page; 364 269 long kbase, kaddr; 270 + unsigned long flags; 365 271 int ret; 366 272 367 273 kbase = bpf_arena_get_kern_vm_start(arena); 368 274 kaddr = kbase + (u32)(vmf->address); 369 275 370 - guard(mutex)(&arena->lock); 276 + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) 277 + /* Make a reasonable effort to address impossible case */ 278 + return VM_FAULT_RETRY; 279 + 371 280 page = vmalloc_to_page((void *)kaddr); 372 281 if (page) 373 282 /* already have a page vmap-ed */ ··· 379 280 380 281 if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) 381 282 /* User space requested to segfault when page is not allocated by bpf prog */ 382 - return VM_FAULT_SIGSEGV; 283 + goto out_unlock_sigsegv; 383 284 384 285 ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); 385 286 if (ret) 386 - return VM_FAULT_SIGSEGV; 287 + goto out_unlock_sigsegv; 387 288 289 + struct apply_range_data data = { .pages = &page, .i = 0 }; 388 290 /* Account into memcg of the process that created bpf_arena */ 389 291 ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); 390 292 if (ret) { 391 293 range_tree_set(&arena->rt, vmf->pgoff, 1); 392 - return VM_FAULT_SIGSEGV; 294 + goto out_unlock_sigsegv; 393 295 } 394 296 395 - ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page); 297 + ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); 396 298 if (ret) { 397 299 range_tree_set(&arena->rt, vmf->pgoff, 1); 398 - __free_page(page); 399 - return VM_FAULT_SIGSEGV; 300 + free_pages_nolock(page, 0); 301 + goto out_unlock_sigsegv; 400 302 } 303 + flush_vmap_cache(kaddr, PAGE_SIZE); 401 304 out: 402 305 page_ref_add(page, 1); 306 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 403 307 vmf->page = page; 404 308 return 0; 309 + out_unlock_sigsegv: 310 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 311 + return VM_FAULT_SIGSEGV; 405 312 } 406 313 407 314 static const struct vm_operations_struct arena_vm_ops = { ··· 528 423 * Allocate pages and vmap them into kernel vmalloc area. 529 424 * Later the pages will be mmaped into user space vma. 530 425 */ 531 - static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) 426 + static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id, 427 + bool sleepable) 532 428 { 533 429 /* user_vm_end/start are fixed before bpf prog runs */ 534 430 long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; 535 431 u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); 536 - struct page **pages; 432 + struct apply_range_data data; 433 + struct page **pages = NULL; 434 + long remaining, mapped = 0; 435 + long alloc_pages; 436 + unsigned long flags; 537 437 long pgoff = 0; 538 438 u32 uaddr32; 539 439 int ret, i; ··· 555 445 return 0; 556 446 } 557 447 558 - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ 559 - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); 448 + /* Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. */ 449 + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); 450 + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); 560 451 if (!pages) 561 452 return 0; 453 + data.pages = pages; 562 454 563 - guard(mutex)(&arena->lock); 455 + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) 456 + goto out_free_pages; 564 457 565 458 if (uaddr) { 566 459 ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); 567 460 if (ret) 568 - goto out_free_pages; 461 + goto out_unlock_free_pages; 569 462 ret = range_tree_clear(&arena->rt, pgoff, page_cnt); 570 463 } else { 571 464 ret = pgoff = range_tree_find(&arena->rt, page_cnt); ··· 576 463 ret = range_tree_clear(&arena->rt, pgoff, page_cnt); 577 464 } 578 465 if (ret) 579 - goto out_free_pages; 466 + goto out_unlock_free_pages; 580 467 581 - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); 582 - if (ret) 583 - goto out; 584 - 468 + remaining = page_cnt; 585 469 uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); 586 - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 587 - * will not overflow 32-bit. Lower 32-bit need to represent 588 - * contiguous user address range. 589 - * Map these pages at kern_vm_start base. 590 - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow 591 - * lower 32-bit and it's ok. 592 - */ 593 - ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32, 594 - kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages); 595 - if (ret) { 596 - for (i = 0; i < page_cnt; i++) 597 - __free_page(pages[i]); 598 - goto out; 470 + 471 + while (remaining) { 472 + long this_batch = min(remaining, alloc_pages); 473 + 474 + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ 475 + memset(pages, 0, this_batch * sizeof(struct page *)); 476 + 477 + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); 478 + if (ret) 479 + goto out; 480 + 481 + /* 482 + * Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 483 + * will not overflow 32-bit. Lower 32-bit need to represent 484 + * contiguous user address range. 485 + * Map these pages at kern_vm_start base. 486 + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow 487 + * lower 32-bit and it's ok. 488 + */ 489 + data.i = 0; 490 + ret = apply_to_page_range(&init_mm, 491 + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), 492 + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); 493 + if (ret) { 494 + /* data.i pages were mapped, account them and free the remaining */ 495 + mapped += data.i; 496 + for (i = data.i; i < this_batch; i++) 497 + free_pages_nolock(pages[i], 0); 498 + goto out; 499 + } 500 + 501 + mapped += this_batch; 502 + remaining -= this_batch; 599 503 } 600 - kvfree(pages); 504 + flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT); 505 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 506 + kfree_nolock(pages); 601 507 return clear_lo32(arena->user_vm_start) + uaddr32; 602 508 out: 603 - range_tree_set(&arena->rt, pgoff, page_cnt); 509 + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); 510 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 511 + if (mapped) { 512 + flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT); 513 + arena_free_pages(arena, uaddr32, mapped, sleepable); 514 + } 515 + goto out_free_pages; 516 + out_unlock_free_pages: 517 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 604 518 out_free_pages: 605 - kvfree(pages); 519 + kfree_nolock(pages); 606 520 return 0; 607 521 } 608 522 ··· 642 502 { 643 503 struct vma_list *vml; 644 504 505 + guard(mutex)(&arena->lock); 506 + /* iterate link list under lock */ 645 507 list_for_each_entry(vml, &arena->vma_list, head) 646 508 zap_page_range_single(vml->vma, uaddr, 647 509 PAGE_SIZE * page_cnt, NULL); 648 510 } 649 511 650 - static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) 512 + static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) 651 513 { 652 514 u64 full_uaddr, uaddr_end; 653 - long kaddr, pgoff, i; 515 + long kaddr, pgoff; 654 516 struct page *page; 517 + struct llist_head free_pages; 518 + struct llist_node *pos, *t; 519 + struct arena_free_span *s; 520 + unsigned long flags; 521 + int ret = 0; 655 522 656 523 /* only aligned lower 32-bit are relevant */ 657 524 uaddr = (u32)uaddr; 658 525 uaddr &= PAGE_MASK; 526 + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; 659 527 full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; 660 528 uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); 661 529 if (full_uaddr >= uaddr_end) 662 530 return; 663 531 664 532 page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; 665 - 666 - guard(mutex)(&arena->lock); 667 - 668 533 pgoff = compute_pgoff(arena, uaddr); 669 - /* clear range */ 534 + 535 + if (!sleepable) 536 + goto defer; 537 + 538 + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); 539 + 540 + /* Can't proceed without holding the spinlock so defer the free */ 541 + if (ret) 542 + goto defer; 543 + 670 544 range_tree_set(&arena->rt, pgoff, page_cnt); 545 + 546 + init_llist_head(&free_pages); 547 + /* clear ptes and collect struct pages */ 548 + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, 549 + apply_range_clear_cb, &free_pages); 550 + 551 + /* drop the lock to do the tlb flush and zap pages */ 552 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 553 + 554 + /* ensure no stale TLB entries */ 555 + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); 671 556 672 557 if (page_cnt > 1) 673 558 /* bulk zap if multiple pages being freed */ 674 559 zap_pages(arena, full_uaddr, page_cnt); 675 560 676 - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; 677 - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { 678 - page = vmalloc_to_page((void *)kaddr); 679 - if (!page) 680 - continue; 561 + llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) { 562 + page = llist_entry(pos, struct page, pcp_llist); 681 563 if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */ 682 564 /* Optimization for the common case of page_cnt==1: 683 565 * If page wasn't mapped into some user vma there ··· 707 545 * page_cnt is big it's faster to do the batched zap. 708 546 */ 709 547 zap_pages(arena, full_uaddr, 1); 710 - vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE); 711 548 __free_page(page); 712 549 } 550 + 551 + return; 552 + 553 + defer: 554 + s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1); 555 + if (!s) 556 + /* 557 + * If allocation fails in non-sleepable context, pages are intentionally left 558 + * inaccessible (leaked) until the arena is destroyed. Cleanup or retries are not 559 + * possible here, so we intentionally omit them for safety. 560 + */ 561 + return; 562 + 563 + s->page_cnt = page_cnt; 564 + s->uaddr = uaddr; 565 + llist_add(&s->node, &arena->free_spans); 566 + irq_work_queue(&arena->free_irq); 713 567 } 714 568 715 569 /* ··· 735 557 static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt) 736 558 { 737 559 long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; 560 + unsigned long flags; 738 561 long pgoff; 739 562 int ret; 740 563 ··· 746 567 if (pgoff + page_cnt > page_cnt_max) 747 568 return -EINVAL; 748 569 749 - guard(mutex)(&arena->lock); 570 + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) 571 + return -EBUSY; 750 572 751 573 /* Cannot guard already allocated pages. */ 752 574 ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); 753 - if (ret) 754 - return -EBUSY; 575 + if (ret) { 576 + ret = -EBUSY; 577 + goto out; 578 + } 755 579 756 580 /* "Allocate" the region to prevent it from being allocated. */ 757 - return range_tree_clear(&arena->rt, pgoff, page_cnt); 581 + ret = range_tree_clear(&arena->rt, pgoff, page_cnt); 582 + out: 583 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 584 + return ret; 585 + } 586 + 587 + static void arena_free_worker(struct work_struct *work) 588 + { 589 + struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work); 590 + struct llist_node *list, *pos, *t; 591 + struct arena_free_span *s; 592 + u64 arena_vm_start, user_vm_start; 593 + struct llist_head free_pages; 594 + struct page *page; 595 + unsigned long full_uaddr; 596 + long kaddr, page_cnt, pgoff; 597 + unsigned long flags; 598 + 599 + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) { 600 + schedule_work(work); 601 + return; 602 + } 603 + 604 + init_llist_head(&free_pages); 605 + arena_vm_start = bpf_arena_get_kern_vm_start(arena); 606 + user_vm_start = bpf_arena_get_user_vm_start(arena); 607 + 608 + list = llist_del_all(&arena->free_spans); 609 + llist_for_each(pos, list) { 610 + s = llist_entry(pos, struct arena_free_span, node); 611 + page_cnt = s->page_cnt; 612 + kaddr = arena_vm_start + s->uaddr; 613 + pgoff = compute_pgoff(arena, s->uaddr); 614 + 615 + /* clear ptes and collect pages in free_pages llist */ 616 + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, 617 + apply_range_clear_cb, &free_pages); 618 + 619 + range_tree_set(&arena->rt, pgoff, page_cnt); 620 + } 621 + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); 622 + 623 + /* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */ 624 + llist_for_each_safe(pos, t, list) { 625 + s = llist_entry(pos, struct arena_free_span, node); 626 + page_cnt = s->page_cnt; 627 + full_uaddr = clear_lo32(user_vm_start) + s->uaddr; 628 + kaddr = arena_vm_start + s->uaddr; 629 + 630 + /* ensure no stale TLB entries */ 631 + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); 632 + 633 + /* remove pages from user vmas */ 634 + zap_pages(arena, full_uaddr, page_cnt); 635 + 636 + kfree_nolock(s); 637 + } 638 + 639 + /* free all pages collected by apply_to_existing_page_range() in the first loop */ 640 + llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) { 641 + page = llist_entry(pos, struct page, pcp_llist); 642 + __free_page(page); 643 + } 644 + } 645 + 646 + static void arena_free_irq(struct irq_work *iw) 647 + { 648 + struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq); 649 + 650 + schedule_work(&arena->free_work); 758 651 } 759 652 760 653 __bpf_kfunc_start_defs(); ··· 840 589 if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt) 841 590 return NULL; 842 591 843 - return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id); 592 + return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, true); 844 593 } 845 594 595 + void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, 596 + int node_id, u64 flags) 597 + { 598 + struct bpf_map *map = p__map; 599 + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); 600 + 601 + if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt) 602 + return NULL; 603 + 604 + return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false); 605 + } 846 606 __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt) 847 607 { 848 608 struct bpf_map *map = p__map; ··· 861 599 862 600 if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) 863 601 return; 864 - arena_free_pages(arena, (long)ptr__ign, page_cnt); 602 + arena_free_pages(arena, (long)ptr__ign, page_cnt, true); 603 + } 604 + 605 + void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) 606 + { 607 + struct bpf_map *map = p__map; 608 + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); 609 + 610 + if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) 611 + return; 612 + arena_free_pages(arena, (long)ptr__ign, page_cnt, false); 865 613 } 866 614 867 615 __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt) ··· 890 618 __bpf_kfunc_end_defs(); 891 619 892 620 BTF_KFUNCS_START(arena_kfuncs) 893 - BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2) 894 - BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) 895 - BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) 621 + BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2) 622 + BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) 623 + BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) 896 624 BTF_KFUNCS_END(arena_kfuncs) 897 625 898 626 static const struct btf_kfunc_id_set common_kfunc_set = {
+10
kernel/bpf/verifier.c
··· 12380 12380 KF___bpf_trap, 12381 12381 KF_bpf_task_work_schedule_signal_impl, 12382 12382 KF_bpf_task_work_schedule_resume_impl, 12383 + KF_bpf_arena_alloc_pages, 12384 + KF_bpf_arena_free_pages, 12383 12385 }; 12384 12386 12385 12387 BTF_ID_LIST(special_kfunc_list) ··· 12456 12454 BTF_ID(func, __bpf_trap) 12457 12455 BTF_ID(func, bpf_task_work_schedule_signal_impl) 12458 12456 BTF_ID(func, bpf_task_work_schedule_resume_impl) 12457 + BTF_ID(func, bpf_arena_alloc_pages) 12458 + BTF_ID(func, bpf_arena_free_pages) 12459 12459 12460 12460 static bool is_task_work_add_kfunc(u32 func_id) 12461 12461 { ··· 22436 22432 } else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) { 22437 22433 if (!env->insn_aux_data[insn_idx].non_sleepable) 22438 22434 addr = (unsigned long)bpf_dynptr_from_file_sleepable; 22435 + } else if (func_id == special_kfunc_list[KF_bpf_arena_alloc_pages]) { 22436 + if (env->insn_aux_data[insn_idx].non_sleepable) 22437 + addr = (unsigned long)bpf_arena_alloc_pages_non_sleepable; 22438 + } else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) { 22439 + if (env->insn_aux_data[insn_idx].non_sleepable) 22440 + addr = (unsigned long)bpf_arena_free_pages_non_sleepable; 22439 22441 } 22440 22442 desc->addr = addr; 22441 22443 return 0;
+15 -5
tools/testing/selftests/bpf/prog_tests/arena_list.c
··· 27 27 return sum; 28 28 } 29 29 30 - static void test_arena_list_add_del(int cnt) 30 + static void test_arena_list_add_del(int cnt, bool nonsleepable) 31 31 { 32 32 LIBBPF_OPTS(bpf_test_run_opts, opts); 33 33 struct arena_list *skel; 34 34 int expected_sum = (u64)cnt * (cnt - 1) / 2; 35 35 int ret, sum; 36 36 37 - skel = arena_list__open_and_load(); 38 - if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load")) 37 + skel = arena_list__open(); 38 + if (!ASSERT_OK_PTR(skel, "arena_list__open")) 39 39 return; 40 + 41 + skel->rodata->nonsleepable = nonsleepable; 42 + 43 + ret = arena_list__load(skel); 44 + if (!ASSERT_OK(ret, "arena_list__load")) 45 + goto out; 40 46 41 47 skel->bss->cnt = cnt; 42 48 ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts); ··· 71 65 void test_arena_list(void) 72 66 { 73 67 if (test__start_subtest("arena_list_1")) 74 - test_arena_list_add_del(1); 68 + test_arena_list_add_del(1, false); 75 69 if (test__start_subtest("arena_list_1000")) 76 - test_arena_list_add_del(1000); 70 + test_arena_list_add_del(1000, false); 71 + if (test__start_subtest("arena_list_1_nonsleepable")) 72 + test_arena_list_add_del(1, true); 73 + if (test__start_subtest("arena_list_1000_nonsleepable")) 74 + test_arena_list_add_del(1000, true); 77 75 }
+11
tools/testing/selftests/bpf/progs/arena_list.c
··· 30 30 int list_sum; 31 31 int cnt; 32 32 bool skip = false; 33 + const volatile bool nonsleepable = false; 33 34 34 35 #ifdef __BPF_FEATURE_ADDR_SPACE_CAST 35 36 long __arena arena_sum; ··· 42 41 #endif 43 42 44 43 int zero; 44 + 45 + void bpf_rcu_read_lock(void) __ksym; 46 + void bpf_rcu_read_unlock(void) __ksym; 45 47 46 48 SEC("syscall") 47 49 int arena_list_add(void *ctx) ··· 75 71 struct elem __arena *n; 76 72 int sum = 0; 77 73 74 + /* Take rcu_read_lock to test non-sleepable context */ 75 + if (nonsleepable) 76 + bpf_rcu_read_lock(); 77 + 78 78 arena_sum = 0; 79 79 list_for_each_entry(n, list_head, node) { 80 80 sum += n->value; ··· 87 79 bpf_free(n); 88 80 } 89 81 list_sum = sum; 82 + 83 + if (nonsleepable) 84 + bpf_rcu_read_unlock(); 90 85 #else 91 86 skip = true; 92 87 #endif
+185
tools/testing/selftests/bpf/progs/verifier_arena.c
··· 21 21 #endif 22 22 } arena SEC(".maps"); 23 23 24 + SEC("socket") 25 + __success __retval(0) 26 + int basic_alloc1_nosleep(void *ctx) 27 + { 28 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 29 + volatile int __arena *page1, *page2, *no_page; 30 + 31 + page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); 32 + if (!page1) 33 + return 1; 34 + *page1 = 1; 35 + page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); 36 + if (!page2) 37 + return 2; 38 + *page2 = 2; 39 + no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); 40 + if (no_page) 41 + return 3; 42 + if (*page1 != 1) 43 + return 4; 44 + if (*page2 != 2) 45 + return 5; 46 + bpf_arena_free_pages(&arena, (void __arena *)page2, 1); 47 + if (*page1 != 1) 48 + return 6; 49 + if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */ 50 + return 7; 51 + #endif 52 + return 0; 53 + } 54 + 24 55 SEC("syscall") 25 56 __success __retval(0) 26 57 int basic_alloc1(void *ctx) ··· 87 56 return 9; 88 57 if (*page1 != 1) 89 58 return 10; 59 + #endif 60 + return 0; 61 + } 62 + 63 + SEC("socket") 64 + __success __retval(0) 65 + int basic_alloc2_nosleep(void *ctx) 66 + { 67 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 68 + volatile char __arena *page1, *page2, *page3, *page4; 69 + 70 + page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0); 71 + if (!page1) 72 + return 1; 73 + page2 = page1 + __PAGE_SIZE; 74 + page3 = page1 + __PAGE_SIZE * 2; 75 + page4 = page1 - __PAGE_SIZE; 76 + *page1 = 1; 77 + *page2 = 2; 78 + *page3 = 3; 79 + *page4 = 4; 80 + if (*page1 != 1) 81 + return 1; 82 + if (*page2 != 2) 83 + return 2; 84 + if (*page3 != 0) 85 + return 3; 86 + if (*page4 != 0) 87 + return 4; 88 + bpf_arena_free_pages(&arena, (void __arena *)page1, 2); 89 + if (*page1 != 0 && *page1 != 1) 90 + return 5; 91 + if (*page2 != 0 && *page2 != 2) 92 + return 6; 93 + if (*page3 != 0) 94 + return 7; 95 + if (*page4 != 0) 96 + return 8; 90 97 #endif 91 98 return 0; 92 99 } ··· 171 102 struct bpf_map map; 172 103 } __attribute__((preserve_access_index)); 173 104 105 + SEC("socket") 106 + __success __retval(0) __log_level(2) 107 + int basic_alloc3_nosleep(void *ctx) 108 + { 109 + struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena; 110 + volatile char __arena *pages; 111 + 112 + pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0); 113 + if (!pages) 114 + return 1; 115 + return 0; 116 + } 117 + 174 118 SEC("syscall") 175 119 __success __retval(0) __log_level(2) 176 120 int basic_alloc3(void *ctx) ··· 194 112 pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0); 195 113 if (!pages) 196 114 return 1; 115 + return 0; 116 + } 117 + 118 + SEC("socket") 119 + __success __retval(0) 120 + int basic_reserve1_nosleep(void *ctx) 121 + { 122 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 123 + char __arena *page; 124 + int ret; 125 + 126 + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); 127 + if (!page) 128 + return 1; 129 + 130 + page += __PAGE_SIZE; 131 + 132 + /* Reserve the second page */ 133 + ret = bpf_arena_reserve_pages(&arena, page, 1); 134 + if (ret) 135 + return 2; 136 + 137 + /* Try to explicitly allocate the reserved page. */ 138 + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); 139 + if (page) 140 + return 3; 141 + 142 + /* Try to implicitly allocate the page (since there's only 2 of them). */ 143 + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); 144 + if (page) 145 + return 4; 146 + #endif 197 147 return 0; 198 148 } 199 149 ··· 261 147 return 0; 262 148 } 263 149 150 + SEC("socket") 151 + __success __retval(0) 152 + int basic_reserve2_nosleep(void *ctx) 153 + { 154 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 155 + char __arena *page; 156 + int ret; 157 + 158 + page = arena_base(&arena); 159 + ret = bpf_arena_reserve_pages(&arena, page, 1); 160 + if (ret) 161 + return 1; 162 + 163 + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); 164 + if ((u64)page) 165 + return 2; 166 + #endif 167 + return 0; 168 + } 169 + 264 170 SEC("syscall") 265 171 __success __retval(0) 266 172 int basic_reserve2(void *ctx) ··· 302 168 } 303 169 304 170 /* Reserve the same page twice, should return -EBUSY. */ 171 + SEC("socket") 172 + __success __retval(0) 173 + int reserve_twice_nosleep(void *ctx) 174 + { 175 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 176 + char __arena *page; 177 + int ret; 178 + 179 + page = arena_base(&arena); 180 + 181 + ret = bpf_arena_reserve_pages(&arena, page, 1); 182 + if (ret) 183 + return 1; 184 + 185 + ret = bpf_arena_reserve_pages(&arena, page, 1); 186 + if (ret != -EBUSY) 187 + return 2; 188 + #endif 189 + return 0; 190 + } 191 + 305 192 SEC("syscall") 306 193 __success __retval(0) 307 194 int reserve_twice(void *ctx) ··· 345 190 } 346 191 347 192 /* Try to reserve past the end of the arena. */ 193 + SEC("socket") 194 + __success __retval(0) 195 + int reserve_invalid_region_nosleep(void *ctx) 196 + { 197 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 198 + char __arena *page; 199 + int ret; 200 + 201 + /* Try a NULL pointer. */ 202 + ret = bpf_arena_reserve_pages(&arena, NULL, 3); 203 + if (ret != -EINVAL) 204 + return 1; 205 + 206 + page = arena_base(&arena); 207 + 208 + ret = bpf_arena_reserve_pages(&arena, page, 3); 209 + if (ret != -EINVAL) 210 + return 2; 211 + 212 + ret = bpf_arena_reserve_pages(&arena, page, 4096); 213 + if (ret != -EINVAL) 214 + return 3; 215 + 216 + ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1); 217 + if (ret != -EINVAL) 218 + return 4; 219 + #endif 220 + return 0; 221 + } 222 + 348 223 SEC("syscall") 349 224 __success __retval(0) 350 225 int reserve_invalid_region(void *ctx)
+29
tools/testing/selftests/bpf/progs/verifier_arena_large.c
··· 283 283 return 9; 284 284 return 0; 285 285 } 286 + 287 + SEC("socket") 288 + __success __retval(0) 289 + int big_alloc3(void *ctx) 290 + { 291 + #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) 292 + char __arena *pages; 293 + u64 i; 294 + 295 + /* 296 + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests. 297 + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should 298 + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3 299 + * pages. 300 + */ 301 + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0); 302 + if (!pages) 303 + return -1; 304 + 305 + bpf_for(i, 0, 2051) 306 + pages[i * PAGE_SIZE] = 123; 307 + bpf_for(i, 0, 2051) 308 + if (pages[i * PAGE_SIZE] != 123) 309 + return i; 310 + 311 + bpf_arena_free_pages(&arena, pages, 2051); 312 + #endif 313 + return 0; 314 + } 286 315 #endif 287 316 char _license[] SEC("license") = "GPL";