sched_ext: Add verifier-time kfunc context filter

Move enforcement of SCX context-sensitive kfunc restrictions from per-task
runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's
struct_ops context information.

A shared .filter callback is attached to each context-sensitive BTF set
and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX
ops member offset. Disallowed calls are now rejected at program load time
instead of at runtime.

The old model split reachability across two places: each SCX_CALL_OP*()
set bits naming its op context, and each kfunc's scx_kf_allowed() check
OR'd together the bits it accepted. A kfunc was callable when those two
masks overlapped. The new model transposes the result to the caller side -
each op's allow flags directly list the kfunc groups it may call. The old
bit assignments were:

Call-site bits:
ops.select_cpu = ENQUEUE | SELECT_CPU
ops.enqueue = ENQUEUE
ops.dispatch = DISPATCH
ops.cpu_release = CPU_RELEASE

Kfunc-group accepted bits:
enqueue group = ENQUEUE | DISPATCH
select_cpu group = SELECT_CPU | ENQUEUE
dispatch group = DISPATCH
cpu_release group = CPU_RELEASE

Intersecting them yields the reachability now expressed directly by
scx_kf_allow_flags[]:

ops.select_cpu -> SELECT_CPU | ENQUEUE
ops.enqueue -> SELECT_CPU | ENQUEUE
ops.dispatch -> ENQUEUE | DISPATCH
ops.cpu_release -> CPU_RELEASE

Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs;
that maps directly to UNLOCKED in the new table.

Equivalence was checked by walking every (op, kfunc-group) combination
across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old
scx_kf_allowed() runtime checks. With two intended exceptions (see below),
all combinations reach the same verdict; disallowed calls are now caught at
load time instead of firing scx_error() at runtime.

scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are
exceptions: they have no runtime check at all, but the new filter rejects
them from ops outside dispatch/unlocked. The affected cases are nonsensical
- the values these setters store are only read by
scx_bpf_dsq_move{,_vtime}(), which is itself restricted to
dispatch/unlocked, so a setter call from anywhere else was already dead
code.

Runtime scx_kf_mask enforcement is left in place by this patch and removed
in a follow-up.

Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

Tejun Heo 2 months ago d1d3c1c6 2193af26

+125 -5

4 changed files

expand all

kernel

sched

ext.c

ext_idle.c

ext_idle.h

ext_internal.h

+119 -5

kernel/sched/ext.c

··· 8133 8133 static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { 8134 8134 .owner = THIS_MODULE, 8135 8135 .set = &scx_kfunc_ids_enqueue_dispatch, 8136 + .filter = scx_kfunc_context_filter, 8136 8137 }; 8137 8138 8138 8139 static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, ··· 8512 8511 static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { 8513 8512 .owner = THIS_MODULE, 8514 8513 .set = &scx_kfunc_ids_dispatch, 8514 + .filter = scx_kfunc_context_filter, 8515 8515 }; 8516 8516 8517 8517 __bpf_kfunc_start_defs(); ··· 8553 8551 static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { 8554 8552 .owner = THIS_MODULE, 8555 8553 .set = &scx_kfunc_ids_cpu_release, 8554 + .filter = scx_kfunc_context_filter, 8556 8555 }; 8557 8556 8558 8557 __bpf_kfunc_start_defs(); ··· 8631 8628 static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { 8632 8629 .owner = THIS_MODULE, 8633 8630 .set = &scx_kfunc_ids_unlocked, 8631 + .filter = scx_kfunc_context_filter, 8634 8632 }; 8635 8633 8636 8634 __bpf_kfunc_start_defs(); ··· 9607 9603 .set = &scx_kfunc_ids_any, 9608 9604 }; 9609 9605 9606 + /* 9607 + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc 9608 + * group; an op may permit zero or more groups, with the union expressed in 9609 + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) 9610 + * consults this table to decide whether a context-sensitive kfunc is callable 9611 + * from a given SCX op. 9612 + */ 9613 + enum scx_kf_allow_flags { 9614 + SCX_KF_ALLOW_UNLOCKED = 1 << 0, 9615 + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, 9616 + SCX_KF_ALLOW_DISPATCH = 1 << 2, 9617 + SCX_KF_ALLOW_ENQUEUE = 1 << 3, 9618 + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, 9619 + }; 9620 + 9621 + /* 9622 + * Map each SCX op to the union of kfunc groups it permits, indexed by 9623 + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not 9624 + * context-sensitive. 9625 + */ 9626 + static const u32 scx_kf_allow_flags[] = { 9627 + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 9628 + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 9629 + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, 9630 + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, 9631 + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, 9632 + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, 9633 + #ifdef CONFIG_EXT_GROUP_SCHED 9634 + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, 9635 + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, 9636 + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, 9637 + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, 9638 + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, 9639 + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, 9640 + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, 9641 + #endif /* CONFIG_EXT_GROUP_SCHED */ 9642 + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, 9643 + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, 9644 + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, 9645 + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, 9646 + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, 9647 + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, 9648 + }; 9649 + 9650 + /* 9651 + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the 9652 + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this 9653 + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or 9654 + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the 9655 + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. 9656 + * scx_kfunc_ids_any) by falling through to "allow" when none of the 9657 + * context-sensitive sets contain the kfunc. 9658 + */ 9659 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) 9660 + { 9661 + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); 9662 + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); 9663 + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); 9664 + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); 9665 + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); 9666 + u32 moff, flags; 9667 + 9668 + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ 9669 + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) 9670 + return 0; 9671 + 9672 + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ 9673 + if (prog->type == BPF_PROG_TYPE_SYSCALL) 9674 + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; 9675 + 9676 + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) 9677 + return -EACCES; 9678 + 9679 + /* 9680 + * add_subprog_and_kfunc() collects all kfunc calls, including dead code 9681 + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets 9682 + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; 9683 + * do_check_main() re-runs the filter with st_ops set and enforces the 9684 + * actual restrictions. 9685 + */ 9686 + if (!prog->aux->st_ops) 9687 + return 0; 9688 + 9689 + /* 9690 + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other 9691 + * context-sensitive kfuncs assume the rq lock is held by the SCX 9692 + * dispatch path, which doesn't apply to other struct_ops users. 9693 + */ 9694 + if (prog->aux->st_ops != &bpf_sched_ext_ops) 9695 + return in_unlocked ? 0 : -EACCES; 9696 + 9697 + /* SCX struct_ops: check the per-op allow list. */ 9698 + moff = prog->aux->attach_st_ops_member_off; 9699 + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; 9700 + 9701 + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) 9702 + return 0; 9703 + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) 9704 + return 0; 9705 + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) 9706 + return 0; 9707 + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) 9708 + return 0; 9709 + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) 9710 + return 0; 9711 + 9712 + return -EACCES; 9713 + } 9714 + 9610 9715 static int __init scx_init(void) 9611 9716 { 9612 9717 int ret; ··· 9725 9612 * register_btf_kfunc_id_set() needs most of the system to be up. 9726 9613 * 9727 9614 * Some kfuncs are context-sensitive and can only be called from 9728 - * specific SCX ops. They are grouped into BTF sets accordingly. 9729 - * Unfortunately, BPF currently doesn't have a way of enforcing such 9730 - * restrictions. Eventually, the verifier should be able to enforce 9731 - * them. For now, register them the same and make each kfunc explicitly 9732 - * check using scx_kf_allowed(). 9615 + * specific SCX ops. They are grouped into per-context BTF sets, each 9616 + * registered with scx_kfunc_context_filter as its .filter callback. The 9617 + * BPF core dedups identical filter pointers per hook 9618 + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per 9619 + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op 9620 + * restrictions at verify time. 9733 9621 */ 9734 9622 if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 9735 9623 &scx_kfunc_set_enqueue_dispatch)) ||

kernel/sched/ext_idle.c

··· 1491 1491 static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { 1492 1492 .owner = THIS_MODULE, 1493 1493 .set = &scx_kfunc_ids_select_cpu, 1494 + .filter = scx_kfunc_context_filter, 1494 1495 }; 1495 1496 1496 1497 int scx_idle_init(void)

kernel/sched/ext_idle.h

··· 12 12 13 13 struct sched_ext_ops; 14 14 15 + extern struct btf_id_set8 scx_kfunc_ids_select_cpu; 16 + 15 17 void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); 16 18 void scx_idle_init_masks(void); 17 19

kernel/sched/ext_internal.h

··· 6 6 * Copyright (c) 2025 Tejun Heo <tj@kernel.org> 7 7 */ 8 8 #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) 9 + #define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) 9 10 10 11 enum scx_consts { 11 12 SCX_DSP_DFL_MAX_BATCH = 32, ··· 1363 1362 1364 1363 extern struct scx_sched __rcu *scx_root; 1365 1364 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); 1365 + 1366 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); 1366 1367 1367 1368 /* 1368 1369 * Return the rq currently locked from an scx callback, or NULL if no rq is

Configure Feed

Configure Feed