Merge branch 'bpf-replace-wq-users-and-add-wq_percpu-to-alloc_workqueue-users'

Marco Crivellari says:

====================
Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:

"workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."
https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/

=== Current situation: problems ===

Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.

This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
schedule_delayed_work(, 0);

This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:

schedule_delayed_work(, 1);

Will move the timer on an housekeeping CPU, and schedule the work there.

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

=== Plan and future plans ===

This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.

These are the main steps:
1) API refactoring (that this patch is introducing)
- Make more clear and uniform the system wq names, both per-cpu and
unbound. This to avoid any possible confusion on what should be
used.

- Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
introduced in this patchset and used on all the callers that are not
currently using WQ_UNBOUND.

WQ_UNBOUND will be removed in a future release cycle.

Most users don't need to be per-cpu, because they don't have
locality requirements, because of that, a next future step will be
make "unbound" the default behavior.

2) Check who really needs to be per-cpu
- Remove the WQ_PERCPU flag when is not strictly required.

3) Add a new API (prefer local cpu)
- There are users that don't require a local execution, like mentioned
above; despite that, local execution yeld to performance gain.

This new API will prefer the local execution, without requiring it.

=== Introduced Changes by this series ===

1) [P 1-2] Replace use of system_wq and system_unbound_wq

system_wq is a per-CPU workqueue, but his name is not clear.
system_unbound_wq is to be used when locality is not required.

Because of that, system_wq has been renamed in system_percpu_wq, and
system_unbound_wq has been renamed in system_dfl_wq.

2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users

Every alloc_workqueue() caller should use one among WQ_PERCPU or
WQ_UNBOUND. This is actually enforced warning if both or none of them
are present at the same time.

WQ_UNBOUND will be removed in a next release cycle.

=== For Maintainers ===

There are prerequisites for this series, already merged in the master branch.
The commits are:

128ea9f6ccfb6960293ae4212f4f97165e42222d ("workqueue: Add system_percpu_wq and
system_dfl_wq")

930c2ea566aff59e962c50b2421d5fcc3b98b8be ("workqueue: Add new WQ_PERCPU flag")
====================

Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20250905085309.94596-1-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Alexei Starovoitov 9 months ago 60ef5415 93a83d04

+8 -7

5 changed files

expand all

kernel

bpf

cgroup.c

cpumap.c

helpers.c

memalloc.c

syscall.c

+3 -2

kernel/bpf/cgroup.c

··· 27 27 /* 28 28 * cgroup bpf destruction makes heavy use of work items and there can be a lot 29 29 * of concurrent destructions. Use a separate workqueue so that cgroup bpf 30 - * destruction work items don't end up filling up max_active of system_wq 30 + * destruction work items don't end up filling up max_active of system_percpu_wq 31 31 * which may lead to deadlock. 32 32 */ 33 33 static struct workqueue_struct *cgroup_bpf_destroy_wq; 34 34 35 35 static int __init cgroup_bpf_wq_init(void) 36 36 { 37 - cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1); 37 + cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 38 + WQ_PERCPU, 1); 38 39 if (!cgroup_bpf_destroy_wq) 39 40 panic("Failed to alloc workqueue for cgroup bpf destroy.\n"); 40 41 return 0;

+1 -1

kernel/bpf/cpumap.c

··· 550 550 old_rcpu = unrcu_pointer(xchg(&cmap->cpu_map[key_cpu], RCU_INITIALIZER(rcpu))); 551 551 if (old_rcpu) { 552 552 INIT_RCU_WORK(&old_rcpu->free_work, __cpu_map_entry_free); 553 - queue_rcu_work(system_wq, &old_rcpu->free_work); 553 + queue_rcu_work(system_percpu_wq, &old_rcpu->free_work); 554 554 } 555 555 } 556 556

+2 -2

kernel/bpf/helpers.c

··· 1594 1594 * timer callback. 1595 1595 */ 1596 1596 if (this_cpu_read(hrtimer_running)) { 1597 - queue_work(system_unbound_wq, &t->cb.delete_work); 1597 + queue_work(system_dfl_wq, &t->cb.delete_work); 1598 1598 return; 1599 1599 } 1600 1600 ··· 1607 1607 if (hrtimer_try_to_cancel(&t->timer) >= 0) 1608 1608 kfree_rcu(t, cb.rcu); 1609 1609 else 1610 - queue_work(system_unbound_wq, &t->cb.delete_work); 1610 + queue_work(system_dfl_wq, &t->cb.delete_work); 1611 1611 } else { 1612 1612 bpf_timer_delete_work(&t->cb.delete_work); 1613 1613 }

+1 -1

kernel/bpf/memalloc.c

··· 736 736 /* Defer barriers into worker to let the rest of map memory to be freed */ 737 737 memset(ma, 0, sizeof(*ma)); 738 738 INIT_WORK(&copy->work, free_mem_alloc_deferred); 739 - queue_work(system_unbound_wq, &copy->work); 739 + queue_work(system_dfl_wq, &copy->work); 740 740 } 741 741 742 742 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)

+1 -1

kernel/bpf/syscall.c

··· 905 905 /* Avoid spawning kworkers, since they all might contend 906 906 * for the same mutex like slab_mutex. 907 907 */ 908 - queue_work(system_unbound_wq, &map->work); 908 + queue_work(system_dfl_wq, &map->work); 909 909 } 910 910 911 911 static void bpf_map_free_rcu_gp(struct rcu_head *rcu)

Configure Feed

Configure Feed