sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()

Many scx schedulers implement their own hard or soft-affinity rules
to support topology characteristics, such as heterogeneous architectures
(e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
specific properties (e.g., running certain tasks only in a subset of
CPUs).

Currently, there is no mechanism that allows to use the built-in idle
CPU selection policy to an arbitrary subset of CPUs. As a result,
schedulers often implement their own idle CPU selection policies, which
are typically similar to one another, leading to a lot of code
duplication.

To address this, modify scx_select_cpu_dfl() to accept an arbitrary
cpumask, that can be used by the BPF schedulers to apply the existent
built-in idle CPU selection policy to a subset of allowed CPUs.

With this concept the idle CPU selection policy becomes the following:
- always prioritize CPUs from fully idle SMT cores (if SMT is enabled),
- select the same CPU if it's idle and in the allowed CPUs,
- select an idle CPU within the same LLC, if the LLC cpumask is a
subset of the allowed CPUs,
- select an idle CPU within the same node, if the node cpumask is a
subset of the allowed CPUs,
- select an idle CPU within the allowed CPUs.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Andrea Righi and committed by

Tejun Heo 1 year ago c2d8b2a5 23c63a96

+44 -4

1 changed file

expand all

kernel

sched

ext_idle.c

+44 -4

kernel/sched/ext_idle.c

··· 49 49 /* 50 50 * Local per-CPU cpumasks (used to generate temporary idle cpumasks). 51 51 */ 52 + static DEFINE_PER_CPU(cpumask_var_t, local_idle_cpumask); 52 53 static DEFINE_PER_CPU(cpumask_var_t, local_llc_idle_cpumask); 53 54 static DEFINE_PER_CPU(cpumask_var_t, local_numa_idle_cpumask); 54 55 ··· 418 417 * branch prediction optimizations. 419 418 * 420 419 * 3. Pick a CPU within the same LLC (Last-Level Cache): 421 - * - if the above conditions aren't met, pick a CPU that shares the same LLC 422 - * to maintain cache locality. 420 + * - if the above conditions aren't met, pick a CPU that shares the same 421 + * LLC, if the LLC domain is a subset of @cpus_allowed, to maintain 422 + * cache locality. 423 423 * 424 424 * 4. Pick a CPU within the same NUMA node, if enabled: 425 - * - choose a CPU from the same NUMA node to reduce memory access latency. 425 + * - choose a CPU from the same NUMA node, if the node cpumask is a 426 + * subset of @cpus_allowed, to reduce memory access latency. 426 427 * 427 - * 5. Pick any idle CPU usable by the task. 428 + * 5. Pick any idle CPU within the @cpus_allowed domain. 428 429 * 429 430 * Step 3 and 4 are performed only if the system has, respectively, 430 431 * multiple LLCs / multiple NUMA nodes (see scx_selcpu_topo_llc and ··· 448 445 const struct cpumask *allowed = cpus_allowed ?: p->cpus_ptr; 449 446 int node = scx_cpu_node_if_enabled(prev_cpu); 450 447 s32 cpu; 448 + 449 + preempt_disable(); 450 + 451 + /* 452 + * Determine the subset of CPUs usable by @p within @cpus_allowed. 453 + */ 454 + if (allowed != p->cpus_ptr) { 455 + struct cpumask *local_cpus = this_cpu_cpumask_var_ptr(local_idle_cpumask); 456 + 457 + if (task_affinity_all(p)) { 458 + allowed = cpus_allowed; 459 + } else if (cpumask_and(local_cpus, cpus_allowed, p->cpus_ptr)) { 460 + allowed = local_cpus; 461 + } else { 462 + cpu = -EBUSY; 463 + goto out_enable; 464 + } 465 + 466 + /* 467 + * If @prev_cpu is not in the allowed CPUs, skip topology 468 + * optimizations and try to pick any idle CPU usable by the 469 + * task. 470 + * 471 + * If %SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled, prioritize 472 + * the current node, as it may optimize some waker->wakee 473 + * workloads. 474 + */ 475 + if (!cpumask_test_cpu(prev_cpu, allowed)) { 476 + node = scx_cpu_node_if_enabled(smp_processor_id()); 477 + cpu = scx_pick_idle_cpu(allowed, node, flags); 478 + goto out_enable; 479 + } 480 + } 451 481 452 482 /* 453 483 * This is necessary to protect llc_cpus. ··· 646 610 647 611 out_unlock: 648 612 rcu_read_unlock(); 613 + out_enable: 614 + preempt_enable(); 649 615 650 616 return cpu; 651 617 } ··· 679 641 680 642 /* Allocate local per-cpu idle cpumasks */ 681 643 for_each_possible_cpu(i) { 644 + BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_idle_cpumask, i), 645 + GFP_KERNEL, cpu_to_node(i))); 682 646 BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_llc_idle_cpumask, i), 683 647 GFP_KERNEL, cpu_to_node(i))); 684 648 BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i),

Configure Feed

Configure Feed