Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

- New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
smaller shards to improve scalability on machines with many CPUs per
LLC

- Misc:
- system_dfl_long_wq for long unbound works
- devm_alloc_workqueue() for device-managed allocation
- sysfs exposure for ordered workqueues and the EFI workqueue
- removal of HK_TYPE_WQ from wq_unbound_cpumask
- various small fixes

* tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits)
workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
workqueue: avoid unguarded 64-bit division
docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: fix typo in WQ_AFFN_SMT comment
workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask
workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path
workqueue: Remove NULL wq WARN in __queue_delayed_work()
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: devres: Add device-managed allocate workqueue
workqueue: Add system_dfl_long_wq for long unbound works
tools/workqueue/wq_dump.py: add NODE prefix to all node columns
tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section
tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header
efi: Allow to expose the workqueue via sysfs
workqueue: Allow to expose ordered workqueues via sysfs
...

+629 -51
+2 -1
Documentation/admin-guide/kernel-parameters.txt
··· 8543 8543 workqueue.default_affinity_scope= 8544 8544 Select the default affinity scope to use for unbound 8545 8545 workqueues. Can be one of "cpu", "smt", "cache", 8546 - "numa" and "system". Default is "cache". For more 8546 + "cache_shard", "numa" and "system". Default is 8547 + "cache_shard". For more 8547 8548 information, see the Affinity Scopes section in 8548 8549 Documentation/core-api/workqueue.rst. 8549 8550
+10 -4
Documentation/core-api/workqueue.rst
··· 378 378 379 379 An unbound workqueue groups CPUs according to its affinity scope to improve 380 380 cache locality. For example, if a workqueue is using the default affinity 381 - scope of "cache", it will group CPUs according to last level cache 382 - boundaries. A work item queued on the workqueue will be assigned to a worker 383 - on one of the CPUs which share the last level cache with the issuing CPU. 381 + scope of "cache_shard", it will group CPUs into sub-LLC shards. A work item 382 + queued on the workqueue will be assigned to a worker on one of the CPUs 383 + within the same shard as the issuing CPU. 384 384 Once started, the worker may or may not be allowed to move outside the scope 385 385 depending on the ``affinity_strict`` setting of the scope. 386 386 ··· 402 402 ``cache`` 403 403 CPUs are grouped according to cache boundaries. Which specific cache 404 404 boundary is used is determined by the arch code. L3 is used in a lot of 405 - cases. This is the default affinity scope. 405 + cases. 406 + 407 + ``cache_shard`` 408 + CPUs are grouped into sub-LLC shards of at most ``wq_cache_shard_size`` 409 + cores (default 8, tunable via the ``workqueue.cache_shard_size`` boot 410 + parameter). Shards are always split on core (SMT group) boundaries. 411 + This is the default affinity scope. 406 412 407 413 ``numa`` 408 414 CPUs are grouped according to NUMA boundaries.
+4
Documentation/driver-api/driver-model/devres.rst
··· 464 464 465 465 WATCHDOG 466 466 devm_watchdog_register_device() 467 + 468 + WORKQUEUE 469 + devm_alloc_workqueue() 470 + devm_alloc_ordered_workqueue()
+1 -1
drivers/firmware/efi/efi.c
··· 423 423 * ordered workqueue (which creates only one execution context) 424 424 * should suffice for all our needs. 425 425 */ 426 - efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); 426 + efi_rts_wq = alloc_ordered_workqueue("efi_runtime", WQ_SYSFS); 427 427 if (!efi_rts_wq) { 428 428 pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); 429 429 clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
+38 -9
include/linux/workqueue.h
··· 131 131 enum wq_affn_scope { 132 132 WQ_AFFN_DFL, /* use system default */ 133 133 WQ_AFFN_CPU, /* one pod per CPU */ 134 - WQ_AFFN_SMT, /* one pod poer SMT */ 134 + WQ_AFFN_SMT, /* one pod per SMT */ 135 135 WQ_AFFN_CACHE, /* one pod per LLC */ 136 + WQ_AFFN_CACHE_SHARD, /* synthetic sub-LLC shards */ 136 137 WQ_AFFN_NUMA, /* one pod per NUMA node */ 137 138 WQ_AFFN_SYSTEM, /* one pod across the whole system */ 138 139 ··· 441 440 * system_long_wq is similar to system_percpu_wq but may host long running 442 441 * works. Queue flushing might take relatively long. 443 442 * 443 + * system_dfl_long_wq is similar to system_dfl_wq but it may host long running 444 + * works. 445 + * 444 446 * system_dfl_wq is unbound workqueue. Workers are not bound to 445 447 * any specific CPU, not concurrency managed, and all queued works are 446 448 * executed immediately as long as max_active limit is not reached and ··· 472 468 extern struct workqueue_struct *system_freezable_power_efficient_wq; 473 469 extern struct workqueue_struct *system_bh_wq; 474 470 extern struct workqueue_struct *system_bh_highpri_wq; 471 + extern struct workqueue_struct *system_dfl_long_wq; 475 472 476 473 void workqueue_softirq_action(bool highpri); 477 474 void workqueue_softirq_dead(unsigned int cpu); ··· 516 511 __printf(1, 4) struct workqueue_struct * 517 512 alloc_workqueue_noprof(const char *fmt, unsigned int flags, int max_active, ...); 518 513 #define alloc_workqueue(...) alloc_hooks(alloc_workqueue_noprof(__VA_ARGS__)) 514 + 515 + /** 516 + * devm_alloc_workqueue - Resource-managed allocate a workqueue 517 + * @dev: Device to allocate workqueue for 518 + * @fmt: printf format for the name of the workqueue 519 + * @flags: WQ_* flags 520 + * @max_active: max in-flight work items, 0 for default 521 + * @...: args for @fmt 522 + * 523 + * Resource managed workqueue, see alloc_workqueue() for details. 524 + * 525 + * The workqueue will be automatically destroyed on driver detach. Typically 526 + * this should be used in drivers already relying on devm interafaces. 527 + * 528 + * RETURNS: 529 + * Pointer to the allocated workqueue on success, %NULL on failure. 530 + */ 531 + __printf(2, 5) struct workqueue_struct * 532 + devm_alloc_workqueue(struct device *dev, const char *fmt, unsigned int flags, 533 + int max_active, ...); 519 534 520 535 #ifdef CONFIG_LOCKDEP 521 536 /** ··· 593 568 */ 594 569 #define alloc_ordered_workqueue(fmt, flags, args...) \ 595 570 alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args) 571 + #define devm_alloc_ordered_workqueue(dev, fmt, flags, args...) \ 572 + devm_alloc_workqueue(dev, fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args) 596 573 597 574 #define create_workqueue(name) \ 598 575 alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM | WQ_PERCPU, 1, (name)) ··· 739 712 } 740 713 741 714 /** 742 - * schedule_work - put work task in global workqueue 715 + * schedule_work - put work task in per-CPU workqueue 743 716 * @work: job to be done 744 717 * 745 - * Returns %false if @work was already on the kernel-global workqueue and 718 + * Returns %false if @work was already on the system per-CPU workqueue and 746 719 * %true otherwise. 747 720 * 748 - * This puts a job in the kernel-global workqueue if it was not already 749 - * queued and leaves it in the same position on the kernel-global 721 + * This puts a job in the system per-CPU workqueue if it was not already 722 + * queued and leaves it in the same position on the system per-CPU 750 723 * workqueue otherwise. 751 724 * 752 725 * Shares the same memory-ordering properties of queue_work(), cf. the ··· 810 783 _wq == system_highpri_wq) || \ 811 784 (__builtin_constant_p(_wq == system_long_wq) && \ 812 785 _wq == system_long_wq) || \ 786 + (__builtin_constant_p(_wq == system_dfl_long_wq) && \ 787 + _wq == system_dfl_long_wq) || \ 813 788 (__builtin_constant_p(_wq == system_dfl_wq) && \ 814 789 _wq == system_dfl_wq) || \ 815 790 (__builtin_constant_p(_wq == system_freezable_wq) && \ ··· 825 796 }) 826 797 827 798 /** 828 - * schedule_delayed_work_on - queue work in global workqueue on CPU after delay 799 + * schedule_delayed_work_on - queue work in per-CPU workqueue on CPU after delay 829 800 * @cpu: cpu to use 830 801 * @dwork: job to be done 831 802 * @delay: number of jiffies to wait 832 803 * 833 - * After waiting for a given time this puts a job in the kernel-global 804 + * After waiting for a given time this puts a job in the system per-CPU 834 805 * workqueue on the specified CPU. 835 806 */ 836 807 static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork, ··· 840 811 } 841 812 842 813 /** 843 - * schedule_delayed_work - put work task in global workqueue after delay 814 + * schedule_delayed_work - put work task in per-CPU workqueue after delay 844 815 * @dwork: job to be done 845 816 * @delay: number of jiffies to wait or 0 for immediate execution 846 817 * 847 - * After waiting for a given time this puts a job in the kernel-global 818 + * After waiting for a given time this puts a job in the system per-CPU 848 819 * workqueue. 849 820 */ 850 821 static inline bool schedule_delayed_work(struct delayed_work *dwork,
+261 -24
kernel/workqueue.c
··· 41 41 #include <linux/mempolicy.h> 42 42 #include <linux/freezer.h> 43 43 #include <linux/debug_locks.h> 44 + #include <linux/device/devres.h> 44 45 #include <linux/lockdep.h> 45 46 #include <linux/idr.h> 46 47 #include <linux/jhash.h> ··· 129 128 130 129 WQ_NAME_LEN = 32, 131 130 WORKER_ID_LEN = 10 + WQ_NAME_LEN, /* "kworker/R-" + WQ_NAME_LEN */ 131 + }; 132 + 133 + /* Layout of shards within one LLC pod */ 134 + struct llc_shard_layout { 135 + int nr_large_shards; /* number of large shards (cores_per_shard + 1) */ 136 + int cores_per_shard; /* base number of cores per default shard */ 137 + int nr_shards; /* total number of shards */ 138 + /* nr_default shards = (nr_shards - nr_large_shards) */ 132 139 }; 133 140 134 141 /* ··· 413 404 u32 flags; 414 405 }; 415 406 416 - static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = { 407 + static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = { 417 408 [WQ_AFFN_DFL] = "default", 418 409 [WQ_AFFN_CPU] = "cpu", 419 410 [WQ_AFFN_SMT] = "smt", 420 411 [WQ_AFFN_CACHE] = "cache", 412 + [WQ_AFFN_CACHE_SHARD] = "cache_shard", 421 413 [WQ_AFFN_NUMA] = "numa", 422 414 [WQ_AFFN_SYSTEM] = "system", 423 415 }; ··· 441 431 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT); 442 432 module_param_named(power_efficient, wq_power_efficient, bool, 0444); 443 433 434 + static unsigned int wq_cache_shard_size = 8; 435 + module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444); 436 + 444 437 static bool wq_online; /* can kworkers be created yet? */ 445 438 static bool wq_topo_initialized __read_mostly = false; 446 439 447 440 static struct kmem_cache *pwq_cache; 448 441 449 442 static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES]; 450 - static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE; 443 + static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE_SHARD; 451 444 452 445 /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */ 453 446 static struct workqueue_attrs *unbound_wq_update_pwq_attrs_buf; ··· 543 530 EXPORT_SYMBOL_GPL(system_bh_wq); 544 531 struct workqueue_struct *system_bh_highpri_wq; 545 532 EXPORT_SYMBOL_GPL(system_bh_highpri_wq); 533 + struct workqueue_struct *system_dfl_long_wq __ro_after_init; 534 + EXPORT_SYMBOL_GPL(system_dfl_long_wq); 546 535 547 536 static int worker_thread(void *__worker); 548 537 static void workqueue_sysfs_unregister(struct workqueue_struct *wq); ··· 2534 2519 struct timer_list *timer = &dwork->timer; 2535 2520 struct work_struct *work = &dwork->work; 2536 2521 2537 - WARN_ON_ONCE(!wq); 2538 2522 WARN_ON_ONCE(timer->function != delayed_work_timer_fn); 2539 2523 WARN_ON_ONCE(timer_pending(timer)); 2540 2524 WARN_ON_ONCE(!list_empty(&work->entry)); ··· 5649 5635 for_each_possible_cpu(cpu) { 5650 5636 struct pool_workqueue *pwq = *per_cpu_ptr(wq->cpu_pwq, cpu); 5651 5637 5652 - if (pwq) 5638 + if (pwq) { 5639 + /* 5640 + * Unlink pwq from wq->pwqs since link_pwq() 5641 + * may have already added it. wq->mutex is not 5642 + * needed as the wq has not been published yet. 5643 + */ 5644 + if (!list_empty(&pwq->pwqs_node)) 5645 + list_del_rcu(&pwq->pwqs_node); 5653 5646 kmem_cache_free(pwq_cache, pwq); 5647 + } 5654 5648 } 5655 5649 free_percpu(wq->cpu_pwq); 5656 5650 wq->cpu_pwq = NULL; ··· 5925 5903 return wq; 5926 5904 } 5927 5905 EXPORT_SYMBOL_GPL(alloc_workqueue_noprof); 5906 + 5907 + static void devm_workqueue_release(void *res) 5908 + { 5909 + destroy_workqueue(res); 5910 + } 5911 + 5912 + __printf(2, 5) struct workqueue_struct * 5913 + devm_alloc_workqueue(struct device *dev, const char *fmt, unsigned int flags, 5914 + int max_active, ...) 5915 + { 5916 + struct workqueue_struct *wq; 5917 + va_list args; 5918 + int ret; 5919 + 5920 + va_start(args, max_active); 5921 + wq = alloc_workqueue(fmt, flags, max_active, args); 5922 + va_end(args); 5923 + if (!wq) 5924 + return NULL; 5925 + 5926 + ret = devm_add_action_or_reset(dev, devm_workqueue_release, wq); 5927 + if (ret) 5928 + return NULL; 5929 + 5930 + return wq; 5931 + } 5932 + EXPORT_SYMBOL_GPL(devm_alloc_workqueue); 5928 5933 5929 5934 #ifdef CONFIG_LOCKDEP 5930 5935 __printf(1, 5) ··· 7108 7059 /* 7109 7060 * If the operation fails, it will fall back to 7110 7061 * wq_requested_unbound_cpumask which is initially set to 7111 - * (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten 7062 + * HK_TYPE_DOMAIN house keeping mask and rewritten 7112 7063 * by any subsequent write to workqueue/cpumask sysfs file. 7113 7064 */ 7114 7065 if (!cpumask_and(cpumask, wq_requested_unbound_cpumask, hk)) ··· 7127 7078 7128 7079 static int parse_affn_scope(const char *val) 7129 7080 { 7130 - int i; 7131 - 7132 - for (i = 0; i < ARRAY_SIZE(wq_affn_names); i++) { 7133 - if (!strncasecmp(val, wq_affn_names[i], strlen(wq_affn_names[i]))) 7134 - return i; 7135 - } 7136 - return -EINVAL; 7081 + return sysfs_match_string(wq_affn_names, val); 7137 7082 } 7138 7083 7139 7084 static int wq_affn_dfl_set(const char *val, const struct kernel_param *kp) ··· 7234 7191 &dev_attr_max_active.attr, 7235 7192 NULL, 7236 7193 }; 7237 - ATTRIBUTE_GROUPS(wq_sysfs); 7194 + 7195 + static umode_t wq_sysfs_is_visible(struct kobject *kobj, struct attribute *a, int n) 7196 + { 7197 + struct device *dev = kobj_to_dev(kobj); 7198 + struct workqueue_struct *wq = dev_to_wq(dev); 7199 + 7200 + /* 7201 + * Adjusting max_active breaks ordering guarantee. Changing it has no 7202 + * effect on BH worker. Limit max_active to RO in such case. 7203 + */ 7204 + if (wq->flags & (WQ_BH | __WQ_ORDERED)) 7205 + return 0444; 7206 + return a->mode; 7207 + } 7208 + 7209 + static const struct attribute_group wq_sysfs_group = { 7210 + .is_visible = wq_sysfs_is_visible, 7211 + .attrs = wq_sysfs_attrs, 7212 + }; 7213 + __ATTRIBUTE_GROUPS(wq_sysfs); 7238 7214 7239 7215 static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr, 7240 7216 char *buf) ··· 7555 7493 { 7556 7494 struct wq_device *wq_dev; 7557 7495 int ret; 7558 - 7559 - /* 7560 - * Adjusting max_active breaks ordering guarantee. Disallow exposing 7561 - * ordered workqueues. 7562 - */ 7563 - if (WARN_ON(wq->flags & __WQ_ORDERED)) 7564 - return -EINVAL; 7565 7496 7566 7497 wq->wq_dev = wq_dev = kzalloc_obj(*wq_dev); 7567 7498 if (!wq_dev) ··· 7932 7877 { 7933 7878 struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_SYSTEM]; 7934 7879 int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL }; 7935 - void (*irq_work_fns[2])(struct irq_work *) = { bh_pool_kick_normal, 7936 - bh_pool_kick_highpri }; 7880 + void (*irq_work_fns[NR_STD_WORKER_POOLS])(struct irq_work *) = 7881 + { bh_pool_kick_normal, bh_pool_kick_highpri }; 7937 7882 int i, cpu; 7938 7883 7939 7884 BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long)); ··· 7945 7890 7946 7891 cpumask_copy(wq_online_cpumask, cpu_online_mask); 7947 7892 cpumask_copy(wq_unbound_cpumask, cpu_possible_mask); 7948 - restrict_unbound_cpumask("HK_TYPE_WQ", housekeeping_cpumask(HK_TYPE_WQ)); 7949 7893 restrict_unbound_cpumask("HK_TYPE_DOMAIN", housekeeping_cpumask(HK_TYPE_DOMAIN)); 7950 7894 if (!cpumask_empty(&wq_cmdline_cpumask)) 7951 7895 restrict_unbound_cpumask("workqueue.unbound_cpus", &wq_cmdline_cpumask); ··· 8028 7974 system_bh_wq = alloc_workqueue("events_bh", WQ_BH | WQ_PERCPU, 0); 8029 7975 system_bh_highpri_wq = alloc_workqueue("events_bh_highpri", 8030 7976 WQ_BH | WQ_HIGHPRI | WQ_PERCPU, 0); 7977 + system_dfl_long_wq = alloc_workqueue("events_dfl_long", WQ_UNBOUND, WQ_MAX_ACTIVE); 8031 7978 BUG_ON(!system_wq || !system_percpu_wq|| !system_highpri_wq || !system_long_wq || 8032 7979 !system_unbound_wq || !system_freezable_wq || !system_dfl_wq || 8033 7980 !system_power_efficient_wq || 8034 7981 !system_freezable_power_efficient_wq || 8035 - !system_bh_wq || !system_bh_highpri_wq); 7982 + !system_bh_wq || !system_bh_highpri_wq || !system_dfl_long_wq); 8036 7983 } 8037 7984 8038 7985 static void __init wq_cpu_intensive_thresh_init(void) ··· 8199 8144 return cpu_to_node(cpu0) == cpu_to_node(cpu1); 8200 8145 } 8201 8146 8147 + /* Maps each CPU to its shard index within the LLC pod it belongs to */ 8148 + static int cpu_shard_id[NR_CPUS] __initdata; 8149 + 8150 + /** 8151 + * llc_count_cores - count distinct cores (SMT groups) within an LLC pod 8152 + * @pod_cpus: the cpumask of CPUs in the LLC pod 8153 + * @smt_pods: the SMT pod type, used to identify sibling groups 8154 + * 8155 + * A core is represented by the lowest-numbered CPU in its SMT group. Returns 8156 + * the number of distinct cores found in @pod_cpus. 8157 + */ 8158 + static int __init llc_count_cores(const struct cpumask *pod_cpus, 8159 + struct wq_pod_type *smt_pods) 8160 + { 8161 + const struct cpumask *sibling_cpus; 8162 + int nr_cores = 0, c; 8163 + 8164 + /* 8165 + * Count distinct cores by only counting the first CPU in each 8166 + * SMT sibling group. 8167 + */ 8168 + for_each_cpu(c, pod_cpus) { 8169 + sibling_cpus = smt_pods->pod_cpus[smt_pods->cpu_pod[c]]; 8170 + if (cpumask_first(sibling_cpus) == c) 8171 + nr_cores++; 8172 + } 8173 + 8174 + return nr_cores; 8175 + } 8176 + 8177 + /* 8178 + * llc_shard_size - number of cores in a given shard 8179 + * 8180 + * Cores are spread as evenly as possible. The first @nr_large_shards shards are 8181 + * "large shards" with (cores_per_shard + 1) cores; the rest are "default 8182 + * shards" with cores_per_shard cores. 8183 + */ 8184 + static int __init llc_shard_size(int shard_id, int cores_per_shard, int nr_large_shards) 8185 + { 8186 + /* The first @nr_large_shards shards are large shards */ 8187 + if (shard_id < nr_large_shards) 8188 + return cores_per_shard + 1; 8189 + 8190 + /* The remaining shards are default shards */ 8191 + return cores_per_shard; 8192 + } 8193 + 8194 + /* 8195 + * llc_calc_shard_layout - compute the shard layout for an LLC pod 8196 + * @nr_cores: number of distinct cores in the LLC pod 8197 + * 8198 + * Chooses the number of shards that keeps average shard size closest to 8199 + * wq_cache_shard_size. Returns a struct describing the total number of shards, 8200 + * the base size of each, and how many are large shards. 8201 + */ 8202 + static struct llc_shard_layout __init llc_calc_shard_layout(int nr_cores) 8203 + { 8204 + struct llc_shard_layout layout; 8205 + 8206 + /* Ensure at least one shard; pick the count closest to the target size */ 8207 + layout.nr_shards = max(1, DIV_ROUND_CLOSEST(nr_cores, wq_cache_shard_size)); 8208 + layout.cores_per_shard = nr_cores / layout.nr_shards; 8209 + layout.nr_large_shards = nr_cores % layout.nr_shards; 8210 + 8211 + return layout; 8212 + } 8213 + 8214 + /* 8215 + * llc_shard_is_full - check whether a shard has reached its core capacity 8216 + * @cores_in_shard: number of cores already assigned to this shard 8217 + * @shard_id: index of the shard being checked 8218 + * @layout: the shard layout computed by llc_calc_shard_layout() 8219 + * 8220 + * Returns true if @cores_in_shard equals the expected size for @shard_id. 8221 + */ 8222 + static bool __init llc_shard_is_full(int cores_in_shard, int shard_id, 8223 + const struct llc_shard_layout *layout) 8224 + { 8225 + return cores_in_shard == llc_shard_size(shard_id, layout->cores_per_shard, 8226 + layout->nr_large_shards); 8227 + } 8228 + 8229 + /** 8230 + * llc_populate_cpu_shard_id - populate cpu_shard_id[] for each CPU in an LLC pod 8231 + * @pod_cpus: the cpumask of CPUs in the LLC pod 8232 + * @smt_pods: the SMT pod type, used to identify sibling groups 8233 + * @nr_cores: number of distinct cores in @pod_cpus (from llc_count_cores()) 8234 + * 8235 + * Walks @pod_cpus in order. At each SMT group leader, advances to the next 8236 + * shard once the current shard is full. Results are written to cpu_shard_id[]. 8237 + */ 8238 + static void __init llc_populate_cpu_shard_id(const struct cpumask *pod_cpus, 8239 + struct wq_pod_type *smt_pods, 8240 + int nr_cores) 8241 + { 8242 + struct llc_shard_layout layout = llc_calc_shard_layout(nr_cores); 8243 + const struct cpumask *sibling_cpus; 8244 + /* Count the number of cores in the current shard_id */ 8245 + int cores_in_shard = 0; 8246 + unsigned int leader; 8247 + /* This is a cursor for the shards. Go from zero to nr_shards - 1*/ 8248 + int shard_id = 0; 8249 + int c; 8250 + 8251 + /* Iterate at every CPU for a given LLC pod, and assign it a shard */ 8252 + for_each_cpu(c, pod_cpus) { 8253 + sibling_cpus = smt_pods->pod_cpus[smt_pods->cpu_pod[c]]; 8254 + if (cpumask_first(sibling_cpus) == c) { 8255 + /* This is the CPU leader for the siblings */ 8256 + if (llc_shard_is_full(cores_in_shard, shard_id, &layout)) { 8257 + shard_id++; 8258 + cores_in_shard = 0; 8259 + } 8260 + cores_in_shard++; 8261 + cpu_shard_id[c] = shard_id; 8262 + } else { 8263 + /* 8264 + * The siblings' shard MUST be the same as the leader. 8265 + * never split threads in the same core. 8266 + */ 8267 + leader = cpumask_first(sibling_cpus); 8268 + 8269 + /* 8270 + * This check silences a Warray-bounds warning on UP 8271 + * configs where NR_CPUS=1 makes cpu_shard_id[] 8272 + * a single-element array, and the compiler can't 8273 + * prove the index is always 0. 8274 + */ 8275 + if (WARN_ON_ONCE(leader >= nr_cpu_ids)) 8276 + continue; 8277 + cpu_shard_id[c] = cpu_shard_id[leader]; 8278 + } 8279 + } 8280 + 8281 + WARN_ON_ONCE(shard_id != (layout.nr_shards - 1)); 8282 + } 8283 + 8284 + /** 8285 + * precompute_cache_shard_ids - assign each CPU its shard index within its LLC 8286 + * 8287 + * Iterates over all LLC pods. For each pod, counts distinct cores then assigns 8288 + * shard indices to all CPUs in the pod. Must be called after WQ_AFFN_CACHE and 8289 + * WQ_AFFN_SMT have been initialized. 8290 + */ 8291 + static void __init precompute_cache_shard_ids(void) 8292 + { 8293 + struct wq_pod_type *llc_pods = &wq_pod_types[WQ_AFFN_CACHE]; 8294 + struct wq_pod_type *smt_pods = &wq_pod_types[WQ_AFFN_SMT]; 8295 + const struct cpumask *cpus_sharing_llc; 8296 + int nr_cores; 8297 + int pod; 8298 + 8299 + if (!wq_cache_shard_size) { 8300 + pr_warn("workqueue: cache_shard_size must be > 0, setting to 1\n"); 8301 + wq_cache_shard_size = 1; 8302 + } 8303 + 8304 + for (pod = 0; pod < llc_pods->nr_pods; pod++) { 8305 + cpus_sharing_llc = llc_pods->pod_cpus[pod]; 8306 + 8307 + /* Number of cores in this given LLC */ 8308 + nr_cores = llc_count_cores(cpus_sharing_llc, smt_pods); 8309 + llc_populate_cpu_shard_id(cpus_sharing_llc, smt_pods, nr_cores); 8310 + } 8311 + } 8312 + 8313 + /* 8314 + * cpus_share_cache_shard - test whether two CPUs belong to the same cache shard 8315 + * 8316 + * Two CPUs share a cache shard if they are in the same LLC and have the same 8317 + * shard index. Used as the pod affinity callback for WQ_AFFN_CACHE_SHARD. 8318 + */ 8319 + static bool __init cpus_share_cache_shard(int cpu0, int cpu1) 8320 + { 8321 + if (!cpus_share_cache(cpu0, cpu1)) 8322 + return false; 8323 + 8324 + return cpu_shard_id[cpu0] == cpu_shard_id[cpu1]; 8325 + } 8326 + 8202 8327 /** 8203 8328 * workqueue_init_topology - initialize CPU pods for unbound workqueues 8204 8329 * ··· 8394 8159 init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share); 8395 8160 init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt); 8396 8161 init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache); 8162 + precompute_cache_shard_ids(); 8163 + init_pod_type(&wq_pod_types[WQ_AFFN_CACHE_SHARD], cpus_share_cache_shard); 8397 8164 init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa); 8398 8165 8399 8166 wq_topo_initialized = true;
+10
lib/Kconfig.debug
··· 2636 2636 2637 2637 If unsure, say N. 2638 2638 2639 + config TEST_WORKQUEUE 2640 + tristate "Test module for stress/performance analysis of workqueue" 2641 + default n 2642 + help 2643 + This builds the "test_workqueue" module for benchmarking 2644 + workqueue throughput under contention. Useful for evaluating 2645 + affinity scope changes (e.g., cache_shard vs cache). 2646 + 2647 + If unsure, say N. 2648 + 2639 2649 config TEST_BPF 2640 2650 tristate "Test BPF filter functionality" 2641 2651 depends on m && NET
+1
lib/Makefile
··· 79 79 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o 80 80 obj-$(CONFIG_TEST_LKM) += test_module.o 81 81 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o 82 + obj-$(CONFIG_TEST_WORKQUEUE) += test_workqueue.o 82 83 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o 83 84 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o 84 85 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
+294
lib/test_workqueue.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + /* 4 + * Test module for stress and performance analysis of workqueue. 5 + * 6 + * Benchmarks queue_work() throughput on an unbound workqueue to measure 7 + * pool->lock contention under different affinity scope configurations 8 + * (e.g., cache vs cache_shard). 9 + * 10 + * The affinity scope is changed between runs via the workqueue's sysfs 11 + * affinity_scope attribute (WQ_SYSFS). 12 + * 13 + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates 14 + * Copyright (c) 2026 Breno Leitao <leitao@debian.org> 15 + * 16 + */ 17 + #include <linux/init.h> 18 + #include <linux/kernel.h> 19 + #include <linux/module.h> 20 + #include <linux/workqueue.h> 21 + #include <linux/kthread.h> 22 + #include <linux/moduleparam.h> 23 + #include <linux/completion.h> 24 + #include <linux/atomic.h> 25 + #include <linux/slab.h> 26 + #include <linux/ktime.h> 27 + #include <linux/cpumask.h> 28 + #include <linux/sched.h> 29 + #include <linux/sort.h> 30 + #include <linux/fs.h> 31 + 32 + #define WQ_NAME "bench_wq" 33 + #define SCOPE_PATH "/sys/bus/workqueue/devices/" WQ_NAME "/affinity_scope" 34 + 35 + static int nr_threads; 36 + module_param(nr_threads, int, 0444); 37 + MODULE_PARM_DESC(nr_threads, 38 + "Number of threads to spawn (default: 0 = num_online_cpus())"); 39 + 40 + static int wq_items = 50000; 41 + module_param(wq_items, int, 0444); 42 + MODULE_PARM_DESC(wq_items, 43 + "Number of work items each thread queues (default: 50000)"); 44 + 45 + static struct workqueue_struct *bench_wq; 46 + static atomic_t threads_done; 47 + static DECLARE_COMPLETION(start_comp); 48 + static DECLARE_COMPLETION(all_done_comp); 49 + 50 + struct thread_ctx { 51 + struct completion work_done; 52 + struct work_struct work; 53 + u64 *latencies; 54 + int cpu; 55 + int items; 56 + }; 57 + 58 + static void bench_work_fn(struct work_struct *work) 59 + { 60 + struct thread_ctx *ctx = container_of(work, struct thread_ctx, work); 61 + 62 + complete(&ctx->work_done); 63 + } 64 + 65 + static int bench_kthread_fn(void *data) 66 + { 67 + struct thread_ctx *ctx = data; 68 + ktime_t t_start, t_end; 69 + int i; 70 + 71 + /* Wait for all threads to be ready */ 72 + wait_for_completion(&start_comp); 73 + 74 + if (kthread_should_stop()) 75 + return 0; 76 + 77 + for (i = 0; i < ctx->items; i++) { 78 + reinit_completion(&ctx->work_done); 79 + INIT_WORK(&ctx->work, bench_work_fn); 80 + 81 + t_start = ktime_get(); 82 + queue_work(bench_wq, &ctx->work); 83 + t_end = ktime_get(); 84 + 85 + ctx->latencies[i] = ktime_to_ns(ktime_sub(t_end, t_start)); 86 + wait_for_completion(&ctx->work_done); 87 + } 88 + 89 + if (atomic_dec_and_test(&threads_done)) 90 + complete(&all_done_comp); 91 + 92 + /* 93 + * Wait for kthread_stop() so the module text isn't freed 94 + * while we're still executing. 95 + */ 96 + while (!kthread_should_stop()) 97 + schedule(); 98 + 99 + return 0; 100 + } 101 + 102 + static int cmp_u64(const void *a, const void *b) 103 + { 104 + u64 va = *(const u64 *)a; 105 + u64 vb = *(const u64 *)b; 106 + 107 + if (va < vb) 108 + return -1; 109 + if (va > vb) 110 + return 1; 111 + return 0; 112 + } 113 + 114 + static int __init set_affn_scope(const char *scope) 115 + { 116 + struct file *f; 117 + loff_t pos = 0; 118 + ssize_t ret; 119 + 120 + f = filp_open(SCOPE_PATH, O_WRONLY, 0); 121 + if (IS_ERR(f)) { 122 + pr_err("test_workqueue: open %s failed: %ld\n", 123 + SCOPE_PATH, PTR_ERR(f)); 124 + return PTR_ERR(f); 125 + } 126 + 127 + ret = kernel_write(f, scope, strlen(scope), &pos); 128 + filp_close(f, NULL); 129 + 130 + if (ret < 0) { 131 + pr_err("test_workqueue: write '%s' failed: %zd\n", scope, ret); 132 + return ret; 133 + } 134 + 135 + return 0; 136 + } 137 + 138 + static int __init run_bench(int n_threads, const char *scope, const char *label) 139 + { 140 + struct task_struct **tasks; 141 + unsigned long total_items; 142 + struct thread_ctx *ctxs; 143 + u64 *all_latencies; 144 + ktime_t start, end; 145 + int cpu, i, j, ret; 146 + s64 elapsed_us; 147 + 148 + ret = set_affn_scope(scope); 149 + if (ret) 150 + return ret; 151 + 152 + ctxs = kcalloc(n_threads, sizeof(*ctxs), GFP_KERNEL); 153 + if (!ctxs) 154 + return -ENOMEM; 155 + 156 + tasks = kcalloc(n_threads, sizeof(*tasks), GFP_KERNEL); 157 + if (!tasks) { 158 + kfree(ctxs); 159 + return -ENOMEM; 160 + } 161 + 162 + total_items = (unsigned long)n_threads * wq_items; 163 + all_latencies = kvmalloc_array(total_items, sizeof(u64), GFP_KERNEL); 164 + if (!all_latencies) { 165 + kfree(tasks); 166 + kfree(ctxs); 167 + return -ENOMEM; 168 + } 169 + 170 + /* Allocate per-thread latency arrays */ 171 + for (i = 0; i < n_threads; i++) { 172 + ctxs[i].latencies = kvmalloc_array(wq_items, sizeof(u64), 173 + GFP_KERNEL); 174 + if (!ctxs[i].latencies) { 175 + while (--i >= 0) 176 + kvfree(ctxs[i].latencies); 177 + kvfree(all_latencies); 178 + kfree(tasks); 179 + kfree(ctxs); 180 + return -ENOMEM; 181 + } 182 + } 183 + 184 + atomic_set(&threads_done, n_threads); 185 + reinit_completion(&all_done_comp); 186 + reinit_completion(&start_comp); 187 + 188 + /* Create kthreads, each bound to a different online CPU */ 189 + i = 0; 190 + for_each_online_cpu(cpu) { 191 + if (i >= n_threads) 192 + break; 193 + 194 + ctxs[i].cpu = cpu; 195 + ctxs[i].items = wq_items; 196 + init_completion(&ctxs[i].work_done); 197 + 198 + tasks[i] = kthread_create(bench_kthread_fn, &ctxs[i], 199 + "wq_bench/%d", cpu); 200 + if (IS_ERR(tasks[i])) { 201 + ret = PTR_ERR(tasks[i]); 202 + pr_err("test_workqueue: failed to create kthread %d: %d\n", 203 + i, ret); 204 + /* Unblock threads waiting on start_comp before stopping them */ 205 + complete_all(&start_comp); 206 + while (--i >= 0) 207 + kthread_stop(tasks[i]); 208 + goto out_free; 209 + } 210 + 211 + kthread_bind(tasks[i], cpu); 212 + wake_up_process(tasks[i]); 213 + i++; 214 + } 215 + 216 + /* Start timing and release all threads */ 217 + start = ktime_get(); 218 + complete_all(&start_comp); 219 + 220 + /* Wait for all threads to finish the benchmark */ 221 + wait_for_completion(&all_done_comp); 222 + 223 + /* Drain any remaining work */ 224 + flush_workqueue(bench_wq); 225 + 226 + /* Ensure all kthreads have fully exited before module memory is freed */ 227 + for (i = 0; i < n_threads; i++) 228 + kthread_stop(tasks[i]); 229 + 230 + end = ktime_get(); 231 + elapsed_us = ktime_us_delta(end, start); 232 + 233 + /* Merge all per-thread latencies and sort for percentile calculation */ 234 + j = 0; 235 + for (i = 0; i < n_threads; i++) { 236 + memcpy(&all_latencies[j], ctxs[i].latencies, 237 + wq_items * sizeof(u64)); 238 + j += wq_items; 239 + } 240 + 241 + sort(all_latencies, total_items, sizeof(u64), cmp_u64, NULL); 242 + 243 + pr_info("test_workqueue: %-16s %llu items/sec\tp50=%llu\tp90=%llu\tp95=%llu ns\n", 244 + label, 245 + elapsed_us ? div_u64(total_items * 1000000ULL, elapsed_us) : 0, 246 + all_latencies[total_items * 50 / 100], 247 + all_latencies[total_items * 90 / 100], 248 + all_latencies[total_items * 95 / 100]); 249 + 250 + ret = 0; 251 + out_free: 252 + for (i = 0; i < n_threads; i++) 253 + kvfree(ctxs[i].latencies); 254 + kvfree(all_latencies); 255 + kfree(tasks); 256 + kfree(ctxs); 257 + 258 + return ret; 259 + } 260 + 261 + static const char * const bench_scopes[] = { 262 + "cpu", "smt", "cache_shard", "cache", "numa", "system", 263 + }; 264 + 265 + static int __init test_workqueue_init(void) 266 + { 267 + int n_threads = min(nr_threads ?: num_online_cpus(), num_online_cpus()); 268 + int i; 269 + 270 + if (wq_items <= 0) { 271 + pr_err("test_workqueue: wq_items must be > 0\n"); 272 + return -EINVAL; 273 + } 274 + 275 + bench_wq = alloc_workqueue(WQ_NAME, WQ_UNBOUND | WQ_SYSFS, 0); 276 + if (!bench_wq) 277 + return -ENOMEM; 278 + 279 + pr_info("test_workqueue: running %d threads, %d items/thread\n", 280 + n_threads, wq_items); 281 + 282 + for (i = 0; i < ARRAY_SIZE(bench_scopes); i++) 283 + run_bench(n_threads, bench_scopes[i], bench_scopes[i]); 284 + 285 + destroy_workqueue(bench_wq); 286 + 287 + /* Return -EAGAIN so the module doesn't stay loaded after the benchmark */ 288 + return -EAGAIN; 289 + } 290 + 291 + module_init(test_workqueue_init); 292 + MODULE_AUTHOR("Breno Leitao <leitao@debian.org>"); 293 + MODULE_DESCRIPTION("Stress/performance benchmark for workqueue subsystem"); 294 + MODULE_LICENSE("GPL");
+8 -12
tools/workqueue/wq_dump.py
··· 107 107 WQ_AFFN_CPU = prog['WQ_AFFN_CPU'] 108 108 WQ_AFFN_SMT = prog['WQ_AFFN_SMT'] 109 109 WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE'] 110 + WQ_AFFN_CACHE_SHARD = prog['WQ_AFFN_CACHE_SHARD'] 110 111 WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA'] 111 112 WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM'] 112 113 ··· 139 138 print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='') 140 139 print('') 141 140 142 - for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]: 141 + for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_CACHE_SHARD, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]: 143 142 print('') 144 143 print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}') 145 144 print_pod_type(wq_pod_types[affn]) ··· 228 227 print(f'NODE[{node:02}]={cpumask_str(node_to_cpumask_map[node])}') 229 228 print('') 230 229 231 - print(f'[{"workqueue":^{WQ_NAME_LEN-2}}\\ min max', end='') 232 - first = True 230 + print(f'[{"workqueue":^{WQ_NAME_LEN-1}} {"min":>4} {"max":>4}', end='') 233 231 for node in for_each_node(): 234 - if first: 235 - print(f' NODE {node}', end='') 236 - first = False 237 - else: 238 - print(f' {node:7}', end='') 239 - print(f' {"dfl":>7} ]') 232 + print(f' {"NODE " + str(node):>9}', end='') 233 + print(f' {"dfl":>9} ]') 240 234 print('') 241 235 242 236 for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'): ··· 239 243 continue 240 244 241 245 print(f'{wq.name.string_().decode():{WQ_NAME_LEN}} ', end='') 242 - print(f'{wq.min_active.value_():3} {wq.max_active.value_():3}', end='') 246 + print(f'{wq.min_active.value_():4} {wq.max_active.value_():4}', end='') 243 247 for node in for_each_node(): 244 248 nna = wq.node_nr_active[node] 245 - print(f' {nna.nr.counter.value_():3}/{nna.max.value_():3}', end='') 249 + print(f' {f"{nna.nr.counter.value_()}/{nna.max.value_()}":>9}', end='') 246 250 nna = wq.node_nr_active[nr_node_ids] 247 - print(f' {nna.nr.counter.value_():3}/{nna.max.value_():3}') 251 + print(f' {f"{nna.nr.counter.value_()}/{nna.max.value_()}":>9}') 248 252 else: 249 253 printf(f'node_to_cpumask_map not present, is NUMA enabled?')