workqueue: Implement system-wide nr_active enforcement for unbound workqueues

A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.

It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.

I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Tejun Heo 2 years ago 5797b1c1 91ccc6e7

+342 -36

2 changed files

expand all

include

linux

workqueue.h

kernel

workqueue.c

+32 -3

include/linux/workqueue.h

··· 398 398 WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ 399 399 WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE, 400 400 WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, 401 + 402 + /* 403 + * Per-node default cap on min_active. Unless explicitly set, min_active 404 + * is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see 405 + * workqueue_struct->min_active definition. 406 + */ 407 + WQ_DFL_MIN_ACTIVE = 8, 401 408 }; 402 409 403 410 /* ··· 447 440 * alloc_workqueue - allocate a workqueue 448 441 * @fmt: printf format for the name of the workqueue 449 442 * @flags: WQ_* flags 450 - * @max_active: max in-flight work items per CPU, 0 for default 443 + * @max_active: max in-flight work items, 0 for default 451 444 * remaining args: args for @fmt 452 445 * 453 - * Allocate a workqueue with the specified parameters. For detailed 454 - * information on WQ_* flags, please refer to 446 + * For a per-cpu workqueue, @max_active limits the number of in-flight work 447 + * items for each CPU. e.g. @max_active of 1 indicates that each CPU can be 448 + * executing at most one work item for the workqueue. 449 + * 450 + * For unbound workqueues, @max_active limits the number of in-flight work items 451 + * for the whole system. e.g. @max_active of 16 indicates that that there can be 452 + * at most 16 work items executing for the workqueue in the whole system. 453 + * 454 + * As sharing the same active counter for an unbound workqueue across multiple 455 + * NUMA nodes can be expensive, @max_active is distributed to each NUMA node 456 + * according to the proportion of the number of online CPUs and enforced 457 + * independently. 458 + * 459 + * Depending on online CPU distribution, a node may end up with per-node 460 + * max_active which is significantly lower than @max_active, which can lead to 461 + * deadlocks if the per-node concurrency limit is lower than the maximum number 462 + * of interdependent work items for the workqueue. 463 + * 464 + * To guarantee forward progress regardless of online CPU distribution, the 465 + * concurrency limit on every node is guaranteed to be equal to or greater than 466 + * min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means 467 + * that the sum of per-node max_active's may be larger than @max_active. 468 + * 469 + * For detailed information on %WQ_* flags, please refer to 455 470 * Documentation/core-api/workqueue.rst. 456 471 * 457 472 * RETURNS:

+310 -33

kernel/workqueue.c

··· 126 126 * 127 127 * L: pool->lock protected. Access with pool->lock held. 128 128 * 129 + * LN: pool->lock and wq_node_nr_active->lock protected for writes. Either for 130 + * reads. 131 + * 129 132 * K: Only modified by worker while holding pool->lock. Can be safely read by 130 133 * self, while holding pool->lock or from IRQ context if %current is the 131 134 * kworker. ··· 250 247 * pwq->inactive_works instead of pool->worklist and marked with 251 248 * WORK_STRUCT_INACTIVE. 252 249 * 253 - * All work items marked with WORK_STRUCT_INACTIVE do not participate 254 - * in pwq->nr_active and all work items in pwq->inactive_works are 255 - * marked with WORK_STRUCT_INACTIVE. But not all WORK_STRUCT_INACTIVE 256 - * work items are in pwq->inactive_works. Some of them are ready to 257 - * run in pool->worklist or worker->scheduled. Those work itmes are 258 - * only struct wq_barrier which is used for flush_work() and should 259 - * not participate in pwq->nr_active. For non-barrier work item, it 260 - * is marked with WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works. 250 + * All work items marked with WORK_STRUCT_INACTIVE do not participate in 251 + * nr_active and all work items in pwq->inactive_works are marked with 252 + * WORK_STRUCT_INACTIVE. But not all WORK_STRUCT_INACTIVE work items are 253 + * in pwq->inactive_works. Some of them are ready to run in 254 + * pool->worklist or worker->scheduled. Those work itmes are only struct 255 + * wq_barrier which is used for flush_work() and should not participate 256 + * in nr_active. For non-barrier work item, it is marked with 257 + * WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works. 261 258 */ 262 259 int nr_active; /* L: nr of active works */ 263 260 struct list_head inactive_works; /* L: inactive works */ 261 + struct list_head pending_node; /* LN: node on wq_node_nr_active->pending_pwqs */ 264 262 struct list_head pwqs_node; /* WR: node on wq->pwqs */ 265 263 struct list_head mayday_node; /* MD: node on wq->maydays */ 266 264 ··· 293 289 * on each CPU, in an unbound workqueue, max_active applies to the whole system. 294 290 * As sharing a single nr_active across multiple sockets can be very expensive, 295 291 * the counting and enforcement is per NUMA node. 292 + * 293 + * The following struct is used to enforce per-node max_active. When a pwq wants 294 + * to start executing a work item, it should increment ->nr using 295 + * tryinc_node_nr_active(). If acquisition fails due to ->nr already being over 296 + * ->max, the pwq is queued on ->pending_pwqs. As in-flight work items finish 297 + * and decrement ->nr, node_activate_pending_pwq() activates the pending pwqs in 298 + * round-robin order. 296 299 */ 297 300 struct wq_node_nr_active { 298 - atomic_t nr; /* per-node nr_active count */ 301 + int max; /* per-node max_active */ 302 + atomic_t nr; /* per-node nr_active */ 303 + raw_spinlock_t lock; /* nests inside pool locks */ 304 + struct list_head pending_pwqs; /* LN: pwqs with inactive works */ 299 305 }; 300 306 301 307 /* ··· 328 314 struct worker *rescuer; /* MD: rescue worker */ 329 315 330 316 int nr_drainers; /* WQ: drain in progress */ 317 + 318 + /* See alloc_workqueue() function comment for info on min/max_active */ 331 319 int max_active; /* WO: max active works */ 320 + int min_active; /* WO: min active works */ 332 321 int saved_max_active; /* WQ: saved max_active */ 322 + int saved_min_active; /* WQ: saved min_active */ 333 323 334 324 struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */ 335 325 struct pool_workqueue __rcu *dfl_pwq; /* PW: only for unbound wqs */ ··· 683 665 return rcu_dereference_check(*unbound_pwq_slot(wq, cpu), 684 666 lockdep_is_held(&wq_pool_mutex) || 685 667 lockdep_is_held(&wq->mutex)); 668 + } 669 + 670 + /** 671 + * unbound_effective_cpumask - effective cpumask of an unbound workqueue 672 + * @wq: workqueue of interest 673 + * 674 + * @wq->unbound_attrs->cpumask contains the cpumask requested by the user which 675 + * is masked with wq_unbound_cpumask to determine the effective cpumask. The 676 + * default pwq is always mapped to the pool with the current effective cpumask. 677 + */ 678 + static struct cpumask *unbound_effective_cpumask(struct workqueue_struct *wq) 679 + { 680 + return unbound_pwq(wq, -1)->pool->attrs->__pod_cpumask; 686 681 } 687 682 688 683 static unsigned int work_color_to_flags(int color) ··· 1493 1462 } 1494 1463 1495 1464 /** 1465 + * wq_update_node_max_active - Update per-node max_actives to use 1466 + * @wq: workqueue to update 1467 + * @off_cpu: CPU that's going down, -1 if a CPU is not going down 1468 + * 1469 + * Update @wq->node_nr_active[]->max. @wq must be unbound. max_active is 1470 + * distributed among nodes according to the proportions of numbers of online 1471 + * cpus. The result is always between @wq->min_active and max_active. 1472 + */ 1473 + static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu) 1474 + { 1475 + struct cpumask *effective = unbound_effective_cpumask(wq); 1476 + int min_active = READ_ONCE(wq->min_active); 1477 + int max_active = READ_ONCE(wq->max_active); 1478 + int total_cpus, node; 1479 + 1480 + lockdep_assert_held(&wq->mutex); 1481 + 1482 + if (!cpumask_test_cpu(off_cpu, effective)) 1483 + off_cpu = -1; 1484 + 1485 + total_cpus = cpumask_weight_and(effective, cpu_online_mask); 1486 + if (off_cpu >= 0) 1487 + total_cpus--; 1488 + 1489 + for_each_node(node) { 1490 + int node_cpus; 1491 + 1492 + node_cpus = cpumask_weight_and(effective, cpumask_of_node(node)); 1493 + if (off_cpu >= 0 && cpu_to_node(off_cpu) == node) 1494 + node_cpus--; 1495 + 1496 + wq_node_nr_active(wq, node)->max = 1497 + clamp(DIV_ROUND_UP(max_active * node_cpus, total_cpus), 1498 + min_active, max_active); 1499 + } 1500 + 1501 + wq_node_nr_active(wq, NUMA_NO_NODE)->max = min_active; 1502 + } 1503 + 1504 + /** 1496 1505 * get_pwq - get an extra reference on the specified pool_workqueue 1497 1506 * @pwq: pool_workqueue to get 1498 1507 * ··· 1629 1558 return true; 1630 1559 } 1631 1560 1561 + static bool tryinc_node_nr_active(struct wq_node_nr_active *nna) 1562 + { 1563 + int max = READ_ONCE(nna->max); 1564 + 1565 + while (true) { 1566 + int old, tmp; 1567 + 1568 + old = atomic_read(&nna->nr); 1569 + if (old >= max) 1570 + return false; 1571 + tmp = atomic_cmpxchg_relaxed(&nna->nr, old, old + 1); 1572 + if (tmp == old) 1573 + return true; 1574 + } 1575 + } 1576 + 1632 1577 /** 1633 1578 * pwq_tryinc_nr_active - Try to increment nr_active for a pwq 1634 1579 * @pwq: pool_workqueue of interest 1580 + * @fill: max_active may have increased, try to increase concurrency level 1635 1581 * 1636 1582 * Try to increment nr_active for @pwq. Returns %true if an nr_active count is 1637 1583 * successfully obtained. %false otherwise. 1638 1584 */ 1639 - static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq) 1585 + static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill) 1640 1586 { 1641 1587 struct workqueue_struct *wq = pwq->wq; 1642 1588 struct worker_pool *pool = pwq->pool; 1643 1589 struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node); 1644 - bool obtained; 1590 + bool obtained = false; 1645 1591 1646 1592 lockdep_assert_held(&pool->lock); 1647 1593 1648 - obtained = pwq->nr_active < READ_ONCE(wq->max_active); 1649 - 1650 - if (obtained) { 1651 - pwq->nr_active++; 1652 - if (nna) 1653 - atomic_inc(&nna->nr); 1594 + if (!nna) { 1595 + /* per-cpu workqueue, pwq->nr_active is sufficient */ 1596 + obtained = pwq->nr_active < READ_ONCE(wq->max_active); 1597 + goto out; 1654 1598 } 1599 + 1600 + /* 1601 + * Unbound workqueue uses per-node shared nr_active $nna. If @pwq is 1602 + * already waiting on $nna, pwq_dec_nr_active() will maintain the 1603 + * concurrency level. Don't jump the line. 1604 + * 1605 + * We need to ignore the pending test after max_active has increased as 1606 + * pwq_dec_nr_active() can only maintain the concurrency level but not 1607 + * increase it. This is indicated by @fill. 1608 + */ 1609 + if (!list_empty(&pwq->pending_node) && likely(!fill)) 1610 + goto out; 1611 + 1612 + obtained = tryinc_node_nr_active(nna); 1613 + if (obtained) 1614 + goto out; 1615 + 1616 + /* 1617 + * Lockless acquisition failed. Lock, add ourself to $nna->pending_pwqs 1618 + * and try again. The smp_mb() is paired with the implied memory barrier 1619 + * of atomic_dec_return() in pwq_dec_nr_active() to ensure that either 1620 + * we see the decremented $nna->nr or they see non-empty 1621 + * $nna->pending_pwqs. 1622 + */ 1623 + raw_spin_lock(&nna->lock); 1624 + 1625 + if (list_empty(&pwq->pending_node)) 1626 + list_add_tail(&pwq->pending_node, &nna->pending_pwqs); 1627 + else if (likely(!fill)) 1628 + goto out_unlock; 1629 + 1630 + smp_mb(); 1631 + 1632 + obtained = tryinc_node_nr_active(nna); 1633 + 1634 + /* 1635 + * If @fill, @pwq might have already been pending. Being spuriously 1636 + * pending in cold paths doesn't affect anything. Let's leave it be. 1637 + */ 1638 + if (obtained && likely(!fill)) 1639 + list_del_init(&pwq->pending_node); 1640 + 1641 + out_unlock: 1642 + raw_spin_unlock(&nna->lock); 1643 + out: 1644 + if (obtained) 1645 + pwq->nr_active++; 1655 1646 return obtained; 1656 1647 } 1657 1648 1658 1649 /** 1659 1650 * pwq_activate_first_inactive - Activate the first inactive work item on a pwq 1660 1651 * @pwq: pool_workqueue of interest 1652 + * @fill: max_active may have increased, try to increase concurrency level 1661 1653 * 1662 1654 * Activate the first inactive work item of @pwq if available and allowed by 1663 1655 * max_active limit. ··· 1728 1594 * Returns %true if an inactive work item has been activated. %false if no 1729 1595 * inactive work item is found or max_active limit is reached. 1730 1596 */ 1731 - static bool pwq_activate_first_inactive(struct pool_workqueue *pwq) 1597 + static bool pwq_activate_first_inactive(struct pool_workqueue *pwq, bool fill) 1732 1598 { 1733 1599 struct work_struct *work = 1734 1600 list_first_entry_or_null(&pwq->inactive_works, 1735 1601 struct work_struct, entry); 1736 1602 1737 - if (work && pwq_tryinc_nr_active(pwq)) { 1603 + if (work && pwq_tryinc_nr_active(pwq, fill)) { 1738 1604 __pwq_activate_work(pwq, work); 1739 1605 return true; 1740 1606 } else { ··· 1743 1609 } 1744 1610 1745 1611 /** 1612 + * node_activate_pending_pwq - Activate a pending pwq on a wq_node_nr_active 1613 + * @nna: wq_node_nr_active to activate a pending pwq for 1614 + * @caller_pool: worker_pool the caller is locking 1615 + * 1616 + * Activate a pwq in @nna->pending_pwqs. Called with @caller_pool locked. 1617 + * @caller_pool may be unlocked and relocked to lock other worker_pools. 1618 + */ 1619 + static void node_activate_pending_pwq(struct wq_node_nr_active *nna, 1620 + struct worker_pool *caller_pool) 1621 + { 1622 + struct worker_pool *locked_pool = caller_pool; 1623 + struct pool_workqueue *pwq; 1624 + struct work_struct *work; 1625 + 1626 + lockdep_assert_held(&caller_pool->lock); 1627 + 1628 + raw_spin_lock(&nna->lock); 1629 + retry: 1630 + pwq = list_first_entry_or_null(&nna->pending_pwqs, 1631 + struct pool_workqueue, pending_node); 1632 + if (!pwq) 1633 + goto out_unlock; 1634 + 1635 + /* 1636 + * If @pwq is for a different pool than @locked_pool, we need to lock 1637 + * @pwq->pool->lock. Let's trylock first. If unsuccessful, do the unlock 1638 + * / lock dance. For that, we also need to release @nna->lock as it's 1639 + * nested inside pool locks. 1640 + */ 1641 + if (pwq->pool != locked_pool) { 1642 + raw_spin_unlock(&locked_pool->lock); 1643 + locked_pool = pwq->pool; 1644 + if (!raw_spin_trylock(&locked_pool->lock)) { 1645 + raw_spin_unlock(&nna->lock); 1646 + raw_spin_lock(&locked_pool->lock); 1647 + raw_spin_lock(&nna->lock); 1648 + goto retry; 1649 + } 1650 + } 1651 + 1652 + /* 1653 + * $pwq may not have any inactive work items due to e.g. cancellations. 1654 + * Drop it from pending_pwqs and see if there's another one. 1655 + */ 1656 + work = list_first_entry_or_null(&pwq->inactive_works, 1657 + struct work_struct, entry); 1658 + if (!work) { 1659 + list_del_init(&pwq->pending_node); 1660 + goto retry; 1661 + } 1662 + 1663 + /* 1664 + * Acquire an nr_active count and activate the inactive work item. If 1665 + * $pwq still has inactive work items, rotate it to the end of the 1666 + * pending_pwqs so that we round-robin through them. This means that 1667 + * inactive work items are not activated in queueing order which is fine 1668 + * given that there has never been any ordering across different pwqs. 1669 + */ 1670 + if (likely(tryinc_node_nr_active(nna))) { 1671 + pwq->nr_active++; 1672 + __pwq_activate_work(pwq, work); 1673 + 1674 + if (list_empty(&pwq->inactive_works)) 1675 + list_del_init(&pwq->pending_node); 1676 + else 1677 + list_move_tail(&pwq->pending_node, &nna->pending_pwqs); 1678 + 1679 + /* if activating a foreign pool, make sure it's running */ 1680 + if (pwq->pool != caller_pool) 1681 + kick_pool(pwq->pool); 1682 + } 1683 + 1684 + out_unlock: 1685 + raw_spin_unlock(&nna->lock); 1686 + if (locked_pool != caller_pool) { 1687 + raw_spin_unlock(&locked_pool->lock); 1688 + raw_spin_lock(&caller_pool->lock); 1689 + } 1690 + } 1691 + 1692 + /** 1746 1693 * pwq_dec_nr_active - Retire an active count 1747 1694 * @pwq: pool_workqueue of interest 1748 1695 * 1749 1696 * Decrement @pwq's nr_active and try to activate the first inactive work item. 1697 + * For unbound workqueues, this function may temporarily drop @pwq->pool->lock. 1750 1698 */ 1751 1699 static void pwq_dec_nr_active(struct pool_workqueue *pwq) 1752 1700 { ··· 1848 1632 * inactive work item on @pwq itself. 1849 1633 */ 1850 1634 if (!nna) { 1851 - pwq_activate_first_inactive(pwq); 1635 + pwq_activate_first_inactive(pwq, false); 1852 1636 return; 1853 1637 } 1854 1638 1855 - atomic_dec(&nna->nr); 1856 - pwq_activate_first_inactive(pwq); 1639 + /* 1640 + * If @pwq is for an unbound workqueue, it's more complicated because 1641 + * multiple pwqs and pools may be sharing the nr_active count. When a 1642 + * pwq needs to wait for an nr_active count, it puts itself on 1643 + * $nna->pending_pwqs. The following atomic_dec_return()'s implied 1644 + * memory barrier is paired with smp_mb() in pwq_tryinc_nr_active() to 1645 + * guarantee that either we see non-empty pending_pwqs or they see 1646 + * decremented $nna->nr. 1647 + * 1648 + * $nna->max may change as CPUs come online/offline and @pwq->wq's 1649 + * max_active gets updated. However, it is guaranteed to be equal to or 1650 + * larger than @pwq->wq->min_active which is above zero unless freezing. 1651 + * This maintains the forward progress guarantee. 1652 + */ 1653 + if (atomic_dec_return(&nna->nr) >= READ_ONCE(nna->max)) 1654 + return; 1655 + 1656 + if (!list_empty(&nna->pending_pwqs)) 1657 + node_activate_pending_pwq(nna, pool); 1857 1658 } 1858 1659 1859 1660 /** ··· 2198 1965 * @work must also queue behind existing inactive work items to maintain 2199 1966 * ordering when max_active changes. See wq_adjust_max_active(). 2200 1967 */ 2201 - if (list_empty(&pwq->inactive_works) && pwq_tryinc_nr_active(pwq)) { 1968 + if (list_empty(&pwq->inactive_works) && pwq_tryinc_nr_active(pwq, false)) { 2202 1969 if (list_empty(&pool->worklist)) 2203 1970 pool->watchdog_ts = jiffies; 2204 1971 ··· 3433 3200 3434 3201 barr->task = current; 3435 3202 3436 - /* The barrier work item does not participate in pwq->nr_active. */ 3203 + /* The barrier work item does not participate in nr_active. */ 3437 3204 work_flags |= WORK_STRUCT_INACTIVE; 3438 3205 3439 3206 /* ··· 4349 4116 static void init_node_nr_active(struct wq_node_nr_active *nna) 4350 4117 { 4351 4118 atomic_set(&nna->nr, 0); 4119 + raw_spin_lock_init(&nna->lock); 4120 + INIT_LIST_HEAD(&nna->pending_pwqs); 4352 4121 } 4353 4122 4354 4123 /* ··· 4590 4355 mutex_unlock(&wq_pool_mutex); 4591 4356 } 4592 4357 4358 + if (!list_empty(&pwq->pending_node)) { 4359 + struct wq_node_nr_active *nna = 4360 + wq_node_nr_active(pwq->wq, pwq->pool->node); 4361 + 4362 + raw_spin_lock_irq(&nna->lock); 4363 + list_del_init(&pwq->pending_node); 4364 + raw_spin_unlock_irq(&nna->lock); 4365 + } 4366 + 4593 4367 call_rcu(&pwq->rcu, rcu_free_pwq); 4594 4368 4595 4369 /* ··· 4624 4380 pwq->flush_color = -1; 4625 4381 pwq->refcnt = 1; 4626 4382 INIT_LIST_HEAD(&pwq->inactive_works); 4383 + INIT_LIST_HEAD(&pwq->pending_node); 4627 4384 INIT_LIST_HEAD(&pwq->pwqs_node); 4628 4385 INIT_LIST_HEAD(&pwq->mayday_node); 4629 4386 kthread_init_work(&pwq->release_work, pwq_release_workfn); ··· 4831 4586 ctx->pwq_tbl[cpu] = install_unbound_pwq(ctx->wq, cpu, 4832 4587 ctx->pwq_tbl[cpu]); 4833 4588 ctx->dfl_pwq = install_unbound_pwq(ctx->wq, -1, ctx->dfl_pwq); 4589 + 4590 + /* update node_nr_active->max */ 4591 + wq_update_node_max_active(ctx->wq, -1); 4834 4592 4835 4593 mutex_unlock(&ctx->wq->mutex); 4836 4594 } ··· 5098 4850 static void wq_adjust_max_active(struct workqueue_struct *wq) 5099 4851 { 5100 4852 bool activated; 4853 + int new_max, new_min; 5101 4854 5102 4855 lockdep_assert_held(&wq->mutex); 5103 4856 5104 4857 if ((wq->flags & WQ_FREEZABLE) && workqueue_freezing) { 5105 - WRITE_ONCE(wq->max_active, 0); 5106 - return; 4858 + new_max = 0; 4859 + new_min = 0; 4860 + } else { 4861 + new_max = wq->saved_max_active; 4862 + new_min = wq->saved_min_active; 5107 4863 } 5108 4864 5109 - if (wq->max_active == wq->saved_max_active) 4865 + if (wq->max_active == new_max && wq->min_active == new_min) 5110 4866 return; 5111 4867 5112 4868 /* 5113 - * Update @wq->max_active and then kick inactive work items if more 4869 + * Update @wq->max/min_active and then kick inactive work items if more 5114 4870 * active work items are allowed. This doesn't break work item ordering 5115 4871 * because new work items are always queued behind existing inactive 5116 4872 * work items if there are any. 5117 4873 */ 5118 - WRITE_ONCE(wq->max_active, wq->saved_max_active); 4874 + WRITE_ONCE(wq->max_active, new_max); 4875 + WRITE_ONCE(wq->min_active, new_min); 4876 + 4877 + if (wq->flags & WQ_UNBOUND) 4878 + wq_update_node_max_active(wq, -1); 4879 + 4880 + if (new_max == 0) 4881 + return; 5119 4882 5120 4883 /* 5121 4884 * Round-robin through pwq's activating the first inactive work item ··· 5141 4882 5142 4883 /* can be called during early boot w/ irq disabled */ 5143 4884 raw_spin_lock_irqsave(&pwq->pool->lock, flags); 5144 - if (pwq_activate_first_inactive(pwq)) { 4885 + if (pwq_activate_first_inactive(pwq, true)) { 5145 4886 activated = true; 5146 4887 kick_pool(pwq->pool); 5147 4888 } ··· 5203 4944 /* init wq */ 5204 4945 wq->flags = flags; 5205 4946 wq->max_active = max_active; 5206 - wq->saved_max_active = max_active; 4947 + wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE); 4948 + wq->saved_max_active = wq->max_active; 4949 + wq->saved_min_active = wq->min_active; 5207 4950 mutex_init(&wq->mutex); 5208 4951 atomic_set(&wq->nr_pwqs_to_flush, 0); 5209 4952 INIT_LIST_HEAD(&wq->pwqs); ··· 5371 5110 * @wq: target workqueue 5372 5111 * @max_active: new max_active value. 5373 5112 * 5374 - * Set max_active of @wq to @max_active. 5113 + * Set max_active of @wq to @max_active. See the alloc_workqueue() function 5114 + * comment. 5375 5115 * 5376 5116 * CONTEXT: 5377 5117 * Don't call from IRQ context. ··· 5389 5127 5390 5128 wq->flags &= ~__WQ_ORDERED; 5391 5129 wq->saved_max_active = max_active; 5130 + if (wq->flags & WQ_UNBOUND) 5131 + wq->saved_min_active = min(wq->saved_min_active, max_active); 5132 + 5392 5133 wq_adjust_max_active(wq); 5393 5134 5394 5135 mutex_unlock(&wq->mutex); ··· 6073 5808 6074 5809 for_each_cpu(tcpu, pt->pod_cpus[pt->cpu_pod[cpu]]) 6075 5810 wq_update_pod(wq, tcpu, cpu, true); 5811 + 5812 + mutex_lock(&wq->mutex); 5813 + wq_update_node_max_active(wq, -1); 5814 + mutex_unlock(&wq->mutex); 6076 5815 } 6077 5816 } 6078 5817 ··· 6105 5836 6106 5837 for_each_cpu(tcpu, pt->pod_cpus[pt->cpu_pod[cpu]]) 6107 5838 wq_update_pod(wq, tcpu, cpu, false); 5839 + 5840 + mutex_lock(&wq->mutex); 5841 + wq_update_node_max_active(wq, cpu); 5842 + mutex_unlock(&wq->mutex); 6108 5843 } 6109 5844 } 6110 5845 mutex_unlock(&wq_pool_mutex); ··· 7400 7127 * combinations to apply per-pod sharing. 7401 7128 */ 7402 7129 list_for_each_entry(wq, &workqueues, list) { 7403 - for_each_online_cpu(cpu) { 7130 + for_each_online_cpu(cpu) 7404 7131 wq_update_pod(wq, cpu, cpu, true); 7132 + if (wq->flags & WQ_UNBOUND) { 7133 + mutex_lock(&wq->mutex); 7134 + wq_update_node_max_active(wq, -1); 7135 + mutex_unlock(&wq->mutex); 7405 7136 } 7406 7137 } 7407 7138

Configure Feed

Configure Feed