Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock

The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().

To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.

Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Waiman Long and committed by
Tejun Heo
a84097e6 6df415aa

+44 -11
+42 -5
kernel/cgroup/cpuset.c
··· 65 65 * CPUSET Locking Convention 66 66 * ------------------------- 67 67 * 68 - * Below are the three global locks guarding cpuset structures in lock 68 + * Below are the four global/local locks guarding cpuset structures in lock 69 69 * acquisition order: 70 + * - cpuset_top_mutex 70 71 * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) 71 72 * - cpuset_mutex 72 73 * - callback_lock (raw spinlock) 73 74 * 74 - * A task must hold all the three locks to modify externally visible or 75 - * used fields of cpusets, though some of the internally used cpuset fields 75 + * As cpuset will now indirectly flush a number of different workqueues in 76 + * housekeeping_update() to update housekeeping cpumasks when the set of 77 + * isolated CPUs is going to be changed, it may be vulnerable to deadlock 78 + * if we hold cpus_read_lock while calling into housekeeping_update(). 79 + * 80 + * The first cpuset_top_mutex will be held except when calling into 81 + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock 82 + * and cpuset_mutex will be held instead. The main purpose of this mutex 83 + * is to prevent regular cpuset control file write actions from interfering 84 + * with the call to housekeeping_update(), though CPU hotplug operation can 85 + * still happen in parallel. This mutex also provides protection for some 86 + * internal variables. 87 + * 88 + * A task must hold all the remaining three locks to modify externally visible 89 + * or used fields of cpusets, though some of the internally used cpuset fields 76 90 * and internal variables can be modified without holding callback_lock. If only 77 91 * reliable read access of the externally used fields are needed, a task can 78 92 * hold either cpuset_mutex or callback_lock which are exposed to other ··· 114 100 * cpumasks and nodemasks. 115 101 */ 116 102 103 + static DEFINE_MUTEX(cpuset_top_mutex); 117 104 static DEFINE_MUTEX(cpuset_mutex); 118 105 119 106 /* ··· 126 111 * 127 112 * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable 128 113 * by holding both cpuset_mutex and callback_lock. 114 + * 115 + * T: Read/write-able by holding the cpuset_top_mutex. 129 116 */ 130 117 131 118 /* ··· 150 133 * Set if housekeeping cpumasks are to be updated. 151 134 */ 152 135 static bool update_housekeeping; /* RWCS */ 136 + 137 + /* 138 + * Copy of isolated_cpus to be passed to housekeeping_update() 139 + */ 140 + static cpumask_var_t isolated_hk_cpus; /* T */ 153 141 154 142 /* 155 143 * A flag to force sched domain rebuild at the end of an operation. ··· 319 297 */ 320 298 void cpuset_full_lock(void) 321 299 { 300 + mutex_lock(&cpuset_top_mutex); 322 301 cpus_read_lock(); 323 302 mutex_lock(&cpuset_mutex); 324 303 } ··· 328 305 { 329 306 mutex_unlock(&cpuset_mutex); 330 307 cpus_read_unlock(); 308 + mutex_unlock(&cpuset_top_mutex); 331 309 } 332 310 333 311 #ifdef CONFIG_LOCKDEP 334 312 bool lockdep_is_cpuset_held(void) 335 313 { 336 - return lockdep_is_held(&cpuset_mutex); 314 + return lockdep_is_held(&cpuset_mutex) || 315 + lockdep_is_held(&cpuset_top_mutex); 337 316 } 338 317 #endif 339 318 ··· 1340 1315 { 1341 1316 if (update_housekeeping) { 1342 1317 /* Updating HK cpumasks implies rebuild sched domains */ 1343 - WARN_ON_ONCE(housekeeping_update(isolated_cpus)); 1344 1318 update_housekeeping = false; 1345 1319 force_sd_rebuild = true; 1320 + cpumask_copy(isolated_hk_cpus, isolated_cpus); 1321 + 1322 + /* 1323 + * housekeeping_update() is now called without holding 1324 + * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex 1325 + * is still being held for mutual exclusion. 1326 + */ 1327 + mutex_unlock(&cpuset_mutex); 1328 + cpus_read_unlock(); 1329 + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus)); 1330 + cpus_read_lock(); 1331 + mutex_lock(&cpuset_mutex); 1346 1332 } 1347 1333 /* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */ 1348 1334 if (force_sd_rebuild) ··· 3671 3635 BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL)); 3672 3636 BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL)); 3673 3637 BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL)); 3638 + BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL)); 3674 3639 3675 3640 cpumask_setall(top_cpuset.cpus_allowed); 3676 3641 nodes_setall(top_cpuset.mems_allowed);
+1 -3
kernel/sched/isolation.c
··· 123 123 struct cpumask *trial, *old = NULL; 124 124 int err; 125 125 126 - lockdep_assert_cpus_held(); 127 - 128 126 trial = kmalloc(cpumask_size(), GFP_KERNEL); 129 127 if (!trial) 130 128 return -ENOMEM; ··· 134 136 } 135 137 136 138 if (!housekeeping.flags) 137 - static_branch_enable_cpuslocked(&housekeeping_overridden); 139 + static_branch_enable(&housekeeping_overridden); 138 140 139 141 if (housekeeping.flags & HK_FLAG_DOMAIN) 140 142 old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
+1 -3
kernel/time/timer_migration.c
··· 1559 1559 cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL; 1560 1560 int cpu; 1561 1561 1562 - lockdep_assert_cpus_held(); 1563 - 1564 1562 if (!works) 1565 1563 return -ENOMEM; 1566 1564 if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) ··· 1568 1570 * First set previously isolated CPUs as available (unisolate). 1569 1571 * This cpumask contains only CPUs that switched to available now. 1570 1572 */ 1573 + guard(cpus_read_lock)(); 1571 1574 cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); 1572 1575 cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); 1573 1576 ··· 1625 1626 cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)); 1626 1627 1627 1628 /* Protect against RCU torture hotplug testing */ 1628 - guard(cpus_read_lock)(); 1629 1629 return tmigr_isolated_exclude_cpumask(cpumask); 1630 1630 } 1631 1631 late_initcall(tmigr_init_isolation);