Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue

The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.

As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().

An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Waiman Long and committed by
Tejun Heo
6df415aa 3bfe4796

+36 -6
+26 -5
kernel/cgroup/cpuset.c
··· 1324 1324 rebuild_sched_domains_locked(); 1325 1325 } 1326 1326 1327 + /* 1328 + * Work function to invoke update_hk_sched_domains() 1329 + */ 1330 + static void hk_sd_workfn(struct work_struct *work) 1331 + { 1332 + cpuset_full_lock(); 1333 + update_hk_sched_domains(); 1334 + cpuset_full_unlock(); 1335 + } 1336 + 1327 1337 /** 1328 1338 * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets 1329 1339 * @parent: Parent cpuset containing all siblings ··· 3806 3796 */ 3807 3797 static void cpuset_handle_hotplug(void) 3808 3798 { 3799 + static DECLARE_WORK(hk_sd_work, hk_sd_workfn); 3809 3800 static cpumask_t new_cpus; 3810 3801 static nodemask_t new_mems; 3811 3802 bool cpus_updated, mems_updated; ··· 3889 3878 } 3890 3879 3891 3880 3892 - if (update_housekeeping || force_sd_rebuild) { 3893 - mutex_lock(&cpuset_mutex); 3894 - update_hk_sched_domains(); 3895 - mutex_unlock(&cpuset_mutex); 3896 - } 3881 + /* 3882 + * Queue a work to call housekeeping_update() & rebuild_sched_domains() 3883 + * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping 3884 + * cpumask can correctly reflect what is in isolated_cpus. 3885 + * 3886 + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that 3887 + * is still pending. Before the pending bit is cleared, the work data 3888 + * is copied out and work item dequeued. So it is possible to queue 3889 + * the work again before the hk_sd_workfn() is invoked to process the 3890 + * previously queued work. Since hk_sd_workfn() doesn't use the work 3891 + * item at all, this is not a problem. 3892 + */ 3893 + if (update_housekeeping || force_sd_rebuild) 3894 + queue_work(system_unbound_wq, &hk_sd_work); 3895 + 3897 3896 free_tmpmasks(ptmp); 3898 3897 } 3899 3898
+10 -1
tools/testing/selftests/cgroup/test_cpuset_prs.sh
··· 245 245 " C2-3:P1 C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1" 246 246 " C2-3:P1 C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1" 247 247 " C2-3:P1 C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1" 248 + " C2-3:P1 C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2" 249 + " C1-3:P1 C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|" 250 + " C1-3:P1 C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3 A1:P1|A2:P2 3" 248 251 "$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1" 249 252 "$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1" 250 253 "$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1" ··· 764 761 # only CPUs in isolated partitions as well as those that are isolated at 765 762 # boot time. 766 763 # 767 - # $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>} 764 + # $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>} 768 765 # <isolcpus1> - expected sched/domains value 769 766 # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined 770 767 # ··· 773 770 EXPECTED_ISOLCPUS=$1 774 771 ISCPUS=${CGROUP2}/cpuset.cpus.isolated 775 772 ISOLCPUS=$(cat $ISCPUS) 773 + HKICPUS=$(cat /sys/devices/system/cpu/isolated) 776 774 LASTISOLCPU= 777 775 SCHED_DOMAINS=/sys/kernel/debug/sched/domains 778 776 if [[ $EXPECTED_ISOLCPUS = . ]] ··· 810 806 [[ "$EXPECTED_ISOLCPUS" != "$ISOLCPUS" ]] && return 1 811 807 ISOLCPUS= 812 808 EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN 809 + 810 + # 811 + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS 812 + # 813 + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1 813 814 814 815 # 815 816 # Use the sched domain in debugfs to check isolated CPUs, if available