sched/mmcid: Prevent live lock on task to CPU mode transition

Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:

A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.

At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.

T0 CID0 runs fork() and creates T4
T1 CID1
T2 CID2
T3 CID3
T4 --- not visible yet

T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.

During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:

T1 schedules in on CPU1 and observes percpu == true, so it transfers
its CID to CPU1

T1 is migrated to CPU2 and schedule in observes percpu == true, but
CPU2 does not have a CID associated and T1 transferred its own to
CPU1

So it has to allocate one with CPU2 runqueue lock held, but the
pool is empty, so it keeps looping in mm_get_cid().

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.

There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.

Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.

Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.

As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org

authored by

Thomas Gleixner and committed by

Peter Zijlstra 3 months ago 4327fb13 18f7fcd5

+88 -44

2 changed files

expand all

kernel

sched

core.c

sched.h

+84 -44

kernel/sched/core.c

··· 10269 10269 * Serialization rules: 10270 10270 * 10271 10271 * mm::mm_cid::mutex: Serializes fork() and exit() and therefore 10272 - * protects mm::mm_cid::users. 10272 + * protects mm::mm_cid::users and mode switch 10273 + * transitions 10273 10274 * 10274 10275 * mm::mm_cid::lock: Serializes mm_update_max_cids() and 10275 10276 * mm_update_cpus_allowed(). Nests in mm_cid::mutex ··· 10286 10285 * 10287 10286 * A CID is either owned by a task (stored in task_struct::mm_cid.cid) or 10288 10287 * by a CPU (stored in mm::mm_cid.pcpu::cid). CIDs owned by CPUs have the 10289 - * MM_CID_ONCPU bit set. During transition from CPU to task ownership mode, 10290 - * MM_CID_TRANSIT is set on the per task CIDs. When this bit is set the 10291 - * task needs to drop the CID into the pool when scheduling out. Both bits 10292 - * (ONCPU and TRANSIT) are filtered out by task_cid() when the CID is 10293 - * actually handed over to user space in the RSEQ memory. 10288 + * MM_CID_ONCPU bit set. 10289 + * 10290 + * During the transition of ownership mode, the MM_CID_TRANSIT bit is set 10291 + * on the CIDs. When this bit is set the tasks drop the CID back into the 10292 + * pool when scheduling out. 10293 + * 10294 + * Both bits (ONCPU and TRANSIT) are filtered out by task_cid() when the 10295 + * CID is actually handed over to user space in the RSEQ memory. 10294 10296 * 10295 10297 * Mode switching: 10298 + * 10299 + * All transitions of ownership mode happen in two phases: 10300 + * 10301 + * 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the CIDs 10302 + * and denotes that the CID is only temporarily owned by a task. When 10303 + * the task schedules out it drops the CID back into the pool if this 10304 + * bit is set. 10305 + * 10306 + * 2) The initiating context walks the per CPU space or the tasks to fixup 10307 + * or drop the CIDs and after completion it clears mm:mm_cid.transit. 10308 + * After that point the CIDs are strictly task or CPU owned again. 10309 + * 10310 + * This two phase transition is required to prevent CID space exhaustion 10311 + * during the transition as a direct transfer of ownership would fail: 10312 + * 10313 + * - On task to CPU mode switch if a task is scheduled in on one CPU and 10314 + * then migrated to another CPU before the fixup freed enough per task 10315 + * CIDs. 10316 + * 10317 + * - On CPU to task mode switch if two tasks are scheduled in on the same 10318 + * CPU before the fixup freed per CPU CIDs. 10319 + * 10320 + * Both scenarios can result in a live lock because sched_in() is invoked 10321 + * with runqueue lock held and loops in search of a CID and the fixup 10322 + * thread can't make progress freeing them up because it is stuck on the 10323 + * same runqueue lock. 10324 + * 10325 + * While MM_CID_TRANSIT is active during the transition phase the MM_CID 10326 + * bitmap can be contended, but that's a temporary contention bound to the 10327 + * transition period. After that everything goes back into steady state and 10328 + * nothing except fork() and exit() will touch the bitmap. This is an 10329 + * acceptable tradeoff as it completely avoids complex serialization, 10330 + * memory barriers and atomic operations for the common case. 10331 + * 10332 + * Aside of that this mechanism also ensures RT compability: 10333 + * 10334 + * - The task which runs the fixup is fully preemptible except for the 10335 + * short runqueue lock held sections. 10336 + * 10337 + * - The transient impact of the bitmap contention is only problematic 10338 + * when there is a thundering herd scenario of tasks scheduling in and 10339 + * out concurrently. There is not much which can be done about that 10340 + * except for avoiding mode switching by a proper overall system 10341 + * configuration. 10296 10342 * 10297 10343 * Switching to per CPU mode happens when the user count becomes greater 10298 10344 * than the maximum number of CIDs, which is calculated by: ··· 10354 10306 * 10355 10307 * At the point of switching to per CPU mode the new user is not yet 10356 10308 * visible in the system, so the task which initiated the fork() runs the 10357 - * fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and 10358 - * either transfers each tasks owned CID to the CPU the task runs on or 10359 - * drops it into the CID pool if a task is not on a CPU at that point in 10360 - * time. Tasks which schedule in before the task walk reaches them do the 10361 - * handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes 10362 - * it's guaranteed that no task related to that MM owns a CID anymore. 10309 + * fixup function. mm_cid_fixup_tasks_to_cpu() walks the thread list and 10310 + * either marks each task owned CID with MM_CID_TRANSIT if the task is 10311 + * running on a CPU or drops it into the CID pool if a task is not on a 10312 + * CPU. Tasks which schedule in before the task walk reaches them do the 10313 + * handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() 10314 + * completes it is guaranteed that no task related to that MM owns a CID 10315 + * anymore. 10363 10316 * 10364 10317 * Switching back to task mode happens when the user count goes below the 10365 10318 * threshold which was recorded on the per CPU mode switch: ··· 10376 10327 * run either in the deferred update function in context of a workqueue or 10377 10328 * by a task which forks a new one or by a task which exits. Whatever 10378 10329 * happens first. mm_cid_fixup_cpus_to_task() walks through the possible 10379 - * CPUs and either transfers the CPU owned CIDs to a related task which 10380 - * runs on the CPU or drops it into the pool. Tasks which schedule in on a 10381 - * CPU which the walk did not cover yet do the handover themself. 10382 - * 10383 - * This transition from CPU to per task ownership happens in two phases: 10384 - * 10385 - * 1) mm:mm_cid.transit contains MM_CID_TRANSIT This is OR'ed on the task 10386 - * CID and denotes that the CID is only temporarily owned by the 10387 - * task. When it schedules out the task drops the CID back into the 10388 - * pool if this bit is set. 10389 - * 10390 - * 2) The initiating context walks the per CPU space and after completion 10391 - * clears mm:mm_cid.transit. So after that point the CIDs are strictly 10392 - * task owned again. 10393 - * 10394 - * This two phase transition is required to prevent CID space exhaustion 10395 - * during the transition as a direct transfer of ownership would fail if 10396 - * two tasks are scheduled in on the same CPU before the fixup freed per 10397 - * CPU CIDs. 10398 - * 10399 - * When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID 10400 - * related to that MM is owned by a CPU anymore. 10330 + * CPUs and either marks the CPU owned CIDs with MM_CID_TRANSIT if a 10331 + * related task is running on the CPU or drops it into the pool. Tasks 10332 + * which are scheduled in before the fixup covered them do the handover 10333 + * themself. When mm_cid_fixup_cpus_to_tasks() completes it is guaranteed 10334 + * that no CID related to that MM is owned by a CPU anymore. 10401 10335 */ 10402 10336 10403 10337 /* ··· 10432 10400 /* Mode change required? */ 10433 10401 if (!!mc->percpu == !!mc->pcpu_thrs) 10434 10402 return false; 10435 - /* When switching back to per TASK mode, set the transition flag */ 10436 - if (!mc->pcpu_thrs) 10437 - WRITE_ONCE(mc->transit, MM_CID_TRANSIT); 10403 + 10404 + /* Set the transition flag to bridge the transfer */ 10405 + WRITE_ONCE(mc->transit, MM_CID_TRANSIT); 10438 10406 WRITE_ONCE(mc->percpu, !!mc->pcpu_thrs); 10439 10407 return true; 10440 10408 } ··· 10525 10493 WRITE_ONCE(mm->mm_cid.transit, 0); 10526 10494 } 10527 10495 10528 - static inline void mm_cid_transfer_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp) 10496 + static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp) 10529 10497 { 10530 10498 if (cid_on_task(t->mm_cid.cid)) { 10531 - t->mm_cid.cid = cid_to_cpu_cid(t->mm_cid.cid); 10499 + t->mm_cid.cid = cid_to_transit_cid(t->mm_cid.cid); 10532 10500 pcp->cid = t->mm_cid.cid; 10533 10501 } 10534 10502 } ··· 10541 10509 if (!t->mm_cid.active) 10542 10510 return false; 10543 10511 if (cid_on_task(t->mm_cid.cid)) { 10544 - /* If running on the CPU, transfer the CID, otherwise drop it */ 10512 + /* If running on the CPU, put the CID in transit mode, otherwise drop it */ 10545 10513 if (task_rq(t)->curr == t) 10546 - mm_cid_transfer_to_cpu(t, per_cpu_ptr(mm->mm_cid.pcpu, task_cpu(t))); 10514 + mm_cid_transit_to_cpu(t, per_cpu_ptr(mm->mm_cid.pcpu, task_cpu(t))); 10547 10515 else 10548 10516 mm_unset_cid_on_task(t); 10549 10517 } 10550 10518 return true; 10551 10519 } 10552 10520 10553 - static void mm_cid_fixup_tasks_to_cpus(void) 10521 + static void mm_cid_do_fixup_tasks_to_cpus(struct mm_struct *mm) 10554 10522 { 10555 - struct mm_struct *mm = current->mm; 10556 10523 struct task_struct *p, *t; 10557 10524 unsigned int users; 10558 10525 ··· 10587 10556 return; 10588 10557 } 10589 10558 } 10559 + } 10560 + 10561 + static void mm_cid_fixup_tasks_to_cpus(void) 10562 + { 10563 + struct mm_struct *mm = current->mm; 10564 + 10565 + mm_cid_do_fixup_tasks_to_cpus(mm); 10566 + /* Clear the transition bit */ 10567 + WRITE_ONCE(mm->mm_cid.transit, 0); 10590 10568 } 10591 10569 10592 10570 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct *mm) ··· 10636 10596 if (!percpu) 10637 10597 mm_cid_transit_to_task(current, pcp); 10638 10598 else 10639 - mm_cid_transfer_to_cpu(current, pcp); 10599 + mm_cid_transit_to_cpu(current, pcp); 10640 10600 } 10641 10601 10642 10602 if (percpu) {

kernel/sched/sched.h

··· 3841 3841 /* Still nothing, allocate a new one */ 3842 3842 if (!cid_on_cpu(cpu_cid)) 3843 3843 cpu_cid = cid_to_cpu_cid(mm_get_cid(mm)); 3844 + 3845 + /* Set the transition mode flag if required */ 3846 + if (READ_ONCE(mm->mm_cid.transit)) 3847 + cpu_cid = cpu_cid_to_cid(cpu_cid) | MM_CID_TRANSIT; 3844 3848 } 3845 3849 mm_cid_update_pcpu_cid(mm, cpu_cid); 3846 3850 mm_cid_update_task_cid(t, cpu_cid);

Configure Feed

Configure Feed