Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

sched_ext: Exit dispatch and move operations immediately when aborting

62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.

However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.

This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.

The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.

v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi).

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo 5ebec443 a69040ed

+18 -44
+18 -44
kernel/sched/ext.c
··· 1818 1818 return dst_rq; 1819 1819 } 1820 1820 1821 - /* 1822 - * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly 1823 - * banging on the same DSQ on a large NUMA system to the point where switching 1824 - * to the bypass mode can take a long time. Inject artificial delays while the 1825 - * bypass mode is switching to guarantee timely completion. 1826 - */ 1827 - static void scx_breather(struct rq *rq) 1828 - { 1829 - u64 until; 1830 - 1831 - lockdep_assert_rq_held(rq); 1832 - 1833 - if (likely(!READ_ONCE(scx_aborting))) 1834 - return; 1835 - 1836 - raw_spin_rq_unlock(rq); 1837 - 1838 - until = ktime_get_ns() + NSEC_PER_MSEC; 1839 - 1840 - do { 1841 - int cnt = 1024; 1842 - while (READ_ONCE(scx_aborting) && --cnt) 1843 - cpu_relax(); 1844 - } while (READ_ONCE(scx_aborting) && 1845 - time_before64(ktime_get_ns(), until)); 1846 - 1847 - raw_spin_rq_lock(rq); 1848 - } 1849 - 1850 1821 static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq, 1851 1822 struct scx_dispatch_q *dsq) 1852 1823 { 1853 1824 struct task_struct *p; 1854 1825 retry: 1855 - /* 1856 - * This retry loop can repeatedly race against scx_bypass() dequeueing 1857 - * tasks from @dsq trying to put the system into the bypass mode. On 1858 - * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock 1859 - * the machine into soft lockups. Give a breather. 1860 - */ 1861 - scx_breather(rq); 1862 - 1863 1826 /* 1864 1827 * The caller can't expect to successfully consume a task if the task's 1865 1828 * addition to @dsq isn't guaranteed to be visible somehow. Test ··· 1835 1872 1836 1873 nldsq_for_each_task(p, dsq) { 1837 1874 struct rq *task_rq = task_rq(p); 1875 + 1876 + /* 1877 + * This loop can lead to multiple lockup scenarios, e.g. the BPF 1878 + * scheduler can put an enormous number of affinitized tasks into 1879 + * a contended DSQ, or the outer retry loop can repeatedly race 1880 + * against scx_bypass() dequeueing tasks from @dsq trying to put 1881 + * the system into the bypass mode. This can easily live-lock the 1882 + * machine. If aborting, exit from all non-bypass DSQs. 1883 + */ 1884 + if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS) 1885 + break; 1838 1886 1839 1887 if (rq == task_rq) { 1840 1888 task_unlink_from_dsq(p, dsq); ··· 5611 5637 return false; 5612 5638 5613 5639 /* 5640 + * If the BPF scheduler keeps calling this function repeatedly, it can 5641 + * cause similar live-lock conditions as consume_dispatch_q(). 5642 + */ 5643 + if (unlikely(READ_ONCE(scx_aborting))) 5644 + return false; 5645 + 5646 + /* 5614 5647 * Can be called from either ops.dispatch() locking this_rq() or any 5615 5648 * context where no rq lock is held. If latter, lock @p's task_rq which 5616 5649 * we'll likely need anyway. ··· 5636 5655 } else { 5637 5656 raw_spin_rq_lock(src_rq); 5638 5657 } 5639 - 5640 - /* 5641 - * If the BPF scheduler keeps calling this function repeatedly, it can 5642 - * cause similar live-lock conditions as consume_dispatch_q(). Insert a 5643 - * breather if necessary. 5644 - */ 5645 - scx_breather(src_rq); 5646 5658 5647 5659 locked_rq = src_rq; 5648 5660 raw_spin_lock(&src_dsq->lock);