Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

workqueue: Fix false positive stall reports

On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.

This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():

workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0

To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.

Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.

Fixes: 82607adcf9cd ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Song Liu and committed by
Tejun Heo
c7f27a8a 98c790b1

+21 -3
+21 -3
kernel/workqueue.c
··· 7699 7699 else 7700 7700 ts = touched; 7701 7701 7702 - /* did we stall? */ 7702 + /* 7703 + * Did we stall? 7704 + * 7705 + * Do a lockless check first. On weakly ordered 7706 + * architectures, the lockless check can observe a 7707 + * reordering between worklist insert_work() and 7708 + * last_progress_ts update from __queue_work(). Since 7709 + * __queue_work() is a much hotter path than the timer 7710 + * function, we handle false positive here by reading 7711 + * last_progress_ts again with pool->lock held. 7712 + */ 7703 7713 if (time_after(now, ts + thresh)) { 7714 + scoped_guard(raw_spinlock_irqsave, &pool->lock) { 7715 + pool_ts = pool->last_progress_ts; 7716 + if (time_after(pool_ts, touched)) 7717 + ts = pool_ts; 7718 + else 7719 + ts = touched; 7720 + } 7721 + if (!time_after(now, ts + thresh)) 7722 + continue; 7723 + 7704 7724 lockup_detected = true; 7705 7725 stall_time = jiffies_to_msecs(now - pool_ts) / 1000; 7706 7726 max_stall_time = max(max_stall_time, stall_time); ··· 7732 7712 pr_cont_pool_info(pool); 7733 7713 pr_cont(" stuck for %us!\n", stall_time); 7734 7714 } 7735 - 7736 - 7737 7715 } 7738 7716 7739 7717 if (lockup_detected)