Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

workqueue: add time-based panic for stalls

Add a new module parameter 'panic_on_stall_time' that triggers a panic
when a workqueue stall persists for longer than the specified duration
in seconds.

Unlike 'panic_on_stall' which counts accumulated stall events, this
parameter triggers based on the duration of a single continuous stall.
This is useful for catching truly stuck workqueues rather than
accumulating transient stalls.

Usage:
workqueue.panic_on_stall_time=120

This would panic if any workqueue pool has been stalled for 120 seconds
or more.

The stall duration is measured from the workqueue last progress
(poll_ts) which accounts for legitimate system stalls.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Breno Leitao and committed by
Tejun Heo
f84c9dd3 32d572e3

+26 -4
+8
Documentation/admin-guide/kernel-parameters.txt
··· 8339 8339 The default is set by CONFIG_BOOTPARAM_WQ_STALL_PANIC, 8340 8340 which is 0 (disabled) if not configured. 8341 8341 8342 + workqueue.panic_on_stall_time=<uint> 8343 + Panic when a workqueue stall has been continuous for 8344 + the specified number of seconds. Unlike panic_on_stall 8345 + which counts accumulated stall events, this triggers 8346 + based on the duration of a single continuous stall. 8347 + 8348 + The default is 0, which disables the time-based panic. 8349 + 8342 8350 workqueue.cpu_intensive_thresh_us= 8343 8351 Per-cpu work items which run for longer than this 8344 8352 threshold are automatically considered CPU intensive
+18 -4
kernel/workqueue.c
··· 7571 7571 static unsigned int wq_panic_on_stall = CONFIG_BOOTPARAM_WQ_STALL_PANIC; 7572 7572 module_param_named(panic_on_stall, wq_panic_on_stall, uint, 0644); 7573 7573 7574 + static unsigned int wq_panic_on_stall_time; 7575 + module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644); 7576 + MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)"); 7577 + 7574 7578 /* 7575 7579 * Show workers that might prevent the processing of pending work items. 7576 7580 * The only candidates are CPU-bound workers in the running state. ··· 7626 7622 rcu_read_unlock(); 7627 7623 } 7628 7624 7629 - static void panic_on_wq_watchdog(void) 7625 + /* 7626 + * It triggers a panic in two scenarios: when the total number of stalls 7627 + * exceeds a threshold, and when a stall lasts longer than 7628 + * wq_panic_on_stall_time 7629 + */ 7630 + static void panic_on_wq_watchdog(unsigned int stall_time_sec) 7630 7631 { 7631 7632 static unsigned int wq_stall; 7632 7633 ··· 7639 7630 wq_stall++; 7640 7631 BUG_ON(wq_stall >= wq_panic_on_stall); 7641 7632 } 7633 + 7634 + BUG_ON(wq_panic_on_stall_time && stall_time_sec >= wq_panic_on_stall_time); 7642 7635 } 7643 7636 7644 7637 static void wq_watchdog_reset_touched(void) ··· 7655 7644 static void wq_watchdog_timer_fn(struct timer_list *unused) 7656 7645 { 7657 7646 unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ; 7647 + unsigned int max_stall_time = 0; 7658 7648 bool lockup_detected = false; 7659 7649 bool cpu_pool_stall = false; 7660 7650 unsigned long now = jiffies; 7661 7651 struct worker_pool *pool; 7652 + unsigned int stall_time; 7662 7653 int pi; 7663 7654 7664 7655 if (!thresh) ··· 7694 7681 /* did we stall? */ 7695 7682 if (time_after(now, ts + thresh)) { 7696 7683 lockup_detected = true; 7684 + stall_time = jiffies_to_msecs(now - pool_ts) / 1000; 7685 + max_stall_time = max(max_stall_time, stall_time); 7697 7686 if (pool->cpu >= 0 && !(pool->flags & POOL_BH)) { 7698 7687 pool->cpu_stall = true; 7699 7688 cpu_pool_stall = true; 7700 7689 } 7701 7690 pr_emerg("BUG: workqueue lockup - pool"); 7702 7691 pr_cont_pool_info(pool); 7703 - pr_cont(" stuck for %us!\n", 7704 - jiffies_to_msecs(now - pool_ts) / 1000); 7692 + pr_cont(" stuck for %us!\n", stall_time); 7705 7693 } 7706 7694 7707 7695 ··· 7715 7701 show_cpu_pools_hogs(); 7716 7702 7717 7703 if (lockup_detected) 7718 - panic_on_wq_watchdog(); 7704 + panic_on_wq_watchdog(max_stall_time); 7719 7705 7720 7706 wq_watchdog_reset_touched(); 7721 7707 mod_timer(&wq_watchdog_timer, jiffies + thresh);