Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

- Rework the rescuer to process work items one-by-one instead of
slurping all pending work items in a single pass.

As there is only one rescuer per workqueue, a single long-blocking
work item could cause high latency for all tasks queued behind it,
even after memory pressure is relieved and regular kworkers become
available to service them.

- Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and
workqueue.panic_on_stall_time parameter for time-based stall panic,
giving systems more control over workqueue stall handling.

- Replace BUG_ON() with panic() in the stall panic path for clearer
intent and more informative output.

* tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: replace BUG_ON with panic in panic_on_wq_watchdog
workqueue: add time-based panic for stalls
workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option
workqueue: Process extra works in rescuer on memory pressure
workqueue: Process rescuer work items one-by-one using a cursor
workqueue: Make send_mayday() take a PWQ argument directly

+149 -36
+10 -1
Documentation/admin-guide/kernel-parameters.txt
··· 8376 8376 CONFIG_WQ_WATCHDOG. It sets the number times of the 8377 8377 stall to trigger panic. 8378 8378 8379 - The default is 0, which disables the panic on stall. 8379 + The default is set by CONFIG_BOOTPARAM_WQ_STALL_PANIC, 8380 + which is 0 (disabled) if not configured. 8381 + 8382 + workqueue.panic_on_stall_time=<uint> 8383 + Panic when a workqueue stall has been continuous for 8384 + the specified number of seconds. Unlike panic_on_stall 8385 + which counts accumulated stall events, this triggers 8386 + based on the duration of a single continuous stall. 8387 + 8388 + The default is 0, which disables the time-based panic. 8380 8389 8381 8390 workqueue.cpu_intensive_thresh_us= 8382 8391 Per-cpu work items which run for longer than this
+116 -35
kernel/workqueue.c
··· 117 117 MAYDAY_INTERVAL = HZ / 10, /* and then every 100ms */ 118 118 CREATE_COOLDOWN = HZ, /* time to breath after fail */ 119 119 120 + RESCUER_BATCH = 16, /* process items per turn */ 121 + 120 122 /* 121 123 * Rescue workers are used only on emergencies and shared by 122 124 * all cpus. Give MIN_NICE. ··· 288 286 struct list_head pending_node; /* LN: node on wq_node_nr_active->pending_pwqs */ 289 287 struct list_head pwqs_node; /* WR: node on wq->pwqs */ 290 288 struct list_head mayday_node; /* MD: node on wq->maydays */ 289 + struct work_struct mayday_cursor; /* L: cursor on pool->worklist */ 291 290 292 291 u64 stats[PWQ_NR_STATS]; 293 292 ··· 1123 1120 return NULL; 1124 1121 } 1125 1122 1123 + static void mayday_cursor_func(struct work_struct *work) 1124 + { 1125 + /* should not be processed, only for marking position */ 1126 + BUG(); 1127 + } 1128 + 1126 1129 /** 1127 1130 * move_linked_works - move linked works to a list 1128 1131 * @work: start of series of works to be scheduled ··· 1190 1181 struct worker *collision; 1191 1182 1192 1183 lockdep_assert_held(&pool->lock); 1184 + 1185 + /* The cursor work should not be processed */ 1186 + if (unlikely(work->func == mayday_cursor_func)) { 1187 + /* only worker_thread() can possibly take this branch */ 1188 + WARN_ON_ONCE(worker->rescue_wq); 1189 + if (nextp) 1190 + *nextp = list_next_entry(work, entry); 1191 + list_del_init(&work->entry); 1192 + return false; 1193 + } 1193 1194 1194 1195 /* 1195 1196 * A single work shouldn't be executed concurrently by multiple workers. ··· 2995 2976 reap_dying_workers(&cull_list); 2996 2977 } 2997 2978 2998 - static void send_mayday(struct work_struct *work) 2979 + static void send_mayday(struct pool_workqueue *pwq) 2999 2980 { 3000 - struct pool_workqueue *pwq = get_work_pwq(work); 3001 2981 struct workqueue_struct *wq = pwq->wq; 3002 2982 3003 2983 lockdep_assert_held(&wq_mayday_lock); ··· 3034 3016 * rescuers. 3035 3017 */ 3036 3018 list_for_each_entry(work, &pool->worklist, entry) 3037 - send_mayday(work); 3019 + send_mayday(get_work_pwq(work)); 3038 3020 } 3039 3021 3040 3022 raw_spin_unlock(&wq_mayday_lock); ··· 3458 3440 static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer) 3459 3441 { 3460 3442 struct worker_pool *pool = pwq->pool; 3443 + struct work_struct *cursor = &pwq->mayday_cursor; 3461 3444 struct work_struct *work, *n; 3462 3445 3463 - /* need rescue? */ 3464 - if (!pwq->nr_active || !need_to_create_worker(pool)) 3446 + /* have work items to rescue? */ 3447 + if (!pwq->nr_active) 3465 3448 return false; 3466 3449 3467 - /* 3468 - * Slurp in all works issued via this workqueue and 3469 - * process'em. 3470 - */ 3471 - list_for_each_entry_safe(work, n, &pool->worklist, entry) { 3472 - if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) 3473 - pwq->stats[PWQ_STAT_RESCUED]++; 3450 + /* need rescue? */ 3451 + if (!need_to_create_worker(pool)) { 3452 + /* 3453 + * The pool has idle workers and doesn't need the rescuer, so it 3454 + * could simply return false here. 3455 + * 3456 + * However, the memory pressure might not be fully relieved. 3457 + * In PERCPU pool with concurrency enabled, having idle workers 3458 + * does not necessarily mean memory pressure is gone; it may 3459 + * simply mean regular workers have woken up, completed their 3460 + * work, and gone idle again due to concurrency limits. 3461 + * 3462 + * In this case, those working workers may later sleep again, 3463 + * the pool may run out of idle workers, and it will have to 3464 + * allocate new ones and wait for the timer to send mayday, 3465 + * causing unnecessary delay - especially if memory pressure 3466 + * was never resolved throughout. 3467 + * 3468 + * Do more work if memory pressure is still on to reduce 3469 + * relapse, using (pool->flags & POOL_MANAGER_ACTIVE), though 3470 + * not precisely, unless there are other PWQs needing help. 3471 + */ 3472 + if (!(pool->flags & POOL_MANAGER_ACTIVE) || 3473 + !list_empty(&pwq->wq->maydays)) 3474 + return false; 3474 3475 } 3475 3476 3476 - return !list_empty(&rescuer->scheduled); 3477 + /* search from the start or cursor if available */ 3478 + if (list_empty(&cursor->entry)) 3479 + work = list_first_entry(&pool->worklist, struct work_struct, entry); 3480 + else 3481 + work = list_next_entry(cursor, entry); 3482 + 3483 + /* find the next work item to rescue */ 3484 + list_for_each_entry_safe_from(work, n, &pool->worklist, entry) { 3485 + if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) { 3486 + pwq->stats[PWQ_STAT_RESCUED]++; 3487 + /* put the cursor for next search */ 3488 + list_move_tail(&cursor->entry, &n->entry); 3489 + return true; 3490 + } 3491 + } 3492 + 3493 + return false; 3477 3494 } 3478 3495 3479 3496 /** ··· 3565 3512 struct pool_workqueue *pwq = list_first_entry(&wq->maydays, 3566 3513 struct pool_workqueue, mayday_node); 3567 3514 struct worker_pool *pool = pwq->pool; 3515 + unsigned int count = 0; 3568 3516 3569 3517 __set_current_state(TASK_RUNNING); 3570 3518 list_del_init(&pwq->mayday_node); ··· 3578 3524 3579 3525 WARN_ON_ONCE(!list_empty(&rescuer->scheduled)); 3580 3526 3581 - if (assign_rescuer_work(pwq, rescuer)) { 3527 + while (assign_rescuer_work(pwq, rescuer)) { 3582 3528 process_scheduled_works(rescuer); 3583 3529 3584 3530 /* 3585 - * The above execution of rescued work items could 3586 - * have created more to rescue through 3587 - * pwq_activate_first_inactive() or chained 3588 - * queueing. Let's put @pwq back on mayday list so 3589 - * that such back-to-back work items, which may be 3590 - * being used to relieve memory pressure, don't 3591 - * incur MAYDAY_INTERVAL delay inbetween. 3531 + * If the per-turn work item limit is reached and other 3532 + * PWQs are in mayday, requeue mayday for this PWQ and 3533 + * let the rescuer handle the other PWQs first. 3592 3534 */ 3593 - if (pwq->nr_active && need_to_create_worker(pool)) { 3535 + if (++count > RESCUER_BATCH && !list_empty(&pwq->wq->maydays) && 3536 + pwq->nr_active && need_to_create_worker(pool)) { 3594 3537 raw_spin_lock(&wq_mayday_lock); 3595 - /* 3596 - * Queue iff somebody else hasn't queued it already. 3597 - */ 3598 - if (list_empty(&pwq->mayday_node)) { 3599 - get_pwq(pwq); 3600 - list_add_tail(&pwq->mayday_node, &wq->maydays); 3601 - } 3538 + send_mayday(pwq); 3602 3539 raw_spin_unlock(&wq_mayday_lock); 3540 + break; 3603 3541 } 3604 3542 } 3543 + 3544 + /* The cursor can not be left behind without the rescuer watching it. */ 3545 + if (!list_empty(&pwq->mayday_cursor.entry) && list_empty(&pwq->mayday_node)) 3546 + list_del_init(&pwq->mayday_cursor.entry); 3605 3547 3606 3548 /* 3607 3549 * Leave this pool. Notify regular workers; otherwise, we end up ··· 5217 5167 INIT_LIST_HEAD(&pwq->pwqs_node); 5218 5168 INIT_LIST_HEAD(&pwq->mayday_node); 5219 5169 kthread_init_work(&pwq->release_work, pwq_release_workfn); 5170 + 5171 + /* 5172 + * Set the dummy cursor work with valid function and get_work_pwq(). 5173 + * 5174 + * The cursor work should only be in the pwq->pool->worklist, and 5175 + * should not be treated as a processable work item. 5176 + * 5177 + * WORK_STRUCT_PENDING and WORK_STRUCT_INACTIVE just make it less 5178 + * surprise for kernel debugging tools and reviewers. 5179 + */ 5180 + INIT_WORK(&pwq->mayday_cursor, mayday_cursor_func); 5181 + atomic_long_set(&pwq->mayday_cursor.data, (unsigned long)pwq | 5182 + WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | WORK_STRUCT_INACTIVE); 5220 5183 } 5221 5184 5222 5185 /* sync @pwq with the current state of its associated wq and link it */ ··· 7571 7508 static unsigned long wq_watchdog_touched = INITIAL_JIFFIES; 7572 7509 static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES; 7573 7510 7574 - static unsigned int wq_panic_on_stall; 7511 + static unsigned int wq_panic_on_stall = CONFIG_BOOTPARAM_WQ_STALL_PANIC; 7575 7512 module_param_named(panic_on_stall, wq_panic_on_stall, uint, 0644); 7513 + 7514 + static unsigned int wq_panic_on_stall_time; 7515 + module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644); 7516 + MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)"); 7576 7517 7577 7518 /* 7578 7519 * Show workers that might prevent the processing of pending work items. ··· 7629 7562 rcu_read_unlock(); 7630 7563 } 7631 7564 7632 - static void panic_on_wq_watchdog(void) 7565 + /* 7566 + * It triggers a panic in two scenarios: when the total number of stalls 7567 + * exceeds a threshold, and when a stall lasts longer than 7568 + * wq_panic_on_stall_time 7569 + */ 7570 + static void panic_on_wq_watchdog(unsigned int stall_time_sec) 7633 7571 { 7634 7572 static unsigned int wq_stall; 7635 7573 7636 7574 if (wq_panic_on_stall) { 7637 7575 wq_stall++; 7638 - BUG_ON(wq_stall >= wq_panic_on_stall); 7576 + if (wq_stall >= wq_panic_on_stall) 7577 + panic("workqueue: %u stall(s) exceeded threshold %u\n", 7578 + wq_stall, wq_panic_on_stall); 7639 7579 } 7580 + 7581 + if (wq_panic_on_stall_time && stall_time_sec >= wq_panic_on_stall_time) 7582 + panic("workqueue: stall lasted %us, exceeding threshold %us\n", 7583 + stall_time_sec, wq_panic_on_stall_time); 7640 7584 } 7641 7585 7642 7586 static void wq_watchdog_reset_touched(void) ··· 7662 7584 static void wq_watchdog_timer_fn(struct timer_list *unused) 7663 7585 { 7664 7586 unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ; 7587 + unsigned int max_stall_time = 0; 7665 7588 bool lockup_detected = false; 7666 7589 bool cpu_pool_stall = false; 7667 7590 unsigned long now = jiffies; 7668 7591 struct worker_pool *pool; 7592 + unsigned int stall_time; 7669 7593 int pi; 7670 7594 7671 7595 if (!thresh) ··· 7701 7621 /* did we stall? */ 7702 7622 if (time_after(now, ts + thresh)) { 7703 7623 lockup_detected = true; 7624 + stall_time = jiffies_to_msecs(now - pool_ts) / 1000; 7625 + max_stall_time = max(max_stall_time, stall_time); 7704 7626 if (pool->cpu >= 0 && !(pool->flags & POOL_BH)) { 7705 7627 pool->cpu_stall = true; 7706 7628 cpu_pool_stall = true; 7707 7629 } 7708 7630 pr_emerg("BUG: workqueue lockup - pool"); 7709 7631 pr_cont_pool_info(pool); 7710 - pr_cont(" stuck for %us!\n", 7711 - jiffies_to_msecs(now - pool_ts) / 1000); 7632 + pr_cont(" stuck for %us!\n", stall_time); 7712 7633 } 7713 7634 7714 7635 ··· 7722 7641 show_cpu_pools_hogs(); 7723 7642 7724 7643 if (lockup_detected) 7725 - panic_on_wq_watchdog(); 7644 + panic_on_wq_watchdog(max_stall_time); 7726 7645 7727 7646 wq_watchdog_reset_touched(); 7728 7647 mod_timer(&wq_watchdog_timer, jiffies + thresh);
+23
lib/Kconfig.debug
··· 1322 1322 state. This can be configured through kernel parameter 1323 1323 "workqueue.watchdog_thresh" and its sysfs counterpart. 1324 1324 1325 + config BOOTPARAM_WQ_STALL_PANIC 1326 + int "Panic on Nth workqueue stall" 1327 + default 0 1328 + range 0 100 1329 + depends on WQ_WATCHDOG 1330 + help 1331 + Set the number of workqueue stalls to trigger a kernel panic. 1332 + A workqueue stall occurs when a worker pool doesn't make forward 1333 + progress on a pending work item for over 30 seconds (configurable 1334 + using the workqueue.watchdog_thresh parameter). 1335 + 1336 + If n = 0, the kernel will not panic on stall. If n > 0, the kernel 1337 + will panic after n stall warnings. 1338 + 1339 + The panic can be used in combination with panic_timeout, 1340 + to cause the system to reboot automatically after a 1341 + stall has been detected. This feature is useful for 1342 + high-availability systems that have uptime guarantees and 1343 + where a stall must be resolved ASAP. 1344 + 1345 + This setting can be overridden at runtime via the 1346 + workqueue.panic_on_stall kernel parameter. 1347 + 1325 1348 config WQ_CPU_INTENSIVE_REPORT 1326 1349 bool "Report per-cpu work items which hog CPU for too long" 1327 1350 depends on DEBUG_KERNEL