block: correct locking order for protecting blk-wbt parameters

The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
elevator_lock")' introduced q->elevator_lock to protect updates
to blk-wbt parameters when writing to the sysfs attribute wbt_
lat_usec and the cgroup attribute io.cost.qos. However, both
these attributes also acquire q->rq_qos_mutex, leading to the
following lockdep warning:

======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc5+ #138 Not tainted
------------------------------------------------------
bash/5902 is trying to acquire lock:
c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238

but task is already holding lock:
c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock+0xf0/0xa58
ioc_qos_write+0x16c/0x85c
cgroup_file_write+0xc4/0x32c
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec

-> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
__lock_acquire+0x1b6c/0x2ae0
lock_acquire+0x140/0x430
__mutex_lock+0xf0/0xa58
wbt_init+0x164/0x238
queue_wb_lat_store+0x1dc/0x20c
queue_attr_store+0x12c/0x164
sysfs_kf_write+0x6c/0xb0
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);

*** DEADLOCK ***

6 locks held by bash/5902:
#0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
#1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
#2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
#3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

stack backtrace:
CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
[c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
[c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
[c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
[c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
[c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
[c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
[c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
[c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
[c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
[c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
[c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
[c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
[c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
[c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

>From the above log it's apparent that method which writes to sysfs attr
wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
qos_mutex. However the another method which writes to io.cost.qos,
acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
this could potentially cause the deadlock.

A closer look at ioc_qos_write shows that correcting the lock order is
non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
and released in blkg_conf_exit. The function blkg_conf_open_bdev is
responsible for parsing user input and finding the corresponding block
device (bdev) from the user provided major:minor number.

Since we do not know the bdev until blkg_conf_open_bdev completes, we
cannot simply move q->elevator_lock acquisition before blkg_conf_open_
bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
freezes the queue, acquires q->elevator_lock and ensures the correct
locking order is followed between q->elevator_lock and q->rq_qos_mutex.
Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
the queue ensures that we release the locks in correct order.

By using these helpers, now we maintain the same locking order in all
code paths where we update blk-wbt parameters.

Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Nilay Shroff and committed by

Jens Axboe 1 year ago 9730763f 89ed5fa3

+58 -13

3 changed files

expand all

block

blk-cgroup.c

blk-cgroup.h

blk-iocost.c

+51

block/blk-cgroup.c

··· 815 815 ctx->bdev = bdev; 816 816 return 0; 817 817 } 818 + /* 819 + * Similar to blkg_conf_open_bdev, but additionally freezes the queue, 820 + * acquires q->elevator_lock, and ensures the correct locking order 821 + * between q->elevator_lock and q->rq_qos_mutex. 822 + * 823 + * This function returns negative error on failure. On success it returns 824 + * memflags which must be saved and later passed to blkg_conf_exit_frozen 825 + * for restoring the memalloc scope. 826 + */ 827 + unsigned long __must_check blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx) 828 + { 829 + int ret; 830 + unsigned long memflags; 831 + 832 + if (ctx->bdev) 833 + return -EINVAL; 834 + 835 + ret = blkg_conf_open_bdev(ctx); 836 + if (ret < 0) 837 + return ret; 838 + /* 839 + * At this point, we haven’t started protecting anything related to QoS, 840 + * so we release q->rq_qos_mutex here, which was first acquired in blkg_ 841 + * conf_open_bdev. Later, we re-acquire q->rq_qos_mutex after freezing 842 + * the queue and acquiring q->elevator_lock to maintain the correct 843 + * locking order. 844 + */ 845 + mutex_unlock(&ctx->bdev->bd_queue->rq_qos_mutex); 846 + 847 + memflags = blk_mq_freeze_queue(ctx->bdev->bd_queue); 848 + mutex_lock(&ctx->bdev->bd_queue->elevator_lock); 849 + mutex_lock(&ctx->bdev->bd_queue->rq_qos_mutex); 850 + 851 + return memflags; 852 + } 818 853 819 854 /** 820 855 * blkg_conf_prep - parse and prepare for per-blkg config update ··· 1005 970 } 1006 971 } 1007 972 EXPORT_SYMBOL_GPL(blkg_conf_exit); 973 + 974 + /* 975 + * Similar to blkg_conf_exit, but also unfreezes the queue and releases 976 + * q->elevator_lock. Should be used when blkg_conf_open_bdev_frozen 977 + * is used to open the bdev. 978 + */ 979 + void blkg_conf_exit_frozen(struct blkg_conf_ctx *ctx, unsigned long memflags) 980 + { 981 + if (ctx->bdev) { 982 + struct request_queue *q = ctx->bdev->bd_queue; 983 + 984 + blkg_conf_exit(ctx); 985 + mutex_unlock(&q->elevator_lock); 986 + blk_mq_unfreeze_queue(q, memflags); 987 + } 988 + } 1008 989 1009 990 static void blkg_iostat_add(struct blkg_iostat *dst, struct blkg_iostat *src) 1010 991 {

block/blk-cgroup.h

··· 219 219 220 220 void blkg_conf_init(struct blkg_conf_ctx *ctx, char *input); 221 221 int blkg_conf_open_bdev(struct blkg_conf_ctx *ctx); 222 + unsigned long blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx); 222 223 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, 223 224 struct blkg_conf_ctx *ctx); 224 225 void blkg_conf_exit(struct blkg_conf_ctx *ctx); 226 + void blkg_conf_exit_frozen(struct blkg_conf_ctx *ctx, unsigned long memflags); 225 227 226 228 /** 227 229 * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg

+5 -13

block/blk-iocost.c

··· 3223 3223 u32 qos[NR_QOS_PARAMS]; 3224 3224 bool enable, user; 3225 3225 char *body, *p; 3226 - unsigned int memflags; 3226 + unsigned long memflags; 3227 3227 int ret; 3228 3228 3229 3229 blkg_conf_init(&ctx, input); 3230 3230 3231 - ret = blkg_conf_open_bdev(&ctx); 3232 - if (ret) 3231 + memflags = blkg_conf_open_bdev_frozen(&ctx); 3232 + if (IS_ERR_VALUE(memflags)) 3233 3233 goto err; 3234 3234 3235 3235 body = ctx.body; ··· 3247 3247 ioc = q_to_ioc(disk->queue); 3248 3248 } 3249 3249 3250 - memflags = blk_mq_freeze_queue(disk->queue); 3251 - mutex_lock(&disk->queue->elevator_lock); 3252 3250 blk_mq_quiesce_queue(disk->queue); 3253 3251 3254 3252 spin_lock_irq(&ioc->lock); ··· 3346 3348 wbt_enable_default(disk); 3347 3349 3348 3350 blk_mq_unquiesce_queue(disk->queue); 3349 - mutex_unlock(&disk->queue->elevator_lock); 3350 - blk_mq_unfreeze_queue(disk->queue, memflags); 3351 3351 3352 - blkg_conf_exit(&ctx); 3352 + blkg_conf_exit_frozen(&ctx, memflags); 3353 3353 return nbytes; 3354 3354 einval: 3355 3355 spin_unlock_irq(&ioc->lock); 3356 - 3357 3356 blk_mq_unquiesce_queue(disk->queue); 3358 - mutex_unlock(&disk->queue->elevator_lock); 3359 - blk_mq_unfreeze_queue(disk->queue, memflags); 3360 - 3361 3357 ret = -EINVAL; 3362 3358 err: 3363 - blkg_conf_exit(&ctx); 3359 + blkg_conf_exit_frozen(&ctx, memflags); 3364 3360 return ret; 3365 3361 } 3366 3362

Configure Feed

Configure Feed