commits

When a controller reset is triggered via sysfs (by writing to
/sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
and re-establishes all queues. The socket release using fput() defers
the actual cleanup to task_work delayed_fput workqueue. This deferred
cleanup can race with the subsequent queue re-allocation during reset,
potentially leading to use-after-free or resource conflicts.

Replace fput() with __fput_sync() to ensure synchronous socket release,
guaranteeing that all socket resources are fully cleaned up before the
function returns. This prevents races during controller reset where
new queue setup may begin before the old socket is fully released.

* Call chain during reset:
nvme_reset_ctrl_work()
-> nvme_tcp_teardown_ctrl()
-> nvme_tcp_teardown_io_queues()
-> nvme_tcp_free_io_queues()
-> nvme_tcp_free_queue() <-- fput() -> __fput_sync()
-> nvme_tcp_teardown_admin_queue()
-> nvme_tcp_free_admin_queue()
-> nvme_tcp_free_queue() <-- fput() -> __fput_sync()
-> nvme_tcp_setup_ctrl() <-- race with deferred fput

memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
performing memory reclaim work that need reserve access. While PF_MEMALLOC
prevents the task from entering direct reclaim (causing __need_reclaim() to
return false), it does not strip __GFP_IO from gfp flags. The allocator can
therefore still trigger writeback I/O when __GFP_IO remains set, which is
unsafe when the caller holds block layer locks.

Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
the scope, making it safe to allocate memory while holding elevator_lock and
set->srcu.

* The issue can be reproduced using blktests:

nvme_trtype=tcp ./check nvme/005
blktests (master) # nvme_trtype=tcp ./check nvme/005
nvme/005 (tr=tcp) (reset local loopback target) [failed]
runtime 0.725s ... 0.798s
something found in dmesg:
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20

[...]
...
(See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message)
blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[ 108.526983] loop0: detected capacity change from 0 to 2097152
[ 108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.616832] nvme nvme0: creating 48 I/O queues.
[ 108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.748466] nvme nvme0: creating 48 I/O queues.
[ 108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 108.854288] block nvme0n1: no available path - failing I/O
[ 108.854344] block nvme0n1: no available path - failing I/O
[ 108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read

[ 108.891693] ======================================================
[ 108.895912] WARNING: possible circular locking dependency detected
[ 108.900184] 6.17.0nvme+ #3 Tainted: G N
[ 108.903913] ------------------------------------------------------
[ 108.908171] nvme/2734 is trying to acquire lock:
[ 108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170
[ 108.917587]
but task is already holding lock:
[ 108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[ 108.927361]
which lock already depends on the new lock.

[ 108.933018]
the existing dependency chain (in reverse order) is:
[ 108.938223]
-> #4 (&q->elevator_lock){+.+.}-{4:4}:
[ 108.942988] __mutex_lock+0xa2/0x1150
[ 108.945873] elevator_change+0xa8/0x1c0
[ 108.948925] elv_iosched_store+0xdf/0x140
[ 108.952043] kernfs_fop_write_iter+0x16a/0x220
[ 108.955367] vfs_write+0x378/0x520
[ 108.957598] ksys_write+0x67/0xe0
[ 108.959721] do_syscall_64+0x76/0xbb0
[ 108.962052] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 108.965145]
-> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 108.968923] blk_alloc_queue+0x30e/0x350
[ 108.972117] blk_mq_alloc_queue+0x61/0xd0
[ 108.974677] scsi_alloc_sdev+0x2a0/0x3e0
[ 108.977092] scsi_probe_and_add_lun+0x1bd/0x430
[ 108.979921] __scsi_add_device+0x109/0x120
[ 108.982504] ata_scsi_scan_host+0x97/0x1c0
[ 108.984365] async_run_entry_fn+0x2d/0x130
[ 108.986109] process_one_work+0x20e/0x630
[ 108.987830] worker_thread+0x184/0x330
[ 108.989473] kthread+0x10a/0x250
[ 108.990852] ret_from_fork+0x297/0x300
[ 108.992491] ret_from_fork_asm+0x1a/0x30
[ 108.994159]
-> #2 (fs_reclaim){+.+.}-{0:0}:
[ 108.996320] fs_reclaim_acquire+0x99/0xd0
[ 108.998058] kmem_cache_alloc_node_noprof+0x4e/0x3c0
[ 109.000123] __alloc_skb+0x15f/0x190
[ 109.002195] tcp_send_active_reset+0x3f/0x1e0
[ 109.004038] tcp_disconnect+0x50b/0x720
[ 109.005695] __tcp_close+0x2b8/0x4b0
[ 109.007227] tcp_close+0x20/0x80
[ 109.008663] inet_release+0x31/0x60
[ 109.010175] __sock_release+0x3a/0xc0
[ 109.011778] sock_close+0x14/0x20
[ 109.013263] __fput+0xee/0x2c0
[ 109.014673] delayed_fput+0x31/0x50
[ 109.016183] process_one_work+0x20e/0x630
[ 109.017897] worker_thread+0x184/0x330
[ 109.019543] kthread+0x10a/0x250
[ 109.020929] ret_from_fork+0x297/0x300
[ 109.022565] ret_from_fork_asm+0x1a/0x30
[ 109.024194]
-> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 109.026634] lock_sock_nested+0x2e/0x70
[ 109.028251] tcp_sendmsg+0x1a/0x40
[ 109.029783] sock_sendmsg+0xed/0x110
[ 109.031321] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[ 109.034263] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[ 109.036375] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[ 109.038528] blk_mq_dispatch_rq_list+0x297/0x800
[ 109.040448] __blk_mq_sched_dispatch_requests+0x3db/0x5f0
[ 109.042677] blk_mq_sched_dispatch_requests+0x29/0x70
[ 109.044787] blk_mq_run_work_fn+0x76/0x1b0
[ 109.046535] process_one_work+0x20e/0x630
[ 109.048245] worker_thread+0x184/0x330
[ 109.049890] kthread+0x10a/0x250
[ 109.051331] ret_from_fork+0x297/0x300
[ 109.053024] ret_from_fork_asm+0x1a/0x30
[ 109.054740]
-> #0 (set->srcu){.+.+}-{0:0}:
[ 109.056850] __lock_acquire+0x1468/0x2210
[ 109.058614] lock_sync+0xa5/0x110
[ 109.060048] __synchronize_srcu+0x49/0x170
[ 109.061802] elevator_switch+0xc9/0x330
[ 109.063950] elevator_change+0x128/0x1c0
[ 109.065675] elevator_set_none+0x4c/0x90
[ 109.067316] blk_unregister_queue+0xa8/0x110
[ 109.069165] __del_gendisk+0x14e/0x3c0
[ 109.070824] del_gendisk+0x75/0xa0
[ 109.072328] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 109.074365] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 109.076652] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 109.078775] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 109.081009] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 109.083082] kernfs_fop_write_iter+0x16a/0x220
[ 109.085009] vfs_write+0x378/0x520
[ 109.086539] ksys_write+0x67/0xe0
[ 109.087982] do_syscall_64+0x76/0xbb0
[ 109.089577] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 109.091665]
other info that might help us debug this:

[ 109.095478] Chain exists of:
set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock

[ 109.099544] Possible unsafe locking scenario:

[ 109.101708] CPU0 CPU1
[ 109.103402] ---- ----
[ 109.105103] lock(&q->elevator_lock);
[ 109.106530] lock(&q->q_usage_counter(io));
[ 109.109022] lock(&q->elevator_lock);
[ 109.111391] sync(set->srcu);
[ 109.112586]
*** DEADLOCK ***

[ 109.114772] 5 locks held by nvme/2734:
[ 109.116189] #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0
[ 109.119143] #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220
[ 109.123141] #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50
[ 109.126543] #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0
[ 109.129891] #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[ 109.133149]
stack backtrace:
[ 109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G N 6.17.0nvme+ #3 PREEMPT(voluntary)
[ 109.134819] Tainted: [N]=TEST
[ 109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 109.134821] Call Trace:
[ 109.134823] <TASK>
[ 109.134824] dump_stack_lvl+0x75/0xb0
[ 109.134828] print_circular_bug+0x26a/0x330
[ 109.134831] check_noncircular+0x12f/0x150
[ 109.134834] __lock_acquire+0x1468/0x2210
[ 109.134837] ? __synchronize_srcu+0x17/0x170
[ 109.134838] lock_sync+0xa5/0x110
[ 109.134840] ? __synchronize_srcu+0x17/0x170
[ 109.134842] __synchronize_srcu+0x49/0x170
[ 109.134843] ? mark_held_locks+0x49/0x80
[ 109.134845] ? _raw_spin_unlock_irqrestore+0x2d/0x60
[ 109.134847] ? kvm_clock_get_cycles+0x14/0x30
[ 109.134853] ? ktime_get_mono_fast_ns+0x36/0xb0
[ 109.134858] elevator_switch+0xc9/0x330
[ 109.134860] elevator_change+0x128/0x1c0
[ 109.134862] ? kernfs_put.part.0+0x86/0x290
[ 109.134864] elevator_set_none+0x4c/0x90
[ 109.134866] blk_unregister_queue+0xa8/0x110
[ 109.134868] __del_gendisk+0x14e/0x3c0
[ 109.134870] del_gendisk+0x75/0xa0
[ 109.134872] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 109.134879] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 109.134887] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 109.134893] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 109.134899] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 109.134905] kernfs_fop_write_iter+0x16a/0x220
[ 109.134908] vfs_write+0x378/0x520
[ 109.134911] ksys_write+0x67/0xe0
[ 109.134913] do_syscall_64+0x76/0xbb0
[ 109.134915] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 109.134916] RIP: 0033:0x7fd68a737317
[ 109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317
[ 109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003
[ 109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0
[ 109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719
[ 109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660
[ 109.134926] </TASK>
[ 109.962756] Key type psk unregistered

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

13d ago

Alistair Francis

5fc42295

nvmet-tcp: Don't clear tls_key when freeing sq

13d ago

Alistair Francis

f920ebd0

Revert "nvmet-tcp: Don't free SQ on authentication success"

13d ago

Keith Busch

bddb911d

nvme: skip trace completion for host path errors

13d ago

Tao Jiang

cf92d78a

nvme-pci: add quirk for Memblaze Pblaze5 (0x1c5f:0x0555)

19d ago

John Garry

3f150f0f

nvme-multipath: put module reference when delayed removal work is canceled

19d ago

Daniel Wagner

20925812

nvme: expose TLS mode

19d ago

Fedor Pchelkin

ba9d308c

nvme-apple: drop invalid put of admin queue reference count

19d ago

Flavio Suligoi

e80e39f2

nvme-core: fix parameter name in comment

19d ago

Chaitanya Kulkarni

aade8abd

nvmet: avoid recursive nvmet-wq flush in nvmet_ctrl_free

nvmet_tcp_release_queue_work() runs on nvmet-wq and can drop the
final controller reference through nvmet_cq_put(). If that triggers
nvmet_ctrl_free(), the teardown path flushes ctrl->async_event_work on
the same nvmet-wq.

Call chain:

nvmet_tcp_schedule_release_queue()
kref_put(&queue->kref, nvmet_tcp_release_queue)
nvmet_tcp_release_queue()
queue_work(nvmet_wq, &queue->release_work) <--- nvmet_wq
process_one_work()
nvmet_tcp_release_queue_work()
nvmet_cq_put(&queue->nvme_cq)
nvmet_cq_destroy()
nvmet_ctrl_put(cq->ctrl)
nvmet_ctrl_free()
flush_work(&ctrl->async_event_work) <--- nvmet_wq

Previously Scheduled by :-
nvmet_add_async_event
queue_work(nvmet_wq, &ctrl->async_event_work);

This trips lockdep with a possible recursive locking warning.

[ 5223.015876] run blktests nvme/003 at 2026-04-07 20:53:55
[ 5223.061801] loop0: detected capacity change from 0 to 2097152
[ 5223.072206] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 5223.088368] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 5223.126086] nvmet: Created discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 5223.128453] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 5233.199447] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

[ 5233.227718] ============================================
[ 5233.231283] WARNING: possible recursive locking detected
[ 5233.234696] 7.0.0-rc3nvme+ #20 Tainted: G O N
[ 5233.238434] --------------------------------------------
[ 5233.241852] kworker/u192:6/2413 is trying to acquire lock:
[ 5233.245429] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90
[ 5233.251438]
but task is already holding lock:
[ 5233.255254] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0
[ 5233.261125]
other info that might help us debug this:
[ 5233.265333] Possible unsafe locking scenario:

[ 5233.269217] CPU0
[ 5233.270795] ----
[ 5233.272436] lock((wq_completion)nvmet-wq);
[ 5233.275241] lock((wq_completion)nvmet-wq);
[ 5233.278020]
*** DEADLOCK ***

[ 5233.281793] May be due to missing lock nesting notation

[ 5233.286195] 3 locks held by kworker/u192:6/2413:
[ 5233.289192] #0: ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0
[ 5233.294569] #1: ffffc9000e2a7e40 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x6e0
[ 5233.300128] #2: ffffffff82d7dc40 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530
[ 5233.304290]
stack backtrace:
[ 5233.306520] CPU: 4 UID: 0 PID: 2413 Comm: kworker/u192:6 Tainted: G O N 7.0.0-rc3nvme+ #20 PREEMPT(full)
[ 5233.306524] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 5233.306525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 5233.306527] Workqueue: nvmet-wq nvmet_tcp_release_queue_work [nvmet_tcp]
[ 5233.306532] Call Trace:
[ 5233.306534] <TASK>
[ 5233.306536] dump_stack_lvl+0x73/0xb0
[ 5233.306552] print_deadlock_bug+0x225/0x2f0
[ 5233.306556] __lock_acquire+0x13f0/0x2290
[ 5233.306563] lock_acquire+0xd0/0x300
[ 5233.306565] ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306571] ? __flush_work+0x20b/0x530
[ 5233.306573] ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306577] touch_wq_lockdep_map+0x3b/0x90
[ 5233.306580] ? touch_wq_lockdep_map+0x26/0x90
[ 5233.306583] ? __flush_work+0x20b/0x530
[ 5233.306585] __flush_work+0x268/0x530
[ 5233.306588] ? __pfx_wq_barrier_func+0x10/0x10
[ 5233.306594] ? xen_error_entry+0x30/0x60
[ 5233.306600] nvmet_ctrl_free+0x140/0x310 [nvmet]
[ 5233.306617] nvmet_cq_put+0x74/0x90 [nvmet]
[ 5233.306629] nvmet_tcp_release_queue_work+0x19f/0x360 [nvmet_tcp]
[ 5233.306634] process_one_work+0x206/0x6e0
[ 5233.306640] worker_thread+0x184/0x320
[ 5233.306643] ? __pfx_worker_thread+0x10/0x10
[ 5233.306646] kthread+0xf1/0x130
[ 5233.306648] ? __pfx_kthread+0x10/0x10
[ 5233.306651] ret_from_fork+0x355/0x450
[ 5233.306653] ? __pfx_kthread+0x10/0x10
[ 5233.306656] ret_from_fork_asm+0x1a/0x30
[ 5233.306664] </TASK>

There is also no need to flush async_event_work from controller
teardown. The admin queue teardown already fails outstanding AER
requests before the final controller put :-

nvmet_sq_destroy(admin sq)
nvmet_async_events_failall(ctrl)

The controller has already been removed from the subsystem list before
nvmet_ctrl_free() quiesces outstanding work.

Replace flush_work() with cancel_work_sync() so a pending
async_event_work item is canceled and a running instance is waited on
without recursing into the same workqueue.

Fixes: 06406d81a2d7 ("nvmet: cancel fatal error and flush async work before free controller")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

19d ago

John Garry

7d435caa

nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()

3w ago

Alan Cui

7f991e3f

nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus )

3w ago

Chaitanya Kulkarni

5293a888

nvmet-tcp: fix race between ICReq handling and queue teardown

3w ago

Maurizio Lombardi

bad44c9c

nvmet-tcp: remove redundant calls to nvmet_tcp_fatal_error()

3w ago

Maurizio Lombardi

ea8e356a

nvmet-tcp: propagate nvmet_tcp_build_pdu_iovec() errors to its callers

3w ago

Shivaji Kant

23528aa3

nvme: enable PCI P2PDMA support for RDMA transport

4w ago

Aurelien Aptel

0a5a9464

nvmet: introduce new mdts configuration entry

4w ago

Geliang Tang

723277b1

nvme: add missing MODULE_ALIAS for fabrics transports

4w ago

Shivam Kumar

4606467a

nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path

4w ago

Yuto Ohnuki

e9b004ff

blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()

4w ago

Uday Shankar

320f9b1c

selftests: ublk: test that teardown after incomplete recovery completes

4w ago

Uday Shankar

0842186d

ublk: reset per-IO canceled flag on each fetch

4w ago

Caleb Sander Mateos

8b155f2e

block: remove unused BVEC_ITER_ALL_INIT

4w ago

Mingzhe Zou

20a8e451

bcache: fix uninitialized closure object

4w ago

Mingzhe Zou

fec114a9

bcache: fix cached_dev.sb_bio use-after-free and crash

4w ago

Thorsten Blum

a175ee82

block: use sysfs_emit in sysfs show functions

4w ago

Christoph Hellwig

4e56428e

blk-crypto: fix name of the bio completion callback

4w ago

Ming Lei

c691e4b0

bio: fix kmemleak false positives from percpu bio alloc cache

4w ago

Jialin Wang

f91ffe89

blk-iocost: fix busy_level reset when no IOs complete

When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.

This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.

Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.

The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.

BLK_DEVID="8:0"
MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566"
QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000"

echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model
echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos

CG_A="/sys/fs/cgroup/cgA"
CG_B="/sys/fs/cgroup/cgB"

FILE_A="/path/to/sda/A.fio.testfile"
FILE_B="/path/to/sda/B.fio.testfile"
RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)"

mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR"

get_result() {
local file=$1
local label=$2

local results=$(jq -r '
.jobs[0].mixed |
( .iops | tonumber | round ) as $iops |
( .bw_bytes / 1024 / 1024 ) as $bps |
( .slat_ns.mean / 1000000 ) as $slat |
( .clat_ns.mean / 1000000 ) as $avg |
( .clat_ns.max / 1000000 ) as $max |
( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 |
( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 |
( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 |
( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 |
"\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)"
' "$file")

IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results"
printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \
"$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999"
}

run_fio() {
local cg_path=$1
local filename=$2
local name=$3
local bs=$4
local qd=$5
local out=$6
shift 6
local extra=$@

(
pid=$(sh -c 'echo $PPID')
echo $pid >"${cg_path}/cgroup.procs"
fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \
--ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \
--time_based --group_reporting --unified_rw_reporting=mixed \
--output-format=json --output="$out" $extra >/dev/null 2>&1
) &
}

echo "Starting Test ..."

for bs_b in "4k" "32k" "256k"; do
echo "Running iteration: BS=$bs_b"
out_a="${RESULT_DIR}/cgA_1m.json"
out_b="${RESULT_DIR}/cgB_${bs_b}.json"

# cgA: Heavy background (BS 1MB, QD 128)
run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a"
# cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100)
run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100"

wait
SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n'
SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n'
done

echo -e "\nFinal Results Summary:\n"

printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \
"" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat"
printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \
"CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99"
echo "$SUMMARY_DATA"

echo "Results saved in $RESULT_DIR"

Before:
slat clat clat clat clat clat clat
CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99

cgA-1m 166 166.37 3.44 748.95 1298.29 977.27 1233.13 1300.23 1300.23
cgB-4k 5 0.02 0.02 181.74 761.32 742.39 759.17 759.17 759.17

cgA-1m 167 166.51 1.98 748.68 1549.41 809.50 1451.23 1551.89 1551.89
cgB-32k 6 0.18 0.02 169.98 761.76 742.39 759.17 759.17 759.17

cgA-1m 166 165.55 2.89 750.89 1540.37 851.44 1451.23 1535.12 1535.12
cgB-256k 5 1.30 0.02 191.35 759.51 750.78 759.17 759.17 759.17

After:
slat clat clat clat clat clat clat
CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99

cgA-1m 162 162.48 6.14 749.69 850.02 826.28 834.67 843.06 851.44
cgB-4k 199 0.78 0.01 1.95 42.12 2.57 7.50 34.87 42.21

cgA-1m 146 146.20 6.83 833.04 908.68 893.39 901.78 910.16 910.16
cgB-32k 200 6.25 0.01 2.32 31.40 3.06 7.50 16.58 31.33

cgA-1m 110 110.46 9.04 1082.67 1197.91 1182.79 1199.57 1199.57 1199.57
cgB-256k 200 49.98 0.02 3.69 22.20 4.88 9.11 20.05 22.15

Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5w ago

Jackie Liu

23308af7

blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()

5w ago

Damien Le Moal

b2a78fec

zloop: add max_open_zones option

5w ago

Jackie Liu

2a2f520f

block: fix zones_cond memory leak on zone revalidation error paths

5w ago

Daan De Meyer

267ec4d7

loop: fix partition scan race between udev and loop_reread_partitions()

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:

1. disk_force_media_change() sets GD_NEED_PART_SCAN
2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
3. loop_global_unlock() releases the lock
4. loop_reread_partitions() calls bdev_disk_changed() to scan

There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.

The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.

Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:

- floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
false and GD_NEED_PART_SCAN has no effect.
- loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
set, loop_reread_partitions() performs an explicit scan. When not
set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
- loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
LO_FLAGS_PARTSCAN is set.
- nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.

With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.

A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.

Fixes: 9f65c489b68d ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5w ago

Milan Broz

499d2d2f

sed-opal: Add STACK_RESET command

5w ago

Jens Axboe

75e75445

Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block

5w ago

Bart Van Assche

2b31e863

drbd: Balance RCU calls in drbd_adm_dump_devices()

5w ago

Nilay Shroff

886f3520

nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown

Cancelling the I/O and admin tagsets during nvme-loop controller reset
or shutdown is unnecessary. The subsequent destruction of the I/O and
admin queues already waits for all in-flight target operations to
complete.

Cancelling the tagsets first also opens a race window. After a request
tag has been cancelled, a late completion from the target may still
arrive before the queues are destroyed. In that case the completion path
may access a request whose tag has already been cancelled or freed,
which can lead to a kernel crash. Please see below the kernel crash
encountered while running blktests nvme/040:

run blktests nvme/040 at 2026-03-08 06:34:27
loop0: detected capacity change from 0 to 2097152
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme6: creating 96 I/O queues.
nvme nvme6: new ctrl: "blktests-subsystem-1"
nvme_log_error: 1 callbacks suppressed
block nvme6n1: no usable path - requeuing I/O
nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
blk_print_req_error: 1 callbacks suppressed
I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2
block nvme6n1: no usable path - requeuing I/O
Kernel attempted to read user page (236) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000236
Faulting instruction address: 0xc000000000961274
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop]
CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT
Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries
NIP: c000000000961274 LR: c008000009af1808 CTR: c00000000096124c
REGS: c0000007ffc0f910 TRAP: 0300 Not tainted (7.0.0-rc3+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22222222 XER: 00000000
CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0
GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00
GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318
GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000
GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000
GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a
GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000
GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00
NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4
LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop]
Call Trace:
0xc00000000502c640 (unreliable)
nvme_loop_queue_response+0x104/0x290 [nvme_loop]
__nvmet_req_complete+0x80/0x498 [nvmet]
nvmet_req_complete+0x24/0xf8 [nvmet]
nvmet_bio_done+0x58/0xcc [nvmet]
bio_endio+0x250/0x390
blk_update_request+0x2e8/0x68c
blk_mq_end_request+0x30/0x5c
lo_complete_rq+0x94/0x110 [loop]
blk_complete_reqs+0x78/0x98
handle_softirqs+0x148/0x454
do_softirq_own_stack+0x3c/0x50
__irq_exit_rcu+0x18c/0x1b4
irq_exit+0x1c/0x34
do_IRQ+0x114/0x278
hardware_interrupt_common_virt+0x28c/0x290

Since the queue teardown path already guarantees that all target-side
operations have completed, cancelling the tagsets is redundant and
unsafe. So avoid cancelling the I/O and admin tagsets during controller
reset and shutdown.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>