commits
The blk_mq_hw_ctx_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-4-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blk_crypto_attrs structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-3-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blk_ia_range_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-2-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queue_sysfs_entry structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-1-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bvec_free is only called by bio_free, so inline it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_alloc_bioset tries non-waiting slab allocations first for the bio and
bvec array, but does so in a somewhat convoluted way.
Restructure the function so that it first open codes these slab
allocations, and then falls back to the mempools with the original
gfp mask.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only used in bio.c these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ublk driver doesn't access request integrity buffers directly, it
only copies them to/from the ublk server in ublk_copy_user_integrity().
ublk_copy_user_integrity() uses bio_for_each_integrity_vec() to walk all
the integrity segments. ublk devices are therefore capable of handling
requests with integrity intervals split across segments. Set
BLK_SPLIT_INTERVAL_CAPABLE in the struct blk_integrity flags for ublk
devices to opt out of the integrity-interval dma_alignment limit.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A bio segment may have partial interval block data with the rest
continuing into the next segments because direct-io data payloads only
need to align in memory to the device's DMA limits.
At the same time, the protection information may also be split in
multiple segments. The most likely way that may happen is if two
requests merge, or if we're directly using the io_uring user metadata.
The generate/verify, however, only ever accessed the first bip_vec.
Further, it may be possible to unalign the protection fields from the
user space buffer, or if there are odd additional opaque bytes in front
or in back of the protection information metadata region.
Change up the iteration to allow spanning multiple segments. This patch
is mostly a re-write of the protection information handling to allow any
arbitrary alignments, so it's probably easier to review the end result
rather than the diff.
Many controllers are not able to handle interval data composed of
multiple segments when PI is used, so this patch introduces a new
integrity limit that a low level driver can set to notify that it is
capable, default to false. The nvme driver is the first one to enable it
in this patch. Everyone else will force DMA alignment to the logical
block size as before to ensure interval data is always aligned within a
single segment.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the
previous disk's blkcg state is cleaned up asynchronously via
disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk()
runs before that cleanup finishes, we may overwrite q->root_blkg while
the old one is still alive, and radix_tree_insert() in blkg_create()
fails with -EEXIST because the old blkg entries still occupy the same
queue id slot in blkcg->blkg_tree. This causes the sd probe to fail
with -ENOMEM.
Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL,
which indicates the previous disk's blkcg cleanup has completed.
Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_steal_bios() transfers bios from a request to a bio_list when the
request is requeued to a different queue. The NVMe multipath failover
path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
Move these fixups into blk_steal_bios() itself so that any caller
automatically gets correct flag state when bios cross queue boundaries.
Simplify nvme_failover_req() accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in integrity changes which are also landing in the VFS tree as
dependencies for fs related changes.
* for-7.1/block-integrity:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
bdev_nonrot() is simply the negative return value of bdev_rot().
So replace all call sites of bdev_nonrot() with calls to bdev_rot()
and remove bdev_nonrot().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow the file system to limit the size processed in a single
bounce operation. This is needed when generating integrity data
so that the size of a single integrity segment can't overflow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Correct the comments that the cloned bio must be freed before the memory
pointed to by @bio_src->bi_io_vecs (is freed).
Christoph Hellwig contributed most the of the update wording.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a set of helpers for file system initiated integrity information.
These include mempool backed allocations and verifying based on a passed
in sector and size which is often available from file system completion
routines.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update the documentation file Documentation/ABI/stable/sysfs-block to
describe the zoned_qd1_writes sysfs queue attribute file.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
File systems that generate integrity will need this, so move it out
of the block private or blk-mq specific headers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to
having zone write plugging limit write operations to a maximum queue
depth of 1 for all zones. This significantly reduce write seek overhead
and improves SMR HDD write throughput.
For remotely connected disks with a very high network latency this
features might not be useful. However, remotely connected zoned devices
are rare at the moment, and we cannot know the round trip latency to
pick a good default for network attached devices. System administrators
can however disable this feature in that case.
For BIO based (non blk-mq) rotational zoned block devices, the device
driver (e.g. a DM target driver) can directly set an appropriate
default.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return the status from verify instead of directly stashing it in the bio,
and rename the helpers to use the usual bio_ prefix for things operating
on a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.
In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.
The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes for zoned devices. For regular block devices, this
attribute is not visible. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.
Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.
Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.
A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock. The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Sequential write | Baseline | Patched |
| Queue Depth | 6.19-rc8 | |
+------------------+----------+---------+
| 1 | 244 | 245 |
| 2 | 244 | 245 |
| 4 | 245 | 245 |
| 8 | 242 | 245 |
| 16 | 222 | 246 |
| 32 | 211 | 245 |
| 64 | 193 | 244 |
| 128 | 112 | 246 |
+------------------+----------+---------+
With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.
Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Random write | Baseline | Patched |
| Number of zones | 6.19-rc7 | |
+------------------+----------+---------+
| 1 | 191 | 192 |
| 2 | 101 | 128 |
| 4 | 115 | 123 |
| 8 | 90 | 120 |
| 16 | 64 | 115 |
| 32 | 58 | 105 |
| 64 | 56 | 101 |
| 128 | 55 | 99 |
+------------------+----------+---------+
Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 212 | 238 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 213 | 243 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 179 | 242 |
| random file size | | |
+------------------+----------+---------+
Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 175 | 245 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 174 | 244 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 171 | 243 |
| random file size | | |
+------------------+----------+---------+
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out a helper to see if the block device has an integrity checksum
from bdev_stable_writes so that it can be reused for other checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
clearly indicates that this is the spinlock used for manipulating the
hash table of zone write plugs.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to set the seed and check flag based on useful defaults
from the profile.
Note that this includes a small behavior change, as we now only set the
seed if any action is set, which is fine as nothing will look at it
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The helper function disk_zone_is_full() is only used in
disk_zone_wplug_is_full(). So remove it and open code it directly in
this single caller.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the logic to see if a bio needs integrity metadata from
bio_integrity_prep into a reusable helper than can be called from
file system code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_get_and_lock_zone_wplug() always returns a zone write plug with the
plug lock held. This is unnecessary since this function does not look at
the fields of existing plugs, and new plugs need to be locked only after
their insertion in the disk hash table, when they are being used.
Remove the zone write plug locking from disk_get_and_lock_zone_wplug()
and rename this function disk_get_or_alloc_zone_wplug().
blk_zone_wplug_handle_write() is modified to add locking of the zone
write plug after calling disk_get_or_alloc_zone_wplug() and before
starting to use the plug. This change also simplifies
blk_revalidate_seq_zone() as unlocking the plug becomes unnecessary.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.
Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.
Fixes: 9e78c38ab30b ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Commit 7b295187287e ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs or requests in-flight. However, this check does not take into
account that a BIO completion may happen right after its submission by
a zone write plug BIO work, and before the zone write plug BIO work
releases the zone write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false as in this case the zone
write plug reference count is at least equal to 3. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.
Furthermore, relying on a particular value of a zone write plug
reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as
reading the atomic reference count and doing a comparison with some
value is not overall atomic at all.
Address these issues by reworking the reference counting of zone write
plugs so that removing plugs from a disk hash table can be done
directly from disk_put_zone_wplug() when the last reference on a plug
is dropped.
To do so, replace the function disk_remove_zone_wplug() with
disk_mark_zone_wplug_dead(). This new function sets the zone write plug
flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and
drops the initial reference on the zone write plug taken when the plug
was added to the disk hash table. This function is called either for
zones that are empty or full, or directly in the case of a forced plug
removal (e.g. when the disk hash table is being destroyed on disk
removal). With this change, disk_should_remove_zone_wplug() is also
removed.
disk_put_zone_wplug() is modified to call the function
disk_free_zone_wplug() to remove a zone write plug from a disk hash
table and free the plug structure (with a call_rcu()), when the last
reference on a zone write plug is dropped. disk_free_zone_wplug()
always checks that the BLK_ZONE_WPLUG_DEAD flag is set.
In order to avoid having multiple zone write plugs for the same zone in
the disk hash table, disk_get_and_lock_zone_wplug() checked for the
BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for
the new BLK_ZONE_WPLUG_DEAD flag is added to
blk_zone_wplug_handle_write(). With this change, we continue preventing
adding multiple zone write plugs for the same zone and at the same time
re-inforce checks on the user behavior by failing new incoming write
BIOs targeting a zone that is marked as dead. This case can happen only
if the user erroneously issues write BIOs to zones that are full, or to
zones that are currently being reset or finished.
Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull crypto library fix from Eric Biggers:
"Fix a big endian specific issue in the PPC64-optimized AES code"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: powerpc/aes: Fix rndkey_from_vsx() on big endian CPUs
hppa-linux-gcc 9.5.0 generates a call to fsverity_readahead() in
f2fs_readahead() when CONFIG_FS_VERITY=n, because it fails to do the
expected dead code elimination based on vi always being NULL. Fix the
build error by adding an inline stub for fsverity_readahead(). Since
it's just for opportunistic readahead, just make it a no-op.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602180838.pwICdY2r-lkp@intel.com/
Fixes: 45dcb3ac9832 ("f2fs: consolidate fsverity_info lookup")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218012244.18536-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
The queue_hw_ctx field in struct request_queue is an array of pointers
to struct blk_mq_hw_ctx. The number of elements in this array is tracked
by the nr_hw_queues field.
The array is allocated in __blk_mq_realloc_hw_ctxs() using
kcalloc_node() with set->nr_hw_queues elements. q->nr_hw_queues is
subsequently updated to set->nr_hw_queues.
When growing the array, the new array is assigned to queue_hw_ctx before
nr_hw_queues is updated. This is safe because nr_hw_queues (the old
smaller count) is used for bounds checking, which is within the new
larger allocation.
When shrinking the array, nr_hw_queues is updated to the smaller value,
while queue_hw_ctx retains the larger allocation. This is also safe as
the count is within the allocation bounds.
Annotating queue_hw_ctx with __counted_by_ptr(nr_hw_queues) allows the
compiler (with kSAN) to verify that accesses to queue_hw_ctx are within
the valid range defined by nr_hw_queues.
This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running blktests.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Bill Wendling <morbo@google.com>
[axboe: massage commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stephen retired and stepped back from -next maintainership, update his
entry in CREDITS to recognise his 18 years of hard work making it what
it is today and all the impact it's had on our development process.
Also update to his current GnuPG key while we're here.
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I finally got a big endian PPC64 kernel to boot in QEMU. The PPC64 VSX
optimized AES library code does work in that case, with the exception of
rndkey_from_vsx() which doesn't take into account that the order in
which the VSX code stores the round key words depends on the endianness.
So fix rndkey_from_vsx() to do the right thing on big endian CPUs.
Fixes: 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized code into library")
Link: https://lore.kernel.org/r/20260216022104.332991-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Now that fsverity_verify_page() has no callers, remove it.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
This adds a function for retrieving the set of Locking objects enabled
for Single User Mode (SUM) and the value of the
RangeStartRangeLengthPolicy parameter.
It retrieves data from the LockingInfo table, specifically the
columns SingleUserModeRanges and RangeStartLengthPolicy, which
were added according to the TCG Opal Feature Set: Single User Mode,
as described in chapters 4.4.3.1 and 4.4.3.2.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The x509 public key code gained a dependency on the sha256 hash
implementation, causing a rare link time failure in randconfig
builds:
arm-linux-gnueabi-ld: crypto/asymmetric_keys/x509_public_key.o: in function `x509_get_sig_params':
x509_public_key.c:(.text.x509_get_sig_params+0x12): undefined reference to `sha256'
arm-linux-gnueabi-ld: (sha256): Unknown destination type (ARM/Thumb) in crypto/asymmetric_keys/x509_public_key.o
x509_public_key.c:(.text.x509_get_sig_params+0x12): dangerous relocation: unsupported relocation
Select the necessary library code from Kconfig.
Fixes: 2c62068ac86b ("x509: Separately calculate sha256 for blacklist")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull sysctl updates from Joel Granados:
- Remove macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it
is more verbose than the macro version, it helps when debugging and
better aligns with coding-style.rst.
- General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
Add missing kernel doc to proc_dointvec_conv.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks
of testing
* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
sysctl: Replace unidirectional INT converter macros with functions
sysctl: Add kernel doc to proc_douintvec_conv
sysctl: Replace UINT converter macros with functions
sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
sysctl: clarify proc_douintvec_minmax doc
sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
sysctl: Remove unused ctl_table forward declarations
loadpin: Implement custom proc_handler for enforce
alloc_tag: move memory_allocation_profiling_sysctls into .rodata
sysctl: Add missing kernel-doc for proc_dointvec_conv
f2fs_verify_cluster() is the only remaining caller of the
non-large-folio-aware function fsverity_verify_page(). To unblock the
removal of that function, change f2fs_verify_cluster() to verify the
entire folio of each page and mark it up-to-date.
Note that this doesn't actually make f2fs_verify_cluster()
large-folio-aware, as it is still passed an array of pages. Currently,
it's never called with large folios.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Change the column parameter in response_get_column() from u8 to u64
to support the full range of column identifiers.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Align to the commit bf4afc53b77a ("Convert 'alloc_obj' family to use the
new default GFP_KERNEL argument") update the 'kmalloc_obj' declaration
for userspace to fix below compile error:
In file included from arch/arm/boot/compressed/../../../../lib/decompress_unxz.c:241,
from arch/arm/boot/compressed/decompress.c:56:
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c: In function 'xz_dec_init':
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c:787:28: error: implicit declaration of function 'kmalloc_obj'; did you mean 'kmalloc'? [-Wimplicit-function-declaration]
787 | struct xz_dec *s = kmalloc_obj(*s);
| ^~~~~~~~~~~
| kmalloc
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Fixes: bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull turbostat fix from Len Brown:
"Fix a recent AMD regression due to errant code cleanup"
* tag 'turbostat-2026.02.14-AMD-RAPL-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: Fix AMD RAPL regression
Remove SYSCTL_INT_CONV_CUSTOM and replace it with proc_int_conv. This
converter function expects a negp argument as it can take on negative
values. Update all jiffies converters to use explicit function calls.
Remove SYSCTL_CONV_IDENTITY as it is no longer used.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Remove the unnecessary clearing of PG_uptodate. It's guaranteed to
already be clear.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
This ioctl is used to set up RLE (read lock enabled) and WLE (write
lock enabled) parameters of the Locking object.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
With the IOC_OPAL_ENABLE_DISABLE_LR ioctl, the opal_user_lr_setup
members 'range_start' and 'range_length' of the ioctl argument are
ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull RTC updates from Alexandre Belloni:
- loongson: Loongson-2K0300 support
- s35390a: nvmem support
- zynqmp: rework calibration
* tag 'rtc-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: ds1390: fix number of bytes read from RTC
rtc: class: Remove duplicate check for alarm
rtc: optee: simplify OP-TEE context match
rtc: interface: Alarm race handling should not discard preceding error
rtc: s35390a: implement nvmem support
rtc: loongson: Add Loongson-2K0300 support
dt-bindings: rtc: loongson: Document Loongson-2K0300 compatible
dt-bindings: rtc: loongson: Correct Loongson-1C interrupts property
dt-bindings: rtc: renesas,rz-rtca3: Add RZ/V2N support
dt-bindings: rtc: cpcap: convert to schema
rtc: zynqmp: use dynamic max and min offset ranges
rtc: zynqmp: rework set_offset
rtc: zynqmp: rework read_offset
rtc: zynqmp: check calibration max value
rtc: zynqmp: correct frequency value
rtc: amlogic-a4: Remove IRQF_ONESHOT
rtc: pcf8563: use correct of_node for output clock
rtc: max31335: use correct CONFIG symbol in IS_REACHABLE()
rtc: nvvrs: Add ARCH_TEGRA to the NV VRS RTC driver
Pull turbostat updates from Len Brown:
- Add L2 statistics columns for recent Intel processors:
L2MRPS = L2 Cache M-References Per Second
L2%hit = L2 Cache Hit %
- Sort work and output by cpu# rather than core#
- Minor features and fixes
* tag 'turbostat-2026.02.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (23 commits)
tools/power turbostat: version 2026.02.14
tools/power turbostat: Fix and document --header_iterations
tools/power turbostat: Use strtoul() for iteration parsing
tools/power turbostat: Favor cpu# over core#
tools/power turbostat: Expunge logical_cpu_id
tools/power turbostat: Enhance HT enumeration
tools/power turbostat: Simplify global core_id calculation
tools/power turbostat: Unify even/odd/average counter referencing
tools/power turbostat: Allocate average counters dynamically
tools/power turbostat: Delete core_data.core_id
tools/power turbostat: Rename physical_core_id to core_id
tools/power turbostat: Cleanup package_id
tools/power turbostat: Cleanup internal use of "base_cpu"
tools/power turbostat: Add L2 cache statistics
tools/power turbostat: Remove redundant newlines from err(3) strings
tools/power turbostat: Allow more use of is_hybrid flag
tools/power turbostat: Rename "LLCkRPS" column to "LLCMRPS"
tools/power turbostat.8: Document the "--force" option
tools/power turbostat: Harden against unexpected values
tools/power turbostat: Dump hypervisor name
...
turbostat.c:8688: rapl_perf_init: Assertion `next_domain < num_domains' failed.
Two recent cleanup patches that were not supposed to change anything
broke the core_id code needed for AMD RAPL initialization:
commit 070e92361eec ("tools/power turbostat: Enhance HT enumeration")
commit ddf60e38ca04 ("tools/power turbostat: Simplify global core_id calculation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Replace SYSCTL_USER_TO_KERN_INT_CONV and SYSCTL_KERN_TO_USER_INT_CONV
macros with function implementing the same logic.This makes debugging
easier and aligns with the functions preference described in
coding-style.rst. Update all jiffies converters to use explicit function
implementations instead of macro-generated versions.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pull LoongArch updates from Huacai Chen:
- Select HAVE_CMPXCHG_{LOCAL,DOUBLE}
- Add 128-bit atomic cmpxchg support
- Add HOTPLUG_SMT implementation
- Wire up memfd_secret system call
- Fix boot errors and unwind errors for KASAN
- Use BPF prog pack allocator and add BPF arena support
- Update dts files to add nand controllers
- Some bug fixes and other small changes
* tag 'loongarch-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: dts: loongson-2k1000: Add nand controller support
LoongArch: dts: loongson-2k0500: Add nand controller support
LoongArch: BPF: Implement bpf_addr_space_cast instruction
LoongArch: BPF: Implement PROBE_MEM32 pseudo instructions
LoongArch: BPF: Use BPF prog pack allocator
LoongArch: Use IS_ERR_PCPU() macro for KGDB
LoongArch: Rework KASAN initialization for PTW-enabled systems
LoongArch: Disable instrumentation for setup_ptwalker()
LoongArch: Remove some extern variables in source files
LoongArch: Guard percpu handler under !CONFIG_PREEMPT_RT
LoongArch: Handle percpu handler address for ORC unwinder
LoongArch: Use %px to print unmodified unwinding address
LoongArch: Prefer top-down allocation after arch_mem_init()
LoongArch: Add HOTPLUG_SMT implementation
LoongArch: Make cpumask_of_node() robust against NUMA_NO_NODE
LoongArch: Wire up memfd_secret system call
LoongArch: Replace seq_printf() with seq_puts() for simple strings
LoongArch: Add 128-bit atomic cmpxchg support
LoongArch: Add detection for SC.Q support
LoongArch: Select HAVE_CMPXCHG_LOCAL in Kconfig
This ioctl is used to set up locking range start (offset)
and locking range length attributes only.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
Therefore, we need a separate function for setting up locking range
start and locking range length because it may require two different
authorities (and sessions) if the RangeStartRangeLengthPolicy attribute
is set.
With the IOC_OPAL_LR_SET_START_LEN ioctl, the opal_user_lr_setup
members 'RLE' and 'WLE' of the ioctl argument are ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Pass '-Zunstable-options' flag required by the future Rust 1.95.0
- Fix 'objtool' warning for Rust 1.84.0
'kernel' crate:
- 'irq' module: add missing bound detected by the future Rust 1.95.0
- 'list' module: add missing 'unsafe' blocks and placeholder safety
comments to macros (an issue for future callers within the crate)
'pin-init' crate:
- Clean Clippy warning that changed behavior in the future Rust
1.95.0"
* tag 'rust-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: list: Add unsafe blocks for container_of and safety comments
rust: pin-init: replace clippy `expect` with `allow`
rust: irq: add `'static` bounds to irq callbacks
objtool/rust: add one more `noreturn` Rust function
rust: kbuild: pass `-Zunstable-options` for Rust 1.95.0
The spi_write_then_read() reads 8 bytes starting from
DS1390_REG_SECONDS (== 0x01), so the last byte read would already
be part of the alarm (Tenths and Hundredths of Seconds) feature.
However 7 bytes are engouh -- seconds (0x01), minutes (0x02), hours (0x03),
day (0x04), date (0x05), month/century (0x06) and year (0x07).
Signed-off-by: Andreas Gabriel-Platschek <andi.platschek@gmail.com>
Link: https://patch.msgid.link/20260209053439.313825-1-andi.platschek@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Pull ntfs3 updates from Konstantin Komarov:
"New code:
- improve readahead for bitmap initialization and large directory scans
- fsync files by syncing parent inodes
- drop of preallocated clusters for sparse and compressed files
- zero-fill folios beyond i_valid in ntfs_read_folio()
- implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
- implement iomap-based file operations
- allow explicit boolean acl/prealloc mount options
- fall-through between switch labels
- delayed-allocation (delalloc) support
Fixes:
- check return value of indx_find to avoid infinite loop
- initialize new folios before use
- infinite loop in attr_load_runs_range on inconsistent metadata
- infinite loop triggered by zero-sized ATTR_LIST
- ntfs_mount_options leak in ntfs_fill_super()
- deadlock in ni_read_folio_cmpr
- circular locking dependency in run_unpack_ex
- prevent infinite loops caused by the next valid being the same
- restore NULL folio initialization in ntfs_writepages()
- slab-out-of-bounds read in DeleteIndexEntryRoot
Updates:
- allow readdir() to finish after directory mutations without rewinddir()
- handle attr_set_size() errors when truncating files
- make ntfs_writeback_ops static
- refactor duplicate kmemdup pattern in do_action()
- avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
Replaced:
- use wait_on_buffer() directly
- rename ni_readpage_cmpr into ni_read_folio_cmpr"
* tag 'ntfs3_for_7.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (26 commits)
fs/ntfs3: add delayed-allocation (delalloc) support
fs/ntfs3: avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
fs/ntfs3: add fall-through between switch labels
fs/ntfs3: allow explicit boolean acl/prealloc mount options
fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRoot
ntfs3: Restore NULL folio initialization in ntfs_writepages()
ntfs3: Refactor duplicate kmemdup pattern in do_action()
fs/ntfs3: prevent infinite loops caused by the next valid being the same
fs/ntfs3: make ntfs_writeback_ops static
ntfs3: fix circular locking dependency in run_unpack_ex
fs/ntfs3: implement iomap-based file operations
fs/ntfs3: fix deadlock in ni_read_folio_cmpr
fs/ntfs3: implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
fs/ntfs3: zero-fill folios beyond i_valid in ntfs_read_folio()
fs/ntfs3: handle attr_set_size() errors when truncating files
fs/ntfs3: drop preallocated clusters for sparse and compressed files
fs/ntfs3: fsync files by syncing parent inodes
fs/ntfs3: fix ntfs_mount_options leak in ntfs_fill_super()
fs/ntfs3: allow readdir() to finish after directory mutations without rewinddir()
fs/ntfs3: improve readahead for bitmap initialization and large directory scans
...
Since release 2025.12.02:
Add L2 statistics columns for recent Intel processors:
L2MRPS = L2 Cache M-References Per Second
L2%hit = L2 Cache Hit %
Sort work and output by cpu# rather than core#
This commit:
Version number and white space (indent -l160)
No functional change.
Signed-off-by: Len Brown <len.brown@intel.com>
This commit is making sure that all the functions that are part of the
API are documented.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pull memblock updates from Mike Rapoport:
- update tools/include/linux/mm.h to fix memblock tests compilation
- drop redundant struct page* parameter from memblock_free_pages() and
get struct page from the pfn
- add underflow detection for size calculation in memtest and warn
about underflow when VM_DEBUG is enabled
* tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm/memtest: add underflow detection for size calculation
memblock: drop redundant 'struct page *' argument from memblock_free_pages()
memblock test: include <linux/sizes.h> from tools mm.h stub
The blk_mq_hw_ctx_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-4-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blk_crypto_attrs structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-3-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blk_ia_range_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-2-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queue_sysfs_entry structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-1-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bvec_free is only called by bio_free, so inline it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_alloc_bioset tries non-waiting slab allocations first for the bio and
bvec array, but does so in a somewhat convoluted way.
Restructure the function so that it first open codes these slab
allocations, and then falls back to the mempools with the original
gfp mask.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only used in bio.c these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ublk driver doesn't access request integrity buffers directly, it
only copies them to/from the ublk server in ublk_copy_user_integrity().
ublk_copy_user_integrity() uses bio_for_each_integrity_vec() to walk all
the integrity segments. ublk devices are therefore capable of handling
requests with integrity intervals split across segments. Set
BLK_SPLIT_INTERVAL_CAPABLE in the struct blk_integrity flags for ublk
devices to opt out of the integrity-interval dma_alignment limit.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A bio segment may have partial interval block data with the rest
continuing into the next segments because direct-io data payloads only
need to align in memory to the device's DMA limits.
At the same time, the protection information may also be split in
multiple segments. The most likely way that may happen is if two
requests merge, or if we're directly using the io_uring user metadata.
The generate/verify, however, only ever accessed the first bip_vec.
Further, it may be possible to unalign the protection fields from the
user space buffer, or if there are odd additional opaque bytes in front
or in back of the protection information metadata region.
Change up the iteration to allow spanning multiple segments. This patch
is mostly a re-write of the protection information handling to allow any
arbitrary alignments, so it's probably easier to review the end result
rather than the diff.
Many controllers are not able to handle interval data composed of
multiple segments when PI is used, so this patch introduces a new
integrity limit that a low level driver can set to notify that it is
capable, default to false. The nvme driver is the first one to enable it
in this patch. Everyone else will force DMA alignment to the logical
block size as before to ensure interval data is always aligned within a
single segment.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the
previous disk's blkcg state is cleaned up asynchronously via
disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk()
runs before that cleanup finishes, we may overwrite q->root_blkg while
the old one is still alive, and radix_tree_insert() in blkg_create()
fails with -EEXIST because the old blkg entries still occupy the same
queue id slot in blkcg->blkg_tree. This causes the sd probe to fail
with -ENOMEM.
Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL,
which indicates the previous disk's blkcg cleanup has completed.
Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_steal_bios() transfers bios from a request to a bio_list when the
request is requeued to a different queue. The NVMe multipath failover
path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
Move these fixups into blk_steal_bios() itself so that any caller
automatically gets correct flag state when bios cross queue boundaries.
Simplify nvme_failover_req() accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in integrity changes which are also landing in the VFS tree as
dependencies for fs related changes.
* for-7.1/block-integrity:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
bdev_nonrot() is simply the negative return value of bdev_rot().
So replace all call sites of bdev_nonrot() with calls to bdev_rot()
and remove bdev_nonrot().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow the file system to limit the size processed in a single
bounce operation. This is needed when generating integrity data
so that the size of a single integrity segment can't overflow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Correct the comments that the cloned bio must be freed before the memory
pointed to by @bio_src->bi_io_vecs (is freed).
Christoph Hellwig contributed most the of the update wording.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a set of helpers for file system initiated integrity information.
These include mempool backed allocations and verifying based on a passed
in sector and size which is often available from file system completion
routines.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update the documentation file Documentation/ABI/stable/sysfs-block to
describe the zoned_qd1_writes sysfs queue attribute file.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
File systems that generate integrity will need this, so move it out
of the block private or blk-mq specific headers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to
having zone write plugging limit write operations to a maximum queue
depth of 1 for all zones. This significantly reduce write seek overhead
and improves SMR HDD write throughput.
For remotely connected disks with a very high network latency this
features might not be useful. However, remotely connected zoned devices
are rare at the moment, and we cannot know the round trip latency to
pick a good default for network attached devices. System administrators
can however disable this feature in that case.
For BIO based (non blk-mq) rotational zoned block devices, the device
driver (e.g. a DM target driver) can directly set an appropriate
default.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return the status from verify instead of directly stashing it in the bio,
and rename the helpers to use the usual bio_ prefix for things operating
on a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.
In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.
The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes for zoned devices. For regular block devices, this
attribute is not visible. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.
Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.
Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.
A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock. The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Sequential write | Baseline | Patched |
| Queue Depth | 6.19-rc8 | |
+------------------+----------+---------+
| 1 | 244 | 245 |
| 2 | 244 | 245 |
| 4 | 245 | 245 |
| 8 | 242 | 245 |
| 16 | 222 | 246 |
| 32 | 211 | 245 |
| 64 | 193 | 244 |
| 128 | 112 | 246 |
+------------------+----------+---------+
With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.
Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Random write | Baseline | Patched |
| Number of zones | 6.19-rc7 | |
+------------------+----------+---------+
| 1 | 191 | 192 |
| 2 | 101 | 128 |
| 4 | 115 | 123 |
| 8 | 90 | 120 |
| 16 | 64 | 115 |
| 32 | 58 | 105 |
| 64 | 56 | 101 |
| 128 | 55 | 99 |
+------------------+----------+---------+
Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 212 | 238 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 213 | 243 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 179 | 242 |
| random file size | | |
+------------------+----------+---------+
Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 175 | 245 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 174 | 244 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 171 | 243 |
| random file size | | |
+------------------+----------+---------+
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out a helper to see if the block device has an integrity checksum
from bdev_stable_writes so that it can be reused for other checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
clearly indicates that this is the spinlock used for manipulating the
hash table of zone write plugs.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to set the seed and check flag based on useful defaults
from the profile.
Note that this includes a small behavior change, as we now only set the
seed if any action is set, which is fine as nothing will look at it
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The helper function disk_zone_is_full() is only used in
disk_zone_wplug_is_full(). So remove it and open code it directly in
this single caller.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the logic to see if a bio needs integrity metadata from
bio_integrity_prep into a reusable helper than can be called from
file system code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_get_and_lock_zone_wplug() always returns a zone write plug with the
plug lock held. This is unnecessary since this function does not look at
the fields of existing plugs, and new plugs need to be locked only after
their insertion in the disk hash table, when they are being used.
Remove the zone write plug locking from disk_get_and_lock_zone_wplug()
and rename this function disk_get_or_alloc_zone_wplug().
blk_zone_wplug_handle_write() is modified to add locking of the zone
write plug after calling disk_get_or_alloc_zone_wplug() and before
starting to use the plug. This change also simplifies
blk_revalidate_seq_zone() as unlocking the plug becomes unnecessary.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.
Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.
Fixes: 9e78c38ab30b ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Commit 7b295187287e ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs or requests in-flight. However, this check does not take into
account that a BIO completion may happen right after its submission by
a zone write plug BIO work, and before the zone write plug BIO work
releases the zone write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false as in this case the zone
write plug reference count is at least equal to 3. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.
Furthermore, relying on a particular value of a zone write plug
reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as
reading the atomic reference count and doing a comparison with some
value is not overall atomic at all.
Address these issues by reworking the reference counting of zone write
plugs so that removing plugs from a disk hash table can be done
directly from disk_put_zone_wplug() when the last reference on a plug
is dropped.
To do so, replace the function disk_remove_zone_wplug() with
disk_mark_zone_wplug_dead(). This new function sets the zone write plug
flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and
drops the initial reference on the zone write plug taken when the plug
was added to the disk hash table. This function is called either for
zones that are empty or full, or directly in the case of a forced plug
removal (e.g. when the disk hash table is being destroyed on disk
removal). With this change, disk_should_remove_zone_wplug() is also
removed.
disk_put_zone_wplug() is modified to call the function
disk_free_zone_wplug() to remove a zone write plug from a disk hash
table and free the plug structure (with a call_rcu()), when the last
reference on a zone write plug is dropped. disk_free_zone_wplug()
always checks that the BLK_ZONE_WPLUG_DEAD flag is set.
In order to avoid having multiple zone write plugs for the same zone in
the disk hash table, disk_get_and_lock_zone_wplug() checked for the
BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for
the new BLK_ZONE_WPLUG_DEAD flag is added to
blk_zone_wplug_handle_write(). With this change, we continue preventing
adding multiple zone write plugs for the same zone and at the same time
re-inforce checks on the user behavior by failing new incoming write
BIOs targeting a zone that is marked as dead. This case can happen only
if the user erroneously issues write BIOs to zones that are full, or to
zones that are currently being reset or finished.
Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hppa-linux-gcc 9.5.0 generates a call to fsverity_readahead() in
f2fs_readahead() when CONFIG_FS_VERITY=n, because it fails to do the
expected dead code elimination based on vi always being NULL. Fix the
build error by adding an inline stub for fsverity_readahead(). Since
it's just for opportunistic readahead, just make it a no-op.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602180838.pwICdY2r-lkp@intel.com/
Fixes: 45dcb3ac9832 ("f2fs: consolidate fsverity_info lookup")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218012244.18536-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
The queue_hw_ctx field in struct request_queue is an array of pointers
to struct blk_mq_hw_ctx. The number of elements in this array is tracked
by the nr_hw_queues field.
The array is allocated in __blk_mq_realloc_hw_ctxs() using
kcalloc_node() with set->nr_hw_queues elements. q->nr_hw_queues is
subsequently updated to set->nr_hw_queues.
When growing the array, the new array is assigned to queue_hw_ctx before
nr_hw_queues is updated. This is safe because nr_hw_queues (the old
smaller count) is used for bounds checking, which is within the new
larger allocation.
When shrinking the array, nr_hw_queues is updated to the smaller value,
while queue_hw_ctx retains the larger allocation. This is also safe as
the count is within the allocation bounds.
Annotating queue_hw_ctx with __counted_by_ptr(nr_hw_queues) allows the
compiler (with kSAN) to verify that accesses to queue_hw_ctx are within
the valid range defined by nr_hw_queues.
This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running blktests.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Bill Wendling <morbo@google.com>
[axboe: massage commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stephen retired and stepped back from -next maintainership, update his
entry in CREDITS to recognise his 18 years of hard work making it what
it is today and all the impact it's had on our development process.
Also update to his current GnuPG key while we're here.
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I finally got a big endian PPC64 kernel to boot in QEMU. The PPC64 VSX
optimized AES library code does work in that case, with the exception of
rndkey_from_vsx() which doesn't take into account that the order in
which the VSX code stores the round key words depends on the endianness.
So fix rndkey_from_vsx() to do the right thing on big endian CPUs.
Fixes: 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized code into library")
Link: https://lore.kernel.org/r/20260216022104.332991-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
This adds a function for retrieving the set of Locking objects enabled
for Single User Mode (SUM) and the value of the
RangeStartRangeLengthPolicy parameter.
It retrieves data from the LockingInfo table, specifically the
columns SingleUserModeRanges and RangeStartLengthPolicy, which
were added according to the TCG Opal Feature Set: Single User Mode,
as described in chapters 4.4.3.1 and 4.4.3.2.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The x509 public key code gained a dependency on the sha256 hash
implementation, causing a rare link time failure in randconfig
builds:
arm-linux-gnueabi-ld: crypto/asymmetric_keys/x509_public_key.o: in function `x509_get_sig_params':
x509_public_key.c:(.text.x509_get_sig_params+0x12): undefined reference to `sha256'
arm-linux-gnueabi-ld: (sha256): Unknown destination type (ARM/Thumb) in crypto/asymmetric_keys/x509_public_key.o
x509_public_key.c:(.text.x509_get_sig_params+0x12): dangerous relocation: unsupported relocation
Select the necessary library code from Kconfig.
Fixes: 2c62068ac86b ("x509: Separately calculate sha256 for blacklist")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull sysctl updates from Joel Granados:
- Remove macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it
is more verbose than the macro version, it helps when debugging and
better aligns with coding-style.rst.
- General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
Add missing kernel doc to proc_dointvec_conv.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks
of testing
* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
sysctl: Replace unidirectional INT converter macros with functions
sysctl: Add kernel doc to proc_douintvec_conv
sysctl: Replace UINT converter macros with functions
sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
sysctl: clarify proc_douintvec_minmax doc
sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
sysctl: Remove unused ctl_table forward declarations
loadpin: Implement custom proc_handler for enforce
alloc_tag: move memory_allocation_profiling_sysctls into .rodata
sysctl: Add missing kernel-doc for proc_dointvec_conv
f2fs_verify_cluster() is the only remaining caller of the
non-large-folio-aware function fsverity_verify_page(). To unblock the
removal of that function, change f2fs_verify_cluster() to verify the
entire folio of each page and mark it up-to-date.
Note that this doesn't actually make f2fs_verify_cluster()
large-folio-aware, as it is still passed an array of pages. Currently,
it's never called with large folios.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Align to the commit bf4afc53b77a ("Convert 'alloc_obj' family to use the
new default GFP_KERNEL argument") update the 'kmalloc_obj' declaration
for userspace to fix below compile error:
In file included from arch/arm/boot/compressed/../../../../lib/decompress_unxz.c:241,
from arch/arm/boot/compressed/decompress.c:56:
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c: In function 'xz_dec_init':
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c:787:28: error: implicit declaration of function 'kmalloc_obj'; did you mean 'kmalloc'? [-Wimplicit-function-declaration]
787 | struct xz_dec *s = kmalloc_obj(*s);
| ^~~~~~~~~~~
| kmalloc
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Fixes: bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove SYSCTL_INT_CONV_CUSTOM and replace it with proc_int_conv. This
converter function expects a negp argument as it can take on negative
values. Update all jiffies converters to use explicit function calls.
Remove SYSCTL_CONV_IDENTITY as it is no longer used.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
This ioctl is used to set up RLE (read lock enabled) and WLE (write
lock enabled) parameters of the Locking object.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
With the IOC_OPAL_ENABLE_DISABLE_LR ioctl, the opal_user_lr_setup
members 'range_start' and 'range_length' of the ioctl argument are
ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull RTC updates from Alexandre Belloni:
- loongson: Loongson-2K0300 support
- s35390a: nvmem support
- zynqmp: rework calibration
* tag 'rtc-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: ds1390: fix number of bytes read from RTC
rtc: class: Remove duplicate check for alarm
rtc: optee: simplify OP-TEE context match
rtc: interface: Alarm race handling should not discard preceding error
rtc: s35390a: implement nvmem support
rtc: loongson: Add Loongson-2K0300 support
dt-bindings: rtc: loongson: Document Loongson-2K0300 compatible
dt-bindings: rtc: loongson: Correct Loongson-1C interrupts property
dt-bindings: rtc: renesas,rz-rtca3: Add RZ/V2N support
dt-bindings: rtc: cpcap: convert to schema
rtc: zynqmp: use dynamic max and min offset ranges
rtc: zynqmp: rework set_offset
rtc: zynqmp: rework read_offset
rtc: zynqmp: check calibration max value
rtc: zynqmp: correct frequency value
rtc: amlogic-a4: Remove IRQF_ONESHOT
rtc: pcf8563: use correct of_node for output clock
rtc: max31335: use correct CONFIG symbol in IS_REACHABLE()
rtc: nvvrs: Add ARCH_TEGRA to the NV VRS RTC driver
Pull turbostat updates from Len Brown:
- Add L2 statistics columns for recent Intel processors:
L2MRPS = L2 Cache M-References Per Second
L2%hit = L2 Cache Hit %
- Sort work and output by cpu# rather than core#
- Minor features and fixes
* tag 'turbostat-2026.02.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (23 commits)
tools/power turbostat: version 2026.02.14
tools/power turbostat: Fix and document --header_iterations
tools/power turbostat: Use strtoul() for iteration parsing
tools/power turbostat: Favor cpu# over core#
tools/power turbostat: Expunge logical_cpu_id
tools/power turbostat: Enhance HT enumeration
tools/power turbostat: Simplify global core_id calculation
tools/power turbostat: Unify even/odd/average counter referencing
tools/power turbostat: Allocate average counters dynamically
tools/power turbostat: Delete core_data.core_id
tools/power turbostat: Rename physical_core_id to core_id
tools/power turbostat: Cleanup package_id
tools/power turbostat: Cleanup internal use of "base_cpu"
tools/power turbostat: Add L2 cache statistics
tools/power turbostat: Remove redundant newlines from err(3) strings
tools/power turbostat: Allow more use of is_hybrid flag
tools/power turbostat: Rename "LLCkRPS" column to "LLCMRPS"
tools/power turbostat.8: Document the "--force" option
tools/power turbostat: Harden against unexpected values
tools/power turbostat: Dump hypervisor name
...
turbostat.c:8688: rapl_perf_init: Assertion `next_domain < num_domains' failed.
Two recent cleanup patches that were not supposed to change anything
broke the core_id code needed for AMD RAPL initialization:
commit 070e92361eec ("tools/power turbostat: Enhance HT enumeration")
commit ddf60e38ca04 ("tools/power turbostat: Simplify global core_id calculation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Replace SYSCTL_USER_TO_KERN_INT_CONV and SYSCTL_KERN_TO_USER_INT_CONV
macros with function implementing the same logic.This makes debugging
easier and aligns with the functions preference described in
coding-style.rst. Update all jiffies converters to use explicit function
implementations instead of macro-generated versions.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pull LoongArch updates from Huacai Chen:
- Select HAVE_CMPXCHG_{LOCAL,DOUBLE}
- Add 128-bit atomic cmpxchg support
- Add HOTPLUG_SMT implementation
- Wire up memfd_secret system call
- Fix boot errors and unwind errors for KASAN
- Use BPF prog pack allocator and add BPF arena support
- Update dts files to add nand controllers
- Some bug fixes and other small changes
* tag 'loongarch-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: dts: loongson-2k1000: Add nand controller support
LoongArch: dts: loongson-2k0500: Add nand controller support
LoongArch: BPF: Implement bpf_addr_space_cast instruction
LoongArch: BPF: Implement PROBE_MEM32 pseudo instructions
LoongArch: BPF: Use BPF prog pack allocator
LoongArch: Use IS_ERR_PCPU() macro for KGDB
LoongArch: Rework KASAN initialization for PTW-enabled systems
LoongArch: Disable instrumentation for setup_ptwalker()
LoongArch: Remove some extern variables in source files
LoongArch: Guard percpu handler under !CONFIG_PREEMPT_RT
LoongArch: Handle percpu handler address for ORC unwinder
LoongArch: Use %px to print unmodified unwinding address
LoongArch: Prefer top-down allocation after arch_mem_init()
LoongArch: Add HOTPLUG_SMT implementation
LoongArch: Make cpumask_of_node() robust against NUMA_NO_NODE
LoongArch: Wire up memfd_secret system call
LoongArch: Replace seq_printf() with seq_puts() for simple strings
LoongArch: Add 128-bit atomic cmpxchg support
LoongArch: Add detection for SC.Q support
LoongArch: Select HAVE_CMPXCHG_LOCAL in Kconfig
This ioctl is used to set up locking range start (offset)
and locking range length attributes only.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
Therefore, we need a separate function for setting up locking range
start and locking range length because it may require two different
authorities (and sessions) if the RangeStartRangeLengthPolicy attribute
is set.
With the IOC_OPAL_LR_SET_START_LEN ioctl, the opal_user_lr_setup
members 'RLE' and 'WLE' of the ioctl argument are ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Pass '-Zunstable-options' flag required by the future Rust 1.95.0
- Fix 'objtool' warning for Rust 1.84.0
'kernel' crate:
- 'irq' module: add missing bound detected by the future Rust 1.95.0
- 'list' module: add missing 'unsafe' blocks and placeholder safety
comments to macros (an issue for future callers within the crate)
'pin-init' crate:
- Clean Clippy warning that changed behavior in the future Rust
1.95.0"
* tag 'rust-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: list: Add unsafe blocks for container_of and safety comments
rust: pin-init: replace clippy `expect` with `allow`
rust: irq: add `'static` bounds to irq callbacks
objtool/rust: add one more `noreturn` Rust function
rust: kbuild: pass `-Zunstable-options` for Rust 1.95.0
The spi_write_then_read() reads 8 bytes starting from
DS1390_REG_SECONDS (== 0x01), so the last byte read would already
be part of the alarm (Tenths and Hundredths of Seconds) feature.
However 7 bytes are engouh -- seconds (0x01), minutes (0x02), hours (0x03),
day (0x04), date (0x05), month/century (0x06) and year (0x07).
Signed-off-by: Andreas Gabriel-Platschek <andi.platschek@gmail.com>
Link: https://patch.msgid.link/20260209053439.313825-1-andi.platschek@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Pull ntfs3 updates from Konstantin Komarov:
"New code:
- improve readahead for bitmap initialization and large directory scans
- fsync files by syncing parent inodes
- drop of preallocated clusters for sparse and compressed files
- zero-fill folios beyond i_valid in ntfs_read_folio()
- implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
- implement iomap-based file operations
- allow explicit boolean acl/prealloc mount options
- fall-through between switch labels
- delayed-allocation (delalloc) support
Fixes:
- check return value of indx_find to avoid infinite loop
- initialize new folios before use
- infinite loop in attr_load_runs_range on inconsistent metadata
- infinite loop triggered by zero-sized ATTR_LIST
- ntfs_mount_options leak in ntfs_fill_super()
- deadlock in ni_read_folio_cmpr
- circular locking dependency in run_unpack_ex
- prevent infinite loops caused by the next valid being the same
- restore NULL folio initialization in ntfs_writepages()
- slab-out-of-bounds read in DeleteIndexEntryRoot
Updates:
- allow readdir() to finish after directory mutations without rewinddir()
- handle attr_set_size() errors when truncating files
- make ntfs_writeback_ops static
- refactor duplicate kmemdup pattern in do_action()
- avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
Replaced:
- use wait_on_buffer() directly
- rename ni_readpage_cmpr into ni_read_folio_cmpr"
* tag 'ntfs3_for_7.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (26 commits)
fs/ntfs3: add delayed-allocation (delalloc) support
fs/ntfs3: avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
fs/ntfs3: add fall-through between switch labels
fs/ntfs3: allow explicit boolean acl/prealloc mount options
fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRoot
ntfs3: Restore NULL folio initialization in ntfs_writepages()
ntfs3: Refactor duplicate kmemdup pattern in do_action()
fs/ntfs3: prevent infinite loops caused by the next valid being the same
fs/ntfs3: make ntfs_writeback_ops static
ntfs3: fix circular locking dependency in run_unpack_ex
fs/ntfs3: implement iomap-based file operations
fs/ntfs3: fix deadlock in ni_read_folio_cmpr
fs/ntfs3: implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
fs/ntfs3: zero-fill folios beyond i_valid in ntfs_read_folio()
fs/ntfs3: handle attr_set_size() errors when truncating files
fs/ntfs3: drop preallocated clusters for sparse and compressed files
fs/ntfs3: fsync files by syncing parent inodes
fs/ntfs3: fix ntfs_mount_options leak in ntfs_fill_super()
fs/ntfs3: allow readdir() to finish after directory mutations without rewinddir()
fs/ntfs3: improve readahead for bitmap initialization and large directory scans
...
Since release 2025.12.02:
Add L2 statistics columns for recent Intel processors:
L2MRPS = L2 Cache M-References Per Second
L2%hit = L2 Cache Hit %
Sort work and output by cpu# rather than core#
This commit:
Version number and white space (indent -l160)
No functional change.
Signed-off-by: Len Brown <len.brown@intel.com>
Pull memblock updates from Mike Rapoport:
- update tools/include/linux/mm.h to fix memblock tests compilation
- drop redundant struct page* parameter from memblock_free_pages() and
get struct page from the pfn
- add underflow detection for size calculation in memtest and warn
about underflow when VM_DEBUG is enabled
* tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm/memtest: add underflow detection for size calculation
memblock: drop redundant 'struct page *' argument from memblock_free_pages()
memblock test: include <linux/sizes.h> from tools mm.h stub