Linux kernel ============ The Linux kernel is the core of any Linux operating system. It manages hardware, system resources, and provides the fundamental services for all other software. Quick Start ----------- * Report a bug: See Documentation/admin-guide/reporting-issues.rst * Get the latest kernel: https://kernel.org * Build the kernel: See Documentation/admin-guide/quickly-build-trimmed-linux.rst * Join the community: https://lore.kernel.org/ Essential Documentation ----------------------- All users should be familiar with: * Building requirements: Documentation/process/changes.rst * Code of Conduct: Documentation/process/code-of-conduct.rst * License: See COPYING Documentation can be built with make htmldocs or viewed online at: https://www.kernel.org/doc/html/latest/ Who Are You? ============ Find your role below: * New Kernel Developer - Getting started with kernel development * Academic Researcher - Studying kernel internals and architecture * Security Expert - Hardening and vulnerability analysis * Backport/Maintenance Engineer - Maintaining stable kernels * System Administrator - Configuring and troubleshooting * Maintainer - Leading subsystems and reviewing patches * Hardware Vendor - Writing drivers for new hardware * Distribution Maintainer - Packaging kernels for distros * AI Coding Assistant - LLMs and AI-powered development tools For Specific Users ================== New Kernel Developer -------------------- Welcome! Start your kernel development journey here: * Getting Started: Documentation/process/development-process.rst * Your First Patch: Documentation/process/submitting-patches.rst * Coding Style: Documentation/process/coding-style.rst * Build System: Documentation/kbuild/index.rst * Development Tools: Documentation/dev-tools/index.rst * Kernel Hacking Guide: Documentation/kernel-hacking/hacking.rst * Core APIs: Documentation/core-api/index.rst Academic Researcher ------------------- Explore the kernel's architecture and internals: * Researcher Guidelines: Documentation/process/researcher-guidelines.rst * Memory Management: Documentation/mm/index.rst * Scheduler: Documentation/scheduler/index.rst * Networking Stack: Documentation/networking/index.rst * Filesystems: Documentation/filesystems/index.rst * RCU (Read-Copy Update): Documentation/RCU/index.rst * Locking Primitives: Documentation/locking/index.rst * Power Management: Documentation/power/index.rst Security Expert --------------- Security documentation and hardening guides: * Security Documentation: Documentation/security/index.rst * LSM Development: Documentation/security/lsm-development.rst * Self Protection: Documentation/security/self-protection.rst * Reporting Vulnerabilities: Documentation/process/security-bugs.rst * CVE Procedures: Documentation/process/cve.rst * Embargoed Hardware Issues: Documentation/process/embargoed-hardware-issues.rst * Security Features: Documentation/userspace-api/seccomp_filter.rst Backport/Maintenance Engineer ----------------------------- Maintain and stabilize kernel versions: * Stable Kernel Rules: Documentation/process/stable-kernel-rules.rst * Backporting Guide: Documentation/process/backporting.rst * Applying Patches: Documentation/process/applying-patches.rst * Subsystem Profile: Documentation/maintainer/maintainer-entry-profile.rst * Git for Maintainers: Documentation/maintainer/configure-git.rst System Administrator -------------------- Configure, tune, and troubleshoot Linux systems: * Admin Guide: Documentation/admin-guide/index.rst * Kernel Parameters: Documentation/admin-guide/kernel-parameters.rst * Sysctl Tuning: Documentation/admin-guide/sysctl/index.rst * Tracing/Debugging: Documentation/trace/index.rst * Performance Security: Documentation/admin-guide/perf-security.rst * Hardware Monitoring: Documentation/hwmon/index.rst Maintainer ---------- Lead kernel subsystems and manage contributions: * Maintainer Handbook: Documentation/maintainer/index.rst * Pull Requests: Documentation/maintainer/pull-requests.rst * Managing Patches: Documentation/maintainer/modifying-patches.rst * Rebasing and Merging: Documentation/maintainer/rebasing-and-merging.rst * Development Process: Documentation/process/maintainer-handbooks.rst * Maintainer Entry Profile: Documentation/maintainer/maintainer-entry-profile.rst * Git Configuration: Documentation/maintainer/configure-git.rst Hardware Vendor --------------- Write drivers and support new hardware: * Driver API Guide: Documentation/driver-api/index.rst * Driver Model: Documentation/driver-api/driver-model/driver.rst * Device Drivers: Documentation/driver-api/infrastructure.rst * Bus Types: Documentation/driver-api/driver-model/bus.rst * Device Tree Bindings: Documentation/devicetree/bindings/ * Power Management: Documentation/driver-api/pm/index.rst * DMA API: Documentation/core-api/dma-api.rst Distribution Maintainer ----------------------- Package and distribute the kernel: * Stable Kernel Rules: Documentation/process/stable-kernel-rules.rst * ABI Documentation: Documentation/ABI/README * Kernel Configuration: Documentation/kbuild/kconfig.rst * Module Signing: Documentation/admin-guide/module-signing.rst * Kernel Parameters: Documentation/admin-guide/kernel-parameters.rst * Tainted Kernels: Documentation/admin-guide/tainted-kernels.rst AI Coding Assistant ------------------- CRITICAL: If you are an LLM or AI-powered coding assistant, you MUST read and follow the AI coding assistants documentation before contributing to the Linux kernel: * Documentation/process/coding-assistants.rst This documentation contains essential requirements about licensing, attribution, and the Developer Certificate of Origin that all AI tools must comply with. Communication and Support ========================= * Mailing Lists: https://lore.kernel.org/ * IRC: #kernelnewbies on irc.oftc.net * Bugzilla: https://bugzilla.kernel.org/ * MAINTAINERS file: Lists subsystem maintainers and mailing lists * Email Clients: Documentation/process/email-clients.rst
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.
In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.
The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes for zoned devices. For regular block devices, this
attribute is not visible. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.
Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.
Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.
A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock. The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Sequential write | Baseline | Patched |
| Queue Depth | 6.19-rc8 | |
+------------------+----------+---------+
| 1 | 244 | 245 |
| 2 | 244 | 245 |
| 4 | 245 | 245 |
| 8 | 242 | 245 |
| 16 | 222 | 246 |
| 32 | 211 | 245 |
| 64 | 193 | 244 |
| 128 | 112 | 246 |
+------------------+----------+---------+
With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.
Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Random write | Baseline | Patched |
| Number of zones | 6.19-rc7 | |
+------------------+----------+---------+
| 1 | 191 | 192 |
| 2 | 101 | 128 |
| 4 | 115 | 123 |
| 8 | 90 | 120 |
| 16 | 64 | 115 |
| 32 | 58 | 105 |
| 64 | 56 | 101 |
| 128 | 55 | 99 |
+------------------+----------+---------+
Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 212 | 238 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 213 | 243 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 179 | 242 |
| random file size | | |
+------------------+----------+---------+
Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 175 | 245 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 174 | 244 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 171 | 243 |
| random file size | | |
+------------------+----------+---------+
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
clearly indicates that this is the spinlock used for manipulating the
hash table of zone write plugs.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The helper function disk_zone_is_full() is only used in
disk_zone_wplug_is_full(). So remove it and open code it directly in
this single caller.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_get_and_lock_zone_wplug() always returns a zone write plug with the
plug lock held. This is unnecessary since this function does not look at
the fields of existing plugs, and new plugs need to be locked only after
their insertion in the disk hash table, when they are being used.
Remove the zone write plug locking from disk_get_and_lock_zone_wplug()
and rename this function disk_get_or_alloc_zone_wplug().
blk_zone_wplug_handle_write() is modified to add locking of the zone
write plug after calling disk_get_or_alloc_zone_wplug() and before
starting to use the plug. This change also simplifies
blk_revalidate_seq_zone() as unlocking the plug becomes unnecessary.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.
Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.
Fixes: 9e78c38ab30b ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 7b295187287e ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs or requests in-flight. However, this check does not take into
account that a BIO completion may happen right after its submission by
a zone write plug BIO work, and before the zone write plug BIO work
releases the zone write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false as in this case the zone
write plug reference count is at least equal to 3. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.
Furthermore, relying on a particular value of a zone write plug
reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as
reading the atomic reference count and doing a comparison with some
value is not overall atomic at all.
Address these issues by reworking the reference counting of zone write
plugs so that removing plugs from a disk hash table can be done
directly from disk_put_zone_wplug() when the last reference on a plug
is dropped.
To do so, replace the function disk_remove_zone_wplug() with
disk_mark_zone_wplug_dead(). This new function sets the zone write plug
flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and
drops the initial reference on the zone write plug taken when the plug
was added to the disk hash table. This function is called either for
zones that are empty or full, or directly in the case of a forced plug
removal (e.g. when the disk hash table is being destroyed on disk
removal). With this change, disk_should_remove_zone_wplug() is also
removed.
disk_put_zone_wplug() is modified to call the function
disk_free_zone_wplug() to remove a zone write plug from a disk hash
table and free the plug structure (with a call_rcu()), when the last
reference on a zone write plug is dropped. disk_free_zone_wplug()
always checks that the BLK_ZONE_WPLUG_DEAD flag is set.
In order to avoid having multiple zone write plugs for the same zone in
the disk hash table, disk_get_and_lock_zone_wplug() checked for the
BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for
the new BLK_ZONE_WPLUG_DEAD flag is added to
blk_zone_wplug_handle_write(). With this change, we continue preventing
adding multiple zone write plugs for the same zone and at the same time
re-inforce checks on the user behavior by failing new incoming write
BIOs targeting a zone that is marked as dead. This case can happen only
if the user erroneously issues write BIOs to zones that are full, or to
zones that are currently being reset or finished.
Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queue_hw_ctx field in struct request_queue is an array of pointers
to struct blk_mq_hw_ctx. The number of elements in this array is tracked
by the nr_hw_queues field.
The array is allocated in __blk_mq_realloc_hw_ctxs() using
kcalloc_node() with set->nr_hw_queues elements. q->nr_hw_queues is
subsequently updated to set->nr_hw_queues.
When growing the array, the new array is assigned to queue_hw_ctx before
nr_hw_queues is updated. This is safe because nr_hw_queues (the old
smaller count) is used for bounds checking, which is within the new
larger allocation.
When shrinking the array, nr_hw_queues is updated to the smaller value,
while queue_hw_ctx retains the larger allocation. This is also safe as
the count is within the allocation bounds.
Annotating queue_hw_ctx with __counted_by_ptr(nr_hw_queues) allows the
compiler (with kSAN) to verify that accesses to queue_hw_ctx are within
the valid range defined by nr_hw_queues.
This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running blktests.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Bill Wendling <morbo@google.com>
[axboe: massage commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds a function for retrieving the set of Locking objects enabled
for Single User Mode (SUM) and the value of the
RangeStartRangeLengthPolicy parameter.
It retrieves data from the LockingInfo table, specifically the
columns SingleUserModeRanges and RangeStartLengthPolicy, which
were added according to the TCG Opal Feature Set: Single User Mode,
as described in chapters 4.4.3.1 and 4.4.3.2.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the column parameter in response_get_column() from u8 to u64
to support the full range of column identifiers.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ioctl is used to set up RLE (read lock enabled) and WLE (write
lock enabled) parameters of the Locking object.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
With the IOC_OPAL_ENABLE_DISABLE_LR ioctl, the opal_user_lr_setup
members 'range_start' and 'range_length' of the ioctl argument are
ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ioctl is used to set up locking range start (offset)
and locking range length attributes only.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
Therefore, we need a separate function for setting up locking range
start and locking range length because it may require two different
authorities (and sessions) if the RangeStartRangeLengthPolicy attribute
is set.
With the IOC_OPAL_LR_SET_START_LEN ioctl, the opal_user_lr_setup
members 'RLE' and 'WLE' of the ioctl argument are ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOC_OPAL_LR_SETUP is used to set up a locking range entirely under a
single authority (usually Admin1), but for Single User Mode (SUM),
the permissions for attributes (RangeStart, RangeLength)
and (ReadLockEnable, WriteLockEnable, ReadLocked, WriteLocked)
may be split between two different authorities. Typically, it is Admin1
for the former and the User associated with the LockingRange in SUM
for the latter.
This commit only splits the internals in preparation for the introduction
of separate ioctls for setting RangeStart, RangeLength and the rest
using new ioctl calls.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the 'Reactivate' method as described in the
"TCG Storage Opal SSC Feature Set: Single User Mode"
document (ch. 3.1.1.1).
The method enables switching an already active SED OPAL2 device,
with appropriate firmware support for Single User Mode (SUM),
to or from SUM.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As desribed in ch. 3.1.1.1.1.3 of TCG Storage Opal SSC Feature Set:
Single User Mode document.
To be used later in Reactivate method implementation.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As desribed in ch. 3.1.1.1.1.2 of TCG Storage Opal SSC Feature Set:
Single User Mode document.
To be used later in Reactivate method implementation and in function
for retrieving SUM device status.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As described in ch. 6.3, Table 240 in TCG Storage
Architecture Core Specification document.
It's also referenced in TCG Storage Opal SSC Feature Set:
Single User Mode document, ch. 3.1.1.1 Reactivate method.
It will be used later in Reactivate method implemetation
for sed-opal interface.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>