Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: block: Create blk-mq documentation

Create a documentation providing a background and explanation around the
operation of the Multi-Queue Block IO Queueing Mechanism (blk-mq).

The reference for writing this documentation was the source code and
"Linux Block IO: Introducing Multi-queue SSD Access on Multi-core
Systems", by Axboe et al.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20200620002036.113000-1-andrealmeid@collabora.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by

André Almeida and committed by
Jonathan Corbet
46e90614 b103c607

+154
+153
Documentation/block/blk-mq.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================================================ 4 + Multi-Queue Block IO Queueing Mechanism (blk-mq) 5 + ================================================ 6 + 7 + The Multi-Queue Block IO Queueing Mechanism is an API to enable fast storage 8 + devices to achieve a huge number of input/output operations per second (IOPS) 9 + through queueing and submitting IO requests to block devices simultaneously, 10 + benefiting from the parallelism offered by modern storage devices. 11 + 12 + Introduction 13 + ============ 14 + 15 + Background 16 + ---------- 17 + 18 + Magnetic hard disks have been the de facto standard from the beginning of the 19 + development of the kernel. The Block IO subsystem aimed to achieve the best 20 + performance possible for those devices with a high penalty when doing random 21 + access, and the bottleneck was the mechanical moving parts, a lot slower than 22 + any layer on the storage stack. One example of such optimization technique 23 + involves ordering read/write requests according to the current position of the 24 + hard disk head. 25 + 26 + However, with the development of Solid State Drives and Non-Volatile Memories 27 + without mechanical parts nor random access penalty and capable of performing 28 + high parallel access, the bottleneck of the stack had moved from the storage 29 + device to the operating system. In order to take advantage of the parallelism 30 + in those devices' design, the multi-queue mechanism was introduced. 31 + 32 + The former design had a single queue to store block IO requests with a single 33 + lock. That did not scale well in SMP systems due to dirty data in cache and the 34 + bottleneck of having a single lock for multiple processors. This setup also 35 + suffered with congestion when different processes (or the same process, moving 36 + to different CPUs) wanted to perform block IO. Instead of this, the blk-mq API 37 + spawns multiple queues with individual entry points local to the CPU, removing 38 + the need for a lock. A deeper explanation on how this works is covered in the 39 + following section (`Operation`_). 40 + 41 + Operation 42 + --------- 43 + 44 + When the userspace performs IO to a block device (reading or writing a file, 45 + for instance), blk-mq takes action: it will store and manage IO requests to 46 + the block device, acting as middleware between the userspace (and a file 47 + system, if present) and the block device driver. 48 + 49 + blk-mq has two group of queues: software staging queues and hardware dispatch 50 + queues. When the request arrives at the block layer, it will try the shortest 51 + path possible: send it directly to the hardware queue. However, there are two 52 + cases that it might not do that: if there's an IO scheduler attached at the 53 + layer or if we want to try to merge requests. In both cases, requests will be 54 + sent to the software queue. 55 + 56 + Then, after the requests are processed by software queues, they will be placed 57 + at the hardware queue, a second stage queue were the hardware has direct access 58 + to process those requests. However, if the hardware does not have enough 59 + resources to accept more requests, blk-mq will places requests on a temporary 60 + queue, to be sent in the future, when the hardware is able. 61 + 62 + Software staging queues 63 + ~~~~~~~~~~~~~~~~~~~~~~~ 64 + 65 + The block IO subsystem adds requests in the software staging queues 66 + (represented by struct :c:type:`blk_mq_ctx`) in case that they weren't sent 67 + directly to the driver. A request is one or more BIOs. They arrived at the 68 + block layer through the data structure struct :c:type:`bio`. The block layer 69 + will then build a new structure from it, the struct :c:type:`request` that will 70 + be used to communicate with the device driver. Each queue has its own lock and 71 + the number of queues is defined by a per-CPU or per-node basis. 72 + 73 + The staging queue can be used to merge requests for adjacent sectors. For 74 + instance, requests for sector 3-6, 6-7, 7-9 can become one request for 3-9. 75 + Even if random access to SSDs and NVMs have the same time of response compared 76 + to sequential access, grouped requests for sequential access decreases the 77 + number of individual requests. This technique of merging requests is called 78 + plugging. 79 + 80 + Along with that, the requests can be reordered to ensure fairness of system 81 + resources (e.g. to ensure that no application suffers from starvation) and/or to 82 + improve IO performance, by an IO scheduler. 83 + 84 + IO Schedulers 85 + ^^^^^^^^^^^^^ 86 + 87 + There are several schedulers implemented by the block layer, each one following 88 + a heuristic to improve the IO performance. They are "pluggable" (as in plug 89 + and play), in the sense of they can be selected at run time using sysfs. You 90 + can read more about Linux's IO schedulers `here 91 + <https://www.kernel.org/doc/html/latest/block/index.html>`_. The scheduling 92 + happens only between requests in the same queue, so it is not possible to merge 93 + requests from different queues, otherwise there would be cache trashing and a 94 + need to have a lock for each queue. After the scheduling, the requests are 95 + eligible to be sent to the hardware. One of the possible schedulers to be 96 + selected is the NONE scheduler, the most straightforward one. It will just 97 + place requests on whatever software queue the process is running on, without 98 + any reordering. When the device starts processing requests in the hardware 99 + queue (a.k.a. run the hardware queue), the software queues mapped to that 100 + hardware queue will be drained in sequence according to their mapping. 101 + 102 + Hardware dispatch queues 103 + ~~~~~~~~~~~~~~~~~~~~~~~~ 104 + 105 + The hardware queue (represented by struct :c:type:`blk_mq_hw_ctx`) is a struct 106 + used by device drivers to map the device submission queues (or device DMA ring 107 + buffer), and are the last step of the block layer submission code before the 108 + low level device driver taking ownership of the request. To run this queue, the 109 + block layer removes requests from the associated software queues and tries to 110 + dispatch to the hardware. 111 + 112 + If it's not possible to send the requests directly to hardware, they will be 113 + added to a linked list (:c:type:`hctx->dispatch`) of requests. Then, 114 + next time the block layer runs a queue, it will send the requests laying at the 115 + :c:type:`dispatch` list first, to ensure a fairness dispatch with those 116 + requests that were ready to be sent first. The number of hardware queues 117 + depends on the number of hardware contexts supported by the hardware and its 118 + device driver, but it will not be more than the number of cores of the system. 119 + There is no reordering at this stage, and each software queue has a set of 120 + hardware queues to send requests for. 121 + 122 + .. note:: 123 + 124 + Neither the block layer nor the device protocols guarantee 125 + the order of completion of requests. This must be handled by 126 + higher layers, like the filesystem. 127 + 128 + Tag-based completion 129 + ~~~~~~~~~~~~~~~~~~~~ 130 + 131 + In order to indicate which request has been completed, every request is 132 + identified by an integer, ranging from 0 to the dispatch queue size. This tag 133 + is generated by the block layer and later reused by the device driver, removing 134 + the need to create a redundant identifier. When a request is completed in the 135 + drive, the tag is sent back to the block layer to notify it of the finalization. 136 + This removes the need to do a linear search to find out which IO has been 137 + completed. 138 + 139 + Further reading 140 + --------------- 141 + 142 + - `Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems <http://kernel.dk/blk-mq.pdf>`_ 143 + 144 + - `NOOP scheduler <https://en.wikipedia.org/wiki/Noop_scheduler>`_ 145 + 146 + - `Null block device driver <https://www.kernel.org/doc/html/latest/block/null_blk.html>`_ 147 + 148 + Source code documentation 149 + ========================= 150 + 151 + .. kernel-doc:: include/linux/blk-mq.h 152 + 153 + .. kernel-doc:: block/blk-mq.c
+1
Documentation/block/index.rst
··· 10 10 bfq-iosched 11 11 biodoc 12 12 biovecs 13 + blk-mq 13 14 capability 14 15 cmdline-partition 15 16 data-integrity