Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

+45

Documentation/ABI/stable/sysfs-block

··· 609 609 enabled, and whether tags are shared. 610 610 611 611 612 + What: /sys/block/<disk>/queue/async_depth 613 + Date: August 2025 614 + Contact: linux-block@vger.kernel.org 615 + Description: 616 + [RW] Controls how many asynchronous requests may be allocated 617 + in the block layer. The value is always capped at nr_requests. 618 + 619 + When no elevator is active (none): 620 + 621 + - async_depth is always equal to nr_requests. 622 + 623 + For bfq scheduler: 624 + 625 + - By default, async_depth is set to 75% of nr_requests. 626 + Internal limits are then derived from this value: 627 + 628 + * Sync writes: limited to async_depth (≈75% of nr_requests). 629 + * Async I/O: limited to ~2/3 of async_depth (≈50% of 630 + nr_requests). 631 + 632 + If a bfq_queue is weight-raised: 633 + 634 + * Sync writes: limited to ~1/2 of async_depth (≈37% of 635 + nr_requests). 636 + * Async I/O: limited to ~1/4 of async_depth (≈18% of 637 + nr_requests). 638 + 639 + - If the user writes a custom value to async_depth, BFQ will 640 + recompute these limits proportionally based on the new value. 641 + 642 + For Kyber: 643 + 644 + - By default async_depth is set to 75% of nr_requests. 645 + - If the user writes a custom value to async_depth, then it 646 + overrides the default and directly controls the limit for 647 + writes and async I/O. 648 + 649 + For mq-deadline: 650 + 651 + - By default async_depth is set to nr_requests. 652 + - If the user writes a custom value to async_depth, then it 653 + overrides the default and directly controls the limit for 654 + writes and async I/O. 655 + 656 + 612 657 What: /sys/block/<disk>/queue/nr_zones 613 658 Date: November 2018 614 659 Contact: Damien Le Moal <damien.lemoal@wdc.com>

-1

Documentation/block/biovecs.rst

··· 135 135 bio_first_bvec_all() 136 136 bio_first_page_all() 137 137 bio_first_folio_all() 138 - bio_last_bvec_all() 139 138 140 139 * The following helpers iterate over single-page segment. The passed 'struct 141 140 bio_vec' will contain a single-page IO vector during the iteration::

+6

Documentation/block/inline-encryption.rst

··· 206 206 for en/decryption. Users don't need to worry about freeing the bio_crypt_ctx 207 207 later, as that happens automatically when the bio is freed or reset. 208 208 209 + To submit a bio that uses inline encryption, users must call 210 + ``blk_crypto_submit_bio()`` instead of the usual ``submit_bio()``. This will 211 + submit the bio to the underlying driver if it supports inline crypto, or else 212 + call the blk-crypto fallback routines before submitting normal bios to the 213 + underlying drivers. 214 + 209 215 Finally, when done using inline encryption with a blk_crypto_key on a 210 216 block_device, users must call ``blk_crypto_evict_key()``. This ensures that 211 217 the key is evicted from all keyslots it may be programmed into and unlinked from

+60 -4

Documentation/block/ublk.rst

··· 260 260 and each command is only for forwarding the IO and committing the result 261 261 with specified IO tag in the command data: 262 262 263 - - ``UBLK_IO_FETCH_REQ`` 263 + Traditional Per-I/O Commands 264 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 264 265 265 - Sent from the server IO pthread for fetching future incoming IO requests 266 + - ``UBLK_U_IO_FETCH_REQ`` 267 + 268 + Sent from the server I/O pthread for fetching future incoming I/O requests 266 269 destined to ``/dev/ublkb*``. This command is sent only once from the server 267 270 IO pthread for ublk driver to setup IO forward environment. 268 271 ··· 281 278 supported by the driver, daemons must be per-queue instead - i.e. all I/Os 282 279 associated to a single qid must be handled by the same task. 283 280 284 - - ``UBLK_IO_COMMIT_AND_FETCH_REQ`` 281 + - ``UBLK_U_IO_COMMIT_AND_FETCH_REQ`` 285 282 286 283 When an IO request is destined to ``/dev/ublkb*``, the driver stores 287 284 the IO's ``ublksrv_io_desc`` to the specified mapped area; then the ··· 296 293 requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ`` 297 294 is reused for both fetching request and committing back IO result. 298 295 299 - - ``UBLK_IO_NEED_GET_DATA`` 296 + - ``UBLK_U_IO_NEED_GET_DATA`` 300 297 301 298 With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly 302 299 issued to ublk server without data copy. Then, IO backend of ublk server ··· 324 321 When the server handles READ request and sends 325 322 ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy 326 323 the server buffer (pages) read to the IO request pages. 324 + 325 + Batch I/O Commands (UBLK_F_BATCH_IO) 326 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 327 + 328 + The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance 329 + I/O handling model that replaces the traditional per-I/O commands with 330 + per-queue batch commands. This significantly reduces communication overhead 331 + and enables better load balancing across multiple server tasks. 332 + 333 + Key differences from traditional mode: 334 + 335 + - **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os 336 + - **Batch processing**: Multiple I/Os are handled in single operations 337 + - **Multishot commands**: Use io_uring multishot for reduced submission overhead 338 + - **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons) 339 + - **Better load balancing**: Tasks can adjust their workload dynamically 340 + 341 + Batch I/O Commands: 342 + 343 + - ``UBLK_U_IO_PREP_IO_CMDS`` 344 + 345 + Prepares multiple I/O commands in batch. The server provides a buffer 346 + containing multiple I/O descriptors that will be processed together. 347 + This reduces the number of individual command submissions required. 348 + 349 + - ``UBLK_U_IO_COMMIT_IO_CMDS`` 350 + 351 + Commits results for multiple I/O operations in batch, and prepares the 352 + I/O descriptors to accept new requests. The server provides a buffer 353 + containing the results of multiple completed I/Os, allowing efficient 354 + bulk completion of requests. 355 + 356 + - ``UBLK_U_IO_FETCH_IO_CMDS`` 357 + 358 + **Multishot command** for fetching I/O commands in batch. This is the key 359 + command that enables high-performance batch processing: 360 + 361 + * Uses io_uring multishot capability for reduced submission overhead 362 + * Single command can fetch multiple I/O requests over time 363 + * Buffer size determines maximum batch size per operation 364 + * Multiple fetch commands can be submitted for load balancing 365 + * Only one fetch command is active at any time per queue 366 + * Supports dynamic load balancing across multiple server tasks 367 + 368 + It is one typical multishot io_uring request with provided buffer, and it 369 + won't be completed until any failure is triggered. 370 + 371 + Each task can submit ``UBLK_U_IO_FETCH_IO_CMDS`` with different buffer 372 + sizes to control how much work it handles. This enables sophisticated 373 + load balancing strategies in multi-threaded servers. 374 + 375 + Migration: Applications using traditional commands (``UBLK_U_IO_FETCH_REQ``, 376 + ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``) cannot use batch mode simultaneously. 327 377 328 378 Zero copy 329 379 ---------

+1

MAINTAINERS

··· 24276 24276 SOFTWARE RAID (Multiple Disks) SUPPORT 24277 24277 M: Song Liu <song@kernel.org> 24278 24278 M: Yu Kuai <yukuai@fnnas.com> 24279 + R: Li Nan <linan122@huawei.com> 24279 24280 L: linux-raid@vger.kernel.org 24280 24281 S: Supported 24281 24282 Q: https://patchwork.kernel.org/project/linux-raid/list/

-1

block/bdev.c

··· 208 208 209 209 inode->i_blkbits = blksize_bits(size); 210 210 mapping_set_folio_min_order(inode->i_mapping, get_order(size)); 211 - kill_bdev(bdev); 212 211 filemap_invalidate_unlock(inode->i_mapping); 213 212 inode_unlock(inode); 214 213 }

+28 -37

block/bfq-iosched.c

··· 231 231 #define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \ 232 232 (get_sdist(last_pos, rq) > \ 233 233 BFQQ_SEEK_THR && \ 234 - (!blk_queue_nonrot(bfqd->queue) || \ 234 + (blk_queue_rot(bfqd->queue) || \ 235 235 blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT)) 236 236 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024) 237 237 #define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19) ··· 697 697 unsigned int limit, act_idx; 698 698 699 699 /* Sync reads have full depth available */ 700 - if (op_is_sync(opf) && !op_is_write(opf)) 700 + if (blk_mq_is_sync_read(opf)) 701 701 limit = data->q->nr_requests; 702 702 else 703 703 limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)]; ··· 4165 4165 4166 4166 /* don't use too short time intervals */ 4167 4167 if (delta_usecs < 1000) { 4168 - if (blk_queue_nonrot(bfqd->queue)) 4168 + if (!blk_queue_rot(bfqd->queue)) 4169 4169 /* 4170 4170 * give same worst-case guarantees as idling 4171 4171 * for seeky ··· 4487 4487 struct bfq_queue *bfqq) 4488 4488 { 4489 4489 bool rot_without_queueing = 4490 - !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag, 4490 + blk_queue_rot(bfqd->queue) && !bfqd->hw_tag, 4491 4491 bfqq_sequential_and_IO_bound, 4492 4492 idling_boosts_thr; 4493 4493 ··· 4521 4521 * flash-based device. 4522 4522 */ 4523 4523 idling_boosts_thr = rot_without_queueing || 4524 - ((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) && 4524 + ((blk_queue_rot(bfqd->queue) || !bfqd->hw_tag) && 4525 4525 bfqq_sequential_and_IO_bound); 4526 4526 4527 4527 /* ··· 4722 4722 * there is only one in-flight large request 4723 4723 * at a time. 4724 4724 */ 4725 - if (blk_queue_nonrot(bfqd->queue) && 4725 + if (!blk_queue_rot(bfqd->queue) && 4726 4726 blk_rq_sectors(bfqq->next_rq) >= 4727 4727 BFQQ_SECT_THR_NONROT && 4728 4728 bfqd->tot_rq_in_driver >= 1) ··· 6340 6340 bfqd->hw_tag_samples = 0; 6341 6341 6342 6342 bfqd->nonrot_with_queueing = 6343 - blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag; 6343 + !blk_queue_rot(bfqd->queue) && bfqd->hw_tag; 6344 6344 } 6345 6345 6346 6346 static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) ··· 7112 7112 static void bfq_depth_updated(struct request_queue *q) 7113 7113 { 7114 7114 struct bfq_data *bfqd = q->elevator->elevator_data; 7115 - unsigned int nr_requests = q->nr_requests; 7115 + unsigned int async_depth = q->async_depth; 7116 7116 7117 7117 /* 7118 - * In-word depths if no bfq_queue is being weight-raised: 7119 - * leaving 25% of tags only for sync reads. 7118 + * By default: 7119 + * - sync reads are not limited 7120 + * If bfqq is not being weight-raised: 7121 + * - sync writes are limited to 75%(async depth default value) 7122 + * - async IO are limited to 50% 7123 + * If bfqq is being weight-raised: 7124 + * - sync writes are limited to ~37% 7125 + * - async IO are limited to ~18 7120 7126 * 7121 - * In next formulas, right-shift the value 7122 - * (1U<<bt->sb.shift), instead of computing directly 7123 - * (1U<<(bt->sb.shift - something)), to be robust against 7124 - * any possible value of bt->sb.shift, without having to 7125 - * limit 'something'. 7127 + * If request_queue->async_depth is updated by user, all limit are 7128 + * updated relatively. 7126 7129 */ 7127 - /* no more than 50% of tags for async I/O */ 7128 - bfqd->async_depths[0][0] = max(nr_requests >> 1, 1U); 7129 - /* 7130 - * no more than 75% of tags for sync writes (25% extra tags 7131 - * w.r.t. async I/O, to prevent async I/O from starving sync 7132 - * writes) 7133 - */ 7134 - bfqd->async_depths[0][1] = max((nr_requests * 3) >> 2, 1U); 7130 + bfqd->async_depths[0][1] = async_depth; 7131 + bfqd->async_depths[0][0] = max(async_depth * 2 / 3, 1U); 7132 + bfqd->async_depths[1][1] = max(async_depth >> 1, 1U); 7133 + bfqd->async_depths[1][0] = max(async_depth >> 2, 1U); 7135 7134 7136 7135 /* 7137 - * In-word depths in case some bfq_queue is being weight- 7138 - * raised: leaving ~63% of tags for sync reads. This is the 7139 - * highest percentage for which, in our tests, application 7140 - * start-up times didn't suffer from any regression due to tag 7141 - * shortage. 7136 + * Due to cgroup qos, the allowed request for bfqq might be 1 7142 7137 */ 7143 - /* no more than ~18% of tags for async I/O */ 7144 - bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U); 7145 - /* no more than ~37% of tags for sync writes (~20% extra tags) */ 7146 - bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U); 7147 - 7148 7138 blk_mq_set_min_shallow_depth(q, 1); 7149 7139 } 7150 7140 ··· 7283 7293 INIT_HLIST_HEAD(&bfqd->burst_list); 7284 7294 7285 7295 bfqd->hw_tag = -1; 7286 - bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue); 7296 + bfqd->nonrot_with_queueing = !blk_queue_rot(bfqd->queue); 7287 7297 7288 7298 bfqd->bfq_max_budget = bfq_default_max_budget; 7289 7299 ··· 7318 7328 * Begin by assuming, optimistically, that the device peak 7319 7329 * rate is equal to 2/3 of the highest reference rate. 7320 7330 */ 7321 - bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] * 7322 - ref_wr_duration[blk_queue_nonrot(bfqd->queue)]; 7323 - bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3; 7331 + bfqd->rate_dur_prod = ref_rate[!blk_queue_rot(bfqd->queue)] * 7332 + ref_wr_duration[!blk_queue_rot(bfqd->queue)]; 7333 + bfqd->peak_rate = ref_rate[!blk_queue_rot(bfqd->queue)] * 2 / 3; 7324 7334 7325 7335 /* see comments on the definition of next field inside bfq_data */ 7326 7336 bfqd->actuator_load_threshold = 4; ··· 7355 7365 blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q); 7356 7366 wbt_disable_default(q->disk); 7357 7367 blk_stat_enable_accounting(q); 7368 + q->async_depth = (q->nr_requests * 3) >> 2; 7358 7369 7359 7370 return 0; 7360 7371

+1 -13

block/bio-integrity-auto.c

··· 52 52 53 53 static bool bi_offload_capable(struct blk_integrity *bi) 54 54 { 55 - switch (bi->csum_type) { 56 - case BLK_INTEGRITY_CSUM_CRC64: 57 - return bi->metadata_size == sizeof(struct crc64_pi_tuple); 58 - case BLK_INTEGRITY_CSUM_CRC: 59 - case BLK_INTEGRITY_CSUM_IP: 60 - return bi->metadata_size == sizeof(struct t10_pi_tuple); 61 - default: 62 - pr_warn_once("%s: unknown integrity checksum type:%d\n", 63 - __func__, bi->csum_type); 64 - fallthrough; 65 - case BLK_INTEGRITY_CSUM_NONE: 66 - return false; 67 - } 55 + return bi->metadata_size == bi->pi_tuple_size; 68 56 } 69 57 70 58 /**

+4 -1

block/bio.c

··· 301 301 */ 302 302 void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf) 303 303 { 304 + struct bio_vec *bv = bio->bi_io_vec; 305 + 304 306 bio_uninit(bio); 305 307 memset(bio, 0, BIO_RESET_BYTES); 306 308 atomic_set(&bio->__bi_remaining, 1); 309 + bio->bi_io_vec = bv; 307 310 bio->bi_bdev = bdev; 308 311 if (bio->bi_bdev) 309 312 bio_associate_blkg(bio); ··· 1199 1196 { 1200 1197 WARN_ON_ONCE(bio->bi_max_vecs); 1201 1198 1202 - bio->bi_vcnt = iter->nr_segs; 1203 1199 bio->bi_io_vec = (struct bio_vec *)iter->bvec; 1200 + bio->bi_iter.bi_idx = 0; 1204 1201 bio->bi_iter.bi_bvec_done = iter->iov_offset; 1205 1202 bio->bi_iter.bi_size = iov_iter_count(iter); 1206 1203 bio_set_flag(bio, BIO_CLONED);

+13 -8

block/blk-core.c

··· 114 114 #undef REQ_OP_NAME 115 115 116 116 /** 117 - * blk_op_str - Return string XXX in the REQ_OP_XXX. 118 - * @op: REQ_OP_XXX. 117 + * blk_op_str - Return the string "name" for an operation REQ_OP_name. 118 + * @op: a request operation. 119 119 * 120 - * Description: Centralize block layer function to convert REQ_OP_XXX into 121 - * string format. Useful in the debugging and tracing bio or request. For 122 - * invalid REQ_OP_XXX it returns string "UNKNOWN". 120 + * Convert a request operation REQ_OP_name into the string "name". Useful for 121 + * debugging and tracing BIOs and requests. For an invalid request operation 122 + * code, the string "UNKNOWN" is returned. 123 123 */ 124 124 inline const char *blk_op_str(enum req_op op) 125 125 { ··· 463 463 fs_reclaim_release(GFP_KERNEL); 464 464 465 465 q->nr_requests = BLKDEV_DEFAULT_RQ; 466 + q->async_depth = BLKDEV_DEFAULT_RQ; 466 467 467 468 return q; 468 469 ··· 629 628 /* If plug is not used, add new plug here to cache nsecs time. */ 630 629 struct blk_plug plug; 631 630 632 - if (unlikely(!blk_crypto_bio_prep(&bio))) 633 - return; 634 - 635 631 blk_start_plug(&plug); 636 632 637 633 if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { ··· 791 793 */ 792 794 if ((bio->bi_opf & REQ_NOWAIT) && !bdev_nowait(bdev)) 793 795 goto not_supported; 796 + 797 + if (bio_has_crypt_ctx(bio)) { 798 + if (WARN_ON_ONCE(!bio_has_data(bio))) 799 + goto end_io; 800 + if (!blk_crypto_supported(bio)) 801 + goto not_supported; 802 + } 794 803 795 804 if (should_fail_bio(bio)) 796 805 goto end_io;

+239 -234

block/blk-crypto-fallback.c

··· 22 22 #include "blk-cgroup.h" 23 23 #include "blk-crypto-internal.h" 24 24 25 - static unsigned int num_prealloc_bounce_pg = 32; 25 + static unsigned int num_prealloc_bounce_pg = BIO_MAX_VECS; 26 26 module_param(num_prealloc_bounce_pg, uint, 0); 27 27 MODULE_PARM_DESC(num_prealloc_bounce_pg, 28 28 "Number of preallocated bounce pages for the blk-crypto crypto API fallback"); ··· 75 75 76 76 static struct blk_crypto_fallback_keyslot { 77 77 enum blk_crypto_mode_num crypto_mode; 78 - struct crypto_skcipher *tfms[BLK_ENCRYPTION_MODE_MAX]; 78 + struct crypto_sync_skcipher *tfms[BLK_ENCRYPTION_MODE_MAX]; 79 79 } *blk_crypto_keyslots; 80 80 81 81 static struct blk_crypto_profile *blk_crypto_fallback_profile; 82 82 static struct workqueue_struct *blk_crypto_wq; 83 83 static mempool_t *blk_crypto_bounce_page_pool; 84 - static struct bio_set crypto_bio_split; 84 + static struct bio_set enc_bio_set; 85 85 86 86 /* 87 87 * This is the key we set when evicting a keyslot. This *should* be the all 0's ··· 98 98 WARN_ON(slotp->crypto_mode == BLK_ENCRYPTION_MODE_INVALID); 99 99 100 100 /* Clear the key in the skcipher */ 101 - err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], blank_key, 101 + err = crypto_sync_skcipher_setkey(slotp->tfms[crypto_mode], blank_key, 102 102 blk_crypto_modes[crypto_mode].keysize); 103 103 WARN_ON(err); 104 104 slotp->crypto_mode = BLK_ENCRYPTION_MODE_INVALID; ··· 119 119 blk_crypto_fallback_evict_keyslot(slot); 120 120 121 121 slotp->crypto_mode = crypto_mode; 122 - err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], key->bytes, 122 + err = crypto_sync_skcipher_setkey(slotp->tfms[crypto_mode], key->bytes, 123 123 key->size); 124 124 if (err) { 125 125 blk_crypto_fallback_evict_keyslot(slot); ··· 144 144 static void blk_crypto_fallback_encrypt_endio(struct bio *enc_bio) 145 145 { 146 146 struct bio *src_bio = enc_bio->bi_private; 147 - int i; 147 + struct page **pages = (struct page **)enc_bio->bi_io_vec; 148 + struct bio_vec *bv; 149 + unsigned int i; 148 150 149 - for (i = 0; i < enc_bio->bi_vcnt; i++) 150 - mempool_free(enc_bio->bi_io_vec[i].bv_page, 151 - blk_crypto_bounce_page_pool); 151 + /* 152 + * Use the same trick as the alloc side to avoid the need for an extra 153 + * pages array. 154 + */ 155 + bio_for_each_bvec_all(bv, enc_bio, i) 156 + pages[i] = bv->bv_page; 152 157 153 - src_bio->bi_status = enc_bio->bi_status; 158 + i = mempool_free_bulk(blk_crypto_bounce_page_pool, (void **)pages, 159 + enc_bio->bi_vcnt); 160 + if (i < enc_bio->bi_vcnt) 161 + release_pages(pages + i, enc_bio->bi_vcnt - i); 154 162 155 - bio_uninit(enc_bio); 156 - kfree(enc_bio); 163 + if (enc_bio->bi_status) 164 + cmpxchg(&src_bio->bi_status, 0, enc_bio->bi_status); 165 + 166 + bio_put(enc_bio); 157 167 bio_endio(src_bio); 158 168 } 159 169 160 - static struct bio *blk_crypto_fallback_clone_bio(struct bio *bio_src) 170 + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) 171 + 172 + static struct bio *blk_crypto_alloc_enc_bio(struct bio *bio_src, 173 + unsigned int nr_segs, struct page ***pages_ret) 161 174 { 162 - unsigned int nr_segs = bio_segments(bio_src); 163 - struct bvec_iter iter; 164 - struct bio_vec bv; 175 + unsigned int memflags = memalloc_noio_save(); 176 + unsigned int nr_allocated; 177 + struct page **pages; 165 178 struct bio *bio; 166 179 167 - bio = bio_kmalloc(nr_segs, GFP_NOIO); 168 - if (!bio) 169 - return NULL; 170 - bio_init_inline(bio, bio_src->bi_bdev, nr_segs, bio_src->bi_opf); 180 + bio = bio_alloc_bioset(bio_src->bi_bdev, nr_segs, bio_src->bi_opf, 181 + GFP_NOIO, &enc_bio_set); 171 182 if (bio_flagged(bio_src, BIO_REMAPPED)) 172 183 bio_set_flag(bio, BIO_REMAPPED); 184 + bio->bi_private = bio_src; 185 + bio->bi_end_io = blk_crypto_fallback_encrypt_endio; 173 186 bio->bi_ioprio = bio_src->bi_ioprio; 174 187 bio->bi_write_hint = bio_src->bi_write_hint; 175 188 bio->bi_write_stream = bio_src->bi_write_stream; 176 189 bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; 177 - bio->bi_iter.bi_size = bio_src->bi_iter.bi_size; 178 - 179 - bio_for_each_segment(bv, bio_src, iter) 180 - bio->bi_io_vec[bio->bi_vcnt++] = bv; 181 - 182 190 bio_clone_blkg_association(bio, bio_src); 183 191 192 + /* 193 + * Move page array up in the allocated memory for the bio vecs as far as 194 + * possible so that we can start filling biovecs from the beginning 195 + * without overwriting the temporary page array. 196 + */ 197 + static_assert(PAGE_PTRS_PER_BVEC > 1); 198 + pages = (struct page **)bio->bi_io_vec; 199 + pages += nr_segs * (PAGE_PTRS_PER_BVEC - 1); 200 + 201 + /* 202 + * Try a bulk allocation first. This could leave random pages in the 203 + * array unallocated, but we'll fix that up later in mempool_alloc_bulk. 204 + * 205 + * Note: alloc_pages_bulk needs the array to be zeroed, as it assumes 206 + * any non-zero slot already contains a valid allocation. 207 + */ 208 + memset(pages, 0, sizeof(struct page *) * nr_segs); 209 + nr_allocated = alloc_pages_bulk(GFP_KERNEL, nr_segs, pages); 210 + if (nr_allocated < nr_segs) 211 + mempool_alloc_bulk(blk_crypto_bounce_page_pool, (void **)pages, 212 + nr_segs, nr_allocated); 213 + memalloc_noio_restore(memflags); 214 + *pages_ret = pages; 184 215 return bio; 185 216 } 186 217 187 - static bool 188 - blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, 189 - struct skcipher_request **ciph_req_ret, 190 - struct crypto_wait *wait) 218 + static struct crypto_sync_skcipher * 219 + blk_crypto_fallback_tfm(struct blk_crypto_keyslot *slot) 191 220 { 192 - struct skcipher_request *ciph_req; 193 - const struct blk_crypto_fallback_keyslot *slotp; 194 - int keyslot_idx = blk_crypto_keyslot_index(slot); 221 + const struct blk_crypto_fallback_keyslot *slotp = 222 + &blk_crypto_keyslots[blk_crypto_keyslot_index(slot)]; 195 223 196 - slotp = &blk_crypto_keyslots[keyslot_idx]; 197 - ciph_req = skcipher_request_alloc(slotp->tfms[slotp->crypto_mode], 198 - GFP_NOIO); 199 - if (!ciph_req) 200 - return false; 201 - 202 - skcipher_request_set_callback(ciph_req, 203 - CRYPTO_TFM_REQ_MAY_BACKLOG | 204 - CRYPTO_TFM_REQ_MAY_SLEEP, 205 - crypto_req_done, wait); 206 - *ciph_req_ret = ciph_req; 207 - 208 - return true; 209 - } 210 - 211 - static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) 212 - { 213 - struct bio *bio = *bio_ptr; 214 - unsigned int i = 0; 215 - unsigned int num_sectors = 0; 216 - struct bio_vec bv; 217 - struct bvec_iter iter; 218 - 219 - bio_for_each_segment(bv, bio, iter) { 220 - num_sectors += bv.bv_len >> SECTOR_SHIFT; 221 - if (++i == BIO_MAX_VECS) 222 - break; 223 - } 224 - 225 - if (num_sectors < bio_sectors(bio)) { 226 - bio = bio_submit_split_bioset(bio, num_sectors, 227 - &crypto_bio_split); 228 - if (!bio) 229 - return false; 230 - 231 - *bio_ptr = bio; 232 - } 233 - 234 - return true; 224 + return slotp->tfms[slotp->crypto_mode]; 235 225 } 236 226 237 227 union blk_crypto_iv { ··· 238 248 iv->dun[i] = cpu_to_le64(dun[i]); 239 249 } 240 250 241 - /* 242 - * The crypto API fallback's encryption routine. 243 - * Allocate a bounce bio for encryption, encrypt the input bio using crypto API, 244 - * and replace *bio_ptr with the bounce bio. May split input bio if it's too 245 - * large. Returns true on success. Returns false and sets bio->bi_status on 246 - * error. 247 - */ 248 - static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) 251 + static void __blk_crypto_fallback_encrypt_bio(struct bio *src_bio, 252 + struct crypto_sync_skcipher *tfm) 249 253 { 250 - struct bio *src_bio, *enc_bio; 251 - struct bio_crypt_ctx *bc; 252 - struct blk_crypto_keyslot *slot; 253 - int data_unit_size; 254 - struct skcipher_request *ciph_req = NULL; 255 - DECLARE_CRYPTO_WAIT(wait); 254 + struct bio_crypt_ctx *bc = src_bio->bi_crypt_context; 255 + int data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 256 + SYNC_SKCIPHER_REQUEST_ON_STACK(ciph_req, tfm); 256 257 u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 257 258 struct scatterlist src, dst; 258 259 union blk_crypto_iv iv; 259 - unsigned int i, j; 260 - bool ret = false; 261 - blk_status_t blk_st; 260 + unsigned int nr_enc_pages, enc_idx; 261 + struct page **enc_pages; 262 + struct bio *enc_bio; 263 + unsigned int i; 262 264 263 - /* Split the bio if it's too big for single page bvec */ 264 - if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr)) 265 - return false; 266 - 267 - src_bio = *bio_ptr; 268 - bc = src_bio->bi_crypt_context; 269 - data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 270 - 271 - /* Allocate bounce bio for encryption */ 272 - enc_bio = blk_crypto_fallback_clone_bio(src_bio); 273 - if (!enc_bio) { 274 - src_bio->bi_status = BLK_STS_RESOURCE; 275 - return false; 276 - } 277 - 278 - /* 279 - * Get a blk-crypto-fallback keyslot that contains a crypto_skcipher for 280 - * this bio's algorithm and key. 281 - */ 282 - blk_st = blk_crypto_get_keyslot(blk_crypto_fallback_profile, 283 - bc->bc_key, &slot); 284 - if (blk_st != BLK_STS_OK) { 285 - src_bio->bi_status = blk_st; 286 - goto out_put_enc_bio; 287 - } 288 - 289 - /* and then allocate an skcipher_request for it */ 290 - if (!blk_crypto_fallback_alloc_cipher_req(slot, &ciph_req, &wait)) { 291 - src_bio->bi_status = BLK_STS_RESOURCE; 292 - goto out_release_keyslot; 293 - } 265 + skcipher_request_set_callback(ciph_req, 266 + CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, 267 + NULL, NULL); 294 268 295 269 memcpy(curr_dun, bc->bc_dun, sizeof(curr_dun)); 296 270 sg_init_table(&src, 1); ··· 263 309 skcipher_request_set_crypt(ciph_req, &src, &dst, data_unit_size, 264 310 iv.bytes); 265 311 266 - /* Encrypt each page in the bounce bio */ 267 - for (i = 0; i < enc_bio->bi_vcnt; i++) { 268 - struct bio_vec *enc_bvec = &enc_bio->bi_io_vec[i]; 269 - struct page *plaintext_page = enc_bvec->bv_page; 270 - struct page *ciphertext_page = 271 - mempool_alloc(blk_crypto_bounce_page_pool, GFP_NOIO); 312 + /* 313 + * Encrypt each page in the source bio. Because the source bio could 314 + * have bio_vecs that span more than a single page, but the encrypted 315 + * bios are limited to a single page per bio_vec, this can generate 316 + * more than a single encrypted bio per source bio. 317 + */ 318 + new_bio: 319 + nr_enc_pages = min(bio_segments(src_bio), BIO_MAX_VECS); 320 + enc_bio = blk_crypto_alloc_enc_bio(src_bio, nr_enc_pages, &enc_pages); 321 + enc_idx = 0; 322 + for (;;) { 323 + struct bio_vec src_bv = 324 + bio_iter_iovec(src_bio, src_bio->bi_iter); 325 + struct page *enc_page = enc_pages[enc_idx]; 272 326 273 - enc_bvec->bv_page = ciphertext_page; 274 - 275 - if (!ciphertext_page) { 276 - src_bio->bi_status = BLK_STS_RESOURCE; 277 - goto out_free_bounce_pages; 327 + if (!IS_ALIGNED(src_bv.bv_len | src_bv.bv_offset, 328 + data_unit_size)) { 329 + enc_bio->bi_status = BLK_STS_INVAL; 330 + goto out_free_enc_bio; 278 331 } 279 332 280 - sg_set_page(&src, plaintext_page, data_unit_size, 281 - enc_bvec->bv_offset); 282 - sg_set_page(&dst, ciphertext_page, data_unit_size, 283 - enc_bvec->bv_offset); 333 + __bio_add_page(enc_bio, enc_page, src_bv.bv_len, 334 + src_bv.bv_offset); 284 335 285 - /* Encrypt each data unit in this page */ 286 - for (j = 0; j < enc_bvec->bv_len; j += data_unit_size) { 336 + sg_set_page(&src, src_bv.bv_page, data_unit_size, 337 + src_bv.bv_offset); 338 + sg_set_page(&dst, enc_page, data_unit_size, src_bv.bv_offset); 339 + 340 + /* 341 + * Increment the index now that the encrypted page is added to 342 + * the bio. This is important for the error unwind path. 343 + */ 344 + enc_idx++; 345 + 346 + /* 347 + * Encrypt each data unit in this page. 348 + */ 349 + for (i = 0; i < src_bv.bv_len; i += data_unit_size) { 287 350 blk_crypto_dun_to_iv(curr_dun, &iv); 288 - if (crypto_wait_req(crypto_skcipher_encrypt(ciph_req), 289 - &wait)) { 290 - i++; 291 - src_bio->bi_status = BLK_STS_IOERR; 292 - goto out_free_bounce_pages; 351 + if (crypto_skcipher_encrypt(ciph_req)) { 352 + enc_bio->bi_status = BLK_STS_IOERR; 353 + goto out_free_enc_bio; 293 354 } 294 355 bio_crypt_dun_increment(curr_dun, 1); 295 356 src.offset += data_unit_size; 296 357 dst.offset += data_unit_size; 297 358 } 359 + 360 + bio_advance_iter_single(src_bio, &src_bio->bi_iter, 361 + src_bv.bv_len); 362 + if (!src_bio->bi_iter.bi_size) 363 + break; 364 + 365 + if (enc_idx == nr_enc_pages) { 366 + /* 367 + * For each additional encrypted bio submitted, 368 + * increment the source bio's remaining count. Each 369 + * encrypted bio's completion handler calls bio_endio on 370 + * the source bio, so this keeps the source bio from 371 + * completing until the last encrypted bio does. 372 + */ 373 + bio_inc_remaining(src_bio); 374 + submit_bio(enc_bio); 375 + goto new_bio; 376 + } 298 377 } 299 378 300 - enc_bio->bi_private = src_bio; 301 - enc_bio->bi_end_io = blk_crypto_fallback_encrypt_endio; 302 - *bio_ptr = enc_bio; 303 - ret = true; 379 + submit_bio(enc_bio); 380 + return; 304 381 305 - enc_bio = NULL; 306 - goto out_free_ciph_req; 382 + out_free_enc_bio: 383 + /* 384 + * Add the remaining pages to the bio so that the normal completion path 385 + * in blk_crypto_fallback_encrypt_endio frees them. The exact data 386 + * layout does not matter for that, so don't bother iterating the source 387 + * bio. 388 + */ 389 + for (; enc_idx < nr_enc_pages; enc_idx++) 390 + __bio_add_page(enc_bio, enc_pages[enc_idx], PAGE_SIZE, 0); 391 + bio_endio(enc_bio); 392 + } 307 393 308 - out_free_bounce_pages: 309 - while (i > 0) 310 - mempool_free(enc_bio->bi_io_vec[--i].bv_page, 311 - blk_crypto_bounce_page_pool); 312 - out_free_ciph_req: 313 - skcipher_request_free(ciph_req); 314 - out_release_keyslot: 394 + /* 395 + * The crypto API fallback's encryption routine. 396 + * 397 + * Allocate one or more bios for encryption, encrypt the input bio using the 398 + * crypto API, and submit the encrypted bios. Sets bio->bi_status and 399 + * completes the source bio on error 400 + */ 401 + static void blk_crypto_fallback_encrypt_bio(struct bio *src_bio) 402 + { 403 + struct bio_crypt_ctx *bc = src_bio->bi_crypt_context; 404 + struct blk_crypto_keyslot *slot; 405 + blk_status_t status; 406 + 407 + status = blk_crypto_get_keyslot(blk_crypto_fallback_profile, 408 + bc->bc_key, &slot); 409 + if (status != BLK_STS_OK) { 410 + src_bio->bi_status = status; 411 + bio_endio(src_bio); 412 + return; 413 + } 414 + __blk_crypto_fallback_encrypt_bio(src_bio, 415 + blk_crypto_fallback_tfm(slot)); 315 416 blk_crypto_put_keyslot(slot); 316 - out_put_enc_bio: 317 - if (enc_bio) 318 - bio_uninit(enc_bio); 319 - kfree(enc_bio); 320 - return ret; 417 + } 418 + 419 + static blk_status_t __blk_crypto_fallback_decrypt_bio(struct bio *bio, 420 + struct bio_crypt_ctx *bc, struct bvec_iter iter, 421 + struct crypto_sync_skcipher *tfm) 422 + { 423 + SYNC_SKCIPHER_REQUEST_ON_STACK(ciph_req, tfm); 424 + u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 425 + union blk_crypto_iv iv; 426 + struct scatterlist sg; 427 + struct bio_vec bv; 428 + const int data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 429 + unsigned int i; 430 + 431 + skcipher_request_set_callback(ciph_req, 432 + CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, 433 + NULL, NULL); 434 + 435 + memcpy(curr_dun, bc->bc_dun, sizeof(curr_dun)); 436 + sg_init_table(&sg, 1); 437 + skcipher_request_set_crypt(ciph_req, &sg, &sg, data_unit_size, 438 + iv.bytes); 439 + 440 + /* Decrypt each segment in the bio */ 441 + __bio_for_each_segment(bv, bio, iter, iter) { 442 + struct page *page = bv.bv_page; 443 + 444 + if (!IS_ALIGNED(bv.bv_len | bv.bv_offset, data_unit_size)) 445 + return BLK_STS_INVAL; 446 + 447 + sg_set_page(&sg, page, data_unit_size, bv.bv_offset); 448 + 449 + /* Decrypt each data unit in the segment */ 450 + for (i = 0; i < bv.bv_len; i += data_unit_size) { 451 + blk_crypto_dun_to_iv(curr_dun, &iv); 452 + if (crypto_skcipher_decrypt(ciph_req)) 453 + return BLK_STS_IOERR; 454 + bio_crypt_dun_increment(curr_dun, 1); 455 + sg.offset += data_unit_size; 456 + } 457 + } 458 + 459 + return BLK_STS_OK; 321 460 } 322 461 323 462 /* 324 463 * The crypto API fallback's main decryption routine. 464 + * 325 465 * Decrypts input bio in place, and calls bio_endio on the bio. 326 466 */ 327 467 static void blk_crypto_fallback_decrypt_bio(struct work_struct *work) ··· 425 377 struct bio *bio = f_ctx->bio; 426 378 struct bio_crypt_ctx *bc = &f_ctx->crypt_ctx; 427 379 struct blk_crypto_keyslot *slot; 428 - struct skcipher_request *ciph_req = NULL; 429 - DECLARE_CRYPTO_WAIT(wait); 430 - u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 431 - union blk_crypto_iv iv; 432 - struct scatterlist sg; 433 - struct bio_vec bv; 434 - struct bvec_iter iter; 435 - const int data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 436 - unsigned int i; 437 - blk_status_t blk_st; 380 + blk_status_t status; 438 381 439 - /* 440 - * Get a blk-crypto-fallback keyslot that contains a crypto_skcipher for 441 - * this bio's algorithm and key. 442 - */ 443 - blk_st = blk_crypto_get_keyslot(blk_crypto_fallback_profile, 382 + status = blk_crypto_get_keyslot(blk_crypto_fallback_profile, 444 383 bc->bc_key, &slot); 445 - if (blk_st != BLK_STS_OK) { 446 - bio->bi_status = blk_st; 447 - goto out_no_keyslot; 384 + if (status == BLK_STS_OK) { 385 + status = __blk_crypto_fallback_decrypt_bio(bio, bc, 386 + f_ctx->crypt_iter, 387 + blk_crypto_fallback_tfm(slot)); 388 + blk_crypto_put_keyslot(slot); 448 389 } 449 - 450 - /* and then allocate an skcipher_request for it */ 451 - if (!blk_crypto_fallback_alloc_cipher_req(slot, &ciph_req, &wait)) { 452 - bio->bi_status = BLK_STS_RESOURCE; 453 - goto out; 454 - } 455 - 456 - memcpy(curr_dun, bc->bc_dun, sizeof(curr_dun)); 457 - sg_init_table(&sg, 1); 458 - skcipher_request_set_crypt(ciph_req, &sg, &sg, data_unit_size, 459 - iv.bytes); 460 - 461 - /* Decrypt each segment in the bio */ 462 - __bio_for_each_segment(bv, bio, iter, f_ctx->crypt_iter) { 463 - struct page *page = bv.bv_page; 464 - 465 - sg_set_page(&sg, page, data_unit_size, bv.bv_offset); 466 - 467 - /* Decrypt each data unit in the segment */ 468 - for (i = 0; i < bv.bv_len; i += data_unit_size) { 469 - blk_crypto_dun_to_iv(curr_dun, &iv); 470 - if (crypto_wait_req(crypto_skcipher_decrypt(ciph_req), 471 - &wait)) { 472 - bio->bi_status = BLK_STS_IOERR; 473 - goto out; 474 - } 475 - bio_crypt_dun_increment(curr_dun, 1); 476 - sg.offset += data_unit_size; 477 - } 478 - } 479 - 480 - out: 481 - skcipher_request_free(ciph_req); 482 - blk_crypto_put_keyslot(slot); 483 - out_no_keyslot: 484 390 mempool_free(f_ctx, bio_fallback_crypt_ctx_pool); 391 + 392 + bio->bi_status = status; 485 393 bio_endio(bio); 486 394 } 487 395 ··· 470 466 471 467 /** 472 468 * blk_crypto_fallback_bio_prep - Prepare a bio to use fallback en/decryption 469 + * @bio: bio to prepare 473 470 * 474 - * @bio_ptr: pointer to the bio to prepare 471 + * If bio is doing a WRITE operation, allocate one or more bios to contain the 472 + * encrypted payload and submit them. 475 473 * 476 - * If bio is doing a WRITE operation, this splits the bio into two parts if it's 477 - * too big (see blk_crypto_fallback_split_bio_if_needed()). It then allocates a 478 - * bounce bio for the first part, encrypts it, and updates bio_ptr to point to 479 - * the bounce bio. 480 - * 481 - * For a READ operation, we mark the bio for decryption by using bi_private and 474 + * For a READ operation, mark the bio for decryption by using bi_private and 482 475 * bi_end_io. 483 476 * 484 - * In either case, this function will make the bio look like a regular bio (i.e. 485 - * as if no encryption context was ever specified) for the purposes of the rest 486 - * of the stack except for blk-integrity (blk-integrity and blk-crypto are not 487 - * currently supported together). 477 + * In either case, this function will make the submitted bio(s) look like 478 + * regular bios (i.e. as if no encryption context was ever specified) for the 479 + * purposes of the rest of the stack except for blk-integrity (blk-integrity and 480 + * blk-crypto are not currently supported together). 488 481 * 489 - * Return: true on success. Sets bio->bi_status and returns false on error. 482 + * Return: true if @bio should be submitted to the driver by the caller, else 483 + * false. Sets bio->bi_status, calls bio_endio and returns false on error. 490 484 */ 491 - bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) 485 + bool blk_crypto_fallback_bio_prep(struct bio *bio) 492 486 { 493 - struct bio *bio = *bio_ptr; 494 487 struct bio_crypt_ctx *bc = bio->bi_crypt_context; 495 488 struct bio_fallback_crypt_ctx *f_ctx; 496 489 497 490 if (WARN_ON_ONCE(!tfms_inited[bc->bc_key->crypto_cfg.crypto_mode])) { 498 491 /* User didn't call blk_crypto_start_using_key() first */ 499 - bio->bi_status = BLK_STS_IOERR; 492 + bio_io_error(bio); 500 493 return false; 501 494 } 502 495 503 496 if (!__blk_crypto_cfg_supported(blk_crypto_fallback_profile, 504 497 &bc->bc_key->crypto_cfg)) { 505 498 bio->bi_status = BLK_STS_NOTSUPP; 499 + bio_endio(bio); 506 500 return false; 507 501 } 508 502 509 - if (bio_data_dir(bio) == WRITE) 510 - return blk_crypto_fallback_encrypt_bio(bio_ptr); 503 + if (bio_data_dir(bio) == WRITE) { 504 + blk_crypto_fallback_encrypt_bio(bio); 505 + return false; 506 + } 511 507 512 508 /* 513 509 * bio READ case: Set up a f_ctx in the bio's bi_private and set the ··· 541 537 542 538 get_random_bytes(blank_key, sizeof(blank_key)); 543 539 544 - err = bioset_init(&crypto_bio_split, 64, 0, 0); 540 + err = bioset_init(&enc_bio_set, 64, 0, BIOSET_NEED_BVECS); 545 541 if (err) 546 542 goto out; 547 543 ··· 611 607 fail_free_profile: 612 608 kfree(blk_crypto_fallback_profile); 613 609 fail_free_bioset: 614 - bioset_exit(&crypto_bio_split); 610 + bioset_exit(&enc_bio_set); 615 611 out: 616 612 return err; 617 613 } ··· 645 641 646 642 for (i = 0; i < blk_crypto_num_keyslots; i++) { 647 643 slotp = &blk_crypto_keyslots[i]; 648 - slotp->tfms[mode_num] = crypto_alloc_skcipher(cipher_str, 0, 0); 644 + slotp->tfms[mode_num] = crypto_alloc_sync_skcipher(cipher_str, 645 + 0, 0); 649 646 if (IS_ERR(slotp->tfms[mode_num])) { 650 647 err = PTR_ERR(slotp->tfms[mode_num]); 651 648 if (err == -ENOENT) { ··· 658 653 goto out_free_tfms; 659 654 } 660 655 661 - crypto_skcipher_set_flags(slotp->tfms[mode_num], 656 + crypto_sync_skcipher_set_flags(slotp->tfms[mode_num], 662 657 CRYPTO_TFM_REQ_FORBID_WEAK_KEYS); 663 658 } 664 659 ··· 672 667 out_free_tfms: 673 668 for (i = 0; i < blk_crypto_num_keyslots; i++) { 674 669 slotp = &blk_crypto_keyslots[i]; 675 - crypto_free_skcipher(slotp->tfms[mode_num]); 670 + crypto_free_sync_skcipher(slotp->tfms[mode_num]); 676 671 slotp->tfms[mode_num] = NULL; 677 672 } 678 673 out:

+13 -17

block/blk-crypto-internal.h

··· 86 86 int blk_crypto_ioctl(struct block_device *bdev, unsigned int cmd, 87 87 void __user *argp); 88 88 89 + static inline bool blk_crypto_supported(struct bio *bio) 90 + { 91 + return blk_crypto_config_supported_natively(bio->bi_bdev, 92 + &bio->bi_crypt_context->bc_key->crypto_cfg); 93 + } 94 + 89 95 #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 90 96 91 97 static inline int blk_crypto_sysfs_register(struct gendisk *disk) ··· 145 139 return -ENOTTY; 146 140 } 147 141 142 + static inline bool blk_crypto_supported(struct bio *bio) 143 + { 144 + return false; 145 + } 146 + 148 147 #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ 149 148 150 149 void __bio_crypt_advance(struct bio *bio, unsigned int bytes); ··· 174 163 memcpy(rq->crypt_ctx->bc_dun, bio->bi_crypt_context->bc_dun, 175 164 sizeof(rq->crypt_ctx->bc_dun)); 176 165 #endif 177 - } 178 - 179 - bool __blk_crypto_bio_prep(struct bio **bio_ptr); 180 - static inline bool blk_crypto_bio_prep(struct bio **bio_ptr) 181 - { 182 - if (bio_has_crypt_ctx(*bio_ptr)) 183 - return __blk_crypto_bio_prep(bio_ptr); 184 - return true; 185 166 } 186 167 187 168 blk_status_t __blk_crypto_rq_get_keyslot(struct request *rq); ··· 218 215 return 0; 219 216 } 220 217 218 + bool blk_crypto_fallback_bio_prep(struct bio *bio); 219 + 221 220 #ifdef CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK 222 221 223 222 int blk_crypto_fallback_start_using_mode(enum blk_crypto_mode_num mode_num); 224 - 225 - bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr); 226 223 227 224 int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key); 228 225 ··· 233 230 { 234 231 pr_warn_once("crypto API fallback is disabled\n"); 235 232 return -ENOPKG; 236 - } 237 - 238 - static inline bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) 239 - { 240 - pr_warn_once("crypto API fallback disabled; failing request.\n"); 241 - (*bio_ptr)->bi_status = BLK_STS_NOTSUPP; 242 - return false; 243 233 } 244 234 245 235 static inline int

+23 -55

block/blk-crypto.c

··· 219 219 return !bc1 || bio_crypt_dun_is_contiguous(bc1, bc1_bytes, bc2->bc_dun); 220 220 } 221 221 222 - /* Check that all I/O segments are data unit aligned. */ 223 - static bool bio_crypt_check_alignment(struct bio *bio) 224 - { 225 - const unsigned int data_unit_size = 226 - bio->bi_crypt_context->bc_key->crypto_cfg.data_unit_size; 227 - struct bvec_iter iter; 228 - struct bio_vec bv; 229 - 230 - bio_for_each_segment(bv, bio, iter) { 231 - if (!IS_ALIGNED(bv.bv_len | bv.bv_offset, data_unit_size)) 232 - return false; 233 - } 234 - 235 - return true; 236 - } 237 - 238 222 blk_status_t __blk_crypto_rq_get_keyslot(struct request *rq) 239 223 { 240 224 return blk_crypto_get_keyslot(rq->q->crypto_profile, ··· 242 258 rq->crypt_ctx = NULL; 243 259 } 244 260 245 - /** 246 - * __blk_crypto_bio_prep - Prepare bio for inline encryption 261 + /* 262 + * Process a bio with a crypto context. Returns true if the caller should 263 + * submit the passed in bio, false if the bio is consumed. 247 264 * 248 - * @bio_ptr: pointer to original bio pointer 249 - * 250 - * If the bio crypt context provided for the bio is supported by the underlying 251 - * device's inline encryption hardware, do nothing. 252 - * 253 - * Otherwise, try to perform en/decryption for this bio by falling back to the 254 - * kernel crypto API. When the crypto API fallback is used for encryption, 255 - * blk-crypto may choose to split the bio into 2 - the first one that will 256 - * continue to be processed and the second one that will be resubmitted via 257 - * submit_bio_noacct. A bounce bio will be allocated to encrypt the contents 258 - * of the aforementioned "first one", and *bio_ptr will be updated to this 259 - * bounce bio. 260 - * 261 - * Caller must ensure bio has bio_crypt_ctx. 262 - * 263 - * Return: true on success; false on error (and bio->bi_status will be set 264 - * appropriately, and bio_endio() will have been called so bio 265 - * submission should abort). 265 + * See the kerneldoc comment for blk_crypto_submit_bio for further details. 266 266 */ 267 - bool __blk_crypto_bio_prep(struct bio **bio_ptr) 267 + bool __blk_crypto_submit_bio(struct bio *bio) 268 268 { 269 - struct bio *bio = *bio_ptr; 270 269 const struct blk_crypto_key *bc_key = bio->bi_crypt_context->bc_key; 270 + struct block_device *bdev = bio->bi_bdev; 271 271 272 272 /* Error if bio has no data. */ 273 273 if (WARN_ON_ONCE(!bio_has_data(bio))) { 274 - bio->bi_status = BLK_STS_IOERR; 275 - goto fail; 276 - } 277 - 278 - if (!bio_crypt_check_alignment(bio)) { 279 - bio->bi_status = BLK_STS_INVAL; 280 - goto fail; 274 + bio_io_error(bio); 275 + return false; 281 276 } 282 277 283 278 /* 284 - * Success if device supports the encryption context, or if we succeeded 285 - * in falling back to the crypto API. 279 + * If the device does not natively support the encryption context, try to use 280 + * the fallback if available. 286 281 */ 287 - if (blk_crypto_config_supported_natively(bio->bi_bdev, 288 - &bc_key->crypto_cfg)) 289 - return true; 290 - if (blk_crypto_fallback_bio_prep(bio_ptr)) 291 - return true; 292 - fail: 293 - bio_endio(*bio_ptr); 294 - return false; 282 + if (!blk_crypto_config_supported_natively(bdev, &bc_key->crypto_cfg)) { 283 + if (!IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK)) { 284 + pr_warn_once("%pg: crypto API fallback disabled; failing request.\n", 285 + bdev); 286 + bio->bi_status = BLK_STS_NOTSUPP; 287 + bio_endio(bio); 288 + return false; 289 + } 290 + return blk_crypto_fallback_bio_prep(bio); 291 + } 292 + 293 + return true; 295 294 } 295 + EXPORT_SYMBOL_GPL(__blk_crypto_submit_bio); 296 296 297 297 int __blk_crypto_rq_bio_prep(struct request *rq, struct bio *bio, 298 298 gfp_t gfp_mask)

+4 -2

block/blk-flush.c

··· 199 199 } 200 200 201 201 static enum rq_end_io_ret flush_end_io(struct request *flush_rq, 202 - blk_status_t error) 202 + blk_status_t error, 203 + const struct io_comp_batch *iob) 203 204 { 204 205 struct request_queue *q = flush_rq->q; 205 206 struct list_head *running; ··· 336 335 } 337 336 338 337 static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq, 339 - blk_status_t error) 338 + blk_status_t error, 339 + const struct io_comp_batch *iob) 340 340 { 341 341 struct request_queue *q = rq->q; 342 342 struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

+1 -1

block/blk-iocost.c

··· 812 812 u64 now_ns; 813 813 814 814 /* rotational? */ 815 - if (!blk_queue_nonrot(disk->queue)) 815 + if (blk_queue_rot(disk->queue)) 816 816 return AUTOP_HDD; 817 817 818 818 /* handle SATA SSDs w/ broken NCQ */

+1 -4

block/blk-iolatency.c

··· 988 988 u64 now = blk_time_get_ns(); 989 989 int cpu; 990 990 991 - if (blk_queue_nonrot(blkg->q)) 992 - iolat->ssd = true; 993 - else 994 - iolat->ssd = false; 991 + iolat->ssd = !blk_queue_rot(blkg->q); 995 992 996 993 for_each_possible_cpu(cpu) { 997 994 struct latency_stat *stat;

+25 -5

block/blk-merge.c

··· 158 158 return bio; 159 159 } 160 160 161 - struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim, 162 - unsigned *nsegs) 161 + static struct bio *__bio_split_discard(struct bio *bio, 162 + const struct queue_limits *lim, unsigned *nsegs, 163 + unsigned int max_sectors) 163 164 { 164 165 unsigned int max_discard_sectors, granularity; 165 166 sector_t tmp; ··· 170 169 171 170 granularity = max(lim->discard_granularity >> 9, 1U); 172 171 173 - max_discard_sectors = 174 - min(lim->max_discard_sectors, bio_allowed_max_sectors(lim)); 172 + max_discard_sectors = min(max_sectors, bio_allowed_max_sectors(lim)); 175 173 max_discard_sectors -= max_discard_sectors % granularity; 176 174 if (unlikely(!max_discard_sectors)) 177 175 return bio; ··· 192 192 split_sectors -= tmp; 193 193 194 194 return bio_submit_split(bio, split_sectors); 195 + } 196 + 197 + struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim, 198 + unsigned *nsegs) 199 + { 200 + unsigned int max_sectors; 201 + 202 + if (bio_op(bio) == REQ_OP_SECURE_ERASE) 203 + max_sectors = lim->max_secure_erase_sectors; 204 + else 205 + max_sectors = lim->max_discard_sectors; 206 + 207 + return __bio_split_discard(bio, lim, nsegs, max_sectors); 195 208 } 196 209 197 210 static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim, ··· 337 324 int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, 338 325 unsigned *segs, unsigned max_bytes, unsigned len_align_mask) 339 326 { 327 + struct bio_crypt_ctx *bc = bio_crypt_ctx(bio); 340 328 struct bio_vec bv, bvprv, *bvprvp = NULL; 341 329 unsigned nsegs = 0, bytes = 0, gaps = 0; 342 330 struct bvec_iter iter; 331 + unsigned start_align_mask = lim->dma_alignment; 332 + 333 + if (bc) { 334 + start_align_mask |= (bc->bc_key->crypto_cfg.data_unit_size - 1); 335 + len_align_mask |= (bc->bc_key->crypto_cfg.data_unit_size - 1); 336 + } 343 337 344 338 bio_for_each_bvec(bv, bio, iter) { 345 - if (bv.bv_offset & lim->dma_alignment || 339 + if (bv.bv_offset & start_align_mask || 346 340 bv.bv_len & len_align_mask) 347 341 return -EINVAL; 348 342

+42 -26

block/blk-mq-debugfs.c

··· 608 608 {}, 609 609 }; 610 610 611 - static void debugfs_create_files(struct dentry *parent, void *data, 611 + static void debugfs_create_files(struct request_queue *q, struct dentry *parent, 612 + void *data, 612 613 const struct blk_mq_debugfs_attr *attr) 613 614 { 615 + lockdep_assert_held(&q->debugfs_mutex); 616 + /* 617 + * Creating new debugfs entries with queue freezed has the risk of 618 + * deadlock. 619 + */ 620 + WARN_ON_ONCE(q->mq_freeze_depth != 0); 621 + /* 622 + * debugfs_mutex should not be nested under other locks that can be 623 + * grabbed while queue is frozen. 624 + */ 625 + lockdep_assert_not_held(&q->elevator_lock); 626 + lockdep_assert_not_held(&q->rq_qos_mutex); 627 + 614 628 if (IS_ERR_OR_NULL(parent)) 615 629 return; 616 630 ··· 638 624 struct blk_mq_hw_ctx *hctx; 639 625 unsigned long i; 640 626 641 - debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs); 627 + debugfs_create_files(q, q->debugfs_dir, q, blk_mq_debugfs_queue_attrs); 642 628 643 629 queue_for_each_hw_ctx(q, hctx, i) { 644 630 if (!hctx->debugfs_dir) 645 631 blk_mq_debugfs_register_hctx(q, hctx); 646 632 } 647 633 648 - if (q->rq_qos) { 649 - struct rq_qos *rqos = q->rq_qos; 650 - 651 - while (rqos) { 652 - blk_mq_debugfs_register_rqos(rqos); 653 - rqos = rqos->next; 654 - } 655 - } 634 + blk_mq_debugfs_register_rq_qos(q); 656 635 } 657 636 658 637 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx, ··· 657 650 snprintf(name, sizeof(name), "cpu%u", ctx->cpu); 658 651 ctx_dir = debugfs_create_dir(name, hctx->debugfs_dir); 659 652 660 - debugfs_create_files(ctx_dir, ctx, blk_mq_debugfs_ctx_attrs); 653 + debugfs_create_files(hctx->queue, ctx_dir, ctx, 654 + blk_mq_debugfs_ctx_attrs); 661 655 } 662 656 663 657 void blk_mq_debugfs_register_hctx(struct request_queue *q, ··· 674 666 snprintf(name, sizeof(name), "hctx%u", hctx->queue_num); 675 667 hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir); 676 668 677 - debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs); 669 + debugfs_create_files(q, hctx->debugfs_dir, hctx, 670 + blk_mq_debugfs_hctx_attrs); 678 671 679 672 hctx_for_each_ctx(hctx, ctx, i) 680 673 blk_mq_debugfs_register_ctx(hctx, ctx); ··· 695 686 struct blk_mq_hw_ctx *hctx; 696 687 unsigned long i; 697 688 689 + mutex_lock(&q->debugfs_mutex); 698 690 queue_for_each_hw_ctx(q, hctx, i) 699 691 blk_mq_debugfs_register_hctx(q, hctx); 692 + mutex_unlock(&q->debugfs_mutex); 700 693 } 701 694 702 695 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q) ··· 728 717 729 718 q->sched_debugfs_dir = debugfs_create_dir("sched", q->debugfs_dir); 730 719 731 - debugfs_create_files(q->sched_debugfs_dir, q, e->queue_debugfs_attrs); 720 + debugfs_create_files(q, q->sched_debugfs_dir, q, e->queue_debugfs_attrs); 732 721 } 733 722 734 723 void blk_mq_debugfs_unregister_sched(struct request_queue *q) ··· 752 741 return "unknown"; 753 742 } 754 743 755 - void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos) 756 - { 757 - lockdep_assert_held(&rqos->disk->queue->debugfs_mutex); 758 - 759 - if (!rqos->disk->queue->debugfs_dir) 760 - return; 761 - debugfs_remove_recursive(rqos->debugfs_dir); 762 - rqos->debugfs_dir = NULL; 763 - } 764 - 765 - void blk_mq_debugfs_register_rqos(struct rq_qos *rqos) 744 + static void blk_mq_debugfs_register_rqos(struct rq_qos *rqos) 766 745 { 767 746 struct request_queue *q = rqos->disk->queue; 768 747 const char *dir_name = rq_qos_id_to_name(rqos->id); ··· 767 766 q->debugfs_dir); 768 767 769 768 rqos->debugfs_dir = debugfs_create_dir(dir_name, q->rqos_debugfs_dir); 770 - debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs); 769 + debugfs_create_files(q, rqos->debugfs_dir, rqos, 770 + rqos->ops->debugfs_attrs); 771 + } 772 + 773 + void blk_mq_debugfs_register_rq_qos(struct request_queue *q) 774 + { 775 + lockdep_assert_held(&q->debugfs_mutex); 776 + 777 + if (q->rq_qos) { 778 + struct rq_qos *rqos = q->rq_qos; 779 + 780 + while (rqos) { 781 + blk_mq_debugfs_register_rqos(rqos); 782 + rqos = rqos->next; 783 + } 784 + } 771 785 } 772 786 773 787 void blk_mq_debugfs_register_sched_hctx(struct request_queue *q, ··· 805 789 806 790 hctx->sched_debugfs_dir = debugfs_create_dir("sched", 807 791 hctx->debugfs_dir); 808 - debugfs_create_files(hctx->sched_debugfs_dir, hctx, 792 + debugfs_create_files(q, hctx->sched_debugfs_dir, hctx, 809 793 e->hctx_debugfs_attrs); 810 794 } 811 795

+2 -6

block/blk-mq-debugfs.h

··· 33 33 struct blk_mq_hw_ctx *hctx); 34 34 void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx); 35 35 36 - void blk_mq_debugfs_register_rqos(struct rq_qos *rqos); 37 - void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos); 36 + void blk_mq_debugfs_register_rq_qos(struct request_queue *q); 38 37 #else 39 38 static inline void blk_mq_debugfs_register(struct request_queue *q) 40 39 { ··· 73 74 { 74 75 } 75 76 76 - static inline void blk_mq_debugfs_register_rqos(struct rq_qos *rqos) 77 + static inline void blk_mq_debugfs_register_rq_qos(struct request_queue *q) 77 78 { 78 79 } 79 80 80 - static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos) 81 - { 82 - } 83 81 #endif 84 82 85 83 #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)

+6 -8

block/blk-mq-dma.c

··· 6 6 #include <linux/blk-mq-dma.h> 7 7 #include "blk.h" 8 8 9 - struct phys_vec { 10 - phys_addr_t paddr; 11 - u32 len; 12 - }; 13 - 14 9 static bool __blk_map_iter_next(struct blk_map_iter *iter) 15 10 { 16 11 if (iter->iter.bi_size) ··· 107 112 struct phys_vec *vec) 108 113 { 109 114 enum dma_data_direction dir = rq_dma_dir(req); 110 - unsigned int mapped = 0; 111 115 unsigned int attrs = 0; 116 + size_t mapped = 0; 112 117 int error; 113 118 114 119 iter->addr = state->addr; ··· 233 238 * blk_rq_dma_map_iter_next - map the next DMA segment for a request 234 239 * @req: request to map 235 240 * @dma_dev: device to map to 236 - * @state: DMA IOVA state 237 241 * @iter: block layer DMA iterator 238 242 * 239 243 * Iterate to the next mapping after a previous call to ··· 247 253 * returned in @iter.status. 248 254 */ 249 255 bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev, 250 - struct dma_iova_state *state, struct blk_dma_iter *iter) 256 + struct blk_dma_iter *iter) 251 257 { 252 258 struct phys_vec vec; 253 259 ··· 291 297 blk_rq_map_iter_init(rq, &iter); 292 298 while (blk_map_iter_next(rq, &iter, &vec)) { 293 299 *last_sg = blk_next_sg(last_sg, sglist); 300 + 301 + WARN_ON_ONCE(overflows_type(vec.len, unsigned int)); 294 302 sg_set_page(*last_sg, phys_to_page(vec.paddr), vec.len, 295 303 offset_in_page(vec.paddr)); 296 304 nsegs++; ··· 413 417 414 418 while (blk_map_iter_next(rq, &iter, &vec)) { 415 419 sg = blk_next_sg(&sg, sglist); 420 + 421 + WARN_ON_ONCE(overflows_type(vec.len, unsigned int)); 416 422 sg_set_page(sg, phys_to_page(vec.paddr), vec.len, 417 423 offset_in_page(vec.paddr)); 418 424 segments++;

+5

block/blk-mq-sched.h

··· 137 137 depth); 138 138 } 139 139 140 + static inline bool blk_mq_is_sync_read(blk_opf_t opf) 141 + { 142 + return op_is_sync(opf) && !op_is_write(opf); 143 + } 144 + 140 145 #endif

+48 -29

block/blk-mq.c

··· 498 498 return rq_list_pop(data->cached_rqs); 499 499 } 500 500 501 + static void blk_mq_limit_depth(struct blk_mq_alloc_data *data) 502 + { 503 + struct elevator_mq_ops *ops; 504 + 505 + /* If no I/O scheduler has been configured, don't limit requests */ 506 + if (!data->q->elevator) { 507 + blk_mq_tag_busy(data->hctx); 508 + return; 509 + } 510 + 511 + /* 512 + * All requests use scheduler tags when an I/O scheduler is 513 + * enabled for the queue. 514 + */ 515 + data->rq_flags |= RQF_SCHED_TAGS; 516 + 517 + /* 518 + * Flush/passthrough requests are special and go directly to the 519 + * dispatch list, they are not subject to the async_depth limit. 520 + */ 521 + if ((data->cmd_flags & REQ_OP_MASK) == REQ_OP_FLUSH || 522 + blk_op_is_passthrough(data->cmd_flags)) 523 + return; 524 + 525 + WARN_ON_ONCE(data->flags & BLK_MQ_REQ_RESERVED); 526 + data->rq_flags |= RQF_USE_SCHED; 527 + 528 + /* 529 + * By default, sync requests have no limit, and async requests are 530 + * limited to async_depth. 531 + */ 532 + ops = &data->q->elevator->type->ops; 533 + if (ops->limit_depth) 534 + ops->limit_depth(data->cmd_flags, data); 535 + } 536 + 501 537 static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) 502 538 { 503 539 struct request_queue *q = data->q; ··· 552 516 data->ctx = blk_mq_get_ctx(q); 553 517 data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx); 554 518 555 - if (q->elevator) { 556 - /* 557 - * All requests use scheduler tags when an I/O scheduler is 558 - * enabled for the queue. 559 - */ 560 - data->rq_flags |= RQF_SCHED_TAGS; 561 - 562 - /* 563 - * Flush/passthrough requests are special and go directly to the 564 - * dispatch list. 565 - */ 566 - if ((data->cmd_flags & REQ_OP_MASK) != REQ_OP_FLUSH && 567 - !blk_op_is_passthrough(data->cmd_flags)) { 568 - struct elevator_mq_ops *ops = &q->elevator->type->ops; 569 - 570 - WARN_ON_ONCE(data->flags & BLK_MQ_REQ_RESERVED); 571 - 572 - data->rq_flags |= RQF_USE_SCHED; 573 - if (ops->limit_depth) 574 - ops->limit_depth(data->cmd_flags, data); 575 - } 576 - } else { 577 - blk_mq_tag_busy(data->hctx); 578 - } 579 - 519 + blk_mq_limit_depth(data); 580 520 if (data->flags & BLK_MQ_REQ_RESERVED) 581 521 data->rq_flags |= RQF_RESV; 582 522 ··· 1168 1156 1169 1157 if (rq->end_io) { 1170 1158 rq_qos_done(rq->q, rq); 1171 - if (rq->end_io(rq, error) == RQ_END_IO_FREE) 1159 + if (rq->end_io(rq, error, NULL) == RQ_END_IO_FREE) 1172 1160 blk_mq_free_request(rq); 1173 1161 } else { 1174 1162 blk_mq_free_request(rq); ··· 1223 1211 * If end_io handler returns NONE, then it still has 1224 1212 * ownership of the request. 1225 1213 */ 1226 - if (rq->end_io && rq->end_io(rq, 0) == RQ_END_IO_NONE) 1214 + if (rq->end_io && rq->end_io(rq, 0, iob) == RQ_END_IO_NONE) 1227 1215 continue; 1228 1216 1229 1217 WRITE_ONCE(rq->state, MQ_RQ_IDLE); ··· 1470 1458 blk_status_t ret; 1471 1459 }; 1472 1460 1473 - static enum rq_end_io_ret blk_end_sync_rq(struct request *rq, blk_status_t ret) 1461 + static enum rq_end_io_ret blk_end_sync_rq(struct request *rq, blk_status_t ret, 1462 + const struct io_comp_batch *iob) 1474 1463 { 1475 1464 struct blk_rq_wait *wait = rq->end_io_data; 1476 1465 ··· 1701 1688 void blk_mq_put_rq_ref(struct request *rq) 1702 1689 { 1703 1690 if (is_flush_rq(rq)) { 1704 - if (rq->end_io(rq, 0) == RQ_END_IO_FREE) 1691 + if (rq->end_io(rq, 0, NULL) == RQ_END_IO_FREE) 1705 1692 blk_mq_free_request(rq); 1706 1693 } else if (req_ref_put_and_test(rq)) { 1707 1694 __blk_mq_free_request(rq); ··· 4662 4649 spin_lock_init(&q->requeue_lock); 4663 4650 4664 4651 q->nr_requests = set->queue_depth; 4652 + q->async_depth = set->queue_depth; 4665 4653 4666 4654 blk_mq_init_cpu_queues(q, set->nr_hw_queues); 4667 4655 blk_mq_map_swqueue(q); ··· 5029 5015 q->elevator->et = et; 5030 5016 } 5031 5017 5018 + /* 5019 + * Preserve relative value, both nr and async_depth are at most 16 bit 5020 + * value, no need to worry about overflow. 5021 + */ 5022 + q->async_depth = max(q->async_depth * nr / q->nr_requests, 1); 5032 5023 q->nr_requests = nr; 5033 5024 if (q->elevator && q->elevator->type->ops.depth_updated) 5034 5025 q->elevator->type->ops.depth_updated(q);

-11

block/blk-rq-qos.c

··· 347 347 blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q); 348 348 349 349 blk_mq_unfreeze_queue(q, memflags); 350 - 351 - if (rqos->ops->debugfs_attrs) { 352 - mutex_lock(&q->debugfs_mutex); 353 - blk_mq_debugfs_register_rqos(rqos); 354 - mutex_unlock(&q->debugfs_mutex); 355 - } 356 - 357 350 return 0; 358 351 ebusy: 359 352 blk_mq_unfreeze_queue(q, memflags); ··· 371 378 if (!q->rq_qos) 372 379 blk_queue_flag_clear(QUEUE_FLAG_QOS_ENABLED, q); 373 380 blk_mq_unfreeze_queue(q, memflags); 374 - 375 - mutex_lock(&q->debugfs_mutex); 376 - blk_mq_debugfs_unregister_rqos(rqos); 377 - mutex_unlock(&q->debugfs_mutex); 378 381 }

+44 -37

block/blk-sysfs.c

··· 127 127 return ret; 128 128 } 129 129 130 + static ssize_t queue_async_depth_show(struct gendisk *disk, char *page) 131 + { 132 + guard(mutex)(&disk->queue->elevator_lock); 133 + 134 + return queue_var_show(disk->queue->async_depth, page); 135 + } 136 + 137 + static ssize_t 138 + queue_async_depth_store(struct gendisk *disk, const char *page, size_t count) 139 + { 140 + struct request_queue *q = disk->queue; 141 + unsigned int memflags; 142 + unsigned long nr; 143 + int ret; 144 + 145 + if (!queue_is_mq(q)) 146 + return -EINVAL; 147 + 148 + ret = queue_var_store(&nr, page, count); 149 + if (ret < 0) 150 + return ret; 151 + 152 + if (nr == 0) 153 + return -EINVAL; 154 + 155 + memflags = blk_mq_freeze_queue(q); 156 + scoped_guard(mutex, &q->elevator_lock) { 157 + if (q->elevator) { 158 + q->async_depth = min(q->nr_requests, nr); 159 + if (q->elevator->type->ops.depth_updated) 160 + q->elevator->type->ops.depth_updated(q); 161 + } else { 162 + ret = -EINVAL; 163 + } 164 + } 165 + blk_mq_unfreeze_queue(q, memflags); 166 + 167 + return ret; 168 + } 169 + 130 170 static ssize_t queue_ra_show(struct gendisk *disk, char *page) 131 171 { 132 172 ssize_t ret; ··· 572 532 } 573 533 574 534 QUEUE_RW_ENTRY(queue_requests, "nr_requests"); 535 + QUEUE_RW_ENTRY(queue_async_depth, "async_depth"); 575 536 QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb"); 576 537 QUEUE_LIM_RW_ENTRY(queue_max_sectors, "max_sectors_kb"); 577 538 QUEUE_LIM_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb"); ··· 677 636 static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page, 678 637 size_t count) 679 638 { 680 - struct request_queue *q = disk->queue; 681 - struct rq_qos *rqos; 682 639 ssize_t ret; 683 640 s64 val; 684 - unsigned int memflags; 685 641 686 642 ret = queue_var_store64(&val, page); 687 643 if (ret < 0) ··· 686 648 if (val < -1) 687 649 return -EINVAL; 688 650 689 - /* 690 - * Ensure that the queue is idled, in case the latency update 691 - * ends up either enabling or disabling wbt completely. We can't 692 - * have IO inflight if that happens. 693 - */ 694 - memflags = blk_mq_freeze_queue(q); 695 - 696 - rqos = wbt_rq_qos(q); 697 - if (!rqos) { 698 - ret = wbt_init(disk); 699 - if (ret) 700 - goto out; 701 - } 702 - 703 - ret = count; 704 - if (val == -1) 705 - val = wbt_default_latency_nsec(q); 706 - else if (val >= 0) 707 - val *= 1000ULL; 708 - 709 - if (wbt_get_min_lat(q) == val) 710 - goto out; 711 - 712 - blk_mq_quiesce_queue(q); 713 - 714 - mutex_lock(&disk->rqos_state_mutex); 715 - wbt_set_min_lat(q, val); 716 - mutex_unlock(&disk->rqos_state_mutex); 717 - 718 - blk_mq_unquiesce_queue(q); 719 - out: 720 - blk_mq_unfreeze_queue(q, memflags); 721 - 722 - return ret; 651 + ret = wbt_set_lat(disk, val); 652 + return ret ? ret : count; 723 653 } 724 654 725 655 QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec"); ··· 760 754 */ 761 755 &elv_iosched_entry.attr, 762 756 &queue_requests_entry.attr, 757 + &queue_async_depth_entry.attr, 763 758 #ifdef CONFIG_BLK_WBT 764 759 &queue_wb_lat_entry.attr, 765 760 #endif

+113 -41

block/blk-wbt.c

··· 93 93 struct rq_depth rq_depth; 94 94 }; 95 95 96 + static int wbt_init(struct gendisk *disk, struct rq_wb *rwb); 97 + 96 98 static inline struct rq_wb *RQWB(struct rq_qos *rqos) 97 99 { 98 100 return container_of(rqos, struct rq_wb, rqos); ··· 508 506 return RQWB(rqos)->min_lat_nsec; 509 507 } 510 508 511 - void wbt_set_min_lat(struct request_queue *q, u64 val) 509 + static void wbt_set_min_lat(struct request_queue *q, u64 val) 512 510 { 513 511 struct rq_qos *rqos = wbt_rq_qos(q); 514 512 if (!rqos) ··· 698 696 } 699 697 } 700 698 699 + static int wbt_data_dir(const struct request *rq) 700 + { 701 + const enum req_op op = req_op(rq); 702 + 703 + if (op == REQ_OP_READ) 704 + return READ; 705 + else if (op_is_write(op)) 706 + return WRITE; 707 + 708 + /* don't account */ 709 + return -1; 710 + } 711 + 712 + static struct rq_wb *wbt_alloc(void) 713 + { 714 + struct rq_wb *rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); 715 + 716 + if (!rwb) 717 + return NULL; 718 + 719 + rwb->cb = blk_stat_alloc_callback(wb_timer_fn, wbt_data_dir, 2, rwb); 720 + if (!rwb->cb) { 721 + kfree(rwb); 722 + return NULL; 723 + } 724 + 725 + return rwb; 726 + } 727 + 728 + static void wbt_free(struct rq_wb *rwb) 729 + { 730 + blk_stat_free_callback(rwb->cb); 731 + kfree(rwb); 732 + } 733 + 701 734 /* 702 735 * Enable wbt if defaults are configured that way 703 736 */ ··· 774 737 775 738 void wbt_init_enable_default(struct gendisk *disk) 776 739 { 777 - if (__wbt_enable_default(disk)) 778 - WARN_ON_ONCE(wbt_init(disk)); 740 + struct request_queue *q = disk->queue; 741 + struct rq_wb *rwb; 742 + 743 + if (!__wbt_enable_default(disk)) 744 + return; 745 + 746 + rwb = wbt_alloc(); 747 + if (WARN_ON_ONCE(!rwb)) 748 + return; 749 + 750 + if (WARN_ON_ONCE(wbt_init(disk, rwb))) { 751 + wbt_free(rwb); 752 + return; 753 + } 754 + 755 + mutex_lock(&q->debugfs_mutex); 756 + blk_mq_debugfs_register_rq_qos(q); 757 + mutex_unlock(&q->debugfs_mutex); 779 758 } 780 759 781 - u64 wbt_default_latency_nsec(struct request_queue *q) 760 + static u64 wbt_default_latency_nsec(struct request_queue *q) 782 761 { 783 762 /* 784 763 * We default to 2msec for non-rotational storage, and 75msec 785 764 * for rotational storage. 786 765 */ 787 - if (blk_queue_nonrot(q)) 788 - return 2000000ULL; 789 - else 766 + if (blk_queue_rot(q)) 790 767 return 75000000ULL; 791 - } 792 - 793 - static int wbt_data_dir(const struct request *rq) 794 - { 795 - const enum req_op op = req_op(rq); 796 - 797 - if (op == REQ_OP_READ) 798 - return READ; 799 - else if (op_is_write(op)) 800 - return WRITE; 801 - 802 - /* don't account */ 803 - return -1; 768 + return 2000000ULL; 804 769 } 805 770 806 771 static void wbt_queue_depth_changed(struct rq_qos *rqos) ··· 816 777 struct rq_wb *rwb = RQWB(rqos); 817 778 818 779 blk_stat_remove_callback(rqos->disk->queue, rwb->cb); 819 - blk_stat_free_callback(rwb->cb); 820 - kfree(rwb); 780 + wbt_free(rwb); 821 781 } 822 782 823 783 /* ··· 940 902 #endif 941 903 }; 942 904 943 - int wbt_init(struct gendisk *disk) 905 + static int wbt_init(struct gendisk *disk, struct rq_wb *rwb) 944 906 { 945 907 struct request_queue *q = disk->queue; 946 - struct rq_wb *rwb; 947 - int i; 948 908 int ret; 949 - 950 - rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); 951 - if (!rwb) 952 - return -ENOMEM; 953 - 954 - rwb->cb = blk_stat_alloc_callback(wb_timer_fn, wbt_data_dir, 2, rwb); 955 - if (!rwb->cb) { 956 - kfree(rwb); 957 - return -ENOMEM; 958 - } 909 + int i; 959 910 960 911 for (i = 0; i < WBT_NUM_RWQ; i++) 961 912 rq_wait_init(&rwb->rq_wait[i]); ··· 964 937 ret = rq_qos_add(&rwb->rqos, disk, RQ_QOS_WBT, &wbt_rqos_ops); 965 938 mutex_unlock(&q->rq_qos_mutex); 966 939 if (ret) 967 - goto err_free; 940 + return ret; 968 941 969 942 blk_stat_add_callback(q, rwb->cb); 970 - 971 943 return 0; 944 + } 972 945 973 - err_free: 974 - blk_stat_free_callback(rwb->cb); 975 - kfree(rwb); 946 + int wbt_set_lat(struct gendisk *disk, s64 val) 947 + { 948 + struct request_queue *q = disk->queue; 949 + struct rq_qos *rqos = wbt_rq_qos(q); 950 + struct rq_wb *rwb = NULL; 951 + unsigned int memflags; 952 + int ret = 0; 953 + 954 + if (!rqos) { 955 + rwb = wbt_alloc(); 956 + if (!rwb) 957 + return -ENOMEM; 958 + } 959 + 960 + /* 961 + * Ensure that the queue is idled, in case the latency update 962 + * ends up either enabling or disabling wbt completely. We can't 963 + * have IO inflight if that happens. 964 + */ 965 + memflags = blk_mq_freeze_queue(q); 966 + if (!rqos) { 967 + ret = wbt_init(disk, rwb); 968 + if (ret) { 969 + wbt_free(rwb); 970 + goto out; 971 + } 972 + } 973 + 974 + if (val == -1) 975 + val = wbt_default_latency_nsec(q); 976 + else if (val >= 0) 977 + val *= 1000ULL; 978 + 979 + if (wbt_get_min_lat(q) == val) 980 + goto out; 981 + 982 + blk_mq_quiesce_queue(q); 983 + 984 + mutex_lock(&disk->rqos_state_mutex); 985 + wbt_set_min_lat(q, val); 986 + mutex_unlock(&disk->rqos_state_mutex); 987 + 988 + blk_mq_unquiesce_queue(q); 989 + out: 990 + blk_mq_unfreeze_queue(q, memflags); 991 + mutex_lock(&q->debugfs_mutex); 992 + blk_mq_debugfs_register_rq_qos(q); 993 + mutex_unlock(&q->debugfs_mutex); 994 + 976 995 return ret; 977 - 978 996 }

+2 -5

block/blk-wbt.h

··· 4 4 5 5 #ifdef CONFIG_BLK_WBT 6 6 7 - int wbt_init(struct gendisk *disk); 8 7 void wbt_init_enable_default(struct gendisk *disk); 9 8 void wbt_disable_default(struct gendisk *disk); 10 9 void wbt_enable_default(struct gendisk *disk); 11 10 12 11 u64 wbt_get_min_lat(struct request_queue *q); 13 - void wbt_set_min_lat(struct request_queue *q, u64 val); 14 - bool wbt_disabled(struct request_queue *); 15 - 16 - u64 wbt_default_latency_nsec(struct request_queue *); 12 + bool wbt_disabled(struct request_queue *q); 13 + int wbt_set_lat(struct gendisk *disk, s64 val); 17 14 18 15 #else 19 16

+5 -5

block/blk-zoned.c

··· 112 112 #define BLK_ZONE_WPLUG_UNHASHED (1U << 2) 113 113 114 114 /** 115 - * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX. 116 - * @zone_cond: BLK_ZONE_COND_XXX. 115 + * blk_zone_cond_str - Return a zone condition name string 116 + * @zone_cond: a zone condition BLK_ZONE_COND_name 117 117 * 118 - * Description: Centralize block layer function to convert BLK_ZONE_COND_XXX 119 - * into string format. Useful in the debugging and tracing zone conditions. For 120 - * invalid BLK_ZONE_COND_XXX it returns string "UNKNOWN". 118 + * Convert a BLK_ZONE_COND_name zone condition into the string "name". Useful 119 + * for the debugging and tracing zone conditions. For an invalid zone 120 + * conditions, the string "UNKNOWN" is returned. 121 121 */ 122 122 const char *blk_zone_cond_str(enum blk_zone_cond zone_cond) 123 123 {

+14 -4

block/blk.h

··· 208 208 struct request_queue *q = rq->q; 209 209 enum req_op op = req_op(rq); 210 210 211 - if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE)) 211 + if (unlikely(op == REQ_OP_DISCARD)) 212 212 return min(q->limits.max_discard_sectors, 213 + UINT_MAX >> SECTOR_SHIFT); 214 + 215 + if (unlikely(op == REQ_OP_SECURE_ERASE)) 216 + return min(q->limits.max_secure_erase_sectors, 213 217 UINT_MAX >> SECTOR_SHIFT); 214 218 215 219 if (unlikely(op == REQ_OP_WRITE_ZEROES)) ··· 375 371 static inline bool bio_may_need_split(struct bio *bio, 376 372 const struct queue_limits *lim) 377 373 { 374 + const struct bio_vec *bv; 375 + 378 376 if (lim->chunk_sectors) 379 377 return true; 380 - if (bio->bi_vcnt != 1) 378 + 379 + if (!bio->bi_io_vec) 381 380 return true; 382 - return bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > 383 - lim->max_fast_segment_size; 381 + 382 + bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); 383 + if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done) 384 + return true; 385 + return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size; 384 386 } 385 387 386 388 /**

+1

block/elevator.c

··· 589 589 blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); 590 590 q->elevator = NULL; 591 591 q->nr_requests = q->tag_set->queue_depth; 592 + q->async_depth = q->tag_set->queue_depth; 592 593 } 593 594 blk_add_trace_msg(q, "elv switch: %s", ctx->name); 594 595

+1 -1

block/ioctl.c

··· 692 692 queue_max_sectors(bdev_get_queue(bdev))); 693 693 return put_ushort(argp, max_sectors); 694 694 case BLKROTATIONAL: 695 - return put_ushort(argp, !bdev_nonrot(bdev)); 695 + return put_ushort(argp, bdev_rot(bdev)); 696 696 case BLKRASET: 697 697 case BLKFRASET: 698 698 if(!capable(CAP_SYS_ADMIN))

+5 -28

block/kyber-iosched.c

··· 47 47 * asynchronous requests, we reserve 25% of requests for synchronous 48 48 * operations. 49 49 */ 50 - KYBER_ASYNC_PERCENT = 75, 50 + KYBER_DEFAULT_ASYNC_PERCENT = 75, 51 51 }; 52 - 53 52 /* 54 53 * Maximum device-wide depth for each scheduling domain. 55 54 * ··· 155 156 * device-wide, limited by these tokens. 156 157 */ 157 158 struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; 158 - 159 - /* Number of allowed async requests. */ 160 - unsigned int async_depth; 161 159 162 160 struct kyber_cpu_latency __percpu *cpu_latency; 163 161 ··· 397 401 398 402 static void kyber_depth_updated(struct request_queue *q) 399 403 { 400 - struct kyber_queue_data *kqd = q->elevator->elevator_data; 401 - 402 - kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U; 403 - blk_mq_set_min_shallow_depth(q, kqd->async_depth); 404 + blk_mq_set_min_shallow_depth(q, q->async_depth); 404 405 } 405 406 406 407 static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq) ··· 407 414 blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); 408 415 409 416 q->elevator = eq; 417 + q->async_depth = q->nr_requests * KYBER_DEFAULT_ASYNC_PERCENT / 100; 410 418 kyber_depth_updated(q); 411 419 412 420 return 0; ··· 546 552 547 553 static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data) 548 554 { 549 - /* 550 - * We use the scheduler tags as per-hardware queue queueing tokens. 551 - * Async requests can be limited at this stage. 552 - */ 553 - if (!op_is_sync(opf)) { 554 - struct kyber_queue_data *kqd = data->q->elevator->elevator_data; 555 - 556 - data->shallow_depth = kqd->async_depth; 557 - } 555 + if (!blk_mq_is_sync_read(opf)) 556 + data->shallow_depth = data->q->async_depth; 558 557 } 559 558 560 559 static bool kyber_bio_merge(struct request_queue *q, struct bio *bio, ··· 943 956 KYBER_DEBUGFS_DOMAIN_ATTRS(KYBER_OTHER, other) 944 957 #undef KYBER_DEBUGFS_DOMAIN_ATTRS 945 958 946 - static int kyber_async_depth_show(void *data, struct seq_file *m) 947 - { 948 - struct request_queue *q = data; 949 - struct kyber_queue_data *kqd = q->elevator->elevator_data; 950 - 951 - seq_printf(m, "%u\n", kqd->async_depth); 952 - return 0; 953 - } 954 - 955 959 static int kyber_cur_domain_show(void *data, struct seq_file *m) 956 960 { 957 961 struct blk_mq_hw_ctx *hctx = data; ··· 968 990 KYBER_QUEUE_DOMAIN_ATTRS(write), 969 991 KYBER_QUEUE_DOMAIN_ATTRS(discard), 970 992 KYBER_QUEUE_DOMAIN_ATTRS(other), 971 - {"async_depth", 0400, kyber_async_depth_show}, 972 993 {}, 973 994 }; 974 995 #undef KYBER_QUEUE_DOMAIN_ATTRS

+5 -34

block/mq-deadline.c

··· 98 98 int fifo_batch; 99 99 int writes_starved; 100 100 int front_merges; 101 - u32 async_depth; 102 101 int prio_aging_expire; 103 102 104 103 spinlock_t lock; ··· 485 486 return rq; 486 487 } 487 488 488 - /* 489 - * Called by __blk_mq_alloc_request(). The shallow_depth value set by this 490 - * function is used by __blk_mq_get_tag(). 491 - */ 492 489 static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data) 493 490 { 494 - struct deadline_data *dd = data->q->elevator->elevator_data; 495 - 496 - /* Do not throttle synchronous reads. */ 497 - if (op_is_sync(opf) && !op_is_write(opf)) 498 - return; 499 - 500 - /* 501 - * Throttle asynchronous requests and writes such that these requests 502 - * do not block the allocation of synchronous requests. 503 - */ 504 - data->shallow_depth = dd->async_depth; 491 + if (!blk_mq_is_sync_read(opf)) 492 + data->shallow_depth = data->q->async_depth; 505 493 } 506 494 507 - /* Called by blk_mq_update_nr_requests(). */ 495 + /* Called by blk_mq_init_sched() and blk_mq_update_nr_requests(). */ 508 496 static void dd_depth_updated(struct request_queue *q) 509 497 { 510 - struct deadline_data *dd = q->elevator->elevator_data; 511 - 512 - dd->async_depth = q->nr_requests; 513 - blk_mq_set_min_shallow_depth(q, 1); 498 + blk_mq_set_min_shallow_depth(q, q->async_depth); 514 499 } 515 500 516 501 static void dd_exit_sched(struct elevator_queue *e) ··· 559 576 blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); 560 577 561 578 q->elevator = eq; 579 + q->async_depth = q->nr_requests; 562 580 dd_depth_updated(q); 563 581 return 0; 564 582 } ··· 747 763 SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire); 748 764 SHOW_INT(deadline_writes_starved_show, dd->writes_starved); 749 765 SHOW_INT(deadline_front_merges_show, dd->front_merges); 750 - SHOW_INT(deadline_async_depth_show, dd->async_depth); 751 766 SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch); 752 767 #undef SHOW_INT 753 768 #undef SHOW_JIFFIES ··· 776 793 STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX); 777 794 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX); 778 795 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1); 779 - STORE_INT(deadline_async_depth_store, &dd->async_depth, 1, INT_MAX); 780 796 STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX); 781 797 #undef STORE_FUNCTION 782 798 #undef STORE_INT ··· 789 807 DD_ATTR(write_expire), 790 808 DD_ATTR(writes_starved), 791 809 DD_ATTR(front_merges), 792 - DD_ATTR(async_depth), 793 810 DD_ATTR(fifo_batch), 794 811 DD_ATTR(prio_aging_expire), 795 812 __ATTR_NULL ··· 872 891 struct deadline_data *dd = q->elevator->elevator_data; 873 892 874 893 seq_printf(m, "%u\n", dd->starved); 875 - return 0; 876 - } 877 - 878 - static int dd_async_depth_show(void *data, struct seq_file *m) 879 - { 880 - struct request_queue *q = data; 881 - struct deadline_data *dd = q->elevator->elevator_data; 882 - 883 - seq_printf(m, "%u\n", dd->async_depth); 884 894 return 0; 885 895 } 886 896 ··· 974 1002 DEADLINE_NEXT_RQ_ATTR(write2), 975 1003 {"batching", 0400, deadline_batching_show}, 976 1004 {"starved", 0400, deadline_starved_show}, 977 - {"async_depth", 0400, dd_async_depth_show}, 978 1005 {"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops}, 979 1006 {"owned_by_driver", 0400, dd_owned_by_driver_show}, 980 1007 {"queued", 0400, dd_queued_show},

+2 -1

block/partitions/core.c

··· 7 7 #include <linux/fs.h> 8 8 #include <linux/major.h> 9 9 #include <linux/slab.h> 10 + #include <linux/string.h> 10 11 #include <linux/ctype.h> 11 12 #include <linux/vmalloc.h> 12 13 #include <linux/raid/detect.h> ··· 131 130 state->pp_buf[0] = '\0'; 132 131 133 132 state->disk = hd; 134 - snprintf(state->name, BDEVNAME_SIZE, "%s", hd->disk_name); 133 + strscpy(state->name, hd->disk_name); 135 134 snprintf(state->pp_buf, PAGE_SIZE, " %s:", state->name); 136 135 if (isdigit(state->name[strlen(state->name)-1])) 137 136 sprintf(state->name, "p");

+2 -1

block/sed-opal.c

··· 2940 2940 }; 2941 2941 int ret; 2942 2942 2943 - if (!opal_lr_act->num_lrs || opal_lr_act->num_lrs > OPAL_MAX_LRS) 2943 + if (opal_lr_act->sum && 2944 + (!opal_lr_act->num_lrs || opal_lr_act->num_lrs > OPAL_MAX_LRS)) 2944 2945 return -EINVAL; 2945 2946 2946 2947 ret = opal_get_key(dev, &opal_lr_act->key);

+1 -2

drivers/block/brd.c

··· 247 247 /* Legacy boot options - nonmodular */ 248 248 static int __init ramdisk_size(char *str) 249 249 { 250 - rd_size = simple_strtol(str, NULL, 0); 251 - return 1; 250 + return kstrtoul(str, 0, &rd_size) == 0; 252 251 } 253 252 __setup("ramdisk_size=", ramdisk_size); 254 253 #endif

+1 -1

drivers/block/loop.c

··· 969 969 lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL); 970 970 if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY)) 971 971 lim->features |= BLK_FEAT_WRITE_CACHE; 972 - if (backing_bdev && !bdev_nonrot(backing_bdev)) 972 + if (backing_bdev && bdev_rot(backing_bdev)) 973 973 lim->features |= BLK_FEAT_ROTATIONAL; 974 974 lim->max_hw_discard_sectors = max_discard_sectors; 975 975 lim->max_write_zeroes_sectors = max_discard_sectors;

+2 -2

drivers/block/null_blk/main.c

··· 642 642 null_free_dev(dev); 643 643 } 644 644 645 - static struct configfs_item_operations nullb_device_ops = { 645 + static const struct configfs_item_operations nullb_device_ops = { 646 646 .release = nullb_device_release, 647 647 }; 648 648 ··· 749 749 NULL, 750 750 }; 751 751 752 - static struct configfs_group_operations nullb_group_ops = { 752 + static const struct configfs_group_operations nullb_group_ops = { 753 753 .make_group = nullb_group_make_group, 754 754 .drop_item = nullb_group_drop_item, 755 755 };

+8

drivers/block/rnbd/rnbd-clt-sysfs.c

··· 475 475 } 476 476 } 477 477 478 + static void rnbd_dev_release(struct kobject *kobj) 479 + { 480 + struct rnbd_clt_dev *dev = container_of(kobj, struct rnbd_clt_dev, kobj); 481 + 482 + kfree(dev); 483 + } 484 + 478 485 static const struct kobj_type rnbd_dev_ktype = { 479 486 .sysfs_ops = &kobj_sysfs_ops, 480 487 .default_groups = rnbd_dev_groups, 488 + .release = rnbd_dev_release, 481 489 }; 482 490 483 491 static int rnbd_clt_add_dev_kobj(struct rnbd_clt_dev *dev)

+10 -9

drivers/block/rnbd/rnbd-clt.c

··· 60 60 kfree(dev->pathname); 61 61 rnbd_clt_put_sess(dev->sess); 62 62 mutex_destroy(&dev->lock); 63 - kfree(dev); 63 + 64 + if (dev->kobj.state_initialized) 65 + kobject_put(&dev->kobj); 64 66 } 65 67 66 68 static inline bool rnbd_clt_get_dev(struct rnbd_clt_dev *dev) ··· 1519 1517 return found; 1520 1518 } 1521 1519 1522 - static void delete_dev(struct rnbd_clt_dev *dev) 1520 + static void rnbd_delete_dev(struct rnbd_clt_dev *dev) 1523 1521 { 1524 1522 struct rnbd_clt_session *sess = dev->sess; 1525 1523 ··· 1640 1638 kfree(rsp); 1641 1639 rnbd_put_iu(sess, iu); 1642 1640 del_dev: 1643 - delete_dev(dev); 1641 + rnbd_delete_dev(dev); 1644 1642 put_dev: 1645 1643 rnbd_clt_put_dev(dev); 1646 1644 put_sess: ··· 1649 1647 return ERR_PTR(ret); 1650 1648 } 1651 1649 1652 - static void destroy_gen_disk(struct rnbd_clt_dev *dev) 1650 + static void rnbd_destroy_gen_disk(struct rnbd_clt_dev *dev) 1653 1651 { 1654 1652 del_gendisk(dev->gd); 1655 1653 put_disk(dev->gd); 1656 1654 } 1657 1655 1658 - static void destroy_sysfs(struct rnbd_clt_dev *dev, 1656 + static void rnbd_destroy_sysfs(struct rnbd_clt_dev *dev, 1659 1657 const struct attribute *sysfs_self) 1660 1658 { 1661 1659 rnbd_clt_remove_dev_symlink(dev); ··· 1664 1662 /* To avoid deadlock firstly remove itself */ 1665 1663 sysfs_remove_file_self(&dev->kobj, sysfs_self); 1666 1664 kobject_del(&dev->kobj); 1667 - kobject_put(&dev->kobj); 1668 1665 } 1669 1666 } 1670 1667 ··· 1692 1691 dev->dev_state = DEV_STATE_UNMAPPED; 1693 1692 mutex_unlock(&dev->lock); 1694 1693 1695 - delete_dev(dev); 1696 - destroy_sysfs(dev, sysfs_self); 1697 - destroy_gen_disk(dev); 1694 + rnbd_delete_dev(dev); 1695 + rnbd_destroy_sysfs(dev, sysfs_self); 1696 + rnbd_destroy_gen_disk(dev); 1698 1697 if (was_mapped && sess->rtrs) 1699 1698 send_msg_close(dev, dev->device_id, RTRS_PERMIT_WAIT); 1700 1699

+17 -1

drivers/block/rnbd/rnbd-proto.h

··· 18 18 #include <rdma/ib.h> 19 19 20 20 #define RNBD_PROTO_VER_MAJOR 2 21 - #define RNBD_PROTO_VER_MINOR 0 21 + #define RNBD_PROTO_VER_MINOR 2 22 22 23 23 /* The default port number the RTRS server is listening on. */ 24 24 #define RTRS_PORT 1234 ··· 197 197 * 198 198 * @RNBD_F_SYNC: request is sync (sync write or read) 199 199 * @RNBD_F_FUA: forced unit access 200 + * @RNBD_F_PREFLUSH: request for cache flush 201 + * @RNBD_F_NOUNMAP: do not free blocks when zeroing 200 202 */ 201 203 enum rnbd_io_flags { 202 204 ··· 213 211 /* Flags */ 214 212 RNBD_F_SYNC = 1<<(RNBD_OP_BITS + 0), 215 213 RNBD_F_FUA = 1<<(RNBD_OP_BITS + 1), 214 + RNBD_F_PREFLUSH = 1<<(RNBD_OP_BITS + 2), 215 + RNBD_F_NOUNMAP = 1<<(RNBD_OP_BITS + 3) 216 216 }; 217 217 218 218 static inline u32 rnbd_op(u32 flags) ··· 249 245 break; 250 246 case RNBD_OP_WRITE_ZEROES: 251 247 bio_opf = REQ_OP_WRITE_ZEROES; 248 + 249 + if (rnbd_opf & RNBD_F_NOUNMAP) 250 + bio_opf |= REQ_NOUNMAP; 252 251 break; 253 252 default: 254 253 WARN(1, "Unknown RNBD type: %d (flags %d)\n", ··· 264 257 265 258 if (rnbd_opf & RNBD_F_FUA) 266 259 bio_opf |= REQ_FUA; 260 + 261 + if (rnbd_opf & RNBD_F_PREFLUSH) 262 + bio_opf |= REQ_PREFLUSH; 267 263 268 264 return bio_opf; 269 265 } ··· 290 280 break; 291 281 case REQ_OP_WRITE_ZEROES: 292 282 rnbd_opf = RNBD_OP_WRITE_ZEROES; 283 + 284 + if (rq->cmd_flags & REQ_NOUNMAP) 285 + rnbd_opf |= RNBD_F_NOUNMAP; 293 286 break; 294 287 case REQ_OP_FLUSH: 295 288 rnbd_opf = RNBD_OP_FLUSH; ··· 309 296 310 297 if (op_is_flush(rq->cmd_flags)) 311 298 rnbd_opf |= RNBD_F_FUA; 299 + 300 + if (rq->cmd_flags & REQ_PREFLUSH) 301 + rnbd_opf |= RNBD_F_PREFLUSH; 312 302 313 303 return rnbd_opf; 314 304 }

+2 -20

drivers/block/rnbd/rnbd-srv-trace.h

··· 44 44 DEFINE_LINK_EVENT(create_sess); 45 45 DEFINE_LINK_EVENT(destroy_sess); 46 46 47 - TRACE_DEFINE_ENUM(RNBD_OP_READ); 48 - TRACE_DEFINE_ENUM(RNBD_OP_WRITE); 49 - TRACE_DEFINE_ENUM(RNBD_OP_FLUSH); 50 - TRACE_DEFINE_ENUM(RNBD_OP_DISCARD); 51 - TRACE_DEFINE_ENUM(RNBD_OP_SECURE_ERASE); 52 - TRACE_DEFINE_ENUM(RNBD_F_SYNC); 53 - TRACE_DEFINE_ENUM(RNBD_F_FUA); 54 - 55 - #define show_rnbd_rw_flags(x) \ 56 - __print_flags(x, "|", \ 57 - { RNBD_OP_READ, "READ" }, \ 58 - { RNBD_OP_WRITE, "WRITE" }, \ 59 - { RNBD_OP_FLUSH, "FLUSH" }, \ 60 - { RNBD_OP_DISCARD, "DISCARD" }, \ 61 - { RNBD_OP_SECURE_ERASE, "SECURE_ERASE" }, \ 62 - { RNBD_F_SYNC, "SYNC" }, \ 63 - { RNBD_F_FUA, "FUA" }) 64 - 65 47 TRACE_EVENT(process_rdma, 66 48 TP_PROTO(struct rnbd_srv_session *srv, 67 49 const struct rnbd_msg_io *msg, ··· 79 97 __entry->usrlen = usrlen; 80 98 ), 81 99 82 - TP_printk("I/O req: sess: %s, type: %s, ver: %d, devid: %u, sector: %llu, bsize: %u, flags: %s, ioprio: %d, datalen: %u, usrlen: %zu", 100 + TP_printk("I/O req: sess: %s, type: %s, ver: %d, devid: %u, sector: %llu, bsize: %u, flags: %u, ioprio: %d, datalen: %u, usrlen: %zu", 83 101 __get_str(sessname), 84 102 __print_symbolic(__entry->dir, 85 103 { READ, "READ" }, ··· 88 106 __entry->device_id, 89 107 __entry->sector, 90 108 __entry->bi_size, 91 - show_rnbd_rw_flags(__entry->flags), 109 + __entry->flags, 92 110 __entry->ioprio, 93 111 __entry->datalen, 94 112 __entry->usrlen

+26 -10

drivers/block/rnbd/rnbd-srv.c

··· 145 145 priv->sess_dev = sess_dev; 146 146 priv->id = id; 147 147 148 - bio = bio_alloc(file_bdev(sess_dev->bdev_file), 1, 148 + bio = bio_alloc(file_bdev(sess_dev->bdev_file), !!datalen, 149 149 rnbd_to_bio_flags(le32_to_cpu(msg->rw)), GFP_KERNEL); 150 - bio_add_virt_nofail(bio, data, datalen); 151 - 152 - bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw)); 153 - if (bio_has_data(bio) && 154 - bio->bi_iter.bi_size != le32_to_cpu(msg->bi_size)) { 155 - rnbd_srv_err_rl(sess_dev, "Datalen mismatch: bio bi_size (%u), bi_size (%u)\n", 156 - bio->bi_iter.bi_size, msg->bi_size); 157 - err = -EINVAL; 158 - goto bio_put; 150 + if (unlikely(!bio)) { 151 + err = -ENOMEM; 152 + goto put_sess_dev; 159 153 } 154 + 155 + if (!datalen) { 156 + /* 157 + * For special requests like DISCARD and WRITE_ZEROES, the datalen is zero. 158 + */ 159 + bio->bi_iter.bi_size = le32_to_cpu(msg->bi_size); 160 + } else { 161 + bio_add_virt_nofail(bio, data, datalen); 162 + bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw)); 163 + if (bio->bi_iter.bi_size != le32_to_cpu(msg->bi_size)) { 164 + rnbd_srv_err_rl(sess_dev, 165 + "Datalen mismatch: bio bi_size (%u), bi_size (%u)\n", 166 + bio->bi_iter.bi_size, msg->bi_size); 167 + err = -EINVAL; 168 + goto bio_put; 169 + } 170 + } 171 + 160 172 bio->bi_end_io = rnbd_dev_bi_end_io; 161 173 bio->bi_private = priv; 162 174 bio->bi_iter.bi_sector = le64_to_cpu(msg->sector); ··· 182 170 183 171 bio_put: 184 172 bio_put(bio); 173 + put_sess_dev: 185 174 rnbd_put_sess_dev(sess_dev); 186 175 err: 187 176 kfree(priv); ··· 551 538 { 552 539 struct block_device *bdev = file_bdev(sess_dev->bdev_file); 553 540 541 + memset(rsp, 0, sizeof(*rsp)); 542 + 554 543 rsp->hdr.type = cpu_to_le16(RNBD_MSG_OPEN_RSP); 555 544 rsp->device_id = cpu_to_le32(sess_dev->device_id); 556 545 rsp->nsectors = cpu_to_le64(bdev_nr_sectors(bdev)); ··· 659 644 660 645 trace_process_msg_sess_info(srv_sess, sess_info_msg); 661 646 647 + memset(rsp, 0, sizeof(*rsp)); 662 648 rsp->hdr.type = cpu_to_le16(RNBD_MSG_SESS_INFO_RSP); 663 649 rsp->ver = srv_sess->ver; 664 650 }

+1 -2

drivers/block/rnull/configfs.rs

··· 13 13 str::{kstrtobool_bytes, CString}, 14 14 sync::Mutex, 15 15 }; 16 - use pin_init::PinInit; 17 16 18 17 pub(crate) fn subsystem() -> impl PinInit<kernel::configfs::Subsystem<Config>, Error> { 19 18 let item_type = configfs_attrs! { ··· 24 25 ], 25 26 }; 26 27 27 - kernel::configfs::Subsystem::new(c_str!("rnull"), item_type, try_pin_init!(Config {})) 28 + kernel::configfs::Subsystem::new(c"rnull", item_type, try_pin_init!(Config {})) 28 29 } 29 30 30 31 #[pin_data]

-3

drivers/block/rnull/rnull.rs

··· 14 14 Operations, TagSet, 15 15 }, 16 16 }, 17 - error::Result, 18 - pr_info, 19 17 prelude::*, 20 18 sync::{aref::ARef, Arc}, 21 19 }; 22 - use pin_init::PinInit; 23 20 24 21 module! { 25 22 type: NullBlkModule,

+1704 -211

drivers/block/ublk_drv.c

··· 44 44 #include <linux/task_work.h> 45 45 #include <linux/namei.h> 46 46 #include <linux/kref.h> 47 + #include <linux/kfifo.h> 48 + #include <linux/blk-integrity.h> 49 + #include <uapi/linux/fs.h> 47 50 #include <uapi/linux/ublk_cmd.h> 48 51 49 52 #define UBLK_MINORS (1U << MINORBITS) ··· 57 54 #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC) 58 55 #define UBLK_CMD_UPDATE_SIZE _IOC_NR(UBLK_U_CMD_UPDATE_SIZE) 59 56 #define UBLK_CMD_QUIESCE_DEV _IOC_NR(UBLK_U_CMD_QUIESCE_DEV) 57 + #define UBLK_CMD_TRY_STOP_DEV _IOC_NR(UBLK_U_CMD_TRY_STOP_DEV) 60 58 61 59 #define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF) 62 60 #define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF) ··· 77 73 | UBLK_F_AUTO_BUF_REG \ 78 74 | UBLK_F_QUIESCE \ 79 75 | UBLK_F_PER_IO_DAEMON \ 80 - | UBLK_F_BUF_REG_OFF_DAEMON) 76 + | UBLK_F_BUF_REG_OFF_DAEMON \ 77 + | (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ? UBLK_F_INTEGRITY : 0) \ 78 + | UBLK_F_SAFE_STOP_DEV \ 79 + | UBLK_F_BATCH_IO \ 80 + | UBLK_F_NO_AUTO_PART_SCAN) 81 81 82 82 #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \ 83 83 | UBLK_F_USER_RECOVERY_REISSUE \ ··· 91 83 #define UBLK_PARAM_TYPE_ALL \ 92 84 (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD | \ 93 85 UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED | \ 94 - UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT) 86 + UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT | \ 87 + UBLK_PARAM_TYPE_INTEGRITY) 88 + 89 + #define UBLK_BATCH_F_ALL \ 90 + (UBLK_BATCH_F_HAS_ZONE_LBA | \ 91 + UBLK_BATCH_F_HAS_BUF_ADDR | \ 92 + UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) 93 + 94 + /* ublk batch fetch uring_cmd */ 95 + struct ublk_batch_fetch_cmd { 96 + struct list_head node; 97 + struct io_uring_cmd *cmd; 98 + unsigned short buf_group; 99 + }; 95 100 96 101 struct ublk_uring_cmd_pdu { 97 102 /* ··· 126 105 */ 127 106 struct ublk_queue *ubq; 128 107 129 - u16 tag; 108 + union { 109 + u16 tag; 110 + struct ublk_batch_fetch_cmd *fcmd; /* batch io only */ 111 + }; 112 + }; 113 + 114 + struct ublk_batch_io_data { 115 + struct ublk_device *ub; 116 + struct io_uring_cmd *cmd; 117 + struct ublk_batch_io header; 118 + unsigned int issue_flags; 119 + struct io_comp_batch *iob; 130 120 }; 131 121 132 122 /* ··· 187 155 */ 188 156 #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2) 189 157 158 + /* used for UBLK_F_BATCH_IO only */ 159 + #define UBLK_BATCH_IO_UNUSED_TAG ((unsigned short)-1) 160 + 190 161 union ublk_io_buf { 191 162 __u64 addr; 192 163 struct ublk_auto_buf_reg auto_reg; ··· 214 179 * if user copy or zero copy are enabled: 215 180 * - UBLK_REFCOUNT_INIT from dispatch to the server 216 181 * until UBLK_IO_COMMIT_AND_FETCH_REQ 217 - * - 1 for each inflight ublk_ch_{read,write}_iter() call 182 + * - 1 for each inflight ublk_ch_{read,write}_iter() call not on task 218 183 * - 1 for each io_uring registered buffer not registered on task 219 184 * The I/O can only be completed once all references are dropped. 220 185 * User copy and buffer registration operations are only permitted ··· 225 190 unsigned task_registered_buffers; 226 191 227 192 void *buf_ctx_handle; 193 + spinlock_t lock; 228 194 } ____cacheline_aligned_in_smp; 229 195 230 196 struct ublk_queue { ··· 240 204 bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */ 241 205 spinlock_t cancel_lock; 242 206 struct ublk_device *dev; 207 + u32 nr_io_ready; 208 + 209 + /* 210 + * For supporting UBLK_F_BATCH_IO only. 211 + * 212 + * Inflight ublk request tag is saved in this fifo 213 + * 214 + * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(), 215 + * so lock is required for storing request tag to fifo 216 + * 217 + * Make sure just one reader for fetching request from task work 218 + * function to ublk server, so no need to grab the lock in reader 219 + * side. 220 + * 221 + * Batch I/O State Management: 222 + * 223 + * The batch I/O system uses implicit state management based on the 224 + * combination of three key variables below. 225 + * 226 + * - IDLE: list_empty(&fcmd_head) && !active_fcmd 227 + * No fetch commands available, events queue in evts_fifo 228 + * 229 + * - READY: !list_empty(&fcmd_head) && !active_fcmd 230 + * Fetch commands available but none processing events 231 + * 232 + * - ACTIVE: active_fcmd 233 + * One fetch command actively processing events from evts_fifo 234 + * 235 + * Key Invariants: 236 + * - At most one active_fcmd at any time (single reader) 237 + * - active_fcmd is always from fcmd_head list when non-NULL 238 + * - evts_fifo can be read locklessly by the single active reader 239 + * - All state transitions require evts_lock protection 240 + * - Multiple writers to evts_fifo require lock protection 241 + */ 242 + struct { 243 + DECLARE_KFIFO_PTR(evts_fifo, unsigned short); 244 + spinlock_t evts_lock; 245 + 246 + /* List of fetch commands available to process events */ 247 + struct list_head fcmd_head; 248 + 249 + /* Currently active fetch command (NULL = none active) */ 250 + struct ublk_batch_fetch_cmd *active_fcmd; 251 + }____cacheline_aligned_in_smp; 252 + 243 253 struct ublk_io ios[] __counted_by(q_depth); 244 254 }; 245 255 ··· 313 231 struct ublk_params params; 314 232 315 233 struct completion completion; 316 - u32 nr_io_ready; 234 + u32 nr_queue_ready; 317 235 bool unprivileged_daemons; 318 236 struct mutex cancel_mutex; 319 237 bool canceling; 320 238 pid_t ublksrv_tgid; 321 239 struct delayed_work exit_work; 322 240 struct work_struct partition_scan_work; 241 + 242 + bool block_open; /* protected by open_mutex */ 323 243 324 244 struct ublk_queue *queues[]; 325 245 }; ··· 336 252 static void ublk_stop_dev_unlocked(struct ublk_device *ub); 337 253 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq); 338 254 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, 339 - u16 q_id, u16 tag, struct ublk_io *io, size_t offset); 255 + u16 q_id, u16 tag, struct ublk_io *io); 340 256 static inline unsigned int ublk_req_build_flags(struct request *req); 257 + static void ublk_batch_dispatch(struct ublk_queue *ubq, 258 + const struct ublk_batch_io_data *data, 259 + struct ublk_batch_fetch_cmd *fcmd); 260 + 261 + static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub) 262 + { 263 + return ub->dev_info.flags & UBLK_F_BATCH_IO; 264 + } 265 + 266 + static inline bool ublk_support_batch_io(const struct ublk_queue *ubq) 267 + { 268 + return ubq->flags & UBLK_F_BATCH_IO; 269 + } 270 + 271 + static inline void ublk_io_lock(struct ublk_io *io) 272 + { 273 + spin_lock(&io->lock); 274 + } 275 + 276 + static inline void ublk_io_unlock(struct ublk_io *io) 277 + { 278 + spin_unlock(&io->lock); 279 + } 280 + 281 + /* Initialize the event queue */ 282 + static inline int ublk_io_evts_init(struct ublk_queue *q, unsigned int size, 283 + int numa_node) 284 + { 285 + spin_lock_init(&q->evts_lock); 286 + return kfifo_alloc_node(&q->evts_fifo, size, GFP_KERNEL, numa_node); 287 + } 288 + 289 + /* Check if event queue is empty */ 290 + static inline bool ublk_io_evts_empty(const struct ublk_queue *q) 291 + { 292 + return kfifo_is_empty(&q->evts_fifo); 293 + } 294 + 295 + static inline void ublk_io_evts_deinit(struct ublk_queue *q) 296 + { 297 + WARN_ON_ONCE(!kfifo_is_empty(&q->evts_fifo)); 298 + kfifo_free(&q->evts_fifo); 299 + } 341 300 342 301 static inline struct ublksrv_io_desc * 343 302 ublk_get_iod(const struct ublk_queue *ubq, unsigned tag) 344 303 { 345 304 return &ubq->io_cmd_buf[tag]; 305 + } 306 + 307 + static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq) 308 + { 309 + return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY; 310 + } 311 + 312 + static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub) 313 + { 314 + return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY; 315 + } 316 + 317 + static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq) 318 + { 319 + return ubq->flags & UBLK_F_AUTO_BUF_REG; 320 + } 321 + 322 + static inline bool ublk_dev_support_auto_buf_reg(const struct ublk_device *ub) 323 + { 324 + return ub->dev_info.flags & UBLK_F_AUTO_BUF_REG; 325 + } 326 + 327 + static inline bool ublk_support_user_copy(const struct ublk_queue *ubq) 328 + { 329 + return ubq->flags & UBLK_F_USER_COPY; 330 + } 331 + 332 + static inline bool ublk_dev_support_user_copy(const struct ublk_device *ub) 333 + { 334 + return ub->dev_info.flags & UBLK_F_USER_COPY; 346 335 } 347 336 348 337 static inline bool ublk_dev_is_zoned(const struct ublk_device *ub) ··· 426 269 static inline bool ublk_queue_is_zoned(const struct ublk_queue *ubq) 427 270 { 428 271 return ubq->flags & UBLK_F_ZONED; 272 + } 273 + 274 + static inline bool ublk_dev_support_integrity(const struct ublk_device *ub) 275 + { 276 + return ub->dev_info.flags & UBLK_F_INTEGRITY; 429 277 } 430 278 431 279 #ifdef CONFIG_BLK_DEV_ZONED ··· 694 532 #endif 695 533 696 534 static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io, 697 - bool need_map); 535 + bool need_map, struct io_comp_batch *iob); 698 536 699 537 static dev_t ublk_chr_devt; 700 538 static const struct class ublk_chr_class = { ··· 707 545 708 546 static DEFINE_MUTEX(ublk_ctl_mutex); 709 547 548 + static struct ublk_batch_fetch_cmd * 549 + ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd) 550 + { 551 + struct ublk_batch_fetch_cmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO); 552 + 553 + if (fcmd) { 554 + fcmd->cmd = cmd; 555 + fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index); 556 + } 557 + return fcmd; 558 + } 559 + 560 + static void ublk_batch_free_fcmd(struct ublk_batch_fetch_cmd *fcmd) 561 + { 562 + kfree(fcmd); 563 + } 564 + 565 + static void __ublk_release_fcmd(struct ublk_queue *ubq) 566 + { 567 + WRITE_ONCE(ubq->active_fcmd, NULL); 568 + } 569 + 570 + /* 571 + * Nothing can move on, so clear ->active_fcmd, and the caller should stop 572 + * dispatching 573 + */ 574 + static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq, 575 + const struct ublk_batch_io_data *data, 576 + struct ublk_batch_fetch_cmd *fcmd, 577 + int res) 578 + { 579 + spin_lock(&ubq->evts_lock); 580 + list_del_init(&fcmd->node); 581 + WARN_ON_ONCE(fcmd != ubq->active_fcmd); 582 + __ublk_release_fcmd(ubq); 583 + spin_unlock(&ubq->evts_lock); 584 + 585 + io_uring_cmd_done(fcmd->cmd, res, data->issue_flags); 586 + ublk_batch_free_fcmd(fcmd); 587 + } 588 + 589 + static int ublk_batch_fetch_post_cqe(struct ublk_batch_fetch_cmd *fcmd, 590 + struct io_br_sel *sel, 591 + unsigned int issue_flags) 592 + { 593 + if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags)) 594 + return -ENOBUFS; 595 + return 0; 596 + } 597 + 598 + static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fetch_cmd *fcmd, 599 + void __user *buf, const u16 *tag_buf, 600 + unsigned int len) 601 + { 602 + if (copy_to_user(buf, tag_buf, len)) 603 + return -EFAULT; 604 + return len; 605 + } 710 606 711 607 #define UBLK_MAX_UBLKS UBLK_MINORS 712 608 ··· 804 584 set_disk_ro(ub->ub_disk, true); 805 585 806 586 set_capacity(ub->ub_disk, p->dev_sectors); 587 + } 588 + 589 + static int ublk_integrity_flags(u32 flags) 590 + { 591 + int ret_flags = 0; 592 + 593 + if (flags & LBMD_PI_CAP_INTEGRITY) { 594 + flags &= ~LBMD_PI_CAP_INTEGRITY; 595 + ret_flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 596 + } 597 + if (flags & LBMD_PI_CAP_REFTAG) { 598 + flags &= ~LBMD_PI_CAP_REFTAG; 599 + ret_flags |= BLK_INTEGRITY_REF_TAG; 600 + } 601 + return flags ? -EINVAL : ret_flags; 602 + } 603 + 604 + static int ublk_integrity_pi_tuple_size(u8 csum_type) 605 + { 606 + switch (csum_type) { 607 + case LBMD_PI_CSUM_NONE: 608 + return 0; 609 + case LBMD_PI_CSUM_IP: 610 + case LBMD_PI_CSUM_CRC16_T10DIF: 611 + return 8; 612 + case LBMD_PI_CSUM_CRC64_NVME: 613 + return 16; 614 + default: 615 + return -EINVAL; 616 + } 617 + } 618 + 619 + static enum blk_integrity_checksum ublk_integrity_csum_type(u8 csum_type) 620 + { 621 + switch (csum_type) { 622 + case LBMD_PI_CSUM_NONE: 623 + return BLK_INTEGRITY_CSUM_NONE; 624 + case LBMD_PI_CSUM_IP: 625 + return BLK_INTEGRITY_CSUM_IP; 626 + case LBMD_PI_CSUM_CRC16_T10DIF: 627 + return BLK_INTEGRITY_CSUM_CRC; 628 + case LBMD_PI_CSUM_CRC64_NVME: 629 + return BLK_INTEGRITY_CSUM_CRC64; 630 + default: 631 + WARN_ON_ONCE(1); 632 + return BLK_INTEGRITY_CSUM_NONE; 633 + } 807 634 } 808 635 809 636 static int ublk_validate_params(const struct ublk_device *ub) ··· 915 648 return -EINVAL; 916 649 } 917 650 651 + if (ub->params.types & UBLK_PARAM_TYPE_INTEGRITY) { 652 + const struct ublk_param_integrity *p = &ub->params.integrity; 653 + int pi_tuple_size = ublk_integrity_pi_tuple_size(p->csum_type); 654 + int flags = ublk_integrity_flags(p->flags); 655 + 656 + if (!ublk_dev_support_integrity(ub)) 657 + return -EINVAL; 658 + if (flags < 0) 659 + return flags; 660 + if (pi_tuple_size < 0) 661 + return pi_tuple_size; 662 + if (!p->metadata_size) 663 + return -EINVAL; 664 + if (p->csum_type == LBMD_PI_CSUM_NONE && 665 + p->flags & LBMD_PI_CAP_REFTAG) 666 + return -EINVAL; 667 + if (p->pi_offset + pi_tuple_size > p->metadata_size) 668 + return -EINVAL; 669 + if (p->interval_exp < SECTOR_SHIFT || 670 + p->interval_exp > ub->params.basic.logical_bs_shift) 671 + return -EINVAL; 672 + } 673 + 918 674 return 0; 919 675 } 920 676 ··· 947 657 948 658 if (ub->params.types & UBLK_PARAM_TYPE_ZONED) 949 659 ublk_dev_param_zoned_apply(ub); 950 - } 951 - 952 - static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq) 953 - { 954 - return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY; 955 - } 956 - 957 - static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub) 958 - { 959 - return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY; 960 - } 961 - 962 - static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq) 963 - { 964 - return ubq->flags & UBLK_F_AUTO_BUF_REG; 965 - } 966 - 967 - static inline bool ublk_dev_support_auto_buf_reg(const struct ublk_device *ub) 968 - { 969 - return ub->dev_info.flags & UBLK_F_AUTO_BUF_REG; 970 - } 971 - 972 - static inline bool ublk_support_user_copy(const struct ublk_queue *ubq) 973 - { 974 - return ubq->flags & UBLK_F_USER_COPY; 975 - } 976 - 977 - static inline bool ublk_dev_support_user_copy(const struct ublk_device *ub) 978 - { 979 - return ub->dev_info.flags & UBLK_F_USER_COPY; 980 660 } 981 661 982 662 static inline bool ublk_need_map_io(const struct ublk_queue *ubq) ··· 986 726 ublk_dev_support_auto_buf_reg(ub); 987 727 } 988 728 729 + /* 730 + * ublk IO Reference Counting Design 731 + * ================================== 732 + * 733 + * For user-copy and zero-copy modes, ublk uses a split reference model with 734 + * two counters that together track IO lifetime: 735 + * 736 + * - io->ref: refcount for off-task buffer registrations and user-copy ops 737 + * - io->task_registered_buffers: count of buffers registered on the IO task 738 + * 739 + * Key Invariant: 740 + * -------------- 741 + * When IO is dispatched to the ublk server (UBLK_IO_FLAG_OWNED_BY_SRV set), 742 + * the sum (io->ref + io->task_registered_buffers) must equal UBLK_REFCOUNT_INIT 743 + * when no active references exist. After IO completion, both counters become 744 + * zero. For I/Os not currently dispatched to the ublk server, both ref and 745 + * task_registered_buffers are 0. 746 + * 747 + * This invariant is checked by ublk_check_and_reset_active_ref() during daemon 748 + * exit to determine if all references have been released. 749 + * 750 + * Why Split Counters: 751 + * ------------------- 752 + * Buffers registered on the IO daemon task can use the lightweight 753 + * task_registered_buffers counter (simple increment/decrement) instead of 754 + * atomic refcount operations. The ublk_io_release() callback checks if 755 + * current == io->task to decide which counter to update. 756 + * 757 + * This optimization only applies before IO completion. At completion, 758 + * ublk_sub_req_ref() collapses task_registered_buffers into the atomic ref. 759 + * After that, all subsequent buffer unregistrations must use the atomic ref 760 + * since they may be releasing the last reference. 761 + * 762 + * Reference Lifecycle: 763 + * -------------------- 764 + * 1. ublk_init_req_ref(): Sets io->ref = UBLK_REFCOUNT_INIT at IO dispatch 765 + * 766 + * 2. During IO processing: 767 + * - On-task buffer reg: task_registered_buffers++ (no ref change) 768 + * - Off-task buffer reg: ref++ via ublk_get_req_ref() 769 + * - Buffer unregister callback (ublk_io_release): 770 + * * If on-task: task_registered_buffers-- 771 + * * If off-task: ref-- via ublk_put_req_ref() 772 + * 773 + * 3. ublk_sub_req_ref() at IO completion: 774 + * - Computes: sub_refs = UBLK_REFCOUNT_INIT - task_registered_buffers 775 + * - Subtracts sub_refs from ref and zeroes task_registered_buffers 776 + * - This effectively collapses task_registered_buffers into the atomic ref, 777 + * accounting for the initial UBLK_REFCOUNT_INIT minus any on-task 778 + * buffers that were already counted 779 + * 780 + * Example (zero-copy, register on-task, unregister off-task): 781 + * - Dispatch: ref = UBLK_REFCOUNT_INIT, task_registered_buffers = 0 782 + * - Register buffer on-task: task_registered_buffers = 1 783 + * - Unregister off-task: ref-- (UBLK_REFCOUNT_INIT - 1), task_registered_buffers stays 1 784 + * - Completion via ublk_sub_req_ref(): 785 + * sub_refs = UBLK_REFCOUNT_INIT - 1, 786 + * ref = (UBLK_REFCOUNT_INIT - 1) - (UBLK_REFCOUNT_INIT - 1) = 0 787 + * 788 + * Example (auto buffer registration): 789 + * Auto buffer registration sets task_registered_buffers = 1 at dispatch. 790 + * 791 + * - Dispatch: ref = UBLK_REFCOUNT_INIT, task_registered_buffers = 1 792 + * - Buffer unregister: task_registered_buffers-- (becomes 0) 793 + * - Completion via ublk_sub_req_ref(): 794 + * sub_refs = UBLK_REFCOUNT_INIT - 0, ref becomes 0 795 + * 796 + * Example (zero-copy, ublk server killed): 797 + * When daemon is killed, io_uring cleanup unregisters buffers off-task. 798 + * ublk_check_and_reset_active_ref() waits for the invariant to hold. 799 + * 800 + * - Dispatch: ref = UBLK_REFCOUNT_INIT, task_registered_buffers = 0 801 + * - Register buffer on-task: task_registered_buffers = 1 802 + * - Daemon killed, io_uring cleanup unregisters buffer (off-task): 803 + * ref-- (UBLK_REFCOUNT_INIT - 1), task_registered_buffers stays 1 804 + * - Daemon exit check: sum = (UBLK_REFCOUNT_INIT - 1) + 1 = UBLK_REFCOUNT_INIT 805 + * - Sum equals UBLK_REFCOUNT_INIT, then both two counters are zeroed by 806 + * ublk_check_and_reset_active_ref(), so ublk_abort_queue() can proceed 807 + * and abort pending requests 808 + * 809 + * Batch IO Special Case: 810 + * ---------------------- 811 + * In batch IO mode, io->task is NULL. This means ublk_io_release() always 812 + * takes the off-task path (ublk_put_req_ref), decrementing io->ref. The 813 + * task_registered_buffers counter still tracks registered buffers for the 814 + * invariant check, even though the callback doesn't decrement it. 815 + * 816 + * Note: updating task_registered_buffers is protected by io->lock. 817 + */ 989 818 static inline void ublk_init_req_ref(const struct ublk_queue *ubq, 990 819 struct ublk_io *io) 991 820 { ··· 1093 744 return; 1094 745 1095 746 /* ublk_need_map_io() and ublk_need_req_ref() are mutually exclusive */ 1096 - __ublk_complete_rq(req, io, false); 747 + __ublk_complete_rq(req, io, false, NULL); 1097 748 } 1098 749 1099 750 static inline bool ublk_sub_req_ref(struct ublk_io *io) ··· 1254 905 return -EPERM; 1255 906 } 1256 907 908 + if (ub->block_open) 909 + return -ENXIO; 910 + 1257 911 return 0; 1258 912 } 1259 913 ··· 1266 914 .free_disk = ublk_free_disk, 1267 915 .report_zones = ublk_report_zones, 1268 916 }; 917 + 918 + static bool ublk_copy_user_bvec(const struct bio_vec *bv, unsigned *offset, 919 + struct iov_iter *uiter, int dir, size_t *done) 920 + { 921 + unsigned len; 922 + void *bv_buf; 923 + size_t copied; 924 + 925 + if (*offset >= bv->bv_len) { 926 + *offset -= bv->bv_len; 927 + return true; 928 + } 929 + 930 + len = bv->bv_len - *offset; 931 + bv_buf = kmap_local_page(bv->bv_page) + bv->bv_offset + *offset; 932 + if (dir == ITER_DEST) 933 + copied = copy_to_iter(bv_buf, len, uiter); 934 + else 935 + copied = copy_from_iter(bv_buf, len, uiter); 936 + 937 + kunmap_local(bv_buf); 938 + 939 + *done += copied; 940 + if (copied < len) 941 + return false; 942 + 943 + *offset = 0; 944 + return true; 945 + } 1269 946 1270 947 /* 1271 948 * Copy data between request pages and io_iter, and 'offset' ··· 1308 927 size_t done = 0; 1309 928 1310 929 rq_for_each_segment(bv, req, iter) { 1311 - unsigned len; 1312 - void *bv_buf; 1313 - size_t copied; 1314 - 1315 - if (offset >= bv.bv_len) { 1316 - offset -= bv.bv_len; 1317 - continue; 1318 - } 1319 - 1320 - len = bv.bv_len - offset; 1321 - bv_buf = kmap_local_page(bv.bv_page) + bv.bv_offset + offset; 1322 - if (dir == ITER_DEST) 1323 - copied = copy_to_iter(bv_buf, len, uiter); 1324 - else 1325 - copied = copy_from_iter(bv_buf, len, uiter); 1326 - 1327 - kunmap_local(bv_buf); 1328 - 1329 - done += copied; 1330 - if (copied < len) 930 + if (!ublk_copy_user_bvec(&bv, &offset, uiter, dir, &done)) 1331 931 break; 1332 - 1333 - offset = 0; 1334 932 } 1335 933 return done; 1336 934 } 935 + 936 + #ifdef CONFIG_BLK_DEV_INTEGRITY 937 + static size_t ublk_copy_user_integrity(const struct request *req, 938 + unsigned offset, struct iov_iter *uiter, int dir) 939 + { 940 + size_t done = 0; 941 + struct bio *bio = req->bio; 942 + struct bvec_iter iter; 943 + struct bio_vec iv; 944 + 945 + if (!blk_integrity_rq(req)) 946 + return 0; 947 + 948 + bio_for_each_integrity_vec(iv, bio, iter) { 949 + if (!ublk_copy_user_bvec(&iv, &offset, uiter, dir, &done)) 950 + break; 951 + } 952 + 953 + return done; 954 + } 955 + #else /* #ifdef CONFIG_BLK_DEV_INTEGRITY */ 956 + static size_t ublk_copy_user_integrity(const struct request *req, 957 + unsigned offset, struct iov_iter *uiter, int dir) 958 + { 959 + return 0; 960 + } 961 + #endif /* #ifdef CONFIG_BLK_DEV_INTEGRITY */ 1337 962 1338 963 static inline bool ublk_need_map_req(const struct request *req) 1339 964 { ··· 1422 1035 if (req->cmd_flags & REQ_SWAP) 1423 1036 flags |= UBLK_IO_F_SWAP; 1424 1037 1038 + if (blk_integrity_rq(req)) 1039 + flags |= UBLK_IO_F_INTEGRITY; 1040 + 1425 1041 return flags; 1426 1042 } 1427 1043 ··· 1480 1090 1481 1091 /* todo: handle partial completion */ 1482 1092 static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io, 1483 - bool need_map) 1093 + bool need_map, struct io_comp_batch *iob) 1484 1094 { 1485 1095 unsigned int unmapped_bytes; 1486 1096 blk_status_t res = BLK_STS_OK; ··· 1534 1144 local_bh_enable(); 1535 1145 if (requeue) 1536 1146 blk_mq_requeue_request(req, true); 1537 - else if (likely(!blk_should_fake_timeout(req->q))) 1147 + else if (likely(!blk_should_fake_timeout(req->q))) { 1148 + if (blk_mq_add_to_batch(req, iob, false, blk_mq_end_request_batch)) 1149 + return; 1538 1150 __blk_mq_end_request(req, BLK_STS_OK); 1151 + } 1539 1152 1540 1153 return; 1541 1154 exit: ··· 1599 1206 AUTO_BUF_REG_OK, 1600 1207 }; 1601 1208 1602 - static void ublk_prep_auto_buf_reg_io(const struct ublk_queue *ubq, 1603 - struct request *req, struct ublk_io *io, 1604 - struct io_uring_cmd *cmd, 1605 - enum auto_buf_reg_res res) 1209 + /* 1210 + * Setup io state after auto buffer registration. 1211 + * 1212 + * Must be called after ublk_auto_buf_register() is done. 1213 + * Caller must hold io->lock in batch context. 1214 + */ 1215 + static void ublk_auto_buf_io_setup(const struct ublk_queue *ubq, 1216 + struct request *req, struct ublk_io *io, 1217 + struct io_uring_cmd *cmd, 1218 + enum auto_buf_reg_res res) 1606 1219 { 1607 1220 if (res == AUTO_BUF_REG_OK) { 1608 1221 io->task_registered_buffers = 1; ··· 1619 1220 __ublk_prep_compl_io_cmd(io, req); 1620 1221 } 1621 1222 1223 + /* Register request bvec to io_uring for auto buffer registration. */ 1622 1224 static enum auto_buf_reg_res 1623 - __ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req, 1225 + ublk_auto_buf_register(const struct ublk_queue *ubq, struct request *req, 1624 1226 struct ublk_io *io, struct io_uring_cmd *cmd, 1625 1227 unsigned int issue_flags) 1626 1228 { ··· 1641 1241 return AUTO_BUF_REG_OK; 1642 1242 } 1643 1243 1644 - static void ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req, 1645 - struct ublk_io *io, struct io_uring_cmd *cmd, 1646 - unsigned int issue_flags) 1244 + /* 1245 + * Dispatch IO to userspace with auto buffer registration. 1246 + * 1247 + * Only called in non-batch context from task work, io->lock not held. 1248 + */ 1249 + static void ublk_auto_buf_dispatch(const struct ublk_queue *ubq, 1250 + struct request *req, struct ublk_io *io, 1251 + struct io_uring_cmd *cmd, 1252 + unsigned int issue_flags) 1647 1253 { 1648 - enum auto_buf_reg_res res = __ublk_do_auto_buf_reg(ubq, req, io, cmd, 1254 + enum auto_buf_reg_res res = ublk_auto_buf_register(ubq, req, io, cmd, 1649 1255 issue_flags); 1650 1256 1651 1257 if (res != AUTO_BUF_REG_FAIL) { 1652 - ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res); 1258 + ublk_auto_buf_io_setup(ubq, req, io, cmd, res); 1653 1259 io_uring_cmd_done(cmd, UBLK_IO_RES_OK, issue_flags); 1654 1260 } 1655 1261 } ··· 1730 1324 return; 1731 1325 1732 1326 if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req)) { 1733 - ublk_do_auto_buf_reg(ubq, req, io, io->cmd, issue_flags); 1327 + ublk_auto_buf_dispatch(ubq, req, io, io->cmd, issue_flags); 1734 1328 } else { 1735 1329 ublk_init_req_ref(ubq, io); 1736 1330 ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags); 1737 1331 } 1332 + } 1333 + 1334 + static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq, 1335 + const struct ublk_batch_io_data *data, 1336 + unsigned short tag) 1337 + { 1338 + struct ublk_device *ub = data->ub; 1339 + struct ublk_io *io = &ubq->ios[tag]; 1340 + struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); 1341 + enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK; 1342 + struct io_uring_cmd *cmd = data->cmd; 1343 + 1344 + if (!ublk_start_io(ubq, req, io)) 1345 + return false; 1346 + 1347 + if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req)) { 1348 + res = ublk_auto_buf_register(ubq, req, io, cmd, 1349 + data->issue_flags); 1350 + 1351 + if (res == AUTO_BUF_REG_FAIL) 1352 + return false; 1353 + } 1354 + 1355 + ublk_io_lock(io); 1356 + ublk_auto_buf_io_setup(ubq, req, io, cmd, res); 1357 + ublk_io_unlock(io); 1358 + 1359 + return true; 1360 + } 1361 + 1362 + static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq, 1363 + const struct ublk_batch_io_data *data, 1364 + unsigned short *tag_buf, 1365 + unsigned int len) 1366 + { 1367 + bool has_unused = false; 1368 + unsigned int i; 1369 + 1370 + for (i = 0; i < len; i++) { 1371 + unsigned short tag = tag_buf[i]; 1372 + 1373 + if (!__ublk_batch_prep_dispatch(ubq, data, tag)) { 1374 + tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG; 1375 + has_unused = true; 1376 + } 1377 + } 1378 + 1379 + return has_unused; 1380 + } 1381 + 1382 + /* 1383 + * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf. 1384 + * Returns the new length after filtering. 1385 + */ 1386 + static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf, 1387 + unsigned int len) 1388 + { 1389 + unsigned int i, j; 1390 + 1391 + for (i = 0, j = 0; i < len; i++) { 1392 + if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) { 1393 + if (i != j) 1394 + tag_buf[j] = tag_buf[i]; 1395 + j++; 1396 + } 1397 + } 1398 + 1399 + return j; 1400 + } 1401 + 1402 + #define MAX_NR_TAG 128 1403 + static int __ublk_batch_dispatch(struct ublk_queue *ubq, 1404 + const struct ublk_batch_io_data *data, 1405 + struct ublk_batch_fetch_cmd *fcmd) 1406 + { 1407 + const unsigned int tag_sz = sizeof(unsigned short); 1408 + unsigned short tag_buf[MAX_NR_TAG]; 1409 + struct io_br_sel sel; 1410 + size_t len = 0; 1411 + bool needs_filter; 1412 + int ret; 1413 + 1414 + WARN_ON_ONCE(data->cmd != fcmd->cmd); 1415 + 1416 + sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len, 1417 + data->issue_flags); 1418 + if (sel.val < 0) 1419 + return sel.val; 1420 + if (!sel.addr) 1421 + return -ENOBUFS; 1422 + 1423 + /* single reader needn't lock and sizeof(kfifo element) is 2 bytes */ 1424 + len = min(len, sizeof(tag_buf)) / tag_sz; 1425 + len = kfifo_out(&ubq->evts_fifo, tag_buf, len); 1426 + 1427 + needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len); 1428 + /* Filter out unused tags before posting to userspace */ 1429 + if (unlikely(needs_filter)) { 1430 + int new_len = ublk_filter_unused_tags(tag_buf, len); 1431 + 1432 + /* return actual length if all are failed or requeued */ 1433 + if (!new_len) { 1434 + /* release the selected buffer */ 1435 + sel.val = 0; 1436 + WARN_ON_ONCE(!io_uring_mshot_cmd_post_cqe(fcmd->cmd, 1437 + &sel, data->issue_flags)); 1438 + return len; 1439 + } 1440 + len = new_len; 1441 + } 1442 + 1443 + sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * tag_sz); 1444 + ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags); 1445 + if (unlikely(ret < 0)) { 1446 + int i, res; 1447 + 1448 + /* 1449 + * Undo prep state for all IOs since userspace never received them. 1450 + * This restores IOs to pre-prepared state so they can be cleanly 1451 + * re-prepared when tags are pulled from FIFO again. 1452 + */ 1453 + for (i = 0; i < len; i++) { 1454 + struct ublk_io *io = &ubq->ios[tag_buf[i]]; 1455 + int index = -1; 1456 + 1457 + ublk_io_lock(io); 1458 + if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG) 1459 + index = io->buf.auto_reg.index; 1460 + io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG); 1461 + io->flags |= UBLK_IO_FLAG_ACTIVE; 1462 + ublk_io_unlock(io); 1463 + 1464 + if (index != -1) 1465 + io_buffer_unregister_bvec(data->cmd, index, 1466 + data->issue_flags); 1467 + } 1468 + 1469 + res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo, 1470 + tag_buf, len, &ubq->evts_lock); 1471 + 1472 + pr_warn_ratelimited("%s: copy tags or post CQE failure, move back " 1473 + "tags(%d %zu) ret %d\n", __func__, res, len, 1474 + ret); 1475 + } 1476 + return ret; 1477 + } 1478 + 1479 + static struct ublk_batch_fetch_cmd *__ublk_acquire_fcmd( 1480 + struct ublk_queue *ubq) 1481 + { 1482 + struct ublk_batch_fetch_cmd *fcmd; 1483 + 1484 + lockdep_assert_held(&ubq->evts_lock); 1485 + 1486 + /* 1487 + * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd. 1488 + * 1489 + * The pair is the smp_mb() in ublk_batch_dispatch(). 1490 + * 1491 + * If ubq->active_fcmd is observed as non-NULL, the new added tags 1492 + * can be visisible in ublk_batch_dispatch() with the barrier pairing. 1493 + */ 1494 + smp_mb(); 1495 + if (READ_ONCE(ubq->active_fcmd)) { 1496 + fcmd = NULL; 1497 + } else { 1498 + fcmd = list_first_entry_or_null(&ubq->fcmd_head, 1499 + struct ublk_batch_fetch_cmd, node); 1500 + WRITE_ONCE(ubq->active_fcmd, fcmd); 1501 + } 1502 + return fcmd; 1503 + } 1504 + 1505 + static void ublk_batch_tw_cb(struct io_tw_req tw_req, io_tw_token_t tw) 1506 + { 1507 + unsigned int issue_flags = IO_URING_CMD_TASK_WORK_ISSUE_FLAGS; 1508 + struct io_uring_cmd *cmd = io_uring_cmd_from_tw(tw_req); 1509 + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1510 + struct ublk_batch_fetch_cmd *fcmd = pdu->fcmd; 1511 + struct ublk_batch_io_data data = { 1512 + .ub = pdu->ubq->dev, 1513 + .cmd = fcmd->cmd, 1514 + .issue_flags = issue_flags, 1515 + }; 1516 + 1517 + WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd); 1518 + 1519 + ublk_batch_dispatch(pdu->ubq, &data, fcmd); 1520 + } 1521 + 1522 + static void 1523 + ublk_batch_dispatch(struct ublk_queue *ubq, 1524 + const struct ublk_batch_io_data *data, 1525 + struct ublk_batch_fetch_cmd *fcmd) 1526 + { 1527 + struct ublk_batch_fetch_cmd *new_fcmd; 1528 + unsigned tried = 0; 1529 + int ret = 0; 1530 + 1531 + again: 1532 + while (!ublk_io_evts_empty(ubq)) { 1533 + ret = __ublk_batch_dispatch(ubq, data, fcmd); 1534 + if (ret <= 0) 1535 + break; 1536 + } 1537 + 1538 + if (ret < 0) { 1539 + ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret); 1540 + return; 1541 + } 1542 + 1543 + __ublk_release_fcmd(ubq); 1544 + /* 1545 + * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and 1546 + * checking ubq->evts_fifo. 1547 + * 1548 + * The pair is the smp_mb() in __ublk_acquire_fcmd(). 1549 + */ 1550 + smp_mb(); 1551 + if (likely(ublk_io_evts_empty(ubq))) 1552 + return; 1553 + 1554 + spin_lock(&ubq->evts_lock); 1555 + new_fcmd = __ublk_acquire_fcmd(ubq); 1556 + spin_unlock(&ubq->evts_lock); 1557 + 1558 + if (!new_fcmd) 1559 + return; 1560 + 1561 + /* Avoid lockup by allowing to handle at most 32 batches */ 1562 + if (new_fcmd == fcmd && tried++ < 32) 1563 + goto again; 1564 + 1565 + io_uring_cmd_complete_in_task(new_fcmd->cmd, ublk_batch_tw_cb); 1738 1566 } 1739 1567 1740 1568 static void ublk_cmd_tw_cb(struct io_tw_req tw_req, io_tw_token_t tw) ··· 1978 1338 struct ublk_queue *ubq = pdu->ubq; 1979 1339 1980 1340 ublk_dispatch_req(ubq, pdu->req); 1341 + } 1342 + 1343 + static void ublk_batch_queue_cmd(struct ublk_queue *ubq, struct request *rq, bool last) 1344 + { 1345 + unsigned short tag = rq->tag; 1346 + struct ublk_batch_fetch_cmd *fcmd = NULL; 1347 + 1348 + spin_lock(&ubq->evts_lock); 1349 + kfifo_put(&ubq->evts_fifo, tag); 1350 + if (last) 1351 + fcmd = __ublk_acquire_fcmd(ubq); 1352 + spin_unlock(&ubq->evts_lock); 1353 + 1354 + if (fcmd) 1355 + io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb); 1981 1356 } 1982 1357 1983 1358 static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) ··· 2084 1429 return BLK_STS_OK; 2085 1430 } 2086 1431 2087 - static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, 2088 - const struct blk_mq_queue_data *bd) 1432 + /* 1433 + * Common helper for queue_rq that handles request preparation and 1434 + * cancellation checks. Returns status and sets should_queue to indicate 1435 + * whether the caller should proceed with queuing the request. 1436 + */ 1437 + static inline blk_status_t __ublk_queue_rq_common(struct ublk_queue *ubq, 1438 + struct request *rq, 1439 + bool *should_queue) 2089 1440 { 2090 - struct ublk_queue *ubq = hctx->driver_data; 2091 - struct request *rq = bd->rq; 2092 1441 blk_status_t res; 2093 1442 2094 1443 res = ublk_prep_req(ubq, rq, false); 2095 - if (res != BLK_STS_OK) 1444 + if (res != BLK_STS_OK) { 1445 + *should_queue = false; 2096 1446 return res; 1447 + } 2097 1448 2098 1449 /* 2099 1450 * ->canceling has to be handled after ->force_abort and ->fail_io ··· 2107 1446 * of recovery, and cause hang when deleting disk 2108 1447 */ 2109 1448 if (unlikely(ubq->canceling)) { 1449 + *should_queue = false; 2110 1450 __ublk_abort_rq(ubq, rq); 2111 1451 return BLK_STS_OK; 2112 1452 } 2113 1453 1454 + *should_queue = true; 1455 + return BLK_STS_OK; 1456 + } 1457 + 1458 + static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, 1459 + const struct blk_mq_queue_data *bd) 1460 + { 1461 + struct ublk_queue *ubq = hctx->driver_data; 1462 + struct request *rq = bd->rq; 1463 + bool should_queue; 1464 + blk_status_t res; 1465 + 1466 + res = __ublk_queue_rq_common(ubq, rq, &should_queue); 1467 + if (!should_queue) 1468 + return res; 1469 + 2114 1470 ublk_queue_cmd(ubq, rq); 1471 + return BLK_STS_OK; 1472 + } 1473 + 1474 + static blk_status_t ublk_batch_queue_rq(struct blk_mq_hw_ctx *hctx, 1475 + const struct blk_mq_queue_data *bd) 1476 + { 1477 + struct ublk_queue *ubq = hctx->driver_data; 1478 + struct request *rq = bd->rq; 1479 + bool should_queue; 1480 + blk_status_t res; 1481 + 1482 + res = __ublk_queue_rq_common(ubq, rq, &should_queue); 1483 + if (!should_queue) 1484 + return res; 1485 + 1486 + ublk_batch_queue_cmd(ubq, rq, bd->last); 2115 1487 return BLK_STS_OK; 2116 1488 } 2117 1489 ··· 2154 1460 return (io_uring_cmd_ctx_handle(io->cmd) == 2155 1461 io_uring_cmd_ctx_handle(io2->cmd)) && 2156 1462 (io->task == io2->task); 1463 + } 1464 + 1465 + static void ublk_commit_rqs(struct blk_mq_hw_ctx *hctx) 1466 + { 1467 + struct ublk_queue *ubq = hctx->driver_data; 1468 + struct ublk_batch_fetch_cmd *fcmd; 1469 + 1470 + spin_lock(&ubq->evts_lock); 1471 + fcmd = __ublk_acquire_fcmd(ubq); 1472 + spin_unlock(&ubq->evts_lock); 1473 + 1474 + if (fcmd) 1475 + io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb); 2157 1476 } 2158 1477 2159 1478 static void ublk_queue_rqs(struct rq_list *rqlist) ··· 2197 1490 *rqlist = requeue_list; 2198 1491 } 2199 1492 1493 + static void ublk_batch_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l) 1494 + { 1495 + unsigned short tags[MAX_NR_TAG]; 1496 + struct ublk_batch_fetch_cmd *fcmd; 1497 + struct request *rq; 1498 + unsigned cnt = 0; 1499 + 1500 + spin_lock(&ubq->evts_lock); 1501 + rq_list_for_each(l, rq) { 1502 + tags[cnt++] = (unsigned short)rq->tag; 1503 + if (cnt >= MAX_NR_TAG) { 1504 + kfifo_in(&ubq->evts_fifo, tags, cnt); 1505 + cnt = 0; 1506 + } 1507 + } 1508 + if (cnt) 1509 + kfifo_in(&ubq->evts_fifo, tags, cnt); 1510 + fcmd = __ublk_acquire_fcmd(ubq); 1511 + spin_unlock(&ubq->evts_lock); 1512 + 1513 + rq_list_init(l); 1514 + if (fcmd) 1515 + io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb); 1516 + } 1517 + 1518 + static void ublk_batch_queue_rqs(struct rq_list *rqlist) 1519 + { 1520 + struct rq_list requeue_list = { }; 1521 + struct rq_list submit_list = { }; 1522 + struct ublk_queue *ubq = NULL; 1523 + struct request *req; 1524 + 1525 + while ((req = rq_list_pop(rqlist))) { 1526 + struct ublk_queue *this_q = req->mq_hctx->driver_data; 1527 + 1528 + if (ublk_prep_req(this_q, req, true) != BLK_STS_OK) { 1529 + rq_list_add_tail(&requeue_list, req); 1530 + continue; 1531 + } 1532 + 1533 + if (ubq && this_q != ubq && !rq_list_empty(&submit_list)) 1534 + ublk_batch_queue_cmd_list(ubq, &submit_list); 1535 + ubq = this_q; 1536 + rq_list_add_tail(&submit_list, req); 1537 + } 1538 + 1539 + if (!rq_list_empty(&submit_list)) 1540 + ublk_batch_queue_cmd_list(ubq, &submit_list); 1541 + *rqlist = requeue_list; 1542 + } 1543 + 2200 1544 static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data, 2201 1545 unsigned int hctx_idx) 2202 1546 { ··· 2265 1507 .timeout = ublk_timeout, 2266 1508 }; 2267 1509 1510 + static const struct blk_mq_ops ublk_batch_mq_ops = { 1511 + .commit_rqs = ublk_commit_rqs, 1512 + .queue_rq = ublk_batch_queue_rq, 1513 + .queue_rqs = ublk_batch_queue_rqs, 1514 + .init_hctx = ublk_init_hctx, 1515 + .timeout = ublk_timeout, 1516 + }; 1517 + 2268 1518 static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq) 2269 1519 { 2270 1520 int i; 1521 + 1522 + ubq->nr_io_ready = 0; 2271 1523 2272 1524 for (i = 0; i < ubq->q_depth; i++) { 2273 1525 struct ublk_io *io = &ubq->ios[i]; ··· 2327 1559 2328 1560 /* set to NULL, otherwise new tasks cannot mmap io_cmd_buf */ 2329 1561 ub->mm = NULL; 2330 - ub->nr_io_ready = 0; 1562 + ub->nr_queue_ready = 0; 2331 1563 ub->unprivileged_daemons = false; 2332 1564 ub->ublksrv_tgid = -1; 2333 1565 } ··· 2581 1813 static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io, 2582 1814 struct request *req) 2583 1815 { 2584 - WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); 1816 + WARN_ON_ONCE(!ublk_dev_support_batch_io(ub) && 1817 + io->flags & UBLK_IO_FLAG_ACTIVE); 2585 1818 2586 1819 if (ublk_nosrv_should_reissue_outstanding(ub)) 2587 1820 blk_mq_requeue_request(req, false); 2588 1821 else { 2589 1822 io->res = -EIO; 2590 - __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub)); 1823 + __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL); 1824 + } 1825 + } 1826 + 1827 + /* 1828 + * Request tag may just be filled to event kfifo, not get chance to 1829 + * dispatch, abort these requests too 1830 + */ 1831 + static void ublk_abort_batch_queue(struct ublk_device *ub, 1832 + struct ublk_queue *ubq) 1833 + { 1834 + unsigned short tag; 1835 + 1836 + while (kfifo_out(&ubq->evts_fifo, &tag, 1)) { 1837 + struct request *req = blk_mq_tag_to_rq( 1838 + ub->tag_set.tags[ubq->q_id], tag); 1839 + 1840 + if (!WARN_ON_ONCE(!req || !blk_mq_request_started(req))) 1841 + __ublk_fail_req(ub, &ubq->ios[tag], req); 2591 1842 } 2592 1843 } 2593 1844 ··· 2628 1841 if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV) 2629 1842 __ublk_fail_req(ub, io, io->req); 2630 1843 } 1844 + 1845 + if (ublk_support_batch_io(ubq)) 1846 + ublk_abort_batch_queue(ub, ubq); 2631 1847 } 2632 1848 2633 1849 static void ublk_start_cancel(struct ublk_device *ub) ··· 2695 1905 } 2696 1906 2697 1907 /* 1908 + * Cancel a batch fetch command if it hasn't been claimed by another path. 1909 + * 1910 + * An fcmd can only be cancelled if: 1911 + * 1. It's not the active_fcmd (which is currently being processed) 1912 + * 2. It's still on the list (!list_empty check) - once removed from the list, 1913 + * the fcmd is considered claimed and will be freed by whoever removed it 1914 + * 1915 + * Use list_del_init() so subsequent list_empty() checks work correctly. 1916 + */ 1917 + static void ublk_batch_cancel_cmd(struct ublk_queue *ubq, 1918 + struct ublk_batch_fetch_cmd *fcmd, 1919 + unsigned int issue_flags) 1920 + { 1921 + bool done; 1922 + 1923 + spin_lock(&ubq->evts_lock); 1924 + done = (READ_ONCE(ubq->active_fcmd) != fcmd) && !list_empty(&fcmd->node); 1925 + if (done) 1926 + list_del_init(&fcmd->node); 1927 + spin_unlock(&ubq->evts_lock); 1928 + 1929 + if (done) { 1930 + io_uring_cmd_done(fcmd->cmd, UBLK_IO_RES_ABORT, issue_flags); 1931 + ublk_batch_free_fcmd(fcmd); 1932 + } 1933 + } 1934 + 1935 + static void ublk_batch_cancel_queue(struct ublk_queue *ubq) 1936 + { 1937 + struct ublk_batch_fetch_cmd *fcmd; 1938 + LIST_HEAD(fcmd_list); 1939 + 1940 + spin_lock(&ubq->evts_lock); 1941 + ubq->force_abort = true; 1942 + list_splice_init(&ubq->fcmd_head, &fcmd_list); 1943 + fcmd = READ_ONCE(ubq->active_fcmd); 1944 + if (fcmd) 1945 + list_move(&fcmd->node, &ubq->fcmd_head); 1946 + spin_unlock(&ubq->evts_lock); 1947 + 1948 + while (!list_empty(&fcmd_list)) { 1949 + fcmd = list_first_entry(&fcmd_list, 1950 + struct ublk_batch_fetch_cmd, node); 1951 + ublk_batch_cancel_cmd(ubq, fcmd, IO_URING_F_UNLOCKED); 1952 + } 1953 + } 1954 + 1955 + static void ublk_batch_cancel_fn(struct io_uring_cmd *cmd, 1956 + unsigned int issue_flags) 1957 + { 1958 + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1959 + struct ublk_batch_fetch_cmd *fcmd = pdu->fcmd; 1960 + struct ublk_queue *ubq = pdu->ubq; 1961 + 1962 + ublk_start_cancel(ubq->dev); 1963 + 1964 + ublk_batch_cancel_cmd(ubq, fcmd, issue_flags); 1965 + } 1966 + 1967 + /* 2698 1968 * The ublk char device won't be closed when calling cancel fn, so both 2699 1969 * ublk device and queue are guaranteed to be live 2700 1970 * ··· 2794 1944 ublk_cancel_cmd(ubq, pdu->tag, issue_flags); 2795 1945 } 2796 1946 1947 + static inline bool ublk_queue_ready(const struct ublk_queue *ubq) 1948 + { 1949 + return ubq->nr_io_ready == ubq->q_depth; 1950 + } 1951 + 2797 1952 static inline bool ublk_dev_ready(const struct ublk_device *ub) 2798 1953 { 2799 - u32 total = (u32)ub->dev_info.nr_hw_queues * ub->dev_info.queue_depth; 2800 - 2801 - return ub->nr_io_ready == total; 1954 + return ub->nr_queue_ready == ub->dev_info.nr_hw_queues; 2802 1955 } 2803 1956 2804 1957 static void ublk_cancel_queue(struct ublk_queue *ubq) 2805 1958 { 2806 1959 int i; 1960 + 1961 + if (ublk_support_batch_io(ubq)) { 1962 + ublk_batch_cancel_queue(ubq); 1963 + return; 1964 + } 2807 1965 2808 1966 for (i = 0; i < ubq->q_depth; i++) 2809 1967 ublk_cancel_cmd(ubq, i, IO_URING_F_UNLOCKED); ··· 2910 2052 ublk_cancel_dev(ub); 2911 2053 } 2912 2054 2913 - /* reset ublk io_uring queue & io flags */ 2914 - static void ublk_reset_io_flags(struct ublk_device *ub) 2055 + /* reset per-queue io flags */ 2056 + static void ublk_queue_reset_io_flags(struct ublk_queue *ubq) 2915 2057 { 2916 - int i, j; 2058 + int j; 2917 2059 2918 - for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { 2919 - struct ublk_queue *ubq = ublk_get_queue(ub, i); 2920 - 2921 - /* UBLK_IO_FLAG_CANCELED can be cleared now */ 2922 - spin_lock(&ubq->cancel_lock); 2923 - for (j = 0; j < ubq->q_depth; j++) 2924 - ubq->ios[j].flags &= ~UBLK_IO_FLAG_CANCELED; 2925 - spin_unlock(&ubq->cancel_lock); 2926 - ubq->fail_io = false; 2927 - } 2928 - mutex_lock(&ub->cancel_mutex); 2929 - ublk_set_canceling(ub, false); 2930 - mutex_unlock(&ub->cancel_mutex); 2060 + /* UBLK_IO_FLAG_CANCELED can be cleared now */ 2061 + spin_lock(&ubq->cancel_lock); 2062 + for (j = 0; j < ubq->q_depth; j++) 2063 + ubq->ios[j].flags &= ~UBLK_IO_FLAG_CANCELED; 2064 + ubq->canceling = false; 2065 + spin_unlock(&ubq->cancel_lock); 2066 + ubq->fail_io = false; 2931 2067 } 2932 2068 2933 2069 /* device can only be started after all IOs are ready */ 2934 - static void ublk_mark_io_ready(struct ublk_device *ub) 2070 + static void ublk_mark_io_ready(struct ublk_device *ub, u16 q_id) 2935 2071 __must_hold(&ub->mutex) 2936 2072 { 2073 + struct ublk_queue *ubq = ublk_get_queue(ub, q_id); 2074 + 2937 2075 if (!ub->unprivileged_daemons && !capable(CAP_SYS_ADMIN)) 2938 2076 ub->unprivileged_daemons = true; 2939 2077 2940 - ub->nr_io_ready++; 2078 + ubq->nr_io_ready++; 2079 + 2080 + /* Check if this specific queue is now fully ready */ 2081 + if (ublk_queue_ready(ubq)) { 2082 + ub->nr_queue_ready++; 2083 + 2084 + /* 2085 + * Reset queue flags as soon as this queue is ready. 2086 + * This clears the canceling flag, allowing batch FETCH commands 2087 + * to succeed during recovery without waiting for all queues. 2088 + */ 2089 + ublk_queue_reset_io_flags(ubq); 2090 + } 2091 + 2092 + /* Check if all queues are ready */ 2941 2093 if (ublk_dev_ready(ub)) { 2942 - /* now we are ready for handling ublk io request */ 2943 - ublk_reset_io_flags(ub); 2094 + /* 2095 + * All queues ready - clear device-level canceling flag 2096 + * and complete the recovery/initialization. 2097 + */ 2098 + mutex_lock(&ub->cancel_mutex); 2099 + ub->canceling = false; 2100 + mutex_unlock(&ub->cancel_mutex); 2944 2101 complete_all(&ub->completion); 2945 2102 } 2946 2103 } ··· 2988 2115 return 0; 2989 2116 } 2990 2117 2991 - static int ublk_handle_auto_buf_reg(struct ublk_io *io, 2118 + static void ublk_clear_auto_buf_reg(struct ublk_io *io, 2992 2119 struct io_uring_cmd *cmd, 2993 2120 u16 *buf_idx) 2994 2121 { ··· 3008 2135 if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd)) 3009 2136 *buf_idx = io->buf.auto_reg.index; 3010 2137 } 2138 + } 3011 2139 2140 + static int ublk_handle_auto_buf_reg(struct ublk_io *io, 2141 + struct io_uring_cmd *cmd, 2142 + u16 *buf_idx) 2143 + { 2144 + ublk_clear_auto_buf_reg(io, cmd, buf_idx); 3012 2145 return ublk_set_auto_buf_reg(io, cmd); 3013 2146 } 3014 2147 ··· 3087 2208 if (!ublk_dev_support_zero_copy(ub)) 3088 2209 return -EINVAL; 3089 2210 3090 - req = __ublk_check_and_get_req(ub, q_id, tag, io, 0); 2211 + req = __ublk_check_and_get_req(ub, q_id, tag, io); 3091 2212 if (!req) 3092 2213 return -EINVAL; 3093 2214 ··· 3159 2280 } 3160 2281 3161 2282 static int __ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub, 3162 - struct ublk_io *io) 2283 + struct ublk_io *io, u16 q_id) 3163 2284 { 3164 2285 /* UBLK_IO_FETCH_REQ is only allowed before dev is setup */ 3165 2286 if (ublk_dev_ready(ub)) ··· 3173 2294 3174 2295 ublk_fill_io_cmd(io, cmd); 3175 2296 3176 - WRITE_ONCE(io->task, get_task_struct(current)); 3177 - ublk_mark_io_ready(ub); 2297 + if (ublk_dev_support_batch_io(ub)) 2298 + WRITE_ONCE(io->task, NULL); 2299 + else 2300 + WRITE_ONCE(io->task, get_task_struct(current)); 3178 2301 3179 2302 return 0; 3180 2303 } 3181 2304 3182 2305 static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub, 3183 - struct ublk_io *io, __u64 buf_addr) 2306 + struct ublk_io *io, __u64 buf_addr, u16 q_id) 3184 2307 { 3185 2308 int ret; 3186 2309 ··· 3192 2311 * FETCH, so it is fine even for IO_URING_F_NONBLOCK. 3193 2312 */ 3194 2313 mutex_lock(&ub->mutex); 3195 - ret = __ublk_fetch(cmd, ub, io); 2314 + ret = __ublk_fetch(cmd, ub, io, q_id); 3196 2315 if (!ret) 3197 2316 ret = ublk_config_io_buf(ub, io, cmd, buf_addr, NULL); 2317 + if (!ret) 2318 + ublk_mark_io_ready(ub, q_id); 3198 2319 mutex_unlock(&ub->mutex); 3199 2320 return ret; 3200 2321 } ··· 3300 2417 ret = ublk_check_fetch_buf(ub, addr); 3301 2418 if (ret) 3302 2419 goto out; 3303 - ret = ublk_fetch(cmd, ub, io, addr); 2420 + ret = ublk_fetch(cmd, ub, io, addr, q_id); 3304 2421 if (ret) 3305 2422 goto out; 3306 2423 ··· 3345 2462 io->res = result; 3346 2463 req = ublk_fill_io_cmd(io, cmd); 3347 2464 ret = ublk_config_io_buf(ub, io, cmd, addr, &buf_idx); 3348 - compl = ublk_need_complete_req(ub, io); 3349 - 3350 - /* can't touch 'ublk_io' any more */ 3351 2465 if (buf_idx != UBLK_INVALID_BUF_IDX) 3352 2466 io_buffer_unregister_bvec(cmd, buf_idx, issue_flags); 2467 + compl = ublk_need_complete_req(ub, io); 2468 + 3353 2469 if (req_op(req) == REQ_OP_ZONE_APPEND) 3354 2470 req->__sector = addr; 3355 2471 if (compl) 3356 - __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub)); 2472 + __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL); 3357 2473 3358 2474 if (ret) 3359 2475 goto out; ··· 3384 2502 } 3385 2503 3386 2504 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, 3387 - u16 q_id, u16 tag, struct ublk_io *io, size_t offset) 2505 + u16 q_id, u16 tag, struct ublk_io *io) 3388 2506 { 3389 2507 struct request *req; 3390 2508 ··· 3403 2521 goto fail_put; 3404 2522 3405 2523 if (!ublk_rq_has_data(req)) 3406 - goto fail_put; 3407 - 3408 - if (offset > blk_rq_bytes(req)) 3409 2524 goto fail_put; 3410 2525 3411 2526 return req; ··· 3437 2558 return ublk_ch_uring_cmd_local(cmd, issue_flags); 3438 2559 } 3439 2560 2561 + static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc, 2562 + const struct ublk_elem_header *elem) 2563 + { 2564 + const void *buf = elem; 2565 + 2566 + if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR) 2567 + return *(const __u64 *)(buf + sizeof(*elem)); 2568 + return 0; 2569 + } 2570 + 2571 + static inline __u64 ublk_batch_zone_lba(const struct ublk_batch_io *uc, 2572 + const struct ublk_elem_header *elem) 2573 + { 2574 + const void *buf = elem; 2575 + 2576 + if (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA) 2577 + return *(const __u64 *)(buf + sizeof(*elem) + 2578 + 8 * !!(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)); 2579 + return -1; 2580 + } 2581 + 2582 + static struct ublk_auto_buf_reg 2583 + ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc, 2584 + const struct ublk_elem_header *elem) 2585 + { 2586 + struct ublk_auto_buf_reg reg = { 2587 + .index = elem->buf_index, 2588 + .flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ? 2589 + UBLK_AUTO_BUF_REG_FALLBACK : 0, 2590 + }; 2591 + 2592 + return reg; 2593 + } 2594 + 2595 + /* 2596 + * 48 can hold any type of buffer element(8, 16 and 24 bytes) because 2597 + * it is the least common multiple(LCM) of 8, 16 and 24 2598 + */ 2599 + #define UBLK_CMD_BATCH_TMP_BUF_SZ (48 * 10) 2600 + struct ublk_batch_io_iter { 2601 + void __user *uaddr; 2602 + unsigned done, total; 2603 + unsigned char elem_bytes; 2604 + /* copy to this buffer from user space */ 2605 + unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ]; 2606 + }; 2607 + 2608 + static inline int 2609 + __ublk_walk_cmd_buf(struct ublk_queue *ubq, 2610 + struct ublk_batch_io_iter *iter, 2611 + const struct ublk_batch_io_data *data, 2612 + unsigned bytes, 2613 + int (*cb)(struct ublk_queue *q, 2614 + const struct ublk_batch_io_data *data, 2615 + const struct ublk_elem_header *elem)) 2616 + { 2617 + unsigned int i; 2618 + int ret = 0; 2619 + 2620 + for (i = 0; i < bytes; i += iter->elem_bytes) { 2621 + const struct ublk_elem_header *elem = 2622 + (const struct ublk_elem_header *)&iter->buf[i]; 2623 + 2624 + if (unlikely(elem->tag >= data->ub->dev_info.queue_depth)) { 2625 + ret = -EINVAL; 2626 + break; 2627 + } 2628 + 2629 + ret = cb(ubq, data, elem); 2630 + if (unlikely(ret)) 2631 + break; 2632 + } 2633 + 2634 + iter->done += i; 2635 + return ret; 2636 + } 2637 + 2638 + static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter, 2639 + const struct ublk_batch_io_data *data, 2640 + int (*cb)(struct ublk_queue *q, 2641 + const struct ublk_batch_io_data *data, 2642 + const struct ublk_elem_header *elem)) 2643 + { 2644 + struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id); 2645 + int ret = 0; 2646 + 2647 + while (iter->done < iter->total) { 2648 + unsigned int len = min(sizeof(iter->buf), iter->total - iter->done); 2649 + 2650 + if (copy_from_user(iter->buf, iter->uaddr + iter->done, len)) { 2651 + pr_warn("ublk%d: read batch cmd buffer failed\n", 2652 + data->ub->dev_info.dev_id); 2653 + return -EFAULT; 2654 + } 2655 + 2656 + ret = __ublk_walk_cmd_buf(ubq, iter, data, len, cb); 2657 + if (ret) 2658 + return ret; 2659 + } 2660 + return 0; 2661 + } 2662 + 2663 + static int ublk_batch_unprep_io(struct ublk_queue *ubq, 2664 + const struct ublk_batch_io_data *data, 2665 + const struct ublk_elem_header *elem) 2666 + { 2667 + struct ublk_io *io = &ubq->ios[elem->tag]; 2668 + 2669 + /* 2670 + * If queue was ready before this decrement, it won't be anymore, 2671 + * so we need to decrement the queue ready count and restore the 2672 + * canceling flag to prevent new requests from being queued. 2673 + */ 2674 + if (ublk_queue_ready(ubq)) { 2675 + data->ub->nr_queue_ready--; 2676 + spin_lock(&ubq->cancel_lock); 2677 + ubq->canceling = true; 2678 + spin_unlock(&ubq->cancel_lock); 2679 + } 2680 + ubq->nr_io_ready--; 2681 + 2682 + ublk_io_lock(io); 2683 + io->flags = 0; 2684 + ublk_io_unlock(io); 2685 + return 0; 2686 + } 2687 + 2688 + static void ublk_batch_revert_prep_cmd(struct ublk_batch_io_iter *iter, 2689 + const struct ublk_batch_io_data *data) 2690 + { 2691 + int ret; 2692 + 2693 + /* Re-process only what we've already processed, starting from beginning */ 2694 + iter->total = iter->done; 2695 + iter->done = 0; 2696 + 2697 + ret = ublk_walk_cmd_buf(iter, data, ublk_batch_unprep_io); 2698 + WARN_ON_ONCE(ret); 2699 + } 2700 + 2701 + static int ublk_batch_prep_io(struct ublk_queue *ubq, 2702 + const struct ublk_batch_io_data *data, 2703 + const struct ublk_elem_header *elem) 2704 + { 2705 + struct ublk_io *io = &ubq->ios[elem->tag]; 2706 + const struct ublk_batch_io *uc = &data->header; 2707 + union ublk_io_buf buf = { 0 }; 2708 + int ret; 2709 + 2710 + if (ublk_dev_support_auto_buf_reg(data->ub)) 2711 + buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem); 2712 + else if (ublk_dev_need_map_io(data->ub)) { 2713 + buf.addr = ublk_batch_buf_addr(uc, elem); 2714 + 2715 + ret = ublk_check_fetch_buf(data->ub, buf.addr); 2716 + if (ret) 2717 + return ret; 2718 + } 2719 + 2720 + ublk_io_lock(io); 2721 + ret = __ublk_fetch(data->cmd, data->ub, io, ubq->q_id); 2722 + if (!ret) 2723 + io->buf = buf; 2724 + ublk_io_unlock(io); 2725 + 2726 + if (!ret) 2727 + ublk_mark_io_ready(data->ub, ubq->q_id); 2728 + 2729 + return ret; 2730 + } 2731 + 2732 + static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data) 2733 + { 2734 + const struct ublk_batch_io *uc = &data->header; 2735 + struct io_uring_cmd *cmd = data->cmd; 2736 + struct ublk_batch_io_iter iter = { 2737 + .uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)), 2738 + .total = uc->nr_elem * uc->elem_bytes, 2739 + .elem_bytes = uc->elem_bytes, 2740 + }; 2741 + int ret; 2742 + 2743 + mutex_lock(&data->ub->mutex); 2744 + ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_prep_io); 2745 + 2746 + if (ret && iter.done) 2747 + ublk_batch_revert_prep_cmd(&iter, data); 2748 + mutex_unlock(&data->ub->mutex); 2749 + return ret; 2750 + } 2751 + 2752 + static int ublk_batch_commit_io_check(const struct ublk_queue *ubq, 2753 + struct ublk_io *io, 2754 + union ublk_io_buf *buf) 2755 + { 2756 + if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)) 2757 + return -EBUSY; 2758 + 2759 + /* BATCH_IO doesn't support UBLK_F_NEED_GET_DATA */ 2760 + if (ublk_need_map_io(ubq) && !buf->addr) 2761 + return -EINVAL; 2762 + return 0; 2763 + } 2764 + 2765 + static int ublk_batch_commit_io(struct ublk_queue *ubq, 2766 + const struct ublk_batch_io_data *data, 2767 + const struct ublk_elem_header *elem) 2768 + { 2769 + struct ublk_io *io = &ubq->ios[elem->tag]; 2770 + const struct ublk_batch_io *uc = &data->header; 2771 + u16 buf_idx = UBLK_INVALID_BUF_IDX; 2772 + union ublk_io_buf buf = { 0 }; 2773 + struct request *req = NULL; 2774 + bool auto_reg = false; 2775 + bool compl = false; 2776 + int ret; 2777 + 2778 + if (ublk_dev_support_auto_buf_reg(data->ub)) { 2779 + buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem); 2780 + auto_reg = true; 2781 + } else if (ublk_dev_need_map_io(data->ub)) 2782 + buf.addr = ublk_batch_buf_addr(uc, elem); 2783 + 2784 + ublk_io_lock(io); 2785 + ret = ublk_batch_commit_io_check(ubq, io, &buf); 2786 + if (!ret) { 2787 + io->res = elem->result; 2788 + io->buf = buf; 2789 + req = ublk_fill_io_cmd(io, data->cmd); 2790 + 2791 + if (auto_reg) 2792 + ublk_clear_auto_buf_reg(io, data->cmd, &buf_idx); 2793 + compl = ublk_need_complete_req(data->ub, io); 2794 + } 2795 + ublk_io_unlock(io); 2796 + 2797 + if (unlikely(ret)) { 2798 + pr_warn_ratelimited("%s: dev %u queue %u io %u: commit failure %d\n", 2799 + __func__, data->ub->dev_info.dev_id, ubq->q_id, 2800 + elem->tag, ret); 2801 + return ret; 2802 + } 2803 + 2804 + if (buf_idx != UBLK_INVALID_BUF_IDX) 2805 + io_buffer_unregister_bvec(data->cmd, buf_idx, data->issue_flags); 2806 + if (req_op(req) == REQ_OP_ZONE_APPEND) 2807 + req->__sector = ublk_batch_zone_lba(uc, elem); 2808 + if (compl) 2809 + __ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub), data->iob); 2810 + return 0; 2811 + } 2812 + 2813 + static int ublk_handle_batch_commit_cmd(struct ublk_batch_io_data *data) 2814 + { 2815 + const struct ublk_batch_io *uc = &data->header; 2816 + struct io_uring_cmd *cmd = data->cmd; 2817 + struct ublk_batch_io_iter iter = { 2818 + .uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)), 2819 + .total = uc->nr_elem * uc->elem_bytes, 2820 + .elem_bytes = uc->elem_bytes, 2821 + }; 2822 + DEFINE_IO_COMP_BATCH(iob); 2823 + int ret; 2824 + 2825 + data->iob = &iob; 2826 + ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_commit_io); 2827 + 2828 + if (iob.complete) 2829 + iob.complete(&iob); 2830 + 2831 + return iter.done == 0 ? ret : iter.done; 2832 + } 2833 + 2834 + static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc) 2835 + { 2836 + unsigned elem_bytes = sizeof(struct ublk_elem_header); 2837 + 2838 + if (uc->flags & ~UBLK_BATCH_F_ALL) 2839 + return -EINVAL; 2840 + 2841 + /* UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK requires buffer index */ 2842 + if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) && 2843 + (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)) 2844 + return -EINVAL; 2845 + 2846 + elem_bytes += (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA ? sizeof(u64) : 0) + 2847 + (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR ? sizeof(u64) : 0); 2848 + if (uc->elem_bytes != elem_bytes) 2849 + return -EINVAL; 2850 + return 0; 2851 + } 2852 + 2853 + static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data) 2854 + { 2855 + const struct ublk_batch_io *uc = &data->header; 2856 + 2857 + if (uc->q_id >= data->ub->dev_info.nr_hw_queues) 2858 + return -EINVAL; 2859 + 2860 + if (uc->nr_elem > data->ub->dev_info.queue_depth) 2861 + return -E2BIG; 2862 + 2863 + if ((uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA) && 2864 + !ublk_dev_is_zoned(data->ub)) 2865 + return -EINVAL; 2866 + 2867 + if ((uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR) && 2868 + !ublk_dev_need_map_io(data->ub)) 2869 + return -EINVAL; 2870 + 2871 + if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) && 2872 + !ublk_dev_support_auto_buf_reg(data->ub)) 2873 + return -EINVAL; 2874 + 2875 + return ublk_check_batch_cmd_flags(uc); 2876 + } 2877 + 2878 + static int ublk_batch_attach(struct ublk_queue *ubq, 2879 + struct ublk_batch_io_data *data, 2880 + struct ublk_batch_fetch_cmd *fcmd) 2881 + { 2882 + struct ublk_batch_fetch_cmd *new_fcmd = NULL; 2883 + bool free = false; 2884 + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(data->cmd); 2885 + 2886 + spin_lock(&ubq->evts_lock); 2887 + if (unlikely(ubq->force_abort || ubq->canceling)) { 2888 + free = true; 2889 + } else { 2890 + list_add_tail(&fcmd->node, &ubq->fcmd_head); 2891 + new_fcmd = __ublk_acquire_fcmd(ubq); 2892 + } 2893 + spin_unlock(&ubq->evts_lock); 2894 + 2895 + if (unlikely(free)) { 2896 + ublk_batch_free_fcmd(fcmd); 2897 + return -ENODEV; 2898 + } 2899 + 2900 + pdu->ubq = ubq; 2901 + pdu->fcmd = fcmd; 2902 + io_uring_cmd_mark_cancelable(fcmd->cmd, data->issue_flags); 2903 + 2904 + if (!new_fcmd) 2905 + goto out; 2906 + 2907 + /* 2908 + * If the two fetch commands are originated from same io_ring_ctx, 2909 + * run batch dispatch directly. Otherwise, schedule task work for 2910 + * doing it. 2911 + */ 2912 + if (io_uring_cmd_ctx_handle(new_fcmd->cmd) == 2913 + io_uring_cmd_ctx_handle(fcmd->cmd)) { 2914 + data->cmd = new_fcmd->cmd; 2915 + ublk_batch_dispatch(ubq, data, new_fcmd); 2916 + } else { 2917 + io_uring_cmd_complete_in_task(new_fcmd->cmd, 2918 + ublk_batch_tw_cb); 2919 + } 2920 + out: 2921 + return -EIOCBQUEUED; 2922 + } 2923 + 2924 + static int ublk_handle_batch_fetch_cmd(struct ublk_batch_io_data *data) 2925 + { 2926 + struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id); 2927 + struct ublk_batch_fetch_cmd *fcmd = ublk_batch_alloc_fcmd(data->cmd); 2928 + 2929 + if (!fcmd) 2930 + return -ENOMEM; 2931 + 2932 + return ublk_batch_attach(ubq, data, fcmd); 2933 + } 2934 + 2935 + static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data) 2936 + { 2937 + const struct ublk_batch_io *uc = &data->header; 2938 + 2939 + if (uc->q_id >= data->ub->dev_info.nr_hw_queues) 2940 + return -EINVAL; 2941 + 2942 + if (!(data->cmd->flags & IORING_URING_CMD_MULTISHOT)) 2943 + return -EINVAL; 2944 + 2945 + if (uc->elem_bytes != sizeof(__u16)) 2946 + return -EINVAL; 2947 + 2948 + if (uc->flags != 0) 2949 + return -EINVAL; 2950 + 2951 + return 0; 2952 + } 2953 + 2954 + static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd, 2955 + unsigned int issue_flags) 2956 + { 2957 + const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe); 2958 + struct ublk_device *ub = cmd->file->private_data; 2959 + unsigned tag = READ_ONCE(ub_cmd->tag); 2960 + unsigned q_id = READ_ONCE(ub_cmd->q_id); 2961 + unsigned index = READ_ONCE(ub_cmd->addr); 2962 + struct ublk_queue *ubq; 2963 + struct ublk_io *io; 2964 + 2965 + if (cmd->cmd_op == UBLK_U_IO_UNREGISTER_IO_BUF) 2966 + return ublk_unregister_io_buf(cmd, ub, index, issue_flags); 2967 + 2968 + if (q_id >= ub->dev_info.nr_hw_queues) 2969 + return -EINVAL; 2970 + 2971 + if (tag >= ub->dev_info.queue_depth) 2972 + return -EINVAL; 2973 + 2974 + if (cmd->cmd_op != UBLK_U_IO_REGISTER_IO_BUF) 2975 + return -EOPNOTSUPP; 2976 + 2977 + ubq = ublk_get_queue(ub, q_id); 2978 + io = &ubq->ios[tag]; 2979 + return ublk_register_io_buf(cmd, ub, q_id, tag, io, index, 2980 + issue_flags); 2981 + } 2982 + 2983 + static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd, 2984 + unsigned int issue_flags) 2985 + { 2986 + const struct ublk_batch_io *uc = io_uring_sqe_cmd(cmd->sqe); 2987 + struct ublk_device *ub = cmd->file->private_data; 2988 + struct ublk_batch_io_data data = { 2989 + .ub = ub, 2990 + .cmd = cmd, 2991 + .header = (struct ublk_batch_io) { 2992 + .q_id = READ_ONCE(uc->q_id), 2993 + .flags = READ_ONCE(uc->flags), 2994 + .nr_elem = READ_ONCE(uc->nr_elem), 2995 + .elem_bytes = READ_ONCE(uc->elem_bytes), 2996 + }, 2997 + .issue_flags = issue_flags, 2998 + }; 2999 + u32 cmd_op = cmd->cmd_op; 3000 + int ret = -EINVAL; 3001 + 3002 + if (unlikely(issue_flags & IO_URING_F_CANCEL)) { 3003 + ublk_batch_cancel_fn(cmd, issue_flags); 3004 + return 0; 3005 + } 3006 + 3007 + switch (cmd_op) { 3008 + case UBLK_U_IO_PREP_IO_CMDS: 3009 + ret = ublk_check_batch_cmd(&data); 3010 + if (ret) 3011 + goto out; 3012 + ret = ublk_handle_batch_prep_cmd(&data); 3013 + break; 3014 + case UBLK_U_IO_COMMIT_IO_CMDS: 3015 + ret = ublk_check_batch_cmd(&data); 3016 + if (ret) 3017 + goto out; 3018 + ret = ublk_handle_batch_commit_cmd(&data); 3019 + break; 3020 + case UBLK_U_IO_FETCH_IO_CMDS: 3021 + ret = ublk_validate_batch_fetch_cmd(&data); 3022 + if (ret) 3023 + goto out; 3024 + ret = ublk_handle_batch_fetch_cmd(&data); 3025 + break; 3026 + default: 3027 + ret = ublk_handle_non_batch_cmd(cmd, issue_flags); 3028 + break; 3029 + } 3030 + out: 3031 + return ret; 3032 + } 3033 + 3440 3034 static inline bool ublk_check_ubuf_dir(const struct request *req, 3441 3035 int ubuf_dir) 3442 3036 { ··· 3927 2575 return false; 3928 2576 } 3929 2577 3930 - static struct request *ublk_check_and_get_req(struct kiocb *iocb, 3931 - struct iov_iter *iter, size_t *off, int dir, 3932 - struct ublk_io **io) 2578 + static ssize_t 2579 + ublk_user_copy(struct kiocb *iocb, struct iov_iter *iter, int dir) 3933 2580 { 3934 2581 struct ublk_device *ub = iocb->ki_filp->private_data; 3935 2582 struct ublk_queue *ubq; 3936 2583 struct request *req; 2584 + struct ublk_io *io; 2585 + unsigned data_len; 2586 + bool is_integrity; 2587 + bool on_daemon; 3937 2588 size_t buf_off; 3938 2589 u16 tag, q_id; 2590 + ssize_t ret; 3939 2591 3940 2592 if (!user_backed_iter(iter)) 3941 - return ERR_PTR(-EACCES); 2593 + return -EACCES; 3942 2594 3943 2595 if (ub->dev_info.state == UBLK_S_DEV_DEAD) 3944 - return ERR_PTR(-EACCES); 2596 + return -EACCES; 3945 2597 3946 2598 tag = ublk_pos_to_tag(iocb->ki_pos); 3947 2599 q_id = ublk_pos_to_hwq(iocb->ki_pos); 3948 2600 buf_off = ublk_pos_to_buf_off(iocb->ki_pos); 2601 + is_integrity = !!(iocb->ki_pos & UBLKSRV_IO_INTEGRITY_FLAG); 2602 + 2603 + if (unlikely(!ublk_dev_support_integrity(ub) && is_integrity)) 2604 + return -EINVAL; 3949 2605 3950 2606 if (q_id >= ub->dev_info.nr_hw_queues) 3951 - return ERR_PTR(-EINVAL); 2607 + return -EINVAL; 3952 2608 3953 2609 ubq = ublk_get_queue(ub, q_id); 3954 2610 if (!ublk_dev_support_user_copy(ub)) 3955 - return ERR_PTR(-EACCES); 2611 + return -EACCES; 3956 2612 3957 2613 if (tag >= ub->dev_info.queue_depth) 3958 - return ERR_PTR(-EINVAL); 2614 + return -EINVAL; 3959 2615 3960 - *io = &ubq->ios[tag]; 3961 - req = __ublk_check_and_get_req(ub, q_id, tag, *io, buf_off); 3962 - if (!req) 3963 - return ERR_PTR(-EINVAL); 2616 + io = &ubq->ios[tag]; 2617 + on_daemon = current == READ_ONCE(io->task); 2618 + if (on_daemon) { 2619 + /* On daemon, io can't be completed concurrently, so skip ref */ 2620 + if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)) 2621 + return -EINVAL; 3964 2622 3965 - if (!ublk_check_ubuf_dir(req, dir)) 3966 - goto fail; 2623 + req = io->req; 2624 + if (!ublk_rq_has_data(req)) 2625 + return -EINVAL; 2626 + } else { 2627 + req = __ublk_check_and_get_req(ub, q_id, tag, io); 2628 + if (!req) 2629 + return -EINVAL; 2630 + } 3967 2631 3968 - *off = buf_off; 3969 - return req; 3970 - fail: 3971 - ublk_put_req_ref(*io, req); 3972 - return ERR_PTR(-EACCES); 2632 + if (is_integrity) { 2633 + struct blk_integrity *bi = &req->q->limits.integrity; 2634 + 2635 + data_len = bio_integrity_bytes(bi, blk_rq_sectors(req)); 2636 + } else { 2637 + data_len = blk_rq_bytes(req); 2638 + } 2639 + if (buf_off > data_len) { 2640 + ret = -EINVAL; 2641 + goto out; 2642 + } 2643 + 2644 + if (!ublk_check_ubuf_dir(req, dir)) { 2645 + ret = -EACCES; 2646 + goto out; 2647 + } 2648 + 2649 + if (is_integrity) 2650 + ret = ublk_copy_user_integrity(req, buf_off, iter, dir); 2651 + else 2652 + ret = ublk_copy_user_pages(req, buf_off, iter, dir); 2653 + 2654 + out: 2655 + if (!on_daemon) 2656 + ublk_put_req_ref(io, req); 2657 + return ret; 3973 2658 } 3974 2659 3975 2660 static ssize_t ublk_ch_read_iter(struct kiocb *iocb, struct iov_iter *to) 3976 2661 { 3977 - struct request *req; 3978 - struct ublk_io *io; 3979 - size_t buf_off; 3980 - size_t ret; 3981 - 3982 - req = ublk_check_and_get_req(iocb, to, &buf_off, ITER_DEST, &io); 3983 - if (IS_ERR(req)) 3984 - return PTR_ERR(req); 3985 - 3986 - ret = ublk_copy_user_pages(req, buf_off, to, ITER_DEST); 3987 - ublk_put_req_ref(io, req); 3988 - 3989 - return ret; 2662 + return ublk_user_copy(iocb, to, ITER_DEST); 3990 2663 } 3991 2664 3992 2665 static ssize_t ublk_ch_write_iter(struct kiocb *iocb, struct iov_iter *from) 3993 2666 { 3994 - struct request *req; 3995 - struct ublk_io *io; 3996 - size_t buf_off; 3997 - size_t ret; 3998 - 3999 - req = ublk_check_and_get_req(iocb, from, &buf_off, ITER_SOURCE, &io); 4000 - if (IS_ERR(req)) 4001 - return PTR_ERR(req); 4002 - 4003 - ret = ublk_copy_user_pages(req, buf_off, from, ITER_SOURCE); 4004 - ublk_put_req_ref(io, req); 4005 - 4006 - return ret; 2667 + return ublk_user_copy(iocb, from, ITER_SOURCE); 4007 2668 } 4008 2669 4009 2670 static const struct file_operations ublk_ch_fops = { ··· 4029 2664 .mmap = ublk_ch_mmap, 4030 2665 }; 4031 2666 4032 - static void ublk_deinit_queue(struct ublk_device *ub, int q_id) 4033 - { 4034 - struct ublk_queue *ubq = ub->queues[q_id]; 4035 - int size, i; 2667 + static const struct file_operations ublk_ch_batch_io_fops = { 2668 + .owner = THIS_MODULE, 2669 + .open = ublk_ch_open, 2670 + .release = ublk_ch_release, 2671 + .read_iter = ublk_ch_read_iter, 2672 + .write_iter = ublk_ch_write_iter, 2673 + .uring_cmd = ublk_ch_batch_io_uring_cmd, 2674 + .mmap = ublk_ch_mmap, 2675 + }; 4036 2676 4037 - if (!ubq) 4038 - return; 2677 + static void __ublk_deinit_queue(struct ublk_device *ub, struct ublk_queue *ubq) 2678 + { 2679 + int size, i; 4039 2680 4040 2681 size = ublk_queue_cmd_buf_size(ub); 4041 2682 ··· 4056 2685 if (ubq->io_cmd_buf) 4057 2686 free_pages((unsigned long)ubq->io_cmd_buf, get_order(size)); 4058 2687 2688 + if (ublk_dev_support_batch_io(ub)) 2689 + ublk_io_evts_deinit(ubq); 2690 + 4059 2691 kvfree(ubq); 2692 + } 2693 + 2694 + static void ublk_deinit_queue(struct ublk_device *ub, int q_id) 2695 + { 2696 + struct ublk_queue *ubq = ub->queues[q_id]; 2697 + 2698 + if (!ubq) 2699 + return; 2700 + 2701 + __ublk_deinit_queue(ub, ubq); 4060 2702 ub->queues[q_id] = NULL; 4061 2703 } 4062 2704 ··· 4093 2709 struct ublk_queue *ubq; 4094 2710 struct page *page; 4095 2711 int numa_node; 4096 - int size; 2712 + int size, i, ret; 4097 2713 4098 2714 /* Determine NUMA node based on queue's CPU affinity */ 4099 2715 numa_node = ublk_get_queue_numa_node(ub, q_id); ··· 4118 2734 } 4119 2735 ubq->io_cmd_buf = page_address(page); 4120 2736 2737 + for (i = 0; i < ubq->q_depth; i++) 2738 + spin_lock_init(&ubq->ios[i].lock); 2739 + 2740 + if (ublk_dev_support_batch_io(ub)) { 2741 + ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node); 2742 + if (ret) 2743 + goto fail; 2744 + INIT_LIST_HEAD(&ubq->fcmd_head); 2745 + } 4121 2746 ub->queues[q_id] = ubq; 4122 2747 ubq->dev = ub; 2748 + 4123 2749 return 0; 2750 + fail: 2751 + __ublk_deinit_queue(ub, ubq); 2752 + return ret; 4124 2753 } 4125 2754 4126 2755 static void ublk_deinit_queues(struct ublk_device *ub) ··· 4221 2824 if (ret) 4222 2825 goto fail; 4223 2826 4224 - cdev_init(&ub->cdev, &ublk_ch_fops); 2827 + if (ublk_dev_support_batch_io(ub)) 2828 + cdev_init(&ub->cdev, &ublk_ch_batch_io_fops); 2829 + else 2830 + cdev_init(&ub->cdev, &ublk_ch_fops); 4225 2831 ret = cdev_device_add(&ub->cdev, dev); 4226 2832 if (ret) 4227 2833 goto fail; ··· 4248 2848 4249 2849 static int ublk_add_tag_set(struct ublk_device *ub) 4250 2850 { 4251 - ub->tag_set.ops = &ublk_mq_ops; 2851 + if (ublk_dev_support_batch_io(ub)) 2852 + ub->tag_set.ops = &ublk_batch_mq_ops; 2853 + else 2854 + ub->tag_set.ops = &ublk_mq_ops; 4252 2855 ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues; 4253 2856 ub->tag_set.queue_depth = ub->dev_info.queue_depth; 4254 2857 ub->tag_set.numa_node = NUMA_NO_NODE; ··· 4362 2959 lim.max_segments = ub->params.seg.max_segments; 4363 2960 } 4364 2961 2962 + if (ub->params.types & UBLK_PARAM_TYPE_INTEGRITY) { 2963 + const struct ublk_param_integrity *p = &ub->params.integrity; 2964 + int pi_tuple_size = ublk_integrity_pi_tuple_size(p->csum_type); 2965 + 2966 + lim.max_integrity_segments = 2967 + p->max_integrity_segments ?: USHRT_MAX; 2968 + lim.integrity = (struct blk_integrity) { 2969 + .flags = ublk_integrity_flags(p->flags), 2970 + .csum_type = ublk_integrity_csum_type(p->csum_type), 2971 + .metadata_size = p->metadata_size, 2972 + .pi_offset = p->pi_offset, 2973 + .interval_exp = p->interval_exp, 2974 + .tag_size = p->tag_size, 2975 + .pi_tuple_size = pi_tuple_size, 2976 + }; 2977 + } 2978 + 4365 2979 if (wait_for_completion_interruptible(&ub->completion) != 0) 4366 2980 return -EINTR; 4367 2981 ··· 4386 2966 return -EINVAL; 4387 2967 4388 2968 mutex_lock(&ub->mutex); 2969 + /* device may become not ready in case of F_BATCH */ 2970 + if (!ublk_dev_ready(ub)) { 2971 + ret = -EINVAL; 2972 + goto out_unlock; 2973 + } 4389 2974 if (ub->dev_info.state == UBLK_S_DEV_LIVE || 4390 2975 test_bit(UB_STATE_USED, &ub->state)) { 4391 2976 ret = -EEXIST; ··· 4438 3013 4439 3014 set_bit(UB_STATE_USED, &ub->state); 4440 3015 4441 - /* Schedule async partition scan for trusted daemons */ 4442 - if (!ub->unprivileged_daemons) 4443 - schedule_work(&ub->partition_scan_work); 3016 + /* Skip partition scan if disabled by user */ 3017 + if (ub->dev_info.flags & UBLK_F_NO_AUTO_PART_SCAN) { 3018 + clear_bit(GD_SUPPRESS_PART_SCAN, &disk->state); 3019 + } else { 3020 + /* Schedule async partition scan for trusted daemons */ 3021 + if (!ub->unprivileged_daemons) 3022 + schedule_work(&ub->partition_scan_work); 3023 + } 4444 3024 4445 3025 out_put_cdev: 4446 3026 if (ret) { ··· 4579 3149 return -EINVAL; 4580 3150 } 4581 3151 3152 + /* User copy is required to access integrity buffer */ 3153 + if (info.flags & UBLK_F_INTEGRITY && !(info.flags & UBLK_F_USER_COPY)) 3154 + return -EINVAL; 3155 + 4582 3156 /* the created device is always owned by current user */ 4583 3157 ublk_store_owner_uid_gid(&info.owner_uid, &info.owner_gid); 4584 3158 ··· 4638 3204 ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE | 4639 3205 UBLK_F_URING_CMD_COMP_IN_TASK | 4640 3206 UBLK_F_PER_IO_DAEMON | 4641 - UBLK_F_BUF_REG_OFF_DAEMON; 3207 + UBLK_F_BUF_REG_OFF_DAEMON | 3208 + UBLK_F_SAFE_STOP_DEV; 3209 + 3210 + /* So far, UBLK_F_PER_IO_DAEMON won't be exposed for BATCH_IO */ 3211 + if (ublk_dev_support_batch_io(ub)) 3212 + ub->dev_info.flags &= ~UBLK_F_PER_IO_DAEMON; 4642 3213 4643 3214 /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */ 4644 3215 if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY | 4645 3216 UBLK_F_AUTO_BUF_REG)) 3217 + ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA; 3218 + 3219 + /* UBLK_F_BATCH_IO doesn't support GET_DATA */ 3220 + if (ublk_dev_support_batch_io(ub)) 4646 3221 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA; 4647 3222 4648 3223 /* ··· 4754 3311 return 0; 4755 3312 } 4756 3313 4757 - static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd) 3314 + static inline void ublk_ctrl_cmd_dump(u32 cmd_op, 3315 + const struct ublksrv_ctrl_cmd *header) 4758 3316 { 4759 - const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe); 4760 - 4761 3317 pr_devel("%s: cmd_op %x, dev id %d qid %d data %llx buf %llx len %u\n", 4762 - __func__, cmd->cmd_op, header->dev_id, header->queue_id, 3318 + __func__, cmd_op, header->dev_id, header->queue_id, 4763 3319 header->data[0], header->addr, header->len); 4764 3320 } 4765 3321 4766 - static int ublk_ctrl_stop_dev(struct ublk_device *ub) 3322 + static void ublk_ctrl_stop_dev(struct ublk_device *ub) 4767 3323 { 4768 3324 ublk_stop_dev(ub); 4769 - return 0; 3325 + } 3326 + 3327 + static int ublk_ctrl_try_stop_dev(struct ublk_device *ub) 3328 + { 3329 + struct gendisk *disk; 3330 + int ret = 0; 3331 + 3332 + disk = ublk_get_disk(ub); 3333 + if (!disk) 3334 + return -ENODEV; 3335 + 3336 + mutex_lock(&disk->open_mutex); 3337 + if (disk_openers(disk) > 0) { 3338 + ret = -EBUSY; 3339 + goto unlock; 3340 + } 3341 + ub->block_open = true; 3342 + /* release open_mutex as del_gendisk() will reacquire it */ 3343 + mutex_unlock(&disk->open_mutex); 3344 + 3345 + ublk_ctrl_stop_dev(ub); 3346 + goto out; 3347 + 3348 + unlock: 3349 + mutex_unlock(&disk->open_mutex); 3350 + out: 3351 + ublk_put_disk(disk); 3352 + return ret; 4770 3353 } 4771 3354 4772 3355 static int ublk_ctrl_get_dev_info(struct ublk_device *ub, ··· 4915 3446 return ret; 4916 3447 } 4917 3448 4918 - static int ublk_ctrl_start_recovery(struct ublk_device *ub, 4919 - const struct ublksrv_ctrl_cmd *header) 3449 + static int ublk_ctrl_start_recovery(struct ublk_device *ub) 4920 3450 { 4921 3451 int ret = -EINVAL; 4922 3452 ··· 4944 3476 ret = -EBUSY; 4945 3477 goto out_unlock; 4946 3478 } 4947 - pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id); 3479 + pr_devel("%s: start recovery for dev id %d\n", __func__, ub->ub_number); 4948 3480 init_completion(&ub->completion); 4949 3481 ret = 0; 4950 3482 out_unlock: ··· 5045 3577 { 5046 3578 unsigned int elapsed = 0; 5047 3579 int ret; 3580 + 3581 + /* 3582 + * For UBLK_F_BATCH_IO ublk server can get notified with existing 3583 + * or new fetch command, so needn't wait any more 3584 + */ 3585 + if (ublk_dev_support_batch_io(ub)) 3586 + return 0; 5048 3587 5049 3588 while (elapsed < timeout_ms && !signal_pending(current)) { 5050 3589 unsigned int queues_cancelable = 0; ··· 5160 3685 } 5161 3686 5162 3687 static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub, 5163 - struct io_uring_cmd *cmd) 3688 + u32 cmd_op, struct ublksrv_ctrl_cmd *header) 5164 3689 { 5165 - struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)io_uring_sqe_cmd(cmd->sqe); 5166 3690 bool unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV; 5167 3691 void __user *argp = (void __user *)(unsigned long)header->addr; 5168 3692 char *dev_path = NULL; ··· 5177 3703 * know if the specified device is created as unprivileged 5178 3704 * mode. 5179 3705 */ 5180 - if (_IOC_NR(cmd->cmd_op) != UBLK_CMD_GET_DEV_INFO2) 3706 + if (_IOC_NR(cmd_op) != UBLK_CMD_GET_DEV_INFO2) 5181 3707 return 0; 5182 3708 } 5183 3709 ··· 5198 3724 return PTR_ERR(dev_path); 5199 3725 5200 3726 ret = -EINVAL; 5201 - switch (_IOC_NR(cmd->cmd_op)) { 3727 + switch (_IOC_NR(cmd_op)) { 5202 3728 case UBLK_CMD_GET_DEV_INFO: 5203 3729 case UBLK_CMD_GET_DEV_INFO2: 5204 3730 case UBLK_CMD_GET_QUEUE_AFFINITY: ··· 5215 3741 case UBLK_CMD_END_USER_RECOVERY: 5216 3742 case UBLK_CMD_UPDATE_SIZE: 5217 3743 case UBLK_CMD_QUIESCE_DEV: 3744 + case UBLK_CMD_TRY_STOP_DEV: 5218 3745 mask = MAY_READ | MAY_WRITE; 5219 3746 break; 5220 3747 default: ··· 5228 3753 header->addr += header->dev_path_len; 5229 3754 } 5230 3755 pr_devel("%s: dev id %d cmd_op %x uid %d gid %d path %s ret %d\n", 5231 - __func__, ub->ub_number, cmd->cmd_op, 3756 + __func__, ub->ub_number, cmd_op, 5232 3757 ub->dev_info.owner_uid, ub->dev_info.owner_gid, 5233 3758 dev_path, ret); 5234 3759 exit: ··· 5252 3777 static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, 5253 3778 unsigned int issue_flags) 5254 3779 { 5255 - const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe); 3780 + /* May point to userspace-mapped memory */ 3781 + const struct ublksrv_ctrl_cmd *ub_src = io_uring_sqe_cmd(cmd->sqe); 3782 + struct ublksrv_ctrl_cmd header; 5256 3783 struct ublk_device *ub = NULL; 5257 3784 u32 cmd_op = cmd->cmd_op; 5258 3785 int ret = -EINVAL; ··· 5263 3786 issue_flags & IO_URING_F_NONBLOCK) 5264 3787 return -EAGAIN; 5265 3788 5266 - ublk_ctrl_cmd_dump(cmd); 5267 - 5268 3789 if (!(issue_flags & IO_URING_F_SQE128)) 5269 - goto out; 3790 + return -EINVAL; 3791 + 3792 + header.dev_id = READ_ONCE(ub_src->dev_id); 3793 + header.queue_id = READ_ONCE(ub_src->queue_id); 3794 + header.len = READ_ONCE(ub_src->len); 3795 + header.addr = READ_ONCE(ub_src->addr); 3796 + header.data[0] = READ_ONCE(ub_src->data[0]); 3797 + header.dev_path_len = READ_ONCE(ub_src->dev_path_len); 3798 + ublk_ctrl_cmd_dump(cmd_op, &header); 5270 3799 5271 3800 ret = ublk_check_cmd_op(cmd_op); 5272 3801 if (ret) 5273 3802 goto out; 5274 3803 5275 3804 if (cmd_op == UBLK_U_CMD_GET_FEATURES) { 5276 - ret = ublk_ctrl_get_features(header); 3805 + ret = ublk_ctrl_get_features(&header); 5277 3806 goto out; 5278 3807 } 5279 3808 5280 3809 if (_IOC_NR(cmd_op) != UBLK_CMD_ADD_DEV) { 5281 3810 ret = -ENODEV; 5282 - ub = ublk_get_device_from_id(header->dev_id); 3811 + ub = ublk_get_device_from_id(header.dev_id); 5283 3812 if (!ub) 5284 3813 goto out; 5285 3814 5286 - ret = ublk_ctrl_uring_cmd_permission(ub, cmd); 3815 + ret = ublk_ctrl_uring_cmd_permission(ub, cmd_op, &header); 5287 3816 if (ret) 5288 3817 goto put_dev; 5289 3818 } 5290 3819 5291 3820 switch (_IOC_NR(cmd_op)) { 5292 3821 case UBLK_CMD_START_DEV: 5293 - ret = ublk_ctrl_start_dev(ub, header); 3822 + ret = ublk_ctrl_start_dev(ub, &header); 5294 3823 break; 5295 3824 case UBLK_CMD_STOP_DEV: 5296 - ret = ublk_ctrl_stop_dev(ub); 3825 + ublk_ctrl_stop_dev(ub); 3826 + ret = 0; 5297 3827 break; 5298 3828 case UBLK_CMD_GET_DEV_INFO: 5299 3829 case UBLK_CMD_GET_DEV_INFO2: 5300 - ret = ublk_ctrl_get_dev_info(ub, header); 3830 + ret = ublk_ctrl_get_dev_info(ub, &header); 5301 3831 break; 5302 3832 case UBLK_CMD_ADD_DEV: 5303 - ret = ublk_ctrl_add_dev(header); 3833 + ret = ublk_ctrl_add_dev(&header); 5304 3834 break; 5305 3835 case UBLK_CMD_DEL_DEV: 5306 3836 ret = ublk_ctrl_del_dev(&ub, true); ··· 5316 3832 ret = ublk_ctrl_del_dev(&ub, false); 5317 3833 break; 5318 3834 case UBLK_CMD_GET_QUEUE_AFFINITY: 5319 - ret = ublk_ctrl_get_queue_affinity(ub, header); 3835 + ret = ublk_ctrl_get_queue_affinity(ub, &header); 5320 3836 break; 5321 3837 case UBLK_CMD_GET_PARAMS: 5322 - ret = ublk_ctrl_get_params(ub, header); 3838 + ret = ublk_ctrl_get_params(ub, &header); 5323 3839 break; 5324 3840 case UBLK_CMD_SET_PARAMS: 5325 - ret = ublk_ctrl_set_params(ub, header); 3841 + ret = ublk_ctrl_set_params(ub, &header); 5326 3842 break; 5327 3843 case UBLK_CMD_START_USER_RECOVERY: 5328 - ret = ublk_ctrl_start_recovery(ub, header); 3844 + ret = ublk_ctrl_start_recovery(ub); 5329 3845 break; 5330 3846 case UBLK_CMD_END_USER_RECOVERY: 5331 - ret = ublk_ctrl_end_recovery(ub, header); 3847 + ret = ublk_ctrl_end_recovery(ub, &header); 5332 3848 break; 5333 3849 case UBLK_CMD_UPDATE_SIZE: 5334 - ublk_ctrl_set_size(ub, header); 3850 + ublk_ctrl_set_size(ub, &header); 5335 3851 ret = 0; 5336 3852 break; 5337 3853 case UBLK_CMD_QUIESCE_DEV: 5338 - ret = ublk_ctrl_quiesce_dev(ub, header); 3854 + ret = ublk_ctrl_quiesce_dev(ub, &header); 3855 + break; 3856 + case UBLK_CMD_TRY_STOP_DEV: 3857 + ret = ublk_ctrl_try_stop_dev(ub); 5339 3858 break; 5340 3859 default: 5341 3860 ret = -EOPNOTSUPP; ··· 5350 3863 ublk_put_device(ub); 5351 3864 out: 5352 3865 pr_devel("%s: cmd done ret %d cmd_op %x, dev id %d qid %d\n", 5353 - __func__, ret, cmd->cmd_op, header->dev_id, header->queue_id); 3866 + __func__, ret, cmd_op, header.dev_id, header.queue_id); 5354 3867 return ret; 5355 3868 } 5356 3869 ··· 5373 3886 5374 3887 BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET + 5375 3888 UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET); 3889 + /* 3890 + * Ensure UBLKSRV_IO_BUF_OFFSET + UBLKSRV_IO_BUF_TOTAL_SIZE 3891 + * doesn't overflow into UBLKSRV_IO_INTEGRITY_FLAG 3892 + */ 3893 + BUILD_BUG_ON(UBLKSRV_IO_BUF_OFFSET + UBLKSRV_IO_BUF_TOTAL_SIZE >= 3894 + UBLKSRV_IO_INTEGRITY_FLAG); 5376 3895 BUILD_BUG_ON(sizeof(struct ublk_auto_buf_reg) != 8); 5377 3896 5378 3897 init_waitqueue_head(&ublk_idr_wq);

+2 -1

drivers/md/dm-rq.c

··· 295 295 } 296 296 297 297 static enum rq_end_io_ret end_clone_request(struct request *clone, 298 - blk_status_t error) 298 + blk_status_t error, 299 + const struct io_comp_batch *iob) 299 300 { 300 301 struct dm_rq_target_io *tio = clone->end_io_data; 301 302

+4 -3

drivers/md/md-bitmap.c

··· 2085 2085 return; 2086 2086 2087 2087 bitmap_wait_behind_writes(mddev); 2088 - if (!mddev->serialize_policy) 2088 + if (!test_bit(MD_SERIALIZE_POLICY, &mddev->flags)) 2089 2089 mddev_destroy_serial_pool(mddev, NULL); 2090 2090 2091 2091 mutex_lock(&mddev->bitmap_info.mutex); ··· 2453 2453 memcpy(page_address(store.sb_page), 2454 2454 page_address(bitmap->storage.sb_page), 2455 2455 sizeof(bitmap_super_t)); 2456 + mutex_lock(&bitmap->mddev->bitmap_info.mutex); 2456 2457 spin_lock_irq(&bitmap->counts.lock); 2457 2458 md_bitmap_file_unmap(&bitmap->storage); 2458 2459 bitmap->storage = store; ··· 2561 2560 set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); 2562 2561 } 2563 2562 spin_unlock_irq(&bitmap->counts.lock); 2564 - 2563 + mutex_unlock(&bitmap->mddev->bitmap_info.mutex); 2565 2564 if (!init) { 2566 2565 __bitmap_unplug(bitmap); 2567 2566 bitmap->mddev->pers->quiesce(bitmap->mddev, 0); ··· 2810 2809 mddev->bitmap_info.max_write_behind = backlog; 2811 2810 if (!backlog && mddev->serial_info_pool) { 2812 2811 /* serial_info_pool is not needed if backlog is zero */ 2813 - if (!mddev->serialize_policy) 2812 + if (!test_bit(MD_SERIALIZE_POLICY, &mddev->flags)) 2814 2813 mddev_destroy_serial_pool(mddev, NULL); 2815 2814 } else if (backlog && !mddev->serial_info_pool) { 2816 2815 /* serial_info_pool is needed since backlog is not zero */

+6 -1

drivers/md/md-cluster.c

··· 549 549 550 550 dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); 551 551 552 - /* daemaon thread must exist */ 553 552 thread = rcu_dereference_protected(mddev->thread, true); 553 + if (!thread) { 554 + pr_warn("md-cluster: Received metadata update but MD thread is not ready\n"); 555 + dlm_unlock_sync(cinfo->no_new_dev_lockres); 556 + return; 557 + } 558 + 554 559 wait_event(thread->wqueue, 555 560 (got_lock = mddev_trylock(mddev)) || 556 561 test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));

+3 -1

drivers/md/md-llbitmap.c

··· 712 712 percpu_ref_kill(&pctl->active); 713 713 714 714 if (!wait_event_timeout(pctl->wait, percpu_ref_is_zero(&pctl->active), 715 - llbitmap->mddev->bitmap_info.daemon_sleep * HZ)) 715 + llbitmap->mddev->bitmap_info.daemon_sleep * HZ)) { 716 + percpu_ref_resurrect(&pctl->active); 716 717 return -ETIMEDOUT; 718 + } 717 719 718 720 return 0; 719 721 }

+99 -89

drivers/md/md.c

··· 279 279 280 280 rdev_for_each(temp, mddev) { 281 281 if (!rdev) { 282 - if (!mddev->serialize_policy || 282 + if (!test_bit(MD_SERIALIZE_POLICY, 283 + &mddev->flags) || 283 284 !rdev_need_serial(temp)) 284 285 rdev_uninit_serial(temp); 285 286 else ··· 2617 2616 2618 2617 list_add_rcu(&rdev->same_set, &mddev->disks); 2619 2618 bd_link_disk_holder(rdev->bdev, mddev->gendisk); 2620 - 2621 - /* May as well allow recovery to be retried once */ 2622 - mddev->recovery_disabled++; 2623 2619 2624 2620 return 0; 2625 2621 ··· 5862 5864 5863 5865 static ssize_t fail_last_dev_show(struct mddev *mddev, char *page) 5864 5866 { 5865 - return sprintf(page, "%d\n", mddev->fail_last_dev); 5867 + return sprintf(page, "%d\n", test_bit(MD_FAILLAST_DEV, &mddev->flags)); 5866 5868 } 5867 5869 5868 5870 /* 5869 - * Setting fail_last_dev to true to allow last device to be forcibly removed 5871 + * Setting MD_FAILLAST_DEV to allow last device to be forcibly removed 5870 5872 * from RAID1/RAID10. 5871 5873 */ 5872 5874 static ssize_t ··· 5879 5881 if (ret) 5880 5882 return ret; 5881 5883 5882 - if (value != mddev->fail_last_dev) 5883 - mddev->fail_last_dev = value; 5884 + if (value) 5885 + set_bit(MD_FAILLAST_DEV, &mddev->flags); 5886 + else 5887 + clear_bit(MD_FAILLAST_DEV, &mddev->flags); 5884 5888 5885 5889 return len; 5886 5890 } ··· 5895 5895 if (mddev->pers == NULL || (mddev->pers->head.id != ID_RAID1)) 5896 5896 return sprintf(page, "n/a\n"); 5897 5897 else 5898 - return sprintf(page, "%d\n", mddev->serialize_policy); 5898 + return sprintf(page, "%d\n", 5899 + test_bit(MD_SERIALIZE_POLICY, &mddev->flags)); 5899 5900 } 5900 5901 5901 5902 /* 5902 - * Setting serialize_policy to true to enforce write IO is not reordered 5903 + * Setting MD_SERIALIZE_POLICY enforce write IO is not reordered 5903 5904 * for raid1. 5904 5905 */ 5905 5906 static ssize_t ··· 5913 5912 if (err) 5914 5913 return err; 5915 5914 5916 - if (value == mddev->serialize_policy) 5915 + if (value == test_bit(MD_SERIALIZE_POLICY, &mddev->flags)) 5917 5916 return len; 5918 5917 5919 5918 err = mddev_suspend_and_lock(mddev); ··· 5925 5924 goto unlock; 5926 5925 } 5927 5926 5928 - if (value) 5927 + if (value) { 5929 5928 mddev_create_serial_pool(mddev, NULL); 5930 - else 5929 + set_bit(MD_SERIALIZE_POLICY, &mddev->flags); 5930 + } else { 5931 5931 mddev_destroy_serial_pool(mddev, NULL); 5932 - mddev->serialize_policy = value; 5932 + clear_bit(MD_SERIALIZE_POLICY, &mddev->flags); 5933 + } 5933 5934 unlock: 5934 5935 mddev_unlock_and_resume(mddev); 5935 5936 return err ?: len; ··· 6505 6502 * the only valid external interface is through the md 6506 6503 * device. 6507 6504 */ 6508 - mddev->has_superblocks = false; 6505 + clear_bit(MD_HAS_SUPERBLOCK, &mddev->flags); 6509 6506 rdev_for_each(rdev, mddev) { 6510 6507 if (test_bit(Faulty, &rdev->flags)) 6511 6508 continue; ··· 6518 6515 } 6519 6516 6520 6517 if (rdev->sb_page) 6521 - mddev->has_superblocks = true; 6518 + set_bit(MD_HAS_SUPERBLOCK, &mddev->flags); 6522 6519 6523 6520 /* perform some consistency tests on the device. 6524 6521 * We don't want the data to overlap the metadata, ··· 6851 6848 { 6852 6849 timer_delete_sync(&mddev->safemode_timer); 6853 6850 6854 - if (mddev->pers && mddev->pers->quiesce) { 6855 - mddev->pers->quiesce(mddev, 1); 6856 - mddev->pers->quiesce(mddev, 0); 6857 - } 6851 + if (md_is_rdwr(mddev) || !mddev_is_dm(mddev)) { 6852 + if (mddev->pers && mddev->pers->quiesce) { 6853 + mddev->pers->quiesce(mddev, 1); 6854 + mddev->pers->quiesce(mddev, 0); 6855 + } 6858 6856 6859 - if (md_bitmap_enabled(mddev, true)) 6860 - mddev->bitmap_ops->flush(mddev); 6857 + if (md_bitmap_enabled(mddev, true)) 6858 + mddev->bitmap_ops->flush(mddev); 6859 + } 6861 6860 6862 6861 if (md_is_rdwr(mddev) && 6863 6862 ((!mddev->in_sync && !mddev_is_clustered(mddev)) || ··· 6870 6865 md_update_sb(mddev, 1); 6871 6866 } 6872 6867 /* disable policy to guarantee rdevs free resources for serialization */ 6873 - mddev->serialize_policy = 0; 6868 + clear_bit(MD_SERIALIZE_POLICY, &mddev->flags); 6874 6869 mddev_destroy_serial_pool(mddev, NULL); 6875 6870 } 6876 6871 ··· 9073 9068 return idle; 9074 9069 } 9075 9070 9076 - void md_done_sync(struct mddev *mddev, int blocks, int ok) 9071 + void md_done_sync(struct mddev *mddev, int blocks) 9077 9072 { 9078 9073 /* another "blocks" (512byte) blocks have been synced */ 9079 9074 atomic_sub(blocks, &mddev->recovery_active); 9080 9075 wake_up(&mddev->recovery_wait); 9081 - if (!ok) { 9082 - set_bit(MD_RECOVERY_INTR, &mddev->recovery); 9083 - set_bit(MD_RECOVERY_ERROR, &mddev->recovery); 9084 - md_wakeup_thread(mddev->thread); 9085 - // stop recovery, signal do_sync .... 9086 - } 9087 9076 } 9088 9077 EXPORT_SYMBOL(md_done_sync); 9078 + 9079 + void md_sync_error(struct mddev *mddev) 9080 + { 9081 + // stop recovery, signal do_sync .... 9082 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 9083 + md_wakeup_thread(mddev->thread); 9084 + } 9085 + EXPORT_SYMBOL(md_sync_error); 9089 9086 9090 9087 /* md_write_start(mddev, bi) 9091 9088 * If we need to update some array metadata (e.g. 'active' flag ··· 9132 9125 rcu_read_unlock(); 9133 9126 if (did_change) 9134 9127 sysfs_notify_dirent_safe(mddev->sysfs_state); 9135 - if (!mddev->has_superblocks) 9128 + if (!test_bit(MD_HAS_SUPERBLOCK, &mddev->flags)) 9136 9129 return; 9137 9130 wait_event(mddev->sb_wait, 9138 9131 !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); ··· 9437 9430 (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev); 9438 9431 } 9439 9432 9433 + /* 9434 + * Update sync offset and mddev status when sync completes 9435 + */ 9436 + static void md_finish_sync(struct mddev *mddev, enum sync_action action) 9437 + { 9438 + struct md_rdev *rdev; 9439 + 9440 + switch (action) { 9441 + case ACTION_RESYNC: 9442 + case ACTION_REPAIR: 9443 + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9444 + mddev->curr_resync = MaxSector; 9445 + mddev->resync_offset = mddev->curr_resync; 9446 + break; 9447 + case ACTION_RECOVER: 9448 + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9449 + mddev->curr_resync = MaxSector; 9450 + rcu_read_lock(); 9451 + rdev_for_each_rcu(rdev, mddev) 9452 + if (mddev->delta_disks >= 0 && 9453 + rdev_needs_recovery(rdev, mddev->curr_resync)) 9454 + rdev->recovery_offset = mddev->curr_resync; 9455 + rcu_read_unlock(); 9456 + break; 9457 + case ACTION_RESHAPE: 9458 + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9459 + mddev->delta_disks > 0 && 9460 + mddev->pers->finish_reshape && 9461 + mddev->pers->size && 9462 + !mddev_is_dm(mddev)) { 9463 + mddev_lock_nointr(mddev); 9464 + md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0)); 9465 + mddev_unlock(mddev); 9466 + if (!mddev_is_clustered(mddev)) 9467 + set_capacity_and_notify(mddev->gendisk, 9468 + mddev->array_sectors); 9469 + } 9470 + if (mddev->pers->finish_reshape) 9471 + mddev->pers->finish_reshape(mddev); 9472 + break; 9473 + /* */ 9474 + case ACTION_CHECK: 9475 + default: 9476 + break; 9477 + } 9478 + } 9479 + 9440 9480 #define SYNC_MARKS 10 9441 9481 #define SYNC_MARK_STEP (3*HZ) 9442 9482 #define UPDATE_FREQUENCY (5*60*HZ) ··· 9499 9445 int last_mark,m; 9500 9446 sector_t last_check; 9501 9447 int skipped = 0; 9502 - struct md_rdev *rdev; 9503 9448 enum sync_action action; 9504 9449 const char *desc; 9505 9450 struct blk_plug plug; ··· 9784 9731 wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); 9785 9732 9786 9733 if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 9787 - !test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9788 9734 mddev->curr_resync >= MD_RESYNC_ACTIVE) { 9735 + /* All sync IO completes after recovery_active becomes 0 */ 9789 9736 mddev->curr_resync_completed = mddev->curr_resync; 9790 9737 sysfs_notify_dirent_safe(mddev->sysfs_completed); 9791 9738 } 9792 9739 mddev->pers->sync_request(mddev, max_sectors, max_sectors, &skipped); 9793 9740 9794 - if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 9795 - mddev->curr_resync > MD_RESYNC_ACTIVE) { 9796 - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 9797 - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { 9798 - if (mddev->curr_resync >= mddev->resync_offset) { 9799 - pr_debug("md: checkpointing %s of %s.\n", 9800 - desc, mdname(mddev)); 9801 - if (test_bit(MD_RECOVERY_ERROR, 9802 - &mddev->recovery)) 9803 - mddev->resync_offset = 9804 - mddev->curr_resync_completed; 9805 - else 9806 - mddev->resync_offset = 9807 - mddev->curr_resync; 9808 - } 9809 - } else 9810 - mddev->resync_offset = MaxSector; 9811 - } else { 9812 - if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9813 - mddev->curr_resync = MaxSector; 9814 - if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 9815 - test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { 9816 - rcu_read_lock(); 9817 - rdev_for_each_rcu(rdev, mddev) 9818 - if (mddev->delta_disks >= 0 && 9819 - rdev_needs_recovery(rdev, mddev->curr_resync)) 9820 - rdev->recovery_offset = mddev->curr_resync; 9821 - rcu_read_unlock(); 9822 - } 9823 - } 9824 - } 9741 + if (mddev->curr_resync > MD_RESYNC_ACTIVE) 9742 + md_finish_sync(mddev, action); 9825 9743 skip: 9826 9744 /* set CHANGE_PENDING here since maybe another update is needed, 9827 9745 * so other nodes are informed. It should be harmless for normal 9828 9746 * raid */ 9829 9747 set_mask_bits(&mddev->sb_flags, 0, 9830 9748 BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS)); 9831 - 9832 - if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 9833 - !test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9834 - mddev->delta_disks > 0 && 9835 - mddev->pers->finish_reshape && 9836 - mddev->pers->size && 9837 - !mddev_is_dm(mddev)) { 9838 - mddev_lock_nointr(mddev); 9839 - md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0)); 9840 - mddev_unlock(mddev); 9841 - if (!mddev_is_clustered(mddev)) 9842 - set_capacity_and_notify(mddev->gendisk, 9843 - mddev->array_sectors); 9844 - } 9845 - 9846 9749 spin_lock(&mddev->lock); 9847 9750 if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { 9848 9751 /* We completed so min/max setting can be forgotten if used. */ ··· 10313 10304 { 10314 10305 struct md_rdev *rdev; 10315 10306 sector_t old_dev_sectors = mddev->dev_sectors; 10316 - bool is_reshaped = false; 10307 + bool is_reshaped = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 10317 10308 10318 10309 /* resync has finished, collect result */ 10319 10310 md_unregister_thread(mddev, &mddev->sync_thread); ··· 10328 10319 sysfs_notify_dirent_safe(mddev->sysfs_degraded); 10329 10320 set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); 10330 10321 } 10331 - } 10332 - if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 10333 - mddev->pers->finish_reshape) { 10334 - mddev->pers->finish_reshape(mddev); 10335 - if (mddev_is_clustered(mddev)) 10336 - is_reshaped = true; 10337 10322 } 10338 10323 10339 10324 /* If array is no-longer degraded, then any saved_raid_disk ··· 10355 10352 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, 10356 10353 * so it is time to update size across cluster. 10357 10354 */ 10358 - if (mddev_is_clustered(mddev) && is_reshaped 10359 - && !test_bit(MD_CLOSING, &mddev->flags)) 10355 + if (mddev_is_clustered(mddev) && is_reshaped && 10356 + mddev->pers->finish_reshape && 10357 + !test_bit(MD_CLOSING, &mddev->flags)) 10360 10358 mddev->cluster_ops->update_size(mddev, old_dev_sectors); 10361 10359 /* flag recovery needed just to double check */ 10362 10360 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); ··· 10417 10413 else 10418 10414 s += rdev->data_offset; 10419 10415 10420 - if (!badblocks_set(&rdev->badblocks, s, sectors, 0)) 10416 + if (!badblocks_set(&rdev->badblocks, s, sectors, 0)) { 10417 + /* 10418 + * Mark the disk as Faulty when setting badblocks fails, 10419 + * otherwise, bad sectors may be read. 10420 + */ 10421 + md_error(mddev, rdev); 10421 10422 return false; 10423 + } 10422 10424 10423 10425 /* Make sure they get written out promptly */ 10424 10426 if (test_bit(ExternalBbl, &rdev->flags))

+14 -15

drivers/md/md.h

··· 22 22 #include <trace/events/block.h> 23 23 24 24 #define MaxSector (~(sector_t)0) 25 + /* 26 + * Number of guaranteed raid bios in case of extreme VM load: 27 + */ 28 + #define NR_RAID_BIOS 256 25 29 26 30 enum md_submodule_type { 27 31 MD_PERSONALITY = 0, ··· 344 340 * array is ready yet. 345 341 * @MD_BROKEN: This is used to stop writes and mark array as failed. 346 342 * @MD_DELETED: This device is being deleted 343 + * @MD_HAS_SUPERBLOCK: There is persistence sb in member disks. 344 + * @MD_FAILLAST_DEV: Allow last rdev to be removed. 345 + * @MD_SERIALIZE_POLICY: Enforce write IO is not reordered, just used by raid1. 347 346 * 348 347 * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added 349 348 */ ··· 363 356 MD_BROKEN, 364 357 MD_DO_DELETE, 365 358 MD_DELETED, 359 + MD_HAS_SUPERBLOCK, 360 + MD_FAILLAST_DEV, 361 + MD_SERIALIZE_POLICY, 366 362 }; 367 363 368 364 enum mddev_sb_flags { ··· 505 495 int ok_start_degraded; 506 496 507 497 unsigned long recovery; 508 - /* If a RAID personality determines that recovery (of a particular 509 - * device) will fail due to a read error on the source device, it 510 - * takes a copy of this number and does not attempt recovery again 511 - * until this number changes. 512 - */ 513 - int recovery_disabled; 514 498 515 499 int in_sync; /* know to not need resync */ 516 500 /* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so ··· 626 622 627 623 /* The sequence number for sync thread */ 628 624 atomic_t sync_seq; 629 - 630 - bool has_superblocks:1; 631 - bool fail_last_dev:1; 632 - bool serialize_policy:1; 633 625 }; 634 626 635 627 enum recovery_flags { ··· 646 646 MD_RECOVERY_FROZEN, 647 647 /* waiting for pers->start() to finish */ 648 648 MD_RECOVERY_WAIT, 649 - /* interrupted because io-error */ 650 - MD_RECOVERY_ERROR, 651 649 652 650 /* flags determines sync action, see details in enum sync_action */ 653 651 ··· 735 737 int ret; 736 738 737 739 ret = mutex_trylock(&mddev->reconfig_mutex); 738 - if (!ret && test_bit(MD_DELETED, &mddev->flags)) { 739 - ret = -ENODEV; 740 + if (ret && test_bit(MD_DELETED, &mddev->flags)) { 741 + ret = 0; 740 742 mutex_unlock(&mddev->reconfig_mutex); 741 743 } 742 744 return ret; ··· 910 912 extern void md_write_start(struct mddev *mddev, struct bio *bi); 911 913 extern void md_write_inc(struct mddev *mddev, struct bio *bi); 912 914 extern void md_write_end(struct mddev *mddev); 913 - extern void md_done_sync(struct mddev *mddev, int blocks, int ok); 915 + extern void md_done_sync(struct mddev *mddev, int blocks); 916 + extern void md_sync_error(struct mddev *mddev); 914 917 extern void md_error(struct mddev *mddev, struct md_rdev *rdev); 915 918 extern void md_finish_reshape(struct mddev *mddev); 916 919 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,

+3 -1

drivers/md/raid0.c

··· 27 27 (1L << MD_JOURNAL_CLEAN) | \ 28 28 (1L << MD_FAILFAST_SUPPORTED) |\ 29 29 (1L << MD_HAS_PPL) | \ 30 - (1L << MD_HAS_MULTIPLE_PPLS)) 30 + (1L << MD_HAS_MULTIPLE_PPLS) | \ 31 + (1L << MD_FAILLAST_DEV) | \ 32 + (1L << MD_SERIALIZE_POLICY)) 31 33 32 34 /* 33 35 * inform the user of the raid configuration

-5

drivers/md/raid1-10.c

··· 3 3 #define RESYNC_BLOCK_SIZE (64*1024) 4 4 #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE) 5 5 6 - /* 7 - * Number of guaranteed raid bios in case of extreme VM load: 8 - */ 9 - #define NR_RAID_BIOS 256 10 - 11 6 /* when we get a read error on a read-only array, we redirect to another 12 7 * device without failing the first device, or trying to over-write to 13 8 * correct the read error. To keep track of bad blocks on a per-bio

+36 -53

drivers/md/raid1.c

··· 542 542 call_bio_endio(r1_bio); 543 543 } 544 544 } 545 - } else if (rdev->mddev->serialize_policy) 545 + } else if (test_bit(MD_SERIALIZE_POLICY, &rdev->mddev->flags)) 546 546 remove_serial(rdev, lo, hi); 547 547 if (r1_bio->bios[mirror] == NULL) 548 548 rdev_dec_pending(rdev, conf->mddev); ··· 1644 1644 mbio = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO, 1645 1645 &mddev->bio_set); 1646 1646 1647 - if (mddev->serialize_policy) 1647 + if (test_bit(MD_SERIALIZE_POLICY, &mddev->flags)) 1648 1648 wait_for_serialization(rdev, r1_bio); 1649 1649 } 1650 1650 ··· 1746 1746 * - &mddev->degraded is bumped. 1747 1747 * 1748 1748 * @rdev is marked as &Faulty excluding case when array is failed and 1749 - * &mddev->fail_last_dev is off. 1749 + * MD_FAILLAST_DEV is not set. 1750 1750 */ 1751 1751 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) 1752 1752 { ··· 1759 1759 (conf->raid_disks - mddev->degraded) == 1) { 1760 1760 set_bit(MD_BROKEN, &mddev->flags); 1761 1761 1762 - if (!mddev->fail_last_dev) { 1763 - conf->recovery_disabled = mddev->recovery_disabled; 1762 + if (!test_bit(MD_FAILLAST_DEV, &mddev->flags)) { 1764 1763 spin_unlock_irqrestore(&conf->device_lock, flags); 1765 1764 return; 1766 1765 } ··· 1903 1904 1904 1905 /* Only remove non-faulty devices if recovery is not possible. */ 1905 1906 if (!test_bit(Faulty, &rdev->flags) && 1906 - rdev->mddev->recovery_disabled != conf->recovery_disabled && 1907 1907 rdev->mddev->degraded < conf->raid_disks) 1908 1908 return false; 1909 1909 ··· 1921 1923 struct raid1_info *p; 1922 1924 int first = 0; 1923 1925 int last = conf->raid_disks - 1; 1924 - 1925 - if (mddev->recovery_disabled == conf->recovery_disabled) 1926 - return -EBUSY; 1927 1926 1928 1927 if (rdev->raid_disk >= 0) 1929 1928 first = last = rdev->raid_disk; ··· 2057 2062 } while (sectors_to_go > 0); 2058 2063 } 2059 2064 2060 - static void put_sync_write_buf(struct r1bio *r1_bio, int uptodate) 2065 + static void put_sync_write_buf(struct r1bio *r1_bio) 2061 2066 { 2062 2067 if (atomic_dec_and_test(&r1_bio->remaining)) { 2063 2068 struct mddev *mddev = r1_bio->mddev; ··· 2068 2073 reschedule_retry(r1_bio); 2069 2074 else { 2070 2075 put_buf(r1_bio); 2071 - md_done_sync(mddev, s, uptodate); 2076 + md_done_sync(mddev, s); 2072 2077 } 2073 2078 } 2074 2079 } 2075 2080 2076 2081 static void end_sync_write(struct bio *bio) 2077 2082 { 2078 - int uptodate = !bio->bi_status; 2079 2083 struct r1bio *r1_bio = get_resync_r1bio(bio); 2080 2084 struct mddev *mddev = r1_bio->mddev; 2081 2085 struct r1conf *conf = mddev->private; 2082 2086 struct md_rdev *rdev = conf->mirrors[find_bio_disk(r1_bio, bio)].rdev; 2083 2087 2084 - if (!uptodate) { 2088 + if (bio->bi_status) { 2085 2089 abort_sync_write(mddev, r1_bio); 2086 2090 set_bit(WriteErrorSeen, &rdev->flags); 2087 2091 if (!test_and_set_bit(WantReplacement, &rdev->flags)) ··· 2093 2099 set_bit(R1BIO_MadeGood, &r1_bio->state); 2094 2100 } 2095 2101 2096 - put_sync_write_buf(r1_bio, uptodate); 2102 + put_sync_write_buf(r1_bio); 2097 2103 } 2098 2104 2099 2105 static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, ··· 2110 2116 rdev->mddev->recovery); 2111 2117 } 2112 2118 /* need to record an error - either for the block or the device */ 2113 - if (!rdev_set_badblocks(rdev, sector, sectors, 0)) 2114 - md_error(rdev->mddev, rdev); 2119 + rdev_set_badblocks(rdev, sector, sectors, 0); 2115 2120 return 0; 2116 2121 } 2117 2122 ··· 2341 2348 */ 2342 2349 if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) || 2343 2350 !fix_sync_read_error(r1_bio)) { 2344 - conf->recovery_disabled = mddev->recovery_disabled; 2345 - set_bit(MD_RECOVERY_INTR, &mddev->recovery); 2346 - md_done_sync(mddev, r1_bio->sectors, 0); 2351 + md_done_sync(mddev, r1_bio->sectors); 2352 + md_sync_error(mddev); 2347 2353 put_buf(r1_bio); 2348 2354 return; 2349 2355 } ··· 2377 2385 submit_bio_noacct(wbio); 2378 2386 } 2379 2387 2380 - put_sync_write_buf(r1_bio, 1); 2388 + put_sync_write_buf(r1_bio); 2381 2389 } 2382 2390 2383 2391 /* ··· 2434 2442 if (!success) { 2435 2443 /* Cannot read from anywhere - mark it bad */ 2436 2444 struct md_rdev *rdev = conf->mirrors[read_disk].rdev; 2437 - if (!rdev_set_badblocks(rdev, sect, s, 0)) 2438 - md_error(mddev, rdev); 2445 + rdev_set_badblocks(rdev, sect, s, 0); 2439 2446 break; 2440 2447 } 2441 2448 /* write it back and re-read */ ··· 2478 2487 } 2479 2488 } 2480 2489 2481 - static bool narrow_write_error(struct r1bio *r1_bio, int i) 2490 + static void narrow_write_error(struct r1bio *r1_bio, int i) 2482 2491 { 2483 2492 struct mddev *mddev = r1_bio->mddev; 2484 2493 struct r1conf *conf = mddev->private; ··· 2495 2504 * We currently own a reference on the rdev. 2496 2505 */ 2497 2506 2498 - int block_sectors; 2507 + int block_sectors, lbs = bdev_logical_block_size(rdev->bdev) >> 9; 2499 2508 sector_t sector; 2500 2509 int sectors; 2501 2510 int sect_to_write = r1_bio->sectors; 2502 - bool ok = true; 2503 2511 2504 2512 if (rdev->badblocks.shift < 0) 2505 - return false; 2513 + block_sectors = lbs; 2514 + else 2515 + block_sectors = roundup(1 << rdev->badblocks.shift, lbs); 2506 2516 2507 - block_sectors = roundup(1 << rdev->badblocks.shift, 2508 - bdev_logical_block_size(rdev->bdev) >> 9); 2509 2517 sector = r1_bio->sector; 2510 2518 sectors = ((sector + block_sectors) 2511 2519 & ~(sector_t)(block_sectors - 1)) ··· 2532 2542 bio_trim(wbio, sector - r1_bio->sector, sectors); 2533 2543 wbio->bi_iter.bi_sector += rdev->data_offset; 2534 2544 2535 - if (submit_bio_wait(wbio) < 0) 2536 - /* failure! */ 2537 - ok = rdev_set_badblocks(rdev, sector, 2538 - sectors, 0) 2539 - && ok; 2545 + if (submit_bio_wait(wbio) && 2546 + !rdev_set_badblocks(rdev, sector, sectors, 0)) { 2547 + /* 2548 + * Badblocks set failed, disk marked Faulty. 2549 + * No further operations needed. 2550 + */ 2551 + bio_put(wbio); 2552 + break; 2553 + } 2540 2554 2541 2555 bio_put(wbio); 2542 2556 sect_to_write -= sectors; 2543 2557 sector += sectors; 2544 2558 sectors = block_sectors; 2545 2559 } 2546 - return ok; 2547 2560 } 2548 2561 2549 2562 static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio) ··· 2559 2566 if (bio->bi_end_io == NULL) 2560 2567 continue; 2561 2568 if (!bio->bi_status && 2562 - test_bit(R1BIO_MadeGood, &r1_bio->state)) { 2569 + test_bit(R1BIO_MadeGood, &r1_bio->state)) 2563 2570 rdev_clear_badblocks(rdev, r1_bio->sector, s, 0); 2564 - } 2565 2571 if (bio->bi_status && 2566 - test_bit(R1BIO_WriteError, &r1_bio->state)) { 2567 - if (!rdev_set_badblocks(rdev, r1_bio->sector, s, 0)) 2568 - md_error(conf->mddev, rdev); 2569 - } 2572 + test_bit(R1BIO_WriteError, &r1_bio->state)) 2573 + rdev_set_badblocks(rdev, r1_bio->sector, s, 0); 2570 2574 } 2571 2575 put_buf(r1_bio); 2572 - md_done_sync(conf->mddev, s, 1); 2576 + md_done_sync(conf->mddev, s); 2573 2577 } 2574 2578 2575 2579 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio) ··· 2587 2597 * errors. 2588 2598 */ 2589 2599 fail = true; 2590 - if (!narrow_write_error(r1_bio, m)) 2591 - md_error(conf->mddev, 2592 - conf->mirrors[m].rdev); 2593 - /* an I/O failed, we can't clear the bitmap */ 2600 + narrow_write_error(r1_bio, m); 2594 2601 rdev_dec_pending(conf->mirrors[m].rdev, 2595 2602 conf->mddev); 2596 2603 } ··· 2942 2955 *skipped = 1; 2943 2956 put_buf(r1_bio); 2944 2957 2945 - if (!ok) { 2946 - /* Cannot record the badblocks, so need to 2958 + if (!ok) 2959 + /* Cannot record the badblocks, md_error has set INTR, 2947 2960 * abort the resync. 2948 - * If there are multiple read targets, could just 2949 - * fail the really bad ones ??? 2950 2961 */ 2951 - conf->recovery_disabled = mddev->recovery_disabled; 2952 - set_bit(MD_RECOVERY_INTR, &mddev->recovery); 2953 2962 return 0; 2954 - } else 2963 + else 2955 2964 return min_bad; 2956 2965 2957 2966 } ··· 3134 3151 init_waitqueue_head(&conf->wait_barrier); 3135 3152 3136 3153 bio_list_init(&conf->pending_bio_list); 3137 - conf->recovery_disabled = mddev->recovery_disabled - 1; 3138 3154 3139 3155 err = -EIO; 3140 3156 for (i = 0; i < conf->raid_disks * 2; i++) { ··· 3236 3254 if (!mddev_is_dm(mddev)) { 3237 3255 ret = raid1_set_limits(mddev); 3238 3256 if (ret) { 3257 + md_unregister_thread(mddev, &conf->thread); 3239 3258 if (!mddev->private) 3240 3259 raid1_free(mddev, conf); 3241 3260 return ret;

-5

drivers/md/raid1.h

··· 93 93 */ 94 94 int fullsync; 95 95 96 - /* When the same as mddev->recovery_disabled we don't allow 97 - * recovery to be attempted as we expect a read error. 98 - */ 99 - int recovery_disabled; 100 - 101 96 mempool_t *r1bio_pool; 102 97 mempool_t r1buf_pool; 103 98

+56 -122

drivers/md/raid10.c

··· 1990 1990 * - &mddev->degraded is bumped. 1991 1991 * 1992 1992 * @rdev is marked as &Faulty excluding case when array is failed and 1993 - * &mddev->fail_last_dev is off. 1993 + * MD_FAILLAST_DEV is not set. 1994 1994 */ 1995 1995 static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) 1996 1996 { ··· 2002 2002 if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { 2003 2003 set_bit(MD_BROKEN, &mddev->flags); 2004 2004 2005 - if (!mddev->fail_last_dev) { 2005 + if (!test_bit(MD_FAILLAST_DEV, &mddev->flags)) { 2006 2006 spin_unlock_irqrestore(&conf->device_lock, flags); 2007 2007 return; 2008 2008 } ··· 2130 2130 mirror = first; 2131 2131 for ( ; mirror <= last ; mirror++) { 2132 2132 p = &conf->mirrors[mirror]; 2133 - if (p->recovery_disabled == mddev->recovery_disabled) 2134 - continue; 2135 2133 if (p->rdev) { 2136 2134 if (test_bit(WantReplacement, &p->rdev->flags) && 2137 2135 p->replacement == NULL && repl_slot < 0) ··· 2141 2143 if (err) 2142 2144 return err; 2143 2145 p->head_position = 0; 2144 - p->recovery_disabled = mddev->recovery_disabled - 1; 2145 2146 rdev->raid_disk = mirror; 2146 2147 err = 0; 2147 2148 if (rdev->saved_raid_disk != mirror) ··· 2193 2196 * is not possible. 2194 2197 */ 2195 2198 if (!test_bit(Faulty, &rdev->flags) && 2196 - mddev->recovery_disabled != p->recovery_disabled && 2197 2199 (!p->replacement || p->replacement == rdev) && 2198 2200 number < conf->geo.raid_disks && 2199 2201 enough(conf, -1)) { ··· 2272 2276 reschedule_retry(r10_bio); 2273 2277 else 2274 2278 put_buf(r10_bio); 2275 - md_done_sync(mddev, s, 1); 2279 + md_done_sync(mddev, s); 2276 2280 break; 2277 2281 } else { 2278 2282 struct r10bio *r10_bio2 = (struct r10bio *)r10_bio->master_bio; ··· 2448 2452 2449 2453 done: 2450 2454 if (atomic_dec_and_test(&r10_bio->remaining)) { 2451 - md_done_sync(mddev, r10_bio->sectors, 1); 2455 + md_done_sync(mddev, r10_bio->sectors); 2452 2456 put_buf(r10_bio); 2453 2457 } 2454 2458 } ··· 2531 2535 pr_notice("md/raid10:%s: recovery aborted due to read error\n", 2532 2536 mdname(mddev)); 2533 2537 2534 - conf->mirrors[dw].recovery_disabled 2535 - = mddev->recovery_disabled; 2536 2538 set_bit(MD_RECOVERY_INTR, 2537 2539 &mddev->recovery); 2538 2540 break; ··· 2598 2604 &rdev->mddev->recovery); 2599 2605 } 2600 2606 /* need to record an error - either for the block or the device */ 2601 - if (!rdev_set_badblocks(rdev, sector, sectors, 0)) 2602 - md_error(rdev->mddev, rdev); 2607 + rdev_set_badblocks(rdev, sector, sectors, 0); 2603 2608 return 0; 2604 2609 } 2605 2610 ··· 2679 2686 r10_bio->devs[slot].addr 2680 2687 + sect, 2681 2688 s, 0)) { 2682 - md_error(mddev, rdev); 2683 2689 r10_bio->devs[slot].bio 2684 2690 = IO_BLOCKED; 2685 2691 } ··· 2765 2773 } 2766 2774 } 2767 2775 2768 - static bool narrow_write_error(struct r10bio *r10_bio, int i) 2776 + static void narrow_write_error(struct r10bio *r10_bio, int i) 2769 2777 { 2770 2778 struct bio *bio = r10_bio->master_bio; 2771 2779 struct mddev *mddev = r10_bio->mddev; ··· 2782 2790 * We currently own a reference to the rdev. 2783 2791 */ 2784 2792 2785 - int block_sectors; 2793 + int block_sectors, lbs = bdev_logical_block_size(rdev->bdev) >> 9; 2786 2794 sector_t sector; 2787 2795 int sectors; 2788 2796 int sect_to_write = r10_bio->sectors; 2789 - bool ok = true; 2790 2797 2791 2798 if (rdev->badblocks.shift < 0) 2792 - return false; 2799 + block_sectors = lbs; 2800 + else 2801 + block_sectors = roundup(1 << rdev->badblocks.shift, lbs); 2793 2802 2794 - block_sectors = roundup(1 << rdev->badblocks.shift, 2795 - bdev_logical_block_size(rdev->bdev) >> 9); 2796 2803 sector = r10_bio->sector; 2797 2804 sectors = ((r10_bio->sector + block_sectors) 2798 2805 & ~(sector_t)(block_sectors - 1)) ··· 2811 2820 choose_data_offset(r10_bio, rdev); 2812 2821 wbio->bi_opf = REQ_OP_WRITE; 2813 2822 2814 - if (submit_bio_wait(wbio) < 0) 2815 - /* Failure! */ 2816 - ok = rdev_set_badblocks(rdev, wsector, 2817 - sectors, 0) 2818 - && ok; 2823 + if (submit_bio_wait(wbio) && 2824 + !rdev_set_badblocks(rdev, wsector, sectors, 0)) { 2825 + /* 2826 + * Badblocks set failed, disk marked Faulty. 2827 + * No further operations needed. 2828 + */ 2829 + bio_put(wbio); 2830 + break; 2831 + } 2819 2832 2820 2833 bio_put(wbio); 2821 2834 sect_to_write -= sectors; 2822 2835 sector += sectors; 2823 2836 sectors = block_sectors; 2824 2837 } 2825 - return ok; 2826 2838 } 2827 2839 2828 2840 static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio) ··· 2885 2891 if (r10_bio->devs[m].bio == NULL || 2886 2892 r10_bio->devs[m].bio->bi_end_io == NULL) 2887 2893 continue; 2888 - if (!r10_bio->devs[m].bio->bi_status) { 2894 + if (!r10_bio->devs[m].bio->bi_status) 2889 2895 rdev_clear_badblocks( 2890 2896 rdev, 2891 2897 r10_bio->devs[m].addr, 2892 2898 r10_bio->sectors, 0); 2893 - } else { 2894 - if (!rdev_set_badblocks( 2895 - rdev, 2896 - r10_bio->devs[m].addr, 2897 - r10_bio->sectors, 0)) 2898 - md_error(conf->mddev, rdev); 2899 - } 2899 + else 2900 + rdev_set_badblocks(rdev, 2901 + r10_bio->devs[m].addr, 2902 + r10_bio->sectors, 0); 2900 2903 rdev = conf->mirrors[dev].replacement; 2901 2904 if (r10_bio->devs[m].repl_bio == NULL || 2902 2905 r10_bio->devs[m].repl_bio->bi_end_io == NULL) 2903 2906 continue; 2904 2907 2905 - if (!r10_bio->devs[m].repl_bio->bi_status) { 2908 + if (!r10_bio->devs[m].repl_bio->bi_status) 2906 2909 rdev_clear_badblocks( 2907 2910 rdev, 2908 2911 r10_bio->devs[m].addr, 2909 2912 r10_bio->sectors, 0); 2910 - } else { 2911 - if (!rdev_set_badblocks( 2912 - rdev, 2913 - r10_bio->devs[m].addr, 2914 - r10_bio->sectors, 0)) 2915 - md_error(conf->mddev, rdev); 2916 - } 2913 + else 2914 + rdev_set_badblocks(rdev, 2915 + r10_bio->devs[m].addr, 2916 + r10_bio->sectors, 0); 2917 2917 } 2918 2918 put_buf(r10_bio); 2919 2919 } else { ··· 2924 2936 rdev_dec_pending(rdev, conf->mddev); 2925 2937 } else if (bio != NULL && bio->bi_status) { 2926 2938 fail = true; 2927 - if (!narrow_write_error(r10_bio, m)) 2928 - md_error(conf->mddev, rdev); 2939 + narrow_write_error(r10_bio, m); 2929 2940 rdev_dec_pending(rdev, conf->mddev); 2930 2941 } 2931 2942 bio = r10_bio->devs[m].repl_bio; ··· 3155 3168 int i; 3156 3169 int max_sync; 3157 3170 sector_t sync_blocks; 3158 - sector_t sectors_skipped = 0; 3159 - int chunks_skipped = 0; 3160 3171 sector_t chunk_mask = conf->geo.chunk_mask; 3161 3172 int page_idx = 0; 3162 - int error_disk = -1; 3163 3173 3164 3174 /* 3165 3175 * Allow skipping a full rebuild for incremental assembly ··· 3177 3193 if (init_resync(conf)) 3178 3194 return 0; 3179 3195 3180 - skipped: 3181 3196 if (sector_nr >= max_sector) { 3182 3197 conf->cluster_sync_low = 0; 3183 3198 conf->cluster_sync_high = 0; ··· 3228 3245 mddev->bitmap_ops->close_sync(mddev); 3229 3246 close_sync(conf); 3230 3247 *skipped = 1; 3231 - return sectors_skipped; 3248 + return 0; 3232 3249 } 3233 3250 3234 3251 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 3235 3252 return reshape_request(mddev, sector_nr, skipped); 3236 - 3237 - if (chunks_skipped >= conf->geo.raid_disks) { 3238 - pr_err("md/raid10:%s: %s fails\n", mdname(mddev), 3239 - test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ? "resync" : "recovery"); 3240 - if (error_disk >= 0 && 3241 - !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 3242 - /* 3243 - * recovery fails, set mirrors.recovery_disabled, 3244 - * device shouldn't be added to there. 3245 - */ 3246 - conf->mirrors[error_disk].recovery_disabled = 3247 - mddev->recovery_disabled; 3248 - return 0; 3249 - } 3250 - /* 3251 - * if there has been nothing to do on any drive, 3252 - * then there is nothing to do at all. 3253 - */ 3254 - *skipped = 1; 3255 - return (max_sector - sector_nr) + sectors_skipped; 3256 - } 3257 3253 3258 3254 if (max_sector > mddev->resync_max) 3259 3255 max_sector = mddev->resync_max; /* Don't do IO beyond here */ ··· 3316 3354 /* yep, skip the sync_blocks here, but don't assume 3317 3355 * that there will never be anything to do here 3318 3356 */ 3319 - chunks_skipped = -1; 3320 3357 continue; 3321 3358 } 3322 3359 if (mrdev) ··· 3363 3402 !test_bit(In_sync, &rdev->flags)) 3364 3403 continue; 3365 3404 /* This is where we read from */ 3366 - any_working = 1; 3367 3405 sector = r10_bio->devs[j].addr; 3368 3406 3369 3407 if (is_badblock(rdev, sector, max_sync, ··· 3377 3417 continue; 3378 3418 } 3379 3419 } 3420 + any_working = 1; 3380 3421 bio = r10_bio->devs[0].bio; 3381 3422 bio->bi_next = biolist; 3382 3423 biolist = bio; ··· 3446 3485 for (k = 0; k < conf->copies; k++) 3447 3486 if (r10_bio->devs[k].devnum == i) 3448 3487 break; 3449 - if (mrdev && !test_bit(In_sync, 3450 - &mrdev->flags) 3451 - && !rdev_set_badblocks( 3452 - mrdev, 3453 - r10_bio->devs[k].addr, 3454 - max_sync, 0)) 3455 - any_working = 0; 3456 - if (mreplace && 3457 - !rdev_set_badblocks( 3458 - mreplace, 3459 - r10_bio->devs[k].addr, 3460 - max_sync, 0)) 3461 - any_working = 0; 3462 - } 3463 - if (!any_working) { 3464 - if (!test_and_set_bit(MD_RECOVERY_INTR, 3465 - &mddev->recovery)) 3466 - pr_warn("md/raid10:%s: insufficient working devices for recovery.\n", 3467 - mdname(mddev)); 3468 - mirror->recovery_disabled 3469 - = mddev->recovery_disabled; 3470 - } else { 3471 - error_disk = i; 3488 + if (mrdev && 3489 + !test_bit(In_sync, &mrdev->flags)) 3490 + rdev_set_badblocks( 3491 + mrdev, 3492 + r10_bio->devs[k].addr, 3493 + max_sync, 0); 3494 + if (mreplace) 3495 + rdev_set_badblocks( 3496 + mreplace, 3497 + r10_bio->devs[k].addr, 3498 + max_sync, 0); 3499 + pr_warn("md/raid10:%s: cannot recovery sector %llu + %d.\n", 3500 + mdname(mddev), r10_bio->devs[k].addr, max_sync); 3472 3501 } 3473 3502 put_buf(r10_bio); 3474 3503 if (rb2) ··· 3499 3548 rb2->master_bio = NULL; 3500 3549 put_buf(rb2); 3501 3550 } 3502 - goto giveup; 3551 + *skipped = 1; 3552 + return max_sync; 3503 3553 } 3504 3554 } else { 3505 3555 /* resync. Schedule a read for every block at this virt offset */ ··· 3524 3572 &mddev->recovery)) { 3525 3573 /* We can skip this block */ 3526 3574 *skipped = 1; 3527 - return sync_blocks + sectors_skipped; 3575 + return sync_blocks; 3528 3576 } 3529 3577 if (sync_blocks < max_sync) 3530 3578 max_sync = sync_blocks; ··· 3616 3664 mddev); 3617 3665 } 3618 3666 put_buf(r10_bio); 3619 - biolist = NULL; 3620 - goto giveup; 3667 + *skipped = 1; 3668 + return max_sync; 3621 3669 } 3622 3670 } 3623 3671 ··· 3637 3685 if (WARN_ON(!bio_add_page(bio, page, len, 0))) { 3638 3686 bio->bi_status = BLK_STS_RESOURCE; 3639 3687 bio_endio(bio); 3640 - goto giveup; 3688 + *skipped = 1; 3689 + return max_sync; 3641 3690 } 3642 3691 } 3643 3692 nr_sectors += len>>9; ··· 3706 3753 } 3707 3754 } 3708 3755 3709 - if (sectors_skipped) 3710 - /* pretend they weren't skipped, it makes 3711 - * no important difference in this case 3712 - */ 3713 - md_done_sync(mddev, sectors_skipped, 1); 3714 - 3715 - return sectors_skipped + nr_sectors; 3716 - giveup: 3717 - /* There is nowhere to write, so all non-sync 3718 - * drives must be failed or in resync, all drives 3719 - * have a bad block, so try the next chunk... 3720 - */ 3721 - if (sector_nr + max_sync < max_sector) 3722 - max_sector = sector_nr + max_sync; 3723 - 3724 - sectors_skipped += (max_sector - sector_nr); 3725 - chunks_skipped ++; 3726 - sector_nr = max_sector; 3727 - goto skipped; 3756 + return nr_sectors; 3728 3757 } 3729 3758 3730 3759 static sector_t ··· 4069 4134 disk->replacement->saved_raid_disk < 0) { 4070 4135 conf->fullsync = 1; 4071 4136 } 4072 - 4073 - disk->recovery_disabled = mddev->recovery_disabled - 1; 4074 4137 } 4075 4138 4076 4139 if (mddev->resync_offset != MaxSector) ··· 4846 4913 if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) 4847 4914 if (handle_reshape_read_error(mddev, r10_bio) < 0) { 4848 4915 /* Reshape has been aborted */ 4849 - md_done_sync(mddev, r10_bio->sectors, 0); 4916 + md_done_sync(mddev, r10_bio->sectors); 4917 + md_sync_error(mddev); 4850 4918 return; 4851 4919 } 4852 4920 ··· 5005 5071 { 5006 5072 if (!atomic_dec_and_test(&r10_bio->remaining)) 5007 5073 return; 5008 - md_done_sync(r10_bio->mddev, r10_bio->sectors, 1); 5074 + md_done_sync(r10_bio->mddev, r10_bio->sectors); 5009 5075 bio_put(r10_bio->master_bio); 5010 5076 put_buf(r10_bio); 5011 5077 }

-5

drivers/md/raid10.h

··· 18 18 struct raid10_info { 19 19 struct md_rdev *rdev, *replacement; 20 20 sector_t head_position; 21 - int recovery_disabled; /* matches 22 - * mddev->recovery_disabled 23 - * when we shouldn't try 24 - * recovering this device. 25 - */ 26 21 }; 27 22 28 23 struct r10conf {

+89 -54

drivers/md/raid5.c

··· 56 56 #include "md-bitmap.h" 57 57 #include "raid5-log.h" 58 58 59 - #define UNSUPPORTED_MDDEV_FLAGS (1L << MD_FAILFAST_SUPPORTED) 59 + #define UNSUPPORTED_MDDEV_FLAGS \ 60 + ((1L << MD_FAILFAST_SUPPORTED) | \ 61 + (1L << MD_FAILLAST_DEV) | \ 62 + (1L << MD_SERIALIZE_POLICY)) 63 + 60 64 61 65 #define cpu_to_group(cpu) cpu_to_node(cpu) 62 66 #define ANY_GROUP NUMA_NO_NODE ··· 777 773 /* last sector in the request */ 778 774 sector_t last_sector; 779 775 776 + /* the request had REQ_PREFLUSH, cleared after the first stripe_head */ 777 + bool do_flush; 778 + 780 779 /* 781 780 * bitmap to track stripe sectors that have been added to stripes 782 781 * add one to account for unaligned requests 783 782 */ 784 - DECLARE_BITMAP(sectors_to_do, RAID5_MAX_REQ_STRIPES + 1); 785 - 786 - /* the request had REQ_PREFLUSH, cleared after the first stripe_head */ 787 - bool do_flush; 783 + unsigned long sectors_to_do[]; 788 784 }; 789 785 790 786 /* ··· 2821 2817 else { 2822 2818 clear_bit(R5_ReadError, &sh->dev[i].flags); 2823 2819 clear_bit(R5_ReWrite, &sh->dev[i].flags); 2824 - if (!(set_bad 2825 - && test_bit(In_sync, &rdev->flags) 2826 - && rdev_set_badblocks( 2827 - rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), 0))) 2828 - md_error(conf->mddev, rdev); 2820 + if (!(set_bad && test_bit(In_sync, &rdev->flags))) 2821 + rdev_set_badblocks(rdev, sh->sector, 2822 + RAID5_STRIPE_SECTORS(conf), 0); 2829 2823 } 2830 2824 } 2831 2825 rdev_dec_pending(rdev, conf->mddev); ··· 2922 2920 2923 2921 if (has_failed(conf)) { 2924 2922 set_bit(MD_BROKEN, &conf->mddev->flags); 2925 - conf->recovery_disabled = mddev->recovery_disabled; 2926 2923 2927 2924 pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n", 2928 2925 mdname(mddev), mddev->degraded, conf->raid_disks); ··· 3600 3599 else 3601 3600 rdev = NULL; 3602 3601 if (rdev) { 3603 - if (!rdev_set_badblocks( 3604 - rdev, 3605 - sh->sector, 3606 - RAID5_STRIPE_SECTORS(conf), 0)) 3607 - md_error(conf->mddev, rdev); 3602 + rdev_set_badblocks(rdev, 3603 + sh->sector, 3604 + RAID5_STRIPE_SECTORS(conf), 3605 + 0); 3608 3606 rdev_dec_pending(rdev, conf->mddev); 3609 3607 } 3610 3608 } ··· 3723 3723 RAID5_STRIPE_SECTORS(conf), 0)) 3724 3724 abort = 1; 3725 3725 } 3726 - if (abort) 3727 - conf->recovery_disabled = 3728 - conf->mddev->recovery_disabled; 3729 3726 } 3730 - md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort); 3727 + md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf)); 3728 + 3729 + if (abort) 3730 + md_sync_error(conf->mddev); 3731 3731 } 3732 3732 3733 3733 static int want_replace(struct stripe_head *sh, int disk_idx) ··· 3751 3751 struct r5dev *dev = &sh->dev[disk_idx]; 3752 3752 struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]], 3753 3753 &sh->dev[s->failed_num[1]] }; 3754 + struct mddev *mddev = sh->raid_conf->mddev; 3755 + bool force_rcw = false; 3754 3756 int i; 3755 - bool force_rcw = (sh->raid_conf->rmw_level == PARITY_DISABLE_RMW); 3756 3757 3758 + if (sh->raid_conf->rmw_level == PARITY_DISABLE_RMW || 3759 + (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced && 3760 + !mddev->bitmap_ops->blocks_synced(mddev, sh->sector))) 3761 + force_rcw = true; 3757 3762 3758 3763 if (test_bit(R5_LOCKED, &dev->flags) || 3759 3764 test_bit(R5_UPTODATE, &dev->flags)) ··· 5162 5157 if ((s.syncing || s.replacing) && s.locked == 0 && 5163 5158 !test_bit(STRIPE_COMPUTE_RUN, &sh->state) && 5164 5159 test_bit(STRIPE_INSYNC, &sh->state)) { 5165 - md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), 1); 5160 + md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf)); 5166 5161 clear_bit(STRIPE_SYNCING, &sh->state); 5167 5162 if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags)) 5168 5163 wake_up_bit(&sh->dev[sh->pd_idx].flags, R5_Overlap); ··· 5229 5224 clear_bit(STRIPE_EXPAND_READY, &sh->state); 5230 5225 atomic_dec(&conf->reshape_stripes); 5231 5226 wake_up(&conf->wait_for_reshape); 5232 - md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), 1); 5227 + md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf)); 5233 5228 } 5234 5229 5235 5230 if (s.expanding && s.locked == 0 && ··· 5258 5253 if (test_and_clear_bit(R5_WriteError, &dev->flags)) { 5259 5254 /* We own a safe reference to the rdev */ 5260 5255 rdev = conf->disks[i].rdev; 5261 - if (!rdev_set_badblocks(rdev, sh->sector, 5262 - RAID5_STRIPE_SECTORS(conf), 0)) 5263 - md_error(conf->mddev, rdev); 5256 + rdev_set_badblocks(rdev, sh->sector, 5257 + RAID5_STRIPE_SECTORS(conf), 0); 5264 5258 rdev_dec_pending(rdev, conf->mddev); 5265 5259 } 5266 5260 if (test_and_clear_bit(R5_MadeGood, &dev->flags)) { ··· 6084 6080 static bool raid5_make_request(struct mddev *mddev, struct bio * bi) 6085 6081 { 6086 6082 DEFINE_WAIT_FUNC(wait, woken_wake_function); 6087 - bool on_wq; 6088 6083 struct r5conf *conf = mddev->private; 6089 - sector_t logical_sector; 6090 - struct stripe_request_ctx ctx = {}; 6091 6084 const int rw = bio_data_dir(bi); 6085 + struct stripe_request_ctx *ctx; 6086 + sector_t logical_sector; 6092 6087 enum stripe_result res; 6093 6088 int s, stripe_cnt; 6089 + bool on_wq; 6094 6090 6095 6091 if (unlikely(bi->bi_opf & REQ_PREFLUSH)) { 6096 6092 int ret = log_handle_flush_request(conf, bi); ··· 6102 6098 return true; 6103 6099 } 6104 6100 /* ret == -EAGAIN, fallback */ 6105 - /* 6106 - * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH, 6107 - * we need to flush journal device 6108 - */ 6109 - ctx.do_flush = bi->bi_opf & REQ_PREFLUSH; 6110 6101 } 6111 6102 6112 6103 md_write_start(mddev, bi); ··· 6124 6125 } 6125 6126 6126 6127 logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1); 6127 - ctx.first_sector = logical_sector; 6128 - ctx.last_sector = bio_end_sector(bi); 6129 6128 bi->bi_next = NULL; 6130 6129 6131 - stripe_cnt = DIV_ROUND_UP_SECTOR_T(ctx.last_sector - logical_sector, 6130 + ctx = mempool_alloc(conf->ctx_pool, GFP_NOIO); 6131 + memset(ctx, 0, conf->ctx_size); 6132 + ctx->first_sector = logical_sector; 6133 + ctx->last_sector = bio_end_sector(bi); 6134 + /* 6135 + * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH, 6136 + * we need to flush journal device 6137 + */ 6138 + if (unlikely(bi->bi_opf & REQ_PREFLUSH)) 6139 + ctx->do_flush = true; 6140 + 6141 + stripe_cnt = DIV_ROUND_UP_SECTOR_T(ctx->last_sector - logical_sector, 6132 6142 RAID5_STRIPE_SECTORS(conf)); 6133 - bitmap_set(ctx.sectors_to_do, 0, stripe_cnt); 6143 + bitmap_set(ctx->sectors_to_do, 0, stripe_cnt); 6134 6144 6135 6145 pr_debug("raid456: %s, logical %llu to %llu\n", __func__, 6136 - bi->bi_iter.bi_sector, ctx.last_sector); 6146 + bi->bi_iter.bi_sector, ctx->last_sector); 6137 6147 6138 6148 /* Bail out if conflicts with reshape and REQ_NOWAIT is set */ 6139 6149 if ((bi->bi_opf & REQ_NOWAIT) && ··· 6150 6142 bio_wouldblock_error(bi); 6151 6143 if (rw == WRITE) 6152 6144 md_write_end(mddev); 6145 + mempool_free(ctx, conf->ctx_pool); 6153 6146 return true; 6154 6147 } 6155 6148 md_account_bio(mddev, &bi); ··· 6169 6160 add_wait_queue(&conf->wait_for_reshape, &wait); 6170 6161 on_wq = true; 6171 6162 } 6172 - s = (logical_sector - ctx.first_sector) >> RAID5_STRIPE_SHIFT(conf); 6163 + s = (logical_sector - ctx->first_sector) >> RAID5_STRIPE_SHIFT(conf); 6173 6164 6174 6165 while (1) { 6175 - res = make_stripe_request(mddev, conf, &ctx, logical_sector, 6166 + res = make_stripe_request(mddev, conf, ctx, logical_sector, 6176 6167 bi); 6177 6168 if (res == STRIPE_FAIL || res == STRIPE_WAIT_RESHAPE) 6178 6169 break; ··· 6189 6180 * raid5_activate_delayed() from making progress 6190 6181 * and thus deadlocking. 6191 6182 */ 6192 - if (ctx.batch_last) { 6193 - raid5_release_stripe(ctx.batch_last); 6194 - ctx.batch_last = NULL; 6183 + if (ctx->batch_last) { 6184 + raid5_release_stripe(ctx->batch_last); 6185 + ctx->batch_last = NULL; 6195 6186 } 6196 6187 6197 6188 wait_woken(&wait, TASK_UNINTERRUPTIBLE, ··· 6199 6190 continue; 6200 6191 } 6201 6192 6202 - s = find_next_bit_wrap(ctx.sectors_to_do, stripe_cnt, s); 6193 + s = find_next_bit_wrap(ctx->sectors_to_do, stripe_cnt, s); 6203 6194 if (s == stripe_cnt) 6204 6195 break; 6205 6196 6206 - logical_sector = ctx.first_sector + 6197 + logical_sector = ctx->first_sector + 6207 6198 (s << RAID5_STRIPE_SHIFT(conf)); 6208 6199 } 6209 6200 if (unlikely(on_wq)) 6210 6201 remove_wait_queue(&conf->wait_for_reshape, &wait); 6211 6202 6212 - if (ctx.batch_last) 6213 - raid5_release_stripe(ctx.batch_last); 6203 + if (ctx->batch_last) 6204 + raid5_release_stripe(ctx->batch_last); 6214 6205 6215 6206 if (rw == WRITE) 6216 6207 md_write_end(mddev); 6208 + 6209 + mempool_free(ctx, conf->ctx_pool); 6217 6210 if (res == STRIPE_WAIT_RESHAPE) { 6218 6211 md_free_cloned_bio(bi); 6219 6212 return false; ··· 7385 7374 bioset_exit(&conf->bio_split); 7386 7375 kfree(conf->stripe_hashtbl); 7387 7376 kfree(conf->pending_data); 7377 + 7378 + mempool_destroy(conf->ctx_pool); 7379 + 7388 7380 kfree(conf); 7389 7381 } 7390 7382 ··· 7550 7536 } 7551 7537 7552 7538 conf->bypass_threshold = BYPASS_THRESHOLD; 7553 - conf->recovery_disabled = mddev->recovery_disabled - 1; 7554 - 7555 7539 conf->raid_disks = mddev->raid_disks; 7556 7540 if (mddev->reshape_position == MaxSector) 7557 7541 conf->previous_raid_disks = mddev->raid_disks; ··· 7741 7729 return 0; 7742 7730 } 7743 7731 7732 + static int raid5_create_ctx_pool(struct r5conf *conf) 7733 + { 7734 + struct stripe_request_ctx *ctx; 7735 + int size; 7736 + 7737 + if (mddev_is_dm(conf->mddev)) 7738 + size = BITS_TO_LONGS(RAID5_MAX_REQ_STRIPES); 7739 + else 7740 + size = BITS_TO_LONGS( 7741 + queue_max_hw_sectors(conf->mddev->gendisk->queue) >> 7742 + RAID5_STRIPE_SHIFT(conf)); 7743 + 7744 + conf->ctx_size = struct_size(ctx, sectors_to_do, size); 7745 + conf->ctx_pool = mempool_create_kmalloc_pool(NR_RAID_BIOS, 7746 + conf->ctx_size); 7747 + 7748 + return conf->ctx_pool ? 0 : -ENOMEM; 7749 + } 7750 + 7744 7751 static int raid5_set_limits(struct mddev *mddev) 7745 7752 { 7746 7753 struct r5conf *conf = mddev->private; ··· 7816 7785 * Limit the max sectors based on this. 7817 7786 */ 7818 7787 lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf); 7788 + if ((lim.max_hw_sectors << 9) < lim.io_opt) 7789 + lim.max_hw_sectors = lim.io_opt >> 9; 7819 7790 7820 7791 /* No restrictions on the number of segments in the request */ 7821 7792 lim.max_segments = USHRT_MAX; ··· 8090 8057 goto abort; 8091 8058 } 8092 8059 8093 - if (log_init(conf, journal_dev, raid5_has_ppl(conf))) 8060 + ret = raid5_create_ctx_pool(conf); 8061 + if (ret) 8062 + goto abort; 8063 + 8064 + ret = log_init(conf, journal_dev, raid5_has_ppl(conf)); 8065 + if (ret) 8094 8066 goto abort; 8095 8067 8096 8068 return 0; ··· 8249 8211 * isn't possible. 8250 8212 */ 8251 8213 if (!test_bit(Faulty, &rdev->flags) && 8252 - mddev->recovery_disabled != conf->recovery_disabled && 8253 8214 !has_failed(conf) && 8254 8215 (!p->replacement || p->replacement == rdev) && 8255 8216 number < conf->raid_disks) { ··· 8309 8272 8310 8273 return 0; 8311 8274 } 8312 - if (mddev->recovery_disabled == conf->recovery_disabled) 8313 - return -EBUSY; 8314 8275 8315 8276 if (rdev->saved_raid_disk < 0 && has_failed(conf)) 8316 8277 /* no point adding a device */

+3 -1

drivers/md/raid5.h

··· 640 640 * (fresh device added). 641 641 * Cleared when a sync completes. 642 642 */ 643 - int recovery_disabled; 644 643 /* per cpu variables */ 645 644 struct raid5_percpu __percpu *percpu; 646 645 int scribble_disks; ··· 689 690 struct list_head pending_list; 690 691 int pending_data_cnt; 691 692 struct r5pending_data *next_pending_data; 693 + 694 + mempool_t *ctx_pool; 695 + int ctx_size; 692 696 }; 693 697 694 698 #if PAGE_SIZE == DEFAULT_STRIPE_SIZE

+2 -1

drivers/nvme/host/core.c

··· 1333 1333 } 1334 1334 1335 1335 static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq, 1336 - blk_status_t status) 1336 + blk_status_t status, 1337 + const struct io_comp_batch *iob) 1337 1338 { 1338 1339 struct nvme_ctrl *ctrl = rq->end_io_data; 1339 1340 unsigned long rtt = jiffies - (rq->deadline - rq->timeout);

+15 -8

drivers/nvme/host/ioctl.c

··· 410 410 } 411 411 412 412 static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req, 413 - blk_status_t err) 413 + blk_status_t err, 414 + const struct io_comp_batch *iob) 414 415 { 415 416 struct io_uring_cmd *ioucmd = req->end_io_data; 416 417 struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); ··· 426 425 pdu->result = le64_to_cpu(nvme_req(req)->result.u64); 427 426 428 427 /* 429 - * IOPOLL could potentially complete this request directly, but 430 - * if multiple rings are polling on the same queue, then it's possible 431 - * for one ring to find completions for another ring. Punting the 432 - * completion via task_work will always direct it to the right 433 - * location, rather than potentially complete requests for ringA 434 - * under iopoll invocations from ringB. 428 + * For IOPOLL, check if this completion is happening in the context 429 + * of the same io_ring that owns the request (local context). If so, 430 + * we can complete inline without task_work overhead. Otherwise, we 431 + * must punt to task_work to ensure completion happens in the correct 432 + * ring's context. 435 433 */ 436 - io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); 434 + if (blk_rq_is_poll(req) && iob && 435 + iob->poll_ctx == io_uring_cmd_ctx_handle(ioucmd)) { 436 + if (pdu->bio) 437 + blk_rq_unmap_user(pdu->bio); 438 + io_uring_cmd_done32(ioucmd, pdu->status, pdu->result, 0); 439 + } else { 440 + io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); 441 + } 437 442 return RQ_END_IO_FREE; 438 443 } 439 444

+11 -11

drivers/nvme/host/pci.c

··· 290 290 u8 flags; 291 291 u8 nr_descriptors; 292 292 293 - unsigned int total_len; 293 + size_t total_len; 294 294 struct dma_iova_state dma_state; 295 295 void *descriptors[NVME_MAX_NR_DESCRIPTORS]; 296 296 struct nvme_dma_vec *dma_vecs; 297 297 unsigned int nr_dma_vecs; 298 298 299 299 dma_addr_t meta_dma; 300 - unsigned int meta_total_len; 300 + size_t meta_total_len; 301 301 struct dma_iova_state meta_dma_state; 302 302 struct nvme_sgl_desc *meta_descriptor; 303 303 }; ··· 845 845 static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev, 846 846 struct blk_dma_iter *iter) 847 847 { 848 - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 849 - 850 848 if (iter->len) 851 849 return true; 852 - if (!blk_rq_dma_map_iter_next(req, dma_dev, &iod->dma_state, iter)) 850 + if (!blk_rq_dma_map_iter_next(req, dma_dev, iter)) 853 851 return false; 854 852 return nvme_pci_prp_save_mapping(req, dma_dev, iter); 855 853 } ··· 1022 1024 } 1023 1025 nvme_pci_sgl_set_data(&sg_list[mapped++], iter); 1024 1026 iod->total_len += iter->len; 1025 - } while (blk_rq_dma_map_iter_next(req, nvmeq->dev->dev, &iod->dma_state, 1026 - iter)); 1027 + } while (blk_rq_dma_map_iter_next(req, nvmeq->dev->dev, iter)); 1027 1028 1028 1029 nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, mapped); 1029 1030 if (unlikely(iter->status)) ··· 1631 1634 return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid); 1632 1635 } 1633 1636 1634 - static enum rq_end_io_ret abort_endio(struct request *req, blk_status_t error) 1637 + static enum rq_end_io_ret abort_endio(struct request *req, blk_status_t error, 1638 + const struct io_comp_batch *iob) 1635 1639 { 1636 1640 struct nvme_queue *nvmeq = req->mq_hctx->driver_data; 1637 1641 ··· 2875 2877 } 2876 2878 2877 2879 static enum rq_end_io_ret nvme_del_queue_end(struct request *req, 2878 - blk_status_t error) 2880 + blk_status_t error, 2881 + const struct io_comp_batch *iob) 2879 2882 { 2880 2883 struct nvme_queue *nvmeq = req->end_io_data; 2881 2884 ··· 2886 2887 } 2887 2888 2888 2889 static enum rq_end_io_ret nvme_del_cq_end(struct request *req, 2889 - blk_status_t error) 2890 + blk_status_t error, 2891 + const struct io_comp_batch *iob) 2890 2892 { 2891 2893 struct nvme_queue *nvmeq = req->end_io_data; 2892 2894 2893 2895 if (error) 2894 2896 set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags); 2895 2897 2896 - return nvme_del_queue_end(req, error); 2898 + return nvme_del_queue_end(req, error, iob); 2897 2899 } 2898 2900 2899 2901 static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)

+2 -2

drivers/nvme/target/admin-cmd.c

··· 298 298 if (status) 299 299 goto out; 300 300 301 - if (!req->ns->bdev || bdev_nonrot(req->ns->bdev)) { 301 + if (!req->ns->bdev || !bdev_rot(req->ns->bdev)) { 302 302 status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 303 303 goto out; 304 304 } ··· 1084 1084 id->nmic = NVME_NS_NMIC_SHARED; 1085 1085 if (req->ns->readonly) 1086 1086 id->nsattr |= NVME_NS_ATTR_RO; 1087 - if (req->ns->bdev && !bdev_nonrot(req->ns->bdev)) 1087 + if (req->ns->bdev && bdev_rot(req->ns->bdev)) 1088 1088 id->nsfeat |= NVME_NS_ROTATIONAL; 1089 1089 /* 1090 1090 * We need flush command to flush the file's metadata,

+2 -1

drivers/nvme/target/passthru.c

··· 247 247 } 248 248 249 249 static enum rq_end_io_ret nvmet_passthru_req_done(struct request *rq, 250 - blk_status_t blk_status) 250 + blk_status_t blk_status, 251 + const struct io_comp_batch *iob) 251 252 { 252 253 struct nvmet_req *req = rq->end_io_data; 253 254

+2 -1

drivers/scsi/scsi_error.c

··· 2118 2118 } 2119 2119 2120 2120 static enum rq_end_io_ret eh_lock_door_done(struct request *req, 2121 - blk_status_t status) 2121 + blk_status_t status, 2122 + const struct io_comp_batch *iob) 2122 2123 { 2123 2124 blk_mq_free_request(req); 2124 2125 return RQ_END_IO_NONE;

+4 -2

drivers/scsi/sg.c

··· 177 177 } Sg_device; 178 178 179 179 /* tasklet or soft irq callback */ 180 - static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status); 180 + static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status, 181 + const struct io_comp_batch *iob); 181 182 static int sg_start_req(Sg_request *srp, unsigned char *cmd); 182 183 static int sg_finish_rem_req(Sg_request * srp); 183 184 static int sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size); ··· 1310 1309 * level when a command is completed (or has failed). 1311 1310 */ 1312 1311 static enum rq_end_io_ret 1313 - sg_rq_end_io(struct request *rq, blk_status_t status) 1312 + sg_rq_end_io(struct request *rq, blk_status_t status, 1313 + const struct io_comp_batch *iob) 1314 1314 { 1315 1315 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq); 1316 1316 struct sg_request *srp = rq->end_io_data;

+2 -1

drivers/scsi/st.c

··· 525 525 } 526 526 527 527 static enum rq_end_io_ret st_scsi_execute_end(struct request *req, 528 - blk_status_t status) 528 + blk_status_t status, 529 + const struct io_comp_batch *iob) 529 530 { 530 531 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req); 531 532 struct st_request *SRpnt = req->end_io_data;

+4 -2

drivers/target/target_core_pscsi.c

··· 39 39 } 40 40 41 41 static sense_reason_t pscsi_execute_cmd(struct se_cmd *cmd); 42 - static enum rq_end_io_ret pscsi_req_done(struct request *, blk_status_t); 42 + static enum rq_end_io_ret pscsi_req_done(struct request *, blk_status_t, 43 + const struct io_comp_batch *); 43 44 44 45 /* pscsi_attach_hba(): 45 46 * ··· 1002 1001 } 1003 1002 1004 1003 static enum rq_end_io_ret pscsi_req_done(struct request *req, 1005 - blk_status_t status) 1004 + blk_status_t status, 1005 + const struct io_comp_batch *iob) 1006 1006 { 1007 1007 struct se_cmd *cmd = req->end_io_data; 1008 1008 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);

+2 -1

fs/buffer.c

··· 29 29 #include <linux/slab.h> 30 30 #include <linux/capability.h> 31 31 #include <linux/blkdev.h> 32 + #include <linux/blk-crypto.h> 32 33 #include <linux/file.h> 33 34 #include <linux/quotaops.h> 34 35 #include <linux/highmem.h> ··· 2822 2821 wbc_account_cgroup_owner(wbc, bh->b_folio, bh->b_size); 2823 2822 } 2824 2823 2825 - submit_bio(bio); 2824 + blk_crypto_submit_bio(bio); 2826 2825 } 2827 2826 2828 2827 void submit_bh(blk_opf_t opf, struct buffer_head *bh)

+56 -35

fs/crypto/bio.c

··· 47 47 } 48 48 EXPORT_SYMBOL(fscrypt_decrypt_bio); 49 49 50 + struct fscrypt_zero_done { 51 + atomic_t pending; 52 + blk_status_t status; 53 + struct completion done; 54 + }; 55 + 56 + static void fscrypt_zeroout_range_done(struct fscrypt_zero_done *done) 57 + { 58 + if (atomic_dec_and_test(&done->pending)) 59 + complete(&done->done); 60 + } 61 + 62 + static void fscrypt_zeroout_range_end_io(struct bio *bio) 63 + { 64 + struct fscrypt_zero_done *done = bio->bi_private; 65 + 66 + if (bio->bi_status) 67 + cmpxchg(&done->status, 0, bio->bi_status); 68 + fscrypt_zeroout_range_done(done); 69 + bio_put(bio); 70 + } 71 + 50 72 static int fscrypt_zeroout_range_inline_crypt(const struct inode *inode, 51 - pgoff_t lblk, sector_t pblk, 73 + pgoff_t lblk, sector_t sector, 52 74 unsigned int len) 53 75 { 54 76 const unsigned int blockbits = inode->i_blkbits; 55 77 const unsigned int blocks_per_page = 1 << (PAGE_SHIFT - blockbits); 56 - struct bio *bio; 57 - int ret, err = 0; 58 - int num_pages = 0; 59 - 60 - /* This always succeeds since __GFP_DIRECT_RECLAIM is set. */ 61 - bio = bio_alloc(inode->i_sb->s_bdev, BIO_MAX_VECS, REQ_OP_WRITE, 62 - GFP_NOFS); 78 + struct fscrypt_zero_done done = { 79 + .pending = ATOMIC_INIT(1), 80 + .done = COMPLETION_INITIALIZER_ONSTACK(done.done), 81 + }; 63 82 64 83 while (len) { 65 - unsigned int blocks_this_page = min(len, blocks_per_page); 66 - unsigned int bytes_this_page = blocks_this_page << blockbits; 84 + struct bio *bio; 85 + unsigned int n; 67 86 68 - if (num_pages == 0) { 69 - fscrypt_set_bio_crypt_ctx(bio, inode, lblk, GFP_NOFS); 70 - bio->bi_iter.bi_sector = 71 - pblk << (blockbits - SECTOR_SHIFT); 87 + bio = bio_alloc(inode->i_sb->s_bdev, BIO_MAX_VECS, REQ_OP_WRITE, 88 + GFP_NOFS); 89 + bio->bi_iter.bi_sector = sector; 90 + bio->bi_private = &done; 91 + bio->bi_end_io = fscrypt_zeroout_range_end_io; 92 + fscrypt_set_bio_crypt_ctx(bio, inode, lblk, GFP_NOFS); 93 + 94 + for (n = 0; n < BIO_MAX_VECS; n++) { 95 + unsigned int blocks_this_page = 96 + min(len, blocks_per_page); 97 + unsigned int bytes_this_page = blocks_this_page << blockbits; 98 + 99 + __bio_add_page(bio, ZERO_PAGE(0), bytes_this_page, 0); 100 + len -= blocks_this_page; 101 + lblk += blocks_this_page; 102 + sector += (bytes_this_page >> SECTOR_SHIFT); 103 + if (!len || !fscrypt_mergeable_bio(bio, inode, lblk)) 104 + break; 72 105 } 73 - ret = bio_add_page(bio, ZERO_PAGE(0), bytes_this_page, 0); 74 - if (WARN_ON_ONCE(ret != bytes_this_page)) { 75 - err = -EIO; 76 - goto out; 77 - } 78 - num_pages++; 79 - len -= blocks_this_page; 80 - lblk += blocks_this_page; 81 - pblk += blocks_this_page; 82 - if (num_pages == BIO_MAX_VECS || !len || 83 - !fscrypt_mergeable_bio(bio, inode, lblk)) { 84 - err = submit_bio_wait(bio); 85 - if (err) 86 - goto out; 87 - bio_reset(bio, inode->i_sb->s_bdev, REQ_OP_WRITE); 88 - num_pages = 0; 89 - } 106 + 107 + atomic_inc(&done.pending); 108 + blk_crypto_submit_bio(bio); 90 109 } 91 - out: 92 - bio_put(bio); 93 - return err; 110 + 111 + fscrypt_zeroout_range_done(&done); 112 + 113 + wait_for_completion(&done.done); 114 + return blk_status_to_errno(done.status); 94 115 } 95 116 96 117 /** ··· 153 132 return 0; 154 133 155 134 if (fscrypt_inode_uses_inline_crypto(inode)) 156 - return fscrypt_zeroout_range_inline_crypt(inode, lblk, pblk, 135 + return fscrypt_zeroout_range_inline_crypt(inode, lblk, sector, 157 136 len); 158 137 159 138 BUILD_BUG_ON(ARRAY_SIZE(pages) > BIO_MAX_VECS);

+2 -1

fs/ext4/page-io.c

··· 7 7 * Written by Theodore Ts'o, 2010. 8 8 */ 9 9 10 + #include <linux/blk-crypto.h> 10 11 #include <linux/fs.h> 11 12 #include <linux/time.h> 12 13 #include <linux/highuid.h> ··· 402 401 if (bio) { 403 402 if (io->io_wbc->sync_mode == WB_SYNC_ALL) 404 403 io->io_bio->bi_opf |= REQ_SYNC; 405 - submit_bio(io->io_bio); 404 + blk_crypto_submit_bio(io->io_bio); 406 405 } 407 406 io->io_bio = NULL; 408 407 }

+5 -4

fs/ext4/readpage.c

··· 36 36 #include <linux/bio.h> 37 37 #include <linux/fs.h> 38 38 #include <linux/buffer_head.h> 39 + #include <linux/blk-crypto.h> 39 40 #include <linux/blkdev.h> 40 41 #include <linux/highmem.h> 41 42 #include <linux/prefetch.h> ··· 346 345 if (bio && (last_block_in_bio != first_block - 1 || 347 346 !fscrypt_mergeable_bio(bio, inode, next_block))) { 348 347 submit_and_realloc: 349 - submit_bio(bio); 348 + blk_crypto_submit_bio(bio); 350 349 bio = NULL; 351 350 } 352 351 if (bio == NULL) { ··· 372 371 if (((map.m_flags & EXT4_MAP_BOUNDARY) && 373 372 (relative_block == map.m_len)) || 374 373 (first_hole != blocks_per_folio)) { 375 - submit_bio(bio); 374 + blk_crypto_submit_bio(bio); 376 375 bio = NULL; 377 376 } else 378 377 last_block_in_bio = first_block + blocks_per_folio - 1; 379 378 continue; 380 379 confused: 381 380 if (bio) { 382 - submit_bio(bio); 381 + blk_crypto_submit_bio(bio); 383 382 bio = NULL; 384 383 } 385 384 if (!folio_test_uptodate(folio)) ··· 390 389 ; /* A label shall be followed by a statement until C23 */ 391 390 } 392 391 if (bio) 393 - submit_bio(bio); 392 + blk_crypto_submit_bio(bio); 394 393 return 0; 395 394 } 396 395

+2 -2

fs/f2fs/data.c

··· 513 513 trace_f2fs_submit_read_bio(sbi->sb, type, bio); 514 514 515 515 iostat_update_submit_ctx(bio, type); 516 - submit_bio(bio); 516 + blk_crypto_submit_bio(bio); 517 517 } 518 518 519 519 static void f2fs_submit_write_bio(struct f2fs_sb_info *sbi, struct bio *bio, ··· 522 522 WARN_ON_ONCE(is_read_io(bio_op(bio))); 523 523 trace_f2fs_submit_write_bio(sbi->sb, type, bio); 524 524 iostat_update_submit_ctx(bio, type); 525 - submit_bio(bio); 525 + blk_crypto_submit_bio(bio); 526 526 } 527 527 528 528 static void __submit_merged_bio(struct f2fs_bio_info *io)

+2 -1

fs/f2fs/file.c

··· 5 5 * Copyright (c) 2012 Samsung Electronics Co., Ltd. 6 6 * http://www.samsung.com/ 7 7 */ 8 + #include <linux/blk-crypto.h> 8 9 #include <linux/fs.h> 9 10 #include <linux/f2fs_fs.h> 10 11 #include <linux/stat.h> ··· 5048 5047 enum temp_type temp = f2fs_get_segment_temp(sbi, type); 5049 5048 5050 5049 bio->bi_write_hint = f2fs_io_type_to_rw_hint(sbi, DATA, temp); 5051 - submit_bio(bio); 5050 + blk_crypto_submit_bio(bio); 5052 5051 } 5053 5052 5054 5053 static const struct iomap_dio_ops f2fs_iomap_dio_write_ops = {

+2 -1

fs/iomap/direct-io.c

··· 3 3 * Copyright (C) 2010 Red Hat, Inc. 4 4 * Copyright (c) 2016-2025 Christoph Hellwig. 5 5 */ 6 + #include <linux/blk-crypto.h> 6 7 #include <linux/fscrypt.h> 7 8 #include <linux/pagemap.h> 8 9 #include <linux/iomap.h> ··· 76 75 dio->dops->submit_io(iter, bio, pos); 77 76 } else { 78 77 WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_ANON_WRITE); 79 - submit_bio(bio); 78 + blk_crypto_submit_bio(bio); 80 79 } 81 80 } 82 81

-6

include/linux/bio.h

··· 256 256 return page_folio(bio_first_page_all(bio)); 257 257 } 258 258 259 - static inline struct bio_vec *bio_last_bvec_all(struct bio *bio) 260 - { 261 - WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)); 262 - return &bio->bi_io_vec[bio->bi_vcnt - 1]; 263 - } 264 - 265 259 /** 266 260 * struct folio_iter - State for iterating all folios in a bio. 267 261 * @folio: The current folio we're iterating. NULL after the last folio.

+32

include/linux/blk-crypto.h

··· 132 132 return bio->bi_crypt_context; 133 133 } 134 134 135 + static inline struct bio_crypt_ctx *bio_crypt_ctx(struct bio *bio) 136 + { 137 + return bio->bi_crypt_context; 138 + } 139 + 135 140 void bio_crypt_set_ctx(struct bio *bio, const struct blk_crypto_key *key, 136 141 const u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 137 142 gfp_t gfp_mask); ··· 174 169 return false; 175 170 } 176 171 172 + static inline struct bio_crypt_ctx *bio_crypt_ctx(struct bio *bio) 173 + { 174 + return NULL; 175 + } 176 + 177 177 #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ 178 + 179 + bool __blk_crypto_submit_bio(struct bio *bio); 180 + 181 + /** 182 + * blk_crypto_submit_bio - Submit a bio that may have a crypto context 183 + * @bio: bio to submit 184 + * 185 + * If @bio has no crypto context, or the crypt context attached to @bio is 186 + * supported by the underlying device's inline encryption hardware, just submit 187 + * @bio. 188 + * 189 + * Otherwise, try to perform en/decryption for this bio by falling back to the 190 + * kernel crypto API. For encryption this means submitting newly allocated 191 + * bios for the encrypted payload while keeping back the source bio until they 192 + * complete, while for reads the decryption happens in-place by a hooked in 193 + * completion handler. 194 + */ 195 + static inline void blk_crypto_submit_bio(struct bio *bio) 196 + { 197 + if (!bio_has_crypt_ctx(bio) || __blk_crypto_submit_bio(bio)) 198 + submit_bio(bio); 199 + } 178 200 179 201 int __bio_crypt_clone(struct bio *dst, struct bio *src, gfp_t gfp_mask); 180 202 /**

+3 -3

include/linux/blk-integrity.h

··· 91 91 return bio_integrity_intervals(bi, sectors) * bi->metadata_size; 92 92 } 93 93 94 - static inline bool blk_integrity_rq(struct request *rq) 94 + static inline bool blk_integrity_rq(const struct request *rq) 95 95 { 96 96 return rq->cmd_flags & REQ_INTEGRITY; 97 97 } ··· 168 168 { 169 169 return 0; 170 170 } 171 - static inline int blk_integrity_rq(struct request *rq) 171 + static inline bool blk_integrity_rq(const struct request *rq) 172 172 { 173 - return 0; 173 + return false; 174 174 } 175 175 176 176 static inline struct bio_vec rq_integrity_vec(struct request *rq)

+1 -1

include/linux/blk-mq-dma.h

··· 28 28 bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev, 29 29 struct dma_iova_state *state, struct blk_dma_iter *iter); 30 30 bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev, 31 - struct dma_iova_state *state, struct blk_dma_iter *iter); 31 + struct blk_dma_iter *iter); 32 32 33 33 /** 34 34 * blk_rq_dma_map_coalesce - were all segments coalesced?

+3 -1

include/linux/blk-mq.h

··· 13 13 14 14 struct blk_mq_tags; 15 15 struct blk_flush_queue; 16 + struct io_comp_batch; 16 17 17 18 #define BLKDEV_MIN_RQ 4 18 19 #define BLKDEV_DEFAULT_RQ 128 ··· 23 22 RQ_END_IO_FREE, 24 23 }; 25 24 26 - typedef enum rq_end_io_ret (rq_end_io_fn)(struct request *, blk_status_t); 25 + typedef enum rq_end_io_ret (rq_end_io_fn)(struct request *, blk_status_t, 26 + const struct io_comp_batch *); 27 27 28 28 /* 29 29 * request flags */

+2 -2

include/linux/blk_types.h

··· 232 232 233 233 atomic_t __bi_remaining; 234 234 235 + /* The actual vec list, preserved by bio_reset() */ 236 + struct bio_vec *bi_io_vec; 235 237 struct bvec_iter bi_iter; 236 238 237 239 union { ··· 276 274 unsigned short bi_max_vecs; /* max bvl_vecs we can hold */ 277 275 278 276 atomic_t __bi_cnt; /* pin count */ 279 - 280 - struct bio_vec *bi_io_vec; /* the actual vec list */ 281 277 282 278 struct bio_set *bi_pool; 283 279 };

+15 -9

include/linux/blkdev.h

··· 340 340 /* skip this queue in blk_mq_(un)quiesce_tagset */ 341 341 #define BLK_FEAT_SKIP_TAGSET_QUIESCE ((__force blk_features_t)(1u << 13)) 342 342 343 + /* atomic writes enabled */ 344 + #define BLK_FEAT_ATOMIC_WRITES ((__force blk_features_t)(1u << 14)) 345 + 343 346 /* undocumented magic for bcache */ 344 347 #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \ 345 348 ((__force blk_features_t)(1u << 15)) 346 - 347 - /* atomic writes enabled */ 348 - #define BLK_FEAT_ATOMIC_WRITES \ 349 - ((__force blk_features_t)(1u << 16)) 350 349 351 350 /* 352 351 * Flags automatically inherited when stacking limits. ··· 550 551 /* 551 552 * queue settings 552 553 */ 553 - unsigned long nr_requests; /* Max # of requests */ 554 + unsigned int nr_requests; /* Max # of requests */ 555 + unsigned int async_depth; /* Max # of async requests */ 554 556 555 557 #ifdef CONFIG_BLK_INLINE_ENCRYPTION 556 558 struct blk_crypto_profile *crypto_profile; ··· 681 681 #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) 682 682 #define blk_queue_noxmerges(q) \ 683 683 test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags) 684 - #define blk_queue_nonrot(q) (!((q)->limits.features & BLK_FEAT_ROTATIONAL)) 684 + #define blk_queue_rot(q) ((q)->limits.features & BLK_FEAT_ROTATIONAL) 685 685 #define blk_queue_io_stat(q) ((q)->limits.features & BLK_FEAT_IO_STAT) 686 686 #define blk_queue_passthrough_stat(q) \ 687 687 ((q)->limits.flags & BLK_FLAG_IOSTATS_PASSTHROUGH) ··· 1026 1026 extern void blk_queue_exit(struct request_queue *q); 1027 1027 extern void blk_sync_queue(struct request_queue *q); 1028 1028 1029 - /* Helper to convert REQ_OP_XXX to its string format XXX */ 1029 + /* Convert a request operation REQ_OP_name into the string "name" */ 1030 1030 extern const char *blk_op_str(enum req_op op); 1031 1031 1032 1032 int blk_status_to_errno(blk_status_t status); ··· 1044 1044 return bdev->bd_queue; /* this is never NULL */ 1045 1045 } 1046 1046 1047 - /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */ 1047 + /* Convert a zone condition BLK_ZONE_COND_name into the string "name" */ 1048 1048 const char *blk_zone_cond_str(enum blk_zone_cond zone_cond); 1049 1049 1050 1050 static inline unsigned int bio_zone_no(struct bio *bio) ··· 1462 1462 return bdev_limits(bdev)->max_wzeroes_unmap_sectors; 1463 1463 } 1464 1464 1465 + static inline bool bdev_rot(struct block_device *bdev) 1466 + { 1467 + return blk_queue_rot(bdev_get_queue(bdev)); 1468 + } 1469 + 1465 1470 static inline bool bdev_nonrot(struct block_device *bdev) 1466 1471 { 1467 - return blk_queue_nonrot(bdev_get_queue(bdev)); 1472 + return !bdev_rot(bdev); 1468 1473 } 1469 1474 1470 1475 static inline bool bdev_synchronous(struct block_device *bdev) ··· 1827 1822 struct rq_list req_list; 1828 1823 bool need_ts; 1829 1824 void (*complete)(struct io_comp_batch *); 1825 + void *poll_ctx; 1830 1826 }; 1831 1827 1832 1828 static inline bool blk_atomic_write_start_sect_aligned(sector_t sector,

+5

include/linux/types.h

··· 171 171 typedef u32 phys_addr_t; 172 172 #endif 173 173 174 + struct phys_vec { 175 + phys_addr_t paddr; 176 + size_t len; 177 + }; 178 + 174 179 typedef phys_addr_t resource_size_t; 175 180 176 181 /*

+120 -1

include/uapi/linux/ublk_cmd.h

··· 55 55 _IOWR('u', 0x15, struct ublksrv_ctrl_cmd) 56 56 #define UBLK_U_CMD_QUIESCE_DEV \ 57 57 _IOWR('u', 0x16, struct ublksrv_ctrl_cmd) 58 - 58 + #define UBLK_U_CMD_TRY_STOP_DEV \ 59 + _IOWR('u', 0x17, struct ublksrv_ctrl_cmd) 59 60 /* 60 61 * 64bits are enough now, and it should be easy to extend in case of 61 62 * running out of feature flags ··· 104 103 #define UBLK_U_IO_UNREGISTER_IO_BUF \ 105 104 _IOWR('u', 0x24, struct ublksrv_io_cmd) 106 105 106 + /* 107 + * return 0 if the command is run successfully, otherwise failure code 108 + * is returned 109 + */ 110 + #define UBLK_U_IO_PREP_IO_CMDS \ 111 + _IOWR('u', 0x25, struct ublk_batch_io) 112 + /* 113 + * If failure code is returned, nothing in the command buffer is handled. 114 + * Otherwise, the returned value means how many bytes in command buffer 115 + * are handled actually, then number of handled IOs can be calculated with 116 + * `elem_bytes` for each IO. IOs in the remained bytes are not committed, 117 + * userspace has to check return value for dealing with partial committing 118 + * correctly. 119 + */ 120 + #define UBLK_U_IO_COMMIT_IO_CMDS \ 121 + _IOWR('u', 0x26, struct ublk_batch_io) 122 + 123 + /* 124 + * Fetch io commands to provided buffer in multishot style, 125 + * `IORING_URING_CMD_MULTISHOT` is required for this command. 126 + */ 127 + #define UBLK_U_IO_FETCH_IO_CMDS \ 128 + _IOWR('u', 0x27, struct ublk_batch_io) 129 + 107 130 /* only ABORT means that no re-fetch */ 108 131 #define UBLK_IO_RES_OK 0 109 132 #define UBLK_IO_RES_NEED_GET_DATA 1 ··· 158 133 159 134 #define UBLKSRV_IO_BUF_TOTAL_BITS (UBLK_QID_OFF + UBLK_QID_BITS) 160 135 #define UBLKSRV_IO_BUF_TOTAL_SIZE (1ULL << UBLKSRV_IO_BUF_TOTAL_BITS) 136 + 137 + /* Copy to/from request integrity buffer instead of data buffer */ 138 + #define UBLK_INTEGRITY_FLAG_OFF 62 139 + #define UBLKSRV_IO_INTEGRITY_FLAG (1ULL << UBLK_INTEGRITY_FLAG_OFF) 161 140 162 141 /* 163 142 * ublk server can register data buffers for incoming I/O requests with a sparse ··· 340 311 */ 341 312 #define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14) 342 313 314 + /* 315 + * Support the following commands for delivering & committing io command 316 + * in batch. 317 + * 318 + * - UBLK_U_IO_PREP_IO_CMDS 319 + * - UBLK_U_IO_COMMIT_IO_CMDS 320 + * - UBLK_U_IO_FETCH_IO_CMDS 321 + * - UBLK_U_IO_REGISTER_IO_BUF 322 + * - UBLK_U_IO_UNREGISTER_IO_BUF 323 + * 324 + * The existing UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ and 325 + * UBLK_U_IO_NEED_GET_DATA uring_cmd are not supported for this feature. 326 + */ 327 + #define UBLK_F_BATCH_IO (1ULL << 15) 328 + 329 + /* 330 + * ublk device supports requests with integrity/metadata buffer. 331 + * Requires UBLK_F_USER_COPY. 332 + */ 333 + #define UBLK_F_INTEGRITY (1ULL << 16) 334 + 335 + /* 336 + * The device supports the UBLK_CMD_TRY_STOP_DEV command, which 337 + * allows stopping the device only if there are no openers. 338 + */ 339 + #define UBLK_F_SAFE_STOP_DEV (1ULL << 17) 340 + 341 + /* Disable automatic partition scanning when device is started */ 342 + #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18) 343 + 343 344 /* device state */ 344 345 #define UBLK_S_DEV_DEAD 0 345 346 #define UBLK_S_DEV_LIVE 1 ··· 467 408 * passed in. 468 409 */ 469 410 #define UBLK_IO_F_NEED_REG_BUF (1U << 17) 411 + /* Request has an integrity data buffer */ 412 + #define UBLK_IO_F_INTEGRITY (1UL << 18) 470 413 471 414 /* 472 415 * io cmd is described by this structure, and stored in share memory, indexed ··· 586 525 }; 587 526 }; 588 527 528 + struct ublk_elem_header { 529 + __u16 tag; /* IO tag */ 530 + 531 + /* 532 + * Buffer index for incoming io command, only valid iff 533 + * UBLK_F_AUTO_BUF_REG is set 534 + */ 535 + __u16 buf_index; 536 + __s32 result; /* I/O completion result (commit only) */ 537 + }; 538 + 539 + /* 540 + * uring_cmd buffer structure for batch commands 541 + * 542 + * buffer includes multiple elements, which number is specified by 543 + * `nr_elem`. Each element buffer is organized in the following order: 544 + * 545 + * struct ublk_elem_buffer { 546 + * // Mandatory fields (8 bytes) 547 + * struct ublk_elem_header header; 548 + * 549 + * // Optional fields (8 bytes each, included based on flags) 550 + * 551 + * // Buffer address (if UBLK_BATCH_F_HAS_BUF_ADDR) for copying data 552 + * // between ublk request and ublk server buffer 553 + * __u64 buf_addr; 554 + * 555 + * // returned Zone append LBA (if UBLK_BATCH_F_HAS_ZONE_LBA) 556 + * __u64 zone_lba; 557 + * } 558 + * 559 + * Used for `UBLK_U_IO_PREP_IO_CMDS` and `UBLK_U_IO_COMMIT_IO_CMDS` 560 + */ 561 + struct ublk_batch_io { 562 + __u16 q_id; 563 + #define UBLK_BATCH_F_HAS_ZONE_LBA (1 << 0) 564 + #define UBLK_BATCH_F_HAS_BUF_ADDR (1 << 1) 565 + #define UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK (1 << 2) 566 + __u16 flags; 567 + __u16 nr_elem; 568 + __u8 elem_bytes; 569 + __u8 reserved; 570 + __u64 reserved2; 571 + }; 572 + 589 573 struct ublk_param_basic { 590 574 #define UBLK_ATTR_READ_ONLY (1 << 0) 591 575 #define UBLK_ATTR_ROTATIONAL (1 << 1) ··· 706 600 __u8 pad[2]; 707 601 }; 708 602 603 + struct ublk_param_integrity { 604 + __u32 flags; /* LBMD_PI_CAP_* from linux/fs.h */ 605 + __u16 max_integrity_segments; /* 0 means no limit */ 606 + __u8 interval_exp; 607 + __u8 metadata_size; /* UBLK_PARAM_TYPE_INTEGRITY requires nonzero */ 608 + __u8 pi_offset; 609 + __u8 csum_type; /* LBMD_PI_CSUM_* from linux/fs.h */ 610 + __u8 tag_size; 611 + __u8 pad[5]; 612 + }; 613 + 709 614 struct ublk_params { 710 615 /* 711 616 * Total length of parameters, userspace has to set 'len' for both ··· 731 614 #define UBLK_PARAM_TYPE_ZONED (1 << 3) 732 615 #define UBLK_PARAM_TYPE_DMA_ALIGN (1 << 4) 733 616 #define UBLK_PARAM_TYPE_SEGMENT (1 << 5) 617 + #define UBLK_PARAM_TYPE_INTEGRITY (1 << 6) /* requires UBLK_F_INTEGRITY */ 734 618 __u32 types; /* types of parameter included */ 735 619 736 620 struct ublk_param_basic basic; ··· 740 622 struct ublk_param_zoned zoned; 741 623 struct ublk_param_dma_align dma; 742 624 struct ublk_param_segment seg; 625 + struct ublk_param_integrity integrity; 743 626 }; 744 627 745 628 #endif

-11

io_uring/rsrc.c

··· 1055 1055 1056 1056 iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, count); 1057 1057 iov_iter_advance(iter, offset); 1058 - 1059 - if (count < imu->len) { 1060 - const struct bio_vec *bvec = iter->bvec; 1061 - 1062 - len += iter->iov_offset; 1063 - while (len > bvec->bv_len) { 1064 - len -= bvec->bv_len; 1065 - bvec++; 1066 - } 1067 - iter->nr_segs = 1 + bvec - iter->bvec; 1068 - } 1069 1058 return 0; 1070 1059 } 1071 1060

+6

io_uring/rw.c

··· 1329 1329 int nr_events = 0; 1330 1330 1331 1331 /* 1332 + * Store the polling io_ring_ctx so drivers can detect if they're 1333 + * completing a request in the same ring context that's polling. 1334 + */ 1335 + iob.poll_ctx = ctx; 1336 + 1337 + /* 1332 1338 * Only spin for completions if we don't have multiple devices hanging 1333 1339 * off our complete list. 1334 1340 */

+1 -1

kernel/trace/blktrace.c

··· 793 793 return PTR_ERR(bt); 794 794 } 795 795 blk_trace_setup_finalize(q, name, 1, bt, &buts2); 796 - strcpy(buts.name, buts2.name); 796 + strscpy(buts.name, buts2.name, BLKTRACE_BDEV_SIZE); 797 797 mutex_unlock(&q->debugfs_mutex); 798 798 799 799 if (copy_to_user(arg, &buts, sizeof(buts))) {

+1 -2

rust/kernel/block/mq/gen_disk.rs

··· 107 107 drop(unsafe { T::QueueData::from_foreign(data) }); 108 108 }); 109 109 110 - // SAFETY: `bindings::queue_limits` contain only fields that are valid when zeroed. 111 - let mut lim: bindings::queue_limits = unsafe { core::mem::zeroed() }; 110 + let mut lim: bindings::queue_limits = pin_init::zeroed(); 112 111 113 112 lim.logical_block_size = self.logical_block_size; 114 113 lim.physical_block_size = self.physical_block_size;

+1 -3

rust/kernel/block/mq/tag_set.rs

··· 38 38 num_tags: u32, 39 39 num_maps: u32, 40 40 ) -> impl PinInit<Self, error::Error> { 41 - // SAFETY: `blk_mq_tag_set` only contains integers and pointers, which 42 - // all are allowed to be 0. 43 - let tag_set: bindings::blk_mq_tag_set = unsafe { core::mem::zeroed() }; 41 + let tag_set: bindings::blk_mq_tag_set = pin_init::zeroed(); 44 42 let tag_set: Result<_> = core::mem::size_of::<RequestDataWrapper>() 45 43 .try_into() 46 44 .map(|cmd_size| {

+4 -2

tools/testing/selftests/ublk/.gitignore

··· 1 - kublk 2 - /tools 1 + # SPDX-License-Identifier: GPL-2.0 3 2 *-verify.state 3 + /tools 4 + kublk 5 + metadata_size

+61 -9

tools/testing/selftests/ublk/Makefile

··· 7 7 8 8 LDLIBS += -lpthread -lm -luring 9 9 10 - TEST_PROGS := test_generic_01.sh 11 - TEST_PROGS += test_generic_02.sh 10 + TEST_PROGS := test_generic_02.sh 12 11 TEST_PROGS += test_generic_03.sh 13 - TEST_PROGS += test_generic_04.sh 14 - TEST_PROGS += test_generic_05.sh 15 12 TEST_PROGS += test_generic_06.sh 16 13 TEST_PROGS += test_generic_07.sh 17 14 18 15 TEST_PROGS += test_generic_08.sh 19 16 TEST_PROGS += test_generic_09.sh 20 17 TEST_PROGS += test_generic_10.sh 21 - TEST_PROGS += test_generic_11.sh 22 18 TEST_PROGS += test_generic_12.sh 23 19 TEST_PROGS += test_generic_13.sh 24 - TEST_PROGS += test_generic_14.sh 25 - TEST_PROGS += test_generic_15.sh 20 + TEST_PROGS += test_generic_16.sh 21 + 22 + TEST_PROGS += test_batch_01.sh 23 + TEST_PROGS += test_batch_02.sh 24 + TEST_PROGS += test_batch_03.sh 26 25 27 26 TEST_PROGS += test_null_01.sh 28 27 TEST_PROGS += test_null_02.sh ··· 33 34 TEST_PROGS += test_loop_05.sh 34 35 TEST_PROGS += test_loop_06.sh 35 36 TEST_PROGS += test_loop_07.sh 37 + 38 + TEST_PROGS += test_integrity_01.sh 39 + TEST_PROGS += test_integrity_02.sh 40 + 41 + TEST_PROGS += test_recover_01.sh 42 + TEST_PROGS += test_recover_02.sh 43 + TEST_PROGS += test_recover_03.sh 44 + TEST_PROGS += test_recover_04.sh 36 45 TEST_PROGS += test_stripe_01.sh 37 46 TEST_PROGS += test_stripe_02.sh 38 47 TEST_PROGS += test_stripe_03.sh 39 48 TEST_PROGS += test_stripe_04.sh 40 49 TEST_PROGS += test_stripe_05.sh 41 50 TEST_PROGS += test_stripe_06.sh 51 + 52 + TEST_PROGS += test_part_01.sh 53 + TEST_PROGS += test_part_02.sh 42 54 43 55 TEST_PROGS += test_stress_01.sh 44 56 TEST_PROGS += test_stress_02.sh ··· 58 48 TEST_PROGS += test_stress_05.sh 59 49 TEST_PROGS += test_stress_06.sh 60 50 TEST_PROGS += test_stress_07.sh 51 + TEST_PROGS += test_stress_08.sh 52 + TEST_PROGS += test_stress_09.sh 61 53 62 - TEST_GEN_PROGS_EXTENDED = kublk 54 + TEST_FILES := settings 55 + 56 + TEST_GEN_PROGS_EXTENDED = kublk metadata_size 57 + STANDALONE_UTILS := metadata_size.c 63 58 64 59 LOCAL_HDRS += $(wildcard *.h) 65 60 include ../lib.mk 66 61 67 - $(TEST_GEN_PROGS_EXTENDED): $(wildcard *.c) 62 + $(OUTPUT)/kublk: $(filter-out $(STANDALONE_UTILS),$(wildcard *.c)) 68 63 69 64 check: 70 65 shellcheck -x -f gcc *.sh 66 + 67 + # Test groups for running subsets of tests 68 + # JOBS=1 (default): sequential with kselftest TAP output 69 + # JOBS>1: parallel execution with xargs -P 70 + # Usage: make run_null JOBS=4 71 + JOBS ?= 1 72 + export JOBS 73 + 74 + # Auto-detect test groups from TEST_PROGS (test_<group>_<num>.sh -> group) 75 + TEST_GROUPS := $(shell echo "$(TEST_PROGS)" | tr ' ' '\n' | \ 76 + sed 's/test_$[^_]*$_.*/\1/' | sort -u) 77 + 78 + # Template for group test targets 79 + # $(1) = group name (e.g., null, generic, stress) 80 + define RUN_GROUP 81 + run_$(1): all 82 + @if [ $$(JOBS) -gt 1 ]; then \ 83 + echo $$(filter test_$(1)_%.sh,$$(TEST_PROGS)) | tr ' ' '\n' | \ 84 + xargs -P $$(JOBS) -n1 sh -c './"$$$$0"' || true; \ 85 + else \ 86 + $$(call RUN_TESTS, $$(filter test_$(1)_%.sh,$$(TEST_PROGS))); \ 87 + fi 88 + .PHONY: run_$(1) 89 + endef 90 + 91 + # Generate targets for each discovered test group 92 + $(foreach group,$(TEST_GROUPS),$(eval $(call RUN_GROUP,$(group)))) 93 + 94 + # Run all tests (parallel when JOBS>1) 95 + run_all: all 96 + @if [ $(JOBS) -gt 1 ]; then \ 97 + echo $(TEST_PROGS) | tr ' ' '\n' | \ 98 + xargs -P $(JOBS) -n1 sh -c './"$$0"' || true; \ 99 + else \ 100 + $(call RUN_TESTS, $(TEST_PROGS)); \ 101 + fi 102 + .PHONY: run_all

+607

tools/testing/selftests/ublk/batch.c

··· 1 + /* SPDX-License-Identifier: MIT */ 2 + /* 3 + * Description: UBLK_F_BATCH_IO buffer management 4 + */ 5 + 6 + #include "kublk.h" 7 + 8 + static inline void *ublk_get_commit_buf(struct ublk_thread *t, 9 + unsigned short buf_idx) 10 + { 11 + unsigned idx; 12 + 13 + if (buf_idx < t->commit_buf_start || 14 + buf_idx >= t->commit_buf_start + t->nr_commit_buf) 15 + return NULL; 16 + idx = buf_idx - t->commit_buf_start; 17 + return t->commit_buf + idx * t->commit_buf_size; 18 + } 19 + 20 + /* 21 + * Allocate one buffer for UBLK_U_IO_PREP_IO_CMDS or UBLK_U_IO_COMMIT_IO_CMDS 22 + * 23 + * Buffer index is returned. 24 + */ 25 + static inline unsigned short ublk_alloc_commit_buf(struct ublk_thread *t) 26 + { 27 + int idx = allocator_get(&t->commit_buf_alloc); 28 + 29 + if (idx >= 0) 30 + return idx + t->commit_buf_start; 31 + return UBLKS_T_COMMIT_BUF_INV_IDX; 32 + } 33 + 34 + /* 35 + * Free one commit buffer which is used by UBLK_U_IO_PREP_IO_CMDS or 36 + * UBLK_U_IO_COMMIT_IO_CMDS 37 + */ 38 + static inline void ublk_free_commit_buf(struct ublk_thread *t, 39 + unsigned short i) 40 + { 41 + unsigned short idx = i - t->commit_buf_start; 42 + 43 + ublk_assert(idx < t->nr_commit_buf); 44 + ublk_assert(allocator_get_val(&t->commit_buf_alloc, idx) != 0); 45 + 46 + allocator_put(&t->commit_buf_alloc, idx); 47 + } 48 + 49 + static unsigned char ublk_commit_elem_buf_size(struct ublk_dev *dev) 50 + { 51 + if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_USER_COPY | 52 + UBLK_F_AUTO_BUF_REG)) 53 + return 8; 54 + 55 + /* one extra 8bytes for carrying buffer address */ 56 + return 16; 57 + } 58 + 59 + static unsigned ublk_commit_buf_size(struct ublk_thread *t) 60 + { 61 + struct ublk_dev *dev = t->dev; 62 + unsigned elem_size = ublk_commit_elem_buf_size(dev); 63 + unsigned int total = elem_size * dev->dev_info.queue_depth; 64 + unsigned int page_sz = getpagesize(); 65 + 66 + return round_up(total, page_sz); 67 + } 68 + 69 + static void free_batch_commit_buf(struct ublk_thread *t) 70 + { 71 + if (t->commit_buf) { 72 + unsigned buf_size = ublk_commit_buf_size(t); 73 + unsigned int total = buf_size * t->nr_commit_buf; 74 + 75 + munlock(t->commit_buf, total); 76 + free(t->commit_buf); 77 + } 78 + allocator_deinit(&t->commit_buf_alloc); 79 + free(t->commit); 80 + } 81 + 82 + static int alloc_batch_commit_buf(struct ublk_thread *t) 83 + { 84 + unsigned buf_size = ublk_commit_buf_size(t); 85 + unsigned int total = buf_size * t->nr_commit_buf; 86 + unsigned int page_sz = getpagesize(); 87 + void *buf = NULL; 88 + int i, ret, j = 0; 89 + 90 + t->commit = calloc(t->nr_queues, sizeof(*t->commit)); 91 + for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) { 92 + if (t->q_map[i]) 93 + t->commit[j++].q_id = i; 94 + } 95 + 96 + allocator_init(&t->commit_buf_alloc, t->nr_commit_buf); 97 + 98 + t->commit_buf = NULL; 99 + ret = posix_memalign(&buf, page_sz, total); 100 + if (ret || !buf) 101 + goto fail; 102 + 103 + t->commit_buf = buf; 104 + 105 + /* lock commit buffer pages for fast access */ 106 + if (mlock(t->commit_buf, total)) 107 + ublk_err("%s: can't lock commit buffer %s\n", __func__, 108 + strerror(errno)); 109 + 110 + return 0; 111 + 112 + fail: 113 + free_batch_commit_buf(t); 114 + return ret; 115 + } 116 + 117 + static unsigned int ublk_thread_nr_queues(const struct ublk_thread *t) 118 + { 119 + int i; 120 + int ret = 0; 121 + 122 + for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) 123 + ret += !!t->q_map[i]; 124 + 125 + return ret; 126 + } 127 + 128 + void ublk_batch_prepare(struct ublk_thread *t) 129 + { 130 + /* 131 + * We only handle single device in this thread context. 132 + * 133 + * All queues have same feature flags, so use queue 0's for 134 + * calculate uring_cmd flags. 135 + * 136 + * This way looks not elegant, but it works so far. 137 + */ 138 + struct ublk_queue *q = &t->dev->q[0]; 139 + 140 + /* cache nr_queues because we don't support dynamic load-balance yet */ 141 + t->nr_queues = ublk_thread_nr_queues(t); 142 + 143 + t->commit_buf_elem_size = ublk_commit_elem_buf_size(t->dev); 144 + t->commit_buf_size = ublk_commit_buf_size(t); 145 + t->commit_buf_start = t->nr_bufs; 146 + t->nr_commit_buf = 2 * t->nr_queues; 147 + t->nr_bufs += t->nr_commit_buf; 148 + 149 + t->cmd_flags = 0; 150 + if (ublk_queue_use_auto_zc(q)) { 151 + if (ublk_queue_auto_zc_fallback(q)) 152 + t->cmd_flags |= UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK; 153 + } else if (!ublk_queue_no_buf(q)) 154 + t->cmd_flags |= UBLK_BATCH_F_HAS_BUF_ADDR; 155 + 156 + t->state |= UBLKS_T_BATCH_IO; 157 + 158 + ublk_log("%s: thread %d commit(nr_bufs %u, buf_size %u, start %u)\n", 159 + __func__, t->idx, 160 + t->nr_commit_buf, t->commit_buf_size, 161 + t->nr_bufs); 162 + } 163 + 164 + static void free_batch_fetch_buf(struct ublk_thread *t) 165 + { 166 + int i; 167 + 168 + for (i = 0; i < t->nr_fetch_bufs; i++) { 169 + io_uring_free_buf_ring(&t->ring, t->fetch[i].br, 1, i); 170 + munlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size); 171 + free(t->fetch[i].fetch_buf); 172 + } 173 + free(t->fetch); 174 + } 175 + 176 + static int alloc_batch_fetch_buf(struct ublk_thread *t) 177 + { 178 + /* page aligned fetch buffer, and it is mlocked for speedup delivery */ 179 + unsigned pg_sz = getpagesize(); 180 + unsigned buf_size = round_up(t->dev->dev_info.queue_depth * 2, pg_sz); 181 + int ret; 182 + int i = 0; 183 + 184 + /* double fetch buffer for each queue */ 185 + t->nr_fetch_bufs = t->nr_queues * 2; 186 + t->fetch = calloc(t->nr_fetch_bufs, sizeof(*t->fetch)); 187 + 188 + /* allocate one buffer for each queue */ 189 + for (i = 0; i < t->nr_fetch_bufs; i++) { 190 + t->fetch[i].fetch_buf_size = buf_size; 191 + 192 + if (posix_memalign((void **)&t->fetch[i].fetch_buf, pg_sz, 193 + t->fetch[i].fetch_buf_size)) 194 + return -ENOMEM; 195 + 196 + /* lock fetch buffer page for fast fetching */ 197 + if (mlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size)) 198 + ublk_err("%s: can't lock fetch buffer %s\n", __func__, 199 + strerror(errno)); 200 + t->fetch[i].br = io_uring_setup_buf_ring(&t->ring, 1, 201 + i, IOU_PBUF_RING_INC, &ret); 202 + if (!t->fetch[i].br) { 203 + ublk_err("Buffer ring register failed %d\n", ret); 204 + return ret; 205 + } 206 + } 207 + 208 + return 0; 209 + } 210 + 211 + int ublk_batch_alloc_buf(struct ublk_thread *t) 212 + { 213 + int ret; 214 + 215 + ublk_assert(t->nr_commit_buf < 2 * UBLK_MAX_QUEUES); 216 + 217 + ret = alloc_batch_commit_buf(t); 218 + if (ret) 219 + return ret; 220 + return alloc_batch_fetch_buf(t); 221 + } 222 + 223 + void ublk_batch_free_buf(struct ublk_thread *t) 224 + { 225 + free_batch_commit_buf(t); 226 + free_batch_fetch_buf(t); 227 + } 228 + 229 + static void ublk_init_batch_cmd(struct ublk_thread *t, __u16 q_id, 230 + struct io_uring_sqe *sqe, unsigned op, 231 + unsigned short elem_bytes, 232 + unsigned short nr_elem, 233 + unsigned short buf_idx) 234 + { 235 + struct ublk_batch_io *cmd; 236 + __u64 user_data; 237 + 238 + cmd = (struct ublk_batch_io *)ublk_get_sqe_cmd(sqe); 239 + 240 + ublk_set_sqe_cmd_op(sqe, op); 241 + 242 + sqe->fd = 0; /* dev->fds[0] */ 243 + sqe->opcode = IORING_OP_URING_CMD; 244 + sqe->flags = IOSQE_FIXED_FILE; 245 + 246 + cmd->q_id = q_id; 247 + cmd->flags = 0; 248 + cmd->reserved = 0; 249 + cmd->elem_bytes = elem_bytes; 250 + cmd->nr_elem = nr_elem; 251 + 252 + user_data = build_user_data(buf_idx, _IOC_NR(op), nr_elem, q_id, 0); 253 + io_uring_sqe_set_data64(sqe, user_data); 254 + 255 + t->cmd_inflight += 1; 256 + 257 + ublk_dbg(UBLK_DBG_IO_CMD, "%s: thread %u qid %d cmd_op %x data %lx " 258 + "nr_elem %u elem_bytes %u buf_size %u buf_idx %d " 259 + "cmd_inflight %u\n", 260 + __func__, t->idx, q_id, op, user_data, 261 + cmd->nr_elem, cmd->elem_bytes, 262 + nr_elem * elem_bytes, buf_idx, t->cmd_inflight); 263 + } 264 + 265 + static void ublk_setup_commit_sqe(struct ublk_thread *t, 266 + struct io_uring_sqe *sqe, 267 + unsigned short buf_idx) 268 + { 269 + struct ublk_batch_io *cmd; 270 + 271 + cmd = (struct ublk_batch_io *)ublk_get_sqe_cmd(sqe); 272 + 273 + /* Use plain user buffer instead of fixed buffer */ 274 + cmd->flags |= t->cmd_flags; 275 + } 276 + 277 + static void ublk_batch_queue_fetch(struct ublk_thread *t, 278 + struct ublk_queue *q, 279 + unsigned short buf_idx) 280 + { 281 + unsigned short nr_elem = t->fetch[buf_idx].fetch_buf_size / 2; 282 + struct io_uring_sqe *sqe; 283 + 284 + io_uring_buf_ring_add(t->fetch[buf_idx].br, t->fetch[buf_idx].fetch_buf, 285 + t->fetch[buf_idx].fetch_buf_size, 286 + 0, 0, 0); 287 + io_uring_buf_ring_advance(t->fetch[buf_idx].br, 1); 288 + 289 + ublk_io_alloc_sqes(t, &sqe, 1); 290 + 291 + ublk_init_batch_cmd(t, q->q_id, sqe, UBLK_U_IO_FETCH_IO_CMDS, 2, nr_elem, 292 + buf_idx); 293 + 294 + sqe->rw_flags= IORING_URING_CMD_MULTISHOT; 295 + sqe->buf_group = buf_idx; 296 + sqe->flags |= IOSQE_BUFFER_SELECT; 297 + 298 + t->fetch[buf_idx].fetch_buf_off = 0; 299 + } 300 + 301 + void ublk_batch_start_fetch(struct ublk_thread *t) 302 + { 303 + int i; 304 + int j = 0; 305 + 306 + for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) { 307 + if (t->q_map[i]) { 308 + struct ublk_queue *q = &t->dev->q[i]; 309 + 310 + /* submit two fetch commands for each queue */ 311 + ublk_batch_queue_fetch(t, q, j++); 312 + ublk_batch_queue_fetch(t, q, j++); 313 + } 314 + } 315 + } 316 + 317 + static unsigned short ublk_compl_batch_fetch(struct ublk_thread *t, 318 + struct ublk_queue *q, 319 + const struct io_uring_cqe *cqe) 320 + { 321 + unsigned short buf_idx = user_data_to_tag(cqe->user_data); 322 + unsigned start = t->fetch[buf_idx].fetch_buf_off; 323 + unsigned end = start + cqe->res; 324 + void *buf = t->fetch[buf_idx].fetch_buf; 325 + int i; 326 + 327 + if (cqe->res < 0) 328 + return buf_idx; 329 + 330 + if ((end - start) / 2 > q->q_depth) { 331 + ublk_err("%s: fetch duplicated ios offset %u count %u\n", __func__, start, cqe->res); 332 + 333 + for (i = start; i < end; i += 2) { 334 + unsigned short tag = *(unsigned short *)(buf + i); 335 + 336 + ublk_err("%u ", tag); 337 + } 338 + ublk_err("\n"); 339 + } 340 + 341 + for (i = start; i < end; i += 2) { 342 + unsigned short tag = *(unsigned short *)(buf + i); 343 + 344 + if (tag >= q->q_depth) 345 + ublk_err("%s: bad tag %u\n", __func__, tag); 346 + 347 + if (q->tgt_ops->queue_io) 348 + q->tgt_ops->queue_io(t, q, tag); 349 + } 350 + t->fetch[buf_idx].fetch_buf_off = end; 351 + return buf_idx; 352 + } 353 + 354 + static int __ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q) 355 + { 356 + unsigned short nr_elem = q->q_depth; 357 + unsigned short buf_idx = ublk_alloc_commit_buf(t); 358 + struct io_uring_sqe *sqe; 359 + void *buf; 360 + int i; 361 + 362 + ublk_assert(buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX); 363 + 364 + ublk_io_alloc_sqes(t, &sqe, 1); 365 + 366 + ublk_assert(nr_elem == q->q_depth); 367 + buf = ublk_get_commit_buf(t, buf_idx); 368 + for (i = 0; i < nr_elem; i++) { 369 + struct ublk_batch_elem *elem = (struct ublk_batch_elem *)( 370 + buf + i * t->commit_buf_elem_size); 371 + struct ublk_io *io = &q->ios[i]; 372 + 373 + elem->tag = i; 374 + elem->result = 0; 375 + 376 + if (ublk_queue_use_auto_zc(q)) 377 + elem->buf_index = ublk_batch_io_buf_idx(t, q, i); 378 + else if (!ublk_queue_no_buf(q)) 379 + elem->buf_addr = (__u64)io->buf_addr; 380 + } 381 + 382 + sqe->addr = (__u64)buf; 383 + sqe->len = t->commit_buf_elem_size * nr_elem; 384 + 385 + ublk_init_batch_cmd(t, q->q_id, sqe, UBLK_U_IO_PREP_IO_CMDS, 386 + t->commit_buf_elem_size, nr_elem, buf_idx); 387 + ublk_setup_commit_sqe(t, sqe, buf_idx); 388 + return 0; 389 + } 390 + 391 + int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q) 392 + { 393 + int ret = 0; 394 + 395 + pthread_spin_lock(&q->lock); 396 + if (q->flags & UBLKS_Q_PREPARED) 397 + goto unlock; 398 + ret = __ublk_batch_queue_prep_io_cmds(t, q); 399 + if (!ret) 400 + q->flags |= UBLKS_Q_PREPARED; 401 + unlock: 402 + pthread_spin_unlock(&q->lock); 403 + 404 + return ret; 405 + } 406 + 407 + static void ublk_batch_compl_commit_cmd(struct ublk_thread *t, 408 + const struct io_uring_cqe *cqe, 409 + unsigned op) 410 + { 411 + unsigned short buf_idx = user_data_to_tag(cqe->user_data); 412 + 413 + if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS)) 414 + ublk_assert(cqe->res == 0); 415 + else if (op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS)) { 416 + int nr_elem = user_data_to_tgt_data(cqe->user_data); 417 + 418 + ublk_assert(cqe->res == t->commit_buf_elem_size * nr_elem); 419 + } else 420 + ublk_assert(0); 421 + 422 + ublk_free_commit_buf(t, buf_idx); 423 + } 424 + 425 + void ublk_batch_compl_cmd(struct ublk_thread *t, 426 + const struct io_uring_cqe *cqe) 427 + { 428 + unsigned op = user_data_to_op(cqe->user_data); 429 + struct ublk_queue *q; 430 + unsigned buf_idx; 431 + unsigned q_id; 432 + 433 + if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS) || 434 + op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS)) { 435 + t->cmd_inflight--; 436 + ublk_batch_compl_commit_cmd(t, cqe, op); 437 + return; 438 + } 439 + 440 + /* FETCH command is per queue */ 441 + q_id = user_data_to_q_id(cqe->user_data); 442 + q = &t->dev->q[q_id]; 443 + buf_idx = ublk_compl_batch_fetch(t, q, cqe); 444 + 445 + if (cqe->res < 0 && cqe->res != -ENOBUFS) { 446 + t->cmd_inflight--; 447 + t->state |= UBLKS_T_STOPPING; 448 + } else if (!(cqe->flags & IORING_CQE_F_MORE) || cqe->res == -ENOBUFS) { 449 + t->cmd_inflight--; 450 + ublk_batch_queue_fetch(t, q, buf_idx); 451 + } 452 + } 453 + 454 + static void __ublk_batch_commit_io_cmds(struct ublk_thread *t, 455 + struct batch_commit_buf *cb) 456 + { 457 + struct io_uring_sqe *sqe; 458 + unsigned short buf_idx; 459 + unsigned short nr_elem = cb->done; 460 + 461 + /* nothing to commit */ 462 + if (!nr_elem) { 463 + ublk_free_commit_buf(t, cb->buf_idx); 464 + return; 465 + } 466 + 467 + ublk_io_alloc_sqes(t, &sqe, 1); 468 + buf_idx = cb->buf_idx; 469 + sqe->addr = (__u64)cb->elem; 470 + sqe->len = nr_elem * t->commit_buf_elem_size; 471 + 472 + /* commit isn't per-queue command */ 473 + ublk_init_batch_cmd(t, cb->q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS, 474 + t->commit_buf_elem_size, nr_elem, buf_idx); 475 + ublk_setup_commit_sqe(t, sqe, buf_idx); 476 + } 477 + 478 + void ublk_batch_commit_io_cmds(struct ublk_thread *t) 479 + { 480 + int i; 481 + 482 + for (i = 0; i < t->nr_queues; i++) { 483 + struct batch_commit_buf *cb = &t->commit[i]; 484 + 485 + if (cb->buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX) 486 + __ublk_batch_commit_io_cmds(t, cb); 487 + } 488 + 489 + } 490 + 491 + static void __ublk_batch_init_commit(struct ublk_thread *t, 492 + struct batch_commit_buf *cb, 493 + unsigned short buf_idx) 494 + { 495 + /* so far only support 1:1 queue/thread mapping */ 496 + cb->buf_idx = buf_idx; 497 + cb->elem = ublk_get_commit_buf(t, buf_idx); 498 + cb->done = 0; 499 + cb->count = t->commit_buf_size / 500 + t->commit_buf_elem_size; 501 + } 502 + 503 + /* COMMIT_IO_CMDS is per-queue command, so use its own commit buffer */ 504 + static void ublk_batch_init_commit(struct ublk_thread *t, 505 + struct batch_commit_buf *cb) 506 + { 507 + unsigned short buf_idx = ublk_alloc_commit_buf(t); 508 + 509 + ublk_assert(buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX); 510 + ublk_assert(!ublk_batch_commit_prepared(cb)); 511 + 512 + __ublk_batch_init_commit(t, cb, buf_idx); 513 + } 514 + 515 + void ublk_batch_prep_commit(struct ublk_thread *t) 516 + { 517 + int i; 518 + 519 + for (i = 0; i < t->nr_queues; i++) 520 + t->commit[i].buf_idx = UBLKS_T_COMMIT_BUF_INV_IDX; 521 + } 522 + 523 + void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q, 524 + unsigned tag, int res) 525 + { 526 + unsigned q_t_idx = ublk_queue_idx_in_thread(t, q); 527 + struct batch_commit_buf *cb = &t->commit[q_t_idx]; 528 + struct ublk_batch_elem *elem; 529 + struct ublk_io *io = &q->ios[tag]; 530 + 531 + if (!ublk_batch_commit_prepared(cb)) 532 + ublk_batch_init_commit(t, cb); 533 + 534 + ublk_assert(q->q_id == cb->q_id); 535 + 536 + elem = (struct ublk_batch_elem *)(cb->elem + cb->done * t->commit_buf_elem_size); 537 + elem->tag = tag; 538 + elem->buf_index = ublk_batch_io_buf_idx(t, q, tag); 539 + elem->result = res; 540 + 541 + if (!ublk_queue_no_buf(q)) 542 + elem->buf_addr = (__u64) (uintptr_t) io->buf_addr; 543 + 544 + cb->done += 1; 545 + ublk_assert(cb->done <= cb->count); 546 + } 547 + 548 + void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES], 549 + int nthreads, int queues) 550 + { 551 + int i, j; 552 + 553 + /* 554 + * Setup round-robin queue-to-thread mapping for arbitrary N:M combinations. 555 + * 556 + * This algorithm distributes queues across threads (and threads across queues) 557 + * in a balanced round-robin fashion to ensure even load distribution. 558 + * 559 + * Examples: 560 + * - 2 threads, 4 queues: T0=[Q0,Q2], T1=[Q1,Q3] 561 + * - 4 threads, 2 queues: T0=[Q0], T1=[Q1], T2=[Q0], T3=[Q1] 562 + * - 3 threads, 3 queues: T0=[Q0], T1=[Q1], T2=[Q2] (1:1 mapping) 563 + * 564 + * Phase 1: Mark which queues each thread handles (boolean mapping) 565 + */ 566 + for (i = 0, j = 0; i < queues || j < nthreads; i++, j++) { 567 + q_thread_map[j % nthreads][i % queues] = 1; 568 + } 569 + 570 + /* 571 + * Phase 2: Convert boolean mapping to sequential indices within each thread. 572 + * 573 + * Transform from: q_thread_map[thread][queue] = 1 (handles queue) 574 + * To: q_thread_map[thread][queue] = N (queue index within thread) 575 + * 576 + * This allows each thread to know the local index of each queue it handles, 577 + * which is essential for buffer allocation and management. For example: 578 + * - Thread 0 handling queues [0,2] becomes: q_thread_map[0][0]=1, q_thread_map[0][2]=2 579 + * - Thread 1 handling queues [1,3] becomes: q_thread_map[1][1]=1, q_thread_map[1][3]=2 580 + */ 581 + for (j = 0; j < nthreads; j++) { 582 + unsigned char seq = 1; 583 + 584 + for (i = 0; i < queues; i++) { 585 + if (q_thread_map[j][i]) 586 + q_thread_map[j][i] = seq++; 587 + } 588 + } 589 + 590 + #if 0 591 + for (j = 0; j < nthreads; j++) { 592 + printf("thread %0d: ", j); 593 + for (i = 0; i < queues; i++) { 594 + if (q_thread_map[j][i]) 595 + printf("%03u ", i); 596 + } 597 + printf("\n"); 598 + } 599 + printf("\n"); 600 + for (j = 0; j < nthreads; j++) { 601 + for (i = 0; i < queues; i++) { 602 + printf("%03u ", q_thread_map[j][i]); 603 + } 604 + printf("\n"); 605 + } 606 + #endif 607 + }

+3 -3

tools/testing/selftests/ublk/common.c

··· 12 12 } 13 13 } 14 14 15 - int backing_file_tgt_init(struct ublk_dev *dev) 15 + int backing_file_tgt_init(struct ublk_dev *dev, unsigned int nr_direct) 16 16 { 17 17 int fd, i; 18 18 19 - assert(dev->nr_fds == 1); 19 + ublk_assert(dev->nr_fds == 1); 20 20 21 21 for (i = 0; i < dev->tgt.nr_backing_files; i++) { 22 22 char *file = dev->tgt.backing_file[i]; ··· 25 25 26 26 ublk_dbg(UBLK_DBG_DEV, "%s: file %d: %s\n", __func__, i, file); 27 27 28 - fd = open(file, O_RDWR | O_DIRECT); 28 + fd = open(file, O_RDWR | (i < nr_direct ? O_DIRECT : 0)); 29 29 if (fd < 0) { 30 30 ublk_err("%s: backing file %s can't be opened: %s\n", 31 31 __func__, file, strerror(errno));

+1

tools/testing/selftests/ublk/fault_inject.c

··· 33 33 .dev_sectors = dev_size >> 9, 34 34 }, 35 35 }; 36 + ublk_set_integrity_params(ctx, &dev->tgt.params); 36 37 37 38 dev->private_data = (void *)(unsigned long)(ctx->fault_inject.delay_us * 1000); 38 39 return 0;

+81 -20

tools/testing/selftests/ublk/file_backed.c

··· 10 10 return zc ? IORING_OP_READ_FIXED : IORING_OP_READ; 11 11 else if (ublk_op == UBLK_IO_OP_WRITE) 12 12 return zc ? IORING_OP_WRITE_FIXED : IORING_OP_WRITE; 13 - assert(0); 13 + ublk_assert(0); 14 14 } 15 15 16 16 static int loop_queue_flush_io(struct ublk_thread *t, struct ublk_queue *q, ··· 35 35 unsigned auto_zc = ublk_queue_use_auto_zc(q); 36 36 enum io_uring_op op = ublk_to_uring_op(iod, zc | auto_zc); 37 37 struct ublk_io *io = ublk_get_io(q, tag); 38 + __u64 offset = iod->start_sector << 9; 39 + __u32 len = iod->nr_sectors << 9; 38 40 struct io_uring_sqe *sqe[3]; 39 41 void *addr = io->buf_addr; 42 + unsigned short buf_index = ublk_io_buf_idx(t, q, tag); 43 + 44 + if (iod->op_flags & UBLK_IO_F_INTEGRITY) { 45 + ublk_io_alloc_sqes(t, sqe, 1); 46 + /* Use second backing file for integrity data */ 47 + io_uring_prep_rw(op, sqe[0], ublk_get_registered_fd(q, 2), 48 + io->integrity_buf, 49 + ublk_integrity_len(q, len), 50 + ublk_integrity_len(q, offset)); 51 + sqe[0]->flags = IOSQE_FIXED_FILE; 52 + /* tgt_data = 1 indicates integrity I/O */ 53 + sqe[0]->user_data = build_user_data(tag, ublk_op, 1, q->q_id, 1); 54 + } 40 55 41 56 if (!zc || auto_zc) { 42 57 ublk_io_alloc_sqes(t, sqe, 1); ··· 60 45 61 46 io_uring_prep_rw(op, sqe[0], ublk_get_registered_fd(q, 1) /*fds[1]*/, 62 47 addr, 63 - iod->nr_sectors << 9, 64 - iod->start_sector << 9); 48 + len, 49 + offset); 65 50 if (auto_zc) 66 - sqe[0]->buf_index = tag; 51 + sqe[0]->buf_index = buf_index; 67 52 io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE); 68 53 /* bit63 marks us as tgt io */ 69 54 sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); 70 - return 1; 55 + return !!(iod->op_flags & UBLK_IO_F_INTEGRITY) + 1; 71 56 } 72 57 73 58 ublk_io_alloc_sqes(t, sqe, 3); 74 59 75 - io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, io->buf_index); 60 + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_index); 76 61 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 77 62 sqe[0]->user_data = build_user_data(tag, 78 63 ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); 79 64 80 65 io_uring_prep_rw(op, sqe[1], ublk_get_registered_fd(q, 1) /*fds[1]*/, 0, 81 - iod->nr_sectors << 9, 82 - iod->start_sector << 9); 83 - sqe[1]->buf_index = tag; 66 + len, 67 + offset); 68 + sqe[1]->buf_index = buf_index; 84 69 sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK; 85 70 sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); 86 71 87 - io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, io->buf_index); 72 + io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, buf_index); 88 73 sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); 89 74 90 - return 2; 75 + return !!(iod->op_flags & UBLK_IO_F_INTEGRITY) + 2; 91 76 } 92 77 93 78 static int loop_queue_tgt_io(struct ublk_thread *t, struct ublk_queue *q, int tag) ··· 134 119 unsigned op = user_data_to_op(cqe->user_data); 135 120 struct ublk_io *io = ublk_get_io(q, tag); 136 121 137 - if (cqe->res < 0 || op != ublk_cmd_op_nr(UBLK_U_IO_UNREGISTER_IO_BUF)) { 138 - if (!io->result) 139 - io->result = cqe->res; 140 - if (cqe->res < 0) 141 - ublk_err("%s: io failed op %x user_data %lx\n", 142 - __func__, op, cqe->user_data); 122 + if (cqe->res < 0) { 123 + io->result = cqe->res; 124 + ublk_err("%s: io failed op %x user_data %lx\n", 125 + __func__, op, cqe->user_data); 126 + } else if (op != ublk_cmd_op_nr(UBLK_U_IO_UNREGISTER_IO_BUF)) { 127 + __s32 data_len = user_data_to_tgt_data(cqe->user_data) 128 + ? ublk_integrity_data_len(q, cqe->res) 129 + : cqe->res; 130 + 131 + if (!io->result || data_len < io->result) 132 + io->result = data_len; 143 133 } 144 134 145 135 /* buffer register op is IOSQE_CQE_SKIP_SUCCESS */ ··· 155 135 ublk_complete_io(t, q, tag, io->result); 156 136 } 157 137 138 + static int ublk_loop_memset_file(int fd, __u8 byte, size_t len) 139 + { 140 + off_t offset = 0; 141 + __u8 buf[4096]; 142 + 143 + memset(buf, byte, sizeof(buf)); 144 + while (len) { 145 + int ret = pwrite(fd, buf, min(len, sizeof(buf)), offset); 146 + 147 + if (ret < 0) 148 + return -errno; 149 + if (!ret) 150 + return -EIO; 151 + 152 + len -= ret; 153 + offset += ret; 154 + } 155 + return 0; 156 + } 157 + 158 158 static int ublk_loop_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev) 159 159 { 160 160 unsigned long long bytes; 161 + unsigned long blocks; 161 162 int ret; 162 163 struct ublk_params p = { 163 164 .types = UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DMA_ALIGN, ··· 195 154 }, 196 155 }; 197 156 157 + ublk_set_integrity_params(ctx, &p); 198 158 if (ctx->auto_zc_fallback) { 199 159 ublk_err("%s: not support auto_zc_fallback\n", __func__); 200 160 return -EINVAL; 201 161 } 202 162 203 - ret = backing_file_tgt_init(dev); 163 + /* Use O_DIRECT only for data file */ 164 + ret = backing_file_tgt_init(dev, 1); 204 165 if (ret) 205 166 return ret; 206 167 207 - if (dev->tgt.nr_backing_files != 1) 168 + /* Expect a second file for integrity data */ 169 + if (dev->tgt.nr_backing_files != 1 + !!ctx->metadata_size) 208 170 return -EINVAL; 209 171 210 - bytes = dev->tgt.backing_file_size[0]; 172 + blocks = dev->tgt.backing_file_size[0] >> p.basic.logical_bs_shift; 173 + if (ctx->metadata_size) { 174 + unsigned long metadata_blocks = 175 + dev->tgt.backing_file_size[1] / ctx->metadata_size; 176 + unsigned long integrity_len; 177 + 178 + /* Ensure both data and integrity data fit in backing files */ 179 + blocks = min(blocks, metadata_blocks); 180 + integrity_len = blocks * ctx->metadata_size; 181 + /* 182 + * Initialize PI app tag and ref tag to 0xFF 183 + * to disable bio-integrity-auto checks 184 + */ 185 + ret = ublk_loop_memset_file(dev->fds[2], 0xFF, integrity_len); 186 + if (ret) 187 + return ret; 188 + } 189 + bytes = blocks << p.basic.logical_bs_shift; 211 190 dev->tgt.dev_size = bytes; 212 191 p.basic.dev_sectors = bytes >> 9; 213 192 dev->tgt.params = p;

+264 -29

tools/testing/selftests/ublk/kublk.c

··· 3 3 * Description: uring_cmd based ublk 4 4 */ 5 5 6 + #include <linux/fs.h> 6 7 #include "kublk.h" 7 8 8 9 #define MAX_NR_TGT_ARG 64 ··· 103 102 { 104 103 struct ublk_ctrl_cmd_data data = { 105 104 .cmd_op = UBLK_U_CMD_STOP_DEV, 105 + }; 106 + 107 + return __ublk_ctrl_cmd(dev, &data); 108 + } 109 + 110 + static int ublk_ctrl_try_stop_dev(struct ublk_dev *dev) 111 + { 112 + struct ublk_ctrl_cmd_data data = { 113 + .cmd_op = UBLK_U_CMD_TRY_STOP_DEV, 106 114 }; 107 115 108 116 return __ublk_ctrl_cmd(dev, &data); ··· 425 415 if (q->io_cmd_buf) 426 416 munmap(q->io_cmd_buf, ublk_queue_cmd_buf_sz(q)); 427 417 428 - for (i = 0; i < nr_ios; i++) 418 + for (i = 0; i < nr_ios; i++) { 429 419 free(q->ios[i].buf_addr); 420 + free(q->ios[i].integrity_buf); 421 + } 430 422 } 431 423 432 424 static void ublk_thread_deinit(struct ublk_thread *t) 433 425 { 434 426 io_uring_unregister_buffers(&t->ring); 427 + 428 + ublk_batch_free_buf(t); 435 429 436 430 io_uring_unregister_ring_fd(&t->ring); 437 431 ··· 446 432 } 447 433 } 448 434 449 - static int ublk_queue_init(struct ublk_queue *q, unsigned long long extra_flags) 435 + static int ublk_queue_init(struct ublk_queue *q, unsigned long long extra_flags, 436 + __u8 metadata_size) 450 437 { 451 438 struct ublk_dev *dev = q->dev; 452 439 int depth = dev->dev_info.queue_depth; 453 440 int i; 454 - int cmd_buf_size, io_buf_size; 441 + int cmd_buf_size, io_buf_size, integrity_size; 455 442 unsigned long off; 456 443 444 + pthread_spin_init(&q->lock, PTHREAD_PROCESS_PRIVATE); 457 445 q->tgt_ops = dev->tgt.ops; 458 446 q->flags = 0; 459 447 q->q_depth = depth; 460 448 q->flags = dev->dev_info.flags; 461 449 q->flags |= extra_flags; 450 + q->metadata_size = metadata_size; 462 451 463 452 /* Cache fd in queue for fast path access */ 464 453 q->ublk_fd = dev->fds[0]; ··· 477 460 } 478 461 479 462 io_buf_size = dev->dev_info.max_io_buf_bytes; 463 + integrity_size = ublk_integrity_len(q, io_buf_size); 480 464 for (i = 0; i < q->q_depth; i++) { 481 465 q->ios[i].buf_addr = NULL; 482 466 q->ios[i].flags = UBLKS_IO_NEED_FETCH_RQ | UBLKS_IO_FREE; 483 467 q->ios[i].tag = i; 468 + 469 + if (integrity_size) { 470 + q->ios[i].integrity_buf = malloc(integrity_size); 471 + if (!q->ios[i].integrity_buf) { 472 + ublk_err("ublk dev %d queue %d io %d malloc(%d) failed: %m\n", 473 + dev->dev_info.dev_id, q->q_id, i, 474 + integrity_size); 475 + goto fail; 476 + } 477 + } 478 + 484 479 485 480 if (ublk_queue_no_buf(q)) 486 481 continue; ··· 520 491 int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth; 521 492 int ret; 522 493 494 + /* FETCH_IO_CMDS is multishot, so increase cq depth for BATCH_IO */ 495 + if (ublk_dev_batch_io(dev)) 496 + cq_depth += dev->dev_info.queue_depth * 2; 497 + 523 498 ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth, 524 499 IORING_SETUP_COOP_TASKRUN | 525 500 IORING_SETUP_SINGLE_ISSUER | ··· 538 505 unsigned nr_ios = dev->dev_info.queue_depth * dev->dev_info.nr_hw_queues; 539 506 unsigned max_nr_ios_per_thread = nr_ios / dev->nthreads; 540 507 max_nr_ios_per_thread += !!(nr_ios % dev->nthreads); 541 - ret = io_uring_register_buffers_sparse( 542 - &t->ring, max_nr_ios_per_thread); 508 + 509 + t->nr_bufs = max_nr_ios_per_thread; 510 + } else { 511 + t->nr_bufs = 0; 512 + } 513 + 514 + if (ublk_dev_batch_io(dev)) 515 + ublk_batch_prepare(t); 516 + 517 + if (t->nr_bufs) { 518 + ret = io_uring_register_buffers_sparse(&t->ring, t->nr_bufs); 543 519 if (ret) { 544 - ublk_err("ublk dev %d thread %d register spare buffers failed %d", 520 + ublk_err("ublk dev %d thread %d register spare buffers failed %d\n", 545 521 dev->dev_info.dev_id, t->idx, ret); 522 + goto fail; 523 + } 524 + } 525 + 526 + if (ublk_dev_batch_io(dev)) { 527 + ret = ublk_batch_alloc_buf(t); 528 + if (ret) { 529 + ublk_err("ublk dev %d thread %d alloc batch buf failed %d\n", 530 + dev->dev_info.dev_id, t->idx, ret); 546 531 goto fail; 547 532 } 548 533 } ··· 630 579 close(dev->fds[0]); 631 580 } 632 581 633 - static void ublk_set_auto_buf_reg(const struct ublk_queue *q, 582 + static void ublk_set_auto_buf_reg(const struct ublk_thread *t, 583 + const struct ublk_queue *q, 634 584 struct io_uring_sqe *sqe, 635 585 unsigned short tag) 636 586 { 637 587 struct ublk_auto_buf_reg buf = {}; 638 588 639 589 if (q->tgt_ops->buf_index) 640 - buf.index = q->tgt_ops->buf_index(q, tag); 590 + buf.index = q->tgt_ops->buf_index(t, q, tag); 641 591 else 642 - buf.index = q->ios[tag].buf_index; 592 + buf.index = ublk_io_buf_idx(t, q, tag); 643 593 644 594 if (ublk_queue_auto_zc_fallback(q)) 645 595 buf.flags = UBLK_AUTO_BUF_REG_FALLBACK; ··· 659 607 __u8 ublk_op = ublksrv_get_op(iod); 660 608 __u32 len = iod->nr_sectors << 9; 661 609 void *addr = io->buf_addr; 610 + ssize_t copied; 662 611 663 612 if (ublk_op != match_ublk_op) 664 613 return; 665 614 666 615 while (len) { 667 616 __u32 copy_len = min(len, UBLK_USER_COPY_LEN); 668 - ssize_t copied; 669 617 670 618 if (ublk_op == UBLK_IO_OP_WRITE) 671 619 copied = pread(q->ublk_fd, addr, copy_len, off); ··· 678 626 off += copy_len; 679 627 len -= copy_len; 680 628 } 629 + 630 + if (!(iod->op_flags & UBLK_IO_F_INTEGRITY)) 631 + return; 632 + 633 + len = ublk_integrity_len(q, iod->nr_sectors << 9); 634 + off = ublk_user_copy_offset(q->q_id, io->tag); 635 + off |= UBLKSRV_IO_INTEGRITY_FLAG; 636 + if (ublk_op == UBLK_IO_OP_WRITE) 637 + copied = pread(q->ublk_fd, io->integrity_buf, len, off); 638 + else if (ublk_op == UBLK_IO_OP_READ) 639 + copied = pwrite(q->ublk_fd, io->integrity_buf, len, off); 640 + else 641 + assert(0); 642 + assert(copied == (ssize_t)len); 681 643 } 682 644 683 645 int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io) ··· 756 690 cmd->addr = 0; 757 691 758 692 if (ublk_queue_use_auto_zc(q)) 759 - ublk_set_auto_buf_reg(q, sqe[0], io->tag); 693 + ublk_set_auto_buf_reg(t, q, sqe[0], io->tag); 760 694 761 695 user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, q->q_id, 0); 762 696 io_uring_sqe_set_data64(sqe[0], user_data); ··· 845 779 unsigned tag = user_data_to_tag(cqe->user_data); 846 780 struct ublk_io *io = &q->ios[tag]; 847 781 782 + t->cmd_inflight--; 783 + 848 784 if (!fetch) { 849 785 t->state |= UBLKS_T_STOPPING; 850 786 io->flags &= ~UBLKS_IO_NEED_FETCH_RQ; 851 787 } 852 788 853 789 if (cqe->res == UBLK_IO_RES_OK) { 854 - assert(tag < q->q_depth); 790 + ublk_assert(tag < q->q_depth); 855 791 856 792 if (ublk_queue_use_user_copy(q)) 857 793 ublk_user_copy(io, UBLK_IO_OP_WRITE); ··· 881 813 { 882 814 struct ublk_dev *dev = t->dev; 883 815 unsigned q_id = user_data_to_q_id(cqe->user_data); 884 - struct ublk_queue *q = &dev->q[q_id]; 885 816 unsigned cmd_op = user_data_to_op(cqe->user_data); 886 817 887 - if (cqe->res < 0 && cqe->res != -ENODEV) 888 - ublk_err("%s: res %d userdata %llx queue state %x\n", __func__, 889 - cqe->res, cqe->user_data, q->flags); 818 + if (cqe->res < 0 && cqe->res != -ENODEV && cqe->res != -ENOBUFS) 819 + ublk_err("%s: res %d userdata %llx thread state %x\n", __func__, 820 + cqe->res, cqe->user_data, t->state); 890 821 891 - ublk_dbg(UBLK_DBG_IO_CMD, "%s: res %d (qid %d tag %u cmd_op %u target %d/%d) stopping %d\n", 892 - __func__, cqe->res, q->q_id, user_data_to_tag(cqe->user_data), 893 - cmd_op, is_target_io(cqe->user_data), 822 + ublk_dbg(UBLK_DBG_IO_CMD, "%s: res %d (thread %d qid %d tag %u cmd_op %x " 823 + "data %lx target %d/%d) stopping %d\n", 824 + __func__, cqe->res, t->idx, q_id, 825 + user_data_to_tag(cqe->user_data), 826 + cmd_op, cqe->user_data, is_target_io(cqe->user_data), 894 827 user_data_to_tgt_data(cqe->user_data), 895 828 (t->state & UBLKS_T_STOPPING)); 896 829 897 830 /* Don't retrieve io in case of target io */ 898 831 if (is_target_io(cqe->user_data)) { 899 - ublksrv_handle_tgt_cqe(t, q, cqe); 832 + ublksrv_handle_tgt_cqe(t, &dev->q[q_id], cqe); 900 833 return; 901 834 } 902 835 903 - t->cmd_inflight--; 904 - 905 - ublk_handle_uring_cmd(t, q, cqe); 836 + if (ublk_thread_batch_io(t)) 837 + ublk_batch_compl_cmd(t, cqe); 838 + else 839 + ublk_handle_uring_cmd(t, &dev->q[q_id], cqe); 906 840 } 907 841 908 842 static int ublk_reap_events_uring(struct ublk_thread *t) ··· 936 866 return -ENODEV; 937 867 938 868 ret = io_uring_submit_and_wait(&t->ring, 1); 939 - reapped = ublk_reap_events_uring(t); 869 + if (ublk_thread_batch_io(t)) { 870 + ublk_batch_prep_commit(t); 871 + reapped = ublk_reap_events_uring(t); 872 + ublk_batch_commit_io_cmds(t); 873 + } else { 874 + reapped = ublk_reap_events_uring(t); 875 + } 940 876 941 877 ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n", 942 878 ret, reapped, (t->state & UBLKS_T_STOPPING), ··· 958 882 sem_t *ready; 959 883 cpu_set_t *affinity; 960 884 unsigned long long extra_flags; 885 + unsigned char (*q_thread_map)[UBLK_MAX_QUEUES]; 961 886 }; 962 887 963 888 static void ublk_thread_set_sched_affinity(const struct ublk_thread_info *info) ··· 966 889 if (pthread_setaffinity_np(pthread_self(), sizeof(*info->affinity), info->affinity) < 0) 967 890 ublk_err("ublk dev %u thread %u set affinity failed", 968 891 info->dev->dev_info.dev_id, info->idx); 892 + } 893 + 894 + static void ublk_batch_setup_queues(struct ublk_thread *t) 895 + { 896 + int i; 897 + 898 + for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) { 899 + struct ublk_queue *q = &t->dev->q[i]; 900 + int ret; 901 + 902 + /* 903 + * Only prepare io commands in the mapped thread context, 904 + * otherwise io command buffer index may not work as expected 905 + */ 906 + if (t->q_map[i] == 0) 907 + continue; 908 + 909 + ret = ublk_batch_queue_prep_io_cmds(t, q); 910 + ublk_assert(ret >= 0); 911 + } 969 912 } 970 913 971 914 static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_info *info) ··· 996 899 }; 997 900 int dev_id = info->dev->dev_info.dev_id; 998 901 int ret; 902 + 903 + /* Copy per-thread queue mapping into thread-local variable */ 904 + if (info->q_thread_map) 905 + memcpy(t.q_map, info->q_thread_map[info->idx], sizeof(t.q_map)); 999 906 1000 907 ret = ublk_thread_init(&t, info->extra_flags); 1001 908 if (ret) { ··· 1012 911 ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n", 1013 912 gettid(), dev_id, t.idx); 1014 913 1015 - /* submit all io commands to ublk driver */ 1016 - ublk_submit_fetch_commands(&t); 914 + if (!ublk_thread_batch_io(&t)) { 915 + /* submit all io commands to ublk driver */ 916 + ublk_submit_fetch_commands(&t); 917 + } else { 918 + ublk_batch_setup_queues(&t); 919 + ublk_batch_start_fetch(&t); 920 + } 921 + 1017 922 do { 1018 923 if (ublk_process_io(&t) < 0) 1019 924 break; ··· 1091 984 struct ublk_thread_info *tinfo; 1092 985 unsigned long long extra_flags = 0; 1093 986 cpu_set_t *affinity_buf; 987 + unsigned char (*q_thread_map)[UBLK_MAX_QUEUES] = NULL; 1094 988 void *thread_ret; 1095 989 sem_t ready; 1096 990 int ret, i; ··· 1111 1003 if (ret) 1112 1004 return ret; 1113 1005 1006 + if (ublk_dev_batch_io(dev)) { 1007 + q_thread_map = calloc(dev->nthreads, sizeof(*q_thread_map)); 1008 + if (!q_thread_map) { 1009 + ret = -ENOMEM; 1010 + goto fail; 1011 + } 1012 + ublk_batch_setup_map(q_thread_map, dev->nthreads, 1013 + dinfo->nr_hw_queues); 1014 + } 1015 + 1114 1016 if (ctx->auto_zc_fallback) 1115 1017 extra_flags = UBLKS_Q_AUTO_BUF_REG_FALLBACK; 1116 1018 if (ctx->no_ublk_fixed_fd) ··· 1130 1012 dev->q[i].dev = dev; 1131 1013 dev->q[i].q_id = i; 1132 1014 1133 - ret = ublk_queue_init(&dev->q[i], extra_flags); 1015 + ret = ublk_queue_init(&dev->q[i], extra_flags, 1016 + ctx->metadata_size); 1134 1017 if (ret) { 1135 1018 ublk_err("ublk dev %d queue %d init queue failed\n", 1136 1019 dinfo->dev_id, i); ··· 1144 1025 tinfo[i].idx = i; 1145 1026 tinfo[i].ready = &ready; 1146 1027 tinfo[i].extra_flags = extra_flags; 1028 + tinfo[i].q_thread_map = q_thread_map; 1147 1029 1148 1030 /* 1149 1031 * If threads are not tied 1:1 to queues, setting thread ··· 1164 1044 for (i = 0; i < dev->nthreads; i++) 1165 1045 sem_wait(&ready); 1166 1046 free(affinity_buf); 1047 + free(q_thread_map); 1167 1048 1168 1049 /* everything is fine now, start us */ 1169 1050 if (ctx->recovery) ··· 1335 1214 goto fail; 1336 1215 } 1337 1216 1338 - if (nthreads != nr_queues && !ctx->per_io_tasks) { 1217 + if (nthreads != nr_queues && (!ctx->per_io_tasks && 1218 + !(ctx->flags & UBLK_F_BATCH_IO))) { 1339 1219 ublk_err("%s: threads %u must be same as queues %u if " 1340 1220 "not using per_io_tasks\n", 1341 1221 __func__, nthreads, nr_queues); ··· 1516 1394 return 0; 1517 1395 } 1518 1396 1397 + static int cmd_dev_stop(struct dev_ctx *ctx) 1398 + { 1399 + int number = ctx->dev_id; 1400 + struct ublk_dev *dev; 1401 + int ret; 1402 + 1403 + if (number < 0) { 1404 + ublk_err("%s: device id is required\n", __func__); 1405 + return -EINVAL; 1406 + } 1407 + 1408 + dev = ublk_ctrl_init(); 1409 + dev->dev_info.dev_id = number; 1410 + 1411 + ret = ublk_ctrl_get_info(dev); 1412 + if (ret < 0) 1413 + goto fail; 1414 + 1415 + if (ctx->safe_stop) { 1416 + ret = ublk_ctrl_try_stop_dev(dev); 1417 + if (ret < 0) 1418 + ublk_err("%s: try_stop dev %d failed ret %d\n", 1419 + __func__, number, ret); 1420 + } else { 1421 + ret = ublk_ctrl_stop_dev(dev); 1422 + if (ret < 0) 1423 + ublk_err("%s: stop dev %d failed ret %d\n", 1424 + __func__, number, ret); 1425 + } 1426 + 1427 + fail: 1428 + ublk_ctrl_deinit(dev); 1429 + 1430 + return ret; 1431 + } 1432 + 1519 1433 static int __cmd_dev_list(struct dev_ctx *ctx) 1520 1434 { 1521 1435 struct ublk_dev *dev = ublk_ctrl_init(); ··· 1614 1456 FEAT_NAME(UBLK_F_QUIESCE), 1615 1457 FEAT_NAME(UBLK_F_PER_IO_DAEMON), 1616 1458 FEAT_NAME(UBLK_F_BUF_REG_OFF_DAEMON), 1459 + FEAT_NAME(UBLK_F_INTEGRITY), 1460 + FEAT_NAME(UBLK_F_SAFE_STOP_DEV), 1461 + FEAT_NAME(UBLK_F_BATCH_IO), 1462 + FEAT_NAME(UBLK_F_NO_AUTO_PART_SCAN), 1617 1463 }; 1618 1464 struct ublk_dev *dev; 1619 1465 __u64 features = 0; ··· 1713 1551 printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1] [-g] [-u]\n"); 1714 1552 printf("\t[-e 0|1 ] [-i 0|1] [--no_ublk_fixed_fd]\n"); 1715 1553 printf("\t[--nthreads threads] [--per_io_tasks]\n"); 1554 + printf("\t[--integrity_capable] [--integrity_reftag] [--metadata_size SIZE] " 1555 + "[--pi_offset OFFSET] [--csum_type ip|t10dif|nvme] [--tag_size SIZE]\n"); 1556 + printf("\t[--batch|-b] [--no_auto_part_scan]\n"); 1716 1557 printf("\t[target options] [backfile1] [backfile2] ...\n"); 1717 1558 printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n"); 1718 1559 printf("\tdefault: nthreads=nr_queues"); ··· 1748 1583 1749 1584 printf("%s del [-n dev_id] -a \n", exe); 1750 1585 printf("\t -a delete all devices -n delete specified device\n\n"); 1586 + printf("%s stop -n dev_id [--safe]\n", exe); 1587 + printf("\t --safe only stop if device has no active openers\n\n"); 1751 1588 printf("%s list [-n dev_id] -a \n", exe); 1752 1589 printf("\t -a list all devices, -n list specified device, default -a \n\n"); 1753 1590 printf("%s features\n", exe); ··· 1781 1614 { "nthreads", 1, NULL, 0 }, 1782 1615 { "per_io_tasks", 0, NULL, 0 }, 1783 1616 { "no_ublk_fixed_fd", 0, NULL, 0 }, 1617 + { "integrity_capable", 0, NULL, 0 }, 1618 + { "integrity_reftag", 0, NULL, 0 }, 1619 + { "metadata_size", 1, NULL, 0 }, 1620 + { "pi_offset", 1, NULL, 0 }, 1621 + { "csum_type", 1, NULL, 0 }, 1622 + { "tag_size", 1, NULL, 0 }, 1623 + { "safe", 0, NULL, 0 }, 1624 + { "batch", 0, NULL, 'b'}, 1625 + { "no_auto_part_scan", 0, NULL, 0 }, 1784 1626 { 0, 0, 0, 0 } 1785 1627 }; 1786 1628 const struct ublk_tgt_ops *ops = NULL; ··· 1801 1625 .nr_hw_queues = 2, 1802 1626 .dev_id = -1, 1803 1627 .tgt_type = "unknown", 1628 + .csum_type = LBMD_PI_CSUM_NONE, 1804 1629 }; 1805 1630 int ret = -EINVAL, i; 1806 1631 int tgt_argc = 1; ··· 1813 1636 1814 1637 opterr = 0; 1815 1638 optind = 2; 1816 - while ((opt = getopt_long(argc, argv, "t:n:d:q:r:e:i:s:gazu", 1639 + while ((opt = getopt_long(argc, argv, "t:n:d:q:r:e:i:s:gazub", 1817 1640 longopts, &option_idx)) != -1) { 1818 1641 switch (opt) { 1819 1642 case 'a': 1820 1643 ctx.all = 1; 1644 + break; 1645 + case 'b': 1646 + ctx.flags |= UBLK_F_BATCH_IO; 1821 1647 break; 1822 1648 case 'n': 1823 1649 ctx.dev_id = strtol(optarg, NULL, 10); ··· 1879 1699 ctx.per_io_tasks = 1; 1880 1700 if (!strcmp(longopts[option_idx].name, "no_ublk_fixed_fd")) 1881 1701 ctx.no_ublk_fixed_fd = 1; 1702 + if (!strcmp(longopts[option_idx].name, "integrity_capable")) 1703 + ctx.integrity_flags |= LBMD_PI_CAP_INTEGRITY; 1704 + if (!strcmp(longopts[option_idx].name, "integrity_reftag")) 1705 + ctx.integrity_flags |= LBMD_PI_CAP_REFTAG; 1706 + if (!strcmp(longopts[option_idx].name, "metadata_size")) 1707 + ctx.metadata_size = strtoul(optarg, NULL, 0); 1708 + if (!strcmp(longopts[option_idx].name, "pi_offset")) 1709 + ctx.pi_offset = strtoul(optarg, NULL, 0); 1710 + if (!strcmp(longopts[option_idx].name, "csum_type")) { 1711 + if (!strcmp(optarg, "ip")) { 1712 + ctx.csum_type = LBMD_PI_CSUM_IP; 1713 + } else if (!strcmp(optarg, "t10dif")) { 1714 + ctx.csum_type = LBMD_PI_CSUM_CRC16_T10DIF; 1715 + } else if (!strcmp(optarg, "nvme")) { 1716 + ctx.csum_type = LBMD_PI_CSUM_CRC64_NVME; 1717 + } else { 1718 + ublk_err("invalid csum_type: %s\n", optarg); 1719 + return -EINVAL; 1720 + } 1721 + } 1722 + if (!strcmp(longopts[option_idx].name, "tag_size")) 1723 + ctx.tag_size = strtoul(optarg, NULL, 0); 1724 + if (!strcmp(longopts[option_idx].name, "safe")) 1725 + ctx.safe_stop = 1; 1726 + if (!strcmp(longopts[option_idx].name, "no_auto_part_scan")) 1727 + ctx.flags |= UBLK_F_NO_AUTO_PART_SCAN; 1882 1728 break; 1883 1729 case '?': 1884 1730 /* ··· 1928 1722 } 1929 1723 } 1930 1724 1725 + if (ctx.per_io_tasks && (ctx.flags & UBLK_F_BATCH_IO)) { 1726 + ublk_err("per_io_task and F_BATCH_IO conflict\n"); 1727 + return -EINVAL; 1728 + } 1729 + 1931 1730 /* auto_zc_fallback depends on F_AUTO_BUF_REG & F_SUPPORT_ZERO_COPY */ 1932 1731 if (ctx.auto_zc_fallback && 1933 1732 !((ctx.flags & UBLK_F_AUTO_BUF_REG) && ··· 1949 1738 (ctx.flags & UBLK_F_AUTO_BUF_REG && !ctx.auto_zc_fallback) + 1950 1739 ctx.auto_zc_fallback > 1) { 1951 1740 fprintf(stderr, "too many data copy modes specified\n"); 1741 + return -EINVAL; 1742 + } 1743 + 1744 + if (ctx.metadata_size) { 1745 + if (!(ctx.flags & UBLK_F_USER_COPY)) { 1746 + ublk_err("integrity requires user_copy\n"); 1747 + return -EINVAL; 1748 + } 1749 + 1750 + ctx.flags |= UBLK_F_INTEGRITY; 1751 + } else if (ctx.integrity_flags || 1752 + ctx.pi_offset || 1753 + ctx.csum_type != LBMD_PI_CSUM_NONE || 1754 + ctx.tag_size) { 1755 + ublk_err("integrity parameters require metadata_size\n"); 1756 + return -EINVAL; 1757 + } 1758 + 1759 + if ((ctx.flags & UBLK_F_AUTO_BUF_REG) && 1760 + (ctx.flags & UBLK_F_BATCH_IO) && 1761 + (ctx.nthreads > ctx.nr_hw_queues)) { 1762 + ublk_err("too many threads for F_AUTO_BUF_REG & F_BATCH_IO\n"); 1952 1763 return -EINVAL; 1953 1764 } 1954 1765 ··· 1999 1766 } 2000 1767 } else if (!strcmp(cmd, "del")) 2001 1768 ret = cmd_dev_del(&ctx); 1769 + else if (!strcmp(cmd, "stop")) 1770 + ret = cmd_dev_stop(&ctx); 2002 1771 else if (!strcmp(cmd, "list")) { 2003 1772 ctx.all = 1; 2004 1773 ret = cmd_dev_list(&ctx);

+208 -30

tools/testing/selftests/ublk/kublk.h

··· 78 78 unsigned int auto_zc_fallback:1; 79 79 unsigned int per_io_tasks:1; 80 80 unsigned int no_ublk_fixed_fd:1; 81 + unsigned int safe_stop:1; 82 + unsigned int no_auto_part_scan:1; 83 + __u32 integrity_flags; 84 + __u8 metadata_size; 85 + __u8 pi_offset; 86 + __u8 csum_type; 87 + __u8 tag_size; 81 88 82 89 int _evtfd; 83 90 int _shmid; ··· 114 107 115 108 struct ublk_io { 116 109 char *buf_addr; 110 + void *integrity_buf; 117 111 118 112 #define UBLKS_IO_NEED_FETCH_RQ (1UL << 0) 119 113 #define UBLKS_IO_NEED_COMMIT_RQ_COMP (1UL << 1) ··· 151 143 void (*usage)(const struct ublk_tgt_ops *ops); 152 144 153 145 /* return buffer index for UBLK_F_AUTO_BUF_REG */ 154 - unsigned short (*buf_index)(const struct ublk_queue *, int tag); 146 + unsigned short (*buf_index)(const struct ublk_thread *t, 147 + const struct ublk_queue *, int tag); 155 148 }; 156 149 157 150 struct ublk_tgt { ··· 174 165 const struct ublk_tgt_ops *tgt_ops; 175 166 struct ublksrv_io_desc *io_cmd_buf; 176 167 177 - /* borrow one bit of ublk uapi flags, which may never be used */ 168 + /* borrow three bit of ublk uapi flags, which may never be used */ 178 169 #define UBLKS_Q_AUTO_BUF_REG_FALLBACK (1ULL << 63) 179 170 #define UBLKS_Q_NO_UBLK_FIXED_FD (1ULL << 62) 171 + #define UBLKS_Q_PREPARED (1ULL << 61) 180 172 __u64 flags; 181 173 int ublk_fd; /* cached ublk char device fd */ 174 + __u8 metadata_size; 182 175 struct ublk_io ios[UBLK_QUEUE_DEPTH]; 176 + 177 + /* used for prep io commands */ 178 + pthread_spinlock_t lock; 179 + }; 180 + 181 + /* align with `ublk_elem_header` */ 182 + struct ublk_batch_elem { 183 + __u16 tag; 184 + __u16 buf_index; 185 + __s32 result; 186 + __u64 buf_addr; 187 + }; 188 + 189 + struct batch_commit_buf { 190 + unsigned short q_id; 191 + unsigned short buf_idx; 192 + void *elem; 193 + unsigned short done; 194 + unsigned short count; 195 + }; 196 + 197 + struct batch_fetch_buf { 198 + struct io_uring_buf_ring *br; 199 + void *fetch_buf; 200 + unsigned int fetch_buf_size; 201 + unsigned int fetch_buf_off; 183 202 }; 184 203 185 204 struct ublk_thread { 205 + /* Thread-local copy of queue-to-thread mapping for this thread */ 206 + unsigned char q_map[UBLK_MAX_QUEUES]; 207 + 186 208 struct ublk_dev *dev; 187 - unsigned idx; 209 + unsigned short idx; 210 + unsigned short nr_queues; 188 211 189 212 #define UBLKS_T_STOPPING (1U << 0) 190 213 #define UBLKS_T_IDLE (1U << 1) 214 + #define UBLKS_T_BATCH_IO (1U << 31) /* readonly */ 191 215 unsigned state; 192 216 unsigned int cmd_inflight; 193 217 unsigned int io_inflight; 218 + 219 + unsigned short nr_bufs; 220 + 221 + /* followings are for BATCH_IO */ 222 + unsigned short commit_buf_start; 223 + unsigned char commit_buf_elem_size; 224 + /* 225 + * We just support single device, so pre-calculate commit/prep flags 226 + */ 227 + unsigned short cmd_flags; 228 + unsigned int nr_commit_buf; 229 + unsigned int commit_buf_size; 230 + void *commit_buf; 231 + #define UBLKS_T_COMMIT_BUF_INV_IDX ((unsigned short)-1) 232 + struct allocator commit_buf_alloc; 233 + struct batch_commit_buf *commit; 234 + /* FETCH_IO_CMDS buffer */ 235 + unsigned short nr_fetch_bufs; 236 + struct batch_fetch_buf *fetch; 237 + 194 238 struct io_uring ring; 195 239 }; 196 240 ··· 264 202 265 203 extern int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io); 266 204 205 + static inline int __ublk_use_batch_io(__u64 flags) 206 + { 207 + return flags & UBLK_F_BATCH_IO; 208 + } 209 + 210 + static inline int ublk_queue_batch_io(const struct ublk_queue *q) 211 + { 212 + return __ublk_use_batch_io(q->flags); 213 + } 214 + 215 + static inline int ublk_dev_batch_io(const struct ublk_dev *dev) 216 + { 217 + return __ublk_use_batch_io(dev->dev_info.flags); 218 + } 219 + 220 + /* only work for handle single device in this pthread context */ 221 + static inline int ublk_thread_batch_io(const struct ublk_thread *t) 222 + { 223 + return t->state & UBLKS_T_BATCH_IO; 224 + } 225 + 226 + static inline void ublk_set_integrity_params(const struct dev_ctx *ctx, 227 + struct ublk_params *params) 228 + { 229 + if (!ctx->metadata_size) 230 + return; 231 + 232 + params->types |= UBLK_PARAM_TYPE_INTEGRITY; 233 + params->integrity = (struct ublk_param_integrity) { 234 + .flags = ctx->integrity_flags, 235 + .interval_exp = params->basic.logical_bs_shift, 236 + .metadata_size = ctx->metadata_size, 237 + .pi_offset = ctx->pi_offset, 238 + .csum_type = ctx->csum_type, 239 + .tag_size = ctx->tag_size, 240 + }; 241 + } 242 + 243 + static inline size_t ublk_integrity_len(const struct ublk_queue *q, size_t len) 244 + { 245 + /* All targets currently use interval_exp = logical_bs_shift = 9 */ 246 + return (len >> 9) * q->metadata_size; 247 + } 248 + 249 + static inline size_t 250 + ublk_integrity_data_len(const struct ublk_queue *q, size_t integrity_len) 251 + { 252 + return (integrity_len / q->metadata_size) << 9; 253 + } 267 254 268 255 static inline int ublk_io_auto_zc_fallback(const struct ublksrv_io_desc *iod) 269 256 { ··· 335 224 { 336 225 /* we only have 7 bits to encode q_id */ 337 226 _Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7, "UBLK_MAX_QUEUES_SHIFT must be <= 7"); 338 - assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7)); 227 + ublk_assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7)); 339 228 340 - return tag | (op << 16) | (tgt_data << 24) | 229 + return tag | ((__u64)op << 16) | ((__u64)tgt_data << 24) | 341 230 (__u64)q_id << 56 | (__u64)is_target_io << 63; 342 231 } 343 232 ··· 468 357 addr[1] = 0; 469 358 } 470 359 360 + static inline unsigned short ublk_batch_io_buf_idx( 361 + const struct ublk_thread *t, const struct ublk_queue *q, 362 + unsigned tag); 363 + 364 + static inline unsigned short ublk_io_buf_idx(const struct ublk_thread *t, 365 + const struct ublk_queue *q, 366 + unsigned tag) 367 + { 368 + if (ublk_queue_batch_io(q)) 369 + return ublk_batch_io_buf_idx(t, q, tag); 370 + return q->ios[tag].buf_index; 371 + } 372 + 471 373 static inline struct ublk_io *ublk_get_io(struct ublk_queue *q, unsigned tag) 472 374 { 473 375 return &q->ios[tag]; 474 - } 475 - 476 - static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queue *q, 477 - unsigned tag, int res) 478 - { 479 - struct ublk_io *io = &q->ios[tag]; 480 - 481 - ublk_mark_io_done(io, res); 482 - 483 - return ublk_queue_io_cmd(t, io); 484 - } 485 - 486 - static inline void ublk_queued_tgt_io(struct ublk_thread *t, struct ublk_queue *q, 487 - unsigned tag, int queued) 488 - { 489 - if (queued < 0) 490 - ublk_complete_io(t, q, tag, queued); 491 - else { 492 - struct ublk_io *io = ublk_get_io(q, tag); 493 - 494 - t->io_inflight += queued; 495 - io->tgt_ios = queued; 496 - io->result = 0; 497 - } 498 376 } 499 377 500 378 static inline int ublk_completed_tgt_io(struct ublk_thread *t, ··· 521 421 return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q); 522 422 } 523 423 424 + static inline int ublk_batch_commit_prepared(struct batch_commit_buf *cb) 425 + { 426 + return cb->buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX; 427 + } 428 + 429 + static inline unsigned ublk_queue_idx_in_thread(const struct ublk_thread *t, 430 + const struct ublk_queue *q) 431 + { 432 + unsigned char idx; 433 + 434 + idx = t->q_map[q->q_id]; 435 + ublk_assert(idx != 0); 436 + return idx - 1; 437 + } 438 + 439 + /* 440 + * Each IO's buffer index has to be calculated by this helper for 441 + * UBLKS_T_BATCH_IO 442 + */ 443 + static inline unsigned short ublk_batch_io_buf_idx( 444 + const struct ublk_thread *t, const struct ublk_queue *q, 445 + unsigned tag) 446 + { 447 + return ublk_queue_idx_in_thread(t, q) * q->q_depth + tag; 448 + } 449 + 450 + /* Queue UBLK_U_IO_PREP_IO_CMDS for a specific queue with batch elements */ 451 + int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q); 452 + /* Start fetching I/O commands using multishot UBLK_U_IO_FETCH_IO_CMDS */ 453 + void ublk_batch_start_fetch(struct ublk_thread *t); 454 + /* Handle completion of batch I/O commands (prep/commit) */ 455 + void ublk_batch_compl_cmd(struct ublk_thread *t, 456 + const struct io_uring_cqe *cqe); 457 + /* Initialize batch I/O state and calculate buffer parameters */ 458 + void ublk_batch_prepare(struct ublk_thread *t); 459 + /* Allocate and register commit buffers for batch operations */ 460 + int ublk_batch_alloc_buf(struct ublk_thread *t); 461 + /* Free commit buffers and cleanup batch allocator */ 462 + void ublk_batch_free_buf(struct ublk_thread *t); 463 + 464 + /* Prepare a new commit buffer for batching completed I/O operations */ 465 + void ublk_batch_prep_commit(struct ublk_thread *t); 466 + /* Submit UBLK_U_IO_COMMIT_IO_CMDS with batched completed I/O operations */ 467 + void ublk_batch_commit_io_cmds(struct ublk_thread *t); 468 + /* Add a completed I/O operation to the current batch commit buffer */ 469 + void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q, 470 + unsigned tag, int res); 471 + void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES], 472 + int nthreads, int queues); 473 + 474 + static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queue *q, 475 + unsigned tag, int res) 476 + { 477 + if (ublk_queue_batch_io(q)) { 478 + ublk_batch_complete_io(t, q, tag, res); 479 + return 0; 480 + } else { 481 + struct ublk_io *io = &q->ios[tag]; 482 + 483 + ublk_mark_io_done(io, res); 484 + return ublk_queue_io_cmd(t, io); 485 + } 486 + } 487 + 488 + static inline void ublk_queued_tgt_io(struct ublk_thread *t, struct ublk_queue *q, 489 + unsigned tag, int queued) 490 + { 491 + if (queued < 0) 492 + ublk_complete_io(t, q, tag, queued); 493 + else { 494 + struct ublk_io *io = ublk_get_io(q, tag); 495 + 496 + t->io_inflight += queued; 497 + io->tgt_ios = queued; 498 + io->result = 0; 499 + } 500 + } 501 + 524 502 extern const struct ublk_tgt_ops null_tgt_ops; 525 503 extern const struct ublk_tgt_ops loop_tgt_ops; 526 504 extern const struct ublk_tgt_ops stripe_tgt_ops; 527 505 extern const struct ublk_tgt_ops fault_inject_tgt_ops; 528 506 529 507 void backing_file_tgt_deinit(struct ublk_dev *dev); 530 - int backing_file_tgt_init(struct ublk_dev *dev); 508 + int backing_file_tgt_init(struct ublk_dev *dev, unsigned int nr_direct); 531 509 532 510 #endif

+36

tools/testing/selftests/ublk/metadata_size.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <fcntl.h> 3 + #include <linux/fs.h> 4 + #include <stdio.h> 5 + #include <sys/ioctl.h> 6 + 7 + int main(int argc, char **argv) 8 + { 9 + struct logical_block_metadata_cap cap = {}; 10 + const char *filename; 11 + int fd; 12 + int result; 13 + 14 + if (argc != 2) { 15 + fprintf(stderr, "Usage: %s BLOCK_DEVICE\n", argv[0]); 16 + return 1; 17 + } 18 + 19 + filename = argv[1]; 20 + fd = open(filename, O_RDONLY); 21 + if (fd < 0) { 22 + perror(filename); 23 + return 1; 24 + } 25 + 26 + result = ioctl(fd, FS_IOC_GETLBMD_CAP, &cap); 27 + if (result < 0) { 28 + perror("ioctl"); 29 + return 1; 30 + } 31 + 32 + printf("metadata_size: %u\n", cap.lbmd_size); 33 + printf("pi_offset: %u\n", cap.lbmd_pi_offset); 34 + printf("pi_tuple_size: %u\n", cap.lbmd_pi_size); 35 + return 0; 36 + }

+11 -8

tools/testing/selftests/ublk/null.c

··· 36 36 .max_segments = 32, 37 37 }, 38 38 }; 39 + ublk_set_integrity_params(ctx, &dev->tgt.params); 39 40 40 41 if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) 41 42 dev->tgt.sq_depth = dev->tgt.cq_depth = 2 * info->queue_depth; ··· 44 43 } 45 44 46 45 static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod, 47 - struct io_uring_sqe *sqe, int q_id) 46 + struct io_uring_sqe *sqe, int q_id, unsigned buf_idx) 48 47 { 49 48 unsigned ublk_op = ublksrv_get_op(iod); 50 49 51 50 io_uring_prep_nop(sqe); 52 - sqe->buf_index = tag; 51 + sqe->buf_index = buf_idx; 53 52 sqe->flags |= IOSQE_FIXED_FILE; 54 53 sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT; 55 54 sqe->len = iod->nr_sectors << 9; /* injected result */ ··· 61 60 { 62 61 const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag); 63 62 struct io_uring_sqe *sqe[3]; 63 + unsigned short buf_idx = ublk_io_buf_idx(t, q, tag); 64 64 65 65 ublk_io_alloc_sqes(t, sqe, 3); 66 66 67 - io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 67 + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_idx); 68 68 sqe[0]->user_data = build_user_data(tag, 69 69 ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); 70 70 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 71 71 72 - __setup_nop_io(tag, iod, sqe[1], q->q_id); 72 + __setup_nop_io(tag, iod, sqe[1], q->q_id, buf_idx); 73 73 sqe[1]->flags |= IOSQE_IO_HARDLINK; 74 74 75 - io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 75 + io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, buf_idx); 76 76 sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); 77 77 78 78 // buf register is marked as IOSQE_CQE_SKIP_SUCCESS ··· 87 85 struct io_uring_sqe *sqe[1]; 88 86 89 87 ublk_io_alloc_sqes(t, sqe, 1); 90 - __setup_nop_io(tag, iod, sqe[0], q->q_id); 88 + __setup_nop_io(tag, iod, sqe[0], q->q_id, ublk_io_buf_idx(t, q, tag)); 91 89 return 1; 92 90 } 93 91 ··· 138 136 * return invalid buffer index for triggering auto buffer register failure, 139 137 * then UBLK_IO_RES_NEED_REG_BUF handling is covered 140 138 */ 141 - static unsigned short ublk_null_buf_index(const struct ublk_queue *q, int tag) 139 + static unsigned short ublk_null_buf_index(const struct ublk_thread *t, 140 + const struct ublk_queue *q, int tag) 142 141 { 143 142 if (ublk_queue_auto_zc_fallback(q)) 144 143 return (unsigned short)-1; 145 - return q->ios[tag].buf_index; 144 + return ublk_io_buf_idx(t, q, tag); 146 145 } 147 146 148 147 const struct ublk_tgt_ops null_tgt_ops = {

+1

tools/testing/selftests/ublk/settings

··· 1 + timeout=150

+14 -9

tools/testing/selftests/ublk/stripe.c

··· 96 96 this->seq = seq; 97 97 s->nr += 1; 98 98 } else { 99 - assert(seq == this->seq); 100 - assert(this->start + this->nr_sects == stripe_off); 99 + ublk_assert(seq == this->seq); 100 + ublk_assert(this->start + this->nr_sects == stripe_off); 101 101 this->nr_sects += nr_sects; 102 102 } 103 103 104 - assert(this->nr_vec < this->cap); 104 + ublk_assert(this->nr_vec < this->cap); 105 105 this->vec[this->nr_vec].iov_base = (void *)(base + done); 106 106 this->vec[this->nr_vec++].iov_len = nr_sects << 9; 107 107 ··· 120 120 return zc ? IORING_OP_READV_FIXED : IORING_OP_READV; 121 121 else if (ublk_op == UBLK_IO_OP_WRITE) 122 122 return zc ? IORING_OP_WRITEV_FIXED : IORING_OP_WRITEV; 123 - assert(0); 123 + ublk_assert(0); 124 124 } 125 125 126 126 static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q, ··· 135 135 struct ublk_io *io = ublk_get_io(q, tag); 136 136 int i, extra = zc ? 2 : 0; 137 137 void *base = io->buf_addr; 138 + unsigned short buf_idx = ublk_io_buf_idx(t, q, tag); 138 139 139 140 io->private_data = s; 140 141 calculate_stripe_array(conf, iod, s, base); ··· 143 142 ublk_io_alloc_sqes(t, sqe, s->nr + extra); 144 143 145 144 if (zc) { 146 - io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, io->buf_index); 145 + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_idx); 147 146 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 148 147 sqe[0]->user_data = build_user_data(tag, 149 148 ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); ··· 159 158 t->start << 9); 160 159 io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE); 161 160 if (auto_zc || zc) { 162 - sqe[i]->buf_index = tag; 161 + sqe[i]->buf_index = buf_idx; 163 162 if (zc) 164 163 sqe[i]->flags |= IOSQE_IO_HARDLINK; 165 164 } ··· 169 168 if (zc) { 170 169 struct io_uring_sqe *unreg = sqe[s->nr + 1]; 171 170 172 - io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, io->buf_index); 171 + io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, buf_idx); 173 172 unreg->user_data = build_user_data( 174 173 tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1); 175 174 } ··· 299 298 ublk_err("%s: not support auto_zc_fallback\n", __func__); 300 299 return -EINVAL; 301 300 } 301 + if (ctx->metadata_size) { 302 + ublk_err("%s: integrity not supported\n", __func__); 303 + return -EINVAL; 304 + } 302 305 303 306 if ((chunk_size & (chunk_size - 1)) || !chunk_size) { 304 307 ublk_err("invalid chunk size %u\n", chunk_size); ··· 316 311 317 312 chunk_shift = ilog2(chunk_size); 318 313 319 - ret = backing_file_tgt_init(dev); 314 + ret = backing_file_tgt_init(dev, dev->tgt.nr_backing_files); 320 315 if (ret) 321 316 return ret; 322 317 323 318 if (!dev->tgt.nr_backing_files || dev->tgt.nr_backing_files > NR_STRIPE) 324 319 return -EINVAL; 325 320 326 - assert(dev->nr_fds == dev->tgt.nr_backing_files + 1); 321 + ublk_assert(dev->nr_fds == dev->tgt.nr_backing_files + 1); 327 322 328 323 for (i = 0; i < dev->tgt.nr_backing_files; i++) 329 324 dev->tgt.backing_file_size[i] &= ~((1 << chunk_shift) - 1);

+31

tools/testing/selftests/ublk/test_batch_01.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + if ! _have_feature "BATCH_IO"; then 9 + exit "$UBLK_SKIP_CODE" 10 + fi 11 + 12 + _prep_test "generic" "test basic function of UBLK_F_BATCH_IO" 13 + 14 + _create_backfile 0 256M 15 + _create_backfile 1 256M 16 + 17 + dev_id=$(_add_ublk_dev -t loop -q 2 -b "${UBLK_BACKFILES[0]}") 18 + _check_add_dev $TID $? 19 + 20 + if ! _mkfs_mount_test /dev/ublkb"${dev_id}"; then 21 + _cleanup_test "generic" 22 + _show_result $TID 255 23 + fi 24 + 25 + dev_id=$(_add_ublk_dev -t stripe -b --auto_zc "${UBLK_BACKFILES[0]}" "${UBLK_BACKFILES[1]}") 26 + _check_add_dev $TID $? 27 + _mkfs_mount_test /dev/ublkb"${dev_id}" 28 + ERR_CODE=$? 29 + 30 + _cleanup_test "generic" 31 + _show_result $TID $ERR_CODE

+29

tools/testing/selftests/ublk/test_batch_02.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + if ! _have_feature "BATCH_IO"; then 9 + exit "$UBLK_SKIP_CODE" 10 + fi 11 + 12 + if ! _have_program fio; then 13 + exit "$UBLK_SKIP_CODE" 14 + fi 15 + 16 + _prep_test "generic" "test UBLK_F_BATCH_IO with 4_threads vs. 1_queues" 17 + 18 + _create_backfile 0 512M 19 + 20 + dev_id=$(_add_ublk_dev -t loop -q 1 --nthreads 4 -b "${UBLK_BACKFILES[0]}") 21 + _check_add_dev $TID $? 22 + 23 + # run fio over the ublk disk 24 + fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio --rw=readwrite \ 25 + --iodepth=32 --size=100M --numjobs=4 > /dev/null 2>&1 26 + ERR_CODE=$? 27 + 28 + _cleanup_test "generic" 29 + _show_result $TID $ERR_CODE

+29

tools/testing/selftests/ublk/test_batch_03.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + if ! _have_feature "BATCH_IO"; then 9 + exit "$UBLK_SKIP_CODE" 10 + fi 11 + 12 + if ! _have_program fio; then 13 + exit "$UBLK_SKIP_CODE" 14 + fi 15 + 16 + _prep_test "generic" "test UBLK_F_BATCH_IO with 1_threads vs. 4_queues" 17 + 18 + _create_backfile 0 512M 19 + 20 + dev_id=$(_add_ublk_dev -t loop -q 4 --nthreads 1 -b "${UBLK_BACKFILES[0]}") 21 + _check_add_dev $TID $? 22 + 23 + # run fio over the ublk disk 24 + fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio --rw=readwrite \ 25 + --iodepth=32 --size=100M --numjobs=4 > /dev/null 2>&1 26 + ERR_CODE=$? 27 + 28 + _cleanup_test "generic" 29 + _show_result $TID $ERR_CODE

+58 -15

tools/testing/selftests/ublk/test_common.sh

··· 1 1 #!/bin/bash 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 + # Derive TID from script name: test_<type>_<num>.sh -> <type>_<num> 5 + # Can be overridden in test script after sourcing this file 6 + TID=$(basename "$0" .sh) 7 + TID=${TID#test_} 8 + 4 9 UBLK_SKIP_CODE=4 5 10 6 11 _have_program() { ··· 13 8 return 0 14 9 fi 15 10 return 1 11 + } 12 + 13 + # Sleep with awareness of parallel execution. 14 + # Usage: _ublk_sleep <normal_secs> <parallel_secs> 15 + _ublk_sleep() { 16 + if [ "${JOBS:-1}" -gt 1 ]; then 17 + sleep "$2" 18 + else 19 + sleep "$1" 20 + fi 16 21 } 17 22 18 23 _get_disk_dev_t() { ··· 58 43 old_file="${UBLK_BACKFILES[$index]}" 59 44 [ -f "$old_file" ] && rm -f "$old_file" 60 45 61 - new_file=$(mktemp ublk_file_"${new_size}"_XXXXX) 46 + new_file=$(mktemp ${UBLK_TEST_DIR}/ublk_file_"${new_size}"_XXXXX) 62 47 truncate -s "${new_size}" "${new_file}" 63 48 UBLK_BACKFILES["$index"]="$new_file" 64 49 } ··· 75 60 _create_tmp_dir() { 76 61 local my_file; 77 62 78 - my_file=$(mktemp -d ublk_dir_XXXXX) 63 + my_file=$(mktemp -d ${UBLK_TEST_DIR}/ublk_dir_XXXXX) 79 64 echo "$my_file" 80 65 } 81 66 ··· 116 101 fi 117 102 } 118 103 119 - _remove_ublk_devices() { 120 - ${UBLK_PROG} del -a 121 - modprobe -r ublk_drv > /dev/null 2>&1 122 - } 123 - 124 104 _get_ublk_dev_state() { 125 105 ${UBLK_PROG} list -n "$1" | grep "state" | awk '{print $11}' 126 106 } ··· 129 119 local type=$1 130 120 shift 1 131 121 modprobe ublk_drv > /dev/null 2>&1 132 - UBLK_TMP=$(mktemp ublk_test_XXXXX) 122 + local base_dir=${TMPDIR:-./ublktest-dir} 123 + mkdir -p "$base_dir" 124 + UBLK_TEST_DIR=$(mktemp -d ${base_dir}/${TID}.XXXXXX) 125 + UBLK_TMP=$(mktemp ${UBLK_TEST_DIR}/ublk_test_XXXXX) 133 126 [ "$UBLK_TEST_QUIET" -eq 0 ] && echo "ublk $type: $*" 127 + echo "ublk selftest: $TID starting at $(date '+%F %T')" | tee /dev/kmsg 134 128 } 135 129 136 130 _remove_test_files() ··· 176 162 } 177 163 178 164 _cleanup_test() { 179 - "${UBLK_PROG}" del -a 165 + if [ -f "${UBLK_TEST_DIR}/.ublk_devs" ]; then 166 + while read -r dev_id; do 167 + ${UBLK_PROG} del -n "${dev_id}" 168 + done < "${UBLK_TEST_DIR}/.ublk_devs" 169 + rm -f "${UBLK_TEST_DIR}/.ublk_devs" 170 + fi 180 171 181 172 _remove_files 173 + rmdir ${UBLK_TEST_DIR} 174 + echo "ublk selftest: $TID done at $(date '+%F %T')" | tee /dev/kmsg 182 175 } 183 176 184 177 _have_feature() ··· 218 197 fi 219 198 220 199 if [ "$settle" = "yes" ]; then 221 - udevadm settle 200 + udevadm settle --timeout=20 222 201 fi 223 202 224 203 if [[ "$dev_id" =~ ^[0-9]+$ ]]; then 204 + echo "$dev_id" >> "${UBLK_TEST_DIR}/.ublk_devs" 225 205 echo "${dev_id}" 226 206 else 227 207 return 255 ··· 242 220 local state 243 221 244 222 dev_id=$(_create_ublk_dev "recover" "yes" "$@") 245 - for ((j=0;j<20;j++)); do 223 + for ((j=0;j<100;j++)); do 246 224 state=$(_get_ublk_dev_state "${dev_id}") 247 225 [ "$state" == "LIVE" ] && break 248 226 sleep 1 ··· 262 240 return "$state" 263 241 fi 264 242 265 - for ((j=0;j<50;j++)); do 243 + for ((j=0;j<100;j++)); do 266 244 state=$(_get_ublk_dev_state "${dev_id}") 267 245 [ "$state" == "$exp_state" ] && break 268 246 sleep 1 ··· 281 259 daemon_pid=$(_get_ublk_daemon_pid "${dev_id}") 282 260 state=$(_get_ublk_dev_state "${dev_id}") 283 261 284 - for ((j=0;j<50;j++)); do 262 + for ((j=0;j<100;j++)); do 285 263 [ "$state" == "$exp_state" ] && break 286 264 kill -9 "$daemon_pid" > /dev/null 2>&1 287 265 sleep 1 ··· 290 268 echo "$state" 291 269 } 292 270 293 - __remove_ublk_dev_return() { 271 + _ublk_del_dev() { 294 272 local dev_id=$1 295 273 296 274 ${UBLK_PROG} del -n "${dev_id}" 275 + 276 + # Remove from tracking file 277 + if [ -f "${UBLK_TEST_DIR}/.ublk_devs" ]; then 278 + sed -i "/^${dev_id}$/d" "${UBLK_TEST_DIR}/.ublk_devs" 279 + fi 280 + } 281 + 282 + __remove_ublk_dev_return() { 283 + local dev_id=$1 284 + 285 + _ublk_del_dev "${dev_id}" 297 286 local res=$? 298 - udevadm settle 287 + udevadm settle --timeout=20 299 288 return ${res} 300 289 } 301 290 ··· 415 382 _ublk_test_top_dir() 416 383 { 417 384 cd "$(dirname "$0")" && pwd 385 + } 386 + 387 + METADATA_SIZE_PROG="$(_ublk_test_top_dir)/metadata_size" 388 + 389 + _get_metadata_size() 390 + { 391 + local dev_id=$1 392 + local field=$2 393 + 394 + "$METADATA_SIZE_PROG" "/dev/ublkb$dev_id" | grep "$field" | grep -o "[0-9]*" 418 395 } 419 396 420 397 UBLK_PROG=$(_ublk_test_top_dir)/kublk

-48

tools/testing/selftests/ublk/test_generic_01.sh

··· 1 - #!/bin/bash 2 - # SPDX-License-Identifier: GPL-2.0 3 - 4 - . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - 6 - TID="generic_01" 7 - ERR_CODE=0 8 - 9 - if ! _have_program bpftrace; then 10 - exit "$UBLK_SKIP_CODE" 11 - fi 12 - 13 - if ! _have_program fio; then 14 - exit "$UBLK_SKIP_CODE" 15 - fi 16 - 17 - _prep_test "null" "sequential io order" 18 - 19 - dev_id=$(_add_ublk_dev -t null) 20 - _check_add_dev $TID $? 21 - 22 - dev_t=$(_get_disk_dev_t "$dev_id") 23 - bpftrace trace/seq_io.bt "$dev_t" "W" 1 > "$UBLK_TMP" 2>&1 & 24 - btrace_pid=$! 25 - sleep 2 26 - 27 - if ! kill -0 "$btrace_pid" > /dev/null 2>&1; then 28 - _cleanup_test "null" 29 - exit "$UBLK_SKIP_CODE" 30 - fi 31 - 32 - # run fio over this ublk disk 33 - fio --name=write_seq \ 34 - --filename=/dev/ublkb"${dev_id}" \ 35 - --ioengine=libaio --iodepth=16 \ 36 - --rw=write \ 37 - --size=512M \ 38 - --direct=1 \ 39 - --bs=4k > /dev/null 2>&1 40 - ERR_CODE=$? 41 - kill "$btrace_pid" 42 - wait 43 - if grep -q "io_out_of_order" "$UBLK_TMP"; then 44 - cat "$UBLK_TMP" 45 - ERR_CODE=255 46 - fi 47 - _cleanup_test "null" 48 - _show_result $TID $ERR_CODE

+15 -8

tools/testing/selftests/ublk/test_generic_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_02" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program bpftrace; then ··· 13 14 exit "$UBLK_SKIP_CODE" 14 15 fi 15 16 16 - _prep_test "null" "sequential io order for MQ" 17 + _prep_test "null" "ublk dispatch won't reorder IO for MQ" 17 18 18 19 dev_id=$(_add_ublk_dev -t null -q 2) 19 20 _check_add_dev $TID $? ··· 21 22 dev_t=$(_get_disk_dev_t "$dev_id") 22 23 bpftrace trace/seq_io.bt "$dev_t" "W" 1 > "$UBLK_TMP" 2>&1 & 23 24 btrace_pid=$! 24 - sleep 2 25 25 26 - if ! kill -0 "$btrace_pid" > /dev/null 2>&1; then 26 + # Wait for bpftrace probes to be attached (BEGIN block prints BPFTRACE_READY) 27 + for _ in $(seq 100); do 28 + grep -q "BPFTRACE_READY" "$UBLK_TMP" 2>/dev/null && break 29 + sleep 0.1 30 + done 31 + 32 + if ! kill -0 "$btrace_pid" 2>/dev/null; then 27 33 _cleanup_test "null" 28 34 exit "$UBLK_SKIP_CODE" 29 35 fi 30 36 31 - # run fio over this ublk disk 32 - fio --name=write_seq \ 37 + # run fio over this ublk disk (pinned to CPU 0) 38 + taskset -c 0 fio --name=write_seq \ 33 39 --filename=/dev/ublkb"${dev_id}" \ 34 40 --ioengine=libaio --iodepth=16 \ 35 41 --rw=write \ ··· 44 40 ERR_CODE=$? 45 41 kill "$btrace_pid" 46 42 wait 47 - if grep -q "io_out_of_order" "$UBLK_TMP"; then 48 - cat "$UBLK_TMP" 43 + 44 + # Check for out-of-order completions detected by bpftrace 45 + if grep -q "^out_of_order:" "$UBLK_TMP"; then 46 + echo "I/O reordering detected:" 47 + grep "^out_of_order:" "$UBLK_TMP" 49 48 ERR_CODE=255 50 49 fi 51 50 _cleanup_test "null"

-1

tools/testing/selftests/ublk/test_generic_03.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_03" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "null" "check dma & segment limits for zero copy"

+5 -1

tools/testing/selftests/ublk/test_generic_04.sh tools/testing/selftests/ublk/test_recover_01.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_04" 7 6 ERR_CODE=0 8 7 9 8 ublk_run_recover_test() ··· 24 25 _create_backfile 0 256M 25 26 _create_backfile 1 128M 26 27 _create_backfile 2 128M 28 + 29 + ublk_run_recover_test -t null -q 2 -r 1 -b & 30 + ublk_run_recover_test -t loop -q 2 -r 1 -b "${UBLK_BACKFILES[0]}" & 31 + ublk_run_recover_test -t stripe -q 2 -r 1 -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 32 + wait 27 33 28 34 ublk_run_recover_test -t null -q 2 -r 1 & 29 35 ublk_run_recover_test -t loop -q 2 -r 1 "${UBLK_BACKFILES[0]}" &

+5 -1

tools/testing/selftests/ublk/test_generic_05.sh tools/testing/selftests/ublk/test_recover_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_05" 7 6 ERR_CODE=0 8 7 9 8 ublk_run_recover_test() ··· 28 29 _create_backfile 0 256M 29 30 _create_backfile 1 128M 30 31 _create_backfile 2 128M 32 + 33 + ublk_run_recover_test -t null -q 2 -r 1 -z -b & 34 + ublk_run_recover_test -t loop -q 2 -r 1 -z -b "${UBLK_BACKFILES[0]}" & 35 + ublk_run_recover_test -t stripe -q 2 -r 1 -z -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 36 + wait 31 37 32 38 ublk_run_recover_test -t null -q 2 -r 1 -z & 33 39 ublk_run_recover_test -t loop -q 2 -r 1 -z "${UBLK_BACKFILES[0]}" &

-1

tools/testing/selftests/ublk/test_generic_06.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_06" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"

-1

tools/testing/selftests/ublk/test_generic_07.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_07" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_generic_08.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_08" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_feature "AUTO_BUF_REG"; then

-1

tools/testing/selftests/ublk/test_generic_09.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_09" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_feature "AUTO_BUF_REG"; then

-1

tools/testing/selftests/ublk/test_generic_10.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_10" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_feature "UPDATE_SIZE"; then

-1

tools/testing/selftests/ublk/test_generic_11.sh tools/testing/selftests/ublk/test_recover_03.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_11" 7 6 ERR_CODE=0 8 7 9 8 ublk_run_quiesce_recover()

-1

tools/testing/selftests/ublk/test_generic_12.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_12" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program bpftrace; then

-1

tools/testing/selftests/ublk/test_generic_13.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_13" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "null" "check that feature list is complete"

-1

tools/testing/selftests/ublk/test_generic_14.sh tools/testing/selftests/ublk/test_recover_04.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_14" 7 6 ERR_CODE=0 8 7 9 8 ublk_run_recover_test()

+3 -4

tools/testing/selftests/ublk/test_generic_15.sh tools/testing/selftests/ublk/test_part_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="generic_15" 7 6 ERR_CODE=0 8 7 9 8 _test_partition_scan_no_hang() ··· 33 34 # The add command should return quickly because partition scan is async. 34 35 # Now sleep briefly to let the async partition scan work start and hit 35 36 # the delay in the fault_inject handler. 36 - sleep 1 37 + _ublk_sleep 1 5 37 38 38 39 # Kill the ublk daemon while partition scan is potentially blocked 39 40 # And check state transitions properly ··· 46 47 if [ "$state" != "${expected_state}" ]; then 47 48 echo "FAIL: Device state is $state, expected ${expected_state}" 48 49 ERR_CODE=255 49 - ${UBLK_PROG} del -n "${dev_id}" > /dev/null 2>&1 50 + _ublk_del_dev "${dev_id}" > /dev/null 2>&1 50 51 return 51 52 fi 52 53 echo "PASS: Device transitioned to ${expected_state} in ${elapsed}s without hanging" 53 54 54 55 # Clean up the device 55 - ${UBLK_PROG} del -n "${dev_id}" > /dev/null 2>&1 56 + _ublk_del_dev "${dev_id}" > /dev/null 2>&1 56 57 } 57 58 58 59 _prep_test "partition_scan" "verify async partition scan prevents IO hang"

+56

tools/testing/selftests/ublk/test_generic_16.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + _prep_test "null" "stop --safe command" 9 + 10 + # Check if SAFE_STOP_DEV feature is supported 11 + if ! _have_feature "SAFE_STOP_DEV"; then 12 + _cleanup_test "null" 13 + exit "$UBLK_SKIP_CODE" 14 + fi 15 + 16 + # Test 1: stop --safe on idle device should succeed 17 + dev_id=$(_add_ublk_dev -t null -q 2 -d 32) 18 + _check_add_dev $TID $? 19 + 20 + # Device is idle (no openers), stop --safe should succeed 21 + if ! ${UBLK_PROG} stop -n "${dev_id}" --safe; then 22 + echo "stop --safe on idle device failed unexpectedly!" 23 + ERR_CODE=255 24 + fi 25 + 26 + # Clean up device 27 + _ublk_del_dev "${dev_id}" > /dev/null 2>&1 28 + udevadm settle 29 + 30 + # Test 2: stop --safe on device with active opener should fail 31 + dev_id=$(_add_ublk_dev -t null -q 2 -d 32) 32 + _check_add_dev $TID $? 33 + 34 + # Open device in background (dd reads indefinitely) 35 + dd if=/dev/ublkb${dev_id} of=/dev/null bs=4k iflag=direct > /dev/null 2>&1 & 36 + dd_pid=$! 37 + 38 + # Give dd time to start 39 + sleep 0.2 40 + 41 + # Device has active opener, stop --safe should fail with -EBUSY 42 + if ${UBLK_PROG} stop -n "${dev_id}" --safe 2>/dev/null; then 43 + echo "stop --safe on busy device succeeded unexpectedly!" 44 + ERR_CODE=255 45 + fi 46 + 47 + # Kill dd and clean up 48 + kill $dd_pid 2>/dev/null 49 + wait $dd_pid 2>/dev/null 50 + 51 + # Now device should be idle, regular delete should work 52 + _ublk_del_dev "${dev_id}" 53 + udevadm settle 54 + 55 + _cleanup_test "null" 56 + _show_result $TID $ERR_CODE

+105

tools/testing/selftests/ublk/test_integrity_01.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + _check_value() { 9 + local name=$1 10 + local actual=$2 11 + local expected=$3 12 + 13 + if [ "$actual" != "$expected" ]; then 14 + echo "$name $actual != $expected" 15 + ERR_CODE=255 16 + return 1 17 + fi 18 + return 0 19 + } 20 + 21 + _test_metadata_only() { 22 + local dev_id 23 + 24 + dev_id=$(_add_ublk_dev -t null -u --no_auto_part_scan --metadata_size 8) 25 + _check_add_dev "$TID" $? 26 + 27 + _check_value "metadata_size" "$(_get_metadata_size "$dev_id" metadata_size)" 8 && 28 + _check_value "pi_offset" "$(_get_metadata_size "$dev_id" pi_offset)" 0 && 29 + _check_value "pi_tuple_size" "$(_get_metadata_size "$dev_id" pi_tuple_size)" 0 && 30 + _check_value "device_is_integrity_capable" \ 31 + "$(cat "/sys/block/ublkb$dev_id/integrity/device_is_integrity_capable")" 0 && 32 + _check_value "format" "$(cat "/sys/block/ublkb$dev_id/integrity/format")" nop && 33 + _check_value "protection_interval_bytes" \ 34 + "$(cat "/sys/block/ublkb$dev_id/integrity/protection_interval_bytes")" 512 && 35 + _check_value "tag_size" "$(cat "/sys/block/ublkb$dev_id/integrity/tag_size")" 0 36 + 37 + _ublk_del_dev "${dev_id}" 38 + } 39 + 40 + _test_integrity_capable_ip() { 41 + local dev_id 42 + 43 + dev_id=$(_add_ublk_dev -t null -u --no_auto_part_scan --integrity_capable --metadata_size 64 --pi_offset 56 --csum_type ip) 44 + _check_add_dev "$TID" $? 45 + 46 + _check_value "metadata_size" "$(_get_metadata_size "$dev_id" metadata_size)" 64 && 47 + _check_value "pi_offset" "$(_get_metadata_size "$dev_id" pi_offset)" 56 && 48 + _check_value "pi_tuple_size" "$(_get_metadata_size "$dev_id" pi_tuple_size)" 8 && 49 + _check_value "device_is_integrity_capable" \ 50 + "$(cat "/sys/block/ublkb$dev_id/integrity/device_is_integrity_capable")" 1 && 51 + _check_value "format" "$(cat "/sys/block/ublkb$dev_id/integrity/format")" T10-DIF-TYPE3-IP && 52 + _check_value "protection_interval_bytes" \ 53 + "$(cat "/sys/block/ublkb$dev_id/integrity/protection_interval_bytes")" 512 && 54 + _check_value "tag_size" "$(cat "/sys/block/ublkb$dev_id/integrity/tag_size")" 0 55 + 56 + _ublk_del_dev "${dev_id}" 57 + } 58 + 59 + _test_integrity_reftag_t10dif() { 60 + local dev_id 61 + 62 + dev_id=$(_add_ublk_dev -t null -u --no_auto_part_scan --integrity_reftag --metadata_size 8 --csum_type t10dif) 63 + _check_add_dev "$TID" $? 64 + 65 + _check_value "metadata_size" "$(_get_metadata_size "$dev_id" metadata_size)" 8 && 66 + _check_value "pi_offset" "$(_get_metadata_size "$dev_id" pi_offset)" 0 && 67 + _check_value "pi_tuple_size" "$(_get_metadata_size "$dev_id" pi_tuple_size)" 8 && 68 + _check_value "device_is_integrity_capable" \ 69 + "$(cat "/sys/block/ublkb$dev_id/integrity/device_is_integrity_capable")" 0 && 70 + _check_value "format" "$(cat "/sys/block/ublkb$dev_id/integrity/format")" T10-DIF-TYPE1-CRC && 71 + _check_value "protection_interval_bytes" \ 72 + "$(cat "/sys/block/ublkb$dev_id/integrity/protection_interval_bytes")" 512 && 73 + _check_value "tag_size" "$(cat "/sys/block/ublkb$dev_id/integrity/tag_size")" 0 74 + 75 + _ublk_del_dev "${dev_id}" 76 + } 77 + 78 + _test_nvme_csum() { 79 + local dev_id 80 + 81 + dev_id=$(_add_ublk_dev -t null -u --no_auto_part_scan --metadata_size 16 --csum_type nvme --tag_size 8) 82 + _check_add_dev "$TID" $? 83 + 84 + _check_value "metadata_size" "$(_get_metadata_size "$dev_id" metadata_size)" 16 && 85 + _check_value "pi_offset" "$(_get_metadata_size "$dev_id" pi_offset)" 0 && 86 + _check_value "pi_tuple_size" "$(_get_metadata_size "$dev_id" pi_tuple_size)" 16 && 87 + _check_value "device_is_integrity_capable" \ 88 + "$(cat "/sys/block/ublkb$dev_id/integrity/device_is_integrity_capable")" 0 && 89 + _check_value "format" "$(cat "/sys/block/ublkb$dev_id/integrity/format")" EXT-DIF-TYPE3-CRC64 && 90 + _check_value "protection_interval_bytes" \ 91 + "$(cat "/sys/block/ublkb$dev_id/integrity/protection_interval_bytes")" 512 && 92 + _check_value "tag_size" "$(cat "/sys/block/ublkb$dev_id/integrity/tag_size")" 8 93 + 94 + _ublk_del_dev "${dev_id}" 95 + } 96 + 97 + _prep_test "null" "integrity params" 98 + 99 + _test_metadata_only 100 + _test_integrity_capable_ip 101 + _test_integrity_reftag_t10dif 102 + _test_nvme_csum 103 + 104 + _cleanup_test 105 + _show_result "$TID" $ERR_CODE

+141

tools/testing/selftests/ublk/test_integrity_02.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + if ! _have_program fio; then 7 + exit $UBLK_SKIP_CODE 8 + fi 9 + 10 + fio_version=$(fio --version) 11 + if [[ "$fio_version" =~ fio-[0-9]+\.[0-9]+$ ]]; then 12 + echo "Requires development fio version with https://github.com/axboe/fio/pull/1992" 13 + exit $UBLK_SKIP_CODE 14 + fi 15 + 16 + ERR_CODE=0 17 + 18 + # Global variables set during device setup 19 + dev_id="" 20 + fio_args="" 21 + fio_err="" 22 + 23 + _setup_device() { 24 + _create_backfile 0 256M 25 + _create_backfile 1 32M # 256M * (64 integrity bytes / 512 data bytes) 26 + 27 + local integrity_params="--integrity_capable --integrity_reftag 28 + --metadata_size 64 --pi_offset 56 --csum_type t10dif" 29 + dev_id=$(_add_ublk_dev -t loop -u $integrity_params "${UBLK_BACKFILES[@]}") 30 + _check_add_dev "$TID" $? 31 + 32 + # 1M * (64 integrity bytes / 512 data bytes) = 128K 33 + fio_args="--ioengine io_uring --direct 1 --bsrange 512-1M --iodepth 32 34 + --md_per_io_size 128K --pi_act 0 --pi_chk GUARD,REFTAG,APPTAG 35 + --filename /dev/ublkb$dev_id" 36 + 37 + fio_err=$(mktemp "${UBLK_TEST_DIR}"/fio_err_XXXXX) 38 + } 39 + 40 + _test_fill_and_verify() { 41 + fio --name fill --rw randwrite $fio_args > /dev/null 42 + if [ $? != 0 ]; then 43 + echo "fio fill failed" 44 + ERR_CODE=255 45 + return 1 46 + fi 47 + 48 + fio --name verify --rw randread $fio_args > /dev/null 49 + if [ $? != 0 ]; then 50 + echo "fio verify failed" 51 + ERR_CODE=255 52 + return 1 53 + fi 54 + } 55 + 56 + _test_corrupted_reftag() { 57 + local dd_reftag_args="bs=1 seek=60 count=4 oflag=dsync conv=notrunc status=none" 58 + local expected_err="REFTAG compare error: LBA: 0 Expected=0, Actual=" 59 + 60 + # Overwrite 4-byte reftag at offset 56 + 4 = 60 61 + dd if=/dev/urandom "of=${UBLK_BACKFILES[1]}" $dd_reftag_args 62 + if [ $? != 0 ]; then 63 + echo "dd corrupted_reftag failed" 64 + ERR_CODE=255 65 + return 1 66 + fi 67 + 68 + if fio --name corrupted_reftag --rw randread $fio_args > /dev/null 2> "$fio_err"; then 69 + echo "fio corrupted_reftag unexpectedly succeeded" 70 + ERR_CODE=255 71 + return 1 72 + fi 73 + 74 + if ! grep -q "$expected_err" "$fio_err"; then 75 + echo "fio corrupted_reftag message not found: $expected_err" 76 + ERR_CODE=255 77 + return 1 78 + fi 79 + 80 + # Reset to 0 81 + dd if=/dev/zero "of=${UBLK_BACKFILES[1]}" $dd_reftag_args 82 + if [ $? != 0 ]; then 83 + echo "dd restore corrupted_reftag failed" 84 + ERR_CODE=255 85 + return 1 86 + fi 87 + } 88 + 89 + _test_corrupted_data() { 90 + local dd_data_args="bs=512 count=1 oflag=direct,dsync conv=notrunc status=none" 91 + local expected_err="Guard compare error: LBA: 0 Expected=0, Actual=" 92 + 93 + dd if=/dev/zero "of=${UBLK_BACKFILES[0]}" $dd_data_args 94 + if [ $? != 0 ]; then 95 + echo "dd corrupted_data failed" 96 + ERR_CODE=255 97 + return 1 98 + fi 99 + 100 + if fio --name corrupted_data --rw randread $fio_args > /dev/null 2> "$fio_err"; then 101 + echo "fio corrupted_data unexpectedly succeeded" 102 + ERR_CODE=255 103 + return 1 104 + fi 105 + 106 + if ! grep -q "$expected_err" "$fio_err"; then 107 + echo "fio corrupted_data message not found: $expected_err" 108 + ERR_CODE=255 109 + return 1 110 + fi 111 + } 112 + 113 + _test_bad_apptag() { 114 + local expected_err="APPTAG compare error: LBA: [0-9]* Expected=4321, Actual=1234" 115 + 116 + if fio --name bad_apptag --rw randread $fio_args --apptag 0x4321 > /dev/null 2> "$fio_err"; then 117 + echo "fio bad_apptag unexpectedly succeeded" 118 + ERR_CODE=255 119 + return 1 120 + fi 121 + 122 + if ! grep -q "$expected_err" "$fio_err"; then 123 + echo "fio bad_apptag message not found: $expected_err" 124 + ERR_CODE=255 125 + return 1 126 + fi 127 + } 128 + 129 + _prep_test "loop" "end-to-end integrity" 130 + 131 + _setup_device 132 + 133 + _test_fill_and_verify && \ 134 + _test_corrupted_reftag && \ 135 + _test_corrupted_data && \ 136 + _test_bad_apptag 137 + 138 + rm -f "$fio_err" 139 + 140 + _cleanup_test 141 + _show_result "$TID" $ERR_CODE

-1

tools/testing/selftests/ublk/test_loop_01.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_01" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_loop_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_02" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "loop" "mkfs & mount & umount"

-1

tools/testing/selftests/ublk/test_loop_03.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_03" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_loop_04.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_04" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "loop" "mkfs & mount & umount with zero copy"

-1

tools/testing/selftests/ublk/test_loop_05.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_05" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_loop_06.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_06" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_loop_07.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="loop_07" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "loop" "mkfs & mount & umount with user copy"

-1

tools/testing/selftests/ublk/test_null_01.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="null_01" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_null_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="null_02" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_null_03.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="null_03" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

+104

tools/testing/selftests/ublk/test_part_01.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + ERR_CODE=0 7 + 8 + format_backing_file() 9 + { 10 + local backing_file=$1 11 + 12 + # Create ublk device to write partition table 13 + local tmp_dev=$(_add_ublk_dev -t loop "${backing_file}") 14 + [ $? -ne 0 ] && return 1 15 + 16 + # Write partition table with sfdisk 17 + sfdisk /dev/ublkb"${tmp_dev}" > /dev/null 2>&1 <<EOF 18 + label: dos 19 + start=2048, size=100MiB, type=83 20 + start=206848, size=100MiB, type=83 21 + EOF 22 + local ret=$? 23 + 24 + "${UBLK_PROG}" del -n "${tmp_dev}" 25 + 26 + return $ret 27 + } 28 + 29 + test_auto_part_scan() 30 + { 31 + local backing_file=$1 32 + 33 + # Create device WITHOUT --no_auto_part_scan 34 + local dev_id=$(_add_ublk_dev -t loop "${backing_file}") 35 + [ $? -ne 0 ] && return 1 36 + 37 + udevadm settle 38 + 39 + # Partitions should be auto-detected 40 + if [ ! -e /dev/ublkb"${dev_id}"p1 ] || [ ! -e /dev/ublkb"${dev_id}"p2 ]; then 41 + "${UBLK_PROG}" del -n "${dev_id}" 42 + return 1 43 + fi 44 + 45 + "${UBLK_PROG}" del -n "${dev_id}" 46 + return 0 47 + } 48 + 49 + test_no_auto_part_scan() 50 + { 51 + local backing_file=$1 52 + 53 + # Create device WITH --no_auto_part_scan 54 + local dev_id=$(_add_ublk_dev -t loop --no_auto_part_scan "${backing_file}") 55 + [ $? -ne 0 ] && return 1 56 + 57 + udevadm settle 58 + 59 + # Partitions should NOT be auto-detected 60 + if [ -e /dev/ublkb"${dev_id}"p1 ]; then 61 + "${UBLK_PROG}" del -n "${dev_id}" 62 + return 1 63 + fi 64 + 65 + # Manual scan should work 66 + blockdev --rereadpt /dev/ublkb"${dev_id}" > /dev/null 2>&1 67 + udevadm settle 68 + 69 + if [ ! -e /dev/ublkb"${dev_id}"p1 ] || [ ! -e /dev/ublkb"${dev_id}"p2 ]; then 70 + "${UBLK_PROG}" del -n "${dev_id}" 71 + return 1 72 + fi 73 + 74 + "${UBLK_PROG}" del -n "${dev_id}" 75 + return 0 76 + } 77 + 78 + if ! _have_program sfdisk || ! _have_program blockdev; then 79 + exit "$UBLK_SKIP_CODE" 80 + fi 81 + 82 + _prep_test "generic" "test UBLK_F_NO_AUTO_PART_SCAN" 83 + 84 + if ! _have_feature "UBLK_F_NO_AUTO_PART_SCAN"; then 85 + _cleanup_test "generic" 86 + exit "$UBLK_SKIP_CODE" 87 + fi 88 + 89 + 90 + # Create and format backing file with partition table 91 + _create_backfile 0 256M 92 + format_backing_file "${UBLK_BACKFILES[0]}" 93 + [ $? -ne 0 ] && ERR_CODE=255 94 + 95 + # Test normal auto partition scan 96 + [ "$ERR_CODE" -eq 0 ] && test_auto_part_scan "${UBLK_BACKFILES[0]}" 97 + [ $? -ne 0 ] && ERR_CODE=255 98 + 99 + # Test no auto partition scan with manual scan 100 + [ "$ERR_CODE" -eq 0 ] && test_no_auto_part_scan "${UBLK_BACKFILES[0]}" 101 + [ $? -ne 0 ] && ERR_CODE=255 102 + 103 + _cleanup_test "generic" 104 + _show_result $TID $ERR_CODE

-1

tools/testing/selftests/ublk/test_stress_01.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_01" 6 5 ERR_CODE=0 7 6 8 7 ublk_io_and_remove()

-1

tools/testing/selftests/ublk/test_stress_02.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_02" 6 5 ERR_CODE=0 7 6 8 7 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_stress_03.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_03" 6 5 ERR_CODE=0 7 6 8 7 ublk_io_and_remove()

-1

tools/testing/selftests/ublk/test_stress_04.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_04" 6 5 ERR_CODE=0 7 6 8 7 ublk_io_and_kill_daemon()

-1

tools/testing/selftests/ublk/test_stress_05.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_05" 6 5 ERR_CODE=0 7 6 8 7 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_stress_06.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_06" 6 5 ERR_CODE=0 7 6 8 7 ublk_io_and_remove()

-1

tools/testing/selftests/ublk/test_stress_07.sh

··· 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 - TID="stress_07" 6 5 ERR_CODE=0 7 6 8 7 ublk_io_and_kill_daemon()

+44

tools/testing/selftests/ublk/test_stress_08.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + ERR_CODE=0 6 + 7 + ublk_io_and_remove() 8 + { 9 + run_io_and_remove "$@" 10 + ERR_CODE=$? 11 + if [ ${ERR_CODE} -ne 0 ]; then 12 + echo "$TID failure: $*" 13 + _show_result $TID $ERR_CODE 14 + fi 15 + } 16 + 17 + if ! _have_program fio; then 18 + exit "$UBLK_SKIP_CODE" 19 + fi 20 + 21 + if ! _have_feature "ZERO_COPY"; then 22 + exit "$UBLK_SKIP_CODE" 23 + fi 24 + if ! _have_feature "AUTO_BUF_REG"; then 25 + exit "$UBLK_SKIP_CODE" 26 + fi 27 + if ! _have_feature "BATCH_IO"; then 28 + exit "$UBLK_SKIP_CODE" 29 + fi 30 + 31 + _prep_test "stress" "run IO and remove device(zero copy)" 32 + 33 + _create_backfile 0 256M 34 + _create_backfile 1 128M 35 + _create_backfile 2 128M 36 + 37 + ublk_io_and_remove 8G -t null -q 4 -b & 38 + ublk_io_and_remove 256M -t loop -q 4 --auto_zc -b "${UBLK_BACKFILES[0]}" & 39 + ublk_io_and_remove 256M -t stripe -q 4 --auto_zc -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 40 + ublk_io_and_remove 8G -t null -q 4 -z --auto_zc --auto_zc_fallback -b & 41 + wait 42 + 43 + _cleanup_test "stress" 44 + _show_result $TID $ERR_CODE

+43

tools/testing/selftests/ublk/test_stress_09.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + ERR_CODE=0 6 + 7 + ublk_io_and_kill_daemon() 8 + { 9 + run_io_and_kill_daemon "$@" 10 + ERR_CODE=$? 11 + if [ ${ERR_CODE} -ne 0 ]; then 12 + echo "$TID failure: $*" 13 + _show_result $TID $ERR_CODE 14 + fi 15 + } 16 + 17 + if ! _have_program fio; then 18 + exit "$UBLK_SKIP_CODE" 19 + fi 20 + if ! _have_feature "ZERO_COPY"; then 21 + exit "$UBLK_SKIP_CODE" 22 + fi 23 + if ! _have_feature "AUTO_BUF_REG"; then 24 + exit "$UBLK_SKIP_CODE" 25 + fi 26 + if ! _have_feature "BATCH_IO"; then 27 + exit "$UBLK_SKIP_CODE" 28 + fi 29 + 30 + _prep_test "stress" "run IO and kill ublk server(zero copy)" 31 + 32 + _create_backfile 0 256M 33 + _create_backfile 1 128M 34 + _create_backfile 2 128M 35 + 36 + ublk_io_and_kill_daemon 8G -t null -q 4 -z -b & 37 + ublk_io_and_kill_daemon 256M -t loop -q 4 --auto_zc -b "${UBLK_BACKFILES[0]}" & 38 + ublk_io_and_kill_daemon 256M -t stripe -q 4 -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 39 + ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback -b & 40 + wait 41 + 42 + _cleanup_test "stress" 43 + _show_result $TID $ERR_CODE

-1

tools/testing/selftests/ublk/test_stripe_01.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_01" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_stripe_02.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_02" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "stripe" "mkfs & mount & umount"

-1

tools/testing/selftests/ublk/test_stripe_03.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_03" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_stripe_04.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_04" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "stripe" "mkfs & mount & umount on zero copy"

-1

tools/testing/selftests/ublk/test_stripe_05.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_05" 7 6 ERR_CODE=0 8 7 9 8 if ! _have_program fio; then

-1

tools/testing/selftests/ublk/test_stripe_06.sh

··· 3 3 4 4 . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 5 6 - TID="stripe_06" 7 6 ERR_CODE=0 8 7 9 8 _prep_test "stripe" "mkfs & mount & umount on user copy"

+38 -9

tools/testing/selftests/ublk/trace/seq_io.bt

··· 2 2 $1: dev_t 3 3 $2: RWBS 4 4 $3: strlen($2) 5 + 6 + Track request order between block_io_start and block_rq_complete. 7 + Sequence starts at 1 so 0 means "never seen". On first valid 8 + completion, sync complete_seq to handle probe attachment races. 9 + block_rq_complete listed first to reduce missed completion window. 5 10 */ 11 + 6 12 BEGIN { 7 - @last_rw[$1, str($2)] = (uint64)0; 13 + @start_seq = (uint64)1; 14 + @complete_seq = (uint64)0; 15 + @out_of_order = (uint64)0; 16 + @start_order[0] = (uint64)0; 17 + delete(@start_order[0]); 18 + printf("BPFTRACE_READY\n"); 8 19 } 20 + 9 21 tracepoint:block:block_rq_complete 22 + /(int64)args.dev == $1 && !strncmp(args.rwbs, str($2), $3)/ 10 23 { 11 - $dev = $1; 12 - if ((int64)args.dev == $1 && !strncmp(args.rwbs, str($2), $3)) { 13 - $last = @last_rw[$dev, str($2)]; 14 - if ((uint64)args.sector != $last) { 15 - printf("io_out_of_order: exp %llu actual %llu\n", 16 - args.sector, $last); 24 + $expected = @start_order[args.sector]; 25 + if ($expected > 0) { 26 + if (@complete_seq == 0) { 27 + @complete_seq = $expected; 17 28 } 18 - @last_rw[$dev, str($2)] = (args.sector + args.nr_sector); 29 + if ($expected != @complete_seq) { 30 + printf("out_of_order: sector %llu started at seq %llu but completed at seq %llu\n", 31 + args.sector, $expected, @complete_seq); 32 + @out_of_order = @out_of_order + 1; 33 + } 34 + delete(@start_order[args.sector]); 35 + @complete_seq = @complete_seq + 1; 19 36 } 20 37 } 21 38 39 + tracepoint:block:block_io_start 40 + /(int64)args.dev == $1 && !strncmp(args.rwbs, str($2), $3)/ 41 + { 42 + @start_order[args.sector] = @start_seq; 43 + @start_seq = @start_seq + 1; 44 + } 45 + 22 46 END { 23 - clear(@last_rw); 47 + printf("total_start: %llu total_complete: %llu out_of_order: %llu\n", 48 + @start_seq - 1, @complete_seq, @out_of_order); 49 + clear(@start_order); 50 + clear(@start_seq); 51 + clear(@complete_seq); 52 + clear(@out_of_order); 24 53 }

+64

tools/testing/selftests/ublk/utils.h

··· 21 21 #define round_up(val, rnd) \ 22 22 (((val) + ((rnd) - 1)) & ~((rnd) - 1)) 23 23 24 + /* small sized & per-thread allocator */ 25 + struct allocator { 26 + unsigned int size; 27 + cpu_set_t *set; 28 + }; 29 + 30 + static inline int allocator_init(struct allocator *a, unsigned size) 31 + { 32 + a->set = CPU_ALLOC(size); 33 + a->size = size; 34 + 35 + if (a->set) 36 + return 0; 37 + return -ENOMEM; 38 + } 39 + 40 + static inline void allocator_deinit(struct allocator *a) 41 + { 42 + CPU_FREE(a->set); 43 + a->set = NULL; 44 + a->size = 0; 45 + } 46 + 47 + static inline int allocator_get(struct allocator *a) 48 + { 49 + int i; 50 + 51 + for (i = 0; i < a->size; i += 1) { 52 + size_t set_size = CPU_ALLOC_SIZE(a->size); 53 + 54 + if (!CPU_ISSET_S(i, set_size, a->set)) { 55 + CPU_SET_S(i, set_size, a->set); 56 + return i; 57 + } 58 + } 59 + 60 + return -1; 61 + } 62 + 63 + static inline void allocator_put(struct allocator *a, int i) 64 + { 65 + size_t set_size = CPU_ALLOC_SIZE(a->size); 66 + 67 + if (i >= 0 && i < a->size) 68 + CPU_CLR_S(i, set_size, a->set); 69 + } 70 + 71 + static inline int allocator_get_val(struct allocator *a, int i) 72 + { 73 + size_t set_size = CPU_ALLOC_SIZE(a->size); 74 + 75 + return CPU_ISSET_S(i, set_size, a->set); 76 + } 77 + 24 78 static inline unsigned int ilog2(unsigned int x) 25 79 { 26 80 if (x == 0) ··· 97 43 98 44 va_start(ap, fmt); 99 45 vfprintf(stderr, fmt, ap); 46 + va_end(ap); 100 47 } 101 48 102 49 static inline void ublk_log(const char *fmt, ...) ··· 107 52 108 53 va_start(ap, fmt); 109 54 vfprintf(stdout, fmt, ap); 55 + va_end(ap); 110 56 } 111 57 } 112 58 ··· 118 62 119 63 va_start(ap, fmt); 120 64 vfprintf(stdout, fmt, ap); 65 + va_end(ap); 121 66 } 122 67 } 68 + 69 + #define ublk_assert(x) do { \ 70 + if (!(x)) { \ 71 + ublk_err("%s %d: assert!\n", __func__, __LINE__); \ 72 + assert(x); \ 73 + } \ 74 + } while (0) 123 75 124 76 #endif

Configure Feed

Configure Feed