Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Fix for a deadlock on queue freeze with zoned writes

- Fix for zoned append emulation

- Two bio folio fixes, for sparsemem and for very large folios

- Fix for a performance regression introduced in 6.13 when plug
insertion was changed

- Fix for NVMe passthrough handling for polled IO

- Document the ublk auto registration feature

- loop lockdep warning fix

* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze

+114 -38
+77
Documentation/block/ublk.rst
··· 352 352 parameter of `struct ublk_param_segment` with backend for avoiding 353 353 unnecessary IO split, which usually hurts io_uring performance. 354 354 355 + Auto Buffer Registration 356 + ------------------------ 357 + 358 + The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration 359 + and unregistration for I/O requests, which simplifies the buffer management 360 + process and reduces overhead in the ublk server implementation. 361 + 362 + This is another feature flag for using zero copy, and it is compatible with 363 + ``UBLK_F_SUPPORT_ZERO_COPY``. 364 + 365 + Feature Overview 366 + ~~~~~~~~~~~~~~~~ 367 + 368 + This feature automatically registers request buffers to the io_uring context 369 + before delivering I/O commands to the ublk server and unregisters them when 370 + completing I/O commands. This eliminates the need for manual buffer 371 + registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and 372 + ``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server 373 + can avoid dependency on the two uring_cmd operations. 374 + 375 + IOs can't be issued concurrently to io_uring if there is any dependency 376 + among these IOs. So this way not only simplifies ublk server implementation, 377 + but also makes concurrent IO handling becomes possible by removing the 378 + dependency on buffer registration & unregistration commands. 379 + 380 + Usage Requirements 381 + ~~~~~~~~~~~~~~~~~~ 382 + 383 + 1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx`` 384 + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``. If 385 + uring_cmd is issued on a different ``io_ring_ctx``, manual buffer 386 + unregistration is required. 387 + 388 + 2. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the 389 + following structure:: 390 + 391 + struct ublk_auto_buf_reg { 392 + __u16 index; /* Buffer index for registration */ 393 + __u8 flags; /* Registration flags */ 394 + __u8 reserved0; /* Reserved for future use */ 395 + __u32 reserved1; /* Reserved for future use */ 396 + }; 397 + 398 + ublk_auto_buf_reg_to_sqe_addr() is for converting the above structure into 399 + ``sqe->addr``. 400 + 401 + 3. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed. 402 + 403 + 4. Optional flags can be passed via ``ublk_auto_buf_reg.flags``. 404 + 405 + Fallback Behavior 406 + ~~~~~~~~~~~~~~~~~ 407 + 408 + If auto buffer registration fails: 409 + 410 + 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled: 411 + 412 + - The uring_cmd is completed 413 + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags`` 414 + - The ublk server must manually deal with the failure, such as, register 415 + the buffer manually, or using user copy feature for retrieving the data 416 + for handling ublk IO 417 + 418 + 2. If fallback is not enabled: 419 + 420 + - The ublk I/O request fails silently 421 + - The uring_cmd won't be completed 422 + 423 + Limitations 424 + ~~~~~~~~~~~ 425 + 426 + - Requires same ``io_ring_ctx`` for all operations 427 + - May require manual buffer management in fallback cases 428 + - io_ring_ctx buffer table has a max size of 16K, which may not be enough 429 + in case that too many ublk devices are handled by this single io_ring_ctx 430 + and each one has very large queue depth 431 + 355 432 References 356 433 ========== 357 434
+13 -13
block/blk-merge.c
··· 998 998 if (!plug || rq_list_empty(&plug->mq_list)) 999 999 return false; 1000 1000 1001 - rq_list_for_each(&plug->mq_list, rq) { 1002 - if (rq->q == q) { 1003 - if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == 1004 - BIO_MERGE_OK) 1005 - return true; 1006 - break; 1007 - } 1001 + rq = plug->mq_list.tail; 1002 + if (rq->q == q) 1003 + return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == 1004 + BIO_MERGE_OK; 1005 + else if (!plug->multiple_queues) 1006 + return false; 1008 1007 1009 - /* 1010 - * Only keep iterating plug list for merges if we have multiple 1011 - * queues 1012 - */ 1013 - if (!plug->multiple_queues) 1014 - break; 1008 + rq_list_for_each(&plug->mq_list, rq) { 1009 + if (rq->q != q) 1010 + continue; 1011 + if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == 1012 + BIO_MERGE_OK) 1013 + return true; 1014 + break; 1015 1015 } 1016 1016 return false; 1017 1017 }
+6 -2
block/blk-zoned.c
··· 1225 1225 if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) { 1226 1226 bio->bi_opf &= ~REQ_OP_MASK; 1227 1227 bio->bi_opf |= REQ_OP_ZONE_APPEND; 1228 + bio_clear_flag(bio, BIO_EMULATES_ZONE_APPEND); 1228 1229 } 1229 1230 1230 1231 /* ··· 1307 1306 spin_unlock_irqrestore(&zwplug->lock, flags); 1308 1307 1309 1308 bdev = bio->bi_bdev; 1310 - submit_bio_noacct_nocheck(bio); 1311 1309 1312 1310 /* 1313 1311 * blk-mq devices will reuse the extra reference on the request queue ··· 1314 1314 * path for BIO-based devices will not do that. So drop this extra 1315 1315 * reference here. 1316 1316 */ 1317 - if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) 1317 + if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) { 1318 + bdev->bd_disk->fops->submit_bio(bio); 1318 1319 blk_queue_exit(bdev->bd_disk->queue); 1320 + } else { 1321 + blk_mq_submit_bio(bio); 1322 + } 1319 1323 1320 1324 put_zwplug: 1321 1325 /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
+5 -6
drivers/block/loop.c
··· 1248 1248 lo->lo_flags &= ~LOOP_SET_STATUS_CLEARABLE_FLAGS; 1249 1249 lo->lo_flags |= (info->lo_flags & LOOP_SET_STATUS_SETTABLE_FLAGS); 1250 1250 1251 - if (size_changed) { 1252 - loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, 1253 - lo->lo_backing_file); 1254 - loop_set_size(lo, new_size); 1255 - } 1256 - 1257 1251 /* update the direct I/O flag if lo_offset changed */ 1258 1252 loop_update_dio(lo); 1259 1253 ··· 1255 1261 blk_mq_unfreeze_queue(lo->lo_queue, memflags); 1256 1262 if (partscan) 1257 1263 clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state); 1264 + if (!err && size_changed) { 1265 + loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, 1266 + lo->lo_backing_file); 1267 + loop_set_size(lo, new_size); 1268 + } 1258 1269 out_unlock: 1259 1270 mutex_unlock(&lo->lo_mutex); 1260 1271 if (partscan)
+7 -14
drivers/nvme/host/ioctl.c
··· 429 429 pdu->result = le64_to_cpu(nvme_req(req)->result.u64); 430 430 431 431 /* 432 - * For iopoll, complete it directly. Note that using the uring_cmd 433 - * helper for this is safe only because we check blk_rq_is_poll(). 434 - * As that returns false if we're NOT on a polled queue, then it's 435 - * safe to use the polled completion helper. 436 - * 437 - * Otherwise, move the completion to task work. 432 + * IOPOLL could potentially complete this request directly, but 433 + * if multiple rings are polling on the same queue, then it's possible 434 + * for one ring to find completions for another ring. Punting the 435 + * completion via task_work will always direct it to the right 436 + * location, rather than potentially complete requests for ringA 437 + * under iopoll invocations from ringB. 438 438 */ 439 - if (blk_rq_is_poll(req)) { 440 - if (pdu->bio) 441 - blk_rq_unmap_user(pdu->bio); 442 - io_uring_cmd_iopoll_done(ioucmd, pdu->result, pdu->status); 443 - } else { 444 - io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); 445 - } 446 - 439 + io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); 447 440 return RQ_END_IO_FREE; 448 441 } 449 442
+1 -1
include/linux/bio.h
··· 291 291 292 292 fi->folio = page_folio(bvec->bv_page); 293 293 fi->offset = bvec->bv_offset + 294 - PAGE_SIZE * (bvec->bv_page - &fi->folio->page); 294 + PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page); 295 295 fi->_seg_count = bvec->bv_len; 296 296 fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count); 297 297 fi->_next = folio_next(fi->folio);
+5 -2
include/linux/bvec.h
··· 57 57 * @offset: offset into the folio 58 58 */ 59 59 static inline void bvec_set_folio(struct bio_vec *bv, struct folio *folio, 60 - unsigned int len, unsigned int offset) 60 + size_t len, size_t offset) 61 61 { 62 - bvec_set_page(bv, &folio->page, len, offset); 62 + unsigned long nr = offset / PAGE_SIZE; 63 + 64 + WARN_ON_ONCE(len > UINT_MAX); 65 + bvec_set_page(bv, folio_page(folio, nr), len, offset % PAGE_SIZE); 63 66 } 64 67 65 68 /**