Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull bounce buffer dio for stable pages from Jens Axboe:
"This adds support for bounce buffering of dio for stable pages. This
was all done by Christoph. In his words:

This series tries to address the problem that under I/O pages can be
modified during direct I/O, even when the device or file system
require stable pages during I/O to calculate checksums, parity or data
operations. It does so by adding block layer helpers to bounce buffer
an iov_iter into a bio, then wires that up in iomap and ultimately
XFS.

The reason that the file system even needs to know about it, is
because reads need a user context to copy the data back, and the
infrastructure to defer ioends to a workqueue currently sits in XFS.
I'm going to look into moving that into ioend and enabling it for
other file systems. Additionally btrfs already has it's own
infrastructure for this, and actually an urgent need to bounce buffer,
so this should be useful there and could be wire up easily. In fact
the idea comes from patches by Qu that did this in btrfs.

This patch fixes all but one xfstests failures on T10 PI capable
devices (generic/095 seems to have issues with a mix of mmap and
splice still, I'm looking into that separately), and make qemu VMs
running Windows, or Linux with swap enabled fine on an XFS file on a
device using PI.

Performance numbers on my (not exactly state of the art) NVMe PI test
setup:

Sequential reads using io_uring, QD=16.
Bandwidth and CPU usage (usr/sys):

| size | zero copy | bounce |
+------+--------------------------+--------------------------+
| 4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) |
| 64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) |
| 1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) |
+------+--------------------------+--------------------------+

Sequential writes using io_uring, QD=16.
Bandwidth and CPU usage (usr/sys):

| size | zero copy | bounce |
+------+--------------------------+--------------------------+
| 4k | 882MiB/s (11.83/33.88%) | 750MiB/s (10.53/34.08%) |
| 64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) |
| 1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) |
+------+--------------------------+--------------------------+

Note that the 64k read numbers look really odd to me for the baseline
zero copy case, but are reproducible over many repeated runs.

The bounce read numbers should further improve when moving the PI
validation to the file system and removing the double context switch,
which I have patches for that will sent out soon"

* tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
xfs: use bounce buffering direct I/O when the device requires stable pages
iomap: add a flag to bounce buffer direct I/O
iomap: support ioends for direct reads
iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED
iomap: free the bio before completing the dio
iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct
iomap: split out the per-bio logic from iomap_dio_bio_iter
iomap: simplify iomap_dio_bio_iter
iomap: fix submission side handling of completion side errors
block: add helpers to bounce buffer an iov_iter into bios
block: remove bio_release_page
iov_iter: extract a iov_iter_extract_bvecs helper from bio code
block: open code bio_add_page and fix handling of mismatching P2P ranges
block: refactor get_contig_folio_len
block: add a BIO_MAX_SIZE constant and use it

+508 -241
+206 -128
block/bio.c
··· 958 958 { 959 959 if (bio->bi_vcnt >= bio->bi_max_vecs) 960 960 return true; 961 - if (bio->bi_iter.bi_size > UINT_MAX - len) 961 + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) 962 962 return true; 963 963 return false; 964 964 } ··· 1064 1064 { 1065 1065 if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) 1066 1066 return 0; 1067 - if (bio->bi_iter.bi_size > UINT_MAX - len) 1067 + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) 1068 1068 return 0; 1069 1069 1070 1070 if (bio->bi_vcnt > 0) { ··· 1091 1091 { 1092 1092 unsigned long nr = off / PAGE_SIZE; 1093 1093 1094 - WARN_ON_ONCE(len > UINT_MAX); 1094 + WARN_ON_ONCE(len > BIO_MAX_SIZE); 1095 1095 __bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE); 1096 1096 } 1097 1097 EXPORT_SYMBOL_GPL(bio_add_folio_nofail); ··· 1115 1115 { 1116 1116 unsigned long nr = off / PAGE_SIZE; 1117 1117 1118 - if (len > UINT_MAX) 1118 + if (len > BIO_MAX_SIZE) 1119 1119 return false; 1120 1120 return bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE) > 0; 1121 1121 } ··· 1206 1206 bio_set_flag(bio, BIO_CLONED); 1207 1207 } 1208 1208 1209 - static unsigned int get_contig_folio_len(unsigned int *num_pages, 1210 - struct page **pages, unsigned int i, 1211 - struct folio *folio, size_t left, 1212 - size_t offset) 1213 - { 1214 - size_t bytes = left; 1215 - size_t contig_sz = min_t(size_t, PAGE_SIZE - offset, bytes); 1216 - unsigned int j; 1217 - 1218 - /* 1219 - * We might COW a single page in the middle of 1220 - * a large folio, so we have to check that all 1221 - * pages belong to the same folio. 1222 - */ 1223 - bytes -= contig_sz; 1224 - for (j = i + 1; j < i + *num_pages; j++) { 1225 - size_t next = min_t(size_t, PAGE_SIZE, bytes); 1226 - 1227 - if (page_folio(pages[j]) != folio || 1228 - pages[j] != pages[j - 1] + 1) { 1229 - break; 1230 - } 1231 - contig_sz += next; 1232 - bytes -= next; 1233 - } 1234 - *num_pages = j - i; 1235 - 1236 - return contig_sz; 1237 - } 1238 - 1239 - #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) 1240 - 1241 - /** 1242 - * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio 1243 - * @bio: bio to add pages to 1244 - * @iter: iov iterator describing the region to be mapped 1245 - * 1246 - * Extracts pages from *iter and appends them to @bio's bvec array. The pages 1247 - * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag. 1248 - * For a multi-segment *iter, this function only adds pages from the next 1249 - * non-empty segment of the iov iterator. 1250 - */ 1251 - static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) 1252 - { 1253 - iov_iter_extraction_t extraction_flags = 0; 1254 - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; 1255 - unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; 1256 - struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; 1257 - struct page **pages = (struct page **)bv; 1258 - ssize_t size; 1259 - unsigned int num_pages, i = 0; 1260 - size_t offset, folio_offset, left, len; 1261 - int ret = 0; 1262 - 1263 - /* 1264 - * Move page array up in the allocated memory for the bio vecs as far as 1265 - * possible so that we can start filling biovecs from the beginning 1266 - * without overwriting the temporary page array. 1267 - */ 1268 - BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); 1269 - pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); 1270 - 1271 - if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) 1272 - extraction_flags |= ITER_ALLOW_P2PDMA; 1273 - 1274 - size = iov_iter_extract_pages(iter, &pages, 1275 - UINT_MAX - bio->bi_iter.bi_size, 1276 - nr_pages, extraction_flags, &offset); 1277 - if (unlikely(size <= 0)) 1278 - return size ? size : -EFAULT; 1279 - 1280 - nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); 1281 - for (left = size, i = 0; left > 0; left -= len, i += num_pages) { 1282 - struct page *page = pages[i]; 1283 - struct folio *folio = page_folio(page); 1284 - unsigned int old_vcnt = bio->bi_vcnt; 1285 - 1286 - folio_offset = ((size_t)folio_page_idx(folio, page) << 1287 - PAGE_SHIFT) + offset; 1288 - 1289 - len = min(folio_size(folio) - folio_offset, left); 1290 - 1291 - num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); 1292 - 1293 - if (num_pages > 1) 1294 - len = get_contig_folio_len(&num_pages, pages, i, 1295 - folio, left, offset); 1296 - 1297 - if (!bio_add_folio(bio, folio, len, folio_offset)) { 1298 - WARN_ON_ONCE(1); 1299 - ret = -EINVAL; 1300 - goto out; 1301 - } 1302 - 1303 - if (bio_flagged(bio, BIO_PAGE_PINNED)) { 1304 - /* 1305 - * We're adding another fragment of a page that already 1306 - * was part of the last segment. Undo our pin as the 1307 - * page was pinned when an earlier fragment of it was 1308 - * added to the bio and __bio_release_pages expects a 1309 - * single pin per page. 1310 - */ 1311 - if (offset && bio->bi_vcnt == old_vcnt) 1312 - unpin_user_folio(folio, 1); 1313 - } 1314 - offset = 0; 1315 - } 1316 - 1317 - iov_iter_revert(iter, left); 1318 - out: 1319 - while (i < nr_pages) 1320 - bio_release_page(bio, pages[i++]); 1321 - 1322 - return ret; 1323 - } 1324 - 1325 1209 /* 1326 1210 * Aligns the bio size to the len_align_mask, releasing excessive bio vecs that 1327 1211 * __bio_iov_iter_get_pages may have inserted, and reverts the trimmed length ··· 1229 1345 break; 1230 1346 } 1231 1347 1232 - bio_release_page(bio, bv->bv_page); 1348 + if (bio_flagged(bio, BIO_PAGE_PINNED)) 1349 + unpin_user_page(bv->bv_page); 1350 + 1233 1351 bio->bi_vcnt--; 1234 1352 nbytes -= bv->bv_len; 1235 1353 } while (nbytes); ··· 1265 1379 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter, 1266 1380 unsigned len_align_mask) 1267 1381 { 1268 - int ret = 0; 1382 + iov_iter_extraction_t flags = 0; 1269 1383 1270 1384 if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) 1271 1385 return -EIO; ··· 1278 1392 1279 1393 if (iov_iter_extract_will_pin(iter)) 1280 1394 bio_set_flag(bio, BIO_PAGE_PINNED); 1281 - do { 1282 - ret = __bio_iov_iter_get_pages(bio, iter); 1283 - } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); 1395 + if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) 1396 + flags |= ITER_ALLOW_P2PDMA; 1284 1397 1285 - if (bio->bi_vcnt) 1286 - return bio_iov_iter_align_down(bio, iter, len_align_mask); 1287 - return ret; 1398 + do { 1399 + ssize_t ret; 1400 + 1401 + ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec, 1402 + BIO_MAX_SIZE - bio->bi_iter.bi_size, 1403 + &bio->bi_vcnt, bio->bi_max_vecs, flags); 1404 + if (ret <= 0) { 1405 + if (!bio->bi_vcnt) 1406 + return ret; 1407 + break; 1408 + } 1409 + bio->bi_iter.bi_size += ret; 1410 + } while (iov_iter_count(iter) && !bio_full(bio, 0)); 1411 + 1412 + if (is_pci_p2pdma_page(bio->bi_io_vec->bv_page)) 1413 + bio->bi_opf |= REQ_NOMERGE; 1414 + return bio_iov_iter_align_down(bio, iter, len_align_mask); 1415 + } 1416 + 1417 + static struct folio *folio_alloc_greedy(gfp_t gfp, size_t *size) 1418 + { 1419 + struct folio *folio; 1420 + 1421 + while (*size > PAGE_SIZE) { 1422 + folio = folio_alloc(gfp | __GFP_NORETRY, get_order(*size)); 1423 + if (folio) 1424 + return folio; 1425 + *size = rounddown_pow_of_two(*size - 1); 1426 + } 1427 + 1428 + return folio_alloc(gfp, get_order(*size)); 1429 + } 1430 + 1431 + static void bio_free_folios(struct bio *bio) 1432 + { 1433 + struct bio_vec *bv; 1434 + int i; 1435 + 1436 + bio_for_each_bvec_all(bv, bio, i) { 1437 + struct folio *folio = page_folio(bv->bv_page); 1438 + 1439 + if (!is_zero_folio(folio)) 1440 + folio_put(folio); 1441 + } 1442 + } 1443 + 1444 + static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter) 1445 + { 1446 + size_t total_len = iov_iter_count(iter); 1447 + 1448 + if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) 1449 + return -EINVAL; 1450 + if (WARN_ON_ONCE(bio->bi_iter.bi_size)) 1451 + return -EINVAL; 1452 + if (WARN_ON_ONCE(bio->bi_vcnt >= bio->bi_max_vecs)) 1453 + return -EINVAL; 1454 + 1455 + do { 1456 + size_t this_len = min(total_len, SZ_1M); 1457 + struct folio *folio; 1458 + 1459 + if (this_len > PAGE_SIZE * 2) 1460 + this_len = rounddown_pow_of_two(this_len); 1461 + 1462 + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - this_len) 1463 + break; 1464 + 1465 + folio = folio_alloc_greedy(GFP_KERNEL, &this_len); 1466 + if (!folio) 1467 + break; 1468 + bio_add_folio_nofail(bio, folio, this_len, 0); 1469 + 1470 + if (copy_from_iter(folio_address(folio), this_len, iter) != 1471 + this_len) { 1472 + bio_free_folios(bio); 1473 + return -EFAULT; 1474 + } 1475 + 1476 + total_len -= this_len; 1477 + } while (total_len && bio->bi_vcnt < bio->bi_max_vecs); 1478 + 1479 + if (!bio->bi_iter.bi_size) 1480 + return -ENOMEM; 1481 + return 0; 1482 + } 1483 + 1484 + static int bio_iov_iter_bounce_read(struct bio *bio, struct iov_iter *iter) 1485 + { 1486 + size_t len = min(iov_iter_count(iter), SZ_1M); 1487 + struct folio *folio; 1488 + 1489 + folio = folio_alloc_greedy(GFP_KERNEL, &len); 1490 + if (!folio) 1491 + return -ENOMEM; 1492 + 1493 + do { 1494 + ssize_t ret; 1495 + 1496 + ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec + 1, len, 1497 + &bio->bi_vcnt, bio->bi_max_vecs - 1, 0); 1498 + if (ret <= 0) { 1499 + if (!bio->bi_vcnt) 1500 + return ret; 1501 + break; 1502 + } 1503 + len -= ret; 1504 + bio->bi_iter.bi_size += ret; 1505 + } while (len && bio->bi_vcnt < bio->bi_max_vecs - 1); 1506 + 1507 + /* 1508 + * Set the folio directly here. The above loop has already calculated 1509 + * the correct bi_size, and we use bi_vcnt for the user buffers. That 1510 + * is safe as bi_vcnt is only used by the submitter and not the actual 1511 + * I/O path. 1512 + */ 1513 + bvec_set_folio(&bio->bi_io_vec[0], folio, bio->bi_iter.bi_size, 0); 1514 + if (iov_iter_extract_will_pin(iter)) 1515 + bio_set_flag(bio, BIO_PAGE_PINNED); 1516 + return 0; 1517 + } 1518 + 1519 + /** 1520 + * bio_iov_iter_bounce - bounce buffer data from an iter into a bio 1521 + * @bio: bio to send 1522 + * @iter: iter to read from / write into 1523 + * 1524 + * Helper for direct I/O implementations that need to bounce buffer because 1525 + * we need to checksum the data or perform other operations that require 1526 + * consistency. Allocates folios to back the bounce buffer, and for writes 1527 + * copies the data into it. Needs to be paired with bio_iov_iter_unbounce() 1528 + * called on completion. 1529 + */ 1530 + int bio_iov_iter_bounce(struct bio *bio, struct iov_iter *iter) 1531 + { 1532 + if (op_is_write(bio_op(bio))) 1533 + return bio_iov_iter_bounce_write(bio, iter); 1534 + return bio_iov_iter_bounce_read(bio, iter); 1535 + } 1536 + 1537 + static void bvec_unpin(struct bio_vec *bv, bool mark_dirty) 1538 + { 1539 + struct folio *folio = page_folio(bv->bv_page); 1540 + size_t nr_pages = (bv->bv_offset + bv->bv_len - 1) / PAGE_SIZE - 1541 + bv->bv_offset / PAGE_SIZE + 1; 1542 + 1543 + if (mark_dirty) 1544 + folio_mark_dirty_lock(folio); 1545 + unpin_user_folio(folio, nr_pages); 1546 + } 1547 + 1548 + static void bio_iov_iter_unbounce_read(struct bio *bio, bool is_error, 1549 + bool mark_dirty) 1550 + { 1551 + unsigned int len = bio->bi_io_vec[0].bv_len; 1552 + 1553 + if (likely(!is_error)) { 1554 + void *buf = bvec_virt(&bio->bi_io_vec[0]); 1555 + struct iov_iter to; 1556 + 1557 + iov_iter_bvec(&to, ITER_DEST, bio->bi_io_vec + 1, bio->bi_vcnt, 1558 + len); 1559 + /* copying to pinned pages should always work */ 1560 + WARN_ON_ONCE(copy_to_iter(buf, len, &to) != len); 1561 + } else { 1562 + /* No need to mark folios dirty if never copied to them */ 1563 + mark_dirty = false; 1564 + } 1565 + 1566 + if (bio_flagged(bio, BIO_PAGE_PINNED)) { 1567 + int i; 1568 + 1569 + for (i = 0; i < bio->bi_vcnt; i++) 1570 + bvec_unpin(&bio->bi_io_vec[1 + i], mark_dirty); 1571 + } 1572 + 1573 + folio_put(page_folio(bio->bi_io_vec[0].bv_page)); 1574 + } 1575 + 1576 + /** 1577 + * bio_iov_iter_unbounce - finish a bounce buffer operation 1578 + * @bio: completed bio 1579 + * @is_error: %true if an I/O error occurred and data should not be copied 1580 + * @mark_dirty: If %true, folios will be marked dirty. 1581 + * 1582 + * Helper for direct I/O implementations that need to bounce buffer because 1583 + * we need to checksum the data or perform other operations that require 1584 + * consistency. Called to complete a bio set up by bio_iov_iter_bounce(). 1585 + * Copies data back for reads, and marks the original folios dirty if 1586 + * requested and then frees the bounce buffer. 1587 + */ 1588 + void bio_iov_iter_unbounce(struct bio *bio, bool is_error, bool mark_dirty) 1589 + { 1590 + if (op_is_write(bio_op(bio))) 1591 + bio_free_folios(bio); 1592 + else 1593 + bio_iov_iter_unbounce_read(bio, is_error, mark_dirty); 1288 1594 } 1289 1595 1290 1596 static void submit_bio_wait_endio(struct bio *bio)
+4 -5
block/blk-lib.c
··· 32 32 * Align the bio size to the discard granularity to make splitting the bio 33 33 * at discard granularity boundaries easier in the driver if needed. 34 34 */ 35 - return round_down(UINT_MAX, discard_granularity) >> SECTOR_SHIFT; 35 + return round_down(BIO_MAX_SIZE, discard_granularity) >> SECTOR_SHIFT; 36 36 } 37 37 38 38 struct bio *blk_alloc_discard_bio(struct block_device *bdev, ··· 107 107 { 108 108 sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; 109 109 110 - return min(bdev_write_zeroes_sectors(bdev), 111 - (UINT_MAX >> SECTOR_SHIFT) & ~bs_mask); 110 + return min(bdev_write_zeroes_sectors(bdev), BIO_MAX_SECTORS & ~bs_mask); 112 111 } 113 112 114 113 /* ··· 336 337 int ret = 0; 337 338 338 339 /* make sure that "len << SECTOR_SHIFT" doesn't overflow */ 339 - if (max_sectors > UINT_MAX >> SECTOR_SHIFT) 340 - max_sectors = UINT_MAX >> SECTOR_SHIFT; 340 + if (max_sectors > BIO_MAX_SECTORS) 341 + max_sectors = BIO_MAX_SECTORS; 341 342 max_sectors &= ~bs_mask; 342 343 343 344 if (max_sectors == 0)
+4 -4
block/blk-merge.c
··· 95 95 } 96 96 97 97 /* 98 - * The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size 99 - * is defined as 'unsigned int', meantime it has to be aligned to with the 98 + * The maximum size that a bio can fit has to be aligned down to the 100 99 * logical block size, which is the minimum accepted unit by hardware. 101 100 */ 102 101 static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim) 103 102 { 104 - return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT; 103 + return round_down(BIO_MAX_SIZE, lim->logical_block_size) >> 104 + SECTOR_SHIFT; 105 105 } 106 106 107 107 /* ··· 515 515 516 516 rq_for_each_bvec(bv, rq, iter) 517 517 bvec_split_segs(&rq->q->limits, &bv, &nr_phys_segs, &bytes, 518 - UINT_MAX, UINT_MAX); 518 + UINT_MAX, BIO_MAX_SIZE); 519 519 return nr_phys_segs; 520 520 } 521 521
-11
block/blk.h
··· 599 599 600 600 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id, 601 601 struct lock_class_key *lkclass); 602 - 603 - /* 604 - * Clean up a page appropriately, where the page may be pinned, may have a 605 - * ref taken on it or neither. 606 - */ 607 - static inline void bio_release_page(struct bio *bio, struct page *page) 608 - { 609 - if (bio_flagged(bio, BIO_PAGE_PINNED)) 610 - unpin_user_page(page); 611 - } 612 - 613 602 struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id); 614 603 615 604 int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
+104 -87
fs/iomap/direct-io.c
··· 23 23 #define IOMAP_DIO_WRITE_THROUGH (1U << 28) 24 24 #define IOMAP_DIO_NEED_SYNC (1U << 29) 25 25 #define IOMAP_DIO_WRITE (1U << 30) 26 - #define IOMAP_DIO_DIRTY (1U << 31) 26 + #define IOMAP_DIO_USER_BACKED (1U << 31) 27 27 28 28 struct iomap_dio { 29 29 struct kiocb *iocb; ··· 223 223 iomap_dio_complete_work(&dio->aio.work); 224 224 } 225 225 226 - void iomap_dio_bio_end_io(struct bio *bio) 226 + static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion) 227 227 { 228 228 struct iomap_dio *dio = bio->bi_private; 229 - bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); 230 229 231 - if (bio->bi_status) 232 - iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); 233 - 234 - if (atomic_dec_and_test(&dio->ref)) 235 - iomap_dio_done(dio); 236 - 237 - if (should_dirty) { 230 + if (dio->flags & IOMAP_DIO_BOUNCE) { 231 + bio_iov_iter_unbounce(bio, !!dio->error, 232 + dio->flags & IOMAP_DIO_USER_BACKED); 233 + bio_put(bio); 234 + } else if (dio->flags & IOMAP_DIO_USER_BACKED) { 238 235 bio_check_pages_dirty(bio); 239 236 } else { 240 237 bio_release_pages(bio, false); 241 238 bio_put(bio); 242 239 } 240 + 241 + /* Do not touch bio below, we just gave up our reference. */ 242 + 243 + if (atomic_dec_and_test(&dio->ref)) { 244 + /* 245 + * Avoid another context switch for the completion when already 246 + * called from the ioend completion workqueue. 247 + */ 248 + if (inline_completion) 249 + dio->flags &= ~IOMAP_DIO_COMP_WORK; 250 + iomap_dio_done(dio); 251 + } 252 + } 253 + 254 + void iomap_dio_bio_end_io(struct bio *bio) 255 + { 256 + struct iomap_dio *dio = bio->bi_private; 257 + 258 + if (bio->bi_status) 259 + iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); 260 + __iomap_dio_bio_end_io(bio, false); 243 261 } 244 262 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io); 245 263 246 264 u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend) 247 265 { 248 266 struct iomap_dio *dio = ioend->io_bio.bi_private; 249 - bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); 250 267 u32 vec_count = ioend->io_bio.bi_vcnt; 251 268 252 269 if (ioend->io_error) 253 270 iomap_dio_set_error(dio, ioend->io_error); 254 - 255 - if (atomic_dec_and_test(&dio->ref)) { 256 - /* 257 - * Try to avoid another context switch for the completion given 258 - * that we are already called from the ioend completion 259 - * workqueue. 260 - */ 261 - dio->flags &= ~IOMAP_DIO_COMP_WORK; 262 - iomap_dio_done(dio); 263 - } 264 - 265 - if (should_dirty) { 266 - bio_check_pages_dirty(&ioend->io_bio); 267 - } else { 268 - bio_release_pages(&ioend->io_bio, false); 269 - bio_put(&ioend->io_bio); 270 - } 271 + __iomap_dio_bio_end_io(&ioend->io_bio, true); 271 272 272 273 /* 273 274 * Return the number of bvecs completed as even direct I/O completions ··· 315 314 return 0; 316 315 } 317 316 317 + static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter, 318 + struct iomap_dio *dio, loff_t pos, unsigned int alignment, 319 + blk_opf_t op) 320 + { 321 + unsigned int nr_vecs; 322 + struct bio *bio; 323 + ssize_t ret; 324 + 325 + if (dio->flags & IOMAP_DIO_BOUNCE) 326 + nr_vecs = bio_iov_bounce_nr_vecs(dio->submit.iter, op); 327 + else 328 + nr_vecs = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS); 329 + 330 + bio = iomap_dio_alloc_bio(iter, dio, nr_vecs, op); 331 + fscrypt_set_bio_crypt_ctx(bio, iter->inode, 332 + pos >> iter->inode->i_blkbits, GFP_KERNEL); 333 + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); 334 + bio->bi_write_hint = iter->inode->i_write_hint; 335 + bio->bi_ioprio = dio->iocb->ki_ioprio; 336 + bio->bi_private = dio; 337 + bio->bi_end_io = iomap_dio_bio_end_io; 338 + 339 + if (dio->flags & IOMAP_DIO_BOUNCE) 340 + ret = bio_iov_iter_bounce(bio, dio->submit.iter); 341 + else 342 + ret = bio_iov_iter_get_pages(bio, dio->submit.iter, 343 + alignment - 1); 344 + if (unlikely(ret)) 345 + goto out_put_bio; 346 + ret = bio->bi_iter.bi_size; 347 + 348 + /* 349 + * An atomic write bio must cover the complete length. If it doesn't, 350 + * error out. 351 + */ 352 + if ((op & REQ_ATOMIC) && WARN_ON_ONCE(ret != iomap_length(iter))) { 353 + ret = -EINVAL; 354 + goto out_put_bio; 355 + } 356 + 357 + if (dio->flags & IOMAP_DIO_WRITE) 358 + task_io_account_write(ret); 359 + else if ((dio->flags & IOMAP_DIO_USER_BACKED) && 360 + !(dio->flags & IOMAP_DIO_BOUNCE)) 361 + bio_set_pages_dirty(bio); 362 + 363 + /* 364 + * We can only poll for single bio I/Os. 365 + */ 366 + if (iov_iter_count(dio->submit.iter)) 367 + dio->iocb->ki_flags &= ~IOCB_HIPRI; 368 + iomap_dio_submit_bio(iter, dio, bio, pos); 369 + return ret; 370 + 371 + out_put_bio: 372 + bio_put(bio); 373 + return ret; 374 + } 375 + 318 376 static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio) 319 377 { 320 378 const struct iomap *iomap = &iter->iomap; ··· 382 322 const loff_t length = iomap_length(iter); 383 323 loff_t pos = iter->pos; 384 324 blk_opf_t bio_opf = REQ_SYNC | REQ_IDLE; 385 - struct bio *bio; 386 325 bool need_zeroout = false; 387 - int nr_pages, ret = 0; 388 326 u64 copied = 0; 389 327 size_t orig_count; 390 328 unsigned int alignment; 329 + ssize_t ret = 0; 391 330 392 331 /* 393 332 * File systems that write out of place and always allocate new blocks ··· 511 452 goto out; 512 453 } 513 454 514 - nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS); 515 455 do { 516 - size_t n; 517 - if (dio->error) { 518 - iov_iter_revert(dio->submit.iter, copied); 519 - copied = ret = 0; 456 + /* 457 + * If completions already occurred and reported errors, give up now and 458 + * don't bother submitting more bios. 459 + */ 460 + if (unlikely(data_race(dio->error))) 520 461 goto out; 521 - } 522 462 523 - bio = iomap_dio_alloc_bio(iter, dio, nr_pages, bio_opf); 524 - fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, 525 - GFP_KERNEL); 526 - bio->bi_iter.bi_sector = iomap_sector(iomap, pos); 527 - bio->bi_write_hint = inode->i_write_hint; 528 - bio->bi_ioprio = dio->iocb->ki_ioprio; 529 - bio->bi_private = dio; 530 - bio->bi_end_io = iomap_dio_bio_end_io; 531 - 532 - ret = bio_iov_iter_get_pages(bio, dio->submit.iter, 533 - alignment - 1); 534 - if (unlikely(ret)) { 463 + ret = iomap_dio_bio_iter_one(iter, dio, pos, alignment, bio_opf); 464 + if (unlikely(ret < 0)) { 535 465 /* 536 466 * We have to stop part way through an IO. We must fall 537 467 * through to the sub-block tail zeroing here, otherwise 538 468 * this short IO may expose stale data in the tail of 539 469 * the block we haven't written data to. 540 470 */ 541 - bio_put(bio); 542 - goto zero_tail; 471 + break; 543 472 } 544 - 545 - n = bio->bi_iter.bi_size; 546 - if (WARN_ON_ONCE((bio_opf & REQ_ATOMIC) && n != length)) { 547 - /* 548 - * An atomic write bio must cover the complete length, 549 - * which it doesn't, so error. We may need to zero out 550 - * the tail (complete FS block), similar to when 551 - * bio_iov_iter_get_pages() returns an error, above. 552 - */ 553 - ret = -EINVAL; 554 - bio_put(bio); 555 - goto zero_tail; 556 - } 557 - if (dio->flags & IOMAP_DIO_WRITE) 558 - task_io_account_write(n); 559 - else if (dio->flags & IOMAP_DIO_DIRTY) 560 - bio_set_pages_dirty(bio); 561 - 562 - dio->size += n; 563 - copied += n; 564 - 565 - nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, 566 - BIO_MAX_VECS); 567 - /* 568 - * We can only poll for single bio I/Os. 569 - */ 570 - if (nr_pages) 571 - dio->iocb->ki_flags &= ~IOCB_HIPRI; 572 - iomap_dio_submit_bio(iter, dio, bio, pos); 573 - pos += n; 574 - } while (nr_pages); 473 + dio->size += ret; 474 + copied += ret; 475 + pos += ret; 476 + ret = 0; 477 + } while (iov_iter_count(dio->submit.iter)); 575 478 576 479 /* 577 480 * We need to zeroout the tail of a sub-block write if the extent type ··· 541 520 * the block tail in the latter case, we can expose stale data via mmap 542 521 * reads of the EOF block. 543 522 */ 544 - zero_tail: 545 523 if (need_zeroout || 546 524 ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { 547 525 /* zero out from the end of the write to the end of the block */ ··· 687 667 dio->i_size = i_size_read(inode); 688 668 dio->dops = dops; 689 669 dio->error = 0; 690 - dio->flags = 0; 670 + dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE); 691 671 dio->done_before = done_before; 692 672 693 673 dio->submit.iter = iter; ··· 696 676 if (iocb->ki_flags & IOCB_NOWAIT) 697 677 iomi.flags |= IOMAP_NOWAIT; 698 678 699 - if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED) 700 - dio->flags |= IOMAP_DIO_FSBLOCK_ALIGNED; 701 - 702 679 if (iov_iter_rw(iter) == READ) { 703 680 if (iomi.pos >= dio->i_size) 704 681 goto out_free_dio; 705 682 706 683 if (user_backed_iter(iter)) 707 - dio->flags |= IOMAP_DIO_DIRTY; 684 + dio->flags |= IOMAP_DIO_USER_BACKED; 708 685 709 686 ret = kiocb_write_and_wait(iocb, iomi.len); 710 687 if (ret)
+8
fs/iomap/ioend.c
··· 305 305 static bool iomap_ioend_can_merge(struct iomap_ioend *ioend, 306 306 struct iomap_ioend *next) 307 307 { 308 + /* 309 + * There is no point in merging reads as there is no completion 310 + * processing that can be easily batched up for them. 311 + */ 312 + if (bio_op(&ioend->io_bio) == REQ_OP_READ || 313 + bio_op(&next->io_bio) == REQ_OP_READ) 314 + return false; 315 + 308 316 if (ioend->io_bio.bi_status != next->io_bio.bi_status) 309 317 return false; 310 318 if (next->io_flags & IOMAP_IOEND_BOUNDARY)
+6 -2
fs/xfs/xfs_aops.c
··· 103 103 * IO write completion. 104 104 */ 105 105 STATIC void 106 - xfs_end_ioend( 106 + xfs_end_ioend_write( 107 107 struct iomap_ioend *ioend) 108 108 { 109 109 struct xfs_inode *ip = XFS_I(ioend->io_inode); ··· 202 202 io_list))) { 203 203 list_del_init(&ioend->io_list); 204 204 iomap_ioend_try_merge(ioend, &tmp); 205 - xfs_end_ioend(ioend); 205 + if (bio_op(&ioend->io_bio) == REQ_OP_READ) 206 + iomap_finish_ioends(ioend, 207 + blk_status_to_errno(ioend->io_bio.bi_status)); 208 + else 209 + xfs_end_ioend_write(ioend); 206 210 cond_resched(); 207 211 } 208 212 }
+38 -3
fs/xfs/xfs_file.c
··· 225 225 return 0; 226 226 } 227 227 228 + /* 229 + * Bounce buffering dio reads need a user context to copy back the data. 230 + * Use an ioend to provide that. 231 + */ 232 + static void 233 + xfs_dio_read_bounce_submit_io( 234 + const struct iomap_iter *iter, 235 + struct bio *bio, 236 + loff_t file_offset) 237 + { 238 + iomap_init_ioend(iter->inode, bio, file_offset, IOMAP_IOEND_DIRECT); 239 + bio->bi_end_io = xfs_end_bio; 240 + submit_bio(bio); 241 + } 242 + 243 + static const struct iomap_dio_ops xfs_dio_read_bounce_ops = { 244 + .submit_io = xfs_dio_read_bounce_submit_io, 245 + .bio_set = &iomap_ioend_bioset, 246 + }; 247 + 228 248 STATIC ssize_t 229 249 xfs_file_dio_read( 230 250 struct kiocb *iocb, 231 251 struct iov_iter *to) 232 252 { 233 253 struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); 254 + unsigned int dio_flags = 0; 255 + const struct iomap_dio_ops *dio_ops = NULL; 234 256 ssize_t ret; 235 257 236 258 trace_xfs_file_direct_read(iocb, to); ··· 265 243 ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); 266 244 if (ret) 267 245 return ret; 268 - ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0); 246 + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) { 247 + dio_ops = &xfs_dio_read_bounce_ops; 248 + dio_flags |= IOMAP_DIO_BOUNCE; 249 + } 250 + ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, dio_ops, dio_flags, 251 + NULL, 0); 269 252 xfs_iunlock(ip, XFS_IOLOCK_SHARED); 270 253 271 254 return ret; ··· 731 704 xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); 732 705 iolock = XFS_IOLOCK_SHARED; 733 706 } 707 + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) 708 + dio_flags |= IOMAP_DIO_BOUNCE; 734 709 trace_xfs_file_direct_write(iocb, from); 735 710 ret = iomap_dio_rw(iocb, from, ops, dops, dio_flags, ac, 0); 736 711 out_unlock: ··· 780 751 { 781 752 unsigned int iolock = XFS_IOLOCK_SHARED; 782 753 ssize_t ret, ocount = iov_iter_count(from); 754 + unsigned int dio_flags = 0; 783 755 const struct iomap_ops *dops; 784 756 785 757 /* ··· 808 778 } 809 779 810 780 trace_xfs_file_direct_write(iocb, from); 811 - ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, 812 - 0, NULL, 0); 781 + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) 782 + dio_flags |= IOMAP_DIO_BOUNCE; 783 + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, dio_flags, 784 + NULL, 0); 813 785 814 786 /* 815 787 * The retry mechanism is based on the ->iomap_begin method returning ··· 899 867 */ 900 868 if (flags & IOMAP_DIO_FORCE_WAIT) 901 869 inode_dio_wait(VFS_I(ip)); 870 + 871 + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) 872 + flags |= IOMAP_DIO_BOUNCE; 902 873 903 874 trace_xfs_file_direct_write(iocb, from); 904 875 ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+26
include/linux/bio.h
··· 397 397 return iov_iter_npages(iter, max_segs); 398 398 } 399 399 400 + /** 401 + * bio_iov_bounce_nr_vecs - calculate number of bvecs for a bounce bio 402 + * @iter: iter to bounce from 403 + * @op: REQ_OP_* for the bio 404 + * 405 + * Calculates how many bvecs are needed for the next bio to bounce from/to 406 + * @iter. 407 + */ 408 + static inline unsigned short 409 + bio_iov_bounce_nr_vecs(struct iov_iter *iter, blk_opf_t op) 410 + { 411 + /* 412 + * We still need to bounce bvec iters, so don't special case them 413 + * here unlike in bio_iov_vecs_to_alloc. 414 + * 415 + * For reads we need to use a vector for the bounce buffer, account 416 + * for that here. 417 + */ 418 + if (op_is_write(op)) 419 + return iov_iter_npages(iter, BIO_MAX_VECS); 420 + return iov_iter_npages(iter, BIO_MAX_VECS - 1) + 1; 421 + } 422 + 400 423 struct request_queue; 401 424 402 425 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, ··· 473 450 void __bio_release_pages(struct bio *bio, bool mark_dirty); 474 451 extern void bio_set_pages_dirty(struct bio *bio); 475 452 extern void bio_check_pages_dirty(struct bio *bio); 453 + 454 + int bio_iov_iter_bounce(struct bio *bio, struct iov_iter *iter); 455 + void bio_iov_iter_unbounce(struct bio *bio, bool is_error, bool mark_dirty); 476 456 477 457 extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, 478 458 struct bio *src, struct bvec_iter *src_iter);
+2 -1
include/linux/blk_types.h
··· 281 281 }; 282 282 283 283 #define BIO_RESET_BYTES offsetof(struct bio, bi_max_vecs) 284 - #define BIO_MAX_SECTORS (UINT_MAX >> SECTOR_SHIFT) 284 + #define BIO_MAX_SIZE UINT_MAX /* max value of bi_iter.bi_size */ 285 + #define BIO_MAX_SECTORS (BIO_MAX_SIZE >> SECTOR_SHIFT) 285 286 286 287 static inline struct bio_vec *bio_inline_vecs(struct bio *bio) 287 288 {
+9
include/linux/iomap.h
··· 566 566 */ 567 567 #define IOMAP_DIO_FSBLOCK_ALIGNED (1 << 3) 568 568 569 + /* 570 + * Bounce buffer instead of using zero copy access. 571 + * 572 + * This is needed if the device needs stable data to checksum or generate 573 + * parity. The file system must hook into the I/O submission and offload 574 + * completions to user context for reads when this is set. 575 + */ 576 + #define IOMAP_DIO_BOUNCE (1 << 4) 577 + 569 578 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, 570 579 const struct iomap_ops *ops, const struct iomap_dio_ops *dops, 571 580 unsigned int dio_flags, void *private, size_t done_before);
+3
include/linux/uio.h
··· 389 389 size_t maxsize, unsigned int maxpages, 390 390 iov_iter_extraction_t extraction_flags, 391 391 size_t *offset0); 392 + ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv, 393 + size_t max_size, unsigned short *nr_vecs, 394 + unsigned short max_vecs, iov_iter_extraction_t extraction_flags); 392 395 393 396 /** 394 397 * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
+98
lib/iov_iter.c
··· 1845 1845 return -EFAULT; 1846 1846 } 1847 1847 EXPORT_SYMBOL_GPL(iov_iter_extract_pages); 1848 + 1849 + static unsigned int get_contig_folio_len(struct page **pages, 1850 + unsigned int *num_pages, size_t left, size_t offset) 1851 + { 1852 + struct folio *folio = page_folio(pages[0]); 1853 + size_t contig_sz = min_t(size_t, PAGE_SIZE - offset, left); 1854 + unsigned int max_pages, i; 1855 + size_t folio_offset, len; 1856 + 1857 + folio_offset = PAGE_SIZE * folio_page_idx(folio, pages[0]) + offset; 1858 + len = min(folio_size(folio) - folio_offset, left); 1859 + 1860 + /* 1861 + * We might COW a single page in the middle of a large folio, so we have 1862 + * to check that all pages belong to the same folio. 1863 + */ 1864 + left -= contig_sz; 1865 + max_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); 1866 + for (i = 1; i < max_pages; i++) { 1867 + size_t next = min_t(size_t, PAGE_SIZE, left); 1868 + 1869 + if (page_folio(pages[i]) != folio || 1870 + pages[i] != pages[i - 1] + 1) 1871 + break; 1872 + contig_sz += next; 1873 + left -= next; 1874 + } 1875 + 1876 + *num_pages = i; 1877 + return contig_sz; 1878 + } 1879 + 1880 + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) 1881 + 1882 + /** 1883 + * iov_iter_extract_bvecs - Extract bvecs from an iterator 1884 + * @iter: the iterator to extract from 1885 + * @bv: bvec return array 1886 + * @max_size: maximum size to extract from @iter 1887 + * @nr_vecs: number of vectors in @bv (on in and output) 1888 + * @max_vecs: maximum vectors in @bv, including those filled before calling 1889 + * @extraction_flags: flags to qualify request 1890 + * 1891 + * Like iov_iter_extract_pages(), but returns physically contiguous ranges 1892 + * contained in a single folio as a single bvec instead of multiple entries. 1893 + * 1894 + * Returns the number of bytes extracted when successful, or a negative errno. 1895 + * If @nr_vecs was non-zero on entry, the number of successfully extracted bytes 1896 + * can be 0. 1897 + */ 1898 + ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv, 1899 + size_t max_size, unsigned short *nr_vecs, 1900 + unsigned short max_vecs, iov_iter_extraction_t extraction_flags) 1901 + { 1902 + unsigned short entries_left = max_vecs - *nr_vecs; 1903 + unsigned short nr_pages, i = 0; 1904 + size_t left, offset, len; 1905 + struct page **pages; 1906 + ssize_t size; 1907 + 1908 + /* 1909 + * Move page array up in the allocated memory for the bio vecs as far as 1910 + * possible so that we can start filling biovecs from the beginning 1911 + * without overwriting the temporary page array. 1912 + */ 1913 + BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); 1914 + pages = (struct page **)(bv + *nr_vecs) + 1915 + entries_left * (PAGE_PTRS_PER_BVEC - 1); 1916 + 1917 + size = iov_iter_extract_pages(iter, &pages, max_size, entries_left, 1918 + extraction_flags, &offset); 1919 + if (unlikely(size <= 0)) 1920 + return size ? size : -EFAULT; 1921 + 1922 + nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); 1923 + for (left = size; left > 0; left -= len) { 1924 + unsigned int nr_to_add; 1925 + 1926 + if (*nr_vecs > 0 && 1927 + !zone_device_pages_have_same_pgmap(bv[*nr_vecs - 1].bv_page, 1928 + pages[i])) 1929 + break; 1930 + 1931 + len = get_contig_folio_len(&pages[i], &nr_to_add, left, offset); 1932 + bvec_set_page(&bv[*nr_vecs], pages[i], len, offset); 1933 + i += nr_to_add; 1934 + (*nr_vecs)++; 1935 + offset = 0; 1936 + } 1937 + 1938 + iov_iter_revert(iter, left); 1939 + if (iov_iter_extract_will_pin(iter)) { 1940 + while (i < nr_pages) 1941 + unpin_user_page(pages[i++]); 1942 + } 1943 + return size - left; 1944 + } 1945 + EXPORT_SYMBOL_GPL(iov_iter_extract_bvecs);