Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iomap updates from Christian Brauner:

- Allow the filesystem to submit the writeback bios.

- Allow the filsystem to track completions on a per-bio bases
instead of the entire I/O.

- Change writeback_ops so that ->submit_bio can be done by the
filesystem.

- A new ANON_WRITE flag for writes that don't have a block number
assigned to them at the iomap level leaving the filesystem to do
that work in the submission handler.

- Incremental iterator advance

The folio_batch support for zero range where the filesystem provides
a batch of folios to process that might not be logically continguous
requires more flexibility than the current offset based iteration
currently offers.

Update all iomap operations to advance the iterator within the
operation and thus remove the need to advance from the core iomap
iterator.

- Make buffered writes work with RWF_DONTCACHE

If RWF_DONTCACHE is set for a write, mark the folios being written as
uncached. On writeback completion the pages will be dropped.

- Introduce infrastructure for large atomic writes

This will eventually be used by xfs and ext4.

* tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
iomap: rework IOMAP atomic flags
iomap: comment on atomic write checks in iomap_dio_bio_iter()
iomap: inline iomap_dio_bio_opflags()
iomap: fix inline data on buffered read
iomap: Lift blocksize restriction on atomic writes
iomap: Support SW-based atomic writes
iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
xfs: flag as supporting FOP_DONTCACHE
iomap: make buffered writes work with RWF_DONTCACHE
iomap: introduce a full map advance helper
iomap: rename iomap_iter processed field to status
iomap: remove unnecessary advance from iomap_iter()
dax: advance the iomap_iter on pte and pmd faults
dax: advance the iomap_iter on dedupe range
dax: advance the iomap_iter on unshare range
dax: advance the iomap_iter on zero range
dax: push advance down into dax_iomap_iter() for read and write
dax: advance the iomap_iter in the read/write path
iomap: convert misc simple ops to incremental advance
iomap: advance the iter on direct I/O
...

+816 -521
+9
Documentation/filesystems/iomap/design.rst
··· 246 246 * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can 247 247 be set by the filesystem for its own purposes. 248 248 249 + * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target 250 + block assigned to it yet and the file system will do that in the bio 251 + submission handler, splitting the I/O as needed. 252 + 249 253 These flags can be set by iomap itself during file operations. 250 254 The filesystem should supply an ``->iomap_end`` function if it needs 251 255 to observe these flags: ··· 355 351 start the operation and force the submitting task to block. 356 352 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 357 353 ``RWF_NOWAIT``. 354 + 355 + * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a 356 + buffered file I/O and would like the kernel to drop the pagecache 357 + after the I/O completes, if it isn't already being used by another 358 + thread. 358 359 359 360 If it is necessary to read existing file contents from a `different 360 361 <https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+29 -13
Documentation/filesystems/iomap/operations.rst
··· 131 131 132 132 * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``. 133 133 134 + * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``. 135 + 134 136 Internal per-Folio State 135 137 ------------------------ 136 138 ··· 285 283 struct iomap_writeback_ops { 286 284 int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode, 287 285 loff_t offset, unsigned len); 288 - int (*prepare_ioend)(struct iomap_ioend *ioend, int status); 286 + int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status); 289 287 void (*discard_folio)(struct folio *folio, loff_t pos); 290 288 }; 291 289 ··· 308 306 purpose. 309 307 This function must be supplied by the filesystem. 310 308 311 - - ``prepare_ioend``: Enables filesystems to transform the writeback 312 - ioend or perform any other preparatory work before the writeback I/O 313 - is submitted. 309 + - ``submit_ioend``: Allows the file systems to hook into writeback bio 310 + submission. 314 311 This might include pre-write space accounting updates, or installing 315 312 a custom ``->bi_end_io`` function for internal purposes, such as 316 313 deferring the ioend completion to a workqueue to run metadata update 317 - transactions from process context. 314 + transactions from process context before submitting the bio. 318 315 This function is optional. 319 316 320 317 - ``discard_folio``: iomap calls this function after ``->map_blocks`` ··· 342 341 storage device. 343 342 344 343 Filesystems that need to update internal bookkeeping (e.g. unwritten 345 - extent conversions) should provide a ``->prepare_ioend`` function to 344 + extent conversions) should provide a ``->submit_ioend`` function to 346 345 set ``struct iomap_end::bio::bi_end_io`` to its own function. 347 346 This function should call ``iomap_finish_ioends`` after finishing its 348 347 own work (e.g. unwritten extent conversion). ··· 516 515 517 516 * ``IOMAP_ATOMIC``: This write is being issued with torn-write 518 517 protection. 519 - Only a single bio can be created for the write, and the write must 520 - not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be 521 - set. 518 + Torn-write protection may be provided based on HW-offload or by a 519 + software mechanism provided by the filesystem. 520 + 521 + For HW-offload based support, only a single bio can be created for the 522 + write, and the write must not be split into multiple I/O requests, i.e. 523 + flag REQ_ATOMIC must be set. 522 524 The file range to write must be aligned to satisfy the requirements 523 525 of both the filesystem and the underlying block device's atomic 524 526 commit capabilities. 525 527 If filesystem metadata updates are required (e.g. unwritten extent 526 - conversion or copy on write), all updates for the entire file range 528 + conversion or copy-on-write), all updates for the entire file range 527 529 must be committed atomically as well. 528 - Only one space mapping is allowed per untorn write. 529 - Untorn writes must be aligned to, and must not be longer than, a 530 - single file block. 530 + Untorn-writes may be longer than a single file block. In all cases, 531 + the mapping start disk block must have at least the same alignment as 532 + the write offset. 533 + The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an 534 + untorn-write based on HW-offload. 535 + 536 + For untorn-writes based on a software mechanism provided by the 537 + filesystem, all the disk block alignment and single bio restrictions 538 + which apply for HW-offload based untorn-writes do not apply. 539 + The mechanism would typically be used as a fallback for when 540 + HW-offload based untorn-writes may not be issued, e.g. the range of the 541 + write covers multiple extents, meaning that it is not possible to issue 542 + a single bio. 543 + All filesystem metadata updates for the entire file range must be 544 + committed atomically as well. 531 545 532 546 Callers commonly hold ``i_rwsem`` in shared or exclusive mode before 533 547 calling this function.
+61 -50
fs/dax.c
··· 1258 1258 } 1259 1259 #endif /* CONFIG_FS_DAX_PMD */ 1260 1260 1261 - static s64 dax_unshare_iter(struct iomap_iter *iter) 1261 + static int dax_unshare_iter(struct iomap_iter *iter) 1262 1262 { 1263 1263 struct iomap *iomap = &iter->iomap; 1264 1264 const struct iomap *srcmap = iomap_iter_srcmap(iter); ··· 1266 1266 u64 copy_len = iomap_length(iter); 1267 1267 u32 mod; 1268 1268 int id = 0; 1269 - s64 ret = 0; 1269 + s64 ret; 1270 1270 void *daddr = NULL, *saddr = NULL; 1271 1271 1272 1272 if (!iomap_want_unshare_iter(iter)) 1273 - return iomap_length(iter); 1273 + return iomap_iter_advance_full(iter); 1274 1274 1275 1275 /* 1276 1276 * Extend the file range to be aligned to fsblock/pagesize, because ··· 1300 1300 if (ret < 0) 1301 1301 goto out_unlock; 1302 1302 1303 - if (copy_mc_to_kernel(daddr, saddr, copy_len) == 0) 1304 - ret = iomap_length(iter); 1305 - else 1303 + if (copy_mc_to_kernel(daddr, saddr, copy_len) != 0) 1306 1304 ret = -EIO; 1307 1305 1308 1306 out_unlock: 1309 1307 dax_read_unlock(id); 1310 - return dax_mem2blk_err(ret); 1308 + if (ret < 0) 1309 + return dax_mem2blk_err(ret); 1310 + return iomap_iter_advance_full(iter); 1311 1311 } 1312 1312 1313 1313 int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len, ··· 1326 1326 1327 1327 iter.len = min(len, size - pos); 1328 1328 while ((ret = iomap_iter(&iter, ops)) > 0) 1329 - iter.processed = dax_unshare_iter(&iter); 1329 + iter.status = dax_unshare_iter(&iter); 1330 1330 return ret; 1331 1331 } 1332 1332 EXPORT_SYMBOL_GPL(dax_file_unshare); ··· 1354 1354 return ret; 1355 1355 } 1356 1356 1357 - static s64 dax_zero_iter(struct iomap_iter *iter, bool *did_zero) 1357 + static int dax_zero_iter(struct iomap_iter *iter, bool *did_zero) 1358 1358 { 1359 1359 const struct iomap *iomap = &iter->iomap; 1360 1360 const struct iomap *srcmap = iomap_iter_srcmap(iter); 1361 - loff_t pos = iter->pos; 1362 1361 u64 length = iomap_length(iter); 1363 - s64 written = 0; 1362 + int ret; 1364 1363 1365 1364 /* already zeroed? we're done. */ 1366 1365 if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) 1367 - return length; 1366 + return iomap_iter_advance(iter, &length); 1368 1367 1369 1368 /* 1370 1369 * invalidate the pages whose sharing state is to be changed ··· 1371 1372 */ 1372 1373 if (iomap->flags & IOMAP_F_SHARED) 1373 1374 invalidate_inode_pages2_range(iter->inode->i_mapping, 1374 - pos >> PAGE_SHIFT, 1375 - (pos + length - 1) >> PAGE_SHIFT); 1375 + iter->pos >> PAGE_SHIFT, 1376 + (iter->pos + length - 1) >> PAGE_SHIFT); 1376 1377 1377 1378 do { 1379 + loff_t pos = iter->pos; 1378 1380 unsigned offset = offset_in_page(pos); 1379 - unsigned size = min_t(u64, PAGE_SIZE - offset, length); 1380 1381 pgoff_t pgoff = dax_iomap_pgoff(iomap, pos); 1381 - long rc; 1382 1382 int id; 1383 1383 1384 + length = min_t(u64, PAGE_SIZE - offset, length); 1385 + 1384 1386 id = dax_read_lock(); 1385 - if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE) 1386 - rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1); 1387 + if (IS_ALIGNED(pos, PAGE_SIZE) && length == PAGE_SIZE) 1388 + ret = dax_zero_page_range(iomap->dax_dev, pgoff, 1); 1387 1389 else 1388 - rc = dax_memzero(iter, pos, size); 1390 + ret = dax_memzero(iter, pos, length); 1389 1391 dax_read_unlock(id); 1390 1392 1391 - if (rc < 0) 1392 - return rc; 1393 - pos += size; 1394 - length -= size; 1395 - written += size; 1393 + if (ret < 0) 1394 + return ret; 1395 + 1396 + ret = iomap_iter_advance(iter, &length); 1397 + if (ret) 1398 + return ret; 1396 1399 } while (length > 0); 1397 1400 1398 1401 if (did_zero) 1399 1402 *did_zero = true; 1400 - return written; 1403 + return ret; 1401 1404 } 1402 1405 1403 1406 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, ··· 1414 1413 int ret; 1415 1414 1416 1415 while ((ret = iomap_iter(&iter, ops)) > 0) 1417 - iter.processed = dax_zero_iter(&iter, did_zero); 1416 + iter.status = dax_zero_iter(&iter, did_zero); 1418 1417 return ret; 1419 1418 } 1420 1419 EXPORT_SYMBOL_GPL(dax_zero_range); ··· 1432 1431 } 1433 1432 EXPORT_SYMBOL_GPL(dax_truncate_page); 1434 1433 1435 - static loff_t dax_iomap_iter(const struct iomap_iter *iomi, 1436 - struct iov_iter *iter) 1434 + static int dax_iomap_iter(struct iomap_iter *iomi, struct iov_iter *iter) 1437 1435 { 1438 1436 const struct iomap *iomap = &iomi->iomap; 1439 1437 const struct iomap *srcmap = iomap_iter_srcmap(iomi); ··· 1451 1451 if (pos >= end) 1452 1452 return 0; 1453 1453 1454 - if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN) 1455 - return iov_iter_zero(min(length, end - pos), iter); 1454 + if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN) { 1455 + done = iov_iter_zero(min(length, end - pos), iter); 1456 + return iomap_iter_advance(iomi, &done); 1457 + } 1456 1458 } 1457 1459 1458 1460 /* ··· 1487 1485 } 1488 1486 1489 1487 id = dax_read_lock(); 1490 - while (pos < end) { 1488 + while ((pos = iomi->pos) < end) { 1491 1489 unsigned offset = pos & (PAGE_SIZE - 1); 1492 1490 const size_t size = ALIGN(length + offset, PAGE_SIZE); 1493 1491 pgoff_t pgoff = dax_iomap_pgoff(iomap, pos); ··· 1537 1535 xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr, 1538 1536 map_len, iter); 1539 1537 1540 - pos += xfer; 1541 - length -= xfer; 1542 - done += xfer; 1543 - 1544 - if (xfer == 0) 1538 + length = xfer; 1539 + ret = iomap_iter_advance(iomi, &length); 1540 + if (!ret && xfer == 0) 1545 1541 ret = -EFAULT; 1546 1542 if (xfer < map_len) 1547 1543 break; 1548 1544 } 1549 1545 dax_read_unlock(id); 1550 1546 1551 - return done ? done : ret; 1547 + return ret; 1552 1548 } 1553 1549 1554 1550 /** ··· 1586 1586 iomi.flags |= IOMAP_NOWAIT; 1587 1587 1588 1588 while ((ret = iomap_iter(&iomi, ops)) > 0) 1589 - iomi.processed = dax_iomap_iter(&iomi, iter); 1589 + iomi.status = dax_iomap_iter(&iomi, iter); 1590 1590 1591 1591 done = iomi.pos - iocb->ki_pos; 1592 1592 iocb->ki_pos = iomi.pos; ··· 1757 1757 1758 1758 while ((error = iomap_iter(&iter, ops)) > 0) { 1759 1759 if (WARN_ON_ONCE(iomap_length(&iter) < PAGE_SIZE)) { 1760 - iter.processed = -EIO; /* fs corruption? */ 1760 + iter.status = -EIO; /* fs corruption? */ 1761 1761 continue; 1762 1762 } 1763 1763 ··· 1769 1769 ret |= VM_FAULT_MAJOR; 1770 1770 } 1771 1771 1772 - if (!(ret & VM_FAULT_ERROR)) 1773 - iter.processed = PAGE_SIZE; 1772 + if (!(ret & VM_FAULT_ERROR)) { 1773 + u64 length = PAGE_SIZE; 1774 + iter.status = iomap_iter_advance(&iter, &length); 1775 + } 1774 1776 } 1775 1777 1776 1778 if (iomap_errp) ··· 1885 1883 continue; /* actually breaks out of the loop */ 1886 1884 1887 1885 ret = dax_fault_iter(vmf, &iter, pfnp, &xas, &entry, true); 1888 - if (ret != VM_FAULT_FALLBACK) 1889 - iter.processed = PMD_SIZE; 1886 + if (ret != VM_FAULT_FALLBACK) { 1887 + u64 length = PMD_SIZE; 1888 + iter.status = iomap_iter_advance(&iter, &length); 1889 + } 1890 1890 } 1891 1891 1892 1892 unlock_entry: ··· 2003 1999 } 2004 2000 EXPORT_SYMBOL_GPL(dax_finish_sync_fault); 2005 2001 2006 - static loff_t dax_range_compare_iter(struct iomap_iter *it_src, 2002 + static int dax_range_compare_iter(struct iomap_iter *it_src, 2007 2003 struct iomap_iter *it_dest, u64 len, bool *same) 2008 2004 { 2009 2005 const struct iomap *smap = &it_src->iomap; 2010 2006 const struct iomap *dmap = &it_dest->iomap; 2011 2007 loff_t pos1 = it_src->pos, pos2 = it_dest->pos; 2008 + u64 dest_len; 2012 2009 void *saddr, *daddr; 2013 2010 int id, ret; 2014 2011 ··· 2017 2012 2018 2013 if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) { 2019 2014 *same = true; 2020 - return len; 2015 + goto advance; 2021 2016 } 2022 2017 2023 2018 if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) { ··· 2040 2035 if (!*same) 2041 2036 len = 0; 2042 2037 dax_read_unlock(id); 2043 - return len; 2038 + 2039 + advance: 2040 + dest_len = len; 2041 + ret = iomap_iter_advance(it_src, &len); 2042 + if (!ret) 2043 + ret = iomap_iter_advance(it_dest, &dest_len); 2044 + return ret; 2044 2045 2045 2046 out_unlock: 2046 2047 dax_read_unlock(id); ··· 2069 2058 .len = len, 2070 2059 .flags = IOMAP_DAX, 2071 2060 }; 2072 - int ret, compared = 0; 2061 + int ret, status; 2073 2062 2074 2063 while ((ret = iomap_iter(&src_iter, ops)) > 0 && 2075 2064 (ret = iomap_iter(&dst_iter, ops)) > 0) { 2076 - compared = dax_range_compare_iter(&src_iter, &dst_iter, 2065 + status = dax_range_compare_iter(&src_iter, &dst_iter, 2077 2066 min(src_iter.len, dst_iter.len), same); 2078 - if (compared < 0) 2067 + if (status < 0) 2079 2068 return ret; 2080 - src_iter.processed = dst_iter.processed = compared; 2069 + src_iter.status = dst_iter.status = status; 2081 2070 } 2082 2071 return ret; 2083 2072 }
+4
fs/ext4/inode.c
··· 3290 3290 if (map->m_flags & EXT4_MAP_NEW) 3291 3291 iomap->flags |= IOMAP_F_NEW; 3292 3292 3293 + /* HW-offload atomics are always used */ 3294 + if (flags & IOMAP_ATOMIC) 3295 + iomap->flags |= IOMAP_F_ATOMIC_BIO; 3296 + 3293 3297 if (flags & IOMAP_DAX) 3294 3298 iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev; 3295 3299 else
+2 -1
fs/gfs2/bmap.c
··· 1300 1300 unsigned int length) 1301 1301 { 1302 1302 BUG_ON(current->journal_info); 1303 - return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops); 1303 + return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops, 1304 + NULL); 1304 1305 } 1305 1306 1306 1307 #define GFS2_JTRUNC_REVOKES 8192
+1
fs/iomap/Makefile
··· 12 12 iter.o 13 13 iomap-$(CONFIG_BLOCK) += buffered-io.o \ 14 14 direct-io.o \ 15 + ioend.o \ 15 16 fiemap.o \ 16 17 seek.o 17 18 iomap-$(CONFIG_SWAP) += swapfile.o
+122 -232
fs/iomap/buffered-io.c
··· 12 12 #include <linux/buffer_head.h> 13 13 #include <linux/dax.h> 14 14 #include <linux/writeback.h> 15 - #include <linux/list_sort.h> 16 15 #include <linux/swap.h> 17 16 #include <linux/bio.h> 18 17 #include <linux/sched/signal.h> 19 18 #include <linux/migrate.h> 19 + #include "internal.h" 20 20 #include "trace.h" 21 21 22 22 #include "../internal.h" 23 - 24 - #define IOEND_BATCH_SIZE 4096 25 23 26 24 /* 27 25 * Structure allocated for each folio to track per-block uptodate, dirty state ··· 37 39 */ 38 40 unsigned long state[]; 39 41 }; 40 - 41 - static struct bio_set iomap_ioend_bioset; 42 42 43 43 static inline bool ifs_is_fully_uptodate(struct folio *folio, 44 44 struct iomap_folio_state *ifs) ··· 362 366 pos >= i_size_read(iter->inode); 363 367 } 364 368 365 - static loff_t iomap_readpage_iter(const struct iomap_iter *iter, 366 - struct iomap_readpage_ctx *ctx, loff_t offset) 369 + static int iomap_readpage_iter(struct iomap_iter *iter, 370 + struct iomap_readpage_ctx *ctx) 367 371 { 368 372 const struct iomap *iomap = &iter->iomap; 369 - loff_t pos = iter->pos + offset; 370 - loff_t length = iomap_length(iter) - offset; 373 + loff_t pos = iter->pos; 374 + loff_t length = iomap_length(iter); 371 375 struct folio *folio = ctx->cur_folio; 372 376 struct iomap_folio_state *ifs; 373 - loff_t orig_pos = pos; 374 377 size_t poff, plen; 375 378 sector_t sector; 379 + int ret; 376 380 377 - if (iomap->type == IOMAP_INLINE) 378 - return iomap_read_inline_data(iter, folio); 381 + if (iomap->type == IOMAP_INLINE) { 382 + ret = iomap_read_inline_data(iter, folio); 383 + if (ret) 384 + return ret; 385 + return iomap_iter_advance(iter, &length); 386 + } 379 387 380 388 /* zero post-eof blocks as the page may be mapped */ 381 389 ifs = ifs_alloc(iter->inode, folio, iter->flags); ··· 438 438 * we can skip trailing ones as they will be handled in the next 439 439 * iteration. 440 440 */ 441 - return pos - orig_pos + plen; 441 + length = pos - iter->pos + plen; 442 + return iomap_iter_advance(iter, &length); 442 443 } 443 444 444 - static loff_t iomap_read_folio_iter(const struct iomap_iter *iter, 445 + static int iomap_read_folio_iter(struct iomap_iter *iter, 445 446 struct iomap_readpage_ctx *ctx) 446 447 { 447 - struct folio *folio = ctx->cur_folio; 448 - size_t offset = offset_in_folio(folio, iter->pos); 449 - loff_t length = min_t(loff_t, folio_size(folio) - offset, 450 - iomap_length(iter)); 451 - loff_t done, ret; 448 + int ret; 452 449 453 - for (done = 0; done < length; done += ret) { 454 - ret = iomap_readpage_iter(iter, ctx, done); 455 - if (ret <= 0) 450 + while (iomap_length(iter)) { 451 + ret = iomap_readpage_iter(iter, ctx); 452 + if (ret) 456 453 return ret; 457 454 } 458 455 459 - return done; 456 + return 0; 460 457 } 461 458 462 459 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops) ··· 471 474 trace_iomap_readpage(iter.inode, 1); 472 475 473 476 while ((ret = iomap_iter(&iter, ops)) > 0) 474 - iter.processed = iomap_read_folio_iter(&iter, &ctx); 477 + iter.status = iomap_read_folio_iter(&iter, &ctx); 475 478 476 479 if (ctx.bio) { 477 480 submit_bio(ctx.bio); ··· 490 493 } 491 494 EXPORT_SYMBOL_GPL(iomap_read_folio); 492 495 493 - static loff_t iomap_readahead_iter(const struct iomap_iter *iter, 496 + static int iomap_readahead_iter(struct iomap_iter *iter, 494 497 struct iomap_readpage_ctx *ctx) 495 498 { 496 - loff_t length = iomap_length(iter); 497 - loff_t done, ret; 499 + int ret; 498 500 499 - for (done = 0; done < length; done += ret) { 501 + while (iomap_length(iter)) { 500 502 if (ctx->cur_folio && 501 - offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) { 503 + offset_in_folio(ctx->cur_folio, iter->pos) == 0) { 502 504 if (!ctx->cur_folio_in_bio) 503 505 folio_unlock(ctx->cur_folio); 504 506 ctx->cur_folio = NULL; ··· 506 510 ctx->cur_folio = readahead_folio(ctx->rac); 507 511 ctx->cur_folio_in_bio = false; 508 512 } 509 - ret = iomap_readpage_iter(iter, ctx, done); 510 - if (ret <= 0) 513 + ret = iomap_readpage_iter(iter, ctx); 514 + if (ret) 511 515 return ret; 512 516 } 513 517 514 - return done; 518 + return 0; 515 519 } 516 520 517 521 /** ··· 543 547 trace_iomap_readahead(rac->mapping->host, readahead_count(rac)); 544 548 545 549 while (iomap_iter(&iter, ops) > 0) 546 - iter.processed = iomap_readahead_iter(&iter, &ctx); 550 + iter.status = iomap_readahead_iter(&iter, &ctx); 547 551 548 552 if (ctx.bio) 549 553 submit_bio(ctx.bio); ··· 599 603 600 604 if (iter->flags & IOMAP_NOWAIT) 601 605 fgp |= FGP_NOWAIT; 606 + if (iter->flags & IOMAP_DONTCACHE) 607 + fgp |= FGP_DONTCACHE; 602 608 fgp |= fgf_set_order(len); 603 609 604 610 return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, ··· 905 907 return __iomap_write_end(iter->inode, pos, len, copied, folio); 906 908 } 907 909 908 - static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) 910 + static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) 909 911 { 910 - loff_t length = iomap_length(iter); 911 - loff_t pos = iter->pos; 912 912 ssize_t total_written = 0; 913 - long status = 0; 913 + int status = 0; 914 914 struct address_space *mapping = iter->inode->i_mapping; 915 915 size_t chunk = mapping_max_folio_size(mapping); 916 916 unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0; ··· 919 923 size_t offset; /* Offset into folio */ 920 924 size_t bytes; /* Bytes to write to folio */ 921 925 size_t copied; /* Bytes copied from user */ 922 - size_t written; /* Bytes have been written */ 926 + u64 written; /* Bytes have been written */ 927 + loff_t pos = iter->pos; 923 928 924 929 bytes = iov_iter_count(i); 925 930 retry: ··· 931 934 if (unlikely(status)) 932 935 break; 933 936 934 - if (bytes > length) 935 - bytes = length; 937 + if (bytes > iomap_length(iter)) 938 + bytes = iomap_length(iter); 936 939 937 940 /* 938 941 * Bring in the user page that we'll copy from _first_. ··· 1003 1006 goto retry; 1004 1007 } 1005 1008 } else { 1006 - pos += written; 1007 1009 total_written += written; 1008 - length -= written; 1010 + iomap_iter_advance(iter, &written); 1009 1011 } 1010 - } while (iov_iter_count(i) && length); 1012 + } while (iov_iter_count(i) && iomap_length(iter)); 1011 1013 1012 - if (status == -EAGAIN) { 1013 - iov_iter_revert(i, total_written); 1014 - return -EAGAIN; 1015 - } 1016 - return total_written ? total_written : status; 1014 + return total_written ? 0 : status; 1017 1015 } 1018 1016 1019 1017 ssize_t ··· 1026 1034 1027 1035 if (iocb->ki_flags & IOCB_NOWAIT) 1028 1036 iter.flags |= IOMAP_NOWAIT; 1037 + if (iocb->ki_flags & IOCB_DONTCACHE) 1038 + iter.flags |= IOMAP_DONTCACHE; 1029 1039 1030 1040 while ((ret = iomap_iter(&iter, ops)) > 0) 1031 - iter.processed = iomap_write_iter(&iter, i); 1041 + iter.status = iomap_write_iter(&iter, i); 1032 1042 1033 1043 if (unlikely(iter.pos == iocb->ki_pos)) 1034 1044 return ret; ··· 1264 1270 } 1265 1271 EXPORT_SYMBOL_GPL(iomap_write_delalloc_release); 1266 1272 1267 - static loff_t iomap_unshare_iter(struct iomap_iter *iter) 1273 + static int iomap_unshare_iter(struct iomap_iter *iter) 1268 1274 { 1269 1275 struct iomap *iomap = &iter->iomap; 1270 - loff_t pos = iter->pos; 1271 - loff_t length = iomap_length(iter); 1272 - loff_t written = 0; 1276 + u64 bytes = iomap_length(iter); 1277 + int status; 1273 1278 1274 1279 if (!iomap_want_unshare_iter(iter)) 1275 - return length; 1280 + return iomap_iter_advance(iter, &bytes); 1276 1281 1277 1282 do { 1278 1283 struct folio *folio; 1279 - int status; 1280 1284 size_t offset; 1281 - size_t bytes = min_t(u64, SIZE_MAX, length); 1285 + loff_t pos = iter->pos; 1282 1286 bool ret; 1283 1287 1288 + bytes = min_t(u64, SIZE_MAX, bytes); 1284 1289 status = iomap_write_begin(iter, pos, bytes, &folio); 1285 1290 if (unlikely(status)) 1286 1291 return status; ··· 1297 1304 1298 1305 cond_resched(); 1299 1306 1300 - pos += bytes; 1301 - written += bytes; 1302 - length -= bytes; 1303 - 1304 1307 balance_dirty_pages_ratelimited(iter->inode->i_mapping); 1305 - } while (length > 0); 1306 1308 1307 - return written; 1309 + status = iomap_iter_advance(iter, &bytes); 1310 + if (status) 1311 + break; 1312 + } while (bytes > 0); 1313 + 1314 + return status; 1308 1315 } 1309 1316 1310 1317 int ··· 1324 1331 1325 1332 iter.len = min(len, size - pos); 1326 1333 while ((ret = iomap_iter(&iter, ops)) > 0) 1327 - iter.processed = iomap_unshare_iter(&iter); 1334 + iter.status = iomap_unshare_iter(&iter); 1328 1335 return ret; 1329 1336 } 1330 1337 EXPORT_SYMBOL_GPL(iomap_file_unshare); ··· 1343 1350 return filemap_write_and_wait_range(mapping, i->pos, end); 1344 1351 } 1345 1352 1346 - static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) 1353 + static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) 1347 1354 { 1348 - loff_t pos = iter->pos; 1349 - loff_t length = iomap_length(iter); 1350 - loff_t written = 0; 1355 + u64 bytes = iomap_length(iter); 1356 + int status; 1351 1357 1352 1358 do { 1353 1359 struct folio *folio; 1354 - int status; 1355 1360 size_t offset; 1356 - size_t bytes = min_t(u64, SIZE_MAX, length); 1361 + loff_t pos = iter->pos; 1357 1362 bool ret; 1358 1363 1364 + bytes = min_t(u64, SIZE_MAX, bytes); 1359 1365 status = iomap_write_begin(iter, pos, bytes, &folio); 1360 1366 if (status) 1361 1367 return status; ··· 1375 1383 if (WARN_ON_ONCE(!ret)) 1376 1384 return -EIO; 1377 1385 1378 - pos += bytes; 1379 - length -= bytes; 1380 - written += bytes; 1381 - } while (length > 0); 1386 + status = iomap_iter_advance(iter, &bytes); 1387 + if (status) 1388 + break; 1389 + } while (bytes > 0); 1382 1390 1383 1391 if (did_zero) 1384 1392 *did_zero = true; 1385 - return written; 1393 + return status; 1386 1394 } 1387 1395 1388 1396 int 1389 1397 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, 1390 - const struct iomap_ops *ops) 1398 + const struct iomap_ops *ops, void *private) 1391 1399 { 1392 1400 struct iomap_iter iter = { 1393 1401 .inode = inode, 1394 1402 .pos = pos, 1395 1403 .len = len, 1396 1404 .flags = IOMAP_ZERO, 1405 + .private = private, 1397 1406 }; 1398 1407 struct address_space *mapping = inode->i_mapping; 1399 1408 unsigned int blocksize = i_blocksize(inode); ··· 1417 1424 filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) { 1418 1425 iter.len = plen; 1419 1426 while ((ret = iomap_iter(&iter, ops)) > 0) 1420 - iter.processed = iomap_zero_iter(&iter, did_zero); 1427 + iter.status = iomap_zero_iter(&iter, did_zero); 1421 1428 1422 1429 iter.len = len - (iter.pos - pos); 1423 1430 if (ret || !iter.len) ··· 1436 1443 1437 1444 if (srcmap->type == IOMAP_HOLE || 1438 1445 srcmap->type == IOMAP_UNWRITTEN) { 1439 - loff_t proc = iomap_length(&iter); 1446 + s64 status; 1440 1447 1441 1448 if (range_dirty) { 1442 1449 range_dirty = false; 1443 - proc = iomap_zero_iter_flush_and_stale(&iter); 1450 + status = iomap_zero_iter_flush_and_stale(&iter); 1451 + } else { 1452 + status = iomap_iter_advance_full(&iter); 1444 1453 } 1445 - iter.processed = proc; 1454 + iter.status = status; 1446 1455 continue; 1447 1456 } 1448 1457 1449 - iter.processed = iomap_zero_iter(&iter, did_zero); 1458 + iter.status = iomap_zero_iter(&iter, did_zero); 1450 1459 } 1451 1460 return ret; 1452 1461 } ··· 1456 1461 1457 1462 int 1458 1463 iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, 1459 - const struct iomap_ops *ops) 1464 + const struct iomap_ops *ops, void *private) 1460 1465 { 1461 1466 unsigned int blocksize = i_blocksize(inode); 1462 1467 unsigned int off = pos & (blocksize - 1); ··· 1464 1469 /* Block boundary? Nothing to do */ 1465 1470 if (!off) 1466 1471 return 0; 1467 - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); 1472 + return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops, 1473 + private); 1468 1474 } 1469 1475 EXPORT_SYMBOL_GPL(iomap_truncate_page); 1470 1476 1471 - static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter, 1477 + static int iomap_folio_mkwrite_iter(struct iomap_iter *iter, 1472 1478 struct folio *folio) 1473 1479 { 1474 1480 loff_t length = iomap_length(iter); ··· 1486 1490 folio_mark_dirty(folio); 1487 1491 } 1488 1492 1489 - return length; 1493 + return iomap_iter_advance(iter, &length); 1490 1494 } 1491 1495 1492 - vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops) 1496 + vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops, 1497 + void *private) 1493 1498 { 1494 1499 struct iomap_iter iter = { 1495 1500 .inode = file_inode(vmf->vma->vm_file), 1496 1501 .flags = IOMAP_WRITE | IOMAP_FAULT, 1502 + .private = private, 1497 1503 }; 1498 1504 struct folio *folio = page_folio(vmf->page); 1499 1505 ssize_t ret; ··· 1507 1509 iter.pos = folio_pos(folio); 1508 1510 iter.len = ret; 1509 1511 while ((ret = iomap_iter(&iter, ops)) > 0) 1510 - iter.processed = iomap_folio_mkwrite_iter(&iter, folio); 1512 + iter.status = iomap_folio_mkwrite_iter(&iter, folio); 1511 1513 1512 1514 if (ret < 0) 1513 1515 goto out_unlock; ··· 1536 1538 * state, release holds on bios, and finally free up memory. Do not use the 1537 1539 * ioend after this. 1538 1540 */ 1539 - static u32 1540 - iomap_finish_ioend(struct iomap_ioend *ioend, int error) 1541 + u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend) 1541 1542 { 1542 1543 struct inode *inode = ioend->io_inode; 1543 1544 struct bio *bio = &ioend->io_bio; 1544 1545 struct folio_iter fi; 1545 1546 u32 folio_count = 0; 1546 1547 1547 - if (error) { 1548 - mapping_set_error(inode->i_mapping, error); 1548 + if (ioend->io_error) { 1549 + mapping_set_error(inode->i_mapping, ioend->io_error); 1549 1550 if (!bio_flagged(bio, BIO_QUIET)) { 1550 1551 pr_err_ratelimited( 1551 1552 "%s: writeback error on inode %lu, offset %lld, sector %llu", ··· 1563 1566 return folio_count; 1564 1567 } 1565 1568 1566 - /* 1567 - * Ioend completion routine for merged bios. This can only be called from task 1568 - * contexts as merged ioends can be of unbound length. Hence we have to break up 1569 - * the writeback completions into manageable chunks to avoid long scheduler 1570 - * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get 1571 - * good batch processing throughput without creating adverse scheduler latency 1572 - * conditions. 1573 - */ 1574 - void 1575 - iomap_finish_ioends(struct iomap_ioend *ioend, int error) 1576 - { 1577 - struct list_head tmp; 1578 - u32 completions; 1579 - 1580 - might_sleep(); 1581 - 1582 - list_replace_init(&ioend->io_list, &tmp); 1583 - completions = iomap_finish_ioend(ioend, error); 1584 - 1585 - while (!list_empty(&tmp)) { 1586 - if (completions > IOEND_BATCH_SIZE * 8) { 1587 - cond_resched(); 1588 - completions = 0; 1589 - } 1590 - ioend = list_first_entry(&tmp, struct iomap_ioend, io_list); 1591 - list_del_init(&ioend->io_list); 1592 - completions += iomap_finish_ioend(ioend, error); 1593 - } 1594 - } 1595 - EXPORT_SYMBOL_GPL(iomap_finish_ioends); 1596 - 1597 - /* 1598 - * We can merge two adjacent ioends if they have the same set of work to do. 1599 - */ 1600 - static bool 1601 - iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) 1602 - { 1603 - if (ioend->io_bio.bi_status != next->io_bio.bi_status) 1604 - return false; 1605 - if (next->io_flags & IOMAP_F_BOUNDARY) 1606 - return false; 1607 - if ((ioend->io_flags & IOMAP_F_SHARED) ^ 1608 - (next->io_flags & IOMAP_F_SHARED)) 1609 - return false; 1610 - if ((ioend->io_type == IOMAP_UNWRITTEN) ^ 1611 - (next->io_type == IOMAP_UNWRITTEN)) 1612 - return false; 1613 - if (ioend->io_offset + ioend->io_size != next->io_offset) 1614 - return false; 1615 - /* 1616 - * Do not merge physically discontiguous ioends. The filesystem 1617 - * completion functions will have to iterate the physical 1618 - * discontiguities even if we merge the ioends at a logical level, so 1619 - * we don't gain anything by merging physical discontiguities here. 1620 - * 1621 - * We cannot use bio->bi_iter.bi_sector here as it is modified during 1622 - * submission so does not point to the start sector of the bio at 1623 - * completion. 1624 - */ 1625 - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) 1626 - return false; 1627 - return true; 1628 - } 1629 - 1630 - void 1631 - iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) 1632 - { 1633 - struct iomap_ioend *next; 1634 - 1635 - INIT_LIST_HEAD(&ioend->io_list); 1636 - 1637 - while ((next = list_first_entry_or_null(more_ioends, struct iomap_ioend, 1638 - io_list))) { 1639 - if (!iomap_ioend_can_merge(ioend, next)) 1640 - break; 1641 - list_move_tail(&next->io_list, &ioend->io_list); 1642 - ioend->io_size += next->io_size; 1643 - } 1644 - } 1645 - EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); 1646 - 1647 - static int 1648 - iomap_ioend_compare(void *priv, const struct list_head *a, 1649 - const struct list_head *b) 1650 - { 1651 - struct iomap_ioend *ia = container_of(a, struct iomap_ioend, io_list); 1652 - struct iomap_ioend *ib = container_of(b, struct iomap_ioend, io_list); 1653 - 1654 - if (ia->io_offset < ib->io_offset) 1655 - return -1; 1656 - if (ia->io_offset > ib->io_offset) 1657 - return 1; 1658 - return 0; 1659 - } 1660 - 1661 - void 1662 - iomap_sort_ioends(struct list_head *ioend_list) 1663 - { 1664 - list_sort(NULL, ioend_list, iomap_ioend_compare); 1665 - } 1666 - EXPORT_SYMBOL_GPL(iomap_sort_ioends); 1667 - 1668 1569 static void iomap_writepage_end_bio(struct bio *bio) 1669 1570 { 1670 - iomap_finish_ioend(iomap_ioend_from_bio(bio), 1671 - blk_status_to_errno(bio->bi_status)); 1571 + struct iomap_ioend *ioend = iomap_ioend_from_bio(bio); 1572 + 1573 + ioend->io_error = blk_status_to_errno(bio->bi_status); 1574 + iomap_finish_ioend_buffered(ioend); 1672 1575 } 1673 1576 1674 1577 /* 1675 - * Submit the final bio for an ioend. 1578 + * Submit an ioend. 1676 1579 * 1677 1580 * If @error is non-zero, it means that we have a situation where some part of 1678 1581 * the submission process has failed after we've marked pages for writeback. ··· 1591 1694 * failure happened so that the file system end I/O handler gets called 1592 1695 * to clean up. 1593 1696 */ 1594 - if (wpc->ops->prepare_ioend) 1595 - error = wpc->ops->prepare_ioend(wpc->ioend, error); 1697 + if (wpc->ops->submit_ioend) { 1698 + error = wpc->ops->submit_ioend(wpc, error); 1699 + } else { 1700 + if (WARN_ON_ONCE(wpc->iomap.flags & IOMAP_F_ANON_WRITE)) 1701 + error = -EIO; 1702 + if (!error) 1703 + submit_bio(&wpc->ioend->io_bio); 1704 + } 1596 1705 1597 1706 if (error) { 1598 1707 wpc->ioend->io_bio.bi_status = errno_to_blk_status(error); 1599 1708 bio_endio(&wpc->ioend->io_bio); 1600 - } else { 1601 - submit_bio(&wpc->ioend->io_bio); 1602 1709 } 1603 1710 1604 1711 wpc->ioend = NULL; ··· 1610 1709 } 1611 1710 1612 1711 static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc, 1613 - struct writeback_control *wbc, struct inode *inode, loff_t pos) 1712 + struct writeback_control *wbc, struct inode *inode, loff_t pos, 1713 + u16 ioend_flags) 1614 1714 { 1615 - struct iomap_ioend *ioend; 1616 1715 struct bio *bio; 1617 1716 1618 1717 bio = bio_alloc_bioset(wpc->iomap.bdev, BIO_MAX_VECS, ··· 1620 1719 GFP_NOFS, &iomap_ioend_bioset); 1621 1720 bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos); 1622 1721 bio->bi_end_io = iomap_writepage_end_bio; 1623 - wbc_init_bio(wbc, bio); 1624 1722 bio->bi_write_hint = inode->i_write_hint; 1625 - 1626 - ioend = iomap_ioend_from_bio(bio); 1627 - INIT_LIST_HEAD(&ioend->io_list); 1628 - ioend->io_type = wpc->iomap.type; 1629 - ioend->io_flags = wpc->iomap.flags; 1630 - if (pos > wpc->iomap.offset) 1631 - wpc->iomap.flags &= ~IOMAP_F_BOUNDARY; 1632 - ioend->io_inode = inode; 1633 - ioend->io_size = 0; 1634 - ioend->io_offset = pos; 1635 - ioend->io_sector = bio->bi_iter.bi_sector; 1636 - 1723 + wbc_init_bio(wbc, bio); 1637 1724 wpc->nr_folios = 0; 1638 - return ioend; 1725 + return iomap_init_ioend(inode, bio, pos, ioend_flags); 1639 1726 } 1640 1727 1641 - static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) 1728 + static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos, 1729 + u16 ioend_flags) 1642 1730 { 1643 - if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY)) 1731 + if (ioend_flags & IOMAP_IOEND_BOUNDARY) 1644 1732 return false; 1645 - if ((wpc->iomap.flags & IOMAP_F_SHARED) != 1646 - (wpc->ioend->io_flags & IOMAP_F_SHARED)) 1647 - return false; 1648 - if (wpc->iomap.type != wpc->ioend->io_type) 1733 + if ((ioend_flags & IOMAP_IOEND_NOMERGE_FLAGS) != 1734 + (wpc->ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS)) 1649 1735 return false; 1650 1736 if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) 1651 1737 return false; 1652 - if (iomap_sector(&wpc->iomap, pos) != 1738 + if (!(wpc->iomap.flags & IOMAP_F_ANON_WRITE) && 1739 + iomap_sector(&wpc->iomap, pos) != 1653 1740 bio_end_sector(&wpc->ioend->io_bio)) 1654 1741 return false; 1655 1742 /* ··· 1668 1779 { 1669 1780 struct iomap_folio_state *ifs = folio->private; 1670 1781 size_t poff = offset_in_folio(folio, pos); 1782 + unsigned int ioend_flags = 0; 1671 1783 int error; 1672 1784 1673 - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { 1785 + if (wpc->iomap.type == IOMAP_UNWRITTEN) 1786 + ioend_flags |= IOMAP_IOEND_UNWRITTEN; 1787 + if (wpc->iomap.flags & IOMAP_F_SHARED) 1788 + ioend_flags |= IOMAP_IOEND_SHARED; 1789 + if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY)) 1790 + ioend_flags |= IOMAP_IOEND_BOUNDARY; 1791 + 1792 + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, ioend_flags)) { 1674 1793 new_ioend: 1675 1794 error = iomap_submit_ioend(wpc, 0); 1676 1795 if (error) 1677 1796 return error; 1678 - wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); 1797 + wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos, 1798 + ioend_flags); 1679 1799 } 1680 1800 1681 1801 if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) ··· 1960 2062 return iomap_submit_ioend(wpc, error); 1961 2063 } 1962 2064 EXPORT_SYMBOL_GPL(iomap_writepages); 1963 - 1964 - static int __init iomap_buffered_init(void) 1965 - { 1966 - return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), 1967 - offsetof(struct iomap_ioend, io_bio), 1968 - BIOSET_NEED_BVECS); 1969 - } 1970 - fs_initcall(iomap_buffered_init);
+157 -124
fs/iomap/direct-io.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 3 * Copyright (C) 2010 Red Hat, Inc. 4 - * Copyright (c) 2016-2021 Christoph Hellwig. 4 + * Copyright (c) 2016-2025 Christoph Hellwig. 5 5 */ 6 6 #include <linux/module.h> 7 7 #include <linux/compiler.h> ··· 12 12 #include <linux/backing-dev.h> 13 13 #include <linux/uio.h> 14 14 #include <linux/task_io_accounting_ops.h> 15 + #include "internal.h" 15 16 #include "trace.h" 16 17 17 18 #include "../internal.h" ··· 21 20 * Private flags for iomap_dio, must not overlap with the public ones in 22 21 * iomap.h: 23 22 */ 23 + #define IOMAP_DIO_NO_INVALIDATE (1U << 25) 24 24 #define IOMAP_DIO_CALLER_COMP (1U << 26) 25 25 #define IOMAP_DIO_INLINE_COMP (1U << 27) 26 26 #define IOMAP_DIO_WRITE_THROUGH (1U << 28) ··· 83 81 WRITE_ONCE(iocb->private, bio); 84 82 } 85 83 86 - if (dio->dops && dio->dops->submit_io) 84 + if (dio->dops && dio->dops->submit_io) { 87 85 dio->dops->submit_io(iter, bio, pos); 88 - else 86 + } else { 87 + WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_ANON_WRITE); 89 88 submit_bio(bio); 89 + } 90 90 } 91 91 92 92 ssize_t iomap_dio_complete(struct iomap_dio *dio) ··· 121 117 * ->end_io() when necessary, otherwise a racing buffer read would cache 122 118 * zeros from unwritten extents. 123 119 */ 124 - if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE)) 120 + if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) && 121 + !(dio->flags & IOMAP_DIO_NO_INVALIDATE)) 125 122 kiocb_invalidate_post_direct_write(iocb, dio->size); 126 123 127 124 inode_dio_end(file_inode(iocb->ki_filp)); ··· 168 163 cmpxchg(&dio->error, 0, ret); 169 164 } 170 165 171 - void iomap_dio_bio_end_io(struct bio *bio) 166 + /* 167 + * Called when dio->ref reaches zero from an I/O completion. 168 + */ 169 + static void iomap_dio_done(struct iomap_dio *dio) 172 170 { 173 - struct iomap_dio *dio = bio->bi_private; 174 - bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); 175 171 struct kiocb *iocb = dio->iocb; 176 172 177 - if (bio->bi_status) 178 - iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); 179 - if (!atomic_dec_and_test(&dio->ref)) 180 - goto release_bio; 181 - 182 - /* 183 - * Synchronous dio, task itself will handle any completion work 184 - * that needs after IO. All we need to do is wake the task. 185 - */ 186 173 if (dio->wait_for_completion) { 174 + /* 175 + * Synchronous I/O, task itself will handle any completion work 176 + * that needs after IO. All we need to do is wake the task. 177 + */ 187 178 struct task_struct *waiter = dio->submit.waiter; 188 179 189 180 WRITE_ONCE(dio->submit.waiter, NULL); 190 181 blk_wake_io_task(waiter); 191 - goto release_bio; 192 - } 193 - 194 - /* 195 - * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline 196 - */ 197 - if (dio->flags & IOMAP_DIO_INLINE_COMP) { 182 + } else if (dio->flags & IOMAP_DIO_INLINE_COMP) { 198 183 WRITE_ONCE(iocb->private, NULL); 199 184 iomap_dio_complete_work(&dio->aio.work); 200 - goto release_bio; 201 - } 202 - 203 - /* 204 - * If this dio is flagged with IOMAP_DIO_CALLER_COMP, then schedule 205 - * our completion that way to avoid an async punt to a workqueue. 206 - */ 207 - if (dio->flags & IOMAP_DIO_CALLER_COMP) { 185 + } else if (dio->flags & IOMAP_DIO_CALLER_COMP) { 186 + /* 187 + * If this dio is flagged with IOMAP_DIO_CALLER_COMP, then 188 + * schedule our completion that way to avoid an async punt to a 189 + * workqueue. 190 + */ 208 191 /* only polled IO cares about private cleared */ 209 192 iocb->private = dio; 210 193 iocb->dio_complete = iomap_dio_deferred_complete; ··· 210 217 * issuer. 211 218 */ 212 219 iocb->ki_complete(iocb, 0); 213 - goto release_bio; 214 - } 220 + } else { 221 + struct inode *inode = file_inode(iocb->ki_filp); 215 222 216 - /* 217 - * Async DIO completion that requires filesystem level completion work 218 - * gets punted to a work queue to complete as the operation may require 219 - * more IO to be issued to finalise filesystem metadata changes or 220 - * guarantee data integrity. 221 - */ 222 - INIT_WORK(&dio->aio.work, iomap_dio_complete_work); 223 - queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq, 224 - &dio->aio.work); 225 - release_bio: 223 + /* 224 + * Async DIO completion that requires filesystem level 225 + * completion work gets punted to a work queue to complete as 226 + * the operation may require more IO to be issued to finalise 227 + * filesystem metadata changes or guarantee data integrity. 228 + */ 229 + INIT_WORK(&dio->aio.work, iomap_dio_complete_work); 230 + queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); 231 + } 232 + } 233 + 234 + void iomap_dio_bio_end_io(struct bio *bio) 235 + { 236 + struct iomap_dio *dio = bio->bi_private; 237 + bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); 238 + 239 + if (bio->bi_status) 240 + iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); 241 + 242 + if (atomic_dec_and_test(&dio->ref)) 243 + iomap_dio_done(dio); 244 + 226 245 if (should_dirty) { 227 246 bio_check_pages_dirty(bio); 228 247 } else { ··· 243 238 } 244 239 } 245 240 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io); 241 + 242 + u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend) 243 + { 244 + struct iomap_dio *dio = ioend->io_bio.bi_private; 245 + bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); 246 + u32 vec_count = ioend->io_bio.bi_vcnt; 247 + 248 + if (ioend->io_error) 249 + iomap_dio_set_error(dio, ioend->io_error); 250 + 251 + if (atomic_dec_and_test(&dio->ref)) { 252 + /* 253 + * Try to avoid another context switch for the completion given 254 + * that we are already called from the ioend completion 255 + * workqueue, but never invalidate pages from this thread to 256 + * avoid deadlocks with buffered I/O completions. Tough luck if 257 + * you hit the tiny race with someone dirtying the range now 258 + * between this check and the actual completion. 259 + */ 260 + if (!dio->iocb->ki_filp->f_mapping->nrpages) { 261 + dio->flags |= IOMAP_DIO_INLINE_COMP; 262 + dio->flags |= IOMAP_DIO_NO_INVALIDATE; 263 + } 264 + dio->flags &= ~IOMAP_DIO_CALLER_COMP; 265 + iomap_dio_done(dio); 266 + } 267 + 268 + if (should_dirty) { 269 + bio_check_pages_dirty(&ioend->io_bio); 270 + } else { 271 + bio_release_pages(&ioend->io_bio, false); 272 + bio_put(&ioend->io_bio); 273 + } 274 + 275 + /* 276 + * Return the number of bvecs completed as even direct I/O completions 277 + * do significant per-folio work and we'll still want to give up the 278 + * CPU after a lot of completions. 279 + */ 280 + return vec_count; 281 + } 246 282 247 283 static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, 248 284 loff_t pos, unsigned len) ··· 312 266 } 313 267 314 268 /* 315 - * Figure out the bio's operation flags from the dio request, the 316 - * mapping, and whether or not we want FUA. Note that we can end up 317 - * clearing the WRITE_THROUGH flag in the dio request. 269 + * Use a FUA write if we need datasync semantics and this is a pure data I/O 270 + * that doesn't require any metadata updates (including after I/O completion 271 + * such as unwritten extent conversion) and the underlying device either 272 + * doesn't have a volatile write cache or supports FUA. 273 + * This allows us to avoid cache flushes on I/O completion. 318 274 */ 319 - static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio, 320 - const struct iomap *iomap, bool use_fua, bool atomic) 275 + static inline bool iomap_dio_can_use_fua(const struct iomap *iomap, 276 + struct iomap_dio *dio) 321 277 { 322 - blk_opf_t opflags = REQ_SYNC | REQ_IDLE; 323 - 324 - if (!(dio->flags & IOMAP_DIO_WRITE)) 325 - return REQ_OP_READ; 326 - 327 - opflags |= REQ_OP_WRITE; 328 - if (use_fua) 329 - opflags |= REQ_FUA; 330 - else 331 - dio->flags &= ~IOMAP_DIO_WRITE_THROUGH; 332 - if (atomic) 333 - opflags |= REQ_ATOMIC; 334 - 335 - return opflags; 278 + if (iomap->flags & (IOMAP_F_SHARED | IOMAP_F_DIRTY)) 279 + return false; 280 + if (!(dio->flags & IOMAP_DIO_WRITE_THROUGH)) 281 + return false; 282 + return !bdev_write_cache(iomap->bdev) || bdev_fua(iomap->bdev); 336 283 } 337 284 338 - static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, 339 - struct iomap_dio *dio) 285 + static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio) 340 286 { 341 287 const struct iomap *iomap = &iter->iomap; 342 288 struct inode *inode = iter->inode; 343 289 unsigned int fs_block_size = i_blocksize(inode), pad; 344 290 const loff_t length = iomap_length(iter); 345 - bool atomic = iter->flags & IOMAP_ATOMIC; 346 291 loff_t pos = iter->pos; 347 - blk_opf_t bio_opf; 292 + blk_opf_t bio_opf = REQ_SYNC | REQ_IDLE; 348 293 struct bio *bio; 349 294 bool need_zeroout = false; 350 - bool use_fua = false; 351 295 int nr_pages, ret = 0; 352 - size_t copied = 0; 296 + u64 copied = 0; 353 297 size_t orig_count; 354 - 355 - if (atomic && length != fs_block_size) 356 - return -EINVAL; 357 298 358 299 if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || 359 300 !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) 360 301 return -EINVAL; 361 302 362 - if (iomap->type == IOMAP_UNWRITTEN) { 363 - dio->flags |= IOMAP_DIO_UNWRITTEN; 364 - need_zeroout = true; 365 - } 303 + if (dio->flags & IOMAP_DIO_WRITE) { 304 + bio_opf |= REQ_OP_WRITE; 366 305 367 - if (iomap->flags & IOMAP_F_SHARED) 368 - dio->flags |= IOMAP_DIO_COW; 306 + if (iomap->flags & IOMAP_F_ATOMIC_BIO) { 307 + /* 308 + * Ensure that the mapping covers the full write 309 + * length, otherwise it won't be submitted as a single 310 + * bio, which is required to use hardware atomics. 311 + */ 312 + if (length != iter->len) 313 + return -EINVAL; 314 + bio_opf |= REQ_ATOMIC; 315 + } 369 316 370 - if (iomap->flags & IOMAP_F_NEW) { 371 - need_zeroout = true; 372 - } else if (iomap->type == IOMAP_MAPPED) { 317 + if (iomap->type == IOMAP_UNWRITTEN) { 318 + dio->flags |= IOMAP_DIO_UNWRITTEN; 319 + need_zeroout = true; 320 + } 321 + 322 + if (iomap->flags & IOMAP_F_SHARED) 323 + dio->flags |= IOMAP_DIO_COW; 324 + 325 + if (iomap->flags & IOMAP_F_NEW) { 326 + need_zeroout = true; 327 + } else if (iomap->type == IOMAP_MAPPED) { 328 + if (iomap_dio_can_use_fua(iomap, dio)) 329 + bio_opf |= REQ_FUA; 330 + else 331 + dio->flags &= ~IOMAP_DIO_WRITE_THROUGH; 332 + } 333 + 373 334 /* 374 - * Use a FUA write if we need datasync semantics, this is a pure 375 - * data IO that doesn't require any metadata updates (including 376 - * after IO completion such as unwritten extent conversion) and 377 - * the underlying device either supports FUA or doesn't have 378 - * a volatile write cache. This allows us to avoid cache flushes 379 - * on IO completion. If we can't use writethrough and need to 380 - * sync, disable in-task completions as dio completion will 381 - * need to call generic_write_sync() which will do a blocking 382 - * fsync / cache flush call. 335 + * We can only do deferred completion for pure overwrites that 336 + * don't require additional I/O at completion time. 337 + * 338 + * This rules out writes that need zeroing or extent conversion, 339 + * extend the file size, or issue metadata I/O or cache flushes 340 + * during completion processing. 383 341 */ 384 - if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) && 385 - (dio->flags & IOMAP_DIO_WRITE_THROUGH) && 386 - (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev))) 387 - use_fua = true; 388 - else if (dio->flags & IOMAP_DIO_NEED_SYNC) 342 + if (need_zeroout || (pos >= i_size_read(inode)) || 343 + ((dio->flags & IOMAP_DIO_NEED_SYNC) && 344 + !(bio_opf & REQ_FUA))) 389 345 dio->flags &= ~IOMAP_DIO_CALLER_COMP; 346 + } else { 347 + bio_opf |= REQ_OP_READ; 390 348 } 391 349 392 350 /* ··· 403 353 404 354 if (!iov_iter_count(dio->submit.iter)) 405 355 goto out; 406 - 407 - /* 408 - * We can only do deferred completion for pure overwrites that 409 - * don't require additional IO at completion. This rules out 410 - * writes that need zeroing or extent conversion, extend 411 - * the file size, or issue journal IO or cache flushes 412 - * during completion processing. 413 - */ 414 - if (need_zeroout || 415 - ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) || 416 - ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) 417 - dio->flags &= ~IOMAP_DIO_CALLER_COMP; 418 356 419 357 /* 420 358 * The rules for polled IO completions follow the guidelines as the ··· 420 382 if (ret) 421 383 goto out; 422 384 } 423 - 424 - bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic); 425 385 426 386 nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS); 427 387 do { ··· 452 416 } 453 417 454 418 n = bio->bi_iter.bi_size; 455 - if (WARN_ON_ONCE(atomic && n != length)) { 419 + if (WARN_ON_ONCE((bio_opf & REQ_ATOMIC) && n != length)) { 456 420 /* 457 - * This bio should have covered the complete length, 421 + * An atomic write bio must cover the complete length, 458 422 * which it doesn't, so error. We may need to zero out 459 423 * the tail (complete FS block), similar to when 460 424 * bio_iov_iter_get_pages() returns an error, above. ··· 501 465 /* Undo iter limitation to current extent */ 502 466 iov_iter_reexpand(dio->submit.iter, orig_count - copied); 503 467 if (copied) 504 - return copied; 468 + return iomap_iter_advance(iter, &copied); 505 469 return ret; 506 470 } 507 471 508 - static loff_t iomap_dio_hole_iter(const struct iomap_iter *iter, 509 - struct iomap_dio *dio) 472 + static int iomap_dio_hole_iter(struct iomap_iter *iter, struct iomap_dio *dio) 510 473 { 511 474 loff_t length = iov_iter_zero(iomap_length(iter), dio->submit.iter); 512 475 513 476 dio->size += length; 514 477 if (!length) 515 478 return -EFAULT; 516 - return length; 479 + return iomap_iter_advance(iter, &length); 517 480 } 518 481 519 - static loff_t iomap_dio_inline_iter(const struct iomap_iter *iomi, 520 - struct iomap_dio *dio) 482 + static int iomap_dio_inline_iter(struct iomap_iter *iomi, struct iomap_dio *dio) 521 483 { 522 484 const struct iomap *iomap = &iomi->iomap; 523 485 struct iov_iter *iter = dio->submit.iter; 524 486 void *inline_data = iomap_inline_data(iomap, iomi->pos); 525 487 loff_t length = iomap_length(iomi); 526 488 loff_t pos = iomi->pos; 527 - size_t copied; 489 + u64 copied; 528 490 529 491 if (WARN_ON_ONCE(!iomap_inline_data_valid(iomap))) 530 492 return -EIO; ··· 544 510 dio->size += copied; 545 511 if (!copied) 546 512 return -EFAULT; 547 - return copied; 513 + return iomap_iter_advance(iomi, &copied); 548 514 } 549 515 550 - static loff_t iomap_dio_iter(const struct iomap_iter *iter, 551 - struct iomap_dio *dio) 516 + static int iomap_dio_iter(struct iomap_iter *iter, struct iomap_dio *dio) 552 517 { 553 518 switch (iter->iomap.type) { 554 519 case IOMAP_HOLE: ··· 641 608 if (iocb->ki_flags & IOCB_NOWAIT) 642 609 iomi.flags |= IOMAP_NOWAIT; 643 610 644 - if (iocb->ki_flags & IOCB_ATOMIC) 645 - iomi.flags |= IOMAP_ATOMIC; 646 - 647 611 if (iov_iter_rw(iter) == READ) { 648 612 /* reads can always complete inline */ 649 613 dio->flags |= IOMAP_DIO_INLINE_COMP; ··· 674 644 goto out_free_dio; 675 645 iomi.flags |= IOMAP_OVERWRITE_ONLY; 676 646 } 647 + 648 + if (iocb->ki_flags & IOCB_ATOMIC) 649 + iomi.flags |= IOMAP_ATOMIC; 677 650 678 651 /* for data sync or sync, we need sync completion processing */ 679 652 if (iocb_is_dsync(iocb)) { ··· 731 698 732 699 blk_start_plug(&plug); 733 700 while ((ret = iomap_iter(&iomi, ops)) > 0) { 734 - iomi.processed = iomap_dio_iter(&iomi, dio); 701 + iomi.status = iomap_dio_iter(&iomi, dio); 735 702 736 703 /* 737 704 * We can only poll for single bio I/Os.
+10 -11
fs/iomap/fiemap.c
··· 39 39 iomap->length, flags); 40 40 } 41 41 42 - static loff_t iomap_fiemap_iter(const struct iomap_iter *iter, 42 + static int iomap_fiemap_iter(struct iomap_iter *iter, 43 43 struct fiemap_extent_info *fi, struct iomap *prev) 44 44 { 45 45 int ret; 46 46 47 47 if (iter->iomap.type == IOMAP_HOLE) 48 - return iomap_length(iter); 48 + goto advance; 49 49 50 50 ret = iomap_to_fiemap(fi, prev, 0); 51 51 *prev = iter->iomap; 52 - switch (ret) { 53 - case 0: /* success */ 54 - return iomap_length(iter); 55 - case 1: /* extent array full */ 56 - return 0; 57 - default: /* error */ 52 + if (ret < 0) 58 53 return ret; 59 - } 54 + if (ret == 1) /* extent array full */ 55 + return 0; 56 + 57 + advance: 58 + return iomap_iter_advance_full(iter); 60 59 } 61 60 62 61 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi, ··· 77 78 return ret; 78 79 79 80 while ((ret = iomap_iter(&iter, ops)) > 0) 80 - iter.processed = iomap_fiemap_iter(&iter, fi, &prev); 81 + iter.status = iomap_fiemap_iter(&iter, fi, &prev); 81 82 82 83 if (prev.type != IOMAP_HOLE) { 83 84 ret = iomap_to_fiemap(fi, &prev, FIEMAP_EXTENT_LAST); ··· 113 114 while ((ret = iomap_iter(&iter, ops)) > 0) { 114 115 if (iter.iomap.type == IOMAP_MAPPED) 115 116 bno = iomap_sector(&iter.iomap, iter.pos) >> blkshift; 116 - /* leave iter.processed unset to abort loop */ 117 + /* leave iter.status unset to abort loop */ 117 118 } 118 119 if (ret) 119 120 return 0;
+10
fs/iomap/internal.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _IOMAP_INTERNAL_H 3 + #define _IOMAP_INTERNAL_H 1 4 + 5 + #define IOEND_BATCH_SIZE 4096 6 + 7 + u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend); 8 + u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend); 9 + 10 + #endif /* _IOMAP_INTERNAL_H */
+216
fs/iomap/ioend.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2024-2025 Christoph Hellwig. 4 + */ 5 + #include <linux/iomap.h> 6 + #include <linux/list_sort.h> 7 + #include "internal.h" 8 + 9 + struct bio_set iomap_ioend_bioset; 10 + EXPORT_SYMBOL_GPL(iomap_ioend_bioset); 11 + 12 + struct iomap_ioend *iomap_init_ioend(struct inode *inode, 13 + struct bio *bio, loff_t file_offset, u16 ioend_flags) 14 + { 15 + struct iomap_ioend *ioend = iomap_ioend_from_bio(bio); 16 + 17 + atomic_set(&ioend->io_remaining, 1); 18 + ioend->io_error = 0; 19 + ioend->io_parent = NULL; 20 + INIT_LIST_HEAD(&ioend->io_list); 21 + ioend->io_flags = ioend_flags; 22 + ioend->io_inode = inode; 23 + ioend->io_offset = file_offset; 24 + ioend->io_size = bio->bi_iter.bi_size; 25 + ioend->io_sector = bio->bi_iter.bi_sector; 26 + ioend->io_private = NULL; 27 + return ioend; 28 + } 29 + EXPORT_SYMBOL_GPL(iomap_init_ioend); 30 + 31 + static u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error) 32 + { 33 + if (ioend->io_parent) { 34 + struct bio *bio = &ioend->io_bio; 35 + 36 + ioend = ioend->io_parent; 37 + bio_put(bio); 38 + } 39 + 40 + if (error) 41 + cmpxchg(&ioend->io_error, 0, error); 42 + 43 + if (!atomic_dec_and_test(&ioend->io_remaining)) 44 + return 0; 45 + if (ioend->io_flags & IOMAP_IOEND_DIRECT) 46 + return iomap_finish_ioend_direct(ioend); 47 + return iomap_finish_ioend_buffered(ioend); 48 + } 49 + 50 + /* 51 + * Ioend completion routine for merged bios. This can only be called from task 52 + * contexts as merged ioends can be of unbound length. Hence we have to break up 53 + * the writeback completions into manageable chunks to avoid long scheduler 54 + * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get 55 + * good batch processing throughput without creating adverse scheduler latency 56 + * conditions. 57 + */ 58 + void iomap_finish_ioends(struct iomap_ioend *ioend, int error) 59 + { 60 + struct list_head tmp; 61 + u32 completions; 62 + 63 + might_sleep(); 64 + 65 + list_replace_init(&ioend->io_list, &tmp); 66 + completions = iomap_finish_ioend(ioend, error); 67 + 68 + while (!list_empty(&tmp)) { 69 + if (completions > IOEND_BATCH_SIZE * 8) { 70 + cond_resched(); 71 + completions = 0; 72 + } 73 + ioend = list_first_entry(&tmp, struct iomap_ioend, io_list); 74 + list_del_init(&ioend->io_list); 75 + completions += iomap_finish_ioend(ioend, error); 76 + } 77 + } 78 + EXPORT_SYMBOL_GPL(iomap_finish_ioends); 79 + 80 + /* 81 + * We can merge two adjacent ioends if they have the same set of work to do. 82 + */ 83 + static bool iomap_ioend_can_merge(struct iomap_ioend *ioend, 84 + struct iomap_ioend *next) 85 + { 86 + if (ioend->io_bio.bi_status != next->io_bio.bi_status) 87 + return false; 88 + if (next->io_flags & IOMAP_IOEND_BOUNDARY) 89 + return false; 90 + if ((ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS) != 91 + (next->io_flags & IOMAP_IOEND_NOMERGE_FLAGS)) 92 + return false; 93 + if (ioend->io_offset + ioend->io_size != next->io_offset) 94 + return false; 95 + /* 96 + * Do not merge physically discontiguous ioends. The filesystem 97 + * completion functions will have to iterate the physical 98 + * discontiguities even if we merge the ioends at a logical level, so 99 + * we don't gain anything by merging physical discontiguities here. 100 + * 101 + * We cannot use bio->bi_iter.bi_sector here as it is modified during 102 + * submission so does not point to the start sector of the bio at 103 + * completion. 104 + */ 105 + if (ioend->io_sector + (ioend->io_size >> SECTOR_SHIFT) != 106 + next->io_sector) 107 + return false; 108 + return true; 109 + } 110 + 111 + void iomap_ioend_try_merge(struct iomap_ioend *ioend, 112 + struct list_head *more_ioends) 113 + { 114 + struct iomap_ioend *next; 115 + 116 + INIT_LIST_HEAD(&ioend->io_list); 117 + 118 + while ((next = list_first_entry_or_null(more_ioends, struct iomap_ioend, 119 + io_list))) { 120 + if (!iomap_ioend_can_merge(ioend, next)) 121 + break; 122 + list_move_tail(&next->io_list, &ioend->io_list); 123 + ioend->io_size += next->io_size; 124 + } 125 + } 126 + EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); 127 + 128 + static int iomap_ioend_compare(void *priv, const struct list_head *a, 129 + const struct list_head *b) 130 + { 131 + struct iomap_ioend *ia = container_of(a, struct iomap_ioend, io_list); 132 + struct iomap_ioend *ib = container_of(b, struct iomap_ioend, io_list); 133 + 134 + if (ia->io_offset < ib->io_offset) 135 + return -1; 136 + if (ia->io_offset > ib->io_offset) 137 + return 1; 138 + return 0; 139 + } 140 + 141 + void iomap_sort_ioends(struct list_head *ioend_list) 142 + { 143 + list_sort(NULL, ioend_list, iomap_ioend_compare); 144 + } 145 + EXPORT_SYMBOL_GPL(iomap_sort_ioends); 146 + 147 + /* 148 + * Split up to the first @max_len bytes from @ioend if the ioend covers more 149 + * than @max_len bytes. 150 + * 151 + * If @is_append is set, the split will be based on the hardware limits for 152 + * REQ_OP_ZONE_APPEND commands and can be less than @max_len if the hardware 153 + * limits don't allow the entire @max_len length. 154 + * 155 + * The bio embedded into @ioend must be a REQ_OP_WRITE because the block layer 156 + * does not allow splitting REQ_OP_ZONE_APPEND bios. The file systems has to 157 + * switch the operation after this call, but before submitting the bio. 158 + */ 159 + struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, 160 + unsigned int max_len, bool is_append) 161 + { 162 + struct bio *bio = &ioend->io_bio; 163 + struct iomap_ioend *split_ioend; 164 + unsigned int nr_segs; 165 + int sector_offset; 166 + struct bio *split; 167 + 168 + if (is_append) { 169 + struct queue_limits *lim = bdev_limits(bio->bi_bdev); 170 + 171 + max_len = min(max_len, 172 + lim->max_zone_append_sectors << SECTOR_SHIFT); 173 + 174 + sector_offset = bio_split_rw_at(bio, lim, &nr_segs, max_len); 175 + if (unlikely(sector_offset < 0)) 176 + return ERR_PTR(sector_offset); 177 + if (!sector_offset) 178 + return NULL; 179 + } else { 180 + if (bio->bi_iter.bi_size <= max_len) 181 + return NULL; 182 + sector_offset = max_len >> SECTOR_SHIFT; 183 + } 184 + 185 + /* ensure the split ioend is still block size aligned */ 186 + sector_offset = ALIGN_DOWN(sector_offset << SECTOR_SHIFT, 187 + i_blocksize(ioend->io_inode)) >> SECTOR_SHIFT; 188 + 189 + split = bio_split(bio, sector_offset, GFP_NOFS, &iomap_ioend_bioset); 190 + if (IS_ERR(split)) 191 + return ERR_CAST(split); 192 + split->bi_private = bio->bi_private; 193 + split->bi_end_io = bio->bi_end_io; 194 + 195 + split_ioend = iomap_init_ioend(ioend->io_inode, split, ioend->io_offset, 196 + ioend->io_flags); 197 + split_ioend->io_parent = ioend; 198 + 199 + atomic_inc(&ioend->io_remaining); 200 + ioend->io_offset += split_ioend->io_size; 201 + ioend->io_size -= split_ioend->io_size; 202 + 203 + split_ioend->io_sector = ioend->io_sector; 204 + if (!is_append) 205 + ioend->io_sector += (split_ioend->io_size >> SECTOR_SHIFT); 206 + return split_ioend; 207 + } 208 + EXPORT_SYMBOL_GPL(iomap_split_ioend); 209 + 210 + static int __init iomap_ioend_init(void) 211 + { 212 + return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), 213 + offsetof(struct iomap_ioend, io_bio), 214 + BIOSET_NEED_BVECS); 215 + } 216 + fs_initcall(iomap_ioend_init);
+58 -39
fs/iomap/iter.c
··· 7 7 #include <linux/iomap.h> 8 8 #include "trace.h" 9 9 10 - /* 11 - * Advance to the next range we need to map. 12 - * 13 - * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully 14 - * processed - it was aborted because the extent the iomap spanned may have been 15 - * changed during the operation. In this case, the iteration behaviour is to 16 - * remap the unprocessed range of the iter, and that means we may need to remap 17 - * even when we've made no progress (i.e. iter->processed = 0). Hence the 18 - * "finished iterating" case needs to distinguish between 19 - * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we 20 - * need to remap the entire remaining range. 21 - */ 22 - static inline int iomap_iter_advance(struct iomap_iter *iter) 10 + static inline void iomap_iter_reset_iomap(struct iomap_iter *iter) 23 11 { 24 - bool stale = iter->iomap.flags & IOMAP_F_STALE; 25 - int ret = 1; 26 - 27 - /* handle the previous iteration (if any) */ 28 - if (iter->iomap.length) { 29 - if (iter->processed < 0) 30 - return iter->processed; 31 - if (WARN_ON_ONCE(iter->processed > iomap_length(iter))) 32 - return -EIO; 33 - iter->pos += iter->processed; 34 - iter->len -= iter->processed; 35 - if (!iter->len || (!iter->processed && !stale)) 36 - ret = 0; 37 - } 38 - 39 - /* clear the per iteration state */ 40 - iter->processed = 0; 12 + iter->status = 0; 41 13 memset(&iter->iomap, 0, sizeof(iter->iomap)); 42 14 memset(&iter->srcmap, 0, sizeof(iter->srcmap)); 43 - return ret; 15 + } 16 + 17 + /* 18 + * Advance the current iterator position and output the length remaining for the 19 + * current mapping. 20 + */ 21 + int iomap_iter_advance(struct iomap_iter *iter, u64 *count) 22 + { 23 + if (WARN_ON_ONCE(*count > iomap_length(iter))) 24 + return -EIO; 25 + iter->pos += *count; 26 + iter->len -= *count; 27 + *count = iomap_length(iter); 28 + return 0; 44 29 } 45 30 46 31 static inline void iomap_iter_done(struct iomap_iter *iter) ··· 34 49 WARN_ON_ONCE(iter->iomap.length == 0); 35 50 WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos); 36 51 WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE); 52 + 53 + iter->iter_start_pos = iter->pos; 37 54 38 55 trace_iomap_iter_dstmap(iter->inode, &iter->iomap); 39 56 if (iter->srcmap.type != IOMAP_HOLE) ··· 54 67 * function must be called in a loop that continues as long it returns a 55 68 * positive value. If 0 or a negative value is returned, the caller must not 56 69 * return to the loop body. Within a loop body, there are two ways to break out 57 - * of the loop body: leave @iter.processed unchanged, or set it to a negative 70 + * of the loop body: leave @iter.status unchanged, or set it to a negative 58 71 * errno. 59 72 */ 60 73 int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) 61 74 { 75 + bool stale = iter->iomap.flags & IOMAP_F_STALE; 76 + ssize_t advanced; 77 + u64 olen; 62 78 int ret; 63 79 64 - if (iter->iomap.length && ops->iomap_end) { 65 - ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter), 66 - iter->processed > 0 ? iter->processed : 0, 67 - iter->flags, &iter->iomap); 68 - if (ret < 0 && !iter->processed) 80 + trace_iomap_iter(iter, ops, _RET_IP_); 81 + 82 + if (!iter->iomap.length) 83 + goto begin; 84 + 85 + /* 86 + * Calculate how far the iter was advanced and the original length bytes 87 + * for ->iomap_end(). 88 + */ 89 + advanced = iter->pos - iter->iter_start_pos; 90 + olen = iter->len + advanced; 91 + 92 + if (ops->iomap_end) { 93 + ret = ops->iomap_end(iter->inode, iter->iter_start_pos, 94 + iomap_length_trim(iter, iter->iter_start_pos, 95 + olen), 96 + advanced, iter->flags, &iter->iomap); 97 + if (ret < 0 && !advanced) 69 98 return ret; 70 99 } 71 100 72 - trace_iomap_iter(iter, ops, _RET_IP_); 73 - ret = iomap_iter_advance(iter); 101 + /* detect old return semantics where this would advance */ 102 + if (WARN_ON_ONCE(iter->status > 0)) 103 + iter->status = -EIO; 104 + 105 + /* 106 + * Use iter->len to determine whether to continue onto the next mapping. 107 + * Explicitly terminate on error status or if the current iter has not 108 + * advanced at all (i.e. no work was done for some reason) unless the 109 + * mapping has been marked stale and needs to be reprocessed. 110 + */ 111 + if (iter->status < 0) 112 + ret = iter->status; 113 + else if (iter->len == 0 || (!advanced && !stale)) 114 + ret = 0; 115 + else 116 + ret = 1; 117 + iomap_iter_reset_iomap(iter); 74 118 if (ret <= 0) 75 119 return ret; 76 120 121 + begin: 77 122 ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags, 78 123 &iter->iomap, &iter->srcmap); 79 124 if (ret < 0)
+8 -8
fs/iomap/seek.c
··· 10 10 #include <linux/pagemap.h> 11 11 #include <linux/pagevec.h> 12 12 13 - static loff_t iomap_seek_hole_iter(const struct iomap_iter *iter, 13 + static int iomap_seek_hole_iter(struct iomap_iter *iter, 14 14 loff_t *hole_pos) 15 15 { 16 16 loff_t length = iomap_length(iter); ··· 20 20 *hole_pos = mapping_seek_hole_data(iter->inode->i_mapping, 21 21 iter->pos, iter->pos + length, SEEK_HOLE); 22 22 if (*hole_pos == iter->pos + length) 23 - return length; 23 + return iomap_iter_advance(iter, &length); 24 24 return 0; 25 25 case IOMAP_HOLE: 26 26 *hole_pos = iter->pos; 27 27 return 0; 28 28 default: 29 - return length; 29 + return iomap_iter_advance(iter, &length); 30 30 } 31 31 } 32 32 ··· 47 47 48 48 iter.len = size - pos; 49 49 while ((ret = iomap_iter(&iter, ops)) > 0) 50 - iter.processed = iomap_seek_hole_iter(&iter, &pos); 50 + iter.status = iomap_seek_hole_iter(&iter, &pos); 51 51 if (ret < 0) 52 52 return ret; 53 53 if (iter.len) /* found hole before EOF */ ··· 56 56 } 57 57 EXPORT_SYMBOL_GPL(iomap_seek_hole); 58 58 59 - static loff_t iomap_seek_data_iter(const struct iomap_iter *iter, 59 + static int iomap_seek_data_iter(struct iomap_iter *iter, 60 60 loff_t *hole_pos) 61 61 { 62 62 loff_t length = iomap_length(iter); 63 63 64 64 switch (iter->iomap.type) { 65 65 case IOMAP_HOLE: 66 - return length; 66 + return iomap_iter_advance(iter, &length); 67 67 case IOMAP_UNWRITTEN: 68 68 *hole_pos = mapping_seek_hole_data(iter->inode->i_mapping, 69 69 iter->pos, iter->pos + length, SEEK_DATA); 70 70 if (*hole_pos < 0) 71 - return length; 71 + return iomap_iter_advance(iter, &length); 72 72 return 0; 73 73 default: 74 74 *hole_pos = iter->pos; ··· 93 93 94 94 iter.len = size - pos; 95 95 while ((ret = iomap_iter(&iter, ops)) > 0) 96 - iter.processed = iomap_seek_data_iter(&iter, &pos); 96 + iter.status = iomap_seek_data_iter(&iter, &pos); 97 97 if (ret < 0) 98 98 return ret; 99 99 if (iter.len) /* found data before EOF */
+4 -3
fs/iomap/swapfile.c
··· 94 94 * swap only cares about contiguous page-aligned physical extents and makes no 95 95 * distinction between written and unwritten extents. 96 96 */ 97 - static loff_t iomap_swapfile_iter(const struct iomap_iter *iter, 97 + static int iomap_swapfile_iter(struct iomap_iter *iter, 98 98 struct iomap *iomap, struct iomap_swapfile_info *isi) 99 99 { 100 100 switch (iomap->type) { ··· 132 132 return error; 133 133 memcpy(&isi->iomap, iomap, sizeof(isi->iomap)); 134 134 } 135 - return iomap_length(iter); 135 + 136 + return iomap_iter_advance_full(iter); 136 137 } 137 138 138 139 /* ··· 167 166 return ret; 168 167 169 168 while ((ret = iomap_iter(&iter, ops)) > 0) 170 - iter.processed = iomap_swapfile_iter(&iter, &iter.iomap, &isi); 169 + iter.status = iomap_swapfile_iter(&iter, &iter.iomap, &isi); 171 170 if (ret < 0) 172 171 return ret; 173 172
+4 -4
fs/iomap/trace.h
··· 207 207 __field(u64, ino) 208 208 __field(loff_t, pos) 209 209 __field(u64, length) 210 - __field(s64, processed) 210 + __field(int, status) 211 211 __field(unsigned int, flags) 212 212 __field(const void *, ops) 213 213 __field(unsigned long, caller) ··· 217 217 __entry->ino = iter->inode->i_ino; 218 218 __entry->pos = iter->pos; 219 219 __entry->length = iomap_length(iter); 220 - __entry->processed = iter->processed; 220 + __entry->status = iter->status; 221 221 __entry->flags = iter->flags; 222 222 __entry->ops = ops; 223 223 __entry->caller = caller; 224 224 ), 225 - TP_printk("dev %d:%d ino 0x%llx pos 0x%llx length 0x%llx processed %lld flags %s (0x%x) ops %ps caller %pS", 225 + TP_printk("dev %d:%d ino 0x%llx pos 0x%llx length 0x%llx status %d flags %s (0x%x) ops %ps caller %pS", 226 226 MAJOR(__entry->dev), MINOR(__entry->dev), 227 227 __entry->ino, 228 228 __entry->pos, 229 229 __entry->length, 230 - __entry->processed, 230 + __entry->status, 231 231 __print_flags(__entry->flags, "|", IOMAP_FLAGS_STRINGS), 232 232 __entry->flags, 233 233 __entry->ops,
+15 -10
fs/xfs/xfs_aops.c
··· 115 115 */ 116 116 error = blk_status_to_errno(ioend->io_bio.bi_status); 117 117 if (unlikely(error)) { 118 - if (ioend->io_flags & IOMAP_F_SHARED) { 118 + if (ioend->io_flags & IOMAP_IOEND_SHARED) { 119 119 xfs_reflink_cancel_cow_range(ip, offset, size, true); 120 120 xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset, 121 121 offset + size); ··· 126 126 /* 127 127 * Success: commit the COW or unwritten blocks if needed. 128 128 */ 129 - if (ioend->io_flags & IOMAP_F_SHARED) 129 + if (ioend->io_flags & IOMAP_IOEND_SHARED) 130 130 error = xfs_reflink_end_cow(ip, offset, size); 131 - else if (ioend->io_type == IOMAP_UNWRITTEN) 131 + else if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN) 132 132 error = xfs_iomap_write_unwritten(ip, offset, size, false); 133 133 134 134 if (!error && xfs_ioend_is_append(ioend)) ··· 396 396 } 397 397 398 398 static int 399 - xfs_prepare_ioend( 400 - struct iomap_ioend *ioend, 399 + xfs_submit_ioend( 400 + struct iomap_writepage_ctx *wpc, 401 401 int status) 402 402 { 403 + struct iomap_ioend *ioend = wpc->ioend; 403 404 unsigned int nofs_flag; 404 405 405 406 /* ··· 411 410 nofs_flag = memalloc_nofs_save(); 412 411 413 412 /* Convert CoW extents to regular */ 414 - if (!status && (ioend->io_flags & IOMAP_F_SHARED)) { 413 + if (!status && (ioend->io_flags & IOMAP_IOEND_SHARED)) { 415 414 status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode), 416 415 ioend->io_offset, ioend->io_size); 417 416 } ··· 419 418 memalloc_nofs_restore(nofs_flag); 420 419 421 420 /* send ioends that might require a transaction to the completion wq */ 422 - if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN || 423 - (ioend->io_flags & IOMAP_F_SHARED)) 421 + if (xfs_ioend_is_append(ioend) || 422 + (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))) 424 423 ioend->io_bio.bi_end_io = xfs_end_bio; 425 - return status; 424 + 425 + if (status) 426 + return status; 427 + submit_bio(&ioend->io_bio); 428 + return 0; 426 429 } 427 430 428 431 /* ··· 468 463 469 464 static const struct iomap_writeback_ops xfs_writeback_ops = { 470 465 .map_blocks = xfs_map_blocks, 471 - .prepare_ioend = xfs_prepare_ioend, 466 + .submit_ioend = xfs_submit_ioend, 472 467 .discard_folio = xfs_discard_folio, 473 468 }; 474 469
+4 -2
fs/xfs/xfs_file.c
··· 1498 1498 if (IS_DAX(inode)) 1499 1499 ret = xfs_dax_fault_locked(vmf, order, true); 1500 1500 else 1501 - ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops); 1501 + ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops, 1502 + NULL); 1502 1503 xfs_iunlock(ip, lock_mode); 1503 1504 1504 1505 sb_end_pagefault(inode->i_sb); ··· 1614 1613 .fadvise = xfs_file_fadvise, 1615 1614 .remap_file_range = xfs_file_remap_range, 1616 1615 .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | 1617 - FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE, 1616 + FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE | 1617 + FOP_DONTCACHE, 1618 1618 }; 1619 1619 1620 1620 const struct file_operations xfs_dir_file_operations = {
+6 -2
fs/xfs/xfs_iomap.c
··· 828 828 if (offset + length > i_size_read(inode)) 829 829 iomap_flags |= IOMAP_F_DIRTY; 830 830 831 + /* HW-offload atomics are always used in this path */ 832 + if (flags & IOMAP_ATOMIC) 833 + iomap_flags |= IOMAP_F_ATOMIC_BIO; 834 + 831 835 /* 832 836 * COW writes may allocate delalloc space or convert unwritten COW 833 837 * extents, so we need to make sure to take the lock exclusively here. ··· 1499 1495 return dax_zero_range(inode, pos, len, did_zero, 1500 1496 &xfs_dax_write_iomap_ops); 1501 1497 return iomap_zero_range(inode, pos, len, did_zero, 1502 - &xfs_buffered_write_iomap_ops); 1498 + &xfs_buffered_write_iomap_ops, NULL); 1503 1499 } 1504 1500 1505 1501 int ··· 1514 1510 return dax_truncate_page(inode, pos, did_zero, 1515 1511 &xfs_dax_write_iomap_ops); 1516 1512 return iomap_truncate_page(inode, pos, did_zero, 1517 - &xfs_buffered_write_iomap_ops); 1513 + &xfs_buffered_write_iomap_ops, NULL); 1518 1514 }
+1 -1
fs/zonefs/file.c
··· 299 299 300 300 /* Serialize against truncates */ 301 301 filemap_invalidate_lock_shared(inode->i_mapping); 302 - ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops); 302 + ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops, NULL); 303 303 filemap_invalidate_unlock_shared(inode->i_mapping); 304 304 305 305 sb_end_pagefault(inode->i_sb);
+95 -21
include/linux/iomap.h
··· 56 56 * 57 57 * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must 58 58 * never be merged with the mapping before it. 59 + * 60 + * IOMAP_F_ANON_WRITE indicates that (write) I/O does not have a target block 61 + * assigned to it yet and the file system will do that in the bio submission 62 + * handler, splitting the I/O as needed. 63 + * 64 + * IOMAP_F_ATOMIC_BIO indicates that (write) I/O will be issued as an atomic 65 + * bio, i.e. set REQ_ATOMIC. 59 66 */ 60 67 #define IOMAP_F_NEW (1U << 0) 61 68 #define IOMAP_F_DIRTY (1U << 1) ··· 75 68 #endif /* CONFIG_BUFFER_HEAD */ 76 69 #define IOMAP_F_XATTR (1U << 5) 77 70 #define IOMAP_F_BOUNDARY (1U << 6) 71 + #define IOMAP_F_ANON_WRITE (1U << 7) 72 + #define IOMAP_F_ATOMIC_BIO (1U << 8) 78 73 79 74 /* 80 75 * Flags set by the core iomap code during operations: ··· 120 111 121 112 static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos) 122 113 { 114 + if (iomap->flags & IOMAP_F_ANON_WRITE) 115 + return U64_MAX; /* invalid */ 123 116 return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; 124 117 } 125 118 ··· 193 182 #else 194 183 #define IOMAP_DAX 0 195 184 #endif /* CONFIG_FS_DAX */ 196 - #define IOMAP_ATOMIC (1 << 9) 185 + #define IOMAP_ATOMIC (1 << 9) /* torn-write protection */ 186 + #define IOMAP_DONTCACHE (1 << 10) 197 187 198 188 struct iomap_ops { 199 189 /* ··· 223 211 * calls to iomap_iter(). Treat as read-only in the body. 224 212 * @len: The remaining length of the file segment we're operating on. 225 213 * It is updated at the same time as @pos. 226 - * @processed: The number of bytes processed by the body in the most recent 227 - * iteration, or a negative errno. 0 causes the iteration to stop. 214 + * @iter_start_pos: The original start pos for the current iomap. Used for 215 + * incremental iter advance. 216 + * @status: Status of the most recent iteration. Zero on success or a negative 217 + * errno on error. 228 218 * @flags: Zero or more of the iomap_begin flags above. 229 219 * @iomap: Map describing the I/O iteration 230 220 * @srcmap: Source map for COW operations ··· 235 221 struct inode *inode; 236 222 loff_t pos; 237 223 u64 len; 238 - s64 processed; 224 + loff_t iter_start_pos; 225 + int status; 239 226 unsigned flags; 240 227 struct iomap iomap; 241 228 struct iomap srcmap; ··· 244 229 }; 245 230 246 231 int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops); 232 + int iomap_iter_advance(struct iomap_iter *iter, u64 *count); 233 + 234 + /** 235 + * iomap_length_trim - trimmed length of the current iomap iteration 236 + * @iter: iteration structure 237 + * @pos: File position to trim from. 238 + * @len: Length of the mapping to trim to. 239 + * 240 + * Returns a trimmed length that the operation applies to for the current 241 + * iteration. 242 + */ 243 + static inline u64 iomap_length_trim(const struct iomap_iter *iter, loff_t pos, 244 + u64 len) 245 + { 246 + u64 end = iter->iomap.offset + iter->iomap.length; 247 + 248 + if (iter->srcmap.type != IOMAP_HOLE) 249 + end = min(end, iter->srcmap.offset + iter->srcmap.length); 250 + return min(len, end - pos); 251 + } 247 252 248 253 /** 249 254 * iomap_length - length of the current iomap iteration ··· 273 238 */ 274 239 static inline u64 iomap_length(const struct iomap_iter *iter) 275 240 { 276 - u64 end = iter->iomap.offset + iter->iomap.length; 241 + return iomap_length_trim(iter, iter->pos, iter->len); 242 + } 277 243 278 - if (iter->srcmap.type != IOMAP_HOLE) 279 - end = min(end, iter->srcmap.offset + iter->srcmap.length); 280 - return min(iter->len, end - iter->pos); 244 + /** 245 + * iomap_iter_advance_full - advance by the full length of current map 246 + */ 247 + static inline int iomap_iter_advance_full(struct iomap_iter *iter) 248 + { 249 + u64 length = iomap_length(iter); 250 + 251 + return iomap_iter_advance(iter, &length); 281 252 } 282 253 283 254 /** ··· 347 306 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, 348 307 const struct iomap_ops *ops); 349 308 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, 350 - bool *did_zero, const struct iomap_ops *ops); 309 + bool *did_zero, const struct iomap_ops *ops, void *private); 351 310 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, 352 - const struct iomap_ops *ops); 353 - vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, 354 - const struct iomap_ops *ops); 355 - 311 + const struct iomap_ops *ops, void *private); 312 + vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops, 313 + void *private); 356 314 typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length, 357 315 struct iomap *iomap); 358 316 void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, ··· 368 328 const struct iomap_ops *ops); 369 329 370 330 /* 331 + * Flags for iomap_ioend->io_flags. 332 + */ 333 + /* shared COW extent */ 334 + #define IOMAP_IOEND_SHARED (1U << 0) 335 + /* unwritten extent */ 336 + #define IOMAP_IOEND_UNWRITTEN (1U << 1) 337 + /* don't merge into previous ioend */ 338 + #define IOMAP_IOEND_BOUNDARY (1U << 2) 339 + /* is direct I/O */ 340 + #define IOMAP_IOEND_DIRECT (1U << 3) 341 + 342 + /* 343 + * Flags that if set on either ioend prevent the merge of two ioends. 344 + * (IOMAP_IOEND_BOUNDARY also prevents merges, but only one-way) 345 + */ 346 + #define IOMAP_IOEND_NOMERGE_FLAGS \ 347 + (IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT) 348 + 349 + /* 371 350 * Structure for writeback I/O completions. 351 + * 352 + * File systems implementing ->submit_ioend (for buffered I/O) or ->submit_io 353 + * for direct I/O) can split a bio generated by iomap. In that case the parent 354 + * ioend it was split from is recorded in ioend->io_parent. 372 355 */ 373 356 struct iomap_ioend { 374 357 struct list_head io_list; /* next ioend in chain */ 375 - u16 io_type; 376 - u16 io_flags; /* IOMAP_F_* */ 358 + u16 io_flags; /* IOMAP_IOEND_* */ 377 359 struct inode *io_inode; /* file being written to */ 378 - size_t io_size; /* size of data within eof */ 360 + size_t io_size; /* size of the extent */ 361 + atomic_t io_remaining; /* completetion defer count */ 362 + int io_error; /* stashed away status */ 363 + struct iomap_ioend *io_parent; /* parent for completions */ 379 364 loff_t io_offset; /* offset in the file */ 380 365 sector_t io_sector; /* start sector of ioend */ 366 + void *io_private; /* file system private data */ 381 367 struct bio io_bio; /* MUST BE LAST! */ 382 368 }; 383 369 ··· 428 362 loff_t offset, unsigned len); 429 363 430 364 /* 431 - * Optional, allows the file systems to perform actions just before 432 - * submitting the bio and/or override the bio end_io handler for complex 433 - * operations like copy on write extent manipulation or unwritten extent 434 - * conversions. 365 + * Optional, allows the file systems to hook into bio submission, 366 + * including overriding the bi_end_io handler. 367 + * 368 + * Returns 0 if the bio was successfully submitted, or a negative 369 + * error code if status was non-zero or another error happened and 370 + * the bio could not be submitted. 435 371 */ 436 - int (*prepare_ioend)(struct iomap_ioend *ioend, int status); 372 + int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status); 437 373 438 374 /* 439 375 * Optional, allows the file system to discard state on a page where ··· 451 383 u32 nr_folios; /* folios added to the ioend */ 452 384 }; 453 385 386 + struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio, 387 + loff_t file_offset, u16 ioend_flags); 388 + struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, 389 + unsigned int max_len, bool is_append); 454 390 void iomap_finish_ioends(struct iomap_ioend *ioend, int error); 455 391 void iomap_ioend_try_merge(struct iomap_ioend *ioend, 456 392 struct list_head *more_ioends); ··· 525 453 #else 526 454 # define iomap_swapfile_activate(sis, swapfile, pagespan, ops) (-EIO) 527 455 #endif /* CONFIG_SWAP */ 456 + 457 + extern struct bio_set iomap_ioend_bioset; 528 458 529 459 #endif /* LINUX_IOMAP_H */