Merge tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

+1

Documentation/filesystems/index.rst

··· 34 34 seq_file 35 35 sharedsubtree 36 36 idmappings 37 + iomap/index 37 38 38 39 automount-support 39 40

+441

Documentation/filesystems/iomap/design.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. _iomap_design: 3 + 4 + .. 5 + Dumb style notes to maintain the author's sanity: 6 + Please try to start sentences on separate lines so that 7 + sentence changes don't bleed colors in diff. 8 + Heading decorations are documented in sphinx.rst. 9 + 10 + ============== 11 + Library Design 12 + ============== 13 + 14 + .. contents:: Table of Contents 15 + :local: 16 + 17 + Introduction 18 + ============ 19 + 20 + iomap is a filesystem library for handling common file operations. 21 + The library has two layers: 22 + 23 + 1. A lower layer that provides an iterator over ranges of file offsets. 24 + This layer tries to obtain mappings of each file ranges to storage 25 + from the filesystem, but the storage information is not necessarily 26 + required. 27 + 28 + 2. An upper layer that acts upon the space mappings provided by the 29 + lower layer iterator. 30 + 31 + The iteration can involve mappings of file's logical offset ranges to 32 + physical extents, but the storage layer information is not necessarily 33 + required, e.g. for walking cached file information. 34 + The library exports various APIs for implementing file operations such 35 + as: 36 + 37 + * Pagecache reads and writes 38 + * Folio write faults to the pagecache 39 + * Writeback of dirty folios 40 + * Direct I/O reads and writes 41 + * fsdax I/O reads, writes, loads, and stores 42 + * FIEMAP 43 + * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 + * swapfile activation 45 + 46 + This origins of this library is the file I/O path that XFS once used; it 47 + has now been extended to cover several other operations. 48 + 49 + Who Should Read This? 50 + ===================== 51 + 52 + The target audience for this document are filesystem, storage, and 53 + pagecache programmers and code reviewers. 54 + 55 + If you are working on PCI, machine architectures, or device drivers, you 56 + are most likely in the wrong place. 57 + 58 + How Is This Better? 59 + =================== 60 + 61 + Unlike the classic Linux I/O model which breaks file I/O into small 62 + units (generally memory pages or blocks) and looks up space mappings on 63 + the basis of that unit, the iomap model asks the filesystem for the 64 + largest space mappings that it can create for a given file operation and 65 + initiates operations on that basis. 66 + This strategy improves the filesystem's visibility into the size of the 67 + operation being performed, which enables it to combat fragmentation with 68 + larger space allocations when possible. 69 + Larger space mappings improve runtime performance by amortizing the cost 70 + of mapping function calls into the filesystem across a larger amount of 71 + data. 72 + 73 + At a high level, an iomap operation `looks like this 74 + <https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 + 76 + 1. For each byte in the operation range... 77 + 78 + 1. Obtain a space mapping via ``->iomap_begin`` 79 + 80 + 2. For each sub-unit of work... 81 + 82 + 1. Revalidate the mapping and go back to (1) above, if necessary. 83 + So far only the pagecache operations need to do this. 84 + 85 + 2. Do the work 86 + 87 + 3. Increment operation cursor 88 + 89 + 4. Release the mapping via ``->iomap_end``, if necessary 90 + 91 + Each iomap operation will be covered in more detail below. 92 + This library was covered previously by an `LWN article 93 + <https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94 + <https://kernelnewbies.org/KernelProjects/iomap>`_. 95 + 96 + The goal of this document is to provide a brief discussion of the 97 + design and capabilities of iomap, followed by a more detailed catalog 98 + of the interfaces presented by iomap. 99 + If you change iomap, please update this design document. 100 + 101 + File Range Iterator 102 + =================== 103 + 104 + Definitions 105 + ----------- 106 + 107 + * **buffer head**: Shattered remnants of the old buffer cache. 108 + 109 + * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 + 111 + * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 + Processes hold this in shared mode to read file state and contents. 113 + Some filesystems may allow shared mode for writes. 114 + Processes often hold this in exclusive mode to change file state and 115 + contents. 116 + 117 + * ``invalidate_lock``: The pagecache ``struct address_space`` 118 + rwsemaphore that protects against folio insertion and removal for 119 + filesystems that support punching out folios below EOF. 120 + Processes wishing to insert folios must hold this lock in shared 121 + mode to prevent removal, though concurrent insertion is allowed. 122 + Processes wishing to remove folios must hold this lock in exclusive 123 + mode to prevent insertions. 124 + Concurrent removals are not allowed. 125 + 126 + * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 + device pre-shutdown hook from returning before other threads have 128 + released resources. 129 + 130 + * **filesystem mapping lock**: This synchronization primitive is 131 + internal to the filesystem and must protect the file mapping data 132 + from updates while a mapping is being sampled. 133 + The filesystem author must determine how this coordination should 134 + happen; it does not need to be an actual lock. 135 + 136 + * **iomap internal operation lock**: This is a general term for 137 + synchronization primitives that iomap functions take while holding a 138 + mapping. 139 + A specific example would be taking the folio lock while reading or 140 + writing the pagecache. 141 + 142 + * **pure overwrite**: A write operation that does not require any 143 + metadata or zeroing operations to perform during either submission 144 + or completion. 145 + This implies that the fileystem must have already allocated space 146 + on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 + constaints on IO alignment or size. 148 + The only constraints on I/O alignment are device level (minimum I/O 149 + size and alignment, typically sector size). 150 + 151 + ``struct iomap`` 152 + ---------------- 153 + 154 + The filesystem communicates to the iomap iterator the mapping of 155 + byte ranges of a file to byte ranges of a storage device with the 156 + structure below: 157 + 158 + .. code-block:: c 159 + 160 + struct iomap { 161 + u64 addr; 162 + loff_t offset; 163 + u64 length; 164 + u16 type; 165 + u16 flags; 166 + struct block_device *bdev; 167 + struct dax_device *dax_dev; 168 + voidw *inline_data; 169 + void *private; 170 + const struct iomap_folio_ops *folio_ops; 171 + u64 validity_cookie; 172 + }; 173 + 174 + The fields are as follows: 175 + 176 + * ``offset`` and ``length`` describe the range of file offsets, in 177 + bytes, covered by this mapping. 178 + These fields must always be set by the filesystem. 179 + 180 + * ``type`` describes the type of the space mapping: 181 + 182 + * **IOMAP_HOLE**: No storage has been allocated. 183 + This type must never be returned in response to an ``IOMAP_WRITE`` 184 + operation because writes must allocate and map space, and return 185 + the mapping. 186 + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 187 + iomap does not support writing (whether via pagecache or direct 188 + I/O) to a hole. 189 + 190 + * **IOMAP_DELALLOC**: A promise to allocate space at a later time 191 + ("delayed allocation"). 192 + If the filesystem returns IOMAP_F_NEW here and the write fails, the 193 + ``->iomap_end`` function must delete the reservation. 194 + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 195 + 196 + * **IOMAP_MAPPED**: The file range maps to specific space on the 197 + storage device. 198 + The device is returned in ``bdev`` or ``dax_dev``. 199 + The device address, in bytes, is returned via ``addr``. 200 + 201 + * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 202 + storage device, but the space has not yet been initialized. 203 + The device is returned in ``bdev`` or ``dax_dev``. 204 + The device address, in bytes, is returned via ``addr``. 205 + Reads from this type of mapping will return zeroes to the caller. 206 + For a write or writeback operation, the ioend should update the 207 + mapping to MAPPED. 208 + Refer to the sections about ioends for more details. 209 + 210 + * **IOMAP_INLINE**: The file range maps to the memory buffer 211 + specified by ``inline_data``. 212 + For write operation, the ``->iomap_end`` function presumably 213 + handles persisting the data. 214 + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 215 + 216 + * ``flags`` describe the status of the space mapping. 217 + These flags should be set by the filesystem in ``->iomap_begin``: 218 + 219 + * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 220 + Areas that will not be written to must be zeroed. 221 + If a write fails and the mapping is a space reservation, the 222 + reservation must be deleted. 223 + 224 + * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 225 + to access any data written. 226 + fdatasync is required to commit these changes to persistent 227 + storage. 228 + This needs to take into account metadata changes that *may* be made 229 + at I/O completion, such as file size updates from direct I/O. 230 + 231 + * **IOMAP_F_SHARED**: The space under the mapping is shared. 232 + Copy on write is necessary to avoid corrupting other file data. 233 + 234 + * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 235 + heads for pagecache operations. 236 + Do not add more uses of this. 237 + 238 + * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 239 + coalesced into this single mapping. 240 + This is only useful for FIEMAP. 241 + 242 + * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 243 + regular file data. 244 + This is only useful for FIEMAP. 245 + 246 + * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can 247 + be set by the filesystem for its own purposes. 248 + 249 + These flags can be set by iomap itself during file operations. 250 + The filesystem should supply an ``->iomap_end`` function if it needs 251 + to observe these flags: 252 + 253 + * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 254 + using this mapping. 255 + 256 + * **IOMAP_F_STALE**: The mapping was found to be stale. 257 + iomap will call ``->iomap_end`` on this mapping and then 258 + ``->iomap_begin`` to obtain a new mapping. 259 + 260 + Currently, these flags are only set by pagecache operations. 261 + 262 + * ``addr`` describes the device address, in bytes. 263 + 264 + * ``bdev`` describes the block device for this mapping. 265 + This only needs to be set for mapped or unwritten operations. 266 + 267 + * ``dax_dev`` describes the DAX device for this mapping. 268 + This only needs to be set for mapped or unwritten operations, and 269 + only for a fsdax operation. 270 + 271 + * ``inline_data`` points to a memory buffer for I/O involving 272 + ``IOMAP_INLINE`` mappings. 273 + This value is ignored for all other mapping types. 274 + 275 + * ``private`` is a pointer to `filesystem-private information 276 + <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 277 + This value will be passed unchanged to ``->iomap_end``. 278 + 279 + * ``folio_ops`` will be covered in the section on pagecache operations. 280 + 281 + * ``validity_cookie`` is a magic freshness value set by the filesystem 282 + that should be used to detect stale mappings. 283 + For pagecache operations this is critical for correct operation 284 + because page faults can occur, which implies that filesystem locks 285 + should not be held between ``->iomap_begin`` and ``->iomap_end``. 286 + Filesystems with completely static mappings need not set this value. 287 + Only pagecache operations revalidate mappings; see the section about 288 + ``iomap_valid`` for details. 289 + 290 + ``struct iomap_ops`` 291 + -------------------- 292 + 293 + Every iomap function requires the filesystem to pass an operations 294 + structure to obtain a mapping and (optionally) to release the mapping: 295 + 296 + .. code-block:: c 297 + 298 + struct iomap_ops { 299 + int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 300 + unsigned flags, struct iomap *iomap, 301 + struct iomap *srcmap); 302 + 303 + int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 304 + ssize_t written, unsigned flags, 305 + struct iomap *iomap); 306 + }; 307 + 308 + ``->iomap_begin`` 309 + ~~~~~~~~~~~~~~~~~ 310 + 311 + iomap operations call ``->iomap_begin`` to obtain one file mapping for 312 + the range of bytes specified by ``pos`` and ``length`` for the file 313 + ``inode``. 314 + This mapping should be returned through the ``iomap`` pointer. 315 + The mapping must cover at least the first byte of the supplied file 316 + range, but it does not need to cover the entire requested range. 317 + 318 + Each iomap operation describes the requested operation through the 319 + ``flags`` argument. 320 + The exact value of ``flags`` will be documented in the 321 + operation-specific sections below. 322 + These flags can, at least in principle, apply generally to iomap 323 + operations: 324 + 325 + * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 326 + block storage. 327 + 328 + * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 329 + memory-like storage. 330 + 331 + * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 332 + effort attempt to avoid any operation that would result in blocking 333 + the submitting task. 334 + This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 335 + intended for asynchronous applications to keep doing other work 336 + instead of waiting for the specific unavailable filesystem resource 337 + to become available. 338 + Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 339 + trylock algorithms. 340 + They need to be able to satisfy the entire I/O request range with a 341 + single iomap mapping. 342 + They need to avoid reading or writing metadata synchronously. 343 + They need to avoid blocking memory allocations. 344 + They need to avoid waiting on transaction reservations to allow 345 + modifications to take place. 346 + They probably should not be allocating new space. 347 + And so on. 348 + If there is any doubt in the filesystem developer's mind as to 349 + whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 350 + then they should return ``-EAGAIN`` as early as possible rather than 351 + start the operation and force the submitting task to block. 352 + ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 353 + ``RWF_NOWAIT``. 354 + 355 + If it is necessary to read existing file contents from a `different 356 + <https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 357 + device or address range on a device, the filesystem should return that 358 + information via ``srcmap``. 359 + Only pagecache and fsdax operations support reading from one mapping and 360 + writing to another. 361 + 362 + ``->iomap_end`` 363 + ~~~~~~~~~~~~~~~ 364 + 365 + After the operation completes, the ``->iomap_end`` function, if present, 366 + is called to signal that iomap is finished with a mapping. 367 + Typically, implementations will use this function to tear down any 368 + context that were set up in ``->iomap_begin``. 369 + For example, a write might wish to commit the reservations for the bytes 370 + that were operated upon and unreserve any space that was not operated 371 + upon. 372 + ``written`` might be zero if no bytes were touched. 373 + ``flags`` will contain the same value passed to ``->iomap_begin``. 374 + iomap ops for reads are not likely to need to supply this function. 375 + 376 + Both functions should return a negative errno code on error, or zero on 377 + success. 378 + 379 + Preparing for File Operations 380 + ============================= 381 + 382 + iomap only handles mapping and I/O. 383 + Filesystems must still call out to the VFS to check input parameters 384 + and file state before initiating an I/O operation. 385 + It does not handle obtaining filesystem freeze protection, updating of 386 + timestamps, stripping privileges, or access control. 387 + 388 + Locking Hierarchy 389 + ================= 390 + 391 + iomap requires that filesystems supply their own locking model. 392 + There are three categories of synchronization primitives, as far as 393 + iomap is concerned: 394 + 395 + * The **upper** level primitive is provided by the filesystem to 396 + coordinate access to different iomap operations. 397 + The exact primitive is specifc to the filesystem and operation, 398 + but is often a VFS inode, pagecache invalidation, or folio lock. 399 + For example, a filesystem might take ``i_rwsem`` before calling 400 + ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 401 + these two file operations from clobbering each other. 402 + Pagecache writeback may lock a folio to prevent other threads from 403 + accessing the folio until writeback is underway. 404 + 405 + * The **lower** level primitive is taken by the filesystem in the 406 + ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 407 + access to the file space mapping information. 408 + The fields of the iomap object should be filled out while holding 409 + this primitive. 410 + The upper level synchronization primitive, if any, remains held 411 + while acquiring the lower level synchronization primitive. 412 + For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 413 + while sampling mappings. 414 + Filesystems with immutable mapping information may not require 415 + synchronization here. 416 + 417 + * The **operation** primitive is taken by an iomap operation to 418 + coordinate access to its own internal data structures. 419 + The upper level synchronization primitive, if any, remains held 420 + while acquiring this primitive. 421 + The lower level primitive is not held while acquiring this 422 + primitive. 423 + For example, pagecache write operations will obtain a file mapping, 424 + then grab and lock a folio to copy new contents. 425 + It may also lock an internal folio state object to update metadata. 426 + 427 + The exact locking requirements are specific to the filesystem; for 428 + certain operations, some of these locks can be elided. 429 + All further mention of locking are *recommendations*, not mandates. 430 + Each filesystem author must figure out the locking for themself. 431 + 432 + Bugs and Limitations 433 + ==================== 434 + 435 + * No support for fscrypt. 436 + * No support for compression. 437 + * No support for fsverity yet. 438 + * Strong assumptions that IO should work the way it does on XFS. 439 + * Does iomap *actually* work for non-regular file data? 440 + 441 + Patches welcome!

+13

Documentation/filesystems/iomap/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================= 4 + VFS iomap Documentation 5 + ======================= 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + :numbered: 10 + 11 + design 12 + operations 13 + porting

+713

Documentation/filesystems/iomap/operations.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. _iomap_operations: 3 + 4 + .. 5 + Dumb style notes to maintain the author's sanity: 6 + Please try to start sentences on separate lines so that 7 + sentence changes don't bleed colors in diff. 8 + Heading decorations are documented in sphinx.rst. 9 + 10 + ========================= 11 + Supported File Operations 12 + ========================= 13 + 14 + .. contents:: Table of Contents 15 + :local: 16 + 17 + Below are a discussion of the high level file operations that iomap 18 + implements. 19 + 20 + Buffered I/O 21 + ============ 22 + 23 + Buffered I/O is the default file I/O path in Linux. 24 + File contents are cached in memory ("pagecache") to satisfy reads and 25 + writes. 26 + Dirty cache will be written back to disk at some point that can be 27 + forced via ``fsync`` and variants. 28 + 29 + iomap implements nearly all the folio and pagecache management that 30 + filesystems have to implement themselves under the legacy I/O model. 31 + This means that the filesystem need not know the details of allocating, 32 + mapping, managing uptodate and dirty state, or writeback of pagecache 33 + folios. 34 + Under the legacy I/O model, this was managed very inefficiently with 35 + linked lists of buffer heads instead of the per-folio bitmaps that iomap 36 + uses. 37 + Unless the filesystem explicitly opts in to buffer heads, they will not 38 + be used, which makes buffered I/O much more efficient, and the pagecache 39 + maintainer much happier. 40 + 41 + ``struct address_space_operations`` 42 + ----------------------------------- 43 + 44 + The following iomap functions can be referenced directly from the 45 + address space operations structure: 46 + 47 + * ``iomap_dirty_folio`` 48 + * ``iomap_release_folio`` 49 + * ``iomap_invalidate_folio`` 50 + * ``iomap_is_partially_uptodate`` 51 + 52 + The following address space operations can be wrapped easily: 53 + 54 + * ``read_folio`` 55 + * ``readahead`` 56 + * ``writepages`` 57 + * ``bmap`` 58 + * ``swap_activate`` 59 + 60 + ``struct iomap_folio_ops`` 61 + -------------------------- 62 + 63 + The ``->iomap_begin`` function for pagecache operations may set the 64 + ``struct iomap::folio_ops`` field to an ops structure to override 65 + default behaviors of iomap: 66 + 67 + .. code-block:: c 68 + 69 + struct iomap_folio_ops { 70 + struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos, 71 + unsigned len); 72 + void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied, 73 + struct folio *folio); 74 + bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap); 75 + }; 76 + 77 + iomap calls these functions: 78 + 79 + - ``get_folio``: Called to allocate and return an active reference to 80 + a locked folio prior to starting a write. 81 + If this function is not provided, iomap will call 82 + ``iomap_get_folio``. 83 + This could be used to `set up per-folio filesystem state 84 + <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_ 85 + for a write. 86 + 87 + - ``put_folio``: Called to unlock and put a folio after a pagecache 88 + operation completes. 89 + If this function is not provided, iomap will ``folio_unlock`` and 90 + ``folio_put`` on its own. 91 + This could be used to `commit per-folio filesystem state 92 + <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_ 93 + that was set up by ``->get_folio``. 94 + 95 + - ``iomap_valid``: The filesystem may not hold locks between 96 + ``->iomap_begin`` and ``->iomap_end`` because pagecache operations 97 + can take folio locks, fault on userspace pages, initiate writeback 98 + for memory reclamation, or engage in other time-consuming actions. 99 + If a file's space mapping data are mutable, it is possible that the 100 + mapping for a particular pagecache folio can `change in the time it 101 + takes 102 + <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_ 103 + to allocate, install, and lock that folio. 104 + 105 + For the pagecache, races can happen if writeback doesn't take 106 + ``i_rwsem`` or ``invalidate_lock`` and updates mapping information. 107 + Races can also happen if the filesytem allows concurrent writes. 108 + For such files, the mapping *must* be revalidated after the folio 109 + lock has been taken so that iomap can manage the folio correctly. 110 + 111 + fsdax does not need this revalidation because there's no writeback 112 + and no support for unwritten extents. 113 + 114 + Filesystems subject to this kind of race must provide a 115 + ``->iomap_valid`` function to decide if the mapping is still valid. 116 + If the mapping is not valid, the mapping will be sampled again. 117 + 118 + To support making the validity decision, the filesystem's 119 + ``->iomap_begin`` function may set ``struct iomap::validity_cookie`` 120 + at the same time that it populates the other iomap fields. 121 + A simple validation cookie implementation is a sequence counter. 122 + If the filesystem bumps the sequence counter every time it modifies 123 + the inode's extent map, it can be placed in the ``struct 124 + iomap::validity_cookie`` during ``->iomap_begin``. 125 + If the value in the cookie is found to be different to the value 126 + the filesystem holds when the mapping is passed back to 127 + ``->iomap_valid``, then the iomap should considered stale and the 128 + validation failed. 129 + 130 + These ``struct kiocb`` flags are significant for buffered I/O with iomap: 131 + 132 + * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``. 133 + 134 + Internal per-Folio State 135 + ------------------------ 136 + 137 + If the fsblock size matches the size of a pagecache folio, it is assumed 138 + that all disk I/O operations will operate on the entire folio. 139 + The uptodate (memory contents are at least as new as what's on disk) and 140 + dirty (memory contents are newer than what's on disk) status of the 141 + folio are all that's needed for this case. 142 + 143 + If the fsblock size is less than the size of a pagecache folio, iomap 144 + tracks the per-fsblock uptodate and dirty state itself. 145 + This enables iomap to handle both "bs < ps" `filesystems 146 + <https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_ 147 + and large folios in the pagecache. 148 + 149 + iomap internally tracks two state bits per fsblock: 150 + 151 + * ``uptodate``: iomap will try to keep folios fully up to date. 152 + If there are read(ahead) errors, those fsblocks will not be marked 153 + uptodate. 154 + The folio itself will be marked uptodate when all fsblocks within the 155 + folio are uptodate. 156 + 157 + * ``dirty``: iomap will set the per-block dirty state when programs 158 + write to the file. 159 + The folio itself will be marked dirty when any fsblock within the 160 + folio is dirty. 161 + 162 + iomap also tracks the amount of read and write disk IOs that are in 163 + flight. 164 + This structure is much lighter weight than ``struct buffer_head`` 165 + because there is only one per folio, and the per-fsblock overhead is two 166 + bits vs. 104 bytes. 167 + 168 + Filesystems wishing to turn on large folios in the pagecache should call 169 + ``mapping_set_large_folios`` when initializing the incore inode. 170 + 171 + Buffered Readahead and Reads 172 + ---------------------------- 173 + 174 + The ``iomap_readahead`` function initiates readahead to the pagecache. 175 + The ``iomap_read_folio`` function reads one folio's worth of data into 176 + the pagecache. 177 + The ``flags`` argument to ``->iomap_begin`` will be set to zero. 178 + The pagecache takes whatever locks it needs before calling the 179 + filesystem. 180 + 181 + Buffered Writes 182 + --------------- 183 + 184 + The ``iomap_file_buffered_write`` function writes an ``iocb`` to the 185 + pagecache. 186 + ``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as 187 + the ``flags`` argument to ``->iomap_begin``. 188 + Callers commonly take ``i_rwsem`` in either shared or exclusive mode 189 + before calling this function. 190 + 191 + mmap Write Faults 192 + ~~~~~~~~~~~~~~~~~ 193 + 194 + The ``iomap_page_mkwrite`` function handles a write fault to a folio in 195 + the pagecache. 196 + ``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument 197 + to ``->iomap_begin``. 198 + Callers commonly take the mmap ``invalidate_lock`` in shared or 199 + exclusive mode before calling this function. 200 + 201 + Buffered Write Failures 202 + ~~~~~~~~~~~~~~~~~~~~~~~ 203 + 204 + After a short write to the pagecache, the areas not written will not 205 + become marked dirty. 206 + The filesystem must arrange to `cancel 207 + <https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_ 208 + such `reservations 209 + <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_ 210 + because writeback will not consume the reservation. 211 + The ``iomap_file_buffered_write_punch_delalloc`` can be called from a 212 + ``->iomap_end`` function to find all the clean areas of the folios 213 + caching a fresh (``IOMAP_F_NEW``) delalloc mapping. 214 + It takes the ``invalidate_lock``. 215 + 216 + The filesystem must supply a function ``punch`` to be called for 217 + each file range in this state. 218 + This function must *only* remove delayed allocation reservations, in 219 + case another thread racing with the current thread writes successfully 220 + to the same region and triggers writeback to flush the dirty data out to 221 + disk. 222 + 223 + Zeroing for File Operations 224 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 225 + 226 + Filesystems can call ``iomap_zero_range`` to perform zeroing of the 227 + pagecache for non-truncation file operations that are not aligned to 228 + the fsblock size. 229 + ``IOMAP_ZERO`` will be passed as the ``flags`` argument to 230 + ``->iomap_begin``. 231 + Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive 232 + mode before calling this function. 233 + 234 + Unsharing Reflinked File Data 235 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 236 + 237 + Filesystems can call ``iomap_file_unshare`` to force a file sharing 238 + storage with another file to preemptively copy the shared data to newly 239 + allocate storage. 240 + ``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument 241 + to ``->iomap_begin``. 242 + Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive 243 + mode before calling this function. 244 + 245 + Truncation 246 + ---------- 247 + 248 + Filesystems can call ``iomap_truncate_page`` to zero the bytes in the 249 + pagecache from EOF to the end of the fsblock during a file truncation 250 + operation. 251 + ``truncate_setsize`` or ``truncate_pagecache`` will take care of 252 + everything after the EOF block. 253 + ``IOMAP_ZERO`` will be passed as the ``flags`` argument to 254 + ``->iomap_begin``. 255 + Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive 256 + mode before calling this function. 257 + 258 + Pagecache Writeback 259 + ------------------- 260 + 261 + Filesystems can call ``iomap_writepages`` to respond to a request to 262 + write dirty pagecache folios to disk. 263 + The ``mapping`` and ``wbc`` parameters should be passed unchanged. 264 + The ``wpc`` pointer should be allocated by the filesystem and must 265 + be initialized to zero. 266 + 267 + The pagecache will lock each folio before trying to schedule it for 268 + writeback. 269 + It does not lock ``i_rwsem`` or ``invalidate_lock``. 270 + 271 + The dirty bit will be cleared for all folios run through the 272 + ``->map_blocks`` machinery described below even if the writeback fails. 273 + This is to prevent dirty folio clots when storage devices fail; an 274 + ``-EIO`` is recorded for userspace to collect via ``fsync``. 275 + 276 + The ``ops`` structure must be specified and is as follows: 277 + 278 + ``struct iomap_writeback_ops`` 279 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 280 + 281 + .. code-block:: c 282 + 283 + struct iomap_writeback_ops { 284 + int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode, 285 + loff_t offset, unsigned len); 286 + int (*prepare_ioend)(struct iomap_ioend *ioend, int status); 287 + void (*discard_folio)(struct folio *folio, loff_t pos); 288 + }; 289 + 290 + The fields are as follows: 291 + 292 + - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file 293 + range (in bytes) given by ``offset`` and ``len``. 294 + iomap calls this function for each dirty fs block in each dirty folio, 295 + though it will `reuse mappings 296 + <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_ 297 + for runs of contiguous dirty fsblocks within a folio. 298 + Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end`` 299 + function must deal with persisting written data. 300 + Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently 301 + requires mapping to allocated space. 302 + Filesystems can skip a potentially expensive mapping lookup if the 303 + mappings have not changed. 304 + This revalidation must be open-coded by the filesystem; it is 305 + unclear if ``iomap::validity_cookie`` can be reused for this 306 + purpose. 307 + This function must be supplied by the filesystem. 308 + 309 + - ``prepare_ioend``: Enables filesystems to transform the writeback 310 + ioend or perform any other preparatory work before the writeback I/O 311 + is submitted. 312 + This might include pre-write space accounting updates, or installing 313 + a custom ``->bi_end_io`` function for internal purposes, such as 314 + deferring the ioend completion to a workqueue to run metadata update 315 + transactions from process context. 316 + This function is optional. 317 + 318 + - ``discard_folio``: iomap calls this function after ``->map_blocks`` 319 + fails to schedule I/O for any part of a dirty folio. 320 + The function should throw away any reservations that may have been 321 + made for the write. 322 + The folio will be marked clean and an ``-EIO`` recorded in the 323 + pagecache. 324 + Filesystems can use this callback to `remove 325 + <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_ 326 + delalloc reservations to avoid having delalloc reservations for 327 + clean pagecache. 328 + This function is optional. 329 + 330 + Pagecache Writeback Completion 331 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 332 + 333 + To handle the bookkeeping that must happen after disk I/O for writeback 334 + completes, iomap creates chains of ``struct iomap_ioend`` objects that 335 + wrap the ``bio`` that is used to write pagecache data to disk. 336 + By default, iomap finishes writeback ioends by clearing the writeback 337 + bit on the folios attached to the ``ioend``. 338 + If the write failed, it will also set the error bits on the folios and 339 + the address space. 340 + This can happen in interrupt or process context, depending on the 341 + storage device. 342 + 343 + Filesystems that need to update internal bookkeeping (e.g. unwritten 344 + extent conversions) should provide a ``->prepare_ioend`` function to 345 + set ``struct iomap_end::bio::bi_end_io`` to its own function. 346 + This function should call ``iomap_finish_ioends`` after finishing its 347 + own work (e.g. unwritten extent conversion). 348 + 349 + Some filesystems may wish to `amortize the cost of running metadata 350 + transactions 351 + <https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_ 352 + for post-writeback updates by batching them. 353 + They may also require transactions to run from process context, which 354 + implies punting batches to a workqueue. 355 + iomap ioends contain a ``list_head`` to enable batching. 356 + 357 + Given a batch of ioends, iomap has a few helpers to assist with 358 + amortization: 359 + 360 + * ``iomap_sort_ioends``: Sort all the ioends in the list by file 361 + offset. 362 + 363 + * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and 364 + a separate list of sorted ioends, merge as many of the ioends from 365 + the head of the list into the given ioend. 366 + ioends can only be merged if the file range and storage addresses are 367 + contiguous; the unwritten and shared status are the same; and the 368 + write I/O outcome is the same. 369 + The merged ioends become their own list. 370 + 371 + * ``iomap_finish_ioends``: Finish an ioend that possibly has other 372 + ioends linked to it. 373 + 374 + Direct I/O 375 + ========== 376 + 377 + In Linux, direct I/O is defined as file I/O that is issued directly to 378 + storage, bypassing the pagecache. 379 + The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and 380 + writes for files. 381 + 382 + .. code-block:: c 383 + 384 + ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, 385 + const struct iomap_ops *ops, 386 + const struct iomap_dio_ops *dops, 387 + unsigned int dio_flags, void *private, 388 + size_t done_before); 389 + 390 + The filesystem can provide the ``dops`` parameter if it needs to perform 391 + extra work before or after the I/O is issued to storage. 392 + The ``done_before`` parameter tells the how much of the request has 393 + already been transferred. 394 + It is used to continue a request asynchronously when `part of the 395 + request 396 + <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_ 397 + has already been completed synchronously. 398 + 399 + The ``done_before`` parameter should be set if writes for the ``iocb`` 400 + have been initiated prior to the call. 401 + The direction of the I/O is determined from the ``iocb`` passed in. 402 + 403 + The ``dio_flags`` argument can be set to any combination of the 404 + following values: 405 + 406 + * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the 407 + kiocb is not synchronous. 408 + 409 + * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range 410 + or fail with ``-EAGAIN``. 411 + This can be used by filesystems with complex unaligned I/O 412 + write paths to provide an optimised fast path for unaligned writes. 413 + If a pure overwrite can be performed, then serialisation against 414 + other I/Os to the same filesystem block(s) is unnecessary as there is 415 + no risk of stale data exposure or data loss. 416 + If a pure overwrite cannot be performed, then the filesystem can 417 + perform the serialisation steps needed to provide exclusive access 418 + to the unaligned I/O range so that it can perform allocation and 419 + sub-block zeroing safely. 420 + Filesystems can use this flag to try to reduce locking contention, 421 + but a lot of `detailed checking 422 + <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_ 423 + is required to do it `correctly 424 + <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_. 425 + 426 + * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever 427 + progress has already been made. 428 + The caller may deal with the page fault and retry the operation. 429 + If the caller decides to retry the operation, it should pass the 430 + accumulated return values of all previous calls as the 431 + ``done_before`` parameter to the next call. 432 + 433 + These ``struct kiocb`` flags are significant for direct I/O with iomap: 434 + 435 + * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``. 436 + 437 + * ``IOCB_SYNC``: Ensure that the device has persisted data to disk 438 + before completing the call. 439 + In the case of pure overwrites, the I/O may be issued with FUA 440 + enabled. 441 + 442 + * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an 443 + interrupt. 444 + Only meaningful for asynchronous I/O, and only if the entire I/O can 445 + be issued as a single ``struct bio``. 446 + 447 + * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's 448 + process context. 449 + See ``linux/fs.h`` for more details. 450 + 451 + Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and 452 + ``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open`` 453 + function for the file. 454 + They should not set ``->direct_IO``, which is deprecated. 455 + 456 + If a filesystem wishes to perform its own work before direct I/O 457 + completion, it should call ``__iomap_dio_rw``. 458 + If its return value is not an error pointer or a NULL pointer, the 459 + filesystem should pass the return value to ``iomap_dio_complete`` after 460 + finishing its internal work. 461 + 462 + Return Values 463 + ------------- 464 + 465 + ``iomap_dio_rw`` can return one of the following: 466 + 467 + * A non-negative number of bytes transferred. 468 + 469 + * ``-ENOTBLK``: Fall back to buffered I/O. 470 + iomap itself will return this value if it cannot invalidate the page 471 + cache before issuing the I/O to storage. 472 + The ``->iomap_begin`` or ``->iomap_end`` functions may also return 473 + this value. 474 + 475 + * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been 476 + queued and will be completed separately. 477 + 478 + * Any of the other negative error codes. 479 + 480 + Direct Reads 481 + ------------ 482 + 483 + A direct I/O read initiates a read I/O from the storage device to the 484 + caller's buffer. 485 + Dirty parts of the pagecache are flushed to storage before initiating 486 + the read io. 487 + The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with 488 + any combination of the following enhancements: 489 + 490 + * ``IOMAP_NOWAIT``, as defined previously. 491 + 492 + Callers commonly hold ``i_rwsem`` in shared mode before calling this 493 + function. 494 + 495 + Direct Writes 496 + ------------- 497 + 498 + A direct I/O write initiates a write I/O to the storage device from the 499 + caller's buffer. 500 + Dirty parts of the pagecache are flushed to storage before initiating 501 + the write io. 502 + The pagecache is invalidated both before and after the write io. 503 + The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT | 504 + IOMAP_WRITE`` with any combination of the following enhancements: 505 + 506 + * ``IOMAP_NOWAIT``, as defined previously. 507 + 508 + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial 509 + blocks is not allowed. 510 + The entire file range must map to a single written or unwritten 511 + extent. 512 + The file I/O range must be aligned to the filesystem block size 513 + if the mapping is unwritten and the filesystem cannot handle zeroing 514 + the unaligned regions without exposing stale contents. 515 + 516 + Callers commonly hold ``i_rwsem`` in shared or exclusive mode before 517 + calling this function. 518 + 519 + ``struct iomap_dio_ops:`` 520 + ------------------------- 521 + .. code-block:: c 522 + 523 + struct iomap_dio_ops { 524 + void (*submit_io)(const struct iomap_iter *iter, struct bio *bio, 525 + loff_t file_offset); 526 + int (*end_io)(struct kiocb *iocb, ssize_t size, int error, 527 + unsigned flags); 528 + struct bio_set *bio_set; 529 + }; 530 + 531 + The fields of this structure are as follows: 532 + 533 + - ``submit_io``: iomap calls this function when it has constructed a 534 + ``struct bio`` object for the I/O requested, and wishes to submit it 535 + to the block device. 536 + If no function is provided, ``submit_bio`` will be called directly. 537 + Filesystems that would like to perform additional work before (e.g. 538 + data replication for btrfs) should implement this function. 539 + 540 + - ``end_io``: This is called after the ``struct bio`` completes. 541 + This function should perform post-write conversions of unwritten 542 + extent mappings, handle write failures, etc. 543 + The ``flags`` argument may be set to a combination of the following: 544 + 545 + * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend 546 + should mark the extent as written. 547 + 548 + * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a 549 + copy on write operation, so the ioend should switch mappings. 550 + 551 + - ``bio_set``: This allows the filesystem to provide a custom bio_set 552 + for allocating direct I/O bios. 553 + This enables filesystems to `stash additional per-bio information 554 + <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_ 555 + for private use. 556 + If this field is NULL, generic ``struct bio`` objects will be used. 557 + 558 + Filesystems that want to perform extra work after an I/O completion 559 + should set a custom ``->bi_end_io`` function via ``->submit_io``. 560 + Afterwards, the custom endio function must call 561 + ``iomap_dio_bio_end_io`` to finish the direct I/O. 562 + 563 + DAX I/O 564 + ======= 565 + 566 + Some storage devices can be directly mapped as memory. 567 + These devices support a new access mode known as "fsdax" that allows 568 + loads and stores through the CPU and memory controller. 569 + 570 + fsdax Reads 571 + ----------- 572 + 573 + A fsdax read performs a memcpy from storage device to the caller's 574 + buffer. 575 + The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any 576 + combination of the following enhancements: 577 + 578 + * ``IOMAP_NOWAIT``, as defined previously. 579 + 580 + Callers commonly hold ``i_rwsem`` in shared mode before calling this 581 + function. 582 + 583 + fsdax Writes 584 + ------------ 585 + 586 + A fsdax write initiates a memcpy to the storage device from the caller's 587 + buffer. 588 + The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX | 589 + IOMAP_WRITE`` with any combination of the following enhancements: 590 + 591 + * ``IOMAP_NOWAIT``, as defined previously. 592 + 593 + * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be 594 + performed from this mapping. 595 + This requires the filesystem extent mapping to already exist as an 596 + ``IOMAP_MAPPED`` type and span the entire range of the write I/O 597 + request. 598 + If the filesystem cannot map this request in a way that allows the 599 + iomap infrastructure to perform a pure overwrite, it must fail the 600 + mapping operation with ``-EAGAIN``. 601 + 602 + Callers commonly hold ``i_rwsem`` in exclusive mode before calling this 603 + function. 604 + 605 + fsdax mmap Faults 606 + ~~~~~~~~~~~~~~~~~ 607 + 608 + The ``dax_iomap_fault`` function handles read and write faults to fsdax 609 + storage. 610 + For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the 611 + ``flags`` argument to ``->iomap_begin``. 612 + For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be 613 + passed as the ``flags`` argument to ``->iomap_begin``. 614 + 615 + Callers commonly hold the same locks as they do to call their iomap 616 + pagecache counterparts. 617 + 618 + fsdax Truncation, fallocate, and Unsharing 619 + ------------------------------------------ 620 + 621 + For fsdax files, the following functions are provided to replace their 622 + iomap pagecache I/O counterparts. 623 + The ``flags`` argument to ``->iomap_begin`` are the same as the 624 + pagecache counterparts, with ``IOMAP_DAX`` added. 625 + 626 + * ``dax_file_unshare`` 627 + * ``dax_zero_range`` 628 + * ``dax_truncate_page`` 629 + 630 + Callers commonly hold the same locks as they do to call their iomap 631 + pagecache counterparts. 632 + 633 + fsdax Deduplication 634 + ------------------- 635 + 636 + Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the 637 + ``dax_remap_file_range_prep`` function with their own iomap read ops. 638 + 639 + Seeking Files 640 + ============= 641 + 642 + iomap implements the two iterating whence modes of the ``llseek`` system 643 + call. 644 + 645 + SEEK_DATA 646 + --------- 647 + 648 + The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value 649 + for llseek. 650 + ``IOMAP_REPORT`` will be passed as the ``flags`` argument to 651 + ``->iomap_begin``. 652 + 653 + For unwritten mappings, the pagecache will be searched. 654 + Regions of the pagecache with a folio mapped and uptodate fsblocks 655 + within those folios will be reported as data areas. 656 + 657 + Callers commonly hold ``i_rwsem`` in shared mode before calling this 658 + function. 659 + 660 + SEEK_HOLE 661 + --------- 662 + 663 + The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value 664 + for llseek. 665 + ``IOMAP_REPORT`` will be passed as the ``flags`` argument to 666 + ``->iomap_begin``. 667 + 668 + For unwritten mappings, the pagecache will be searched. 669 + Regions of the pagecache with no folio mapped, or a !uptodate fsblock 670 + within a folio will be reported as sparse hole areas. 671 + 672 + Callers commonly hold ``i_rwsem`` in shared mode before calling this 673 + function. 674 + 675 + Swap File Activation 676 + ==================== 677 + 678 + The ``iomap_swapfile_activate`` function finds all the base-page aligned 679 + regions in a file and sets them up as swap space. 680 + The file will be ``fsync()``'d before activation. 681 + ``IOMAP_REPORT`` will be passed as the ``flags`` argument to 682 + ``->iomap_begin``. 683 + All mappings must be mapped or unwritten; cannot be dirty or shared, and 684 + cannot span multiple block devices. 685 + Callers must hold ``i_rwsem`` in exclusive mode; this is already 686 + provided by ``swapon``. 687 + 688 + File Space Mapping Reporting 689 + ============================ 690 + 691 + iomap implements two of the file space mapping system calls. 692 + 693 + FS_IOC_FIEMAP 694 + ------------- 695 + 696 + The ``iomap_fiemap`` function exports file extent mappings to userspace 697 + in the format specified by the ``FS_IOC_FIEMAP`` ioctl. 698 + ``IOMAP_REPORT`` will be passed as the ``flags`` argument to 699 + ``->iomap_begin``. 700 + Callers commonly hold ``i_rwsem`` in shared mode before calling this 701 + function. 702 + 703 + FIBMAP (deprecated) 704 + ------------------- 705 + 706 + ``iomap_bmap`` implements FIBMAP. 707 + The calling conventions are the same as for FIEMAP. 708 + This function is only provided to maintain compatibility for filesystems 709 + that implemented FIBMAP prior to conversion. 710 + This ioctl is deprecated; do **not** add a FIBMAP implementation to 711 + filesystems that do not have it. 712 + Callers should probably hold ``i_rwsem`` in shared mode before calling 713 + this function, but this is unclear.

+120

Documentation/filesystems/iomap/porting.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. _iomap_porting: 3 + 4 + .. 5 + Dumb style notes to maintain the author's sanity: 6 + Please try to start sentences on separate lines so that 7 + sentence changes don't bleed colors in diff. 8 + Heading decorations are documented in sphinx.rst. 9 + 10 + ======================= 11 + Porting Your Filesystem 12 + ======================= 13 + 14 + .. contents:: Table of Contents 15 + :local: 16 + 17 + Why Convert? 18 + ============ 19 + 20 + There are several reasons to convert a filesystem to iomap: 21 + 22 + 1. The classic Linux I/O path is not terribly efficient. 23 + Pagecache operations lock a single base page at a time and then call 24 + into the filesystem to return a mapping for only that page. 25 + Direct I/O operations build I/O requests a single file block at a 26 + time. 27 + This worked well enough for direct/indirect-mapped filesystems such 28 + as ext2, but is very inefficient for extent-based filesystems such 29 + as XFS. 30 + 31 + 2. Large folios are only supported via iomap; there are no plans to 32 + convert the old buffer_head path to use them. 33 + 34 + 3. Direct access to storage on memory-like devices (fsdax) is only 35 + supported via iomap. 36 + 37 + 4. Lower maintenance overhead for individual filesystem maintainers. 38 + iomap handles common pagecache related operations itself, such as 39 + allocating, instantiating, locking, and unlocking of folios. 40 + No ->write_begin(), ->write_end() or direct_IO 41 + address_space_operations are required to be implemented by 42 + filesystem using iomap. 43 + 44 + How Do I Convert a Filesystem? 45 + ============================== 46 + 47 + First, add ``#include <linux/iomap.h>`` from your source code and add 48 + ``select FS_IOMAP`` to your filesystem's Kconfig option. 49 + Build the kernel, run fstests with the ``-g all`` option across a wide 50 + variety of your filesystem's supported configurations to build a 51 + baseline of which tests pass and which ones fail. 52 + 53 + The recommended approach is first to implement ``->iomap_begin`` (and 54 + ``->iomap_end`` if necessary) to allow iomap to obtain a read-only 55 + mapping of a file range. 56 + In most cases, this is a relatively trivial conversion of the existing 57 + ``get_block()`` function for read-only mappings. 58 + ``FS_IOC_FIEMAP`` is a good first target because it is trivial to 59 + implement support for it and then to determine that the extent map 60 + iteration is correct from userspace. 61 + If FIEMAP is returning the correct information, it's a good sign that 62 + other read-only mapping operations will do the right thing. 63 + 64 + Next, modify the filesystem's ``get_block(create = false)`` 65 + implementation to use the new ``->iomap_begin`` implementation to map 66 + file space for selected read operations. 67 + Hide behind a debugging knob the ability to switch on the iomap mapping 68 + functions for selected call paths. 69 + It is necessary to write some code to fill out the bufferhead-based 70 + mapping information from the ``iomap`` structure, but the new functions 71 + can be tested without needing to implement any iomap APIs. 72 + 73 + Once the read-only functions are working like this, convert each high 74 + level file operation one by one to use iomap native APIs instead of 75 + going through ``get_block()``. 76 + Done one at a time, regressions should be self evident. 77 + You *do* have a regression test baseline for fstests, right? 78 + It is suggested to convert swap file activation, ``SEEK_DATA``, and 79 + ``SEEK_HOLE`` before tackling the I/O paths. 80 + A likely complexity at this point will be converting the buffered read 81 + I/O path because of bufferheads. 82 + The buffered read I/O paths doesn't need to be converted yet, though the 83 + direct I/O read path should be converted in this phase. 84 + 85 + At this point, you should look over your ``->iomap_begin`` function. 86 + If it switches between large blocks of code based on dispatching of the 87 + ``flags`` argument, you should consider breaking it up into 88 + per-operation iomap ops with smaller, more cohesive functions. 89 + XFS is a good example of this. 90 + 91 + The next thing to do is implement ``get_blocks(create == true)`` 92 + functionality in the ``->iomap_begin``/``->iomap_end`` methods. 93 + It is strongly recommended to create separate mapping functions and 94 + iomap ops for write operations. 95 + Then convert the direct I/O write path to iomap, and start running fsx 96 + w/ DIO enabled in earnest on filesystem. 97 + This will flush out lots of data integrity corner case bugs that the new 98 + write mapping implementation introduces. 99 + 100 + Now, convert any remaining file operations to call the iomap functions. 101 + This will get the entire filesystem using the new mapping functions, and 102 + they should largely be debugged and working correctly after this step. 103 + 104 + Most likely at this point, the buffered read and write paths will still 105 + need to be converted. 106 + The mapping functions should all work correctly, so all that needs to be 107 + done is rewriting all the code that interfaces with bufferheads to 108 + interface with iomap and folios. 109 + It is much easier first to get regular file I/O (without any fancy 110 + features like fscrypt, fsverity, compression, or data=journaling) 111 + converted to use iomap. 112 + Some of those fancy features (fscrypt and compression) aren't 113 + implemented yet in iomap. 114 + For unjournalled filesystems that use the pagecache for symbolic links 115 + and directories, you might also try converting their handling to iomap. 116 + 117 + The rest is left as an exercise for the reader, as it will be different 118 + for every filesystem. 119 + If you encounter problems, email the people and lists in 120 + ``get_maintainers.pl`` for help.

+1

MAINTAINERS

··· 8460 8460 L: linux-xfs@vger.kernel.org 8461 8461 L: linux-fsdevel@vger.kernel.org 8462 8462 S: Supported 8463 + F: Documentation/filesystems/iomap/* 8463 8464 F: fs/iomap/ 8464 8465 F: include/linux/iomap.h 8465 8466

+49 -26

fs/iomap/buffered-io.c

··· 442 442 return pos - orig_pos + plen; 443 443 } 444 444 445 + static loff_t iomap_read_folio_iter(const struct iomap_iter *iter, 446 + struct iomap_readpage_ctx *ctx) 447 + { 448 + struct folio *folio = ctx->cur_folio; 449 + size_t offset = offset_in_folio(folio, iter->pos); 450 + loff_t length = min_t(loff_t, folio_size(folio) - offset, 451 + iomap_length(iter)); 452 + loff_t done, ret; 453 + 454 + for (done = 0; done < length; done += ret) { 455 + ret = iomap_readpage_iter(iter, ctx, done); 456 + if (ret <= 0) 457 + return ret; 458 + } 459 + 460 + return done; 461 + } 462 + 445 463 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops) 446 464 { 447 465 struct iomap_iter iter = { ··· 475 457 trace_iomap_readpage(iter.inode, 1); 476 458 477 459 while ((ret = iomap_iter(&iter, ops)) > 0) 478 - iter.processed = iomap_readpage_iter(&iter, &ctx, 0); 460 + iter.processed = iomap_read_folio_iter(&iter, &ctx); 479 461 480 462 if (ctx.bio) { 481 463 submit_bio(ctx.bio); ··· 890 872 size_t copied, struct folio *folio) 891 873 { 892 874 const struct iomap *srcmap = iomap_iter_srcmap(iter); 893 - loff_t old_size = iter->inode->i_size; 894 - size_t written; 895 875 896 876 if (srcmap->type == IOMAP_INLINE) { 897 877 iomap_write_end_inline(iter, folio, pos, copied); 898 - written = copied; 899 - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { 900 - written = block_write_end(NULL, iter->inode->i_mapping, pos, 878 + return true; 879 + } 880 + 881 + if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { 882 + size_t bh_written; 883 + 884 + bh_written = block_write_end(NULL, iter->inode->i_mapping, pos, 901 885 len, copied, &folio->page, NULL); 902 - WARN_ON_ONCE(written != copied && written != 0); 903 - } else { 904 - written = __iomap_write_end(iter->inode, pos, len, copied, 905 - folio) ? copied : 0; 886 + WARN_ON_ONCE(bh_written != copied && bh_written != 0); 887 + return bh_written == copied; 906 888 } 907 889 908 - /* 909 - * Update the in-memory inode size after copying the data into the page 910 - * cache. It's up to the file system to write the updated size to disk, 911 - * preferably after I/O completion so that no stale data is exposed. 912 - * Only once that's done can we unlock and release the folio. 913 - */ 914 - if (pos + written > old_size) { 915 - i_size_write(iter->inode, pos + written); 916 - iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; 917 - } 918 - __iomap_put_folio(iter, pos, written, folio); 919 - 920 - if (old_size < pos) 921 - pagecache_isize_extended(iter->inode, old_size, pos); 922 - 923 - return written == copied; 890 + return __iomap_write_end(iter->inode, pos, len, copied, folio); 924 891 } 925 892 926 893 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) ··· 920 917 921 918 do { 922 919 struct folio *folio; 920 + loff_t old_size; 923 921 size_t offset; /* Offset into folio */ 924 922 size_t bytes; /* Bytes to write to folio */ 925 923 size_t copied; /* Bytes copied from user */ ··· 971 967 copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); 972 968 written = iomap_write_end(iter, pos, bytes, copied, folio) ? 973 969 copied : 0; 970 + 971 + /* 972 + * Update the in-memory inode size after copying the data into 973 + * the page cache. It's up to the file system to write the 974 + * updated size to disk, preferably after I/O completion so that 975 + * no stale data is exposed. Only once that's done can we 976 + * unlock and release the folio. 977 + */ 978 + old_size = iter->inode->i_size; 979 + if (pos + written > old_size) { 980 + i_size_write(iter->inode, pos + written); 981 + iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; 982 + } 983 + __iomap_put_folio(iter, pos, written, folio); 984 + 985 + if (old_size < pos) 986 + pagecache_isize_extended(iter->inode, old_size, pos); 974 987 975 988 cond_resched(); 976 989 if (unlikely(written == 0)) { ··· 1359 1338 bytes = folio_size(folio) - offset; 1360 1339 1361 1340 ret = iomap_write_end(iter, pos, bytes, bytes, folio); 1341 + __iomap_put_folio(iter, pos, bytes, folio); 1362 1342 if (WARN_ON_ONCE(!ret)) 1363 1343 return -EIO; 1364 1344 ··· 1425 1403 folio_mark_accessed(folio); 1426 1404 1427 1405 ret = iomap_write_end(iter, pos, bytes, bytes, folio); 1406 + __iomap_put_folio(iter, pos, bytes, folio); 1428 1407 if (WARN_ON_ONCE(!ret)) 1429 1408 return -EIO; 1430 1409

+14 -1

fs/xfs/xfs_iops.c

··· 17 17 #include "xfs_da_btree.h" 18 18 #include "xfs_attr.h" 19 19 #include "xfs_trans.h" 20 + #include "xfs_trans_space.h" 21 + #include "xfs_bmap_btree.h" 20 22 #include "xfs_trace.h" 21 23 #include "xfs_icache.h" 22 24 #include "xfs_symlink.h" ··· 813 811 struct xfs_trans *tp; 814 812 int error; 815 813 uint lock_flags = 0; 814 + uint resblks = 0; 816 815 bool did_zeroing = false; 817 816 818 817 xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); ··· 920 917 return error; 921 918 } 922 919 923 - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); 920 + /* 921 + * For realtime inode with more than one block rtextsize, we need the 922 + * block reservation for bmap btree block allocations/splits that can 923 + * happen since it could split the tail written extent and convert the 924 + * right beyond EOF one to unwritten. 925 + */ 926 + if (xfs_inode_has_bigrtalloc(ip)) 927 + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); 928 + 929 + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks, 930 + 0, 0, &tp); 924 931 if (error) 925 932 return error; 926 933

Configure Feed

Configure Feed