Merge netfs API documentation updates · tjh.dev/kernel@a1b4a25

+21 -13

Documentation/driver-api/early-userspace/buffer-format.rst

··· 4 4 5 5 Al Viro, H. Peter Anvin 6 6 7 - Last revision: 2002-01-13 8 - 9 - Starting with kernel 2.5.x, the old "initial ramdisk" protocol is 10 - getting {replaced/complemented} with the new "initial ramfs" 11 - (initramfs) protocol. The initramfs contents is passed using the same 12 - memory buffer protocol used by the initrd protocol, but the contents 7 + With kernel 2.5.x, the old "initial ramdisk" protocol was complemented 8 + with an "initial ramfs" protocol. The initramfs content is passed 9 + using the same memory buffer protocol used by initrd, but the content 13 10 is different. The initramfs buffer contains an archive which is 14 - expanded into a ramfs filesystem; this document details the format of 15 - the initramfs buffer format. 11 + expanded into a ramfs filesystem; this document details the initramfs 12 + buffer format. 16 13 17 14 The initramfs buffer format is based around the "newc" or "crc" CPIO 18 15 formats, and can be created with the cpio(1) utility. The cpio 19 - archive can be compressed using gzip(1). One valid version of an 20 - initramfs buffer is thus a single .cpio.gz file. 16 + archive can be compressed using gzip(1), or any other algorithm provided 17 + via CONFIG_DECOMPRESS_*. One valid version of an initramfs buffer is 18 + thus a single .cpio.gz file. 21 19 22 20 The full format of the initramfs buffer is defined by the following 23 21 grammar, where:: ··· 23 25 * is used to indicate "0 or more occurrences of" 24 26 (|) indicates alternatives 25 27 + indicates concatenation 26 - GZIP() indicates the gzip(1) of the operand 28 + GZIP() indicates gzip compression of the operand 29 + BZIP2() indicates bzip2 compression of the operand 30 + LZMA() indicates lzma compression of the operand 31 + XZ() indicates xz compression of the operand 32 + LZO() indicates lzo compression of the operand 33 + LZ4() indicates lz4 compression of the operand 34 + ZSTD() indicates zstd compression of the operand 27 35 ALGN(n) means padding with null bytes to an n-byte boundary 28 36 29 - initramfs := ("\0" | cpio_archive | cpio_gzip_archive)* 37 + initramfs := ("\0" | cpio_archive | cpio_compressed_archive)* 30 38 31 - cpio_gzip_archive := GZIP(cpio_archive) 39 + cpio_compressed_archive := (GZIP(cpio_archive) | BZIP2(cpio_archive) 40 + | LZMA(cpio_archive) | XZ(cpio_archive) | LZO(cpio_archive) 41 + | LZ4(cpio_archive) | ZSTD(cpio_archive)) 32 42 33 43 cpio_archive := cpio_file* + (<nothing> | cpio_trailer) 34 44 ··· 80 74 81 75 The c_mode field matches the contents of st_mode returned by stat(2) 82 76 on Linux, and encodes the file type and file permissions. 77 + 78 + c_mtime is ignored unless CONFIG_INITRAMFS_PRESERVE_MTIME=y is set. 83 79 84 80 The c_filesize should be zero for any file which is not a regular file 85 81 or symlink.

+742 -280

Documentation/filesystems/netfs_library.rst

··· 1 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 - ================================= 4 - Network Filesystem Helper Library 5 - ================================= 3 + =================================== 4 + Network Filesystem Services Library 5 + =================================== 6 6 7 7 .. Contents: 8 8 9 9 - Overview. 10 + - Requests and streams. 11 + - Subrequests. 12 + - Result collection and retry. 13 + - Local caching. 14 + - Content encryption (fscrypt). 10 15 - Per-inode context. 11 16 - Inode context helper functions. 12 - - Buffered read helpers. 13 - - Read helper functions. 14 - - Read helper structures. 15 - - Read helper operations. 16 - - Read helper procedure. 17 - - Read helper cache API. 17 + - Inode locking. 18 + - Inode writeback. 19 + - High-level VFS API. 20 + - Unlocked read/write iter. 21 + - Pre-locked read/write iter. 22 + - Monolithic files API. 23 + - Memory-mapped I/O API. 24 + - High-level VM API. 25 + - Deprecated PG_private2 API. 26 + - I/O request API. 27 + - Request structure. 28 + - Stream structure. 29 + - Subrequest structure. 30 + - Filesystem methods. 31 + - Terminating a subrequest. 32 + - Local cache API. 33 + - API function reference. 18 34 19 35 20 36 Overview 21 37 ======== 22 38 23 - The network filesystem helper library is a set of functions designed to aid a 24 - network filesystem in implementing VM/VFS operations. For the moment, that 25 - just includes turning various VM buffered read operations into requests to read 26 - from the server. The helper library, however, can also interpose other 27 - services, such as local caching or local data encryption. 39 + The network filesystem services library, netfslib, is a set of functions 40 + designed to aid a network filesystem in implementing VM/VFS API operations. It 41 + takes over the normal buffered read, readahead, write and writeback and also 42 + handles unbuffered and direct I/O. 28 43 29 - Note that the library module doesn't link against local caching directly, so 30 - access must be provided by the netfs. 44 + The library provides support for (re-)negotiation of I/O sizes and retrying 45 + failed I/O as well as local caching and will, in the future, provide content 46 + encryption. 47 + 48 + It insulates the filesystem from VM interface changes as much as possible and 49 + handles VM features such as large multipage folios. The filesystem basically 50 + just has to provide a way to perform read and write RPC calls. 51 + 52 + The way I/O is organised inside netfslib consists of a number of objects: 53 + 54 + * A *request*. A request is used to track the progress of the I/O overall and 55 + to hold on to resources. The collection of results is done at the request 56 + level. The I/O within a request is divided into a number of parallel 57 + streams of subrequests. 58 + 59 + * A *stream*. A non-overlapping series of subrequests. The subrequests 60 + within a stream do not have to be contiguous. 61 + 62 + * A *subrequest*. This is the basic unit of I/O. It represents a single RPC 63 + call or a single cache I/O operation. The library passes these to the 64 + filesystem and the cache to perform. 65 + 66 + Requests and Streams 67 + -------------------- 68 + 69 + When actually performing I/O (as opposed to just copying into the pagecache), 70 + netfslib will create one or more requests to track the progress of the I/O and 71 + to hold resources. 72 + 73 + A read operation will have a single stream and the subrequests within that 74 + stream may be of mixed origins, for instance mixing RPC subrequests and cache 75 + subrequests. 76 + 77 + On the other hand, a write operation may have multiple streams, where each 78 + stream targets a different destination. For instance, there may be one stream 79 + writing to the local cache and one to the server. Currently, only two streams 80 + are allowed, but this could be increased if parallel writes to multiple servers 81 + is desired. 82 + 83 + The subrequests within a write stream do not need to match alignment or size 84 + with the subrequests in another write stream and netfslib performs the tiling 85 + of subrequests in each stream over the source buffer independently. Further, 86 + each stream may contain holes that don't correspond to holes in the other 87 + stream. 88 + 89 + In addition, the subrequests do not need to correspond to the boundaries of the 90 + folios or vectors in the source/destination buffer. The library handles the 91 + collection of results and the wrangling of folio flags and references. 92 + 93 + Subrequests 94 + ----------- 95 + 96 + Subrequests are at the heart of the interaction between netfslib and the 97 + filesystem using it. Each subrequest is expected to correspond to a single 98 + read or write RPC or cache operation. The library will stitch together the 99 + results from a set of subrequests to provide a higher level operation. 100 + 101 + Netfslib has two interactions with the filesystem or the cache when setting up 102 + a subrequest. First, there's an optional preparatory step that allows the 103 + filesystem to negotiate the limits on the subrequest, both in terms of maximum 104 + number of bytes and maximum number of vectors (e.g. for RDMA). This may 105 + involve negotiating with the server (e.g. cifs needing to acquire credits). 106 + 107 + And, secondly, there's the issuing step in which the subrequest is handed off 108 + to the filesystem to perform. 109 + 110 + Note that these two steps are done slightly differently between read and write: 111 + 112 + * For reads, the VM/VFS tells us how much is being requested up front, so the 113 + library can preset maximum values that the cache and then the filesystem can 114 + then reduce. The cache also gets consulted first on whether it wants to do 115 + a read before the filesystem is consulted. 116 + 117 + * For writeback, it is unknown how much there will be to write until the 118 + pagecache is walked, so no limit is set by the library. 119 + 120 + Once a subrequest is completed, the filesystem or cache informs the library of 121 + the completion and then collection is invoked. Depending on whether the 122 + request is synchronous or asynchronous, the collection of results will be done 123 + in either the application thread or in a work queue. 124 + 125 + Result Collection and Retry 126 + --------------------------- 127 + 128 + As subrequests complete, the results are collected and collated by the library 129 + and folio unlocking is performed progressively (if appropriate). Once the 130 + request is complete, async completion will be invoked (again, if appropriate). 131 + It is possible for the filesystem to provide interim progress reports to the 132 + library to cause folio unlocking to happen earlier if possible. 133 + 134 + If any subrequests fail, netfslib can retry them. It will wait until all 135 + subrequests are completed, offer the filesystem the opportunity to fiddle with 136 + the resources/state held by the request and poke at the subrequests before 137 + re-preparing and re-issuing the subrequests. 138 + 139 + This allows the tiling of contiguous sets of failed subrequest within a stream 140 + to be changed, adding more subrequests or ditching excess as necessary (for 141 + instance, if the network sizes change or the server decides it wants smaller 142 + chunks). 143 + 144 + Further, if one or more contiguous cache-read subrequests fail, the library 145 + will pass them to the filesystem to perform instead, renegotiating and retiling 146 + them as necessary to fit with the filesystem's parameters rather than those of 147 + the cache. 148 + 149 + Local Caching 150 + ------------- 151 + 152 + One of the services netfslib provides, via ``fscache``, is the option to cache 153 + on local disk a copy of the data obtained from/written to a network filesystem. 154 + The library will manage the storing, retrieval and some invalidation of data 155 + automatically on behalf of the filesystem if a cookie is attached to the 156 + ``netfs_inode``. 157 + 158 + Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to 159 + keep track of a page that was being written to the cache, but this is now 160 + deprecated as PG_private_2 will be removed. 161 + 162 + Instead, folios that are read from the server for which there was no data in 163 + the cache will be marked as dirty and will have ``folio->private`` set to a 164 + special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write. 165 + If the folio is modified before that happened, the special value will be 166 + cleared and the write will become normally dirty. 167 + 168 + When writeback occurs, folios that are so marked will only be written to the 169 + cache and not to the server. Writeback handles mixed cache-only writes and 170 + server-and-cache writes by using two streams, sending one to the cache and one 171 + to the server. The server stream will have gaps in it corresponding to those 172 + folios. 173 + 174 + Content Encryption (fscrypt) 175 + ---------------------------- 176 + 177 + Though it does not do so yet, at some point netfslib will acquire the ability 178 + to do client-side content encryption on behalf of the network filesystem (Ceph, 179 + for example). fscrypt can be used for this if appropriate (it may not be - 180 + cifs, for example). 181 + 182 + The data will be stored encrypted in the local cache using the same manner of 183 + encryption as the data written to the server and the library will impose bounce 184 + buffering and RMW cycles as necessary. 31 185 32 186 33 187 Per-Inode Context ··· 194 40 struct netfs_inode { 195 41 struct inode inode; 196 42 const struct netfs_request_ops *ops; 197 - struct fscache_cookie *cache; 43 + struct fscache_cookie * cache; 44 + loff_t remote_i_size; 45 + unsigned long flags; 46 + ... 198 47 }; 199 48 200 - A network filesystem that wants to use netfs lib must place one of these in its 49 + A network filesystem that wants to use netfslib must place one of these in its 201 50 inode wrapper struct instead of the VFS ``struct inode``. This can be done in 202 51 a way similar to the following:: 203 52 ··· 213 56 inode pointer, thereby allowing the netfslib helper functions to be pointed to 214 57 directly by the VFS/VM operation tables. 215 58 216 - The structure contains the following fields: 59 + The structure contains the following fields that are of interest to the 60 + filesystem: 217 61 218 62 * ``inode`` 219 63 ··· 229 71 Local caching cookie, or NULL if no caching is enabled. This field does not 230 72 exist if fscache is disabled. 231 73 74 + * ``remote_i_size`` 75 + 76 + The size of the file on the server. This differs from inode->i_size if 77 + local modifications have been made but not yet written back. 78 + 79 + * ``flags`` 80 + 81 + A set of flags, some of which the filesystem might be interested in: 82 + 83 + * ``NETFS_ICTX_MODIFIED_ATTR`` 84 + 85 + Set if netfslib modifies mtime/ctime. The filesystem is free to ignore 86 + this or clear it. 87 + 88 + * ``NETFS_ICTX_UNBUFFERED`` 89 + 90 + Do unbuffered I/O upon the file. Like direct I/O but without the 91 + alignment limitations. RMW will be performed if necessary. The pagecache 92 + will not be used unless mmap() is also used. 93 + 94 + * ``NETFS_ICTX_WRITETHROUGH`` 95 + 96 + Do writethrough caching upon the file. I/O will be set up and dispatched 97 + as buffered writes are made to the page cache. mmap() does the normal 98 + writeback thing. 99 + 100 + * ``NETFS_ICTX_SINGLE_NO_UPLOAD`` 101 + 102 + Set if the file has a monolithic content that must be read entirely in a 103 + single go and must not be written back to the server, though it can be 104 + cached (e.g. AFS directories). 232 105 233 106 Inode Context Helper Functions 234 107 ------------------------------ ··· 273 84 274 85 then a function to cast from the VFS inode structure to the netfs context:: 275 86 276 - struct netfs_inode *netfs_node(struct inode *inode); 87 + struct netfs_inode *netfs_inode(struct inode *inode); 277 88 278 89 and finally, a function to get the cache cookie pointer from the context 279 90 attached to an inode (or NULL if fscache is disabled):: 280 91 281 92 struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); 282 93 94 + Inode Locking 95 + ------------- 283 96 284 - Buffered Read Helpers 285 - ===================== 97 + A number of functions are provided to manage the locking of i_rwsem for I/O and 98 + to effectively extend it to provide more separate classes of exclusion:: 286 99 287 - The library provides a set of read helpers that handle the ->read_folio(), 288 - ->readahead() and much of the ->write_begin() VM operations and translate them 289 - into a common call framework. 100 + int netfs_start_io_read(struct inode *inode); 101 + void netfs_end_io_read(struct inode *inode); 102 + int netfs_start_io_write(struct inode *inode); 103 + void netfs_end_io_write(struct inode *inode); 104 + int netfs_start_io_direct(struct inode *inode); 105 + void netfs_end_io_direct(struct inode *inode); 290 106 291 - The following services are provided: 107 + The exclusion breaks down into four separate classes: 292 108 293 - * Handle folios that span multiple pages. 109 + 1) Buffered reads and writes. 294 110 295 - * Insulate the netfs from VM interface changes. 111 + Buffered reads can run concurrently each other and with buffered writes, 112 + but buffered writes cannot run concurrently with each other. 296 113 297 - * Allow the netfs to arbitrarily split reads up into pieces, even ones that 298 - don't match folio sizes or folio alignments and that may cross folios. 114 + 2) Direct reads and writes. 299 115 300 - * Allow the netfs to expand a readahead request in both directions to meet its 301 - needs. 116 + Direct (and unbuffered) reads and writes can run concurrently since they do 117 + not share local buffering (i.e. the pagecache) and, in a network 118 + filesystem, are expected to have exclusion managed on the server (though 119 + this may not be the case for, say, Ceph). 302 120 303 - * Allow the netfs to partially fulfil a read, which will then be resubmitted. 121 + 3) Other major inode modifying operations (e.g. truncate, fallocate). 304 122 305 - * Handle local caching, allowing cached data and server-read data to be 306 - interleaved for a single request. 123 + These should just access i_rwsem directly. 307 124 308 - * Handle clearing of bufferage that isn't on the server. 125 + 4) mmap(). 309 126 310 - * Handle retrying of reads that failed, switching reads from the cache to the 311 - server as necessary. 127 + mmap'd accesses might operate concurrently with any of the other classes. 128 + They might form the buffer for an intra-file loopback DIO read/write. They 129 + might be permitted on unbuffered files. 312 130 313 - * In the future, this is a place that other services can be performed, such as 314 - local encryption of data to be stored remotely or in the cache. 131 + Inode Writeback 132 + --------------- 315 133 316 - From the network filesystem, the helpers require a table of operations. This 317 - includes a mandatory method to issue a read operation along with a number of 318 - optional methods. 134 + Netfslib will pin resources on an inode for future writeback (such as pinning 135 + use of an fscache cookie) when an inode is dirtied. However, this pinning 136 + needs careful management. To manage the pinning, the following sequence 137 + occurs: 138 + 139 + 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the 140 + pinning begins (when a folio is dirtied, for example) if the cache is 141 + active to stop the cache structures from being discarded and the cache 142 + space from being culled. This also prevents re-getting of cache resources 143 + if the flag is already set. 144 + 145 + 2) This flag then cleared inside the inode lock during inode writeback in the 146 + VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb`` 147 + in ``struct writeback_control``. 148 + 149 + 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced. 150 + 151 + 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup. 152 + 153 + 5) The filesystem invokes netfs to do its cleanup. 154 + 155 + To do the cleanup, netfslib provides a function to do the resource unpinning:: 156 + 157 + int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc); 158 + 159 + If the filesystem doesn't need to do anything else, this may be set as a its 160 + ``.write_inode`` method. 161 + 162 + Further, if an inode is deleted, the filesystem's write_inode method may not 163 + get called, so:: 164 + 165 + void netfs_clear_inode_writeback(struct inode *inode, const void *aux); 166 + 167 + must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called. 319 168 320 169 321 - Read Helper Functions 170 + High-Level VFS API 171 + ================== 172 + 173 + Netfslib provides a number of sets of API calls for the filesystem to delegate 174 + VFS operations to. Netfslib, in turn, will call out to the filesystem and the 175 + cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene 176 + at various times. 177 + 178 + Unlocked Read/Write Iter 179 + ------------------------ 180 + 181 + The first API set is for the delegation of operations to netfslib when the 182 + filesystem is called through the standard VFS read/write_iter methods:: 183 + 184 + ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); 185 + ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from); 186 + ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); 187 + ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); 188 + ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from); 189 + 190 + They can be assigned directly to ``.read_iter`` and ``.write_iter``. They 191 + perform the inode locking themselves and the first two will switch between 192 + buffered I/O and DIO as appropriate. 193 + 194 + Pre-Locked Read/Write Iter 195 + -------------------------- 196 + 197 + The second API set is for the delegation of operations to netfslib when the 198 + filesystem is called through the standard VFS methods, but needs to do some 199 + other stuff before or after calling netfslib whilst still inside locked section 200 + (e.g. Ceph negotiating caps). The unbuffered read function is:: 201 + 202 + ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter); 203 + 204 + This must not be assigned directly to ``.read_iter`` and the filesystem is 205 + responsible for performing the inode locking before calling it. In the case of 206 + buffered read, the filesystem should use ``filemap_read()``. 207 + 208 + There are three functions for writes:: 209 + 210 + ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, 211 + struct netfs_group *netfs_group); 212 + ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, 213 + struct netfs_group *netfs_group); 214 + ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter, 215 + struct netfs_group *netfs_group); 216 + 217 + These must not be assigned directly to ``.write_iter`` and the filesystem is 218 + responsible for performing the inode locking before calling them. 219 + 220 + The first two functions are for buffered writes; the first just adds some 221 + standard write checks and jumps to the second, but if the filesystem wants to 222 + do the checks itself, it can use the second directly. The third function is 223 + for unbuffered or DIO writes. 224 + 225 + On all three write functions, there is a writeback group pointer (which should 226 + be NULL if the filesystem doesn't use this). Writeback groups are set on 227 + folios when they're modified. If a folio to-be-modified is already marked with 228 + a different group, it is flushed first. The writeback API allows writing back 229 + of a specific group. 230 + 231 + Memory-Mapped I/O API 322 232 --------------------- 323 233 324 - Three read helpers are provided:: 234 + An API for support of mmap()'d I/O is provided:: 325 235 326 - void netfs_readahead(struct readahead_control *ractl); 327 - int netfs_read_folio(struct file *file, 328 - struct folio *folio); 329 - int netfs_write_begin(struct netfs_inode *ctx, 330 - struct file *file, 331 - struct address_space *mapping, 332 - loff_t pos, 333 - unsigned int len, 334 - struct folio **_folio, 335 - void **_fsdata); 236 + vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); 336 237 337 - Each corresponds to a VM address space operation. These operations use the 338 - state in the per-inode context. 238 + This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The 239 + filesystem should not take the inode lock before calling it, but, as with the 240 + locked write functions above, this does take a writeback group pointer. If the 241 + page to be made writable is in a different group, it will be flushed first. 339 242 340 - For ->readahead() and ->read_folio(), the network filesystem just point directly 341 - at the corresponding read helper; whereas for ->write_begin(), it may be a 342 - little more complicated as the network filesystem might want to flush 343 - conflicting writes or track dirty data and needs to put the acquired folio if 344 - an error occurs after calling the helper. 243 + Monolithic Files API 244 + -------------------- 345 245 346 - The helpers manage the read request, calling back into the network filesystem 347 - through the supplied table of operations. Waits will be performed as 348 - necessary before returning for helpers that are meant to be synchronous. 246 + There is also a special API set for files for which the content must be read in 247 + a single RPC (and not written back) and is maintained as a monolithic blob 248 + (e.g. an AFS directory), though it can be stored and updated in the local cache:: 349 249 350 - If an error occurs, the ->free_request() will be called to clean up the 351 - netfs_io_request struct allocated. If some parts of the request are in 352 - progress when an error occurs, the request will get partially completed if 353 - sufficient data is read. 250 + ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter); 251 + void netfs_single_mark_inode_dirty(struct inode *inode); 252 + int netfs_writeback_single(struct address_space *mapping, 253 + struct writeback_control *wbc, 254 + struct iov_iter *iter); 354 255 355 - Additionally, there is:: 256 + The first function reads from a file into the given buffer, reading from the 257 + cache in preference if the data is cached there; the second function allows the 258 + inode to be marked dirty, causing a later writeback; and the third function can 259 + be called from the writeback code to write the data to the cache, if there is 260 + one. 356 261 357 - * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq, 358 - ssize_t transferred_or_error, 359 - bool was_async); 262 + The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be 263 + used. The writeback function requires the buffer to be of ITER_FOLIOQ type. 360 264 361 - which should be called to complete a read subrequest. This is given the number 362 - of bytes transferred or a negative error code, plus a flag indicating whether 363 - the operation was asynchronous (ie. whether the follow-on processing can be 364 - done in the current context, given this may involve sleeping). 265 + High-Level VM API 266 + ================== 267 + 268 + Netfslib also provides a number of sets of API calls for the filesystem to 269 + delegate VM operations to. Again, netfslib, in turn, will call out to the 270 + filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places 271 + for it to intervene at various times:: 272 + 273 + void netfs_readahead(struct readahead_control *); 274 + int netfs_read_folio(struct file *, struct folio *); 275 + int netfs_writepages(struct address_space *mapping, 276 + struct writeback_control *wbc); 277 + bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); 278 + void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); 279 + bool netfs_release_folio(struct folio *folio, gfp_t gfp); 280 + 281 + These are ``address_space_operations`` methods and can be set directly in the 282 + operations table. 283 + 284 + Deprecated PG_private_2 API 285 + --------------------------- 286 + 287 + There is also a deprecated function for filesystems that still use the 288 + ``->write_begin`` method:: 289 + 290 + int netfs_write_begin(struct netfs_inode *inode, struct file *file, 291 + struct address_space *mapping, loff_t pos, unsigned int len, 292 + struct folio **_folio, void **_fsdata); 293 + 294 + It uses the deprecated PG_private_2 flag and so should not be used. 365 295 366 296 367 - Read Helper Structures 368 - ---------------------- 297 + I/O Request API 298 + =============== 369 299 370 - The read helpers make use of a couple of structures to maintain the state of 371 - the read. The first is a structure that manages a read request as a whole:: 300 + The I/O request API comprises a number of structures and a number of functions 301 + that the filesystem may need to use. 302 + 303 + Request Structure 304 + ----------------- 305 + 306 + The request structure manages the request as a whole, holding some resources 307 + and state on behalf of the filesystem and tracking the collection of results:: 372 308 373 309 struct netfs_io_request { 310 + enum netfs_io_origin origin; 374 311 struct inode *inode; 375 312 struct address_space *mapping; 376 - struct netfs_cache_resources cache_resources; 313 + struct netfs_group *group; 314 + struct netfs_io_stream io_streams[]; 377 315 void *netfs_priv; 378 - loff_t start; 379 - size_t len; 380 - loff_t i_size; 381 - const struct netfs_request_ops *netfs_ops; 316 + void *netfs_priv2; 317 + unsigned long long start; 318 + unsigned long long len; 319 + unsigned long long i_size; 382 320 unsigned int debug_id; 321 + unsigned long flags; 383 322 ... 384 323 }; 385 324 386 - The above fields are the ones the netfs can use. They are: 325 + Many of the fields are for internal use, but the fields shown here are of 326 + interest to the filesystem: 327 + 328 + * ``origin`` 329 + 330 + The origin of the request (readahead, read_folio, DIO read, writeback, ...). 387 331 388 332 * ``inode`` 389 333 * ``mapping`` ··· 524 202 The inode and the address space of the file being read from. The mapping 525 203 may or may not point to inode->i_data. 526 204 527 - * ``cache_resources`` 205 + * ``group`` 528 206 529 - Resources for the local cache to use, if present. 207 + The writeback group this request is dealing with or NULL. This holds a ref 208 + on the group. 209 + 210 + * ``io_streams`` 211 + 212 + The parallel streams of subrequests available to the request. Currently two 213 + are available, but this may be made extensible in future. ``NR_IO_STREAMS`` 214 + indicates the size of the array. 530 215 531 216 * ``netfs_priv`` 217 + * ``netfs_priv2`` 532 218 533 219 The network filesystem's private data. The value for this can be passed in 534 220 to the helper functions or set during the request. ··· 551 221 552 222 The size of the file at the start of the request. 553 223 554 - * ``netfs_ops`` 555 - 556 - A pointer to the operation table. The value for this is passed into the 557 - helper functions. 558 - 559 224 * ``debug_id`` 560 225 561 226 A number allocated to this operation that can be displayed in trace lines 562 227 for reference. 563 228 229 + * ``flags`` 564 230 565 - The second structure is used to manage individual slices of the overall read 566 - request:: 231 + Flags for managing and controlling the operation of the request. Some of 232 + these may be of interest to the filesystem: 567 233 568 - struct netfs_io_subrequest { 569 - struct netfs_io_request *rreq; 570 - loff_t start; 571 - size_t len; 572 - size_t transferred; 573 - unsigned long flags; 574 - unsigned short debug_index; 234 + * ``NETFS_RREQ_RETRYING`` 235 + 236 + Netfslib sets this when generating retries. 237 + 238 + * ``NETFS_RREQ_PAUSE`` 239 + 240 + The filesystem can set this to request to pause the library's subrequest 241 + issuing loop - but care needs to be taken as netfslib may also set it. 242 + 243 + * ``NETFS_RREQ_NONBLOCK`` 244 + * ``NETFS_RREQ_BLOCKED`` 245 + 246 + Netfslib sets the first to indicate that non-blocking mode was set by the 247 + caller and the filesystem can set the second to indicate that it would 248 + have had to block. 249 + 250 + * ``NETFS_RREQ_USE_PGPRIV2`` 251 + 252 + The filesystem can set this if it wants to use PG_private_2 to track 253 + whether a folio is being written to the cache. This is deprecated as 254 + PG_private_2 is going to go away. 255 + 256 + If the filesystem wants more private data than is afforded by this structure, 257 + then it should wrap it and provide its own allocator. 258 + 259 + Stream Structure 260 + ---------------- 261 + 262 + A request is comprised of one or more parallel streams and each stream may be 263 + aimed at a different target. 264 + 265 + For read requests, only stream 0 is used. This can contain a mixture of 266 + subrequests aimed at different sources. For write requests, stream 0 is used 267 + for the server and stream 1 is used for the cache. For buffered writeback, 268 + stream 0 is not enabled unless a normal dirty folio is encountered, at which 269 + point ->begin_writeback() will be invoked and the filesystem can mark the 270 + stream available. 271 + 272 + The stream struct looks like:: 273 + 274 + struct netfs_io_stream { 275 + unsigned char stream_nr; 276 + bool avail; 277 + size_t sreq_max_len; 278 + unsigned int sreq_max_segs; 279 + unsigned int submit_extendable_to; 575 280 ... 576 281 }; 577 282 578 - Each subrequest is expected to access a single source, though the helpers will 283 + A number of members are available for access/use by the filesystem: 284 + 285 + * ``stream_nr`` 286 + 287 + The number of the stream within the request. 288 + 289 + * ``avail`` 290 + 291 + True if the stream is available for use. The filesystem should set this on 292 + stream zero if in ->begin_writeback(). 293 + 294 + * ``sreq_max_len`` 295 + * ``sreq_max_segs`` 296 + 297 + These are set by the filesystem or the cache in ->prepare_read() or 298 + ->prepare_write() for each subrequest to indicate the maximum number of 299 + bytes and, optionally, the maximum number of segments (if not 0) that that 300 + subrequest can support. 301 + 302 + * ``submit_extendable_to`` 303 + 304 + The size that a subrequest can be rounded up to beyond the EOF, given the 305 + available buffer. This allows the cache to work out if it can do a DIO read 306 + or write that straddles the EOF marker. 307 + 308 + Subrequest Structure 309 + -------------------- 310 + 311 + Individual units of I/O are managed by the subrequest structure. These 312 + represent slices of the overall request and run independently:: 313 + 314 + struct netfs_io_subrequest { 315 + struct netfs_io_request *rreq; 316 + struct iov_iter io_iter; 317 + unsigned long long start; 318 + size_t len; 319 + size_t transferred; 320 + unsigned long flags; 321 + short error; 322 + unsigned short debug_index; 323 + unsigned char stream_nr; 324 + ... 325 + }; 326 + 327 + Each subrequest is expected to access a single source, though the library will 579 328 handle falling back from one source type to another. The members are: 580 329 581 330 * ``rreq`` 582 331 583 332 A pointer to the read request. 333 + 334 + * ``io_iter`` 335 + 336 + An I/O iterator representing a slice of the buffer to be read into or 337 + written from. 584 338 585 339 * ``start`` 586 340 * ``len`` ··· 674 260 675 261 * ``transferred`` 676 262 677 - The amount of data transferred so far of the length of this slice. The 678 - network filesystem or cache should start the operation this far into the 679 - slice. If a short read occurs, the helpers will call again, having updated 680 - this to reflect the amount read so far. 263 + The amount of data transferred so far for this subrequest. This should be 264 + added to with the length of the transfer made by this issuance of the 265 + subrequest. If this is less than ``len`` then the subrequest may be 266 + reissued to continue. 681 267 682 268 * ``flags`` 683 269 684 - Flags pertaining to the read. There are two of interest to the filesystem 685 - or cache: 270 + Flags for managing the subrequest. There are a number of interest to the 271 + filesystem or cache: 272 + 273 + * ``NETFS_SREQ_MADE_PROGRESS`` 274 + 275 + Set by the filesystem to indicates that at least one byte of data was read 276 + or written. 277 + 278 + * ``NETFS_SREQ_HIT_EOF`` 279 + 280 + The filesystem should set this if a read hit the EOF on the file (in which 281 + case ``transferred`` should stop at the EOF). Netfslib may expand the 282 + subrequest out to the size of the folio containing the EOF on the off 283 + chance that a third party change happened or a DIO read may have asked for 284 + more than is available. The library will clear any excess pagecache. 686 285 687 286 * ``NETFS_SREQ_CLEAR_TAIL`` 688 287 689 - This can be set to indicate that the remainder of the slice, from 690 - transferred to len, should be cleared. 288 + The filesystem can set this to indicate that the remainder of the slice, 289 + from transferred to len, should be cleared. Do not set if HIT_EOF is set. 290 + 291 + * ``NETFS_SREQ_NEED_RETRY`` 292 + 293 + The filesystem can set this to tell netfslib to retry the subrequest. 294 + 295 + * ``NETFS_SREQ_BOUNDARY`` 296 + 297 + This can be set by the filesystem on a subrequest to indicate that it ends 298 + at a boundary with the filesystem structure (e.g. at the end of a Ceph 299 + object). It tells netfslib not to retile subrequests across it. 691 300 692 301 * ``NETFS_SREQ_SEEK_DATA_READ`` 693 302 694 - This is a hint to the cache that it might want to try skipping ahead to 695 - the next data (ie. using SEEK_DATA). 303 + This is a hint from netfslib to the cache that it might want to try 304 + skipping ahead to the next data (ie. using SEEK_DATA). 305 + 306 + * ``error`` 307 + 308 + This is for the filesystem to store result of the subrequest. It should be 309 + set to 0 if successful and a negative error code otherwise. 696 310 697 311 * ``debug_index`` 312 + * ``stream_nr`` 698 313 699 314 A number allocated to this slice that can be displayed in trace lines for 700 - reference. 315 + reference and the number of the request stream that it belongs to. 701 316 317 + If necessary, the filesystem can get and put extra refs on the subrequest it is 318 + given:: 702 319 703 - Read Helper Operations 704 - ---------------------- 320 + void netfs_get_subrequest(struct netfs_io_subrequest *subreq, 321 + enum netfs_sreq_ref_trace what); 322 + void netfs_put_subrequest(struct netfs_io_subrequest *subreq, 323 + enum netfs_sreq_ref_trace what); 705 324 706 - The network filesystem must provide the read helpers with a table of operations 707 - through which it can issue requests and negotiate:: 325 + using netfs trace codes to indicate the reason. Care must be taken, however, 326 + as once control of the subrequest is returned to netfslib, the same subrequest 327 + can be reissued/retried. 328 + 329 + Filesystem Methods 330 + ------------------ 331 + 332 + The filesystem sets a table of operations in ``netfs_inode`` for netfslib to 333 + use:: 708 334 709 335 struct netfs_request_ops { 710 - void (*init_request)(struct netfs_io_request *rreq, struct file *file); 336 + mempool_t *request_pool; 337 + mempool_t *subrequest_pool; 338 + int (*init_request)(struct netfs_io_request *rreq, struct file *file); 711 339 void (*free_request)(struct netfs_io_request *rreq); 340 + void (*free_subrequest)(struct netfs_io_subrequest *rreq); 712 341 void (*expand_readahead)(struct netfs_io_request *rreq); 713 - bool (*clamp_length)(struct netfs_io_subrequest *subreq); 342 + int (*prepare_read)(struct netfs_io_subrequest *subreq); 714 343 void (*issue_read)(struct netfs_io_subrequest *subreq); 715 - bool (*is_still_valid)(struct netfs_io_request *rreq); 716 - int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, 717 - struct folio **foliop, void **_fsdata); 718 344 void (*done)(struct netfs_io_request *rreq); 345 + void (*update_i_size)(struct inode *inode, loff_t i_size); 346 + void (*post_modify)(struct inode *inode); 347 + void (*begin_writeback)(struct netfs_io_request *wreq); 348 + void (*prepare_write)(struct netfs_io_subrequest *subreq); 349 + void (*issue_write)(struct netfs_io_subrequest *subreq); 350 + void (*retry_request)(struct netfs_io_request *wreq, 351 + struct netfs_io_stream *stream); 352 + void (*invalidate_cache)(struct netfs_io_request *wreq); 719 353 }; 720 354 721 - The operations are as follows: 355 + The table starts with a pair of optional pointers to memory pools from which 356 + requests and subrequests can be allocated. If these are not given, netfslib 357 + has default pools that it will use instead. If the filesystem wraps the netfs 358 + structs in its own larger structs, then it will need to use its own pools. 359 + Netfslib will allocate directly from the pools. 360 + 361 + The methods defined in the table are: 722 362 723 363 * ``init_request()`` 724 - 725 - [Optional] This is called to initialise the request structure. It is given 726 - the file for reference. 727 - 728 364 * ``free_request()`` 365 + * ``free_subrequest()`` 729 366 730 - [Optional] This is called as the request is being deallocated so that the 731 - filesystem can clean up any state it has attached there. 367 + [Optional] A filesystem may implement these to initialise or clean up any 368 + resources that it attaches to the request or subrequest. 732 369 733 370 * ``expand_readahead()`` 734 371 735 372 [Optional] This is called to allow the filesystem to expand the size of a 736 - readahead read request. The filesystem gets to expand the request in both 737 - directions, though it's not permitted to reduce it as the numbers may 738 - represent an allocation already made. If local caching is enabled, it gets 739 - to expand the request first. 373 + readahead request. The filesystem gets to expand the request in both 374 + directions, though it must retain the initial region as that may represent 375 + an allocation already made. If local caching is enabled, it gets to expand 376 + the request first. 740 377 741 378 Expansion is communicated by changing ->start and ->len in the request 742 379 structure. Note that if any change is made, ->len must be increased by at 743 380 least as much as ->start is reduced. 744 381 745 - * ``clamp_length()`` 382 + * ``prepare_read()`` 746 383 747 - [Optional] This is called to allow the filesystem to reduce the size of a 748 - subrequest. The filesystem can use this, for example, to chop up a request 749 - that has to be split across multiple servers or to put multiple reads in 750 - flight. 384 + [Optional] This is called to allow the filesystem to limit the size of a 385 + subrequest. It may also limit the number of individual regions in iterator, 386 + such as required by RDMA. This information should be set on stream zero in:: 751 387 752 - This should return 0 on success and an error code on error. 388 + rreq->io_streams[0].sreq_max_len 389 + rreq->io_streams[0].sreq_max_segs 390 + 391 + The filesystem can use this, for example, to chop up a request that has to 392 + be split across multiple servers or to put multiple reads in flight. 393 + 394 + Zero should be returned on success and an error code otherwise. 753 395 754 396 * ``issue_read()`` 755 397 756 - [Required] The helpers use this to dispatch a subrequest to the server for 398 + [Required] Netfslib calls this to dispatch a subrequest to the server for 757 399 reading. In the subrequest, ->start, ->len and ->transferred indicate what 758 - data should be read from the server. 400 + data should be read from the server and ->io_iter indicates the buffer to be 401 + used. 759 402 760 - There is no return value; the netfs_subreq_terminated() function should be 761 - called to indicate whether or not the operation succeeded and how much data 762 - it transferred. The filesystem also should not deal with setting folios 763 - uptodate, unlocking them or dropping their refs - the helpers need to deal 764 - with this as they have to coordinate with copying to the local cache. 403 + There is no return value; the ``netfs_read_subreq_terminated()`` function 404 + should be called to indicate that the subrequest completed either way. 405 + ->error, ->transferred and ->flags should be updated before completing. The 406 + termination can be done asynchronously. 765 407 766 - Note that the helpers have the folios locked, but not pinned. It is 767 - possible to use the ITER_XARRAY iov iterator to refer to the range of the 768 - inode that is being operated upon without the need to allocate large bvec 769 - tables. 408 + Note: the filesystem must not deal with setting folios uptodate, unlocking 409 + them or dropping their refs - the library deals with this as it may have to 410 + stitch together the results of multiple subrequests that variously overlap 411 + the set of folios. 770 412 771 - * ``is_still_valid()`` 413 + * ``done()`` 772 414 773 - [Optional] This is called to find out if the data just read from the local 774 - cache is still valid. It should return true if it is still valid and false 775 - if not. If it's not still valid, it will be reread from the server. 776 - 777 - * ``check_write_begin()`` 778 - 779 - [Optional] This is called from the netfs_write_begin() helper once it has 780 - allocated/grabbed the folio to be modified to allow the filesystem to flush 781 - conflicting state before allowing it to be modified. 782 - 783 - It may unlock and discard the folio it was given and set the caller's folio 784 - pointer to NULL. It should return 0 if everything is now fine (``*foliop`` 785 - left set) or the op should be retried (``*foliop`` cleared) and any other 786 - error code to abort the operation. 787 - 788 - * ``done`` 789 - 790 - [Optional] This is called after the folios in the request have all been 415 + [Optional] This is called after the folios in a read request have all been 791 416 unlocked (and marked uptodate if applicable). 792 417 418 + * ``update_i_size()`` 793 419 420 + [Optional] This is invoked by netfslib at various points during the write 421 + paths to ask the filesystem to update its idea of the file size. If not 422 + given, netfslib will set i_size and i_blocks and update the local cache 423 + cookie. 424 + 425 + * ``post_modify()`` 794 426 795 - Read Helper Procedure 796 - --------------------- 427 + [Optional] This is called after netfslib writes to the pagecache or when it 428 + allows an mmap'd page to be marked as writable. 429 + 430 + * ``begin_writeback()`` 797 431 798 - The read helpers work by the following general procedure: 432 + [Optional] Netfslib calls this when processing a writeback request if it 433 + finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE, 434 + indicating it must be written to the server. This allows the filesystem to 435 + only set up writeback resources when it knows it's going to have to perform 436 + a write. 437 + 438 + * ``prepare_write()`` 799 439 800 - * Set up the request. 440 + [Optional] This is called to allow the filesystem to limit the size of a 441 + subrequest. It may also limit the number of individual regions in iterator, 442 + such as required by RDMA. This information should be set on stream to which 443 + the subrequest belongs:: 801 444 802 - * For readahead, allow the local cache and then the network filesystem to 803 - propose expansions to the read request. This is then proposed to the VM. 804 - If the VM cannot fully perform the expansion, a partially expanded read will 805 - be performed, though this may not get written to the cache in its entirety. 445 + rreq->io_streams[subreq->stream_nr].sreq_max_len 446 + rreq->io_streams[subreq->stream_nr].sreq_max_segs 806 447 807 - * Loop around slicing chunks off of the request to form subrequests: 448 + The filesystem can use this, for example, to chop up a request that has to 449 + be split across multiple servers or to put multiple writes in flight. 808 450 809 - * If a local cache is present, it gets to do the slicing, otherwise the 810 - helpers just try to generate maximal slices. 451 + This is not permitted to return an error. Instead, in the event of failure, 452 + ``netfs_prepare_write_failed()`` must be called. 811 453 812 - * The network filesystem gets to clamp the size of each slice if it is to be 813 - the source. This allows rsize and chunking to be implemented. 454 + * ``issue_write()`` 814 455 815 - * The helpers issue a read from the cache or a read from the server or just 816 - clears the slice as appropriate. 456 + [Required] This is used to dispatch a subrequest to the server for writing. 457 + In the subrequest, ->start, ->len and ->transferred indicate what data 458 + should be written to the server and ->io_iter indicates the buffer to be 459 + used. 817 460 818 - * The next slice begins at the end of the last one. 461 + There is no return value; the ``netfs_write_subreq_terminated()`` function 462 + should be called to indicate that the subrequest completed either way. 463 + ->error, ->transferred and ->flags should be updated before completing. The 464 + termination can be done asynchronously. 819 465 820 - * As slices finish being read, they terminate. 466 + Note: the filesystem must not deal with removing the dirty or writeback 467 + marks on folios involved in the operation and should not take refs or pins 468 + on them, but should leave retention to netfslib. 821 469 822 - * When all the subrequests have terminated, the subrequests are assessed and 823 - any that are short or have failed are reissued: 470 + * ``retry_request()`` 824 471 825 - * Failed cache requests are issued against the server instead. 472 + [Optional] Netfslib calls this at the beginning of a retry cycle. This 473 + allows the filesystem to examine the state of the request, the subrequests 474 + in the indicated stream and of its own data and make adjustments or 475 + renegotiate resources. 476 + 477 + * ``invalidate_cache()`` 826 478 827 - * Failed server requests just fail. 479 + [Optional] This is called by netfslib to invalidate data stored in the local 480 + cache in the event that writing to the local cache fails, providing updated 481 + coherency data that netfs can't provide. 828 482 829 - * Short reads against either source will be reissued against that source 830 - provided they have transferred some more data: 483 + Terminating a subrequest 484 + ------------------------ 831 485 832 - * The cache may need to skip holes that it can't do DIO from. 486 + When a subrequest completes, there are a number of functions that the cache or 487 + subrequest can call to inform netfslib of the status change. One function is 488 + provided to terminate a write subrequest at the preparation stage and acts 489 + synchronously: 833 490 834 - * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the 835 - end of the slice instead of reissuing. 491 + * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);`` 836 492 837 - * Once the data is read, the folios that have been fully read/cleared: 493 + Indicate that the ->prepare_write() call failed. The ``error`` field should 494 + have been updated. 838 495 839 - * Will be marked uptodate. 496 + Note that ->prepare_read() can return an error as a read can simply be aborted. 497 + Dealing with writeback failure is trickier. 840 498 841 - * If a cache is present, will be marked with PG_fscache. 499 + The other functions are used for subrequests that got as far as being issued: 842 500 843 - * Unlocked 501 + * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);`` 844 502 845 - * Any folios that need writing to the cache will then have DIO writes issued. 503 + Tell netfslib that a read subrequest has terminated. The ``error``, 504 + ``flags`` and ``transferred`` fields should have been updated. 846 505 847 - * Synchronous operations will wait for reading to be complete. 506 + * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);`` 848 507 849 - * Writes to the cache will proceed asynchronously and the folios will have the 850 - PG_fscache mark removed when that completes. 508 + Tell netfslib that a write subrequest has terminated. Either the amount of 509 + data processed or the negative error code can be passed in. This is 510 + can be used as a kiocb completion function. 851 511 852 - * The request structures will be cleaned up when everything has completed. 512 + * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);`` 853 513 514 + This is provided to optionally update netfslib on the incremental progress 515 + of a read, allowing some folios to be unlocked early and does not actually 516 + terminate the subrequest. The ``transferred`` field should have been 517 + updated. 854 518 855 - Read Helper Cache API 856 - --------------------- 519 + Local Cache API 520 + --------------- 857 521 858 - When implementing a local cache to be used by the read helpers, two things are 859 - required: some way for the network filesystem to initialise the caching for a 860 - read request and a table of operations for the helpers to call. 522 + Netfslib provides a separate API for a local cache to implement, though it 523 + provides some somewhat similar routines to the filesystem request API. 861 524 862 - To begin a cache operation on an fscache object, the following function is 863 - called:: 864 - 865 - int fscache_begin_read_operation(struct netfs_io_request *rreq, 866 - struct fscache_cookie *cookie); 867 - 868 - passing in the request pointer and the cookie corresponding to the file. This 869 - fills in the cache resources mentioned below. 870 - 871 - The netfs_io_request object contains a place for the cache to hang its 525 + Firstly, the netfs_io_request object contains a place for the cache to hang its 872 526 state:: 873 527 874 528 struct netfs_cache_resources { 875 529 const struct netfs_cache_ops *ops; 876 530 void *cache_priv; 877 531 void *cache_priv2; 532 + unsigned int debug_id; 533 + unsigned int inval_counter; 878 534 }; 879 535 880 - This contains an operations table pointer and two private pointers. The 881 - operation table looks like the following:: 536 + This contains an operations table pointer and two private pointers plus the 537 + debug ID of the fscache cookie for tracing purposes and an invalidation counter 538 + that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests 539 + to be invalidated after completion. 540 + 541 + The cache operation table looks like the following:: 882 542 883 543 struct netfs_cache_ops { 884 544 void (*end_operation)(struct netfs_cache_resources *cres); 885 - 886 545 void (*expand_readahead)(struct netfs_cache_resources *cres, 887 546 loff_t *_start, size_t *_len, loff_t i_size); 888 - 889 547 enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq, 890 - loff_t i_size); 891 - 548 + loff_t i_size); 892 549 int (*read)(struct netfs_cache_resources *cres, 893 550 loff_t start_pos, 894 551 struct iov_iter *iter, 895 552 bool seek_data, 896 553 netfs_io_terminated_t term_func, 897 554 void *term_func_priv); 898 - 899 - int (*prepare_write)(struct netfs_cache_resources *cres, 900 - loff_t *_start, size_t *_len, loff_t i_size, 901 - bool no_space_allocated_yet); 902 - 903 - int (*write)(struct netfs_cache_resources *cres, 904 - loff_t start_pos, 905 - struct iov_iter *iter, 906 - netfs_io_terminated_t term_func, 907 - void *term_func_priv); 908 - 909 - int (*query_occupancy)(struct netfs_cache_resources *cres, 910 - loff_t start, size_t len, size_t granularity, 911 - loff_t *_data_start, size_t *_data_len); 555 + void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq); 556 + void (*issue_write)(struct netfs_io_subrequest *subreq); 912 557 }; 913 558 914 559 With a termination handler function pointer:: ··· 984 511 985 512 * ``expand_readahead()`` 986 513 987 - [Optional] Called at the beginning of a netfs_readahead() operation to allow 988 - the cache to expand a request in either direction. This allows the cache to 514 + [Optional] Called at the beginning of a readahead operation to allow the 515 + cache to expand a request in either direction. This allows the cache to 989 516 size the request appropriately for the cache granularity. 517 + 518 + * ``prepare_read()`` 519 + 520 + [Required] Called to configure the next slice of a request. ->start and 521 + ->len in the subrequest indicate where and how big the next slice can be; 522 + the cache gets to reduce the length to match its granularity requirements. 990 523 991 524 The function is passed pointers to the start and length in its parameters, 992 525 plus the size of the file for reference, and adjusts the start and length ··· 1007 528 downloaded from the server or read from the cache - or whether slicing 1008 529 should be given up at the current point. 1009 530 1010 - * ``prepare_read()`` 1011 - 1012 - [Required] Called to configure the next slice of a request. ->start and 1013 - ->len in the subrequest indicate where and how big the next slice can be; 1014 - the cache gets to reduce the length to match its granularity requirements. 1015 - 1016 531 * ``read()`` 1017 532 1018 533 [Required] Called to read from the cache. The start file offset is given ··· 1020 547 indicating whether the termination is definitely happening in the caller's 1021 548 context. 1022 549 1023 - * ``prepare_write()`` 550 + * ``prepare_write_subreq()`` 1024 551 1025 - [Required] Called to prepare a write to the cache to take place. This 1026 - involves checking to see whether the cache has sufficient space to honour 1027 - the write. ``*_start`` and ``*_len`` indicate the region to be written; the 1028 - region can be shrunk or it can be expanded to a page boundary either way as 1029 - necessary to align for direct I/O. i_size holds the size of the object and 1030 - is provided for reference. no_space_allocated_yet is set to true if the 1031 - caller is certain that no data has been written to that region - for example 1032 - if it tried to do a read from there already. 552 + [Required] This is called to allow the cache to limit the size of a 553 + subrequest. It may also limit the number of individual regions in iterator, 554 + such as required by DIO/DMA. This information should be set on stream to 555 + which the subrequest belongs:: 1033 556 1034 - * ``write()`` 557 + rreq->io_streams[subreq->stream_nr].sreq_max_len 558 + rreq->io_streams[subreq->stream_nr].sreq_max_segs 1035 559 1036 - [Required] Called to write to the cache. The start file offset is given 1037 - along with an iterator to write from, which gives the length also. 560 + The filesystem can use this, for example, to chop up a request that has to 561 + be split across multiple servers or to put multiple writes in flight. 1038 562 1039 - Also provided is a pointer to a termination handler function and private 1040 - data to pass to that function. The termination function should be called 1041 - with the number of bytes transferred or an error code, plus a flag 1042 - indicating whether the termination is definitely happening in the caller's 1043 - context. 563 + This is not permitted to return an error. In the event of failure, 564 + ``netfs_prepare_write_failed()`` must be called. 1044 565 1045 - * ``query_occupancy()`` 566 + * ``issue_write()`` 1046 567 1047 - [Required] Called to find out where the next piece of data is within a 1048 - particular region of the cache. The start and length of the region to be 1049 - queried are passed in, along with the granularity to which the answer needs 1050 - to be aligned. The function passes back the start and length of the data, 1051 - if any, available within that region. Note that there may be a hole at the 1052 - front. 568 + [Required] This is used to dispatch a subrequest to the cache for writing. 569 + In the subrequest, ->start, ->len and ->transferred indicate what data 570 + should be written to the cache and ->io_iter indicates the buffer to be 571 + used. 1053 572 1054 - It returns 0 if some data was found, -ENODATA if there was no usable data 1055 - within the region or -ENOBUFS if there is no caching on this file. 1056 - 1057 - Note that these methods are passed a pointer to the cache resource structure, 1058 - not the read request structure as they could be used in other situations where 1059 - there isn't a read request structure as well, such as writing dirty data to the 1060 - cache. 573 + There is no return value; the ``netfs_write_subreq_terminated()`` function 574 + should be called to indicate that the subrequest completed either way. 575 + ->error, ->transferred and ->flags should be updated before completing. The 576 + termination can be done asynchronously. 1061 577 1062 578 1063 579 API Function Reference

+45

fs/anon_inodes.c

··· 24 24 25 25 #include <linux/uaccess.h> 26 26 27 + #include "internal.h" 28 + 27 29 static struct vfsmount *anon_inode_mnt __ro_after_init; 28 30 static struct inode *anon_inode_inode __ro_after_init; 31 + 32 + /* 33 + * User space expects anonymous inodes to have no file type in st_mode. 34 + * 35 + * In particular, 'lsof' has this legacy logic: 36 + * 37 + * type = s->st_mode & S_IFMT; 38 + * switch (type) { 39 + * ... 40 + * case 0: 41 + * if (!strcmp(p, "anon_inode")) 42 + * Lf->ntype = Ntype = N_ANON_INODE; 43 + * 44 + * to detect our old anon_inode logic. 45 + * 46 + * Rather than mess with our internal sane inode data, just fix it 47 + * up here in getattr() by masking off the format bits. 48 + */ 49 + int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path, 50 + struct kstat *stat, u32 request_mask, 51 + unsigned int query_flags) 52 + { 53 + struct inode *inode = d_inode(path->dentry); 54 + 55 + generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); 56 + stat->mode &= ~S_IFMT; 57 + return 0; 58 + } 59 + 60 + int anon_inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, 61 + struct iattr *attr) 62 + { 63 + return -EOPNOTSUPP; 64 + } 65 + 66 + static const struct inode_operations anon_inode_operations = { 67 + .getattr = anon_inode_getattr, 68 + .setattr = anon_inode_setattr, 69 + }; 29 70 30 71 /* 31 72 * anon_inodefs_dname() is called from d_path(). ··· 86 45 struct pseudo_fs_context *ctx = init_pseudo(fc, ANON_INODE_FS_MAGIC); 87 46 if (!ctx) 88 47 return -ENOMEM; 48 + fc->s_iflags |= SB_I_NOEXEC; 49 + fc->s_iflags |= SB_I_NODEV; 89 50 ctx->dops = &anon_inodefs_dentry_operations; 90 51 return 0; 91 52 } ··· 109 66 if (IS_ERR(inode)) 110 67 return inode; 111 68 inode->i_flags &= ~S_PRIVATE; 69 + inode->i_op = &anon_inode_operations; 112 70 error = security_inode_init_security_anon(inode, &QSTR(name), 113 71 context_inode); 114 72 if (error) { ··· 357 313 anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb); 358 314 if (IS_ERR(anon_inode_inode)) 359 315 panic("anon_inode_init() inode allocation failed (%ld)\n", PTR_ERR(anon_inode_inode)); 316 + anon_inode_inode->i_op = &anon_inode_operations; 360 317 361 318 return 0; 362 319 }

+5

fs/internal.h

··· 343 343 void file_f_owner_release(struct file *file); 344 344 bool file_seek_cur_needs_f_lock(struct file *file); 345 345 int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map); 346 + int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path, 347 + struct kstat *stat, u32 request_mask, 348 + unsigned int query_flags); 349 + int anon_inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, 350 + struct iattr *attr);

+7 -1

fs/libfs.c

··· 1647 1647 * that it already _is_ on the dirty list. 1648 1648 */ 1649 1649 inode->i_state = I_DIRTY; 1650 - inode->i_mode = S_IRUSR | S_IWUSR; 1650 + /* 1651 + * Historically anonymous inodes didn't have a type at all and 1652 + * userspace has come to rely on this. Internally they're just 1653 + * regular files but S_IFREG is masked off when reporting 1654 + * information to userspace. 1655 + */ 1656 + inode->i_mode = S_IFREG | S_IRUSR | S_IWUSR; 1651 1657 inode->i_uid = current_fsuid(); 1652 1658 inode->i_gid = current_fsgid(); 1653 1659 inode->i_flags |= S_PRIVATE;

+5 -5

fs/namei.c

··· 1905 1905 unlikely(link->mnt->mnt_flags & MNT_NOSYMFOLLOW)) 1906 1906 return ERR_PTR(-ELOOP); 1907 1907 1908 - if (!(nd->flags & LOOKUP_RCU)) { 1908 + if (unlikely(atime_needs_update(&last->link, inode))) { 1909 + if (nd->flags & LOOKUP_RCU) { 1910 + if (!try_to_unlazy(nd)) 1911 + return ERR_PTR(-ECHILD); 1912 + } 1909 1913 touch_atime(&last->link); 1910 1914 cond_resched(); 1911 - } else if (atime_needs_update(&last->link, inode)) { 1912 - if (!try_to_unlazy(nd)) 1913 - return ERR_PTR(-ECHILD); 1914 - touch_atime(&last->link); 1915 1915 } 1916 1916 1917 1917 error = security_inode_follow_link(link->dentry, inode,

+2 -24

fs/pidfs.c

··· 569 569 static int pidfs_setattr(struct mnt_idmap *idmap, struct dentry *dentry, 570 570 struct iattr *attr) 571 571 { 572 - return -EOPNOTSUPP; 572 + return anon_inode_setattr(idmap, dentry, attr); 573 573 } 574 574 575 - 576 - /* 577 - * User space expects pidfs inodes to have no file type in st_mode. 578 - * 579 - * In particular, 'lsof' has this legacy logic: 580 - * 581 - * type = s->st_mode & S_IFMT; 582 - * switch (type) { 583 - * ... 584 - * case 0: 585 - * if (!strcmp(p, "anon_inode")) 586 - * Lf->ntype = Ntype = N_ANON_INODE; 587 - * 588 - * to detect our old anon_inode logic. 589 - * 590 - * Rather than mess with our internal sane inode data, just fix it 591 - * up here in getattr() by masking off the format bits. 592 - */ 593 575 static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path, 594 576 struct kstat *stat, u32 request_mask, 595 577 unsigned int query_flags) 596 578 { 597 - struct inode *inode = d_inode(path->dentry); 598 - 599 - generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); 600 - stat->mode &= ~S_IFMT; 601 - return 0; 579 + return anon_inode_getattr(idmap, path, stat, request_mask, query_flags); 602 580 } 603 581 604 582 static const struct inode_operations pidfs_inode_operations = {

+20 -15

fs/stat.c

··· 241 241 int retval; 242 242 243 243 retval = security_inode_getattr(path); 244 - if (retval) 244 + if (unlikely(retval)) 245 245 return retval; 246 246 return vfs_getattr_nosec(path, stat, request_mask, query_flags); 247 247 } ··· 421 421 int error; 422 422 423 423 error = vfs_stat(filename, &stat); 424 - if (error) 424 + if (unlikely(error)) 425 425 return error; 426 426 427 427 return cp_old_stat(&stat, statbuf); ··· 434 434 int error; 435 435 436 436 error = vfs_lstat(filename, &stat); 437 - if (error) 437 + if (unlikely(error)) 438 438 return error; 439 439 440 440 return cp_old_stat(&stat, statbuf); ··· 443 443 SYSCALL_DEFINE2(fstat, unsigned int, fd, struct __old_kernel_stat __user *, statbuf) 444 444 { 445 445 struct kstat stat; 446 - int error = vfs_fstat(fd, &stat); 446 + int error; 447 447 448 - if (!error) 449 - error = cp_old_stat(&stat, statbuf); 448 + error = vfs_fstat(fd, &stat); 449 + if (unlikely(error)) 450 + return error; 450 451 451 - return error; 452 + return cp_old_stat(&stat, statbuf); 452 453 } 453 454 454 455 #endif /* __ARCH_WANT_OLD_STAT */ ··· 503 502 struct stat __user *, statbuf) 504 503 { 505 504 struct kstat stat; 506 - int error = vfs_stat(filename, &stat); 505 + int error; 507 506 508 - if (error) 507 + error = vfs_stat(filename, &stat); 508 + if (unlikely(error)) 509 509 return error; 510 + 510 511 return cp_new_stat(&stat, statbuf); 511 512 } 512 513 ··· 519 516 int error; 520 517 521 518 error = vfs_lstat(filename, &stat); 522 - if (error) 519 + if (unlikely(error)) 523 520 return error; 524 521 525 522 return cp_new_stat(&stat, statbuf); ··· 533 530 int error; 534 531 535 532 error = vfs_fstatat(dfd, filename, &stat, flag); 536 - if (error) 533 + if (unlikely(error)) 537 534 return error; 535 + 538 536 return cp_new_stat(&stat, statbuf); 539 537 } 540 538 #endif ··· 543 539 SYSCALL_DEFINE2(newfstat, unsigned int, fd, struct stat __user *, statbuf) 544 540 { 545 541 struct kstat stat; 546 - int error = vfs_fstat(fd, &stat); 542 + int error; 547 543 548 - if (!error) 549 - error = cp_new_stat(&stat, statbuf); 544 + error = vfs_fstat(fd, &stat); 545 + if (unlikely(error)) 546 + return error; 550 547 551 - return error; 548 + return cp_new_stat(&stat, statbuf); 552 549 } 553 550 #endif 554 551

+1 -1

include/linux/file.h

··· 59 59 60 60 static inline void fdput(struct fd fd) 61 61 { 62 - if (fd.word & FDPUT_FPUT) 62 + if (unlikely(fd.word & FDPUT_FPUT)) 63 63 fput(fd_file(fd)); 64 64 } 65 65

+1

tools/testing/selftests/filesystems/.gitignore

··· 2 2 dnotify_test 3 3 devpts_pts 4 4 file_stressor 5 + anon_inode_test

+1 -1

tools/testing/selftests/filesystems/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 3 3 CFLAGS += $(KHDR_INCLUDES) 4 - TEST_GEN_PROGS := devpts_pts file_stressor 4 + TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test 5 5 TEST_GEN_PROGS_EXTENDED := dnotify_test 6 6 7 7 include ../lib.mk

+69

tools/testing/selftests/filesystems/anon_inode_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #define _GNU_SOURCE 3 + #define __SANE_USERSPACE_TYPES__ 4 + 5 + #include <fcntl.h> 6 + #include <stdio.h> 7 + #include <sys/stat.h> 8 + 9 + #include "../kselftest_harness.h" 10 + #include "overlayfs/wrappers.h" 11 + 12 + TEST(anon_inode_no_chown) 13 + { 14 + int fd_context; 15 + 16 + fd_context = sys_fsopen("tmpfs", 0); 17 + ASSERT_GE(fd_context, 0); 18 + 19 + ASSERT_LT(fchown(fd_context, 1234, 5678), 0); 20 + ASSERT_EQ(errno, EOPNOTSUPP); 21 + 22 + EXPECT_EQ(close(fd_context), 0); 23 + } 24 + 25 + TEST(anon_inode_no_chmod) 26 + { 27 + int fd_context; 28 + 29 + fd_context = sys_fsopen("tmpfs", 0); 30 + ASSERT_GE(fd_context, 0); 31 + 32 + ASSERT_LT(fchmod(fd_context, 0777), 0); 33 + ASSERT_EQ(errno, EOPNOTSUPP); 34 + 35 + EXPECT_EQ(close(fd_context), 0); 36 + } 37 + 38 + TEST(anon_inode_no_exec) 39 + { 40 + int fd_context; 41 + 42 + fd_context = sys_fsopen("tmpfs", 0); 43 + ASSERT_GE(fd_context, 0); 44 + 45 + ASSERT_LT(execveat(fd_context, "", NULL, NULL, AT_EMPTY_PATH), 0); 46 + ASSERT_EQ(errno, EACCES); 47 + 48 + EXPECT_EQ(close(fd_context), 0); 49 + } 50 + 51 + TEST(anon_inode_no_open) 52 + { 53 + int fd_context; 54 + 55 + fd_context = sys_fsopen("tmpfs", 0); 56 + ASSERT_GE(fd_context, 0); 57 + 58 + ASSERT_GE(dup2(fd_context, 500), 0); 59 + ASSERT_EQ(close(fd_context), 0); 60 + fd_context = 500; 61 + 62 + ASSERT_LT(open("/proc/self/fd/500", 0), 0); 63 + ASSERT_EQ(errno, ENXIO); 64 + 65 + EXPECT_EQ(close(fd_context), 0); 66 + } 67 + 68 + TEST_HARNESS_MAIN 69 +

Configure Feed

Configure Feed