Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: convert dax.txt to rst

Change the file extension and add the rst constructs to integrate this
doc to the documentation infrastructure and take advantage of rst
features.

Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
Link: https://lore.kernel.org/r/20210531130515.10309-1-igormtorrente@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by

Igor Matheus Andrade Torrente and committed by
Jonathan Corbet
acda97ac fb7b26a8

+292 -257
+291
Documentation/filesystems/dax.rst
··· 1 + ======================= 2 + Direct Access for files 3 + ======================= 4 + 5 + Motivation 6 + ---------- 7 + 8 + The page cache is usually used to buffer reads and writes to files. 9 + It is also used to provide the pages which are mapped into userspace 10 + by a call to mmap. 11 + 12 + For block devices that are memory-like, the page cache pages would be 13 + unnecessary copies of the original storage. The `DAX` code removes the 14 + extra copy by performing reads and writes directly to the storage device. 15 + For file mappings, the storage device is mapped directly into userspace. 16 + 17 + 18 + Usage 19 + ----- 20 + 21 + If you have a block device which supports `DAX`, you can make a filesystem 22 + on it as usual. The `DAX` code currently only supports files with a block 23 + size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block 24 + size when creating the filesystem. 25 + 26 + Currently 3 filesystems support `DAX`: ext2, ext4 and xfs. Enabling `DAX` on them 27 + is different. 28 + 29 + Enabling DAX on ext2 30 + -------------------- 31 + 32 + When mounting the filesystem, use the ``-o dax`` option on the command line or 33 + add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files 34 + within the filesystem. It is equivalent to the ``-o dax=always`` behavior below. 35 + 36 + 37 + Enabling DAX on xfs and ext4 38 + ---------------------------- 39 + 40 + Summary 41 + ------- 42 + 43 + 1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to 44 + the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details 45 + about this access mode. 46 + 47 + 2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular 48 + files and directories. This advisory flag can be set or cleared at any 49 + time, but doing so does not immediately affect the `S_DAX` state. 50 + 51 + 3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will 52 + be inherited by all regular files and subdirectories that are subsequently 53 + created in this directory. Files and subdirectories that exist at the time 54 + this flag is set or cleared on the parent directory are not modified by 55 + this modification of the parent directory. 56 + 57 + 4. There exist dax mount options which can override `FS_XFLAG_DAX` in the 58 + setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the 59 + following hold: 60 + 61 + ``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default. 62 + 63 + ``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`." 64 + 65 + ``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`." 66 + 67 + ``-o dax`` is a legacy option which is an alias for ``dax=always``. 68 + 69 + .. warning:: 70 + 71 + The option ``-o dax`` may be removed in the future so ``-o dax=always`` is 72 + the preferred method for specifying this behavior. 73 + 74 + .. note:: 75 + 76 + Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain 77 + the same even when the filesystem is mounted with a dax option. However, 78 + in-core inode state (`S_DAX`) will be overridden until the filesystem is 79 + remounted with dax=inode and the inode is evicted from kernel memory. 80 + 81 + 5. The `S_DAX` policy can be changed via: 82 + 83 + a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are 84 + created 85 + 86 + b) Setting the appropriate dax="foo" mount option 87 + 88 + c) Changing the `FS_XFLAG_DAX` flag on existing regular files and 89 + directories. This has runtime constraints and limitations that are 90 + described in 6) below. 91 + 92 + 6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX` 93 + flag, the change to existing regular files won't take effect until the 94 + files are closed by all processes. 95 + 96 + 97 + Details 98 + ------- 99 + 100 + There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`) 101 + and the other is a volatile flag indicating the active state of the feature 102 + (`S_DAX`). 103 + 104 + `FS_XFLAG_DAX` is preserved within the filesystem. This persistent config 105 + setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl 106 + (see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. 107 + 108 + New files and directories automatically inherit `FS_XFLAG_DAX` from 109 + their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at 110 + directory creation time can be used to set a default behavior for an entire 111 + sub-tree. 112 + 113 + To clarify inheritance, here are 3 examples: 114 + 115 + Example A: 116 + 117 + .. code-block:: shell 118 + 119 + mkdir -p a/b/c 120 + xfs_io -c 'chattr +x' a 121 + mkdir a/b/c/d 122 + mkdir a/e 123 + 124 + ------[outcome]------ 125 + 126 + dax: a,e 127 + no dax: b,c,d 128 + 129 + Example B: 130 + 131 + .. code-block:: shell 132 + 133 + mkdir a 134 + xfs_io -c 'chattr +x' a 135 + mkdir -p a/b/c/d 136 + 137 + ------[outcome]------ 138 + 139 + dax: a,b,c,d 140 + no dax: 141 + 142 + Example C: 143 + 144 + .. code-block:: shell 145 + 146 + mkdir -p a/b/c 147 + xfs_io -c 'chattr +x' c 148 + mkdir a/b/c/d 149 + 150 + ------[outcome]------ 151 + 152 + dax: c,d 153 + no dax: a,b 154 + 155 + The current enabled state (`S_DAX`) is set when a file inode is instantiated in 156 + memory by the kernel. It is set based on the underlying media support, the 157 + value of `FS_XFLAG_DAX` and the filesystem's dax mount option. 158 + 159 + statx can be used to query `S_DAX`. 160 + 161 + .. note:: 162 + 163 + That only regular files will ever have `S_DAX` set and therefore statx 164 + will never indicate that `S_DAX` is set on directories. 165 + 166 + Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even 167 + if the underlying media does not support dax and/or the filesystem is 168 + overridden with a mount option. 169 + 170 + 171 + Implementation Tips for Block Driver Writers 172 + -------------------------------------------- 173 + 174 + To support `DAX` in your block driver, implement the 'direct_access' 175 + block device operation. It is used to translate the sector number 176 + (expressed in units of 512-byte sectors) to a page frame number (pfn) 177 + that identifies the physical page for the memory. It also returns a 178 + kernel virtual address that can be used to access the memory. 179 + 180 + The direct_access method takes a 'size' parameter that indicates the 181 + number of bytes being requested. The function should return the number 182 + of bytes that can be contiguously accessed at that offset. It may also 183 + return a negative errno if an error occurs. 184 + 185 + In order to support this method, the storage must be byte-accessible by 186 + the CPU at all times. If your device uses paging techniques to expose 187 + a large amount of memory through a smaller window, then you cannot 188 + implement direct_access. Equally, if your device can occasionally 189 + stall the CPU for an extended period, you should also not attempt to 190 + implement direct_access. 191 + 192 + These block devices may be used for inspiration: 193 + - brd: RAM backed block device driver 194 + - dcssblk: s390 dcss block device driver 195 + - pmem: NVDIMM persistent memory driver 196 + 197 + 198 + Implementation Tips for Filesystem Writers 199 + ------------------------------------------ 200 + 201 + Filesystem support consists of: 202 + 203 + * Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in 204 + i_flags 205 + * Implementing ->read_iter and ->write_iter operations which use 206 + :c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set 207 + * Implementing an mmap file operation for `DAX` files which sets the 208 + `VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to 209 + include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 210 + handlers should probably call :c:func:`dax_iomap_fault()` passing the 211 + appropriate fault size and iomap operations. 212 + * Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations 213 + instead of :c:func:`block_truncate_page()` for `DAX` files 214 + * Ensuring that there is sufficient locking between reads, writes, 215 + truncates and page faults 216 + 217 + The iomap handlers for allocating blocks must make sure that allocated blocks 218 + are zeroed out and converted to written extents before being returned to avoid 219 + exposure of uninitialized data through mmap. 220 + 221 + These filesystems may be used for inspiration: 222 + 223 + .. seealso:: 224 + 225 + ext2: see Documentation/filesystems/ext2.rst 226 + 227 + .. seealso:: 228 + 229 + xfs: see Documentation/admin-guide/xfs.rst 230 + 231 + .. seealso:: 232 + 233 + ext4: see Documentation/filesystems/ext4/ 234 + 235 + 236 + Handling Media Errors 237 + --------------------- 238 + 239 + The libnvdimm subsystem stores a record of known media error locations for 240 + each pmem block device (in gendisk->badblocks). If we fault at such location, 241 + or one with a latent error not yet discovered, the application can expect 242 + to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply 243 + writing the affected sectors (through the pmem driver, and if the underlying 244 + NVDIMM supports the clear_poison DSM defined by ACPI). 245 + 246 + Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or 247 + sysadmins have an option to restore the lost data from a prior ``backup/inbuilt`` 248 + redundancy in the following ways: 249 + 250 + 1. Delete the affected file, and restore from a backup (sysadmin route): 251 + This will free the filesystem blocks that were being used by the file, 252 + and the next time they're allocated, they will be zeroed first, which 253 + happens through the driver, and will clear bad sectors. 254 + 255 + 2. Truncate or hole-punch the part of the file that has a bad-block (at least 256 + an entire aligned sector has to be hole-punched, but not necessarily an 257 + entire filesystem block). 258 + 259 + These are the two basic paths that allow `DAX` filesystems to continue operating 260 + in the presence of media errors. More robust error recovery mechanisms can be 261 + built on top of this in the future, for example, involving redundancy/mirroring 262 + provided at the block layer through DM, or additionally, at the filesystem 263 + level. These would have to rely on the above two tenets, that error clearing 264 + can happen either by sending an IO through the driver, or zeroing (also through 265 + the driver). 266 + 267 + 268 + Shortcomings 269 + ------------ 270 + 271 + Even if the kernel or its modules are stored on a filesystem that supports 272 + `DAX` on a block device that supports `DAX`, they will still be copied into RAM. 273 + 274 + The DAX code does not work correctly on architectures which have virtually 275 + mapped caches such as ARM, MIPS and SPARC. 276 + 277 + Calling :c:func:`get_user_pages()` on a range of user memory that has been 278 + mmaped from a `DAX` file will fail when there are no 'struct page' to describe 279 + those pages. This problem has been addressed in some device drivers 280 + by adding optional struct page support for pages under the control of 281 + the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of 282 + how to do this). In the non struct page cases `O_DIRECT` reads/writes to 283 + those memory ranges from a non-`DAX` file will fail 284 + 285 + 286 + .. note:: 287 + 288 + `O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that 289 + is being accessed that is key here). Other things that will not work in 290 + the non struct page case include RDMA, :c:func:`sendfile()` and 291 + :c:func:`splice()`.
-257
Documentation/filesystems/dax.txt
··· 1 - Direct Access for files 2 - ----------------------- 3 - 4 - Motivation 5 - ---------- 6 - 7 - The page cache is usually used to buffer reads and writes to files. 8 - It is also used to provide the pages which are mapped into userspace 9 - by a call to mmap. 10 - 11 - For block devices that are memory-like, the page cache pages would be 12 - unnecessary copies of the original storage. The DAX code removes the 13 - extra copy by performing reads and writes directly to the storage device. 14 - For file mappings, the storage device is mapped directly into userspace. 15 - 16 - 17 - Usage 18 - ----- 19 - 20 - If you have a block device which supports DAX, you can make a filesystem 21 - on it as usual. The DAX code currently only supports files with a block 22 - size equal to your kernel's PAGE_SIZE, so you may need to specify a block 23 - size when creating the filesystem. 24 - 25 - Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them 26 - is different. 27 - 28 - Enabling DAX on ext2 29 - ----------------------------- 30 - 31 - When mounting the filesystem, use the "-o dax" option on the command line or 32 - add 'dax' to the options in /etc/fstab. This works to enable DAX on all files 33 - within the filesystem. It is equivalent to the '-o dax=always' behavior below. 34 - 35 - 36 - Enabling DAX on xfs and ext4 37 - ---------------------------- 38 - 39 - Summary 40 - ------- 41 - 42 - 1. There exists an in-kernel file access mode flag S_DAX that corresponds to 43 - the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details 44 - about this access mode. 45 - 46 - 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular 47 - files and directories. This advisory flag can be set or cleared at any 48 - time, but doing so does not immediately affect the S_DAX state. 49 - 50 - 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will 51 - be inherited by all regular files and subdirectories that are subsequently 52 - created in this directory. Files and subdirectories that exist at the time 53 - this flag is set or cleared on the parent directory are not modified by 54 - this modification of the parent directory. 55 - 56 - 4. There exist dax mount options which can override FS_XFLAG_DAX in the 57 - setting of the S_DAX flag. Given underlying storage which supports DAX the 58 - following hold: 59 - 60 - "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default. 61 - 62 - "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX." 63 - 64 - "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX." 65 - 66 - "-o dax" is a legacy option which is an alias for "dax=always". 67 - This may be removed in the future so "-o dax=always" is 68 - the preferred method for specifying this behavior. 69 - 70 - NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain 71 - the same even when the filesystem is mounted with a dax option. However, 72 - in-core inode state (S_DAX) will be overridden until the filesystem is 73 - remounted with dax=inode and the inode is evicted from kernel memory. 74 - 75 - 5. The S_DAX policy can be changed via: 76 - 77 - a) Setting the parent directory FS_XFLAG_DAX as needed before files are 78 - created 79 - 80 - b) Setting the appropriate dax="foo" mount option 81 - 82 - c) Changing the FS_XFLAG_DAX flag on existing regular files and 83 - directories. This has runtime constraints and limitations that are 84 - described in 6) below. 85 - 86 - 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX 87 - flag, the change to existing regular files won't take effect until the 88 - files are closed by all processes. 89 - 90 - 91 - Details 92 - ------- 93 - 94 - There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX) 95 - and the other is a volatile flag indicating the active state of the feature 96 - (S_DAX). 97 - 98 - FS_XFLAG_DAX is preserved within the filesystem. This persistent config 99 - setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl 100 - (see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. 101 - 102 - New files and directories automatically inherit FS_XFLAG_DAX from 103 - their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at 104 - directory creation time can be used to set a default behavior for an entire 105 - sub-tree. 106 - 107 - To clarify inheritance, here are 3 examples: 108 - 109 - Example A: 110 - 111 - mkdir -p a/b/c 112 - xfs_io -c 'chattr +x' a 113 - mkdir a/b/c/d 114 - mkdir a/e 115 - 116 - dax: a,e 117 - no dax: b,c,d 118 - 119 - Example B: 120 - 121 - mkdir a 122 - xfs_io -c 'chattr +x' a 123 - mkdir -p a/b/c/d 124 - 125 - dax: a,b,c,d 126 - no dax: 127 - 128 - Example C: 129 - 130 - mkdir -p a/b/c 131 - xfs_io -c 'chattr +x' c 132 - mkdir a/b/c/d 133 - 134 - dax: c,d 135 - no dax: a,b 136 - 137 - 138 - The current enabled state (S_DAX) is set when a file inode is instantiated in 139 - memory by the kernel. It is set based on the underlying media support, the 140 - value of FS_XFLAG_DAX and the filesystem's dax mount option. 141 - 142 - statx can be used to query S_DAX. NOTE that only regular files will ever have 143 - S_DAX set and therefore statx will never indicate that S_DAX is set on 144 - directories. 145 - 146 - Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even 147 - if the underlying media does not support dax and/or the filesystem is 148 - overridden with a mount option. 149 - 150 - 151 - 152 - Implementation Tips for Block Driver Writers 153 - -------------------------------------------- 154 - 155 - To support DAX in your block driver, implement the 'direct_access' 156 - block device operation. It is used to translate the sector number 157 - (expressed in units of 512-byte sectors) to a page frame number (pfn) 158 - that identifies the physical page for the memory. It also returns a 159 - kernel virtual address that can be used to access the memory. 160 - 161 - The direct_access method takes a 'size' parameter that indicates the 162 - number of bytes being requested. The function should return the number 163 - of bytes that can be contiguously accessed at that offset. It may also 164 - return a negative errno if an error occurs. 165 - 166 - In order to support this method, the storage must be byte-accessible by 167 - the CPU at all times. If your device uses paging techniques to expose 168 - a large amount of memory through a smaller window, then you cannot 169 - implement direct_access. Equally, if your device can occasionally 170 - stall the CPU for an extended period, you should also not attempt to 171 - implement direct_access. 172 - 173 - These block devices may be used for inspiration: 174 - - brd: RAM backed block device driver 175 - - dcssblk: s390 dcss block device driver 176 - - pmem: NVDIMM persistent memory driver 177 - 178 - 179 - Implementation Tips for Filesystem Writers 180 - ------------------------------------------ 181 - 182 - Filesystem support consists of 183 - - adding support to mark inodes as being DAX by setting the S_DAX flag in 184 - i_flags 185 - - implementing ->read_iter and ->write_iter operations which use dax_iomap_rw() 186 - when inode has S_DAX flag set 187 - - implementing an mmap file operation for DAX files which sets the 188 - VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 189 - include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 190 - handlers should probably call dax_iomap_fault() passing the appropriate 191 - fault size and iomap operations. 192 - - calling iomap_zero_range() passing appropriate iomap operations instead of 193 - block_truncate_page() for DAX files 194 - - ensuring that there is sufficient locking between reads, writes, 195 - truncates and page faults 196 - 197 - The iomap handlers for allocating blocks must make sure that allocated blocks 198 - are zeroed out and converted to written extents before being returned to avoid 199 - exposure of uninitialized data through mmap. 200 - 201 - These filesystems may be used for inspiration: 202 - - ext2: see Documentation/filesystems/ext2.rst 203 - - ext4: see Documentation/filesystems/ext4/ 204 - - xfs: see Documentation/admin-guide/xfs.rst 205 - 206 - 207 - Handling Media Errors 208 - --------------------- 209 - 210 - The libnvdimm subsystem stores a record of known media error locations for 211 - each pmem block device (in gendisk->badblocks). If we fault at such location, 212 - or one with a latent error not yet discovered, the application can expect 213 - to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply 214 - writing the affected sectors (through the pmem driver, and if the underlying 215 - NVDIMM supports the clear_poison DSM defined by ACPI). 216 - 217 - Since DAX IO normally doesn't go through the driver/bio path, applications or 218 - sysadmins have an option to restore the lost data from a prior backup/inbuilt 219 - redundancy in the following ways: 220 - 221 - 1. Delete the affected file, and restore from a backup (sysadmin route): 222 - This will free the filesystem blocks that were being used by the file, 223 - and the next time they're allocated, they will be zeroed first, which 224 - happens through the driver, and will clear bad sectors. 225 - 226 - 2. Truncate or hole-punch the part of the file that has a bad-block (at least 227 - an entire aligned sector has to be hole-punched, but not necessarily an 228 - entire filesystem block). 229 - 230 - These are the two basic paths that allow DAX filesystems to continue operating 231 - in the presence of media errors. More robust error recovery mechanisms can be 232 - built on top of this in the future, for example, involving redundancy/mirroring 233 - provided at the block layer through DM, or additionally, at the filesystem 234 - level. These would have to rely on the above two tenets, that error clearing 235 - can happen either by sending an IO through the driver, or zeroing (also through 236 - the driver). 237 - 238 - 239 - Shortcomings 240 - ------------ 241 - 242 - Even if the kernel or its modules are stored on a filesystem that supports 243 - DAX on a block device that supports DAX, they will still be copied into RAM. 244 - 245 - The DAX code does not work correctly on architectures which have virtually 246 - mapped caches such as ARM, MIPS and SPARC. 247 - 248 - Calling get_user_pages() on a range of user memory that has been mmaped 249 - from a DAX file will fail when there are no 'struct page' to describe 250 - those pages. This problem has been addressed in some device drivers 251 - by adding optional struct page support for pages under the control of 252 - the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of 253 - how to do this). In the non struct page cases O_DIRECT reads/writes to 254 - those memory ranges from a non-DAX file will fail (note that O_DIRECT 255 - reads/writes _of a DAX file_ do work, it is the memory that is being 256 - accessed that is key here). Other things that will not work in the 257 - non struct page case include RDMA, sendfile() and splice().
+1
Documentation/filesystems/index.rst
··· 77 77 coda 78 78 configfs 79 79 cramfs 80 + dax 80 81 debugfs 81 82 dlmfs 82 83 ecryptfs