Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"The highlights are new logic behind background block group reclaim,
automatic removal of qgroup after removing a subvolume and new
'rescue=' mount options.

The rest is optimizations, cleanups and refactoring.

User visible features:

- dynamic block group reclaim:
- tunable framework to avoid situations where eager data
allocations prevent creating new metadata chunks due to lack of
unallocated space
- reuse sysfs knob bg_reclaim_threshold (otherwise used only in
zoned mode) for a fixed value threshold
- new on/off sysfs knob "dynamic_reclaim" calculating the value
based on heuristics, aiming to keep spare working space for
relocating chunks but not to needlessly relocate partially
utilized block groups or reclaim newly allocated ones
- stats are exported in sysfs per block group type, files
"reclaim_*"
- this may increase IO load at unexpected times but the corner
case of no allocatable block groups is known to be worse

- automatically remove qgroup of deleted subvolumes:
- adjust qgroup removal conditions, make sure all related
subvolume data are already removed, or return EBUSY, also take
into account setting of sysfs drop_subtree_threshold
- also works in squota mode

- mount option updates: new modes of 'rescue=' that allow to mount
images (read-only) that could have been partially converted by user
space tools
- ignoremetacsums - invalid metadata checksums are ignored
- ignoresuperflags - super block flags that track conversion in
progress (like UUID or checksums)

Core:

- size of struct btrfs_inode is now below 1024 (on a release config),
improved memory packing and other secondary effects

- switch tracking of open inodes from rb-tree to xarray, minor
performance improvement

- reduce number of empty transaction commits when there are no dirty
data/metadata

- memory allocation optimizations (reduced numbers, reordering out of
critical sections)

- extent map structure optimizations and refactoring, more sanity
checks

- more subpage in zoned mode preparations or fixes

- general snapshot code cleanups, improvements and documentation

- tree-checker updates: more file extent ram_bytes fixes, continued

- raid-stripe-tree update (not backward compatible):
- remove extent encoding field from the structure, can be inferred
from other information
- requires btrfs-progs 6.9.1 or newer

- cleanups and refactoring
- error message updates
- error handling improvements
- return type and parameter cleanups and improvements"

* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
btrfs: fix extent map use-after-free when adding pages to compressed bio
btrfs: fix bitmap leak when loading free space cache on duplicate entry
btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
btrfs: move extent_range_clear_dirty_for_io() into inode.c
btrfs: enhance compression error messages
btrfs: fix data race when accessing the last_trans field of a root
btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
btrfs: introduce new "rescue=ignoresuperflags" mount option
btrfs: introduce new "rescue=ignoremetacsums" mount option
btrfs: output the unrecognized super block flags as hex
btrfs: remove unused Opt enums
btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
btrfs: fix the ram_bytes assignment for truncated ordered extents
btrfs: make validate_extent_map() catch ram_bytes mismatch
btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
btrfs: fix typo in error message in btrfs_validate_super()
btrfs: move the direct IO code into its own file
btrfs: pass a btrfs_inode to btrfs_set_prop()
...

+5184 -3914
+1 -1
fs/btrfs/Makefile
··· 33 33 uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \ 34 34 block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \ 35 35 subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \ 36 - lru_cache.o raid-stripe-tree.o 36 + lru_cache.o raid-stripe-tree.o fiemap.o direct-io.o 37 37 38 38 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o 39 39 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
+6 -9
fs/btrfs/accessors.h
··· 34 34 35 35 static inline u8 get_unaligned_le8(const void *p) 36 36 { 37 - return *(u8 *)p; 37 + return *(const u8 *)p; 38 38 } 39 39 40 40 static inline void put_unaligned_le8(u8 val, void *p) ··· 48 48 offsetof(type, member), \ 49 49 sizeof_field(type, member))) 50 50 51 - #define write_eb_member(eb, ptr, type, member, result) (\ 52 - write_extent_buffer(eb, (char *)(result), \ 51 + #define write_eb_member(eb, ptr, type, member, source) ( \ 52 + write_extent_buffer(eb, (const char *)(source), \ 53 53 ((unsigned long)(ptr)) + \ 54 54 offsetof(type, member), \ 55 55 sizeof_field(type, member))) ··· 315 315 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64); 316 316 BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32); 317 317 318 - BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8); 319 318 BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64); 320 319 BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64); 321 - BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding, 322 - struct btrfs_stripe_extent, encoding, 8); 323 320 BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64); 324 321 BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64); 325 322 ··· 350 353 351 354 static inline void btrfs_set_tree_block_key(const struct extent_buffer *eb, 352 355 struct btrfs_tree_block_info *item, 353 - struct btrfs_disk_key *key) 356 + const struct btrfs_disk_key *key) 354 357 { 355 358 write_eb_member(eb, item, struct btrfs_tree_block_info, key, key); 356 359 } ··· 443 446 struct btrfs_disk_key *disk_key, int nr); 444 447 445 448 static inline void btrfs_set_node_key(const struct extent_buffer *eb, 446 - struct btrfs_disk_key *disk_key, int nr) 449 + const struct btrfs_disk_key *disk_key, int nr) 447 450 { 448 451 unsigned long ptr; 449 452 ··· 509 512 } 510 513 511 514 static inline void btrfs_set_item_key(struct extent_buffer *eb, 512 - struct btrfs_disk_key *disk_key, int nr) 515 + const struct btrfs_disk_key *disk_key, int nr) 513 516 { 514 517 struct btrfs_item *item = btrfs_item_nr(eb, nr); 515 518
+2 -2
fs/btrfs/bio.c
··· 29 29 /* Is this a data path I/O that needs storage layer checksum and repair? */ 30 30 static inline bool is_data_bbio(struct btrfs_bio *bbio) 31 31 { 32 - return bbio->inode && is_data_inode(&bbio->inode->vfs_inode); 32 + return bbio->inode && is_data_inode(bbio->inode); 33 33 } 34 34 35 35 static bool bbio_has_ordered_extent(struct btrfs_bio *bbio) ··· 732 732 * point, so they are handled as part of the no-checksum case. 733 733 */ 734 734 if (inode && !(inode->flags & BTRFS_INODE_NODATASUM) && 735 - !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) && 735 + !test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state) && 736 736 !btrfs_is_data_reloc_root(inode->root)) { 737 737 if (should_async_write(bbio) && 738 738 btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
+40 -13
fs/btrfs/block-group.c
··· 1022 1022 } 1023 1023 } 1024 1024 1025 + static struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info) 1026 + { 1027 + if (btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE)) 1028 + return fs_info->block_group_root; 1029 + return btrfs_extent_root(fs_info, 0); 1030 + } 1031 + 1025 1032 static int remove_block_group_item(struct btrfs_trans_handle *trans, 1026 1033 struct btrfs_path *path, 1027 1034 struct btrfs_block_group *block_group) ··· 1764 1757 1765 1758 static bool should_reclaim_block_group(struct btrfs_block_group *bg, u64 bytes_freed) 1766 1759 { 1767 - const struct btrfs_space_info *space_info = bg->space_info; 1768 - const int reclaim_thresh = READ_ONCE(space_info->bg_reclaim_threshold); 1760 + const int thresh_pct = btrfs_calc_reclaim_threshold(bg->space_info); 1761 + u64 thresh_bytes = mult_perc(bg->length, thresh_pct); 1769 1762 const u64 new_val = bg->used; 1770 1763 const u64 old_val = new_val + bytes_freed; 1771 - u64 thresh; 1772 1764 1773 - if (reclaim_thresh == 0) 1765 + if (thresh_bytes == 0) 1774 1766 return false; 1775 - 1776 - thresh = mult_perc(bg->length, reclaim_thresh); 1777 1767 1778 1768 /* 1779 1769 * If we were below the threshold before don't reclaim, we are likely a 1780 1770 * brand new block group and we don't want to relocate new block groups. 1781 1771 */ 1782 - if (old_val < thresh) 1772 + if (old_val < thresh_bytes) 1783 1773 return false; 1784 - if (new_val >= thresh) 1774 + if (new_val >= thresh_bytes) 1785 1775 return false; 1786 1776 return true; 1787 1777 } ··· 1826 1822 list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp); 1827 1823 while (!list_empty(&fs_info->reclaim_bgs)) { 1828 1824 u64 zone_unusable; 1825 + u64 reclaimed; 1829 1826 int ret = 0; 1830 1827 1831 1828 bg = list_first_entry(&fs_info->reclaim_bgs, ··· 1840 1835 /* Don't race with allocators so take the groups_sem */ 1841 1836 down_write(&space_info->groups_sem); 1842 1837 1838 + spin_lock(&space_info->lock); 1843 1839 spin_lock(&bg->lock); 1844 1840 if (bg->reserved || bg->pinned || bg->ro) { 1845 1841 /* ··· 1850 1844 * this block group. 1851 1845 */ 1852 1846 spin_unlock(&bg->lock); 1847 + spin_unlock(&space_info->lock); 1853 1848 up_write(&space_info->groups_sem); 1854 1849 goto next; 1855 1850 } ··· 1869 1862 if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) 1870 1863 btrfs_mark_bg_unused(bg); 1871 1864 spin_unlock(&bg->lock); 1865 + spin_unlock(&space_info->lock); 1872 1866 up_write(&space_info->groups_sem); 1873 1867 goto next; 1874 1868 ··· 1886 1878 */ 1887 1879 if (!should_reclaim_block_group(bg, bg->length)) { 1888 1880 spin_unlock(&bg->lock); 1881 + spin_unlock(&space_info->lock); 1889 1882 up_write(&space_info->groups_sem); 1890 1883 goto next; 1891 1884 } 1892 1885 spin_unlock(&bg->lock); 1886 + spin_unlock(&space_info->lock); 1893 1887 1894 1888 /* 1895 1889 * Get out fast, in case we're read-only or unmounting the ··· 1924 1914 div64_u64(bg->used * 100, bg->length), 1925 1915 div64_u64(zone_unusable * 100, bg->length)); 1926 1916 trace_btrfs_reclaim_block_group(bg); 1917 + reclaimed = bg->used; 1927 1918 ret = btrfs_relocate_chunk(fs_info, bg->start); 1928 1919 if (ret) { 1929 1920 btrfs_dec_block_group_ro(bg); 1930 1921 btrfs_err(fs_info, "error relocating chunk %llu", 1931 1922 bg->start); 1923 + reclaimed = 0; 1924 + spin_lock(&space_info->lock); 1925 + space_info->reclaim_errors++; 1926 + if (READ_ONCE(space_info->periodic_reclaim)) 1927 + space_info->periodic_reclaim_ready = false; 1928 + spin_unlock(&space_info->lock); 1932 1929 } 1930 + spin_lock(&space_info->lock); 1931 + space_info->reclaim_count++; 1932 + space_info->reclaim_bytes += reclaimed; 1933 + spin_unlock(&space_info->lock); 1933 1934 1934 1935 next: 1935 - if (ret) { 1936 + if (ret && !READ_ONCE(space_info->periodic_reclaim)) { 1936 1937 /* Refcount held by the reclaim_bgs list after splice. */ 1937 1938 spin_lock(&fs_info->unused_bgs_lock); 1938 1939 /* ··· 1985 1964 1986 1965 void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info) 1987 1966 { 1967 + btrfs_reclaim_sweep(fs_info); 1988 1968 spin_lock(&fs_info->unused_bgs_lock); 1989 1969 if (!list_empty(&fs_info->reclaim_bgs)) 1990 1970 queue_work(system_unbound_wq, &fs_info->reclaim_bgs_work); ··· 3684 3662 old_val += num_bytes; 3685 3663 cache->used = old_val; 3686 3664 cache->reserved -= num_bytes; 3665 + cache->reclaim_mark = 0; 3687 3666 space_info->bytes_reserved -= num_bytes; 3688 3667 space_info->bytes_used += num_bytes; 3689 3668 space_info->disk_used += num_bytes * factor; 3669 + if (READ_ONCE(space_info->periodic_reclaim)) 3670 + btrfs_space_info_update_reclaimable(space_info, -num_bytes); 3690 3671 spin_unlock(&cache->lock); 3691 3672 spin_unlock(&space_info->lock); 3692 3673 } else { ··· 3699 3674 btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes); 3700 3675 space_info->bytes_used -= num_bytes; 3701 3676 space_info->disk_used -= num_bytes * factor; 3702 - 3703 - reclaim = should_reclaim_block_group(cache, num_bytes); 3677 + if (READ_ONCE(space_info->periodic_reclaim)) 3678 + btrfs_space_info_update_reclaimable(space_info, num_bytes); 3679 + else 3680 + reclaim = should_reclaim_block_group(cache, num_bytes); 3704 3681 3705 3682 spin_unlock(&cache->lock); 3706 3683 spin_unlock(&space_info->lock); ··· 4356 4329 spin_lock(&block_group->lock); 4357 4330 if (test_and_clear_bit(BLOCK_GROUP_FLAG_IREF, 4358 4331 &block_group->runtime_flags)) { 4359 - struct inode *inode = block_group->inode; 4332 + struct btrfs_inode *inode = block_group->inode; 4360 4333 4361 4334 block_group->inode = NULL; 4362 4335 spin_unlock(&block_group->lock); 4363 4336 4364 4337 ASSERT(block_group->io_ctl.inode == NULL); 4365 - iput(inode); 4338 + iput(&inode->vfs_inode); 4366 4339 } else { 4367 4340 spin_unlock(&block_group->lock); 4368 4341 }
+2 -1
fs/btrfs/block-group.h
··· 115 115 116 116 struct btrfs_block_group { 117 117 struct btrfs_fs_info *fs_info; 118 - struct inode *inode; 118 + struct btrfs_inode *inode; 119 119 spinlock_t lock; 120 120 u64 start; 121 121 u64 length; ··· 263 263 struct work_struct zone_finish_work; 264 264 struct extent_buffer *last_eb; 265 265 enum btrfs_block_group_size_class size_class; 266 + u64 reclaim_mark; 266 267 }; 267 268 268 269 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
+108 -48
fs/btrfs/btrfs_inode.h
··· 19 19 #include <uapi/linux/btrfs_tree.h> 20 20 #include <trace/events/btrfs.h> 21 21 #include "block-rsv.h" 22 - #include "btrfs_inode.h" 23 22 #include "extent_map.h" 24 23 #include "extent_io.h" 25 24 #include "extent-io-tree.h" ··· 98 99 * range). 99 100 */ 100 101 BTRFS_INODE_COW_WRITE_ERROR, 102 + /* 103 + * Indicate this is a directory that points to a subvolume for which 104 + * there is no root reference item. That's a case like the following: 105 + * 106 + * $ btrfs subvolume create /mnt/parent 107 + * $ btrfs subvolume create /mnt/parent/child 108 + * $ btrfs subvolume snapshot /mnt/parent /mnt/snap 109 + * 110 + * If subvolume "parent" is root 256, subvolume "child" is root 257 and 111 + * snapshot "snap" is root 258, then there's no root reference item (key 112 + * BTRFS_ROOT_REF_KEY in the root tree) for the subvolume "child" 113 + * associated to root 258 (the snapshot) - there's only for the root 114 + * of the "parent" subvolume (root 256). In the chunk root we have a 115 + * (256 BTRFS_ROOT_REF_KEY 257) key but we don't have a 116 + * (258 BTRFS_ROOT_REF_KEY 257) key - the sames goes for backrefs, we 117 + * have a (257 BTRFS_ROOT_BACKREF_KEY 256) but we don't have a 118 + * (257 BTRFS_ROOT_BACKREF_KEY 258) key. 119 + * 120 + * So when opening the "child" dentry from the snapshot's directory, 121 + * we don't find a root ref item and we create a stub inode. This is 122 + * done at new_simple_dir(), called from btrfs_lookup_dentry(). 123 + */ 124 + BTRFS_INODE_ROOT_STUB, 101 125 }; 102 126 103 127 /* in memory btrfs inode */ ··· 128 106 /* which subvolume this inode belongs to */ 129 107 struct btrfs_root *root; 130 108 131 - /* key used to find this inode on disk. This is used by the code 132 - * to read in roots of subvolumes 109 + #if BITS_PER_LONG == 32 110 + /* 111 + * The objectid of the corresponding BTRFS_INODE_ITEM_KEY. 112 + * On 64 bits platforms we can get it from vfs_inode.i_ino, which is an 113 + * unsigned long and therefore 64 bits on such platforms. 133 114 */ 134 - struct btrfs_key location; 115 + u64 objectid; 116 + #endif 135 117 136 118 /* Cached value of inode property 'compression'. */ 137 119 u8 prop_compress; ··· 190 164 * to walk them all. 191 165 */ 192 166 struct list_head delalloc_inodes; 193 - 194 - /* node for the red-black tree that links inodes in subvolume root */ 195 - struct rb_node rb_node; 196 167 197 168 unsigned long runtime_flags; 198 169 ··· 251 228 u64 last_dir_index_offset; 252 229 }; 253 230 254 - /* 255 - * Total number of bytes pending defrag, used by stat to check whether 256 - * it needs COW. Protected by 'lock'. 257 - */ 258 - u64 defrag_bytes; 231 + union { 232 + /* 233 + * Total number of bytes pending defrag, used by stat to check whether 234 + * it needs COW. Protected by 'lock'. 235 + * Used by inodes other than the data relocation inode. 236 + */ 237 + u64 defrag_bytes; 238 + 239 + /* 240 + * Logical address of the block group being relocated. 241 + * Used only by the data relocation inode. 242 + */ 243 + u64 reloc_block_group_start; 244 + }; 259 245 260 246 /* 261 247 * The size of the file stored in the metadata on disk. data=ordered ··· 273 241 */ 274 242 u64 disk_i_size; 275 243 276 - /* 277 - * If this is a directory then index_cnt is the counter for the index 278 - * number for new files that are created. For an empty directory, this 279 - * must be initialized to BTRFS_DIR_START_INDEX. 280 - */ 281 - u64 index_cnt; 244 + union { 245 + /* 246 + * If this is a directory then index_cnt is the counter for the 247 + * index number for new files that are created. For an empty 248 + * directory, this must be initialized to BTRFS_DIR_START_INDEX. 249 + */ 250 + u64 index_cnt; 251 + 252 + /* 253 + * If this is not a directory, this is the number of bytes 254 + * outstanding that are going to need csums. This is used in 255 + * ENOSPC accounting. Protected by 'lock'. 256 + */ 257 + u64 csum_bytes; 258 + }; 282 259 283 260 /* Cache the directory index number to speed the dir/file remove */ 284 261 u64 dir_index; ··· 299 258 */ 300 259 u64 last_unlink_trans; 301 260 302 - /* 303 - * The id/generation of the last transaction where this inode was 304 - * either the source or the destination of a clone/dedupe operation. 305 - * Used when logging an inode to know if there are shared extents that 306 - * need special care when logging checksum items, to avoid duplicate 307 - * checksum items in a log (which can lead to a corruption where we end 308 - * up with missing checksum ranges after log replay). 309 - * Protected by the vfs inode lock. 310 - */ 311 - u64 last_reflink_trans; 261 + union { 262 + /* 263 + * The id/generation of the last transaction where this inode 264 + * was either the source or the destination of a clone/dedupe 265 + * operation. Used when logging an inode to know if there are 266 + * shared extents that need special care when logging checksum 267 + * items, to avoid duplicate checksum items in a log (which can 268 + * lead to a corruption where we end up with missing checksum 269 + * ranges after log replay). Protected by the VFS inode lock. 270 + * Used for regular files only. 271 + */ 272 + u64 last_reflink_trans; 312 273 313 - /* 314 - * Number of bytes outstanding that are going to need csums. This is 315 - * used in ENOSPC accounting. Protected by 'lock'. 316 - */ 317 - u64 csum_bytes; 274 + /* 275 + * In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set), 276 + * the ID of that root. 277 + */ 278 + u64 ref_root_id; 279 + }; 318 280 319 281 /* Backwards incompatible flags, lower half of inode_item::flags */ 320 282 u32 flags; ··· 375 331 */ 376 332 static inline u64 btrfs_ino(const struct btrfs_inode *inode) 377 333 { 378 - u64 ino = inode->location.objectid; 334 + u64 ino = inode->objectid; 379 335 380 - /* type == BTRFS_ROOT_ITEM_KEY: subvol dir */ 381 - if (inode->location.type == BTRFS_ROOT_ITEM_KEY) 336 + if (test_bit(BTRFS_INODE_ROOT_STUB, &inode->runtime_flags)) 382 337 ino = inode->vfs_inode.i_ino; 383 338 return ino; 384 339 } ··· 391 348 392 349 #endif 393 350 351 + static inline void btrfs_get_inode_key(const struct btrfs_inode *inode, 352 + struct btrfs_key *key) 353 + { 354 + key->objectid = btrfs_ino(inode); 355 + key->type = BTRFS_INODE_ITEM_KEY; 356 + key->offset = 0; 357 + } 358 + 359 + static inline void btrfs_set_inode_number(struct btrfs_inode *inode, u64 ino) 360 + { 361 + #if BITS_PER_LONG == 32 362 + inode->objectid = ino; 363 + #endif 364 + inode->vfs_inode.i_ino = ino; 365 + } 366 + 394 367 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size) 395 368 { 396 369 i_size_write(&inode->vfs_inode, size); 397 370 inode->disk_i_size = size; 398 371 } 399 372 400 - static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) 373 + static inline bool btrfs_is_free_space_inode(const struct btrfs_inode *inode) 401 374 { 402 375 return test_bit(BTRFS_INODE_FREE_SPACE_INODE, &inode->runtime_flags); 403 376 } 404 377 405 - static inline bool is_data_inode(struct inode *inode) 378 + static inline bool is_data_inode(const struct btrfs_inode *inode) 406 379 { 407 - return btrfs_ino(BTRFS_I(inode)) != BTRFS_BTREE_INODE_OBJECTID; 380 + return btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID; 408 381 } 409 382 410 383 static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode, ··· 514 455 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev, 515 456 u32 bio_offset, struct bio_vec *bv); 516 457 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, 517 - u64 *orig_start, u64 *orig_block_len, 518 - u64 *ram_bytes, bool nowait, bool strict); 458 + struct btrfs_file_extent *file_extent, 459 + bool nowait, bool strict); 519 460 520 461 void btrfs_del_delalloc_inode(struct btrfs_inode *inode); 521 462 struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry); ··· 574 515 int btrfs_drop_inode(struct inode *inode); 575 516 int __init btrfs_init_cachep(void); 576 517 void __cold btrfs_destroy_cachep(void); 577 - struct inode *btrfs_iget_path(struct super_block *s, u64 ino, 578 - struct btrfs_root *root, struct btrfs_path *path); 579 - struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root); 518 + struct inode *btrfs_iget_path(u64 ino, struct btrfs_root *root, 519 + struct btrfs_path *path); 520 + struct inode *btrfs_iget(u64 ino, struct btrfs_root *root); 580 521 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, 581 522 struct page *page, u64 start, u64 len); 582 523 int btrfs_update_inode(struct btrfs_trans_handle *trans, ··· 610 551 ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, 611 552 const struct btrfs_ioctl_encoded_io_args *encoded); 612 553 613 - ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, 614 - size_t done_before); 615 - struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, 616 - size_t done_before); 617 554 struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino); 618 555 619 556 extern const struct dentry_operations btrfs_dentry_operations; ··· 626 571 void btrfs_update_inode_bytes(struct btrfs_inode *inode, const u64 add_bytes, 627 572 const u64 del_bytes); 628 573 void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end); 574 + u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 start, 575 + u64 num_bytes); 576 + struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start, 577 + const struct btrfs_file_extent *file_extent, 578 + int type); 629 579 630 580 #endif
+13 -12
fs/btrfs/compression.c
··· 261 261 folio_put(folio); 262 262 } 263 263 264 - static void end_bbio_comprssed_read(struct btrfs_bio *bbio) 264 + static void end_bbio_compressed_read(struct btrfs_bio *bbio) 265 265 { 266 266 struct compressed_bio *cb = to_compressed_bio(bbio); 267 267 blk_status_t status = bbio->bio.bi_status; ··· 334 334 * This also calls the writeback end hooks for the file pages so that metadata 335 335 * and checksums can be updated in the file. 336 336 */ 337 - static void end_bbio_comprssed_write(struct btrfs_bio *bbio) 337 + static void end_bbio_compressed_write(struct btrfs_bio *bbio) 338 338 { 339 339 struct compressed_bio *cb = to_compressed_bio(bbio); 340 340 struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; ··· 374 374 blk_opf_t write_flags, 375 375 bool writeback) 376 376 { 377 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 377 + struct btrfs_inode *inode = ordered->inode; 378 378 struct btrfs_fs_info *fs_info = inode->root->fs_info; 379 379 struct compressed_bio *cb; 380 380 ··· 383 383 384 384 cb = alloc_compressed_bio(inode, ordered->file_offset, 385 385 REQ_OP_WRITE | write_flags, 386 - end_bbio_comprssed_write); 386 + end_bbio_compressed_write); 387 387 cb->start = ordered->file_offset; 388 388 cb->len = ordered->num_bytes; 389 389 cb->compressed_folios = compressed_folios; ··· 507 507 */ 508 508 if (!em || cur < em->start || 509 509 (cur + fs_info->sectorsize > extent_map_end(em)) || 510 - (em->block_start >> SECTOR_SHIFT) != orig_bio->bi_iter.bi_sector) { 510 + (extent_map_block_start(em) >> SECTOR_SHIFT) != 511 + orig_bio->bi_iter.bi_sector) { 511 512 free_extent_map(em); 512 513 unlock_extent(tree, cur, page_end, NULL); 513 514 unlock_page(page); 514 515 put_page(page); 515 516 break; 516 517 } 518 + add_size = min(em->start + em->len, page_end + 1) - cur; 517 519 free_extent_map(em); 518 520 519 521 if (page->index == end_index) { ··· 528 526 } 529 527 } 530 528 531 - add_size = min(em->start + em->len, page_end + 1) - cur; 532 529 ret = bio_add_page(orig_bio, page, add_size, offset_in_page(cur)); 533 530 if (ret != add_size) { 534 531 unlock_extent(tree, cur, page_end, NULL); ··· 586 585 } 587 586 588 587 ASSERT(extent_map_is_compressed(em)); 589 - compressed_len = em->block_len; 588 + compressed_len = em->disk_num_bytes; 590 589 591 590 cb = alloc_compressed_bio(inode, file_offset, REQ_OP_READ, 592 - end_bbio_comprssed_read); 591 + end_bbio_compressed_read); 593 592 594 - cb->start = em->orig_start; 593 + cb->start = em->start - em->offset; 595 594 em_len = em->len; 596 595 em_start = em->start; 597 596 ··· 609 608 goto out_free_bio; 610 609 } 611 610 612 - ret2 = btrfs_alloc_folio_array(cb->nr_folios, cb->compressed_folios, 0); 611 + ret2 = btrfs_alloc_folio_array(cb->nr_folios, cb->compressed_folios); 613 612 if (ret2) { 614 613 ret = BLK_STS_RESOURCE; 615 614 goto out_free_compressed_pages; ··· 1507 1506 * 1508 1507 * Return non-zero if the compression should be done, 0 otherwise. 1509 1508 */ 1510 - int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end) 1509 + int btrfs_compress_heuristic(struct btrfs_inode *inode, u64 start, u64 end) 1511 1510 { 1512 1511 struct list_head *ws_list = get_workspace(0, 0); 1513 1512 struct heuristic_ws *ws; ··· 1517 1516 1518 1517 ws = list_entry(ws_list, struct heuristic_ws, list); 1519 1518 1520 - heuristic_collect_sample(inode, start, end, ws); 1519 + heuristic_collect_sample(&inode->vfs_inode, start, end, ws); 1521 1520 1522 1521 if (sample_repeated_patterns(ws)) { 1523 1522 ret = 1;
+1 -1
fs/btrfs/compression.h
··· 144 144 const char* btrfs_compress_type2str(enum btrfs_compression_type type); 145 145 bool btrfs_compress_is_valid_type(const char *str, size_t len); 146 146 147 - int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end); 147 + int btrfs_compress_heuristic(struct btrfs_inode *inode, u64 start, u64 end); 148 148 149 149 int btrfs_compress_filemap_get_folio(struct address_space *mapping, u64 start, 150 150 struct folio **in_folio_ret);
+65 -43
fs/btrfs/ctree.c
··· 321 321 WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && 322 322 trans->transid != fs_info->running_transaction->transid); 323 323 WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && 324 - trans->transid != root->last_trans); 324 + trans->transid != btrfs_get_root_last_trans(root)); 325 325 326 326 level = btrfs_header_level(buf); 327 327 if (level == 0) ··· 417 417 u64 refs; 418 418 u64 owner; 419 419 u64 flags; 420 - u64 new_flags = 0; 421 420 int ret; 422 421 423 422 /* ··· 461 462 } 462 463 463 464 owner = btrfs_header_owner(buf); 464 - BUG_ON(owner == BTRFS_TREE_RELOC_OBJECTID && 465 - !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)); 465 + if (unlikely(owner == BTRFS_TREE_RELOC_OBJECTID && 466 + !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))) { 467 + btrfs_crit(fs_info, 468 + "found tree block at bytenr %llu level %d root %llu refs %llu flags %llx without full backref flag set", 469 + buf->start, btrfs_header_level(buf), 470 + btrfs_root_id(root), refs, flags); 471 + ret = -EUCLEAN; 472 + btrfs_abort_transaction(trans, ret); 473 + return ret; 474 + } 466 475 467 476 if (refs > 1) { 468 477 if ((owner == btrfs_root_id(root) || ··· 488 481 if (ret) 489 482 return ret; 490 483 } 491 - new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; 484 + ret = btrfs_set_disk_extent_flags(trans, buf, 485 + BTRFS_BLOCK_FLAG_FULL_BACKREF); 486 + if (ret) 487 + return ret; 492 488 } else { 493 489 494 490 if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID) 495 491 ret = btrfs_inc_ref(trans, root, cow, 1); 496 492 else 497 493 ret = btrfs_inc_ref(trans, root, cow, 0); 498 - if (ret) 499 - return ret; 500 - } 501 - if (new_flags != 0) { 502 - ret = btrfs_set_disk_extent_flags(trans, buf, new_flags); 503 494 if (ret) 504 495 return ret; 505 496 } ··· 556 551 WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && 557 552 trans->transid != fs_info->running_transaction->transid); 558 553 WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && 559 - trans->transid != root->last_trans); 554 + trans->transid != btrfs_get_root_last_trans(root)); 560 555 561 556 level = btrfs_header_level(buf); 562 557 ··· 593 588 594 589 ret = update_ref_for_cow(trans, root, buf, cow, &last_ref); 595 590 if (ret) { 596 - btrfs_tree_unlock(cow); 597 - free_extent_buffer(cow); 598 591 btrfs_abort_transaction(trans, ret); 599 - return ret; 592 + goto error_unlock_cow; 600 593 } 601 594 602 595 if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) { 603 596 ret = btrfs_reloc_cow_block(trans, root, buf, cow); 604 597 if (ret) { 605 - btrfs_tree_unlock(cow); 606 - free_extent_buffer(cow); 607 598 btrfs_abort_transaction(trans, ret); 608 - return ret; 599 + goto error_unlock_cow; 609 600 } 610 601 } 611 602 ··· 613 612 614 613 ret = btrfs_tree_mod_log_insert_root(root->node, cow, true); 615 614 if (ret < 0) { 616 - btrfs_tree_unlock(cow); 617 - free_extent_buffer(cow); 618 615 btrfs_abort_transaction(trans, ret); 619 - return ret; 616 + goto error_unlock_cow; 620 617 } 621 618 atomic_inc(&cow->refs); 622 619 rcu_assign_pointer(root->node, cow); 623 620 624 - btrfs_free_tree_block(trans, btrfs_root_id(root), buf, 625 - parent_start, last_ref); 621 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), buf, 622 + parent_start, last_ref); 626 623 free_extent_buffer(buf); 627 624 add_root_to_dirty_list(root); 625 + if (ret < 0) { 626 + btrfs_abort_transaction(trans, ret); 627 + goto error_unlock_cow; 628 + } 628 629 } else { 629 630 WARN_ON(trans->transid != btrfs_header_generation(parent)); 630 631 ret = btrfs_tree_mod_log_insert_key(parent, parent_slot, 631 632 BTRFS_MOD_LOG_KEY_REPLACE); 632 633 if (ret) { 633 - btrfs_tree_unlock(cow); 634 - free_extent_buffer(cow); 635 634 btrfs_abort_transaction(trans, ret); 636 - return ret; 635 + goto error_unlock_cow; 637 636 } 638 637 btrfs_set_node_blockptr(parent, parent_slot, 639 638 cow->start); ··· 643 642 if (last_ref) { 644 643 ret = btrfs_tree_mod_log_free_eb(buf); 645 644 if (ret) { 646 - btrfs_tree_unlock(cow); 647 - free_extent_buffer(cow); 648 645 btrfs_abort_transaction(trans, ret); 649 - return ret; 646 + goto error_unlock_cow; 650 647 } 651 648 } 652 - btrfs_free_tree_block(trans, btrfs_root_id(root), buf, 653 - parent_start, last_ref); 649 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), buf, 650 + parent_start, last_ref); 651 + if (ret < 0) { 652 + btrfs_abort_transaction(trans, ret); 653 + goto error_unlock_cow; 654 + } 654 655 } 655 656 if (unlock_orig) 656 657 btrfs_tree_unlock(buf); ··· 660 657 btrfs_mark_buffer_dirty(trans, cow); 661 658 *cow_ret = cow; 662 659 return 0; 660 + 661 + error_unlock_cow: 662 + btrfs_tree_unlock(cow); 663 + free_extent_buffer(cow); 664 + return ret; 663 665 } 664 666 665 667 static inline int should_cow_block(struct btrfs_trans_handle *trans, ··· 991 983 free_extent_buffer(mid); 992 984 993 985 root_sub_used_bytes(root); 994 - btrfs_free_tree_block(trans, btrfs_root_id(root), mid, 0, 1); 986 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), mid, 0, 1); 995 987 /* once for the root ptr */ 996 988 free_extent_buffer_stale(mid); 989 + if (ret < 0) { 990 + btrfs_abort_transaction(trans, ret); 991 + goto out; 992 + } 997 993 return 0; 998 994 } 999 995 if (btrfs_header_nritems(mid) > ··· 1065 1053 goto out; 1066 1054 } 1067 1055 root_sub_used_bytes(root); 1068 - btrfs_free_tree_block(trans, btrfs_root_id(root), right, 1069 - 0, 1); 1056 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), 1057 + right, 0, 1); 1070 1058 free_extent_buffer_stale(right); 1071 1059 right = NULL; 1060 + if (ret < 0) { 1061 + btrfs_abort_transaction(trans, ret); 1062 + goto out; 1063 + } 1072 1064 } else { 1073 1065 struct btrfs_disk_key right_key; 1074 1066 btrfs_node_key(right, &right_key, 0); ··· 1127 1111 goto out; 1128 1112 } 1129 1113 root_sub_used_bytes(root); 1130 - btrfs_free_tree_block(trans, btrfs_root_id(root), mid, 0, 1); 1114 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), mid, 0, 1); 1131 1115 free_extent_buffer_stale(mid); 1132 1116 mid = NULL; 1117 + if (ret < 0) { 1118 + btrfs_abort_transaction(trans, ret); 1119 + goto out; 1120 + } 1133 1121 } else { 1134 1122 /* update the parent key to reflect our changes */ 1135 1123 struct btrfs_disk_key mid_key; ··· 1571 1551 if (ret) { 1572 1552 free_extent_buffer(tmp); 1573 1553 btrfs_release_path(p); 1574 - return -EIO; 1575 - } 1576 - if (btrfs_check_eb_owner(tmp, btrfs_root_id(root))) { 1577 - free_extent_buffer(tmp); 1578 - btrfs_release_path(p); 1579 - return -EUCLEAN; 1554 + return ret; 1580 1555 } 1581 1556 1582 1557 if (unlock_up) ··· 2898 2883 old = root->node; 2899 2884 ret = btrfs_tree_mod_log_insert_root(root->node, c, false); 2900 2885 if (ret < 0) { 2901 - btrfs_free_tree_block(trans, btrfs_root_id(root), c, 0, 1); 2886 + int ret2; 2887 + 2888 + ret2 = btrfs_free_tree_block(trans, btrfs_root_id(root), c, 0, 1); 2889 + if (ret2 < 0) 2890 + btrfs_abort_transaction(trans, ret2); 2902 2891 btrfs_tree_unlock(c); 2903 2892 free_extent_buffer(c); 2904 2893 return ret; ··· 4471 4452 root_sub_used_bytes(root); 4472 4453 4473 4454 atomic_inc(&leaf->refs); 4474 - btrfs_free_tree_block(trans, btrfs_root_id(root), leaf, 0, 1); 4455 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), leaf, 0, 1); 4475 4456 free_extent_buffer_stale(leaf); 4476 - return 0; 4457 + if (ret < 0) 4458 + btrfs_abort_transaction(trans, ret); 4459 + 4460 + return ret; 4477 4461 } 4478 4462 /* 4479 4463 * delete the item at the leaf level in path. If that empties
+15 -3
fs/btrfs/ctree.h
··· 221 221 222 222 struct list_head root_list; 223 223 224 - spinlock_t inode_lock; 225 - /* red-black tree that keeps track of in-memory inodes */ 226 - struct rb_root inode_tree; 224 + /* 225 + * Xarray that keeps track of in-memory inodes, protected by the lock 226 + * @inode_lock. 227 + */ 228 + struct xarray inodes; 227 229 228 230 /* 229 231 * Xarray that keeps track of delayed nodes of every inode, protected ··· 354 352 static inline void btrfs_set_root_last_log_commit(struct btrfs_root *root, int commit_id) 355 353 { 356 354 WRITE_ONCE(root->last_log_commit, commit_id); 355 + } 356 + 357 + static inline u64 btrfs_get_root_last_trans(const struct btrfs_root *root) 358 + { 359 + return READ_ONCE(root->last_trans); 360 + } 361 + 362 + static inline void btrfs_set_root_last_trans(struct btrfs_root *root, u64 transid) 363 + { 364 + WRITE_ONCE(root->last_trans, transid); 357 365 } 358 366 359 367 /*
+10 -8
fs/btrfs/defrag.c
··· 139 139 if (trans) 140 140 transid = trans->transid; 141 141 else 142 - transid = inode->root->last_trans; 142 + transid = btrfs_get_root_last_trans(root); 143 143 144 144 defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS); 145 145 if (!defrag) ··· 255 255 goto cleanup; 256 256 } 257 257 258 - inode = btrfs_iget(fs_info->sb, defrag->ino, inode_root); 258 + inode = btrfs_iget(defrag->ino, inode_root); 259 259 btrfs_put_root(inode_root); 260 260 if (IS_ERR(inode)) { 261 261 ret = PTR_ERR(inode); ··· 707 707 */ 708 708 if (key.offset > start) { 709 709 em->start = start; 710 - em->orig_start = start; 711 - em->block_start = EXTENT_MAP_HOLE; 710 + em->disk_bytenr = EXTENT_MAP_HOLE; 711 + em->disk_num_bytes = 0; 712 + em->ram_bytes = 0; 713 + em->offset = 0; 712 714 em->len = key.offset - start; 713 715 break; 714 716 } ··· 827 825 */ 828 826 next = defrag_lookup_extent(inode, em->start + em->len, newer_than, locked); 829 827 /* No more em or hole */ 830 - if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) 828 + if (!next || next->disk_bytenr >= EXTENT_MAP_LAST_BYTE) 831 829 goto out; 832 830 if (next->flags & EXTENT_FLAG_PREALLOC) 833 831 goto out; ··· 994 992 * This is for users who want to convert inline extents to 995 993 * regular ones through max_inline= mount option. 996 994 */ 997 - if (em->block_start == EXTENT_MAP_INLINE && 995 + if (em->disk_bytenr == EXTENT_MAP_INLINE && 998 996 em->len <= inode->root->fs_info->max_inline) 999 997 goto next; 1000 998 1001 999 /* Skip holes and preallocated extents. */ 1002 - if (em->block_start == EXTENT_MAP_HOLE || 1000 + if (em->disk_bytenr == EXTENT_MAP_HOLE || 1003 1001 (em->flags & EXTENT_FLAG_PREALLOC)) 1004 1002 goto next; 1005 1003 ··· 1064 1062 * So if an inline extent passed all above checks, just add it 1065 1063 * for defrag, and be converted to regular extents. 1066 1064 */ 1067 - if (em->block_start == EXTENT_MAP_INLINE) 1065 + if (em->disk_bytenr == EXTENT_MAP_INLINE) 1068 1066 goto add; 1069 1067 1070 1068 next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
+1 -1
fs/btrfs/delalloc-space.c
··· 111 111 * making error handling and cleanup easier. 112 112 */ 113 113 114 - int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) 114 + int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes) 115 115 { 116 116 struct btrfs_root *root = inode->root; 117 117 struct btrfs_fs_info *fs_info = root->fs_info;
+1 -1
fs/btrfs/delalloc-space.h
··· 9 9 struct btrfs_inode; 10 10 struct btrfs_fs_info; 11 11 12 - int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); 12 + int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes); 13 13 int btrfs_check_data_free_space(struct btrfs_inode *inode, 14 14 struct extent_changeset **reserved, u64 start, u64 len, 15 15 bool noflush);
+23 -24
fs/btrfs/delayed-inode.c
··· 77 77 return node; 78 78 } 79 79 80 - spin_lock(&root->inode_lock); 80 + xa_lock(&root->delayed_nodes); 81 81 node = xa_load(&root->delayed_nodes, ino); 82 82 83 83 if (node) { 84 84 if (btrfs_inode->delayed_node) { 85 85 refcount_inc(&node->refs); /* can be accessed */ 86 86 BUG_ON(btrfs_inode->delayed_node != node); 87 - spin_unlock(&root->inode_lock); 87 + xa_unlock(&root->delayed_nodes); 88 88 return node; 89 89 } 90 90 ··· 111 111 node = NULL; 112 112 } 113 113 114 - spin_unlock(&root->inode_lock); 114 + xa_unlock(&root->delayed_nodes); 115 115 return node; 116 116 } 117 - spin_unlock(&root->inode_lock); 117 + xa_unlock(&root->delayed_nodes); 118 118 119 119 return NULL; 120 120 } ··· 148 148 kmem_cache_free(delayed_node_cache, node); 149 149 return ERR_PTR(-ENOMEM); 150 150 } 151 - spin_lock(&root->inode_lock); 151 + xa_lock(&root->delayed_nodes); 152 152 ptr = xa_load(&root->delayed_nodes, ino); 153 153 if (ptr) { 154 154 /* Somebody inserted it, go back and read it. */ 155 - spin_unlock(&root->inode_lock); 155 + xa_unlock(&root->delayed_nodes); 156 156 kmem_cache_free(delayed_node_cache, node); 157 157 node = NULL; 158 158 goto again; 159 159 } 160 - ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC); 160 + ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC); 161 161 ASSERT(xa_err(ptr) != -EINVAL); 162 162 ASSERT(xa_err(ptr) != -ENOMEM); 163 163 ASSERT(ptr == NULL); 164 164 btrfs_inode->delayed_node = node; 165 - spin_unlock(&root->inode_lock); 165 + xa_unlock(&root->delayed_nodes); 166 166 167 167 return node; 168 168 } ··· 275 275 if (refcount_dec_and_test(&delayed_node->refs)) { 276 276 struct btrfs_root *root = delayed_node->root; 277 277 278 - spin_lock(&root->inode_lock); 278 + xa_erase(&root->delayed_nodes, delayed_node->inode_id); 279 279 /* 280 280 * Once our refcount goes to zero, nobody is allowed to bump it 281 281 * back up. We can delete it now. 282 282 */ 283 283 ASSERT(refcount_read(&delayed_node->refs) == 0); 284 - xa_erase(&root->delayed_nodes, delayed_node->inode_id); 285 - spin_unlock(&root->inode_lock); 286 284 kmem_cache_free(delayed_node_cache, delayed_node); 287 285 } 288 286 } ··· 1469 1471 int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, 1470 1472 const char *name, int name_len, 1471 1473 struct btrfs_inode *dir, 1472 - struct btrfs_disk_key *disk_key, u8 flags, 1474 + const struct btrfs_disk_key *disk_key, u8 flags, 1473 1475 u64 index) 1474 1476 { 1475 1477 struct btrfs_fs_info *fs_info = trans->fs_info; ··· 1682 1684 return 0; 1683 1685 } 1684 1686 1685 - bool btrfs_readdir_get_delayed_items(struct inode *inode, 1687 + bool btrfs_readdir_get_delayed_items(struct btrfs_inode *inode, 1686 1688 u64 last_index, 1687 1689 struct list_head *ins_list, 1688 1690 struct list_head *del_list) ··· 1690 1692 struct btrfs_delayed_node *delayed_node; 1691 1693 struct btrfs_delayed_item *item; 1692 1694 1693 - delayed_node = btrfs_get_delayed_node(BTRFS_I(inode)); 1695 + delayed_node = btrfs_get_delayed_node(inode); 1694 1696 if (!delayed_node) 1695 1697 return false; 1696 1698 ··· 1698 1700 * We can only do one readdir with delayed items at a time because of 1699 1701 * item->readdir_list. 1700 1702 */ 1701 - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); 1702 - btrfs_inode_lock(BTRFS_I(inode), 0); 1703 + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED); 1704 + btrfs_inode_lock(inode, 0); 1703 1705 1704 1706 mutex_lock(&delayed_node->mutex); 1705 1707 item = __btrfs_first_delayed_insertion_item(delayed_node); ··· 1730 1732 return true; 1731 1733 } 1732 1734 1733 - void btrfs_readdir_put_delayed_items(struct inode *inode, 1735 + void btrfs_readdir_put_delayed_items(struct btrfs_inode *inode, 1734 1736 struct list_head *ins_list, 1735 1737 struct list_head *del_list) 1736 1738 { ··· 1752 1754 * The VFS is going to do up_read(), so we need to downgrade back to a 1753 1755 * read lock. 1754 1756 */ 1755 - downgrade_write(&inode->i_rwsem); 1757 + downgrade_write(&inode->vfs_inode.i_rwsem); 1756 1758 } 1757 1759 1758 - int btrfs_should_delete_dir_index(struct list_head *del_list, 1760 + int btrfs_should_delete_dir_index(const struct list_head *del_list, 1759 1761 u64 index) 1760 1762 { 1761 1763 struct btrfs_delayed_item *curr; ··· 1776 1778 * Read dir info stored in the delayed tree. 1777 1779 */ 1778 1780 int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, 1779 - struct list_head *ins_list) 1781 + const struct list_head *ins_list) 1780 1782 { 1781 1783 struct btrfs_dir_item *di; 1782 1784 struct btrfs_delayed_item *curr, *next; ··· 1914 1916 BTRFS_I(inode)->i_otime_nsec = btrfs_stack_timespec_nsec(&inode_item->otime); 1915 1917 1916 1918 inode->i_generation = BTRFS_I(inode)->generation; 1917 - BTRFS_I(inode)->index_cnt = (u64)-1; 1919 + if (S_ISDIR(inode->i_mode)) 1920 + BTRFS_I(inode)->index_cnt = (u64)-1; 1918 1921 1919 1922 mutex_unlock(&delayed_node->mutex); 1920 1923 btrfs_release_delayed_node(delayed_node); ··· 2056 2057 struct btrfs_delayed_node *node; 2057 2058 int count; 2058 2059 2059 - spin_lock(&root->inode_lock); 2060 + xa_lock(&root->delayed_nodes); 2060 2061 if (xa_empty(&root->delayed_nodes)) { 2061 - spin_unlock(&root->inode_lock); 2062 + xa_unlock(&root->delayed_nodes); 2062 2063 return; 2063 2064 } 2064 2065 ··· 2075 2076 if (count >= ARRAY_SIZE(delayed_nodes)) 2076 2077 break; 2077 2078 } 2078 - spin_unlock(&root->inode_lock); 2079 + xa_unlock(&root->delayed_nodes); 2079 2080 index++; 2080 2081 2081 2082 for (int i = 0; i < count; i++) {
+5 -5
fs/btrfs/delayed-inode.h
··· 110 110 int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, 111 111 const char *name, int name_len, 112 112 struct btrfs_inode *dir, 113 - struct btrfs_disk_key *disk_key, u8 flags, 113 + const struct btrfs_disk_key *disk_key, u8 flags, 114 114 u64 index); 115 115 116 116 int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans, ··· 143 143 void btrfs_destroy_delayed_inodes(struct btrfs_fs_info *fs_info); 144 144 145 145 /* Used for readdir() */ 146 - bool btrfs_readdir_get_delayed_items(struct inode *inode, 146 + bool btrfs_readdir_get_delayed_items(struct btrfs_inode *inode, 147 147 u64 last_index, 148 148 struct list_head *ins_list, 149 149 struct list_head *del_list); 150 - void btrfs_readdir_put_delayed_items(struct inode *inode, 150 + void btrfs_readdir_put_delayed_items(struct btrfs_inode *inode, 151 151 struct list_head *ins_list, 152 152 struct list_head *del_list); 153 - int btrfs_should_delete_dir_index(struct list_head *del_list, 153 + int btrfs_should_delete_dir_index(const struct list_head *del_list, 154 154 u64 index); 155 155 int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, 156 - struct list_head *ins_list); 156 + const struct list_head *ins_list); 157 157 158 158 /* Used during directory logging. */ 159 159 void btrfs_log_get_delayed_items(struct btrfs_inode *inode,
+8 -43
fs/btrfs/delayed-ref.c
··· 195 195 } 196 196 197 197 /* 198 - * Transfer bytes to our delayed refs rsv. 199 - * 200 - * @fs_info: the filesystem 201 - * @num_bytes: number of bytes to transfer 202 - * 203 - * This transfers up to the num_bytes amount, previously reserved, to the 204 - * delayed_refs_rsv. Any extra bytes are returned to the space info. 205 - */ 206 - void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, 207 - u64 num_bytes) 208 - { 209 - struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv; 210 - u64 to_free = 0; 211 - 212 - spin_lock(&delayed_refs_rsv->lock); 213 - if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) { 214 - u64 delta = delayed_refs_rsv->size - 215 - delayed_refs_rsv->reserved; 216 - if (num_bytes > delta) { 217 - to_free = num_bytes - delta; 218 - num_bytes = delta; 219 - } 220 - } else { 221 - to_free = num_bytes; 222 - num_bytes = 0; 223 - } 224 - 225 - if (num_bytes) 226 - delayed_refs_rsv->reserved += num_bytes; 227 - if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size) 228 - delayed_refs_rsv->full = true; 229 - spin_unlock(&delayed_refs_rsv->lock); 230 - 231 - if (num_bytes) 232 - trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv", 233 - 0, num_bytes, 1); 234 - if (to_free) 235 - btrfs_space_info_free_bytes_may_use(fs_info, 236 - delayed_refs_rsv->space_info, to_free); 237 - } 238 - 239 - /* 240 198 * Refill based on our delayed refs usage. 241 199 * 242 200 * @fs_info: the filesystem ··· 819 861 spin_lock_init(&head_ref->lock); 820 862 mutex_init(&head_ref->mutex); 821 863 864 + /* If not metadata set an impossible level to help debugging. */ 865 + if (generic_ref->type == BTRFS_REF_METADATA) 866 + head_ref->level = generic_ref->tree_ref.level; 867 + else 868 + head_ref->level = U8_MAX; 869 + 822 870 if (qrecord) { 823 871 if (generic_ref->ref_root && reserved) { 824 872 qrecord->data_rsv = reserved; ··· 1078 1114 } 1079 1115 1080 1116 int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans, 1081 - u64 bytenr, u64 num_bytes, 1117 + u64 bytenr, u64 num_bytes, u8 level, 1082 1118 struct btrfs_delayed_extent_op *extent_op) 1083 1119 { 1084 1120 struct btrfs_delayed_ref_head *head_ref; ··· 1088 1124 .action = BTRFS_UPDATE_DELAYED_HEAD, 1089 1125 .bytenr = bytenr, 1090 1126 .num_bytes = num_bytes, 1127 + .tree_ref.level = level, 1091 1128 }; 1092 1129 1093 1130 head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+4 -4
fs/btrfs/delayed-ref.h
··· 108 108 109 109 struct btrfs_delayed_extent_op { 110 110 struct btrfs_disk_key key; 111 - u8 level; 112 111 bool update_key; 113 112 bool update_flags; 114 113 u64 flags_to_set; ··· 170 171 * or cleanup, we will need to free the reservation. 171 172 */ 172 173 u64 reserved_bytes; 174 + 175 + /* Tree block level, for metadata only. */ 176 + u8 level; 173 177 174 178 /* 175 179 * when a new extent is allocated, it is just reserved in memory ··· 357 355 struct btrfs_ref *generic_ref, 358 356 u64 reserved); 359 357 int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans, 360 - u64 bytenr, u64 num_bytes, 358 + u64 bytenr, u64 num_bytes, u8 level, 361 359 struct btrfs_delayed_extent_op *extent_op); 362 360 void btrfs_merge_delayed_refs(struct btrfs_fs_info *fs_info, 363 361 struct btrfs_delayed_ref_root *delayed_refs, ··· 388 386 void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info); 389 387 int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, 390 388 enum btrfs_reserve_flush_enum flush); 391 - void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, 392 - u64 num_bytes); 393 389 bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); 394 390 395 391 static inline u64 btrfs_delayed_ref_owner(struct btrfs_delayed_ref_node *node)
+2 -2
fs/btrfs/dev-replace.c
··· 684 684 if (ret) 685 685 btrfs_err(fs_info, "kobj add dev failed %d", ret); 686 686 687 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 687 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 688 688 689 689 /* 690 690 * Commit dev_replace state and reserve 1 item for it. ··· 880 880 mutex_unlock(&dev_replace->lock_finishing_cancel_unmount); 881 881 return ret; 882 882 } 883 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 883 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 884 884 885 885 /* 886 886 * We have to use this loop approach because at this point src_device
+4 -4
fs/btrfs/dir-item.c
··· 22 22 *trans, 23 23 struct btrfs_root *root, 24 24 struct btrfs_path *path, 25 - struct btrfs_key *cpu_key, 25 + const struct btrfs_key *cpu_key, 26 26 u32 data_size, 27 27 const char *name, 28 28 int name_len) ··· 108 108 */ 109 109 int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, 110 110 const struct fscrypt_str *name, struct btrfs_inode *dir, 111 - struct btrfs_key *location, u8 type, u64 index) 111 + const struct btrfs_key *location, u8 type, u64 index) 112 112 { 113 113 int ret = 0; 114 114 int ret2 = 0; ··· 379 379 * for a specific name. 380 380 */ 381 381 struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_fs_info *fs_info, 382 - struct btrfs_path *path, 382 + const struct btrfs_path *path, 383 383 const char *name, int name_len) 384 384 { 385 385 struct btrfs_dir_item *dir_item; ··· 417 417 int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans, 418 418 struct btrfs_root *root, 419 419 struct btrfs_path *path, 420 - struct btrfs_dir_item *di) 420 + const struct btrfs_dir_item *di) 421 421 { 422 422 423 423 struct extent_buffer *leaf;
+3 -3
fs/btrfs/dir-item.h
··· 17 17 const struct fscrypt_str *name); 18 18 int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, 19 19 const struct fscrypt_str *name, struct btrfs_inode *dir, 20 - struct btrfs_key *location, u8 type, u64 index); 20 + const struct btrfs_key *location, u8 type, u64 index); 21 21 struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans, 22 22 struct btrfs_root *root, 23 23 struct btrfs_path *path, u64 dir, ··· 33 33 int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans, 34 34 struct btrfs_root *root, 35 35 struct btrfs_path *path, 36 - struct btrfs_dir_item *di); 36 + const struct btrfs_dir_item *di); 37 37 int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans, 38 38 struct btrfs_root *root, 39 39 struct btrfs_path *path, u64 objectid, ··· 45 45 const char *name, u16 name_len, 46 46 int mod); 47 47 struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_fs_info *fs_info, 48 - struct btrfs_path *path, 48 + const struct btrfs_path *path, 49 49 const char *name, 50 50 int name_len); 51 51
+1052
fs/btrfs/direct-io.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/fsverity.h> 4 + #include <linux/iomap.h> 5 + #include "ctree.h" 6 + #include "delalloc-space.h" 7 + #include "direct-io.h" 8 + #include "extent-tree.h" 9 + #include "file.h" 10 + #include "fs.h" 11 + #include "transaction.h" 12 + #include "volumes.h" 13 + 14 + struct btrfs_dio_data { 15 + ssize_t submitted; 16 + struct extent_changeset *data_reserved; 17 + struct btrfs_ordered_extent *ordered; 18 + bool data_space_reserved; 19 + bool nocow_done; 20 + }; 21 + 22 + struct btrfs_dio_private { 23 + /* Range of I/O */ 24 + u64 file_offset; 25 + u32 bytes; 26 + 27 + /* This must be last */ 28 + struct btrfs_bio bbio; 29 + }; 30 + 31 + static struct bio_set btrfs_dio_bioset; 32 + 33 + static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend, 34 + struct extent_state **cached_state, 35 + unsigned int iomap_flags) 36 + { 37 + const bool writing = (iomap_flags & IOMAP_WRITE); 38 + const bool nowait = (iomap_flags & IOMAP_NOWAIT); 39 + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; 40 + struct btrfs_ordered_extent *ordered; 41 + int ret = 0; 42 + 43 + while (1) { 44 + if (nowait) { 45 + if (!try_lock_extent(io_tree, lockstart, lockend, 46 + cached_state)) 47 + return -EAGAIN; 48 + } else { 49 + lock_extent(io_tree, lockstart, lockend, cached_state); 50 + } 51 + /* 52 + * We're concerned with the entire range that we're going to be 53 + * doing DIO to, so we need to make sure there's no ordered 54 + * extents in this range. 55 + */ 56 + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), lockstart, 57 + lockend - lockstart + 1); 58 + 59 + /* 60 + * We need to make sure there are no buffered pages in this 61 + * range either, we could have raced between the invalidate in 62 + * generic_file_direct_write and locking the extent. The 63 + * invalidate needs to happen so that reads after a write do not 64 + * get stale data. 65 + */ 66 + if (!ordered && 67 + (!writing || !filemap_range_has_page(inode->i_mapping, 68 + lockstart, lockend))) 69 + break; 70 + 71 + unlock_extent(io_tree, lockstart, lockend, cached_state); 72 + 73 + if (ordered) { 74 + if (nowait) { 75 + btrfs_put_ordered_extent(ordered); 76 + ret = -EAGAIN; 77 + break; 78 + } 79 + /* 80 + * If we are doing a DIO read and the ordered extent we 81 + * found is for a buffered write, we can not wait for it 82 + * to complete and retry, because if we do so we can 83 + * deadlock with concurrent buffered writes on page 84 + * locks. This happens only if our DIO read covers more 85 + * than one extent map, if at this point has already 86 + * created an ordered extent for a previous extent map 87 + * and locked its range in the inode's io tree, and a 88 + * concurrent write against that previous extent map's 89 + * range and this range started (we unlock the ranges 90 + * in the io tree only when the bios complete and 91 + * buffered writes always lock pages before attempting 92 + * to lock range in the io tree). 93 + */ 94 + if (writing || 95 + test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) 96 + btrfs_start_ordered_extent(ordered); 97 + else 98 + ret = nowait ? -EAGAIN : -ENOTBLK; 99 + btrfs_put_ordered_extent(ordered); 100 + } else { 101 + /* 102 + * We could trigger writeback for this range (and wait 103 + * for it to complete) and then invalidate the pages for 104 + * this range (through invalidate_inode_pages2_range()), 105 + * but that can lead us to a deadlock with a concurrent 106 + * call to readahead (a buffered read or a defrag call 107 + * triggered a readahead) on a page lock due to an 108 + * ordered dio extent we created before but did not have 109 + * yet a corresponding bio submitted (whence it can not 110 + * complete), which makes readahead wait for that 111 + * ordered extent to complete while holding a lock on 112 + * that page. 113 + */ 114 + ret = nowait ? -EAGAIN : -ENOTBLK; 115 + } 116 + 117 + if (ret) 118 + break; 119 + 120 + cond_resched(); 121 + } 122 + 123 + return ret; 124 + } 125 + 126 + static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, 127 + struct btrfs_dio_data *dio_data, 128 + const u64 start, 129 + const struct btrfs_file_extent *file_extent, 130 + const int type) 131 + { 132 + struct extent_map *em = NULL; 133 + struct btrfs_ordered_extent *ordered; 134 + 135 + if (type != BTRFS_ORDERED_NOCOW) { 136 + em = btrfs_create_io_em(inode, start, file_extent, type); 137 + if (IS_ERR(em)) 138 + goto out; 139 + } 140 + 141 + ordered = btrfs_alloc_ordered_extent(inode, start, file_extent, 142 + (1 << type) | 143 + (1 << BTRFS_ORDERED_DIRECT)); 144 + if (IS_ERR(ordered)) { 145 + if (em) { 146 + free_extent_map(em); 147 + btrfs_drop_extent_map_range(inode, start, 148 + start + file_extent->num_bytes - 1, false); 149 + } 150 + em = ERR_CAST(ordered); 151 + } else { 152 + ASSERT(!dio_data->ordered); 153 + dio_data->ordered = ordered; 154 + } 155 + out: 156 + 157 + return em; 158 + } 159 + 160 + static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, 161 + struct btrfs_dio_data *dio_data, 162 + u64 start, u64 len) 163 + { 164 + struct btrfs_root *root = inode->root; 165 + struct btrfs_fs_info *fs_info = root->fs_info; 166 + struct btrfs_file_extent file_extent; 167 + struct extent_map *em; 168 + struct btrfs_key ins; 169 + u64 alloc_hint; 170 + int ret; 171 + 172 + alloc_hint = btrfs_get_extent_allocation_hint(inode, start, len); 173 + again: 174 + ret = btrfs_reserve_extent(root, len, len, fs_info->sectorsize, 175 + 0, alloc_hint, &ins, 1, 1); 176 + if (ret == -EAGAIN) { 177 + ASSERT(btrfs_is_zoned(fs_info)); 178 + wait_on_bit_io(&inode->root->fs_info->flags, BTRFS_FS_NEED_ZONE_FINISH, 179 + TASK_UNINTERRUPTIBLE); 180 + goto again; 181 + } 182 + if (ret) 183 + return ERR_PTR(ret); 184 + 185 + file_extent.disk_bytenr = ins.objectid; 186 + file_extent.disk_num_bytes = ins.offset; 187 + file_extent.num_bytes = ins.offset; 188 + file_extent.ram_bytes = ins.offset; 189 + file_extent.offset = 0; 190 + file_extent.compression = BTRFS_COMPRESS_NONE; 191 + em = btrfs_create_dio_extent(inode, dio_data, start, &file_extent, 192 + BTRFS_ORDERED_REGULAR); 193 + btrfs_dec_block_group_reservations(fs_info, ins.objectid); 194 + if (IS_ERR(em)) 195 + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 196 + 1); 197 + 198 + return em; 199 + } 200 + 201 + static int btrfs_get_blocks_direct_write(struct extent_map **map, 202 + struct inode *inode, 203 + struct btrfs_dio_data *dio_data, 204 + u64 start, u64 *lenp, 205 + unsigned int iomap_flags) 206 + { 207 + const bool nowait = (iomap_flags & IOMAP_NOWAIT); 208 + struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 209 + struct btrfs_file_extent file_extent; 210 + struct extent_map *em = *map; 211 + int type; 212 + u64 block_start; 213 + struct btrfs_block_group *bg; 214 + bool can_nocow = false; 215 + bool space_reserved = false; 216 + u64 len = *lenp; 217 + u64 prev_len; 218 + int ret = 0; 219 + 220 + /* 221 + * We don't allocate a new extent in the following cases 222 + * 223 + * 1) The inode is marked as NODATACOW. In this case we'll just use the 224 + * existing extent. 225 + * 2) The extent is marked as PREALLOC. We're good to go here and can 226 + * just use the extent. 227 + * 228 + */ 229 + if ((em->flags & EXTENT_FLAG_PREALLOC) || 230 + ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && 231 + em->disk_bytenr != EXTENT_MAP_HOLE)) { 232 + if (em->flags & EXTENT_FLAG_PREALLOC) 233 + type = BTRFS_ORDERED_PREALLOC; 234 + else 235 + type = BTRFS_ORDERED_NOCOW; 236 + len = min(len, em->len - (start - em->start)); 237 + block_start = extent_map_block_start(em) + (start - em->start); 238 + 239 + if (can_nocow_extent(inode, start, &len, 240 + &file_extent, false, false) == 1) { 241 + bg = btrfs_inc_nocow_writers(fs_info, block_start); 242 + if (bg) 243 + can_nocow = true; 244 + } 245 + } 246 + 247 + prev_len = len; 248 + if (can_nocow) { 249 + struct extent_map *em2; 250 + 251 + /* We can NOCOW, so only need to reserve metadata space. */ 252 + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, 253 + nowait); 254 + if (ret < 0) { 255 + /* Our caller expects us to free the input extent map. */ 256 + free_extent_map(em); 257 + *map = NULL; 258 + btrfs_dec_nocow_writers(bg); 259 + if (nowait && (ret == -ENOSPC || ret == -EDQUOT)) 260 + ret = -EAGAIN; 261 + goto out; 262 + } 263 + space_reserved = true; 264 + 265 + em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, 266 + &file_extent, type); 267 + btrfs_dec_nocow_writers(bg); 268 + if (type == BTRFS_ORDERED_PREALLOC) { 269 + free_extent_map(em); 270 + *map = em2; 271 + em = em2; 272 + } 273 + 274 + if (IS_ERR(em2)) { 275 + ret = PTR_ERR(em2); 276 + goto out; 277 + } 278 + 279 + dio_data->nocow_done = true; 280 + } else { 281 + /* Our caller expects us to free the input extent map. */ 282 + free_extent_map(em); 283 + *map = NULL; 284 + 285 + if (nowait) { 286 + ret = -EAGAIN; 287 + goto out; 288 + } 289 + 290 + /* 291 + * If we could not allocate data space before locking the file 292 + * range and we can't do a NOCOW write, then we have to fail. 293 + */ 294 + if (!dio_data->data_space_reserved) { 295 + ret = -ENOSPC; 296 + goto out; 297 + } 298 + 299 + /* 300 + * We have to COW and we have already reserved data space before, 301 + * so now we reserve only metadata. 302 + */ 303 + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, 304 + false); 305 + if (ret < 0) 306 + goto out; 307 + space_reserved = true; 308 + 309 + em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); 310 + if (IS_ERR(em)) { 311 + ret = PTR_ERR(em); 312 + goto out; 313 + } 314 + *map = em; 315 + len = min(len, em->len - (start - em->start)); 316 + if (len < prev_len) 317 + btrfs_delalloc_release_metadata(BTRFS_I(inode), 318 + prev_len - len, true); 319 + } 320 + 321 + /* 322 + * We have created our ordered extent, so we can now release our reservation 323 + * for an outstanding extent. 324 + */ 325 + btrfs_delalloc_release_extents(BTRFS_I(inode), prev_len); 326 + 327 + /* 328 + * Need to update the i_size under the extent lock so buffered 329 + * readers will get the updated i_size when we unlock. 330 + */ 331 + if (start + len > i_size_read(inode)) 332 + i_size_write(inode, start + len); 333 + out: 334 + if (ret && space_reserved) { 335 + btrfs_delalloc_release_extents(BTRFS_I(inode), len); 336 + btrfs_delalloc_release_metadata(BTRFS_I(inode), len, true); 337 + } 338 + *lenp = len; 339 + return ret; 340 + } 341 + 342 + static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, 343 + loff_t length, unsigned int flags, struct iomap *iomap, 344 + struct iomap *srcmap) 345 + { 346 + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); 347 + struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 348 + struct extent_map *em; 349 + struct extent_state *cached_state = NULL; 350 + struct btrfs_dio_data *dio_data = iter->private; 351 + u64 lockstart, lockend; 352 + const bool write = !!(flags & IOMAP_WRITE); 353 + int ret = 0; 354 + u64 len = length; 355 + const u64 data_alloc_len = length; 356 + bool unlock_extents = false; 357 + 358 + /* 359 + * We could potentially fault if we have a buffer > PAGE_SIZE, and if 360 + * we're NOWAIT we may submit a bio for a partial range and return 361 + * EIOCBQUEUED, which would result in an errant short read. 362 + * 363 + * The best way to handle this would be to allow for partial completions 364 + * of iocb's, so we could submit the partial bio, return and fault in 365 + * the rest of the pages, and then submit the io for the rest of the 366 + * range. However we don't have that currently, so simply return 367 + * -EAGAIN at this point so that the normal path is used. 368 + */ 369 + if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE) 370 + return -EAGAIN; 371 + 372 + /* 373 + * Cap the size of reads to that usually seen in buffered I/O as we need 374 + * to allocate a contiguous array for the checksums. 375 + */ 376 + if (!write) 377 + len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_BIO_SECTORS); 378 + 379 + lockstart = start; 380 + lockend = start + len - 1; 381 + 382 + /* 383 + * iomap_dio_rw() only does filemap_write_and_wait_range(), which isn't 384 + * enough if we've written compressed pages to this area, so we need to 385 + * flush the dirty pages again to make absolutely sure that any 386 + * outstanding dirty pages are on disk - the first flush only starts 387 + * compression on the data, while keeping the pages locked, so by the 388 + * time the second flush returns we know bios for the compressed pages 389 + * were submitted and finished, and the pages no longer under writeback. 390 + * 391 + * If we have a NOWAIT request and we have any pages in the range that 392 + * are locked, likely due to compression still in progress, we don't want 393 + * to block on page locks. We also don't want to block on pages marked as 394 + * dirty or under writeback (same as for the non-compression case). 395 + * iomap_dio_rw() did the same check, but after that and before we got 396 + * here, mmap'ed writes may have happened or buffered reads started 397 + * (readpage() and readahead(), which lock pages), as we haven't locked 398 + * the file range yet. 399 + */ 400 + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, 401 + &BTRFS_I(inode)->runtime_flags)) { 402 + if (flags & IOMAP_NOWAIT) { 403 + if (filemap_range_needs_writeback(inode->i_mapping, 404 + lockstart, lockend)) 405 + return -EAGAIN; 406 + } else { 407 + ret = filemap_fdatawrite_range(inode->i_mapping, start, 408 + start + length - 1); 409 + if (ret) 410 + return ret; 411 + } 412 + } 413 + 414 + memset(dio_data, 0, sizeof(*dio_data)); 415 + 416 + /* 417 + * We always try to allocate data space and must do it before locking 418 + * the file range, to avoid deadlocks with concurrent writes to the same 419 + * range if the range has several extents and the writes don't expand the 420 + * current i_size (the inode lock is taken in shared mode). If we fail to 421 + * allocate data space here we continue and later, after locking the 422 + * file range, we fail with ENOSPC only if we figure out we can not do a 423 + * NOCOW write. 424 + */ 425 + if (write && !(flags & IOMAP_NOWAIT)) { 426 + ret = btrfs_check_data_free_space(BTRFS_I(inode), 427 + &dio_data->data_reserved, 428 + start, data_alloc_len, false); 429 + if (!ret) 430 + dio_data->data_space_reserved = true; 431 + else if (ret && !(BTRFS_I(inode)->flags & 432 + (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) 433 + goto err; 434 + } 435 + 436 + /* 437 + * If this errors out it's because we couldn't invalidate pagecache for 438 + * this range and we need to fallback to buffered IO, or we are doing a 439 + * NOWAIT read/write and we need to block. 440 + */ 441 + ret = lock_extent_direct(inode, lockstart, lockend, &cached_state, flags); 442 + if (ret < 0) 443 + goto err; 444 + 445 + em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len); 446 + if (IS_ERR(em)) { 447 + ret = PTR_ERR(em); 448 + goto unlock_err; 449 + } 450 + 451 + /* 452 + * Ok for INLINE and COMPRESSED extents we need to fallback on buffered 453 + * io. INLINE is special, and we could probably kludge it in here, but 454 + * it's still buffered so for safety lets just fall back to the generic 455 + * buffered path. 456 + * 457 + * For COMPRESSED we _have_ to read the entire extent in so we can 458 + * decompress it, so there will be buffering required no matter what we 459 + * do, so go ahead and fallback to buffered. 460 + * 461 + * We return -ENOTBLK because that's what makes DIO go ahead and go back 462 + * to buffered IO. Don't blame me, this is the price we pay for using 463 + * the generic code. 464 + */ 465 + if (extent_map_is_compressed(em) || em->disk_bytenr == EXTENT_MAP_INLINE) { 466 + free_extent_map(em); 467 + /* 468 + * If we are in a NOWAIT context, return -EAGAIN in order to 469 + * fallback to buffered IO. This is not only because we can 470 + * block with buffered IO (no support for NOWAIT semantics at 471 + * the moment) but also to avoid returning short reads to user 472 + * space - this happens if we were able to read some data from 473 + * previous non-compressed extents and then when we fallback to 474 + * buffered IO, at btrfs_file_read_iter() by calling 475 + * filemap_read(), we fail to fault in pages for the read buffer, 476 + * in which case filemap_read() returns a short read (the number 477 + * of bytes previously read is > 0, so it does not return -EFAULT). 478 + */ 479 + ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK; 480 + goto unlock_err; 481 + } 482 + 483 + len = min(len, em->len - (start - em->start)); 484 + 485 + /* 486 + * If we have a NOWAIT request and the range contains multiple extents 487 + * (or a mix of extents and holes), then we return -EAGAIN to make the 488 + * caller fallback to a context where it can do a blocking (without 489 + * NOWAIT) request. This way we avoid doing partial IO and returning 490 + * success to the caller, which is not optimal for writes and for reads 491 + * it can result in unexpected behaviour for an application. 492 + * 493 + * When doing a read, because we use IOMAP_DIO_PARTIAL when calling 494 + * iomap_dio_rw(), we can end up returning less data then what the caller 495 + * asked for, resulting in an unexpected, and incorrect, short read. 496 + * That is, the caller asked to read N bytes and we return less than that, 497 + * which is wrong unless we are crossing EOF. This happens if we get a 498 + * page fault error when trying to fault in pages for the buffer that is 499 + * associated to the struct iov_iter passed to iomap_dio_rw(), and we 500 + * have previously submitted bios for other extents in the range, in 501 + * which case iomap_dio_rw() may return us EIOCBQUEUED if not all of 502 + * those bios have completed by the time we get the page fault error, 503 + * which we return back to our caller - we should only return EIOCBQUEUED 504 + * after we have submitted bios for all the extents in the range. 505 + */ 506 + if ((flags & IOMAP_NOWAIT) && len < length) { 507 + free_extent_map(em); 508 + ret = -EAGAIN; 509 + goto unlock_err; 510 + } 511 + 512 + if (write) { 513 + ret = btrfs_get_blocks_direct_write(&em, inode, dio_data, 514 + start, &len, flags); 515 + if (ret < 0) 516 + goto unlock_err; 517 + unlock_extents = true; 518 + /* Recalc len in case the new em is smaller than requested */ 519 + len = min(len, em->len - (start - em->start)); 520 + if (dio_data->data_space_reserved) { 521 + u64 release_offset; 522 + u64 release_len = 0; 523 + 524 + if (dio_data->nocow_done) { 525 + release_offset = start; 526 + release_len = data_alloc_len; 527 + } else if (len < data_alloc_len) { 528 + release_offset = start + len; 529 + release_len = data_alloc_len - len; 530 + } 531 + 532 + if (release_len > 0) 533 + btrfs_free_reserved_data_space(BTRFS_I(inode), 534 + dio_data->data_reserved, 535 + release_offset, 536 + release_len); 537 + } 538 + } else { 539 + /* 540 + * We need to unlock only the end area that we aren't using. 541 + * The rest is going to be unlocked by the endio routine. 542 + */ 543 + lockstart = start + len; 544 + if (lockstart < lockend) 545 + unlock_extents = true; 546 + } 547 + 548 + if (unlock_extents) 549 + unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, 550 + &cached_state); 551 + else 552 + free_extent_state(cached_state); 553 + 554 + /* 555 + * Translate extent map information to iomap. 556 + * We trim the extents (and move the addr) even though iomap code does 557 + * that, since we have locked only the parts we are performing I/O in. 558 + */ 559 + if ((em->disk_bytenr == EXTENT_MAP_HOLE) || 560 + ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) { 561 + iomap->addr = IOMAP_NULL_ADDR; 562 + iomap->type = IOMAP_HOLE; 563 + } else { 564 + iomap->addr = extent_map_block_start(em) + (start - em->start); 565 + iomap->type = IOMAP_MAPPED; 566 + } 567 + iomap->offset = start; 568 + iomap->bdev = fs_info->fs_devices->latest_dev->bdev; 569 + iomap->length = len; 570 + free_extent_map(em); 571 + 572 + return 0; 573 + 574 + unlock_err: 575 + unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, 576 + &cached_state); 577 + err: 578 + if (dio_data->data_space_reserved) { 579 + btrfs_free_reserved_data_space(BTRFS_I(inode), 580 + dio_data->data_reserved, 581 + start, data_alloc_len); 582 + extent_changeset_free(dio_data->data_reserved); 583 + } 584 + 585 + return ret; 586 + } 587 + 588 + static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, 589 + ssize_t written, unsigned int flags, struct iomap *iomap) 590 + { 591 + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); 592 + struct btrfs_dio_data *dio_data = iter->private; 593 + size_t submitted = dio_data->submitted; 594 + const bool write = !!(flags & IOMAP_WRITE); 595 + int ret = 0; 596 + 597 + if (!write && (iomap->type == IOMAP_HOLE)) { 598 + /* If reading from a hole, unlock and return */ 599 + unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, 600 + NULL); 601 + return 0; 602 + } 603 + 604 + if (submitted < length) { 605 + pos += submitted; 606 + length -= submitted; 607 + if (write) 608 + btrfs_finish_ordered_extent(dio_data->ordered, NULL, 609 + pos, length, false); 610 + else 611 + unlock_extent(&BTRFS_I(inode)->io_tree, pos, 612 + pos + length - 1, NULL); 613 + ret = -ENOTBLK; 614 + } 615 + if (write) { 616 + btrfs_put_ordered_extent(dio_data->ordered); 617 + dio_data->ordered = NULL; 618 + } 619 + 620 + if (write) 621 + extent_changeset_free(dio_data->data_reserved); 622 + return ret; 623 + } 624 + 625 + static void btrfs_dio_end_io(struct btrfs_bio *bbio) 626 + { 627 + struct btrfs_dio_private *dip = 628 + container_of(bbio, struct btrfs_dio_private, bbio); 629 + struct btrfs_inode *inode = bbio->inode; 630 + struct bio *bio = &bbio->bio; 631 + 632 + if (bio->bi_status) { 633 + btrfs_warn(inode->root->fs_info, 634 + "direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d", 635 + btrfs_ino(inode), bio->bi_opf, 636 + dip->file_offset, dip->bytes, bio->bi_status); 637 + } 638 + 639 + if (btrfs_op(bio) == BTRFS_MAP_WRITE) { 640 + btrfs_finish_ordered_extent(bbio->ordered, NULL, 641 + dip->file_offset, dip->bytes, 642 + !bio->bi_status); 643 + } else { 644 + unlock_extent(&inode->io_tree, dip->file_offset, 645 + dip->file_offset + dip->bytes - 1, NULL); 646 + } 647 + 648 + bbio->bio.bi_private = bbio->private; 649 + iomap_dio_bio_end_io(bio); 650 + } 651 + 652 + static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio, 653 + struct btrfs_ordered_extent *ordered) 654 + { 655 + u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; 656 + u64 len = bbio->bio.bi_iter.bi_size; 657 + struct btrfs_ordered_extent *new; 658 + int ret; 659 + 660 + /* Must always be called for the beginning of an ordered extent. */ 661 + if (WARN_ON_ONCE(start != ordered->disk_bytenr)) 662 + return -EINVAL; 663 + 664 + /* No need to split if the ordered extent covers the entire bio. */ 665 + if (ordered->disk_num_bytes == len) { 666 + refcount_inc(&ordered->refs); 667 + bbio->ordered = ordered; 668 + return 0; 669 + } 670 + 671 + /* 672 + * Don't split the extent_map for NOCOW extents, as we're writing into 673 + * a pre-existing one. 674 + */ 675 + if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) { 676 + ret = split_extent_map(bbio->inode, bbio->file_offset, 677 + ordered->num_bytes, len, 678 + ordered->disk_bytenr); 679 + if (ret) 680 + return ret; 681 + } 682 + 683 + new = btrfs_split_ordered_extent(ordered, len); 684 + if (IS_ERR(new)) 685 + return PTR_ERR(new); 686 + bbio->ordered = new; 687 + return 0; 688 + } 689 + 690 + static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, 691 + loff_t file_offset) 692 + { 693 + struct btrfs_bio *bbio = btrfs_bio(bio); 694 + struct btrfs_dio_private *dip = 695 + container_of(bbio, struct btrfs_dio_private, bbio); 696 + struct btrfs_dio_data *dio_data = iter->private; 697 + 698 + btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, 699 + btrfs_dio_end_io, bio->bi_private); 700 + bbio->inode = BTRFS_I(iter->inode); 701 + bbio->file_offset = file_offset; 702 + 703 + dip->file_offset = file_offset; 704 + dip->bytes = bio->bi_iter.bi_size; 705 + 706 + dio_data->submitted += bio->bi_iter.bi_size; 707 + 708 + /* 709 + * Check if we are doing a partial write. If we are, we need to split 710 + * the ordered extent to match the submitted bio. Hang on to the 711 + * remaining unfinishable ordered_extent in dio_data so that it can be 712 + * cancelled in iomap_end to avoid a deadlock wherein faulting the 713 + * remaining pages is blocked on the outstanding ordered extent. 714 + */ 715 + if (iter->flags & IOMAP_WRITE) { 716 + int ret; 717 + 718 + ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered); 719 + if (ret) { 720 + btrfs_finish_ordered_extent(dio_data->ordered, NULL, 721 + file_offset, dip->bytes, 722 + !ret); 723 + bio->bi_status = errno_to_blk_status(ret); 724 + iomap_dio_bio_end_io(bio); 725 + return; 726 + } 727 + } 728 + 729 + btrfs_submit_bio(bbio, 0); 730 + } 731 + 732 + static const struct iomap_ops btrfs_dio_iomap_ops = { 733 + .iomap_begin = btrfs_dio_iomap_begin, 734 + .iomap_end = btrfs_dio_iomap_end, 735 + }; 736 + 737 + static const struct iomap_dio_ops btrfs_dio_ops = { 738 + .submit_io = btrfs_dio_submit_io, 739 + .bio_set = &btrfs_dio_bioset, 740 + }; 741 + 742 + static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, 743 + size_t done_before) 744 + { 745 + struct btrfs_dio_data data = { 0 }; 746 + 747 + return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 748 + IOMAP_DIO_PARTIAL, &data, done_before); 749 + } 750 + 751 + static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, 752 + size_t done_before) 753 + { 754 + struct btrfs_dio_data data = { 0 }; 755 + 756 + return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 757 + IOMAP_DIO_PARTIAL, &data, done_before); 758 + } 759 + 760 + static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info, 761 + const struct iov_iter *iter, loff_t offset) 762 + { 763 + const u32 blocksize_mask = fs_info->sectorsize - 1; 764 + 765 + if (offset & blocksize_mask) 766 + return -EINVAL; 767 + 768 + if (iov_iter_alignment(iter) & blocksize_mask) 769 + return -EINVAL; 770 + 771 + return 0; 772 + } 773 + 774 + ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) 775 + { 776 + struct file *file = iocb->ki_filp; 777 + struct inode *inode = file_inode(file); 778 + struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 779 + loff_t pos; 780 + ssize_t written = 0; 781 + ssize_t written_buffered; 782 + size_t prev_left = 0; 783 + loff_t endbyte; 784 + ssize_t ret; 785 + unsigned int ilock_flags = 0; 786 + struct iomap_dio *dio; 787 + 788 + if (iocb->ki_flags & IOCB_NOWAIT) 789 + ilock_flags |= BTRFS_ILOCK_TRY; 790 + 791 + /* 792 + * If the write DIO is within EOF, use a shared lock and also only if 793 + * security bits will likely not be dropped by file_remove_privs() called 794 + * from btrfs_write_check(). Either will need to be rechecked after the 795 + * lock was acquired. 796 + */ 797 + if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode)) 798 + ilock_flags |= BTRFS_ILOCK_SHARED; 799 + 800 + relock: 801 + ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags); 802 + if (ret < 0) 803 + return ret; 804 + 805 + /* Shared lock cannot be used with security bits set. */ 806 + if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) { 807 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 808 + ilock_flags &= ~BTRFS_ILOCK_SHARED; 809 + goto relock; 810 + } 811 + 812 + ret = generic_write_checks(iocb, from); 813 + if (ret <= 0) { 814 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 815 + return ret; 816 + } 817 + 818 + ret = btrfs_write_check(iocb, from, ret); 819 + if (ret < 0) { 820 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 821 + goto out; 822 + } 823 + 824 + pos = iocb->ki_pos; 825 + /* 826 + * Re-check since file size may have changed just before taking the 827 + * lock or pos may have changed because of O_APPEND in generic_write_check() 828 + */ 829 + if ((ilock_flags & BTRFS_ILOCK_SHARED) && 830 + pos + iov_iter_count(from) > i_size_read(inode)) { 831 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 832 + ilock_flags &= ~BTRFS_ILOCK_SHARED; 833 + goto relock; 834 + } 835 + 836 + if (check_direct_IO(fs_info, from, pos)) { 837 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 838 + goto buffered; 839 + } 840 + 841 + /* 842 + * The iov_iter can be mapped to the same file range we are writing to. 843 + * If that's the case, then we will deadlock in the iomap code, because 844 + * it first calls our callback btrfs_dio_iomap_begin(), which will create 845 + * an ordered extent, and after that it will fault in the pages that the 846 + * iov_iter refers to. During the fault in we end up in the readahead 847 + * pages code (starting at btrfs_readahead()), which will lock the range, 848 + * find that ordered extent and then wait for it to complete (at 849 + * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since 850 + * obviously the ordered extent can never complete as we didn't submit 851 + * yet the respective bio(s). This always happens when the buffer is 852 + * memory mapped to the same file range, since the iomap DIO code always 853 + * invalidates pages in the target file range (after starting and waiting 854 + * for any writeback). 855 + * 856 + * So here we disable page faults in the iov_iter and then retry if we 857 + * got -EFAULT, faulting in the pages before the retry. 858 + */ 859 + from->nofault = true; 860 + dio = btrfs_dio_write(iocb, from, written); 861 + from->nofault = false; 862 + 863 + /* 864 + * iomap_dio_complete() will call btrfs_sync_file() if we have a dsync 865 + * iocb, and that needs to lock the inode. So unlock it before calling 866 + * iomap_dio_complete() to avoid a deadlock. 867 + */ 868 + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 869 + 870 + if (IS_ERR_OR_NULL(dio)) 871 + ret = PTR_ERR_OR_ZERO(dio); 872 + else 873 + ret = iomap_dio_complete(dio); 874 + 875 + /* No increment (+=) because iomap returns a cumulative value. */ 876 + if (ret > 0) 877 + written = ret; 878 + 879 + if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) { 880 + const size_t left = iov_iter_count(from); 881 + /* 882 + * We have more data left to write. Try to fault in as many as 883 + * possible of the remainder pages and retry. We do this without 884 + * releasing and locking again the inode, to prevent races with 885 + * truncate. 886 + * 887 + * Also, in case the iov refers to pages in the file range of the 888 + * file we want to write to (due to a mmap), we could enter an 889 + * infinite loop if we retry after faulting the pages in, since 890 + * iomap will invalidate any pages in the range early on, before 891 + * it tries to fault in the pages of the iov. So we keep track of 892 + * how much was left of iov in the previous EFAULT and fallback 893 + * to buffered IO in case we haven't made any progress. 894 + */ 895 + if (left == prev_left) { 896 + ret = -ENOTBLK; 897 + } else { 898 + fault_in_iov_iter_readable(from, left); 899 + prev_left = left; 900 + goto relock; 901 + } 902 + } 903 + 904 + /* 905 + * If 'ret' is -ENOTBLK or we have not written all data, then it means 906 + * we must fallback to buffered IO. 907 + */ 908 + if ((ret < 0 && ret != -ENOTBLK) || !iov_iter_count(from)) 909 + goto out; 910 + 911 + buffered: 912 + /* 913 + * If we are in a NOWAIT context, then return -EAGAIN to signal the caller 914 + * it must retry the operation in a context where blocking is acceptable, 915 + * because even if we end up not blocking during the buffered IO attempt 916 + * below, we will block when flushing and waiting for the IO. 917 + */ 918 + if (iocb->ki_flags & IOCB_NOWAIT) { 919 + ret = -EAGAIN; 920 + goto out; 921 + } 922 + 923 + pos = iocb->ki_pos; 924 + written_buffered = btrfs_buffered_write(iocb, from); 925 + if (written_buffered < 0) { 926 + ret = written_buffered; 927 + goto out; 928 + } 929 + /* 930 + * Ensure all data is persisted. We want the next direct IO read to be 931 + * able to read what was just written. 932 + */ 933 + endbyte = pos + written_buffered - 1; 934 + ret = btrfs_fdatawrite_range(BTRFS_I(inode), pos, endbyte); 935 + if (ret) 936 + goto out; 937 + ret = filemap_fdatawait_range(inode->i_mapping, pos, endbyte); 938 + if (ret) 939 + goto out; 940 + written += written_buffered; 941 + iocb->ki_pos = pos + written_buffered; 942 + invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT, 943 + endbyte >> PAGE_SHIFT); 944 + out: 945 + return ret < 0 ? ret : written; 946 + } 947 + 948 + static int check_direct_read(struct btrfs_fs_info *fs_info, 949 + const struct iov_iter *iter, loff_t offset) 950 + { 951 + int ret; 952 + int i, seg; 953 + 954 + ret = check_direct_IO(fs_info, iter, offset); 955 + if (ret < 0) 956 + return ret; 957 + 958 + if (!iter_is_iovec(iter)) 959 + return 0; 960 + 961 + for (seg = 0; seg < iter->nr_segs; seg++) { 962 + for (i = seg + 1; i < iter->nr_segs; i++) { 963 + const struct iovec *iov1 = iter_iov(iter) + seg; 964 + const struct iovec *iov2 = iter_iov(iter) + i; 965 + 966 + if (iov1->iov_base == iov2->iov_base) 967 + return -EINVAL; 968 + } 969 + } 970 + return 0; 971 + } 972 + 973 + ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to) 974 + { 975 + struct inode *inode = file_inode(iocb->ki_filp); 976 + size_t prev_left = 0; 977 + ssize_t read = 0; 978 + ssize_t ret; 979 + 980 + if (fsverity_active(inode)) 981 + return 0; 982 + 983 + if (check_direct_read(inode_to_fs_info(inode), to, iocb->ki_pos)) 984 + return 0; 985 + 986 + btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); 987 + again: 988 + /* 989 + * This is similar to what we do for direct IO writes, see the comment 990 + * at btrfs_direct_write(), but we also disable page faults in addition 991 + * to disabling them only at the iov_iter level. This is because when 992 + * reading from a hole or prealloc extent, iomap calls iov_iter_zero(), 993 + * which can still trigger page fault ins despite having set ->nofault 994 + * to true of our 'to' iov_iter. 995 + * 996 + * The difference to direct IO writes is that we deadlock when trying 997 + * to lock the extent range in the inode's tree during he page reads 998 + * triggered by the fault in (while for writes it is due to waiting for 999 + * our own ordered extent). This is because for direct IO reads, 1000 + * btrfs_dio_iomap_begin() returns with the extent range locked, which 1001 + * is only unlocked in the endio callback (end_bio_extent_readpage()). 1002 + */ 1003 + pagefault_disable(); 1004 + to->nofault = true; 1005 + ret = btrfs_dio_read(iocb, to, read); 1006 + to->nofault = false; 1007 + pagefault_enable(); 1008 + 1009 + /* No increment (+=) because iomap returns a cumulative value. */ 1010 + if (ret > 0) 1011 + read = ret; 1012 + 1013 + if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) { 1014 + const size_t left = iov_iter_count(to); 1015 + 1016 + if (left == prev_left) { 1017 + /* 1018 + * We didn't make any progress since the last attempt, 1019 + * fallback to a buffered read for the remainder of the 1020 + * range. This is just to avoid any possibility of looping 1021 + * for too long. 1022 + */ 1023 + ret = read; 1024 + } else { 1025 + /* 1026 + * We made some progress since the last retry or this is 1027 + * the first time we are retrying. Fault in as many pages 1028 + * as possible and retry. 1029 + */ 1030 + fault_in_iov_iter_writeable(to, left); 1031 + prev_left = left; 1032 + goto again; 1033 + } 1034 + } 1035 + btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); 1036 + return ret < 0 ? ret : read; 1037 + } 1038 + 1039 + int __init btrfs_init_dio(void) 1040 + { 1041 + if (bioset_init(&btrfs_dio_bioset, BIO_POOL_SIZE, 1042 + offsetof(struct btrfs_dio_private, bbio.bio), 1043 + BIOSET_NEED_BVECS)) 1044 + return -ENOMEM; 1045 + 1046 + return 0; 1047 + } 1048 + 1049 + void __cold btrfs_destroy_dio(void) 1050 + { 1051 + bioset_exit(&btrfs_dio_bioset); 1052 + }
+14
fs/btrfs/direct-io.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef BTRFS_DIRECT_IO_H 4 + #define BTRFS_DIRECT_IO_H 5 + 6 + #include <linux/types.h> 7 + 8 + int __init btrfs_init_dio(void); 9 + void __cold btrfs_destroy_dio(void); 10 + 11 + ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from); 12 + ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to); 13 + 14 + #endif /* BTRFS_DIRECT_IO_H */
+61 -67
fs/btrfs/disk-io.c
··· 213 213 * structure for details. 214 214 */ 215 215 int btrfs_read_extent_buffer(struct extent_buffer *eb, 216 - struct btrfs_tree_parent_check *check) 216 + const struct btrfs_tree_parent_check *check) 217 217 { 218 218 struct btrfs_fs_info *fs_info = eb->fs_info; 219 219 int failed = 0; ··· 358 358 359 359 /* Do basic extent buffer checks at read time */ 360 360 int btrfs_validate_extent_buffer(struct extent_buffer *eb, 361 - struct btrfs_tree_parent_check *check) 361 + const struct btrfs_tree_parent_check *check) 362 362 { 363 363 struct btrfs_fs_info *fs_info = eb->fs_info; 364 364 u64 found_start; ··· 367 367 u8 result[BTRFS_CSUM_SIZE]; 368 368 const u8 *header_csum; 369 369 int ret = 0; 370 + const bool ignore_csum = btrfs_test_opt(fs_info, IGNOREMETACSUMS); 370 371 371 372 ASSERT(check); 372 373 ··· 400 399 401 400 if (memcmp(result, header_csum, csum_size) != 0) { 402 401 btrfs_warn_rl(fs_info, 403 - "checksum verify failed on logical %llu mirror %u wanted " CSUM_FMT " found " CSUM_FMT " level %d", 402 + "checksum verify failed on logical %llu mirror %u wanted " CSUM_FMT " found " CSUM_FMT " level %d%s", 404 403 eb->start, eb->read_mirror, 405 404 CSUM_FMT_VALUE(csum_size, header_csum), 406 405 CSUM_FMT_VALUE(csum_size, result), 407 - btrfs_header_level(eb)); 408 - ret = -EUCLEAN; 409 - goto out; 406 + btrfs_header_level(eb), 407 + ignore_csum ? ", ignored" : ""); 408 + if (!ignore_csum) { 409 + ret = -EUCLEAN; 410 + goto out; 411 + } 410 412 } 411 413 412 414 if (found_level != check->level) { ··· 429 425 goto out; 430 426 } 431 427 if (check->has_first_key) { 432 - struct btrfs_key *expect_key = &check->first_key; 428 + const struct btrfs_key *expect_key = &check->first_key; 433 429 struct btrfs_key found_key; 434 430 435 431 if (found_level) ··· 639 635 free_extent_buffer_stale(buf); 640 636 return ERR_PTR(ret); 641 637 } 642 - if (btrfs_check_eb_owner(buf, check->owner_root)) { 643 - free_extent_buffer_stale(buf); 644 - return ERR_PTR(-EUCLEAN); 645 - } 646 638 return buf; 647 639 648 640 } ··· 658 658 root->state = 0; 659 659 RB_CLEAR_NODE(&root->rb_node); 660 660 661 - root->last_trans = 0; 661 + btrfs_set_root_last_trans(root, 0); 662 662 root->free_objectid = 0; 663 663 root->nr_delalloc_inodes = 0; 664 664 root->nr_ordered_extents = 0; 665 - root->inode_tree = RB_ROOT; 665 + xa_init(&root->inodes); 666 666 xa_init(&root->delayed_nodes); 667 667 668 668 btrfs_init_root_block_rsv(root); ··· 674 674 INIT_LIST_HEAD(&root->ordered_extents); 675 675 INIT_LIST_HEAD(&root->ordered_root); 676 676 INIT_LIST_HEAD(&root->reloc_dirty_list); 677 - spin_lock_init(&root->inode_lock); 678 677 spin_lock_init(&root->delalloc_lock); 679 678 spin_lock_init(&root->ordered_extent_lock); 680 679 spin_lock_init(&root->accounting_lock); ··· 846 847 return btrfs_global_root(fs_info, &key); 847 848 } 848 849 849 - struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info) 850 - { 851 - if (btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE)) 852 - return fs_info->block_group_root; 853 - return btrfs_extent_root(fs_info, 0); 854 - } 855 - 856 850 struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans, 857 851 u64 objectid) 858 852 { ··· 1002 1010 return ret; 1003 1011 } 1004 1012 1005 - log_root->last_trans = trans->transid; 1013 + btrfs_set_root_last_trans(log_root, trans->transid); 1006 1014 log_root->root_key.offset = btrfs_root_id(root); 1007 1015 1008 1016 inode_item = &log_root->root_item.inode; ··· 1025 1033 1026 1034 static struct btrfs_root *read_tree_root_path(struct btrfs_root *tree_root, 1027 1035 struct btrfs_path *path, 1028 - struct btrfs_key *key) 1036 + const struct btrfs_key *key) 1029 1037 { 1030 1038 struct btrfs_root *root; 1031 1039 struct btrfs_tree_parent_check check = { 0 }; ··· 1087 1095 } 1088 1096 1089 1097 struct btrfs_root *btrfs_read_tree_root(struct btrfs_root *tree_root, 1090 - struct btrfs_key *key) 1098 + const struct btrfs_key *key) 1091 1099 { 1092 1100 struct btrfs_root *root; 1093 1101 struct btrfs_path *path; ··· 1222 1230 return ret; 1223 1231 } 1224 1232 1225 - void btrfs_check_leaked_roots(struct btrfs_fs_info *fs_info) 1233 + void btrfs_check_leaked_roots(const struct btrfs_fs_info *fs_info) 1226 1234 { 1227 1235 #ifdef CONFIG_BTRFS_DEBUG 1228 1236 struct btrfs_root *root; ··· 1846 1854 return; 1847 1855 1848 1856 if (refcount_dec_and_test(&root->refs)) { 1849 - WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); 1857 + if (WARN_ON(!xa_empty(&root->inodes))) 1858 + xa_destroy(&root->inodes); 1850 1859 WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state)); 1851 1860 if (root->anon_dev) 1852 1861 free_anon_bdev(root->anon_dev); ··· 1921 1928 if (!inode) 1922 1929 return -ENOMEM; 1923 1930 1924 - inode->i_ino = BTRFS_BTREE_INODE_OBJECTID; 1931 + btrfs_set_inode_number(BTRFS_I(inode), BTRFS_BTREE_INODE_OBJECTID); 1925 1932 set_nlink(inode, 1); 1926 1933 /* 1927 1934 * we set the i_size on the btree inode to the max possible int. ··· 1932 1939 inode->i_mapping->a_ops = &btree_aops; 1933 1940 mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); 1934 1941 1935 - RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node); 1936 1942 extent_io_tree_init(fs_info, &BTRFS_I(inode)->io_tree, 1937 1943 IO_TREE_BTREE_INODE_IO); 1938 1944 extent_map_tree_init(&BTRFS_I(inode)->extent_tree); 1939 1945 1940 1946 BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root); 1941 - BTRFS_I(inode)->location.objectid = BTRFS_BTREE_INODE_OBJECTID; 1942 - BTRFS_I(inode)->location.type = 0; 1943 - BTRFS_I(inode)->location.offset = 0; 1944 1947 set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags); 1945 1948 __insert_inode_hash(inode, hash); 1946 1949 fs_info->btree_inode = inode; ··· 2135 2146 /* If we have IGNOREDATACSUMS skip loading these roots. */ 2136 2147 if (objectid == BTRFS_CSUM_TREE_OBJECTID && 2137 2148 btrfs_test_opt(fs_info, IGNOREDATACSUMS)) { 2138 - set_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state); 2149 + set_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state); 2139 2150 return 0; 2140 2151 } 2141 2152 ··· 2188 2199 2189 2200 if (!found || ret) { 2190 2201 if (objectid == BTRFS_CSUM_TREE_OBJECTID) 2191 - set_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state); 2202 + set_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state); 2192 2203 2193 2204 if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) 2194 2205 ret = ret ? ret : -ENOENT; ··· 2339 2350 * 1, 2 2nd and 3rd backup copy 2340 2351 * -1 skip bytenr check 2341 2352 */ 2342 - int btrfs_validate_super(struct btrfs_fs_info *fs_info, 2343 - struct btrfs_super_block *sb, int mirror_num) 2353 + int btrfs_validate_super(const struct btrfs_fs_info *fs_info, 2354 + const struct btrfs_super_block *sb, int mirror_num) 2344 2355 { 2345 2356 u64 nodesize = btrfs_super_nodesize(sb); 2346 2357 u64 sectorsize = btrfs_super_sectorsize(sb); 2347 2358 int ret = 0; 2359 + const bool ignore_flags = btrfs_test_opt(fs_info, IGNORESUPERFLAGS); 2348 2360 2349 2361 if (btrfs_super_magic(sb) != BTRFS_MAGIC) { 2350 2362 btrfs_err(fs_info, "no valid FS found"); 2351 2363 ret = -EINVAL; 2352 2364 } 2353 - if (btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP) { 2354 - btrfs_err(fs_info, "unrecognized or unsupported super flag: %llu", 2355 - btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP); 2356 - ret = -EINVAL; 2365 + if ((btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP)) { 2366 + if (!ignore_flags) { 2367 + btrfs_err(fs_info, 2368 + "unrecognized or unsupported super flag 0x%llx", 2369 + btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP); 2370 + ret = -EINVAL; 2371 + } else { 2372 + btrfs_info(fs_info, 2373 + "unrecognized or unsupported super flags: 0x%llx, ignored", 2374 + btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP); 2375 + } 2357 2376 } 2358 2377 if (btrfs_super_root_level(sb) >= BTRFS_MAX_LEVEL) { 2359 2378 btrfs_err(fs_info, "tree_root level too big: %d >= %d", ··· 2464 2467 (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID) || 2465 2468 !btrfs_fs_incompat(fs_info, NO_HOLES))) { 2466 2469 btrfs_err(fs_info, 2467 - "block-group-tree feature requires fres-space-tree and no-holes"); 2470 + "block-group-tree feature requires free-space-tree and no-holes"); 2468 2471 ret = -EINVAL; 2469 2472 } 2470 2473 ··· 2879 2882 2880 2883 if (sb_rdonly(sb)) 2881 2884 set_bit(BTRFS_FS_STATE_RO, &fs_info->fs_state); 2885 + if (btrfs_test_opt(fs_info, IGNOREMETACSUMS)) 2886 + set_bit(BTRFS_FS_STATE_SKIP_META_CSUMS, &fs_info->fs_state); 2882 2887 2883 2888 return btrfs_alloc_stripe_hash_table(fs_info); 2884 2889 } ··· 2926 2927 { 2927 2928 u64 root_objectid = 0; 2928 2929 struct btrfs_root *gang[8]; 2929 - int i = 0; 2930 - int err = 0; 2931 - unsigned int ret = 0; 2930 + int ret = 0; 2932 2931 2933 2932 while (1) { 2933 + unsigned int found; 2934 + 2934 2935 spin_lock(&fs_info->fs_roots_radix_lock); 2935 - ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix, 2936 + found = radix_tree_gang_lookup(&fs_info->fs_roots_radix, 2936 2937 (void **)gang, root_objectid, 2937 2938 ARRAY_SIZE(gang)); 2938 - if (!ret) { 2939 + if (!found) { 2939 2940 spin_unlock(&fs_info->fs_roots_radix_lock); 2940 2941 break; 2941 2942 } 2942 - root_objectid = btrfs_root_id(gang[ret - 1]) + 1; 2943 + root_objectid = btrfs_root_id(gang[found - 1]) + 1; 2943 2944 2944 - for (i = 0; i < ret; i++) { 2945 + for (int i = 0; i < found; i++) { 2945 2946 /* Avoid to grab roots in dead_roots. */ 2946 2947 if (btrfs_root_refs(&gang[i]->root_item) == 0) { 2947 2948 gang[i] = NULL; ··· 2952 2953 } 2953 2954 spin_unlock(&fs_info->fs_roots_radix_lock); 2954 2955 2955 - for (i = 0; i < ret; i++) { 2956 + for (int i = 0; i < found; i++) { 2956 2957 if (!gang[i]) 2957 2958 continue; 2958 2959 root_objectid = btrfs_root_id(gang[i]); 2959 - err = btrfs_orphan_cleanup(gang[i]); 2960 - if (err) 2961 - goto out; 2960 + /* 2961 + * Continue to release the remaining roots after the first 2962 + * error without cleanup and preserve the first error 2963 + * for the return. 2964 + */ 2965 + if (!ret) 2966 + ret = btrfs_orphan_cleanup(gang[i]); 2962 2967 btrfs_put_root(gang[i]); 2963 2968 } 2969 + if (ret) 2970 + break; 2971 + 2964 2972 root_objectid++; 2965 2973 } 2966 - out: 2967 - /* Release the uncleaned roots due to error. */ 2968 - for (; i < ret; i++) { 2969 - if (gang[i]) 2970 - btrfs_put_root(gang[i]); 2971 - } 2972 - return err; 2974 + return ret; 2973 2975 } 2974 2976 2975 2977 /* ··· 3204 3204 } 3205 3205 3206 3206 int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_devices, 3207 - char *options) 3207 + const char *options) 3208 3208 { 3209 3209 u32 sectorsize; 3210 3210 u32 nodesize; ··· 4157 4157 4158 4158 int btrfs_commit_super(struct btrfs_fs_info *fs_info) 4159 4159 { 4160 - struct btrfs_root *root = fs_info->tree_root; 4161 - struct btrfs_trans_handle *trans; 4162 - 4163 4160 mutex_lock(&fs_info->cleaner_mutex); 4164 4161 btrfs_run_delayed_iputs(fs_info); 4165 4162 mutex_unlock(&fs_info->cleaner_mutex); ··· 4166 4169 down_write(&fs_info->cleanup_work_sem); 4167 4170 up_write(&fs_info->cleanup_work_sem); 4168 4171 4169 - trans = btrfs_join_transaction(root); 4170 - if (IS_ERR(trans)) 4171 - return PTR_ERR(trans); 4172 - return btrfs_commit_transaction(trans); 4172 + return btrfs_commit_current_transaction(fs_info->tree_root); 4173 4173 } 4174 4174 4175 4175 static void warn_about_uncommitted_trans(struct btrfs_fs_info *fs_info) ··· 4527 4533 * extents that haven't had their dirty pages IO start writeout yet 4528 4534 * actually get run and error out properly. 4529 4535 */ 4530 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 4536 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 4531 4537 } 4532 4538 4533 4539 static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
+8 -10
fs/btrfs/disk-io.h
··· 41 41 return BTRFS_SUPER_INFO_OFFSET; 42 42 } 43 43 44 - void btrfs_check_leaked_roots(struct btrfs_fs_info *fs_info); 44 + void btrfs_check_leaked_roots(const struct btrfs_fs_info *fs_info); 45 45 void btrfs_init_fs_info(struct btrfs_fs_info *fs_info); 46 46 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 47 47 struct btrfs_tree_parent_check *check); ··· 52 52 int btrfs_start_pre_rw_mount(struct btrfs_fs_info *fs_info); 53 53 int btrfs_check_super_csum(struct btrfs_fs_info *fs_info, 54 54 const struct btrfs_super_block *disk_sb); 55 - int __cold open_ctree(struct super_block *sb, 56 - struct btrfs_fs_devices *fs_devices, 57 - char *options); 55 + int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_devices, 56 + const char *options); 58 57 void __cold close_ctree(struct btrfs_fs_info *fs_info); 59 - int btrfs_validate_super(struct btrfs_fs_info *fs_info, 60 - struct btrfs_super_block *sb, int mirror_num); 58 + int btrfs_validate_super(const struct btrfs_fs_info *fs_info, 59 + const struct btrfs_super_block *sb, int mirror_num); 61 60 int btrfs_check_features(struct btrfs_fs_info *fs_info, bool is_rw_mount); 62 61 int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors); 63 62 struct btrfs_super_block *btrfs_read_dev_super(struct block_device *bdev); ··· 64 65 int copy_num, bool drop_cache); 65 66 int btrfs_commit_super(struct btrfs_fs_info *fs_info); 66 67 struct btrfs_root *btrfs_read_tree_root(struct btrfs_root *tree_root, 67 - struct btrfs_key *key); 68 + const struct btrfs_key *key); 68 69 int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info, 69 70 struct btrfs_root *root); 70 71 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info); ··· 82 83 struct btrfs_key *key); 83 84 struct btrfs_root *btrfs_csum_root(struct btrfs_fs_info *fs_info, u64 bytenr); 84 85 struct btrfs_root *btrfs_extent_root(struct btrfs_fs_info *fs_info, u64 bytenr); 85 - struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info); 86 86 87 87 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info); 88 88 void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info); ··· 89 91 void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info, 90 92 struct btrfs_root *root); 91 93 int btrfs_validate_extent_buffer(struct extent_buffer *eb, 92 - struct btrfs_tree_parent_check *check); 94 + const struct btrfs_tree_parent_check *check); 93 95 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS 94 96 struct btrfs_root *btrfs_alloc_dummy_root(struct btrfs_fs_info *fs_info); 95 97 #endif ··· 116 118 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid, 117 119 int atomic); 118 120 int btrfs_read_extent_buffer(struct extent_buffer *buf, 119 - struct btrfs_tree_parent_check *check); 121 + const struct btrfs_tree_parent_check *check); 120 122 121 123 blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio); 122 124 int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+3 -3
fs/btrfs/export.c
··· 40 40 if (parent) { 41 41 u64 parent_root_id; 42 42 43 - fid->parent_objectid = BTRFS_I(parent)->location.objectid; 43 + fid->parent_objectid = btrfs_ino(BTRFS_I(parent)); 44 44 fid->parent_gen = parent->i_generation; 45 45 parent_root_id = btrfs_root_id(BTRFS_I(parent)->root); 46 46 ··· 84 84 if (IS_ERR(root)) 85 85 return ERR_CAST(root); 86 86 87 - inode = btrfs_iget(sb, objectid, root); 87 + inode = btrfs_iget(objectid, root); 88 88 btrfs_put_root(root); 89 89 if (IS_ERR(inode)) 90 90 return ERR_CAST(inode); ··· 210 210 found_key.offset, 0); 211 211 } 212 212 213 - return d_obtain_alias(btrfs_iget(fs_info->sb, key.objectid, root)); 213 + return d_obtain_alias(btrfs_iget(key.objectid, root)); 214 214 fail: 215 215 btrfs_free_path(path); 216 216 return ERR_PTR(ret);
+4
fs/btrfs/extent-io-tree.c
··· 4 4 #include <trace/events/btrfs.h> 5 5 #include "messages.h" 6 6 #include "ctree.h" 7 + #include "extent_io.h" 7 8 #include "extent-io-tree.h" 8 9 #include "btrfs_inode.h" 9 10 ··· 1085 1084 */ 1086 1085 prealloc = alloc_extent_state(mask); 1087 1086 } 1087 + /* Optimistically preallocate the extent changeset ulist node. */ 1088 + if (changeset) 1089 + extent_changeset_prealloc(changeset, mask); 1088 1090 1089 1091 spin_lock(&tree->lock); 1090 1092 if (cached_state && *cached_state) {
+423 -262
fs/btrfs/extent-tree.c
··· 104 104 struct btrfs_delayed_ref_head *head; 105 105 struct btrfs_delayed_ref_root *delayed_refs; 106 106 struct btrfs_path *path; 107 - struct btrfs_extent_item *ei; 108 - struct extent_buffer *leaf; 109 107 struct btrfs_key key; 110 - u32 item_size; 111 108 u64 num_refs; 112 109 u64 extent_flags; 113 110 u64 owner = 0; ··· 123 126 if (!path) 124 127 return -ENOMEM; 125 128 126 - if (!trans) { 127 - path->skip_locking = 1; 128 - path->search_commit_root = 1; 129 - } 130 - 131 129 search_again: 132 130 key.objectid = bytenr; 133 131 key.offset = offset; ··· 136 144 if (ret < 0) 137 145 goto out_free; 138 146 139 - if (ret > 0 && metadata && key.type == BTRFS_METADATA_ITEM_KEY) { 147 + if (ret > 0 && key.type == BTRFS_METADATA_ITEM_KEY) { 140 148 if (path->slots[0]) { 141 149 path->slots[0]--; 142 150 btrfs_item_key_to_cpu(path->nodes[0], &key, ··· 149 157 } 150 158 151 159 if (ret == 0) { 152 - leaf = path->nodes[0]; 153 - item_size = btrfs_item_size(leaf, path->slots[0]); 154 - if (item_size >= sizeof(*ei)) { 155 - ei = btrfs_item_ptr(leaf, path->slots[0], 156 - struct btrfs_extent_item); 157 - num_refs = btrfs_extent_refs(leaf, ei); 158 - extent_flags = btrfs_extent_flags(leaf, ei); 159 - owner = btrfs_get_extent_owner_root(fs_info, leaf, 160 - path->slots[0]); 161 - } else { 160 + struct extent_buffer *leaf = path->nodes[0]; 161 + struct btrfs_extent_item *ei; 162 + const u32 item_size = btrfs_item_size(leaf, path->slots[0]); 163 + 164 + if (unlikely(item_size < sizeof(*ei))) { 162 165 ret = -EUCLEAN; 163 166 btrfs_err(fs_info, 164 167 "unexpected extent item size, has %u expect >= %zu", 165 168 item_size, sizeof(*ei)); 166 - if (trans) 167 - btrfs_abort_transaction(trans, ret); 168 - else 169 - btrfs_handle_fs_error(fs_info, ret, NULL); 170 - 169 + btrfs_abort_transaction(trans, ret); 171 170 goto out_free; 172 171 } 173 172 174 - BUG_ON(num_refs == 0); 173 + ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item); 174 + num_refs = btrfs_extent_refs(leaf, ei); 175 + if (unlikely(num_refs == 0)) { 176 + ret = -EUCLEAN; 177 + btrfs_err(fs_info, 178 + "unexpected zero reference count for extent item (%llu %u %llu)", 179 + key.objectid, key.type, key.offset); 180 + btrfs_abort_transaction(trans, ret); 181 + goto out_free; 182 + } 183 + extent_flags = btrfs_extent_flags(leaf, ei); 184 + owner = btrfs_get_extent_owner_root(fs_info, leaf, path->slots[0]); 175 185 } else { 176 186 num_refs = 0; 177 187 extent_flags = 0; 178 188 ret = 0; 179 189 } 180 - 181 - if (!trans) 182 - goto out; 183 190 184 191 delayed_refs = &trans->transaction->delayed_refs; 185 192 spin_lock(&delayed_refs->lock); ··· 202 211 spin_lock(&head->lock); 203 212 if (head->extent_op && head->extent_op->update_flags) 204 213 extent_flags |= head->extent_op->flags_to_set; 205 - else 206 - BUG_ON(num_refs == 0); 207 214 208 215 num_refs += head->ref_mod; 209 216 spin_unlock(&head->lock); 210 217 mutex_unlock(&head->mutex); 211 218 } 212 219 spin_unlock(&delayed_refs->lock); 213 - out: 220 + 214 221 WARN_ON(num_refs == 0); 215 222 if (refs) 216 223 *refs = num_refs; ··· 1621 1632 1622 1633 if (metadata) { 1623 1634 key.type = BTRFS_METADATA_ITEM_KEY; 1624 - key.offset = extent_op->level; 1635 + key.offset = head->level; 1625 1636 } else { 1626 1637 key.type = BTRFS_EXTENT_ITEM_KEY; 1627 1638 key.offset = head->num_bytes; ··· 1656 1667 ret = -EUCLEAN; 1657 1668 btrfs_err(fs_info, 1658 1669 "missing extent item for extent %llu num_bytes %llu level %d", 1659 - head->bytenr, head->num_bytes, extent_op->level); 1670 + head->bytenr, head->num_bytes, head->level); 1660 1671 goto out; 1661 1672 } 1662 1673 } ··· 1715 1726 .generation = trans->transid, 1716 1727 }; 1717 1728 1718 - BUG_ON(!extent_op || !extent_op->update_flags); 1719 1729 ret = alloc_reserved_tree_block(trans, node, extent_op); 1720 1730 if (!ret) 1721 1731 btrfs_record_squota_delta(fs_info, &delta); ··· 2221 2233 struct extent_buffer *eb, u64 flags) 2222 2234 { 2223 2235 struct btrfs_delayed_extent_op *extent_op; 2224 - int level = btrfs_header_level(eb); 2225 2236 int ret; 2226 2237 2227 2238 extent_op = btrfs_alloc_delayed_extent_op(); ··· 2230 2243 extent_op->flags_to_set = flags; 2231 2244 extent_op->update_flags = true; 2232 2245 extent_op->update_key = false; 2233 - extent_op->level = level; 2234 2246 2235 - ret = btrfs_add_delayed_extent_op(trans, eb->start, eb->len, extent_op); 2247 + ret = btrfs_add_delayed_extent_op(trans, eb->start, eb->len, 2248 + btrfs_header_level(eb), extent_op); 2236 2249 if (ret) 2237 2250 btrfs_free_delayed_extent_op(extent_op); 2238 2251 return ret; ··· 3406 3419 return 0; 3407 3420 } 3408 3421 3409 - void btrfs_free_tree_block(struct btrfs_trans_handle *trans, 3410 - u64 root_id, 3411 - struct extent_buffer *buf, 3412 - u64 parent, int last_ref) 3422 + int btrfs_free_tree_block(struct btrfs_trans_handle *trans, 3423 + u64 root_id, 3424 + struct extent_buffer *buf, 3425 + u64 parent, int last_ref) 3413 3426 { 3414 3427 struct btrfs_fs_info *fs_info = trans->fs_info; 3415 3428 struct btrfs_block_group *bg; ··· 3436 3449 btrfs_init_tree_ref(&generic_ref, btrfs_header_level(buf), 0, false); 3437 3450 btrfs_ref_tree_mod(fs_info, &generic_ref); 3438 3451 ret = btrfs_add_delayed_tree_ref(trans, &generic_ref, NULL); 3439 - BUG_ON(ret); /* -ENOMEM */ 3452 + if (ret < 0) 3453 + return ret; 3440 3454 } 3441 3455 3442 3456 if (!last_ref) 3443 - return; 3457 + return 0; 3444 3458 3445 3459 if (btrfs_header_generation(buf) != trans->transid) 3446 3460 goto out; ··· 3498 3510 * matter anymore. 3499 3511 */ 3500 3512 clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags); 3513 + return 0; 3501 3514 } 3502 3515 3503 3516 /* Can return -ENOMEM */ ··· 4853 4864 struct btrfs_path *path; 4854 4865 struct extent_buffer *leaf; 4855 4866 u32 size = sizeof(*extent_item) + sizeof(*iref); 4856 - u64 flags = extent_op->flags_to_set; 4867 + const u64 flags = (extent_op ? extent_op->flags_to_set : 0); 4857 4868 /* The owner of a tree block is the level. */ 4858 4869 int level = btrfs_delayed_ref_owner(node); 4859 4870 bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA); ··· 5110 5121 struct btrfs_key ins; 5111 5122 struct btrfs_block_rsv *block_rsv; 5112 5123 struct extent_buffer *buf; 5113 - struct btrfs_delayed_extent_op *extent_op; 5114 5124 u64 flags = 0; 5115 5125 int ret; 5116 5126 u32 blocksize = fs_info->nodesize; ··· 5152 5164 BUG_ON(parent > 0); 5153 5165 5154 5166 if (root_objectid != BTRFS_TREE_LOG_OBJECTID) { 5167 + struct btrfs_delayed_extent_op *extent_op; 5155 5168 struct btrfs_ref generic_ref = { 5156 5169 .action = BTRFS_ADD_DELAYED_EXTENT, 5157 5170 .bytenr = ins.objectid, ··· 5161 5172 .owning_root = owning_root, 5162 5173 .ref_root = root_objectid, 5163 5174 }; 5164 - extent_op = btrfs_alloc_delayed_extent_op(); 5165 - if (!extent_op) { 5166 - ret = -ENOMEM; 5167 - goto out_free_buf; 5175 + 5176 + if (!skinny_metadata || flags != 0) { 5177 + extent_op = btrfs_alloc_delayed_extent_op(); 5178 + if (!extent_op) { 5179 + ret = -ENOMEM; 5180 + goto out_free_buf; 5181 + } 5182 + if (key) 5183 + memcpy(&extent_op->key, key, sizeof(extent_op->key)); 5184 + else 5185 + memset(&extent_op->key, 0, sizeof(extent_op->key)); 5186 + extent_op->flags_to_set = flags; 5187 + extent_op->update_key = (skinny_metadata ? false : true); 5188 + extent_op->update_flags = (flags != 0); 5189 + } else { 5190 + extent_op = NULL; 5168 5191 } 5169 - if (key) 5170 - memcpy(&extent_op->key, key, sizeof(extent_op->key)); 5171 - else 5172 - memset(&extent_op->key, 0, sizeof(extent_op->key)); 5173 - extent_op->flags_to_set = flags; 5174 - extent_op->update_key = skinny_metadata ? false : true; 5175 - extent_op->update_flags = true; 5176 - extent_op->level = level; 5177 5192 5178 5193 btrfs_init_tree_ref(&generic_ref, level, btrfs_root_id(root), false); 5179 5194 btrfs_ref_tree_mod(fs_info, &generic_ref); 5180 5195 ret = btrfs_add_delayed_tree_ref(trans, &generic_ref, extent_op); 5181 - if (ret) 5182 - goto out_free_delayed; 5196 + if (ret) { 5197 + btrfs_free_delayed_extent_op(extent_op); 5198 + goto out_free_buf; 5199 + } 5183 5200 } 5184 5201 return buf; 5185 5202 5186 - out_free_delayed: 5187 - btrfs_free_delayed_extent_op(extent_op); 5188 5203 out_free_buf: 5189 5204 btrfs_tree_unlock(buf); 5190 5205 free_extent_buffer(buf); ··· 5213 5220 int reada_slot; 5214 5221 int reada_count; 5215 5222 int restarted; 5223 + /* Indicate that extent info needs to be looked up when walking the tree. */ 5224 + int lookup_info; 5216 5225 }; 5217 5226 5227 + /* 5228 + * This is our normal stage. We are traversing blocks the current snapshot owns 5229 + * and we are dropping any of our references to any children we are able to, and 5230 + * then freeing the block once we've processed all of the children. 5231 + */ 5218 5232 #define DROP_REFERENCE 1 5233 + 5234 + /* 5235 + * We enter this stage when we have to walk into a child block (meaning we can't 5236 + * simply drop our reference to it from our current parent node) and there are 5237 + * more than one reference on it. If we are the owner of any of the children 5238 + * blocks from the current parent node then we have to do the FULL_BACKREF dance 5239 + * on them in order to drop our normal ref and add the shared ref. 5240 + */ 5219 5241 #define UPDATE_BACKREF 2 5242 + 5243 + /* 5244 + * Decide if we need to walk down into this node to adjust the references. 5245 + * 5246 + * @root: the root we are currently deleting 5247 + * @wc: the walk control for this deletion 5248 + * @eb: the parent eb that we're currently visiting 5249 + * @refs: the number of refs for wc->level - 1 5250 + * @flags: the flags for wc->level - 1 5251 + * @slot: the slot in the eb that we're currently checking 5252 + * 5253 + * This is meant to be called when we're evaluating if a node we point to at 5254 + * wc->level should be read and walked into, or if we can simply delete our 5255 + * reference to it. We return true if we should walk into the node, false if we 5256 + * can skip it. 5257 + * 5258 + * We have assertions in here to make sure this is called correctly. We assume 5259 + * that sanity checking on the blocks read to this point has been done, so any 5260 + * corrupted file systems must have been caught before calling this function. 5261 + */ 5262 + static bool visit_node_for_delete(struct btrfs_root *root, struct walk_control *wc, 5263 + struct extent_buffer *eb, u64 refs, u64 flags, int slot) 5264 + { 5265 + struct btrfs_key key; 5266 + u64 generation; 5267 + int level = wc->level; 5268 + 5269 + ASSERT(level > 0); 5270 + ASSERT(wc->refs[level - 1] > 0); 5271 + 5272 + /* 5273 + * The update backref stage we only want to skip if we already have 5274 + * FULL_BACKREF set, otherwise we need to read. 5275 + */ 5276 + if (wc->stage == UPDATE_BACKREF) { 5277 + if (level == 1 && flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) 5278 + return false; 5279 + return true; 5280 + } 5281 + 5282 + /* 5283 + * We're the last ref on this block, we must walk into it and process 5284 + * any refs it's pointing at. 5285 + */ 5286 + if (wc->refs[level - 1] == 1) 5287 + return true; 5288 + 5289 + /* 5290 + * If we're already FULL_BACKREF then we know we can just drop our 5291 + * current reference. 5292 + */ 5293 + if (level == 1 && flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) 5294 + return false; 5295 + 5296 + /* 5297 + * This block is older than our creation generation, we can drop our 5298 + * reference to it. 5299 + */ 5300 + generation = btrfs_node_ptr_generation(eb, slot); 5301 + if (!wc->update_ref || generation <= root->root_key.offset) 5302 + return false; 5303 + 5304 + /* 5305 + * This block was processed from a previous snapshot deletion run, we 5306 + * can skip it. 5307 + */ 5308 + btrfs_node_key_to_cpu(eb, &key, slot); 5309 + if (btrfs_comp_cpu_keys(&key, &wc->update_progress) < 0) 5310 + return false; 5311 + 5312 + /* All other cases we need to wander into the node. */ 5313 + return true; 5314 + } 5220 5315 5221 5316 static noinline void reada_walk_down(struct btrfs_trans_handle *trans, 5222 5317 struct btrfs_root *root, ··· 5317 5236 u64 refs; 5318 5237 u64 flags; 5319 5238 u32 nritems; 5320 - struct btrfs_key key; 5321 5239 struct extent_buffer *eb; 5322 5240 int ret; 5323 5241 int slot; ··· 5356 5276 /* We don't care about errors in readahead. */ 5357 5277 if (ret < 0) 5358 5278 continue; 5359 - BUG_ON(refs == 0); 5360 5279 5361 - if (wc->stage == DROP_REFERENCE) { 5362 - if (refs == 1) 5363 - goto reada; 5280 + /* 5281 + * This could be racey, it's conceivable that we raced and end 5282 + * up with a bogus refs count, if that's the case just skip, if 5283 + * we are actually corrupt we will notice when we look up 5284 + * everything again with our locks. 5285 + */ 5286 + if (refs == 0) 5287 + continue; 5364 5288 5365 - if (wc->level == 1 && 5366 - (flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) 5367 - continue; 5368 - if (!wc->update_ref || 5369 - generation <= root->root_key.offset) 5370 - continue; 5371 - btrfs_node_key_to_cpu(eb, &key, slot); 5372 - ret = btrfs_comp_cpu_keys(&key, 5373 - &wc->update_progress); 5374 - if (ret < 0) 5375 - continue; 5376 - } else { 5377 - if (wc->level == 1 && 5378 - (flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) 5379 - continue; 5380 - } 5289 + /* If we don't need to visit this node don't reada. */ 5290 + if (!visit_node_for_delete(root, wc, eb, refs, flags, slot)) 5291 + continue; 5381 5292 reada: 5382 5293 btrfs_readahead_node_child(eb, slot); 5383 5294 nread++; ··· 5387 5316 static noinline int walk_down_proc(struct btrfs_trans_handle *trans, 5388 5317 struct btrfs_root *root, 5389 5318 struct btrfs_path *path, 5390 - struct walk_control *wc, int lookup_info) 5319 + struct walk_control *wc) 5391 5320 { 5392 5321 struct btrfs_fs_info *fs_info = root->fs_info; 5393 5322 int level = wc->level; ··· 5402 5331 * when reference count of tree block is 1, it won't increase 5403 5332 * again. once full backref flag is set, we never clear it. 5404 5333 */ 5405 - if (lookup_info && 5334 + if (wc->lookup_info && 5406 5335 ((wc->stage == DROP_REFERENCE && wc->refs[level] != 1) || 5407 5336 (wc->stage == UPDATE_BACKREF && !(wc->flags[level] & flag)))) { 5408 - BUG_ON(!path->locks[level]); 5337 + ASSERT(path->locks[level]); 5409 5338 ret = btrfs_lookup_extent_info(trans, fs_info, 5410 5339 eb->start, level, 1, 5411 5340 &wc->refs[level], 5412 5341 &wc->flags[level], 5413 5342 NULL); 5414 - BUG_ON(ret == -ENOMEM); 5415 5343 if (ret) 5416 5344 return ret; 5417 - BUG_ON(wc->refs[level] == 0); 5345 + if (unlikely(wc->refs[level] == 0)) { 5346 + btrfs_err(fs_info, "bytenr %llu has 0 references, expect > 0", 5347 + eb->start); 5348 + return -EUCLEAN; 5349 + } 5418 5350 } 5419 5351 5420 5352 if (wc->stage == DROP_REFERENCE) { ··· 5433 5359 5434 5360 /* wc->stage == UPDATE_BACKREF */ 5435 5361 if (!(wc->flags[level] & flag)) { 5436 - BUG_ON(!path->locks[level]); 5362 + ASSERT(path->locks[level]); 5437 5363 ret = btrfs_inc_ref(trans, root, eb, 1); 5438 - BUG_ON(ret); /* -ENOMEM */ 5364 + if (ret) { 5365 + btrfs_abort_transaction(trans, ret); 5366 + return ret; 5367 + } 5439 5368 ret = btrfs_dec_ref(trans, root, eb, 0); 5440 - BUG_ON(ret); /* -ENOMEM */ 5369 + if (ret) { 5370 + btrfs_abort_transaction(trans, ret); 5371 + return ret; 5372 + } 5441 5373 ret = btrfs_set_disk_extent_flags(trans, eb, flag); 5442 - BUG_ON(ret); /* -ENOMEM */ 5374 + if (ret) { 5375 + btrfs_abort_transaction(trans, ret); 5376 + return ret; 5377 + } 5443 5378 wc->flags[level] |= flag; 5444 5379 } 5445 5380 ··· 5491 5408 } 5492 5409 5493 5410 /* 5411 + * We may not have an uptodate block, so if we are going to walk down into this 5412 + * block we need to drop the lock, read it off of the disk, re-lock it and 5413 + * return to continue dropping the snapshot. 5414 + */ 5415 + static int check_next_block_uptodate(struct btrfs_trans_handle *trans, 5416 + struct btrfs_root *root, 5417 + struct btrfs_path *path, 5418 + struct walk_control *wc, 5419 + struct extent_buffer *next) 5420 + { 5421 + struct btrfs_tree_parent_check check = { 0 }; 5422 + u64 generation; 5423 + int level = wc->level; 5424 + int ret; 5425 + 5426 + btrfs_assert_tree_write_locked(next); 5427 + 5428 + generation = btrfs_node_ptr_generation(path->nodes[level], path->slots[level]); 5429 + 5430 + if (btrfs_buffer_uptodate(next, generation, 0)) 5431 + return 0; 5432 + 5433 + check.level = level - 1; 5434 + check.transid = generation; 5435 + check.owner_root = btrfs_root_id(root); 5436 + check.has_first_key = true; 5437 + btrfs_node_key_to_cpu(path->nodes[level], &check.first_key, path->slots[level]); 5438 + 5439 + btrfs_tree_unlock(next); 5440 + if (level == 1) 5441 + reada_walk_down(trans, root, wc, path); 5442 + ret = btrfs_read_extent_buffer(next, &check); 5443 + if (ret) { 5444 + free_extent_buffer(next); 5445 + return ret; 5446 + } 5447 + btrfs_tree_lock(next); 5448 + wc->lookup_info = 1; 5449 + return 0; 5450 + } 5451 + 5452 + /* 5453 + * If we determine that we don't have to visit wc->level - 1 then we need to 5454 + * determine if we can drop our reference. 5455 + * 5456 + * If we are UPDATE_BACKREF then we will not, we need to update our backrefs. 5457 + * 5458 + * If we are DROP_REFERENCE this will figure out if we need to drop our current 5459 + * reference, skipping it if we dropped it from a previous incompleted drop, or 5460 + * dropping it if we still have a reference to it. 5461 + */ 5462 + static int maybe_drop_reference(struct btrfs_trans_handle *trans, struct btrfs_root *root, 5463 + struct btrfs_path *path, struct walk_control *wc, 5464 + struct extent_buffer *next, u64 owner_root) 5465 + { 5466 + struct btrfs_ref ref = { 5467 + .action = BTRFS_DROP_DELAYED_REF, 5468 + .bytenr = next->start, 5469 + .num_bytes = root->fs_info->nodesize, 5470 + .owning_root = owner_root, 5471 + .ref_root = btrfs_root_id(root), 5472 + }; 5473 + int level = wc->level; 5474 + int ret; 5475 + 5476 + /* We are UPDATE_BACKREF, we're not dropping anything. */ 5477 + if (wc->stage == UPDATE_BACKREF) 5478 + return 0; 5479 + 5480 + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { 5481 + ref.parent = path->nodes[level]->start; 5482 + } else { 5483 + ASSERT(btrfs_root_id(root) == btrfs_header_owner(path->nodes[level])); 5484 + if (btrfs_root_id(root) != btrfs_header_owner(path->nodes[level])) { 5485 + btrfs_err(root->fs_info, "mismatched block owner"); 5486 + return -EIO; 5487 + } 5488 + } 5489 + 5490 + /* 5491 + * If we had a drop_progress we need to verify the refs are set as 5492 + * expected. If we find our ref then we know that from here on out 5493 + * everything should be correct, and we can clear the 5494 + * ->restarted flag. 5495 + */ 5496 + if (wc->restarted) { 5497 + ret = check_ref_exists(trans, root, next->start, ref.parent, 5498 + level - 1); 5499 + if (ret <= 0) 5500 + return ret; 5501 + ret = 0; 5502 + wc->restarted = 0; 5503 + } 5504 + 5505 + /* 5506 + * Reloc tree doesn't contribute to qgroup numbers, and we have already 5507 + * accounted them at merge time (replace_path), thus we could skip 5508 + * expensive subtree trace here. 5509 + */ 5510 + if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID && 5511 + wc->refs[level - 1] > 1) { 5512 + u64 generation = btrfs_node_ptr_generation(path->nodes[level], 5513 + path->slots[level]); 5514 + 5515 + ret = btrfs_qgroup_trace_subtree(trans, next, generation, level - 1); 5516 + if (ret) { 5517 + btrfs_err_rl(root->fs_info, 5518 + "error %d accounting shared subtree, quota is out of sync, rescan required", 5519 + ret); 5520 + } 5521 + } 5522 + 5523 + /* 5524 + * We need to update the next key in our walk control so we can update 5525 + * the drop_progress key accordingly. We don't care if find_next_key 5526 + * doesn't find a key because that means we're at the end and are going 5527 + * to clean up now. 5528 + */ 5529 + wc->drop_level = level; 5530 + find_next_key(path, level, &wc->drop_progress); 5531 + 5532 + btrfs_init_tree_ref(&ref, level - 1, 0, false); 5533 + return btrfs_free_extent(trans, &ref); 5534 + } 5535 + 5536 + /* 5494 5537 * helper to process tree block pointer. 5495 5538 * 5496 5539 * when wc->stage == DROP_REFERENCE, this function checks ··· 5632 5423 static noinline int do_walk_down(struct btrfs_trans_handle *trans, 5633 5424 struct btrfs_root *root, 5634 5425 struct btrfs_path *path, 5635 - struct walk_control *wc, int *lookup_info) 5426 + struct walk_control *wc) 5636 5427 { 5637 5428 struct btrfs_fs_info *fs_info = root->fs_info; 5638 5429 u64 bytenr; 5639 5430 u64 generation; 5640 5431 u64 owner_root = 0; 5641 - struct btrfs_tree_parent_check check = { 0 }; 5642 - struct btrfs_key key; 5643 5432 struct extent_buffer *next; 5644 5433 int level = wc->level; 5645 - int reada = 0; 5646 5434 int ret = 0; 5647 - bool need_account = false; 5648 5435 5649 5436 generation = btrfs_node_ptr_generation(path->nodes[level], 5650 5437 path->slots[level]); ··· 5651 5446 */ 5652 5447 if (wc->stage == UPDATE_BACKREF && 5653 5448 generation <= root->root_key.offset) { 5654 - *lookup_info = 1; 5449 + wc->lookup_info = 1; 5655 5450 return 1; 5656 5451 } 5657 5452 5658 5453 bytenr = btrfs_node_blockptr(path->nodes[level], path->slots[level]); 5659 5454 5660 - check.level = level - 1; 5661 - check.transid = generation; 5662 - check.owner_root = btrfs_root_id(root); 5663 - check.has_first_key = true; 5664 - btrfs_node_key_to_cpu(path->nodes[level], &check.first_key, 5665 - path->slots[level]); 5455 + next = btrfs_find_create_tree_block(fs_info, bytenr, btrfs_root_id(root), 5456 + level - 1); 5457 + if (IS_ERR(next)) 5458 + return PTR_ERR(next); 5666 5459 5667 - next = find_extent_buffer(fs_info, bytenr); 5668 - if (!next) { 5669 - next = btrfs_find_create_tree_block(fs_info, bytenr, 5670 - btrfs_root_id(root), level - 1); 5671 - if (IS_ERR(next)) 5672 - return PTR_ERR(next); 5673 - reada = 1; 5674 - } 5675 5460 btrfs_tree_lock(next); 5676 5461 5677 5462 ret = btrfs_lookup_extent_info(trans, fs_info, bytenr, level - 1, 1, ··· 5672 5477 goto out_unlock; 5673 5478 5674 5479 if (unlikely(wc->refs[level - 1] == 0)) { 5675 - btrfs_err(fs_info, "Missing references."); 5676 - ret = -EIO; 5480 + btrfs_err(fs_info, "bytenr %llu has 0 references, expect > 0", 5481 + bytenr); 5482 + ret = -EUCLEAN; 5677 5483 goto out_unlock; 5678 5484 } 5679 - *lookup_info = 0; 5485 + wc->lookup_info = 0; 5680 5486 5681 - if (wc->stage == DROP_REFERENCE) { 5682 - if (wc->refs[level - 1] > 1) { 5683 - need_account = true; 5684 - if (level == 1 && 5685 - (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF)) 5686 - goto skip; 5487 + /* If we don't have to walk into this node skip it. */ 5488 + if (!visit_node_for_delete(root, wc, path->nodes[level], 5489 + wc->refs[level - 1], wc->flags[level - 1], 5490 + path->slots[level])) 5491 + goto skip; 5687 5492 5688 - if (!wc->update_ref || 5689 - generation <= root->root_key.offset) 5690 - goto skip; 5691 - 5692 - btrfs_node_key_to_cpu(path->nodes[level], &key, 5693 - path->slots[level]); 5694 - ret = btrfs_comp_cpu_keys(&key, &wc->update_progress); 5695 - if (ret < 0) 5696 - goto skip; 5697 - 5698 - wc->stage = UPDATE_BACKREF; 5699 - wc->shared_level = level - 1; 5700 - } 5701 - } else { 5702 - if (level == 1 && 5703 - (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF)) 5704 - goto skip; 5493 + /* 5494 + * We have to walk down into this node, and if we're currently at the 5495 + * DROP_REFERNCE stage and this block is shared then we need to switch 5496 + * to the UPDATE_BACKREF stage in order to convert to FULL_BACKREF. 5497 + */ 5498 + if (wc->stage == DROP_REFERENCE && wc->refs[level - 1] > 1) { 5499 + wc->stage = UPDATE_BACKREF; 5500 + wc->shared_level = level - 1; 5705 5501 } 5706 5502 5707 - if (!btrfs_buffer_uptodate(next, generation, 0)) { 5708 - btrfs_tree_unlock(next); 5709 - free_extent_buffer(next); 5710 - next = NULL; 5711 - *lookup_info = 1; 5712 - } 5713 - 5714 - if (!next) { 5715 - if (reada && level == 1) 5716 - reada_walk_down(trans, root, wc, path); 5717 - next = read_tree_block(fs_info, bytenr, &check); 5718 - if (IS_ERR(next)) { 5719 - return PTR_ERR(next); 5720 - } else if (!extent_buffer_uptodate(next)) { 5721 - free_extent_buffer(next); 5722 - return -EIO; 5723 - } 5724 - btrfs_tree_lock(next); 5725 - } 5503 + ret = check_next_block_uptodate(trans, root, path, wc, next); 5504 + if (ret) 5505 + return ret; 5726 5506 5727 5507 level--; 5728 5508 ASSERT(level == btrfs_header_level(next)); ··· 5714 5544 wc->reada_slot = 0; 5715 5545 return 0; 5716 5546 skip: 5547 + ret = maybe_drop_reference(trans, root, path, wc, next, owner_root); 5548 + if (ret) 5549 + goto out_unlock; 5717 5550 wc->refs[level - 1] = 0; 5718 5551 wc->flags[level - 1] = 0; 5719 - if (wc->stage == DROP_REFERENCE) { 5720 - struct btrfs_ref ref = { 5721 - .action = BTRFS_DROP_DELAYED_REF, 5722 - .bytenr = bytenr, 5723 - .num_bytes = fs_info->nodesize, 5724 - .owning_root = owner_root, 5725 - .ref_root = btrfs_root_id(root), 5726 - }; 5727 - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { 5728 - ref.parent = path->nodes[level]->start; 5729 - } else { 5730 - ASSERT(btrfs_root_id(root) == 5731 - btrfs_header_owner(path->nodes[level])); 5732 - if (btrfs_root_id(root) != 5733 - btrfs_header_owner(path->nodes[level])) { 5734 - btrfs_err(root->fs_info, 5735 - "mismatched block owner"); 5736 - ret = -EIO; 5737 - goto out_unlock; 5738 - } 5739 - } 5740 - 5741 - /* 5742 - * If we had a drop_progress we need to verify the refs are set 5743 - * as expected. If we find our ref then we know that from here 5744 - * on out everything should be correct, and we can clear the 5745 - * ->restarted flag. 5746 - */ 5747 - if (wc->restarted) { 5748 - ret = check_ref_exists(trans, root, bytenr, ref.parent, 5749 - level - 1); 5750 - if (ret < 0) 5751 - goto out_unlock; 5752 - if (ret == 0) 5753 - goto no_delete; 5754 - ret = 0; 5755 - wc->restarted = 0; 5756 - } 5757 - 5758 - /* 5759 - * Reloc tree doesn't contribute to qgroup numbers, and we have 5760 - * already accounted them at merge time (replace_path), 5761 - * thus we could skip expensive subtree trace here. 5762 - */ 5763 - if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID && need_account) { 5764 - ret = btrfs_qgroup_trace_subtree(trans, next, 5765 - generation, level - 1); 5766 - if (ret) { 5767 - btrfs_err_rl(fs_info, 5768 - "Error %d accounting shared subtree. Quota is out of sync, rescan required.", 5769 - ret); 5770 - } 5771 - } 5772 - 5773 - /* 5774 - * We need to update the next key in our walk control so we can 5775 - * update the drop_progress key accordingly. We don't care if 5776 - * find_next_key doesn't find a key because that means we're at 5777 - * the end and are going to clean up now. 5778 - */ 5779 - wc->drop_level = level; 5780 - find_next_key(path, level, &wc->drop_progress); 5781 - 5782 - btrfs_init_tree_ref(&ref, level - 1, 0, false); 5783 - ret = btrfs_free_extent(trans, &ref); 5784 - if (ret) 5785 - goto out_unlock; 5786 - } 5787 - no_delete: 5788 - *lookup_info = 1; 5552 + wc->lookup_info = 1; 5789 5553 ret = 1; 5790 5554 5791 5555 out_unlock: ··· 5747 5643 struct walk_control *wc) 5748 5644 { 5749 5645 struct btrfs_fs_info *fs_info = root->fs_info; 5750 - int ret; 5646 + int ret = 0; 5751 5647 int level = wc->level; 5752 5648 struct extent_buffer *eb = path->nodes[level]; 5753 5649 u64 parent = 0; 5754 5650 5755 5651 if (wc->stage == UPDATE_BACKREF) { 5756 - BUG_ON(wc->shared_level < level); 5652 + ASSERT(wc->shared_level >= level); 5757 5653 if (level < wc->shared_level) 5758 5654 goto out; 5759 5655 ··· 5771 5667 * count is one. 5772 5668 */ 5773 5669 if (!path->locks[level]) { 5774 - BUG_ON(level == 0); 5670 + ASSERT(level > 0); 5775 5671 btrfs_tree_lock(eb); 5776 5672 path->locks[level] = BTRFS_WRITE_LOCK; 5777 5673 ··· 5785 5681 path->locks[level] = 0; 5786 5682 return ret; 5787 5683 } 5788 - BUG_ON(wc->refs[level] == 0); 5684 + if (unlikely(wc->refs[level] == 0)) { 5685 + btrfs_tree_unlock_rw(eb, path->locks[level]); 5686 + btrfs_err(fs_info, "bytenr %llu has 0 references, expect > 0", 5687 + eb->start); 5688 + return -EUCLEAN; 5689 + } 5789 5690 if (wc->refs[level] == 1) { 5790 5691 btrfs_tree_unlock_rw(eb, path->locks[level]); 5791 5692 path->locks[level] = 0; ··· 5800 5691 } 5801 5692 5802 5693 /* wc->stage == DROP_REFERENCE */ 5803 - BUG_ON(wc->refs[level] > 1 && !path->locks[level]); 5694 + ASSERT(path->locks[level] || wc->refs[level] == 1); 5804 5695 5805 5696 if (wc->refs[level] == 1) { 5806 5697 if (level == 0) { ··· 5808 5699 ret = btrfs_dec_ref(trans, root, eb, 1); 5809 5700 else 5810 5701 ret = btrfs_dec_ref(trans, root, eb, 0); 5811 - BUG_ON(ret); /* -ENOMEM */ 5702 + if (ret) { 5703 + btrfs_abort_transaction(trans, ret); 5704 + return ret; 5705 + } 5812 5706 if (is_fstree(btrfs_root_id(root))) { 5813 5707 ret = btrfs_qgroup_trace_leaf_items(trans, eb); 5814 5708 if (ret) { ··· 5842 5730 goto owner_mismatch; 5843 5731 } 5844 5732 5845 - btrfs_free_tree_block(trans, btrfs_root_id(root), eb, parent, 5846 - wc->refs[level] == 1); 5733 + ret = btrfs_free_tree_block(trans, btrfs_root_id(root), eb, parent, 5734 + wc->refs[level] == 1); 5735 + if (ret < 0) 5736 + btrfs_abort_transaction(trans, ret); 5847 5737 out: 5848 5738 wc->refs[level] = 0; 5849 5739 wc->flags[level] = 0; 5850 - return 0; 5740 + return ret; 5851 5741 5852 5742 owner_mismatch: 5853 5743 btrfs_err_rl(fs_info, "unexpected tree owner, have %llu expect %llu", ··· 5857 5743 return -EUCLEAN; 5858 5744 } 5859 5745 5746 + /* 5747 + * walk_down_tree consists of two steps. 5748 + * 5749 + * walk_down_proc(). Look up the reference count and reference of our current 5750 + * wc->level. At this point path->nodes[wc->level] should be populated and 5751 + * uptodate, and in most cases should already be locked. If we are in 5752 + * DROP_REFERENCE and our refcount is > 1 then we've entered a shared node and 5753 + * we can walk back up the tree. If we are UPDATE_BACKREF we have to set 5754 + * FULL_BACKREF on this node if it's not already set, and then do the 5755 + * FULL_BACKREF conversion dance, which is to drop the root reference and add 5756 + * the shared reference to all of this nodes children. 5757 + * 5758 + * do_walk_down(). This is where we actually start iterating on the children of 5759 + * our current path->nodes[wc->level]. For DROP_REFERENCE that means dropping 5760 + * our reference to the children that return false from visit_node_for_delete(), 5761 + * which has various conditions where we know we can just drop our reference 5762 + * without visiting the node. For UPDATE_BACKREF we will skip any children that 5763 + * visit_node_for_delete() returns false for, only walking down when necessary. 5764 + * The bulk of the work for UPDATE_BACKREF occurs in the walk_up_tree() part of 5765 + * snapshot deletion. 5766 + */ 5860 5767 static noinline int walk_down_tree(struct btrfs_trans_handle *trans, 5861 5768 struct btrfs_root *root, 5862 5769 struct btrfs_path *path, 5863 5770 struct walk_control *wc) 5864 5771 { 5865 5772 int level = wc->level; 5866 - int lookup_info = 1; 5867 5773 int ret = 0; 5868 5774 5775 + wc->lookup_info = 1; 5869 5776 while (level >= 0) { 5870 - ret = walk_down_proc(trans, root, path, wc, lookup_info); 5777 + ret = walk_down_proc(trans, root, path, wc); 5871 5778 if (ret) 5872 5779 break; 5873 5780 ··· 5899 5764 btrfs_header_nritems(path->nodes[level])) 5900 5765 break; 5901 5766 5902 - ret = do_walk_down(trans, root, path, wc, &lookup_info); 5767 + ret = do_walk_down(trans, root, path, wc); 5903 5768 if (ret > 0) { 5904 5769 path->slots[level]++; 5905 5770 continue; ··· 5910 5775 return (ret == 1) ? 0 : ret; 5911 5776 } 5912 5777 5778 + /* 5779 + * walk_up_tree() is responsible for making sure we visit every slot on our 5780 + * current node, and if we're at the end of that node then we call 5781 + * walk_up_proc() on our current node which will do one of a few things based on 5782 + * our stage. 5783 + * 5784 + * UPDATE_BACKREF. If we wc->level is currently less than our wc->shared_level 5785 + * then we need to walk back up the tree, and then going back down into the 5786 + * other slots via walk_down_tree to update any other children from our original 5787 + * wc->shared_level. Once we're at or above our wc->shared_level we can switch 5788 + * back to DROP_REFERENCE, lookup the current nodes refs and flags, and carry on. 5789 + * 5790 + * DROP_REFERENCE. If our refs == 1 then we're going to free this tree block. 5791 + * If we're level 0 then we need to btrfs_dec_ref() on all of the data extents 5792 + * in our current leaf. After that we call btrfs_free_tree_block() on the 5793 + * current node and walk up to the next node to walk down the next slot. 5794 + */ 5913 5795 static noinline int walk_up_tree(struct btrfs_trans_handle *trans, 5914 5796 struct btrfs_root *root, 5915 5797 struct btrfs_path *path, ··· 5985 5833 struct btrfs_root_item *root_item = &root->root_item; 5986 5834 struct walk_control *wc; 5987 5835 struct btrfs_key key; 5988 - int err = 0; 5989 - int ret; 5836 + const u64 rootid = btrfs_root_id(root); 5837 + int ret = 0; 5990 5838 int level; 5991 5839 bool root_dropped = false; 5992 5840 bool unfinished_drop = false; ··· 5995 5843 5996 5844 path = btrfs_alloc_path(); 5997 5845 if (!path) { 5998 - err = -ENOMEM; 5846 + ret = -ENOMEM; 5999 5847 goto out; 6000 5848 } 6001 5849 6002 5850 wc = kzalloc(sizeof(*wc), GFP_NOFS); 6003 5851 if (!wc) { 6004 5852 btrfs_free_path(path); 6005 - err = -ENOMEM; 5853 + ret = -ENOMEM; 6006 5854 goto out; 6007 5855 } 6008 5856 ··· 6015 5863 else 6016 5864 trans = btrfs_start_transaction(tree_root, 0); 6017 5865 if (IS_ERR(trans)) { 6018 - err = PTR_ERR(trans); 5866 + ret = PTR_ERR(trans); 6019 5867 goto out_free; 6020 5868 } 6021 5869 6022 - err = btrfs_run_delayed_items(trans); 6023 - if (err) 5870 + ret = btrfs_run_delayed_items(trans); 5871 + if (ret) 6024 5872 goto out_end_trans; 6025 5873 6026 5874 /* ··· 6051 5899 path->lowest_level = level; 6052 5900 ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); 6053 5901 path->lowest_level = 0; 6054 - if (ret < 0) { 6055 - err = ret; 5902 + if (ret < 0) 6056 5903 goto out_end_trans; 6057 - } 5904 + 6058 5905 WARN_ON(ret > 0); 5906 + ret = 0; 6059 5907 6060 5908 /* 6061 5909 * unlock our path, this is safe because only this ··· 6068 5916 btrfs_tree_lock(path->nodes[level]); 6069 5917 path->locks[level] = BTRFS_WRITE_LOCK; 6070 5918 5919 + /* 5920 + * btrfs_lookup_extent_info() returns 0 for success, 5921 + * or < 0 for error. 5922 + */ 6071 5923 ret = btrfs_lookup_extent_info(trans, fs_info, 6072 5924 path->nodes[level]->start, 6073 5925 level, 1, &wc->refs[level], 6074 5926 &wc->flags[level], NULL); 6075 - if (ret < 0) { 6076 - err = ret; 5927 + if (ret < 0) 6077 5928 goto out_end_trans; 6078 - } 5929 + 6079 5930 BUG_ON(wc->refs[level] == 0); 6080 5931 6081 5932 if (level == btrfs_root_drop_level(root_item)) ··· 6104 5949 ret = walk_down_tree(trans, root, path, wc); 6105 5950 if (ret < 0) { 6106 5951 btrfs_abort_transaction(trans, ret); 6107 - err = ret; 6108 5952 break; 6109 5953 } 6110 5954 6111 5955 ret = walk_up_tree(trans, root, path, wc, BTRFS_MAX_LEVEL); 6112 5956 if (ret < 0) { 6113 5957 btrfs_abort_transaction(trans, ret); 6114 - err = ret; 6115 5958 break; 6116 5959 } 6117 5960 6118 5961 if (ret > 0) { 6119 5962 BUG_ON(wc->stage != DROP_REFERENCE); 5963 + ret = 0; 6120 5964 break; 6121 5965 } 6122 5966 ··· 6137 5983 root_item); 6138 5984 if (ret) { 6139 5985 btrfs_abort_transaction(trans, ret); 6140 - err = ret; 6141 5986 goto out_end_trans; 6142 5987 } 6143 5988 ··· 6147 5994 if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) { 6148 5995 btrfs_debug(fs_info, 6149 5996 "drop snapshot early exit"); 6150 - err = -EAGAIN; 5997 + ret = -EAGAIN; 6151 5998 goto out_free; 6152 5999 } 6153 6000 ··· 6161 6008 else 6162 6009 trans = btrfs_start_transaction(tree_root, 0); 6163 6010 if (IS_ERR(trans)) { 6164 - err = PTR_ERR(trans); 6011 + ret = PTR_ERR(trans); 6165 6012 goto out_free; 6166 6013 } 6167 6014 } 6168 6015 } 6169 6016 btrfs_release_path(path); 6170 - if (err) 6017 + if (ret) 6171 6018 goto out_end_trans; 6172 6019 6173 6020 ret = btrfs_del_root(trans, &root->root_key); 6174 6021 if (ret) { 6175 6022 btrfs_abort_transaction(trans, ret); 6176 - err = ret; 6177 6023 goto out_end_trans; 6178 6024 } 6179 6025 ··· 6181 6029 NULL, NULL); 6182 6030 if (ret < 0) { 6183 6031 btrfs_abort_transaction(trans, ret); 6184 - err = ret; 6185 6032 goto out_end_trans; 6186 6033 } else if (ret > 0) { 6187 - /* if we fail to delete the orphan item this time 6034 + ret = 0; 6035 + /* 6036 + * If we fail to delete the orphan item this time 6188 6037 * around, it'll get picked up the next time. 6189 6038 * 6190 6039 * The most common failure here is just -ENOENT. ··· 6216 6063 kfree(wc); 6217 6064 btrfs_free_path(path); 6218 6065 out: 6066 + if (!ret && root_dropped) { 6067 + ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid); 6068 + if (ret < 0) 6069 + btrfs_warn_rl(fs_info, 6070 + "failed to cleanup qgroup 0/%llu: %d", 6071 + rootid, ret); 6072 + ret = 0; 6073 + } 6219 6074 /* 6220 6075 * We were an unfinished drop root, check to see if there are any 6221 6076 * pending, and if not clear and wake up any waiters. 6222 6077 */ 6223 - if (!err && unfinished_drop) 6078 + if (!ret && unfinished_drop) 6224 6079 btrfs_maybe_wake_unfinished_drop(fs_info); 6225 6080 6226 6081 /* ··· 6240 6079 */ 6241 6080 if (!for_reloc && !root_dropped) 6242 6081 btrfs_add_dead_root(root); 6243 - return err; 6082 + return ret; 6244 6083 } 6245 6084 6246 6085 /*
+4 -4
fs/btrfs/extent-tree.h
··· 127 127 u64 empty_size, 128 128 u64 reloc_src_root, 129 129 enum btrfs_lock_nesting nest); 130 - void btrfs_free_tree_block(struct btrfs_trans_handle *trans, 131 - u64 root_id, 132 - struct extent_buffer *buf, 133 - u64 parent, int last_ref); 130 + int btrfs_free_tree_block(struct btrfs_trans_handle *trans, 131 + u64 root_id, 132 + struct extent_buffer *buf, 133 + u64 parent, int last_ref); 134 134 int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans, 135 135 struct btrfs_root *root, u64 owner, 136 136 u64 offset, u64 ram_bytes,
+149 -943
fs/btrfs/extent_io.c
··· 164 164 kmem_cache_destroy(extent_buffer_cache); 165 165 } 166 166 167 - void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) 168 - { 169 - unsigned long index = start >> PAGE_SHIFT; 170 - unsigned long end_index = end >> PAGE_SHIFT; 171 - struct page *page; 172 - 173 - while (index <= end_index) { 174 - page = find_get_page(inode->i_mapping, index); 175 - BUG_ON(!page); /* Pages should be in the extent_io_tree */ 176 - clear_page_dirty_for_io(page); 177 - put_page(page); 178 - index++; 179 - } 180 - } 181 - 182 167 static void process_one_page(struct btrfs_fs_info *fs_info, 183 - struct page *page, struct page *locked_page, 168 + struct page *page, const struct page *locked_page, 184 169 unsigned long page_ops, u64 start, u64 end) 185 170 { 186 171 struct folio *folio = page_folio(page); ··· 188 203 } 189 204 190 205 static void __process_pages_contig(struct address_space *mapping, 191 - struct page *locked_page, u64 start, u64 end, 206 + const struct page *locked_page, u64 start, u64 end, 192 207 unsigned long page_ops) 193 208 { 194 209 struct btrfs_fs_info *fs_info = inode_to_fs_info(mapping->host); ··· 215 230 } 216 231 } 217 232 218 - static noinline void __unlock_for_delalloc(struct inode *inode, 219 - struct page *locked_page, 233 + static noinline void __unlock_for_delalloc(const struct inode *inode, 234 + const struct page *locked_page, 220 235 u64 start, u64 end) 221 236 { 222 237 unsigned long index = start >> PAGE_SHIFT; ··· 231 246 } 232 247 233 248 static noinline int lock_delalloc_pages(struct inode *inode, 234 - struct page *locked_page, 249 + const struct page *locked_page, 235 250 u64 start, 236 251 u64 end) 237 252 { ··· 396 411 } 397 412 398 413 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, 399 - struct page *locked_page, 414 + const struct page *locked_page, 400 415 struct extent_state **cached, 401 416 u32 clear_bits, unsigned long page_ops) 402 417 { ··· 652 667 } 653 668 654 669 /* 655 - * Populate every free slot in a provided array with folios. 670 + * Populate every free slot in a provided array with folios using GFP_NOFS. 656 671 * 657 672 * @nr_folios: number of folios to allocate 658 673 * @folio_array: the array to fill with folios; any existing non-NULL entries in 659 674 * the array will be skipped 660 - * @extra_gfp: the extra GFP flags for the allocation 661 675 * 662 676 * Return: 0 if all folios were able to be allocated; 663 677 * -ENOMEM otherwise, the partially allocated folios would be freed and 664 678 * the array slots zeroed 665 679 */ 666 - int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array, 667 - gfp_t extra_gfp) 680 + int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array) 668 681 { 669 682 for (int i = 0; i < nr_folios; i++) { 670 683 if (folio_array[i]) 671 684 continue; 672 - folio_array[i] = folio_alloc(GFP_NOFS | extra_gfp, 0); 685 + folio_array[i] = folio_alloc(GFP_NOFS, 0); 673 686 if (!folio_array[i]) 674 687 goto error; 675 688 } ··· 681 698 } 682 699 683 700 /* 684 - * Populate every free slot in a provided array with pages. 701 + * Populate every free slot in a provided array with pages, using GFP_NOFS. 685 702 * 686 703 * @nr_pages: number of pages to allocate 687 704 * @page_array: the array to fill with pages; any existing non-null entries in 688 - * the array will be skipped 689 - * @extra_gfp: the extra GFP flags for the allocation. 705 + * the array will be skipped 706 + * @nofail: whether using __GFP_NOFAIL flag 690 707 * 691 708 * Return: 0 if all pages were able to be allocated; 692 709 * -ENOMEM otherwise, the partially allocated pages would be freed and 693 710 * the array slots zeroed 694 711 */ 695 712 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array, 696 - gfp_t extra_gfp) 713 + bool nofail) 697 714 { 698 - const gfp_t gfp = GFP_NOFS | extra_gfp; 715 + const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS; 699 716 unsigned int allocated; 700 717 701 718 for (allocated = 0; allocated < nr_pages;) { ··· 719 736 * 720 737 * For now, the folios populated are always in order 0 (aka, single page). 721 738 */ 722 - static int alloc_eb_folio_array(struct extent_buffer *eb, gfp_t extra_gfp) 739 + static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail) 723 740 { 724 741 struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 }; 725 742 int num_pages = num_extent_pages(eb); 726 743 int ret; 727 744 728 - ret = btrfs_alloc_page_array(num_pages, page_array, extra_gfp); 745 + ret = btrfs_alloc_page_array(num_pages, page_array, nofail); 729 746 if (ret < 0) 730 747 return ret; 731 748 ··· 843 860 /* Cap to the current ordered extent boundary if there is one. */ 844 861 if (len > bio_ctrl->len_to_oe_boundary) { 845 862 ASSERT(bio_ctrl->compress_type == BTRFS_COMPRESS_NONE); 846 - ASSERT(is_data_inode(&inode->vfs_inode)); 863 + ASSERT(is_data_inode(inode)); 847 864 len = bio_ctrl->len_to_oe_boundary; 848 865 } 849 866 ··· 1066 1083 iosize = min(extent_map_end(em) - cur, end - cur + 1); 1067 1084 iosize = ALIGN(iosize, blocksize); 1068 1085 if (compress_type != BTRFS_COMPRESS_NONE) 1069 - disk_bytenr = em->block_start; 1086 + disk_bytenr = em->disk_bytenr; 1070 1087 else 1071 - disk_bytenr = em->block_start + extent_offset; 1072 - block_start = em->block_start; 1088 + disk_bytenr = extent_map_block_start(em) + extent_offset; 1089 + block_start = extent_map_block_start(em); 1073 1090 if (em->flags & EXTENT_FLAG_PREALLOC) 1074 1091 block_start = EXTENT_MAP_HOLE; 1075 1092 ··· 1209 1226 static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, 1210 1227 struct page *page, struct writeback_control *wbc) 1211 1228 { 1229 + struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode); 1230 + struct folio *folio = page_folio(page); 1231 + const bool is_subpage = btrfs_is_subpage(fs_info, page->mapping); 1212 1232 const u64 page_start = page_offset(page); 1213 1233 const u64 page_end = page_start + PAGE_SIZE - 1; 1234 + /* 1235 + * Save the last found delalloc end. As the delalloc end can go beyond 1236 + * page boundary, thus we cannot rely on subpage bitmap to locate the 1237 + * last delalloc end. 1238 + */ 1239 + u64 last_delalloc_end = 0; 1214 1240 u64 delalloc_start = page_start; 1215 1241 u64 delalloc_end = page_end; 1216 1242 u64 delalloc_to_write = 0; 1217 1243 int ret = 0; 1218 1244 1245 + /* Lock all (subpage) delalloc ranges inside the page first. */ 1219 1246 while (delalloc_start < page_end) { 1220 1247 delalloc_end = page_end; 1221 1248 if (!find_lock_delalloc_range(&inode->vfs_inode, page, ··· 1233 1240 delalloc_start = delalloc_end + 1; 1234 1241 continue; 1235 1242 } 1236 - 1237 - ret = btrfs_run_delalloc_range(inode, page, delalloc_start, 1238 - delalloc_end, wbc); 1239 - if (ret < 0) 1240 - return ret; 1241 - 1243 + btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start, 1244 + min(delalloc_end, page_end) + 1 - 1245 + delalloc_start); 1246 + last_delalloc_end = delalloc_end; 1242 1247 delalloc_start = delalloc_end + 1; 1243 1248 } 1249 + delalloc_start = page_start; 1244 1250 1251 + if (!last_delalloc_end) 1252 + goto out; 1253 + 1254 + /* Run the delalloc ranges for the above locked ranges. */ 1255 + while (delalloc_start < page_end) { 1256 + u64 found_start; 1257 + u32 found_len; 1258 + bool found; 1259 + 1260 + if (!is_subpage) { 1261 + /* 1262 + * For non-subpage case, the found delalloc range must 1263 + * cover this page and there must be only one locked 1264 + * delalloc range. 1265 + */ 1266 + found_start = page_start; 1267 + found_len = last_delalloc_end + 1 - found_start; 1268 + found = true; 1269 + } else { 1270 + found = btrfs_subpage_find_writer_locked(fs_info, folio, 1271 + delalloc_start, &found_start, &found_len); 1272 + } 1273 + if (!found) 1274 + break; 1275 + /* 1276 + * The subpage range covers the last sector, the delalloc range may 1277 + * end beyond the page boundary, use the saved delalloc_end 1278 + * instead. 1279 + */ 1280 + if (found_start + found_len >= page_end) 1281 + found_len = last_delalloc_end + 1 - found_start; 1282 + 1283 + if (ret >= 0) { 1284 + /* No errors hit so far, run the current delalloc range. */ 1285 + ret = btrfs_run_delalloc_range(inode, page, found_start, 1286 + found_start + found_len - 1, 1287 + wbc); 1288 + } else { 1289 + /* 1290 + * We've hit an error during previous delalloc range, 1291 + * have to cleanup the remaining locked ranges. 1292 + */ 1293 + unlock_extent(&inode->io_tree, found_start, 1294 + found_start + found_len - 1, NULL); 1295 + __unlock_for_delalloc(&inode->vfs_inode, page, found_start, 1296 + found_start + found_len - 1); 1297 + } 1298 + 1299 + /* 1300 + * We can hit btrfs_run_delalloc_range() with >0 return value. 1301 + * 1302 + * This happens when either the IO is already done and page 1303 + * unlocked (inline) or the IO submission and page unlock would 1304 + * be handled as async (compression). 1305 + * 1306 + * Inline is only possible for regular sectorsize for now. 1307 + * 1308 + * Compression is possible for both subpage and regular cases, 1309 + * but even for subpage compression only happens for page aligned 1310 + * range, thus the found delalloc range must go beyond current 1311 + * page. 1312 + */ 1313 + if (ret > 0) 1314 + ASSERT(!is_subpage || found_start + found_len >= page_end); 1315 + 1316 + /* 1317 + * Above btrfs_run_delalloc_range() may have unlocked the page, 1318 + * thus for the last range, we cannot touch the page anymore. 1319 + */ 1320 + if (found_start + found_len >= last_delalloc_end + 1) 1321 + break; 1322 + 1323 + delalloc_start = found_start + found_len; 1324 + } 1325 + if (ret < 0) 1326 + return ret; 1327 + out: 1328 + if (last_delalloc_end) 1329 + delalloc_end = last_delalloc_end; 1330 + else 1331 + delalloc_end = page_end; 1245 1332 /* 1246 1333 * delalloc_end is already one less than the total length, so 1247 1334 * we don't subtract one from PAGE_SIZE ··· 1365 1292 * Return the next dirty range in [@start, @end). 1366 1293 * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE. 1367 1294 */ 1368 - static void find_next_dirty_byte(struct btrfs_fs_info *fs_info, 1295 + static void find_next_dirty_byte(const struct btrfs_fs_info *fs_info, 1369 1296 struct page *page, u64 *start, u64 *end) 1370 1297 { 1371 1298 struct folio *folio = page_folio(page); ··· 1412 1339 * < 0 if there were errors (page still locked) 1413 1340 */ 1414 1341 static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode, 1415 - struct page *page, 1342 + struct page *page, u64 start, u32 len, 1416 1343 struct btrfs_bio_ctrl *bio_ctrl, 1417 1344 loff_t i_size, 1418 1345 int *nr_ret) 1419 1346 { 1420 1347 struct btrfs_fs_info *fs_info = inode->root->fs_info; 1421 - u64 cur = page_offset(page); 1422 - u64 end = cur + PAGE_SIZE - 1; 1348 + u64 cur = start; 1349 + u64 end = start + len - 1; 1423 1350 u64 extent_offset; 1424 1351 u64 block_start; 1425 1352 struct extent_map *em; 1426 1353 int ret = 0; 1427 1354 int nr = 0; 1355 + 1356 + ASSERT(start >= page_offset(page) && 1357 + start + len <= page_offset(page) + PAGE_SIZE); 1428 1358 1429 1359 ret = btrfs_writepage_cow_fixup(page); 1430 1360 if (ret) { ··· 1481 1405 ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize)); 1482 1406 ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize)); 1483 1407 1484 - block_start = em->block_start; 1485 - disk_bytenr = em->block_start + extent_offset; 1408 + block_start = extent_map_block_start(em); 1409 + disk_bytenr = extent_map_block_start(em) + extent_offset; 1486 1410 1487 1411 ASSERT(!extent_map_is_compressed(em)); 1488 1412 ASSERT(block_start != EXTENT_MAP_HOLE); ··· 1517 1441 nr++; 1518 1442 } 1519 1443 1520 - btrfs_folio_assert_not_dirty(fs_info, page_folio(page)); 1444 + btrfs_folio_assert_not_dirty(fs_info, page_folio(page), start, len); 1521 1445 *nr_ret = nr; 1522 1446 return 0; 1523 1447 ··· 1575 1499 if (ret) 1576 1500 goto done; 1577 1501 1578 - ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr); 1502 + ret = __extent_writepage_io(BTRFS_I(inode), page, page_offset(page), 1503 + PAGE_SIZE, bio_ctrl, i_size, &nr); 1579 1504 if (ret == 1) 1580 1505 return 0; 1581 1506 ··· 1593 1516 PAGE_SIZE, !ret); 1594 1517 mapping_set_error(page->mapping, ret); 1595 1518 } 1596 - unlock_page(page); 1519 + 1520 + btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio); 1597 1521 ASSERT(ret <= 0); 1598 1522 return ret; 1599 1523 } ··· 1726 1648 * context. 1727 1649 */ 1728 1650 static struct extent_buffer *find_extent_buffer_nolock( 1729 - struct btrfs_fs_info *fs_info, u64 start) 1651 + const struct btrfs_fs_info *fs_info, u64 start) 1730 1652 { 1731 1653 struct extent_buffer *eb; 1732 1654 ··· 2295 2217 * already been ran (aka, ordered extent inserted) and all pages are still 2296 2218 * locked. 2297 2219 */ 2298 - void extent_write_locked_range(struct inode *inode, struct page *locked_page, 2220 + void extent_write_locked_range(struct inode *inode, const struct page *locked_page, 2299 2221 u64 start, u64 end, struct writeback_control *wbc, 2300 2222 bool pages_dirty) 2301 2223 { ··· 2324 2246 2325 2247 page = find_get_page(mapping, cur >> PAGE_SHIFT); 2326 2248 ASSERT(PageLocked(page)); 2327 - if (pages_dirty && page != locked_page) { 2249 + if (pages_dirty && page != locked_page) 2328 2250 ASSERT(PageDirty(page)); 2329 - clear_page_dirty_for_io(page); 2330 - } 2331 2251 2332 - ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl, 2333 - i_size, &nr); 2252 + ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len, 2253 + &bio_ctrl, i_size, &nr); 2334 2254 if (ret == 1) 2335 2255 goto next_page; 2336 2256 2337 2257 /* Make sure the mapping tag for page dirty gets cleared. */ 2338 2258 if (nr == 0) { 2339 - set_page_writeback(page); 2340 - end_page_writeback(page); 2259 + struct folio *folio; 2260 + 2261 + folio = page_folio(page); 2262 + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); 2263 + btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len); 2341 2264 } 2342 2265 if (ret) { 2343 2266 btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, ··· 2549 2470 return try_release_extent_state(io_tree, page, mask); 2550 2471 } 2551 2472 2552 - struct btrfs_fiemap_entry { 2553 - u64 offset; 2554 - u64 phys; 2555 - u64 len; 2556 - u32 flags; 2557 - }; 2558 - 2559 - /* 2560 - * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file 2561 - * range from the inode's io tree, unlock the subvolume tree search path, flush 2562 - * the fiemap cache and relock the file range and research the subvolume tree. 2563 - * The value here is something negative that can't be confused with a valid 2564 - * errno value and different from 1 because that's also a return value from 2565 - * fiemap_fill_next_extent() and also it's often used to mean some btree search 2566 - * did not find a key, so make it some distinct negative value. 2567 - */ 2568 - #define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1)) 2569 - 2570 - /* 2571 - * Used to: 2572 - * 2573 - * - Cache the next entry to be emitted to the fiemap buffer, so that we can 2574 - * merge extents that are contiguous and can be grouped as a single one; 2575 - * 2576 - * - Store extents ready to be written to the fiemap buffer in an intermediary 2577 - * buffer. This intermediary buffer is to ensure that in case the fiemap 2578 - * buffer is memory mapped to the fiemap target file, we don't deadlock 2579 - * during btrfs_page_mkwrite(). This is because during fiemap we are locking 2580 - * an extent range in order to prevent races with delalloc flushing and 2581 - * ordered extent completion, which is needed in order to reliably detect 2582 - * delalloc in holes and prealloc extents. And this can lead to a deadlock 2583 - * if the fiemap buffer is memory mapped to the file we are running fiemap 2584 - * against (a silly, useless in practice scenario, but possible) because 2585 - * btrfs_page_mkwrite() will try to lock the same extent range. 2586 - */ 2587 - struct fiemap_cache { 2588 - /* An array of ready fiemap entries. */ 2589 - struct btrfs_fiemap_entry *entries; 2590 - /* Number of entries in the entries array. */ 2591 - int entries_size; 2592 - /* Index of the next entry in the entries array to write to. */ 2593 - int entries_pos; 2594 - /* 2595 - * Once the entries array is full, this indicates what's the offset for 2596 - * the next file extent item we must search for in the inode's subvolume 2597 - * tree after unlocking the extent range in the inode's io tree and 2598 - * releasing the search path. 2599 - */ 2600 - u64 next_search_offset; 2601 - /* 2602 - * This matches struct fiemap_extent_info::fi_mapped_extents, we use it 2603 - * to count ourselves emitted extents and stop instead of relying on 2604 - * fiemap_fill_next_extent() because we buffer ready fiemap entries at 2605 - * the @entries array, and we want to stop as soon as we hit the max 2606 - * amount of extents to map, not just to save time but also to make the 2607 - * logic at extent_fiemap() simpler. 2608 - */ 2609 - unsigned int extents_mapped; 2610 - /* Fields for the cached extent (unsubmitted, not ready, extent). */ 2611 - u64 offset; 2612 - u64 phys; 2613 - u64 len; 2614 - u32 flags; 2615 - bool cached; 2616 - }; 2617 - 2618 - static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo, 2619 - struct fiemap_cache *cache) 2620 - { 2621 - for (int i = 0; i < cache->entries_pos; i++) { 2622 - struct btrfs_fiemap_entry *entry = &cache->entries[i]; 2623 - int ret; 2624 - 2625 - ret = fiemap_fill_next_extent(fieinfo, entry->offset, 2626 - entry->phys, entry->len, 2627 - entry->flags); 2628 - /* 2629 - * Ignore 1 (reached max entries) because we keep track of that 2630 - * ourselves in emit_fiemap_extent(). 2631 - */ 2632 - if (ret < 0) 2633 - return ret; 2634 - } 2635 - cache->entries_pos = 0; 2636 - 2637 - return 0; 2638 - } 2639 - 2640 - /* 2641 - * Helper to submit fiemap extent. 2642 - * 2643 - * Will try to merge current fiemap extent specified by @offset, @phys, 2644 - * @len and @flags with cached one. 2645 - * And only when we fails to merge, cached one will be submitted as 2646 - * fiemap extent. 2647 - * 2648 - * Return value is the same as fiemap_fill_next_extent(). 2649 - */ 2650 - static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo, 2651 - struct fiemap_cache *cache, 2652 - u64 offset, u64 phys, u64 len, u32 flags) 2653 - { 2654 - struct btrfs_fiemap_entry *entry; 2655 - u64 cache_end; 2656 - 2657 - /* Set at the end of extent_fiemap(). */ 2658 - ASSERT((flags & FIEMAP_EXTENT_LAST) == 0); 2659 - 2660 - if (!cache->cached) 2661 - goto assign; 2662 - 2663 - /* 2664 - * When iterating the extents of the inode, at extent_fiemap(), we may 2665 - * find an extent that starts at an offset behind the end offset of the 2666 - * previous extent we processed. This happens if fiemap is called 2667 - * without FIEMAP_FLAG_SYNC and there are ordered extents completing 2668 - * after we had to unlock the file range, release the search path, emit 2669 - * the fiemap extents stored in the buffer (cache->entries array) and 2670 - * the lock the remainder of the range and re-search the btree. 2671 - * 2672 - * For example we are in leaf X processing its last item, which is the 2673 - * file extent item for file range [512K, 1M[, and after 2674 - * btrfs_next_leaf() releases the path, there's an ordered extent that 2675 - * completes for the file range [768K, 2M[, and that results in trimming 2676 - * the file extent item so that it now corresponds to the file range 2677 - * [512K, 768K[ and a new file extent item is inserted for the file 2678 - * range [768K, 2M[, which may end up as the last item of leaf X or as 2679 - * the first item of the next leaf - in either case btrfs_next_leaf() 2680 - * will leave us with a path pointing to the new extent item, for the 2681 - * file range [768K, 2M[, since that's the first key that follows the 2682 - * last one we processed. So in order not to report overlapping extents 2683 - * to user space, we trim the length of the previously cached extent and 2684 - * emit it. 2685 - * 2686 - * Upon calling btrfs_next_leaf() we may also find an extent with an 2687 - * offset smaller than or equals to cache->offset, and this happens 2688 - * when we had a hole or prealloc extent with several delalloc ranges in 2689 - * it, but after btrfs_next_leaf() released the path, delalloc was 2690 - * flushed and the resulting ordered extents were completed, so we can 2691 - * now have found a file extent item for an offset that is smaller than 2692 - * or equals to what we have in cache->offset. We deal with this as 2693 - * described below. 2694 - */ 2695 - cache_end = cache->offset + cache->len; 2696 - if (cache_end > offset) { 2697 - if (offset == cache->offset) { 2698 - /* 2699 - * We cached a dealloc range (found in the io tree) for 2700 - * a hole or prealloc extent and we have now found a 2701 - * file extent item for the same offset. What we have 2702 - * now is more recent and up to date, so discard what 2703 - * we had in the cache and use what we have just found. 2704 - */ 2705 - goto assign; 2706 - } else if (offset > cache->offset) { 2707 - /* 2708 - * The extent range we previously found ends after the 2709 - * offset of the file extent item we found and that 2710 - * offset falls somewhere in the middle of that previous 2711 - * extent range. So adjust the range we previously found 2712 - * to end at the offset of the file extent item we have 2713 - * just found, since this extent is more up to date. 2714 - * Emit that adjusted range and cache the file extent 2715 - * item we have just found. This corresponds to the case 2716 - * where a previously found file extent item was split 2717 - * due to an ordered extent completing. 2718 - */ 2719 - cache->len = offset - cache->offset; 2720 - goto emit; 2721 - } else { 2722 - const u64 range_end = offset + len; 2723 - 2724 - /* 2725 - * The offset of the file extent item we have just found 2726 - * is behind the cached offset. This means we were 2727 - * processing a hole or prealloc extent for which we 2728 - * have found delalloc ranges (in the io tree), so what 2729 - * we have in the cache is the last delalloc range we 2730 - * found while the file extent item we found can be 2731 - * either for a whole delalloc range we previously 2732 - * emmitted or only a part of that range. 2733 - * 2734 - * We have two cases here: 2735 - * 2736 - * 1) The file extent item's range ends at or behind the 2737 - * cached extent's end. In this case just ignore the 2738 - * current file extent item because we don't want to 2739 - * overlap with previous ranges that may have been 2740 - * emmitted already; 2741 - * 2742 - * 2) The file extent item starts behind the currently 2743 - * cached extent but its end offset goes beyond the 2744 - * end offset of the cached extent. We don't want to 2745 - * overlap with a previous range that may have been 2746 - * emmitted already, so we emit the currently cached 2747 - * extent and then partially store the current file 2748 - * extent item's range in the cache, for the subrange 2749 - * going the cached extent's end to the end of the 2750 - * file extent item. 2751 - */ 2752 - if (range_end <= cache_end) 2753 - return 0; 2754 - 2755 - if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC))) 2756 - phys += cache_end - offset; 2757 - 2758 - offset = cache_end; 2759 - len = range_end - cache_end; 2760 - goto emit; 2761 - } 2762 - } 2763 - 2764 - /* 2765 - * Only merges fiemap extents if 2766 - * 1) Their logical addresses are continuous 2767 - * 2768 - * 2) Their physical addresses are continuous 2769 - * So truly compressed (physical size smaller than logical size) 2770 - * extents won't get merged with each other 2771 - * 2772 - * 3) Share same flags 2773 - */ 2774 - if (cache->offset + cache->len == offset && 2775 - cache->phys + cache->len == phys && 2776 - cache->flags == flags) { 2777 - cache->len += len; 2778 - return 0; 2779 - } 2780 - 2781 - emit: 2782 - /* Not mergeable, need to submit cached one */ 2783 - 2784 - if (cache->entries_pos == cache->entries_size) { 2785 - /* 2786 - * We will need to research for the end offset of the last 2787 - * stored extent and not from the current offset, because after 2788 - * unlocking the range and releasing the path, if there's a hole 2789 - * between that end offset and this current offset, a new extent 2790 - * may have been inserted due to a new write, so we don't want 2791 - * to miss it. 2792 - */ 2793 - entry = &cache->entries[cache->entries_size - 1]; 2794 - cache->next_search_offset = entry->offset + entry->len; 2795 - cache->cached = false; 2796 - 2797 - return BTRFS_FIEMAP_FLUSH_CACHE; 2798 - } 2799 - 2800 - entry = &cache->entries[cache->entries_pos]; 2801 - entry->offset = cache->offset; 2802 - entry->phys = cache->phys; 2803 - entry->len = cache->len; 2804 - entry->flags = cache->flags; 2805 - cache->entries_pos++; 2806 - cache->extents_mapped++; 2807 - 2808 - if (cache->extents_mapped == fieinfo->fi_extents_max) { 2809 - cache->cached = false; 2810 - return 1; 2811 - } 2812 - assign: 2813 - cache->cached = true; 2814 - cache->offset = offset; 2815 - cache->phys = phys; 2816 - cache->len = len; 2817 - cache->flags = flags; 2818 - 2819 - return 0; 2820 - } 2821 - 2822 - /* 2823 - * Emit last fiemap cache 2824 - * 2825 - * The last fiemap cache may still be cached in the following case: 2826 - * 0 4k 8k 2827 - * |<- Fiemap range ->| 2828 - * |<------------ First extent ----------->| 2829 - * 2830 - * In this case, the first extent range will be cached but not emitted. 2831 - * So we must emit it before ending extent_fiemap(). 2832 - */ 2833 - static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo, 2834 - struct fiemap_cache *cache) 2835 - { 2836 - int ret; 2837 - 2838 - if (!cache->cached) 2839 - return 0; 2840 - 2841 - ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys, 2842 - cache->len, cache->flags); 2843 - cache->cached = false; 2844 - if (ret > 0) 2845 - ret = 0; 2846 - return ret; 2847 - } 2848 - 2849 - static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path) 2850 - { 2851 - struct extent_buffer *clone = path->nodes[0]; 2852 - struct btrfs_key key; 2853 - int slot; 2854 - int ret; 2855 - 2856 - path->slots[0]++; 2857 - if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) 2858 - return 0; 2859 - 2860 - /* 2861 - * Add a temporary extra ref to an already cloned extent buffer to 2862 - * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid 2863 - * the cost of allocating a new one. 2864 - */ 2865 - ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags)); 2866 - atomic_inc(&clone->refs); 2867 - 2868 - ret = btrfs_next_leaf(inode->root, path); 2869 - if (ret != 0) 2870 - goto out; 2871 - 2872 - /* 2873 - * Don't bother with cloning if there are no more file extent items for 2874 - * our inode. 2875 - */ 2876 - btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); 2877 - if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) { 2878 - ret = 1; 2879 - goto out; 2880 - } 2881 - 2882 - /* 2883 - * Important to preserve the start field, for the optimizations when 2884 - * checking if extents are shared (see extent_fiemap()). 2885 - * 2886 - * We must set ->start before calling copy_extent_buffer_full(). If we 2887 - * are on sub-pagesize blocksize, we use ->start to determine the offset 2888 - * into the folio where our eb exists, and if we update ->start after 2889 - * the fact then any subsequent reads of the eb may read from a 2890 - * different offset in the folio than where we originally copied into. 2891 - */ 2892 - clone->start = path->nodes[0]->start; 2893 - /* See the comment at fiemap_search_slot() about why we clone. */ 2894 - copy_extent_buffer_full(clone, path->nodes[0]); 2895 - 2896 - slot = path->slots[0]; 2897 - btrfs_release_path(path); 2898 - path->nodes[0] = clone; 2899 - path->slots[0] = slot; 2900 - out: 2901 - if (ret) 2902 - free_extent_buffer(clone); 2903 - 2904 - return ret; 2905 - } 2906 - 2907 - /* 2908 - * Search for the first file extent item that starts at a given file offset or 2909 - * the one that starts immediately before that offset. 2910 - * Returns: 0 on success, < 0 on error, 1 if not found. 2911 - */ 2912 - static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path, 2913 - u64 file_offset) 2914 - { 2915 - const u64 ino = btrfs_ino(inode); 2916 - struct btrfs_root *root = inode->root; 2917 - struct extent_buffer *clone; 2918 - struct btrfs_key key; 2919 - int slot; 2920 - int ret; 2921 - 2922 - key.objectid = ino; 2923 - key.type = BTRFS_EXTENT_DATA_KEY; 2924 - key.offset = file_offset; 2925 - 2926 - ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); 2927 - if (ret < 0) 2928 - return ret; 2929 - 2930 - if (ret > 0 && path->slots[0] > 0) { 2931 - btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1); 2932 - if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY) 2933 - path->slots[0]--; 2934 - } 2935 - 2936 - if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { 2937 - ret = btrfs_next_leaf(root, path); 2938 - if (ret != 0) 2939 - return ret; 2940 - 2941 - btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); 2942 - if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) 2943 - return 1; 2944 - } 2945 - 2946 - /* 2947 - * We clone the leaf and use it during fiemap. This is because while 2948 - * using the leaf we do expensive things like checking if an extent is 2949 - * shared, which can take a long time. In order to prevent blocking 2950 - * other tasks for too long, we use a clone of the leaf. We have locked 2951 - * the file range in the inode's io tree, so we know none of our file 2952 - * extent items can change. This way we avoid blocking other tasks that 2953 - * want to insert items for other inodes in the same leaf or b+tree 2954 - * rebalance operations (triggered for example when someone is trying 2955 - * to push items into this leaf when trying to insert an item in a 2956 - * neighbour leaf). 2957 - * We also need the private clone because holding a read lock on an 2958 - * extent buffer of the subvolume's b+tree will make lockdep unhappy 2959 - * when we check if extents are shared, as backref walking may need to 2960 - * lock the same leaf we are processing. 2961 - */ 2962 - clone = btrfs_clone_extent_buffer(path->nodes[0]); 2963 - if (!clone) 2964 - return -ENOMEM; 2965 - 2966 - slot = path->slots[0]; 2967 - btrfs_release_path(path); 2968 - path->nodes[0] = clone; 2969 - path->slots[0] = slot; 2970 - 2971 - return 0; 2972 - } 2973 - 2974 - /* 2975 - * Process a range which is a hole or a prealloc extent in the inode's subvolume 2976 - * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc 2977 - * extent. The end offset (@end) is inclusive. 2978 - */ 2979 - static int fiemap_process_hole(struct btrfs_inode *inode, 2980 - struct fiemap_extent_info *fieinfo, 2981 - struct fiemap_cache *cache, 2982 - struct extent_state **delalloc_cached_state, 2983 - struct btrfs_backref_share_check_ctx *backref_ctx, 2984 - u64 disk_bytenr, u64 extent_offset, 2985 - u64 extent_gen, 2986 - u64 start, u64 end) 2987 - { 2988 - const u64 i_size = i_size_read(&inode->vfs_inode); 2989 - u64 cur_offset = start; 2990 - u64 last_delalloc_end = 0; 2991 - u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN; 2992 - bool checked_extent_shared = false; 2993 - int ret; 2994 - 2995 - /* 2996 - * There can be no delalloc past i_size, so don't waste time looking for 2997 - * it beyond i_size. 2998 - */ 2999 - while (cur_offset < end && cur_offset < i_size) { 3000 - u64 delalloc_start; 3001 - u64 delalloc_end; 3002 - u64 prealloc_start; 3003 - u64 prealloc_len = 0; 3004 - bool delalloc; 3005 - 3006 - delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end, 3007 - delalloc_cached_state, 3008 - &delalloc_start, 3009 - &delalloc_end); 3010 - if (!delalloc) 3011 - break; 3012 - 3013 - /* 3014 - * If this is a prealloc extent we have to report every section 3015 - * of it that has no delalloc. 3016 - */ 3017 - if (disk_bytenr != 0) { 3018 - if (last_delalloc_end == 0) { 3019 - prealloc_start = start; 3020 - prealloc_len = delalloc_start - start; 3021 - } else { 3022 - prealloc_start = last_delalloc_end + 1; 3023 - prealloc_len = delalloc_start - prealloc_start; 3024 - } 3025 - } 3026 - 3027 - if (prealloc_len > 0) { 3028 - if (!checked_extent_shared && fieinfo->fi_extents_max) { 3029 - ret = btrfs_is_data_extent_shared(inode, 3030 - disk_bytenr, 3031 - extent_gen, 3032 - backref_ctx); 3033 - if (ret < 0) 3034 - return ret; 3035 - else if (ret > 0) 3036 - prealloc_flags |= FIEMAP_EXTENT_SHARED; 3037 - 3038 - checked_extent_shared = true; 3039 - } 3040 - ret = emit_fiemap_extent(fieinfo, cache, prealloc_start, 3041 - disk_bytenr + extent_offset, 3042 - prealloc_len, prealloc_flags); 3043 - if (ret) 3044 - return ret; 3045 - extent_offset += prealloc_len; 3046 - } 3047 - 3048 - ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0, 3049 - delalloc_end + 1 - delalloc_start, 3050 - FIEMAP_EXTENT_DELALLOC | 3051 - FIEMAP_EXTENT_UNKNOWN); 3052 - if (ret) 3053 - return ret; 3054 - 3055 - last_delalloc_end = delalloc_end; 3056 - cur_offset = delalloc_end + 1; 3057 - extent_offset += cur_offset - delalloc_start; 3058 - cond_resched(); 3059 - } 3060 - 3061 - /* 3062 - * Either we found no delalloc for the whole prealloc extent or we have 3063 - * a prealloc extent that spans i_size or starts at or after i_size. 3064 - */ 3065 - if (disk_bytenr != 0 && last_delalloc_end < end) { 3066 - u64 prealloc_start; 3067 - u64 prealloc_len; 3068 - 3069 - if (last_delalloc_end == 0) { 3070 - prealloc_start = start; 3071 - prealloc_len = end + 1 - start; 3072 - } else { 3073 - prealloc_start = last_delalloc_end + 1; 3074 - prealloc_len = end + 1 - prealloc_start; 3075 - } 3076 - 3077 - if (!checked_extent_shared && fieinfo->fi_extents_max) { 3078 - ret = btrfs_is_data_extent_shared(inode, 3079 - disk_bytenr, 3080 - extent_gen, 3081 - backref_ctx); 3082 - if (ret < 0) 3083 - return ret; 3084 - else if (ret > 0) 3085 - prealloc_flags |= FIEMAP_EXTENT_SHARED; 3086 - } 3087 - ret = emit_fiemap_extent(fieinfo, cache, prealloc_start, 3088 - disk_bytenr + extent_offset, 3089 - prealloc_len, prealloc_flags); 3090 - if (ret) 3091 - return ret; 3092 - } 3093 - 3094 - return 0; 3095 - } 3096 - 3097 - static int fiemap_find_last_extent_offset(struct btrfs_inode *inode, 3098 - struct btrfs_path *path, 3099 - u64 *last_extent_end_ret) 3100 - { 3101 - const u64 ino = btrfs_ino(inode); 3102 - struct btrfs_root *root = inode->root; 3103 - struct extent_buffer *leaf; 3104 - struct btrfs_file_extent_item *ei; 3105 - struct btrfs_key key; 3106 - u64 disk_bytenr; 3107 - int ret; 3108 - 3109 - /* 3110 - * Lookup the last file extent. We're not using i_size here because 3111 - * there might be preallocation past i_size. 3112 - */ 3113 - ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0); 3114 - /* There can't be a file extent item at offset (u64)-1 */ 3115 - ASSERT(ret != 0); 3116 - if (ret < 0) 3117 - return ret; 3118 - 3119 - /* 3120 - * For a non-existing key, btrfs_search_slot() always leaves us at a 3121 - * slot > 0, except if the btree is empty, which is impossible because 3122 - * at least it has the inode item for this inode and all the items for 3123 - * the root inode 256. 3124 - */ 3125 - ASSERT(path->slots[0] > 0); 3126 - path->slots[0]--; 3127 - leaf = path->nodes[0]; 3128 - btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 3129 - if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) { 3130 - /* No file extent items in the subvolume tree. */ 3131 - *last_extent_end_ret = 0; 3132 - return 0; 3133 - } 3134 - 3135 - /* 3136 - * For an inline extent, the disk_bytenr is where inline data starts at, 3137 - * so first check if we have an inline extent item before checking if we 3138 - * have an implicit hole (disk_bytenr == 0). 3139 - */ 3140 - ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); 3141 - if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) { 3142 - *last_extent_end_ret = btrfs_file_extent_end(path); 3143 - return 0; 3144 - } 3145 - 3146 - /* 3147 - * Find the last file extent item that is not a hole (when NO_HOLES is 3148 - * not enabled). This should take at most 2 iterations in the worst 3149 - * case: we have one hole file extent item at slot 0 of a leaf and 3150 - * another hole file extent item as the last item in the previous leaf. 3151 - * This is because we merge file extent items that represent holes. 3152 - */ 3153 - disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 3154 - while (disk_bytenr == 0) { 3155 - ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY); 3156 - if (ret < 0) { 3157 - return ret; 3158 - } else if (ret > 0) { 3159 - /* No file extent items that are not holes. */ 3160 - *last_extent_end_ret = 0; 3161 - return 0; 3162 - } 3163 - leaf = path->nodes[0]; 3164 - ei = btrfs_item_ptr(leaf, path->slots[0], 3165 - struct btrfs_file_extent_item); 3166 - disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 3167 - } 3168 - 3169 - *last_extent_end_ret = btrfs_file_extent_end(path); 3170 - return 0; 3171 - } 3172 - 3173 - int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, 3174 - u64 start, u64 len) 3175 - { 3176 - const u64 ino = btrfs_ino(inode); 3177 - struct extent_state *cached_state = NULL; 3178 - struct extent_state *delalloc_cached_state = NULL; 3179 - struct btrfs_path *path; 3180 - struct fiemap_cache cache = { 0 }; 3181 - struct btrfs_backref_share_check_ctx *backref_ctx; 3182 - u64 last_extent_end; 3183 - u64 prev_extent_end; 3184 - u64 range_start; 3185 - u64 range_end; 3186 - const u64 sectorsize = inode->root->fs_info->sectorsize; 3187 - bool stopped = false; 3188 - int ret; 3189 - 3190 - cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry); 3191 - cache.entries = kmalloc_array(cache.entries_size, 3192 - sizeof(struct btrfs_fiemap_entry), 3193 - GFP_KERNEL); 3194 - backref_ctx = btrfs_alloc_backref_share_check_ctx(); 3195 - path = btrfs_alloc_path(); 3196 - if (!cache.entries || !backref_ctx || !path) { 3197 - ret = -ENOMEM; 3198 - goto out; 3199 - } 3200 - 3201 - restart: 3202 - range_start = round_down(start, sectorsize); 3203 - range_end = round_up(start + len, sectorsize); 3204 - prev_extent_end = range_start; 3205 - 3206 - lock_extent(&inode->io_tree, range_start, range_end, &cached_state); 3207 - 3208 - ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end); 3209 - if (ret < 0) 3210 - goto out_unlock; 3211 - btrfs_release_path(path); 3212 - 3213 - path->reada = READA_FORWARD; 3214 - ret = fiemap_search_slot(inode, path, range_start); 3215 - if (ret < 0) { 3216 - goto out_unlock; 3217 - } else if (ret > 0) { 3218 - /* 3219 - * No file extent item found, but we may have delalloc between 3220 - * the current offset and i_size. So check for that. 3221 - */ 3222 - ret = 0; 3223 - goto check_eof_delalloc; 3224 - } 3225 - 3226 - while (prev_extent_end < range_end) { 3227 - struct extent_buffer *leaf = path->nodes[0]; 3228 - struct btrfs_file_extent_item *ei; 3229 - struct btrfs_key key; 3230 - u64 extent_end; 3231 - u64 extent_len; 3232 - u64 extent_offset = 0; 3233 - u64 extent_gen; 3234 - u64 disk_bytenr = 0; 3235 - u64 flags = 0; 3236 - int extent_type; 3237 - u8 compression; 3238 - 3239 - btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 3240 - if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) 3241 - break; 3242 - 3243 - extent_end = btrfs_file_extent_end(path); 3244 - 3245 - /* 3246 - * The first iteration can leave us at an extent item that ends 3247 - * before our range's start. Move to the next item. 3248 - */ 3249 - if (extent_end <= range_start) 3250 - goto next_item; 3251 - 3252 - backref_ctx->curr_leaf_bytenr = leaf->start; 3253 - 3254 - /* We have in implicit hole (NO_HOLES feature enabled). */ 3255 - if (prev_extent_end < key.offset) { 3256 - const u64 hole_end = min(key.offset, range_end) - 1; 3257 - 3258 - ret = fiemap_process_hole(inode, fieinfo, &cache, 3259 - &delalloc_cached_state, 3260 - backref_ctx, 0, 0, 0, 3261 - prev_extent_end, hole_end); 3262 - if (ret < 0) { 3263 - goto out_unlock; 3264 - } else if (ret > 0) { 3265 - /* fiemap_fill_next_extent() told us to stop. */ 3266 - stopped = true; 3267 - break; 3268 - } 3269 - 3270 - /* We've reached the end of the fiemap range, stop. */ 3271 - if (key.offset >= range_end) { 3272 - stopped = true; 3273 - break; 3274 - } 3275 - } 3276 - 3277 - extent_len = extent_end - key.offset; 3278 - ei = btrfs_item_ptr(leaf, path->slots[0], 3279 - struct btrfs_file_extent_item); 3280 - compression = btrfs_file_extent_compression(leaf, ei); 3281 - extent_type = btrfs_file_extent_type(leaf, ei); 3282 - extent_gen = btrfs_file_extent_generation(leaf, ei); 3283 - 3284 - if (extent_type != BTRFS_FILE_EXTENT_INLINE) { 3285 - disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 3286 - if (compression == BTRFS_COMPRESS_NONE) 3287 - extent_offset = btrfs_file_extent_offset(leaf, ei); 3288 - } 3289 - 3290 - if (compression != BTRFS_COMPRESS_NONE) 3291 - flags |= FIEMAP_EXTENT_ENCODED; 3292 - 3293 - if (extent_type == BTRFS_FILE_EXTENT_INLINE) { 3294 - flags |= FIEMAP_EXTENT_DATA_INLINE; 3295 - flags |= FIEMAP_EXTENT_NOT_ALIGNED; 3296 - ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0, 3297 - extent_len, flags); 3298 - } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) { 3299 - ret = fiemap_process_hole(inode, fieinfo, &cache, 3300 - &delalloc_cached_state, 3301 - backref_ctx, 3302 - disk_bytenr, extent_offset, 3303 - extent_gen, key.offset, 3304 - extent_end - 1); 3305 - } else if (disk_bytenr == 0) { 3306 - /* We have an explicit hole. */ 3307 - ret = fiemap_process_hole(inode, fieinfo, &cache, 3308 - &delalloc_cached_state, 3309 - backref_ctx, 0, 0, 0, 3310 - key.offset, extent_end - 1); 3311 - } else { 3312 - /* We have a regular extent. */ 3313 - if (fieinfo->fi_extents_max) { 3314 - ret = btrfs_is_data_extent_shared(inode, 3315 - disk_bytenr, 3316 - extent_gen, 3317 - backref_ctx); 3318 - if (ret < 0) 3319 - goto out_unlock; 3320 - else if (ret > 0) 3321 - flags |= FIEMAP_EXTENT_SHARED; 3322 - } 3323 - 3324 - ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 3325 - disk_bytenr + extent_offset, 3326 - extent_len, flags); 3327 - } 3328 - 3329 - if (ret < 0) { 3330 - goto out_unlock; 3331 - } else if (ret > 0) { 3332 - /* emit_fiemap_extent() told us to stop. */ 3333 - stopped = true; 3334 - break; 3335 - } 3336 - 3337 - prev_extent_end = extent_end; 3338 - next_item: 3339 - if (fatal_signal_pending(current)) { 3340 - ret = -EINTR; 3341 - goto out_unlock; 3342 - } 3343 - 3344 - ret = fiemap_next_leaf_item(inode, path); 3345 - if (ret < 0) { 3346 - goto out_unlock; 3347 - } else if (ret > 0) { 3348 - /* No more file extent items for this inode. */ 3349 - break; 3350 - } 3351 - cond_resched(); 3352 - } 3353 - 3354 - check_eof_delalloc: 3355 - if (!stopped && prev_extent_end < range_end) { 3356 - ret = fiemap_process_hole(inode, fieinfo, &cache, 3357 - &delalloc_cached_state, backref_ctx, 3358 - 0, 0, 0, prev_extent_end, range_end - 1); 3359 - if (ret < 0) 3360 - goto out_unlock; 3361 - prev_extent_end = range_end; 3362 - } 3363 - 3364 - if (cache.cached && cache.offset + cache.len >= last_extent_end) { 3365 - const u64 i_size = i_size_read(&inode->vfs_inode); 3366 - 3367 - if (prev_extent_end < i_size) { 3368 - u64 delalloc_start; 3369 - u64 delalloc_end; 3370 - bool delalloc; 3371 - 3372 - delalloc = btrfs_find_delalloc_in_range(inode, 3373 - prev_extent_end, 3374 - i_size - 1, 3375 - &delalloc_cached_state, 3376 - &delalloc_start, 3377 - &delalloc_end); 3378 - if (!delalloc) 3379 - cache.flags |= FIEMAP_EXTENT_LAST; 3380 - } else { 3381 - cache.flags |= FIEMAP_EXTENT_LAST; 3382 - } 3383 - } 3384 - 3385 - out_unlock: 3386 - unlock_extent(&inode->io_tree, range_start, range_end, &cached_state); 3387 - 3388 - if (ret == BTRFS_FIEMAP_FLUSH_CACHE) { 3389 - btrfs_release_path(path); 3390 - ret = flush_fiemap_cache(fieinfo, &cache); 3391 - if (ret) 3392 - goto out; 3393 - len -= cache.next_search_offset - start; 3394 - start = cache.next_search_offset; 3395 - goto restart; 3396 - } else if (ret < 0) { 3397 - goto out; 3398 - } 3399 - 3400 - /* 3401 - * Must free the path before emitting to the fiemap buffer because we 3402 - * may have a non-cloned leaf and if the fiemap buffer is memory mapped 3403 - * to a file, a write into it (through btrfs_page_mkwrite()) may trigger 3404 - * waiting for an ordered extent that in order to complete needs to 3405 - * modify that leaf, therefore leading to a deadlock. 3406 - */ 3407 - btrfs_free_path(path); 3408 - path = NULL; 3409 - 3410 - ret = flush_fiemap_cache(fieinfo, &cache); 3411 - if (ret) 3412 - goto out; 3413 - 3414 - ret = emit_last_fiemap_cache(fieinfo, &cache); 3415 - out: 3416 - free_extent_state(delalloc_cached_state); 3417 - kfree(cache.entries); 3418 - btrfs_free_backref_share_ctx(backref_ctx); 3419 - btrfs_free_path(path); 3420 - return ret; 3421 - } 3422 - 3423 2473 static void __free_extent_buffer(struct extent_buffer *eb) 3424 2474 { 3425 2475 kmem_cache_free(extent_buffer_cache, eb); ··· 2580 3372 return false; 2581 3373 } 2582 3374 2583 - static void detach_extent_buffer_folio(struct extent_buffer *eb, struct folio *folio) 3375 + static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct folio *folio) 2584 3376 { 2585 3377 struct btrfs_fs_info *fs_info = eb->fs_info; 2586 3378 const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); ··· 2641 3433 } 2642 3434 2643 3435 /* Release all pages attached to the extent buffer */ 2644 - static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) 3436 + static void btrfs_release_extent_buffer_pages(const struct extent_buffer *eb) 2645 3437 { 2646 3438 ASSERT(!extent_buffer_under_io(eb)); 2647 3439 ··· 2707 3499 */ 2708 3500 set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags); 2709 3501 2710 - ret = alloc_eb_folio_array(new, 0); 3502 + ret = alloc_eb_folio_array(new, false); 2711 3503 if (ret) { 2712 3504 btrfs_release_extent_buffer(new); 2713 3505 return NULL; ··· 2715 3507 2716 3508 for (int i = 0; i < num_folios; i++) { 2717 3509 struct folio *folio = new->folios[i]; 2718 - int ret; 2719 3510 2720 3511 ret = attach_extent_buffer_folio(new, folio, NULL); 2721 3512 if (ret < 0) { ··· 2740 3533 if (!eb) 2741 3534 return NULL; 2742 3535 2743 - ret = alloc_eb_folio_array(eb, 0); 3536 + ret = alloc_eb_folio_array(eb, false); 2744 3537 if (ret) 2745 3538 goto err; 2746 3539 ··· 3106 3899 3107 3900 reallocate: 3108 3901 /* Allocate all pages first. */ 3109 - ret = alloc_eb_folio_array(eb, __GFP_NOFAIL); 3902 + ret = alloc_eb_folio_array(eb, true); 3110 3903 if (ret < 0) { 3111 3904 btrfs_free_subpage(prealloc); 3112 3905 goto out; ··· 3558 4351 } 3559 4352 3560 4353 int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, 3561 - struct btrfs_tree_parent_check *check) 4354 + const struct btrfs_tree_parent_check *check) 3562 4355 { 3563 4356 struct btrfs_bio *bbio; 3564 4357 bool ret; ··· 3796 4589 return; 3797 4590 3798 4591 if (fs_info->nodesize < PAGE_SIZE) { 3799 - struct folio *folio = eb->folios[0]; 3800 - 4592 + folio = eb->folios[0]; 3801 4593 ASSERT(i == 0); 3802 4594 if (WARN_ON(!btrfs_subpage_test_uptodate(fs_info, folio, 3803 4595 eb->start, eb->len))) ··· 3814 4608 size_t cur; 3815 4609 size_t offset; 3816 4610 char *kaddr; 3817 - char *src = (char *)srcv; 4611 + const char *src = (const char *)srcv; 3818 4612 unsigned long i = get_eb_folio_index(eb, start); 3819 4613 /* For unmapped (dummy) ebs, no need to check their uptodate status. */ 3820 4614 const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); ··· 4171 4965 4172 4966 #define GANG_LOOKUP_SIZE 16 4173 4967 static struct extent_buffer *get_next_extent_buffer( 4174 - struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) 4968 + const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) 4175 4969 { 4176 4970 struct extent_buffer *gang[GANG_LOOKUP_SIZE]; 4177 4971 struct extent_buffer *found = NULL;
+10 -9
fs/btrfs/extent_io.h
··· 215 215 return ret; 216 216 } 217 217 218 + static inline void extent_changeset_prealloc(struct extent_changeset *changeset, gfp_t gfp_mask) 219 + { 220 + ulist_prealloc(&changeset->range_changed, gfp_mask); 221 + } 222 + 218 223 static inline void extent_changeset_release(struct extent_changeset *changeset) 219 224 { 220 225 if (!changeset) ··· 240 235 int try_release_extent_buffer(struct page *page); 241 236 242 237 int btrfs_read_folio(struct file *file, struct folio *folio); 243 - void extent_write_locked_range(struct inode *inode, struct page *locked_page, 238 + void extent_write_locked_range(struct inode *inode, const struct page *locked_page, 244 239 u64 start, u64 end, struct writeback_control *wbc, 245 240 bool pages_dirty); 246 241 int btrfs_writepages(struct address_space *mapping, struct writeback_control *wbc); 247 242 int btree_write_cache_pages(struct address_space *mapping, 248 243 struct writeback_control *wbc); 249 244 void btrfs_readahead(struct readahead_control *rac); 250 - int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, 251 - u64 start, u64 len); 252 245 int set_folio_extent_mapped(struct folio *folio); 253 246 int set_page_extent_mapped(struct page *page); 254 247 void clear_page_extent_mapped(struct page *page); ··· 266 263 #define WAIT_COMPLETE 1 267 264 #define WAIT_PAGE_LOCK 2 268 265 int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, 269 - struct btrfs_tree_parent_check *parent_check); 266 + const struct btrfs_tree_parent_check *parent_check); 270 267 void wait_on_extent_buffer_writeback(struct extent_buffer *eb); 271 268 void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info, 272 269 u64 bytenr, u64 owner_root, u64 gen, int level); ··· 353 350 void set_extent_buffer_dirty(struct extent_buffer *eb); 354 351 void set_extent_buffer_uptodate(struct extent_buffer *eb); 355 352 void clear_extent_buffer_uptodate(struct extent_buffer *eb); 356 - void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end); 357 353 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, 358 - struct page *locked_page, 354 + const struct page *locked_page, 359 355 struct extent_state **cached, 360 356 u32 bits_to_clear, unsigned long page_ops); 361 357 int extent_invalidate_folio(struct extent_io_tree *tree, ··· 363 361 struct extent_buffer *buf); 364 362 365 363 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array, 366 - gfp_t extra_gfp); 367 - int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array, 368 - gfp_t extra_gfp); 364 + bool nofail); 365 + int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array); 369 366 370 367 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS 371 368 bool find_lock_delalloc_range(struct inode *inode,
+163 -80
fs/btrfs/extent_map.c
··· 33 33 */ 34 34 void extent_map_tree_init(struct extent_map_tree *tree) 35 35 { 36 - tree->map = RB_ROOT_CACHED; 36 + tree->root = RB_ROOT; 37 37 INIT_LIST_HEAD(&tree->modified_extents); 38 38 rwlock_init(&tree->lock); 39 39 } ··· 85 85 percpu_counter_dec(&fs_info->evictable_extent_maps); 86 86 } 87 87 88 - static int tree_insert(struct rb_root_cached *root, struct extent_map *em) 88 + static int tree_insert(struct rb_root *root, struct extent_map *em) 89 89 { 90 - struct rb_node **p = &root->rb_root.rb_node; 90 + struct rb_node **p = &root->rb_node; 91 91 struct rb_node *parent = NULL; 92 92 struct extent_map *entry = NULL; 93 93 struct rb_node *orig_parent = NULL; 94 94 u64 end = range_end(em->start, em->len); 95 - bool leftmost = true; 96 95 97 96 while (*p) { 98 97 parent = *p; 99 98 entry = rb_entry(parent, struct extent_map, rb_node); 100 99 101 - if (em->start < entry->start) { 100 + if (em->start < entry->start) 102 101 p = &(*p)->rb_left; 103 - } else if (em->start >= extent_map_end(entry)) { 102 + else if (em->start >= extent_map_end(entry)) 104 103 p = &(*p)->rb_right; 105 - leftmost = false; 106 - } else { 104 + else 107 105 return -EEXIST; 108 - } 109 106 } 110 107 111 108 orig_parent = parent; ··· 125 128 return -EEXIST; 126 129 127 130 rb_link_node(&em->rb_node, orig_parent, p); 128 - rb_insert_color_cached(&em->rb_node, root, leftmost); 131 + rb_insert_color(&em->rb_node, root); 129 132 return 0; 130 133 } 131 134 ··· 183 186 return NULL; 184 187 } 185 188 189 + static inline u64 extent_map_block_len(const struct extent_map *em) 190 + { 191 + if (extent_map_is_compressed(em)) 192 + return em->disk_num_bytes; 193 + return em->len; 194 + } 195 + 186 196 static inline u64 extent_map_block_end(const struct extent_map *em) 187 197 { 188 - if (em->block_start + em->block_len < em->block_start) 198 + if (extent_map_block_start(em) + extent_map_block_len(em) < 199 + extent_map_block_start(em)) 189 200 return (u64)-1; 190 - return em->block_start + em->block_len; 201 + return extent_map_block_start(em) + extent_map_block_len(em); 191 202 } 192 203 193 204 static bool can_merge_extent_map(const struct extent_map *em) ··· 230 225 if (prev->flags != next->flags) 231 226 return false; 232 227 233 - if (next->block_start < EXTENT_MAP_LAST_BYTE - 1) 234 - return next->block_start == extent_map_block_end(prev); 228 + if (next->disk_bytenr < EXTENT_MAP_LAST_BYTE - 1) 229 + return extent_map_block_start(next) == extent_map_block_end(prev); 235 230 236 231 /* HOLES and INLINE extents. */ 237 - return next->block_start == prev->block_start; 232 + return next->disk_bytenr == prev->disk_bytenr; 233 + } 234 + 235 + /* 236 + * Handle the on-disk data extents merge for @prev and @next. 237 + * 238 + * Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes. 239 + * For now only uncompressed regular extent can be merged. 240 + * 241 + * @prev and @next will be both updated to point to the new merged range. 242 + * Thus one of them should be removed by the caller. 243 + */ 244 + static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next) 245 + { 246 + u64 new_disk_bytenr; 247 + u64 new_disk_num_bytes; 248 + u64 new_offset; 249 + 250 + /* @prev and @next should not be compressed. */ 251 + ASSERT(!extent_map_is_compressed(prev)); 252 + ASSERT(!extent_map_is_compressed(next)); 253 + 254 + /* 255 + * There are two different cases where @prev and @next can be merged. 256 + * 257 + * 1) They are referring to the same data extent: 258 + * 259 + * |<----- data extent A ----->| 260 + * |<- prev ->|<- next ->| 261 + * 262 + * 2) They are referring to different data extents but still adjacent: 263 + * 264 + * |<-- data extent A -->|<-- data extent B -->| 265 + * |<- prev ->|<- next ->| 266 + * 267 + * The calculation here always merges the data extents first, then updates 268 + * @offset using the new data extents. 269 + * 270 + * For case 1), the merged data extent would be the same. 271 + * For case 2), we just merge the two data extents into one. 272 + */ 273 + new_disk_bytenr = min(prev->disk_bytenr, next->disk_bytenr); 274 + new_disk_num_bytes = max(prev->disk_bytenr + prev->disk_num_bytes, 275 + next->disk_bytenr + next->disk_num_bytes) - 276 + new_disk_bytenr; 277 + new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr; 278 + 279 + prev->disk_bytenr = new_disk_bytenr; 280 + prev->disk_num_bytes = new_disk_num_bytes; 281 + prev->ram_bytes = new_disk_num_bytes; 282 + prev->offset = new_offset; 283 + 284 + next->disk_bytenr = new_disk_bytenr; 285 + next->disk_num_bytes = new_disk_num_bytes; 286 + next->ram_bytes = new_disk_num_bytes; 287 + next->offset = new_offset; 288 + } 289 + 290 + static void dump_extent_map(struct btrfs_fs_info *fs_info, const char *prefix, 291 + struct extent_map *em) 292 + { 293 + if (!IS_ENABLED(CONFIG_BTRFS_DEBUG)) 294 + return; 295 + btrfs_crit(fs_info, 296 + "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu flags=0x%x", 297 + prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes, 298 + em->ram_bytes, em->offset, em->flags); 299 + ASSERT(0); 300 + } 301 + 302 + /* Internal sanity checks for btrfs debug builds. */ 303 + static void validate_extent_map(struct btrfs_fs_info *fs_info, struct extent_map *em) 304 + { 305 + if (!IS_ENABLED(CONFIG_BTRFS_DEBUG)) 306 + return; 307 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { 308 + if (em->disk_num_bytes == 0) 309 + dump_extent_map(fs_info, "zero disk_num_bytes", em); 310 + if (em->offset + em->len > em->ram_bytes) 311 + dump_extent_map(fs_info, "ram_bytes too small", em); 312 + if (em->offset + em->len > em->disk_num_bytes && 313 + !extent_map_is_compressed(em)) 314 + dump_extent_map(fs_info, "disk_num_bytes too small", em); 315 + if (!extent_map_is_compressed(em) && 316 + em->ram_bytes != em->disk_num_bytes) 317 + dump_extent_map(fs_info, 318 + "ram_bytes mismatch with disk_num_bytes for non-compressed em", 319 + em); 320 + } else if (em->offset) { 321 + dump_extent_map(fs_info, "non-zero offset for hole/inline", em); 322 + } 238 323 } 239 324 240 325 static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em) 241 326 { 327 + struct btrfs_fs_info *fs_info = inode->root->fs_info; 242 328 struct extent_map_tree *tree = &inode->extent_tree; 243 329 struct extent_map *merge = NULL; 244 330 struct rb_node *rb; ··· 354 258 merge = rb_entry(rb, struct extent_map, rb_node); 355 259 if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) { 356 260 em->start = merge->start; 357 - em->orig_start = merge->orig_start; 358 261 em->len += merge->len; 359 - em->block_len += merge->block_len; 360 - em->block_start = merge->block_start; 361 262 em->generation = max(em->generation, merge->generation); 263 + 264 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) 265 + merge_ondisk_extents(merge, em); 362 266 em->flags |= EXTENT_FLAG_MERGED; 363 267 364 - rb_erase_cached(&merge->rb_node, &tree->map); 268 + validate_extent_map(fs_info, em); 269 + rb_erase(&merge->rb_node, &tree->root); 365 270 RB_CLEAR_NODE(&merge->rb_node); 366 271 free_extent_map(merge); 367 272 dec_evictable_extent_maps(inode); ··· 374 277 merge = rb_entry(rb, struct extent_map, rb_node); 375 278 if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) { 376 279 em->len += merge->len; 377 - em->block_len += merge->block_len; 378 - rb_erase_cached(&merge->rb_node, &tree->map); 280 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) 281 + merge_ondisk_extents(em, merge); 282 + validate_extent_map(fs_info, em); 283 + rb_erase(&merge->rb_node, &tree->root); 379 284 RB_CLEAR_NODE(&merge->rb_node); 380 285 em->generation = max(em->generation, merge->generation); 381 286 em->flags |= EXTENT_FLAG_MERGED; ··· 488 389 489 390 lockdep_assert_held_write(&tree->lock); 490 391 491 - ret = tree_insert(&tree->map, em); 392 + validate_extent_map(fs_info, em); 393 + ret = tree_insert(&tree->root, em); 492 394 if (ret) 493 395 return ret; 494 396 ··· 510 410 struct rb_node *prev_or_next = NULL; 511 411 u64 end = range_end(start, len); 512 412 513 - rb_node = __tree_search(&tree->map.rb_root, start, &prev_or_next); 413 + rb_node = __tree_search(&tree->root, start, &prev_or_next); 514 414 if (!rb_node) { 515 415 if (prev_or_next) 516 416 rb_node = prev_or_next; ··· 579 479 lockdep_assert_held_write(&tree->lock); 580 480 581 481 WARN_ON(em->flags & EXTENT_FLAG_PINNED); 582 - rb_erase_cached(&em->rb_node, &tree->map); 482 + rb_erase(&em->rb_node, &tree->root); 583 483 if (!(em->flags & EXTENT_FLAG_LOGGING)) 584 484 list_del_init(&em->list); 585 485 RB_CLEAR_NODE(&em->rb_node); ··· 592 492 struct extent_map *new, 593 493 int modified) 594 494 { 495 + struct btrfs_fs_info *fs_info = inode->root->fs_info; 595 496 struct extent_map_tree *tree = &inode->extent_tree; 596 497 597 498 lockdep_assert_held_write(&tree->lock); 499 + 500 + validate_extent_map(fs_info, new); 598 501 599 502 WARN_ON(cur->flags & EXTENT_FLAG_PINNED); 600 503 ASSERT(extent_map_in_tree(cur)); 601 504 if (!(cur->flags & EXTENT_FLAG_LOGGING)) 602 505 list_del_init(&cur->list); 603 - rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map); 506 + rb_replace_node(&cur->rb_node, &new->rb_node, &tree->root); 604 507 RB_CLEAR_NODE(&cur->rb_node); 605 508 606 509 setup_extent_mapping(inode, new, modified); ··· 664 561 start_diff = start - em->start; 665 562 em->start = start; 666 563 em->len = end - start; 667 - if (em->block_start < EXTENT_MAP_LAST_BYTE && 668 - !extent_map_is_compressed(em)) { 669 - em->block_start += start_diff; 670 - em->block_len = em->len; 671 - } 564 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE && !extent_map_is_compressed(em)) 565 + em->offset += start_diff; 672 566 return add_extent_mapping(inode, em, 0); 673 567 } 674 568 ··· 700 600 * Tree-checker should have rejected any inline extent with non-zero 701 601 * file offset. Here just do a sanity check. 702 602 */ 703 - if (em->block_start == EXTENT_MAP_INLINE) 603 + if (em->disk_bytenr == EXTENT_MAP_INLINE) 704 604 ASSERT(em->start == 0); 705 605 706 606 ret = add_extent_mapping(inode, em, 0); ··· 757 657 static void drop_all_extent_maps_fast(struct btrfs_inode *inode) 758 658 { 759 659 struct extent_map_tree *tree = &inode->extent_tree; 660 + struct rb_node *node; 760 661 761 662 write_lock(&tree->lock); 762 - while (!RB_EMPTY_ROOT(&tree->map.rb_root)) { 663 + node = rb_first(&tree->root); 664 + while (node) { 763 665 struct extent_map *em; 764 - struct rb_node *node; 666 + struct rb_node *next = rb_next(node); 765 667 766 - node = rb_first_cached(&tree->map); 767 668 em = rb_entry(node, struct extent_map, rb_node); 768 669 em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING); 769 670 remove_extent_mapping(inode, em); 770 671 free_extent_map(em); 771 - cond_resched_rwlock_write(&tree->lock); 672 + 673 + if (cond_resched_rwlock_write(&tree->lock)) 674 + node = rb_first(&tree->root); 675 + else 676 + node = next; 772 677 } 773 678 write_unlock(&tree->lock); 774 679 } ··· 834 729 u64 gen; 835 730 unsigned long flags; 836 731 bool modified; 837 - bool compressed; 838 732 839 733 if (em_end < end) { 840 734 next_em = next_extent_map(em); ··· 867 763 goto remove_em; 868 764 869 765 gen = em->generation; 870 - compressed = extent_map_is_compressed(em); 871 766 872 767 if (em->start < start) { 873 768 if (!split) { ··· 878 775 split->start = em->start; 879 776 split->len = start - em->start; 880 777 881 - if (em->block_start < EXTENT_MAP_LAST_BYTE) { 882 - split->orig_start = em->orig_start; 883 - split->block_start = em->block_start; 884 - 885 - if (compressed) 886 - split->block_len = em->block_len; 887 - else 888 - split->block_len = split->len; 889 - split->orig_block_len = max(split->block_len, 890 - em->orig_block_len); 778 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { 779 + split->disk_bytenr = em->disk_bytenr; 780 + split->disk_num_bytes = em->disk_num_bytes; 781 + split->offset = em->offset; 891 782 split->ram_bytes = em->ram_bytes; 892 783 } else { 893 - split->orig_start = split->start; 894 - split->block_len = 0; 895 - split->block_start = em->block_start; 896 - split->orig_block_len = 0; 784 + split->disk_bytenr = em->disk_bytenr; 785 + split->disk_num_bytes = 0; 786 + split->offset = 0; 897 787 split->ram_bytes = split->len; 898 788 } 899 789 ··· 906 810 } 907 811 split->start = end; 908 812 split->len = em_end - end; 909 - split->block_start = em->block_start; 813 + split->disk_bytenr = em->disk_bytenr; 910 814 split->flags = flags; 911 815 split->generation = gen; 912 816 913 - if (em->block_start < EXTENT_MAP_LAST_BYTE) { 914 - split->orig_block_len = max(em->block_len, 915 - em->orig_block_len); 916 - 817 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { 818 + split->disk_num_bytes = em->disk_num_bytes; 819 + split->offset = em->offset + end - em->start; 917 820 split->ram_bytes = em->ram_bytes; 918 - if (compressed) { 919 - split->block_len = em->block_len; 920 - split->orig_start = em->orig_start; 921 - } else { 922 - const u64 diff = end - em->start; 923 - 924 - split->block_len = split->len; 925 - split->block_start += diff; 926 - split->orig_start = em->orig_start; 927 - } 928 821 } else { 822 + split->disk_num_bytes = 0; 823 + split->offset = 0; 929 824 split->ram_bytes = split->len; 930 - split->orig_start = split->start; 931 - split->block_len = 0; 932 - split->orig_block_len = 0; 933 825 } 934 826 935 827 if (extent_map_in_tree(em)) { ··· 1060 976 1061 977 ASSERT(em->len == len); 1062 978 ASSERT(!extent_map_is_compressed(em)); 1063 - ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE); 979 + ASSERT(em->disk_bytenr < EXTENT_MAP_LAST_BYTE); 1064 980 ASSERT(em->flags & EXTENT_FLAG_PINNED); 1065 981 ASSERT(!(em->flags & EXTENT_FLAG_LOGGING)); 1066 982 ASSERT(!list_empty(&em->list)); ··· 1071 987 /* First, replace the em with a new extent_map starting from * em->start */ 1072 988 split_pre->start = em->start; 1073 989 split_pre->len = pre; 1074 - split_pre->orig_start = split_pre->start; 1075 - split_pre->block_start = new_logical; 1076 - split_pre->block_len = split_pre->len; 1077 - split_pre->orig_block_len = split_pre->block_len; 990 + split_pre->disk_bytenr = new_logical; 991 + split_pre->disk_num_bytes = split_pre->len; 992 + split_pre->offset = 0; 1078 993 split_pre->ram_bytes = split_pre->len; 1079 994 split_pre->flags = flags; 1080 995 split_pre->generation = em->generation; ··· 1088 1005 /* Insert the middle extent_map. */ 1089 1006 split_mid->start = em->start + pre; 1090 1007 split_mid->len = em->len - pre; 1091 - split_mid->orig_start = split_mid->start; 1092 - split_mid->block_start = em->block_start + pre; 1093 - split_mid->block_len = split_mid->len; 1094 - split_mid->orig_block_len = split_mid->block_len; 1008 + split_mid->disk_bytenr = extent_map_block_start(em) + pre; 1009 + split_mid->disk_num_bytes = split_mid->len; 1010 + split_mid->offset = 0; 1095 1011 split_mid->ram_bytes = split_mid->len; 1096 1012 split_mid->flags = flags; 1097 1013 split_mid->generation = em->generation; ··· 1158 1076 return 0; 1159 1077 } 1160 1078 1161 - node = rb_first_cached(&tree->map); 1079 + node = rb_first(&tree->root); 1162 1080 while (node) { 1081 + struct rb_node *next = rb_next(node); 1163 1082 struct extent_map *em; 1164 1083 1165 1084 em = rb_entry(node, struct extent_map, rb_node); 1166 - node = rb_next(node); 1167 1085 ctx->scanned++; 1168 1086 1169 1087 if (em->flags & EXTENT_FLAG_PINNED) ··· 1197 1115 */ 1198 1116 if (need_resched() || rwlock_needbreak(&tree->lock)) 1199 1117 break; 1118 + node = next; 1200 1119 } 1201 1120 write_unlock(&tree->lock); 1202 1121 up_read(&inode->i_mmap_lock);
+26 -30
fs/btrfs/extent_map.h
··· 4 4 #define BTRFS_EXTENT_MAP_H 5 5 6 6 #include <linux/compiler_types.h> 7 - #include <linux/rwlock_types.h> 7 + #include <linux/spinlock_types.h> 8 8 #include <linux/rbtree.h> 9 9 #include <linux/list.h> 10 10 #include <linux/refcount.h> 11 11 #include "misc.h" 12 - #include "extent_map.h" 13 12 #include "compression.h" 14 13 15 14 struct btrfs_inode; ··· 61 62 u64 len; 62 63 63 64 /* 64 - * The file offset of the original file extent before splitting. 65 + * The bytenr of the full on-disk extent. 65 66 * 66 - * This is an in-memory only member, matching 67 - * extent_map::start - btrfs_file_extent_item::offset for 68 - * regular/preallocated extents. EXTENT_MAP_HOLE otherwise. 67 + * For regular extents it's btrfs_file_extent_item::disk_bytenr. 68 + * For holes it's EXTENT_MAP_HOLE and for inline extents it's 69 + * EXTENT_MAP_INLINE. 69 70 */ 70 - u64 orig_start; 71 + u64 disk_bytenr; 71 72 72 73 /* 73 74 * The full on-disk extent length, matching 74 75 * btrfs_file_extent_item::disk_num_bytes. 75 76 */ 76 - u64 orig_block_len; 77 + u64 disk_num_bytes; 78 + 79 + /* 80 + * Offset inside the decompressed extent. 81 + * 82 + * For regular extents it's btrfs_file_extent_item::offset. 83 + * For holes and inline extents it's 0. 84 + */ 85 + u64 offset; 77 86 78 87 /* 79 88 * The decompressed size of the whole on-disk extent, matching 80 89 * btrfs_file_extent_item::ram_bytes. 81 90 */ 82 91 u64 ram_bytes; 83 - 84 - /* 85 - * The on-disk logical bytenr for the file extent. 86 - * 87 - * For compressed extents it matches btrfs_file_extent_item::disk_bytenr. 88 - * For uncompressed extents it matches 89 - * btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset 90 - * 91 - * For holes it is EXTENT_MAP_HOLE and for inline extents it is 92 - * EXTENT_MAP_INLINE. 93 - */ 94 - u64 block_start; 95 - 96 - /* 97 - * The on-disk length for the file extent. 98 - * 99 - * For compressed extents it matches btrfs_file_extent_item::disk_num_bytes. 100 - * For uncompressed extents it matches extent_map::len. 101 - * For holes and inline extents it's -1 and shouldn't be used. 102 - */ 103 - u64 block_len; 104 92 105 93 /* 106 94 * Generation of the extent map, for merged em it's the highest ··· 101 115 }; 102 116 103 117 struct extent_map_tree { 104 - struct rb_root_cached map; 118 + struct rb_root root; 105 119 struct list_head modified_extents; 106 120 rwlock_t lock; 107 121 }; ··· 147 161 static inline int extent_map_in_tree(const struct extent_map *em) 148 162 { 149 163 return !RB_EMPTY_NODE(&em->rb_node); 164 + } 165 + 166 + static inline u64 extent_map_block_start(const struct extent_map *em) 167 + { 168 + if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { 169 + if (extent_map_is_compressed(em)) 170 + return em->disk_bytenr; 171 + return em->disk_bytenr + em->offset; 172 + } 173 + return em->disk_bytenr; 150 174 } 151 175 152 176 static inline u64 extent_map_end(const struct extent_map *em)
+930
fs/btrfs/fiemap.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include "backref.h" 4 + #include "btrfs_inode.h" 5 + #include "fiemap.h" 6 + #include "file.h" 7 + #include "file-item.h" 8 + 9 + struct btrfs_fiemap_entry { 10 + u64 offset; 11 + u64 phys; 12 + u64 len; 13 + u32 flags; 14 + }; 15 + 16 + /* 17 + * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file 18 + * range from the inode's io tree, unlock the subvolume tree search path, flush 19 + * the fiemap cache and relock the file range and research the subvolume tree. 20 + * The value here is something negative that can't be confused with a valid 21 + * errno value and different from 1 because that's also a return value from 22 + * fiemap_fill_next_extent() and also it's often used to mean some btree search 23 + * did not find a key, so make it some distinct negative value. 24 + */ 25 + #define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1)) 26 + 27 + /* 28 + * Used to: 29 + * 30 + * - Cache the next entry to be emitted to the fiemap buffer, so that we can 31 + * merge extents that are contiguous and can be grouped as a single one; 32 + * 33 + * - Store extents ready to be written to the fiemap buffer in an intermediary 34 + * buffer. This intermediary buffer is to ensure that in case the fiemap 35 + * buffer is memory mapped to the fiemap target file, we don't deadlock 36 + * during btrfs_page_mkwrite(). This is because during fiemap we are locking 37 + * an extent range in order to prevent races with delalloc flushing and 38 + * ordered extent completion, which is needed in order to reliably detect 39 + * delalloc in holes and prealloc extents. And this can lead to a deadlock 40 + * if the fiemap buffer is memory mapped to the file we are running fiemap 41 + * against (a silly, useless in practice scenario, but possible) because 42 + * btrfs_page_mkwrite() will try to lock the same extent range. 43 + */ 44 + struct fiemap_cache { 45 + /* An array of ready fiemap entries. */ 46 + struct btrfs_fiemap_entry *entries; 47 + /* Number of entries in the entries array. */ 48 + int entries_size; 49 + /* Index of the next entry in the entries array to write to. */ 50 + int entries_pos; 51 + /* 52 + * Once the entries array is full, this indicates what's the offset for 53 + * the next file extent item we must search for in the inode's subvolume 54 + * tree after unlocking the extent range in the inode's io tree and 55 + * releasing the search path. 56 + */ 57 + u64 next_search_offset; 58 + /* 59 + * This matches struct fiemap_extent_info::fi_mapped_extents, we use it 60 + * to count ourselves emitted extents and stop instead of relying on 61 + * fiemap_fill_next_extent() because we buffer ready fiemap entries at 62 + * the @entries array, and we want to stop as soon as we hit the max 63 + * amount of extents to map, not just to save time but also to make the 64 + * logic at extent_fiemap() simpler. 65 + */ 66 + unsigned int extents_mapped; 67 + /* Fields for the cached extent (unsubmitted, not ready, extent). */ 68 + u64 offset; 69 + u64 phys; 70 + u64 len; 71 + u32 flags; 72 + bool cached; 73 + }; 74 + 75 + static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo, 76 + struct fiemap_cache *cache) 77 + { 78 + for (int i = 0; i < cache->entries_pos; i++) { 79 + struct btrfs_fiemap_entry *entry = &cache->entries[i]; 80 + int ret; 81 + 82 + ret = fiemap_fill_next_extent(fieinfo, entry->offset, 83 + entry->phys, entry->len, 84 + entry->flags); 85 + /* 86 + * Ignore 1 (reached max entries) because we keep track of that 87 + * ourselves in emit_fiemap_extent(). 88 + */ 89 + if (ret < 0) 90 + return ret; 91 + } 92 + cache->entries_pos = 0; 93 + 94 + return 0; 95 + } 96 + 97 + /* 98 + * Helper to submit fiemap extent. 99 + * 100 + * Will try to merge current fiemap extent specified by @offset, @phys, 101 + * @len and @flags with cached one. 102 + * And only when we fails to merge, cached one will be submitted as 103 + * fiemap extent. 104 + * 105 + * Return value is the same as fiemap_fill_next_extent(). 106 + */ 107 + static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo, 108 + struct fiemap_cache *cache, 109 + u64 offset, u64 phys, u64 len, u32 flags) 110 + { 111 + struct btrfs_fiemap_entry *entry; 112 + u64 cache_end; 113 + 114 + /* Set at the end of extent_fiemap(). */ 115 + ASSERT((flags & FIEMAP_EXTENT_LAST) == 0); 116 + 117 + if (!cache->cached) 118 + goto assign; 119 + 120 + /* 121 + * When iterating the extents of the inode, at extent_fiemap(), we may 122 + * find an extent that starts at an offset behind the end offset of the 123 + * previous extent we processed. This happens if fiemap is called 124 + * without FIEMAP_FLAG_SYNC and there are ordered extents completing 125 + * after we had to unlock the file range, release the search path, emit 126 + * the fiemap extents stored in the buffer (cache->entries array) and 127 + * the lock the remainder of the range and re-search the btree. 128 + * 129 + * For example we are in leaf X processing its last item, which is the 130 + * file extent item for file range [512K, 1M[, and after 131 + * btrfs_next_leaf() releases the path, there's an ordered extent that 132 + * completes for the file range [768K, 2M[, and that results in trimming 133 + * the file extent item so that it now corresponds to the file range 134 + * [512K, 768K[ and a new file extent item is inserted for the file 135 + * range [768K, 2M[, which may end up as the last item of leaf X or as 136 + * the first item of the next leaf - in either case btrfs_next_leaf() 137 + * will leave us with a path pointing to the new extent item, for the 138 + * file range [768K, 2M[, since that's the first key that follows the 139 + * last one we processed. So in order not to report overlapping extents 140 + * to user space, we trim the length of the previously cached extent and 141 + * emit it. 142 + * 143 + * Upon calling btrfs_next_leaf() we may also find an extent with an 144 + * offset smaller than or equals to cache->offset, and this happens 145 + * when we had a hole or prealloc extent with several delalloc ranges in 146 + * it, but after btrfs_next_leaf() released the path, delalloc was 147 + * flushed and the resulting ordered extents were completed, so we can 148 + * now have found a file extent item for an offset that is smaller than 149 + * or equals to what we have in cache->offset. We deal with this as 150 + * described below. 151 + */ 152 + cache_end = cache->offset + cache->len; 153 + if (cache_end > offset) { 154 + if (offset == cache->offset) { 155 + /* 156 + * We cached a dealloc range (found in the io tree) for 157 + * a hole or prealloc extent and we have now found a 158 + * file extent item for the same offset. What we have 159 + * now is more recent and up to date, so discard what 160 + * we had in the cache and use what we have just found. 161 + */ 162 + goto assign; 163 + } else if (offset > cache->offset) { 164 + /* 165 + * The extent range we previously found ends after the 166 + * offset of the file extent item we found and that 167 + * offset falls somewhere in the middle of that previous 168 + * extent range. So adjust the range we previously found 169 + * to end at the offset of the file extent item we have 170 + * just found, since this extent is more up to date. 171 + * Emit that adjusted range and cache the file extent 172 + * item we have just found. This corresponds to the case 173 + * where a previously found file extent item was split 174 + * due to an ordered extent completing. 175 + */ 176 + cache->len = offset - cache->offset; 177 + goto emit; 178 + } else { 179 + const u64 range_end = offset + len; 180 + 181 + /* 182 + * The offset of the file extent item we have just found 183 + * is behind the cached offset. This means we were 184 + * processing a hole or prealloc extent for which we 185 + * have found delalloc ranges (in the io tree), so what 186 + * we have in the cache is the last delalloc range we 187 + * found while the file extent item we found can be 188 + * either for a whole delalloc range we previously 189 + * emmitted or only a part of that range. 190 + * 191 + * We have two cases here: 192 + * 193 + * 1) The file extent item's range ends at or behind the 194 + * cached extent's end. In this case just ignore the 195 + * current file extent item because we don't want to 196 + * overlap with previous ranges that may have been 197 + * emmitted already; 198 + * 199 + * 2) The file extent item starts behind the currently 200 + * cached extent but its end offset goes beyond the 201 + * end offset of the cached extent. We don't want to 202 + * overlap with a previous range that may have been 203 + * emmitted already, so we emit the currently cached 204 + * extent and then partially store the current file 205 + * extent item's range in the cache, for the subrange 206 + * going the cached extent's end to the end of the 207 + * file extent item. 208 + */ 209 + if (range_end <= cache_end) 210 + return 0; 211 + 212 + if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC))) 213 + phys += cache_end - offset; 214 + 215 + offset = cache_end; 216 + len = range_end - cache_end; 217 + goto emit; 218 + } 219 + } 220 + 221 + /* 222 + * Only merges fiemap extents if 223 + * 1) Their logical addresses are continuous 224 + * 225 + * 2) Their physical addresses are continuous 226 + * So truly compressed (physical size smaller than logical size) 227 + * extents won't get merged with each other 228 + * 229 + * 3) Share same flags 230 + */ 231 + if (cache->offset + cache->len == offset && 232 + cache->phys + cache->len == phys && 233 + cache->flags == flags) { 234 + cache->len += len; 235 + return 0; 236 + } 237 + 238 + emit: 239 + /* Not mergeable, need to submit cached one */ 240 + 241 + if (cache->entries_pos == cache->entries_size) { 242 + /* 243 + * We will need to research for the end offset of the last 244 + * stored extent and not from the current offset, because after 245 + * unlocking the range and releasing the path, if there's a hole 246 + * between that end offset and this current offset, a new extent 247 + * may have been inserted due to a new write, so we don't want 248 + * to miss it. 249 + */ 250 + entry = &cache->entries[cache->entries_size - 1]; 251 + cache->next_search_offset = entry->offset + entry->len; 252 + cache->cached = false; 253 + 254 + return BTRFS_FIEMAP_FLUSH_CACHE; 255 + } 256 + 257 + entry = &cache->entries[cache->entries_pos]; 258 + entry->offset = cache->offset; 259 + entry->phys = cache->phys; 260 + entry->len = cache->len; 261 + entry->flags = cache->flags; 262 + cache->entries_pos++; 263 + cache->extents_mapped++; 264 + 265 + if (cache->extents_mapped == fieinfo->fi_extents_max) { 266 + cache->cached = false; 267 + return 1; 268 + } 269 + assign: 270 + cache->cached = true; 271 + cache->offset = offset; 272 + cache->phys = phys; 273 + cache->len = len; 274 + cache->flags = flags; 275 + 276 + return 0; 277 + } 278 + 279 + /* 280 + * Emit last fiemap cache 281 + * 282 + * The last fiemap cache may still be cached in the following case: 283 + * 0 4k 8k 284 + * |<- Fiemap range ->| 285 + * |<------------ First extent ----------->| 286 + * 287 + * In this case, the first extent range will be cached but not emitted. 288 + * So we must emit it before ending extent_fiemap(). 289 + */ 290 + static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo, 291 + struct fiemap_cache *cache) 292 + { 293 + int ret; 294 + 295 + if (!cache->cached) 296 + return 0; 297 + 298 + ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys, 299 + cache->len, cache->flags); 300 + cache->cached = false; 301 + if (ret > 0) 302 + ret = 0; 303 + return ret; 304 + } 305 + 306 + static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path) 307 + { 308 + struct extent_buffer *clone = path->nodes[0]; 309 + struct btrfs_key key; 310 + int slot; 311 + int ret; 312 + 313 + path->slots[0]++; 314 + if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) 315 + return 0; 316 + 317 + /* 318 + * Add a temporary extra ref to an already cloned extent buffer to 319 + * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid 320 + * the cost of allocating a new one. 321 + */ 322 + ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags)); 323 + atomic_inc(&clone->refs); 324 + 325 + ret = btrfs_next_leaf(inode->root, path); 326 + if (ret != 0) 327 + goto out; 328 + 329 + /* 330 + * Don't bother with cloning if there are no more file extent items for 331 + * our inode. 332 + */ 333 + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); 334 + if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) { 335 + ret = 1; 336 + goto out; 337 + } 338 + 339 + /* 340 + * Important to preserve the start field, for the optimizations when 341 + * checking if extents are shared (see extent_fiemap()). 342 + * 343 + * We must set ->start before calling copy_extent_buffer_full(). If we 344 + * are on sub-pagesize blocksize, we use ->start to determine the offset 345 + * into the folio where our eb exists, and if we update ->start after 346 + * the fact then any subsequent reads of the eb may read from a 347 + * different offset in the folio than where we originally copied into. 348 + */ 349 + clone->start = path->nodes[0]->start; 350 + /* See the comment at fiemap_search_slot() about why we clone. */ 351 + copy_extent_buffer_full(clone, path->nodes[0]); 352 + 353 + slot = path->slots[0]; 354 + btrfs_release_path(path); 355 + path->nodes[0] = clone; 356 + path->slots[0] = slot; 357 + out: 358 + if (ret) 359 + free_extent_buffer(clone); 360 + 361 + return ret; 362 + } 363 + 364 + /* 365 + * Search for the first file extent item that starts at a given file offset or 366 + * the one that starts immediately before that offset. 367 + * Returns: 0 on success, < 0 on error, 1 if not found. 368 + */ 369 + static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path, 370 + u64 file_offset) 371 + { 372 + const u64 ino = btrfs_ino(inode); 373 + struct btrfs_root *root = inode->root; 374 + struct extent_buffer *clone; 375 + struct btrfs_key key; 376 + int slot; 377 + int ret; 378 + 379 + key.objectid = ino; 380 + key.type = BTRFS_EXTENT_DATA_KEY; 381 + key.offset = file_offset; 382 + 383 + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); 384 + if (ret < 0) 385 + return ret; 386 + 387 + if (ret > 0 && path->slots[0] > 0) { 388 + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1); 389 + if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY) 390 + path->slots[0]--; 391 + } 392 + 393 + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { 394 + ret = btrfs_next_leaf(root, path); 395 + if (ret != 0) 396 + return ret; 397 + 398 + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); 399 + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) 400 + return 1; 401 + } 402 + 403 + /* 404 + * We clone the leaf and use it during fiemap. This is because while 405 + * using the leaf we do expensive things like checking if an extent is 406 + * shared, which can take a long time. In order to prevent blocking 407 + * other tasks for too long, we use a clone of the leaf. We have locked 408 + * the file range in the inode's io tree, so we know none of our file 409 + * extent items can change. This way we avoid blocking other tasks that 410 + * want to insert items for other inodes in the same leaf or b+tree 411 + * rebalance operations (triggered for example when someone is trying 412 + * to push items into this leaf when trying to insert an item in a 413 + * neighbour leaf). 414 + * We also need the private clone because holding a read lock on an 415 + * extent buffer of the subvolume's b+tree will make lockdep unhappy 416 + * when we check if extents are shared, as backref walking may need to 417 + * lock the same leaf we are processing. 418 + */ 419 + clone = btrfs_clone_extent_buffer(path->nodes[0]); 420 + if (!clone) 421 + return -ENOMEM; 422 + 423 + slot = path->slots[0]; 424 + btrfs_release_path(path); 425 + path->nodes[0] = clone; 426 + path->slots[0] = slot; 427 + 428 + return 0; 429 + } 430 + 431 + /* 432 + * Process a range which is a hole or a prealloc extent in the inode's subvolume 433 + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc 434 + * extent. The end offset (@end) is inclusive. 435 + */ 436 + static int fiemap_process_hole(struct btrfs_inode *inode, 437 + struct fiemap_extent_info *fieinfo, 438 + struct fiemap_cache *cache, 439 + struct extent_state **delalloc_cached_state, 440 + struct btrfs_backref_share_check_ctx *backref_ctx, 441 + u64 disk_bytenr, u64 extent_offset, 442 + u64 extent_gen, 443 + u64 start, u64 end) 444 + { 445 + const u64 i_size = i_size_read(&inode->vfs_inode); 446 + u64 cur_offset = start; 447 + u64 last_delalloc_end = 0; 448 + u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN; 449 + bool checked_extent_shared = false; 450 + int ret; 451 + 452 + /* 453 + * There can be no delalloc past i_size, so don't waste time looking for 454 + * it beyond i_size. 455 + */ 456 + while (cur_offset < end && cur_offset < i_size) { 457 + u64 delalloc_start; 458 + u64 delalloc_end; 459 + u64 prealloc_start; 460 + u64 prealloc_len = 0; 461 + bool delalloc; 462 + 463 + delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end, 464 + delalloc_cached_state, 465 + &delalloc_start, 466 + &delalloc_end); 467 + if (!delalloc) 468 + break; 469 + 470 + /* 471 + * If this is a prealloc extent we have to report every section 472 + * of it that has no delalloc. 473 + */ 474 + if (disk_bytenr != 0) { 475 + if (last_delalloc_end == 0) { 476 + prealloc_start = start; 477 + prealloc_len = delalloc_start - start; 478 + } else { 479 + prealloc_start = last_delalloc_end + 1; 480 + prealloc_len = delalloc_start - prealloc_start; 481 + } 482 + } 483 + 484 + if (prealloc_len > 0) { 485 + if (!checked_extent_shared && fieinfo->fi_extents_max) { 486 + ret = btrfs_is_data_extent_shared(inode, 487 + disk_bytenr, 488 + extent_gen, 489 + backref_ctx); 490 + if (ret < 0) 491 + return ret; 492 + else if (ret > 0) 493 + prealloc_flags |= FIEMAP_EXTENT_SHARED; 494 + 495 + checked_extent_shared = true; 496 + } 497 + ret = emit_fiemap_extent(fieinfo, cache, prealloc_start, 498 + disk_bytenr + extent_offset, 499 + prealloc_len, prealloc_flags); 500 + if (ret) 501 + return ret; 502 + extent_offset += prealloc_len; 503 + } 504 + 505 + ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0, 506 + delalloc_end + 1 - delalloc_start, 507 + FIEMAP_EXTENT_DELALLOC | 508 + FIEMAP_EXTENT_UNKNOWN); 509 + if (ret) 510 + return ret; 511 + 512 + last_delalloc_end = delalloc_end; 513 + cur_offset = delalloc_end + 1; 514 + extent_offset += cur_offset - delalloc_start; 515 + cond_resched(); 516 + } 517 + 518 + /* 519 + * Either we found no delalloc for the whole prealloc extent or we have 520 + * a prealloc extent that spans i_size or starts at or after i_size. 521 + */ 522 + if (disk_bytenr != 0 && last_delalloc_end < end) { 523 + u64 prealloc_start; 524 + u64 prealloc_len; 525 + 526 + if (last_delalloc_end == 0) { 527 + prealloc_start = start; 528 + prealloc_len = end + 1 - start; 529 + } else { 530 + prealloc_start = last_delalloc_end + 1; 531 + prealloc_len = end + 1 - prealloc_start; 532 + } 533 + 534 + if (!checked_extent_shared && fieinfo->fi_extents_max) { 535 + ret = btrfs_is_data_extent_shared(inode, 536 + disk_bytenr, 537 + extent_gen, 538 + backref_ctx); 539 + if (ret < 0) 540 + return ret; 541 + else if (ret > 0) 542 + prealloc_flags |= FIEMAP_EXTENT_SHARED; 543 + } 544 + ret = emit_fiemap_extent(fieinfo, cache, prealloc_start, 545 + disk_bytenr + extent_offset, 546 + prealloc_len, prealloc_flags); 547 + if (ret) 548 + return ret; 549 + } 550 + 551 + return 0; 552 + } 553 + 554 + static int fiemap_find_last_extent_offset(struct btrfs_inode *inode, 555 + struct btrfs_path *path, 556 + u64 *last_extent_end_ret) 557 + { 558 + const u64 ino = btrfs_ino(inode); 559 + struct btrfs_root *root = inode->root; 560 + struct extent_buffer *leaf; 561 + struct btrfs_file_extent_item *ei; 562 + struct btrfs_key key; 563 + u64 disk_bytenr; 564 + int ret; 565 + 566 + /* 567 + * Lookup the last file extent. We're not using i_size here because 568 + * there might be preallocation past i_size. 569 + */ 570 + ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0); 571 + /* There can't be a file extent item at offset (u64)-1 */ 572 + ASSERT(ret != 0); 573 + if (ret < 0) 574 + return ret; 575 + 576 + /* 577 + * For a non-existing key, btrfs_search_slot() always leaves us at a 578 + * slot > 0, except if the btree is empty, which is impossible because 579 + * at least it has the inode item for this inode and all the items for 580 + * the root inode 256. 581 + */ 582 + ASSERT(path->slots[0] > 0); 583 + path->slots[0]--; 584 + leaf = path->nodes[0]; 585 + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 586 + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) { 587 + /* No file extent items in the subvolume tree. */ 588 + *last_extent_end_ret = 0; 589 + return 0; 590 + } 591 + 592 + /* 593 + * For an inline extent, the disk_bytenr is where inline data starts at, 594 + * so first check if we have an inline extent item before checking if we 595 + * have an implicit hole (disk_bytenr == 0). 596 + */ 597 + ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); 598 + if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) { 599 + *last_extent_end_ret = btrfs_file_extent_end(path); 600 + return 0; 601 + } 602 + 603 + /* 604 + * Find the last file extent item that is not a hole (when NO_HOLES is 605 + * not enabled). This should take at most 2 iterations in the worst 606 + * case: we have one hole file extent item at slot 0 of a leaf and 607 + * another hole file extent item as the last item in the previous leaf. 608 + * This is because we merge file extent items that represent holes. 609 + */ 610 + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 611 + while (disk_bytenr == 0) { 612 + ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY); 613 + if (ret < 0) { 614 + return ret; 615 + } else if (ret > 0) { 616 + /* No file extent items that are not holes. */ 617 + *last_extent_end_ret = 0; 618 + return 0; 619 + } 620 + leaf = path->nodes[0]; 621 + ei = btrfs_item_ptr(leaf, path->slots[0], 622 + struct btrfs_file_extent_item); 623 + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 624 + } 625 + 626 + *last_extent_end_ret = btrfs_file_extent_end(path); 627 + return 0; 628 + } 629 + 630 + static int extent_fiemap(struct btrfs_inode *inode, 631 + struct fiemap_extent_info *fieinfo, 632 + u64 start, u64 len) 633 + { 634 + const u64 ino = btrfs_ino(inode); 635 + struct extent_state *cached_state = NULL; 636 + struct extent_state *delalloc_cached_state = NULL; 637 + struct btrfs_path *path; 638 + struct fiemap_cache cache = { 0 }; 639 + struct btrfs_backref_share_check_ctx *backref_ctx; 640 + u64 last_extent_end; 641 + u64 prev_extent_end; 642 + u64 range_start; 643 + u64 range_end; 644 + const u64 sectorsize = inode->root->fs_info->sectorsize; 645 + bool stopped = false; 646 + int ret; 647 + 648 + cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry); 649 + cache.entries = kmalloc_array(cache.entries_size, 650 + sizeof(struct btrfs_fiemap_entry), 651 + GFP_KERNEL); 652 + backref_ctx = btrfs_alloc_backref_share_check_ctx(); 653 + path = btrfs_alloc_path(); 654 + if (!cache.entries || !backref_ctx || !path) { 655 + ret = -ENOMEM; 656 + goto out; 657 + } 658 + 659 + restart: 660 + range_start = round_down(start, sectorsize); 661 + range_end = round_up(start + len, sectorsize); 662 + prev_extent_end = range_start; 663 + 664 + lock_extent(&inode->io_tree, range_start, range_end, &cached_state); 665 + 666 + ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end); 667 + if (ret < 0) 668 + goto out_unlock; 669 + btrfs_release_path(path); 670 + 671 + path->reada = READA_FORWARD; 672 + ret = fiemap_search_slot(inode, path, range_start); 673 + if (ret < 0) { 674 + goto out_unlock; 675 + } else if (ret > 0) { 676 + /* 677 + * No file extent item found, but we may have delalloc between 678 + * the current offset and i_size. So check for that. 679 + */ 680 + ret = 0; 681 + goto check_eof_delalloc; 682 + } 683 + 684 + while (prev_extent_end < range_end) { 685 + struct extent_buffer *leaf = path->nodes[0]; 686 + struct btrfs_file_extent_item *ei; 687 + struct btrfs_key key; 688 + u64 extent_end; 689 + u64 extent_len; 690 + u64 extent_offset = 0; 691 + u64 extent_gen; 692 + u64 disk_bytenr = 0; 693 + u64 flags = 0; 694 + int extent_type; 695 + u8 compression; 696 + 697 + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 698 + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) 699 + break; 700 + 701 + extent_end = btrfs_file_extent_end(path); 702 + 703 + /* 704 + * The first iteration can leave us at an extent item that ends 705 + * before our range's start. Move to the next item. 706 + */ 707 + if (extent_end <= range_start) 708 + goto next_item; 709 + 710 + backref_ctx->curr_leaf_bytenr = leaf->start; 711 + 712 + /* We have in implicit hole (NO_HOLES feature enabled). */ 713 + if (prev_extent_end < key.offset) { 714 + const u64 hole_end = min(key.offset, range_end) - 1; 715 + 716 + ret = fiemap_process_hole(inode, fieinfo, &cache, 717 + &delalloc_cached_state, 718 + backref_ctx, 0, 0, 0, 719 + prev_extent_end, hole_end); 720 + if (ret < 0) { 721 + goto out_unlock; 722 + } else if (ret > 0) { 723 + /* fiemap_fill_next_extent() told us to stop. */ 724 + stopped = true; 725 + break; 726 + } 727 + 728 + /* We've reached the end of the fiemap range, stop. */ 729 + if (key.offset >= range_end) { 730 + stopped = true; 731 + break; 732 + } 733 + } 734 + 735 + extent_len = extent_end - key.offset; 736 + ei = btrfs_item_ptr(leaf, path->slots[0], 737 + struct btrfs_file_extent_item); 738 + compression = btrfs_file_extent_compression(leaf, ei); 739 + extent_type = btrfs_file_extent_type(leaf, ei); 740 + extent_gen = btrfs_file_extent_generation(leaf, ei); 741 + 742 + if (extent_type != BTRFS_FILE_EXTENT_INLINE) { 743 + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); 744 + if (compression == BTRFS_COMPRESS_NONE) 745 + extent_offset = btrfs_file_extent_offset(leaf, ei); 746 + } 747 + 748 + if (compression != BTRFS_COMPRESS_NONE) 749 + flags |= FIEMAP_EXTENT_ENCODED; 750 + 751 + if (extent_type == BTRFS_FILE_EXTENT_INLINE) { 752 + flags |= FIEMAP_EXTENT_DATA_INLINE; 753 + flags |= FIEMAP_EXTENT_NOT_ALIGNED; 754 + ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0, 755 + extent_len, flags); 756 + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) { 757 + ret = fiemap_process_hole(inode, fieinfo, &cache, 758 + &delalloc_cached_state, 759 + backref_ctx, 760 + disk_bytenr, extent_offset, 761 + extent_gen, key.offset, 762 + extent_end - 1); 763 + } else if (disk_bytenr == 0) { 764 + /* We have an explicit hole. */ 765 + ret = fiemap_process_hole(inode, fieinfo, &cache, 766 + &delalloc_cached_state, 767 + backref_ctx, 0, 0, 0, 768 + key.offset, extent_end - 1); 769 + } else { 770 + /* We have a regular extent. */ 771 + if (fieinfo->fi_extents_max) { 772 + ret = btrfs_is_data_extent_shared(inode, 773 + disk_bytenr, 774 + extent_gen, 775 + backref_ctx); 776 + if (ret < 0) 777 + goto out_unlock; 778 + else if (ret > 0) 779 + flags |= FIEMAP_EXTENT_SHARED; 780 + } 781 + 782 + ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 783 + disk_bytenr + extent_offset, 784 + extent_len, flags); 785 + } 786 + 787 + if (ret < 0) { 788 + goto out_unlock; 789 + } else if (ret > 0) { 790 + /* emit_fiemap_extent() told us to stop. */ 791 + stopped = true; 792 + break; 793 + } 794 + 795 + prev_extent_end = extent_end; 796 + next_item: 797 + if (fatal_signal_pending(current)) { 798 + ret = -EINTR; 799 + goto out_unlock; 800 + } 801 + 802 + ret = fiemap_next_leaf_item(inode, path); 803 + if (ret < 0) { 804 + goto out_unlock; 805 + } else if (ret > 0) { 806 + /* No more file extent items for this inode. */ 807 + break; 808 + } 809 + cond_resched(); 810 + } 811 + 812 + check_eof_delalloc: 813 + if (!stopped && prev_extent_end < range_end) { 814 + ret = fiemap_process_hole(inode, fieinfo, &cache, 815 + &delalloc_cached_state, backref_ctx, 816 + 0, 0, 0, prev_extent_end, range_end - 1); 817 + if (ret < 0) 818 + goto out_unlock; 819 + prev_extent_end = range_end; 820 + } 821 + 822 + if (cache.cached && cache.offset + cache.len >= last_extent_end) { 823 + const u64 i_size = i_size_read(&inode->vfs_inode); 824 + 825 + if (prev_extent_end < i_size) { 826 + u64 delalloc_start; 827 + u64 delalloc_end; 828 + bool delalloc; 829 + 830 + delalloc = btrfs_find_delalloc_in_range(inode, 831 + prev_extent_end, 832 + i_size - 1, 833 + &delalloc_cached_state, 834 + &delalloc_start, 835 + &delalloc_end); 836 + if (!delalloc) 837 + cache.flags |= FIEMAP_EXTENT_LAST; 838 + } else { 839 + cache.flags |= FIEMAP_EXTENT_LAST; 840 + } 841 + } 842 + 843 + out_unlock: 844 + unlock_extent(&inode->io_tree, range_start, range_end, &cached_state); 845 + 846 + if (ret == BTRFS_FIEMAP_FLUSH_CACHE) { 847 + btrfs_release_path(path); 848 + ret = flush_fiemap_cache(fieinfo, &cache); 849 + if (ret) 850 + goto out; 851 + len -= cache.next_search_offset - start; 852 + start = cache.next_search_offset; 853 + goto restart; 854 + } else if (ret < 0) { 855 + goto out; 856 + } 857 + 858 + /* 859 + * Must free the path before emitting to the fiemap buffer because we 860 + * may have a non-cloned leaf and if the fiemap buffer is memory mapped 861 + * to a file, a write into it (through btrfs_page_mkwrite()) may trigger 862 + * waiting for an ordered extent that in order to complete needs to 863 + * modify that leaf, therefore leading to a deadlock. 864 + */ 865 + btrfs_free_path(path); 866 + path = NULL; 867 + 868 + ret = flush_fiemap_cache(fieinfo, &cache); 869 + if (ret) 870 + goto out; 871 + 872 + ret = emit_last_fiemap_cache(fieinfo, &cache); 873 + out: 874 + free_extent_state(delalloc_cached_state); 875 + kfree(cache.entries); 876 + btrfs_free_backref_share_ctx(backref_ctx); 877 + btrfs_free_path(path); 878 + return ret; 879 + } 880 + 881 + int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 882 + u64 start, u64 len) 883 + { 884 + struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 885 + int ret; 886 + 887 + ret = fiemap_prep(inode, fieinfo, start, &len, 0); 888 + if (ret) 889 + return ret; 890 + 891 + /* 892 + * fiemap_prep() called filemap_write_and_wait() for the whole possible 893 + * file range (0 to LLONG_MAX), but that is not enough if we have 894 + * compression enabled. The first filemap_fdatawrite_range() only kicks 895 + * in the compression of data (in an async thread) and will return 896 + * before the compression is done and writeback is started. A second 897 + * filemap_fdatawrite_range() is needed to wait for the compression to 898 + * complete and writeback to start. We also need to wait for ordered 899 + * extents to complete, because our fiemap implementation uses mainly 900 + * file extent items to list the extents, searching for extent maps 901 + * only for file ranges with holes or prealloc extents to figure out 902 + * if we have delalloc in those ranges. 903 + */ 904 + if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) { 905 + ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX); 906 + if (ret) 907 + return ret; 908 + } 909 + 910 + btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED); 911 + 912 + /* 913 + * We did an initial flush to avoid holding the inode's lock while 914 + * triggering writeback and waiting for the completion of IO and ordered 915 + * extents. Now after we locked the inode we do it again, because it's 916 + * possible a new write may have happened in between those two steps. 917 + */ 918 + if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) { 919 + ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX); 920 + if (ret) { 921 + btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED); 922 + return ret; 923 + } 924 + } 925 + 926 + ret = extent_fiemap(btrfs_inode, fieinfo, start, len); 927 + btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED); 928 + 929 + return ret; 930 + }
+11
fs/btrfs/fiemap.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef BTRFS_FIEMAP_H 4 + #define BTRFS_FIEMAP_H 5 + 6 + #include <linux/fiemap.h> 7 + 8 + int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 9 + u64 start, u64 len); 10 + 11 + #endif /* BTRFS_FIEMAP_H */
+26 -26
fs/btrfs/file-item.c
··· 45 45 */ 46 46 void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size) 47 47 { 48 - struct btrfs_fs_info *fs_info = inode->root->fs_info; 49 48 u64 start, end, i_size; 50 49 int ret; 51 50 52 51 spin_lock(&inode->lock); 53 52 i_size = new_i_size ?: i_size_read(&inode->vfs_inode); 54 - if (btrfs_fs_incompat(fs_info, NO_HOLES)) { 53 + if (!inode->file_extent_tree) { 55 54 inode->disk_i_size = i_size; 56 55 goto out_unlock; 57 56 } ··· 83 84 int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start, 84 85 u64 len) 85 86 { 87 + if (!inode->file_extent_tree) 88 + return 0; 89 + 86 90 if (len == 0) 87 91 return 0; 88 92 89 93 ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize)); 90 94 91 - if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES)) 92 - return 0; 93 95 return set_extent_bit(inode->file_extent_tree, start, start + len - 1, 94 96 EXTENT_DIRTY, NULL); 95 97 } ··· 112 112 int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start, 113 113 u64 len) 114 114 { 115 + if (!inode->file_extent_tree) 116 + return 0; 117 + 115 118 if (len == 0) 116 119 return 0; 117 120 118 121 ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) || 119 122 len == (u64)-1); 120 123 121 - if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES)) 122 - return 0; 123 124 return clear_extent_bit(inode->file_extent_tree, start, 124 125 start + len - 1, EXTENT_DIRTY, NULL); 125 126 } ··· 353 352 u32 bio_offset = 0; 354 353 355 354 if ((inode->flags & BTRFS_INODE_NODATASUM) || 356 - test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state)) 355 + test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state)) 357 356 return BLK_STS_OK; 358 357 359 358 /* ··· 1281 1280 const int slot = path->slots[0]; 1282 1281 struct btrfs_key key; 1283 1282 u64 extent_start; 1284 - u64 bytenr; 1285 1283 u8 type = btrfs_file_extent_type(leaf, fi); 1286 1284 int compress_type = btrfs_file_extent_compression(leaf, fi); 1287 1285 ··· 1290 1290 em->generation = btrfs_file_extent_generation(leaf, fi); 1291 1291 if (type == BTRFS_FILE_EXTENT_REG || 1292 1292 type == BTRFS_FILE_EXTENT_PREALLOC) { 1293 + const u64 disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 1294 + 1293 1295 em->start = extent_start; 1294 1296 em->len = btrfs_file_extent_end(path) - extent_start; 1295 - em->orig_start = extent_start - 1296 - btrfs_file_extent_offset(leaf, fi); 1297 - em->orig_block_len = btrfs_file_extent_disk_num_bytes(leaf, fi); 1298 - bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 1299 - if (bytenr == 0) { 1300 - em->block_start = EXTENT_MAP_HOLE; 1297 + if (disk_bytenr == 0) { 1298 + em->disk_bytenr = EXTENT_MAP_HOLE; 1299 + em->disk_num_bytes = 0; 1300 + em->offset = 0; 1301 1301 return; 1302 1302 } 1303 + em->disk_bytenr = disk_bytenr; 1304 + em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); 1305 + em->offset = btrfs_file_extent_offset(leaf, fi); 1303 1306 if (compress_type != BTRFS_COMPRESS_NONE) { 1304 1307 extent_map_set_compression(em, compress_type); 1305 - em->block_start = bytenr; 1306 - em->block_len = em->orig_block_len; 1307 1308 } else { 1308 - bytenr += btrfs_file_extent_offset(leaf, fi); 1309 - em->block_start = bytenr; 1310 - em->block_len = em->len; 1309 + /* 1310 + * Older kernels can create regular non-hole data 1311 + * extents with ram_bytes smaller than disk_num_bytes. 1312 + * Not a big deal, just always use disk_num_bytes 1313 + * for ram_bytes. 1314 + */ 1315 + em->ram_bytes = em->disk_num_bytes; 1311 1316 if (type == BTRFS_FILE_EXTENT_PREALLOC) 1312 1317 em->flags |= EXTENT_FLAG_PREALLOC; 1313 1318 } ··· 1320 1315 /* Tree-checker has ensured this. */ 1321 1316 ASSERT(extent_start == 0); 1322 1317 1323 - em->block_start = EXTENT_MAP_INLINE; 1318 + em->disk_bytenr = EXTENT_MAP_INLINE; 1324 1319 em->start = 0; 1325 1320 em->len = fs_info->sectorsize; 1326 - /* 1327 - * Initialize orig_start and block_len with the same values 1328 - * as in inode.c:btrfs_get_extent(). 1329 - */ 1330 - em->orig_start = EXTENT_MAP_HOLE; 1331 - em->block_len = (u64)-1; 1321 + em->offset = 0; 1332 1322 extent_map_set_compression(em, compress_type); 1333 1323 } else { 1334 1324 btrfs_err(fs_info,
+34 -321
fs/btrfs/file.c
··· 17 17 #include <linux/uio.h> 18 18 #include <linux/iversion.h> 19 19 #include <linux/fsverity.h> 20 - #include <linux/iomap.h> 21 20 #include "ctree.h" 21 + #include "direct-io.h" 22 22 #include "disk-io.h" 23 23 #include "transaction.h" 24 24 #include "btrfs_inode.h" ··· 1104 1104 &cached_state); 1105 1105 } 1106 1106 ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes, 1107 - NULL, NULL, NULL, nowait, false); 1107 + NULL, nowait, false); 1108 1108 if (ret <= 0) 1109 1109 btrfs_drew_write_unlock(&root->snapshot_lock); 1110 1110 else ··· 1140 1140 inode_inc_iversion(inode); 1141 1141 } 1142 1142 1143 - static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, 1144 - size_t count) 1143 + int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, size_t count) 1145 1144 { 1146 1145 struct file *file = iocb->ki_filp; 1147 1146 struct inode *inode = file_inode(file); ··· 1186 1187 return 0; 1187 1188 } 1188 1189 1189 - static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, 1190 - struct iov_iter *i) 1190 + ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) 1191 1191 { 1192 1192 struct file *file = iocb->ki_filp; 1193 1193 loff_t pos; ··· 1449 1451 return num_written ? num_written : ret; 1450 1452 } 1451 1453 1452 - static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info, 1453 - const struct iov_iter *iter, loff_t offset) 1454 - { 1455 - const u32 blocksize_mask = fs_info->sectorsize - 1; 1456 - 1457 - if (offset & blocksize_mask) 1458 - return -EINVAL; 1459 - 1460 - if (iov_iter_alignment(iter) & blocksize_mask) 1461 - return -EINVAL; 1462 - 1463 - return 0; 1464 - } 1465 - 1466 - static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) 1467 - { 1468 - struct file *file = iocb->ki_filp; 1469 - struct inode *inode = file_inode(file); 1470 - struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 1471 - loff_t pos; 1472 - ssize_t written = 0; 1473 - ssize_t written_buffered; 1474 - size_t prev_left = 0; 1475 - loff_t endbyte; 1476 - ssize_t ret; 1477 - unsigned int ilock_flags = 0; 1478 - struct iomap_dio *dio; 1479 - 1480 - if (iocb->ki_flags & IOCB_NOWAIT) 1481 - ilock_flags |= BTRFS_ILOCK_TRY; 1482 - 1483 - /* 1484 - * If the write DIO is within EOF, use a shared lock and also only if 1485 - * security bits will likely not be dropped by file_remove_privs() called 1486 - * from btrfs_write_check(). Either will need to be rechecked after the 1487 - * lock was acquired. 1488 - */ 1489 - if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode)) 1490 - ilock_flags |= BTRFS_ILOCK_SHARED; 1491 - 1492 - relock: 1493 - ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags); 1494 - if (ret < 0) 1495 - return ret; 1496 - 1497 - /* Shared lock cannot be used with security bits set. */ 1498 - if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) { 1499 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1500 - ilock_flags &= ~BTRFS_ILOCK_SHARED; 1501 - goto relock; 1502 - } 1503 - 1504 - ret = generic_write_checks(iocb, from); 1505 - if (ret <= 0) { 1506 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1507 - return ret; 1508 - } 1509 - 1510 - ret = btrfs_write_check(iocb, from, ret); 1511 - if (ret < 0) { 1512 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1513 - goto out; 1514 - } 1515 - 1516 - pos = iocb->ki_pos; 1517 - /* 1518 - * Re-check since file size may have changed just before taking the 1519 - * lock or pos may have changed because of O_APPEND in generic_write_check() 1520 - */ 1521 - if ((ilock_flags & BTRFS_ILOCK_SHARED) && 1522 - pos + iov_iter_count(from) > i_size_read(inode)) { 1523 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1524 - ilock_flags &= ~BTRFS_ILOCK_SHARED; 1525 - goto relock; 1526 - } 1527 - 1528 - if (check_direct_IO(fs_info, from, pos)) { 1529 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1530 - goto buffered; 1531 - } 1532 - 1533 - /* 1534 - * The iov_iter can be mapped to the same file range we are writing to. 1535 - * If that's the case, then we will deadlock in the iomap code, because 1536 - * it first calls our callback btrfs_dio_iomap_begin(), which will create 1537 - * an ordered extent, and after that it will fault in the pages that the 1538 - * iov_iter refers to. During the fault in we end up in the readahead 1539 - * pages code (starting at btrfs_readahead()), which will lock the range, 1540 - * find that ordered extent and then wait for it to complete (at 1541 - * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since 1542 - * obviously the ordered extent can never complete as we didn't submit 1543 - * yet the respective bio(s). This always happens when the buffer is 1544 - * memory mapped to the same file range, since the iomap DIO code always 1545 - * invalidates pages in the target file range (after starting and waiting 1546 - * for any writeback). 1547 - * 1548 - * So here we disable page faults in the iov_iter and then retry if we 1549 - * got -EFAULT, faulting in the pages before the retry. 1550 - */ 1551 - from->nofault = true; 1552 - dio = btrfs_dio_write(iocb, from, written); 1553 - from->nofault = false; 1554 - 1555 - /* 1556 - * iomap_dio_complete() will call btrfs_sync_file() if we have a dsync 1557 - * iocb, and that needs to lock the inode. So unlock it before calling 1558 - * iomap_dio_complete() to avoid a deadlock. 1559 - */ 1560 - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); 1561 - 1562 - if (IS_ERR_OR_NULL(dio)) 1563 - ret = PTR_ERR_OR_ZERO(dio); 1564 - else 1565 - ret = iomap_dio_complete(dio); 1566 - 1567 - /* No increment (+=) because iomap returns a cumulative value. */ 1568 - if (ret > 0) 1569 - written = ret; 1570 - 1571 - if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) { 1572 - const size_t left = iov_iter_count(from); 1573 - /* 1574 - * We have more data left to write. Try to fault in as many as 1575 - * possible of the remainder pages and retry. We do this without 1576 - * releasing and locking again the inode, to prevent races with 1577 - * truncate. 1578 - * 1579 - * Also, in case the iov refers to pages in the file range of the 1580 - * file we want to write to (due to a mmap), we could enter an 1581 - * infinite loop if we retry after faulting the pages in, since 1582 - * iomap will invalidate any pages in the range early on, before 1583 - * it tries to fault in the pages of the iov. So we keep track of 1584 - * how much was left of iov in the previous EFAULT and fallback 1585 - * to buffered IO in case we haven't made any progress. 1586 - */ 1587 - if (left == prev_left) { 1588 - ret = -ENOTBLK; 1589 - } else { 1590 - fault_in_iov_iter_readable(from, left); 1591 - prev_left = left; 1592 - goto relock; 1593 - } 1594 - } 1595 - 1596 - /* 1597 - * If 'ret' is -ENOTBLK or we have not written all data, then it means 1598 - * we must fallback to buffered IO. 1599 - */ 1600 - if ((ret < 0 && ret != -ENOTBLK) || !iov_iter_count(from)) 1601 - goto out; 1602 - 1603 - buffered: 1604 - /* 1605 - * If we are in a NOWAIT context, then return -EAGAIN to signal the caller 1606 - * it must retry the operation in a context where blocking is acceptable, 1607 - * because even if we end up not blocking during the buffered IO attempt 1608 - * below, we will block when flushing and waiting for the IO. 1609 - */ 1610 - if (iocb->ki_flags & IOCB_NOWAIT) { 1611 - ret = -EAGAIN; 1612 - goto out; 1613 - } 1614 - 1615 - pos = iocb->ki_pos; 1616 - written_buffered = btrfs_buffered_write(iocb, from); 1617 - if (written_buffered < 0) { 1618 - ret = written_buffered; 1619 - goto out; 1620 - } 1621 - /* 1622 - * Ensure all data is persisted. We want the next direct IO read to be 1623 - * able to read what was just written. 1624 - */ 1625 - endbyte = pos + written_buffered - 1; 1626 - ret = btrfs_fdatawrite_range(inode, pos, endbyte); 1627 - if (ret) 1628 - goto out; 1629 - ret = filemap_fdatawait_range(inode->i_mapping, pos, endbyte); 1630 - if (ret) 1631 - goto out; 1632 - written += written_buffered; 1633 - iocb->ki_pos = pos + written_buffered; 1634 - invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT, 1635 - endbyte >> PAGE_SHIFT); 1636 - out: 1637 - return ret < 0 ? ret : written; 1638 - } 1639 - 1640 1454 static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from, 1641 1455 const struct btrfs_ioctl_encoded_io_args *encoded) 1642 1456 { ··· 1548 1738 return 0; 1549 1739 } 1550 1740 1551 - static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end) 1741 + static int start_ordered_ops(struct btrfs_inode *inode, loff_t start, loff_t end) 1552 1742 { 1553 1743 int ret; 1554 1744 struct blk_plug plug; ··· 1568 1758 1569 1759 static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx) 1570 1760 { 1571 - struct btrfs_inode *inode = BTRFS_I(ctx->inode); 1761 + struct btrfs_inode *inode = ctx->inode; 1572 1762 struct btrfs_fs_info *fs_info = inode->root->fs_info; 1573 1763 1574 1764 if (btrfs_inode_in_log(inode, btrfs_get_fs_generation(fs_info)) && ··· 1604 1794 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) 1605 1795 { 1606 1796 struct dentry *dentry = file_dentry(file); 1607 - struct inode *inode = d_inode(dentry); 1608 - struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 1609 - struct btrfs_root *root = BTRFS_I(inode)->root; 1797 + struct btrfs_inode *inode = BTRFS_I(d_inode(dentry)); 1798 + struct btrfs_root *root = inode->root; 1799 + struct btrfs_fs_info *fs_info = root->fs_info; 1610 1800 struct btrfs_trans_handle *trans; 1611 1801 struct btrfs_log_ctx ctx; 1612 1802 int ret = 0, err; ··· 1639 1829 if (ret) 1640 1830 goto out; 1641 1831 1642 - btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP); 1832 + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); 1643 1833 1644 1834 atomic_inc(&root->log_batch); 1645 1835 ··· 1663 1853 */ 1664 1854 ret = start_ordered_ops(inode, start, end); 1665 1855 if (ret) { 1666 - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP); 1856 + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); 1667 1857 goto out; 1668 1858 } 1669 1859 ··· 1675 1865 * running delalloc the full sync flag may be set if we need to drop 1676 1866 * extra extent map ranges due to temporary memory allocation failures. 1677 1867 */ 1678 - full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 1679 - &BTRFS_I(inode)->runtime_flags); 1868 + full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags); 1680 1869 1681 1870 /* 1682 1871 * We have to do this here to avoid the priority inversion of waiting on ··· 1694 1885 */ 1695 1886 if (full_sync || btrfs_is_zoned(fs_info)) { 1696 1887 ret = btrfs_wait_ordered_range(inode, start, len); 1697 - clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags); 1888 + clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags); 1698 1889 } else { 1699 1890 /* 1700 1891 * Get our ordered extents as soon as possible to avoid doing 1701 1892 * checksum lookups in the csum tree, and use instead the 1702 1893 * checksums attached to the ordered extents. 1703 1894 */ 1704 - btrfs_get_ordered_extents_for_logging(BTRFS_I(inode), 1705 - &ctx.ordered_extents); 1706 - ret = filemap_fdatawait_range(inode->i_mapping, start, end); 1895 + btrfs_get_ordered_extents_for_logging(inode, &ctx.ordered_extents); 1896 + ret = filemap_fdatawait_range(inode->vfs_inode.i_mapping, start, end); 1707 1897 if (ret) 1708 1898 goto out_release_extents; 1709 1899 ··· 1715 1907 * extents to complete so that any extent maps that point to 1716 1908 * unwritten locations are dropped and we don't log them. 1717 1909 */ 1718 - if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR, 1719 - &BTRFS_I(inode)->runtime_flags)) 1910 + if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags)) 1720 1911 ret = btrfs_wait_ordered_range(inode, start, len); 1721 1912 } 1722 1913 ··· 1730 1923 * modified so clear this flag in case it was set for whatever 1731 1924 * reason, it's no longer relevant. 1732 1925 */ 1733 - clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 1734 - &BTRFS_I(inode)->runtime_flags); 1926 + clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags); 1735 1927 /* 1736 1928 * An ordered extent might have started before and completed 1737 1929 * already with io errors, in which case the inode was not ··· 1738 1932 * for any errors that might have happened since we last 1739 1933 * checked called fsync. 1740 1934 */ 1741 - ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err); 1935 + ret = filemap_check_wb_err(inode->vfs_inode.i_mapping, file->f_wb_err); 1742 1936 goto out_release_extents; 1743 1937 } 1744 1938 ··· 1788 1982 * file again, but that will end up using the synchronization 1789 1983 * inside btrfs_sync_log to keep things safe. 1790 1984 */ 1791 - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP); 1985 + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); 1792 1986 1793 1987 if (ret == BTRFS_NO_LOG_SYNC) { 1794 1988 ret = btrfs_end_transaction(trans); ··· 1857 2051 1858 2052 out_release_extents: 1859 2053 btrfs_release_log_ctx_extents(&ctx); 1860 - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP); 2054 + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); 1861 2055 goto out; 1862 2056 } 1863 2057 ··· 2156 2350 hole_em->start = offset; 2157 2351 hole_em->len = end - offset; 2158 2352 hole_em->ram_bytes = hole_em->len; 2159 - hole_em->orig_start = offset; 2160 2353 2161 - hole_em->block_start = EXTENT_MAP_HOLE; 2162 - hole_em->block_len = 0; 2163 - hole_em->orig_block_len = 0; 2354 + hole_em->disk_bytenr = EXTENT_MAP_HOLE; 2355 + hole_em->disk_num_bytes = 0; 2164 2356 hole_em->generation = trans->transid; 2165 2357 2166 2358 ret = btrfs_replace_extent_map_range(inode, hole_em, true); ··· 2189 2385 return PTR_ERR(em); 2190 2386 2191 2387 /* Hole or vacuum extent(only exists in no-hole mode) */ 2192 - if (em->block_start == EXTENT_MAP_HOLE) { 2388 + if (em->disk_bytenr == EXTENT_MAP_HOLE) { 2193 2389 ret = 1; 2194 2390 *len = em->start + em->len > *start + *len ? 2195 2391 0 : *start + *len - em->start - em->len; ··· 2618 2814 2619 2815 btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP); 2620 2816 2621 - ret = btrfs_wait_ordered_range(inode, offset, len); 2817 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset, len); 2622 2818 if (ret) 2623 2819 goto out_only_mutex; 2624 2820 ··· 2846 3042 if (IS_ERR(em)) 2847 3043 return PTR_ERR(em); 2848 3044 2849 - if (em->block_start == EXTENT_MAP_HOLE) 3045 + if (em->disk_bytenr == EXTENT_MAP_HOLE) 2850 3046 ret = RANGE_BOUNDARY_HOLE; 2851 3047 else if (em->flags & EXTENT_FLAG_PREALLOC) 2852 3048 ret = RANGE_BOUNDARY_PREALLOC_EXTENT; ··· 2910 3106 ASSERT(IS_ALIGNED(alloc_start, sectorsize)); 2911 3107 len = offset + len - alloc_start; 2912 3108 offset = alloc_start; 2913 - alloc_hint = em->block_start + em->len; 3109 + alloc_hint = extent_map_block_start(em) + em->len; 2914 3110 } 2915 3111 free_extent_map(em); 2916 3112 ··· 2928 3124 mode); 2929 3125 goto out; 2930 3126 } 2931 - if (len < sectorsize && em->block_start != EXTENT_MAP_HOLE) { 3127 + if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) { 2932 3128 free_extent_map(em); 2933 3129 ret = btrfs_truncate_block(BTRFS_I(inode), offset, len, 2934 3130 0); ··· 3113 3309 * the file range and, due to the previous locking we did, we know there 3114 3310 * can't be more delalloc or ordered extents in the range. 3115 3311 */ 3116 - ret = btrfs_wait_ordered_range(inode, alloc_start, 3312 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), alloc_start, 3117 3313 alloc_end - alloc_start); 3118 3314 if (ret) 3119 3315 goto out; ··· 3141 3337 last_byte = min(extent_map_end(em), alloc_end); 3142 3338 actual_end = min_t(u64, extent_map_end(em), offset + len); 3143 3339 last_byte = ALIGN(last_byte, blocksize); 3144 - if (em->block_start == EXTENT_MAP_HOLE || 3340 + if (em->disk_bytenr == EXTENT_MAP_HOLE || 3145 3341 (cur_offset >= inode->i_size && 3146 3342 !(em->flags & EXTENT_FLAG_PREALLOC))) { 3147 3343 const u64 range_len = last_byte - cur_offset; ··· 3724 3920 return generic_file_open(inode, filp); 3725 3921 } 3726 3922 3727 - static int check_direct_read(struct btrfs_fs_info *fs_info, 3728 - const struct iov_iter *iter, loff_t offset) 3729 - { 3730 - int ret; 3731 - int i, seg; 3732 - 3733 - ret = check_direct_IO(fs_info, iter, offset); 3734 - if (ret < 0) 3735 - return ret; 3736 - 3737 - if (!iter_is_iovec(iter)) 3738 - return 0; 3739 - 3740 - for (seg = 0; seg < iter->nr_segs; seg++) { 3741 - for (i = seg + 1; i < iter->nr_segs; i++) { 3742 - const struct iovec *iov1 = iter_iov(iter) + seg; 3743 - const struct iovec *iov2 = iter_iov(iter) + i; 3744 - 3745 - if (iov1->iov_base == iov2->iov_base) 3746 - return -EINVAL; 3747 - } 3748 - } 3749 - return 0; 3750 - } 3751 - 3752 - static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to) 3753 - { 3754 - struct inode *inode = file_inode(iocb->ki_filp); 3755 - size_t prev_left = 0; 3756 - ssize_t read = 0; 3757 - ssize_t ret; 3758 - 3759 - if (fsverity_active(inode)) 3760 - return 0; 3761 - 3762 - if (check_direct_read(inode_to_fs_info(inode), to, iocb->ki_pos)) 3763 - return 0; 3764 - 3765 - btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); 3766 - again: 3767 - /* 3768 - * This is similar to what we do for direct IO writes, see the comment 3769 - * at btrfs_direct_write(), but we also disable page faults in addition 3770 - * to disabling them only at the iov_iter level. This is because when 3771 - * reading from a hole or prealloc extent, iomap calls iov_iter_zero(), 3772 - * which can still trigger page fault ins despite having set ->nofault 3773 - * to true of our 'to' iov_iter. 3774 - * 3775 - * The difference to direct IO writes is that we deadlock when trying 3776 - * to lock the extent range in the inode's tree during he page reads 3777 - * triggered by the fault in (while for writes it is due to waiting for 3778 - * our own ordered extent). This is because for direct IO reads, 3779 - * btrfs_dio_iomap_begin() returns with the extent range locked, which 3780 - * is only unlocked in the endio callback (end_bio_extent_readpage()). 3781 - */ 3782 - pagefault_disable(); 3783 - to->nofault = true; 3784 - ret = btrfs_dio_read(iocb, to, read); 3785 - to->nofault = false; 3786 - pagefault_enable(); 3787 - 3788 - /* No increment (+=) because iomap returns a cumulative value. */ 3789 - if (ret > 0) 3790 - read = ret; 3791 - 3792 - if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) { 3793 - const size_t left = iov_iter_count(to); 3794 - 3795 - if (left == prev_left) { 3796 - /* 3797 - * We didn't make any progress since the last attempt, 3798 - * fallback to a buffered read for the remainder of the 3799 - * range. This is just to avoid any possibility of looping 3800 - * for too long. 3801 - */ 3802 - ret = read; 3803 - } else { 3804 - /* 3805 - * We made some progress since the last retry or this is 3806 - * the first time we are retrying. Fault in as many pages 3807 - * as possible and retry. 3808 - */ 3809 - fault_in_iov_iter_writeable(to, left); 3810 - prev_left = left; 3811 - goto again; 3812 - } 3813 - } 3814 - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); 3815 - return ret < 0 ? ret : read; 3816 - } 3817 - 3818 3923 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) 3819 3924 { 3820 3925 ssize_t ret = 0; ··· 3758 4045 .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC, 3759 4046 }; 3760 4047 3761 - int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end) 4048 + int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end) 3762 4049 { 4050 + struct address_space *mapping = inode->vfs_inode.i_mapping; 3763 4051 int ret; 3764 4052 3765 4053 /* ··· 3777 4063 * know better and pull this out at some point in the future, it is 3778 4064 * right and you are wrong. 3779 4065 */ 3780 - ret = filemap_fdatawrite_range(inode->i_mapping, start, end); 3781 - if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, 3782 - &BTRFS_I(inode)->runtime_flags)) 3783 - ret = filemap_fdatawrite_range(inode->i_mapping, start, end); 4066 + ret = filemap_fdatawrite_range(mapping, start, end); 4067 + if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags)) 4068 + ret = filemap_fdatawrite_range(mapping, start, end); 3784 4069 3785 4070 return ret; 3786 4071 }
+3 -1
fs/btrfs/file.h
··· 37 37 int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages, 38 38 size_t num_pages, loff_t pos, size_t write_bytes, 39 39 struct extent_state **cached, bool noreserve); 40 - int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); 40 + int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end); 41 41 int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos, 42 42 size_t *write_bytes, bool nowait); 43 43 void btrfs_check_nocow_unlock(struct btrfs_inode *inode); 44 44 bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end, 45 45 struct extent_state **cached_state, 46 46 u64 *delalloc_start_ret, u64 *delalloc_end_ret); 47 + int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, size_t count); 48 + ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i); 47 49 48 50 #endif
+6 -6
fs/btrfs/free-space-cache.c
··· 82 82 struct btrfs_path *path, 83 83 u64 offset) 84 84 { 85 - struct btrfs_fs_info *fs_info = root->fs_info; 86 85 struct btrfs_key key; 87 86 struct btrfs_key location; 88 87 struct btrfs_disk_key disk_key; ··· 115 116 * sure NOFS is set to keep us from deadlocking. 116 117 */ 117 118 nofs_flag = memalloc_nofs_save(); 118 - inode = btrfs_iget_path(fs_info->sb, location.objectid, root, path); 119 + inode = btrfs_iget_path(location.objectid, root, path); 119 120 btrfs_release_path(path); 120 121 memalloc_nofs_restore(nofs_flag); 121 122 if (IS_ERR(inode)) ··· 137 138 138 139 spin_lock(&block_group->lock); 139 140 if (block_group->inode) 140 - inode = igrab(block_group->inode); 141 + inode = igrab(&block_group->inode->vfs_inode); 141 142 spin_unlock(&block_group->lock); 142 143 if (inode) 143 144 return inode; ··· 156 157 } 157 158 158 159 if (!test_and_set_bit(BLOCK_GROUP_FLAG_IREF, &block_group->runtime_flags)) 159 - block_group->inode = igrab(inode); 160 + block_group->inode = BTRFS_I(igrab(inode)); 160 161 spin_unlock(&block_group->lock); 161 162 162 163 return inode; ··· 857 858 spin_unlock(&ctl->tree_lock); 858 859 btrfs_err(fs_info, 859 860 "Duplicate entries in free space cache, dumping"); 861 + kmem_cache_free(btrfs_free_space_bitmap_cachep, e->bitmap); 860 862 kmem_cache_free(btrfs_free_space_cachep, e); 861 863 goto free_cache; 862 864 } ··· 1268 1268 { 1269 1269 int ret; 1270 1270 1271 - ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); 1271 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1); 1272 1272 if (ret) 1273 1273 clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, inode->i_size - 1, 1274 1274 EXTENT_DELALLOC, NULL); ··· 1483 1483 io_ctl->entries = entries; 1484 1484 io_ctl->bitmaps = bitmaps; 1485 1485 1486 - ret = btrfs_fdatawrite_range(inode, 0, (u64)-1); 1486 + ret = btrfs_fdatawrite_range(BTRFS_I(inode), 0, (u64)-1); 1487 1487 if (ret) 1488 1488 goto out; 1489 1489
+7 -3
fs/btrfs/free-space-tree.c
··· 1300 1300 btrfs_tree_lock(free_space_root->node); 1301 1301 btrfs_clear_buffer_dirty(trans, free_space_root->node); 1302 1302 btrfs_tree_unlock(free_space_root->node); 1303 - btrfs_free_tree_block(trans, btrfs_root_id(free_space_root), 1304 - free_space_root->node, 0, 1); 1305 - 1303 + ret = btrfs_free_tree_block(trans, btrfs_root_id(free_space_root), 1304 + free_space_root->node, 0, 1); 1306 1305 btrfs_put_root(free_space_root); 1306 + if (ret < 0) { 1307 + btrfs_abort_transaction(trans, ret); 1308 + btrfs_end_transaction(trans); 1309 + return ret; 1310 + } 1307 1311 1308 1312 return btrfs_commit_transaction(trans); 1309 1313 }
+10 -7
fs/btrfs/fs.h
··· 29 29 #include "extent-io-tree.h" 30 30 #include "async-thread.h" 31 31 #include "block-rsv.h" 32 - #include "fs.h" 33 32 34 33 struct inode; 35 34 struct super_block; ··· 98 99 /* The btrfs_fs_info created for self-tests */ 99 100 BTRFS_FS_STATE_DUMMY_FS_INFO, 100 101 101 - BTRFS_FS_STATE_NO_CSUMS, 102 + /* Checksum errors are ignored. */ 103 + BTRFS_FS_STATE_NO_DATA_CSUMS, 104 + BTRFS_FS_STATE_SKIP_META_CSUMS, 102 105 103 106 /* Indicates there was an error cleaning up a log tree. */ 104 107 BTRFS_FS_STATE_LOG_CLEANUP_ERROR, ··· 226 225 BTRFS_MOUNT_IGNOREDATACSUMS = (1UL << 28), 227 226 BTRFS_MOUNT_NODISCARD = (1UL << 29), 228 227 BTRFS_MOUNT_NOSPACECACHE = (1UL << 30), 228 + BTRFS_MOUNT_IGNOREMETACSUMS = (1UL << 31), 229 + BTRFS_MOUNT_IGNORESUPERFLAGS = (1ULL << 32), 229 230 }; 230 231 231 232 /* ··· 961 958 /* 962 959 * Count how many fs_info->max_extent_size cover the @size 963 960 */ 964 - static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size) 961 + static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size) 965 962 { 966 963 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS 967 964 if (!fs_info) ··· 1022 1019 #define btrfs_test_opt(fs_info, opt) ((fs_info)->mount_opt & \ 1023 1020 BTRFS_MOUNT_##opt) 1024 1021 1025 - static inline int btrfs_fs_closing(struct btrfs_fs_info *fs_info) 1022 + static inline int btrfs_fs_closing(const struct btrfs_fs_info *fs_info) 1026 1023 { 1027 1024 /* Do it this way so we only ever do one test_bit in the normal case. */ 1028 1025 if (test_bit(BTRFS_FS_CLOSING_START, &fs_info->flags)) { ··· 1041 1038 * since setting and checking for SB_RDONLY in the superblock's flags is not 1042 1039 * atomic. 1043 1040 */ 1044 - static inline int btrfs_need_cleaner_sleep(struct btrfs_fs_info *fs_info) 1041 + static inline int btrfs_need_cleaner_sleep(const struct btrfs_fs_info *fs_info) 1045 1042 { 1046 1043 return test_bit(BTRFS_FS_STATE_RO, &fs_info->fs_state) || 1047 1044 btrfs_fs_closing(fs_info); ··· 1062 1059 1063 1060 #define EXPORT_FOR_TESTS 1064 1061 1065 - static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info) 1062 + static inline int btrfs_is_testing(const struct btrfs_fs_info *fs_info) 1066 1063 { 1067 1064 return test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, &fs_info->fs_state); 1068 1065 } ··· 1073 1070 1074 1071 #define EXPORT_FOR_TESTS static 1075 1072 1076 - static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info) 1073 + static inline int btrfs_is_testing(const struct btrfs_fs_info *fs_info) 1077 1074 { 1078 1075 return 0; 1079 1076 }
+2 -2
fs/btrfs/inode-item.c
··· 141 141 extref = btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0], 142 142 ref_objectid, name); 143 143 if (!extref) { 144 - btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL); 145 - ret = -EROFS; 144 + btrfs_abort_transaction(trans, -ENOENT); 145 + ret = -ENOENT; 146 146 goto out; 147 147 } 148 148
+324 -1151
fs/btrfs/inode.c
··· 70 70 #include "orphan.h" 71 71 #include "backref.h" 72 72 #include "raid-stripe-tree.h" 73 + #include "fiemap.h" 73 74 74 75 struct btrfs_iget_args { 75 76 u64 ino; 76 77 struct btrfs_root *root; 77 78 }; 78 - 79 - struct btrfs_dio_data { 80 - ssize_t submitted; 81 - struct extent_changeset *data_reserved; 82 - struct btrfs_ordered_extent *ordered; 83 - bool data_space_reserved; 84 - bool nocow_done; 85 - }; 86 - 87 - struct btrfs_dio_private { 88 - /* Range of I/O */ 89 - u64 file_offset; 90 - u32 bytes; 91 - 92 - /* This must be last */ 93 - struct btrfs_bio bbio; 94 - }; 95 - 96 - static struct bio_set btrfs_dio_bioset; 97 79 98 80 struct btrfs_rename_ctx { 99 81 /* Output field. Stores the index number of the old directory entry. */ ··· 119 137 struct page *locked_page, u64 start, 120 138 u64 end, struct writeback_control *wbc, 121 139 bool pages_dirty); 122 - static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start, 123 - u64 len, u64 orig_start, u64 block_start, 124 - u64 block_len, u64 orig_block_len, 125 - u64 ram_bytes, int compress_type, 126 - int type); 127 140 128 141 static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 num_bytes, 129 142 u64 root, void *warn_ctx) ··· 854 877 if (btrfs_test_opt(fs_info, COMPRESS) || 855 878 inode->flags & BTRFS_INODE_COMPRESS || 856 879 inode->prop_compress) 857 - return btrfs_compress_heuristic(&inode->vfs_inode, start, end); 880 + return btrfs_compress_heuristic(inode, start, end); 858 881 return 0; 859 882 } 860 883 ··· 865 888 if (num_bytes < small_write && 866 889 (start > 0 || end + 1 < inode->disk_i_size)) 867 890 btrfs_add_inode_defrag(NULL, inode, small_write); 891 + } 892 + 893 + static int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) 894 + { 895 + unsigned long end_index = end >> PAGE_SHIFT; 896 + struct page *page; 897 + int ret = 0; 898 + 899 + for (unsigned long index = start >> PAGE_SHIFT; 900 + index <= end_index; index++) { 901 + page = find_get_page(inode->i_mapping, index); 902 + if (unlikely(!page)) { 903 + if (!ret) 904 + ret = -ENOENT; 905 + continue; 906 + } 907 + clear_page_dirty_for_io(page); 908 + put_page(page); 909 + } 910 + return ret; 868 911 } 869 912 870 913 /* ··· 928 931 * Otherwise applications with the file mmap'd can wander in and change 929 932 * the page contents while we are compressing them. 930 933 */ 931 - extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end); 934 + ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end); 935 + 936 + /* 937 + * All the folios should have been locked thus no failure. 938 + * 939 + * And even if some folios are missing, btrfs_compress_folios() 940 + * would handle them correctly, so here just do an ASSERT() check for 941 + * early logic errors. 942 + */ 943 + ASSERT(ret == 0); 932 944 933 945 /* 934 946 * We need to save i_size before now because it could change in between ··· 1158 1152 struct btrfs_root *root = inode->root; 1159 1153 struct btrfs_fs_info *fs_info = root->fs_info; 1160 1154 struct btrfs_ordered_extent *ordered; 1155 + struct btrfs_file_extent file_extent; 1161 1156 struct btrfs_key ins; 1162 1157 struct page *locked_page = NULL; 1163 1158 struct extent_state *cached = NULL; ··· 1205 1198 lock_extent(io_tree, start, end, &cached); 1206 1199 1207 1200 /* Here we're doing allocation and writeback of the compressed pages */ 1208 - em = create_io_em(inode, start, 1209 - async_extent->ram_size, /* len */ 1210 - start, /* orig_start */ 1211 - ins.objectid, /* block_start */ 1212 - ins.offset, /* block_len */ 1213 - ins.offset, /* orig_block_len */ 1214 - async_extent->ram_size, /* ram_bytes */ 1215 - async_extent->compress_type, 1216 - BTRFS_ORDERED_COMPRESSED); 1201 + file_extent.disk_bytenr = ins.objectid; 1202 + file_extent.disk_num_bytes = ins.offset; 1203 + file_extent.ram_bytes = async_extent->ram_size; 1204 + file_extent.num_bytes = async_extent->ram_size; 1205 + file_extent.offset = 0; 1206 + file_extent.compression = async_extent->compress_type; 1207 + 1208 + em = btrfs_create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESSED); 1217 1209 if (IS_ERR(em)) { 1218 1210 ret = PTR_ERR(em); 1219 1211 goto out_free_reserve; 1220 1212 } 1221 1213 free_extent_map(em); 1222 1214 1223 - ordered = btrfs_alloc_ordered_extent(inode, start, /* file_offset */ 1224 - async_extent->ram_size, /* num_bytes */ 1225 - async_extent->ram_size, /* ram_bytes */ 1226 - ins.objectid, /* disk_bytenr */ 1227 - ins.offset, /* disk_num_bytes */ 1228 - 0, /* offset */ 1229 - 1 << BTRFS_ORDERED_COMPRESSED, 1230 - async_extent->compress_type); 1215 + ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent, 1216 + 1 << BTRFS_ORDERED_COMPRESSED); 1231 1217 if (IS_ERR(ordered)) { 1232 1218 btrfs_drop_extent_map_range(inode, start, end, false); 1233 1219 ret = PTR_ERR(ordered); ··· 1264 1264 kfree(async_extent); 1265 1265 } 1266 1266 1267 - static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start, 1268 - u64 num_bytes) 1267 + u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 start, 1268 + u64 num_bytes) 1269 1269 { 1270 1270 struct extent_map_tree *em_tree = &inode->extent_tree; 1271 1271 struct extent_map *em; ··· 1279 1279 * first block in this inode and use that as a hint. If that 1280 1280 * block is also bogus then just don't worry about it. 1281 1281 */ 1282 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 1282 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 1283 1283 free_extent_map(em); 1284 1284 em = search_extent_mapping(em_tree, 0, 0); 1285 - if (em && em->block_start < EXTENT_MAP_LAST_BYTE) 1286 - alloc_hint = em->block_start; 1285 + if (em && em->disk_bytenr < EXTENT_MAP_LAST_BYTE) 1286 + alloc_hint = extent_map_block_start(em); 1287 1287 if (em) 1288 1288 free_extent_map(em); 1289 1289 } else { 1290 - alloc_hint = em->block_start; 1290 + alloc_hint = extent_map_block_start(em); 1291 1291 free_extent_map(em); 1292 1292 } 1293 1293 } ··· 1375 1375 } 1376 1376 } 1377 1377 1378 - alloc_hint = get_extent_allocation_hint(inode, start, num_bytes); 1378 + alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes); 1379 1379 1380 1380 /* 1381 1381 * Relocation relies on the relocated extents to have exactly the same ··· 1395 1395 1396 1396 while (num_bytes > 0) { 1397 1397 struct btrfs_ordered_extent *ordered; 1398 + struct btrfs_file_extent file_extent; 1398 1399 1399 1400 cur_alloc_size = num_bytes; 1400 1401 ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size, ··· 1432 1431 extent_reserved = true; 1433 1432 1434 1433 ram_size = ins.offset; 1434 + file_extent.disk_bytenr = ins.objectid; 1435 + file_extent.disk_num_bytes = ins.offset; 1436 + file_extent.num_bytes = ins.offset; 1437 + file_extent.ram_bytes = ins.offset; 1438 + file_extent.offset = 0; 1439 + file_extent.compression = BTRFS_COMPRESS_NONE; 1435 1440 1436 1441 lock_extent(&inode->io_tree, start, start + ram_size - 1, 1437 1442 &cached); 1438 1443 1439 - em = create_io_em(inode, start, ins.offset, /* len */ 1440 - start, /* orig_start */ 1441 - ins.objectid, /* block_start */ 1442 - ins.offset, /* block_len */ 1443 - ins.offset, /* orig_block_len */ 1444 - ram_size, /* ram_bytes */ 1445 - BTRFS_COMPRESS_NONE, /* compress_type */ 1446 - BTRFS_ORDERED_REGULAR /* type */); 1444 + em = btrfs_create_io_em(inode, start, &file_extent, 1445 + BTRFS_ORDERED_REGULAR); 1447 1446 if (IS_ERR(em)) { 1448 1447 unlock_extent(&inode->io_tree, start, 1449 1448 start + ram_size - 1, &cached); ··· 1452 1451 } 1453 1452 free_extent_map(em); 1454 1453 1455 - ordered = btrfs_alloc_ordered_extent(inode, start, ram_size, 1456 - ram_size, ins.objectid, cur_alloc_size, 1457 - 0, 1 << BTRFS_ORDERED_REGULAR, 1458 - BTRFS_COMPRESS_NONE); 1454 + ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent, 1455 + 1 << BTRFS_ORDERED_REGULAR); 1459 1456 if (IS_ERR(ordered)) { 1460 1457 unlock_extent(&inode->io_tree, start, 1461 1458 start + ram_size - 1, &cached); ··· 1616 1617 u64 alloc_hint = 0; 1617 1618 1618 1619 if (do_free) { 1619 - struct async_chunk *async_chunk; 1620 1620 struct async_cow *async_cow; 1621 1621 1622 - async_chunk = container_of(work, struct async_chunk, work); 1623 1622 btrfs_add_delayed_iput(async_chunk->inode); 1624 1623 if (async_chunk->blkcg_css) 1625 1624 css_put(async_chunk->blkcg_css); ··· 1847 1850 */ 1848 1851 bool free_path; 1849 1852 1850 - /* Output fields. Only set when can_nocow_file_extent() returns 1. */ 1851 - 1852 - u64 disk_bytenr; 1853 - u64 disk_num_bytes; 1854 - u64 extent_offset; 1855 - /* Number of bytes that can be written to in NOCOW mode. */ 1856 - u64 num_bytes; 1853 + /* 1854 + * Output fields. Only set when can_nocow_file_extent() returns 1. 1855 + * The expected file extent for the NOCOW write. 1856 + */ 1857 + struct btrfs_file_extent file_extent; 1857 1858 }; 1858 1859 1859 1860 /* ··· 1873 1878 struct btrfs_root *root = inode->root; 1874 1879 struct btrfs_file_extent_item *fi; 1875 1880 struct btrfs_root *csum_root; 1881 + u64 io_start; 1876 1882 u64 extent_end; 1877 1883 u8 extent_type; 1878 1884 int can_nocow = 0; ··· 1885 1889 1886 1890 if (extent_type == BTRFS_FILE_EXTENT_INLINE) 1887 1891 goto out; 1888 - 1889 - /* Can't access these fields unless we know it's not an inline extent. */ 1890 - args->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 1891 - args->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); 1892 - args->extent_offset = btrfs_file_extent_offset(leaf, fi); 1893 1892 1894 1893 if (!(inode->flags & BTRFS_INODE_NODATACOW) && 1895 1894 extent_type == BTRFS_FILE_EXTENT_REG) ··· 1901 1910 goto out; 1902 1911 1903 1912 /* An explicit hole, must COW. */ 1904 - if (args->disk_bytenr == 0) 1913 + if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0) 1905 1914 goto out; 1906 1915 1907 1916 /* Compressed/encrypted/encoded extents must be COWed. */ ··· 1912 1921 1913 1922 extent_end = btrfs_file_extent_end(path); 1914 1923 1924 + args->file_extent.disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 1925 + args->file_extent.disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); 1926 + args->file_extent.ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi); 1927 + args->file_extent.offset = btrfs_file_extent_offset(leaf, fi); 1928 + args->file_extent.compression = btrfs_file_extent_compression(leaf, fi); 1929 + 1915 1930 /* 1916 1931 * The following checks can be expensive, as they need to take other 1917 1932 * locks and do btree or rbtree searches, so release the path to avoid ··· 1926 1929 btrfs_release_path(path); 1927 1930 1928 1931 ret = btrfs_cross_ref_exist(root, btrfs_ino(inode), 1929 - key->offset - args->extent_offset, 1930 - args->disk_bytenr, args->strict, path); 1932 + key->offset - args->file_extent.offset, 1933 + args->file_extent.disk_bytenr, args->strict, path); 1931 1934 WARN_ON_ONCE(ret > 0 && is_freespace_inode); 1932 1935 if (ret != 0) 1933 1936 goto out; ··· 1948 1951 atomic_read(&root->snapshot_force_cow)) 1949 1952 goto out; 1950 1953 1951 - args->disk_bytenr += args->extent_offset; 1952 - args->disk_bytenr += args->start - key->offset; 1953 - args->num_bytes = min(args->end + 1, extent_end) - args->start; 1954 + args->file_extent.num_bytes = min(args->end + 1, extent_end) - args->start; 1955 + args->file_extent.offset += args->start - key->offset; 1956 + io_start = args->file_extent.disk_bytenr + args->file_extent.offset; 1954 1957 1955 1958 /* 1956 1959 * Force COW if csums exist in the range. This ensures that csums for a 1957 1960 * given extent are either valid or do not exist. 1958 1961 */ 1959 1962 1960 - csum_root = btrfs_csum_root(root->fs_info, args->disk_bytenr); 1961 - ret = btrfs_lookup_csums_list(csum_root, args->disk_bytenr, 1962 - args->disk_bytenr + args->num_bytes - 1, 1963 + csum_root = btrfs_csum_root(root->fs_info, io_start); 1964 + ret = btrfs_lookup_csums_list(csum_root, io_start, 1965 + io_start + args->file_extent.num_bytes - 1, 1963 1966 NULL, nowait); 1964 1967 WARN_ON_ONCE(ret > 0 && is_freespace_inode); 1965 1968 if (ret != 0) ··· 2018 2021 struct extent_buffer *leaf; 2019 2022 struct extent_state *cached_state = NULL; 2020 2023 u64 extent_end; 2021 - u64 ram_bytes; 2022 2024 u64 nocow_end; 2023 2025 int extent_type; 2024 2026 bool is_prealloc; ··· 2096 2100 ret = -EUCLEAN; 2097 2101 goto error; 2098 2102 } 2099 - ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi); 2100 2103 extent_end = btrfs_file_extent_end(path); 2101 2104 2102 2105 /* ··· 2115 2120 goto must_cow; 2116 2121 2117 2122 ret = 0; 2118 - nocow_bg = btrfs_inc_nocow_writers(fs_info, nocow_args.disk_bytenr); 2123 + nocow_bg = btrfs_inc_nocow_writers(fs_info, 2124 + nocow_args.file_extent.disk_bytenr + 2125 + nocow_args.file_extent.offset); 2119 2126 if (!nocow_bg) { 2120 2127 must_cow: 2121 2128 /* ··· 2153 2156 } 2154 2157 } 2155 2158 2156 - nocow_end = cur_offset + nocow_args.num_bytes - 1; 2159 + nocow_end = cur_offset + nocow_args.file_extent.num_bytes - 1; 2157 2160 lock_extent(&inode->io_tree, cur_offset, nocow_end, &cached_state); 2158 2161 2159 2162 is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC; 2160 2163 if (is_prealloc) { 2161 - u64 orig_start = found_key.offset - nocow_args.extent_offset; 2162 2164 struct extent_map *em; 2163 2165 2164 - em = create_io_em(inode, cur_offset, nocow_args.num_bytes, 2165 - orig_start, 2166 - nocow_args.disk_bytenr, /* block_start */ 2167 - nocow_args.num_bytes, /* block_len */ 2168 - nocow_args.disk_num_bytes, /* orig_block_len */ 2169 - ram_bytes, BTRFS_COMPRESS_NONE, 2170 - BTRFS_ORDERED_PREALLOC); 2166 + em = btrfs_create_io_em(inode, cur_offset, 2167 + &nocow_args.file_extent, 2168 + BTRFS_ORDERED_PREALLOC); 2171 2169 if (IS_ERR(em)) { 2172 2170 unlock_extent(&inode->io_tree, cur_offset, 2173 2171 nocow_end, &cached_state); ··· 2174 2182 } 2175 2183 2176 2184 ordered = btrfs_alloc_ordered_extent(inode, cur_offset, 2177 - nocow_args.num_bytes, nocow_args.num_bytes, 2178 - nocow_args.disk_bytenr, nocow_args.num_bytes, 0, 2185 + &nocow_args.file_extent, 2179 2186 is_prealloc 2180 2187 ? (1 << BTRFS_ORDERED_PREALLOC) 2181 - : (1 << BTRFS_ORDERED_NOCOW), 2182 - BTRFS_COMPRESS_NONE); 2188 + : (1 << BTRFS_ORDERED_NOCOW)); 2183 2189 btrfs_dec_nocow_writers(nocow_bg); 2184 2190 if (IS_ERR(ordered)) { 2185 2191 if (is_prealloc) { ··· 2591 2601 } 2592 2602 } 2593 2603 2594 - static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio, 2595 - struct btrfs_ordered_extent *ordered) 2596 - { 2597 - u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; 2598 - u64 len = bbio->bio.bi_iter.bi_size; 2599 - struct btrfs_ordered_extent *new; 2600 - int ret; 2601 - 2602 - /* Must always be called for the beginning of an ordered extent. */ 2603 - if (WARN_ON_ONCE(start != ordered->disk_bytenr)) 2604 - return -EINVAL; 2605 - 2606 - /* No need to split if the ordered extent covers the entire bio. */ 2607 - if (ordered->disk_num_bytes == len) { 2608 - refcount_inc(&ordered->refs); 2609 - bbio->ordered = ordered; 2610 - return 0; 2611 - } 2612 - 2613 - /* 2614 - * Don't split the extent_map for NOCOW extents, as we're writing into 2615 - * a pre-existing one. 2616 - */ 2617 - if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) { 2618 - ret = split_extent_map(bbio->inode, bbio->file_offset, 2619 - ordered->num_bytes, len, 2620 - ordered->disk_bytenr); 2621 - if (ret) 2622 - return ret; 2623 - } 2624 - 2625 - new = btrfs_split_ordered_extent(ordered, len); 2626 - if (IS_ERR(new)) 2627 - return PTR_ERR(new); 2628 - bbio->ordered = new; 2629 - return 0; 2630 - } 2631 - 2632 2604 /* 2633 2605 * given a list of ordered sums record them in the inode. This happens 2634 2606 * at IO completion time based on sums calculated at bio submission time. ··· 2633 2681 if (IS_ERR(em)) 2634 2682 return PTR_ERR(em); 2635 2683 2636 - if (em->block_start != EXTENT_MAP_HOLE) 2684 + if (em->disk_bytenr != EXTENT_MAP_HOLE) 2637 2685 goto next; 2638 2686 2639 2687 em_len = em->len; ··· 2989 3037 btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi, 2990 3038 oe->disk_num_bytes); 2991 3039 btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset); 2992 - if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags)) { 3040 + if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags)) 2993 3041 num_bytes = oe->truncated_len; 2994 - ram_bytes = num_bytes; 2995 - } 2996 3042 btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes); 2997 3043 btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes); 2998 3044 btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type); ··· 3006 3056 test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) || 3007 3057 test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags); 3008 3058 3009 - return insert_reserved_file_extent(trans, BTRFS_I(oe->inode), 3059 + return insert_reserved_file_extent(trans, oe->inode, 3010 3060 oe->file_offset, &stack_fi, 3011 3061 update_inode_bytes, oe->qgroup_rsv); 3012 3062 } ··· 3018 3068 */ 3019 3069 int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent) 3020 3070 { 3021 - struct btrfs_inode *inode = BTRFS_I(ordered_extent->inode); 3071 + struct btrfs_inode *inode = ordered_extent->inode; 3022 3072 struct btrfs_root *root = inode->root; 3023 3073 struct btrfs_fs_info *fs_info = root->fs_info; 3024 3074 struct btrfs_trans_handle *trans = NULL; ··· 3252 3302 3253 3303 int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered) 3254 3304 { 3255 - if (btrfs_is_zoned(inode_to_fs_info(ordered->inode)) && 3305 + if (btrfs_is_zoned(ordered->inode->root->fs_info) && 3256 3306 !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) && 3257 3307 list_empty(&ordered->bioc_list)) 3258 3308 btrfs_finish_ordered_zoned(ordered); ··· 3546 3596 found_key.objectid = found_key.offset; 3547 3597 found_key.type = BTRFS_INODE_ITEM_KEY; 3548 3598 found_key.offset = 0; 3549 - inode = btrfs_iget(fs_info->sb, last_objectid, root); 3599 + inode = btrfs_iget(last_objectid, root); 3550 3600 if (IS_ERR(inode)) { 3551 3601 ret = PTR_ERR(inode); 3552 3602 inode = NULL; ··· 3731 3781 return 1; 3732 3782 } 3733 3783 3784 + static int btrfs_init_file_extent_tree(struct btrfs_inode *inode) 3785 + { 3786 + struct btrfs_fs_info *fs_info = inode->root->fs_info; 3787 + 3788 + if (WARN_ON_ONCE(inode->file_extent_tree)) 3789 + return 0; 3790 + if (btrfs_fs_incompat(fs_info, NO_HOLES)) 3791 + return 0; 3792 + if (!S_ISREG(inode->vfs_inode.i_mode)) 3793 + return 0; 3794 + if (btrfs_is_free_space_inode(inode)) 3795 + return 0; 3796 + 3797 + inode->file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL); 3798 + if (!inode->file_extent_tree) 3799 + return -ENOMEM; 3800 + 3801 + extent_io_tree_init(fs_info, inode->file_extent_tree, IO_TREE_INODE_FILE_EXTENT); 3802 + /* Lockdep class is set only for the file extent tree. */ 3803 + lockdep_set_class(&inode->file_extent_tree->lock, &file_extent_tree_class); 3804 + 3805 + return 0; 3806 + } 3807 + 3734 3808 /* 3735 3809 * read an inode from the btree into the in-memory inode 3736 3810 */ ··· 3774 3800 bool filled = false; 3775 3801 int first_xattr_slot; 3776 3802 3803 + ret = btrfs_init_file_extent_tree(BTRFS_I(inode)); 3804 + if (ret) 3805 + return ret; 3806 + 3777 3807 ret = btrfs_fill_inode(inode, &rdev); 3778 3808 if (!ret) 3779 3809 filled = true; ··· 3788 3810 return -ENOMEM; 3789 3811 } 3790 3812 3791 - memcpy(&location, &BTRFS_I(inode)->location, sizeof(location)); 3813 + btrfs_get_inode_key(BTRFS_I(inode), &location); 3792 3814 3793 3815 ret = btrfs_lookup_inode(NULL, root, path, &location, 0); 3794 3816 if (ret) { ··· 3834 3856 inode->i_rdev = 0; 3835 3857 rdev = btrfs_inode_rdev(leaf, inode_item); 3836 3858 3837 - BTRFS_I(inode)->index_cnt = (u64)-1; 3859 + if (S_ISDIR(inode->i_mode)) 3860 + BTRFS_I(inode)->index_cnt = (u64)-1; 3861 + 3838 3862 btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item), 3839 3863 &BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags); 3840 3864 ··· 4018 4038 struct btrfs_inode_item *inode_item; 4019 4039 struct btrfs_path *path; 4020 4040 struct extent_buffer *leaf; 4041 + struct btrfs_key key; 4021 4042 int ret; 4022 4043 4023 4044 path = btrfs_alloc_path(); 4024 4045 if (!path) 4025 4046 return -ENOMEM; 4026 4047 4027 - ret = btrfs_lookup_inode(trans, inode->root, path, &inode->location, 1); 4048 + btrfs_get_inode_key(inode, &key); 4049 + ret = btrfs_lookup_inode(trans, inode->root, path, &key, 1); 4028 4050 if (ret) { 4029 4051 if (ret > 0) 4030 4052 ret = -ENOENT; ··· 4290 4308 if (btrfs_ino(inode) == BTRFS_FIRST_FREE_OBJECTID) { 4291 4309 objectid = btrfs_root_id(inode->root); 4292 4310 } else if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) { 4293 - objectid = inode->location.objectid; 4311 + objectid = inode->ref_root_id; 4294 4312 } else { 4295 4313 WARN_ON(1); 4296 4314 fscrypt_free_filename(&fname); ··· 4520 4538 if (IS_ERR(trans)) { 4521 4539 ret = PTR_ERR(trans); 4522 4540 goto out_release; 4523 - } 4524 - ret = btrfs_record_root_in_trans(trans, root); 4525 - if (ret) { 4526 - btrfs_abort_transaction(trans, ret); 4527 - goto out_end_trans; 4528 4541 } 4529 4542 btrfs_qgroup_convert_reserved_meta(root, qgroup_reserved); 4530 4543 qgroup_reserved = 0; ··· 4944 4967 } 4945 4968 hole_em->start = cur_offset; 4946 4969 hole_em->len = hole_size; 4947 - hole_em->orig_start = cur_offset; 4948 4970 4949 - hole_em->block_start = EXTENT_MAP_HOLE; 4950 - hole_em->block_len = 0; 4951 - hole_em->orig_block_len = 0; 4971 + hole_em->disk_bytenr = EXTENT_MAP_HOLE; 4972 + hole_em->disk_num_bytes = 0; 4952 4973 hole_em->ram_bytes = hole_size; 4953 4974 hole_em->generation = btrfs_get_fs_generation(fs_info); 4954 4975 ··· 5024 5049 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 5025 5050 5026 5051 if (btrfs_is_zoned(fs_info)) { 5027 - ret = btrfs_wait_ordered_range(inode, 5052 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), 5028 5053 ALIGN(newsize, fs_info->sectorsize), 5029 5054 (u64)-1); 5030 5055 if (ret) ··· 5054 5079 * wait for disk_i_size to be stable and then update the 5055 5080 * in-memory size to match. 5056 5081 */ 5057 - err = btrfs_wait_ordered_range(inode, 0, (u64)-1); 5082 + err = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1); 5058 5083 if (err) 5059 5084 return err; 5060 5085 i_size_write(inode, BTRFS_I(inode)->disk_i_size); ··· 5468 5493 return err; 5469 5494 } 5470 5495 5471 - static void inode_tree_add(struct btrfs_inode *inode) 5496 + static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc) 5497 + { 5498 + struct btrfs_root *root = inode->root; 5499 + struct btrfs_inode *existing; 5500 + const u64 ino = btrfs_ino(inode); 5501 + int ret; 5502 + 5503 + if (inode_unhashed(&inode->vfs_inode)) 5504 + return 0; 5505 + 5506 + if (prealloc) { 5507 + ret = xa_reserve(&root->inodes, ino, GFP_NOFS); 5508 + if (ret) 5509 + return ret; 5510 + } 5511 + 5512 + existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC); 5513 + 5514 + if (xa_is_err(existing)) { 5515 + ret = xa_err(existing); 5516 + ASSERT(ret != -EINVAL); 5517 + ASSERT(ret != -ENOMEM); 5518 + return ret; 5519 + } else if (existing) { 5520 + WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING))); 5521 + } 5522 + 5523 + return 0; 5524 + } 5525 + 5526 + static void btrfs_del_inode_from_root(struct btrfs_inode *inode) 5472 5527 { 5473 5528 struct btrfs_root *root = inode->root; 5474 5529 struct btrfs_inode *entry; 5475 - struct rb_node **p; 5476 - struct rb_node *parent; 5477 - struct rb_node *new = &inode->rb_node; 5478 - u64 ino = btrfs_ino(inode); 5530 + bool empty = false; 5479 5531 5480 - if (inode_unhashed(&inode->vfs_inode)) 5481 - return; 5482 - parent = NULL; 5483 - spin_lock(&root->inode_lock); 5484 - p = &root->inode_tree.rb_node; 5485 - while (*p) { 5486 - parent = *p; 5487 - entry = rb_entry(parent, struct btrfs_inode, rb_node); 5488 - 5489 - if (ino < btrfs_ino(entry)) 5490 - p = &parent->rb_left; 5491 - else if (ino > btrfs_ino(entry)) 5492 - p = &parent->rb_right; 5493 - else { 5494 - WARN_ON(!(entry->vfs_inode.i_state & 5495 - (I_WILL_FREE | I_FREEING))); 5496 - rb_replace_node(parent, new, &root->inode_tree); 5497 - RB_CLEAR_NODE(parent); 5498 - spin_unlock(&root->inode_lock); 5499 - return; 5500 - } 5501 - } 5502 - rb_link_node(new, parent, p); 5503 - rb_insert_color(new, &root->inode_tree); 5504 - spin_unlock(&root->inode_lock); 5505 - } 5506 - 5507 - static void inode_tree_del(struct btrfs_inode *inode) 5508 - { 5509 - struct btrfs_root *root = inode->root; 5510 - int empty = 0; 5511 - 5512 - spin_lock(&root->inode_lock); 5513 - if (!RB_EMPTY_NODE(&inode->rb_node)) { 5514 - rb_erase(&inode->rb_node, &root->inode_tree); 5515 - RB_CLEAR_NODE(&inode->rb_node); 5516 - empty = RB_EMPTY_ROOT(&root->inode_tree); 5517 - } 5518 - spin_unlock(&root->inode_lock); 5532 + xa_lock(&root->inodes); 5533 + entry = __xa_erase(&root->inodes, btrfs_ino(inode)); 5534 + if (entry == inode) 5535 + empty = xa_empty(&root->inodes); 5536 + xa_unlock(&root->inodes); 5519 5537 5520 5538 if (empty && btrfs_root_refs(&root->root_item) == 0) { 5521 - spin_lock(&root->inode_lock); 5522 - empty = RB_EMPTY_ROOT(&root->inode_tree); 5523 - spin_unlock(&root->inode_lock); 5539 + xa_lock(&root->inodes); 5540 + empty = xa_empty(&root->inodes); 5541 + xa_unlock(&root->inodes); 5524 5542 if (empty) 5525 5543 btrfs_add_dead_root(root); 5526 5544 } ··· 5524 5556 { 5525 5557 struct btrfs_iget_args *args = p; 5526 5558 5527 - inode->i_ino = args->ino; 5528 - BTRFS_I(inode)->location.objectid = args->ino; 5529 - BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY; 5530 - BTRFS_I(inode)->location.offset = 0; 5559 + btrfs_set_inode_number(BTRFS_I(inode), args->ino); 5531 5560 BTRFS_I(inode)->root = btrfs_grab_root(args->root); 5532 5561 5533 5562 if (args->root && args->root == args->root->fs_info->tree_root && ··· 5538 5573 { 5539 5574 struct btrfs_iget_args *args = opaque; 5540 5575 5541 - return args->ino == BTRFS_I(inode)->location.objectid && 5576 + return args->ino == btrfs_ino(BTRFS_I(inode)) && 5542 5577 args->root == BTRFS_I(inode)->root; 5543 5578 } 5544 5579 5545 - static struct inode *btrfs_iget_locked(struct super_block *s, u64 ino, 5546 - struct btrfs_root *root) 5580 + static struct inode *btrfs_iget_locked(u64 ino, struct btrfs_root *root) 5547 5581 { 5548 5582 struct inode *inode; 5549 5583 struct btrfs_iget_args args; ··· 5551 5587 args.ino = ino; 5552 5588 args.root = root; 5553 5589 5554 - inode = iget5_locked_rcu(s, hashval, btrfs_find_actor, 5590 + inode = iget5_locked_rcu(root->fs_info->sb, hashval, btrfs_find_actor, 5555 5591 btrfs_init_locked_inode, 5556 5592 (void *)&args); 5557 5593 return inode; ··· 5563 5599 * allocator. NULL is also valid but may require an additional allocation 5564 5600 * later. 5565 5601 */ 5566 - struct inode *btrfs_iget_path(struct super_block *s, u64 ino, 5567 - struct btrfs_root *root, struct btrfs_path *path) 5602 + struct inode *btrfs_iget_path(u64 ino, struct btrfs_root *root, 5603 + struct btrfs_path *path) 5568 5604 { 5569 5605 struct inode *inode; 5606 + int ret; 5570 5607 5571 - inode = btrfs_iget_locked(s, ino, root); 5608 + inode = btrfs_iget_locked(ino, root); 5572 5609 if (!inode) 5573 5610 return ERR_PTR(-ENOMEM); 5574 5611 5575 - if (inode->i_state & I_NEW) { 5576 - int ret; 5612 + if (!(inode->i_state & I_NEW)) 5613 + return inode; 5577 5614 5578 - ret = btrfs_read_locked_inode(inode, path); 5579 - if (!ret) { 5580 - inode_tree_add(BTRFS_I(inode)); 5581 - unlock_new_inode(inode); 5582 - } else { 5583 - iget_failed(inode); 5584 - /* 5585 - * ret > 0 can come from btrfs_search_slot called by 5586 - * btrfs_read_locked_inode, this means the inode item 5587 - * was not found. 5588 - */ 5589 - if (ret > 0) 5590 - ret = -ENOENT; 5591 - inode = ERR_PTR(ret); 5592 - } 5593 - } 5615 + ret = btrfs_read_locked_inode(inode, path); 5616 + /* 5617 + * ret > 0 can come from btrfs_search_slot called by 5618 + * btrfs_read_locked_inode(), this means the inode item was not found. 5619 + */ 5620 + if (ret > 0) 5621 + ret = -ENOENT; 5622 + if (ret < 0) 5623 + goto error; 5624 + 5625 + ret = btrfs_add_inode_to_root(BTRFS_I(inode), true); 5626 + if (ret < 0) 5627 + goto error; 5628 + 5629 + unlock_new_inode(inode); 5594 5630 5595 5631 return inode; 5632 + error: 5633 + iget_failed(inode); 5634 + return ERR_PTR(ret); 5596 5635 } 5597 5636 5598 - struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root) 5637 + struct inode *btrfs_iget(u64 ino, struct btrfs_root *root) 5599 5638 { 5600 - return btrfs_iget_path(s, ino, root, NULL); 5639 + return btrfs_iget_path(ino, root, NULL); 5601 5640 } 5602 5641 5603 5642 static struct inode *new_simple_dir(struct inode *dir, ··· 5614 5647 return ERR_PTR(-ENOMEM); 5615 5648 5616 5649 BTRFS_I(inode)->root = btrfs_grab_root(root); 5617 - memcpy(&BTRFS_I(inode)->location, key, sizeof(*key)); 5650 + BTRFS_I(inode)->ref_root_id = key->objectid; 5651 + set_bit(BTRFS_INODE_ROOT_STUB, &BTRFS_I(inode)->runtime_flags); 5618 5652 set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags); 5619 5653 5620 - inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID; 5654 + btrfs_set_inode_number(BTRFS_I(inode), BTRFS_EMPTY_SUBVOL_DIR_OBJECTID); 5621 5655 /* 5622 5656 * We only need lookup, the rest is read-only and there's no inode 5623 5657 * associated with the dentry ··· 5672 5704 return ERR_PTR(ret); 5673 5705 5674 5706 if (location.type == BTRFS_INODE_ITEM_KEY) { 5675 - inode = btrfs_iget(dir->i_sb, location.objectid, root); 5707 + inode = btrfs_iget(location.objectid, root); 5676 5708 if (IS_ERR(inode)) 5677 5709 return inode; 5678 5710 ··· 5696 5728 else 5697 5729 inode = new_simple_dir(dir, &location, root); 5698 5730 } else { 5699 - inode = btrfs_iget(dir->i_sb, location.objectid, sub_root); 5731 + inode = btrfs_iget(location.objectid, sub_root); 5700 5732 btrfs_put_root(sub_root); 5701 5733 5702 5734 if (IS_ERR(inode)) ··· 5916 5948 addr = private->filldir_buf; 5917 5949 path->reada = READA_FORWARD; 5918 5950 5919 - put = btrfs_readdir_get_delayed_items(inode, private->last_index, 5951 + put = btrfs_readdir_get_delayed_items(BTRFS_I(inode), private->last_index, 5920 5952 &ins_list, &del_list); 5921 5953 5922 5954 again: ··· 6006 6038 ret = 0; 6007 6039 err: 6008 6040 if (put) 6009 - btrfs_readdir_put_delayed_items(inode, &ins_list, &del_list); 6041 + btrfs_readdir_put_delayed_items(BTRFS_I(inode), &ins_list, &del_list); 6010 6042 btrfs_free_path(path); 6011 6043 return ret; 6012 6044 } ··· 6091 6123 { 6092 6124 struct btrfs_iget_args args; 6093 6125 6094 - args.ino = BTRFS_I(inode)->location.objectid; 6126 + args.ino = btrfs_ino(BTRFS_I(inode)); 6095 6127 args.root = BTRFS_I(inode)->root; 6096 6128 6097 6129 return insert_inode_locked4(inode, ··· 6198 6230 struct btrfs_fs_info *fs_info = inode_to_fs_info(dir); 6199 6231 struct btrfs_root *root; 6200 6232 struct btrfs_inode_item *inode_item; 6201 - struct btrfs_key *location; 6202 6233 struct btrfs_path *path; 6203 6234 u64 objectid; 6204 6235 struct btrfs_inode_ref *ref; ··· 6206 6239 struct btrfs_item_batch batch; 6207 6240 unsigned long ptr; 6208 6241 int ret; 6242 + bool xa_reserved = false; 6209 6243 6210 6244 path = btrfs_alloc_path(); 6211 6245 if (!path) ··· 6216 6248 BTRFS_I(inode)->root = btrfs_grab_root(BTRFS_I(dir)->root); 6217 6249 root = BTRFS_I(inode)->root; 6218 6250 6251 + ret = btrfs_init_file_extent_tree(BTRFS_I(inode)); 6252 + if (ret) 6253 + goto out; 6254 + 6219 6255 ret = btrfs_get_free_objectid(root, &objectid); 6220 6256 if (ret) 6221 6257 goto out; 6222 - inode->i_ino = objectid; 6258 + btrfs_set_inode_number(BTRFS_I(inode), objectid); 6259 + 6260 + ret = xa_reserve(&root->inodes, objectid, GFP_NOFS); 6261 + if (ret) 6262 + goto out; 6263 + xa_reserved = true; 6223 6264 6224 6265 if (args->orphan) { 6225 6266 /* ··· 6243 6266 if (ret) 6244 6267 goto out; 6245 6268 } 6246 - /* index_cnt is ignored for everything but a dir. */ 6247 - BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX; 6269 + 6270 + if (S_ISDIR(inode->i_mode)) 6271 + BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX; 6272 + 6248 6273 BTRFS_I(inode)->generation = trans->transid; 6249 6274 inode->i_generation = BTRFS_I(inode)->generation; 6250 6275 ··· 6272 6293 BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW | 6273 6294 BTRFS_INODE_NODATASUM; 6274 6295 } 6275 - 6276 - location = &BTRFS_I(inode)->location; 6277 - location->objectid = objectid; 6278 - location->offset = 0; 6279 - location->type = BTRFS_INODE_ITEM_KEY; 6280 6296 6281 6297 ret = btrfs_insert_inode_locked(inode); 6282 6298 if (ret < 0) { ··· 6371 6397 * Subvolumes inherit properties from their parent subvolume, 6372 6398 * not the directory they were created in. 6373 6399 */ 6374 - parent = btrfs_iget(fs_info->sb, BTRFS_FIRST_FREE_OBJECTID, 6375 - BTRFS_I(dir)->root); 6400 + parent = btrfs_iget(BTRFS_FIRST_FREE_OBJECTID, BTRFS_I(dir)->root); 6376 6401 if (IS_ERR(parent)) { 6377 6402 ret = PTR_ERR(parent); 6378 6403 } else { ··· 6399 6426 } 6400 6427 } 6401 6428 6402 - inode_tree_add(BTRFS_I(inode)); 6429 + ret = btrfs_add_inode_to_root(BTRFS_I(inode), false); 6430 + if (WARN_ON(ret)) { 6431 + /* Shouldn't happen, we used xa_reserve() before. */ 6432 + btrfs_abort_transaction(trans, ret); 6433 + goto discard; 6434 + } 6403 6435 6404 6436 trace_btrfs_inode_new(inode); 6405 6437 btrfs_set_inode_last_trans(trans, BTRFS_I(inode)); ··· 6432 6454 ihold(inode); 6433 6455 discard_new_inode(inode); 6434 6456 out: 6457 + if (xa_reserved) 6458 + xa_release(&root->inodes, objectid); 6459 + 6435 6460 btrfs_free_path(path); 6436 6461 return ret; 6437 6462 } ··· 6799 6818 if (em) { 6800 6819 if (em->start > start || em->start + em->len <= start) 6801 6820 free_extent_map(em); 6802 - else if (em->block_start == EXTENT_MAP_INLINE && page) 6821 + else if (em->disk_bytenr == EXTENT_MAP_INLINE && page) 6803 6822 free_extent_map(em); 6804 6823 else 6805 6824 goto out; ··· 6810 6829 goto out; 6811 6830 } 6812 6831 em->start = EXTENT_MAP_HOLE; 6813 - em->orig_start = EXTENT_MAP_HOLE; 6832 + em->disk_bytenr = EXTENT_MAP_HOLE; 6814 6833 em->len = (u64)-1; 6815 - em->block_len = (u64)-1; 6816 6834 6817 6835 path = btrfs_alloc_path(); 6818 6836 if (!path) { ··· 6901 6921 6902 6922 /* New extent overlaps with existing one */ 6903 6923 em->start = start; 6904 - em->orig_start = start; 6905 6924 em->len = found_key.offset - start; 6906 - em->block_start = EXTENT_MAP_HOLE; 6925 + em->disk_bytenr = EXTENT_MAP_HOLE; 6907 6926 goto insert; 6908 6927 } 6909 6928 ··· 6926 6947 * 6927 6948 * Other members are not utilized for inline extents. 6928 6949 */ 6929 - ASSERT(em->block_start == EXTENT_MAP_INLINE); 6950 + ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE); 6930 6951 ASSERT(em->len == fs_info->sectorsize); 6931 6952 6932 6953 ret = read_inline_extent(inode, path, page); ··· 6936 6957 } 6937 6958 not_found: 6938 6959 em->start = start; 6939 - em->orig_start = start; 6940 6960 em->len = len; 6941 - em->block_start = EXTENT_MAP_HOLE; 6961 + em->disk_bytenr = EXTENT_MAP_HOLE; 6942 6962 insert: 6943 6963 ret = 0; 6944 6964 btrfs_release_path(path); ··· 6961 6983 free_extent_map(em); 6962 6984 return ERR_PTR(ret); 6963 6985 } 6964 - return em; 6965 - } 6966 - 6967 - static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, 6968 - struct btrfs_dio_data *dio_data, 6969 - const u64 start, 6970 - const u64 len, 6971 - const u64 orig_start, 6972 - const u64 block_start, 6973 - const u64 block_len, 6974 - const u64 orig_block_len, 6975 - const u64 ram_bytes, 6976 - const int type) 6977 - { 6978 - struct extent_map *em = NULL; 6979 - struct btrfs_ordered_extent *ordered; 6980 - 6981 - if (type != BTRFS_ORDERED_NOCOW) { 6982 - em = create_io_em(inode, start, len, orig_start, block_start, 6983 - block_len, orig_block_len, ram_bytes, 6984 - BTRFS_COMPRESS_NONE, /* compress_type */ 6985 - type); 6986 - if (IS_ERR(em)) 6987 - goto out; 6988 - } 6989 - ordered = btrfs_alloc_ordered_extent(inode, start, len, len, 6990 - block_start, block_len, 0, 6991 - (1 << type) | 6992 - (1 << BTRFS_ORDERED_DIRECT), 6993 - BTRFS_COMPRESS_NONE); 6994 - if (IS_ERR(ordered)) { 6995 - if (em) { 6996 - free_extent_map(em); 6997 - btrfs_drop_extent_map_range(inode, start, 6998 - start + len - 1, false); 6999 - } 7000 - em = ERR_CAST(ordered); 7001 - } else { 7002 - ASSERT(!dio_data->ordered); 7003 - dio_data->ordered = ordered; 7004 - } 7005 - out: 7006 - 7007 - return em; 7008 - } 7009 - 7010 - static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, 7011 - struct btrfs_dio_data *dio_data, 7012 - u64 start, u64 len) 7013 - { 7014 - struct btrfs_root *root = inode->root; 7015 - struct btrfs_fs_info *fs_info = root->fs_info; 7016 - struct extent_map *em; 7017 - struct btrfs_key ins; 7018 - u64 alloc_hint; 7019 - int ret; 7020 - 7021 - alloc_hint = get_extent_allocation_hint(inode, start, len); 7022 - again: 7023 - ret = btrfs_reserve_extent(root, len, len, fs_info->sectorsize, 7024 - 0, alloc_hint, &ins, 1, 1); 7025 - if (ret == -EAGAIN) { 7026 - ASSERT(btrfs_is_zoned(fs_info)); 7027 - wait_on_bit_io(&inode->root->fs_info->flags, BTRFS_FS_NEED_ZONE_FINISH, 7028 - TASK_UNINTERRUPTIBLE); 7029 - goto again; 7030 - } 7031 - if (ret) 7032 - return ERR_PTR(ret); 7033 - 7034 - em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset, start, 7035 - ins.objectid, ins.offset, ins.offset, 7036 - ins.offset, BTRFS_ORDERED_REGULAR); 7037 - btrfs_dec_block_group_reservations(fs_info, ins.objectid); 7038 - if (IS_ERR(em)) 7039 - btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 7040 - 1); 7041 - 7042 6986 return em; 7043 6987 } 7044 6988 ··· 6998 7098 * any ordered extents. 6999 7099 */ 7000 7100 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, 7001 - u64 *orig_start, u64 *orig_block_len, 7002 - u64 *ram_bytes, bool nowait, bool strict) 7101 + struct btrfs_file_extent *file_extent, 7102 + bool nowait, bool strict) 7003 7103 { 7004 7104 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 7005 7105 struct can_nocow_file_extent_args nocow_args = { 0 }; ··· 7049 7149 7050 7150 fi = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); 7051 7151 found_type = btrfs_file_extent_type(leaf, fi); 7052 - if (ram_bytes) 7053 - *ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi); 7054 7152 7055 7153 nocow_args.start = offset; 7056 7154 nocow_args.end = offset + *len - 1; ··· 7066 7168 } 7067 7169 7068 7170 ret = 0; 7069 - if (btrfs_extent_readonly(fs_info, nocow_args.disk_bytenr)) 7171 + if (btrfs_extent_readonly(fs_info, 7172 + nocow_args.file_extent.disk_bytenr + 7173 + nocow_args.file_extent.offset)) 7070 7174 goto out; 7071 7175 7072 7176 if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && 7073 7177 found_type == BTRFS_FILE_EXTENT_PREALLOC) { 7074 7178 u64 range_end; 7075 7179 7076 - range_end = round_up(offset + nocow_args.num_bytes, 7180 + range_end = round_up(offset + nocow_args.file_extent.num_bytes, 7077 7181 root->fs_info->sectorsize) - 1; 7078 7182 ret = test_range_bit_exists(io_tree, offset, range_end, EXTENT_DELALLOC); 7079 7183 if (ret) { ··· 7084 7184 } 7085 7185 } 7086 7186 7087 - if (orig_start) 7088 - *orig_start = key.offset - nocow_args.extent_offset; 7089 - if (orig_block_len) 7090 - *orig_block_len = nocow_args.disk_num_bytes; 7187 + if (file_extent) 7188 + memcpy(file_extent, &nocow_args.file_extent, sizeof(*file_extent)); 7091 7189 7092 - *len = nocow_args.num_bytes; 7190 + *len = nocow_args.file_extent.num_bytes; 7093 7191 ret = 1; 7094 7192 out: 7095 7193 btrfs_free_path(path); 7096 7194 return ret; 7097 7195 } 7098 7196 7099 - static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend, 7100 - struct extent_state **cached_state, 7101 - unsigned int iomap_flags) 7102 - { 7103 - const bool writing = (iomap_flags & IOMAP_WRITE); 7104 - const bool nowait = (iomap_flags & IOMAP_NOWAIT); 7105 - struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; 7106 - struct btrfs_ordered_extent *ordered; 7107 - int ret = 0; 7108 - 7109 - while (1) { 7110 - if (nowait) { 7111 - if (!try_lock_extent(io_tree, lockstart, lockend, 7112 - cached_state)) 7113 - return -EAGAIN; 7114 - } else { 7115 - lock_extent(io_tree, lockstart, lockend, cached_state); 7116 - } 7117 - /* 7118 - * We're concerned with the entire range that we're going to be 7119 - * doing DIO to, so we need to make sure there's no ordered 7120 - * extents in this range. 7121 - */ 7122 - ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), lockstart, 7123 - lockend - lockstart + 1); 7124 - 7125 - /* 7126 - * We need to make sure there are no buffered pages in this 7127 - * range either, we could have raced between the invalidate in 7128 - * generic_file_direct_write and locking the extent. The 7129 - * invalidate needs to happen so that reads after a write do not 7130 - * get stale data. 7131 - */ 7132 - if (!ordered && 7133 - (!writing || !filemap_range_has_page(inode->i_mapping, 7134 - lockstart, lockend))) 7135 - break; 7136 - 7137 - unlock_extent(io_tree, lockstart, lockend, cached_state); 7138 - 7139 - if (ordered) { 7140 - if (nowait) { 7141 - btrfs_put_ordered_extent(ordered); 7142 - ret = -EAGAIN; 7143 - break; 7144 - } 7145 - /* 7146 - * If we are doing a DIO read and the ordered extent we 7147 - * found is for a buffered write, we can not wait for it 7148 - * to complete and retry, because if we do so we can 7149 - * deadlock with concurrent buffered writes on page 7150 - * locks. This happens only if our DIO read covers more 7151 - * than one extent map, if at this point has already 7152 - * created an ordered extent for a previous extent map 7153 - * and locked its range in the inode's io tree, and a 7154 - * concurrent write against that previous extent map's 7155 - * range and this range started (we unlock the ranges 7156 - * in the io tree only when the bios complete and 7157 - * buffered writes always lock pages before attempting 7158 - * to lock range in the io tree). 7159 - */ 7160 - if (writing || 7161 - test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) 7162 - btrfs_start_ordered_extent(ordered); 7163 - else 7164 - ret = nowait ? -EAGAIN : -ENOTBLK; 7165 - btrfs_put_ordered_extent(ordered); 7166 - } else { 7167 - /* 7168 - * We could trigger writeback for this range (and wait 7169 - * for it to complete) and then invalidate the pages for 7170 - * this range (through invalidate_inode_pages2_range()), 7171 - * but that can lead us to a deadlock with a concurrent 7172 - * call to readahead (a buffered read or a defrag call 7173 - * triggered a readahead) on a page lock due to an 7174 - * ordered dio extent we created before but did not have 7175 - * yet a corresponding bio submitted (whence it can not 7176 - * complete), which makes readahead wait for that 7177 - * ordered extent to complete while holding a lock on 7178 - * that page. 7179 - */ 7180 - ret = nowait ? -EAGAIN : -ENOTBLK; 7181 - } 7182 - 7183 - if (ret) 7184 - break; 7185 - 7186 - cond_resched(); 7187 - } 7188 - 7189 - return ret; 7190 - } 7191 - 7192 7197 /* The callers of this must take lock_extent() */ 7193 - static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start, 7194 - u64 len, u64 orig_start, u64 block_start, 7195 - u64 block_len, u64 orig_block_len, 7196 - u64 ram_bytes, int compress_type, 7197 - int type) 7198 + struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start, 7199 + const struct btrfs_file_extent *file_extent, 7200 + int type) 7198 7201 { 7199 7202 struct extent_map *em; 7200 7203 int ret; ··· 7116 7313 7117 7314 switch (type) { 7118 7315 case BTRFS_ORDERED_PREALLOC: 7119 - /* Uncompressed extents. */ 7120 - ASSERT(block_len == len); 7121 - 7122 7316 /* We're only referring part of a larger preallocated extent. */ 7123 - ASSERT(block_len <= ram_bytes); 7317 + ASSERT(file_extent->num_bytes <= file_extent->ram_bytes); 7124 7318 break; 7125 7319 case BTRFS_ORDERED_REGULAR: 7126 - /* Uncompressed extents. */ 7127 - ASSERT(block_len == len); 7128 - 7129 7320 /* COW results a new extent matching our file extent size. */ 7130 - ASSERT(orig_block_len == len); 7131 - ASSERT(ram_bytes == len); 7321 + ASSERT(file_extent->disk_num_bytes == file_extent->num_bytes); 7322 + ASSERT(file_extent->ram_bytes == file_extent->num_bytes); 7132 7323 7133 7324 /* Since it's a new extent, we should not have any offset. */ 7134 - ASSERT(orig_start == start); 7325 + ASSERT(file_extent->offset == 0); 7135 7326 break; 7136 7327 case BTRFS_ORDERED_COMPRESSED: 7137 7328 /* Must be compressed. */ 7138 - ASSERT(compress_type != BTRFS_COMPRESS_NONE); 7329 + ASSERT(file_extent->compression != BTRFS_COMPRESS_NONE); 7139 7330 7140 7331 /* 7141 7332 * Encoded write can make us to refer to part of the 7142 7333 * uncompressed extent. 7143 7334 */ 7144 - ASSERT(len <= ram_bytes); 7335 + ASSERT(file_extent->num_bytes <= file_extent->ram_bytes); 7145 7336 break; 7146 7337 } 7147 7338 ··· 7144 7347 return ERR_PTR(-ENOMEM); 7145 7348 7146 7349 em->start = start; 7147 - em->orig_start = orig_start; 7148 - em->len = len; 7149 - em->block_len = block_len; 7150 - em->block_start = block_start; 7151 - em->orig_block_len = orig_block_len; 7152 - em->ram_bytes = ram_bytes; 7350 + em->len = file_extent->num_bytes; 7351 + em->disk_bytenr = file_extent->disk_bytenr; 7352 + em->disk_num_bytes = file_extent->disk_num_bytes; 7353 + em->ram_bytes = file_extent->ram_bytes; 7153 7354 em->generation = -1; 7355 + em->offset = file_extent->offset; 7154 7356 em->flags |= EXTENT_FLAG_PINNED; 7155 7357 if (type == BTRFS_ORDERED_COMPRESSED) 7156 - extent_map_set_compression(em, compress_type); 7358 + extent_map_set_compression(em, file_extent->compression); 7157 7359 7158 7360 ret = btrfs_replace_extent_map_range(inode, em, true); 7159 7361 if (ret) { ··· 7162 7366 7163 7367 /* em got 2 refs now, callers needs to do free_extent_map once. */ 7164 7368 return em; 7165 - } 7166 - 7167 - 7168 - static int btrfs_get_blocks_direct_write(struct extent_map **map, 7169 - struct inode *inode, 7170 - struct btrfs_dio_data *dio_data, 7171 - u64 start, u64 *lenp, 7172 - unsigned int iomap_flags) 7173 - { 7174 - const bool nowait = (iomap_flags & IOMAP_NOWAIT); 7175 - struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 7176 - struct extent_map *em = *map; 7177 - int type; 7178 - u64 block_start, orig_start, orig_block_len, ram_bytes; 7179 - struct btrfs_block_group *bg; 7180 - bool can_nocow = false; 7181 - bool space_reserved = false; 7182 - u64 len = *lenp; 7183 - u64 prev_len; 7184 - int ret = 0; 7185 - 7186 - /* 7187 - * We don't allocate a new extent in the following cases 7188 - * 7189 - * 1) The inode is marked as NODATACOW. In this case we'll just use the 7190 - * existing extent. 7191 - * 2) The extent is marked as PREALLOC. We're good to go here and can 7192 - * just use the extent. 7193 - * 7194 - */ 7195 - if ((em->flags & EXTENT_FLAG_PREALLOC) || 7196 - ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && 7197 - em->block_start != EXTENT_MAP_HOLE)) { 7198 - if (em->flags & EXTENT_FLAG_PREALLOC) 7199 - type = BTRFS_ORDERED_PREALLOC; 7200 - else 7201 - type = BTRFS_ORDERED_NOCOW; 7202 - len = min(len, em->len - (start - em->start)); 7203 - block_start = em->block_start + (start - em->start); 7204 - 7205 - if (can_nocow_extent(inode, start, &len, &orig_start, 7206 - &orig_block_len, &ram_bytes, false, false) == 1) { 7207 - bg = btrfs_inc_nocow_writers(fs_info, block_start); 7208 - if (bg) 7209 - can_nocow = true; 7210 - } 7211 - } 7212 - 7213 - prev_len = len; 7214 - if (can_nocow) { 7215 - struct extent_map *em2; 7216 - 7217 - /* We can NOCOW, so only need to reserve metadata space. */ 7218 - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, 7219 - nowait); 7220 - if (ret < 0) { 7221 - /* Our caller expects us to free the input extent map. */ 7222 - free_extent_map(em); 7223 - *map = NULL; 7224 - btrfs_dec_nocow_writers(bg); 7225 - if (nowait && (ret == -ENOSPC || ret == -EDQUOT)) 7226 - ret = -EAGAIN; 7227 - goto out; 7228 - } 7229 - space_reserved = true; 7230 - 7231 - em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len, 7232 - orig_start, block_start, 7233 - len, orig_block_len, 7234 - ram_bytes, type); 7235 - btrfs_dec_nocow_writers(bg); 7236 - if (type == BTRFS_ORDERED_PREALLOC) { 7237 - free_extent_map(em); 7238 - *map = em2; 7239 - em = em2; 7240 - } 7241 - 7242 - if (IS_ERR(em2)) { 7243 - ret = PTR_ERR(em2); 7244 - goto out; 7245 - } 7246 - 7247 - dio_data->nocow_done = true; 7248 - } else { 7249 - /* Our caller expects us to free the input extent map. */ 7250 - free_extent_map(em); 7251 - *map = NULL; 7252 - 7253 - if (nowait) { 7254 - ret = -EAGAIN; 7255 - goto out; 7256 - } 7257 - 7258 - /* 7259 - * If we could not allocate data space before locking the file 7260 - * range and we can't do a NOCOW write, then we have to fail. 7261 - */ 7262 - if (!dio_data->data_space_reserved) { 7263 - ret = -ENOSPC; 7264 - goto out; 7265 - } 7266 - 7267 - /* 7268 - * We have to COW and we have already reserved data space before, 7269 - * so now we reserve only metadata. 7270 - */ 7271 - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, 7272 - false); 7273 - if (ret < 0) 7274 - goto out; 7275 - space_reserved = true; 7276 - 7277 - em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); 7278 - if (IS_ERR(em)) { 7279 - ret = PTR_ERR(em); 7280 - goto out; 7281 - } 7282 - *map = em; 7283 - len = min(len, em->len - (start - em->start)); 7284 - if (len < prev_len) 7285 - btrfs_delalloc_release_metadata(BTRFS_I(inode), 7286 - prev_len - len, true); 7287 - } 7288 - 7289 - /* 7290 - * We have created our ordered extent, so we can now release our reservation 7291 - * for an outstanding extent. 7292 - */ 7293 - btrfs_delalloc_release_extents(BTRFS_I(inode), prev_len); 7294 - 7295 - /* 7296 - * Need to update the i_size under the extent lock so buffered 7297 - * readers will get the updated i_size when we unlock. 7298 - */ 7299 - if (start + len > i_size_read(inode)) 7300 - i_size_write(inode, start + len); 7301 - out: 7302 - if (ret && space_reserved) { 7303 - btrfs_delalloc_release_extents(BTRFS_I(inode), len); 7304 - btrfs_delalloc_release_metadata(BTRFS_I(inode), len, true); 7305 - } 7306 - *lenp = len; 7307 - return ret; 7308 - } 7309 - 7310 - static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, 7311 - loff_t length, unsigned int flags, struct iomap *iomap, 7312 - struct iomap *srcmap) 7313 - { 7314 - struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); 7315 - struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 7316 - struct extent_map *em; 7317 - struct extent_state *cached_state = NULL; 7318 - struct btrfs_dio_data *dio_data = iter->private; 7319 - u64 lockstart, lockend; 7320 - const bool write = !!(flags & IOMAP_WRITE); 7321 - int ret = 0; 7322 - u64 len = length; 7323 - const u64 data_alloc_len = length; 7324 - bool unlock_extents = false; 7325 - 7326 - /* 7327 - * We could potentially fault if we have a buffer > PAGE_SIZE, and if 7328 - * we're NOWAIT we may submit a bio for a partial range and return 7329 - * EIOCBQUEUED, which would result in an errant short read. 7330 - * 7331 - * The best way to handle this would be to allow for partial completions 7332 - * of iocb's, so we could submit the partial bio, return and fault in 7333 - * the rest of the pages, and then submit the io for the rest of the 7334 - * range. However we don't have that currently, so simply return 7335 - * -EAGAIN at this point so that the normal path is used. 7336 - */ 7337 - if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE) 7338 - return -EAGAIN; 7339 - 7340 - /* 7341 - * Cap the size of reads to that usually seen in buffered I/O as we need 7342 - * to allocate a contiguous array for the checksums. 7343 - */ 7344 - if (!write) 7345 - len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_BIO_SECTORS); 7346 - 7347 - lockstart = start; 7348 - lockend = start + len - 1; 7349 - 7350 - /* 7351 - * iomap_dio_rw() only does filemap_write_and_wait_range(), which isn't 7352 - * enough if we've written compressed pages to this area, so we need to 7353 - * flush the dirty pages again to make absolutely sure that any 7354 - * outstanding dirty pages are on disk - the first flush only starts 7355 - * compression on the data, while keeping the pages locked, so by the 7356 - * time the second flush returns we know bios for the compressed pages 7357 - * were submitted and finished, and the pages no longer under writeback. 7358 - * 7359 - * If we have a NOWAIT request and we have any pages in the range that 7360 - * are locked, likely due to compression still in progress, we don't want 7361 - * to block on page locks. We also don't want to block on pages marked as 7362 - * dirty or under writeback (same as for the non-compression case). 7363 - * iomap_dio_rw() did the same check, but after that and before we got 7364 - * here, mmap'ed writes may have happened or buffered reads started 7365 - * (readpage() and readahead(), which lock pages), as we haven't locked 7366 - * the file range yet. 7367 - */ 7368 - if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, 7369 - &BTRFS_I(inode)->runtime_flags)) { 7370 - if (flags & IOMAP_NOWAIT) { 7371 - if (filemap_range_needs_writeback(inode->i_mapping, 7372 - lockstart, lockend)) 7373 - return -EAGAIN; 7374 - } else { 7375 - ret = filemap_fdatawrite_range(inode->i_mapping, start, 7376 - start + length - 1); 7377 - if (ret) 7378 - return ret; 7379 - } 7380 - } 7381 - 7382 - memset(dio_data, 0, sizeof(*dio_data)); 7383 - 7384 - /* 7385 - * We always try to allocate data space and must do it before locking 7386 - * the file range, to avoid deadlocks with concurrent writes to the same 7387 - * range if the range has several extents and the writes don't expand the 7388 - * current i_size (the inode lock is taken in shared mode). If we fail to 7389 - * allocate data space here we continue and later, after locking the 7390 - * file range, we fail with ENOSPC only if we figure out we can not do a 7391 - * NOCOW write. 7392 - */ 7393 - if (write && !(flags & IOMAP_NOWAIT)) { 7394 - ret = btrfs_check_data_free_space(BTRFS_I(inode), 7395 - &dio_data->data_reserved, 7396 - start, data_alloc_len, false); 7397 - if (!ret) 7398 - dio_data->data_space_reserved = true; 7399 - else if (ret && !(BTRFS_I(inode)->flags & 7400 - (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) 7401 - goto err; 7402 - } 7403 - 7404 - /* 7405 - * If this errors out it's because we couldn't invalidate pagecache for 7406 - * this range and we need to fallback to buffered IO, or we are doing a 7407 - * NOWAIT read/write and we need to block. 7408 - */ 7409 - ret = lock_extent_direct(inode, lockstart, lockend, &cached_state, flags); 7410 - if (ret < 0) 7411 - goto err; 7412 - 7413 - em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len); 7414 - if (IS_ERR(em)) { 7415 - ret = PTR_ERR(em); 7416 - goto unlock_err; 7417 - } 7418 - 7419 - /* 7420 - * Ok for INLINE and COMPRESSED extents we need to fallback on buffered 7421 - * io. INLINE is special, and we could probably kludge it in here, but 7422 - * it's still buffered so for safety lets just fall back to the generic 7423 - * buffered path. 7424 - * 7425 - * For COMPRESSED we _have_ to read the entire extent in so we can 7426 - * decompress it, so there will be buffering required no matter what we 7427 - * do, so go ahead and fallback to buffered. 7428 - * 7429 - * We return -ENOTBLK because that's what makes DIO go ahead and go back 7430 - * to buffered IO. Don't blame me, this is the price we pay for using 7431 - * the generic code. 7432 - */ 7433 - if (extent_map_is_compressed(em) || 7434 - em->block_start == EXTENT_MAP_INLINE) { 7435 - free_extent_map(em); 7436 - /* 7437 - * If we are in a NOWAIT context, return -EAGAIN in order to 7438 - * fallback to buffered IO. This is not only because we can 7439 - * block with buffered IO (no support for NOWAIT semantics at 7440 - * the moment) but also to avoid returning short reads to user 7441 - * space - this happens if we were able to read some data from 7442 - * previous non-compressed extents and then when we fallback to 7443 - * buffered IO, at btrfs_file_read_iter() by calling 7444 - * filemap_read(), we fail to fault in pages for the read buffer, 7445 - * in which case filemap_read() returns a short read (the number 7446 - * of bytes previously read is > 0, so it does not return -EFAULT). 7447 - */ 7448 - ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK; 7449 - goto unlock_err; 7450 - } 7451 - 7452 - len = min(len, em->len - (start - em->start)); 7453 - 7454 - /* 7455 - * If we have a NOWAIT request and the range contains multiple extents 7456 - * (or a mix of extents and holes), then we return -EAGAIN to make the 7457 - * caller fallback to a context where it can do a blocking (without 7458 - * NOWAIT) request. This way we avoid doing partial IO and returning 7459 - * success to the caller, which is not optimal for writes and for reads 7460 - * it can result in unexpected behaviour for an application. 7461 - * 7462 - * When doing a read, because we use IOMAP_DIO_PARTIAL when calling 7463 - * iomap_dio_rw(), we can end up returning less data then what the caller 7464 - * asked for, resulting in an unexpected, and incorrect, short read. 7465 - * That is, the caller asked to read N bytes and we return less than that, 7466 - * which is wrong unless we are crossing EOF. This happens if we get a 7467 - * page fault error when trying to fault in pages for the buffer that is 7468 - * associated to the struct iov_iter passed to iomap_dio_rw(), and we 7469 - * have previously submitted bios for other extents in the range, in 7470 - * which case iomap_dio_rw() may return us EIOCBQUEUED if not all of 7471 - * those bios have completed by the time we get the page fault error, 7472 - * which we return back to our caller - we should only return EIOCBQUEUED 7473 - * after we have submitted bios for all the extents in the range. 7474 - */ 7475 - if ((flags & IOMAP_NOWAIT) && len < length) { 7476 - free_extent_map(em); 7477 - ret = -EAGAIN; 7478 - goto unlock_err; 7479 - } 7480 - 7481 - if (write) { 7482 - ret = btrfs_get_blocks_direct_write(&em, inode, dio_data, 7483 - start, &len, flags); 7484 - if (ret < 0) 7485 - goto unlock_err; 7486 - unlock_extents = true; 7487 - /* Recalc len in case the new em is smaller than requested */ 7488 - len = min(len, em->len - (start - em->start)); 7489 - if (dio_data->data_space_reserved) { 7490 - u64 release_offset; 7491 - u64 release_len = 0; 7492 - 7493 - if (dio_data->nocow_done) { 7494 - release_offset = start; 7495 - release_len = data_alloc_len; 7496 - } else if (len < data_alloc_len) { 7497 - release_offset = start + len; 7498 - release_len = data_alloc_len - len; 7499 - } 7500 - 7501 - if (release_len > 0) 7502 - btrfs_free_reserved_data_space(BTRFS_I(inode), 7503 - dio_data->data_reserved, 7504 - release_offset, 7505 - release_len); 7506 - } 7507 - } else { 7508 - /* 7509 - * We need to unlock only the end area that we aren't using. 7510 - * The rest is going to be unlocked by the endio routine. 7511 - */ 7512 - lockstart = start + len; 7513 - if (lockstart < lockend) 7514 - unlock_extents = true; 7515 - } 7516 - 7517 - if (unlock_extents) 7518 - unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, 7519 - &cached_state); 7520 - else 7521 - free_extent_state(cached_state); 7522 - 7523 - /* 7524 - * Translate extent map information to iomap. 7525 - * We trim the extents (and move the addr) even though iomap code does 7526 - * that, since we have locked only the parts we are performing I/O in. 7527 - */ 7528 - if ((em->block_start == EXTENT_MAP_HOLE) || 7529 - ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) { 7530 - iomap->addr = IOMAP_NULL_ADDR; 7531 - iomap->type = IOMAP_HOLE; 7532 - } else { 7533 - iomap->addr = em->block_start + (start - em->start); 7534 - iomap->type = IOMAP_MAPPED; 7535 - } 7536 - iomap->offset = start; 7537 - iomap->bdev = fs_info->fs_devices->latest_dev->bdev; 7538 - iomap->length = len; 7539 - free_extent_map(em); 7540 - 7541 - return 0; 7542 - 7543 - unlock_err: 7544 - unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, 7545 - &cached_state); 7546 - err: 7547 - if (dio_data->data_space_reserved) { 7548 - btrfs_free_reserved_data_space(BTRFS_I(inode), 7549 - dio_data->data_reserved, 7550 - start, data_alloc_len); 7551 - extent_changeset_free(dio_data->data_reserved); 7552 - } 7553 - 7554 - return ret; 7555 - } 7556 - 7557 - static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, 7558 - ssize_t written, unsigned int flags, struct iomap *iomap) 7559 - { 7560 - struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); 7561 - struct btrfs_dio_data *dio_data = iter->private; 7562 - size_t submitted = dio_data->submitted; 7563 - const bool write = !!(flags & IOMAP_WRITE); 7564 - int ret = 0; 7565 - 7566 - if (!write && (iomap->type == IOMAP_HOLE)) { 7567 - /* If reading from a hole, unlock and return */ 7568 - unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, 7569 - NULL); 7570 - return 0; 7571 - } 7572 - 7573 - if (submitted < length) { 7574 - pos += submitted; 7575 - length -= submitted; 7576 - if (write) 7577 - btrfs_finish_ordered_extent(dio_data->ordered, NULL, 7578 - pos, length, false); 7579 - else 7580 - unlock_extent(&BTRFS_I(inode)->io_tree, pos, 7581 - pos + length - 1, NULL); 7582 - ret = -ENOTBLK; 7583 - } 7584 - if (write) { 7585 - btrfs_put_ordered_extent(dio_data->ordered); 7586 - dio_data->ordered = NULL; 7587 - } 7588 - 7589 - if (write) 7590 - extent_changeset_free(dio_data->data_reserved); 7591 - return ret; 7592 - } 7593 - 7594 - static void btrfs_dio_end_io(struct btrfs_bio *bbio) 7595 - { 7596 - struct btrfs_dio_private *dip = 7597 - container_of(bbio, struct btrfs_dio_private, bbio); 7598 - struct btrfs_inode *inode = bbio->inode; 7599 - struct bio *bio = &bbio->bio; 7600 - 7601 - if (bio->bi_status) { 7602 - btrfs_warn(inode->root->fs_info, 7603 - "direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d", 7604 - btrfs_ino(inode), bio->bi_opf, 7605 - dip->file_offset, dip->bytes, bio->bi_status); 7606 - } 7607 - 7608 - if (btrfs_op(bio) == BTRFS_MAP_WRITE) { 7609 - btrfs_finish_ordered_extent(bbio->ordered, NULL, 7610 - dip->file_offset, dip->bytes, 7611 - !bio->bi_status); 7612 - } else { 7613 - unlock_extent(&inode->io_tree, dip->file_offset, 7614 - dip->file_offset + dip->bytes - 1, NULL); 7615 - } 7616 - 7617 - bbio->bio.bi_private = bbio->private; 7618 - iomap_dio_bio_end_io(bio); 7619 - } 7620 - 7621 - static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, 7622 - loff_t file_offset) 7623 - { 7624 - struct btrfs_bio *bbio = btrfs_bio(bio); 7625 - struct btrfs_dio_private *dip = 7626 - container_of(bbio, struct btrfs_dio_private, bbio); 7627 - struct btrfs_dio_data *dio_data = iter->private; 7628 - 7629 - btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, 7630 - btrfs_dio_end_io, bio->bi_private); 7631 - bbio->inode = BTRFS_I(iter->inode); 7632 - bbio->file_offset = file_offset; 7633 - 7634 - dip->file_offset = file_offset; 7635 - dip->bytes = bio->bi_iter.bi_size; 7636 - 7637 - dio_data->submitted += bio->bi_iter.bi_size; 7638 - 7639 - /* 7640 - * Check if we are doing a partial write. If we are, we need to split 7641 - * the ordered extent to match the submitted bio. Hang on to the 7642 - * remaining unfinishable ordered_extent in dio_data so that it can be 7643 - * cancelled in iomap_end to avoid a deadlock wherein faulting the 7644 - * remaining pages is blocked on the outstanding ordered extent. 7645 - */ 7646 - if (iter->flags & IOMAP_WRITE) { 7647 - int ret; 7648 - 7649 - ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered); 7650 - if (ret) { 7651 - btrfs_finish_ordered_extent(dio_data->ordered, NULL, 7652 - file_offset, dip->bytes, 7653 - !ret); 7654 - bio->bi_status = errno_to_blk_status(ret); 7655 - iomap_dio_bio_end_io(bio); 7656 - return; 7657 - } 7658 - } 7659 - 7660 - btrfs_submit_bio(bbio, 0); 7661 - } 7662 - 7663 - static const struct iomap_ops btrfs_dio_iomap_ops = { 7664 - .iomap_begin = btrfs_dio_iomap_begin, 7665 - .iomap_end = btrfs_dio_iomap_end, 7666 - }; 7667 - 7668 - static const struct iomap_dio_ops btrfs_dio_ops = { 7669 - .submit_io = btrfs_dio_submit_io, 7670 - .bio_set = &btrfs_dio_bioset, 7671 - }; 7672 - 7673 - ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) 7674 - { 7675 - struct btrfs_dio_data data = { 0 }; 7676 - 7677 - return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 7678 - IOMAP_DIO_PARTIAL, &data, done_before); 7679 - } 7680 - 7681 - struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, 7682 - size_t done_before) 7683 - { 7684 - struct btrfs_dio_data data = { 0 }; 7685 - 7686 - return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 7687 - IOMAP_DIO_PARTIAL, &data, done_before); 7688 - } 7689 - 7690 - static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 7691 - u64 start, u64 len) 7692 - { 7693 - struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 7694 - int ret; 7695 - 7696 - ret = fiemap_prep(inode, fieinfo, start, &len, 0); 7697 - if (ret) 7698 - return ret; 7699 - 7700 - /* 7701 - * fiemap_prep() called filemap_write_and_wait() for the whole possible 7702 - * file range (0 to LLONG_MAX), but that is not enough if we have 7703 - * compression enabled. The first filemap_fdatawrite_range() only kicks 7704 - * in the compression of data (in an async thread) and will return 7705 - * before the compression is done and writeback is started. A second 7706 - * filemap_fdatawrite_range() is needed to wait for the compression to 7707 - * complete and writeback to start. We also need to wait for ordered 7708 - * extents to complete, because our fiemap implementation uses mainly 7709 - * file extent items to list the extents, searching for extent maps 7710 - * only for file ranges with holes or prealloc extents to figure out 7711 - * if we have delalloc in those ranges. 7712 - */ 7713 - if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) { 7714 - ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX); 7715 - if (ret) 7716 - return ret; 7717 - } 7718 - 7719 - btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED); 7720 - 7721 - /* 7722 - * We did an initial flush to avoid holding the inode's lock while 7723 - * triggering writeback and waiting for the completion of IO and ordered 7724 - * extents. Now after we locked the inode we do it again, because it's 7725 - * possible a new write may have happened in between those two steps. 7726 - */ 7727 - if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) { 7728 - ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX); 7729 - if (ret) { 7730 - btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED); 7731 - return ret; 7732 - } 7733 - } 7734 - 7735 - ret = extent_fiemap(btrfs_inode, fieinfo, start, len); 7736 - btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED); 7737 - 7738 - return ret; 7739 7369 } 7740 7370 7741 7371 /* ··· 7420 8198 const u64 min_size = btrfs_calc_metadata_size(fs_info, 1); 7421 8199 7422 8200 if (!skip_writeback) { 7423 - ret = btrfs_wait_ordered_range(&inode->vfs_inode, 8201 + ret = btrfs_wait_ordered_range(inode, 7424 8202 inode->vfs_inode.i_size & (~mask), 7425 8203 (u64)-1); 7426 8204 if (ret) ··· 7621 8399 struct btrfs_fs_info *fs_info = btrfs_sb(sb); 7622 8400 struct btrfs_inode *ei; 7623 8401 struct inode *inode; 7624 - struct extent_io_tree *file_extent_tree = NULL; 7625 - 7626 - /* Self tests may pass a NULL fs_info. */ 7627 - if (fs_info && !btrfs_fs_incompat(fs_info, NO_HOLES)) { 7628 - file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL); 7629 - if (!file_extent_tree) 7630 - return NULL; 7631 - } 7632 8402 7633 8403 ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL); 7634 - if (!ei) { 7635 - kfree(file_extent_tree); 8404 + if (!ei) 7636 8405 return NULL; 7637 - } 7638 8406 7639 8407 ei->root = NULL; 7640 8408 ei->generation = 0; ··· 7637 8425 ei->disk_i_size = 0; 7638 8426 ei->flags = 0; 7639 8427 ei->ro_flags = 0; 8428 + /* 8429 + * ->index_cnt will be properly initialized later when creating a new 8430 + * inode (btrfs_create_new_inode()) or when reading an existing inode 8431 + * from disk (btrfs_read_locked_inode()). 8432 + */ 7640 8433 ei->csum_bytes = 0; 7641 - ei->index_cnt = (u64)-1; 7642 8434 ei->dir_index = 0; 7643 8435 ei->last_unlink_trans = 0; 7644 8436 ei->last_reflink_trans = 0; ··· 7669 8453 extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO); 7670 8454 ei->io_tree.inode = ei; 7671 8455 7672 - ei->file_extent_tree = file_extent_tree; 7673 - if (file_extent_tree) { 7674 - extent_io_tree_init(fs_info, ei->file_extent_tree, 7675 - IO_TREE_INODE_FILE_EXTENT); 7676 - /* Lockdep class is set only for the file extent tree. */ 7677 - lockdep_set_class(&ei->file_extent_tree->lock, &file_extent_tree_class); 7678 - } 8456 + ei->file_extent_tree = NULL; 8457 + 7679 8458 mutex_init(&ei->log_mutex); 7680 8459 spin_lock_init(&ei->ordered_tree_lock); 7681 8460 ei->ordered_tree = RB_ROOT; 7682 8461 ei->ordered_tree_last = NULL; 7683 8462 INIT_LIST_HEAD(&ei->delalloc_inodes); 7684 8463 INIT_LIST_HEAD(&ei->delayed_iput); 7685 - RB_CLEAR_NODE(&ei->rb_node); 7686 8464 init_rwsem(&ei->i_mmap_lock); 7687 8465 7688 8466 return inode; ··· 7712 8502 if (!S_ISDIR(vfs_inode->i_mode)) { 7713 8503 WARN_ON(inode->delalloc_bytes); 7714 8504 WARN_ON(inode->new_delalloc_bytes); 8505 + WARN_ON(inode->csum_bytes); 7715 8506 } 7716 - WARN_ON(inode->csum_bytes); 7717 - WARN_ON(inode->defrag_bytes); 8507 + if (!root || !btrfs_is_data_reloc_root(root)) 8508 + WARN_ON(inode->defrag_bytes); 7718 8509 7719 8510 /* 7720 8511 * This can happen where we create an inode, but somebody else also ··· 7749 8538 } 7750 8539 } 7751 8540 btrfs_qgroup_check_reserved_leak(inode); 7752 - inode_tree_del(inode); 8541 + btrfs_del_inode_from_root(inode); 7753 8542 btrfs_drop_extent_map_range(inode, 0, (u64)-1, false); 7754 8543 btrfs_inode_clear_file_extent_range(inode, 0, (u64)-1); 7755 8544 btrfs_put_root(inode->root); ··· 7783 8572 * destroy cache. 7784 8573 */ 7785 8574 rcu_barrier(); 7786 - bioset_exit(&btrfs_dio_bioset); 7787 8575 kmem_cache_destroy(btrfs_inode_cachep); 7788 8576 } 7789 8577 ··· 7793 8583 SLAB_RECLAIM_ACCOUNT | SLAB_ACCOUNT, 7794 8584 init_once); 7795 8585 if (!btrfs_inode_cachep) 7796 - goto fail; 7797 - 7798 - if (bioset_init(&btrfs_dio_bioset, BIO_POOL_SIZE, 7799 - offsetof(struct btrfs_dio_private, bbio.bio), 7800 - BIOSET_NEED_BVECS)) 7801 - goto fail; 8586 + return -ENOMEM; 7802 8587 7803 8588 return 0; 7804 - fail: 7805 - btrfs_destroy_cachep(); 7806 - return -ENOMEM; 7807 8589 } 7808 8590 7809 8591 static int btrfs_getattr(struct mnt_idmap *idmap, ··· 8788 9586 } 8789 9587 8790 9588 em->start = cur_offset; 8791 - em->orig_start = cur_offset; 8792 9589 em->len = ins.offset; 8793 - em->block_start = ins.objectid; 8794 - em->block_len = ins.offset; 8795 - em->orig_block_len = ins.offset; 9590 + em->disk_bytenr = ins.objectid; 9591 + em->offset = 0; 9592 + em->disk_num_bytes = ins.offset; 8796 9593 em->ram_bytes = ins.offset; 8797 9594 em->flags |= EXTENT_FLAG_PREALLOC; 8798 9595 em->generation = trans->transid; ··· 9157 9956 pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); 9158 9957 if (!pages) 9159 9958 return -ENOMEM; 9160 - ret = btrfs_alloc_page_array(nr_pages, pages, 0); 9959 + ret = btrfs_alloc_page_array(nr_pages, pages, false); 9161 9960 if (ret) { 9162 9961 ret = -ENOMEM; 9163 9962 goto out; ··· 9234 10033 for (;;) { 9235 10034 struct btrfs_ordered_extent *ordered; 9236 10035 9237 - ret = btrfs_wait_ordered_range(&inode->vfs_inode, start, 10036 + ret = btrfs_wait_ordered_range(inode, start, 9238 10037 lockend - start + 1); 9239 10038 if (ret) 9240 10039 goto out_unlock_inode; ··· 9254 10053 goto out_unlock_extent; 9255 10054 } 9256 10055 9257 - if (em->block_start == EXTENT_MAP_INLINE) { 10056 + if (em->disk_bytenr == EXTENT_MAP_INLINE) { 9258 10057 u64 extent_start = em->start; 9259 10058 9260 10059 /* ··· 9275 10074 */ 9276 10075 encoded->len = min_t(u64, extent_map_end(em), 9277 10076 inode->vfs_inode.i_size) - iocb->ki_pos; 9278 - if (em->block_start == EXTENT_MAP_HOLE || 10077 + if (em->disk_bytenr == EXTENT_MAP_HOLE || 9279 10078 (em->flags & EXTENT_FLAG_PREALLOC)) { 9280 10079 disk_bytenr = EXTENT_MAP_HOLE; 9281 10080 count = min_t(u64, count, encoded->len); 9282 10081 encoded->len = count; 9283 10082 encoded->unencoded_len = count; 9284 10083 } else if (extent_map_is_compressed(em)) { 9285 - disk_bytenr = em->block_start; 10084 + disk_bytenr = em->disk_bytenr; 9286 10085 /* 9287 10086 * Bail if the buffer isn't large enough to return the whole 9288 10087 * compressed extent. 9289 10088 */ 9290 - if (em->block_len > count) { 10089 + if (em->disk_num_bytes > count) { 9291 10090 ret = -ENOBUFS; 9292 10091 goto out_em; 9293 10092 } 9294 - disk_io_size = em->block_len; 9295 - count = em->block_len; 10093 + disk_io_size = em->disk_num_bytes; 10094 + count = em->disk_num_bytes; 9296 10095 encoded->unencoded_len = em->ram_bytes; 9297 - encoded->unencoded_offset = iocb->ki_pos - em->orig_start; 10096 + encoded->unencoded_offset = iocb->ki_pos - (em->start - em->offset); 9298 10097 ret = btrfs_encoded_io_compression_from_extent(fs_info, 9299 10098 extent_map_compression(em)); 9300 10099 if (ret < 0) 9301 10100 goto out_em; 9302 10101 encoded->compression = ret; 9303 10102 } else { 9304 - disk_bytenr = em->block_start + (start - em->start); 10103 + disk_bytenr = extent_map_block_start(em) + (start - em->start); 9305 10104 if (encoded->len > count) 9306 10105 encoded->len = count; 9307 10106 /* ··· 9356 10155 struct extent_changeset *data_reserved = NULL; 9357 10156 struct extent_state *cached_state = NULL; 9358 10157 struct btrfs_ordered_extent *ordered; 10158 + struct btrfs_file_extent file_extent; 9359 10159 int compression; 9360 10160 size_t orig_count; 9361 10161 u64 start, end; ··· 9478 10276 for (;;) { 9479 10277 struct btrfs_ordered_extent *ordered; 9480 10278 9481 - ret = btrfs_wait_ordered_range(&inode->vfs_inode, start, num_bytes); 10279 + ret = btrfs_wait_ordered_range(inode, start, num_bytes); 9482 10280 if (ret) 9483 10281 goto out_folios; 9484 10282 ret = invalidate_inode_pages2_range(inode->vfs_inode.i_mapping, ··· 9532 10330 goto out_delalloc_release; 9533 10331 extent_reserved = true; 9534 10332 9535 - em = create_io_em(inode, start, num_bytes, 9536 - start - encoded->unencoded_offset, ins.objectid, 9537 - ins.offset, ins.offset, ram_bytes, compression, 9538 - BTRFS_ORDERED_COMPRESSED); 10333 + file_extent.disk_bytenr = ins.objectid; 10334 + file_extent.disk_num_bytes = ins.offset; 10335 + file_extent.num_bytes = num_bytes; 10336 + file_extent.ram_bytes = ram_bytes; 10337 + file_extent.offset = encoded->unencoded_offset; 10338 + file_extent.compression = compression; 10339 + em = btrfs_create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESSED); 9539 10340 if (IS_ERR(em)) { 9540 10341 ret = PTR_ERR(em); 9541 10342 goto out_free_reserved; 9542 10343 } 9543 10344 free_extent_map(em); 9544 10345 9545 - ordered = btrfs_alloc_ordered_extent(inode, start, num_bytes, ram_bytes, 9546 - ins.objectid, ins.offset, 9547 - encoded->unencoded_offset, 10346 + ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent, 9548 10347 (1 << BTRFS_ORDERED_ENCODED) | 9549 - (1 << BTRFS_ORDERED_COMPRESSED), 9550 - compression); 10348 + (1 << BTRFS_ORDERED_COMPRESSED)); 9551 10349 if (IS_ERR(ordered)) { 9552 10350 btrfs_drop_extent_map_range(inode, start, end, false); 9553 10351 ret = PTR_ERR(ordered); ··· 9751 10549 * file changes again after this, the user is doing something stupid and 9752 10550 * we don't really care. 9753 10551 */ 9754 - ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); 10552 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1); 9755 10553 if (ret) 9756 10554 return ret; 9757 10555 ··· 9837 10635 goto out; 9838 10636 } 9839 10637 9840 - if (em->block_start == EXTENT_MAP_HOLE) { 10638 + if (em->disk_bytenr == EXTENT_MAP_HOLE) { 9841 10639 btrfs_warn(fs_info, "swapfile must not have holes"); 9842 10640 ret = -EINVAL; 9843 10641 goto out; 9844 10642 } 9845 - if (em->block_start == EXTENT_MAP_INLINE) { 10643 + if (em->disk_bytenr == EXTENT_MAP_INLINE) { 9846 10644 /* 9847 10645 * It's unlikely we'll ever actually find ourselves 9848 10646 * here, as a file small enough to fit inline won't be ··· 9860 10658 goto out; 9861 10659 } 9862 10660 9863 - logical_block_start = em->block_start + (start - em->start); 10661 + logical_block_start = extent_map_block_start(em) + (start - em->start); 9864 10662 len = min(len, em->len - (start - em->start)); 9865 10663 free_extent_map(em); 9866 10664 em = NULL; 9867 10665 9868 - ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true); 10666 + ret = can_nocow_extent(inode, start, &len, NULL, false, true); 9869 10667 if (ret < 0) { 9870 10668 goto out; 9871 10669 } else if (ret) { ··· 10062 10860 */ 10063 10861 struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino) 10064 10862 { 10065 - struct rb_node *node; 10066 - struct rb_node *prev; 10067 10863 struct btrfs_inode *inode; 10864 + unsigned long from = min_ino; 10068 10865 10069 - spin_lock(&root->inode_lock); 10070 - again: 10071 - node = root->inode_tree.rb_node; 10072 - prev = NULL; 10073 - while (node) { 10074 - prev = node; 10075 - inode = rb_entry(node, struct btrfs_inode, rb_node); 10076 - if (min_ino < btrfs_ino(inode)) 10077 - node = node->rb_left; 10078 - else if (min_ino > btrfs_ino(inode)) 10079 - node = node->rb_right; 10080 - else 10866 + xa_lock(&root->inodes); 10867 + while (true) { 10868 + inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT); 10869 + if (!inode) 10081 10870 break; 10871 + if (igrab(&inode->vfs_inode)) 10872 + break; 10873 + 10874 + from = btrfs_ino(inode) + 1; 10875 + cond_resched_lock(&root->inodes.xa_lock); 10082 10876 } 10877 + xa_unlock(&root->inodes); 10083 10878 10084 - if (!node) { 10085 - while (prev) { 10086 - inode = rb_entry(prev, struct btrfs_inode, rb_node); 10087 - if (min_ino <= btrfs_ino(inode)) { 10088 - node = prev; 10089 - break; 10090 - } 10091 - prev = rb_next(prev); 10092 - } 10093 - } 10094 - 10095 - while (node) { 10096 - inode = rb_entry(prev, struct btrfs_inode, rb_node); 10097 - if (igrab(&inode->vfs_inode)) { 10098 - spin_unlock(&root->inode_lock); 10099 - return inode; 10100 - } 10101 - 10102 - min_ino = btrfs_ino(inode) + 1; 10103 - if (cond_resched_lock(&root->inode_lock)) 10104 - goto again; 10105 - 10106 - node = rb_next(node); 10107 - } 10108 - spin_unlock(&root->inode_lock); 10109 - 10110 - return NULL; 10879 + return inode; 10111 10880 } 10112 10881 10113 10882 static const struct inode_operations btrfs_dir_inode_operations = {
+69 -25
fs/btrfs/ioctl.c
··· 375 375 return PTR_ERR(trans); 376 376 377 377 if (comp) { 378 - ret = btrfs_set_prop(trans, inode, "btrfs.compression", comp, 379 - strlen(comp), 0); 378 + ret = btrfs_set_prop(trans, BTRFS_I(inode), "btrfs.compression", 379 + comp, strlen(comp), 0); 380 380 if (ret) { 381 381 btrfs_abort_transaction(trans, ret); 382 382 goto out_end_trans; 383 383 } 384 384 } else { 385 - ret = btrfs_set_prop(trans, inode, "btrfs.compression", NULL, 386 - 0, 0); 385 + ret = btrfs_set_prop(trans, BTRFS_I(inode), "btrfs.compression", 386 + NULL, 0, 0); 387 387 if (ret && ret != -ENODATA) { 388 388 btrfs_abort_transaction(trans, ret); 389 389 goto out_end_trans; ··· 552 552 return 0; 553 553 } 554 554 555 - int __pure btrfs_is_empty_uuid(u8 *uuid) 555 + int __pure btrfs_is_empty_uuid(const u8 *uuid) 556 556 { 557 557 int i; 558 558 ··· 658 658 ret = PTR_ERR(trans); 659 659 goto out_release_rsv; 660 660 } 661 - ret = btrfs_record_root_in_trans(trans, BTRFS_I(dir)->root); 662 - if (ret) 663 - goto out; 664 661 btrfs_qgroup_convert_reserved_meta(root, qgroup_reserved); 665 662 qgroup_reserved = 0; 666 663 trans->block_rsv = &block_rsv; 667 664 trans->bytes_reserved = block_rsv.size; 668 - /* Tree log can't currently deal with an inode which is a new root. */ 669 - btrfs_set_log_full_commit(trans); 670 665 671 666 ret = btrfs_qgroup_inherit(trans, 0, objectid, btrfs_root_id(root), inherit); 672 667 if (ret) ··· 714 719 ret = btrfs_insert_root(trans, fs_info->tree_root, &key, 715 720 root_item); 716 721 if (ret) { 722 + int ret2; 723 + 717 724 /* 718 725 * Since we don't abort the transaction in this case, free the 719 726 * tree block so that we don't leak space and leave the ··· 726 729 btrfs_tree_lock(leaf); 727 730 btrfs_clear_buffer_dirty(trans, leaf); 728 731 btrfs_tree_unlock(leaf); 729 - btrfs_free_tree_block(trans, objectid, leaf, 0, 1); 732 + ret2 = btrfs_free_tree_block(trans, objectid, leaf, 0, 1); 733 + if (ret2 < 0) 734 + btrfs_abort_transaction(trans, ret2); 730 735 free_extent_buffer(leaf); 731 736 goto out; 732 737 } ··· 765 766 btrfs_abort_transaction(trans, ret); 766 767 goto out; 767 768 } 769 + 770 + btrfs_record_new_subvolume(trans, BTRFS_I(dir)); 768 771 769 772 d_instantiate_new(dentry, new_inode_args.inode); 770 773 new_inode_args.inode = NULL; ··· 855 854 pending_snapshot->dentry = dentry; 856 855 pending_snapshot->root = root; 857 856 pending_snapshot->readonly = readonly; 858 - pending_snapshot->dir = dir; 857 + pending_snapshot->dir = BTRFS_I(dir); 859 858 pending_snapshot->inherit = inherit; 860 859 861 860 trans = btrfs_start_transaction(root, 0); ··· 1071 1070 atomic_inc(&root->snapshot_force_cow); 1072 1071 snapshot_force_cow = true; 1073 1072 1074 - btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1); 1073 + btrfs_wait_ordered_extents(root, U64_MAX, NULL); 1075 1074 1076 1075 ret = btrfs_mksubvol(parent, idmap, name, namelen, 1077 1076 root, readonly, inherit); ··· 1918 1917 struct btrfs_ioctl_ino_lookup_user_args *args) 1919 1918 { 1920 1919 struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; 1921 - struct super_block *sb = inode->i_sb; 1922 - struct btrfs_key upper_limit = BTRFS_I(inode)->location; 1920 + u64 upper_limit = btrfs_ino(BTRFS_I(inode)); 1923 1921 u64 treeid = btrfs_root_id(BTRFS_I(inode)->root); 1924 1922 u64 dirid = args->dirid; 1925 1923 unsigned long item_off; ··· 1944 1944 * If the bottom subvolume does not exist directly under upper_limit, 1945 1945 * construct the path in from the bottom up. 1946 1946 */ 1947 - if (dirid != upper_limit.objectid) { 1947 + if (dirid != upper_limit) { 1948 1948 ptr = &args->path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1]; 1949 1949 1950 1950 root = btrfs_get_fs_root(fs_info, treeid, true); ··· 2006 2006 * btree and lock the same leaf. 2007 2007 */ 2008 2008 btrfs_release_path(path); 2009 - temp_inode = btrfs_iget(sb, key2.objectid, root); 2009 + temp_inode = btrfs_iget(key2.objectid, root); 2010 2010 if (IS_ERR(temp_inode)) { 2011 2011 ret = PTR_ERR(temp_inode); 2012 2012 goto out_put; ··· 2019 2019 goto out_put; 2020 2020 } 2021 2021 2022 - if (key.offset == upper_limit.objectid) 2022 + if (key.offset == upper_limit) 2023 2023 break; 2024 2024 if (key.objectid == BTRFS_FIRST_FREE_OBJECTID) { 2025 2025 ret = -EACCES; ··· 2140 2140 inode = file_inode(file); 2141 2141 2142 2142 if (args->dirid == BTRFS_FIRST_FREE_OBJECTID && 2143 - BTRFS_I(inode)->location.objectid != BTRFS_FIRST_FREE_OBJECTID) { 2143 + btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) { 2144 2144 /* 2145 2145 * The subvolume does not exist under fd with which this is 2146 2146 * called ··· 3807 3807 return ret; 3808 3808 } 3809 3809 3810 + /* 3811 + * Quick check for ioctl handlers if quotas are enabled. Proper locking must be 3812 + * done before any operations. 3813 + */ 3814 + static bool qgroup_enabled(struct btrfs_fs_info *fs_info) 3815 + { 3816 + bool ret = true; 3817 + 3818 + mutex_lock(&fs_info->qgroup_ioctl_lock); 3819 + if (!fs_info->quota_root) 3820 + ret = false; 3821 + mutex_unlock(&fs_info->qgroup_ioctl_lock); 3822 + 3823 + return ret; 3824 + } 3825 + 3810 3826 static long btrfs_ioctl_qgroup_assign(struct file *file, void __user *arg) 3811 3827 { 3812 3828 struct inode *inode = file_inode(file); 3813 3829 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 3814 3830 struct btrfs_root *root = BTRFS_I(inode)->root; 3815 3831 struct btrfs_ioctl_qgroup_assign_args *sa; 3832 + struct btrfs_qgroup_list *prealloc = NULL; 3816 3833 struct btrfs_trans_handle *trans; 3817 3834 int ret; 3818 3835 int err; 3819 3836 3820 3837 if (!capable(CAP_SYS_ADMIN)) 3821 3838 return -EPERM; 3839 + 3840 + if (!qgroup_enabled(root->fs_info)) 3841 + return -ENOTCONN; 3822 3842 3823 3843 ret = mnt_want_write_file(file); 3824 3844 if (ret) ··· 3850 3830 goto drop_write; 3851 3831 } 3852 3832 3833 + if (sa->assign) { 3834 + prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL); 3835 + if (!prealloc) { 3836 + ret = -ENOMEM; 3837 + goto drop_write; 3838 + } 3839 + } 3840 + 3853 3841 trans = btrfs_join_transaction(root); 3854 3842 if (IS_ERR(trans)) { 3855 3843 ret = PTR_ERR(trans); 3856 3844 goto out; 3857 3845 } 3858 3846 3847 + /* 3848 + * Prealloc ownership is moved to the relation handler, there it's used 3849 + * or freed on error. 3850 + */ 3859 3851 if (sa->assign) { 3860 - ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst); 3852 + ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc); 3853 + prealloc = NULL; 3861 3854 } else { 3862 3855 ret = btrfs_del_qgroup_relation(trans, sa->src, sa->dst); 3863 3856 } ··· 3880 3847 err = btrfs_run_qgroups(trans); 3881 3848 mutex_unlock(&fs_info->qgroup_ioctl_lock); 3882 3849 if (err < 0) 3883 - btrfs_handle_fs_error(fs_info, err, 3884 - "failed to update qgroup status and info"); 3850 + btrfs_warn(fs_info, 3851 + "qgroup status update failed after %s relation, marked as inconsistent", 3852 + sa->assign ? "adding" : "deleting"); 3885 3853 err = btrfs_end_transaction(trans); 3886 3854 if (err && !ret) 3887 3855 ret = err; 3888 3856 3889 3857 out: 3858 + kfree(prealloc); 3890 3859 kfree(sa); 3891 3860 drop_write: 3892 3861 mnt_drop_write_file(file); ··· 3906 3871 3907 3872 if (!capable(CAP_SYS_ADMIN)) 3908 3873 return -EPERM; 3874 + 3875 + if (!qgroup_enabled(root->fs_info)) 3876 + return -ENOTCONN; 3909 3877 3910 3878 ret = mnt_want_write_file(file); 3911 3879 if (ret) ··· 3966 3928 if (!capable(CAP_SYS_ADMIN)) 3967 3929 return -EPERM; 3968 3930 3931 + if (!qgroup_enabled(root->fs_info)) 3932 + return -ENOTCONN; 3933 + 3969 3934 ret = mnt_want_write_file(file); 3970 3935 if (ret) 3971 3936 return ret; ··· 4013 3972 4014 3973 if (!capable(CAP_SYS_ADMIN)) 4015 3974 return -EPERM; 3975 + 3976 + if (!qgroup_enabled(fs_info)) 3977 + return -ENOTCONN; 4016 3978 4017 3979 ret = mnt_want_write_file(file); 4018 3980 if (ret) ··· 4473 4429 return ret; 4474 4430 } 4475 4431 4476 - static int _btrfs_ioctl_send(struct inode *inode, void __user *argp, bool compat) 4432 + static int _btrfs_ioctl_send(struct btrfs_inode *inode, void __user *argp, bool compat) 4477 4433 { 4478 4434 struct btrfs_ioctl_send_args *arg; 4479 4435 int ret; ··· 4795 4751 return btrfs_ioctl_set_received_subvol_32(file, argp); 4796 4752 #endif 4797 4753 case BTRFS_IOC_SEND: 4798 - return _btrfs_ioctl_send(inode, argp, false); 4754 + return _btrfs_ioctl_send(BTRFS_I(inode), argp, false); 4799 4755 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) 4800 4756 case BTRFS_IOC_SEND_32: 4801 - return _btrfs_ioctl_send(inode, argp, true); 4757 + return _btrfs_ioctl_send(BTRFS_I(inode), argp, true); 4802 4758 #endif 4803 4759 case BTRFS_IOC_GET_DEV_STATS: 4804 4760 return btrfs_ioctl_get_dev_stats(fs_info, argp);
+1 -1
fs/btrfs/ioctl.h
··· 19 19 struct dentry *dentry, struct fileattr *fa); 20 20 int btrfs_ioctl_get_supported_features(void __user *arg); 21 21 void btrfs_sync_inode_flags_to_i_flags(struct inode *inode); 22 - int __pure btrfs_is_empty_uuid(u8 *uuid); 22 + int __pure btrfs_is_empty_uuid(const u8 *uuid); 23 23 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info, 24 24 struct btrfs_ioctl_balance_args *bargs); 25 25
-1
fs/btrfs/locking.h
··· 11 11 #include <linux/lockdep.h> 12 12 #include <linux/percpu_counter.h> 13 13 #include "extent_io.h" 14 - #include "locking.h" 15 14 16 15 struct extent_buffer; 17 16 struct btrfs_path;
-1
fs/btrfs/lru_cache.h
··· 6 6 #include <linux/types.h> 7 7 #include <linux/maple_tree.h> 8 8 #include <linux/list.h> 9 - #include "lru_cache.h" 10 9 11 10 /* 12 11 * A cache entry. This is meant to be embedded in a structure of a user of
+30 -13
fs/btrfs/lzo.c
··· 258 258 workspace->cbuf, &out_len, 259 259 workspace->mem); 260 260 kunmap_local(data_in); 261 - if (ret < 0) { 262 - pr_debug("BTRFS: lzo in loop returned %d\n", ret); 261 + if (unlikely(ret < 0)) { 262 + /* lzo1x_1_compress never fails. */ 263 263 ret = -EIO; 264 264 goto out; 265 265 } ··· 354 354 * and all sectors should be used. 355 355 * If this happens, it means the compressed extent is corrupted. 356 356 */ 357 - if (len_in > min_t(size_t, BTRFS_MAX_COMPRESSED, cb->compressed_len) || 358 - round_up(len_in, sectorsize) < cb->compressed_len) { 357 + if (unlikely(len_in > min_t(size_t, BTRFS_MAX_COMPRESSED, cb->compressed_len) || 358 + round_up(len_in, sectorsize) < cb->compressed_len)) { 359 + struct btrfs_inode *inode = cb->bbio.inode; 360 + 359 361 btrfs_err(fs_info, 360 - "invalid lzo header, lzo len %u compressed len %u", 361 - len_in, cb->compressed_len); 362 + "lzo header invalid, root %llu inode %llu offset %llu lzo len %u compressed len %u", 363 + btrfs_root_id(inode->root), btrfs_ino(inode), 364 + cb->start, len_in, cb->compressed_len); 362 365 return -EUCLEAN; 363 366 } 364 367 ··· 386 383 kunmap_local(kaddr); 387 384 cur_in += LZO_LEN; 388 385 389 - if (seg_len > WORKSPACE_CBUF_LENGTH) { 386 + if (unlikely(seg_len > WORKSPACE_CBUF_LENGTH)) { 387 + struct btrfs_inode *inode = cb->bbio.inode; 388 + 390 389 /* 391 390 * seg_len shouldn't be larger than we have allocated 392 391 * for workspace->cbuf 393 392 */ 394 - btrfs_err(fs_info, "unexpectedly large lzo segment len %u", 395 - seg_len); 393 + btrfs_err(fs_info, 394 + "lzo segment too big, root %llu inode %llu offset %llu len %u", 395 + btrfs_root_id(inode->root), btrfs_ino(inode), 396 + cb->start, seg_len); 396 397 return -EIO; 397 398 } 398 399 ··· 406 399 /* Decompress the data */ 407 400 ret = lzo1x_decompress_safe(workspace->cbuf, seg_len, 408 401 workspace->buf, &out_len); 409 - if (ret != LZO_E_OK) { 410 - btrfs_err(fs_info, "failed to decompress"); 402 + if (unlikely(ret != LZO_E_OK)) { 403 + struct btrfs_inode *inode = cb->bbio.inode; 404 + 405 + btrfs_err(fs_info, 406 + "lzo decompression failed, error %d root %llu inode %llu offset %llu", 407 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), 408 + cb->start); 411 409 return -EIO; 412 410 } 413 411 ··· 466 454 467 455 out_len = sectorsize; 468 456 ret = lzo1x_decompress_safe(data_in, in_len, workspace->buf, &out_len); 469 - if (ret != LZO_E_OK) { 470 - pr_warn("BTRFS: decompress failed!\n"); 457 + if (unlikely(ret != LZO_E_OK)) { 458 + struct btrfs_inode *inode = BTRFS_I(dest_page->mapping->host); 459 + 460 + btrfs_err(fs_info, 461 + "lzo decompression failed, error %d root %llu inode %llu offset %llu", 462 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), 463 + page_offset(dest_page)); 471 464 ret = -EIO; 472 465 goto out; 473 466 }
+2 -1
fs/btrfs/messages.c
··· 20 20 [BTRFS_FS_STATE_TRANS_ABORTED] = 'A', 21 21 [BTRFS_FS_STATE_DEV_REPLACING] = 'R', 22 22 [BTRFS_FS_STATE_DUMMY_FS_INFO] = 0, 23 - [BTRFS_FS_STATE_NO_CSUMS] = 'C', 23 + [BTRFS_FS_STATE_NO_DATA_CSUMS] = 'C', 24 + [BTRFS_FS_STATE_SKIP_META_CSUMS] = 'S', 24 25 [BTRFS_FS_STATE_LOG_CLEANUP_ERROR] = 'L', 25 26 }; 26 27
+2 -2
fs/btrfs/misc.h
··· 66 66 u64 bytenr; 67 67 }; 68 68 69 - static inline struct rb_node *rb_simple_search(struct rb_root *root, u64 bytenr) 69 + static inline struct rb_node *rb_simple_search(const struct rb_root *root, u64 bytenr) 70 70 { 71 71 struct rb_node *node = root->rb_node; 72 72 struct rb_simple_node *entry; ··· 93 93 * Return the rb_node that start at or after @bytenr. If there is no entry at 94 94 * or after @bytner return NULL. 95 95 */ 96 - static inline struct rb_node *rb_simple_search_first(struct rb_root *root, 96 + static inline struct rb_node *rb_simple_search_first(const struct rb_root *root, 97 97 u64 bytenr) 98 98 { 99 99 struct rb_node *node = root->rb_node, *ret = NULL;
+95 -51
fs/btrfs/ordered-data.c
··· 19 19 #include "qgroup.h" 20 20 #include "subpage.h" 21 21 #include "file.h" 22 + #include "block-group.h" 22 23 23 24 static struct kmem_cache *btrfs_ordered_extent_cache; 24 25 ··· 180 179 entry->disk_num_bytes = disk_num_bytes; 181 180 entry->offset = offset; 182 181 entry->bytes_left = num_bytes; 183 - entry->inode = igrab(&inode->vfs_inode); 182 + entry->inode = BTRFS_I(igrab(&inode->vfs_inode)); 184 183 entry->compress_type = compress_type; 185 184 entry->truncated_len = (u64)-1; 186 185 entry->qgroup_rsv = qgroup_rsv; ··· 208 207 209 208 static void insert_ordered_extent(struct btrfs_ordered_extent *entry) 210 209 { 211 - struct btrfs_inode *inode = BTRFS_I(entry->inode); 210 + struct btrfs_inode *inode = entry->inode; 212 211 struct btrfs_root *root = inode->root; 213 212 struct btrfs_fs_info *fs_info = root->fs_info; 214 213 struct rb_node *node; ··· 224 223 spin_lock_irq(&inode->ordered_tree_lock); 225 224 node = tree_insert(&inode->ordered_tree, entry->file_offset, 226 225 &entry->rb_node); 227 - if (node) 226 + if (unlikely(node)) 228 227 btrfs_panic(fs_info, -EEXIST, 229 228 "inconsistency in ordered tree at offset %llu", 230 229 entry->file_offset); ··· 264 263 */ 265 264 struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( 266 265 struct btrfs_inode *inode, u64 file_offset, 267 - u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, 268 - u64 disk_num_bytes, u64 offset, unsigned long flags, 269 - int compress_type) 266 + const struct btrfs_file_extent *file_extent, unsigned long flags) 270 267 { 271 268 struct btrfs_ordered_extent *entry; 272 269 273 270 ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0); 274 271 275 - entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes, 276 - disk_bytenr, disk_num_bytes, offset, flags, 277 - compress_type); 272 + /* 273 + * For regular writes, we just use the members in @file_extent. 274 + * 275 + * For NOCOW, we don't really care about the numbers except @start and 276 + * file_extent->num_bytes, as we won't insert a file extent item at all. 277 + * 278 + * For PREALLOC, we do not use ordered extent members, but 279 + * btrfs_mark_extent_written() handles everything. 280 + * 281 + * So here we always pass 0 as offset for NOCOW/PREALLOC ordered extents, 282 + * or btrfs_split_ordered_extent() cannot handle it correctly. 283 + */ 284 + if (flags & ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) 285 + entry = alloc_ordered_extent(inode, file_offset, 286 + file_extent->num_bytes, 287 + file_extent->num_bytes, 288 + file_extent->disk_bytenr + file_extent->offset, 289 + file_extent->num_bytes, 0, flags, 290 + file_extent->compression); 291 + else 292 + entry = alloc_ordered_extent(inode, file_offset, 293 + file_extent->num_bytes, 294 + file_extent->ram_bytes, 295 + file_extent->disk_bytenr, 296 + file_extent->disk_num_bytes, 297 + file_extent->offset, flags, 298 + file_extent->compression); 278 299 if (!IS_ERR(entry)) 279 300 insert_ordered_extent(entry); 280 301 return entry; ··· 310 287 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, 311 288 struct btrfs_ordered_sum *sum) 312 289 { 313 - struct btrfs_inode *inode = BTRFS_I(entry->inode); 290 + struct btrfs_inode *inode = entry->inode; 314 291 315 292 spin_lock_irq(&inode->ordered_tree_lock); 316 293 list_add_tail(&sum->list, &entry->list); ··· 320 297 void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered) 321 298 { 322 299 if (!test_and_set_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) 323 - mapping_set_error(ordered->inode->i_mapping, -EIO); 300 + mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO); 324 301 } 325 302 326 303 static void finish_ordered_fn(struct btrfs_work *work) ··· 335 312 struct page *page, u64 file_offset, 336 313 u64 len, bool uptodate) 337 314 { 338 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 315 + struct btrfs_inode *inode = ordered->inode; 339 316 struct btrfs_fs_info *fs_info = inode->root->fs_info; 340 317 341 318 lockdep_assert_held(&inode->ordered_tree_lock); ··· 388 365 389 366 static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered) 390 367 { 391 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 368 + struct btrfs_inode *inode = ordered->inode; 392 369 struct btrfs_fs_info *fs_info = inode->root->fs_info; 393 370 struct btrfs_workqueue *wq = btrfs_is_free_space_inode(inode) ? 394 371 fs_info->endio_freespace_worker : fs_info->endio_write_workers; ··· 397 374 btrfs_queue_work(wq, &ordered->work); 398 375 } 399 376 400 - bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, 377 + void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, 401 378 struct page *page, u64 file_offset, u64 len, 402 379 bool uptodate) 403 380 { 404 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 381 + struct btrfs_inode *inode = ordered->inode; 405 382 unsigned long flags; 406 383 bool ret; 407 384 ··· 444 421 445 422 if (ret) 446 423 btrfs_queue_ordered_fn(ordered); 447 - return ret; 448 424 } 449 425 450 426 /* ··· 610 588 struct list_head *cur; 611 589 struct btrfs_ordered_sum *sum; 612 590 613 - trace_btrfs_ordered_extent_put(BTRFS_I(entry->inode), entry); 591 + trace_btrfs_ordered_extent_put(entry->inode, entry); 614 592 615 593 if (refcount_dec_and_test(&entry->refs)) { 616 594 ASSERT(list_empty(&entry->root_extent_list)); 617 595 ASSERT(list_empty(&entry->log_list)); 618 596 ASSERT(RB_EMPTY_NODE(&entry->rb_node)); 619 597 if (entry->inode) 620 - btrfs_add_delayed_iput(BTRFS_I(entry->inode)); 598 + btrfs_add_delayed_iput(entry->inode); 621 599 while (!list_empty(&entry->list)) { 622 600 cur = entry->list.next; 623 601 sum = list_entry(cur, struct btrfs_ordered_sum, list); ··· 648 626 freespace_inode = btrfs_is_free_space_inode(btrfs_inode); 649 627 650 628 btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered); 651 - /* This is paired with btrfs_alloc_ordered_extent. */ 629 + /* This is paired with alloc_ordered_extent(). */ 652 630 spin_lock(&btrfs_inode->lock); 653 631 btrfs_mod_outstanding_extents(btrfs_inode, -1); 654 632 spin_unlock(&btrfs_inode->lock); ··· 734 712 } 735 713 736 714 /* 737 - * wait for all the ordered extents in a root. This is done when balancing 738 - * space between drives. 715 + * Wait for all the ordered extents in a root. Use @bg as range or do whole 716 + * range if it's NULL. 739 717 */ 740 718 u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr, 741 - const u64 range_start, const u64 range_len) 719 + const struct btrfs_block_group *bg) 742 720 { 743 721 struct btrfs_fs_info *fs_info = root->fs_info; 744 722 LIST_HEAD(splice); ··· 746 724 LIST_HEAD(works); 747 725 struct btrfs_ordered_extent *ordered, *next; 748 726 u64 count = 0; 749 - const u64 range_end = range_start + range_len; 727 + u64 range_start, range_len; 728 + u64 range_end; 729 + 730 + if (bg) { 731 + range_start = bg->start; 732 + range_len = bg->length; 733 + } else { 734 + range_start = 0; 735 + range_len = U64_MAX; 736 + } 737 + range_end = range_start + range_len; 750 738 751 739 mutex_lock(&root->ordered_extent_mutex); 752 740 spin_lock(&root->ordered_extent_lock); ··· 783 751 btrfs_queue_work(fs_info->flush_workers, &ordered->flush_work); 784 752 785 753 cond_resched(); 786 - spin_lock(&root->ordered_extent_lock); 787 754 if (nr != U64_MAX) 788 755 nr--; 789 756 count++; 757 + spin_lock(&root->ordered_extent_lock); 790 758 } 791 759 list_splice_tail(&skipped, &root->ordered_extents); 792 760 list_splice_tail(&splice, &root->ordered_extents); ··· 803 771 return count; 804 772 } 805 773 774 + /* 775 + * Wait for @nr ordered extents that intersect the @bg, or the whole range of 776 + * the filesystem if @bg is NULL. 777 + */ 806 778 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr, 807 - const u64 range_start, const u64 range_len) 779 + const struct btrfs_block_group *bg) 808 780 { 809 781 struct btrfs_root *root; 810 782 LIST_HEAD(splice); ··· 826 790 &fs_info->ordered_roots); 827 791 spin_unlock(&fs_info->ordered_root_lock); 828 792 829 - done = btrfs_wait_ordered_extents(root, nr, 830 - range_start, range_len); 793 + done = btrfs_wait_ordered_extents(root, nr, bg); 831 794 btrfs_put_root(root); 832 795 833 - spin_lock(&fs_info->ordered_root_lock); 834 - if (nr != U64_MAX) { 796 + if (nr != U64_MAX) 835 797 nr -= done; 836 - } 798 + 799 + spin_lock(&fs_info->ordered_root_lock); 837 800 } 838 801 list_splice_tail(&splice, &fs_info->ordered_roots); 839 802 spin_unlock(&fs_info->ordered_root_lock); ··· 849 814 { 850 815 u64 start = entry->file_offset; 851 816 u64 end = start + entry->num_bytes - 1; 852 - struct btrfs_inode *inode = BTRFS_I(entry->inode); 817 + struct btrfs_inode *inode = entry->inode; 853 818 bool freespace_inode; 854 819 855 820 trace_btrfs_ordered_extent_start(inode, entry); ··· 876 841 /* 877 842 * Used to wait on ordered extents across a large range of bytes. 878 843 */ 879 - int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len) 844 + int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len) 880 845 { 881 846 int ret = 0; 882 847 int ret_wb = 0; ··· 906 871 * before the ordered extents complete - to avoid failures (-EEXIST) 907 872 * when adding the new ordered extents to the ordered tree. 908 873 */ 909 - ret_wb = filemap_fdatawait_range(inode->i_mapping, start, orig_end); 874 + ret_wb = filemap_fdatawait_range(inode->vfs_inode.i_mapping, start, orig_end); 910 875 911 876 end = orig_end; 912 877 while (1) { 913 - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), end); 878 + ordered = btrfs_lookup_first_ordered_extent(inode, end); 914 879 if (!ordered) 915 880 break; 916 881 if (ordered->file_offset > orig_end) { ··· 1208 1173 struct btrfs_ordered_extent *btrfs_split_ordered_extent( 1209 1174 struct btrfs_ordered_extent *ordered, u64 len) 1210 1175 { 1211 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 1176 + struct btrfs_inode *inode = ordered->inode; 1212 1177 struct btrfs_root *root = inode->root; 1213 1178 struct btrfs_fs_info *fs_info = root->fs_info; 1214 1179 u64 file_offset = ordered->file_offset; ··· 1247 1212 /* One ref for the tree. */ 1248 1213 refcount_inc(&new->refs); 1249 1214 1215 + /* 1216 + * Take the root's ordered_extent_lock to avoid a race with 1217 + * btrfs_wait_ordered_extents() when updating the disk_bytenr and 1218 + * disk_num_bytes fields of the ordered extent below. And we disable 1219 + * IRQs because the inode's ordered_tree_lock is used in IRQ context 1220 + * elsewhere. 1221 + * 1222 + * There's no concern about a previous caller of 1223 + * btrfs_wait_ordered_extents() getting the trimmed ordered extent 1224 + * before we insert the new one, because even if it gets the ordered 1225 + * extent before it's trimmed and the new one inserted, right before it 1226 + * uses it or during its use, the ordered extent might have been 1227 + * trimmed in the meanwhile, and it missed the new ordered extent. 1228 + * There's no way around this and it's harmless for current use cases, 1229 + * so we take the root's ordered_extent_lock to fix that race during 1230 + * trimming and silence tools like KCSAN. 1231 + */ 1250 1232 spin_lock_irq(&root->ordered_extent_lock); 1251 1233 spin_lock(&inode->ordered_tree_lock); 1252 - /* Remove from tree once */ 1253 - node = &ordered->rb_node; 1254 - rb_erase(node, &inode->ordered_tree); 1255 - RB_CLEAR_NODE(node); 1256 - if (inode->ordered_tree_last == node) 1257 - inode->ordered_tree_last = NULL; 1258 1234 1235 + /* 1236 + * We don't have overlapping ordered extents (that would imply double 1237 + * allocation of extents) and we checked above that the split length 1238 + * does not cross the ordered extent's num_bytes field, so there's 1239 + * no need to remove it and re-insert it in the tree. 1240 + */ 1259 1241 ordered->file_offset += len; 1260 1242 ordered->disk_bytenr += len; 1261 1243 ordered->num_bytes -= len; ··· 1302 1250 offset += sum->len; 1303 1251 } 1304 1252 1305 - /* Re-insert the node */ 1306 - node = tree_insert(&inode->ordered_tree, ordered->file_offset, 1307 - &ordered->rb_node); 1308 - if (node) 1309 - btrfs_panic(fs_info, -EEXIST, 1310 - "zoned: inconsistency in ordered tree at offset %llu", 1311 - ordered->file_offset); 1312 - 1313 1253 node = tree_insert(&inode->ordered_tree, new->file_offset, &new->rb_node); 1314 - if (node) 1254 + if (unlikely(node)) 1315 1255 btrfs_panic(fs_info, -EEXIST, 1316 - "zoned: inconsistency in ordered tree at offset %llu", 1256 + "inconsistency in ordered tree at offset %llu after split", 1317 1257 new->file_offset); 1318 1258 spin_unlock(&inode->ordered_tree_lock); 1319 1259
+19 -8
fs/btrfs/ordered-data.h
··· 130 130 refcount_t refs; 131 131 132 132 /* the inode we belong to */ 133 - struct inode *inode; 133 + struct btrfs_inode *inode; 134 134 135 135 /* list of checksums for insertion when the extent io is done */ 136 136 struct list_head list; ··· 162 162 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry); 163 163 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, 164 164 struct btrfs_ordered_extent *entry); 165 - bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, 165 + void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, 166 166 struct page *page, u64 file_offset, u64 len, 167 167 bool uptodate); 168 168 void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, ··· 171 171 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode, 172 172 struct btrfs_ordered_extent **cached, 173 173 u64 file_offset, u64 io_size); 174 + 175 + /* 176 + * This represents details about the target file extent item of a write operation. 177 + */ 178 + struct btrfs_file_extent { 179 + u64 disk_bytenr; 180 + u64 disk_num_bytes; 181 + u64 num_bytes; 182 + u64 ram_bytes; 183 + u64 offset; 184 + u8 compression; 185 + }; 186 + 174 187 struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( 175 188 struct btrfs_inode *inode, u64 file_offset, 176 - u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, 177 - u64 disk_num_bytes, u64 offset, unsigned long flags, 178 - int compress_type); 189 + const struct btrfs_file_extent *file_extent, unsigned long flags); 179 190 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, 180 191 struct btrfs_ordered_sum *sum); 181 192 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode, 182 193 u64 file_offset); 183 194 void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry); 184 - int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len); 195 + int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len); 185 196 struct btrfs_ordered_extent * 186 197 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset); 187 198 struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range( ··· 204 193 void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode, 205 194 struct list_head *list); 206 195 u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr, 207 - const u64 range_start, const u64 range_len); 196 + const struct btrfs_block_group *bg); 208 197 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr, 209 - const u64 range_start, const u64 range_len); 198 + const struct btrfs_block_group *bg); 210 199 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, 211 200 u64 end, 212 201 struct extent_state **cached_state);
+4 -6
fs/btrfs/print-tree.c
··· 109 109 btrfs_err(eb->fs_info, 110 110 "unexpected extent item size, has %u expect >= %zu", 111 111 item_size, sizeof(*ei)); 112 - btrfs_handle_fs_error(eb->fs_info, -EUCLEAN, NULL); 112 + return; 113 113 } 114 114 115 115 ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item); ··· 208 208 struct btrfs_stripe_extent *stripe) 209 209 { 210 210 const int num_stripes = btrfs_num_raid_stripes(item_size); 211 - const u8 encoding = btrfs_stripe_extent_encoding(eb, stripe); 212 - 213 - pr_info("\t\t\tencoding: %s\n", 214 - (encoding && encoding < BTRFS_NR_RAID_TYPES) ? 215 - btrfs_raid_array[encoding].raid_name : "unknown"); 216 211 217 212 for (int i = 0; i < num_stripes; i++) 218 213 pr_info("\t\t\tstride %d devid %llu physical %llu\n", ··· 305 310 case BTRFS_EXTENT_DATA_KEY: 306 311 fi = btrfs_item_ptr(l, i, 307 312 struct btrfs_file_extent_item); 313 + pr_info("\t\tgeneration %llu type %hhu\n", 314 + btrfs_file_extent_generation(l, fi), 315 + btrfs_file_extent_type(l, fi)); 308 316 if (btrfs_file_extent_type(l, fi) == 309 317 BTRFS_FILE_EXTENT_INLINE) { 310 318 pr_info("\t\tinline extent data size %llu\n",
+10 -10
fs/btrfs/props.c
··· 27 27 int (*validate)(const struct btrfs_inode *inode, const char *value, 28 28 size_t len); 29 29 int (*apply)(struct inode *inode, const char *value, size_t len); 30 - const char *(*extract)(struct inode *inode); 30 + const char *(*extract)(const struct inode *inode); 31 31 bool (*ignore)(const struct btrfs_inode *inode); 32 32 int inheritable; 33 33 }; ··· 104 104 return handler->ignore(inode); 105 105 } 106 106 107 - int btrfs_set_prop(struct btrfs_trans_handle *trans, struct inode *inode, 107 + int btrfs_set_prop(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, 108 108 const char *name, const char *value, size_t value_len, 109 109 int flags) 110 110 { ··· 116 116 return -EINVAL; 117 117 118 118 if (value_len == 0) { 119 - ret = btrfs_setxattr(trans, inode, handler->xattr_name, 119 + ret = btrfs_setxattr(trans, &inode->vfs_inode, handler->xattr_name, 120 120 NULL, 0, flags); 121 121 if (ret) 122 122 return ret; 123 123 124 - ret = handler->apply(inode, NULL, 0); 124 + ret = handler->apply(&inode->vfs_inode, NULL, 0); 125 125 ASSERT(ret == 0); 126 126 127 127 return ret; 128 128 } 129 129 130 - ret = btrfs_setxattr(trans, inode, handler->xattr_name, value, 130 + ret = btrfs_setxattr(trans, &inode->vfs_inode, handler->xattr_name, value, 131 131 value_len, flags); 132 132 if (ret) 133 133 return ret; 134 - ret = handler->apply(inode, value, value_len); 134 + ret = handler->apply(&inode->vfs_inode, value, value_len); 135 135 if (ret) { 136 - btrfs_setxattr(trans, inode, handler->xattr_name, NULL, 136 + btrfs_setxattr(trans, &inode->vfs_inode, handler->xattr_name, NULL, 137 137 0, flags); 138 138 return ret; 139 139 } 140 140 141 - set_bit(BTRFS_INODE_HAS_PROPS, &BTRFS_I(inode)->runtime_flags); 141 + set_bit(BTRFS_INODE_HAS_PROPS, &inode->runtime_flags); 142 142 143 143 return 0; 144 144 } ··· 359 359 return false; 360 360 } 361 361 362 - static const char *prop_compression_extract(struct inode *inode) 362 + static const char *prop_compression_extract(const struct inode *inode) 363 363 { 364 364 switch (BTRFS_I(inode)->prop_compress) { 365 365 case BTRFS_COMPRESS_ZLIB: ··· 385 385 }; 386 386 387 387 int btrfs_inode_inherit_props(struct btrfs_trans_handle *trans, 388 - struct inode *inode, struct inode *parent) 388 + struct inode *inode, const struct inode *parent) 389 389 { 390 390 struct btrfs_root *root = BTRFS_I(inode)->root; 391 391 struct btrfs_fs_info *fs_info = root->fs_info;
+2 -2
fs/btrfs/props.h
··· 15 15 16 16 int __init btrfs_props_init(void); 17 17 18 - int btrfs_set_prop(struct btrfs_trans_handle *trans, struct inode *inode, 18 + int btrfs_set_prop(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, 19 19 const char *name, const char *value, size_t value_len, 20 20 int flags); 21 21 int btrfs_validate_prop(const struct btrfs_inode *inode, const char *name, ··· 26 26 27 27 int btrfs_inode_inherit_props(struct btrfs_trans_handle *trans, 28 28 struct inode *inode, 29 - struct inode *dir); 29 + const struct inode *dir); 30 30 31 31 #endif
+150 -71
fs/btrfs/qgroup.c
··· 30 30 #include "root-tree.h" 31 31 #include "tree-checker.h" 32 32 33 - enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info) 33 + enum btrfs_qgroup_mode btrfs_qgroup_mode(const struct btrfs_fs_info *fs_info) 34 34 { 35 35 if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) 36 36 return BTRFS_QGROUP_MODE_DISABLED; ··· 39 39 return BTRFS_QGROUP_MODE_FULL; 40 40 } 41 41 42 - bool btrfs_qgroup_enabled(struct btrfs_fs_info *fs_info) 42 + bool btrfs_qgroup_enabled(const struct btrfs_fs_info *fs_info) 43 43 { 44 44 return btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED; 45 45 } 46 46 47 - bool btrfs_qgroup_full_accounting(struct btrfs_fs_info *fs_info) 47 + bool btrfs_qgroup_full_accounting(const struct btrfs_fs_info *fs_info) 48 48 { 49 49 return btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL; 50 50 } ··· 107 107 108 108 static void qgroup_rsv_add_by_qgroup(struct btrfs_fs_info *fs_info, 109 109 struct btrfs_qgroup *dest, 110 - struct btrfs_qgroup *src) 110 + const struct btrfs_qgroup *src) 111 111 { 112 112 int i; 113 113 ··· 117 117 118 118 static void qgroup_rsv_release_by_qgroup(struct btrfs_fs_info *fs_info, 119 119 struct btrfs_qgroup *dest, 120 - struct btrfs_qgroup *src) 120 + const struct btrfs_qgroup *src) 121 121 { 122 122 int i; 123 123 ··· 141 141 qg->new_refcnt += mod; 142 142 } 143 143 144 - static inline u64 btrfs_qgroup_get_old_refcnt(struct btrfs_qgroup *qg, u64 seq) 144 + static inline u64 btrfs_qgroup_get_old_refcnt(const struct btrfs_qgroup *qg, u64 seq) 145 145 { 146 146 if (qg->old_refcnt < seq) 147 147 return 0; 148 148 return qg->old_refcnt - seq; 149 149 } 150 150 151 - static inline u64 btrfs_qgroup_get_new_refcnt(struct btrfs_qgroup *qg, u64 seq) 151 + static inline u64 btrfs_qgroup_get_new_refcnt(const struct btrfs_qgroup *qg, u64 seq) 152 152 { 153 153 if (qg->new_refcnt < seq) 154 154 return 0; 155 155 return qg->new_refcnt - seq; 156 156 } 157 - 158 - /* 159 - * glue structure to represent the relations between qgroups. 160 - */ 161 - struct btrfs_qgroup_list { 162 - struct list_head next_group; 163 - struct list_head next_member; 164 - struct btrfs_qgroup *group; 165 - struct btrfs_qgroup *member; 166 - }; 167 157 168 158 static int 169 159 qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid, ··· 161 171 static void qgroup_rescan_zero_tracking(struct btrfs_fs_info *fs_info); 162 172 163 173 /* must be called with qgroup_ioctl_lock held */ 164 - static struct btrfs_qgroup *find_qgroup_rb(struct btrfs_fs_info *fs_info, 174 + static struct btrfs_qgroup *find_qgroup_rb(const struct btrfs_fs_info *fs_info, 165 175 u64 qgroupid) 166 176 { 167 177 struct rb_node *n = fs_info->qgroup_tree.rb_node; ··· 336 346 } 337 347 338 348 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS 339 - int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid, 349 + int btrfs_verify_qgroup_counts(const struct btrfs_fs_info *fs_info, u64 qgroupid, 340 350 u64 rfer, u64 excl) 341 351 { 342 352 struct btrfs_qgroup *qgroup; ··· 598 608 * Return false if no reserved space is left. 599 609 * Return true if some reserved space is leaked. 600 610 */ 601 - bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info) 611 + bool btrfs_check_quota_leak(const struct btrfs_fs_info *fs_info) 602 612 { 603 613 struct rb_node *node; 604 614 bool ret = false; ··· 1324 1334 */ 1325 1335 static int flush_reservations(struct btrfs_fs_info *fs_info) 1326 1336 { 1327 - struct btrfs_trans_handle *trans; 1328 1337 int ret; 1329 1338 1330 1339 ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false); 1331 1340 if (ret) 1332 1341 return ret; 1333 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 1334 - trans = btrfs_join_transaction(fs_info->tree_root); 1335 - if (IS_ERR(trans)) 1336 - return PTR_ERR(trans); 1337 - ret = btrfs_commit_transaction(trans); 1342 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 1338 1343 1339 - return ret; 1344 + return btrfs_commit_current_transaction(fs_info->tree_root); 1340 1345 } 1341 1346 1342 1347 int btrfs_quota_disable(struct btrfs_fs_info *fs_info) ··· 1431 1446 btrfs_tree_lock(quota_root->node); 1432 1447 btrfs_clear_buffer_dirty(trans, quota_root->node); 1433 1448 btrfs_tree_unlock(quota_root->node); 1434 - btrfs_free_tree_block(trans, btrfs_root_id(quota_root), 1435 - quota_root->node, 0, 1); 1449 + ret = btrfs_free_tree_block(trans, btrfs_root_id(quota_root), 1450 + quota_root->node, 0, 1); 1436 1451 1452 + if (ret < 0) 1453 + btrfs_abort_transaction(trans, ret); 1437 1454 1438 1455 out: 1439 1456 btrfs_put_root(quota_root); ··· 1559 1572 return ret; 1560 1573 } 1561 1574 1562 - int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst) 1575 + /* 1576 + * Add relation between @src and @dst qgroup. The @prealloc is allocated by the 1577 + * callers and transferred here (either used or freed on error). 1578 + */ 1579 + int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst, 1580 + struct btrfs_qgroup_list *prealloc) 1563 1581 { 1564 1582 struct btrfs_fs_info *fs_info = trans->fs_info; 1565 1583 struct btrfs_qgroup *parent; 1566 1584 struct btrfs_qgroup *member; 1567 1585 struct btrfs_qgroup_list *list; 1568 - struct btrfs_qgroup_list *prealloc = NULL; 1569 1586 int ret = 0; 1587 + 1588 + ASSERT(prealloc); 1570 1589 1571 1590 /* Check the level of src and dst first */ 1572 1591 if (btrfs_qgroup_level(src) >= btrfs_qgroup_level(dst)) ··· 1598 1605 } 1599 1606 } 1600 1607 1601 - prealloc = kzalloc(sizeof(*list), GFP_NOFS); 1602 - if (!prealloc) { 1603 - ret = -ENOMEM; 1604 - goto out; 1605 - } 1606 1608 ret = add_qgroup_relation_item(trans, src, dst); 1607 1609 if (ret) 1608 1610 goto out; ··· 1736 1748 return ret; 1737 1749 } 1738 1750 1739 - static bool qgroup_has_usage(struct btrfs_qgroup *qgroup) 1751 + /* 1752 + * Return 0 if we can not delete the qgroup (not empty or has children etc). 1753 + * Return >0 if we can delete the qgroup. 1754 + * Return <0 for other errors during tree search. 1755 + */ 1756 + static int can_delete_qgroup(struct btrfs_fs_info *fs_info, struct btrfs_qgroup *qgroup) 1740 1757 { 1741 - return (qgroup->rfer > 0 || qgroup->rfer_cmpr > 0 || 1742 - qgroup->excl > 0 || qgroup->excl_cmpr > 0 || 1743 - qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] > 0 || 1744 - qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] > 0 || 1745 - qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS] > 0); 1758 + struct btrfs_key key; 1759 + struct btrfs_path *path; 1760 + int ret; 1761 + 1762 + /* 1763 + * Squota would never be inconsistent, but there can still be case 1764 + * where a dropped subvolume still has qgroup numbers, and squota 1765 + * relies on such qgroup for future accounting. 1766 + * 1767 + * So for squota, do not allow dropping any non-zero qgroup. 1768 + */ 1769 + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE && 1770 + (qgroup->rfer || qgroup->excl || qgroup->excl_cmpr || qgroup->rfer_cmpr)) 1771 + return 0; 1772 + 1773 + /* For higher level qgroup, we can only delete it if it has no child. */ 1774 + if (btrfs_qgroup_level(qgroup->qgroupid)) { 1775 + if (!list_empty(&qgroup->members)) 1776 + return 0; 1777 + return 1; 1778 + } 1779 + 1780 + /* 1781 + * For level-0 qgroups, we can only delete it if it has no subvolume 1782 + * for it. 1783 + * This means even a subvolume is unlinked but not yet fully dropped, 1784 + * we can not delete the qgroup. 1785 + */ 1786 + key.objectid = qgroup->qgroupid; 1787 + key.type = BTRFS_ROOT_ITEM_KEY; 1788 + key.offset = -1ULL; 1789 + path = btrfs_alloc_path(); 1790 + if (!path) 1791 + return -ENOMEM; 1792 + 1793 + ret = btrfs_find_root(fs_info->tree_root, &key, path, NULL, NULL); 1794 + btrfs_free_path(path); 1795 + /* 1796 + * The @ret from btrfs_find_root() exactly matches our definition for 1797 + * the return value, thus can be returned directly. 1798 + */ 1799 + return ret; 1746 1800 } 1747 1801 1748 1802 int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid) ··· 1806 1776 goto out; 1807 1777 } 1808 1778 1809 - if (is_fstree(qgroupid) && qgroup_has_usage(qgroup)) { 1779 + ret = can_delete_qgroup(fs_info, qgroup); 1780 + if (ret < 0) 1781 + goto out; 1782 + if (ret == 0) { 1810 1783 ret = -EBUSY; 1811 1784 goto out; 1812 1785 } ··· 1834 1801 } 1835 1802 1836 1803 spin_lock(&fs_info->qgroup_lock); 1804 + /* 1805 + * Warn on reserved space. The subvolume should has no child nor 1806 + * corresponding subvolume. 1807 + * Thus its reserved space should all be zero, no matter if qgroup 1808 + * is consistent or the mode. 1809 + */ 1810 + WARN_ON(qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] || 1811 + qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] || 1812 + qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS]); 1813 + /* 1814 + * The same for rfer/excl numbers, but that's only if our qgroup is 1815 + * consistent and if it's in regular qgroup mode. 1816 + * For simple mode it's not as accurate thus we can hit non-zero values 1817 + * very frequently. 1818 + */ 1819 + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL && 1820 + !(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) { 1821 + if (WARN_ON(qgroup->rfer || qgroup->excl || 1822 + qgroup->rfer_cmpr || qgroup->excl_cmpr)) { 1823 + btrfs_warn_rl(fs_info, 1824 + "to be deleted qgroup %u/%llu has non-zero numbers, rfer %llu rfer_cmpr %llu excl %llu excl_cmpr %llu", 1825 + btrfs_qgroup_level(qgroup->qgroupid), 1826 + btrfs_qgroup_subvolid(qgroup->qgroupid), 1827 + qgroup->rfer, qgroup->rfer_cmpr, 1828 + qgroup->excl, qgroup->excl_cmpr); 1829 + qgroup_mark_inconsistent(fs_info); 1830 + } 1831 + } 1837 1832 del_qgroup_rb(fs_info, qgroupid); 1838 1833 spin_unlock(&fs_info->qgroup_lock); 1839 1834 ··· 1874 1813 kfree(qgroup); 1875 1814 out: 1876 1815 mutex_unlock(&fs_info->qgroup_ioctl_lock); 1816 + return ret; 1817 + } 1818 + 1819 + int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, u64 subvolid) 1820 + { 1821 + struct btrfs_trans_handle *trans; 1822 + int ret; 1823 + 1824 + if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) || !fs_info->quota_root) 1825 + return 0; 1826 + 1827 + /* 1828 + * Commit current transaction to make sure all the rfer/excl numbers 1829 + * get updated. 1830 + */ 1831 + trans = btrfs_start_transaction(fs_info->quota_root, 0); 1832 + if (IS_ERR(trans)) 1833 + return PTR_ERR(trans); 1834 + 1835 + ret = btrfs_commit_transaction(trans); 1836 + if (ret < 0) 1837 + return ret; 1838 + 1839 + /* Start new trans to delete the qgroup info and limit items. */ 1840 + trans = btrfs_start_transaction(fs_info->quota_root, 2); 1841 + if (IS_ERR(trans)) 1842 + return PTR_ERR(trans); 1843 + ret = btrfs_remove_qgroup(trans, subvolid); 1844 + btrfs_end_transaction(trans); 1845 + /* 1846 + * It's squota and the subvolume still has numbers needed for future 1847 + * accounting, in this case we can not delete it. Just skip it. 1848 + */ 1849 + if (ret == -EBUSY) 1850 + ret = 0; 1877 1851 return ret; 1878 1852 } 1879 1853 ··· 3318 3222 struct btrfs_qgroup_inherit *inherit) 3319 3223 { 3320 3224 int ret = 0; 3321 - int i; 3322 3225 u64 *i_qgroups; 3323 3226 bool committing = false; 3324 3227 struct btrfs_fs_info *fs_info = trans->fs_info; ··· 3374 3279 i_qgroups = (u64 *)(inherit + 1); 3375 3280 nums = inherit->num_qgroups + 2 * inherit->num_ref_copies + 3376 3281 2 * inherit->num_excl_copies; 3377 - for (i = 0; i < nums; ++i) { 3282 + for (int i = 0; i < nums; i++) { 3378 3283 srcgroup = find_qgroup_rb(fs_info, *i_qgroups); 3379 3284 3380 3285 /* ··· 3401 3306 */ 3402 3307 if (inherit) { 3403 3308 i_qgroups = (u64 *)(inherit + 1); 3404 - for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) { 3309 + for (int i = 0; i < inherit->num_qgroups; i++, i_qgroups++) { 3405 3310 if (*i_qgroups == 0) 3406 3311 continue; 3407 3312 ret = add_qgroup_relation_item(trans, objectid, ··· 3487 3392 goto unlock; 3488 3393 3489 3394 i_qgroups = (u64 *)(inherit + 1); 3490 - for (i = 0; i < inherit->num_qgroups; ++i) { 3395 + for (int i = 0; i < inherit->num_qgroups; i++) { 3491 3396 if (*i_qgroups) { 3492 3397 ret = add_relation_rb(fs_info, qlist_prealloc[i], objectid, 3493 3398 *i_qgroups); ··· 3507 3412 ++i_qgroups; 3508 3413 } 3509 3414 3510 - for (i = 0; i < inherit->num_ref_copies; ++i, i_qgroups += 2) { 3415 + for (int i = 0; i < inherit->num_ref_copies; i++, i_qgroups += 2) { 3511 3416 struct btrfs_qgroup *src; 3512 3417 struct btrfs_qgroup *dst; 3513 3418 ··· 3528 3433 /* Manually tweaking numbers certainly needs a rescan */ 3529 3434 need_rescan = true; 3530 3435 } 3531 - for (i = 0; i < inherit->num_excl_copies; ++i, i_qgroups += 2) { 3436 + for (int i = 0; i < inherit->num_excl_copies; i++, i_qgroups += 2) { 3532 3437 struct btrfs_qgroup *src; 3533 3438 struct btrfs_qgroup *dst; 3534 3439 ··· 4013 3918 btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info) 4014 3919 { 4015 3920 int ret = 0; 4016 - struct btrfs_trans_handle *trans; 4017 3921 4018 3922 ret = qgroup_rescan_init(fs_info, 0, 1); 4019 3923 if (ret) ··· 4029 3935 * going to clear all tracking information for a clean start. 4030 3936 */ 4031 3937 4032 - trans = btrfs_attach_transaction_barrier(fs_info->fs_root); 4033 - if (IS_ERR(trans) && trans != ERR_PTR(-ENOENT)) { 3938 + ret = btrfs_commit_current_transaction(fs_info->fs_root); 3939 + if (ret) { 4034 3940 fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; 4035 - return PTR_ERR(trans); 4036 - } else if (trans != ERR_PTR(-ENOENT)) { 4037 - ret = btrfs_commit_transaction(trans); 4038 - if (ret) { 4039 - fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; 4040 - return ret; 4041 - } 3941 + return ret; 4042 3942 } 4043 3943 4044 3944 qgroup_rescan_zero_tracking(fs_info); ··· 4168 4080 */ 4169 4081 static int try_flush_qgroup(struct btrfs_root *root) 4170 4082 { 4171 - struct btrfs_trans_handle *trans; 4172 4083 int ret; 4173 4084 4174 4085 /* Can't hold an open transaction or we run the risk of deadlocking. */ ··· 4188 4101 ret = btrfs_start_delalloc_snapshot(root, true); 4189 4102 if (ret < 0) 4190 4103 goto out; 4191 - btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1); 4104 + btrfs_wait_ordered_extents(root, U64_MAX, NULL); 4192 4105 4193 - trans = btrfs_attach_transaction_barrier(root); 4194 - if (IS_ERR(trans)) { 4195 - ret = PTR_ERR(trans); 4196 - if (ret == -ENOENT) 4197 - ret = 0; 4198 - goto out; 4199 - } 4200 - 4201 - ret = btrfs_commit_transaction(trans); 4106 + ret = btrfs_commit_current_transaction(root); 4202 4107 out: 4203 4108 clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state); 4204 4109 wake_up(&root->qgroup_flush_wait); ··· 4896 4817 } 4897 4818 4898 4819 int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info, 4899 - struct btrfs_squota_delta *delta) 4820 + const struct btrfs_squota_delta *delta) 4900 4821 { 4901 4822 int ret; 4902 4823 struct btrfs_qgroup *qgroup;
+17 -8
fs/btrfs/qgroup.h
··· 123 123 124 124 /* 125 125 * Record a dirty extent, and info qgroup to update quota on it 126 - * TODO: Use kmem cache to alloc it. 127 126 */ 128 127 struct btrfs_qgroup_extent_record { 129 128 struct rb_node node; ··· 278 279 struct kobject kobj; 279 280 }; 280 281 282 + /* Glue structure to represent the relations between qgroups. */ 283 + struct btrfs_qgroup_list { 284 + struct list_head next_group; 285 + struct list_head next_member; 286 + struct btrfs_qgroup *group; 287 + struct btrfs_qgroup *member; 288 + }; 289 + 281 290 struct btrfs_squota_delta { 282 291 /* The fstree root this delta counts against. */ 283 292 u64 root; ··· 319 312 BTRFS_QGROUP_MODE_SIMPLE 320 313 }; 321 314 322 - enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info); 323 - bool btrfs_qgroup_enabled(struct btrfs_fs_info *fs_info); 324 - bool btrfs_qgroup_full_accounting(struct btrfs_fs_info *fs_info); 315 + enum btrfs_qgroup_mode btrfs_qgroup_mode(const struct btrfs_fs_info *fs_info); 316 + bool btrfs_qgroup_enabled(const struct btrfs_fs_info *fs_info); 317 + bool btrfs_qgroup_full_accounting(const struct btrfs_fs_info *fs_info); 325 318 int btrfs_quota_enable(struct btrfs_fs_info *fs_info, 326 319 struct btrfs_ioctl_quota_ctl_args *quota_ctl_args); 327 320 int btrfs_quota_disable(struct btrfs_fs_info *fs_info); ··· 329 322 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info); 330 323 int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info, 331 324 bool interruptible); 332 - int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst); 325 + int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst, 326 + struct btrfs_qgroup_list *prealloc); 333 327 int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, 334 328 u64 dst); 335 329 int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); 336 330 int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); 331 + int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, u64 subvolid); 337 332 int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid, 338 333 struct btrfs_qgroup_limit *limit); 339 334 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info); ··· 370 361 enum btrfs_qgroup_rsv_type type); 371 362 372 363 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS 373 - int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid, 364 + int btrfs_verify_qgroup_counts(const struct btrfs_fs_info *fs_info, u64 qgroupid, 374 365 u64 rfer, u64 excl); 375 366 #endif 376 367 ··· 440 431 int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans, 441 432 struct btrfs_root *root, struct extent_buffer *eb); 442 433 void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans); 443 - bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info); 434 + bool btrfs_check_quota_leak(const struct btrfs_fs_info *fs_info); 444 435 void btrfs_free_squota_rsv(struct btrfs_fs_info *fs_info, u64 root, u64 rsv_bytes); 445 436 int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info, 446 - struct btrfs_squota_delta *delta); 437 + const struct btrfs_squota_delta *delta); 447 438 448 439 #endif
-13
fs/btrfs/raid-stripe-tree.c
··· 80 80 struct btrfs_key stripe_key; 81 81 struct btrfs_root *stripe_root = fs_info->stripe_root; 82 82 const int num_stripes = btrfs_bg_type_to_factor(bioc->map_type); 83 - u8 encoding = btrfs_bg_flags_to_raid_index(bioc->map_type); 84 83 struct btrfs_stripe_extent *stripe_extent; 85 84 const size_t item_size = struct_size(stripe_extent, strides, num_stripes); 86 85 int ret; ··· 93 94 94 95 trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size, 95 96 num_stripes); 96 - btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding); 97 97 for (int i = 0; i < num_stripes; i++) { 98 98 u64 devid = bioc->stripes[i].dev->devid; 99 99 u64 physical = bioc->stripes[i].physical; ··· 157 159 struct extent_buffer *leaf; 158 160 const u64 end = logical + *length; 159 161 int num_stripes; 160 - u8 encoding; 161 162 u64 offset; 162 163 u64 found_logical; 163 164 u64 found_length; ··· 219 222 220 223 num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot)); 221 224 stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); 222 - encoding = btrfs_stripe_extent_encoding(leaf, stripe_extent); 223 - 224 - if (encoding != btrfs_bg_flags_to_raid_index(map_type)) { 225 - ret = -EUCLEAN; 226 - btrfs_handle_fs_error(fs_info, ret, 227 - "on-disk stripe encoding %d doesn't match RAID index %d", 228 - encoding, 229 - btrfs_bg_flags_to_raid_index(map_type)); 230 - goto out; 231 - } 232 225 233 226 for (int i = 0; i < num_stripes; i++) { 234 227 struct btrfs_raid_stride *stride = &stripe_extent->strides[i];
+1 -2
fs/btrfs/raid-stripe-tree.h
··· 48 48 49 49 static inline int btrfs_num_raid_stripes(u32 item_size) 50 50 { 51 - return (item_size - offsetof(struct btrfs_stripe_extent, strides)) / 52 - sizeof(struct btrfs_raid_stride); 51 + return item_size / sizeof(struct btrfs_raid_stride); 53 52 } 54 53 55 54 #endif
+101 -17
fs/btrfs/raid56.c
··· 40 40 41 41 #define BTRFS_STRIPE_HASH_TABLE_BITS 11 42 42 43 + static void dump_bioc(const struct btrfs_fs_info *fs_info, const struct btrfs_io_context *bioc) 44 + { 45 + if (unlikely(!bioc)) { 46 + btrfs_crit(fs_info, "bioc=NULL"); 47 + return; 48 + } 49 + btrfs_crit(fs_info, 50 + "bioc logical=%llu full_stripe=%llu size=%llu map_type=0x%llx mirror=%u replace_nr_stripes=%u replace_stripe_src=%d num_stripes=%u", 51 + bioc->logical, bioc->full_stripe_logical, bioc->size, 52 + bioc->map_type, bioc->mirror_num, bioc->replace_nr_stripes, 53 + bioc->replace_stripe_src, bioc->num_stripes); 54 + for (int i = 0; i < bioc->num_stripes; i++) { 55 + btrfs_crit(fs_info, " nr=%d devid=%llu physical=%llu", 56 + i, bioc->stripes[i].dev->devid, 57 + bioc->stripes[i].physical); 58 + } 59 + } 60 + 61 + static void btrfs_dump_rbio(const struct btrfs_fs_info *fs_info, 62 + const struct btrfs_raid_bio *rbio) 63 + { 64 + if (!IS_ENABLED(CONFIG_BTRFS_ASSERT)) 65 + return; 66 + 67 + dump_bioc(fs_info, rbio->bioc); 68 + btrfs_crit(fs_info, 69 + "rbio flags=0x%lx nr_sectors=%u nr_data=%u real_stripes=%u stripe_nsectors=%u scrubp=%u dbitmap=0x%lx", 70 + rbio->flags, rbio->nr_sectors, rbio->nr_data, 71 + rbio->real_stripes, rbio->stripe_nsectors, 72 + rbio->scrubp, rbio->dbitmap); 73 + } 74 + 75 + #define ASSERT_RBIO(expr, rbio) \ 76 + ({ \ 77 + if (IS_ENABLED(CONFIG_BTRFS_ASSERT) && unlikely(!(expr))) { \ 78 + const struct btrfs_fs_info *__fs_info = (rbio)->bioc ? \ 79 + (rbio)->bioc->fs_info : NULL; \ 80 + \ 81 + btrfs_dump_rbio(__fs_info, (rbio)); \ 82 + } \ 83 + ASSERT((expr)); \ 84 + }) 85 + 86 + #define ASSERT_RBIO_STRIPE(expr, rbio, stripe_nr) \ 87 + ({ \ 88 + if (IS_ENABLED(CONFIG_BTRFS_ASSERT) && unlikely(!(expr))) { \ 89 + const struct btrfs_fs_info *__fs_info = (rbio)->bioc ? \ 90 + (rbio)->bioc->fs_info : NULL; \ 91 + \ 92 + btrfs_dump_rbio(__fs_info, (rbio)); \ 93 + btrfs_crit(__fs_info, "stripe_nr=%d", (stripe_nr)); \ 94 + } \ 95 + ASSERT((expr)); \ 96 + }) 97 + 98 + #define ASSERT_RBIO_SECTOR(expr, rbio, sector_nr) \ 99 + ({ \ 100 + if (IS_ENABLED(CONFIG_BTRFS_ASSERT) && unlikely(!(expr))) { \ 101 + const struct btrfs_fs_info *__fs_info = (rbio)->bioc ? \ 102 + (rbio)->bioc->fs_info : NULL; \ 103 + \ 104 + btrfs_dump_rbio(__fs_info, (rbio)); \ 105 + btrfs_crit(__fs_info, "sector_nr=%d", (sector_nr)); \ 106 + } \ 107 + ASSERT((expr)); \ 108 + }) 109 + 110 + #define ASSERT_RBIO_LOGICAL(expr, rbio, logical) \ 111 + ({ \ 112 + if (IS_ENABLED(CONFIG_BTRFS_ASSERT) && unlikely(!(expr))) { \ 113 + const struct btrfs_fs_info *__fs_info = (rbio)->bioc ? \ 114 + (rbio)->bioc->fs_info : NULL; \ 115 + \ 116 + btrfs_dump_rbio(__fs_info, (rbio)); \ 117 + btrfs_crit(__fs_info, "logical=%llu", (logical)); \ 118 + } \ 119 + ASSERT((expr)); \ 120 + }) 121 + 43 122 /* Used by the raid56 code to lock stripes for read/modify/write */ 44 123 struct btrfs_stripe_hash { 45 124 struct list_head hash_list; ··· 671 592 unsigned int stripe_nr, 672 593 unsigned int sector_nr) 673 594 { 674 - ASSERT(stripe_nr < rbio->real_stripes); 675 - ASSERT(sector_nr < rbio->stripe_nsectors); 595 + ASSERT_RBIO_STRIPE(stripe_nr < rbio->real_stripes, rbio, stripe_nr); 596 + ASSERT_RBIO_SECTOR(sector_nr < rbio->stripe_nsectors, rbio, sector_nr); 676 597 677 598 return stripe_nr * rbio->stripe_nsectors + sector_nr; 678 599 } ··· 952 873 struct sector_ptr *sector; 953 874 int index; 954 875 955 - ASSERT(stripe_nr >= 0 && stripe_nr < rbio->real_stripes); 956 - ASSERT(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors); 876 + ASSERT_RBIO_STRIPE(stripe_nr >= 0 && stripe_nr < rbio->real_stripes, 877 + rbio, stripe_nr); 878 + ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors, 879 + rbio, sector_nr); 957 880 958 881 index = stripe_nr * rbio->stripe_nsectors + sector_nr; 959 882 ASSERT(index >= 0 && index < rbio->nr_sectors); ··· 1051 970 { 1052 971 int ret; 1053 972 1054 - ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, 0); 973 + ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false); 1055 974 if (ret < 0) 1056 975 return ret; 1057 976 /* Mapping all sectors */ ··· 1066 985 int ret; 1067 986 1068 987 ret = btrfs_alloc_page_array(rbio->nr_pages - data_pages, 1069 - rbio->stripe_pages + data_pages, 0); 988 + rbio->stripe_pages + data_pages, false); 1070 989 if (ret < 0) 1071 990 return ret; 1072 991 ··· 1138 1057 * thus it can be larger than rbio->real_stripe. 1139 1058 * So here we check against bioc->num_stripes, not rbio->real_stripes. 1140 1059 */ 1141 - ASSERT(stripe_nr >= 0 && stripe_nr < rbio->bioc->num_stripes); 1142 - ASSERT(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors); 1060 + ASSERT_RBIO_STRIPE(stripe_nr >= 0 && stripe_nr < rbio->bioc->num_stripes, 1061 + rbio, stripe_nr); 1062 + ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors, 1063 + rbio, sector_nr); 1143 1064 ASSERT(sector->page); 1144 1065 1145 1066 stripe = &rbio->bioc->stripes[stripe_nr]; ··· 1280 1197 * At least two stripes (2 disks RAID5), and since real_stripes is U8, 1281 1198 * we won't go beyond 256 disks anyway. 1282 1199 */ 1283 - ASSERT(rbio->real_stripes >= 2); 1284 - ASSERT(rbio->nr_data > 0); 1200 + ASSERT_RBIO(rbio->real_stripes >= 2, rbio); 1201 + ASSERT_RBIO(rbio->nr_data > 0, rbio); 1285 1202 1286 1203 /* 1287 1204 * This is another check to make sure nr data stripes is smaller 1288 1205 * than total stripes. 1289 1206 */ 1290 - ASSERT(rbio->nr_data < rbio->real_stripes); 1207 + ASSERT_RBIO(rbio->nr_data < rbio->real_stripes, rbio); 1291 1208 } 1292 1209 1293 1210 /* Generate PQ for one vertical stripe. */ ··· 1640 1557 const int data_pages = rbio->nr_data * rbio->stripe_npages; 1641 1558 int ret; 1642 1559 1643 - ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, 0); 1560 + ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false); 1644 1561 if (ret < 0) 1645 1562 return ret; 1646 1563 ··· 1724 1641 const u32 sectorsize = fs_info->sectorsize; 1725 1642 u64 cur_logical; 1726 1643 1727 - ASSERT(orig_logical >= full_stripe_start && 1728 - orig_logical + orig_len <= full_stripe_start + 1729 - rbio->nr_data * BTRFS_STRIPE_LEN); 1644 + ASSERT_RBIO_LOGICAL(orig_logical >= full_stripe_start && 1645 + orig_logical + orig_len <= full_stripe_start + 1646 + rbio->nr_data * BTRFS_STRIPE_LEN, 1647 + rbio, orig_logical); 1730 1648 1731 1649 bio_list_add(&rbio->bio_list, orig_bio); 1732 1650 rbio->bio_list_bytes += orig_bio->bi_iter.bi_size; ··· 2473 2389 break; 2474 2390 } 2475 2391 } 2476 - ASSERT(i < rbio->real_stripes); 2392 + ASSERT_RBIO_STRIPE(i < rbio->real_stripes, rbio, i); 2477 2393 2478 2394 bitmap_copy(&rbio->dbitmap, dbitmap, stripe_nsectors); 2479 2395 return rbio; ··· 2639 2555 * Replace is running and our parity stripe needs to be duplicated to 2640 2556 * the target device. Check we have a valid source stripe number. 2641 2557 */ 2642 - ASSERT(rbio->bioc->replace_stripe_src >= 0); 2558 + ASSERT_RBIO(rbio->bioc->replace_stripe_src >= 0, rbio); 2643 2559 for_each_set_bit(sectornr, pbitmap, rbio->stripe_nsectors) { 2644 2560 struct sector_ptr *sector; 2645 2561
+4 -4
fs/btrfs/reflink.c
··· 733 733 * we found the previous extent covering eof and before we 734 734 * attempted to increment its reference count). 735 735 */ 736 - ret = btrfs_wait_ordered_range(inode, wb_start, 736 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), wb_start, 737 737 destoff - wb_start); 738 738 if (ret) 739 739 return ret; ··· 755 755 * range, so wait for writeback to complete before truncating pages 756 756 * from the page cache. This is a rare case. 757 757 */ 758 - wb_ret = btrfs_wait_ordered_range(inode, destoff, len); 758 + wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len); 759 759 ret = ret ? ret : wb_ret; 760 760 /* 761 761 * Truncate page cache pages so that future reads will see the cloned ··· 835 835 if (ret < 0) 836 836 return ret; 837 837 838 - ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs), 838 + ret = btrfs_wait_ordered_range(BTRFS_I(inode_in), ALIGN_DOWN(pos_in, bs), 839 839 wb_len); 840 840 if (ret < 0) 841 841 return ret; 842 - ret = btrfs_wait_ordered_range(inode_out, ALIGN_DOWN(pos_out, bs), 842 + ret = btrfs_wait_ordered_range(BTRFS_I(inode_out), ALIGN_DOWN(pos_out, bs), 843 843 wb_len); 844 844 if (ret < 0) 845 845 return ret;
+76 -81
fs/btrfs/relocation.c
··· 817 817 goto abort; 818 818 } 819 819 set_bit(BTRFS_ROOT_SHAREABLE, &reloc_root->state); 820 - reloc_root->last_trans = trans->transid; 820 + btrfs_set_root_last_trans(reloc_root, trans->transid); 821 821 return reloc_root; 822 822 fail: 823 823 kfree(root_item); ··· 864 864 */ 865 865 if (root->reloc_root) { 866 866 reloc_root = root->reloc_root; 867 - reloc_root->last_trans = trans->transid; 867 + btrfs_set_root_last_trans(reloc_root, trans->transid); 868 868 return 0; 869 869 } 870 870 ··· 962 962 if (!path) 963 963 return -ENOMEM; 964 964 965 - bytenr -= BTRFS_I(reloc_inode)->index_cnt; 965 + bytenr -= BTRFS_I(reloc_inode)->reloc_block_group_start; 966 966 ret = btrfs_lookup_file_extent(NULL, root, path, 967 967 btrfs_ino(BTRFS_I(reloc_inode)), bytenr, 0); 968 968 if (ret < 0) ··· 1739 1739 * btrfs_update_reloc_root() and update our root item 1740 1740 * appropriately. 1741 1741 */ 1742 - reloc_root->last_trans = trans->transid; 1742 + btrfs_set_root_last_trans(reloc_root, trans->transid); 1743 1743 trans->block_rsv = rc->block_rsv; 1744 1744 1745 1745 replaced = 0; ··· 2082 2082 struct btrfs_root *root; 2083 2083 int ret; 2084 2084 2085 - if (reloc_root->last_trans == trans->transid) 2085 + if (btrfs_get_root_last_trans(reloc_root) == trans->transid) 2086 2086 return 0; 2087 2087 2088 2088 root = btrfs_get_fs_root(fs_info, reloc_root->root_key.offset, false); ··· 2790 2790 return ret; 2791 2791 } 2792 2792 2793 - static noinline_for_stack int prealloc_file_extent_cluster( 2794 - struct btrfs_inode *inode, 2795 - const struct file_extent_cluster *cluster) 2793 + static noinline_for_stack int prealloc_file_extent_cluster(struct reloc_control *rc) 2796 2794 { 2795 + const struct file_extent_cluster *cluster = &rc->cluster; 2796 + struct btrfs_inode *inode = BTRFS_I(rc->data_inode); 2797 2797 u64 alloc_hint = 0; 2798 2798 u64 start; 2799 2799 u64 end; 2800 - u64 offset = inode->index_cnt; 2800 + u64 offset = inode->reloc_block_group_start; 2801 2801 u64 num_bytes; 2802 2802 int nr; 2803 2803 int ret = 0; ··· 2899 2899 return ret; 2900 2900 } 2901 2901 2902 - static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inode, 2903 - u64 start, u64 end, u64 block_start) 2902 + static noinline_for_stack int setup_relocation_extent_mapping(struct reloc_control *rc) 2904 2903 { 2904 + struct btrfs_inode *inode = BTRFS_I(rc->data_inode); 2905 2905 struct extent_map *em; 2906 2906 struct extent_state *cached_state = NULL; 2907 + u64 offset = inode->reloc_block_group_start; 2908 + u64 start = rc->cluster.start - offset; 2909 + u64 end = rc->cluster.end - offset; 2907 2910 int ret = 0; 2908 2911 2909 2912 em = alloc_extent_map(); ··· 2915 2912 2916 2913 em->start = start; 2917 2914 em->len = end + 1 - start; 2918 - em->block_len = em->len; 2919 - em->block_start = block_start; 2915 + em->disk_bytenr = rc->cluster.start; 2916 + em->disk_num_bytes = em->len; 2917 + em->ram_bytes = em->len; 2920 2918 em->flags |= EXTENT_FLAG_PINNED; 2921 2919 2922 - lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state); 2923 - ret = btrfs_replace_extent_map_range(BTRFS_I(inode), em, false); 2924 - unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state); 2920 + lock_extent(&inode->io_tree, start, end, &cached_state); 2921 + ret = btrfs_replace_extent_map_range(inode, em, false); 2922 + unlock_extent(&inode->io_tree, start, end, &cached_state); 2925 2923 free_extent_map(em); 2926 2924 2927 2925 return ret; ··· 2950 2946 return cluster->boundary[cluster_nr + 1] - 1; 2951 2947 } 2952 2948 2953 - static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra, 2954 - const struct file_extent_cluster *cluster, 2949 + static int relocate_one_folio(struct reloc_control *rc, 2950 + struct file_ra_state *ra, 2955 2951 int *cluster_nr, unsigned long index) 2956 2952 { 2953 + const struct file_extent_cluster *cluster = &rc->cluster; 2954 + struct inode *inode = rc->data_inode; 2957 2955 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 2958 - u64 offset = BTRFS_I(inode)->index_cnt; 2956 + u64 offset = BTRFS_I(inode)->reloc_block_group_start; 2959 2957 const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT; 2960 2958 gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); 2961 2959 struct folio *folio; ··· 3089 3083 return ret; 3090 3084 } 3091 3085 3092 - static int relocate_file_extent_cluster(struct inode *inode, 3093 - const struct file_extent_cluster *cluster) 3086 + static int relocate_file_extent_cluster(struct reloc_control *rc) 3094 3087 { 3095 - u64 offset = BTRFS_I(inode)->index_cnt; 3088 + struct inode *inode = rc->data_inode; 3089 + const struct file_extent_cluster *cluster = &rc->cluster; 3090 + u64 offset = BTRFS_I(inode)->reloc_block_group_start; 3096 3091 unsigned long index; 3097 3092 unsigned long last_index; 3098 3093 struct file_ra_state *ra; ··· 3107 3100 if (!ra) 3108 3101 return -ENOMEM; 3109 3102 3110 - ret = prealloc_file_extent_cluster(BTRFS_I(inode), cluster); 3103 + ret = prealloc_file_extent_cluster(rc); 3111 3104 if (ret) 3112 3105 goto out; 3113 3106 3114 3107 file_ra_state_init(ra, inode->i_mapping); 3115 3108 3116 - ret = setup_relocation_extent_mapping(inode, cluster->start - offset, 3117 - cluster->end - offset, cluster->start); 3109 + ret = setup_relocation_extent_mapping(rc); 3118 3110 if (ret) 3119 3111 goto out; 3120 3112 3121 3113 last_index = (cluster->end - offset) >> PAGE_SHIFT; 3122 3114 for (index = (cluster->start - offset) >> PAGE_SHIFT; 3123 3115 index <= last_index && !ret; index++) 3124 - ret = relocate_one_folio(inode, ra, cluster, &cluster_nr, index); 3116 + ret = relocate_one_folio(rc, ra, &cluster_nr, index); 3125 3117 if (ret == 0) 3126 3118 WARN_ON(cluster_nr != cluster->nr); 3127 3119 out: ··· 3128 3122 return ret; 3129 3123 } 3130 3124 3131 - static noinline_for_stack int relocate_data_extent(struct inode *inode, 3132 - const struct btrfs_key *extent_key, 3133 - struct file_extent_cluster *cluster) 3125 + static noinline_for_stack int relocate_data_extent(struct reloc_control *rc, 3126 + const struct btrfs_key *extent_key) 3134 3127 { 3128 + struct inode *inode = rc->data_inode; 3129 + struct file_extent_cluster *cluster = &rc->cluster; 3135 3130 int ret; 3136 3131 struct btrfs_root *root = BTRFS_I(inode)->root; 3137 3132 3138 3133 if (cluster->nr > 0 && extent_key->objectid != cluster->end + 1) { 3139 - ret = relocate_file_extent_cluster(inode, cluster); 3134 + ret = relocate_file_extent_cluster(rc); 3140 3135 if (ret) 3141 3136 return ret; 3142 3137 cluster->nr = 0; ··· 3163 3156 * the cluster we need to relocate. 3164 3157 */ 3165 3158 root->relocation_src_root = cluster->owning_root; 3166 - ret = relocate_file_extent_cluster(inode, cluster); 3159 + ret = relocate_file_extent_cluster(rc); 3167 3160 if (ret) 3168 3161 return ret; 3169 3162 cluster->nr = 0; ··· 3182 3175 cluster->nr++; 3183 3176 3184 3177 if (cluster->nr >= MAX_EXTENTS) { 3185 - ret = relocate_file_extent_cluster(inode, cluster); 3178 + ret = relocate_file_extent_cluster(rc); 3186 3179 if (ret) 3187 3180 return ret; 3188 3181 cluster->nr = 0; ··· 3376 3369 if (inode) 3377 3370 goto truncate; 3378 3371 3379 - inode = btrfs_iget(fs_info->sb, ino, root); 3372 + inode = btrfs_iget(ino, root); 3380 3373 if (IS_ERR(inode)) 3381 3374 return -ENOENT; 3382 3375 ··· 3751 3744 if (rc->stage == MOVE_DATA_EXTENTS && 3752 3745 (flags & BTRFS_EXTENT_FLAG_DATA)) { 3753 3746 rc->found_file_extent = true; 3754 - ret = relocate_data_extent(rc->data_inode, 3755 - &key, &rc->cluster); 3747 + ret = relocate_data_extent(rc, &key); 3756 3748 if (ret < 0) { 3757 3749 err = ret; 3758 3750 break; ··· 3780 3774 } 3781 3775 3782 3776 if (!err) { 3783 - ret = relocate_file_extent_cluster(rc->data_inode, 3784 - &rc->cluster); 3777 + ret = relocate_file_extent_cluster(rc); 3785 3778 if (ret < 0) 3786 3779 err = ret; 3787 3780 } ··· 3913 3908 if (ret) 3914 3909 goto out; 3915 3910 3916 - inode = btrfs_iget(fs_info->sb, objectid, root); 3911 + inode = btrfs_iget(objectid, root); 3917 3912 if (IS_ERR(inode)) { 3918 3913 delete_orphan_inode(trans, root, objectid); 3919 3914 ret = PTR_ERR(inode); 3920 3915 inode = NULL; 3921 3916 goto out; 3922 3917 } 3923 - BTRFS_I(inode)->index_cnt = group->start; 3918 + BTRFS_I(inode)->reloc_block_group_start = group->start; 3924 3919 3925 3920 ret = btrfs_orphan_add(trans, BTRFS_I(inode)); 3926 3921 out: ··· 4007 4002 /* 4008 4003 * Print the block group being relocated 4009 4004 */ 4010 - static void describe_relocation(struct btrfs_fs_info *fs_info, 4011 - struct btrfs_block_group *block_group) 4005 + static void describe_relocation(struct btrfs_block_group *block_group) 4012 4006 { 4013 4007 char buf[128] = {'\0'}; 4014 4008 4015 4009 btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf)); 4016 4010 4017 - btrfs_info(fs_info, 4018 - "relocating block group %llu flags %s", 4011 + btrfs_info(block_group->fs_info, "relocating block group %llu flags %s", 4019 4012 block_group->start, buf); 4020 4013 } 4021 4014 ··· 4121 4118 goto out; 4122 4119 } 4123 4120 4124 - describe_relocation(fs_info, rc->block_group); 4121 + describe_relocation(rc->block_group); 4125 4122 4126 4123 btrfs_wait_block_group_reservations(rc->block_group); 4127 4124 btrfs_wait_nocow_writers(rc->block_group); 4128 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 4129 - rc->block_group->start, 4130 - rc->block_group->length); 4125 + btrfs_wait_ordered_roots(fs_info, U64_MAX, rc->block_group); 4131 4126 4132 4127 ret = btrfs_zone_finish(rc->block_group); 4133 4128 WARN_ON(ret && ret != -EAGAIN); ··· 4150 4149 * out of the loop if we hit an error. 4151 4150 */ 4152 4151 if (rc->stage == MOVE_DATA_EXTENTS && rc->found_file_extent) { 4153 - ret = btrfs_wait_ordered_range(rc->data_inode, 0, 4152 + ret = btrfs_wait_ordered_range(BTRFS_I(rc->data_inode), 0, 4154 4153 (u64)-1); 4155 4154 if (ret) 4156 4155 err = ret; ··· 4222 4221 struct extent_buffer *leaf; 4223 4222 struct reloc_control *rc = NULL; 4224 4223 struct btrfs_trans_handle *trans; 4225 - int ret; 4226 - int err = 0; 4224 + int ret2; 4225 + int ret = 0; 4227 4226 4228 4227 path = btrfs_alloc_path(); 4229 4228 if (!path) ··· 4237 4236 while (1) { 4238 4237 ret = btrfs_search_slot(NULL, fs_info->tree_root, &key, 4239 4238 path, 0, 0); 4240 - if (ret < 0) { 4241 - err = ret; 4239 + if (ret < 0) 4242 4240 goto out; 4243 - } 4244 4241 if (ret > 0) { 4245 4242 if (path->slots[0] == 0) 4246 4243 break; 4247 4244 path->slots[0]--; 4248 4245 } 4246 + ret = 0; 4249 4247 leaf = path->nodes[0]; 4250 4248 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 4251 4249 btrfs_release_path(path); ··· 4255 4255 4256 4256 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &key); 4257 4257 if (IS_ERR(reloc_root)) { 4258 - err = PTR_ERR(reloc_root); 4258 + ret = PTR_ERR(reloc_root); 4259 4259 goto out; 4260 4260 } 4261 4261 ··· 4267 4267 reloc_root->root_key.offset, false); 4268 4268 if (IS_ERR(fs_root)) { 4269 4269 ret = PTR_ERR(fs_root); 4270 - if (ret != -ENOENT) { 4271 - err = ret; 4270 + if (ret != -ENOENT) 4272 4271 goto out; 4273 - } 4274 4272 ret = mark_garbage_root(reloc_root); 4275 - if (ret < 0) { 4276 - err = ret; 4273 + if (ret < 0) 4277 4274 goto out; 4278 - } 4275 + ret = 0; 4279 4276 } else { 4280 4277 btrfs_put_root(fs_root); 4281 4278 } ··· 4290 4293 4291 4294 rc = alloc_reloc_control(fs_info); 4292 4295 if (!rc) { 4293 - err = -ENOMEM; 4296 + ret = -ENOMEM; 4294 4297 goto out; 4295 4298 } 4296 4299 4297 4300 ret = reloc_chunk_start(fs_info); 4298 - if (ret < 0) { 4299 - err = ret; 4301 + if (ret < 0) 4300 4302 goto out_end; 4301 - } 4302 4303 4303 4304 rc->extent_root = btrfs_extent_root(fs_info, 0); 4304 4305 ··· 4304 4309 4305 4310 trans = btrfs_join_transaction(rc->extent_root); 4306 4311 if (IS_ERR(trans)) { 4307 - err = PTR_ERR(trans); 4312 + ret = PTR_ERR(trans); 4308 4313 goto out_unset; 4309 4314 } 4310 4315 ··· 4324 4329 fs_root = btrfs_get_fs_root(fs_info, reloc_root->root_key.offset, 4325 4330 false); 4326 4331 if (IS_ERR(fs_root)) { 4327 - err = PTR_ERR(fs_root); 4332 + ret = PTR_ERR(fs_root); 4328 4333 list_add_tail(&reloc_root->root_list, &reloc_roots); 4329 4334 btrfs_end_transaction(trans); 4330 4335 goto out_unset; 4331 4336 } 4332 4337 4333 - err = __add_reloc_root(reloc_root); 4334 - ASSERT(err != -EEXIST); 4335 - if (err) { 4338 + ret = __add_reloc_root(reloc_root); 4339 + ASSERT(ret != -EEXIST); 4340 + if (ret) { 4336 4341 list_add_tail(&reloc_root->root_list, &reloc_roots); 4337 4342 btrfs_put_root(fs_root); 4338 4343 btrfs_end_transaction(trans); ··· 4342 4347 btrfs_put_root(fs_root); 4343 4348 } 4344 4349 4345 - err = btrfs_commit_transaction(trans); 4346 - if (err) 4350 + ret = btrfs_commit_transaction(trans); 4351 + if (ret) 4347 4352 goto out_unset; 4348 4353 4349 4354 merge_reloc_roots(rc); ··· 4352 4357 4353 4358 trans = btrfs_join_transaction(rc->extent_root); 4354 4359 if (IS_ERR(trans)) { 4355 - err = PTR_ERR(trans); 4360 + ret = PTR_ERR(trans); 4356 4361 goto out_clean; 4357 4362 } 4358 - err = btrfs_commit_transaction(trans); 4363 + ret = btrfs_commit_transaction(trans); 4359 4364 out_clean: 4360 - ret = clean_dirty_subvols(rc); 4361 - if (ret < 0 && !err) 4362 - err = ret; 4365 + ret2 = clean_dirty_subvols(rc); 4366 + if (ret2 < 0 && !ret) 4367 + ret = ret2; 4363 4368 out_unset: 4364 4369 unset_reloc_control(rc); 4365 4370 out_end: ··· 4370 4375 4371 4376 btrfs_free_path(path); 4372 4377 4373 - if (err == 0) { 4378 + if (ret == 0) { 4374 4379 /* cleanup orphan inode in data relocation tree */ 4375 4380 fs_root = btrfs_grab_root(fs_info->data_reloc_root); 4376 4381 ASSERT(fs_root); 4377 - err = btrfs_orphan_cleanup(fs_root); 4382 + ret = btrfs_orphan_cleanup(fs_root); 4378 4383 btrfs_put_root(fs_root); 4379 4384 } 4380 - return err; 4385 + return ret; 4381 4386 } 4382 4387 4383 4388 /* ··· 4388 4393 */ 4389 4394 int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered) 4390 4395 { 4391 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 4396 + struct btrfs_inode *inode = ordered->inode; 4392 4397 struct btrfs_fs_info *fs_info = inode->root->fs_info; 4393 - u64 disk_bytenr = ordered->file_offset + inode->index_cnt; 4398 + u64 disk_bytenr = ordered->file_offset + inode->reloc_block_group_start; 4394 4399 struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr); 4395 4400 LIST_HEAD(list); 4396 4401 int ret;
+4 -9
fs/btrfs/scrub.c
··· 261 261 atomic_set(&stripe->pending_io, 0); 262 262 spin_lock_init(&stripe->write_error_lock); 263 263 264 - ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages, 0); 264 + ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages, false); 265 265 if (ret < 0) 266 266 goto error; 267 267 ··· 2441 2441 struct btrfs_block_group *cache) 2442 2442 { 2443 2443 struct btrfs_fs_info *fs_info = cache->fs_info; 2444 - struct btrfs_trans_handle *trans; 2445 2444 2446 2445 if (!btrfs_is_zoned(fs_info)) 2447 2446 return 0; 2448 2447 2449 2448 btrfs_wait_block_group_reservations(cache); 2450 2449 btrfs_wait_nocow_writers(cache); 2451 - btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length); 2450 + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache); 2452 2451 2453 - trans = btrfs_join_transaction(root); 2454 - if (IS_ERR(trans)) 2455 - return PTR_ERR(trans); 2456 - return btrfs_commit_transaction(trans); 2452 + return btrfs_commit_current_transaction(root); 2457 2453 } 2458 2454 2459 2455 static noinline_for_stack ··· 2680 2684 */ 2681 2685 if (sctx->is_dev_replace) { 2682 2686 btrfs_wait_nocow_writers(cache); 2683 - btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, 2684 - cache->length); 2687 + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache); 2685 2688 } 2686 2689 2687 2690 scrub_pause_off(fs_info);
+16 -33
fs/btrfs/send.c
··· 5188 5188 static int process_verity(struct send_ctx *sctx) 5189 5189 { 5190 5190 int ret = 0; 5191 - struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; 5192 5191 struct inode *inode; 5193 5192 struct fs_path *p; 5194 5193 5195 - inode = btrfs_iget(fs_info->sb, sctx->cur_ino, sctx->send_root); 5194 + inode = btrfs_iget(sctx->cur_ino, sctx->send_root); 5196 5195 if (IS_ERR(inode)) 5197 5196 return PTR_ERR(inode); 5198 5197 ··· 5549 5550 size_t inline_size; 5550 5551 int ret; 5551 5552 5552 - inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root); 5553 + inode = btrfs_iget(sctx->cur_ino, root); 5553 5554 if (IS_ERR(inode)) 5554 5555 return PTR_ERR(inode); 5555 5556 ··· 5616 5617 u32 crc; 5617 5618 int ret; 5618 5619 5619 - inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root); 5620 + inode = btrfs_iget(sctx->cur_ino, root); 5620 5621 if (IS_ERR(inode)) 5621 5622 return PTR_ERR(inode); 5622 5623 ··· 5745 5746 if (sctx->cur_inode == NULL) { 5746 5747 struct btrfs_root *root = sctx->send_root; 5747 5748 5748 - sctx->cur_inode = btrfs_iget(root->fs_info->sb, sctx->cur_ino, root); 5749 + sctx->cur_inode = btrfs_iget(sctx->cur_ino, root); 5749 5750 if (IS_ERR(sctx->cur_inode)) { 5750 5751 int err = PTR_ERR(sctx->cur_inode); 5751 5752 ··· 7997 7998 */ 7998 7999 static int ensure_commit_roots_uptodate(struct send_ctx *sctx) 7999 8000 { 8000 - int i; 8001 - struct btrfs_trans_handle *trans = NULL; 8001 + struct btrfs_root *root = sctx->parent_root; 8002 8002 8003 - again: 8004 - if (sctx->parent_root && 8005 - sctx->parent_root->node != sctx->parent_root->commit_root) 8006 - goto commit_trans; 8003 + if (root && root->node != root->commit_root) 8004 + return btrfs_commit_current_transaction(root); 8007 8005 8008 - for (i = 0; i < sctx->clone_roots_cnt; i++) 8009 - if (sctx->clone_roots[i].root->node != 8010 - sctx->clone_roots[i].root->commit_root) 8011 - goto commit_trans; 8012 - 8013 - if (trans) 8014 - return btrfs_end_transaction(trans); 8015 - 8016 - return 0; 8017 - 8018 - commit_trans: 8019 - /* Use any root, all fs roots will get their commit roots updated. */ 8020 - if (!trans) { 8021 - trans = btrfs_join_transaction(sctx->send_root); 8022 - if (IS_ERR(trans)) 8023 - return PTR_ERR(trans); 8024 - goto again; 8006 + for (int i = 0; i < sctx->clone_roots_cnt; i++) { 8007 + root = sctx->clone_roots[i].root; 8008 + if (root->node != root->commit_root) 8009 + return btrfs_commit_current_transaction(root); 8025 8010 } 8026 8011 8027 - return btrfs_commit_transaction(trans); 8012 + return 0; 8028 8013 } 8029 8014 8030 8015 /* ··· 8029 8046 ret = btrfs_start_delalloc_snapshot(root, false); 8030 8047 if (ret) 8031 8048 return ret; 8032 - btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX); 8049 + btrfs_wait_ordered_extents(root, U64_MAX, NULL); 8033 8050 } 8034 8051 8035 8052 for (i = 0; i < sctx->clone_roots_cnt; i++) { ··· 8037 8054 ret = btrfs_start_delalloc_snapshot(root, false); 8038 8055 if (ret) 8039 8056 return ret; 8040 - btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX); 8057 + btrfs_wait_ordered_extents(root, U64_MAX, NULL); 8041 8058 } 8042 8059 8043 8060 return 0; ··· 8065 8082 btrfs_root_id(root), root->dedupe_in_progress); 8066 8083 } 8067 8084 8068 - long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) 8085 + long btrfs_ioctl_send(struct btrfs_inode *inode, const struct btrfs_ioctl_send_args *arg) 8069 8086 { 8070 8087 int ret = 0; 8071 - struct btrfs_root *send_root = BTRFS_I(inode)->root; 8088 + struct btrfs_root *send_root = inode->root; 8072 8089 struct btrfs_fs_info *fs_info = send_root->fs_info; 8073 8090 struct btrfs_root *clone_root; 8074 8091 struct send_ctx *sctx = NULL;
+2 -2
fs/btrfs/send.h
··· 11 11 #include <linux/sizes.h> 12 12 #include <linux/align.h> 13 13 14 - struct inode; 14 + struct btrfs_inode; 15 15 struct btrfs_ioctl_send_args; 16 16 17 17 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream" ··· 182 182 __BTRFS_SEND_A_MAX = 35, 183 183 }; 184 184 185 - long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg); 185 + long btrfs_ioctl_send(struct btrfs_inode *inode, const struct btrfs_ioctl_send_args *arg); 186 186 187 187 #endif
+236 -29
fs/btrfs/space-info.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 + #include "linux/spinlock.h" 4 + #include <linux/minmax.h> 3 5 #include "misc.h" 4 6 #include "ctree.h" 5 7 #include "space-info.h" ··· 192 190 */ 193 191 #define BTRFS_DEFAULT_ZONED_RECLAIM_THRESH (75) 194 192 193 + #define BTRFS_UNALLOC_BLOCK_GROUP_TARGET (10ULL) 194 + 195 195 /* 196 196 * Calculate chunk size depending on volume type (regular or zoned). 197 197 */ ··· 236 232 if (!space_info) 237 233 return -ENOMEM; 238 234 235 + space_info->fs_info = info; 239 236 for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) 240 237 INIT_LIST_HEAD(&space_info->block_groups[i]); 241 238 init_rwsem(&space_info->groups_sem); ··· 345 340 return NULL; 346 341 } 347 342 343 + static u64 calc_effective_data_chunk_size(struct btrfs_fs_info *fs_info) 344 + { 345 + struct btrfs_space_info *data_sinfo; 346 + u64 data_chunk_size; 347 + 348 + /* 349 + * Calculate the data_chunk_size, space_info->chunk_size is the 350 + * "optimal" chunk size based on the fs size. However when we actually 351 + * allocate the chunk we will strip this down further, making it no 352 + * more than 10% of the disk or 1G, whichever is smaller. 353 + * 354 + * On the zoned mode, we need to use zone_size (= data_sinfo->chunk_size) 355 + * as it is. 356 + */ 357 + data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA); 358 + if (btrfs_is_zoned(fs_info)) 359 + return data_sinfo->chunk_size; 360 + data_chunk_size = min(data_sinfo->chunk_size, 361 + mult_perc(fs_info->fs_devices->total_rw_bytes, 10)); 362 + return min_t(u64, data_chunk_size, SZ_1G); 363 + } 364 + 348 365 static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, 349 366 struct btrfs_space_info *space_info, 350 367 enum btrfs_reserve_flush_enum flush) 351 368 { 352 - struct btrfs_space_info *data_sinfo; 353 369 u64 profile; 354 370 u64 avail; 355 371 u64 data_chunk_size; ··· 394 368 if (avail == 0) 395 369 return 0; 396 370 397 - /* 398 - * Calculate the data_chunk_size, space_info->chunk_size is the 399 - * "optimal" chunk size based on the fs size. However when we actually 400 - * allocate the chunk we will strip this down further, making it no more 401 - * than 10% of the disk or 1G, whichever is smaller. 402 - * 403 - * On the zoned mode, we need to use zone_size (= 404 - * data_sinfo->chunk_size) as it is. 405 - */ 406 - data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA); 407 - if (!btrfs_is_zoned(fs_info)) { 408 - data_chunk_size = min(data_sinfo->chunk_size, 409 - mult_perc(fs_info->fs_devices->total_rw_bytes, 10)); 410 - data_chunk_size = min_t(u64, data_chunk_size, SZ_1G); 411 - } else { 412 - data_chunk_size = data_sinfo->chunk_size; 413 - } 371 + data_chunk_size = calc_effective_data_chunk_size(fs_info); 414 372 415 373 /* 416 374 * Since data allocations immediately use block groups as part of the ··· 615 605 return nr; 616 606 } 617 607 618 - #define EXTENT_SIZE_PER_ITEM SZ_256K 619 - 620 608 /* 621 609 * shrink metadata reservation for delalloc 622 610 */ ··· 714 706 skip_async: 715 707 loops++; 716 708 if (wait_ordered && !trans) { 717 - btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1); 709 + btrfs_wait_ordered_roots(fs_info, items, NULL); 718 710 } else { 719 711 time_left = schedule_timeout_killable(1); 720 712 if (time_left) ··· 833 825 * because that does not wait for a transaction to fully commit 834 826 * (only for it to be unblocked, state TRANS_STATE_UNBLOCKED). 835 827 */ 836 - trans = btrfs_attach_transaction_barrier(root); 837 - if (IS_ERR(trans)) { 838 - ret = PTR_ERR(trans); 839 - if (ret == -ENOENT) 840 - ret = 0; 841 - break; 842 - } 843 - ret = btrfs_commit_transaction(trans); 828 + ret = btrfs_commit_current_transaction(root); 844 829 break; 845 830 default: 846 831 ret = -ENOSPC; ··· 1886 1885 spin_unlock(&sinfo->lock); 1887 1886 1888 1887 return free_bytes; 1888 + } 1889 + 1890 + static u64 calc_pct_ratio(u64 x, u64 y) 1891 + { 1892 + int err; 1893 + 1894 + if (!y) 1895 + return 0; 1896 + again: 1897 + err = check_mul_overflow(100, x, &x); 1898 + if (err) 1899 + goto lose_precision; 1900 + return div64_u64(x, y); 1901 + lose_precision: 1902 + x >>= 10; 1903 + y >>= 10; 1904 + if (!y) 1905 + y = 1; 1906 + goto again; 1907 + } 1908 + 1909 + /* 1910 + * A reasonable buffer for unallocated space is 10 data block_groups. 1911 + * If we claw this back repeatedly, we can still achieve efficient 1912 + * utilization when near full, and not do too much reclaim while 1913 + * always maintaining a solid buffer for workloads that quickly 1914 + * allocate and pressure the unallocated space. 1915 + */ 1916 + static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info) 1917 + { 1918 + u64 chunk_sz = calc_effective_data_chunk_size(fs_info); 1919 + 1920 + return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz; 1921 + } 1922 + 1923 + /* 1924 + * The fundamental goal of automatic reclaim is to protect the filesystem's 1925 + * unallocated space and thus minimize the probability of the filesystem going 1926 + * read only when a metadata allocation failure causes a transaction abort. 1927 + * 1928 + * However, relocations happen into the space_info's unused space, therefore 1929 + * automatic reclaim must also back off as that space runs low. There is no 1930 + * value in doing trivial "relocations" of re-writing the same block group 1931 + * into a fresh one. 1932 + * 1933 + * Furthermore, we want to avoid doing too much reclaim even if there are good 1934 + * candidates. This is because the allocator is pretty good at filling up the 1935 + * holes with writes. So we want to do just enough reclaim to try and stay 1936 + * safe from running out of unallocated space but not be wasteful about it. 1937 + * 1938 + * Therefore, the dynamic reclaim threshold is calculated as follows: 1939 + * - calculate a target unallocated amount of 5 block group sized chunks 1940 + * - ratchet up the intensity of reclaim depending on how far we are from 1941 + * that target by using a formula of unalloc / target to set the threshold. 1942 + * 1943 + * Typically with 10 block groups as the target, the discrete values this comes 1944 + * out to are 0, 10, 20, ... , 80, 90, and 99. 1945 + */ 1946 + static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info) 1947 + { 1948 + struct btrfs_fs_info *fs_info = space_info->fs_info; 1949 + u64 unalloc = atomic64_read(&fs_info->free_chunk_space); 1950 + u64 target = calc_unalloc_target(fs_info); 1951 + u64 alloc = space_info->total_bytes; 1952 + u64 used = btrfs_space_info_used(space_info, false); 1953 + u64 unused = alloc - used; 1954 + u64 want = target > unalloc ? target - unalloc : 0; 1955 + u64 data_chunk_size = calc_effective_data_chunk_size(fs_info); 1956 + 1957 + /* If we have no unused space, don't bother, it won't work anyway. */ 1958 + if (unused < data_chunk_size) 1959 + return 0; 1960 + 1961 + /* Cast to int is OK because want <= target. */ 1962 + return calc_pct_ratio(want, target); 1963 + } 1964 + 1965 + int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info) 1966 + { 1967 + lockdep_assert_held(&space_info->lock); 1968 + 1969 + if (READ_ONCE(space_info->dynamic_reclaim)) 1970 + return calc_dynamic_reclaim_threshold(space_info); 1971 + return READ_ONCE(space_info->bg_reclaim_threshold); 1972 + } 1973 + 1974 + /* 1975 + * Under "urgent" reclaim, we will reclaim even fresh block groups that have 1976 + * recently seen successful allocations, as we are desperate to reclaim 1977 + * whatever we can to avoid ENOSPC in a transaction leading to a readonly fs. 1978 + */ 1979 + static bool is_reclaim_urgent(struct btrfs_space_info *space_info) 1980 + { 1981 + struct btrfs_fs_info *fs_info = space_info->fs_info; 1982 + u64 unalloc = atomic64_read(&fs_info->free_chunk_space); 1983 + u64 data_chunk_size = calc_effective_data_chunk_size(fs_info); 1984 + 1985 + return unalloc < data_chunk_size; 1986 + } 1987 + 1988 + static int do_reclaim_sweep(struct btrfs_fs_info *fs_info, 1989 + struct btrfs_space_info *space_info, int raid) 1990 + { 1991 + struct btrfs_block_group *bg; 1992 + int thresh_pct; 1993 + bool try_again = true; 1994 + bool urgent; 1995 + 1996 + spin_lock(&space_info->lock); 1997 + urgent = is_reclaim_urgent(space_info); 1998 + thresh_pct = btrfs_calc_reclaim_threshold(space_info); 1999 + spin_unlock(&space_info->lock); 2000 + 2001 + down_read(&space_info->groups_sem); 2002 + again: 2003 + list_for_each_entry(bg, &space_info->block_groups[raid], list) { 2004 + u64 thresh; 2005 + bool reclaim = false; 2006 + 2007 + btrfs_get_block_group(bg); 2008 + spin_lock(&bg->lock); 2009 + thresh = mult_perc(bg->length, thresh_pct); 2010 + if (bg->used < thresh && bg->reclaim_mark) { 2011 + try_again = false; 2012 + reclaim = true; 2013 + } 2014 + bg->reclaim_mark++; 2015 + spin_unlock(&bg->lock); 2016 + if (reclaim) 2017 + btrfs_mark_bg_to_reclaim(bg); 2018 + btrfs_put_block_group(bg); 2019 + } 2020 + 2021 + /* 2022 + * In situations where we are very motivated to reclaim (low unalloc) 2023 + * use two passes to make the reclaim mark check best effort. 2024 + * 2025 + * If we have any staler groups, we don't touch the fresher ones, but if we 2026 + * really need a block group, do take a fresh one. 2027 + */ 2028 + if (try_again && urgent) { 2029 + try_again = false; 2030 + goto again; 2031 + } 2032 + 2033 + up_read(&space_info->groups_sem); 2034 + return 0; 2035 + } 2036 + 2037 + void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes) 2038 + { 2039 + u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info); 2040 + 2041 + lockdep_assert_held(&space_info->lock); 2042 + space_info->reclaimable_bytes += bytes; 2043 + 2044 + if (space_info->reclaimable_bytes >= chunk_sz) 2045 + btrfs_set_periodic_reclaim_ready(space_info, true); 2046 + } 2047 + 2048 + void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready) 2049 + { 2050 + lockdep_assert_held(&space_info->lock); 2051 + if (!READ_ONCE(space_info->periodic_reclaim)) 2052 + return; 2053 + if (ready != space_info->periodic_reclaim_ready) { 2054 + space_info->periodic_reclaim_ready = ready; 2055 + if (!ready) 2056 + space_info->reclaimable_bytes = 0; 2057 + } 2058 + } 2059 + 2060 + bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info) 2061 + { 2062 + bool ret; 2063 + 2064 + if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM) 2065 + return false; 2066 + if (!READ_ONCE(space_info->periodic_reclaim)) 2067 + return false; 2068 + 2069 + spin_lock(&space_info->lock); 2070 + ret = space_info->periodic_reclaim_ready; 2071 + btrfs_set_periodic_reclaim_ready(space_info, false); 2072 + spin_unlock(&space_info->lock); 2073 + 2074 + return ret; 2075 + } 2076 + 2077 + int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info) 2078 + { 2079 + int ret; 2080 + int raid; 2081 + struct btrfs_space_info *space_info; 2082 + 2083 + list_for_each_entry(space_info, &fs_info->space_info, list) { 2084 + if (!btrfs_should_periodic_reclaim(space_info)) 2085 + continue; 2086 + for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) { 2087 + ret = do_reclaim_sweep(fs_info, space_info, raid); 2088 + if (ret) 2089 + return ret; 2090 + } 2091 + } 2092 + 2093 + return ret; 1889 2094 }
+48
fs/btrfs/space-info.h
··· 94 94 }; 95 95 96 96 struct btrfs_space_info { 97 + struct btrfs_fs_info *fs_info; 97 98 spinlock_t lock; 98 99 99 100 u64 total_bytes; /* total bytes in the space, ··· 166 165 167 166 struct kobject kobj; 168 167 struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES]; 168 + 169 + /* 170 + * Monotonically increasing counter of block group reclaim attempts 171 + * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_count 172 + */ 173 + u64 reclaim_count; 174 + 175 + /* 176 + * Monotonically increasing counter of reclaimed bytes 177 + * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_bytes 178 + */ 179 + u64 reclaim_bytes; 180 + 181 + /* 182 + * Monotonically increasing counter of reclaim errors 183 + * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_errors 184 + */ 185 + u64 reclaim_errors; 186 + 187 + /* 188 + * If true, use the dynamic relocation threshold, instead of the 189 + * fixed bg_reclaim_threshold. 190 + */ 191 + bool dynamic_reclaim; 192 + 193 + /* 194 + * Periodically check all block groups against the reclaim 195 + * threshold in the cleaner thread. 196 + */ 197 + bool periodic_reclaim; 198 + 199 + /* 200 + * Periodic reclaim should be a no-op if a space_info hasn't 201 + * freed any space since the last time we tried. 202 + */ 203 + bool periodic_reclaim_ready; 204 + 205 + /* 206 + * Net bytes freed or allocated since the last reclaim pass. 207 + */ 208 + s64 reclaimable_bytes; 169 209 }; 170 210 171 211 struct reserve_ticket { ··· 288 246 void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info); 289 247 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info); 290 248 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); 249 + 250 + void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes); 251 + void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready); 252 + bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info); 253 + int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info); 254 + int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info); 291 255 292 256 #endif /* BTRFS_SPACE_INFO_H */
+147 -15
fs/btrfs/subpage.c
··· 74 74 * mapping. And if page->mapping->host is data inode, it's subpage. 75 75 * As we have ruled our sectorsize >= PAGE_SIZE case already. 76 76 */ 77 - if (!mapping || !mapping->host || is_data_inode(mapping->host)) 77 + if (!mapping || !mapping->host || is_data_inode(BTRFS_I(mapping->host))) 78 78 return true; 79 79 80 80 /* ··· 242 242 243 243 #define subpage_calc_start_bit(fs_info, folio, name, start, len) \ 244 244 ({ \ 245 - unsigned int start_bit; \ 245 + unsigned int __start_bit; \ 246 246 \ 247 247 btrfs_subpage_assert(fs_info, folio, start, len); \ 248 - start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \ 249 - start_bit += fs_info->subpage_info->name##_offset; \ 250 - start_bit; \ 248 + __start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \ 249 + __start_bit += fs_info->subpage_info->name##_offset; \ 250 + __start_bit; \ 251 251 }) 252 252 253 253 void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info, ··· 283 283 bool last; 284 284 285 285 btrfs_subpage_assert(fs_info, folio, start, len); 286 - is_data = is_data_inode(folio->mapping->host); 286 + is_data = is_data_inode(BTRFS_I(folio->mapping->host)); 287 287 288 288 spin_lock_irqsave(&subpage->lock, flags); 289 289 ··· 703 703 * Make sure not only the page dirty bit is cleared, but also subpage dirty bit 704 704 * is cleared. 705 705 */ 706 - void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio) 706 + void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, 707 + struct folio *folio, u64 start, u32 len) 707 708 { 708 - struct btrfs_subpage *subpage = folio_get_private(folio); 709 + struct btrfs_subpage *subpage; 710 + unsigned int start_bit; 711 + unsigned int nbits; 712 + unsigned long flags; 709 713 710 714 if (!IS_ENABLED(CONFIG_BTRFS_ASSERT)) 711 715 return; 712 716 713 - ASSERT(!folio_test_dirty(folio)); 714 - if (!btrfs_is_subpage(fs_info, folio->mapping)) 717 + if (!btrfs_is_subpage(fs_info, folio->mapping)) { 718 + ASSERT(!folio_test_dirty(folio)); 715 719 return; 720 + } 716 721 717 - ASSERT(folio_test_private(folio) && folio_get_private(folio)); 718 - ASSERT(subpage_test_bitmap_all_zero(fs_info, subpage, dirty)); 722 + start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len); 723 + nbits = len >> fs_info->sectorsize_bits; 724 + subpage = folio_get_private(folio); 725 + ASSERT(subpage); 726 + spin_lock_irqsave(&subpage->lock, flags); 727 + ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); 728 + spin_unlock_irqrestore(&subpage->lock, flags); 719 729 } 720 730 721 731 /* ··· 775 765 btrfs_folio_end_writer_lock(fs_info, folio, start, len); 776 766 } 777 767 768 + /* 769 + * This is for folio already locked by plain lock_page()/folio_lock(), which 770 + * doesn't have any subpage awareness. 771 + * 772 + * This populates the involved subpage ranges so that subpage helpers can 773 + * properly unlock them. 774 + */ 775 + void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info, 776 + struct folio *folio, u64 start, u32 len) 777 + { 778 + struct btrfs_subpage *subpage; 779 + unsigned long flags; 780 + unsigned int start_bit; 781 + unsigned int nbits; 782 + int ret; 783 + 784 + ASSERT(folio_test_locked(folio)); 785 + if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio->mapping)) 786 + return; 787 + 788 + subpage = folio_get_private(folio); 789 + start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len); 790 + nbits = len >> fs_info->sectorsize_bits; 791 + spin_lock_irqsave(&subpage->lock, flags); 792 + /* Target range should not yet be locked. */ 793 + ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); 794 + bitmap_set(subpage->bitmaps, start_bit, nbits); 795 + ret = atomic_add_return(nbits, &subpage->writers); 796 + ASSERT(ret <= fs_info->subpage_info->bitmap_nr_bits); 797 + spin_unlock_irqrestore(&subpage->lock, flags); 798 + } 799 + 800 + /* 801 + * Find any subpage writer locked range inside @folio, starting at file offset 802 + * @search_start. The caller should ensure the folio is locked. 803 + * 804 + * Return true and update @found_start_ret and @found_len_ret to the first 805 + * writer locked range. 806 + * Return false if there is no writer locked range. 807 + */ 808 + bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info, 809 + struct folio *folio, u64 search_start, 810 + u64 *found_start_ret, u32 *found_len_ret) 811 + { 812 + struct btrfs_subpage_info *subpage_info = fs_info->subpage_info; 813 + struct btrfs_subpage *subpage = folio_get_private(folio); 814 + const unsigned int len = PAGE_SIZE - offset_in_page(search_start); 815 + const unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, 816 + locked, search_start, len); 817 + const unsigned int locked_bitmap_start = subpage_info->locked_offset; 818 + const unsigned int locked_bitmap_end = locked_bitmap_start + 819 + subpage_info->bitmap_nr_bits; 820 + unsigned long flags; 821 + int first_zero; 822 + int first_set; 823 + bool found = false; 824 + 825 + ASSERT(folio_test_locked(folio)); 826 + spin_lock_irqsave(&subpage->lock, flags); 827 + first_set = find_next_bit(subpage->bitmaps, locked_bitmap_end, start_bit); 828 + if (first_set >= locked_bitmap_end) 829 + goto out; 830 + 831 + found = true; 832 + 833 + *found_start_ret = folio_pos(folio) + 834 + ((first_set - locked_bitmap_start) << fs_info->sectorsize_bits); 835 + /* 836 + * Since @first_set is ensured to be smaller than locked_bitmap_end 837 + * here, @found_start_ret should be inside the folio. 838 + */ 839 + ASSERT(*found_start_ret < folio_pos(folio) + PAGE_SIZE); 840 + 841 + first_zero = find_next_zero_bit(subpage->bitmaps, locked_bitmap_end, first_set); 842 + *found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits; 843 + out: 844 + spin_unlock_irqrestore(&subpage->lock, flags); 845 + return found; 846 + } 847 + 848 + /* 849 + * Unlike btrfs_folio_end_writer_lock() which unlocks a specified subpage range, 850 + * this ends all writer locked ranges of a page. 851 + * 852 + * This is for the locked page of __extent_writepage(), as the locked page 853 + * can contain several locked subpage ranges. 854 + */ 855 + void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info, struct folio *folio) 856 + { 857 + struct btrfs_subpage *subpage = folio_get_private(folio); 858 + u64 folio_start = folio_pos(folio); 859 + u64 cur = folio_start; 860 + 861 + ASSERT(folio_test_locked(folio)); 862 + if (!btrfs_is_subpage(fs_info, folio->mapping)) { 863 + folio_unlock(folio); 864 + return; 865 + } 866 + 867 + /* The page has no new delalloc range locked on it. Just plain unlock. */ 868 + if (atomic_read(&subpage->writers) == 0) { 869 + folio_unlock(folio); 870 + return; 871 + } 872 + while (cur < folio_start + PAGE_SIZE) { 873 + u64 found_start; 874 + u32 found_len; 875 + bool found; 876 + bool last; 877 + 878 + found = btrfs_subpage_find_writer_locked(fs_info, folio, cur, 879 + &found_start, &found_len); 880 + if (!found) 881 + break; 882 + last = btrfs_subpage_end_and_test_writer(fs_info, folio, 883 + found_start, found_len); 884 + if (last) { 885 + folio_unlock(folio); 886 + break; 887 + } 888 + cur = found_start + found_len; 889 + } 890 + } 891 + 778 892 #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst) \ 779 893 bitmap_cut(dst, subpage->bitmaps, 0, \ 780 894 subpage_info->name##_offset, subpage_info->bitmap_nr_bits) ··· 909 775 struct btrfs_subpage_info *subpage_info = fs_info->subpage_info; 910 776 struct btrfs_subpage *subpage; 911 777 unsigned long uptodate_bitmap; 912 - unsigned long error_bitmap; 913 778 unsigned long dirty_bitmap; 914 779 unsigned long writeback_bitmap; 915 780 unsigned long ordered_bitmap; ··· 930 797 931 798 dump_page(folio_page(folio, 0), "btrfs subpage dump"); 932 799 btrfs_warn(fs_info, 933 - "start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl error=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl", 800 + "start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl", 934 801 start, len, folio_pos(folio), 935 802 subpage_info->bitmap_nr_bits, &uptodate_bitmap, 936 - subpage_info->bitmap_nr_bits, &error_bitmap, 937 803 subpage_info->bitmap_nr_bits, &dirty_bitmap, 938 804 subpage_info->bitmap_nr_bits, &writeback_bitmap, 939 805 subpage_info->bitmap_nr_bits, &ordered_bitmap,
+8 -1
fs/btrfs/subpage.h
··· 112 112 struct folio *folio, u64 start, u32 len); 113 113 void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info, 114 114 struct folio *folio, u64 start, u32 len); 115 + void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info, 116 + struct folio *folio, u64 start, u32 len); 117 + bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info, 118 + struct folio *folio, u64 search_start, 119 + u64 *found_start_ret, u32 *found_len_ret); 120 + void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info, struct folio *folio); 115 121 116 122 /* 117 123 * Template for subpage related operations. ··· 162 156 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info, 163 157 struct folio *folio, u64 start, u32 len); 164 158 165 - void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio); 159 + void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, 160 + struct folio *folio, u64 start, u32 len); 166 161 void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info, 167 162 struct folio *folio, u64 start, u32 len); 168 163 void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
+33 -18
fs/btrfs/super.c
··· 34 34 #include "disk-io.h" 35 35 #include "transaction.h" 36 36 #include "btrfs_inode.h" 37 + #include "direct-io.h" 37 38 #include "props.h" 38 39 #include "xattr.h" 39 40 #include "bio.h" ··· 126 125 Opt_rescue, 127 126 Opt_usebackuproot, 128 127 Opt_nologreplay, 129 - Opt_ignorebadroots, 130 - Opt_ignoredatacsums, 131 - Opt_rescue_all, 132 128 133 129 /* Debugging options */ 134 130 Opt_enospc_debug, ··· 176 178 Opt_rescue_nologreplay, 177 179 Opt_rescue_ignorebadroots, 178 180 Opt_rescue_ignoredatacsums, 181 + Opt_rescue_ignoremetacsums, 182 + Opt_rescue_ignoresuperflags, 179 183 Opt_rescue_parameter_all, 180 184 }; 181 185 ··· 187 187 { "ignorebadroots", Opt_rescue_ignorebadroots }, 188 188 { "ibadroots", Opt_rescue_ignorebadroots }, 189 189 { "ignoredatacsums", Opt_rescue_ignoredatacsums }, 190 + { "ignoremetacsums", Opt_rescue_ignoremetacsums}, 191 + { "ignoresuperflags", Opt_rescue_ignoresuperflags}, 190 192 { "idatacsums", Opt_rescue_ignoredatacsums }, 193 + { "imetacsums", Opt_rescue_ignoremetacsums}, 194 + { "isuperflags", Opt_rescue_ignoresuperflags}, 191 195 { "all", Opt_rescue_parameter_all }, 192 196 {} 193 197 }; ··· 577 573 case Opt_rescue_ignoredatacsums: 578 574 btrfs_set_opt(ctx->mount_opt, IGNOREDATACSUMS); 579 575 break; 576 + case Opt_rescue_ignoremetacsums: 577 + btrfs_set_opt(ctx->mount_opt, IGNOREMETACSUMS); 578 + break; 579 + case Opt_rescue_ignoresuperflags: 580 + btrfs_set_opt(ctx->mount_opt, IGNORESUPERFLAGS); 581 + break; 580 582 case Opt_rescue_parameter_all: 581 583 btrfs_set_opt(ctx->mount_opt, IGNOREDATACSUMS); 584 + btrfs_set_opt(ctx->mount_opt, IGNOREMETACSUMS); 585 + btrfs_set_opt(ctx->mount_opt, IGNORESUPERFLAGS); 582 586 btrfs_set_opt(ctx->mount_opt, IGNOREBADROOTS); 583 587 btrfs_set_opt(ctx->mount_opt, NOLOGREPLAY); 584 588 break; ··· 641 629 btrfs_clear_opt(fs_info->mount_opt, NOSPACECACHE); 642 630 } 643 631 644 - static bool check_ro_option(struct btrfs_fs_info *fs_info, 632 + static bool check_ro_option(const struct btrfs_fs_info *fs_info, 645 633 unsigned long mount_opt, unsigned long opt, 646 634 const char *opt_name) 647 635 { ··· 653 641 return false; 654 642 } 655 643 656 - bool btrfs_check_options(struct btrfs_fs_info *info, unsigned long *mount_opt, 644 + bool btrfs_check_options(const struct btrfs_fs_info *info, unsigned long *mount_opt, 657 645 unsigned long flags) 658 646 { 659 647 bool ret = true; ··· 661 649 if (!(flags & SB_RDONLY) && 662 650 (check_ro_option(info, *mount_opt, BTRFS_MOUNT_NOLOGREPLAY, "nologreplay") || 663 651 check_ro_option(info, *mount_opt, BTRFS_MOUNT_IGNOREBADROOTS, "ignorebadroots") || 664 - check_ro_option(info, *mount_opt, BTRFS_MOUNT_IGNOREDATACSUMS, "ignoredatacsums"))) 652 + check_ro_option(info, *mount_opt, BTRFS_MOUNT_IGNOREDATACSUMS, "ignoredatacsums") || 653 + check_ro_option(info, *mount_opt, BTRFS_MOUNT_IGNOREMETACSUMS, "ignoremetacsums") || 654 + check_ro_option(info, *mount_opt, BTRFS_MOUNT_IGNORESUPERFLAGS, "ignoresuperflags"))) 665 655 ret = false; 666 656 667 657 if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE) && ··· 963 949 return err; 964 950 } 965 951 966 - inode = btrfs_iget(sb, BTRFS_FIRST_FREE_OBJECTID, fs_info->fs_root); 952 + inode = btrfs_iget(BTRFS_FIRST_FREE_OBJECTID, fs_info->fs_root); 967 953 if (IS_ERR(inode)) { 968 954 err = PTR_ERR(inode); 969 955 btrfs_handle_fs_error(fs_info, err, NULL); ··· 997 983 return 0; 998 984 } 999 985 1000 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 986 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 1001 987 1002 988 trans = btrfs_attach_transaction_barrier(root); 1003 989 if (IS_ERR(trans)) { ··· 1079 1065 print_rescue_option(seq, "ignorebadroots", &printed); 1080 1066 if (btrfs_test_opt(info, IGNOREDATACSUMS)) 1081 1067 print_rescue_option(seq, "ignoredatacsums", &printed); 1068 + if (btrfs_test_opt(info, IGNOREMETACSUMS)) 1069 + print_rescue_option(seq, "ignoremetacsums", &printed); 1070 + if (btrfs_test_opt(info, IGNORESUPERFLAGS)) 1071 + print_rescue_option(seq, "ignoresuperflags", &printed); 1082 1072 if (btrfs_test_opt(info, FLUSHONCOMMIT)) 1083 1073 seq_puts(seq, ",flushoncommit"); 1084 1074 if (btrfs_test_opt(info, DISCARD_SYNC)) ··· 1440 1422 btrfs_info_if_set(info, old, USEBACKUPROOT, "trying to use backup root at mount time"); 1441 1423 btrfs_info_if_set(info, old, IGNOREBADROOTS, "ignoring bad roots"); 1442 1424 btrfs_info_if_set(info, old, IGNOREDATACSUMS, "ignoring data csums"); 1425 + btrfs_info_if_set(info, old, IGNOREMETACSUMS, "ignoring meta csums"); 1426 + btrfs_info_if_set(info, old, IGNORESUPERFLAGS, "ignoring unknown super block flags"); 1443 1427 1444 1428 btrfs_info_if_unset(info, old, NODATACOW, "setting datacow"); 1445 1429 btrfs_info_if_unset(info, old, SSD, "not using ssd optimizations"); ··· 2277 2257 2278 2258 static int btrfs_freeze(struct super_block *sb) 2279 2259 { 2280 - struct btrfs_trans_handle *trans; 2281 2260 struct btrfs_fs_info *fs_info = btrfs_sb(sb); 2282 - struct btrfs_root *root = fs_info->tree_root; 2283 2261 2284 2262 set_bit(BTRFS_FS_FROZEN, &fs_info->flags); 2285 2263 /* ··· 2286 2268 * we want to avoid on a frozen filesystem), or do the commit 2287 2269 * ourselves. 2288 2270 */ 2289 - trans = btrfs_attach_transaction_barrier(root); 2290 - if (IS_ERR(trans)) { 2291 - /* no transaction, don't bother */ 2292 - if (PTR_ERR(trans) == -ENOENT) 2293 - return 0; 2294 - return PTR_ERR(trans); 2295 - } 2296 - return btrfs_commit_transaction(trans); 2271 + return btrfs_commit_current_transaction(fs_info->tree_root); 2297 2272 } 2298 2273 2299 2274 static int check_dev_super(struct btrfs_device *dev) ··· 2510 2499 .init_func = btrfs_init_cachep, 2511 2500 .exit_func = btrfs_destroy_cachep, 2512 2501 }, { 2502 + .init_func = btrfs_init_dio, 2503 + .exit_func = btrfs_destroy_dio, 2504 + }, { 2513 2505 .init_func = btrfs_transaction_init, 2514 2506 .exit_func = btrfs_transaction_exit, 2515 2507 }, { ··· 2604 2590 late_initcall(init_btrfs_fs); 2605 2591 module_exit(exit_btrfs_fs) 2606 2592 2593 + MODULE_DESCRIPTION("B-Tree File System (BTRFS)"); 2607 2594 MODULE_LICENSE("GPL"); 2608 2595 MODULE_SOFTDEP("pre: crc32c"); 2609 2596 MODULE_SOFTDEP("pre: xxhash64");
+1 -1
fs/btrfs/super.h
··· 10 10 struct super_block; 11 11 struct btrfs_fs_info; 12 12 13 - bool btrfs_check_options(struct btrfs_fs_info *info, unsigned long *mount_opt, 13 + bool btrfs_check_options(const struct btrfs_fs_info *info, unsigned long *mount_opt, 14 14 unsigned long flags); 15 15 int btrfs_sync_fs(struct super_block *sb, int wait); 16 16 char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
+84 -1
fs/btrfs/sysfs.c
··· 385 385 "nologreplay", 386 386 "ignorebadroots", 387 387 "ignoredatacsums", 388 + "ignoremetacsums", 389 + "ignoresuperflags", 388 390 "all", 389 391 }; 390 392 ··· 896 894 SPACE_INFO_ATTR(bytes_zone_unusable); 897 895 SPACE_INFO_ATTR(disk_used); 898 896 SPACE_INFO_ATTR(disk_total); 897 + SPACE_INFO_ATTR(reclaim_count); 898 + SPACE_INFO_ATTR(reclaim_bytes); 899 + SPACE_INFO_ATTR(reclaim_errors); 899 900 BTRFS_ATTR_RW(space_info, chunk_size, btrfs_chunk_size_show, btrfs_chunk_size_store); 900 901 BTRFS_ATTR(space_info, size_classes, btrfs_size_classes_show); 901 902 ··· 907 902 char *buf) 908 903 { 909 904 struct btrfs_space_info *space_info = to_space_info(kobj); 905 + ssize_t ret; 910 906 911 - return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->bg_reclaim_threshold)); 907 + spin_lock(&space_info->lock); 908 + ret = sysfs_emit(buf, "%d\n", btrfs_calc_reclaim_threshold(space_info)); 909 + spin_unlock(&space_info->lock); 910 + return ret; 912 911 } 913 912 914 913 static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj, ··· 922 913 struct btrfs_space_info *space_info = to_space_info(kobj); 923 914 int thresh; 924 915 int ret; 916 + 917 + if (READ_ONCE(space_info->dynamic_reclaim)) 918 + return -EINVAL; 925 919 926 920 ret = kstrtoint(buf, 10, &thresh); 927 921 if (ret) ··· 941 929 BTRFS_ATTR_RW(space_info, bg_reclaim_threshold, 942 930 btrfs_sinfo_bg_reclaim_threshold_show, 943 931 btrfs_sinfo_bg_reclaim_threshold_store); 932 + 933 + static ssize_t btrfs_sinfo_dynamic_reclaim_show(struct kobject *kobj, 934 + struct kobj_attribute *a, 935 + char *buf) 936 + { 937 + struct btrfs_space_info *space_info = to_space_info(kobj); 938 + 939 + return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->dynamic_reclaim)); 940 + } 941 + 942 + static ssize_t btrfs_sinfo_dynamic_reclaim_store(struct kobject *kobj, 943 + struct kobj_attribute *a, 944 + const char *buf, size_t len) 945 + { 946 + struct btrfs_space_info *space_info = to_space_info(kobj); 947 + int dynamic_reclaim; 948 + int ret; 949 + 950 + ret = kstrtoint(buf, 10, &dynamic_reclaim); 951 + if (ret) 952 + return ret; 953 + 954 + if (dynamic_reclaim < 0) 955 + return -EINVAL; 956 + 957 + WRITE_ONCE(space_info->dynamic_reclaim, dynamic_reclaim != 0); 958 + 959 + return len; 960 + } 961 + 962 + BTRFS_ATTR_RW(space_info, dynamic_reclaim, 963 + btrfs_sinfo_dynamic_reclaim_show, 964 + btrfs_sinfo_dynamic_reclaim_store); 965 + 966 + static ssize_t btrfs_sinfo_periodic_reclaim_show(struct kobject *kobj, 967 + struct kobj_attribute *a, 968 + char *buf) 969 + { 970 + struct btrfs_space_info *space_info = to_space_info(kobj); 971 + 972 + return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->periodic_reclaim)); 973 + } 974 + 975 + static ssize_t btrfs_sinfo_periodic_reclaim_store(struct kobject *kobj, 976 + struct kobj_attribute *a, 977 + const char *buf, size_t len) 978 + { 979 + struct btrfs_space_info *space_info = to_space_info(kobj); 980 + int periodic_reclaim; 981 + int ret; 982 + 983 + ret = kstrtoint(buf, 10, &periodic_reclaim); 984 + if (ret) 985 + return ret; 986 + 987 + if (periodic_reclaim < 0) 988 + return -EINVAL; 989 + 990 + WRITE_ONCE(space_info->periodic_reclaim, periodic_reclaim != 0); 991 + 992 + return len; 993 + } 994 + 995 + BTRFS_ATTR_RW(space_info, periodic_reclaim, 996 + btrfs_sinfo_periodic_reclaim_show, 997 + btrfs_sinfo_periodic_reclaim_store); 944 998 945 999 /* 946 1000 * Allocation information about block group types. ··· 1025 947 BTRFS_ATTR_PTR(space_info, disk_used), 1026 948 BTRFS_ATTR_PTR(space_info, disk_total), 1027 949 BTRFS_ATTR_PTR(space_info, bg_reclaim_threshold), 950 + BTRFS_ATTR_PTR(space_info, dynamic_reclaim), 1028 951 BTRFS_ATTR_PTR(space_info, chunk_size), 1029 952 BTRFS_ATTR_PTR(space_info, size_classes), 953 + BTRFS_ATTR_PTR(space_info, reclaim_count), 954 + BTRFS_ATTR_PTR(space_info, reclaim_bytes), 955 + BTRFS_ATTR_PTR(space_info, reclaim_errors), 956 + BTRFS_ATTR_PTR(space_info, periodic_reclaim), 1030 957 #ifdef CONFIG_BTRFS_DEBUG 1031 958 BTRFS_ATTR_PTR(space_info, force_chunk_alloc), 1032 959 #endif
+1 -4
fs/btrfs/tests/btrfs-tests.c
··· 61 61 return NULL; 62 62 63 63 inode->i_mode = S_IFREG; 64 - inode->i_ino = BTRFS_FIRST_FREE_OBJECTID; 65 - BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY; 66 - BTRFS_I(inode)->location.objectid = BTRFS_FIRST_FREE_OBJECTID; 67 - BTRFS_I(inode)->location.offset = 0; 64 + btrfs_set_inode_number(BTRFS_I(inode), BTRFS_FIRST_FREE_OBJECTID); 68 65 inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG); 69 66 70 67 return inode;
+68 -52
fs/btrfs/tests/extent-map-tests.c
··· 19 19 int ret = 0; 20 20 21 21 write_lock(&em_tree->lock); 22 - while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) { 23 - node = rb_first_cached(&em_tree->map); 22 + while (!RB_EMPTY_ROOT(&em_tree->root)) { 23 + node = rb_first(&em_tree->root); 24 24 em = rb_entry(node, struct extent_map, rb_node); 25 25 remove_extent_mapping(inode, em); 26 26 ··· 28 28 if (refcount_read(&em->refs) != 1) { 29 29 ret = -EINVAL; 30 30 test_err( 31 - "em leak: em (start %llu len %llu block_start %llu block_len %llu) refs %d", 32 - em->start, em->len, em->block_start, 33 - em->block_len, refcount_read(&em->refs)); 31 + "em leak: em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu offset %llu) refs %d", 32 + em->start, em->len, em->disk_bytenr, 33 + em->disk_num_bytes, em->offset, 34 + refcount_read(&em->refs)); 34 35 35 36 refcount_set(&em->refs, 1); 36 37 } ··· 77 76 /* Add [0, 16K) */ 78 77 em->start = 0; 79 78 em->len = SZ_16K; 80 - em->block_start = 0; 81 - em->block_len = SZ_16K; 79 + em->disk_bytenr = 0; 80 + em->disk_num_bytes = SZ_16K; 81 + em->ram_bytes = SZ_16K; 82 82 write_lock(&em_tree->lock); 83 83 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 84 84 write_unlock(&em_tree->lock); ··· 99 97 100 98 em->start = SZ_16K; 101 99 em->len = SZ_4K; 102 - em->block_start = SZ_32K; /* avoid merging */ 103 - em->block_len = SZ_4K; 100 + em->disk_bytenr = SZ_32K; /* avoid merging */ 101 + em->disk_num_bytes = SZ_4K; 102 + em->ram_bytes = SZ_4K; 104 103 write_lock(&em_tree->lock); 105 104 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 106 105 write_unlock(&em_tree->lock); ··· 121 118 /* Add [0, 8K), should return [0, 16K) instead. */ 122 119 em->start = start; 123 120 em->len = len; 124 - em->block_start = start; 125 - em->block_len = len; 121 + em->disk_bytenr = start; 122 + em->disk_num_bytes = len; 123 + em->ram_bytes = len; 126 124 write_lock(&em_tree->lock); 127 125 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 128 126 write_unlock(&em_tree->lock); ··· 138 134 goto out; 139 135 } 140 136 if (em->start != 0 || extent_map_end(em) != SZ_16K || 141 - em->block_start != 0 || em->block_len != SZ_16K) { 137 + em->disk_bytenr != 0 || em->disk_num_bytes != SZ_16K) { 142 138 test_err( 143 - "case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu block_start %llu block_len %llu", 139 + "case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu", 144 140 start, start + len, ret, em->start, em->len, 145 - em->block_start, em->block_len); 141 + em->disk_bytenr, em->disk_num_bytes); 146 142 ret = -EINVAL; 147 143 } 148 144 free_extent_map(em); ··· 176 172 /* Add [0, 1K) */ 177 173 em->start = 0; 178 174 em->len = SZ_1K; 179 - em->block_start = EXTENT_MAP_INLINE; 180 - em->block_len = (u64)-1; 175 + em->disk_bytenr = EXTENT_MAP_INLINE; 176 + em->disk_num_bytes = 0; 177 + em->ram_bytes = SZ_1K; 181 178 write_lock(&em_tree->lock); 182 179 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 183 180 write_unlock(&em_tree->lock); ··· 198 193 199 194 em->start = SZ_4K; 200 195 em->len = SZ_4K; 201 - em->block_start = SZ_4K; 202 - em->block_len = SZ_4K; 196 + em->disk_bytenr = SZ_4K; 197 + em->disk_num_bytes = SZ_4K; 198 + em->ram_bytes = SZ_4K; 203 199 write_lock(&em_tree->lock); 204 200 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 205 201 write_unlock(&em_tree->lock); ··· 220 214 /* Add [0, 1K) */ 221 215 em->start = 0; 222 216 em->len = SZ_1K; 223 - em->block_start = EXTENT_MAP_INLINE; 224 - em->block_len = (u64)-1; 217 + em->disk_bytenr = EXTENT_MAP_INLINE; 218 + em->disk_num_bytes = 0; 219 + em->ram_bytes = SZ_1K; 225 220 write_lock(&em_tree->lock); 226 221 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 227 222 write_unlock(&em_tree->lock); ··· 236 229 goto out; 237 230 } 238 231 if (em->start != 0 || extent_map_end(em) != SZ_1K || 239 - em->block_start != EXTENT_MAP_INLINE || em->block_len != (u64)-1) { 232 + em->disk_bytenr != EXTENT_MAP_INLINE) { 240 233 test_err( 241 - "case2 [0 1K]: ret %d return a wrong em (start %llu len %llu block_start %llu block_len %llu", 242 - ret, em->start, em->len, em->block_start, 243 - em->block_len); 234 + "case2 [0 1K]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu", 235 + ret, em->start, em->len, em->disk_bytenr); 244 236 ret = -EINVAL; 245 237 } 246 238 free_extent_map(em); ··· 269 263 /* Add [4K, 8K) */ 270 264 em->start = SZ_4K; 271 265 em->len = SZ_4K; 272 - em->block_start = SZ_4K; 273 - em->block_len = SZ_4K; 266 + em->disk_bytenr = SZ_4K; 267 + em->disk_num_bytes = SZ_4K; 268 + em->ram_bytes = SZ_4K; 274 269 write_lock(&em_tree->lock); 275 270 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 276 271 write_unlock(&em_tree->lock); ··· 291 284 /* Add [0, 16K) */ 292 285 em->start = 0; 293 286 em->len = SZ_16K; 294 - em->block_start = 0; 295 - em->block_len = SZ_16K; 287 + em->disk_bytenr = 0; 288 + em->disk_num_bytes = SZ_16K; 289 + em->ram_bytes = SZ_16K; 296 290 write_lock(&em_tree->lock); 297 291 ret = btrfs_add_extent_mapping(inode, &em, start, len); 298 292 write_unlock(&em_tree->lock); ··· 313 305 * em->start. 314 306 */ 315 307 if (start < em->start || start + len > extent_map_end(em) || 316 - em->start != em->block_start || em->len != em->block_len) { 308 + em->start != extent_map_block_start(em)) { 317 309 test_err( 318 - "case3 [%llu %llu): ret %d em (start %llu len %llu block_start %llu block_len %llu)", 310 + "case3 [%llu %llu): ret %d em (start %llu len %llu disk_bytenr %llu block_len %llu)", 319 311 start, start + len, ret, em->start, em->len, 320 - em->block_start, em->block_len); 312 + em->disk_bytenr, em->disk_num_bytes); 321 313 ret = -EINVAL; 322 314 } 323 315 free_extent_map(em); ··· 378 370 /* Add [0K, 8K) */ 379 371 em->start = 0; 380 372 em->len = SZ_8K; 381 - em->block_start = 0; 382 - em->block_len = SZ_8K; 373 + em->disk_bytenr = 0; 374 + em->disk_num_bytes = SZ_8K; 375 + em->ram_bytes = SZ_8K; 383 376 write_lock(&em_tree->lock); 384 377 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 385 378 write_unlock(&em_tree->lock); ··· 400 391 /* Add [8K, 32K) */ 401 392 em->start = SZ_8K; 402 393 em->len = 24 * SZ_1K; 403 - em->block_start = SZ_16K; /* avoid merging */ 404 - em->block_len = 24 * SZ_1K; 394 + em->disk_bytenr = SZ_16K; /* avoid merging */ 395 + em->disk_num_bytes = 24 * SZ_1K; 396 + em->ram_bytes = 24 * SZ_1K; 405 397 write_lock(&em_tree->lock); 406 398 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 407 399 write_unlock(&em_tree->lock); ··· 421 411 /* Add [0K, 32K) */ 422 412 em->start = 0; 423 413 em->len = SZ_32K; 424 - em->block_start = 0; 425 - em->block_len = SZ_32K; 414 + em->disk_bytenr = 0; 415 + em->disk_num_bytes = SZ_32K; 416 + em->ram_bytes = SZ_32K; 426 417 write_lock(&em_tree->lock); 427 418 ret = btrfs_add_extent_mapping(inode, &em, start, len); 428 419 write_unlock(&em_tree->lock); ··· 440 429 } 441 430 if (start < em->start || start + len > extent_map_end(em)) { 442 431 test_err( 443 - "case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu block_start %llu block_len %llu)", 444 - start, start + len, ret, em->start, em->len, em->block_start, 445 - em->block_len); 432 + "case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu)", 433 + start, start + len, ret, em->start, em->len, 434 + em->disk_bytenr, em->disk_num_bytes); 446 435 ret = -EINVAL; 447 436 } 448 437 free_extent_map(em); ··· 506 495 507 496 em->start = start; 508 497 em->len = len; 509 - em->block_start = block_start; 510 - em->block_len = SZ_4K; 498 + em->disk_bytenr = block_start; 499 + em->disk_num_bytes = SZ_4K; 500 + em->ram_bytes = len; 511 501 em->flags |= EXTENT_FLAG_COMPRESS_ZLIB; 512 502 write_lock(&em_tree->lock); 513 503 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); ··· 563 551 struct rb_node *n; 564 552 int i; 565 553 566 - for (i = 0, n = rb_first_cached(&em_tree->map); 554 + for (i = 0, n = rb_first(&em_tree->root); 567 555 valid_ranges[index][i].len && n; 568 556 i++, n = rb_next(n)) { 569 557 struct extent_map *entry = rb_entry(n, struct extent_map, rb_node); ··· 728 716 729 717 em->start = SZ_4K; 730 718 em->len = SZ_4K; 731 - em->block_start = SZ_16K; 732 - em->block_len = SZ_16K; 719 + em->disk_bytenr = SZ_16K; 720 + em->disk_num_bytes = SZ_16K; 721 + em->ram_bytes = SZ_16K; 733 722 write_lock(&em_tree->lock); 734 723 ret = btrfs_add_extent_mapping(inode, &em, 0, SZ_8K); 735 724 write_unlock(&em_tree->lock); ··· 782 769 /* [0, 16K), pinned */ 783 770 em->start = 0; 784 771 em->len = SZ_16K; 785 - em->block_start = 0; 786 - em->block_len = SZ_4K; 787 - em->flags |= EXTENT_FLAG_PINNED; 772 + em->disk_bytenr = 0; 773 + em->disk_num_bytes = SZ_4K; 774 + em->ram_bytes = SZ_16K; 775 + em->flags |= (EXTENT_FLAG_PINNED | EXTENT_FLAG_COMPRESS_ZLIB); 788 776 write_lock(&em_tree->lock); 789 777 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 790 778 write_unlock(&em_tree->lock); ··· 805 791 /* [32K, 48K), not pinned */ 806 792 em->start = SZ_32K; 807 793 em->len = SZ_16K; 808 - em->block_start = SZ_32K; 809 - em->block_len = SZ_16K; 794 + em->disk_bytenr = SZ_32K; 795 + em->disk_num_bytes = SZ_16K; 796 + em->ram_bytes = SZ_16K; 810 797 write_lock(&em_tree->lock); 811 798 ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len); 812 799 write_unlock(&em_tree->lock); ··· 870 855 goto out; 871 856 } 872 857 873 - if (em->block_start != SZ_32K + SZ_4K) { 874 - test_err("em->block_start is %llu, expected 36K", em->block_start); 858 + if (extent_map_block_start(em) != SZ_32K + SZ_4K) { 859 + test_err("em->block_start is %llu, expected 36K", 860 + extent_map_block_start(em)); 875 861 goto out; 876 862 } 877 863
+82 -94
fs/btrfs/tests/inode-tests.c
··· 117 117 118 118 /* Now for a regular extent */ 119 119 insert_extent(root, offset, sectorsize - 1, sectorsize - 1, 0, 120 - disk_bytenr, sectorsize, BTRFS_FILE_EXTENT_REG, 0, slot); 120 + disk_bytenr, sectorsize - 1, BTRFS_FILE_EXTENT_REG, 0, slot); 121 121 slot++; 122 122 disk_bytenr += sectorsize; 123 123 offset += sectorsize - 1; ··· 264 264 test_err("got an error when we shouldn't have"); 265 265 goto out; 266 266 } 267 - if (em->block_start != EXTENT_MAP_HOLE) { 268 - test_err("expected a hole, got %llu", em->block_start); 267 + if (em->disk_bytenr != EXTENT_MAP_HOLE) { 268 + test_err("expected a hole, got %llu", em->disk_bytenr); 269 269 goto out; 270 270 } 271 271 free_extent_map(em); ··· 283 283 test_err("got an error when we shouldn't have"); 284 284 goto out; 285 285 } 286 - if (em->block_start != EXTENT_MAP_INLINE) { 287 - test_err("expected an inline, got %llu", em->block_start); 286 + if (em->disk_bytenr != EXTENT_MAP_INLINE) { 287 + test_err("expected an inline, got %llu", em->disk_bytenr); 288 288 goto out; 289 289 } 290 290 ··· 321 321 test_err("got an error when we shouldn't have"); 322 322 goto out; 323 323 } 324 - if (em->block_start != EXTENT_MAP_HOLE) { 325 - test_err("expected a hole, got %llu", em->block_start); 324 + if (em->disk_bytenr != EXTENT_MAP_HOLE) { 325 + test_err("expected a hole, got %llu", em->disk_bytenr); 326 326 goto out; 327 327 } 328 328 if (em->start != offset || em->len != 4) { ··· 344 344 test_err("got an error when we shouldn't have"); 345 345 goto out; 346 346 } 347 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 348 - test_err("expected a real extent, got %llu", em->block_start); 347 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 348 + test_err("expected a real extent, got %llu", em->disk_bytenr); 349 349 goto out; 350 350 } 351 351 if (em->start != offset || em->len != sectorsize - 1) { ··· 358 358 test_err("unexpected flags set, want 0 have %u", em->flags); 359 359 goto out; 360 360 } 361 - if (em->orig_start != em->start) { 362 - test_err("wrong orig offset, want %llu, have %llu", em->start, 363 - em->orig_start); 361 + if (em->offset != 0) { 362 + test_err("wrong offset, want 0, have %llu", em->offset); 364 363 goto out; 365 364 } 366 365 offset = em->start + em->len; ··· 371 372 test_err("got an error when we shouldn't have"); 372 373 goto out; 373 374 } 374 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 375 - test_err("expected a real extent, got %llu", em->block_start); 375 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 376 + test_err("expected a real extent, got %llu", em->disk_bytenr); 376 377 goto out; 377 378 } 378 379 if (em->start != offset || em->len != sectorsize) { ··· 385 386 test_err("unexpected flags set, want 0 have %u", em->flags); 386 387 goto out; 387 388 } 388 - if (em->orig_start != em->start) { 389 - test_err("wrong orig offset, want %llu, have %llu", em->start, 390 - em->orig_start); 389 + if (em->offset != 0) { 390 + test_err("wrong offset, want 0, have %llu", em->offset); 391 391 goto out; 392 392 } 393 - disk_bytenr = em->block_start; 393 + disk_bytenr = extent_map_block_start(em); 394 394 orig_start = em->start; 395 395 offset = em->start + em->len; 396 396 free_extent_map(em); ··· 399 401 test_err("got an error when we shouldn't have"); 400 402 goto out; 401 403 } 402 - if (em->block_start != EXTENT_MAP_HOLE) { 403 - test_err("expected a hole, got %llu", em->block_start); 404 + if (em->disk_bytenr != EXTENT_MAP_HOLE) { 405 + test_err("expected a hole, got %llu", em->disk_bytenr); 404 406 goto out; 405 407 } 406 408 if (em->start != offset || em->len != sectorsize) { ··· 421 423 test_err("got an error when we shouldn't have"); 422 424 goto out; 423 425 } 424 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 425 - test_err("expected a real extent, got %llu", em->block_start); 426 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 427 + test_err("expected a real extent, got %llu", em->disk_bytenr); 426 428 goto out; 427 429 } 428 430 if (em->start != offset || em->len != 2 * sectorsize) { ··· 435 437 test_err("unexpected flags set, want 0 have %u", em->flags); 436 438 goto out; 437 439 } 438 - if (em->orig_start != orig_start) { 439 - test_err("wrong orig offset, want %llu, have %llu", 440 - orig_start, em->orig_start); 440 + if (em->start - em->offset != orig_start) { 441 + test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu", 442 + em->start, em->offset, orig_start); 441 443 goto out; 442 444 } 443 445 disk_bytenr += (em->start - orig_start); 444 - if (em->block_start != disk_bytenr) { 446 + if (extent_map_block_start(em) != disk_bytenr) { 445 447 test_err("wrong block start, want %llu, have %llu", 446 - disk_bytenr, em->block_start); 448 + disk_bytenr, extent_map_block_start(em)); 447 449 goto out; 448 450 } 449 451 offset = em->start + em->len; ··· 455 457 test_err("got an error when we shouldn't have"); 456 458 goto out; 457 459 } 458 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 459 - test_err("expected a real extent, got %llu", em->block_start); 460 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 461 + test_err("expected a real extent, got %llu", em->disk_bytenr); 460 462 goto out; 461 463 } 462 464 if (em->start != offset || em->len != sectorsize) { ··· 470 472 prealloc_only, em->flags); 471 473 goto out; 472 474 } 473 - if (em->orig_start != em->start) { 474 - test_err("wrong orig offset, want %llu, have %llu", em->start, 475 - em->orig_start); 475 + if (em->offset != 0) { 476 + test_err("wrong offset, want 0, have %llu", em->offset); 476 477 goto out; 477 478 } 478 479 offset = em->start + em->len; ··· 483 486 test_err("got an error when we shouldn't have"); 484 487 goto out; 485 488 } 486 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 487 - test_err("expected a real extent, got %llu", em->block_start); 489 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 490 + test_err("expected a real extent, got %llu", em->disk_bytenr); 488 491 goto out; 489 492 } 490 493 if (em->start != offset || em->len != sectorsize) { ··· 498 501 prealloc_only, em->flags); 499 502 goto out; 500 503 } 501 - if (em->orig_start != em->start) { 502 - test_err("wrong orig offset, want %llu, have %llu", em->start, 503 - em->orig_start); 504 + if (em->offset != 0) { 505 + test_err("wrong offset, want 0, have %llu", em->offset); 504 506 goto out; 505 507 } 506 - disk_bytenr = em->block_start; 508 + disk_bytenr = extent_map_block_start(em); 507 509 orig_start = em->start; 508 510 offset = em->start + em->len; 509 511 free_extent_map(em); ··· 512 516 test_err("got an error when we shouldn't have"); 513 517 goto out; 514 518 } 515 - if (em->block_start >= EXTENT_MAP_HOLE) { 516 - test_err("expected a real extent, got %llu", em->block_start); 519 + if (em->disk_bytenr >= EXTENT_MAP_HOLE) { 520 + test_err("expected a real extent, got %llu", em->disk_bytenr); 517 521 goto out; 518 522 } 519 523 if (em->start != offset || em->len != sectorsize) { ··· 526 530 test_err("unexpected flags set, want 0 have %u", em->flags); 527 531 goto out; 528 532 } 529 - if (em->orig_start != orig_start) { 530 - test_err("unexpected orig offset, wanted %llu, have %llu", 531 - orig_start, em->orig_start); 533 + if (em->start - em->offset != orig_start) { 534 + test_err("unexpected offset, wanted %llu, have %llu", 535 + em->start - orig_start, em->offset); 532 536 goto out; 533 537 } 534 - if (em->block_start != (disk_bytenr + (em->start - em->orig_start))) { 538 + if (extent_map_block_start(em) != disk_bytenr + em->offset) { 535 539 test_err("unexpected block start, wanted %llu, have %llu", 536 - disk_bytenr + (em->start - em->orig_start), 537 - em->block_start); 540 + disk_bytenr + em->offset, extent_map_block_start(em)); 538 541 goto out; 539 542 } 540 543 offset = em->start + em->len; ··· 544 549 test_err("got an error when we shouldn't have"); 545 550 goto out; 546 551 } 547 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 548 - test_err("expected a real extent, got %llu", em->block_start); 552 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 553 + test_err("expected a real extent, got %llu", em->disk_bytenr); 549 554 goto out; 550 555 } 551 556 if (em->start != offset || em->len != 2 * sectorsize) { ··· 559 564 prealloc_only, em->flags); 560 565 goto out; 561 566 } 562 - if (em->orig_start != orig_start) { 563 - test_err("wrong orig offset, want %llu, have %llu", orig_start, 564 - em->orig_start); 567 + if (em->start - em->offset != orig_start) { 568 + test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu", 569 + em->start, em->offset, orig_start); 565 570 goto out; 566 571 } 567 - if (em->block_start != (disk_bytenr + (em->start - em->orig_start))) { 572 + if (extent_map_block_start(em) != disk_bytenr + em->offset) { 568 573 test_err("unexpected block start, wanted %llu, have %llu", 569 - disk_bytenr + (em->start - em->orig_start), 570 - em->block_start); 574 + disk_bytenr + em->offset, extent_map_block_start(em)); 571 575 goto out; 572 576 } 573 577 offset = em->start + em->len; ··· 578 584 test_err("got an error when we shouldn't have"); 579 585 goto out; 580 586 } 581 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 582 - test_err("expected a real extent, got %llu", em->block_start); 587 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 588 + test_err("expected a real extent, got %llu", em->disk_bytenr); 583 589 goto out; 584 590 } 585 591 if (em->start != offset || em->len != 2 * sectorsize) { ··· 593 599 compressed_only, em->flags); 594 600 goto out; 595 601 } 596 - if (em->orig_start != em->start) { 597 - test_err("wrong orig offset, want %llu, have %llu", 598 - em->start, em->orig_start); 602 + if (em->offset != 0) { 603 + test_err("wrong offset, want 0, have %llu", em->offset); 599 604 goto out; 600 605 } 601 606 if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) { ··· 611 618 test_err("got an error when we shouldn't have"); 612 619 goto out; 613 620 } 614 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 615 - test_err("expected a real extent, got %llu", em->block_start); 621 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 622 + test_err("expected a real extent, got %llu", em->disk_bytenr); 616 623 goto out; 617 624 } 618 625 if (em->start != offset || em->len != sectorsize) { ··· 626 633 compressed_only, em->flags); 627 634 goto out; 628 635 } 629 - if (em->orig_start != em->start) { 630 - test_err("wrong orig offset, want %llu, have %llu", 631 - em->start, em->orig_start); 636 + if (em->offset != 0) { 637 + test_err("wrong offset, want 0, have %llu", em->offset); 632 638 goto out; 633 639 } 634 640 if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) { ··· 635 643 BTRFS_COMPRESS_ZLIB, extent_map_compression(em)); 636 644 goto out; 637 645 } 638 - disk_bytenr = em->block_start; 646 + disk_bytenr = extent_map_block_start(em); 639 647 orig_start = em->start; 640 648 offset = em->start + em->len; 641 649 free_extent_map(em); ··· 645 653 test_err("got an error when we shouldn't have"); 646 654 goto out; 647 655 } 648 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 649 - test_err("expected a real extent, got %llu", em->block_start); 656 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 657 + test_err("expected a real extent, got %llu", em->disk_bytenr); 650 658 goto out; 651 659 } 652 660 if (em->start != offset || em->len != sectorsize) { ··· 659 667 test_err("unexpected flags set, want 0 have %u", em->flags); 660 668 goto out; 661 669 } 662 - if (em->orig_start != em->start) { 663 - test_err("wrong orig offset, want %llu, have %llu", em->start, 664 - em->orig_start); 670 + if (em->offset != 0) { 671 + test_err("wrong offset, want 0, have %llu", em->offset); 665 672 goto out; 666 673 } 667 674 offset = em->start + em->len; ··· 671 680 test_err("got an error when we shouldn't have"); 672 681 goto out; 673 682 } 674 - if (em->block_start != disk_bytenr) { 683 + if (extent_map_block_start(em) != disk_bytenr) { 675 684 test_err("block start does not match, want %llu got %llu", 676 - disk_bytenr, em->block_start); 685 + disk_bytenr, extent_map_block_start(em)); 677 686 goto out; 678 687 } 679 688 if (em->start != offset || em->len != 2 * sectorsize) { ··· 687 696 compressed_only, em->flags); 688 697 goto out; 689 698 } 690 - if (em->orig_start != orig_start) { 691 - test_err("wrong orig offset, want %llu, have %llu", 692 - em->start, orig_start); 699 + if (em->start - em->offset != orig_start) { 700 + test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu", 701 + em->start, em->offset, orig_start); 693 702 goto out; 694 703 } 695 704 if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) { ··· 706 715 test_err("got an error when we shouldn't have"); 707 716 goto out; 708 717 } 709 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 710 - test_err("expected a real extent, got %llu", em->block_start); 718 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 719 + test_err("expected a real extent, got %llu", em->disk_bytenr); 711 720 goto out; 712 721 } 713 722 if (em->start != offset || em->len != sectorsize) { ··· 720 729 test_err("unexpected flags set, want 0 have %u", em->flags); 721 730 goto out; 722 731 } 723 - if (em->orig_start != em->start) { 724 - test_err("wrong orig offset, want %llu, have %llu", em->start, 725 - em->orig_start); 732 + if (em->offset != 0) { 733 + test_err("wrong offset, want 0, have %llu", em->offset); 726 734 goto out; 727 735 } 728 736 offset = em->start + em->len; ··· 732 742 test_err("got an error when we shouldn't have"); 733 743 goto out; 734 744 } 735 - if (em->block_start != EXTENT_MAP_HOLE) { 736 - test_err("expected a hole extent, got %llu", em->block_start); 745 + if (em->disk_bytenr != EXTENT_MAP_HOLE) { 746 + test_err("expected a hole extent, got %llu", em->disk_bytenr); 737 747 goto out; 738 748 } 739 749 /* ··· 752 762 vacancy_only, em->flags); 753 763 goto out; 754 764 } 755 - if (em->orig_start != em->start) { 756 - test_err("wrong orig offset, want %llu, have %llu", em->start, 757 - em->orig_start); 765 + if (em->offset != 0) { 766 + test_err("wrong offset, want 0, have %llu", em->offset); 758 767 goto out; 759 768 } 760 769 offset = em->start + em->len; ··· 764 775 test_err("got an error when we shouldn't have"); 765 776 goto out; 766 777 } 767 - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { 768 - test_err("expected a real extent, got %llu", em->block_start); 778 + if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) { 779 + test_err("expected a real extent, got %llu", em->disk_bytenr); 769 780 goto out; 770 781 } 771 782 if (em->start != offset || em->len != sectorsize) { ··· 778 789 test_err("unexpected flags set, want 0 have %u", em->flags); 779 790 goto out; 780 791 } 781 - if (em->orig_start != em->start) { 782 - test_err("wrong orig offset, want %llu, have %llu", em->start, 783 - em->orig_start); 792 + if (em->offset != 0) { 793 + test_err("wrong orig offset, want 0, have %llu", em->offset); 784 794 goto out; 785 795 } 786 796 ret = 0; ··· 843 855 test_err("got an error when we shouldn't have"); 844 856 goto out; 845 857 } 846 - if (em->block_start != EXTENT_MAP_HOLE) { 847 - test_err("expected a hole, got %llu", em->block_start); 858 + if (em->disk_bytenr != EXTENT_MAP_HOLE) { 859 + test_err("expected a hole, got %llu", em->disk_bytenr); 848 860 goto out; 849 861 } 850 862 if (em->start != 0 || em->len != sectorsize) { ··· 865 877 test_err("got an error when we shouldn't have"); 866 878 goto out; 867 879 } 868 - if (em->block_start != sectorsize) { 869 - test_err("expected a real extent, got %llu", em->block_start); 880 + if (extent_map_block_start(em) != sectorsize) { 881 + test_err("expected a real extent, got %llu", extent_map_block_start(em)); 870 882 goto out; 871 883 } 872 884 if (em->start != sectorsize || em->len != sectorsize) {
+25 -6
fs/btrfs/transaction.c
··· 405 405 int ret = 0; 406 406 407 407 if ((test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && 408 - root->last_trans < trans->transid) || force) { 408 + btrfs_get_root_last_trans(root) < trans->transid) || force) { 409 409 WARN_ON(!force && root->commit_root != root->node); 410 410 411 411 /* ··· 421 421 smp_wmb(); 422 422 423 423 spin_lock(&fs_info->fs_roots_radix_lock); 424 - if (root->last_trans == trans->transid && !force) { 424 + if (btrfs_get_root_last_trans(root) == trans->transid && !force) { 425 425 spin_unlock(&fs_info->fs_roots_radix_lock); 426 426 return 0; 427 427 } ··· 429 429 (unsigned long)btrfs_root_id(root), 430 430 BTRFS_ROOT_TRANS_TAG); 431 431 spin_unlock(&fs_info->fs_roots_radix_lock); 432 - root->last_trans = trans->transid; 432 + btrfs_set_root_last_trans(root, trans->transid); 433 433 434 434 /* this is pretty tricky. We don't want to 435 435 * take the relocation lock in btrfs_record_root_in_trans ··· 491 491 * and barriers 492 492 */ 493 493 smp_rmb(); 494 - if (root->last_trans == trans->transid && 494 + if (btrfs_get_root_last_trans(root) == trans->transid && 495 495 !test_bit(BTRFS_ROOT_IN_TRANS_SETUP, &root->state)) 496 496 return 0; 497 497 ··· 1637 1637 struct btrfs_root *root = pending->root; 1638 1638 struct btrfs_root *parent_root; 1639 1639 struct btrfs_block_rsv *rsv; 1640 - struct inode *parent_inode = pending->dir; 1640 + struct inode *parent_inode = &pending->dir->vfs_inode; 1641 1641 struct btrfs_path *path; 1642 1642 struct btrfs_dir_item *dir_item; 1643 1643 struct extent_buffer *tmp; ··· 1989 1989 btrfs_put_transaction(cur_trans); 1990 1990 } 1991 1991 1992 + /* 1993 + * If there is a running transaction commit it or if it's already committing, 1994 + * wait for its commit to complete. Does not start and commit a new transaction 1995 + * if there isn't any running. 1996 + */ 1997 + int btrfs_commit_current_transaction(struct btrfs_root *root) 1998 + { 1999 + struct btrfs_trans_handle *trans; 2000 + 2001 + trans = btrfs_attach_transaction_barrier(root); 2002 + if (IS_ERR(trans)) { 2003 + int ret = PTR_ERR(trans); 2004 + 2005 + return (ret == -ENOENT) ? 0 : ret; 2006 + } 2007 + 2008 + return btrfs_commit_transaction(trans); 2009 + } 2010 + 1992 2011 static void cleanup_transaction(struct btrfs_trans_handle *trans, int err) 1993 2012 { 1994 2013 struct btrfs_fs_info *fs_info = trans->fs_info; ··· 2129 2110 static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info) 2130 2111 { 2131 2112 if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) 2132 - btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); 2113 + btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL); 2133 2114 } 2134 2115 2135 2116 /*
+5 -4
fs/btrfs/transaction.h
··· 172 172 173 173 struct btrfs_pending_snapshot { 174 174 struct dentry *dentry; 175 - struct inode *dir; 175 + struct btrfs_inode *dir; 176 176 struct btrfs_root *root; 177 177 struct btrfs_root_item *root_item; 178 178 struct btrfs_root *snap; ··· 229 229 */ 230 230 #define btrfs_abort_transaction(trans, error) \ 231 231 do { \ 232 - bool first = false; \ 232 + bool __first = false; \ 233 233 /* Report first abort since mount */ \ 234 234 if (!test_and_set_bit(BTRFS_FS_STATE_TRANS_ABORTED, \ 235 235 &((trans)->fs_info->fs_state))) { \ 236 - first = true; \ 236 + __first = true; \ 237 237 if (WARN(abort_should_print_stack(error), \ 238 238 KERN_ERR \ 239 239 "BTRFS: Transaction aborted (error %d)\n", \ ··· 246 246 } \ 247 247 } \ 248 248 __btrfs_abort_transaction((trans), __func__, \ 249 - __LINE__, (error), first); \ 249 + __LINE__, (error), __first); \ 250 250 } while (0) 251 251 252 252 int btrfs_end_transaction(struct btrfs_trans_handle *trans); ··· 268 268 int btrfs_clean_one_deleted_snapshot(struct btrfs_fs_info *fs_info); 269 269 int btrfs_commit_transaction(struct btrfs_trans_handle *trans); 270 270 void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans); 271 + int btrfs_commit_current_transaction(struct btrfs_root *root); 271 272 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans); 272 273 bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans); 273 274 void btrfs_throttle(struct btrfs_fs_info *fs_info);
+18 -19
fs/btrfs/tree-checker.c
··· 340 340 } 341 341 } 342 342 343 + /* 344 + * For non-compressed data extents, ram_bytes should match its 345 + * disk_num_bytes. 346 + * However we do not really utilize ram_bytes in this case, so this check 347 + * is only optional for DEBUG builds for developers to catch the 348 + * unexpected behaviors. 349 + */ 350 + if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && 351 + btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE && 352 + btrfs_file_extent_disk_bytenr(leaf, fi)) { 353 + if (WARN_ON(btrfs_file_extent_ram_bytes(leaf, fi) != 354 + btrfs_file_extent_disk_num_bytes(leaf, fi))) 355 + file_extent_err(leaf, slot, 356 + "mismatch ram_bytes (%llu) and disk_num_bytes (%llu) for non-compressed extent", 357 + btrfs_file_extent_ram_bytes(leaf, fi), 358 + btrfs_file_extent_disk_num_bytes(leaf, fi)); 359 + } 360 + 343 361 return 0; 344 362 } 345 363 ··· 1700 1682 static int check_raid_stripe_extent(const struct extent_buffer *leaf, 1701 1683 const struct btrfs_key *key, int slot) 1702 1684 { 1703 - struct btrfs_stripe_extent *stripe_extent = 1704 - btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); 1705 - 1706 1685 if (unlikely(!IS_ALIGNED(key->objectid, leaf->fs_info->sectorsize))) { 1707 1686 generic_err(leaf, slot, 1708 1687 "invalid key objectid for raid stripe extent, have %llu expect aligned to %u", ··· 1710 1695 if (unlikely(!btrfs_fs_incompat(leaf->fs_info, RAID_STRIPE_TREE))) { 1711 1696 generic_err(leaf, slot, 1712 1697 "RAID_STRIPE_EXTENT present but RAID_STRIPE_TREE incompat bit unset"); 1713 - return -EUCLEAN; 1714 - } 1715 - 1716 - switch (btrfs_stripe_extent_encoding(leaf, stripe_extent)) { 1717 - case BTRFS_STRIPE_RAID0: 1718 - case BTRFS_STRIPE_RAID1: 1719 - case BTRFS_STRIPE_DUP: 1720 - case BTRFS_STRIPE_RAID10: 1721 - case BTRFS_STRIPE_RAID5: 1722 - case BTRFS_STRIPE_RAID6: 1723 - case BTRFS_STRIPE_RAID1C3: 1724 - case BTRFS_STRIPE_RAID1C4: 1725 - break; 1726 - default: 1727 - generic_err(leaf, slot, "invalid raid stripe encoding %u", 1728 - btrfs_stripe_extent_encoding(leaf, stripe_extent)); 1729 1698 return -EUCLEAN; 1730 1699 } 1731 1700
+53 -21
fs/btrfs/tree-log.c
··· 151 151 * attempt a transaction commit, resulting in a deadlock. 152 152 */ 153 153 nofs_flag = memalloc_nofs_save(); 154 - inode = btrfs_iget(root->fs_info->sb, objectid, root); 154 + inode = btrfs_iget(objectid, root); 155 155 memalloc_nofs_restore(nofs_flag); 156 156 157 157 return inode; ··· 1644 1644 if (ret) 1645 1645 goto out; 1646 1646 } 1647 - BTRFS_I(inode)->index_cnt = (u64)-1; 1647 + if (S_ISDIR(inode->i_mode)) 1648 + BTRFS_I(inode)->index_cnt = (u64)-1; 1648 1649 1649 1650 if (inode->i_nlink == 0) { 1650 1651 if (S_ISDIR(inode->i_mode)) { ··· 2840 2839 finish_wait(&root->log_writer_wait, &wait); 2841 2840 } 2842 2841 2843 - void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct inode *inode) 2842 + void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct btrfs_inode *inode) 2844 2843 { 2845 2844 ctx->log_ret = 0; 2846 2845 ctx->log_transid = 0; ··· 2859 2858 2860 2859 void btrfs_init_log_ctx_scratch_eb(struct btrfs_log_ctx *ctx) 2861 2860 { 2862 - struct btrfs_inode *inode = BTRFS_I(ctx->inode); 2861 + struct btrfs_inode *inode = ctx->inode; 2863 2862 2864 2863 if (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && 2865 2864 !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) ··· 2877 2876 struct btrfs_ordered_extent *ordered; 2878 2877 struct btrfs_ordered_extent *tmp; 2879 2878 2880 - ASSERT(inode_is_locked(ctx->inode)); 2879 + ASSERT(inode_is_locked(&ctx->inode->vfs_inode)); 2881 2880 2882 2881 list_for_each_entry_safe(ordered, tmp, &ctx->ordered_extents, log_list) { 2883 2882 list_del_init(&ordered->log_list); ··· 4254 4253 struct btrfs_inode *inode, bool inode_item_dropped) 4255 4254 { 4256 4255 struct btrfs_inode_item *inode_item; 4256 + struct btrfs_key key; 4257 4257 int ret; 4258 4258 4259 + btrfs_get_inode_key(inode, &key); 4259 4260 /* 4260 4261 * If we are doing a fast fsync and the inode was logged before in the 4261 4262 * current transaction, then we know the inode was previously logged and ··· 4269 4266 * already exists can also result in unnecessarily splitting a leaf. 4270 4267 */ 4271 4268 if (!inode_item_dropped && inode->logged_trans == trans->transid) { 4272 - ret = btrfs_search_slot(trans, log, &inode->location, path, 0, 1); 4269 + ret = btrfs_search_slot(trans, log, &key, path, 0, 1); 4273 4270 ASSERT(ret <= 0); 4274 4271 if (ret > 0) 4275 4272 ret = -ENOENT; ··· 4283 4280 * the inode, we set BTRFS_INODE_NEEDS_FULL_SYNC on its runtime 4284 4281 * flags and set ->logged_trans to 0. 4285 4282 */ 4286 - ret = btrfs_insert_empty_item(trans, log, path, &inode->location, 4283 + ret = btrfs_insert_empty_item(trans, log, path, &key, 4287 4284 sizeof(*inode_item)); 4288 4285 ASSERT(ret != -EEXIST); 4289 4286 } ··· 4597 4594 { 4598 4595 struct btrfs_ordered_extent *ordered; 4599 4596 struct btrfs_root *csum_root; 4597 + u64 block_start; 4600 4598 u64 csum_offset; 4601 4599 u64 csum_len; 4602 4600 u64 mod_start = em->start; ··· 4607 4603 4608 4604 if (inode->flags & BTRFS_INODE_NODATASUM || 4609 4605 (em->flags & EXTENT_FLAG_PREALLOC) || 4610 - em->block_start == EXTENT_MAP_HOLE) 4606 + em->disk_bytenr == EXTENT_MAP_HOLE) 4611 4607 return 0; 4612 4608 4613 4609 list_for_each_entry(ordered, &ctx->ordered_extents, log_list) { ··· 4671 4667 /* If we're compressed we have to save the entire range of csums. */ 4672 4668 if (extent_map_is_compressed(em)) { 4673 4669 csum_offset = 0; 4674 - csum_len = max(em->block_len, em->orig_block_len); 4670 + csum_len = em->disk_num_bytes; 4675 4671 } else { 4676 4672 csum_offset = mod_start - em->start; 4677 4673 csum_len = mod_len; 4678 4674 } 4679 4675 4680 4676 /* block start is already adjusted for the file extent offset. */ 4681 - csum_root = btrfs_csum_root(trans->fs_info, em->block_start); 4682 - ret = btrfs_lookup_csums_list(csum_root, em->block_start + csum_offset, 4683 - em->block_start + csum_offset + 4684 - csum_len - 1, &ordered_sums, false); 4677 + block_start = extent_map_block_start(em); 4678 + csum_root = btrfs_csum_root(trans->fs_info, block_start); 4679 + ret = btrfs_lookup_csums_list(csum_root, block_start + csum_offset, 4680 + block_start + csum_offset + csum_len - 1, 4681 + &ordered_sums, false); 4685 4682 if (ret < 0) 4686 4683 return ret; 4687 4684 ret = 0; ··· 4712 4707 struct extent_buffer *leaf; 4713 4708 struct btrfs_key key; 4714 4709 enum btrfs_compression_type compress_type; 4715 - u64 extent_offset = em->start - em->orig_start; 4710 + u64 extent_offset = em->offset; 4711 + u64 block_start = extent_map_block_start(em); 4716 4712 u64 block_len; 4717 4713 int ret; 4718 4714 ··· 4723 4717 else 4724 4718 btrfs_set_stack_file_extent_type(&fi, BTRFS_FILE_EXTENT_REG); 4725 4719 4726 - block_len = max(em->block_len, em->orig_block_len); 4720 + block_len = em->disk_num_bytes; 4727 4721 compress_type = extent_map_compression(em); 4728 4722 if (compress_type != BTRFS_COMPRESS_NONE) { 4729 - btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start); 4723 + btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start); 4730 4724 btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len); 4731 - } else if (em->block_start < EXTENT_MAP_LAST_BYTE) { 4732 - btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start - 4733 - extent_offset); 4725 + } else if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { 4726 + btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start - extent_offset); 4734 4727 btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len); 4735 4728 } 4736 4729 ··· 5932 5927 if (ret < 0) { 5933 5928 return ret; 5934 5929 } else if (ret > 0 && 5935 - other_ino != btrfs_ino(BTRFS_I(ctx->inode))) { 5930 + other_ino != btrfs_ino(ctx->inode)) { 5936 5931 if (ins_nr > 0) { 5937 5932 ins_nr++; 5938 5933 } else { ··· 7079 7074 } 7080 7075 7081 7076 /* 7077 + * If we're logging an inode from a subvolume created in the current 7078 + * transaction we must force a commit since the root is not persisted. 7079 + */ 7080 + if (btrfs_root_generation(&root->root_item) == trans->transid) { 7081 + ret = BTRFS_LOG_FORCE_COMMIT; 7082 + goto end_no_trans; 7083 + } 7084 + 7085 + /* 7082 7086 * Skip already logged inodes or inodes corresponding to tmpfiles 7083 7087 * (since logging them is pointless, a link count of 0 means they 7084 7088 * will never be accessible). ··· 7468 7454 } 7469 7455 7470 7456 /* 7457 + * Call this when creating a subvolume in a directory. 7458 + * Because we don't commit a transaction when creating a subvolume, we can't 7459 + * allow the directory pointing to the subvolume to be logged with an entry that 7460 + * points to an unpersisted root if we are still in the transaction used to 7461 + * create the subvolume, so make any attempt to log the directory to result in a 7462 + * full log sync. 7463 + * Also we don't need to worry with renames, since btrfs_rename() marks the log 7464 + * for full commit when renaming a subvolume. 7465 + */ 7466 + void btrfs_record_new_subvolume(const struct btrfs_trans_handle *trans, 7467 + struct btrfs_inode *dir) 7468 + { 7469 + mutex_lock(&dir->log_mutex); 7470 + dir->last_unlink_trans = trans->transid; 7471 + mutex_unlock(&dir->log_mutex); 7472 + } 7473 + 7474 + /* 7471 7475 * Update the log after adding a new name for an inode. 7472 7476 * 7473 7477 * @trans: Transaction handle. ··· 7617 7585 goto out; 7618 7586 } 7619 7587 7620 - btrfs_init_log_ctx(&ctx, &inode->vfs_inode); 7588 + btrfs_init_log_ctx(&ctx, inode); 7621 7589 ctx.logging_new_name = true; 7622 7590 btrfs_init_log_ctx_scratch_eb(&ctx); 7623 7591 /*
+4 -2
fs/btrfs/tree-log.h
··· 37 37 bool logging_new_delayed_dentries; 38 38 /* Indicate if the inode being logged was logged before. */ 39 39 bool logged_before; 40 - struct inode *inode; 40 + struct btrfs_inode *inode; 41 41 struct list_head list; 42 42 /* Only used for fast fsyncs. */ 43 43 struct list_head ordered_extents; ··· 55 55 struct extent_buffer *scratch_eb; 56 56 }; 57 57 58 - void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct inode *inode); 58 + void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct btrfs_inode *inode); 59 59 void btrfs_init_log_ctx_scratch_eb(struct btrfs_log_ctx *ctx); 60 60 void btrfs_release_log_ctx_extents(struct btrfs_log_ctx *ctx); 61 61 ··· 94 94 bool for_rename); 95 95 void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans, 96 96 struct btrfs_inode *dir); 97 + void btrfs_record_new_subvolume(const struct btrfs_trans_handle *trans, 98 + struct btrfs_inode *dir); 97 99 void btrfs_log_new_name(struct btrfs_trans_handle *trans, 98 100 struct dentry *old_dentry, struct btrfs_inode *old_dir, 99 101 u64 old_dir_index, struct dentry *parent);
+18 -3
fs/btrfs/ulist.c
··· 50 50 INIT_LIST_HEAD(&ulist->nodes); 51 51 ulist->root = RB_ROOT; 52 52 ulist->nnodes = 0; 53 + ulist->prealloc = NULL; 53 54 } 54 55 55 56 /* ··· 69 68 list_for_each_entry_safe(node, next, &ulist->nodes, list) { 70 69 kfree(node); 71 70 } 71 + kfree(ulist->prealloc); 72 + ulist->prealloc = NULL; 72 73 ulist->root = RB_ROOT; 73 74 INIT_LIST_HEAD(&ulist->nodes); 74 75 } ··· 106 103 ulist_init(ulist); 107 104 108 105 return ulist; 106 + } 107 + 108 + void ulist_prealloc(struct ulist *ulist, gfp_t gfp_mask) 109 + { 110 + if (!ulist->prealloc) 111 + ulist->prealloc = kzalloc(sizeof(*ulist->prealloc), gfp_mask); 109 112 } 110 113 111 114 /* ··· 215 206 *old_aux = node->aux; 216 207 return 0; 217 208 } 218 - node = kmalloc(sizeof(*node), gfp_mask); 219 - if (!node) 220 - return -ENOMEM; 209 + 210 + if (ulist->prealloc) { 211 + node = ulist->prealloc; 212 + ulist->prealloc = NULL; 213 + } else { 214 + node = kmalloc(sizeof(*node), gfp_mask); 215 + if (!node) 216 + return -ENOMEM; 217 + } 221 218 222 219 node->val = val; 223 220 node->aux = aux;
+2
fs/btrfs/ulist.h
··· 41 41 42 42 struct list_head nodes; 43 43 struct rb_root root; 44 + struct ulist_node *prealloc; 44 45 }; 45 46 46 47 void ulist_init(struct ulist *ulist); 47 48 void ulist_release(struct ulist *ulist); 48 49 void ulist_reinit(struct ulist *ulist); 49 50 struct ulist *ulist_alloc(gfp_t gfp_mask); 51 + void ulist_prealloc(struct ulist *ulist, gfp_t mask); 50 52 void ulist_free(struct ulist *ulist); 51 53 int ulist_add(struct ulist *ulist, u64 val, u64 aux, gfp_t gfp_mask); 52 54 int ulist_add_merge(struct ulist *ulist, u64 val, u64 aux,
+5 -5
fs/btrfs/uuid-tree.c
··· 13 13 #include "accessors.h" 14 14 #include "uuid-tree.h" 15 15 16 - static void btrfs_uuid_to_key(u8 *uuid, u8 type, struct btrfs_key *key) 16 + static void btrfs_uuid_to_key(const u8 *uuid, u8 type, struct btrfs_key *key) 17 17 { 18 18 key->type = type; 19 19 key->objectid = get_unaligned_le64(uuid); ··· 21 21 } 22 22 23 23 /* return -ENOENT for !found, < 0 for errors, or 0 if an item was found */ 24 - static int btrfs_uuid_tree_lookup(struct btrfs_root *uuid_root, u8 *uuid, 24 + static int btrfs_uuid_tree_lookup(struct btrfs_root *uuid_root, const u8 *uuid, 25 25 u8 type, u64 subid) 26 26 { 27 27 int ret; ··· 81 81 return ret; 82 82 } 83 83 84 - int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, u8 *uuid, u8 type, 84 + int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, const u8 *uuid, u8 type, 85 85 u64 subid_cpu) 86 86 { 87 87 struct btrfs_fs_info *fs_info = trans->fs_info; ··· 145 145 return ret; 146 146 } 147 147 148 - int btrfs_uuid_tree_remove(struct btrfs_trans_handle *trans, u8 *uuid, u8 type, 148 + int btrfs_uuid_tree_remove(struct btrfs_trans_handle *trans, const u8 *uuid, u8 type, 149 149 u64 subid) 150 150 { 151 151 struct btrfs_fs_info *fs_info = trans->fs_info; ··· 256 256 * < 0 if an error occurred 257 257 */ 258 258 static int btrfs_check_uuid_tree_entry(struct btrfs_fs_info *fs_info, 259 - u8 *uuid, u8 type, u64 subvolid) 259 + const u8 *uuid, u8 type, u64 subvolid) 260 260 { 261 261 int ret = 0; 262 262 struct btrfs_root *subvol_root;
+2 -2
fs/btrfs/uuid-tree.h
··· 8 8 struct btrfs_trans_handle; 9 9 struct btrfs_fs_info; 10 10 11 - int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, u8 *uuid, u8 type, 11 + int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, const u8 *uuid, u8 type, 12 12 u64 subid); 13 - int btrfs_uuid_tree_remove(struct btrfs_trans_handle *trans, u8 *uuid, u8 type, 13 + int btrfs_uuid_tree_remove(struct btrfs_trans_handle *trans, const u8 *uuid, u8 type, 14 14 u64 subid); 15 15 int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info); 16 16
+29 -33
fs/btrfs/volumes.c
··· 722 722 return -EINVAL; 723 723 } 724 724 725 - u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb) 725 + const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb) 726 726 { 727 727 bool has_metadata_uuid = (btrfs_super_incompat_flags(sb) & 728 728 BTRFS_FEATURE_INCOMPAT_METADATA_UUID); ··· 1380 1380 bool new_device_added = false; 1381 1381 struct btrfs_device *device = NULL; 1382 1382 struct file *bdev_file; 1383 - u64 bytenr, bytenr_orig; 1383 + u64 bytenr; 1384 1384 dev_t devt; 1385 1385 int ret; 1386 1386 1387 1387 lockdep_assert_held(&uuid_mutex); 1388 - 1389 - /* 1390 - * we would like to check all the supers, but that would make 1391 - * a btrfs mount succeed after a mkfs from a different FS. 1392 - * So, we need to add a special mount option to scan for 1393 - * later supers, using BTRFS_SUPER_MIRROR_MAX instead 1394 - */ 1395 1388 1396 1389 /* 1397 1390 * Avoid an exclusive open here, as the systemd-udev may initiate the ··· 1400 1407 if (IS_ERR(bdev_file)) 1401 1408 return ERR_CAST(bdev_file); 1402 1409 1403 - bytenr_orig = btrfs_sb_offset(0); 1410 + /* 1411 + * We would like to check all the super blocks, but doing so would 1412 + * allow a mount to succeed after a mkfs from a different filesystem. 1413 + * Currently, recovery from a bad primary btrfs superblock is done 1414 + * using the userspace command 'btrfs check --super'. 1415 + */ 1404 1416 ret = btrfs_sb_log_location_bdev(file_bdev(bdev_file), 0, READ, &bytenr); 1405 1417 if (ret) { 1406 1418 device = ERR_PTR(ret); ··· 1413 1415 } 1414 1416 1415 1417 disk_super = btrfs_read_disk_super(file_bdev(bdev_file), bytenr, 1416 - bytenr_orig); 1418 + btrfs_sb_offset(0)); 1417 1419 if (IS_ERR(disk_super)) { 1418 1420 device = ERR_CAST(disk_super); 1419 1421 goto error_bdev_put; ··· 2989 2991 if (ret < 0) 2990 2992 goto out; 2991 2993 else if (ret > 0) { /* Logic error or corruption */ 2992 - btrfs_handle_fs_error(fs_info, -ENOENT, 2993 - "Failed lookup while freeing chunk."); 2994 - ret = -ENOENT; 2994 + btrfs_err(fs_info, "failed to lookup chunk %llu when freeing", 2995 + chunk_offset); 2996 + btrfs_abort_transaction(trans, -ENOENT); 2997 + ret = -EUCLEAN; 2995 2998 goto out; 2996 2999 } 2997 3000 2998 3001 ret = btrfs_del_item(trans, root, path); 2999 - if (ret < 0) 3000 - btrfs_handle_fs_error(fs_info, ret, 3001 - "Failed to delete chunk item."); 3002 + if (ret < 0) { 3003 + btrfs_err(fs_info, "failed to delete chunk %llu item", chunk_offset); 3004 + btrfs_abort_transaction(trans, ret); 3005 + goto out; 3006 + } 3002 3007 out: 3003 3008 btrfs_free_path(path); 3004 3009 return ret; ··· 5629 5628 u64 start = ctl->start; 5630 5629 u64 type = ctl->type; 5631 5630 int ret; 5632 - int i; 5633 - int j; 5634 5631 5635 5632 map = btrfs_alloc_chunk_map(ctl->num_stripes, GFP_NOFS); 5636 5633 if (!map) ··· 5643 5644 map->sub_stripes = ctl->sub_stripes; 5644 5645 map->num_stripes = ctl->num_stripes; 5645 5646 5646 - for (i = 0; i < ctl->ndevs; ++i) { 5647 - for (j = 0; j < ctl->dev_stripes; ++j) { 5647 + for (int i = 0; i < ctl->ndevs; i++) { 5648 + for (int j = 0; j < ctl->dev_stripes; j++) { 5648 5649 int s = i * ctl->dev_stripes + j; 5649 5650 map->stripes[s].dev = devices_info[i].dev; 5650 5651 map->stripes[s].physical = devices_info[i].dev_offset + ··· 6287 6288 return ret; 6288 6289 } 6289 6290 6290 - static void handle_ops_on_dev_replace(enum btrfs_map_op op, 6291 - struct btrfs_io_context *bioc, 6291 + static void handle_ops_on_dev_replace(struct btrfs_io_context *bioc, 6292 6292 struct btrfs_dev_replace *dev_replace, 6293 6293 u64 logical, 6294 - int *num_stripes_ret, int *max_errors_ret) 6294 + struct btrfs_io_geometry *io_geom) 6295 6295 { 6296 6296 u64 srcdev_devid = dev_replace->srcdev->devid; 6297 6297 /* 6298 6298 * At this stage, num_stripes is still the real number of stripes, 6299 6299 * excluding the duplicated stripes. 6300 6300 */ 6301 - int num_stripes = *num_stripes_ret; 6301 + int num_stripes = io_geom->num_stripes; 6302 + int max_errors = io_geom->max_errors; 6302 6303 int nr_extra_stripes = 0; 6303 - int max_errors = *max_errors_ret; 6304 6304 int i; 6305 6305 6306 6306 /* ··· 6340 6342 * replace. 6341 6343 * If we have 2 extra stripes, only choose the one with smaller physical. 6342 6344 */ 6343 - if (op == BTRFS_MAP_GET_READ_MIRRORS && nr_extra_stripes == 2) { 6345 + if (io_geom->op == BTRFS_MAP_GET_READ_MIRRORS && nr_extra_stripes == 2) { 6344 6346 struct btrfs_io_stripe *first = &bioc->stripes[num_stripes]; 6345 6347 struct btrfs_io_stripe *second = &bioc->stripes[num_stripes + 1]; 6346 6348 ··· 6358 6360 } 6359 6361 } 6360 6362 6361 - *num_stripes_ret = num_stripes + nr_extra_stripes; 6362 - *max_errors_ret = max_errors + nr_extra_stripes; 6363 + io_geom->num_stripes = num_stripes + nr_extra_stripes; 6364 + io_geom->max_errors = max_errors + nr_extra_stripes; 6363 6365 bioc->replace_nr_stripes = nr_extra_stripes; 6364 6366 } 6365 6367 ··· 6622 6624 struct btrfs_chunk_map *map; 6623 6625 struct btrfs_io_geometry io_geom = { 0 }; 6624 6626 u64 map_offset; 6625 - int i; 6626 6627 int ret = 0; 6627 6628 int num_copies; 6628 6629 struct btrfs_io_context *bioc = NULL; ··· 6767 6770 * For all other non-RAID56 profiles, just copy the target 6768 6771 * stripe into the bioc. 6769 6772 */ 6770 - for (i = 0; i < io_geom.num_stripes; i++) { 6773 + for (int i = 0; i < io_geom.num_stripes; i++) { 6771 6774 ret = set_io_stripe(fs_info, logical, length, 6772 6775 &bioc->stripes[i], map, &io_geom); 6773 6776 if (ret < 0) ··· 6787 6790 6788 6791 if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && 6789 6792 op != BTRFS_MAP_READ) { 6790 - handle_ops_on_dev_replace(op, bioc, dev_replace, logical, 6791 - &io_geom.num_stripes, &io_geom.max_errors); 6793 + handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom); 6792 6794 } 6793 6795 6794 6796 *bioc_ret = bioc;
+1 -1
fs/btrfs/volumes.h
··· 834 834 bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical); 835 835 836 836 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr); 837 - u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb); 837 + const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb); 838 838 839 839 #endif
+2 -2
fs/btrfs/xattr.c
··· 24 24 #include "accessors.h" 25 25 #include "dir-item.h" 26 26 27 - int btrfs_getxattr(struct inode *inode, const char *name, 27 + int btrfs_getxattr(const struct inode *inode, const char *name, 28 28 void *buffer, size_t size) 29 29 { 30 30 struct btrfs_dir_item *di; ··· 451 451 if (IS_ERR(trans)) 452 452 return PTR_ERR(trans); 453 453 454 - ret = btrfs_set_prop(trans, inode, name, value, size, flags); 454 + ret = btrfs_set_prop(trans, BTRFS_I(inode), name, value, size, flags); 455 455 if (!ret) { 456 456 inode_inc_iversion(inode); 457 457 inode_set_ctime_current(inode);
+1 -1
fs/btrfs/xattr.h
··· 14 14 15 15 extern const struct xattr_handler * const btrfs_xattr_handlers[]; 16 16 17 - int btrfs_getxattr(struct inode *inode, const char *name, 17 + int btrfs_getxattr(const struct inode *inode, const char *name, 18 18 void *buffer, size_t size); 19 19 int btrfs_setxattr(struct btrfs_trans_handle *trans, struct inode *inode, 20 20 const char *name, const void *value, size_t size, int flags);
+43 -13
fs/btrfs/zlib.c
··· 18 18 #include <linux/pagemap.h> 19 19 #include <linux/bio.h> 20 20 #include <linux/refcount.h> 21 + #include "btrfs_inode.h" 21 22 #include "compression.h" 22 23 23 24 /* workspace buffer size for s390 zlib hardware support */ ··· 113 112 *total_out = 0; 114 113 *total_in = 0; 115 114 116 - if (Z_OK != zlib_deflateInit(&workspace->strm, workspace->level)) { 117 - pr_warn("BTRFS: deflateInit failed\n"); 115 + ret = zlib_deflateInit(&workspace->strm, workspace->level); 116 + if (unlikely(ret != Z_OK)) { 117 + struct btrfs_inode *inode = BTRFS_I(mapping->host); 118 + 119 + btrfs_err(inode->root->fs_info, 120 + "zlib compression init failed, error %d root %llu inode %llu offset %llu", 121 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), start); 118 122 ret = -EIO; 119 123 goto out; 120 124 } ··· 188 182 } 189 183 190 184 ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH); 191 - if (ret != Z_OK) { 192 - pr_debug("BTRFS: deflate in loop returned %d\n", 193 - ret); 185 + if (unlikely(ret != Z_OK)) { 186 + struct btrfs_inode *inode = BTRFS_I(mapping->host); 187 + 188 + btrfs_warn(inode->root->fs_info, 189 + "zlib compression failed, error %d root %llu inode %llu offset %llu", 190 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), 191 + start); 194 192 zlib_deflateEnd(&workspace->strm); 195 193 ret = -EIO; 196 194 goto out; ··· 317 307 workspace->strm.avail_in -= 2; 318 308 } 319 309 320 - if (Z_OK != zlib_inflateInit2(&workspace->strm, wbits)) { 321 - pr_warn("BTRFS: inflateInit failed\n"); 310 + ret = zlib_inflateInit2(&workspace->strm, wbits); 311 + if (unlikely(ret != Z_OK)) { 312 + struct btrfs_inode *inode = cb->bbio.inode; 313 + 322 314 kunmap_local(data_in); 315 + btrfs_err(inode->root->fs_info, 316 + "zlib decompression init failed, error %d root %llu inode %llu offset %llu", 317 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), cb->start); 323 318 return -EIO; 324 319 } 325 320 while (workspace->strm.total_in < srclen) { ··· 363 348 workspace->strm.avail_in = min(tmp, PAGE_SIZE); 364 349 } 365 350 } 366 - if (ret != Z_STREAM_END) 351 + if (unlikely(ret != Z_STREAM_END)) { 352 + btrfs_err(cb->bbio.inode->root->fs_info, 353 + "zlib decompression failed, error %d root %llu inode %llu offset %llu", 354 + ret, btrfs_root_id(cb->bbio.inode->root), 355 + btrfs_ino(cb->bbio.inode), cb->start); 367 356 ret = -EIO; 368 - else 357 + } else { 369 358 ret = 0; 359 + } 370 360 done: 371 361 zlib_inflateEnd(&workspace->strm); 372 362 if (data_in) ··· 406 386 workspace->strm.avail_in -= 2; 407 387 } 408 388 409 - if (Z_OK != zlib_inflateInit2(&workspace->strm, wbits)) { 410 - pr_warn("BTRFS: inflateInit failed\n"); 389 + ret = zlib_inflateInit2(&workspace->strm, wbits); 390 + if (unlikely(ret != Z_OK)) { 391 + struct btrfs_inode *inode = BTRFS_I(dest_page->mapping->host); 392 + 393 + btrfs_err(inode->root->fs_info, 394 + "zlib decompression init failed, error %d root %llu inode %llu offset %llu", 395 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), 396 + page_offset(dest_page)); 411 397 return -EIO; 412 398 } 413 399 ··· 430 404 431 405 out: 432 406 if (unlikely(to_copy != destlen)) { 433 - pr_warn_ratelimited("BTRFS: inflate failed, decompressed=%lu expected=%zu\n", 434 - to_copy, destlen); 407 + struct btrfs_inode *inode = BTRFS_I(dest_page->mapping->host); 408 + 409 + btrfs_err(inode->root->fs_info, 410 + "zlib decompression failed, error %d root %llu inode %llu offset %llu decompressed %lu expected %zu", 411 + ret, btrfs_root_id(inode->root), btrfs_ino(inode), 412 + page_offset(dest_page), to_copy, destlen); 435 413 ret = -EIO; 436 414 } else { 437 415 ret = 0;
+14 -16
fs/btrfs/zoned.c
··· 87 87 bool empty[BTRFS_NR_SB_LOG_ZONES]; 88 88 bool full[BTRFS_NR_SB_LOG_ZONES]; 89 89 sector_t sector; 90 - int i; 91 90 92 - for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) { 91 + for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) { 93 92 ASSERT(zones[i].type != BLK_ZONE_TYPE_CONVENTIONAL); 94 93 empty[i] = (zones[i].cond == BLK_ZONE_COND_EMPTY); 95 94 full[i] = sb_zone_is_full(&zones[i]); ··· 120 121 struct address_space *mapping = bdev->bd_mapping; 121 122 struct page *page[BTRFS_NR_SB_LOG_ZONES]; 122 123 struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES]; 123 - int i; 124 124 125 - for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) { 125 + for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) { 126 126 u64 zone_end = (zones[i].start + zones[i].capacity) << SECTOR_SHIFT; 127 127 u64 bytenr = ALIGN_DOWN(zone_end, BTRFS_SUPER_INFO_SIZE) - 128 128 BTRFS_SUPER_INFO_SIZE; ··· 142 144 else 143 145 sector = zones[0].start; 144 146 145 - for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) 147 + for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) 146 148 btrfs_release_disk_super(super[i]); 147 149 } else if (!full[0] && (empty[1] || full[1])) { 148 150 sector = zones[0].wp; ··· 650 652 return NULL; 651 653 } 652 654 653 - int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, 654 - struct blk_zone *zone) 655 + static int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) 655 656 { 656 657 unsigned int nr_zones = 1; 657 658 int ret; ··· 767 770 return 0; 768 771 } 769 772 770 - int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info, unsigned long *mount_opt) 773 + int btrfs_check_mountopts_zoned(const struct btrfs_fs_info *info, unsigned long *mount_opt) 771 774 { 772 775 if (!btrfs_is_zoned(info)) 773 776 return 0; ··· 1723 1726 if (!btrfs_is_zoned(fs_info)) 1724 1727 return false; 1725 1728 1726 - if (!inode || !is_data_inode(&inode->vfs_inode)) 1729 + if (!inode || !is_data_inode(inode)) 1727 1730 return false; 1728 1731 1729 1732 if (btrfs_op(&bbio->bio) != BTRFS_MAP_WRITE) ··· 1765 1768 static void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered, 1766 1769 u64 logical) 1767 1770 { 1768 - struct extent_map_tree *em_tree = &BTRFS_I(ordered->inode)->extent_tree; 1771 + struct extent_map_tree *em_tree = &ordered->inode->extent_tree; 1769 1772 struct extent_map *em; 1770 1773 1771 1774 ordered->disk_bytenr = logical; ··· 1773 1776 write_lock(&em_tree->lock); 1774 1777 em = search_extent_mapping(em_tree, ordered->file_offset, 1775 1778 ordered->num_bytes); 1776 - em->block_start = logical; 1779 + /* The em should be a new COW extent, thus it should not have an offset. */ 1780 + ASSERT(em->offset == 0); 1781 + em->disk_bytenr = logical; 1777 1782 free_extent_map(em); 1778 1783 write_unlock(&em_tree->lock); 1779 1784 } ··· 1786 1787 struct btrfs_ordered_extent *new; 1787 1788 1788 1789 if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags) && 1789 - split_extent_map(BTRFS_I(ordered->inode), ordered->file_offset, 1790 + split_extent_map(ordered->inode, ordered->file_offset, 1790 1791 ordered->num_bytes, len, logical)) 1791 1792 return false; 1792 1793 ··· 1800 1801 1801 1802 void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered) 1802 1803 { 1803 - struct btrfs_inode *inode = BTRFS_I(ordered->inode); 1804 + struct btrfs_inode *inode = ordered->inode; 1804 1805 struct btrfs_fs_info *fs_info = inode->root->fs_info; 1805 1806 struct btrfs_ordered_sum *sum; 1806 1807 u64 logical, len; ··· 1844 1845 * here so that we don't attempt to log the csums later. 1845 1846 */ 1846 1847 if ((inode->flags & BTRFS_INODE_NODATASUM) || 1847 - test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state)) { 1848 + test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state)) { 1848 1849 while ((sum = list_first_entry_or_null(&ordered->list, 1849 1850 typeof(*sum), list))) { 1850 1851 list_del(&sum->list); ··· 2214 2215 /* Ensure all writes in this block group finish */ 2215 2216 btrfs_wait_block_group_reservations(block_group); 2216 2217 /* No need to wait for NOCOW writers. Zoned mode does not allow that */ 2217 - btrfs_wait_ordered_roots(fs_info, U64_MAX, block_group->start, 2218 - block_group->length); 2218 + btrfs_wait_ordered_roots(fs_info, U64_MAX, block_group); 2219 2219 /* Wait for extent buffers to be written. */ 2220 2220 if (is_metadata) 2221 2221 wait_eb_writebacks(block_group);
+2 -9
fs/btrfs/zoned.h
··· 53 53 void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered); 54 54 55 55 #ifdef CONFIG_BLK_DEV_ZONED 56 - int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, 57 - struct blk_zone *zone); 58 56 int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info); 59 57 int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache); 60 58 void btrfs_destroy_dev_zone_info(struct btrfs_device *device); 61 59 struct btrfs_zoned_device_info *btrfs_clone_dev_zone_info(struct btrfs_device *orig_dev); 62 60 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); 63 - int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info, unsigned long *mount_opt); 61 + int btrfs_check_mountopts_zoned(const struct btrfs_fs_info *info, unsigned long *mount_opt); 64 62 int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, 65 63 u64 *bytenr_ret); 66 64 int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, ··· 96 98 struct btrfs_space_info *space_info, bool do_finish); 97 99 void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info); 98 100 #else /* CONFIG_BLK_DEV_ZONED */ 99 - static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, 100 - struct blk_zone *zone) 101 - { 102 - return 0; 103 - } 104 101 105 102 static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info) 106 103 { ··· 129 136 return -EOPNOTSUPP; 130 137 } 131 138 132 - static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info, 139 + static inline int btrfs_check_mountopts_zoned(const struct btrfs_fs_info *info, 133 140 unsigned long *mount_opt) 134 141 { 135 142 return 0;
+52 -18
fs/btrfs/zstd.c
··· 19 19 #include <linux/zstd.h> 20 20 #include "misc.h" 21 21 #include "fs.h" 22 + #include "btrfs_inode.h" 22 23 #include "compression.h" 23 24 #include "super.h" 24 25 ··· 400 399 /* Initialize the stream */ 401 400 stream = zstd_init_cstream(&params, len, workspace->mem, 402 401 workspace->size); 403 - if (!stream) { 404 - pr_warn("BTRFS: zstd_init_cstream failed\n"); 402 + if (unlikely(!stream)) { 403 + struct btrfs_inode *inode = BTRFS_I(mapping->host); 404 + 405 + btrfs_err(inode->root->fs_info, 406 + "zstd compression init level %d failed, root %llu inode %llu offset %llu", 407 + workspace->req_level, btrfs_root_id(inode->root), 408 + btrfs_ino(inode), start); 405 409 ret = -EIO; 406 410 goto out; 407 411 } ··· 435 429 436 430 ret2 = zstd_compress_stream(stream, &workspace->out_buf, 437 431 &workspace->in_buf); 438 - if (zstd_is_error(ret2)) { 439 - pr_debug("BTRFS: zstd_compress_stream returned %d\n", 440 - zstd_get_error_code(ret2)); 432 + if (unlikely(zstd_is_error(ret2))) { 433 + struct btrfs_inode *inode = BTRFS_I(mapping->host); 434 + 435 + btrfs_warn(inode->root->fs_info, 436 + "zstd compression level %d failed, error %d root %llu inode %llu offset %llu", 437 + workspace->req_level, zstd_get_error_code(ret2), 438 + btrfs_root_id(inode->root), btrfs_ino(inode), 439 + start); 441 440 ret = -EIO; 442 441 goto out; 443 442 } ··· 508 497 size_t ret2; 509 498 510 499 ret2 = zstd_end_stream(stream, &workspace->out_buf); 511 - if (zstd_is_error(ret2)) { 512 - pr_debug("BTRFS: zstd_end_stream returned %d\n", 513 - zstd_get_error_code(ret2)); 500 + if (unlikely(zstd_is_error(ret2))) { 501 + struct btrfs_inode *inode = BTRFS_I(mapping->host); 502 + 503 + btrfs_err(inode->root->fs_info, 504 + "zstd compression end level %d failed, error %d root %llu inode %llu offset %llu", 505 + workspace->req_level, zstd_get_error_code(ret2), 506 + btrfs_root_id(inode->root), btrfs_ino(inode), 507 + start); 514 508 ret = -EIO; 515 509 goto out; 516 510 } ··· 577 561 578 562 stream = zstd_init_dstream( 579 563 ZSTD_BTRFS_MAX_INPUT, workspace->mem, workspace->size); 580 - if (!stream) { 581 - pr_debug("BTRFS: zstd_init_dstream failed\n"); 564 + if (unlikely(!stream)) { 565 + struct btrfs_inode *inode = cb->bbio.inode; 566 + 567 + btrfs_err(inode->root->fs_info, 568 + "zstd decompression init failed, root %llu inode %llu offset %llu", 569 + btrfs_root_id(inode->root), btrfs_ino(inode), cb->start); 582 570 ret = -EIO; 583 571 goto done; 584 572 } ··· 600 580 601 581 ret2 = zstd_decompress_stream(stream, &workspace->out_buf, 602 582 &workspace->in_buf); 603 - if (zstd_is_error(ret2)) { 604 - pr_debug("BTRFS: zstd_decompress_stream returned %d\n", 605 - zstd_get_error_code(ret2)); 583 + if (unlikely(zstd_is_error(ret2))) { 584 + struct btrfs_inode *inode = cb->bbio.inode; 585 + 586 + btrfs_err(inode->root->fs_info, 587 + "zstd decompression failed, error %d root %llu inode %llu offset %llu", 588 + zstd_get_error_code(ret2), btrfs_root_id(inode->root), 589 + btrfs_ino(inode), cb->start); 606 590 ret = -EIO; 607 591 goto done; 608 592 } ··· 661 637 662 638 stream = zstd_init_dstream( 663 639 ZSTD_BTRFS_MAX_INPUT, workspace->mem, workspace->size); 664 - if (!stream) { 665 - pr_warn("BTRFS: zstd_init_dstream failed\n"); 640 + if (unlikely(!stream)) { 641 + struct btrfs_inode *inode = BTRFS_I(dest_page->mapping->host); 642 + 643 + btrfs_err(inode->root->fs_info, 644 + "zstd decompression init failed, root %llu inode %llu offset %llu", 645 + btrfs_root_id(inode->root), btrfs_ino(inode), 646 + page_offset(dest_page)); 647 + ret = -EIO; 666 648 goto finish; 667 649 } 668 650 ··· 685 655 * one call should end the decompression. 686 656 */ 687 657 ret = zstd_decompress_stream(stream, &workspace->out_buf, &workspace->in_buf); 688 - if (zstd_is_error(ret)) { 689 - pr_warn_ratelimited("BTRFS: zstd_decompress_stream return %d\n", 690 - zstd_get_error_code(ret)); 658 + if (unlikely(zstd_is_error(ret))) { 659 + struct btrfs_inode *inode = BTRFS_I(dest_page->mapping->host); 660 + 661 + btrfs_err(inode->root->fs_info, 662 + "zstd decompression failed, error %d root %llu inode %llu offset %llu", 663 + zstd_get_error_code(ret), btrfs_root_id(inode->root), 664 + btrfs_ino(inode), page_offset(dest_page)); 691 665 goto finish; 692 666 } 693 667 to_copy = workspace->out_buf.pos;
+2 -17
include/trace/events/btrfs.h
··· 291 291 __field( u64, ino ) 292 292 __field( u64, start ) 293 293 __field( u64, len ) 294 - __field( u64, orig_start ) 295 - __field( u64, block_start ) 296 - __field( u64, block_len ) 297 294 __field( u32, flags ) 298 295 __field( int, refs ) 299 296 ), ··· 300 303 __entry->ino = btrfs_ino(inode); 301 304 __entry->start = map->start; 302 305 __entry->len = map->len; 303 - __entry->orig_start = map->orig_start; 304 - __entry->block_start = map->block_start; 305 - __entry->block_len = map->block_len; 306 306 __entry->flags = map->flags; 307 307 __entry->refs = refcount_read(&map->refs); 308 308 ), 309 309 310 - TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu " 311 - "orig_start=%llu block_start=%llu(%s) " 312 - "block_len=%llu flags=%s refs=%u", 310 + TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu flags=%s refs=%u", 313 311 show_root_type(__entry->root_objectid), 314 312 __entry->ino, 315 313 __entry->start, 316 314 __entry->len, 317 - __entry->orig_start, 318 - show_map_type(__entry->block_start), 319 - __entry->block_len, 320 315 show_map_flags(__entry->flags), 321 316 __entry->refs) 322 317 ); ··· 2606 2617 __field( u64, root_id ) 2607 2618 __field( u64, start ) 2608 2619 __field( u64, len ) 2609 - __field( u64, block_start ) 2610 2620 __field( u32, flags ) 2611 2621 ), 2612 2622 ··· 2614 2626 __entry->root_id = inode->root->root_key.objectid; 2615 2627 __entry->start = em->start; 2616 2628 __entry->len = em->len; 2617 - __entry->block_start = em->block_start; 2618 2629 __entry->flags = em->flags; 2619 2630 ), 2620 2631 2621 - TP_printk_btrfs( 2622 - "ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s", 2632 + TP_printk_btrfs("ino=%llu root=%llu(%s) start=%llu len=%llu flags=%s", 2623 2633 __entry->ino, show_root_type(__entry->root_id), 2624 2634 __entry->start, __entry->len, 2625 - show_map_type(__entry->block_start), 2626 2635 show_map_flags(__entry->flags)) 2627 2636 ); 2628 2637
+9 -13
include/uapi/linux/btrfs_tree.h
··· 747 747 __le64 physical; 748 748 } __attribute__ ((__packed__)); 749 749 750 - /* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types. */ 751 - #define BTRFS_STRIPE_RAID0 1 752 - #define BTRFS_STRIPE_RAID1 2 753 - #define BTRFS_STRIPE_DUP 3 754 - #define BTRFS_STRIPE_RAID10 4 755 - #define BTRFS_STRIPE_RAID5 5 756 - #define BTRFS_STRIPE_RAID6 6 757 - #define BTRFS_STRIPE_RAID1C3 7 758 - #define BTRFS_STRIPE_RAID1C4 8 759 - 760 750 struct btrfs_stripe_extent { 761 - __u8 encoding; 762 - __u8 reserved[7]; 763 751 /* An array of raid strides this stripe is composed of. */ 764 - struct btrfs_raid_stride strides[]; 752 + __DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides); 765 753 } __attribute__ ((__packed__)); 766 754 767 755 #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0) ··· 765 777 #define BTRFS_SUPER_FLAG_CHANGING_FSID (1ULL << 35) 766 778 #define BTRFS_SUPER_FLAG_CHANGING_FSID_V2 (1ULL << 36) 767 779 780 + /* 781 + * Those are temporaray flags utilized by btrfs-progs to do offline conversion. 782 + * They are rejected by kernel. 783 + * But still keep them all here to avoid conflicts. 784 + */ 785 + #define BTRFS_SUPER_FLAG_CHANGING_BG_TREE (1ULL << 38) 786 + #define BTRFS_SUPER_FLAG_CHANGING_DATA_CSUM (1ULL << 39) 787 + #define BTRFS_SUPER_FLAG_CHANGING_META_CSUM (1ULL << 40) 768 788 769 789 /* 770 790 * items in the extent btree are used to record the objectid of the