Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

btrfs: inhibit extent buffer writeback to prevent COW amplification

Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.

COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.

Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.

Also inhibit writeback in should_cow_block() when COW is skipped,
so that every transaction handle that reuses an already-COW'd buffer
also inhibits its writeback. Without this, if handle A COWs a block
and inhibits it, and handle B later reuses the same block without
inhibiting, handle A's uninhibit on end_transaction leaves the buffer
unprotected while handle B is still using it. This ensures all handles
that access a COW'd buffer contribute to the inhibitor count, and the
buffer remains protected until the last handle releases it.

In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.

WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.

Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Leo Martins and committed by
David Sterba
f9a48549 cab4c8b5

+98 -3
+7 -2
fs/btrfs/ctree.c
··· 21 21 #include "fs.h" 22 22 #include "accessors.h" 23 23 #include "extent-tree.h" 24 + #include "extent_io.h" 24 25 #include "relocation.h" 25 26 #include "file-item.h" 26 27 ··· 591 590 btrfs_tree_unlock(buf); 592 591 free_extent_buffer_stale(buf); 593 592 btrfs_mark_buffer_dirty(trans, cow); 593 + 594 + btrfs_inhibit_eb_writeback(trans, cow); 595 + 594 596 *cow_ret = cow; 595 597 return 0; 596 598 ··· 603 599 return ret; 604 600 } 605 601 606 - static inline bool should_cow_block(const struct btrfs_trans_handle *trans, 602 + static inline bool should_cow_block(struct btrfs_trans_handle *trans, 607 603 const struct btrfs_root *root, 608 - const struct extent_buffer *buf) 604 + struct extent_buffer *buf) 609 605 { 610 606 if (btrfs_is_testing(root->fs_info)) 611 607 return false; ··· 639 635 if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) 640 636 return true; 641 637 638 + btrfs_inhibit_eb_writeback(trans, buf); 642 639 return false; 643 640 } 644 641
+63 -1
fs/btrfs/extent_io.c
··· 14 14 #include <linux/pagevec.h> 15 15 #include <linux/prefetch.h> 16 16 #include <linux/fsverity.h> 17 + #include <linux/lockdep.h> 17 18 #include "extent_io.h" 18 19 #include "extent-io-tree.h" 19 20 #include "extent_map.h" ··· 1956 1955 * of time. 1957 1956 */ 1958 1957 spin_lock(&eb->refs_lock); 1959 - if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { 1958 + if ((wbc->sync_mode == WB_SYNC_ALL || 1959 + atomic_read(&eb->writeback_inhibitors) == 0) && 1960 + test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { 1960 1961 XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits); 1961 1962 unsigned long flags; 1962 1963 ··· 2993 2990 kmem_cache_free(extent_buffer_cache, eb); 2994 2991 } 2995 2992 2993 + /* 2994 + * Inhibit writeback on buffer during transaction. 2995 + * 2996 + * @trans: transaction handle that will own the inhibitor 2997 + * @eb: extent buffer to inhibit writeback on 2998 + * 2999 + * Attempt to track this extent buffer in the transaction's inhibited set. If 3000 + * memory allocation fails, the buffer is simply not tracked. It may be written 3001 + * back and need re-COW, which is the original behavior. This is acceptable 3002 + * since inhibiting writeback is an optimization. 3003 + */ 3004 + void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, struct extent_buffer *eb) 3005 + { 3006 + unsigned long index = eb->start >> trans->fs_info->nodesize_bits; 3007 + void *old; 3008 + 3009 + lockdep_assert_held(&eb->lock); 3010 + /* Check if already inhibited by this handle. */ 3011 + old = xa_load(&trans->writeback_inhibited_ebs, index); 3012 + if (old == eb) 3013 + return; 3014 + 3015 + /* Take reference for the xarray entry. */ 3016 + refcount_inc(&eb->refs); 3017 + 3018 + old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS); 3019 + if (xa_is_err(old)) { 3020 + /* Allocation failed, just skip inhibiting this buffer. */ 3021 + free_extent_buffer(eb); 3022 + return; 3023 + } 3024 + 3025 + /* Handle replacement of different eb at same index. */ 3026 + if (old && old != eb) { 3027 + struct extent_buffer *old_eb = old; 3028 + 3029 + atomic_dec(&old_eb->writeback_inhibitors); 3030 + free_extent_buffer(old_eb); 3031 + } 3032 + 3033 + atomic_inc(&eb->writeback_inhibitors); 3034 + } 3035 + 3036 + /* 3037 + * Uninhibit writeback on all extent buffers. 3038 + */ 3039 + void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans) 3040 + { 3041 + struct extent_buffer *eb; 3042 + unsigned long index; 3043 + 3044 + xa_for_each(&trans->writeback_inhibited_ebs, index, eb) { 3045 + atomic_dec(&eb->writeback_inhibitors); 3046 + free_extent_buffer(eb); 3047 + } 3048 + xa_destroy(&trans->writeback_inhibited_ebs); 3049 + } 3050 + 2996 3051 static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info, 2997 3052 u64 start) 2998 3053 { ··· 3061 3000 eb->len = fs_info->nodesize; 3062 3001 eb->fs_info = fs_info; 3063 3002 init_rwsem(&eb->lock); 3003 + atomic_set(&eb->writeback_inhibitors, 0); 3064 3004 3065 3005 btrfs_leak_debug_add_eb(eb); 3066 3006
+6
fs/btrfs/extent_io.h
··· 99 99 spinlock_t refs_lock; 100 100 refcount_t refs; 101 101 int read_mirror; 102 + /* Inhibit WB_SYNC_NONE writeback when > 0. */ 103 + atomic_t writeback_inhibitors; 102 104 /* >= 0 if eb belongs to a log tree, -1 otherwise */ 103 105 s8 log_index; 104 106 u8 folio_shift; ··· 382 380 #else 383 381 #define btrfs_extent_buffer_leak_debug_check(fs_info) do {} while (0) 384 382 #endif 383 + 384 + void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, 385 + struct extent_buffer *eb); 386 + void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans); 385 387 386 388 #endif
+19
fs/btrfs/transaction.c
··· 15 15 #include "misc.h" 16 16 #include "ctree.h" 17 17 #include "disk-io.h" 18 + #include "extent_io.h" 18 19 #include "transaction.h" 19 20 #include "locking.h" 20 21 #include "tree-log.h" ··· 697 696 goto alloc_fail; 698 697 } 699 698 699 + xa_init(&h->writeback_inhibited_ebs); 700 + 700 701 /* 701 702 * If we are JOIN_NOLOCK we're already committing a transaction and 702 703 * waiting on this guy, so we don't need to do the sb_start_intwrite ··· 1094 1091 1095 1092 if (trans->type & __TRANS_FREEZABLE) 1096 1093 sb_end_intwrite(info->sb); 1094 + 1095 + /* 1096 + * Uninhibit extent buffer writeback before decrementing num_writers, 1097 + * since the decrement wakes the committing thread which needs all 1098 + * buffers uninhibited to write them to disk. 1099 + */ 1100 + btrfs_uninhibit_all_eb_writeback(trans); 1097 1101 1098 1102 WARN_ON(cur_trans != info->running_transaction); 1099 1103 WARN_ON(atomic_read(&cur_trans->num_writers) < 1); ··· 2134 2124 if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) 2135 2125 btrfs_scrub_cancel(fs_info); 2136 2126 2127 + btrfs_uninhibit_all_eb_writeback(trans); 2137 2128 kmem_cache_free(btrfs_trans_handle_cachep, trans); 2138 2129 } 2139 2130 ··· 2573 2562 if (test_bit(BTRFS_FS_FEATURE_CHANGED, &fs_info->flags) && 2574 2563 fs_info->cleaner_kthread) 2575 2564 wake_up_process(fs_info->cleaner_kthread); 2565 + 2566 + /* 2567 + * Uninhibit writeback on all extent buffers inhibited during this 2568 + * transaction before writing them to disk. Inhibiting prevented 2569 + * writeback while the transaction was building, but now we need 2570 + * them written. 2571 + */ 2572 + btrfs_uninhibit_all_eb_writeback(trans); 2576 2573 2577 2574 ret = btrfs_write_and_wait_transaction(trans); 2578 2575 if (unlikely(ret)) {
+3
fs/btrfs/transaction.h
··· 12 12 #include <linux/time64.h> 13 13 #include <linux/mutex.h> 14 14 #include <linux/wait.h> 15 + #include <linux/xarray.h> 15 16 #include "btrfs_inode.h" 16 17 #include "delayed-ref.h" 17 18 ··· 163 162 struct btrfs_fs_info *fs_info; 164 163 struct list_head new_bgs; 165 164 struct btrfs_block_rsv delayed_rsv; 165 + /* Extent buffers with writeback inhibited by this handle. */ 166 + struct xarray writeback_inhibited_ebs; 166 167 }; 167 168 168 169 /*