Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"A fix for fast fsync that needs to handle errors during writes after
some COW failure so it does not lead to an inconsistent state"

* tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ensure fast fsync waits for ordered extents after a write failure

+57
+10
fs/btrfs/btrfs_inode.h
··· 89 89 BTRFS_INODE_FREE_SPACE_INODE, 90 90 /* Set when there are no capabilities in XATTs for the inode. */ 91 91 BTRFS_INODE_NO_CAP_XATTR, 92 + /* 93 + * Set if an error happened when doing a COW write before submitting a 94 + * bio or during writeback. Used for both buffered writes and direct IO 95 + * writes. This is to signal a fast fsync that it has to wait for 96 + * ordered extents to complete and therefore not log extent maps that 97 + * point to unwritten extents (when an ordered extent completes and it 98 + * has the BTRFS_ORDERED_IOERR flag set, it drops extent maps in its 99 + * range). 100 + */ 101 + BTRFS_INODE_COW_WRITE_ERROR, 92 102 }; 93 103 94 104 /* in memory btrfs inode */
+16
fs/btrfs/file.c
··· 1885 1885 */ 1886 1886 if (full_sync || btrfs_is_zoned(fs_info)) { 1887 1887 ret = btrfs_wait_ordered_range(inode, start, len); 1888 + clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags); 1888 1889 } else { 1889 1890 /* 1890 1891 * Get our ordered extents as soon as possible to avoid doing ··· 1895 1894 btrfs_get_ordered_extents_for_logging(BTRFS_I(inode), 1896 1895 &ctx.ordered_extents); 1897 1896 ret = filemap_fdatawait_range(inode->i_mapping, start, end); 1897 + if (ret) 1898 + goto out_release_extents; 1899 + 1900 + /* 1901 + * Check and clear the BTRFS_INODE_COW_WRITE_ERROR now after 1902 + * starting and waiting for writeback, because for buffered IO 1903 + * it may have been set during the end IO callback 1904 + * (end_bbio_data_write() -> btrfs_finish_ordered_extent()) in 1905 + * case an error happened and we need to wait for ordered 1906 + * extents to complete so that any extent maps that point to 1907 + * unwritten locations are dropped and we don't log them. 1908 + */ 1909 + if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR, 1910 + &BTRFS_I(inode)->runtime_flags)) 1911 + ret = btrfs_wait_ordered_range(inode, start, len); 1898 1912 } 1899 1913 1900 1914 if (ret)
+31
fs/btrfs/ordered-data.c
··· 388 388 ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); 389 389 spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); 390 390 391 + /* 392 + * If this is a COW write it means we created new extent maps for the 393 + * range and they point to unwritten locations if we got an error either 394 + * before submitting a bio or during IO. 395 + * 396 + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we 397 + * are queuing its completion below. During completion, at 398 + * btrfs_finish_one_ordered(), we will drop the extent maps for the 399 + * unwritten extents. 400 + * 401 + * However because completion runs in a work queue we can end up having 402 + * a fast fsync running before that. In the case of direct IO, once we 403 + * unlock the inode the fsync might start, and we queue the completion 404 + * before unlocking the inode. In the case of buffered IO when writeback 405 + * finishes (end_bbio_data_write()) we queue the completion, so if the 406 + * writeback was triggered by a fast fsync, the fsync might start 407 + * logging before ordered extent completion runs in the work queue. 408 + * 409 + * The fast fsync will log file extent items based on the extent maps it 410 + * finds, so if by the time it collects extent maps the ordered extent 411 + * completion didn't happen yet, it will log file extent items that 412 + * point to unwritten extents, resulting in a corruption if a crash 413 + * happens and the log tree is replayed. Note that a fast fsync does not 414 + * wait for completion of ordered extents in order to reduce latency. 415 + * 416 + * Set a flag in the inode so that the next fast fsync will wait for 417 + * ordered extents to complete before starting to log. 418 + */ 419 + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) 420 + set_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags); 421 + 391 422 if (ret) 392 423 btrfs_queue_ordered_fn(ordered); 393 424 return ret;