ext4: zero post-EOF partial block before appending write

In cases of appending write beyond EOF, ext4_zero_partial_blocks() is
called within ext4_*_write_end() to zero out the partial block beyond
EOF. This prevents exposing stale data that might be written through
mmap.

However, supporting only the regular buffered write path is
insufficient. It is also necessary to support the DAX path as well as
the upcoming iomap buffered write path. Therefore, move this operation
to ext4_write_checks().

In addition, this may introduce a race window in which a post-EOF
buffered write can race with an mmap write after the old EOF block has
been zeroed. As a result, the data in this block written by the
buffer-write and the data written by the mmap-write may be mixed.
However, this is safe because users should not rely on the result of the
race condition.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260327102939.1095257-14-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

authored by

Zhang Yi and committed by

Theodore Ts'o 2 months ago 3f60efd6 1ad0f428

+24 -14

2 changed files

expand all

ext4

file.c

inode.c

+17

fs/ext4/file.c

··· 271 271 272 272 static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) 273 273 { 274 + struct inode *inode = file_inode(iocb->ki_filp); 275 + loff_t old_size = i_size_read(inode); 274 276 ssize_t ret, count; 275 277 276 278 count = ext4_generic_write_checks(iocb, from); ··· 282 280 ret = file_modified(iocb->ki_filp); 283 281 if (ret) 284 282 return ret; 283 + 284 + /* 285 + * If the position is beyond the EOF, it is necessary to zero out the 286 + * partial block that beyond the existing EOF, as it may contains 287 + * stale data written through mmap. 288 + */ 289 + if (iocb->ki_pos > old_size && !ext4_verity_in_progress(inode)) { 290 + if (iocb->ki_flags & IOCB_NOWAIT) 291 + return -EAGAIN; 292 + 293 + ret = ext4_block_zero_eof(inode, old_size, iocb->ki_pos); 294 + if (ret) 295 + return ret; 296 + } 297 + 285 298 return count; 286 299 } 287 300

+7 -14

fs/ext4/inode.c

··· 1466 1466 folio_unlock(folio); 1467 1467 folio_put(folio); 1468 1468 1469 - if (old_size < pos && !verity) { 1469 + if (old_size < pos && !verity) 1470 1470 pagecache_isize_extended(inode, old_size, pos); 1471 - ext4_block_zero_eof(inode, old_size, pos); 1472 - } 1471 + 1473 1472 /* 1474 1473 * Don't mark the inode dirty under folio lock. First, it unnecessarily 1475 1474 * makes the holding time of folio lock longer. Second, it forces lock ··· 1583 1584 folio_unlock(folio); 1584 1585 folio_put(folio); 1585 1586 1586 - if (old_size < pos && !verity) { 1587 + if (old_size < pos && !verity) 1587 1588 pagecache_isize_extended(inode, old_size, pos); 1588 - ext4_block_zero_eof(inode, old_size, pos); 1589 - } 1590 1589 1591 1590 if (size_changed) { 1592 1591 ret2 = ext4_mark_inode_dirty(handle, inode); ··· 3223 3226 struct inode *inode = mapping->host; 3224 3227 loff_t old_size = inode->i_size; 3225 3228 bool disksize_changed = false; 3226 - loff_t new_i_size, zero_len = 0; 3229 + loff_t new_i_size; 3227 3230 handle_t *handle; 3228 3231 3229 3232 if (unlikely(!folio_buffers(folio))) { ··· 3267 3270 folio_unlock(folio); 3268 3271 folio_put(folio); 3269 3272 3270 - if (pos > old_size) { 3273 + if (pos > old_size) 3271 3274 pagecache_isize_extended(inode, old_size, pos); 3272 - zero_len = pos - old_size; 3273 - } 3274 3275 3275 - if (!disksize_changed && !zero_len) 3276 + if (!disksize_changed) 3276 3277 return copied; 3277 3278 3278 - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); 3279 + handle = ext4_journal_start(inode, EXT4_HT_INODE, 1); 3279 3280 if (IS_ERR(handle)) 3280 3281 return PTR_ERR(handle); 3281 - if (zero_len) 3282 - ext4_block_zero_eof(inode, old_size, pos); 3283 3282 ext4_mark_inode_dirty(handle, inode); 3284 3283 ext4_journal_stop(handle); 3285 3284

Configure Feed

Configure Feed