Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/memory: do not populate page table entries beyond i_size

Patch series "Fix SIGBUS semantics with large folios", v3.

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
failed due to missing SIGBUS. This was caused by my recent changes that
try to fault in the whole folio where possible:

19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
357b92761d94 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016. The kernel happily maps
PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving a
SIGBUS signal when accessing beyond i_size. I cannot imagine how it could
be useful for the workload.

But apparently filesystem folks care a lot about preserving strict SIGBUS
semantics.

Generic/749 was introduced last year with reference to POSIX, but no real
workloads were mentioned. It also acknowledged the tmpfs deviation from
the test case.

POSIX indeed says[3]:

References within the address range starting at pa and
continuing for len bytes to whole pages following the end of an
object shall result in delivery of a SIGBUS signal.

The patchset fixes the regression introduced by recent changes as well as
more subtle SIGBUS breakage due to split failure on truncation.


This patch (of 2):

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
- PTEs beyond i_size;
- PMD mappings across i_size;

Make an exception for shmem/tmpfs that for long time intentionally
mapped with PMDs across i_size.

Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 6795801366da ("xfs: Support large folios")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Kiryl Shutsemau and committed by
Andrew Morton
74207de2 895b4c0c

+39 -9
+20 -8
mm/filemap.c
··· 3681 3681 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, 3682 3682 struct folio *folio, unsigned long start, 3683 3683 unsigned long addr, unsigned int nr_pages, 3684 - unsigned long *rss, unsigned short *mmap_miss) 3684 + unsigned long *rss, unsigned short *mmap_miss, 3685 + bool can_map_large) 3685 3686 { 3686 3687 unsigned int ref_from_caller = 1; 3687 3688 vm_fault_t ret = 0; ··· 3697 3696 * The folio must not cross VMA or page table boundary. 3698 3697 */ 3699 3698 addr0 = addr - start * PAGE_SIZE; 3700 - if (folio_within_vma(folio, vmf->vma) && 3699 + if (can_map_large && folio_within_vma(folio, vmf->vma) && 3701 3700 (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) { 3702 3701 vmf->pte -= start; 3703 3702 page -= start; ··· 3812 3811 unsigned long rss = 0; 3813 3812 unsigned int nr_pages = 0, folio_type; 3814 3813 unsigned short mmap_miss = 0, mmap_miss_saved; 3814 + bool can_map_large; 3815 3815 3816 3816 rcu_read_lock(); 3817 3817 folio = next_uptodate_folio(&xas, mapping, end_pgoff); 3818 3818 if (!folio) 3819 3819 goto out; 3820 3820 3821 - if (filemap_map_pmd(vmf, folio, start_pgoff)) { 3821 + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; 3822 + end_pgoff = min(end_pgoff, file_end); 3823 + 3824 + /* 3825 + * Do not allow to map with PTEs beyond i_size and with PMD 3826 + * across i_size to preserve SIGBUS semantics. 3827 + * 3828 + * Make an exception for shmem/tmpfs that for long time 3829 + * intentionally mapped with PMDs across i_size. 3830 + */ 3831 + can_map_large = shmem_mapping(mapping) || 3832 + file_end >= folio_next_index(folio); 3833 + 3834 + if (can_map_large && filemap_map_pmd(vmf, folio, start_pgoff)) { 3822 3835 ret = VM_FAULT_NOPAGE; 3823 3836 goto out; 3824 3837 } ··· 3844 3829 folio_put(folio); 3845 3830 goto out; 3846 3831 } 3847 - 3848 - file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; 3849 - if (end_pgoff > file_end) 3850 - end_pgoff = file_end; 3851 3832 3852 3833 folio_type = mm_counter_file(folio); 3853 3834 do { ··· 3861 3850 else 3862 3851 ret |= filemap_map_folio_range(vmf, folio, 3863 3852 xas.xa_index - folio->index, addr, 3864 - nr_pages, &rss, &mmap_miss); 3853 + nr_pages, &rss, &mmap_miss, 3854 + can_map_large); 3865 3855 3866 3856 folio_unlock(folio); 3867 3857 } while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
+19 -1
mm/memory.c
··· 65 65 #include <linux/gfp.h> 66 66 #include <linux/migrate.h> 67 67 #include <linux/string.h> 68 + #include <linux/shmem_fs.h> 68 69 #include <linux/memory-tiers.h> 69 70 #include <linux/debugfs.h> 70 71 #include <linux/userfaultfd_k.h> ··· 5502 5501 return ret; 5503 5502 } 5504 5503 5504 + if (!needs_fallback && vma->vm_file) { 5505 + struct address_space *mapping = vma->vm_file->f_mapping; 5506 + pgoff_t file_end; 5507 + 5508 + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); 5509 + 5510 + /* 5511 + * Do not allow to map with PTEs beyond i_size and with PMD 5512 + * across i_size to preserve SIGBUS semantics. 5513 + * 5514 + * Make an exception for shmem/tmpfs that for long time 5515 + * intentionally mapped with PMDs across i_size. 5516 + */ 5517 + needs_fallback = !shmem_mapping(mapping) && 5518 + file_end < folio_next_index(folio); 5519 + } 5520 + 5505 5521 if (pmd_none(*vmf->pmd)) { 5506 - if (folio_test_pmd_mappable(folio)) { 5522 + if (!needs_fallback && folio_test_pmd_mappable(folio)) { 5507 5523 ret = do_set_pmd(vmf, folio, page); 5508 5524 if (ret != VM_FAULT_FALLBACK) 5509 5525 return ret;