Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/filemap: allow arch to request folio size for exec memory

Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner. We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary. Additionally, we don't bother with an
asynchronous part.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization of
the TLB. When paired with the "multi-size THP" feature, this works well
to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required folio
size and mapping alignment, even when the filesystem supports readahead
into large folios (e.g. XFS).

The reason for the low likelihood is that the current readahead algorithm
starts with an order-0 folio and increases the folio order by 2 every time
the readahead mark is hit. But most executable memory tends to be
accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.

So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of the
translations in iTLB) vs potential for making reclaim more difficult (due
to the folios being larger so if a part of the folio is hot the whole
thing is considered hot). But executable memory is a small portion of the
overall system memory so I doubt this will even register from a reclaim
perspective.

I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs. Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues. I
don't anticipate any write amplification because text is always RO.

Note that the text region of an ELF file could be populated into the page
cache for other reasons than taking a fault in a mmapped area. The most
common case is due to the loader read()ing the header which can be shared
with the beginning of text. So some text will still remain in small
folios, but this simple, best effort change provides good performance
improvements as is.

Confine this special-case approach to the bounds of the VMA. This
prevents wasting memory for any padding that might exist in the file
between sections. Previously the padding would have been contained in
order-0 folios and would be easy to reclaim. But now it would be part of
a larger folio so more difficult to reclaim. Solve this by simply not
reading it into memory in the first place.

Benchmarking
============

The below shows pgbench and redis benchmarks on Graviton3 arm64 system.

First, confirmation that this patch causes more text to be contained in
64K folios:

+----------------------+---------------+---------------+---------------+
| File-backed folios by| system boot | pgbench | redis |
| size as percentage of+-------+-------+-------+-------+-------+-------+
| all mapped text mem |before | after |before | after |before | after |
+======================+=======+=======+=======+=======+=======+=======+
| base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
| thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
| thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
| thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
| thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
| thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
| thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
| thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
| thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
+----------------------+-------+-------+-------+-------+-------+-------+
| cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
+----------------------+-------+-------+-------+-------+-------+-------+

The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed by
4K and 16K folios reduces and the amount backed by 64K folios increases
significantly. And the amount of text that is contpte-mapped
significantly increases (see last row).

And this is reflected in performance improvement. "(I)" indicates a
statistically significant improvement. Note TPS and Reqs/sec are rates so
bigger is better, ms is time so smaller is better:

+-------------+-------------------------------------------+------------+
| Benchmark | Result Class | Improvemnt |
+=============+===========================================+============+
| pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
| | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
| | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
| | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
| | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
| | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
| | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
| | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
| | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
| | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
| | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
| | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
+-------------+-------------------------------------------+------------+
| pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
| | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
| | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
| | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
| | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
| | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
| | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
| | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
| | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
| | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
+-------------+-------------------------------------------+------------+

[ryan.roberts@arm.com: fix use-after-free]
Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com
[ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding]
Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com
Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Will Deacon <will@kernel.org>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Ryan Roberts and committed by
Andrew Morton
38b0ece6 c4602f9f

+58 -9
+8
arch/arm64/include/asm/pgtable.h
··· 1643 1643 */ 1644 1644 #define arch_wants_old_prefaulted_pte cpu_has_hw_af 1645 1645 1646 + /* 1647 + * Request exec memory is read into pagecache in at least 64K folios. This size 1648 + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB 1649 + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base 1650 + * pages are in use. 1651 + */ 1652 + #define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) 1653 + 1646 1654 static inline bool pud_sect_supported(void) 1647 1655 { 1648 1656 return PAGE_SIZE == SZ_4K;
+11
include/linux/pgtable.h
··· 456 456 } 457 457 #endif 458 458 459 + #ifndef exec_folio_order 460 + /* 461 + * Returns preferred minimum folio order for executable file-backed memory. Must 462 + * be in range [0, PMD_ORDER). Default to order-0. 463 + */ 464 + static inline unsigned int exec_folio_order(void) 465 + { 466 + return 0; 467 + } 468 + #endif 469 + 459 470 #ifndef arch_check_zapped_pte 460 471 static inline void arch_check_zapped_pte(struct vm_area_struct *vma, 461 472 pte_t pte)
+39 -9
mm/filemap.c
··· 3238 3238 } 3239 3239 #endif 3240 3240 3241 - /* If we don't want any read-ahead, don't bother */ 3242 - if (vm_flags & VM_RAND_READ) 3241 + /* 3242 + * If we don't want any read-ahead, don't bother. VM_EXEC case below is 3243 + * already intended for random access. 3244 + */ 3245 + if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ) 3243 3246 return fpin; 3244 3247 if (!ra->ra_pages) 3245 3248 return fpin; ··· 3265 3262 if (mmap_miss > MMAP_LOTSAMISS) 3266 3263 return fpin; 3267 3264 3268 - /* 3269 - * mmap read-around 3270 - */ 3265 + if (vm_flags & VM_EXEC) { 3266 + /* 3267 + * Allow arch to request a preferred minimum folio order for 3268 + * executable memory. This can often be beneficial to 3269 + * performance if (e.g.) arm64 can contpte-map the folio. 3270 + * Executable memory rarely benefits from readahead, due to its 3271 + * random access nature, so set async_size to 0. 3272 + * 3273 + * Limit to the boundaries of the VMA to avoid reading in any 3274 + * pad that might exist between sections, which would be a waste 3275 + * of memory. 3276 + */ 3277 + struct vm_area_struct *vma = vmf->vma; 3278 + unsigned long start = vma->vm_pgoff; 3279 + unsigned long end = start + vma_pages(vma); 3280 + unsigned long ra_end; 3281 + 3282 + ra->order = exec_folio_order(); 3283 + ra->start = round_down(vmf->pgoff, 1UL << ra->order); 3284 + ra->start = max(ra->start, start); 3285 + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order); 3286 + ra_end = min(ra_end, end); 3287 + ra->size = ra_end - ra->start; 3288 + ra->async_size = 0; 3289 + } else { 3290 + /* 3291 + * mmap read-around 3292 + */ 3293 + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); 3294 + ra->size = ra->ra_pages; 3295 + ra->async_size = ra->ra_pages / 4; 3296 + ra->order = 0; 3297 + } 3298 + 3271 3299 fpin = maybe_unlock_mmap_for_io(vmf, fpin); 3272 - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); 3273 - ra->size = ra->ra_pages; 3274 - ra->async_size = ra->ra_pages / 4; 3275 - ra->order = 0; 3276 3300 ractl._index = ra->start; 3277 3301 page_cache_ra_order(&ractl, ra); 3278 3302 return fpin;