Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: support batched checking of the young flag for MGLRU

Use the batched helper test_and_clear_young_ptes_notify() to check and
clear the young flag to improve the performance during large folio
reclamation when MGLRU is enabled.

Meanwhile, we can also support batched checking the young and dirty flag
when MGLRU walks the mm's pagetable to update the folios' generation
counter. Since MGLRU also checks the PTE dirty bit, use
folio_pte_batch_flags() with FPB_MERGE_YOUNG_DIRTY set to detect batches
of PTEs for a large folio.

Then we can remove the ptep_test_and_clear_young_notify() since it has no
users now.

Note that we also update the 'young' counter and 'mm_stats[MM_LEAF_YOUNG]'
counter with the batched count in the lru_gen_look_around() and
walk_pte_range(). However, the batched operations may inflate these two
counters, because in a large folio not all PTEs may have been accessed.
(Additionally, tracking how many PTEs have been accessed within a large
folio is not very meaningful, since the mm core actually tracks
access/dirty on a per-folio basis, not per page). The impact analysis is
as follows:

1. The 'mm_stats[MM_LEAF_YOUNG]' counter has no functional impact and
is mainly for debugging.

2. The 'young' counter is used to decide whether to place the current
PMD entry into the bloom filters by suitable_to_scan() (so that next
time we can check whether it has been accessed again), which may set
the hash bit in the bloom filters for a PMD entry that hasn't seen much
access. However, bloom filters inherently allow some error, so this
effect appears negligible.

Link: https://lkml.kernel.org/r/378f4acf7d07410aa7c2e4b49d56bb165918eb34.1772778858.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Baolin Wang and committed by
Andrew Morton
56e5b60b 6d7237dd

+49 -33
+3 -2
include/linux/mmzone.h
··· 684 684 685 685 void lru_gen_init_pgdat(struct pglist_data *pgdat); 686 686 void lru_gen_init_lruvec(struct lruvec *lruvec); 687 - bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); 687 + bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr); 688 688 689 689 void lru_gen_init_memcg(struct mem_cgroup *memcg); 690 690 void lru_gen_exit_memcg(struct mem_cgroup *memcg); ··· 703 703 { 704 704 } 705 705 706 - static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) 706 + static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, 707 + unsigned int nr) 707 708 { 708 709 return false; 709 710 }
-6
mm/internal.h
··· 1848 1848 1849 1849 #endif /* CONFIG_MMU_NOTIFIER */ 1850 1850 1851 - static inline int ptep_test_and_clear_young_notify(struct vm_area_struct *vma, 1852 - unsigned long addr, pte_t *ptep) 1853 - { 1854 - return test_and_clear_young_ptes_notify(vma, addr, ptep, 1); 1855 - } 1856 - 1857 1851 #endif /* __MM_INTERNAL_H */
+14 -14
mm/rmap.c
··· 965 965 return false; 966 966 } 967 967 968 + if (pvmw.pte && folio_test_large(folio)) { 969 + const unsigned long end_addr = pmd_addr_end(address, vma->vm_end); 970 + const unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; 971 + pte_t pteval = ptep_get(pvmw.pte); 972 + 973 + nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr); 974 + } 975 + 968 976 if (lru_gen_enabled() && pvmw.pte) { 969 - if (lru_gen_look_around(&pvmw)) 977 + if (lru_gen_look_around(&pvmw, nr)) 970 978 referenced++; 971 979 } else if (pvmw.pte) { 972 - if (folio_test_large(folio)) { 973 - unsigned long end_addr = pmd_addr_end(address, vma->vm_end); 974 - unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; 975 - pte_t pteval = ptep_get(pvmw.pte); 976 - 977 - nr = folio_pte_batch(folio, pvmw.pte, 978 - pteval, max_nr); 979 - } 980 - 981 - ptes += nr; 982 980 if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr)) 983 981 referenced++; 984 - /* Skip the batched PTEs */ 985 - pvmw.pte += nr - 1; 986 - pvmw.address += (nr - 1) * PAGE_SIZE; 987 982 } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { 988 983 if (pmdp_clear_flush_young_notify(vma, address, 989 984 pvmw.pmd)) ··· 988 993 WARN_ON_ONCE(1); 989 994 } 990 995 996 + ptes += nr; 991 997 pra->mapcount -= nr; 992 998 /* 993 999 * If we are sure that we batched the entire folio, ··· 998 1002 page_vma_mapped_walk_done(&pvmw); 999 1003 break; 1000 1004 } 1005 + 1006 + /* Skip the batched PTEs */ 1007 + pvmw.pte += nr - 1; 1008 + pvmw.address += (nr - 1) * PAGE_SIZE; 1001 1009 } 1002 1010 1003 1011 if (referenced)
+32 -11
mm/vmscan.c
··· 3499 3499 struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); 3500 3500 DEFINE_MAX_SEQ(walk->lruvec); 3501 3501 int gen = lru_gen_from_seq(max_seq); 3502 + unsigned int nr; 3502 3503 pmd_t pmdval; 3503 3504 3504 3505 pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl); ··· 3518 3517 3519 3518 lazy_mmu_mode_enable(); 3520 3519 restart: 3521 - for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { 3520 + for (i = pte_index(start), addr = start; addr != end; i += nr, addr += nr * PAGE_SIZE) { 3522 3521 unsigned long pfn; 3523 3522 struct folio *folio; 3524 - pte_t ptent = ptep_get(pte + i); 3523 + pte_t *cur_pte = pte + i; 3524 + pte_t ptent = ptep_get(cur_pte); 3525 3525 3526 + nr = 1; 3526 3527 total++; 3527 3528 walk->mm_stats[MM_LEAF_TOTAL]++; 3528 3529 ··· 3536 3533 if (!folio) 3537 3534 continue; 3538 3535 3539 - if (!ptep_test_and_clear_young_notify(args->vma, addr, pte + i)) 3536 + if (folio_test_large(folio)) { 3537 + const unsigned int max_nr = (end - addr) >> PAGE_SHIFT; 3538 + 3539 + nr = folio_pte_batch_flags(folio, NULL, cur_pte, &ptent, 3540 + max_nr, FPB_MERGE_YOUNG_DIRTY); 3541 + total += nr - 1; 3542 + walk->mm_stats[MM_LEAF_TOTAL] += nr - 1; 3543 + } 3544 + 3545 + if (!test_and_clear_young_ptes_notify(args->vma, addr, cur_pte, nr)) 3540 3546 continue; 3541 3547 3542 3548 if (last != folio) { ··· 3558 3546 if (pte_dirty(ptent)) 3559 3547 dirty = true; 3560 3548 3561 - young++; 3562 - walk->mm_stats[MM_LEAF_YOUNG]++; 3549 + young += nr; 3550 + walk->mm_stats[MM_LEAF_YOUNG] += nr; 3563 3551 } 3564 3552 3565 3553 walk_update_folio(walk, last, gen, dirty); ··· 4203 4191 * the PTE table to the Bloom filter. This forms a feedback loop between the 4204 4192 * eviction and the aging. 4205 4193 */ 4206 - bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) 4194 + bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr) 4207 4195 { 4208 4196 int i; 4209 4197 bool dirty; ··· 4226 4214 lockdep_assert_held(pvmw->ptl); 4227 4215 VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); 4228 4216 4229 - if (!ptep_test_and_clear_young_notify(vma, addr, pte)) 4217 + if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr)) 4230 4218 return false; 4231 4219 4232 4220 if (spin_is_contended(pvmw->ptl)) ··· 4260 4248 4261 4249 pte -= (addr - start) / PAGE_SIZE; 4262 4250 4263 - for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { 4251 + for (i = 0, addr = start; addr != end; 4252 + i += nr, pte += nr, addr += nr * PAGE_SIZE) { 4264 4253 unsigned long pfn; 4265 - pte_t ptent = ptep_get(pte + i); 4254 + pte_t ptent = ptep_get(pte); 4266 4255 4256 + nr = 1; 4267 4257 pfn = get_pte_pfn(ptent, vma, addr, pgdat); 4268 4258 if (pfn == -1) 4269 4259 continue; ··· 4274 4260 if (!folio) 4275 4261 continue; 4276 4262 4277 - if (!ptep_test_and_clear_young_notify(vma, addr, pte + i)) 4263 + if (folio_test_large(folio)) { 4264 + const unsigned int max_nr = (end - addr) >> PAGE_SHIFT; 4265 + 4266 + nr = folio_pte_batch_flags(folio, NULL, pte, &ptent, 4267 + max_nr, FPB_MERGE_YOUNG_DIRTY); 4268 + } 4269 + 4270 + if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr)) 4278 4271 continue; 4279 4272 4280 4273 if (last != folio) { ··· 4294 4273 if (pte_dirty(ptent)) 4295 4274 dirty = true; 4296 4275 4297 - young++; 4276 + young += nr; 4298 4277 } 4299 4278 4300 4279 walk_update_folio(walk, last, gen, dirty);