Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: vmscan: prepare for reparenting MGLRU folios

Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.

However, there are the following challenges:

1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
number of generations of the parent and child memcg may be different,
so we cannot simply transfer MGLRU folios in the child memcg to the
parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
traverse these folios while holding the lru lock, otherwise it may
cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
may be updated, but the folio is not immediately moved to the
corresponding lru list. Therefore, there may be folios of different
generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
found based on the generation information in folio->flags, and the
corresponding LRU size will be updated. Therefore, we need to update
the lru size correctly during reparenting, otherwise the lru size may
be updated incorrectly in lru_gen_del_folio().

Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.

Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).

To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.

[zhengqi.arch@bytedance.com: use list_splice_tail_init() to reparent child folios]
Link: https://lore.kernel.org/20260324114937.28569-1-qi.zheng@linux.dev
Link: https://lore.kernel.org/e75050354cdbc42221a04f7cf133292b61105548.1772711148.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Allen Pais <apais@linux.microsoft.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yosry Ahmed <yosry@kernel.org>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Qi Zheng and committed by
Andrew Morton
f3046526 07a6e9a2

+159
+17
include/linux/mmzone.h
··· 692 692 void lru_gen_offline_memcg(struct mem_cgroup *memcg); 693 693 void lru_gen_release_memcg(struct mem_cgroup *memcg); 694 694 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid); 695 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid); 696 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid); 697 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); 695 698 696 699 #else /* !CONFIG_LRU_GEN */ 697 700 ··· 733 730 } 734 731 735 732 static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid) 733 + { 734 + } 735 + 736 + static inline void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 737 + { 738 + } 739 + 740 + static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 741 + { 742 + return true; 743 + } 744 + 745 + static inline 746 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 736 747 { 737 748 } 738 749
+142
mm/vmscan.c
··· 4426 4426 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD); 4427 4427 } 4428 4428 4429 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 4430 + { 4431 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4432 + int type; 4433 + 4434 + for (type = 0; type < ANON_AND_FILE; type++) { 4435 + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) 4436 + return false; 4437 + } 4438 + 4439 + return true; 4440 + } 4441 + 4442 + static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg, 4443 + struct lruvec *lruvec) 4444 + { 4445 + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); 4446 + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); 4447 + int swappiness = mem_cgroup_swappiness(memcg); 4448 + DEFINE_MAX_SEQ(lruvec); 4449 + bool success = false; 4450 + 4451 + /* 4452 + * We are not iterating the mm_list here, updating mm_state->seq is just 4453 + * to make mm walkers work properly. 4454 + */ 4455 + if (mm_state) { 4456 + spin_lock(&mm_list->lock); 4457 + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); 4458 + if (max_seq > mm_state->seq) { 4459 + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); 4460 + success = true; 4461 + } 4462 + spin_unlock(&mm_list->lock); 4463 + } else { 4464 + success = true; 4465 + } 4466 + 4467 + if (success) 4468 + inc_max_seq(lruvec, max_seq, swappiness); 4469 + } 4470 + 4471 + /* 4472 + * We need to ensure that the folios of child memcg can be reparented to the 4473 + * same gen of the parent memcg, so the gens of the parent memcg needed be 4474 + * incremented to the MAX_NR_GENS before reparenting. 4475 + */ 4476 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 4477 + { 4478 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4479 + int type; 4480 + 4481 + for (type = 0; type < ANON_AND_FILE; type++) { 4482 + while (get_nr_gens(lruvec, type) < MAX_NR_GENS) { 4483 + try_to_inc_max_seq_nowalk(memcg, lruvec); 4484 + cond_resched(); 4485 + } 4486 + } 4487 + } 4488 + 4489 + /* 4490 + * Compared to traditional LRU, MGLRU faces the following challenges: 4491 + * 4492 + * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the 4493 + * number of generations of the parent and child memcg may be different, 4494 + * so we cannot simply transfer MGLRU folios in the child memcg to the 4495 + * parent memcg as we did for traditional LRU folios. 4496 + * 2. The generation information is stored in folio->flags, but we cannot 4497 + * traverse these folios while holding the lru lock, otherwise it may 4498 + * cause softlockup. 4499 + * 3. In walk_update_folio(), the gen of folio and corresponding lru size 4500 + * may be updated, but the folio is not immediately moved to the 4501 + * corresponding lru list. Therefore, there may be folios of different 4502 + * generations on an LRU list. 4503 + * 4. In lru_gen_del_folio(), the generation to which the folio belongs is 4504 + * found based on the generation information in folio->flags, and the 4505 + * corresponding LRU size will be updated. Therefore, we need to update 4506 + * the lru size correctly during reparenting, otherwise the lru size may 4507 + * be updated incorrectly in lru_gen_del_folio(). 4508 + * 4509 + * Finally, we choose a compromise method, which is to splice the lru list in 4510 + * the child memcg to the lru list of the same generation in the parent memcg 4511 + * during reparenting. 4512 + * 4513 + * The same generation has different meanings in the parent and child memcg, 4514 + * so this compromise method will cause the LRU inversion problem. But as the 4515 + * system runs, this problem will be fixed automatically. 4516 + */ 4517 + static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec, 4518 + int zone, int type) 4519 + { 4520 + struct lru_gen_folio *child_lrugen, *parent_lrugen; 4521 + enum lru_list lru = type * LRU_INACTIVE_FILE; 4522 + int i; 4523 + 4524 + child_lrugen = &child_lruvec->lrugen; 4525 + parent_lrugen = &parent_lruvec->lrugen; 4526 + 4527 + for (i = 0; i < get_nr_gens(child_lruvec, type); i++) { 4528 + int gen = lru_gen_from_seq(child_lrugen->max_seq - i); 4529 + long nr_pages = child_lrugen->nr_pages[gen][type][zone]; 4530 + int child_lru_active = lru_gen_is_active(child_lruvec, gen) ? LRU_ACTIVE : 0; 4531 + int parent_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0; 4532 + 4533 + /* Assuming that child pages are colder than parent pages */ 4534 + list_splice_tail_init(&child_lrugen->folios[gen][type][zone], 4535 + &parent_lrugen->folios[gen][type][zone]); 4536 + 4537 + WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0); 4538 + WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone], 4539 + parent_lrugen->nr_pages[gen][type][zone] + nr_pages); 4540 + 4541 + if (lru_gen_is_active(child_lruvec, gen) != lru_gen_is_active(parent_lruvec, gen)) { 4542 + __update_lru_size(child_lruvec, lru + child_lru_active, zone, -nr_pages); 4543 + __update_lru_size(parent_lruvec, lru + parent_lru_active, zone, nr_pages); 4544 + } 4545 + } 4546 + } 4547 + 4548 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 4549 + { 4550 + struct lruvec *child_lruvec, *parent_lruvec; 4551 + int type, zid; 4552 + struct zone *zone; 4553 + enum lru_list lru; 4554 + 4555 + child_lruvec = get_lruvec(memcg, nid); 4556 + parent_lruvec = get_lruvec(parent, nid); 4557 + 4558 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) 4559 + for (type = 0; type < ANON_AND_FILE; type++) 4560 + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type); 4561 + 4562 + for_each_lru(lru) { 4563 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { 4564 + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); 4565 + 4566 + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); 4567 + } 4568 + } 4569 + } 4570 + 4429 4571 #endif /* CONFIG_MEMCG */ 4430 4572 4431 4573 /******************************************************************************