Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: memcontrol: prepare for reparenting non-hierarchical stats

To resolve the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg. This could cause problems for non-hierarchical
stats.

As Yosry Ahmed pointed out:

In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).

Now we have the following two types of non-hierarchical stats, and they
are only used in CONFIG_MEMCG_V1:

a. memcg->vmstats->state_local[i]
b. pn->lruvec_stats->state_local[i]

To ensure that these non-hierarchical stats work properly, we need to
reparent these non-hierarchical stats after reparenting LRU folios. To
this end, this commit makes the following preparations:

1. implement reparent_state_local() to reparent non-hierarchical stats
2. make css_killed_work_fn() to be called in rcu work, and implement
get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
between mod_memcg_state()/mod_memcg_lruvec_state()
and reparent_state_local()

Link: https://lore.kernel.org/e862995c45a7101a541284b6ebee5e5c32c89066.1772711148.git.zhengqi.arch@bytedance.com
Co-developed-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Allen Pais <apais@linux.microsoft.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Qi Zheng and committed by
Andrew Morton
8285917d 5371e350

+125 -4
+5 -4
kernel/cgroup/cgroup.c
··· 6050 6050 */ 6051 6051 static void css_killed_work_fn(struct work_struct *work) 6052 6052 { 6053 - struct cgroup_subsys_state *css = 6054 - container_of(work, struct cgroup_subsys_state, destroy_work); 6053 + struct cgroup_subsys_state *css; 6054 + 6055 + css = container_of(to_rcu_work(work), struct cgroup_subsys_state, destroy_rwork); 6055 6056 6056 6057 cgroup_lock(); 6057 6058 ··· 6073 6072 container_of(ref, struct cgroup_subsys_state, refcnt); 6074 6073 6075 6074 if (atomic_dec_and_test(&css->online_cnt)) { 6076 - INIT_WORK(&css->destroy_work, css_killed_work_fn); 6077 - queue_work(cgroup_offline_wq, &css->destroy_work); 6075 + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn); 6076 + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork); 6078 6077 } 6079 6078 } 6080 6079
+16
mm/memcontrol-v1.c
··· 1884 1884 PGMAJFAULT, 1885 1885 }; 1886 1886 1887 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1888 + { 1889 + int i; 1890 + 1891 + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) 1892 + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]); 1893 + } 1894 + 1895 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1896 + { 1897 + int i; 1898 + 1899 + for (i = 0; i < NR_LRU_LISTS; i++) 1900 + reparent_memcg_lruvec_state_local(memcg, parent, i); 1901 + } 1902 + 1887 1903 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) 1888 1904 { 1889 1905 unsigned long memory, memsw;
+7
mm/memcontrol-v1.h
··· 73 73 unsigned long nr_memory, int nid); 74 74 75 75 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); 76 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 77 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 78 + 79 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 80 + struct mem_cgroup *parent, int idx); 81 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 82 + struct mem_cgroup *parent, int idx); 76 83 77 84 void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); 78 85 static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
+97
mm/memcontrol.c
··· 225 225 return objcg; 226 226 } 227 227 228 + #ifdef CONFIG_MEMCG_V1 229 + static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force); 230 + 231 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 232 + { 233 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 234 + return; 235 + 236 + /* 237 + * Reparent stats exposed non-hierarchically. Flush @memcg's stats first 238 + * to read its stats accurately , and conservatively flush @parent's 239 + * stats after reparenting to avoid hiding a potentially large stat 240 + * update (e.g. from callers of mem_cgroup_flush_stats_ratelimited()). 241 + */ 242 + __mem_cgroup_flush_stats(memcg, true); 243 + 244 + /* The following counts are all non-hierarchical and need to be reparented. */ 245 + reparent_memcg1_state_local(memcg, parent); 246 + reparent_memcg1_lruvec_state_local(memcg, parent); 247 + 248 + __mem_cgroup_flush_stats(parent, true); 249 + } 250 + #else 251 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 252 + { 253 + } 254 + #endif 255 + 228 256 static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent) 229 257 { 230 258 spin_lock_irq(&objcg_lock); ··· 500 472 return x; 501 473 } 502 474 475 + #ifdef CONFIG_MEMCG_V1 476 + static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, 477 + enum node_stat_item idx, int val); 478 + 479 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 480 + struct mem_cgroup *parent, int idx) 481 + { 482 + int nid; 483 + 484 + for_each_node(nid) { 485 + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 486 + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); 487 + unsigned long value = lruvec_page_state_local(child_lruvec, idx); 488 + struct mem_cgroup_per_node *child_pn, *parent_pn; 489 + 490 + child_pn = container_of(child_lruvec, struct mem_cgroup_per_node, lruvec); 491 + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec); 492 + 493 + __mod_memcg_lruvec_state(child_pn, idx, -value); 494 + __mod_memcg_lruvec_state(parent_pn, idx, value); 495 + } 496 + } 497 + #endif 498 + 503 499 /* Subset of vm_event_item to report for memcg event stats */ 504 500 static const unsigned int memcg_vm_event_stat[] = { 505 501 #ifdef CONFIG_MEMCG_V1 ··· 769 717 return max(val * unit / PAGE_SIZE, 1UL); 770 718 } 771 719 720 + #ifdef CONFIG_MEMCG_V1 721 + /* 722 + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with 723 + * reparenting of non-hierarchical state_locals. 724 + */ 725 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 726 + { 727 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 728 + return memcg; 729 + 730 + rcu_read_lock(); 731 + 732 + while (memcg_is_dying(memcg)) 733 + memcg = parent_mem_cgroup(memcg); 734 + 735 + return memcg; 736 + } 737 + 738 + static inline void get_non_dying_memcg_end(void) 739 + { 740 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 741 + return; 742 + 743 + rcu_read_unlock(); 744 + } 745 + #else 746 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 747 + { 748 + return memcg; 749 + } 750 + 751 + static inline void get_non_dying_memcg_end(void) 752 + { 753 + } 754 + #endif 755 + 772 756 static void __mod_memcg_state(struct mem_cgroup *memcg, 773 757 enum memcg_stat_item idx, int val) 774 758 { ··· 855 767 x = 0; 856 768 #endif 857 769 return x; 770 + } 771 + 772 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 773 + struct mem_cgroup *parent, int idx) 774 + { 775 + unsigned long value = memcg_page_state_local(memcg, idx); 776 + 777 + __mod_memcg_state(memcg, idx, -value); 778 + __mod_memcg_state(parent, idx, value); 858 779 } 859 780 #endif 860 781