Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

+8

CREDITS

··· 1451 1451 E: andy@greyhouse.net 1452 1452 D: Maintenance and contributions to the network interface bonding driver. 1453 1453 1454 + N: Vivek Goyal 1455 + E: vgoyal@redhat.com 1456 + D: KDUMP, KEXEC, and VIRTIO FILE SYSTEM 1457 + 1458 + N: Alexander Graf 1459 + E: graf@amazon.com 1460 + D: Kexec Handover (KHO) 1461 + 1454 1462 N: Wolfgang Grandegger 1455 1463 E: wg@grandegger.com 1456 1464 D: Controller Area Network (device drivers)

+4

Documentation/admin-guide/mm/damon/lru_sort.rst

··· 79 79 parameter is set as ``N``. If invalid parameters are found while the 80 80 re-reading, DAMON_LRU_SORT will be disabled. 81 81 82 + Once ``Y`` is written to this parameter, the user must not write to any 83 + parameters until reading ``commit_inputs`` again returns ``N``. If users 84 + violate this rule, the kernel may exhibit undefined behavior. 85 + 82 86 active_mem_bp 83 87 ------------- 84 88

+4

Documentation/admin-guide/mm/damon/reclaim.rst

··· 71 71 parameter is set as ``N``. If invalid parameters are found while the 72 72 re-reading, DAMON_RECLAIM will be disabled. 73 73 74 + Once ``Y`` is written to this parameter, the user must not write to any 75 + parameters until reading ``commit_inputs`` again returns ``N``. If users 76 + violate this rule, the kernel may exhibit undefined behavior. 77 + 74 78 min_age 75 79 ------- 76 80

+40 -1

Documentation/admin-guide/mm/kho.rst

··· 42 42 an early memory reservation, the new kernel will have that memory at the 43 43 same physical address as the old kernel. 44 44 45 + Kexec Metadata 46 + ============== 47 + 48 + KHO automatically tracks metadata about the kexec chain, passing information 49 + about the previous kernel to the next kernel. This feature helps diagnose 50 + bugs that only reproduce when kexecing from specific kernel versions. 51 + 52 + On each KHO kexec, the kernel logs the previous kernel's version and the 53 + number of kexec reboots since the last cold boot:: 54 + 55 + [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) 56 + 57 + The metadata includes: 58 + 59 + ``previous_release`` 60 + The kernel version string (from ``uname -r``) of the kernel that 61 + initiated the kexec. 62 + 63 + ``kexec_count`` 64 + The number of kexec boots since the last cold boot. On cold boot, 65 + this counter starts at 0 and increments with each kexec. This helps 66 + identify issues that only manifest after multiple consecutive kexec 67 + reboots. 68 + 69 + Use Cases 70 + --------- 71 + 72 + This metadata is particularly useful for debugging kexec transition bugs, 73 + where a buggy kernel kexecs into a new kernel and the bug manifests only 74 + in the second kernel. Examples of such bugs include: 75 + 76 + - Memory corruption from the previous kernel affecting the new kernel 77 + - Incorrect hardware state left by the previous kernel 78 + - Firmware/ACPI state issues that only appear in kexec scenarios 79 + 80 + At scale, correlating crashes to the previous kernel version enables 81 + faster root cause analysis when issues only occur in specific kernel 82 + transition scenarios. 83 + 45 84 debugfs Interfaces 46 85 ================== 47 86 ··· 119 80 it finished to interpret their metadata. 120 81 121 82 ``/sys/kernel/debug/kho/in/sub_fdts/`` 122 - Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs 83 + Similar to ``kho/out/sub_fdts/``, but contains sub blobs 123 84 of KHO producers passed from the old kernel.

+4

Documentation/filesystems/proc.rst

··· 565 565 naturally aligned THP pages of any currently enabled size. 1 if true, 0 566 566 otherwise. 567 567 568 + If both the kernel and the CPU support protection keys (pkeys), 569 + "ProtectionKey" indicates the memory protection key associated with the 570 + virtual memory area. 571 + 568 572 "VmFlags" field deserves a separate description. This member represents the 569 573 kernel flags associated with the particular virtual memory area in two letter 570 574 encoded manner. The codes are the following:

+20 -9

MAINTAINERS

··· 13859 13859 KDUMP 13860 13860 M: Andrew Morton <akpm@linux-foundation.org> 13861 13861 M: Baoquan He <bhe@redhat.com> 13862 - R: Vivek Goyal <vgoyal@redhat.com> 13863 - R: Dave Young <dyoung@redhat.com> 13862 + M: Mike Rapoport <rppt@kernel.org> 13863 + M: Pasha Tatashin <pasha.tatashin@soleen.com> 13864 + M: Pratyush Yadav <pratyush@kernel.org> 13865 + R: Dave Young <ruirui.yang@linux.dev> 13864 13866 L: kexec@lists.infradead.org 13865 13867 S: Maintained 13866 13868 W: http://lse.sourceforge.net/kdump/ ··· 14177 14175 KEXEC 14178 14176 M: Andrew Morton <akpm@linux-foundation.org> 14179 14177 M: Baoquan He <bhe@redhat.com> 14178 + M: Mike Rapoport <rppt@kernel.org> 14179 + M: Pasha Tatashin <pasha.tatashin@soleen.com> 14180 + M: Pratyush Yadav <pratyush@kernel.org> 14180 14181 L: kexec@lists.infradead.org 14181 14182 W: http://kernel.org/pub/linux/utils/kernel/kexec/ 14182 14183 F: include/linux/kexec.h ··· 14187 14182 F: kernel/kexec* 14188 14183 14189 14184 KEXEC HANDOVER (KHO) 14190 - M: Alexander Graf <graf@amazon.com> 14191 14185 M: Mike Rapoport <rppt@kernel.org> 14192 14186 M: Pasha Tatashin <pasha.tatashin@soleen.com> 14193 - R: Pratyush Yadav <pratyush@kernel.org> 14187 + M: Pratyush Yadav <pratyush@kernel.org> 14188 + R: Alexander Graf <graf@amazon.com> 14194 14189 L: kexec@lists.infradead.org 14195 14190 L: linux-mm@kvack.org 14196 14191 S: Maintained 14192 + T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git 14197 14193 F: Documentation/admin-guide/mm/kho.rst 14198 14194 F: Documentation/core-api/kho/* 14199 14195 F: include/linux/kexec_handover.h 14200 14196 F: include/linux/kho/ 14201 - F: include/linux/kho/abi/ 14202 14197 F: kernel/liveupdate/kexec_handover* 14203 14198 F: lib/test_kho.c 14204 14199 F: tools/testing/selftests/kho/ ··· 14897 14892 LIVE UPDATE 14898 14893 M: Pasha Tatashin <pasha.tatashin@soleen.com> 14899 14894 M: Mike Rapoport <rppt@kernel.org> 14900 - R: Pratyush Yadav <pratyush@kernel.org> 14895 + M: Pratyush Yadav <pratyush@kernel.org> 14901 14896 L: linux-kernel@vger.kernel.org 14902 14897 S: Maintained 14898 + T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git 14903 14899 F: Documentation/core-api/liveupdate.rst 14904 14900 F: Documentation/mm/memfd_preservation.rst 14905 14901 F: Documentation/userspace-api/liveupdate.rst 14906 14902 F: include/linux/kho/abi/ 14907 14903 F: include/linux/liveupdate.h 14908 - F: include/linux/liveupdate/ 14909 14904 F: include/uapi/linux/liveupdate.h 14910 14905 F: kernel/liveupdate/ 14911 14906 F: lib/tests/liveupdate.c ··· 16864 16859 16865 16860 MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU) 16866 16861 M: Andrew Morton <akpm@linux-foundation.org> 16867 - M: Axel Rasmussen <axelrasmussen@google.com> 16868 - M: Yuanchu Xie <yuanchu@google.com> 16862 + R: Kairui Song <kasong@tencent.com> 16863 + R: Qi Zheng <qi.zheng@linux.dev> 16864 + R: Shakeel Butt <shakeel.butt@linux.dev> 16865 + R: Barry Song <baohua@kernel.org> 16866 + R: Axel Rasmussen <axelrasmussen@google.com> 16867 + R: Yuanchu Xie <yuanchu@google.com> 16869 16868 R: Wei Xu <weixugc@google.com> 16870 16869 L: linux-mm@kvack.org 16871 16870 S: Maintained ··· 20124 20115 20125 20116 PAGE CACHE 20126 20117 M: Matthew Wilcox (Oracle) <willy@infradead.org> 20118 + R: Jan Kara <jack@suse.cz> 20127 20119 L: linux-fsdevel@vger.kernel.org 20120 + L: linux-mm@kvack.org 20128 20121 S: Supported 20129 20122 T: git git://git.infradead.org/users/willy/pagecache.git 20130 20123 F: Documentation/filesystems/locking.rst

+4 -1

drivers/block/zram/zram_drv.c

··· 2546 2546 mode = RECOMPRESS_HUGE; 2547 2547 if (!strcmp(val, "huge_idle")) 2548 2548 mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE; 2549 + if (!mode) 2550 + return -EINVAL; 2549 2551 continue; 2550 2552 } 2551 2553 ··· 2680 2678 */ 2681 2679 if (offset) { 2682 2680 if (n <= (PAGE_SIZE - offset)) 2683 - return; 2681 + goto end_bio; 2684 2682 2685 2683 n -= (PAGE_SIZE - offset); 2686 2684 index++; ··· 2695 2693 n -= PAGE_SIZE; 2696 2694 } 2697 2695 2696 + end_bio: 2698 2697 bio_endio(bio); 2699 2698 } 2700 2699

+2 -2

fs/buffer.c

··· 822 822 long offset; 823 823 struct mem_cgroup *memcg, *old_memcg; 824 824 825 - /* The folio lock pins the memcg */ 826 - memcg = folio_memcg(folio); 825 + memcg = get_mem_cgroup_from_folio(folio); 827 826 old_memcg = set_active_memcg(memcg); 828 827 829 828 head = NULL; ··· 843 844 } 844 845 out: 845 846 set_active_memcg(old_memcg); 847 + mem_cgroup_put(memcg); 846 848 return head; 847 849 /* 848 850 * In case anything failed, we just free everything we got.

+11 -11

fs/fs-writeback.c

··· 280 280 if (inode_cgwb_enabled(inode)) { 281 281 struct cgroup_subsys_state *memcg_css; 282 282 283 - if (folio) { 284 - memcg_css = mem_cgroup_css_from_folio(folio); 285 - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 286 - } else { 287 - /* must pin memcg_css, see wb_get_create() */ 283 + /* must pin memcg_css, see wb_get_create() */ 284 + if (folio) 285 + memcg_css = get_mem_cgroup_css_from_folio(folio); 286 + else 288 287 memcg_css = task_get_css(current, memory_cgrp_id); 289 - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 290 - css_put(memcg_css); 291 - } 288 + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 289 + css_put(memcg_css); 292 290 } 293 291 294 292 if (!wb) ··· 977 979 if (!wbc->wb || wbc->no_cgroup_owner) 978 980 return; 979 981 980 - css = mem_cgroup_css_from_folio(folio); 982 + css = get_mem_cgroup_css_from_folio(folio); 981 983 /* dead cgroups shouldn't contribute to inode ownership arbitration */ 982 984 if (!css_is_online(css)) 983 - return; 985 + goto out; 984 986 985 987 id = css->id; 986 988 987 989 if (id == wbc->wb_id) { 988 990 wbc->wb_bytes += bytes; 989 - return; 991 + goto out; 990 992 } 991 993 992 994 if (id == wbc->wb_lcand_id) ··· 999 1001 wbc->wb_tcand_bytes += bytes; 1000 1002 else 1001 1003 wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes); 1004 + out: 1005 + css_put(css); 1002 1006 } 1003 1007 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner); 1004 1008

-2

fs/userfaultfd.c

··· 1238 1238 return -EINVAL; 1239 1239 if (!len) 1240 1240 return -EINVAL; 1241 - if (start < mmap_min_addr) 1242 - return -EINVAL; 1243 1241 if (start >= task_size) 1244 1242 return -EINVAL; 1245 1243 if (len > task_size - start)

+2

include/linux/alloc_tag.h

··· 163 163 { 164 164 WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n"); 165 165 } 166 + void alloc_tag_add_early_pfn(unsigned long pfn); 166 167 #else 167 168 static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag) {} 168 169 static inline void alloc_tag_sub_check(union codetag_ref *ref) {} 170 + static inline void alloc_tag_add_early_pfn(unsigned long pfn) {} 169 171 #endif 170 172 171 173 /* Caller should verify both ref and tag to be valid */

+2

include/linux/damon.h

··· 818 818 819 819 /* lists of &struct damon_call_control */ 820 820 struct list_head call_controls; 821 + bool call_controls_obsolete; 821 822 struct mutex call_controls_lock; 822 823 823 824 struct damos_walk_control *walk_control; 825 + bool walk_control_obsolete; 824 826 struct mutex walk_control_lock; 825 827 826 828 /*

+1 -8

include/linux/fs.h

··· 2062 2062 const struct vm_area_struct *vma); 2063 2063 int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma); 2064 2064 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma); 2065 - int __vma_check_mmap_hook(struct vm_area_struct *vma); 2066 2065 2067 2066 static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma) 2068 2067 { 2069 - int err; 2070 - 2071 2068 if (file->f_op->mmap_prepare) 2072 2069 return compat_vma_mmap(file, vma); 2073 2070 2074 - err = file->f_op->mmap(file, vma); 2075 - if (err) 2076 - return err; 2077 - 2078 - return __vma_check_mmap_hook(vma); 2071 + return file->f_op->mmap(file, vma); 2079 2072 } 2080 2073 2081 2074 static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)

+7 -6

include/linux/kexec_handover.h

··· 32 32 struct folio *kho_restore_folio(phys_addr_t phys); 33 33 struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); 34 34 void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); 35 - int kho_add_subtree(const char *name, void *fdt); 36 - void kho_remove_subtree(void *fdt); 37 - int kho_retrieve_subtree(const char *name, phys_addr_t *phys); 35 + int kho_add_subtree(const char *name, void *blob, size_t size); 36 + void kho_remove_subtree(void *blob); 37 + int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size); 38 38 39 39 void kho_memory_init(void); 40 40 ··· 97 97 return NULL; 98 98 } 99 99 100 - static inline int kho_add_subtree(const char *name, void *fdt) 100 + static inline int kho_add_subtree(const char *name, void *blob, size_t size) 101 101 { 102 102 return -EOPNOTSUPP; 103 103 } 104 104 105 - static inline void kho_remove_subtree(void *fdt) { } 105 + static inline void kho_remove_subtree(void *blob) { } 106 106 107 - static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) 107 + static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys, 108 + size_t *size) 108 109 { 109 110 return -EOPNOTSUPP; 110 111 }

+16 -4

include/linux/kho/abi/kexec_handover.h

··· 41 41 * restore the preserved data.:: 42 42 * 43 43 * / { 44 - * compatible = "kho-v2"; 44 + * compatible = "kho-v3"; 45 45 * 46 46 * preserved-memory-map = <0x...>; 47 47 * 48 48 * <subnode-name-1> { 49 49 * preserved-data = <0x...>; 50 + * blob-size = <0x...>; 50 51 * }; 51 52 * 52 53 * <subnode-name-2> { 53 54 * preserved-data = <0x...>; 55 + * blob-size = <0x...>; 54 56 * }; 55 57 * ... ... 56 58 * <subnode-name-N> { 57 59 * preserved-data = <0x...>; 60 + * blob-size = <0x...>; 58 61 * }; 59 62 * }; 60 63 * 61 64 * Root KHO Node (/): 62 - * - compatible: "kho-v2" 65 + * - compatible: "kho-v3" 63 66 * 64 67 * Indentifies the overall KHO ABI version. 65 68 * ··· 81 78 * 82 79 * Physical address pointing to a subnode data blob that is also 83 80 * being preserved. 81 + * 82 + * - blob-size: u64 83 + * 84 + * Size in bytes of the preserved data blob. This is needed because 85 + * blobs may use arbitrary formats (not just FDT), so the size 86 + * cannot be determined from the blob content alone. 84 87 */ 85 88 86 89 /* The compatible string for the KHO FDT root node. */ 87 - #define KHO_FDT_COMPATIBLE "kho-v2" 90 + #define KHO_FDT_COMPATIBLE "kho-v3" 88 91 89 92 /* The FDT property for the preserved memory map. */ 90 93 #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" 91 94 92 95 /* The FDT property for preserved data blobs. */ 93 - #define KHO_FDT_SUB_TREE_PROP_NAME "preserved-data" 96 + #define KHO_SUB_TREE_PROP_NAME "preserved-data" 97 + 98 + /* The FDT property for the size of preserved data blobs. */ 99 + #define KHO_SUB_TREE_SIZE_PROP_NAME "blob-size" 94 100 95 101 /** 96 102 * DOC: Kexec Handover ABI for vmalloc Preservation

+46

include/linux/kho/abi/kexec_metadata.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + 3 + /** 4 + * DOC: Kexec Metadata ABI 5 + * 6 + * The "kexec-metadata" subtree stores optional metadata about the kexec chain. 7 + * It is registered via kho_add_subtree(), keeping it independent from the core 8 + * KHO ABI. This allows the metadata format to evolve without affecting other 9 + * KHO consumers. 10 + * 11 + * The metadata is stored as a plain C struct rather than FDT format for 12 + * simplicity and direct field access. 13 + * 14 + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates. 15 + * Copyright (c) 2026 Breno Leitao <leitao@debian.org> 16 + */ 17 + 18 + #ifndef _LINUX_KHO_ABI_KEXEC_METADATA_H 19 + #define _LINUX_KHO_ABI_KEXEC_METADATA_H 20 + 21 + #include <linux/types.h> 22 + #include <linux/utsname.h> 23 + 24 + #define KHO_KEXEC_METADATA_VERSION 1 25 + 26 + /** 27 + * struct kho_kexec_metadata - Kexec metadata passed between kernels 28 + * @version: ABI version of this struct (must be first field) 29 + * @previous_release: Kernel version string that initiated the kexec 30 + * @kexec_count: Number of kexec boots since last cold boot 31 + * 32 + * This structure is preserved across kexec and allows the new kernel to 33 + * identify which kernel it was booted from and how many kexec reboots 34 + * have occurred. 35 + * 36 + * __NEW_UTS_LEN is part of uABI, so it safe to use it in here. 37 + */ 38 + struct kho_kexec_metadata { 39 + u32 version; 40 + char previous_release[__NEW_UTS_LEN + 1]; 41 + u32 kexec_count; 42 + } __packed; 43 + 44 + #define KHO_METADATA_NODE_NAME "kexec-metadata" 45 + 46 + #endif /* _LINUX_KHO_ABI_KEXEC_METADATA_H */

+9 -8

include/linux/liveupdate.h

··· 12 12 #include <linux/kho/abi/luo.h> 13 13 #include <linux/list.h> 14 14 #include <linux/mutex.h> 15 + #include <linux/rwsem.h> 15 16 #include <linux/types.h> 16 17 #include <uapi/linux/liveupdate.h> 17 18 ··· 64 63 * finish, in order to do successful finish calls for all 65 64 * resources in the session. 66 65 * @finish: Required. Final cleanup in the new kernel. 66 + * @get_id: Optional. Returns a unique identifier for the file. 67 67 * @owner: Module reference 68 68 * 69 69 * All operations (except can_preserve) receive a pointer to a ··· 80 78 int (*retrieve)(struct liveupdate_file_op_args *args); 81 79 bool (*can_finish)(struct liveupdate_file_op_args *args); 82 80 void (*finish)(struct liveupdate_file_op_args *args); 81 + unsigned long (*get_id)(struct file *file); 83 82 struct module *owner; 84 83 }; 85 84 ··· 231 228 int liveupdate_reboot(void); 232 229 233 230 int liveupdate_register_file_handler(struct liveupdate_file_handler *fh); 234 - int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); 231 + void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); 235 232 236 233 int liveupdate_register_flb(struct liveupdate_file_handler *fh, 237 234 struct liveupdate_flb *flb); 238 - int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 239 - struct liveupdate_flb *flb); 235 + void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 236 + struct liveupdate_flb *flb); 240 237 241 238 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); 242 239 int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); ··· 258 255 return -EOPNOTSUPP; 259 256 } 260 257 261 - static inline int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 258 + static inline void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 262 259 { 263 - return -EOPNOTSUPP; 264 260 } 265 261 266 262 static inline int liveupdate_register_flb(struct liveupdate_file_handler *fh, ··· 268 266 return -EOPNOTSUPP; 269 267 } 270 268 271 - static inline int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 272 - struct liveupdate_flb *flb) 269 + static inline void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 270 + struct liveupdate_flb *flb) 273 271 { 274 - return -EOPNOTSUPP; 275 272 } 276 273 277 274 static inline int liveupdate_flb_get_incoming(struct liveupdate_flb *flb,

+98 -95

include/linux/memcontrol.h

··· 115 115 unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; 116 116 struct mem_cgroup_reclaim_iter iter; 117 117 118 + /* 119 + * objcg is wiped out as a part of the objcg repaprenting process. 120 + * orig_objcg preserves a pointer (and a reference) to the original 121 + * objcg until the end of live of memcg. 122 + */ 123 + struct obj_cgroup __rcu *objcg; 124 + struct obj_cgroup *orig_objcg; 125 + /* list of inherited objcgs, protected by objcg_lock */ 126 + struct list_head objcg_list; 127 + 118 128 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC 119 129 /* slab stats for nmi context */ 120 130 atomic_t slab_reclaimable; ··· 189 179 struct list_head list; /* protected by objcg_lock */ 190 180 struct rcu_head rcu; 191 181 }; 182 + bool is_root; 192 183 }; 193 184 194 185 /* ··· 268 257 seqlock_t socket_pressure_seqlock; 269 258 #endif 270 259 int kmemcg_id; 271 - /* 272 - * memcg->objcg is wiped out as a part of the objcg repaprenting 273 - * process. memcg->orig_objcg preserves a pointer (and a reference) 274 - * to the original objcg until the end of live of memcg. 275 - */ 276 - struct obj_cgroup __rcu *objcg; 277 - struct obj_cgroup *orig_objcg; 278 - /* list of inherited objcgs, protected by objcg_lock */ 279 - struct list_head objcg_list; 280 260 281 261 struct memcg_vmstats_percpu __percpu *vmstats_percpu; 282 262 ··· 369 367 #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1) 370 368 371 369 #ifdef CONFIG_MEMCG 372 - 373 - static inline bool folio_memcg_kmem(struct folio *folio); 374 - 375 370 /* 376 371 * After the initialization objcg->memcg is always pointing at 377 372 * a valid memcg, but can be atomically swapped to the parent memcg. ··· 382 383 } 383 384 384 385 /* 385 - * __folio_memcg - Get the memory cgroup associated with a non-kmem folio 386 - * @folio: Pointer to the folio. 387 - * 388 - * Returns a pointer to the memory cgroup associated with the folio, 389 - * or NULL. This function assumes that the folio is known to have a 390 - * proper memory cgroup pointer. It's not safe to call this function 391 - * against some type of folios, e.g. slab folios or ex-slab folios or 392 - * kmem folios. 393 - */ 394 - static inline struct mem_cgroup *__folio_memcg(struct folio *folio) 395 - { 396 - unsigned long memcg_data = folio->memcg_data; 397 - 398 - VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); 399 - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); 400 - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); 401 - 402 - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 403 - } 404 - 405 - /* 406 - * __folio_objcg - get the object cgroup associated with a kmem folio. 386 + * folio_objcg - get the object cgroup associated with a folio. 407 387 * @folio: Pointer to the folio. 408 388 * 409 389 * Returns a pointer to the object cgroup associated with the folio, 410 390 * or NULL. This function assumes that the folio is known to have a 411 - * proper object cgroup pointer. It's not safe to call this function 412 - * against some type of folios, e.g. slab folios or ex-slab folios or 413 - * LRU folios. 391 + * proper object cgroup pointer. 414 392 */ 415 - static inline struct obj_cgroup *__folio_objcg(struct folio *folio) 393 + static inline struct obj_cgroup *folio_objcg(struct folio *folio) 416 394 { 417 395 unsigned long memcg_data = folio->memcg_data; 418 396 419 397 VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); 420 398 VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); 421 - VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); 422 399 423 400 return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 424 401 } ··· 408 433 * proper memory cgroup pointer. It's not safe to call this function 409 434 * against some type of folios, e.g. slab folios or ex-slab folios. 410 435 * 411 - * For a non-kmem folio any of the following ensures folio and memcg binding 412 - * stability: 436 + * For a folio any of the following ensures folio and objcg binding stability: 413 437 * 414 438 * - the folio lock 415 439 * - LRU isolation 416 440 * - exclusive reference 417 441 * 418 - * For a kmem folio a caller should hold an rcu read lock to protect memcg 419 - * associated with a kmem folio from being released. 442 + * Based on the stable binding of folio and objcg, for a folio any of the 443 + * following ensures folio and memcg binding stability: 444 + * 445 + * - cgroup_mutex 446 + * - the lruvec lock 447 + * 448 + * If the caller only want to ensure that the page counters of memcg are 449 + * updated correctly, ensure that the binding stability of folio and objcg 450 + * is sufficient. 451 + * 452 + * Note: The caller should hold an rcu read lock or cgroup_mutex to protect 453 + * memcg associated with a folio from being released. 420 454 */ 421 455 static inline struct mem_cgroup *folio_memcg(struct folio *folio) 422 456 { 423 - if (folio_memcg_kmem(folio)) 424 - return obj_cgroup_memcg(__folio_objcg(folio)); 425 - return __folio_memcg(folio); 457 + struct obj_cgroup *objcg = folio_objcg(folio); 458 + 459 + return objcg ? obj_cgroup_memcg(objcg) : NULL; 426 460 } 427 461 428 462 /* ··· 455 471 * has an associated memory cgroup pointer or an object cgroups vector or 456 472 * an object cgroup. 457 473 * 458 - * For a non-kmem folio any of the following ensures folio and memcg binding 459 - * stability: 474 + * The page and objcg or memcg binding rules can refer to folio_memcg(). 460 475 * 461 - * - the folio lock 462 - * - LRU isolation 463 - * - exclusive reference 464 - * 465 - * For a kmem folio a caller should hold an rcu read lock to protect memcg 466 - * associated with a kmem folio from being released. 476 + * A caller should hold an rcu read lock to protect memcg associated with a 477 + * page from being released. 467 478 */ 468 479 static inline struct mem_cgroup *folio_memcg_check(struct folio *folio) 469 480 { ··· 467 488 * for slabs, READ_ONCE() should be used here. 468 489 */ 469 490 unsigned long memcg_data = READ_ONCE(folio->memcg_data); 491 + struct obj_cgroup *objcg; 470 492 471 493 if (memcg_data & MEMCG_DATA_OBJEXTS) 472 494 return NULL; 473 495 474 - if (memcg_data & MEMCG_DATA_KMEM) { 475 - struct obj_cgroup *objcg; 496 + objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 476 497 477 - objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 478 - return obj_cgroup_memcg(objcg); 479 - } 480 - 481 - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 498 + return objcg ? obj_cgroup_memcg(objcg) : NULL; 482 499 } 483 500 484 501 static inline struct mem_cgroup *page_memcg_check(struct page *page) ··· 521 546 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 522 547 { 523 548 return (memcg == root_mem_cgroup); 549 + } 550 + 551 + static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) 552 + { 553 + return objcg->is_root; 524 554 } 525 555 526 556 static inline bool mem_cgroup_disabled(void) ··· 715 735 * folio_lruvec - return lruvec for isolating/putting an LRU folio 716 736 * @folio: Pointer to the folio. 717 737 * 718 - * This function relies on folio->mem_cgroup being stable. 738 + * Call with rcu_read_lock() held to ensure the lifetime of the returned lruvec. 739 + * Note that this alone will NOT guarantee the stability of the folio->lruvec 740 + * association; the folio can be reparented to an ancestor if this races with 741 + * cgroup deletion. 742 + * 743 + * Use folio_lruvec_lock() to ensure both lifetime and stability of the binding. 744 + * Once a lruvec is locked, folio_lruvec() can be called on other folios, and 745 + * their binding is stable if the returned lruvec matches the one the caller has 746 + * locked. Useful for lock batching. 719 747 */ 720 748 static inline struct lruvec *folio_lruvec(struct folio *folio) 721 749 { ··· 746 758 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, 747 759 unsigned long *flags); 748 760 749 - #ifdef CONFIG_DEBUG_VM 750 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio); 751 - #else 752 - static inline 753 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 754 - { 755 - } 756 - #endif 757 - 758 761 static inline 759 762 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ 760 763 return css ? container_of(css, struct mem_cgroup, css) : NULL; ··· 753 774 754 775 static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg) 755 776 { 777 + if (obj_cgroup_is_root(objcg)) 778 + return true; 756 779 return percpu_ref_tryget(&objcg->refcnt); 757 - } 758 - 759 - static inline void obj_cgroup_get(struct obj_cgroup *objcg) 760 - { 761 - percpu_ref_get(&objcg->refcnt); 762 780 } 763 781 764 782 static inline void obj_cgroup_get_many(struct obj_cgroup *objcg, 765 783 unsigned long nr) 766 784 { 767 - percpu_ref_get_many(&objcg->refcnt, nr); 785 + if (!obj_cgroup_is_root(objcg)) 786 + percpu_ref_get_many(&objcg->refcnt, nr); 787 + } 788 + 789 + static inline void obj_cgroup_get(struct obj_cgroup *objcg) 790 + { 791 + obj_cgroup_get_many(objcg, 1); 768 792 } 769 793 770 794 static inline void obj_cgroup_put(struct obj_cgroup *objcg) 771 795 { 772 - if (objcg) 796 + if (objcg && !obj_cgroup_is_root(objcg)) 773 797 percpu_ref_put(&objcg->refcnt); 774 798 } 775 799 ··· 867 885 return match; 868 886 } 869 887 870 - struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio); 888 + struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio); 871 889 ino_t page_cgroup_ino(struct page *page); 872 890 873 891 static inline bool mem_cgroup_online(struct mem_cgroup *memcg) ··· 878 896 } 879 897 880 898 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 881 - int zid, int nr_pages); 899 + int zid, long nr_pages); 882 900 883 901 static inline 884 902 unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, ··· 948 966 static inline void count_memcg_folio_events(struct folio *folio, 949 967 enum vm_event_item idx, unsigned long nr) 950 968 { 951 - struct mem_cgroup *memcg = folio_memcg(folio); 969 + struct mem_cgroup *memcg; 952 970 953 - if (memcg) 954 - count_memcg_events(memcg, idx, nr); 971 + if (!folio_memcg_charged(folio)) 972 + return; 973 + 974 + rcu_read_lock(); 975 + memcg = folio_memcg(folio); 976 + count_memcg_events(memcg, idx, nr); 977 + rcu_read_unlock(); 955 978 } 956 979 957 980 static inline void count_memcg_events_mm(struct mm_struct *mm, ··· 1074 1087 return true; 1075 1088 } 1076 1089 1090 + static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) 1091 + { 1092 + return true; 1093 + } 1094 + 1077 1095 static inline bool mem_cgroup_disabled(void) 1078 1096 { 1079 1097 return true; ··· 1171 1179 return &pgdat->__lruvec; 1172 1180 } 1173 1181 1174 - static inline 1175 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 1176 - { 1177 - } 1178 - 1179 1182 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 1180 1183 { 1181 1184 return NULL; ··· 1229 1242 { 1230 1243 struct pglist_data *pgdat = folio_pgdat(folio); 1231 1244 1245 + rcu_read_lock(); 1232 1246 spin_lock(&pgdat->__lruvec.lru_lock); 1233 1247 return &pgdat->__lruvec; 1234 1248 } ··· 1238 1250 { 1239 1251 struct pglist_data *pgdat = folio_pgdat(folio); 1240 1252 1253 + rcu_read_lock(); 1241 1254 spin_lock_irq(&pgdat->__lruvec.lru_lock); 1242 1255 return &pgdat->__lruvec; 1243 1256 } ··· 1248 1259 { 1249 1260 struct pglist_data *pgdat = folio_pgdat(folio); 1250 1261 1262 + rcu_read_lock(); 1251 1263 spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); 1252 1264 return &pgdat->__lruvec; 1253 1265 } ··· 1469 1479 return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); 1470 1480 } 1471 1481 1472 - static inline void unlock_page_lruvec(struct lruvec *lruvec) 1482 + static inline void lruvec_lock_irq(struct lruvec *lruvec) 1483 + { 1484 + rcu_read_lock(); 1485 + spin_lock_irq(&lruvec->lru_lock); 1486 + } 1487 + 1488 + static inline void lruvec_unlock(struct lruvec *lruvec) 1473 1489 { 1474 1490 spin_unlock(&lruvec->lru_lock); 1491 + rcu_read_unlock(); 1475 1492 } 1476 1493 1477 - static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) 1494 + static inline void lruvec_unlock_irq(struct lruvec *lruvec) 1478 1495 { 1479 1496 spin_unlock_irq(&lruvec->lru_lock); 1497 + rcu_read_unlock(); 1480 1498 } 1481 1499 1482 - static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, 1483 - unsigned long flags) 1500 + static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags) 1484 1501 { 1485 1502 spin_unlock_irqrestore(&lruvec->lru_lock, flags); 1503 + rcu_read_unlock(); 1486 1504 } 1487 1505 1488 1506 /* Test requires a stable folio->memcg binding, see folio_memcg() */ ··· 1509 1511 if (folio_matches_lruvec(folio, locked_lruvec)) 1510 1512 return locked_lruvec; 1511 1513 1512 - unlock_page_lruvec_irq(locked_lruvec); 1514 + lruvec_unlock_irq(locked_lruvec); 1513 1515 } 1514 1516 1515 1517 return folio_lruvec_lock_irq(folio); ··· 1523 1525 if (folio_matches_lruvec(folio, *lruvecp)) 1524 1526 return; 1525 1527 1526 - unlock_page_lruvec_irqrestore(*lruvecp, *flags); 1528 + lruvec_unlock_irqrestore(*lruvecp, *flags); 1527 1529 } 1528 1530 1529 1531 *lruvecp = folio_lruvec_lock_irqsave(folio, flags); ··· 1547 1549 if (mem_cgroup_disabled()) 1548 1550 return; 1549 1551 1552 + if (!folio_memcg_charged(folio)) 1553 + return; 1554 + 1555 + rcu_read_lock(); 1550 1556 memcg = folio_memcg(folio); 1551 - if (unlikely(memcg && &memcg->css != wb->memcg_css)) 1557 + if (unlikely(&memcg->css != wb->memcg_css)) 1552 1558 mem_cgroup_track_foreign_dirty_slowpath(folio, wb); 1559 + rcu_read_unlock(); 1553 1560 } 1554 1561 1555 1562 void mem_cgroup_flush_foreign(struct bdi_writeback *wb);

+5

include/linux/mm.h

··· 758 758 */ 759 759 }; 760 760 761 + struct vm_uffd_ops; 762 + 761 763 /* 762 764 * These are the virtual MM functions - opening of an area, closing and 763 765 * unmapping it (needed to keep files on disk up-to-date etc), pointer ··· 867 865 struct page *(*find_normal_page)(struct vm_area_struct *vma, 868 866 unsigned long addr); 869 867 #endif /* CONFIG_FIND_NORMAL_PAGE */ 868 + #ifdef CONFIG_USERFAULTFD 869 + const struct vm_uffd_ops *uffd_ops; 870 + #endif 870 871 }; 871 872 872 873 #ifdef CONFIG_NUMA_BALANCING

+6

include/linux/mm_inline.h

··· 348 348 { 349 349 enum lru_list lru = folio_lru_list(folio); 350 350 351 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 352 + 351 353 if (lru_gen_add_folio(lruvec, folio, false)) 352 354 return; 353 355 ··· 364 362 { 365 363 enum lru_list lru = folio_lru_list(folio); 366 364 365 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 366 + 367 367 if (lru_gen_add_folio(lruvec, folio, true)) 368 368 return; 369 369 ··· 379 375 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) 380 376 { 381 377 enum lru_list lru = folio_lru_list(folio); 378 + 379 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 382 380 383 381 if (lru_gen_del_folio(lruvec, folio, false)) 384 382 return;

+27 -15

include/linux/mmzone.h

··· 694 694 void lru_gen_offline_memcg(struct mem_cgroup *memcg); 695 695 void lru_gen_release_memcg(struct mem_cgroup *memcg); 696 696 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid); 697 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid); 698 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid); 699 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); 697 700 698 701 #else /* !CONFIG_LRU_GEN */ 699 702 ··· 735 732 } 736 733 737 734 static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid) 735 + { 736 + } 737 + 738 + static inline void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 739 + { 740 + } 741 + 742 + static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 743 + { 744 + return true; 745 + } 746 + 747 + static inline 748 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 738 749 { 739 750 } 740 751 ··· 2070 2053 extern size_t mem_section_usage_size(void); 2071 2054 2072 2055 /* 2073 - * We use the lower bits of the mem_map pointer to store 2074 - * a little bit of information. The pointer is calculated 2075 - * as mem_map - section_nr_to_pfn(pnum). The result is 2076 - * aligned to the minimum alignment of the two values: 2077 - * 1. All mem_map arrays are page-aligned. 2078 - * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT 2079 - * lowest bits. PFN_SECTION_SHIFT is arch-specific 2080 - * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the 2081 - * worst combination is powerpc with 256k pages, 2082 - * which results in PFN_SECTION_SHIFT equal 6. 2083 - * To sum it up, at least 6 bits are available on all architectures. 2084 - * However, we can exceed 6 bits on some other architectures except 2085 - * powerpc (e.g. 15 bits are available on x86_64, 13 bits are available 2086 - * with the worst case of 64K pages on arm64) if we make sure the 2087 - * exceeded bit is not applicable to powerpc. 2056 + * We use the lower bits of the mem_map pointer to store a little bit of 2057 + * information. The pointer is calculated as mem_map - section_nr_to_pfn(). 2058 + * The result is aligned to the minimum alignment of the two values: 2059 + * 2060 + * 1. All mem_map arrays are page-aligned. 2061 + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT lowest bits. 2062 + * 2063 + * We always expect a single section to cover full pages. Therefore, 2064 + * we can safely assume that PFN_SECTION_SHIFT is large enough to 2065 + * accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. 2088 2066 */ 2089 2067 enum { 2090 2068 SECTION_MARKED_PRESENT_BIT,

+1 -1

include/linux/pgalloc_tag.h

··· 181 181 182 182 if (get_page_tag_ref(page, &ref, &handle)) { 183 183 alloc_tag_sub_check(&ref); 184 - if (ref.ct) 184 + if (ref.ct && !is_codetag_empty(&ref)) 185 185 tag = ct_to_alloc_tag(ref.ct); 186 186 put_page_tag_ref(handle); 187 187 }

+1 -1

include/linux/sched.h

··· 1535 1535 /* Used by memcontrol for targeted memcg charge: */ 1536 1536 struct mem_cgroup *active_memcg; 1537 1537 1538 - /* Cache for current->cgroups->memcg->objcg lookups: */ 1538 + /* Cache for current->cgroups->memcg->nodeinfo[nid]->objcg lookups: */ 1539 1539 struct obj_cgroup *objcg; 1540 1540 #endif 1541 1541

-14

include/linux/shmem_fs.h

··· 221 221 222 222 extern bool shmem_charge(struct inode *inode, long pages); 223 223 224 - #ifdef CONFIG_USERFAULTFD 225 - #ifdef CONFIG_SHMEM 226 - extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, 227 - struct vm_area_struct *dst_vma, 228 - unsigned long dst_addr, 229 - unsigned long src_addr, 230 - uffd_flags_t flags, 231 - struct folio **foliop); 232 - #else /* !CONFIG_SHMEM */ 233 - #define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ 234 - src_addr, flags, foliop) ({ BUG(); 0; }) 235 - #endif /* CONFIG_SHMEM */ 236 - #endif /* CONFIG_USERFAULTFD */ 237 - 238 224 /* 239 225 * Used space is stored as unsigned 64-bit value in bytes but 240 226 * quota core supports only signed 64-bit values so use that

+23 -2

include/linux/swap.h

··· 310 310 311 311 /* linux/mm/swap.c */ 312 312 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, 313 - unsigned int nr_io, unsigned int nr_rotated) 314 - __releases(lruvec->lru_lock); 313 + unsigned int nr_io, unsigned int nr_rotated); 315 314 void lru_note_cost_refault(struct folio *); 316 315 void folio_add_lru(struct folio *); 317 316 void folio_add_lru_vma(struct folio *, struct vm_area_struct *); ··· 352 353 extern unsigned long zone_reclaimable_pages(struct zone *zone); 353 354 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 354 355 gfp_t gfp_mask, nodemask_t *mask); 356 + unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx); 355 357 356 358 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) 357 359 #define MEMCG_RECLAIM_PROACTIVE (1 << 2) ··· 547 547 548 548 return READ_ONCE(memcg->swappiness); 549 549 } 550 + 551 + void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); 550 552 #else 551 553 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) 552 554 { ··· 612 610 return vm_swap_full(); 613 611 } 614 612 #endif 613 + 614 + /* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to 615 + * and including the specified highidx 616 + * @zone: The current zone in the iterator 617 + * @pgdat: The pgdat which node_zones are being iterated 618 + * @idx: The index variable 619 + * @highidx: The index of the highest zone to return 620 + * 621 + * This macro iterates through all managed zones up to and including the specified highidx. 622 + * The zone iterator enters an invalid state after macro call and must be reinitialized 623 + * before it can be used again. 624 + */ 625 + #define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ 626 + for ((idx) = 0, (zone) = (pgdat)->node_zones; \ 627 + (idx) <= (highidx); \ 628 + (idx)++, (zone)++) \ 629 + if (!managed_zone(zone)) \ 630 + continue; \ 631 + else 615 632 616 633 #endif /* __KERNEL__*/ 617 634 #endif /* _LINUX_SWAP_H */

+35 -38

include/linux/userfaultfd_k.h

··· 83 83 84 84 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); 85 85 86 + /* VMA userfaultfd operations */ 87 + struct vm_uffd_ops { 88 + /* Checks if a VMA can support userfaultfd */ 89 + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); 90 + /* 91 + * Called to resolve UFFDIO_CONTINUE request. 92 + * Should return the folio found at pgoff in the VMA's pagecache if it 93 + * exists or ERR_PTR otherwise. 94 + * The returned folio is locked and with reference held. 95 + */ 96 + struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); 97 + /* 98 + * Called during resolution of UFFDIO_COPY request. 99 + * Should allocate and return a folio or NULL if allocation fails. 100 + */ 101 + struct folio *(*alloc_folio)(struct vm_area_struct *vma, 102 + unsigned long addr); 103 + /* 104 + * Called during resolution of UFFDIO_COPY request. 105 + * Should only be called with a folio returned by alloc_folio() above. 106 + * The folio will be set to locked. 107 + * Returns 0 on success, error code on failure. 108 + */ 109 + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma, 110 + unsigned long addr); 111 + /* 112 + * Called during resolution of UFFDIO_COPY request on the error 113 + * handling path. 114 + * Should revert the operation of ->filemap_add(). 115 + */ 116 + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma); 117 + }; 118 + 86 119 /* A combined operation mode + behavior flags. */ 87 120 typedef unsigned int __bitwise uffd_flags_t; 88 121 ··· 146 113 147 114 /* Flags controlling behavior. These behavior changes are mode-independent. */ 148 115 #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) 149 - 150 - extern int mfill_atomic_install_pte(pmd_t *dst_pmd, 151 - struct vm_area_struct *dst_vma, 152 - unsigned long dst_addr, struct page *page, 153 - bool newly_allocated, uffd_flags_t flags); 154 116 155 117 extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start, 156 118 unsigned long src_start, unsigned long len, ··· 239 211 return vma->vm_flags & __VM_UFFD_FLAGS; 240 212 } 241 213 242 - static inline bool vma_can_userfault(struct vm_area_struct *vma, 243 - vm_flags_t vm_flags, 244 - bool wp_async) 245 - { 246 - vm_flags &= __VM_UFFD_FLAGS; 247 - 248 - if (vma->vm_flags & VM_DROPPABLE) 249 - return false; 250 - 251 - if ((vm_flags & VM_UFFD_MINOR) && 252 - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) 253 - return false; 254 - 255 - /* 256 - * If wp async enabled, and WP is the only mode enabled, allow any 257 - * memory type. 258 - */ 259 - if (wp_async && (vm_flags == VM_UFFD_WP)) 260 - return true; 261 - 262 - /* 263 - * If user requested uffd-wp but not enabled pte markers for 264 - * uffd-wp, then shmem & hugetlbfs are not supported but only 265 - * anonymous. 266 - */ 267 - if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && 268 - !vma_is_anonymous(vma)) 269 - return false; 270 - 271 - /* By default, allow any of anon|shmem|hugetlb */ 272 - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || 273 - vma_is_shmem(vma); 274 - } 214 + bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, 215 + bool wp_async); 275 216 276 217 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) 277 218 {

+5 -5

include/trace/events/memcg.h

··· 11 11 12 12 DECLARE_EVENT_CLASS(memcg_rstat_stats, 13 13 14 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 14 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 15 15 16 16 TP_ARGS(memcg, item, val), 17 17 18 18 TP_STRUCT__entry( 19 19 __field(u64, id) 20 20 __field(int, item) 21 - __field(int, val) 21 + __field(long, val) 22 22 ), 23 23 24 24 TP_fast_assign( ··· 27 27 __entry->val = val; 28 28 ), 29 29 30 - TP_printk("memcg_id=%llu item=%d val=%d", 30 + TP_printk("memcg_id=%llu item=%d val=%ld", 31 31 __entry->id, __entry->item, __entry->val) 32 32 ); 33 33 34 34 DEFINE_EVENT(memcg_rstat_stats, mod_memcg_state, 35 35 36 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 36 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 37 37 38 38 TP_ARGS(memcg, item, val) 39 39 ); 40 40 41 41 DEFINE_EVENT(memcg_rstat_stats, mod_memcg_lruvec_state, 42 42 43 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 43 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 44 44 45 45 TP_ARGS(memcg, item, val) 46 46 );

+3

include/trace/events/writeback.h

··· 294 294 __entry->ino = inode ? inode->i_ino : 0; 295 295 __entry->memcg_id = wb->memcg_css->id; 296 296 __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); 297 + 298 + rcu_read_lock(); 297 299 __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup); 300 + rcu_read_unlock(); 298 301 ), 299 302 300 303 TP_printk("bdi %s[%llu]: ino=%llu memcg_id=%u cgroup_ino=%llu page_cgroup_ino=%llu",

+5 -4

kernel/cgroup/cgroup.c

··· 5999 5999 */ 6000 6000 static void css_killed_work_fn(struct work_struct *work) 6001 6001 { 6002 - struct cgroup_subsys_state *css = 6003 - container_of(work, struct cgroup_subsys_state, destroy_work); 6002 + struct cgroup_subsys_state *css; 6003 + 6004 + css = container_of(to_rcu_work(work), struct cgroup_subsys_state, destroy_rwork); 6004 6005 6005 6006 cgroup_lock(); 6006 6007 ··· 6022 6021 container_of(ref, struct cgroup_subsys_state, refcnt); 6023 6022 6024 6023 if (atomic_dec_and_test(&css->online_cnt)) { 6025 - INIT_WORK(&css->destroy_work, css_killed_work_fn); 6026 - queue_work(cgroup_offline_wq, &css->destroy_work); 6024 + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn); 6025 + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork); 6027 6026 } 6028 6027 } 6029 6028

+138 -20

kernel/liveupdate/kexec_handover.c

··· 18 18 #include <linux/kexec.h> 19 19 #include <linux/kexec_handover.h> 20 20 #include <linux/kho_radix_tree.h> 21 + #include <linux/utsname.h> 21 22 #include <linux/kho/abi/kexec_handover.h> 23 + #include <linux/kho/abi/kexec_metadata.h> 22 24 #include <linux/libfdt.h> 23 25 #include <linux/list.h> 24 26 #include <linux/memblock.h> ··· 726 724 } 727 725 728 726 /** 729 - * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. 727 + * kho_add_subtree - record the physical address of a sub blob in KHO root tree. 730 728 * @name: name of the sub tree. 731 - * @fdt: the sub tree blob. 729 + * @blob: the sub tree blob. 730 + * @size: size of the blob in bytes. 732 731 * 733 732 * Creates a new child node named @name in KHO root FDT and records 734 - * the physical address of @fdt. The pages of @fdt must also be preserved 733 + * the physical address of @blob. The pages of @blob must also be preserved 735 734 * by KHO for the new kernel to retrieve it after kexec. 736 735 * 737 736 * A debugfs blob entry is also created at ··· 741 738 * 742 739 * Return: 0 on success, error code on failure 743 740 */ 744 - int kho_add_subtree(const char *name, void *fdt) 741 + int kho_add_subtree(const char *name, void *blob, size_t size) 745 742 { 746 - phys_addr_t phys = virt_to_phys(fdt); 743 + phys_addr_t phys = virt_to_phys(blob); 747 744 void *root_fdt = kho_out.fdt; 745 + u64 size_u64 = size; 748 746 int err = -ENOMEM; 749 747 int off, fdt_err; 750 748 ··· 762 758 goto out_pack; 763 759 } 764 760 765 - err = fdt_setprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, 761 + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, 766 762 &phys, sizeof(phys)); 767 763 if (err < 0) 768 764 goto out_pack; 769 765 770 - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); 766 + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_SIZE_PROP_NAME, 767 + &size_u64, sizeof(size_u64)); 768 + if (err < 0) 769 + goto out_pack; 770 + 771 + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, name, blob, 772 + size, false)); 771 773 772 774 out_pack: 773 775 fdt_pack(root_fdt); ··· 782 772 } 783 773 EXPORT_SYMBOL_GPL(kho_add_subtree); 784 774 785 - void kho_remove_subtree(void *fdt) 775 + void kho_remove_subtree(void *blob) 786 776 { 787 - phys_addr_t target_phys = virt_to_phys(fdt); 777 + phys_addr_t target_phys = virt_to_phys(blob); 788 778 void *root_fdt = kho_out.fdt; 789 779 int off; 790 780 int err; ··· 800 790 const u64 *val; 801 791 int len; 802 792 803 - val = fdt_getprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, &len); 793 + val = fdt_getprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, &len); 804 794 if (!val || len != sizeof(phys_addr_t)) 805 795 continue; 806 796 807 797 if ((phys_addr_t)*val == target_phys) { 808 798 fdt_del_node(root_fdt, off); 809 - kho_debugfs_fdt_remove(&kho_out.dbg, fdt); 799 + kho_debugfs_blob_remove(&kho_out.dbg, blob); 810 800 break; 811 801 } 812 802 } ··· 1270 1260 struct kho_in { 1271 1261 phys_addr_t fdt_phys; 1272 1262 phys_addr_t scratch_phys; 1263 + char previous_release[__NEW_UTS_LEN + 1]; 1264 + u32 kexec_count; 1273 1265 struct kho_debugfs dbg; 1274 1266 }; 1275 1267 ··· 1304 1292 EXPORT_SYMBOL_GPL(is_kho_boot); 1305 1293 1306 1294 /** 1307 - * kho_retrieve_subtree - retrieve a preserved sub FDT by its name. 1308 - * @name: the name of the sub FDT passed to kho_add_subtree(). 1309 - * @phys: if found, the physical address of the sub FDT is stored in @phys. 1295 + * kho_retrieve_subtree - retrieve a preserved sub blob by its name. 1296 + * @name: the name of the sub blob passed to kho_add_subtree(). 1297 + * @phys: if found, the physical address of the sub blob is stored in @phys. 1298 + * @size: if not NULL and found, the size of the sub blob is stored in @size. 1310 1299 * 1311 - * Retrieve a preserved sub FDT named @name and store its physical 1312 - * address in @phys. 1300 + * Retrieve a preserved sub blob named @name and store its physical 1301 + * address in @phys and optionally its size in @size. 1313 1302 * 1314 1303 * Return: 0 on success, error code on failure 1315 1304 */ 1316 - int kho_retrieve_subtree(const char *name, phys_addr_t *phys) 1305 + int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size) 1317 1306 { 1318 1307 const void *fdt = kho_get_fdt(); 1319 1308 const u64 *val; ··· 1330 1317 if (offset < 0) 1331 1318 return -ENOENT; 1332 1319 1333 - val = fdt_getprop(fdt, offset, KHO_FDT_SUB_TREE_PROP_NAME, &len); 1320 + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_PROP_NAME, &len); 1334 1321 if (!val || len != sizeof(*val)) 1335 1322 return -EINVAL; 1336 1323 1337 1324 *phys = (phys_addr_t)*val; 1325 + 1326 + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_SIZE_PROP_NAME, &len); 1327 + if (!val || len != sizeof(*val)) { 1328 + pr_warn("broken KHO subnode '%s': missing or invalid blob-size property\n", 1329 + name); 1330 + return -EINVAL; 1331 + } 1332 + 1333 + if (size) 1334 + *size = (size_t)*val; 1338 1335 1339 1336 return 0; 1340 1337 } ··· 1396 1373 return err; 1397 1374 } 1398 1375 1376 + static void __init kho_in_kexec_metadata(void) 1377 + { 1378 + struct kho_kexec_metadata *metadata; 1379 + phys_addr_t metadata_phys; 1380 + size_t blob_size; 1381 + int err; 1382 + 1383 + err = kho_retrieve_subtree(KHO_METADATA_NODE_NAME, &metadata_phys, 1384 + &blob_size); 1385 + if (err) 1386 + /* This is fine, previous kernel didn't export metadata */ 1387 + return; 1388 + 1389 + /* Check that, at least, "version" is present */ 1390 + if (blob_size < sizeof(u32)) { 1391 + pr_warn("kexec-metadata blob too small (%zu bytes)\n", 1392 + blob_size); 1393 + return; 1394 + } 1395 + 1396 + metadata = phys_to_virt(metadata_phys); 1397 + 1398 + if (metadata->version != KHO_KEXEC_METADATA_VERSION) { 1399 + pr_warn("kexec-metadata version %u not supported (expected %u)\n", 1400 + metadata->version, KHO_KEXEC_METADATA_VERSION); 1401 + return; 1402 + } 1403 + 1404 + if (blob_size < sizeof(*metadata)) { 1405 + pr_warn("kexec-metadata blob too small for v%u (%zu < %zu)\n", 1406 + metadata->version, blob_size, sizeof(*metadata)); 1407 + return; 1408 + } 1409 + 1410 + /* 1411 + * Copy data to the kernel structure that will persist during 1412 + * kernel lifetime. 1413 + */ 1414 + kho_in.kexec_count = metadata->kexec_count; 1415 + strscpy(kho_in.previous_release, metadata->previous_release, 1416 + sizeof(kho_in.previous_release)); 1417 + 1418 + pr_info("exec from: %s (count %u)\n", 1419 + kho_in.previous_release, kho_in.kexec_count); 1420 + } 1421 + 1422 + /* 1423 + * Create kexec metadata to pass kernel version and boot count to the 1424 + * next kernel. This keeps the core KHO ABI minimal and allows the 1425 + * metadata format to evolve independently. 1426 + */ 1427 + static __init int kho_out_kexec_metadata(void) 1428 + { 1429 + struct kho_kexec_metadata *metadata; 1430 + int err; 1431 + 1432 + metadata = kho_alloc_preserve(sizeof(*metadata)); 1433 + if (IS_ERR(metadata)) 1434 + return PTR_ERR(metadata); 1435 + 1436 + metadata->version = KHO_KEXEC_METADATA_VERSION; 1437 + strscpy(metadata->previous_release, init_uts_ns.name.release, 1438 + sizeof(metadata->previous_release)); 1439 + /* kho_in.kexec_count is set to 0 on cold boot */ 1440 + metadata->kexec_count = kho_in.kexec_count + 1; 1441 + 1442 + err = kho_add_subtree(KHO_METADATA_NODE_NAME, metadata, 1443 + sizeof(*metadata)); 1444 + if (err) 1445 + kho_unpreserve_free(metadata); 1446 + 1447 + return err; 1448 + } 1449 + 1450 + static int __init kho_kexec_metadata_init(const void *fdt) 1451 + { 1452 + int err; 1453 + 1454 + if (fdt) 1455 + kho_in_kexec_metadata(); 1456 + 1457 + /* Populate kexec metadata for the possible next kexec */ 1458 + err = kho_out_kexec_metadata(); 1459 + if (err) 1460 + pr_warn("failed to initialize kexec-metadata subtree: %d\n", 1461 + err); 1462 + 1463 + return err; 1464 + } 1465 + 1399 1466 static __init int kho_init(void) 1400 1467 { 1401 1468 struct kho_radix_tree *tree = &kho_out.radix_tree; ··· 1519 1406 if (err) 1520 1407 goto err_free_fdt; 1521 1408 1409 + err = kho_kexec_metadata_init(fdt); 1410 + if (err) 1411 + goto err_free_fdt; 1412 + 1522 1413 if (fdt) { 1523 1414 kho_in_debugfs_init(&kho_in.dbg, fdt); 1524 1415 return 0; ··· 1547 1430 init_cma_reserved_pageblock(pfn_to_page(pfn)); 1548 1431 } 1549 1432 1550 - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", 1551 - kho_out.fdt, true)); 1433 + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, "fdt", 1434 + kho_out.fdt, 1435 + fdt_totalsize(kho_out.fdt), true)); 1552 1436 1553 1437 return 0; 1554 1438

+35 -20

kernel/liveupdate/kexec_handover_debugfs.c

··· 24 24 struct dentry *file; 25 25 }; 26 26 27 - static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, 28 - const char *name, const void *fdt) 27 + static int __kho_debugfs_blob_add(struct list_head *list, struct dentry *dir, 28 + const char *name, const void *blob, 29 + size_t size) 29 30 { 30 31 struct fdt_debugfs *f; 31 32 struct dentry *file; ··· 35 34 if (!f) 36 35 return -ENOMEM; 37 36 38 - f->wrapper.data = (void *)fdt; 39 - f->wrapper.size = fdt_totalsize(fdt); 37 + f->wrapper.data = (void *)blob; 38 + f->wrapper.size = size; 40 39 41 40 file = debugfs_create_blob(name, 0400, dir, &f->wrapper); 42 41 if (IS_ERR(file)) { ··· 50 49 return 0; 51 50 } 52 51 53 - int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 54 - const void *fdt, bool root) 52 + int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, 53 + const void *blob, size_t size, bool root) 55 54 { 56 55 struct dentry *dir; 57 56 ··· 60 59 else 61 60 dir = dbg->sub_fdt_dir; 62 61 63 - return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt); 62 + return __kho_debugfs_blob_add(&dbg->fdt_list, dir, name, blob, size); 64 63 } 65 64 66 - void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) 65 + void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob) 67 66 { 68 67 struct fdt_debugfs *ff; 69 68 70 69 list_for_each_entry(ff, &dbg->fdt_list, list) { 71 - if (ff->wrapper.data == fdt) { 70 + if (ff->wrapper.data == blob) { 72 71 debugfs_remove(ff->file); 73 72 list_del(&ff->list); 74 73 kfree(ff); ··· 114 113 goto err_rmdir; 115 114 } 116 115 117 - err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt); 116 + err = __kho_debugfs_blob_add(&dbg->fdt_list, dir, "fdt", fdt, 117 + fdt_totalsize(fdt)); 118 118 if (err) 119 119 goto err_rmdir; 120 120 121 121 fdt_for_each_subnode(child, fdt, 0) { 122 122 int len = 0; 123 123 const char *name = fdt_get_name(fdt, child, NULL); 124 - const u64 *fdt_phys; 124 + const u64 *blob_phys; 125 + const u64 *blob_size; 126 + void *blob; 125 127 126 - fdt_phys = fdt_getprop(fdt, child, KHO_FDT_SUB_TREE_PROP_NAME, &len); 127 - if (!fdt_phys) 128 + blob_phys = fdt_getprop(fdt, child, 129 + KHO_SUB_TREE_PROP_NAME, &len); 130 + if (!blob_phys) 128 131 continue; 129 - if (len != sizeof(*fdt_phys)) { 130 - pr_warn("node %s prop fdt has invalid length: %d\n", 131 - name, len); 132 + if (len != sizeof(*blob_phys)) { 133 + pr_warn("node %s prop %s has invalid length: %d\n", 134 + name, KHO_SUB_TREE_PROP_NAME, len); 132 135 continue; 133 136 } 134 - err = __kho_debugfs_fdt_add(&dbg->fdt_list, sub_fdt_dir, name, 135 - phys_to_virt(*fdt_phys)); 137 + 138 + blob_size = fdt_getprop(fdt, child, 139 + KHO_SUB_TREE_SIZE_PROP_NAME, &len); 140 + if (!blob_size || len != sizeof(*blob_size)) { 141 + pr_warn("node %s missing or invalid %s property\n", 142 + name, KHO_SUB_TREE_SIZE_PROP_NAME); 143 + continue; 144 + } 145 + 146 + blob = phys_to_virt(*blob_phys); 147 + err = __kho_debugfs_blob_add(&dbg->fdt_list, sub_fdt_dir, name, 148 + blob, *blob_size); 136 149 if (err) { 137 - pr_warn("failed to add fdt %s to debugfs: %pe\n", name, 138 - ERR_PTR(err)); 150 + pr_warn("failed to add blob %s to debugfs: %pe\n", 151 + name, ERR_PTR(err)); 139 152 continue; 140 153 } 141 154 }

+8 -7

kernel/liveupdate/kexec_handover_internal.h

··· 26 26 int kho_debugfs_init(void); 27 27 void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt); 28 28 int kho_out_debugfs_init(struct kho_debugfs *dbg); 29 - int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 30 - const void *fdt, bool root); 31 - void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt); 29 + int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, 30 + const void *blob, size_t size, bool root); 31 + void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob); 32 32 #else 33 33 static inline int kho_debugfs_init(void) { return 0; } 34 34 static inline void kho_in_debugfs_init(struct kho_debugfs *dbg, 35 35 const void *fdt) { } 36 36 static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; } 37 - static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 38 - const void *fdt, bool root) { return 0; } 39 - static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, 40 - void *fdt) { } 37 + static inline int kho_debugfs_blob_add(struct kho_debugfs *dbg, 38 + const char *name, const void *blob, 39 + size_t size, bool root) { return 0; } 40 + static inline void kho_debugfs_blob_remove(struct kho_debugfs *dbg, 41 + void *blob) { } 41 42 #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ 42 43 43 44 #ifdef CONFIG_KEXEC_HANDOVER_DEBUG

+9 -2

kernel/liveupdate/luo_core.c

··· 54 54 #include <linux/liveupdate.h> 55 55 #include <linux/miscdevice.h> 56 56 #include <linux/mm.h> 57 + #include <linux/rwsem.h> 57 58 #include <linux/sizes.h> 58 59 #include <linux/string.h> 59 60 #include <linux/unaligned.h> ··· 68 67 void *fdt_in; 69 68 u64 liveupdate_num; 70 69 } luo_global; 70 + 71 + /* 72 + * luo_register_rwlock - Protects registration of file handlers and FLBs. 73 + */ 74 + DECLARE_RWSEM(luo_register_rwlock); 71 75 72 76 static int __init early_liveupdate_param(char *buf) 73 77 { ··· 94 88 } 95 89 96 90 /* Retrieve LUO subtree, and verify its format. */ 97 - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys); 91 + err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); 98 92 if (err) { 99 93 if (err != -ENOENT) { 100 94 pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", ··· 178 172 if (err) 179 173 goto exit_free; 180 174 181 - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out); 175 + err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, 176 + fdt_totalsize(fdt_out)); 182 177 if (err) 183 178 goto exit_free; 184 179 luo_global.fdt_out = fdt_out;

+56 -56

kernel/liveupdate/luo_file.c

··· 108 108 #include <linux/liveupdate.h> 109 109 #include <linux/module.h> 110 110 #include <linux/sizes.h> 111 + #include <linux/xarray.h> 111 112 #include <linux/slab.h> 112 113 #include <linux/string.h> 113 114 #include "luo_internal.h" 114 115 115 116 static LIST_HEAD(luo_file_handler_list); 117 + 118 + /* Keep track of files being preserved by LUO */ 119 + static DEFINE_XARRAY(luo_preserved_files); 116 120 117 121 /* 2 4K pages, give space for 128 files per file_set */ 118 122 #define LUO_FILE_PGCNT 2ul ··· 207 203 file_set->files = NULL; 208 204 } 209 205 206 + static unsigned long luo_get_id(struct liveupdate_file_handler *fh, 207 + struct file *file) 208 + { 209 + return fh->ops->get_id ? fh->ops->get_id(file) : (unsigned long)file; 210 + } 211 + 210 212 static bool luo_token_is_used(struct luo_file_set *file_set, u64 token) 211 213 { 212 214 struct luo_file *iter; ··· 258 248 * Context: Can be called from an ioctl handler during normal system operation. 259 249 * Return: 0 on success. Returns a negative errno on failure: 260 250 * -EEXIST if the token is already used. 251 + * -EBUSY if the file descriptor is already preserved by another session. 261 252 * -EBADF if the file descriptor is invalid. 262 253 * -ENOSPC if the file_set is full. 263 254 * -ENOENT if no compatible handler is found. ··· 288 277 goto err_fput; 289 278 290 279 err = -ENOENT; 280 + down_read(&luo_register_rwlock); 291 281 list_private_for_each_entry(fh, &luo_file_handler_list, list) { 292 282 if (fh->ops->can_preserve(fh, file)) { 293 - err = 0; 283 + if (try_module_get(fh->ops->owner)) 284 + err = 0; 294 285 break; 295 286 } 296 287 } 288 + up_read(&luo_register_rwlock); 297 289 298 290 /* err is still -ENOENT if no handler was found */ 299 291 if (err) 300 292 goto err_free_files_mem; 301 293 294 + err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), 295 + file, GFP_KERNEL); 296 + if (err) 297 + goto err_module_put; 298 + 302 299 err = luo_flb_file_preserve(fh); 303 300 if (err) 304 - goto err_free_files_mem; 301 + goto err_erase_xa; 305 302 306 303 luo_file = kzalloc_obj(*luo_file); 307 304 if (!luo_file) { ··· 339 320 kfree(luo_file); 340 321 err_flb_unpreserve: 341 322 luo_flb_file_unpreserve(fh); 323 + err_erase_xa: 324 + xa_erase(&luo_preserved_files, luo_get_id(fh, file)); 325 + err_module_put: 326 + module_put(fh->ops->owner); 342 327 err_free_files_mem: 343 328 luo_free_files_mem(file_set); 344 329 err_fput: ··· 385 362 args.private_data = luo_file->private_data; 386 363 luo_file->fh->ops->unpreserve(&args); 387 364 luo_flb_file_unpreserve(luo_file->fh); 365 + module_put(luo_file->fh->ops->owner); 388 366 367 + xa_erase(&luo_preserved_files, 368 + luo_get_id(luo_file->fh, luo_file->file)); 389 369 list_del(&luo_file->list); 390 370 file_set->count--; 391 371 ··· 632 606 luo_file->file = args.file; 633 607 /* Get reference so we can keep this file in LUO until finish */ 634 608 get_file(luo_file->file); 609 + 610 + WARN_ON(xa_insert(&luo_preserved_files, 611 + luo_get_id(luo_file->fh, luo_file->file), 612 + luo_file->file, GFP_KERNEL)); 613 + 635 614 *filep = luo_file->file; 636 615 luo_file->retrieve_status = 1; 637 616 ··· 677 646 678 647 luo_file->fh->ops->finish(&args); 679 648 luo_flb_file_finish(luo_file->fh); 649 + module_put(luo_file->fh->ops->owner); 680 650 } 681 651 682 652 /** ··· 733 701 734 702 luo_file_finish_one(file_set, luo_file); 735 703 736 - if (luo_file->file) 704 + if (luo_file->file) { 705 + xa_erase(&luo_preserved_files, 706 + luo_get_id(luo_file->fh, luo_file->file)); 737 707 fput(luo_file->file); 708 + } 738 709 list_del(&luo_file->list); 739 710 file_set->count--; 740 711 mutex_destroy(&luo_file->mutex); ··· 812 777 bool handler_found = false; 813 778 struct luo_file *luo_file; 814 779 780 + down_read(&luo_register_rwlock); 815 781 list_private_for_each_entry(fh, &luo_file_handler_list, list) { 816 782 if (!strcmp(fh->compatible, file_ser[i].compatible)) { 817 - handler_found = true; 783 + if (try_module_get(fh->ops->owner)) 784 + handler_found = true; 818 785 break; 819 786 } 820 787 } 788 + up_read(&luo_register_rwlock); 821 789 822 790 if (!handler_found) { 823 - pr_warn("No registered handler for compatible '%s'\n", 791 + pr_warn("No registered handler for compatible '%.*s'\n", 792 + (int)sizeof(file_ser[i].compatible), 824 793 file_ser[i].compatible); 825 794 return -ENOENT; 826 795 } 827 796 828 797 luo_file = kzalloc_obj(*luo_file); 829 - if (!luo_file) 798 + if (!luo_file) { 799 + module_put(fh->ops->owner); 830 800 return -ENOMEM; 801 + } 831 802 832 803 luo_file->fh = fh; 833 804 luo_file->file = NULL; ··· 883 842 return -EINVAL; 884 843 } 885 844 886 - /* 887 - * Ensure the system is quiescent (no active sessions). 888 - * This prevents registering new handlers while sessions are active or 889 - * while deserialization is in progress. 890 - */ 891 - if (!luo_session_quiesce()) 892 - return -EBUSY; 893 - 845 + down_write(&luo_register_rwlock); 894 846 /* Check for duplicate compatible strings */ 895 847 list_private_for_each_entry(fh_iter, &luo_file_handler_list, list) { 896 848 if (!strcmp(fh_iter->compatible, fh->compatible)) { 897 849 pr_err("File handler registration failed: Compatible string '%s' already registered.\n", 898 850 fh->compatible); 899 851 err = -EEXIST; 900 - goto err_resume; 852 + goto err_unlock; 901 853 } 902 - } 903 - 904 - /* Pin the module implementing the handler */ 905 - if (!try_module_get(fh->ops->owner)) { 906 - err = -EAGAIN; 907 - goto err_resume; 908 854 } 909 855 910 856 INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, flb_list)); 911 857 INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, list)); 912 858 list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); 913 - luo_session_resume(); 859 + up_write(&luo_register_rwlock); 914 860 915 861 liveupdate_test_register(fh); 916 862 917 863 return 0; 918 864 919 - err_resume: 920 - luo_session_resume(); 865 + err_unlock: 866 + up_write(&luo_register_rwlock); 921 867 return err; 922 868 } 923 869 ··· 914 886 * 915 887 * Unregisters the file handler from the liveupdate core. This function 916 888 * reverses the operations of liveupdate_register_file_handler(). 917 - * 918 - * It ensures safe removal by checking that: 919 - * No live update session is currently in progress. 920 - * No FLB registered with this file handler. 921 - * 922 - * If the unregistration fails, the internal test state is reverted. 923 - * 924 - * Return: 0 Success. -EOPNOTSUPP when live update is not enabled. -EBUSY A live 925 - * update is in progress, can't quiesce live update or FLB is registred with 926 - * this file handler. 927 889 */ 928 - int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 890 + void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 929 891 { 930 - int err = -EBUSY; 931 - 932 892 if (!liveupdate_enabled()) 933 - return -EOPNOTSUPP; 893 + return; 934 894 935 - liveupdate_test_unregister(fh); 936 - 937 - if (!luo_session_quiesce()) 938 - goto err_register; 939 - 940 - if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) 941 - goto err_resume; 942 - 895 + guard(rwsem_write)(&luo_register_rwlock); 896 + luo_flb_unregister_all(fh); 943 897 list_del(&ACCESS_PRIVATE(fh, list)); 944 - module_put(fh->ops->owner); 945 - luo_session_resume(); 946 - 947 - return 0; 948 - 949 - err_resume: 950 - luo_session_resume(); 951 - err_register: 952 - liveupdate_test_register(fh); 953 - return err; 954 898 }

+97 -85

kernel/liveupdate/luo_flb.c

··· 89 89 static struct luo_flb_private *luo_flb_get_private(struct liveupdate_flb *flb) 90 90 { 91 91 struct luo_flb_private *private = &ACCESS_PRIVATE(flb, private); 92 + static DEFINE_SPINLOCK(luo_flb_init_lock); 92 93 94 + if (smp_load_acquire(&private->initialized)) 95 + return private; 96 + 97 + guard(spinlock)(&luo_flb_init_lock); 93 98 if (!private->initialized) { 94 99 mutex_init(&private->incoming.lock); 95 100 mutex_init(&private->outgoing.lock); 96 101 INIT_LIST_HEAD(&private->list); 97 102 private->users = 0; 98 - private->initialized = true; 103 + smp_store_release(&private->initialized, true); 99 104 } 100 105 101 106 return private; ··· 115 110 struct liveupdate_flb_op_args args = {0}; 116 111 int err; 117 112 113 + if (!try_module_get(flb->ops->owner)) 114 + return -ENODEV; 115 + 118 116 args.flb = flb; 119 117 err = flb->ops->preserve(&args); 120 - if (err) 118 + if (err) { 119 + module_put(flb->ops->owner); 121 120 return err; 121 + } 122 122 private->outgoing.data = args.data; 123 123 private->outgoing.obj = args.obj; 124 124 } ··· 151 141 152 142 private->outgoing.data = 0; 153 143 private->outgoing.obj = NULL; 144 + module_put(flb->ops->owner); 154 145 } 155 146 } 156 147 } ··· 187 176 if (!found) 188 177 return -ENOENT; 189 178 179 + if (!try_module_get(flb->ops->owner)) 180 + return -ENODEV; 181 + 190 182 args.flb = flb; 191 183 args.data = private->incoming.data; 192 184 193 185 err = flb->ops->retrieve(&args); 194 - if (err) 186 + if (err) { 187 + module_put(flb->ops->owner); 195 188 return err; 189 + } 196 190 197 191 private->incoming.obj = args.obj; 198 192 private->incoming.retrieved = true; ··· 231 215 private->incoming.data = 0; 232 216 private->incoming.obj = NULL; 233 217 private->incoming.finished = true; 218 + module_put(flb->ops->owner); 234 219 } 235 220 } 236 221 } ··· 257 240 struct luo_flb_link *iter; 258 241 int err = 0; 259 242 243 + down_read(&luo_register_rwlock); 260 244 list_for_each_entry(iter, flb_list, list) { 261 245 err = luo_flb_file_preserve_one(iter->flb); 262 246 if (err) 263 247 goto exit_err; 264 248 } 249 + up_read(&luo_register_rwlock); 265 250 266 251 return 0; 267 252 268 253 exit_err: 269 254 list_for_each_entry_continue_reverse(iter, flb_list, list) 270 255 luo_flb_file_unpreserve_one(iter->flb); 256 + up_read(&luo_register_rwlock); 271 257 272 258 return err; 273 259 } ··· 292 272 struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 293 273 struct luo_flb_link *iter; 294 274 275 + guard(rwsem_read)(&luo_register_rwlock); 295 276 list_for_each_entry_reverse(iter, flb_list, list) 296 277 luo_flb_file_unpreserve_one(iter->flb); 297 278 } ··· 313 292 struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 314 293 struct luo_flb_link *iter; 315 294 295 + guard(rwsem_read)(&luo_register_rwlock); 316 296 list_for_each_entry_reverse(iter, flb_list, list) 317 297 luo_flb_file_finish_one(iter->flb); 298 + } 299 + 300 + static void luo_flb_unregister_one(struct liveupdate_file_handler *fh, 301 + struct liveupdate_flb *flb) 302 + { 303 + struct luo_flb_private *private = luo_flb_get_private(flb); 304 + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 305 + struct luo_flb_link *iter; 306 + bool found = false; 307 + 308 + /* Find and remove the link from the file handler's list */ 309 + list_for_each_entry(iter, flb_list, list) { 310 + if (iter->flb == flb) { 311 + list_del(&iter->list); 312 + kfree(iter); 313 + found = true; 314 + break; 315 + } 316 + } 317 + 318 + if (!found) { 319 + pr_warn("Failed to unregister FLB '%s': not found in file handler '%s'\n", 320 + flb->compatible, fh->compatible); 321 + return; 322 + } 323 + 324 + private->users--; 325 + 326 + /* 327 + * If this is the last file-handler with which we are registred, remove 328 + * from the global list. 329 + */ 330 + if (!private->users) { 331 + list_del_init(&private->list); 332 + luo_flb_global.count--; 333 + } 334 + } 335 + 336 + /** 337 + * luo_flb_unregister_all - Unregister all FLBs associated with a file handler. 338 + * @fh: The file handler whose FLBs should be unregistered. 339 + * 340 + * This function iterates through the list of FLBs associated with the given 341 + * file handler and unregisters them all one by one. 342 + */ 343 + void luo_flb_unregister_all(struct liveupdate_file_handler *fh) 344 + { 345 + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 346 + struct luo_flb_link *iter, *tmp; 347 + 348 + if (!liveupdate_enabled()) 349 + return; 350 + 351 + lockdep_assert_held_write(&luo_register_rwlock); 352 + list_for_each_entry_safe(iter, tmp, flb_list, list) 353 + luo_flb_unregister_one(fh, iter->flb); 318 354 } 319 355 320 356 /** ··· 404 326 struct luo_flb_link *link __free(kfree) = NULL; 405 327 struct liveupdate_flb *gflb; 406 328 struct luo_flb_link *iter; 407 - int err; 408 329 409 330 if (!liveupdate_enabled()) 410 331 return -EOPNOTSUPP; ··· 424 347 if (!link) 425 348 return -ENOMEM; 426 349 427 - /* 428 - * Ensure the system is quiescent (no active sessions). 429 - * This acts as a global lock for registration: no other thread can 430 - * be in this section, and no sessions can be creating/using FDs. 431 - */ 432 - if (!luo_session_quiesce()) 433 - return -EBUSY; 350 + guard(rwsem_write)(&luo_register_rwlock); 434 351 435 352 /* Check that this FLB is not already linked to this file handler */ 436 - err = -EEXIST; 437 353 list_for_each_entry(iter, flb_list, list) { 438 354 if (iter->flb == flb) 439 - goto err_resume; 355 + return -EEXIST; 440 356 } 441 357 442 358 /* ··· 437 367 * is registered 438 368 */ 439 369 if (!private->users) { 440 - if (WARN_ON(!list_empty(&private->list))) { 441 - err = -EINVAL; 442 - goto err_resume; 443 - } 370 + if (WARN_ON(!list_empty(&private->list))) 371 + return -EINVAL; 444 372 445 - if (luo_flb_global.count == LUO_FLB_MAX) { 446 - err = -ENOSPC; 447 - goto err_resume; 448 - } 373 + if (luo_flb_global.count == LUO_FLB_MAX) 374 + return -ENOSPC; 449 375 450 376 /* Check that compatible string is unique in global list */ 451 377 list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { 452 378 if (!strcmp(gflb->compatible, flb->compatible)) 453 - goto err_resume; 454 - } 455 - 456 - if (!try_module_get(flb->ops->owner)) { 457 - err = -EAGAIN; 458 - goto err_resume; 379 + return -EEXIST; 459 380 } 460 381 461 382 list_add_tail(&private->list, &luo_flb_global.list); ··· 457 396 private->users++; 458 397 link->flb = flb; 459 398 list_add_tail(&no_free_ptr(link)->list, flb_list); 460 - luo_session_resume(); 461 399 462 400 return 0; 463 - 464 - err_resume: 465 - luo_session_resume(); 466 - return err; 467 401 } 468 402 469 403 /** ··· 474 418 * the FLB is removed from the global registry and the reference to its 475 419 * owner module (acquired during registration) is released. 476 420 * 477 - * Context: This function ensures the session is quiesced (no active FDs 478 - * being created) during the update. It is typically called from a 479 - * subsystem's module exit function. 480 - * Return: 0 on success. 481 - * -EOPNOTSUPP if live update is disabled. 482 - * -EBUSY if the live update session is active and cannot be quiesced. 483 - * -ENOENT if the FLB was not found in the file handler's list. 421 + * Context: It is typically called from a subsystem's module exit function. 484 422 */ 485 - int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 486 - struct liveupdate_flb *flb) 423 + void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 424 + struct liveupdate_flb *flb) 487 425 { 488 - struct luo_flb_private *private = luo_flb_get_private(flb); 489 - struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 490 - struct luo_flb_link *iter; 491 - int err = -ENOENT; 492 - 493 426 if (!liveupdate_enabled()) 494 - return -EOPNOTSUPP; 427 + return; 495 428 496 - /* 497 - * Ensure the system is quiescent (no active sessions). 498 - * This acts as a global lock for unregistration. 499 - */ 500 - if (!luo_session_quiesce()) 501 - return -EBUSY; 429 + guard(rwsem_write)(&luo_register_rwlock); 502 430 503 - /* Find and remove the link from the file handler's list */ 504 - list_for_each_entry(iter, flb_list, list) { 505 - if (iter->flb == flb) { 506 - list_del(&iter->list); 507 - kfree(iter); 508 - err = 0; 509 - break; 510 - } 511 - } 512 - 513 - if (err) 514 - goto err_resume; 515 - 516 - private->users--; 517 - /* 518 - * If this is the last file-handler with which we are registred, remove 519 - * from the global list, and relese module reference. 520 - */ 521 - if (!private->users) { 522 - list_del_init(&private->list); 523 - luo_flb_global.count--; 524 - module_put(flb->ops->owner); 525 - } 526 - 527 - luo_session_resume(); 528 - 529 - return 0; 530 - 531 - err_resume: 532 - luo_session_resume(); 533 - return err; 431 + luo_flb_unregister_one(fh, flb); 534 432 } 535 433 536 434 /** ··· 502 492 * 503 493 * Return: 0 on success, or a negative errno on failure. -ENODATA means no 504 494 * incoming FLB data, -ENOENT means specific flb not found in the incoming 505 - * data, and -EOPNOTSUPP when live update is disabled or not configured. 495 + * data, -ENODEV if the FLB's module is unloading, and -EOPNOTSUPP when 496 + * live update is disabled or not configured. 506 497 */ 507 498 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp) 508 499 { ··· 649 638 struct liveupdate_flb *gflb; 650 639 int i = 0; 651 640 641 + guard(rwsem_read)(&luo_register_rwlock); 652 642 list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { 653 643 struct luo_flb_private *private = luo_flb_get_private(gflb); 654 644

+3 -4

kernel/liveupdate/luo_internal.h

··· 77 77 struct mutex mutex; 78 78 }; 79 79 80 + extern struct rw_semaphore luo_register_rwlock; 81 + 80 82 int luo_session_create(const char *name, struct file **filep); 81 83 int luo_session_retrieve(const char *name, struct file **filep); 82 84 int __init luo_session_setup_outgoing(void *fdt); 83 85 int __init luo_session_setup_incoming(void *fdt); 84 86 int luo_session_serialize(void); 85 87 int luo_session_deserialize(void); 86 - bool luo_session_quiesce(void); 87 - void luo_session_resume(void); 88 88 89 89 int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd); 90 90 void luo_file_unpreserve_files(struct luo_file_set *file_set); ··· 103 103 int luo_flb_file_preserve(struct liveupdate_file_handler *fh); 104 104 void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); 105 105 void luo_flb_file_finish(struct liveupdate_file_handler *fh); 106 + void luo_flb_unregister_all(struct liveupdate_file_handler *fh); 106 107 int __init luo_flb_setup_outgoing(void *fdt); 107 108 int __init luo_flb_setup_incoming(void *fdt); 108 109 void luo_flb_serialize(void); 109 110 110 111 #ifdef CONFIG_LIVEUPDATE_TEST 111 112 void liveupdate_test_register(struct liveupdate_file_handler *fh); 112 - void liveupdate_test_unregister(struct liveupdate_file_handler *fh); 113 113 #else 114 114 static inline void liveupdate_test_register(struct liveupdate_file_handler *fh) { } 115 - static inline void liveupdate_test_unregister(struct liveupdate_file_handler *fh) { } 116 115 #endif 117 116 118 117 #endif /* _LINUX_LUO_INTERNAL_H */

+2 -44

kernel/liveupdate/luo_session.c

··· 544 544 545 545 session = luo_session_alloc(sh->ser[i].name); 546 546 if (IS_ERR(session)) { 547 - pr_warn("Failed to allocate session [%s] during deserialization %pe\n", 547 + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", 548 + (int)sizeof(sh->ser[i].name), 548 549 sh->ser[i].name, session); 549 550 return PTR_ERR(session); 550 551 } ··· 607 606 return err; 608 607 } 609 608 610 - /** 611 - * luo_session_quiesce - Ensure no active sessions exist and lock session lists. 612 - * 613 - * Acquires exclusive write locks on both incoming and outgoing session lists. 614 - * It then validates no sessions exist in either list. 615 - * 616 - * This mechanism is used during file handler un/registration to ensure that no 617 - * sessions are currently using the handler, and no new sessions can be created 618 - * while un/registration is in progress. 619 - * 620 - * This prevents registering new handlers while sessions are active or 621 - * while deserialization is in progress. 622 - * 623 - * Return: 624 - * true - System is quiescent (0 sessions) and locked. 625 - * false - Active sessions exist. The locks are released internally. 626 - */ 627 - bool luo_session_quiesce(void) 628 - { 629 - down_write(&luo_session_global.incoming.rwsem); 630 - down_write(&luo_session_global.outgoing.rwsem); 631 - 632 - if (luo_session_global.incoming.count || 633 - luo_session_global.outgoing.count) { 634 - up_write(&luo_session_global.outgoing.rwsem); 635 - up_write(&luo_session_global.incoming.rwsem); 636 - return false; 637 - } 638 - 639 - return true; 640 - } 641 - 642 - /** 643 - * luo_session_resume - Unlock session lists and resume normal activity. 644 - * 645 - * Releases the exclusive locks acquired by a successful call to 646 - * luo_session_quiesce(). 647 - */ 648 - void luo_session_resume(void) 649 - { 650 - up_write(&luo_session_global.outgoing.rwsem); 651 - up_write(&luo_session_global.incoming.rwsem); 652 - }

+109

lib/alloc_tag.c

··· 6 6 #include <linux/kallsyms.h> 7 7 #include <linux/module.h> 8 8 #include <linux/page_ext.h> 9 + #include <linux/pgalloc_tag.h> 9 10 #include <linux/proc_fs.h> 11 + #include <linux/rcupdate.h> 10 12 #include <linux/seq_buf.h> 11 13 #include <linux/seq_file.h> 12 14 #include <linux/string_choices.h> ··· 760 758 return mem_profiling_support; 761 759 } 762 760 761 + #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG 762 + /* 763 + * Track page allocations before page_ext is initialized. 764 + * Some pages are allocated before page_ext becomes available, leaving 765 + * their codetag uninitialized. Track these early PFNs so we can clear 766 + * their codetag refs later to avoid warnings when they are freed. 767 + * 768 + * Early allocations include: 769 + * - Base allocations independent of CPU count 770 + * - Per-CPU allocations (e.g., CPU hotplug callbacks during smp_init, 771 + * such as trace ring buffers, scheduler per-cpu data) 772 + * 773 + * For simplicity, we fix the size to 8192. 774 + * If insufficient, a warning will be triggered to alert the user. 775 + * 776 + * TODO: Replace fixed-size array with dynamic allocation using 777 + * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion. 778 + */ 779 + #define EARLY_ALLOC_PFN_MAX 8192 780 + 781 + static unsigned long early_pfns[EARLY_ALLOC_PFN_MAX] __initdata; 782 + static atomic_t early_pfn_count __initdata = ATOMIC_INIT(0); 783 + 784 + static void __init __alloc_tag_add_early_pfn(unsigned long pfn) 785 + { 786 + int old_idx, new_idx; 787 + 788 + do { 789 + old_idx = atomic_read(&early_pfn_count); 790 + if (old_idx >= EARLY_ALLOC_PFN_MAX) { 791 + pr_warn_once("Early page allocations before page_ext init exceeded EARLY_ALLOC_PFN_MAX (%d)\n", 792 + EARLY_ALLOC_PFN_MAX); 793 + return; 794 + } 795 + new_idx = old_idx + 1; 796 + } while (!atomic_try_cmpxchg(&early_pfn_count, &old_idx, new_idx)); 797 + 798 + early_pfns[old_idx] = pfn; 799 + } 800 + 801 + typedef void alloc_tag_add_func(unsigned long pfn); 802 + static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata = 803 + RCU_INITIALIZER(__alloc_tag_add_early_pfn); 804 + 805 + void alloc_tag_add_early_pfn(unsigned long pfn) 806 + { 807 + alloc_tag_add_func *alloc_tag_add; 808 + 809 + if (static_key_enabled(&mem_profiling_compressed)) 810 + return; 811 + 812 + rcu_read_lock(); 813 + alloc_tag_add = rcu_dereference(alloc_tag_add_early_pfn_ptr); 814 + if (alloc_tag_add) 815 + alloc_tag_add(pfn); 816 + rcu_read_unlock(); 817 + } 818 + 819 + static void __init clear_early_alloc_pfn_tag_refs(void) 820 + { 821 + unsigned int i; 822 + 823 + if (static_key_enabled(&mem_profiling_compressed)) 824 + return; 825 + 826 + rcu_assign_pointer(alloc_tag_add_early_pfn_ptr, NULL); 827 + /* Make sure we are not racing with __alloc_tag_add_early_pfn() */ 828 + synchronize_rcu(); 829 + 830 + for (i = 0; i < atomic_read(&early_pfn_count); i++) { 831 + unsigned long pfn = early_pfns[i]; 832 + 833 + if (pfn_valid(pfn)) { 834 + struct page *page = pfn_to_page(pfn); 835 + union pgtag_ref_handle handle; 836 + union codetag_ref ref; 837 + 838 + if (get_page_tag_ref(page, &ref, &handle)) { 839 + /* 840 + * An early-allocated page could be freed and reallocated 841 + * after its page_ext is initialized but before we clear it. 842 + * In that case, it already has a valid tag set. 843 + * We should not overwrite that valid tag with CODETAG_EMPTY. 844 + * 845 + * Note: there is still a small race window between checking 846 + * ref.ct and calling set_codetag_empty(). We accept this 847 + * race as it's unlikely and the extra complexity of atomic 848 + * cmpxchg is not worth it for this debug-only code path. 849 + */ 850 + if (ref.ct) { 851 + put_page_tag_ref(handle); 852 + continue; 853 + } 854 + 855 + set_codetag_empty(&ref); 856 + update_page_tag_ref(handle, &ref); 857 + put_page_tag_ref(handle); 858 + } 859 + } 860 + 861 + } 862 + } 863 + #else /* !CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 864 + static inline void __init clear_early_alloc_pfn_tag_refs(void) {} 865 + #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 866 + 763 867 static __init void init_page_alloc_tagging(void) 764 868 { 869 + clear_early_alloc_pfn_tag_refs(); 765 870 } 766 871 767 872 struct page_ext_operations page_alloc_tagging_ops = {

+77 -53

lib/test_hmm.c

··· 185 185 return 0; 186 186 } 187 187 188 + static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) 189 + { 190 + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; 191 + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; 192 + unsigned long npages = end_pfn - start_pfn + 1; 193 + unsigned long i; 194 + unsigned long *src_pfns; 195 + unsigned long *dst_pfns; 196 + unsigned int order = 0; 197 + 198 + src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); 199 + dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); 200 + 201 + migrate_device_range(src_pfns, start_pfn, npages); 202 + for (i = 0; i < npages; i++) { 203 + struct page *dpage, *spage; 204 + 205 + spage = migrate_pfn_to_page(src_pfns[i]); 206 + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) 207 + continue; 208 + 209 + if (WARN_ON(!is_device_private_page(spage) && 210 + !is_device_coherent_page(spage))) 211 + continue; 212 + 213 + order = folio_order(page_folio(spage)); 214 + spage = BACKING_PAGE(spage); 215 + if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { 216 + dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, 217 + order), 0); 218 + } else { 219 + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); 220 + order = 0; 221 + } 222 + 223 + /* TODO Support splitting here */ 224 + lock_page(dpage); 225 + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); 226 + if (src_pfns[i] & MIGRATE_PFN_WRITE) 227 + dst_pfns[i] |= MIGRATE_PFN_WRITE; 228 + if (order) 229 + dst_pfns[i] |= MIGRATE_PFN_COMPOUND; 230 + folio_copy(page_folio(dpage), page_folio(spage)); 231 + } 232 + migrate_device_pages(src_pfns, dst_pfns, npages); 233 + migrate_device_finalize(src_pfns, dst_pfns, npages); 234 + kvfree(src_pfns); 235 + kvfree(dst_pfns); 236 + } 237 + 188 238 static int dmirror_fops_release(struct inode *inode, struct file *filp) 189 239 { 190 240 struct dmirror *dmirror = filp->private_data; 241 + struct dmirror_device *mdevice = dmirror->mdevice; 242 + int i; 191 243 192 244 mmu_interval_notifier_remove(&dmirror->notifier); 245 + 246 + if (mdevice->devmem_chunks) { 247 + for (i = 0; i < mdevice->devmem_count; i++) { 248 + struct dmirror_chunk *devmem = 249 + mdevice->devmem_chunks[i]; 250 + 251 + dmirror_device_evict_chunk(devmem); 252 + } 253 + } 254 + 193 255 xa_destroy(&dmirror->pt); 194 256 kfree(dmirror); 195 257 return 0; ··· 1439 1377 return ret; 1440 1378 } 1441 1379 1442 - static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) 1443 - { 1444 - unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; 1445 - unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; 1446 - unsigned long npages = end_pfn - start_pfn + 1; 1447 - unsigned long i; 1448 - unsigned long *src_pfns; 1449 - unsigned long *dst_pfns; 1450 - unsigned int order = 0; 1451 - 1452 - src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); 1453 - dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); 1454 - 1455 - migrate_device_range(src_pfns, start_pfn, npages); 1456 - for (i = 0; i < npages; i++) { 1457 - struct page *dpage, *spage; 1458 - 1459 - spage = migrate_pfn_to_page(src_pfns[i]); 1460 - if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) 1461 - continue; 1462 - 1463 - if (WARN_ON(!is_device_private_page(spage) && 1464 - !is_device_coherent_page(spage))) 1465 - continue; 1466 - 1467 - order = folio_order(page_folio(spage)); 1468 - spage = BACKING_PAGE(spage); 1469 - if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { 1470 - dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, 1471 - order), 0); 1472 - } else { 1473 - dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); 1474 - order = 0; 1475 - } 1476 - 1477 - /* TODO Support splitting here */ 1478 - lock_page(dpage); 1479 - dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); 1480 - if (src_pfns[i] & MIGRATE_PFN_WRITE) 1481 - dst_pfns[i] |= MIGRATE_PFN_WRITE; 1482 - if (order) 1483 - dst_pfns[i] |= MIGRATE_PFN_COMPOUND; 1484 - folio_copy(page_folio(dpage), page_folio(spage)); 1485 - } 1486 - migrate_device_pages(src_pfns, dst_pfns, npages); 1487 - migrate_device_finalize(src_pfns, dst_pfns, npages); 1488 - kvfree(src_pfns); 1489 - kvfree(dst_pfns); 1490 - } 1491 - 1492 1380 /* Removes free pages from the free list so they can't be re-allocated */ 1493 1381 static void dmirror_remove_free_pages(struct dmirror_chunk *devmem) 1494 1382 { ··· 1738 1726 .folio_split = dmirror_devmem_folio_split, 1739 1727 }; 1740 1728 1729 + static void dmirror_device_release(struct device *dev) 1730 + { 1731 + struct dmirror_device *mdevice = container_of(dev, struct dmirror_device, device); 1732 + 1733 + dmirror_device_remove_chunks(mdevice); 1734 + } 1735 + 1741 1736 static int dmirror_device_init(struct dmirror_device *mdevice, int id) 1742 1737 { 1743 1738 dev_t dev; ··· 1756 1737 1757 1738 cdev_init(&mdevice->cdevice, &dmirror_fops); 1758 1739 mdevice->cdevice.owner = THIS_MODULE; 1740 + mdevice->device.release = dmirror_device_release; 1741 + 1759 1742 device_initialize(&mdevice->device); 1760 1743 mdevice->device.devt = dev; 1761 1744 ··· 1765 1744 if (ret) 1766 1745 goto put_device; 1767 1746 1747 + /* Build a list of free ZONE_DEVICE struct pages */ 1748 + ret = dmirror_allocate_chunk(mdevice, NULL, false); 1749 + if (ret) 1750 + goto put_device; 1751 + 1768 1752 ret = cdev_device_add(&mdevice->cdevice, &mdevice->device); 1769 1753 if (ret) 1770 1754 goto put_device; 1771 1755 1772 - /* Build a list of free ZONE_DEVICE struct pages */ 1773 - return dmirror_allocate_chunk(mdevice, NULL, false); 1756 + return 0; 1774 1757 1775 1758 put_device: 1776 1759 put_device(&mdevice->device); ··· 1783 1758 1784 1759 static void dmirror_device_remove(struct dmirror_device *mdevice) 1785 1760 { 1786 - dmirror_device_remove_chunks(mdevice); 1787 1761 cdev_device_del(&mdevice->cdevice, &mdevice->device); 1788 1762 put_device(&mdevice->device); 1789 1763 }

+3 -2

lib/test_kho.c

··· 143 143 if (err) 144 144 goto err_unpreserve_data; 145 145 146 - err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt)); 146 + err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt), 147 + fdt_totalsize(folio_address(state->fdt))); 147 148 if (err) 148 149 goto err_unpreserve_data; 149 150 ··· 319 318 if (!kho_is_enabled()) 320 319 return 0; 321 320 322 - err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); 321 + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys, NULL); 323 322 if (!err) { 324 323 err = kho_test_restore(fdt_phys); 325 324 if (err)

-18

lib/tests/liveupdate.c

··· 135 135 TEST_NFLBS, fh->compatible); 136 136 } 137 137 138 - void liveupdate_test_unregister(struct liveupdate_file_handler *fh) 139 - { 140 - int err, i; 141 - 142 - for (i = 0; i < TEST_NFLBS; i++) { 143 - struct liveupdate_flb *flb = &test_flbs[i]; 144 - 145 - err = liveupdate_unregister_flb(fh, flb); 146 - if (err) { 147 - pr_err("Failed to unregister %s %pe\n", 148 - flb->compatible, ERR_PTR(err)); 149 - } 150 - } 151 - 152 - pr_info("Unregistered %d FLBs from file handler: [%s]\n", 153 - TEST_NFLBS, fh->compatible); 154 - } 155 - 156 138 MODULE_LICENSE("GPL"); 157 139 MODULE_AUTHOR("Pasha Tatashin <pasha.tatashin@soleen.com>"); 158 140 MODULE_DESCRIPTION("In-kernel test for LUO mechanism");

+11

mm/Kconfig.debug

··· 297 297 298 298 If unsure, say Y. 299 299 300 + config DEBUG_KMEMLEAK_VERBOSE 301 + bool "Default kmemleak to verbose mode" 302 + depends on DEBUG_KMEMLEAK_AUTO_SCAN 303 + help 304 + Say Y here to have kmemleak print unreferenced object details 305 + (backtrace, hex dump, address) to dmesg when new memory leaks are 306 + detected during automatic scanning. This can also be toggled at 307 + runtime via /sys/module/kmemleak/parameters/verbose. 308 + 309 + If unsure, say N. 310 + 300 311 config PER_VMA_LOCK_STATS 301 312 bool "Statistics for per-vma locks" 302 313 depends on PER_VMA_LOCK

+30 -13

mm/compaction.c

··· 518 518 return true; 519 519 } 520 520 521 + static struct lruvec * 522 + compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags, 523 + struct compact_control *cc) 524 + { 525 + struct lruvec *lruvec; 526 + 527 + rcu_read_lock(); 528 + retry: 529 + lruvec = folio_lruvec(folio); 530 + compact_lock_irqsave(&lruvec->lru_lock, flags, cc); 531 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 532 + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); 533 + goto retry; 534 + } 535 + 536 + return lruvec; 537 + } 538 + 521 539 /* 522 540 * Compaction requires the taking of some coarse locks that are potentially 523 541 * very heavily contended. The lock should be periodically unlocked to avoid ··· 857 839 { 858 840 pg_data_t *pgdat = cc->zone->zone_pgdat; 859 841 unsigned long nr_scanned = 0, nr_isolated = 0; 860 - struct lruvec *lruvec; 842 + struct lruvec *lruvec = NULL; 861 843 unsigned long flags = 0; 862 844 struct lruvec *locked = NULL; 863 845 struct folio *folio = NULL; ··· 931 913 */ 932 914 if (!(low_pfn % COMPACT_CLUSTER_MAX)) { 933 915 if (locked) { 934 - unlock_page_lruvec_irqrestore(locked, flags); 916 + lruvec_unlock_irqrestore(locked, flags); 935 917 locked = NULL; 936 918 } 937 919 ··· 982 964 } 983 965 /* for alloc_contig case */ 984 966 if (locked) { 985 - unlock_page_lruvec_irqrestore(locked, flags); 967 + lruvec_unlock_irqrestore(locked, flags); 986 968 locked = NULL; 987 969 } 988 970 ··· 1071 1053 if (unlikely(page_has_movable_ops(page)) && 1072 1054 !PageMovableOpsIsolated(page)) { 1073 1055 if (locked) { 1074 - unlock_page_lruvec_irqrestore(locked, flags); 1056 + lruvec_unlock_irqrestore(locked, flags); 1075 1057 locked = NULL; 1076 1058 } 1077 1059 ··· 1171 1153 if (!folio_test_clear_lru(folio)) 1172 1154 goto isolate_fail_put; 1173 1155 1174 - lruvec = folio_lruvec(folio); 1156 + if (locked) 1157 + lruvec = folio_lruvec(folio); 1175 1158 1176 1159 /* If we already hold the lock, we can skip some rechecking */ 1177 - if (lruvec != locked) { 1160 + if (lruvec != locked || !locked) { 1178 1161 if (locked) 1179 - unlock_page_lruvec_irqrestore(locked, flags); 1162 + lruvec_unlock_irqrestore(locked, flags); 1180 1163 1181 - compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); 1164 + lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc); 1182 1165 locked = lruvec; 1183 - 1184 - lruvec_memcg_debug(lruvec, folio); 1185 1166 1186 1167 /* 1187 1168 * Try get exclusive access under lock. If marked for ··· 1243 1226 isolate_fail_put: 1244 1227 /* Avoid potential deadlock in freeing page under lru_lock */ 1245 1228 if (locked) { 1246 - unlock_page_lruvec_irqrestore(locked, flags); 1229 + lruvec_unlock_irqrestore(locked, flags); 1247 1230 locked = NULL; 1248 1231 } 1249 1232 folio_put(folio); ··· 1259 1242 */ 1260 1243 if (nr_isolated) { 1261 1244 if (locked) { 1262 - unlock_page_lruvec_irqrestore(locked, flags); 1245 + lruvec_unlock_irqrestore(locked, flags); 1263 1246 locked = NULL; 1264 1247 } 1265 1248 putback_movable_pages(&cc->migratepages); ··· 1291 1274 1292 1275 isolate_abort: 1293 1276 if (locked) 1294 - unlock_page_lruvec_irqrestore(locked, flags); 1277 + lruvec_unlock_irqrestore(locked, flags); 1295 1278 if (folio) { 1296 1279 folio_set_lru(folio); 1297 1280 folio_put(folio);

+49 -39

mm/damon/core.c

··· 1573 1573 return pid; 1574 1574 } 1575 1575 1576 - /* 1577 - * damon_call_handle_inactive_ctx() - handle DAMON call request that added to 1578 - * an inactive context. 1579 - * @ctx: The inactive DAMON context. 1580 - * @control: Control variable of the call request. 1581 - * 1582 - * This function is called in a case that @control is added to @ctx but @ctx is 1583 - * not running (inactive). See if @ctx handled @control or not, and cleanup 1584 - * @control if it was not handled. 1585 - * 1586 - * Returns 0 if @control was handled by @ctx, negative error code otherwise. 1587 - */ 1588 - static int damon_call_handle_inactive_ctx( 1589 - struct damon_ctx *ctx, struct damon_call_control *control) 1590 - { 1591 - struct damon_call_control *c; 1592 - 1593 - mutex_lock(&ctx->call_controls_lock); 1594 - list_for_each_entry(c, &ctx->call_controls, list) { 1595 - if (c == control) { 1596 - list_del(&control->list); 1597 - mutex_unlock(&ctx->call_controls_lock); 1598 - return -EINVAL; 1599 - } 1600 - } 1601 - mutex_unlock(&ctx->call_controls_lock); 1602 - return 0; 1603 - } 1604 - 1605 1576 /** 1606 1577 * damon_call() - Invoke a given function on DAMON worker thread (kdamond). 1607 1578 * @ctx: DAMON context to call the function for. ··· 1590 1619 * synchronization. The return value of the function will be saved in 1591 1620 * &damon_call_control->return_code. 1592 1621 * 1622 + * Note that this function should be called only after damon_start() with the 1623 + * @ctx has succeeded. Otherwise, this function could fall into an indefinite 1624 + * wait. 1625 + * 1593 1626 * Return: 0 on success, negative error code otherwise. 1594 1627 */ 1595 1628 int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) ··· 1604 1629 INIT_LIST_HEAD(&control->list); 1605 1630 1606 1631 mutex_lock(&ctx->call_controls_lock); 1632 + if (ctx->call_controls_obsolete) { 1633 + mutex_unlock(&ctx->call_controls_lock); 1634 + return -ECANCELED; 1635 + } 1607 1636 list_add_tail(&control->list, &ctx->call_controls); 1608 1637 mutex_unlock(&ctx->call_controls_lock); 1609 - if (!damon_is_running(ctx)) 1610 - return damon_call_handle_inactive_ctx(ctx, control); 1611 1638 if (control->repeat) 1612 1639 return 0; 1613 1640 wait_for_completion(&control->completion); ··· 1637 1660 * passed at least one &damos->apply_interval_us, kdamond marks the request as 1638 1661 * completed so that damos_walk() can wakeup and return. 1639 1662 * 1663 + * Note that this function should be called only after damon_start() with the 1664 + * @ctx has succeeded. Otherwise, this function could fall into an indefinite 1665 + * wait. 1666 + * 1640 1667 * Return: 0 on success, negative error code otherwise. 1641 1668 */ 1642 1669 int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control) ··· 1648 1667 init_completion(&control->completion); 1649 1668 control->canceled = false; 1650 1669 mutex_lock(&ctx->walk_control_lock); 1670 + if (ctx->walk_control_obsolete) { 1671 + mutex_unlock(&ctx->walk_control_lock); 1672 + return -ECANCELED; 1673 + } 1651 1674 if (ctx->walk_control) { 1652 1675 mutex_unlock(&ctx->walk_control_lock); 1653 1676 return -EBUSY; 1654 1677 } 1655 1678 ctx->walk_control = control; 1656 1679 mutex_unlock(&ctx->walk_control_lock); 1657 - if (!damon_is_running(ctx)) { 1658 - mutex_lock(&ctx->walk_control_lock); 1659 - if (ctx->walk_control == control) 1660 - ctx->walk_control = NULL; 1661 - mutex_unlock(&ctx->walk_control_lock); 1662 - return -EINVAL; 1663 - } 1664 1680 wait_for_completion(&control->completion); 1665 1681 if (control->canceled) 1666 1682 return -ECANCELED; ··· 2217 2239 #endif /* CONFIG_PSI */ 2218 2240 2219 2241 #ifdef CONFIG_NUMA 2242 + static bool invalid_mem_node(int nid) 2243 + { 2244 + return nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY); 2245 + } 2246 + 2220 2247 static __kernel_ulong_t damos_get_node_mem_bp( 2221 2248 struct damos_quota_goal *goal) 2222 2249 { 2223 2250 struct sysinfo i; 2224 2251 __kernel_ulong_t numerator; 2252 + 2253 + if (invalid_mem_node(goal->nid)) { 2254 + if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) 2255 + return 0; 2256 + else /* DAMOS_QUOTA_NODE_MEM_FREE_BP */ 2257 + return 10000; 2258 + } 2225 2259 2226 2260 si_meminfo_node(&i, goal->nid); 2227 2261 if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) ··· 2250 2260 struct lruvec *lruvec; 2251 2261 unsigned long used_pages, numerator; 2252 2262 struct sysinfo i; 2263 + 2264 + if (invalid_mem_node(goal->nid)) { 2265 + if (goal->metric == DAMOS_QUOTA_NODE_MEMCG_USED_BP) 2266 + return 0; 2267 + else /* DAMOS_QUOTA_NODE_MEMCG_FREE_BP */ 2268 + return 10000; 2269 + } 2253 2270 2254 2271 memcg = mem_cgroup_get_from_id(goal->memcg_id); 2255 2272 if (!memcg) { ··· 2449 2452 } 2450 2453 2451 2454 /* New charge window starts */ 2452 - if (time_after_eq(jiffies, quota->charged_from + 2455 + if (!time_in_range_open(jiffies, quota->charged_from, 2456 + quota->charged_from + 2453 2457 msecs_to_jiffies(quota->reset_interval))) { 2454 2458 if (damos_quota_is_set(quota) && 2455 2459 quota->charged_sz >= quota->esz) ··· 2950 2952 2951 2953 pr_debug("kdamond (%d) starts\n", current->pid); 2952 2954 2955 + mutex_lock(&ctx->call_controls_lock); 2956 + ctx->call_controls_obsolete = false; 2957 + mutex_unlock(&ctx->call_controls_lock); 2958 + mutex_lock(&ctx->walk_control_lock); 2959 + ctx->walk_control_obsolete = false; 2960 + mutex_unlock(&ctx->walk_control_lock); 2953 2961 complete(&ctx->kdamond_started); 2954 2962 kdamond_init_ctx(ctx); 2955 2963 ··· 3066 3062 damon_destroy_targets(ctx); 3067 3063 3068 3064 kfree(ctx->regions_score_histogram); 3065 + mutex_lock(&ctx->call_controls_lock); 3066 + ctx->call_controls_obsolete = true; 3067 + mutex_unlock(&ctx->call_controls_lock); 3069 3068 kdamond_call(ctx, true); 3069 + mutex_lock(&ctx->walk_control_lock); 3070 + ctx->walk_control_obsolete = true; 3071 + mutex_unlock(&ctx->walk_control_lock); 3070 3072 damos_walk_cancel(ctx); 3071 3073 3072 3074 pr_debug("kdamond (%d) finishes\n", current->pid);

+4 -1

mm/damon/stat.c

··· 255 255 if (!damon_stat_context) 256 256 return -ENOMEM; 257 257 err = damon_start(&damon_stat_context, 1, true); 258 - if (err) 258 + if (err) { 259 + damon_destroy_ctx(damon_stat_context); 260 + damon_stat_context = NULL; 259 261 return err; 262 + } 260 263 261 264 damon_stat_last_refresh_jiffies = jiffies; 262 265 call_control.data = damon_stat_context;

+19 -3

mm/huge_memory.c

··· 1218 1218 1219 1219 static struct deferred_split *folio_split_queue_lock(struct folio *folio) 1220 1220 { 1221 - return split_queue_lock(folio_nid(folio), folio_memcg(folio)); 1221 + struct deferred_split *queue; 1222 + 1223 + rcu_read_lock(); 1224 + queue = split_queue_lock(folio_nid(folio), folio_memcg(folio)); 1225 + /* 1226 + * The memcg destruction path is acquiring the split queue lock for 1227 + * reparenting. Once you have it locked, it's safe to drop the rcu lock. 1228 + */ 1229 + rcu_read_unlock(); 1230 + 1231 + return queue; 1222 1232 } 1223 1233 1224 1234 static struct deferred_split * 1225 1235 folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags) 1226 1236 { 1227 - return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); 1237 + struct deferred_split *queue; 1238 + 1239 + rcu_read_lock(); 1240 + queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); 1241 + rcu_read_unlock(); 1242 + 1243 + return queue; 1228 1244 } 1229 1245 1230 1246 static inline void split_queue_unlock(struct deferred_split *queue) ··· 4010 3994 folio_ref_unfreeze(folio, folio_cache_ref_count(folio) + 1); 4011 3995 4012 3996 if (do_lru) 4013 - unlock_page_lruvec(lruvec); 3997 + lruvec_unlock(lruvec); 4014 3998 4015 3999 if (ci) 4016 4000 swap_cluster_unlock(ci);

+18

mm/hugetlb.c

··· 4218 4218 size_t len; 4219 4219 char *p; 4220 4220 4221 + if (!s) 4222 + return -EINVAL; 4223 + 4221 4224 if (hugetlb_param_index >= HUGE_MAX_CMDLINE_ARGS) 4222 4225 return -EINVAL; 4223 4226 ··· 4787 4784 return 0; 4788 4785 } 4789 4786 4787 + #ifdef CONFIG_USERFAULTFD 4788 + static bool hugetlb_can_userfault(struct vm_area_struct *vma, 4789 + vm_flags_t vm_flags) 4790 + { 4791 + return true; 4792 + } 4793 + 4794 + static const struct vm_uffd_ops hugetlb_uffd_ops = { 4795 + .can_userfault = hugetlb_can_userfault, 4796 + }; 4797 + #endif 4798 + 4790 4799 /* 4791 4800 * When a new function is introduced to vm_operations_struct and added 4792 4801 * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops. ··· 4812 4797 .close = hugetlb_vm_op_close, 4813 4798 .may_split = hugetlb_vm_op_split, 4814 4799 .pagesize = hugetlb_vm_op_pagesize, 4800 + #ifdef CONFIG_USERFAULTFD 4801 + .uffd_ops = &hugetlb_uffd_ops, 4802 + #endif 4815 4803 }; 4816 4804 4817 4805 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,

+1 -1

mm/kmemleak.c

··· 241 241 /* If there are leaks that can be reported */ 242 242 static bool kmemleak_found_leaks; 243 243 244 - static bool kmemleak_verbose; 244 + static bool kmemleak_verbose = IS_ENABLED(CONFIG_DEBUG_KMEMLEAK_VERBOSE); 245 245 module_param_named(verbose, kmemleak_verbose, bool, 0600); 246 246 247 247 static void kmemleak_disable(void);

+2 -2

mm/memblock.c

··· 2601 2601 if (err) 2602 2602 goto err_unpreserve_fdt; 2603 2603 2604 - err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt); 2604 + err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt, fdt_totalsize(fdt)); 2605 2605 if (err) 2606 2606 goto err_unpreserve_fdt; 2607 2607 ··· 2646 2646 if (fdt) 2647 2647 return fdt; 2648 2648 2649 - err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys); 2649 + err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys, NULL); 2650 2650 if (err) { 2651 2651 if (err != -ENOENT) 2652 2652 pr_warn("failed to retrieve FDT '%s' from KHO: %d\n",

+25 -6

mm/memcontrol-v1.c

··· 613 613 void memcg1_swapout(struct folio *folio, swp_entry_t entry) 614 614 { 615 615 struct mem_cgroup *memcg, *swap_memcg; 616 + struct obj_cgroup *objcg; 616 617 unsigned int nr_entries; 617 618 618 619 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); ··· 625 624 if (!do_memsw_account()) 626 625 return; 627 626 628 - memcg = folio_memcg(folio); 629 - 630 - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); 631 - if (!memcg) 627 + objcg = folio_objcg(folio); 628 + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); 629 + if (!objcg) 632 630 return; 633 631 632 + rcu_read_lock(); 633 + memcg = obj_cgroup_memcg(objcg); 634 634 /* 635 635 * In case the memcg owning these pages has been offlined and doesn't 636 636 * have an ID allocated to it anymore, charge the closest online ··· 646 644 folio_unqueue_deferred_split(folio); 647 645 folio->memcg_data = 0; 648 646 649 - if (!mem_cgroup_is_root(memcg)) 647 + if (!obj_cgroup_is_root(objcg)) 650 648 page_counter_uncharge(&memcg->memory, nr_entries); 651 649 652 650 if (memcg != swap_memcg) { ··· 667 665 preempt_enable_nested(); 668 666 memcg1_check_events(memcg, folio_nid(folio)); 669 667 670 - css_put(&memcg->css); 668 + rcu_read_unlock(); 669 + obj_cgroup_put(objcg); 671 670 } 672 671 673 672 /* ··· 1886 1883 PGFAULT, 1887 1884 PGMAJFAULT, 1888 1885 }; 1886 + 1887 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1888 + { 1889 + int i; 1890 + 1891 + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) 1892 + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]); 1893 + } 1894 + 1895 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1896 + { 1897 + int i; 1898 + 1899 + for (i = 0; i < NR_LRU_LISTS; i++) 1900 + reparent_memcg_lruvec_state_local(memcg, parent, i); 1901 + } 1889 1902 1890 1903 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) 1891 1904 {

+7

mm/memcontrol-v1.h

··· 73 73 unsigned long nr_memory, int nid); 74 74 75 75 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); 76 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 77 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 78 + 79 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 80 + struct mem_cgroup *parent, int idx); 81 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 82 + struct mem_cgroup *parent, int idx); 76 83 77 84 void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); 78 85 static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)

+425 -215

mm/memcontrol.c

··· 206 206 return objcg; 207 207 } 208 208 209 - static void memcg_reparent_objcgs(struct mem_cgroup *memcg, 210 - struct mem_cgroup *parent) 209 + static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memcg, 210 + struct mem_cgroup *parent, 211 + int nid) 211 212 { 212 213 struct obj_cgroup *objcg, *iter; 214 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 215 + struct mem_cgroup_per_node *parent_pn = parent->nodeinfo[nid]; 213 216 214 - objcg = rcu_replace_pointer(memcg->objcg, NULL, true); 215 - 216 - spin_lock_irq(&objcg_lock); 217 - 217 + objcg = rcu_replace_pointer(pn->objcg, NULL, true); 218 218 /* 1) Ready to reparent active objcg. */ 219 - list_add(&objcg->list, &memcg->objcg_list); 219 + list_add(&objcg->list, &pn->objcg_list); 220 220 /* 2) Reparent active objcg and already reparented objcgs to parent. */ 221 - list_for_each_entry(iter, &memcg->objcg_list, list) 221 + list_for_each_entry(iter, &pn->objcg_list, list) 222 222 WRITE_ONCE(iter->memcg, parent); 223 223 /* 3) Move already reparented objcgs to the parent's list */ 224 - list_splice(&memcg->objcg_list, &parent->objcg_list); 224 + list_splice(&pn->objcg_list, &parent_pn->objcg_list); 225 225 226 + return objcg; 227 + } 228 + 229 + #ifdef CONFIG_MEMCG_V1 230 + static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force); 231 + 232 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 233 + { 234 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 235 + return; 236 + 237 + /* 238 + * Reparent stats exposed non-hierarchically. Flush @memcg's stats first 239 + * to read its stats accurately , and conservatively flush @parent's 240 + * stats after reparenting to avoid hiding a potentially large stat 241 + * update (e.g. from callers of mem_cgroup_flush_stats_ratelimited()). 242 + */ 243 + __mem_cgroup_flush_stats(memcg, true); 244 + 245 + /* The following counts are all non-hierarchical and need to be reparented. */ 246 + reparent_memcg1_state_local(memcg, parent); 247 + reparent_memcg1_lruvec_state_local(memcg, parent); 248 + 249 + __mem_cgroup_flush_stats(parent, true); 250 + } 251 + #else 252 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 253 + { 254 + } 255 + #endif 256 + 257 + static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 258 + { 259 + spin_lock_irq(&objcg_lock); 260 + spin_lock_nested(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock, 1); 261 + spin_lock_nested(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock, 2); 262 + } 263 + 264 + static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 265 + { 266 + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock); 267 + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock); 226 268 spin_unlock_irq(&objcg_lock); 269 + } 227 270 228 - percpu_ref_kill(&objcg->refcnt); 271 + static void memcg_reparent_objcgs(struct mem_cgroup *memcg) 272 + { 273 + struct obj_cgroup *objcg; 274 + struct mem_cgroup *parent = parent_mem_cgroup(memcg); 275 + int nid; 276 + 277 + for_each_node(nid) { 278 + retry: 279 + if (lru_gen_enabled()) 280 + max_lru_gen_memcg(parent, nid); 281 + 282 + reparent_locks(memcg, parent, nid); 283 + 284 + if (lru_gen_enabled()) { 285 + if (!recheck_lru_gen_max_memcg(parent, nid)) { 286 + reparent_unlocks(memcg, parent, nid); 287 + cond_resched(); 288 + goto retry; 289 + } 290 + lru_gen_reparent_memcg(memcg, parent, nid); 291 + } else { 292 + lru_reparent_memcg(memcg, parent, nid); 293 + } 294 + 295 + objcg = __memcg_reparent_objcgs(memcg, parent, nid); 296 + 297 + reparent_unlocks(memcg, parent, nid); 298 + 299 + percpu_ref_kill(&objcg->refcnt); 300 + } 301 + 302 + reparent_state_local(memcg, parent); 229 303 } 230 304 231 305 /* ··· 315 241 EXPORT_SYMBOL(memcg_bpf_enabled_key); 316 242 317 243 /** 318 - * mem_cgroup_css_from_folio - css of the memcg associated with a folio 244 + * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio 319 245 * @folio: folio of interest 320 246 * 321 247 * If memcg is bound to the default hierarchy, css of the memcg associated ··· 325 251 * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup 326 252 * is returned. 327 253 */ 328 - struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio) 254 + struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio) 329 255 { 330 - struct mem_cgroup *memcg = folio_memcg(folio); 256 + struct mem_cgroup *memcg; 331 257 332 - if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) 333 - memcg = root_mem_cgroup; 258 + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 259 + return &root_mem_cgroup->css; 334 260 335 - return &memcg->css; 261 + memcg = get_mem_cgroup_from_folio(folio); 262 + 263 + return memcg ? &memcg->css : &root_mem_cgroup->css; 336 264 } 337 265 338 266 /** ··· 525 449 return x; 526 450 } 527 451 452 + #ifdef CONFIG_MEMCG_V1 453 + static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, 454 + enum node_stat_item idx, long val); 455 + 456 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 457 + struct mem_cgroup *parent, int idx) 458 + { 459 + int nid; 460 + 461 + for_each_node(nid) { 462 + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 463 + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); 464 + unsigned long value = lruvec_page_state_local(child_lruvec, idx); 465 + struct mem_cgroup_per_node *child_pn, *parent_pn; 466 + 467 + child_pn = container_of(child_lruvec, struct mem_cgroup_per_node, lruvec); 468 + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec); 469 + 470 + __mod_memcg_lruvec_state(child_pn, idx, -value); 471 + __mod_memcg_lruvec_state(parent_pn, idx, value); 472 + } 473 + } 474 + #endif 475 + 528 476 /* Subset of vm_event_item to report for memcg event stats */ 529 477 static const unsigned int memcg_vm_event_stat[] = { 530 478 #ifdef CONFIG_MEMCG_V1 ··· 608 508 609 509 struct memcg_vmstats_percpu { 610 510 /* Stats updates since the last flush */ 611 - unsigned int stats_updates; 511 + unsigned long stats_updates; 612 512 613 513 /* Cached pointers for fast iteration in memcg_rstat_updated() */ 614 514 struct memcg_vmstats_percpu __percpu *parent_pcpu; ··· 639 539 unsigned long events_pending[NR_MEMCG_EVENTS]; 640 540 641 541 /* Stats updates since the last flush */ 642 - atomic_t stats_updates; 542 + atomic_long_t stats_updates; 643 543 }; 644 544 645 545 /* ··· 665 565 666 566 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) 667 567 { 668 - return atomic_read(&vmstats->stats_updates) > 568 + return atomic_long_read(&vmstats->stats_updates) > 669 569 MEMCG_CHARGE_BATCH * num_online_cpus(); 670 570 } 671 571 672 - static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, 572 + static inline void memcg_rstat_updated(struct mem_cgroup *memcg, long val, 673 573 int cpu) 674 574 { 675 575 struct memcg_vmstats_percpu __percpu *statc_pcpu; 676 576 struct memcg_vmstats_percpu *statc; 677 - unsigned int stats_updates; 577 + unsigned long stats_updates; 678 578 679 579 if (!val) 680 580 return; ··· 697 597 continue; 698 598 699 599 stats_updates = this_cpu_xchg(statc_pcpu->stats_updates, 0); 700 - atomic_add(stats_updates, &statc->vmstats->stats_updates); 600 + atomic_long_add(stats_updates, &statc->vmstats->stats_updates); 701 601 } 702 602 } 703 603 ··· 705 605 { 706 606 bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats); 707 607 708 - trace_memcg_flush_stats(memcg, atomic_read(&memcg->vmstats->stats_updates), 608 + trace_memcg_flush_stats(memcg, atomic_long_read(&memcg->vmstats->stats_updates), 709 609 force, needs_flush); 710 610 711 611 if (!force && !needs_flush) ··· 784 684 * Normalize the value passed into memcg_rstat_updated() to be in pages. Round 785 685 * up non-zero sub-page updates to 1 page as zero page updates are ignored. 786 686 */ 787 - static int memcg_state_val_in_pages(int idx, int val) 687 + static long memcg_state_val_in_pages(int idx, long val) 788 688 { 789 689 int unit = memcg_page_state_unit(idx); 690 + long res; 790 691 791 692 if (!val || unit == PAGE_SIZE) 792 693 return val; 793 - else 794 - return max(val * unit / PAGE_SIZE, 1UL); 694 + 695 + /* Get the absolute value of (val * unit / PAGE_SIZE). */ 696 + res = mult_frac(abs(val), unit, PAGE_SIZE); 697 + /* Round up zero values. */ 698 + res = res ? : 1; 699 + 700 + return val < 0 ? -res : res; 701 + } 702 + 703 + #ifdef CONFIG_MEMCG_V1 704 + /* 705 + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with 706 + * reparenting of non-hierarchical state_locals. 707 + */ 708 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 709 + { 710 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 711 + return memcg; 712 + 713 + rcu_read_lock(); 714 + 715 + while (memcg_is_dying(memcg)) 716 + memcg = parent_mem_cgroup(memcg); 717 + 718 + return memcg; 719 + } 720 + 721 + static inline void get_non_dying_memcg_end(void) 722 + { 723 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 724 + return; 725 + 726 + rcu_read_unlock(); 727 + } 728 + #else 729 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 730 + { 731 + return memcg; 732 + } 733 + 734 + static inline void get_non_dying_memcg_end(void) 735 + { 736 + } 737 + #endif 738 + 739 + static void __mod_memcg_state(struct mem_cgroup *memcg, 740 + enum memcg_stat_item idx, long val) 741 + { 742 + int i = memcg_stats_index(idx); 743 + int cpu; 744 + 745 + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 746 + return; 747 + 748 + cpu = get_cpu(); 749 + 750 + this_cpu_add(memcg->vmstats_percpu->state[i], val); 751 + val = memcg_state_val_in_pages(idx, val); 752 + memcg_rstat_updated(memcg, val, cpu); 753 + 754 + trace_mod_memcg_state(memcg, idx, val); 755 + 756 + put_cpu(); 795 757 } 796 758 797 759 /** ··· 865 703 void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, 866 704 int val) 867 705 { 868 - int i = memcg_stats_index(idx); 869 - int cpu; 870 - 871 706 if (mem_cgroup_disabled()) 872 707 return; 873 708 874 - if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 875 - return; 876 - 877 - cpu = get_cpu(); 878 - 879 - this_cpu_add(memcg->vmstats_percpu->state[i], val); 880 - val = memcg_state_val_in_pages(idx, val); 881 - memcg_rstat_updated(memcg, val, cpu); 882 - trace_mod_memcg_state(memcg, idx, val); 883 - 884 - put_cpu(); 709 + memcg = get_non_dying_memcg_start(memcg); 710 + __mod_memcg_state(memcg, idx, val); 711 + get_non_dying_memcg_end(); 885 712 } 886 713 887 714 #ifdef CONFIG_MEMCG_V1 ··· 890 739 #endif 891 740 return x; 892 741 } 742 + 743 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 744 + struct mem_cgroup *parent, int idx) 745 + { 746 + unsigned long value = memcg_page_state_local(memcg, idx); 747 + 748 + __mod_memcg_state(memcg, idx, -value); 749 + __mod_memcg_state(parent, idx, value); 750 + } 893 751 #endif 894 752 895 - static void mod_memcg_lruvec_state(struct lruvec *lruvec, 896 - enum node_stat_item idx, 897 - int val) 753 + static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, 754 + enum node_stat_item idx, long val) 898 755 { 899 - struct mem_cgroup_per_node *pn; 900 - struct mem_cgroup *memcg; 756 + struct mem_cgroup *memcg = pn->memcg; 901 757 int i = memcg_stats_index(idx); 902 758 int cpu; 903 759 904 760 if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 905 761 return; 906 - 907 - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 908 - memcg = pn->memcg; 909 762 910 763 cpu = get_cpu(); 911 764 ··· 924 769 trace_mod_memcg_lruvec_state(memcg, idx, val); 925 770 926 771 put_cpu(); 772 + } 773 + 774 + static void mod_memcg_lruvec_state(struct lruvec *lruvec, 775 + enum node_stat_item idx, 776 + int val) 777 + { 778 + struct pglist_data *pgdat = lruvec_pgdat(lruvec); 779 + struct mem_cgroup_per_node *pn; 780 + struct mem_cgroup *memcg; 781 + 782 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 783 + memcg = get_non_dying_memcg_start(pn->memcg); 784 + pn = memcg->nodeinfo[pgdat->node_id]; 785 + 786 + __mod_memcg_lruvec_state(pn, idx, val); 787 + 788 + get_non_dying_memcg_end(); 927 789 } 928 790 929 791 /** ··· 1163 991 /** 1164 992 * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. 1165 993 * @folio: folio from which memcg should be extracted. 994 + * 995 + * See folio_memcg() for folio->objcg/memcg binding rules. 1166 996 */ 1167 997 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) 1168 998 { 1169 - struct mem_cgroup *memcg = folio_memcg(folio); 999 + struct mem_cgroup *memcg; 1170 1000 1171 1001 if (mem_cgroup_disabled()) 1172 1002 return NULL; 1173 1003 1004 + if (!folio_memcg_charged(folio)) 1005 + return root_mem_cgroup; 1006 + 1174 1007 rcu_read_lock(); 1175 - if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) 1176 - memcg = root_mem_cgroup; 1008 + do { 1009 + memcg = folio_memcg(folio); 1010 + } while (unlikely(!css_tryget(&memcg->css))); 1177 1011 rcu_read_unlock(); 1178 1012 return memcg; 1179 1013 } ··· 1376 1198 } 1377 1199 } 1378 1200 1379 - #ifdef CONFIG_DEBUG_VM 1380 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 1381 - { 1382 - struct mem_cgroup *memcg; 1383 - 1384 - if (mem_cgroup_disabled()) 1385 - return; 1386 - 1387 - memcg = folio_memcg(folio); 1388 - 1389 - if (!memcg) 1390 - VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio); 1391 - else 1392 - VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio); 1393 - } 1394 - #endif 1395 - 1396 1201 /** 1397 1202 * folio_lruvec_lock - Lock the lruvec for a folio. 1398 1203 * @folio: Pointer to the folio. ··· 1385 1224 * - folio_test_lru false 1386 1225 * - folio frozen (refcount of 0) 1387 1226 * 1388 - * Return: The lruvec this folio is on with its lock held. 1227 + * Return: The lruvec this folio is on with its lock held and rcu read lock held. 1389 1228 */ 1390 1229 struct lruvec *folio_lruvec_lock(struct folio *folio) 1391 1230 { 1392 - struct lruvec *lruvec = folio_lruvec(folio); 1231 + struct lruvec *lruvec; 1393 1232 1233 + rcu_read_lock(); 1234 + retry: 1235 + lruvec = folio_lruvec(folio); 1394 1236 spin_lock(&lruvec->lru_lock); 1395 - lruvec_memcg_debug(lruvec, folio); 1237 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1238 + spin_unlock(&lruvec->lru_lock); 1239 + goto retry; 1240 + } 1396 1241 1397 1242 return lruvec; 1398 1243 } ··· 1413 1246 * - folio frozen (refcount of 0) 1414 1247 * 1415 1248 * Return: The lruvec this folio is on with its lock held and interrupts 1416 - * disabled. 1249 + * disabled and rcu read lock held. 1417 1250 */ 1418 1251 struct lruvec *folio_lruvec_lock_irq(struct folio *folio) 1419 1252 { 1420 - struct lruvec *lruvec = folio_lruvec(folio); 1253 + struct lruvec *lruvec; 1421 1254 1255 + rcu_read_lock(); 1256 + retry: 1257 + lruvec = folio_lruvec(folio); 1422 1258 spin_lock_irq(&lruvec->lru_lock); 1423 - lruvec_memcg_debug(lruvec, folio); 1259 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1260 + spin_unlock_irq(&lruvec->lru_lock); 1261 + goto retry; 1262 + } 1424 1263 1425 1264 return lruvec; 1426 1265 } ··· 1442 1269 * - folio frozen (refcount of 0) 1443 1270 * 1444 1271 * Return: The lruvec this folio is on with its lock held and interrupts 1445 - * disabled. 1272 + * disabled and rcu read lock held. 1446 1273 */ 1447 1274 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, 1448 1275 unsigned long *flags) 1449 1276 { 1450 - struct lruvec *lruvec = folio_lruvec(folio); 1277 + struct lruvec *lruvec; 1451 1278 1279 + rcu_read_lock(); 1280 + retry: 1281 + lruvec = folio_lruvec(folio); 1452 1282 spin_lock_irqsave(&lruvec->lru_lock, *flags); 1453 - lruvec_memcg_debug(lruvec, folio); 1283 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1284 + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); 1285 + goto retry; 1286 + } 1454 1287 1455 1288 return lruvec; 1456 1289 } ··· 1472 1293 * to or just after a page is removed from an lru list. 1473 1294 */ 1474 1295 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 1475 - int zid, int nr_pages) 1296 + int zid, long nr_pages) 1476 1297 { 1477 1298 struct mem_cgroup_per_node *mz; 1478 1299 unsigned long *lru_size; ··· 1489 1310 1490 1311 size = *lru_size; 1491 1312 if (WARN_ONCE(size < 0, 1492 - "%s(%p, %d, %d): lru_size %ld\n", 1313 + "%s(%p, %d, %ld): lru_size %ld\n", 1493 1314 __func__, lruvec, lru, nr_pages, size)) { 1494 1315 VM_BUG_ON(1); 1495 1316 *lru_size = 0; ··· 2760 2581 return try_charge_memcg(memcg, gfp_mask, nr_pages); 2761 2582 } 2762 2583 2763 - static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) 2584 + static void commit_charge(struct folio *folio, struct obj_cgroup *objcg) 2764 2585 { 2765 2586 VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio); 2766 2587 /* 2767 - * Any of the following ensures page's memcg stability: 2588 + * Any of the following ensures folio's objcg stability: 2768 2589 * 2769 2590 * - the page lock 2770 2591 * - LRU isolation 2771 2592 * - exclusive reference 2772 2593 */ 2773 - folio->memcg_data = (unsigned long)memcg; 2594 + folio->memcg_data = (unsigned long)objcg; 2774 2595 } 2775 2596 2776 2597 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC ··· 2872 2693 2873 2694 static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) 2874 2695 { 2875 - struct obj_cgroup *objcg = NULL; 2696 + int nid = numa_node_id(); 2876 2697 2877 - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 2878 - objcg = rcu_dereference(memcg->objcg); 2698 + for (; memcg; memcg = parent_mem_cgroup(memcg)) { 2699 + struct obj_cgroup *objcg = rcu_dereference(memcg->nodeinfo[nid]->objcg); 2700 + 2879 2701 if (likely(objcg && obj_cgroup_tryget(objcg))) 2880 - break; 2881 - objcg = NULL; 2702 + return objcg; 2882 2703 } 2704 + 2705 + return NULL; 2706 + } 2707 + 2708 + static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) 2709 + { 2710 + struct obj_cgroup *objcg; 2711 + 2712 + rcu_read_lock(); 2713 + objcg = __get_obj_cgroup_from_memcg(memcg); 2714 + rcu_read_unlock(); 2715 + 2883 2716 return objcg; 2884 2717 } 2885 2718 ··· 2950 2759 { 2951 2760 struct mem_cgroup *memcg; 2952 2761 struct obj_cgroup *objcg; 2762 + int nid = numa_node_id(); 2953 2763 2954 2764 if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi()) 2955 2765 return NULL; ··· 2967 2775 * Objcg reference is kept by the task, so it's safe 2968 2776 * to use the objcg by the current task. 2969 2777 */ 2970 - return objcg; 2778 + return objcg ? : rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2971 2779 } 2972 2780 2973 2781 memcg = this_cpu_read(int_active_memcg); 2974 2782 if (unlikely(memcg)) 2975 2783 goto from_memcg; 2976 2784 2977 - return NULL; 2785 + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2978 2786 2979 2787 from_memcg: 2980 - objcg = NULL; 2981 - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 2788 + for (; memcg; memcg = parent_mem_cgroup(memcg)) { 2982 2789 /* 2983 2790 * Memcg pointer is protected by scope (see set_active_memcg()) 2984 2791 * and is pinning the corresponding objcg, so objcg can't go 2985 2792 * away and can be used within the scope without any additional 2986 2793 * protection. 2987 2794 */ 2988 - objcg = rcu_dereference_check(memcg->objcg, 1); 2795 + objcg = rcu_dereference_check(memcg->nodeinfo[nid]->objcg, 1); 2989 2796 if (likely(objcg)) 2990 - break; 2797 + return objcg; 2991 2798 } 2992 2799 2993 - return objcg; 2800 + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2994 2801 } 2995 2802 2996 2803 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) 2997 2804 { 2998 2805 struct obj_cgroup *objcg; 2999 2806 3000 - if (!memcg_kmem_online()) 3001 - return NULL; 3002 - 3003 - if (folio_memcg_kmem(folio)) { 3004 - objcg = __folio_objcg(folio); 2807 + objcg = folio_objcg(folio); 2808 + if (objcg) 3005 2809 obj_cgroup_get(objcg); 3006 - } else { 3007 - struct mem_cgroup *memcg; 3008 2810 3009 - rcu_read_lock(); 3010 - memcg = __folio_memcg(folio); 3011 - if (memcg) 3012 - objcg = __get_obj_cgroup_from_memcg(memcg); 3013 - else 3014 - objcg = NULL; 3015 - rcu_read_unlock(); 3016 - } 3017 2811 return objcg; 3018 2812 } 3019 2813 ··· 3100 2922 int ret = 0; 3101 2923 3102 2924 objcg = current_obj_cgroup(); 3103 - if (objcg) { 2925 + if (objcg && !obj_cgroup_is_root(objcg)) { 3104 2926 ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order); 3105 2927 if (!ret) { 3106 2928 obj_cgroup_get(objcg); ··· 3429 3251 * obj_cgroup_get() is used to get a permanent reference. 3430 3252 */ 3431 3253 objcg = current_obj_cgroup(); 3432 - if (!objcg) 3254 + if (!objcg || obj_cgroup_is_root(objcg)) 3433 3255 return true; 3434 3256 3435 3257 /* ··· 3561 3383 return; 3562 3384 3563 3385 new_refs = (1 << (old_order - new_order)) - 1; 3564 - css_get_many(&__folio_memcg(folio)->css, new_refs); 3386 + obj_cgroup_get_many(folio_objcg(folio), new_refs); 3565 3387 } 3566 3388 3567 - static int memcg_online_kmem(struct mem_cgroup *memcg) 3389 + static void memcg_online_kmem(struct mem_cgroup *memcg) 3568 3390 { 3569 - struct obj_cgroup *objcg; 3570 - 3571 3391 if (mem_cgroup_kmem_disabled()) 3572 - return 0; 3392 + return; 3573 3393 3574 3394 if (unlikely(mem_cgroup_is_root(memcg))) 3575 - return 0; 3576 - 3577 - objcg = obj_cgroup_alloc(); 3578 - if (!objcg) 3579 - return -ENOMEM; 3580 - 3581 - objcg->memcg = memcg; 3582 - rcu_assign_pointer(memcg->objcg, objcg); 3583 - obj_cgroup_get(objcg); 3584 - memcg->orig_objcg = objcg; 3395 + return; 3585 3396 3586 3397 static_branch_enable(&memcg_kmem_online_key); 3587 3398 3588 3399 memcg->kmemcg_id = memcg->id.id; 3589 - 3590 - return 0; 3591 3400 } 3592 3401 3593 3402 static void memcg_offline_kmem(struct mem_cgroup *memcg) ··· 3588 3423 return; 3589 3424 3590 3425 parent = parent_mem_cgroup(memcg); 3591 - if (!parent) 3592 - parent = root_mem_cgroup; 3593 - 3594 3426 memcg_reparent_list_lrus(memcg, parent); 3595 - 3596 - /* 3597 - * Objcg's reparenting must be after list_lru's, make sure list_lru 3598 - * helpers won't use parent's list_lru until child is drained. 3599 - */ 3600 - memcg_reparent_objcgs(memcg, parent); 3601 3427 } 3602 3428 3603 3429 #ifdef CONFIG_CGROUP_WRITEBACK ··· 3861 3705 break; 3862 3706 } 3863 3707 memcg = parent_mem_cgroup(memcg); 3864 - if (!memcg) 3865 - memcg = root_mem_cgroup; 3866 3708 } 3867 3709 return memcg; 3868 3710 } ··· 3925 3771 if (!pn->lruvec_stats_percpu) 3926 3772 goto fail; 3927 3773 3774 + INIT_LIST_HEAD(&pn->objcg_list); 3775 + 3928 3776 lruvec_init(&pn->lruvec); 3929 3777 pn->memcg = memcg; 3930 3778 ··· 3941 3785 { 3942 3786 int node; 3943 3787 3944 - obj_cgroup_put(memcg->orig_objcg); 3788 + for_each_node(node) { 3789 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; 3790 + if (!pn) 3791 + continue; 3945 3792 3946 - for_each_node(node) 3947 - free_mem_cgroup_per_node_info(memcg->nodeinfo[node]); 3793 + obj_cgroup_put(pn->orig_objcg); 3794 + free_mem_cgroup_per_node_info(pn); 3795 + } 3948 3796 memcg1_free_events(memcg); 3949 3797 kfree(memcg->vmstats); 3950 3798 free_percpu(memcg->vmstats_percpu); ··· 4019 3859 #endif 4020 3860 memcg1_memcg_init(memcg); 4021 3861 memcg->kmemcg_id = -1; 4022 - INIT_LIST_HEAD(&memcg->objcg_list); 4023 3862 #ifdef CONFIG_CGROUP_WRITEBACK 4024 3863 INIT_LIST_HEAD(&memcg->cgwb_list); 4025 3864 for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) ··· 4094 3935 static int mem_cgroup_css_online(struct cgroup_subsys_state *css) 4095 3936 { 4096 3937 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 3938 + struct obj_cgroup *objcg; 3939 + int nid; 4097 3940 4098 - if (memcg_online_kmem(memcg)) 4099 - goto remove_id; 3941 + memcg_online_kmem(memcg); 4100 3942 4101 3943 /* 4102 3944 * A memcg must be visible for expand_shrinker_info() ··· 4106 3946 */ 4107 3947 if (alloc_shrinker_info(memcg)) 4108 3948 goto offline_kmem; 3949 + 3950 + for_each_node(nid) { 3951 + objcg = obj_cgroup_alloc(); 3952 + if (!objcg) 3953 + goto free_objcg; 3954 + 3955 + if (unlikely(mem_cgroup_is_root(memcg))) 3956 + objcg->is_root = true; 3957 + 3958 + objcg->memcg = memcg; 3959 + rcu_assign_pointer(memcg->nodeinfo[nid]->objcg, objcg); 3960 + obj_cgroup_get(objcg); 3961 + memcg->nodeinfo[nid]->orig_objcg = objcg; 3962 + } 4109 3963 4110 3964 if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled()) 4111 3965 queue_delayed_work(system_dfl_wq, &stats_flush_dwork, ··· 4143 3969 xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL); 4144 3970 4145 3971 return 0; 3972 + free_objcg: 3973 + for_each_node(nid) { 3974 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 3975 + 3976 + objcg = rcu_replace_pointer(pn->objcg, NULL, true); 3977 + if (objcg) 3978 + percpu_ref_kill(&objcg->refcnt); 3979 + 3980 + if (pn->orig_objcg) { 3981 + obj_cgroup_put(pn->orig_objcg); 3982 + /* 3983 + * Reset pn->orig_objcg to NULL to prevent 3984 + * obj_cgroup_put() from being called again in 3985 + * __mem_cgroup_free(). 3986 + */ 3987 + pn->orig_objcg = NULL; 3988 + } 3989 + } 3990 + free_shrinker_info(memcg); 4146 3991 offline_kmem: 4147 3992 memcg_offline_kmem(memcg); 4148 - remove_id: 4149 3993 mem_cgroup_private_id_remove(memcg); 4150 3994 return -ENOMEM; 4151 3995 } ··· 4181 3989 4182 3990 memcg_offline_kmem(memcg); 4183 3991 reparent_deferred_split_queue(memcg); 3992 + /* 3993 + * The reparenting of objcg must be after the reparenting of the 3994 + * list_lru and deferred_split_queue above, which ensures that they will 3995 + * not mistakenly get the parent list_lru and deferred_split_queue. 3996 + */ 3997 + memcg_reparent_objcgs(memcg); 4184 3998 reparent_shrinker_deferred(memcg); 4185 3999 wb_memcg_offline(memcg); 4186 4000 lru_gen_offline_memcg(memcg); ··· 4419 4221 } 4420 4222 WRITE_ONCE(statc->stats_updates, 0); 4421 4223 /* We are in a per-cpu loop here, only do the atomic write once */ 4422 - if (atomic_read(&memcg->vmstats->stats_updates)) 4423 - atomic_set(&memcg->vmstats->stats_updates, 0); 4224 + if (atomic_long_read(&memcg->vmstats->stats_updates)) 4225 + atomic_long_set(&memcg->vmstats->stats_updates, 0); 4424 4226 } 4425 4227 4426 4228 static void mem_cgroup_fork(struct task_struct *task) ··· 4997 4799 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, 4998 4800 gfp_t gfp) 4999 4801 { 5000 - int ret; 4802 + int ret = 0; 4803 + struct obj_cgroup *objcg; 5001 4804 5002 - ret = try_charge(memcg, gfp, folio_nr_pages(folio)); 5003 - if (ret) 5004 - goto out; 5005 - 5006 - css_get(&memcg->css); 5007 - commit_charge(folio, memcg); 4805 + objcg = get_obj_cgroup_from_memcg(memcg); 4806 + /* Do not account at the root objcg level. */ 4807 + if (!obj_cgroup_is_root(objcg)) 4808 + ret = try_charge_memcg(memcg, gfp, folio_nr_pages(folio)); 4809 + if (ret) { 4810 + obj_cgroup_put(objcg); 4811 + return ret; 4812 + } 4813 + commit_charge(folio, objcg); 5008 4814 memcg1_commit_charge(folio, memcg); 5009 - out: 4815 + 5010 4816 return ret; 5011 4817 } 5012 4818 ··· 5096 4894 } 5097 4895 5098 4896 struct uncharge_gather { 5099 - struct mem_cgroup *memcg; 4897 + struct obj_cgroup *objcg; 5100 4898 unsigned long nr_memory; 5101 4899 unsigned long pgpgout; 5102 4900 unsigned long nr_kmem; ··· 5110 4908 5111 4909 static void uncharge_batch(const struct uncharge_gather *ug) 5112 4910 { 4911 + struct mem_cgroup *memcg; 4912 + 4913 + rcu_read_lock(); 4914 + memcg = obj_cgroup_memcg(ug->objcg); 5113 4915 if (ug->nr_memory) { 5114 - memcg_uncharge(ug->memcg, ug->nr_memory); 4916 + memcg_uncharge(memcg, ug->nr_memory); 5115 4917 if (ug->nr_kmem) { 5116 - mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem); 5117 - memcg1_account_kmem(ug->memcg, -ug->nr_kmem); 4918 + mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem); 4919 + memcg1_account_kmem(memcg, -ug->nr_kmem); 5118 4920 } 5119 - memcg1_oom_recover(ug->memcg); 4921 + memcg1_oom_recover(memcg); 5120 4922 } 5121 4923 5122 - memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid); 4924 + memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid); 4925 + rcu_read_unlock(); 5123 4926 5124 4927 /* drop reference from uncharge_folio */ 5125 - css_put(&ug->memcg->css); 4928 + obj_cgroup_put(ug->objcg); 5126 4929 } 5127 4930 5128 4931 static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) 5129 4932 { 5130 4933 long nr_pages; 5131 - struct mem_cgroup *memcg; 5132 4934 struct obj_cgroup *objcg; 5133 4935 5134 4936 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); 5135 4937 5136 4938 /* 5137 4939 * Nobody should be changing or seriously looking at 5138 - * folio memcg or objcg at this point, we have fully 5139 - * exclusive access to the folio. 4940 + * folio objcg at this point, we have fully exclusive 4941 + * access to the folio. 5140 4942 */ 5141 - if (folio_memcg_kmem(folio)) { 5142 - objcg = __folio_objcg(folio); 5143 - /* 5144 - * This get matches the put at the end of the function and 5145 - * kmem pages do not hold memcg references anymore. 5146 - */ 5147 - memcg = get_mem_cgroup_from_objcg(objcg); 5148 - } else { 5149 - memcg = __folio_memcg(folio); 5150 - } 5151 - 5152 - if (!memcg) 4943 + objcg = folio_objcg(folio); 4944 + if (!objcg) 5153 4945 return; 5154 4946 5155 - if (ug->memcg != memcg) { 5156 - if (ug->memcg) { 4947 + if (ug->objcg != objcg) { 4948 + if (ug->objcg) { 5157 4949 uncharge_batch(ug); 5158 4950 uncharge_gather_clear(ug); 5159 4951 } 5160 - ug->memcg = memcg; 4952 + ug->objcg = objcg; 5161 4953 ug->nid = folio_nid(folio); 5162 4954 5163 - /* pairs with css_put in uncharge_batch */ 5164 - css_get(&memcg->css); 4955 + /* pairs with obj_cgroup_put in uncharge_batch */ 4956 + obj_cgroup_get(objcg); 5165 4957 } 5166 4958 5167 4959 nr_pages = folio_nr_pages(folio); ··· 5163 4967 if (folio_memcg_kmem(folio)) { 5164 4968 ug->nr_memory += nr_pages; 5165 4969 ug->nr_kmem += nr_pages; 5166 - 5167 - folio->memcg_data = 0; 5168 - obj_cgroup_put(objcg); 5169 4970 } else { 5170 4971 /* LRU pages aren't accounted at the root level */ 5171 - if (!mem_cgroup_is_root(memcg)) 4972 + if (!obj_cgroup_is_root(objcg)) 5172 4973 ug->nr_memory += nr_pages; 5173 4974 ug->pgpgout++; 5174 4975 5175 4976 WARN_ON_ONCE(folio_unqueue_deferred_split(folio)); 5176 - folio->memcg_data = 0; 5177 4977 } 5178 4978 5179 - css_put(&memcg->css); 4979 + folio->memcg_data = 0; 4980 + obj_cgroup_put(objcg); 5180 4981 } 5181 4982 5182 4983 void __mem_cgroup_uncharge(struct folio *folio) ··· 5197 5004 uncharge_gather_clear(&ug); 5198 5005 for (i = 0; i < folios->nr; i++) 5199 5006 uncharge_folio(folios->folios[i], &ug); 5200 - if (ug.memcg) 5007 + if (ug.objcg) 5201 5008 uncharge_batch(&ug); 5202 5009 } 5203 5010 ··· 5214 5021 void mem_cgroup_replace_folio(struct folio *old, struct folio *new) 5215 5022 { 5216 5023 struct mem_cgroup *memcg; 5024 + struct obj_cgroup *objcg; 5217 5025 long nr_pages = folio_nr_pages(new); 5218 5026 5219 5027 VM_BUG_ON_FOLIO(!folio_test_locked(old), old); ··· 5229 5035 if (folio_memcg_charged(new)) 5230 5036 return; 5231 5037 5232 - memcg = folio_memcg(old); 5233 - VM_WARN_ON_ONCE_FOLIO(!memcg, old); 5234 - if (!memcg) 5038 + objcg = folio_objcg(old); 5039 + VM_WARN_ON_ONCE_FOLIO(!objcg, old); 5040 + if (!objcg) 5235 5041 return; 5236 5042 5043 + rcu_read_lock(); 5044 + memcg = obj_cgroup_memcg(objcg); 5237 5045 /* Force-charge the new page. The old one will be freed soon */ 5238 - if (!mem_cgroup_is_root(memcg)) { 5046 + if (!obj_cgroup_is_root(objcg)) { 5239 5047 page_counter_charge(&memcg->memory, nr_pages); 5240 5048 if (do_memsw_account()) 5241 5049 page_counter_charge(&memcg->memsw, nr_pages); 5242 5050 } 5243 5051 5244 - css_get(&memcg->css); 5245 - commit_charge(new, memcg); 5052 + obj_cgroup_get(objcg); 5053 + commit_charge(new, objcg); 5246 5054 memcg1_commit_charge(new, memcg); 5055 + rcu_read_unlock(); 5247 5056 } 5248 5057 5249 5058 /** ··· 5262 5065 */ 5263 5066 void mem_cgroup_migrate(struct folio *old, struct folio *new) 5264 5067 { 5265 - struct mem_cgroup *memcg; 5068 + struct obj_cgroup *objcg; 5266 5069 5267 5070 VM_BUG_ON_FOLIO(!folio_test_locked(old), old); 5268 5071 VM_BUG_ON_FOLIO(!folio_test_locked(new), new); ··· 5273 5076 if (mem_cgroup_disabled()) 5274 5077 return; 5275 5078 5276 - memcg = folio_memcg(old); 5079 + objcg = folio_objcg(old); 5277 5080 /* 5278 - * Note that it is normal to see !memcg for a hugetlb folio. 5081 + * Note that it is normal to see !objcg for a hugetlb folio. 5279 5082 * For e.g, it could have been allocated when memory_hugetlb_accounting 5280 5083 * was not selected. 5281 5084 */ 5282 - VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old); 5283 - if (!memcg) 5085 + VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old); 5086 + if (!objcg) 5284 5087 return; 5285 5088 5286 - /* Transfer the charge and the css ref */ 5287 - commit_charge(new, memcg); 5089 + /* Transfer the charge and the objcg ref */ 5090 + commit_charge(new, objcg); 5288 5091 5289 5092 /* Warning should never happen, so don't worry about refcount non-0 */ 5290 5093 WARN_ON_ONCE(folio_unqueue_deferred_split(old)); ··· 5467 5270 unsigned int nr_pages = folio_nr_pages(folio); 5468 5271 struct page_counter *counter; 5469 5272 struct mem_cgroup *memcg; 5273 + struct obj_cgroup *objcg; 5470 5274 5471 5275 if (do_memsw_account()) 5472 5276 return 0; 5473 5277 5474 - memcg = folio_memcg(folio); 5475 - 5476 - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); 5477 - if (!memcg) 5278 + objcg = folio_objcg(folio); 5279 + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); 5280 + if (!objcg) 5478 5281 return 0; 5479 5282 5283 + rcu_read_lock(); 5284 + memcg = obj_cgroup_memcg(objcg); 5480 5285 if (!entry.val) { 5481 5286 memcg_memory_event(memcg, MEMCG_SWAP_FAIL); 5287 + rcu_read_unlock(); 5482 5288 return 0; 5483 5289 } 5484 5290 5485 5291 memcg = mem_cgroup_private_id_get_online(memcg, nr_pages); 5292 + /* memcg is pined by memcg ID. */ 5293 + rcu_read_unlock(); 5486 5294 5487 5295 if (!mem_cgroup_is_root(memcg) && 5488 5296 !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { ··· 5545 5343 bool mem_cgroup_swap_full(struct folio *folio) 5546 5344 { 5547 5345 struct mem_cgroup *memcg; 5346 + bool ret = false; 5548 5347 5549 5348 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); 5550 5349 5551 5350 if (vm_swap_full()) 5552 5351 return true; 5553 - if (do_memsw_account()) 5554 - return false; 5352 + if (do_memsw_account() || !folio_memcg_charged(folio)) 5353 + return ret; 5555 5354 5355 + rcu_read_lock(); 5556 5356 memcg = folio_memcg(folio); 5557 - if (!memcg) 5558 - return false; 5559 - 5560 5357 for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 5561 5358 unsigned long usage = page_counter_read(&memcg->swap); 5562 5359 5563 5360 if (usage * 2 >= READ_ONCE(memcg->swap.high) || 5564 - usage * 2 >= READ_ONCE(memcg->swap.max)) 5565 - return true; 5361 + usage * 2 >= READ_ONCE(memcg->swap.max)) { 5362 + ret = true; 5363 + break; 5364 + } 5566 5365 } 5366 + rcu_read_unlock(); 5567 5367 5568 - return false; 5368 + return ret; 5569 5369 } 5570 5370 5571 5371 static int __init setup_swap_account(char *s) ··· 5763 5559 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 5764 5560 return; 5765 5561 5562 + if (obj_cgroup_is_root(objcg)) 5563 + return; 5564 + 5766 5565 VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC)); 5767 5566 5768 5567 /* PF_MEMALLOC context, charging must succeed */ ··· 5793 5586 struct mem_cgroup *memcg; 5794 5587 5795 5588 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 5589 + return; 5590 + 5591 + if (obj_cgroup_is_root(objcg)) 5796 5592 return; 5797 5593 5798 5594 obj_cgroup_uncharge(objcg, size);

+26 -8

mm/memfd_luo.c

··· 105 105 if (!size) { 106 106 *nr_foliosp = 0; 107 107 *out_folios_ser = NULL; 108 - memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)); 109 108 return 0; 110 109 } 111 110 ··· 409 410 struct inode *inode = file_inode(file); 410 411 struct address_space *mapping = inode->i_mapping; 411 412 struct folio *folio; 413 + long npages, nr_added_pages = 0; 412 414 int err = -EIO; 413 415 long i; 414 416 ··· 456 456 if (flags & MEMFD_LUO_FOLIO_DIRTY) 457 457 folio_mark_dirty(folio); 458 458 459 - err = shmem_inode_acct_blocks(inode, 1); 459 + npages = folio_nr_pages(folio); 460 + err = shmem_inode_acct_blocks(inode, npages); 460 461 if (err) { 461 - pr_err("shmem: failed to account folio index %ld: %d\n", 462 - i, err); 463 - goto unlock_folio; 462 + pr_err("shmem: failed to account folio index %ld(%ld pages): %d\n", 463 + i, npages, err); 464 + goto remove_from_cache; 464 465 } 465 466 466 - shmem_recalc_inode(inode, 1, 0); 467 + nr_added_pages += npages; 467 468 folio_add_lru(folio); 468 469 folio_unlock(folio); 469 470 folio_put(folio); 470 471 } 471 472 473 + shmem_recalc_inode(inode, nr_added_pages, 0); 474 + 472 475 return 0; 473 476 477 + remove_from_cache: 478 + filemap_remove_folio(folio); 474 479 unlock_folio: 475 480 folio_unlock(folio); 476 481 folio_put(folio); ··· 486 481 */ 487 482 for (long j = i + 1; j < nr_folios; j++) { 488 483 const struct memfd_luo_folio_ser *pfolio = &folios_ser[j]; 484 + phys_addr_t phys; 489 485 490 - folio = kho_restore_folio(pfolio->pfn); 486 + if (!pfolio->pfn) 487 + continue; 488 + 489 + phys = PFN_PHYS(pfolio->pfn); 490 + folio = kho_restore_folio(phys); 491 491 if (folio) 492 492 folio_put(folio); 493 493 } 494 + 495 + shmem_recalc_inode(inode, nr_added_pages, 0); 494 496 495 497 return err; 496 498 } ··· 537 525 } 538 526 539 527 vfs_setpos(file, ser->pos, MAX_LFS_FILESIZE); 540 - file->f_inode->i_size = ser->size; 528 + i_size_write(file_inode(file), ser->size); 541 529 542 530 if (ser->nr_folios) { 543 531 folios_ser = kho_restore_vmalloc(&ser->folios); ··· 572 560 return shmem_file(file) && !inode->i_nlink; 573 561 } 574 562 563 + static unsigned long memfd_luo_get_id(struct file *file) 564 + { 565 + return (unsigned long)file_inode(file); 566 + } 567 + 575 568 static const struct liveupdate_file_ops memfd_luo_file_ops = { 576 569 .freeze = memfd_luo_freeze, 577 570 .finish = memfd_luo_finish, ··· 584 567 .preserve = memfd_luo_preserve, 585 568 .unpreserve = memfd_luo_unpreserve, 586 569 .can_preserve = memfd_luo_can_preserve, 570 + .get_id = memfd_luo_get_id, 587 571 .owner = THIS_MODULE, 588 572 }; 589 573

+12 -11

mm/mempolicy.c

··· 3706 3706 new_wi_state->iw_table[i] = 1; 3707 3707 3708 3708 mutex_lock(&wi_state_lock); 3709 - if (!input) { 3710 - old_wi_state = rcu_dereference_protected(wi_state, 3711 - lockdep_is_held(&wi_state_lock)); 3712 - if (!old_wi_state) 3713 - goto update_wi_state; 3714 - if (input == old_wi_state->mode_auto) { 3715 - mutex_unlock(&wi_state_lock); 3716 - return count; 3717 - } 3709 + old_wi_state = rcu_dereference_protected(wi_state, 3710 + lockdep_is_held(&wi_state_lock)); 3718 3711 3719 - memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3720 - nr_node_ids * sizeof(u8)); 3712 + if (old_wi_state && input == old_wi_state->mode_auto) { 3713 + mutex_unlock(&wi_state_lock); 3714 + kfree(new_wi_state); 3715 + return count; 3716 + } 3717 + 3718 + if (!input) { 3719 + if (old_wi_state) 3720 + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3721 + nr_node_ids * sizeof(u8)); 3721 3722 goto update_wi_state; 3722 3723 } 3723 3724

+2

mm/migrate.c

··· 672 672 struct lruvec *old_lruvec, *new_lruvec; 673 673 struct mem_cgroup *memcg; 674 674 675 + rcu_read_lock(); 675 676 memcg = folio_memcg(folio); 676 677 old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat); 677 678 new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat); ··· 700 699 mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr); 701 700 __mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr); 702 701 } 702 + rcu_read_unlock(); 703 703 } 704 704 local_irq_enable(); 705 705

-6

mm/migrate_device.c

··· 175 175 return migrate_vma_collect_skip(start, end, walk); 176 176 } 177 177 178 - if (softleaf_is_migration(entry)) { 179 - softleaf_entry_wait_on_locked(entry, ptl); 180 - spin_unlock(ptl); 181 - return -EAGAIN; 182 - } 183 - 184 178 if (softleaf_is_device_private_write(entry)) 185 179 write = MIGRATE_PFN_WRITE; 186 180 } else {

+1 -1

mm/mlock.c

··· 205 205 } 206 206 207 207 if (lruvec) 208 - unlock_page_lruvec_irq(lruvec); 208 + lruvec_unlock_irq(lruvec); 209 209 folios_put(fbatch); 210 210 } 211 211

+124 -94

mm/mprotect.c

··· 117 117 } 118 118 119 119 /* Set nr_ptes number of ptes, starting from idx */ 120 - static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, 121 - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, 122 - int idx, bool set_write, struct mmu_gather *tlb) 120 + static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, 121 + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, 122 + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) 123 123 { 124 124 /* 125 125 * Advance the position in the batch by idx; note that if idx > 0, ··· 143 143 * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce 144 144 * that the ptes point to consecutive pages of the same anon large folio. 145 145 */ 146 - static int page_anon_exclusive_sub_batch(int start_idx, int max_len, 146 + static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len, 147 147 struct page *first_page, bool expected_anon_exclusive) 148 148 { 149 149 int idx; ··· 169 169 * pte of the batch. Therefore, we must individually check all pages and 170 170 * retrieve sub-batches. 171 171 */ 172 - static void commit_anon_folio_batch(struct vm_area_struct *vma, 172 + static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, 173 173 struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, 174 174 pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) 175 175 { ··· 188 188 } 189 189 } 190 190 191 - static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, 191 + static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, 192 192 struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, 193 193 pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) 194 194 { ··· 211 211 commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); 212 212 } 213 213 214 + static long change_softleaf_pte(struct vm_area_struct *vma, 215 + unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) 216 + { 217 + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 218 + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 219 + softleaf_t entry = softleaf_from_pte(oldpte); 220 + pte_t newpte; 221 + 222 + if (softleaf_is_migration_write(entry)) { 223 + const struct folio *folio = softleaf_to_folio(entry); 224 + 225 + /* 226 + * A protection check is difficult so 227 + * just be safe and disable write 228 + */ 229 + if (folio_test_anon(folio)) 230 + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); 231 + else 232 + entry = make_readable_migration_entry(swp_offset(entry)); 233 + newpte = swp_entry_to_pte(entry); 234 + if (pte_swp_soft_dirty(oldpte)) 235 + newpte = pte_swp_mksoft_dirty(newpte); 236 + } else if (softleaf_is_device_private_write(entry)) { 237 + /* 238 + * We do not preserve soft-dirtiness. See 239 + * copy_nonpresent_pte() for explanation. 240 + */ 241 + entry = make_readable_device_private_entry(swp_offset(entry)); 242 + newpte = swp_entry_to_pte(entry); 243 + if (pte_swp_uffd_wp(oldpte)) 244 + newpte = pte_swp_mkuffd_wp(newpte); 245 + } else if (softleaf_is_marker(entry)) { 246 + /* 247 + * Ignore error swap entries unconditionally, 248 + * because any access should sigbus/sigsegv 249 + * anyway. 250 + */ 251 + if (softleaf_is_poison_marker(entry) || 252 + softleaf_is_guard_marker(entry)) 253 + return 0; 254 + /* 255 + * If this is uffd-wp pte marker and we'd like 256 + * to unprotect it, drop it; the next page 257 + * fault will trigger without uffd trapping. 258 + */ 259 + if (uffd_wp_resolve) { 260 + pte_clear(vma->vm_mm, addr, pte); 261 + return 1; 262 + } 263 + return 0; 264 + } else { 265 + newpte = oldpte; 266 + } 267 + 268 + if (uffd_wp) 269 + newpte = pte_swp_mkuffd_wp(newpte); 270 + else if (uffd_wp_resolve) 271 + newpte = pte_swp_clear_uffd_wp(newpte); 272 + 273 + if (!pte_same(oldpte, newpte)) { 274 + set_pte_at(vma->vm_mm, addr, pte, newpte); 275 + return 1; 276 + } 277 + return 0; 278 + } 279 + 280 + static __always_inline void change_present_ptes(struct mmu_gather *tlb, 281 + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, 282 + int nr_ptes, unsigned long end, pgprot_t newprot, 283 + struct folio *folio, struct page *page, unsigned long cp_flags) 284 + { 285 + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 286 + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 287 + pte_t ptent, oldpte; 288 + 289 + oldpte = modify_prot_start_ptes(vma, addr, ptep, nr_ptes); 290 + ptent = pte_modify(oldpte, newprot); 291 + 292 + if (uffd_wp) 293 + ptent = pte_mkuffd_wp(ptent); 294 + else if (uffd_wp_resolve) 295 + ptent = pte_clear_uffd_wp(ptent); 296 + 297 + /* 298 + * In some writable, shared mappings, we might want 299 + * to catch actual write access -- see 300 + * vma_wants_writenotify(). 301 + * 302 + * In all writable, private mappings, we have to 303 + * properly handle COW. 304 + * 305 + * In both cases, we can sometimes still change PTEs 306 + * writable and avoid the write-fault handler, for 307 + * example, if a PTE is already dirty and no other 308 + * COW or special handling is required. 309 + */ 310 + if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 311 + !pte_write(ptent)) 312 + set_write_prot_commit_flush_ptes(vma, folio, page, 313 + addr, ptep, oldpte, ptent, nr_ptes, tlb); 314 + else 315 + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 316 + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); 317 + } 318 + 214 319 static long change_pte_range(struct mmu_gather *tlb, 215 320 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, 216 321 unsigned long end, pgprot_t newprot, unsigned long cp_flags) ··· 326 221 bool is_private_single_threaded; 327 222 bool prot_numa = cp_flags & MM_CP_PROT_NUMA; 328 223 bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 329 - bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 330 224 int nr_ptes; 331 225 332 226 tlb_change_page_size(tlb, PAGE_SIZE); ··· 346 242 int max_nr_ptes = (end - addr) >> PAGE_SHIFT; 347 243 struct folio *folio = NULL; 348 244 struct page *page; 349 - pte_t ptent; 350 245 351 246 /* Already in the desired state. */ 352 247 if (prot_numa && pte_protnone(oldpte)) ··· 371 268 372 269 nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); 373 270 374 - oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); 375 - ptent = pte_modify(oldpte, newprot); 376 - 377 - if (uffd_wp) 378 - ptent = pte_mkuffd_wp(ptent); 379 - else if (uffd_wp_resolve) 380 - ptent = pte_clear_uffd_wp(ptent); 381 - 382 271 /* 383 - * In some writable, shared mappings, we might want 384 - * to catch actual write access -- see 385 - * vma_wants_writenotify(). 386 - * 387 - * In all writable, private mappings, we have to 388 - * properly handle COW. 389 - * 390 - * In both cases, we can sometimes still change PTEs 391 - * writable and avoid the write-fault handler, for 392 - * example, if a PTE is already dirty and no other 393 - * COW or special handling is required. 272 + * Optimize for the small-folio common case by 273 + * special-casing it here. Compiler constant propagation 274 + * plus copious amounts of __always_inline does wonders. 394 275 */ 395 - if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 396 - !pte_write(ptent)) 397 - set_write_prot_commit_flush_ptes(vma, folio, page, 398 - addr, pte, oldpte, ptent, nr_ptes, tlb); 399 - else 400 - prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, 401 - nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); 276 + if (likely(nr_ptes == 1)) { 277 + change_present_ptes(tlb, vma, addr, pte, 1, 278 + end, newprot, folio, page, cp_flags); 279 + } else { 280 + change_present_ptes(tlb, vma, addr, pte, 281 + nr_ptes, end, newprot, folio, page, 282 + cp_flags); 283 + } 284 + 402 285 pages += nr_ptes; 403 286 } else if (pte_none(oldpte)) { 404 287 /* ··· 406 317 pages++; 407 318 } 408 319 } else { 409 - softleaf_t entry = softleaf_from_pte(oldpte); 410 - pte_t newpte; 411 - 412 - if (softleaf_is_migration_write(entry)) { 413 - const struct folio *folio = softleaf_to_folio(entry); 414 - 415 - /* 416 - * A protection check is difficult so 417 - * just be safe and disable write 418 - */ 419 - if (folio_test_anon(folio)) 420 - entry = make_readable_exclusive_migration_entry( 421 - swp_offset(entry)); 422 - else 423 - entry = make_readable_migration_entry(swp_offset(entry)); 424 - newpte = swp_entry_to_pte(entry); 425 - if (pte_swp_soft_dirty(oldpte)) 426 - newpte = pte_swp_mksoft_dirty(newpte); 427 - } else if (softleaf_is_device_private_write(entry)) { 428 - /* 429 - * We do not preserve soft-dirtiness. See 430 - * copy_nonpresent_pte() for explanation. 431 - */ 432 - entry = make_readable_device_private_entry( 433 - swp_offset(entry)); 434 - newpte = swp_entry_to_pte(entry); 435 - if (pte_swp_uffd_wp(oldpte)) 436 - newpte = pte_swp_mkuffd_wp(newpte); 437 - } else if (softleaf_is_marker(entry)) { 438 - /* 439 - * Ignore error swap entries unconditionally, 440 - * because any access should sigbus/sigsegv 441 - * anyway. 442 - */ 443 - if (softleaf_is_poison_marker(entry) || 444 - softleaf_is_guard_marker(entry)) 445 - continue; 446 - /* 447 - * If this is uffd-wp pte marker and we'd like 448 - * to unprotect it, drop it; the next page 449 - * fault will trigger without uffd trapping. 450 - */ 451 - if (uffd_wp_resolve) { 452 - pte_clear(vma->vm_mm, addr, pte); 453 - pages++; 454 - } 455 - continue; 456 - } else { 457 - newpte = oldpte; 458 - } 459 - 460 - if (uffd_wp) 461 - newpte = pte_swp_mkuffd_wp(newpte); 462 - else if (uffd_wp_resolve) 463 - newpte = pte_swp_clear_uffd_wp(newpte); 464 - 465 - if (!pte_same(oldpte, newpte)) { 466 - set_pte_at(vma->vm_mm, addr, pte, newpte); 467 - pages++; 468 - } 320 + pages += change_softleaf_pte(vma, addr, pte, oldpte, cp_flags); 469 321 } 470 322 } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); 471 323 lazy_mmu_mode_disable();

+9 -1

mm/page_alloc.c

··· 1242 1242 union pgtag_ref_handle handle; 1243 1243 union codetag_ref ref; 1244 1244 1245 - if (get_page_tag_ref(page, &ref, &handle)) { 1245 + if (likely(get_page_tag_ref(page, &ref, &handle))) { 1246 1246 alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr); 1247 1247 update_page_tag_ref(handle, &ref); 1248 1248 put_page_tag_ref(handle); 1249 + } else { 1250 + /* 1251 + * page_ext is not available yet, record the pfn so we can 1252 + * clear the tag ref later when page_ext is initialized. 1253 + */ 1254 + alloc_tag_add_early_pfn(page_to_pfn(page)); 1255 + if (task->alloc_tag) 1256 + alloc_tag_set_inaccurate(task->alloc_tag); 1249 1257 } 1250 1258 } 1251 1259

+7 -3

mm/page_io.c

··· 276 276 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); 277 277 goto out_unlock; 278 278 } 279 + 280 + rcu_read_lock(); 279 281 if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { 282 + rcu_read_unlock(); 280 283 folio_mark_dirty(folio); 281 284 return AOP_WRITEPAGE_ACTIVATE; 282 285 } 286 + rcu_read_unlock(); 283 287 284 288 __swap_writepage(folio, swap_plug); 285 289 return 0; ··· 311 307 struct cgroup_subsys_state *css; 312 308 struct mem_cgroup *memcg; 313 309 314 - memcg = folio_memcg(folio); 315 - if (!memcg) 310 + if (!folio_memcg_charged(folio)) 316 311 return; 317 312 318 313 rcu_read_lock(); 314 + memcg = folio_memcg(folio); 319 315 css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys); 320 316 bio_associate_blkg_from_css(bio, css); 321 317 rcu_read_unlock(); ··· 497 493 folio_mark_uptodate(folio); 498 494 folio_unlock(folio); 499 495 } 500 - count_vm_events(PSWPIN, sio->pages); 496 + count_vm_events(PSWPIN, sio->len >> PAGE_SHIFT); 501 497 } else { 502 498 for (p = 0; p < sio->pages; p++) { 503 499 struct folio *folio = page_folio(sio->bvec[p].bv_page);

+1 -1

mm/percpu.c

··· 1622 1622 return true; 1623 1623 1624 1624 objcg = current_obj_cgroup(); 1625 - if (!objcg) 1625 + if (!objcg || obj_cgroup_is_root(objcg)) 1626 1626 return true; 1627 1627 1628 1628 if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))

+80 -94

mm/shmem.c

··· 3177 3177 #endif /* CONFIG_TMPFS_QUOTA */ 3178 3178 3179 3179 #ifdef CONFIG_USERFAULTFD 3180 - int shmem_mfill_atomic_pte(pmd_t *dst_pmd, 3181 - struct vm_area_struct *dst_vma, 3182 - unsigned long dst_addr, 3183 - unsigned long src_addr, 3184 - uffd_flags_t flags, 3185 - struct folio **foliop) 3180 + static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma, 3181 + unsigned long addr) 3186 3182 { 3187 - struct inode *inode = file_inode(dst_vma->vm_file); 3188 - struct shmem_inode_info *info = SHMEM_I(inode); 3183 + struct inode *inode = file_inode(vma->vm_file); 3189 3184 struct address_space *mapping = inode->i_mapping; 3185 + struct shmem_inode_info *info = SHMEM_I(inode); 3186 + pgoff_t pgoff = linear_page_index(vma, addr); 3190 3187 gfp_t gfp = mapping_gfp_mask(mapping); 3191 - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 3192 - void *page_kaddr; 3193 3188 struct folio *folio; 3194 - int ret; 3195 - pgoff_t max_off; 3196 3189 3197 - if (shmem_inode_acct_blocks(inode, 1)) { 3198 - /* 3199 - * We may have got a page, returned -ENOENT triggering a retry, 3200 - * and now we find ourselves with -ENOMEM. Release the page, to 3201 - * avoid a BUG_ON in our caller. 3202 - */ 3203 - if (unlikely(*foliop)) { 3204 - folio_put(*foliop); 3205 - *foliop = NULL; 3206 - } 3207 - return -ENOMEM; 3190 + if (unlikely(pgoff >= DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE))) 3191 + return NULL; 3192 + 3193 + folio = shmem_alloc_folio(gfp, 0, info, pgoff); 3194 + if (!folio) 3195 + return NULL; 3196 + 3197 + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { 3198 + folio_put(folio); 3199 + return NULL; 3208 3200 } 3209 3201 3210 - if (!*foliop) { 3211 - ret = -ENOMEM; 3212 - folio = shmem_alloc_folio(gfp, 0, info, pgoff); 3213 - if (!folio) 3214 - goto out_unacct_blocks; 3202 + return folio; 3203 + } 3215 3204 3216 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { 3217 - page_kaddr = kmap_local_folio(folio, 0); 3218 - /* 3219 - * The read mmap_lock is held here. Despite the 3220 - * mmap_lock being read recursive a deadlock is still 3221 - * possible if a writer has taken a lock. For example: 3222 - * 3223 - * process A thread 1 takes read lock on own mmap_lock 3224 - * process A thread 2 calls mmap, blocks taking write lock 3225 - * process B thread 1 takes page fault, read lock on own mmap lock 3226 - * process B thread 2 calls mmap, blocks taking write lock 3227 - * process A thread 1 blocks taking read lock on process B 3228 - * process B thread 1 blocks taking read lock on process A 3229 - * 3230 - * Disable page faults to prevent potential deadlock 3231 - * and retry the copy outside the mmap_lock. 3232 - */ 3233 - pagefault_disable(); 3234 - ret = copy_from_user(page_kaddr, 3235 - (const void __user *)src_addr, 3236 - PAGE_SIZE); 3237 - pagefault_enable(); 3238 - kunmap_local(page_kaddr); 3205 + static int shmem_mfill_filemap_add(struct folio *folio, 3206 + struct vm_area_struct *vma, 3207 + unsigned long addr) 3208 + { 3209 + struct inode *inode = file_inode(vma->vm_file); 3210 + struct address_space *mapping = inode->i_mapping; 3211 + pgoff_t pgoff = linear_page_index(vma, addr); 3212 + gfp_t gfp = mapping_gfp_mask(mapping); 3213 + int err; 3239 3214 3240 - /* fallback to copy_from_user outside mmap_lock */ 3241 - if (unlikely(ret)) { 3242 - *foliop = folio; 3243 - ret = -ENOENT; 3244 - /* don't free the page */ 3245 - goto out_unacct_blocks; 3246 - } 3247 - 3248 - flush_dcache_folio(folio); 3249 - } else { /* ZEROPAGE */ 3250 - clear_user_highpage(&folio->page, dst_addr); 3251 - } 3252 - } else { 3253 - folio = *foliop; 3254 - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); 3255 - *foliop = NULL; 3256 - } 3257 - 3258 - VM_BUG_ON(folio_test_locked(folio)); 3259 - VM_BUG_ON(folio_test_swapbacked(folio)); 3260 3215 __folio_set_locked(folio); 3261 3216 __folio_set_swapbacked(folio); 3262 - __folio_mark_uptodate(folio); 3263 3217 3264 - ret = -EFAULT; 3265 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 3266 - if (unlikely(pgoff >= max_off)) 3267 - goto out_release; 3218 + err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); 3219 + if (err) 3220 + goto err_unlock; 3268 3221 3269 - ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); 3270 - if (ret) 3271 - goto out_release; 3272 - ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); 3273 - if (ret) 3274 - goto out_release; 3222 + if (shmem_inode_acct_blocks(inode, 1)) { 3223 + err = -ENOMEM; 3224 + goto err_delete_from_cache; 3225 + } 3275 3226 3276 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 3277 - &folio->page, true, flags); 3278 - if (ret) 3279 - goto out_delete_from_cache; 3280 - 3227 + folio_add_lru(folio); 3281 3228 shmem_recalc_inode(inode, 1, 0); 3282 - folio_unlock(folio); 3229 + 3283 3230 return 0; 3284 - out_delete_from_cache: 3231 + 3232 + err_delete_from_cache: 3285 3233 filemap_remove_folio(folio); 3286 - out_release: 3234 + err_unlock: 3287 3235 folio_unlock(folio); 3288 - folio_put(folio); 3289 - out_unacct_blocks: 3290 - shmem_inode_unacct_blocks(inode, 1); 3291 - return ret; 3236 + return err; 3292 3237 } 3238 + 3239 + static void shmem_mfill_filemap_remove(struct folio *folio, 3240 + struct vm_area_struct *vma) 3241 + { 3242 + struct inode *inode = file_inode(vma->vm_file); 3243 + 3244 + filemap_remove_folio(folio); 3245 + shmem_recalc_inode(inode, 0, 0); 3246 + folio_unlock(folio); 3247 + } 3248 + 3249 + static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff) 3250 + { 3251 + struct folio *folio; 3252 + int err; 3253 + 3254 + err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); 3255 + if (err) 3256 + return ERR_PTR(err); 3257 + 3258 + return folio; 3259 + } 3260 + 3261 + static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) 3262 + { 3263 + return true; 3264 + } 3265 + 3266 + static const struct vm_uffd_ops shmem_uffd_ops = { 3267 + .can_userfault = shmem_can_userfault, 3268 + .get_folio_noalloc = shmem_get_folio_noalloc, 3269 + .alloc_folio = shmem_mfill_folio_alloc, 3270 + .filemap_add = shmem_mfill_filemap_add, 3271 + .filemap_remove = shmem_mfill_filemap_remove, 3272 + }; 3293 3273 #endif /* CONFIG_USERFAULTFD */ 3294 3274 3295 3275 #ifdef CONFIG_TMPFS ··· 5305 5325 .set_policy = shmem_set_policy, 5306 5326 .get_policy = shmem_get_policy, 5307 5327 #endif 5328 + #ifdef CONFIG_USERFAULTFD 5329 + .uffd_ops = &shmem_uffd_ops, 5330 + #endif 5308 5331 }; 5309 5332 5310 5333 static const struct vm_operations_struct shmem_anon_vm_ops = { ··· 5316 5333 #ifdef CONFIG_NUMA 5317 5334 .set_policy = shmem_set_policy, 5318 5335 .get_policy = shmem_get_policy, 5336 + #endif 5337 + #ifdef CONFIG_USERFAULTFD 5338 + .uffd_ops = &shmem_uffd_ops, 5319 5339 #endif 5320 5340 }; 5321 5341

+1 -5

mm/shrinker.c

··· 288 288 { 289 289 int nid, index, offset; 290 290 long nr; 291 - struct mem_cgroup *parent; 291 + struct mem_cgroup *parent = parent_mem_cgroup(memcg); 292 292 struct shrinker_info *child_info, *parent_info; 293 293 struct shrinker_info_unit *child_unit, *parent_unit; 294 - 295 - parent = parent_mem_cgroup(memcg); 296 - if (!parent) 297 - parent = root_mem_cgroup; 298 294 299 295 /* Prevent from concurrent shrinker_info expand */ 300 296 mutex_lock(&shrinker_mutex);

-1

mm/sparse.c

··· 403 403 ms = __nr_to_section(pnum); 404 404 if (!preinited_vmemmap_section(ms)) 405 405 ms->section_mem_map = 0; 406 - ms->section_mem_map = 0; 407 406 } 408 407 } 409 408

+49 -10

mm/swap.c

··· 91 91 92 92 __page_cache_release(folio, &lruvec, &flags); 93 93 if (lruvec) 94 - unlock_page_lruvec_irqrestore(lruvec, flags); 94 + lruvec_unlock_irqrestore(lruvec, flags); 95 95 } 96 96 97 97 void __folio_put(struct folio *folio) ··· 175 175 } 176 176 177 177 if (lruvec) 178 - unlock_page_lruvec_irqrestore(lruvec, flags); 178 + lruvec_unlock_irqrestore(lruvec, flags); 179 179 folios_put(fbatch); 180 180 } 181 181 ··· 240 240 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, 241 241 unsigned int nr_io, unsigned int nr_rotated) 242 242 __releases(lruvec->lru_lock) 243 + __releases(rcu) 243 244 { 244 245 unsigned long cost; 245 246 ··· 254 253 cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated; 255 254 if (!cost) { 256 255 spin_unlock_irq(&lruvec->lru_lock); 256 + rcu_read_unlock(); 257 257 return; 258 258 } 259 259 ··· 287 285 288 286 spin_unlock_irq(&lruvec->lru_lock); 289 287 lruvec = parent_lruvec(lruvec); 290 - if (!lruvec) 288 + if (!lruvec) { 289 + rcu_read_unlock(); 291 290 break; 291 + } 292 292 spin_lock_irq(&lruvec->lru_lock); 293 293 } 294 294 } ··· 353 349 354 350 lruvec = folio_lruvec_lock_irq(folio); 355 351 lru_activate(lruvec, folio); 356 - unlock_page_lruvec_irq(lruvec); 352 + lruvec_unlock_irq(lruvec); 357 353 folio_set_lru(folio); 358 354 } 359 355 #endif ··· 416 412 417 413 static bool lru_gen_clear_refs(struct folio *folio) 418 414 { 419 - struct lru_gen_folio *lrugen; 420 415 int gen = folio_lru_gen(folio); 421 416 int type = folio_is_file_lru(folio); 417 + unsigned long seq; 422 418 423 419 if (gen < 0) 424 420 return true; 425 421 426 422 set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0); 427 423 428 - lrugen = &folio_lruvec(folio)->lrugen; 424 + rcu_read_lock(); 425 + seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]); 426 + rcu_read_unlock(); 429 427 /* whether can do without shuffling under the LRU lock */ 430 - return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type])); 428 + return gen == lru_gen_from_seq(seq); 431 429 } 432 430 433 431 #else /* !CONFIG_LRU_GEN */ ··· 969 963 970 964 if (folio_is_zone_device(folio)) { 971 965 if (lruvec) { 972 - unlock_page_lruvec_irqrestore(lruvec, flags); 966 + lruvec_unlock_irqrestore(lruvec, flags); 973 967 lruvec = NULL; 974 968 } 975 969 if (folio_ref_sub_and_test(folio, nr_refs)) ··· 983 977 /* hugetlb has its own memcg */ 984 978 if (folio_test_hugetlb(folio)) { 985 979 if (lruvec) { 986 - unlock_page_lruvec_irqrestore(lruvec, flags); 980 + lruvec_unlock_irqrestore(lruvec, flags); 987 981 lruvec = NULL; 988 982 } 989 983 free_huge_folio(folio); ··· 997 991 j++; 998 992 } 999 993 if (lruvec) 1000 - unlock_page_lruvec_irqrestore(lruvec, flags); 994 + lruvec_unlock_irqrestore(lruvec, flags); 1001 995 if (!j) { 1002 996 folio_batch_reinit(folios); 1003 997 return; ··· 1089 1083 } 1090 1084 fbatch->nr = j; 1091 1085 } 1086 + 1087 + #ifdef CONFIG_MEMCG 1088 + static void lruvec_reparent_lru(struct lruvec *child_lruvec, 1089 + struct lruvec *parent_lruvec, 1090 + enum lru_list lru, int nid) 1091 + { 1092 + int zid; 1093 + struct zone *zone; 1094 + 1095 + if (lru != LRU_UNEVICTABLE) 1096 + list_splice_tail_init(&child_lruvec->lists[lru], &parent_lruvec->lists[lru]); 1097 + 1098 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { 1099 + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); 1100 + 1101 + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); 1102 + } 1103 + } 1104 + 1105 + void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 1106 + { 1107 + enum lru_list lru; 1108 + struct lruvec *child_lruvec, *parent_lruvec; 1109 + 1110 + child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 1111 + parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); 1112 + parent_lruvec->anon_cost += child_lruvec->anon_cost; 1113 + parent_lruvec->file_cost += child_lruvec->file_cost; 1114 + 1115 + for_each_lru(lru) 1116 + lruvec_reparent_lru(child_lruvec, parent_lruvec, lru, nid); 1117 + } 1118 + #endif 1092 1119 1093 1120 static const struct ctl_table swap_sysctl_table[] = { 1094 1121 {

+399 -291

mm/userfaultfd.c

··· 14 14 #include <linux/userfaultfd_k.h> 15 15 #include <linux/mmu_notifier.h> 16 16 #include <linux/hugetlb.h> 17 - #include <linux/shmem_fs.h> 18 17 #include <asm/tlbflush.h> 19 18 #include <asm/tlb.h> 20 19 #include "internal.h" 21 20 #include "swap.h" 21 + 22 + struct mfill_state { 23 + struct userfaultfd_ctx *ctx; 24 + unsigned long src_start; 25 + unsigned long dst_start; 26 + unsigned long len; 27 + uffd_flags_t flags; 28 + 29 + struct vm_area_struct *vma; 30 + unsigned long src_addr; 31 + unsigned long dst_addr; 32 + pmd_t *pmd; 33 + }; 34 + 35 + static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) 36 + { 37 + /* anonymous memory does not support MINOR mode */ 38 + if (vm_flags & VM_UFFD_MINOR) 39 + return false; 40 + return true; 41 + } 42 + 43 + static struct folio *anon_alloc_folio(struct vm_area_struct *vma, 44 + unsigned long addr) 45 + { 46 + struct folio *folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, 47 + addr); 48 + 49 + if (!folio) 50 + return NULL; 51 + 52 + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { 53 + folio_put(folio); 54 + return NULL; 55 + } 56 + 57 + return folio; 58 + } 59 + 60 + static const struct vm_uffd_ops anon_uffd_ops = { 61 + .can_userfault = anon_can_userfault, 62 + .alloc_folio = anon_alloc_folio, 63 + }; 64 + 65 + static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) 66 + { 67 + if (vma_is_anonymous(vma)) 68 + return &anon_uffd_ops; 69 + return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL; 70 + } 22 71 23 72 static __always_inline 24 73 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) ··· 192 143 } 193 144 #endif 194 145 146 + static void mfill_put_vma(struct mfill_state *state) 147 + { 148 + if (!state->vma) 149 + return; 150 + 151 + up_read(&state->ctx->map_changing_lock); 152 + uffd_mfill_unlock(state->vma); 153 + state->vma = NULL; 154 + } 155 + 156 + static int mfill_get_vma(struct mfill_state *state) 157 + { 158 + struct userfaultfd_ctx *ctx = state->ctx; 159 + uffd_flags_t flags = state->flags; 160 + struct vm_area_struct *dst_vma; 161 + const struct vm_uffd_ops *ops; 162 + int err; 163 + 164 + /* 165 + * Make sure the vma is not shared, that the dst range is 166 + * both valid and fully within a single existing vma. 167 + */ 168 + dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len); 169 + if (IS_ERR(dst_vma)) 170 + return PTR_ERR(dst_vma); 171 + 172 + /* 173 + * If memory mappings are changing because of non-cooperative 174 + * operation (e.g. mremap) running in parallel, bail out and 175 + * request the user to retry later 176 + */ 177 + down_read(&ctx->map_changing_lock); 178 + state->vma = dst_vma; 179 + err = -EAGAIN; 180 + if (atomic_read(&ctx->mmap_changing)) 181 + goto out_unlock; 182 + 183 + err = -EINVAL; 184 + 185 + /* 186 + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but 187 + * it will overwrite vm_ops, so vma_is_anonymous must return false. 188 + */ 189 + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && 190 + dst_vma->vm_flags & VM_SHARED)) 191 + goto out_unlock; 192 + 193 + /* 194 + * validate 'mode' now that we know the dst_vma: don't allow 195 + * a wrprotect copy if the userfaultfd didn't register as WP. 196 + */ 197 + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) 198 + goto out_unlock; 199 + 200 + if (is_vm_hugetlb_page(dst_vma)) 201 + return 0; 202 + 203 + ops = vma_uffd_ops(dst_vma); 204 + if (!ops) 205 + goto out_unlock; 206 + 207 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && 208 + !ops->get_folio_noalloc) 209 + goto out_unlock; 210 + 211 + return 0; 212 + 213 + out_unlock: 214 + mfill_put_vma(state); 215 + return err; 216 + } 217 + 218 + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) 219 + { 220 + pgd_t *pgd; 221 + p4d_t *p4d; 222 + pud_t *pud; 223 + 224 + pgd = pgd_offset(mm, address); 225 + p4d = p4d_alloc(mm, pgd, address); 226 + if (!p4d) 227 + return NULL; 228 + pud = pud_alloc(mm, p4d, address); 229 + if (!pud) 230 + return NULL; 231 + /* 232 + * Note that we didn't run this because the pmd was 233 + * missing, the *pmd may be already established and in 234 + * turn it may also be a trans_huge_pmd. 235 + */ 236 + return pmd_alloc(mm, pud, address); 237 + } 238 + 239 + static int mfill_establish_pmd(struct mfill_state *state) 240 + { 241 + struct mm_struct *dst_mm = state->ctx->mm; 242 + pmd_t *dst_pmd, dst_pmdval; 243 + 244 + dst_pmd = mm_alloc_pmd(dst_mm, state->dst_addr); 245 + if (unlikely(!dst_pmd)) 246 + return -ENOMEM; 247 + 248 + dst_pmdval = pmdp_get_lockless(dst_pmd); 249 + if (unlikely(pmd_none(dst_pmdval)) && 250 + unlikely(__pte_alloc(dst_mm, dst_pmd))) 251 + return -ENOMEM; 252 + 253 + dst_pmdval = pmdp_get_lockless(dst_pmd); 254 + /* 255 + * If the dst_pmd is THP don't override it and just be strict. 256 + * (This includes the case where the PMD used to be THP and 257 + * changed back to none after __pte_alloc().) 258 + */ 259 + if (unlikely(!pmd_present(dst_pmdval) || pmd_leaf(dst_pmdval))) 260 + return -EEXIST; 261 + if (unlikely(pmd_bad(dst_pmdval))) 262 + return -EFAULT; 263 + 264 + state->pmd = dst_pmd; 265 + return 0; 266 + } 267 + 195 268 /* Check if dst_addr is outside of file's size. Must be called with ptl held. */ 196 269 static bool mfill_file_over_size(struct vm_area_struct *dst_vma, 197 270 unsigned long dst_addr) ··· 336 165 * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem 337 166 * and anon, and for both shared and private VMAs. 338 167 */ 339 - int mfill_atomic_install_pte(pmd_t *dst_pmd, 340 - struct vm_area_struct *dst_vma, 341 - unsigned long dst_addr, struct page *page, 342 - bool newly_allocated, uffd_flags_t flags) 168 + static int mfill_atomic_install_pte(pmd_t *dst_pmd, 169 + struct vm_area_struct *dst_vma, 170 + unsigned long dst_addr, struct page *page, 171 + uffd_flags_t flags) 343 172 { 344 173 int ret; 345 174 struct mm_struct *dst_mm = dst_vma->vm_mm; ··· 383 212 goto out_unlock; 384 213 385 214 if (page_in_cache) { 386 - /* Usually, cache pages are already added to LRU */ 387 - if (newly_allocated) 388 - folio_add_lru(folio); 389 215 folio_add_file_rmap_pte(folio, page, dst_vma); 390 216 } else { 391 217 folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE); ··· 397 229 398 230 set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 399 231 232 + if (page_in_cache) 233 + folio_unlock(folio); 234 + 400 235 /* No need to invalidate - it was non-present before */ 401 236 update_mmu_cache(dst_vma, dst_addr, dst_pte); 402 237 ret = 0; ··· 409 238 return ret; 410 239 } 411 240 412 - static int mfill_atomic_pte_copy(pmd_t *dst_pmd, 413 - struct vm_area_struct *dst_vma, 414 - unsigned long dst_addr, 415 - unsigned long src_addr, 416 - uffd_flags_t flags, 417 - struct folio **foliop) 241 + static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr) 418 242 { 419 243 void *kaddr; 420 244 int ret; 245 + 246 + kaddr = kmap_local_folio(folio, 0); 247 + /* 248 + * The read mmap_lock is held here. Despite the 249 + * mmap_lock being read recursive a deadlock is still 250 + * possible if a writer has taken a lock. For example: 251 + * 252 + * process A thread 1 takes read lock on own mmap_lock 253 + * process A thread 2 calls mmap, blocks taking write lock 254 + * process B thread 1 takes page fault, read lock on own mmap lock 255 + * process B thread 2 calls mmap, blocks taking write lock 256 + * process A thread 1 blocks taking read lock on process B 257 + * process B thread 1 blocks taking read lock on process A 258 + * 259 + * Disable page faults to prevent potential deadlock 260 + * and retry the copy outside the mmap_lock. 261 + */ 262 + pagefault_disable(); 263 + ret = copy_from_user(kaddr, (const void __user *) src_addr, 264 + PAGE_SIZE); 265 + pagefault_enable(); 266 + kunmap_local(kaddr); 267 + 268 + if (ret) 269 + return -EFAULT; 270 + 271 + flush_dcache_folio(folio); 272 + return ret; 273 + } 274 + 275 + static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio) 276 + { 277 + unsigned long src_addr = state->src_addr; 278 + void *kaddr; 279 + int err; 280 + 281 + /* retry copying with mm_lock dropped */ 282 + mfill_put_vma(state); 283 + 284 + kaddr = kmap_local_folio(folio, 0); 285 + err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE); 286 + kunmap_local(kaddr); 287 + if (unlikely(err)) 288 + return -EFAULT; 289 + 290 + flush_dcache_folio(folio); 291 + 292 + /* reget VMA and PMD, they could change underneath us */ 293 + err = mfill_get_vma(state); 294 + if (err) 295 + return err; 296 + 297 + err = mfill_establish_pmd(state); 298 + if (err) 299 + return err; 300 + 301 + return 0; 302 + } 303 + 304 + static int __mfill_atomic_pte(struct mfill_state *state, 305 + const struct vm_uffd_ops *ops) 306 + { 307 + unsigned long dst_addr = state->dst_addr; 308 + unsigned long src_addr = state->src_addr; 309 + uffd_flags_t flags = state->flags; 421 310 struct folio *folio; 311 + int ret; 422 312 423 - if (!*foliop) { 424 - ret = -ENOMEM; 425 - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, 426 - dst_addr); 427 - if (!folio) 428 - goto out; 313 + folio = ops->alloc_folio(state->vma, state->dst_addr); 314 + if (!folio) 315 + return -ENOMEM; 429 316 430 - kaddr = kmap_local_folio(folio, 0); 317 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { 318 + ret = mfill_copy_folio_locked(folio, src_addr); 431 319 /* 432 - * The read mmap_lock is held here. Despite the 433 - * mmap_lock being read recursive a deadlock is still 434 - * possible if a writer has taken a lock. For example: 435 - * 436 - * process A thread 1 takes read lock on own mmap_lock 437 - * process A thread 2 calls mmap, blocks taking write lock 438 - * process B thread 1 takes page fault, read lock on own mmap lock 439 - * process B thread 2 calls mmap, blocks taking write lock 440 - * process A thread 1 blocks taking read lock on process B 441 - * process B thread 1 blocks taking read lock on process A 442 - * 443 - * Disable page faults to prevent potential deadlock 444 - * and retry the copy outside the mmap_lock. 320 + * Fallback to copy_from_user outside mmap_lock. 321 + * If retry is successful, mfill_copy_folio_locked() returns 322 + * with locks retaken by mfill_get_vma(). 323 + * If there was an error, we must mfill_put_vma() anyway and it 324 + * will take care of unlocking if needed. 445 325 */ 446 - pagefault_disable(); 447 - ret = copy_from_user(kaddr, (const void __user *) src_addr, 448 - PAGE_SIZE); 449 - pagefault_enable(); 450 - kunmap_local(kaddr); 451 - 452 - /* fallback to copy_from_user outside mmap_lock */ 453 326 if (unlikely(ret)) { 454 - ret = -ENOENT; 455 - *foliop = folio; 456 - /* don't free the page */ 457 - goto out; 327 + ret = mfill_copy_folio_retry(state, folio); 328 + if (ret) 329 + goto err_folio_put; 458 330 } 459 - 460 - flush_dcache_folio(folio); 331 + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { 332 + clear_user_highpage(&folio->page, state->dst_addr); 461 333 } else { 462 - folio = *foliop; 463 - *foliop = NULL; 334 + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); 464 335 } 465 336 466 337 /* ··· 512 299 */ 513 300 __folio_mark_uptodate(folio); 514 301 515 - ret = -ENOMEM; 516 - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) 517 - goto out_release; 302 + if (ops->filemap_add) { 303 + ret = ops->filemap_add(folio, state->vma, state->dst_addr); 304 + if (ret) 305 + goto err_folio_put; 306 + } 518 307 519 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 520 - &folio->page, true, flags); 308 + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, 309 + &folio->page, flags); 521 310 if (ret) 522 - goto out_release; 523 - out: 524 - return ret; 525 - out_release: 526 - folio_put(folio); 527 - goto out; 528 - } 529 - 530 - static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd, 531 - struct vm_area_struct *dst_vma, 532 - unsigned long dst_addr) 533 - { 534 - struct folio *folio; 535 - int ret = -ENOMEM; 536 - 537 - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr); 538 - if (!folio) 539 - return ret; 540 - 541 - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) 542 - goto out_put; 543 - 544 - /* 545 - * The memory barrier inside __folio_mark_uptodate makes sure that 546 - * zeroing out the folio become visible before mapping the page 547 - * using set_pte_at(). See do_anonymous_page(). 548 - */ 549 - __folio_mark_uptodate(folio); 550 - 551 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 552 - &folio->page, true, 0); 553 - if (ret) 554 - goto out_put; 311 + goto err_filemap_remove; 555 312 556 313 return 0; 557 - out_put: 314 + 315 + err_filemap_remove: 316 + if (ops->filemap_remove) 317 + ops->filemap_remove(folio, state->vma); 318 + err_folio_put: 558 319 folio_put(folio); 559 320 return ret; 560 321 } 561 322 562 - static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, 563 - struct vm_area_struct *dst_vma, 564 - unsigned long dst_addr) 323 + static int mfill_atomic_pte_copy(struct mfill_state *state) 565 324 { 325 + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); 326 + 327 + /* 328 + * The normal page fault path for a MAP_PRIVATE mapping in a 329 + * file-backed VMA will invoke the fault, fill the hole in the file and 330 + * COW it right away. The result generates plain anonymous memory. 331 + * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll 332 + * generate anonymous memory directly without actually filling the 333 + * hole. For the MAP_PRIVATE case the robustness check only happens in 334 + * the pagetable (to verify it's still none) and not in the page cache. 335 + */ 336 + if (!(state->vma->vm_flags & VM_SHARED)) 337 + ops = &anon_uffd_ops; 338 + 339 + return __mfill_atomic_pte(state, ops); 340 + } 341 + 342 + static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state) 343 + { 344 + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); 345 + 346 + return __mfill_atomic_pte(state, ops); 347 + } 348 + 349 + static int mfill_atomic_pte_zeropage(struct mfill_state *state) 350 + { 351 + struct vm_area_struct *dst_vma = state->vma; 352 + unsigned long dst_addr = state->dst_addr; 353 + pmd_t *dst_pmd = state->pmd; 566 354 pte_t _dst_pte, *dst_pte; 567 355 spinlock_t *ptl; 568 356 int ret; 569 357 570 - if (mm_forbids_zeropage(dst_vma->vm_mm)) 571 - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); 358 + if (mm_forbids_zeropage(dst_vma->vm_mm) || 359 + (dst_vma->vm_flags & VM_SHARED)) 360 + return mfill_atomic_pte_zeroed_folio(state); 572 361 573 362 _dst_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), 574 363 dst_vma->vm_page_prot)); ··· 596 381 } 597 382 598 383 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ 599 - static int mfill_atomic_pte_continue(pmd_t *dst_pmd, 600 - struct vm_area_struct *dst_vma, 601 - unsigned long dst_addr, 602 - uffd_flags_t flags) 384 + static int mfill_atomic_pte_continue(struct mfill_state *state) 603 385 { 604 - struct inode *inode = file_inode(dst_vma->vm_file); 386 + struct vm_area_struct *dst_vma = state->vma; 387 + const struct vm_uffd_ops *ops = vma_uffd_ops(dst_vma); 388 + unsigned long dst_addr = state->dst_addr; 605 389 pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 390 + struct inode *inode = file_inode(dst_vma->vm_file); 391 + uffd_flags_t flags = state->flags; 392 + pmd_t *dst_pmd = state->pmd; 606 393 struct folio *folio; 607 394 struct page *page; 608 395 int ret; 609 396 610 - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); 611 - /* Our caller expects us to return -EFAULT if we failed to find folio */ 612 - if (ret == -ENOENT) 613 - ret = -EFAULT; 614 - if (ret) 615 - goto out; 616 - if (!folio) { 617 - ret = -EFAULT; 618 - goto out; 397 + if (!ops) { 398 + VM_WARN_ONCE(1, "UFFDIO_CONTINUE for unsupported VMA"); 399 + return -EOPNOTSUPP; 619 400 } 401 + 402 + folio = ops->get_folio_noalloc(inode, pgoff); 403 + /* Our caller expects us to return -EFAULT if we failed to find folio */ 404 + if (IS_ERR_OR_NULL(folio)) 405 + return -EFAULT; 620 406 621 407 page = folio_file_page(folio, pgoff); 622 408 if (PageHWPoison(page)) { ··· 626 410 } 627 411 628 412 ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 629 - page, false, flags); 413 + page, flags); 630 414 if (ret) 631 415 goto out_release; 632 416 633 - folio_unlock(folio); 634 - ret = 0; 635 - out: 636 - return ret; 417 + return 0; 418 + 637 419 out_release: 638 420 folio_unlock(folio); 639 421 folio_put(folio); 640 - goto out; 422 + return ret; 641 423 } 642 424 643 425 /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ 644 - static int mfill_atomic_pte_poison(pmd_t *dst_pmd, 645 - struct vm_area_struct *dst_vma, 646 - unsigned long dst_addr, 647 - uffd_flags_t flags) 426 + static int mfill_atomic_pte_poison(struct mfill_state *state) 648 427 { 649 - int ret; 428 + struct vm_area_struct *dst_vma = state->vma; 650 429 struct mm_struct *dst_mm = dst_vma->vm_mm; 430 + unsigned long dst_addr = state->dst_addr; 431 + pmd_t *dst_pmd = state->pmd; 651 432 pte_t _dst_pte, *dst_pte; 652 433 spinlock_t *ptl; 434 + int ret; 653 435 654 436 _dst_pte = make_pte_marker(PTE_MARKER_POISONED); 655 437 ret = -EAGAIN; ··· 674 460 pte_unmap_unlock(dst_pte, ptl); 675 461 out: 676 462 return ret; 677 - } 678 - 679 - static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) 680 - { 681 - pgd_t *pgd; 682 - p4d_t *p4d; 683 - pud_t *pud; 684 - 685 - pgd = pgd_offset(mm, address); 686 - p4d = p4d_alloc(mm, pgd, address); 687 - if (!p4d) 688 - return NULL; 689 - pud = pud_alloc(mm, p4d, address); 690 - if (!pud) 691 - return NULL; 692 - /* 693 - * Note that we didn't run this because the pmd was 694 - * missing, the *pmd may be already established and in 695 - * turn it may also be a trans_huge_pmd. 696 - */ 697 - return pmd_alloc(mm, pud, address); 698 463 } 699 464 700 465 #ifdef CONFIG_HUGETLB_PAGE ··· 850 657 uffd_flags_t flags); 851 658 #endif /* CONFIG_HUGETLB_PAGE */ 852 659 853 - static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, 854 - struct vm_area_struct *dst_vma, 855 - unsigned long dst_addr, 856 - unsigned long src_addr, 857 - uffd_flags_t flags, 858 - struct folio **foliop) 660 + static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) 859 661 { 860 - ssize_t err; 662 + uffd_flags_t flags = state->flags; 861 663 862 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { 863 - return mfill_atomic_pte_continue(dst_pmd, dst_vma, 864 - dst_addr, flags); 865 - } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { 866 - return mfill_atomic_pte_poison(dst_pmd, dst_vma, 867 - dst_addr, flags); 868 - } 664 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) 665 + return mfill_atomic_pte_continue(state); 666 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) 667 + return mfill_atomic_pte_poison(state); 668 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) 669 + return mfill_atomic_pte_copy(state); 670 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) 671 + return mfill_atomic_pte_zeropage(state); 869 672 870 - /* 871 - * The normal page fault path for a shmem will invoke the 872 - * fault, fill the hole in the file and COW it right away. The 873 - * result generates plain anonymous memory. So when we are 874 - * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll 875 - * generate anonymous memory directly without actually filling 876 - * the hole. For the MAP_PRIVATE case the robustness check 877 - * only happens in the pagetable (to verify it's still none) 878 - * and not in the radix tree. 879 - */ 880 - if (!(dst_vma->vm_flags & VM_SHARED)) { 881 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) 882 - err = mfill_atomic_pte_copy(dst_pmd, dst_vma, 883 - dst_addr, src_addr, 884 - flags, foliop); 885 - else 886 - err = mfill_atomic_pte_zeropage(dst_pmd, 887 - dst_vma, dst_addr); 888 - } else { 889 - err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, 890 - dst_addr, src_addr, 891 - flags, foliop); 892 - } 893 - 894 - return err; 673 + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); 674 + return -EOPNOTSUPP; 895 675 } 896 676 897 677 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, ··· 873 707 unsigned long len, 874 708 uffd_flags_t flags) 875 709 { 876 - struct mm_struct *dst_mm = ctx->mm; 877 - struct vm_area_struct *dst_vma; 710 + struct mfill_state state = (struct mfill_state){ 711 + .ctx = ctx, 712 + .dst_start = dst_start, 713 + .src_start = src_start, 714 + .flags = flags, 715 + .len = len, 716 + .src_addr = src_start, 717 + .dst_addr = dst_start, 718 + }; 719 + long copied = 0; 878 720 ssize_t err; 879 - pmd_t *dst_pmd; 880 - unsigned long src_addr, dst_addr; 881 - long copied; 882 - struct folio *folio; 883 721 884 722 /* 885 723 * Sanitize the command parameters: ··· 895 725 VM_WARN_ON_ONCE(src_start + len <= src_start); 896 726 VM_WARN_ON_ONCE(dst_start + len <= dst_start); 897 727 898 - src_addr = src_start; 899 - dst_addr = dst_start; 900 - copied = 0; 901 - folio = NULL; 902 - retry: 903 - /* 904 - * Make sure the vma is not shared, that the dst range is 905 - * both valid and fully within a single existing vma. 906 - */ 907 - dst_vma = uffd_mfill_lock(dst_mm, dst_start, len); 908 - if (IS_ERR(dst_vma)) { 909 - err = PTR_ERR(dst_vma); 728 + err = mfill_get_vma(&state); 729 + if (err) 910 730 goto out; 911 - } 912 - 913 - /* 914 - * If memory mappings are changing because of non-cooperative 915 - * operation (e.g. mremap) running in parallel, bail out and 916 - * request the user to retry later 917 - */ 918 - down_read(&ctx->map_changing_lock); 919 - err = -EAGAIN; 920 - if (atomic_read(&ctx->mmap_changing)) 921 - goto out_unlock; 922 - 923 - err = -EINVAL; 924 - /* 925 - * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but 926 - * it will overwrite vm_ops, so vma_is_anonymous must return false. 927 - */ 928 - if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && 929 - dst_vma->vm_flags & VM_SHARED)) 930 - goto out_unlock; 931 - 932 - /* 933 - * validate 'mode' now that we know the dst_vma: don't allow 934 - * a wrprotect copy if the userfaultfd didn't register as WP. 935 - */ 936 - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) 937 - goto out_unlock; 938 731 939 732 /* 940 733 * If this is a HUGETLB vma, pass off to appropriate routine 941 734 */ 942 - if (is_vm_hugetlb_page(dst_vma)) 943 - return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, 735 + if (is_vm_hugetlb_page(state.vma)) 736 + return mfill_atomic_hugetlb(ctx, state.vma, dst_start, 944 737 src_start, len, flags); 945 738 946 - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) 947 - goto out_unlock; 948 - if (!vma_is_shmem(dst_vma) && 949 - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) 950 - goto out_unlock; 739 + while (state.src_addr < src_start + len) { 740 + VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len); 951 741 952 - while (src_addr < src_start + len) { 953 - pmd_t dst_pmdval; 742 + err = mfill_establish_pmd(&state); 743 + if (err) 744 + break; 954 745 955 - VM_WARN_ON_ONCE(dst_addr >= dst_start + len); 956 - 957 - dst_pmd = mm_alloc_pmd(dst_mm, dst_addr); 958 - if (unlikely(!dst_pmd)) { 959 - err = -ENOMEM; 960 - break; 961 - } 962 - 963 - dst_pmdval = pmdp_get_lockless(dst_pmd); 964 - if (unlikely(pmd_none(dst_pmdval)) && 965 - unlikely(__pte_alloc(dst_mm, dst_pmd))) { 966 - err = -ENOMEM; 967 - break; 968 - } 969 - dst_pmdval = pmdp_get_lockless(dst_pmd); 970 - /* 971 - * If the dst_pmd is THP don't override it and just be strict. 972 - * (This includes the case where the PMD used to be THP and 973 - * changed back to none after __pte_alloc().) 974 - */ 975 - if (unlikely(!pmd_present(dst_pmdval) || 976 - pmd_trans_huge(dst_pmdval))) { 977 - err = -EEXIST; 978 - break; 979 - } 980 - if (unlikely(pmd_bad(dst_pmdval))) { 981 - err = -EFAULT; 982 - break; 983 - } 984 746 /* 985 747 * For shmem mappings, khugepaged is allowed to remove page 986 748 * tables under us; pte_offset_map_lock() will deal with that. 987 749 */ 988 750 989 - err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, 990 - src_addr, flags, &folio); 751 + err = mfill_atomic_pte(&state); 991 752 cond_resched(); 992 753 993 - if (unlikely(err == -ENOENT)) { 994 - void *kaddr; 995 - 996 - up_read(&ctx->map_changing_lock); 997 - uffd_mfill_unlock(dst_vma); 998 - VM_WARN_ON_ONCE(!folio); 999 - 1000 - kaddr = kmap_local_folio(folio, 0); 1001 - err = copy_from_user(kaddr, 1002 - (const void __user *) src_addr, 1003 - PAGE_SIZE); 1004 - kunmap_local(kaddr); 1005 - if (unlikely(err)) { 1006 - err = -EFAULT; 1007 - goto out; 1008 - } 1009 - flush_dcache_folio(folio); 1010 - goto retry; 1011 - } else 1012 - VM_WARN_ON_ONCE(folio); 1013 - 1014 754 if (!err) { 1015 - dst_addr += PAGE_SIZE; 1016 - src_addr += PAGE_SIZE; 755 + state.dst_addr += PAGE_SIZE; 756 + state.src_addr += PAGE_SIZE; 1017 757 copied += PAGE_SIZE; 1018 758 1019 759 if (fatal_signal_pending(current)) ··· 933 853 break; 934 854 } 935 855 936 - out_unlock: 937 - up_read(&ctx->map_changing_lock); 938 - uffd_mfill_unlock(dst_vma); 856 + mfill_put_vma(&state); 939 857 out: 940 - if (folio) 941 - folio_put(folio); 942 858 VM_WARN_ON_ONCE(copied < 0); 943 859 VM_WARN_ON_ONCE(err > 0); 944 860 VM_WARN_ON_ONCE(!copied && !err); ··· 2012 1936 VM_WARN_ON_ONCE(err > 0); 2013 1937 VM_WARN_ON_ONCE(!moved && !err); 2014 1938 return moved ? moved : err; 1939 + } 1940 + 1941 + bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, 1942 + bool wp_async) 1943 + { 1944 + const struct vm_uffd_ops *ops = vma_uffd_ops(vma); 1945 + 1946 + if (vma->vm_flags & VM_DROPPABLE) 1947 + return false; 1948 + 1949 + vm_flags &= __VM_UFFD_FLAGS; 1950 + 1951 + /* 1952 + * If WP is the only mode enabled and context is wp async, allow any 1953 + * memory type. 1954 + */ 1955 + if (wp_async && (vm_flags == VM_UFFD_WP)) 1956 + return true; 1957 + 1958 + /* For any other mode reject VMAs that don't implement vm_uffd_ops */ 1959 + if (!ops) 1960 + return false; 1961 + 1962 + /* 1963 + * If user requested uffd-wp but not enabled pte markers for 1964 + * uffd-wp, then only anonymous memory is supported 1965 + */ 1966 + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && 1967 + !vma_is_anonymous(vma)) 1968 + return false; 1969 + 1970 + return ops->can_userfault(vma, vm_flags); 2015 1971 } 2016 1972 2017 1973 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,

-10

mm/util.c

··· 1281 1281 } 1282 1282 EXPORT_SYMBOL(compat_vma_mmap); 1283 1283 1284 - int __vma_check_mmap_hook(struct vm_area_struct *vma) 1285 - { 1286 - /* vm_ops->mapped is not valid if mmap() is specified. */ 1287 - if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped)) 1288 - return -EINVAL; 1289 - 1290 - return 0; 1291 - } 1292 - EXPORT_SYMBOL(__vma_check_mmap_hook); 1293 - 1294 1284 static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, 1295 1285 const struct page *page) 1296 1286 {

+221 -82

mm/vmscan.c

··· 269 269 } 270 270 #endif 271 271 272 - /* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to 273 - * and including the specified highidx 274 - * @zone: The current zone in the iterator 275 - * @pgdat: The pgdat which node_zones are being iterated 276 - * @idx: The index variable 277 - * @highidx: The index of the highest zone to return 278 - * 279 - * This macro iterates through all managed zones up to and including the specified highidx. 280 - * The zone iterator enters an invalid state after macro call and must be reinitialized 281 - * before it can be used again. 282 - */ 283 - #define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ 284 - for ((idx) = 0, (zone) = (pgdat)->node_zones; \ 285 - (idx) <= (highidx); \ 286 - (idx)++, (zone)++) \ 287 - if (!managed_zone(zone)) \ 288 - continue; \ 289 - else 290 - 291 272 static void set_task_reclaim_state(struct task_struct *task, 292 273 struct reclaim_state *rs) 293 274 { ··· 390 409 * @lru: lru to use 391 410 * @zone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list) 392 411 */ 393 - static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, 394 - int zone_idx) 412 + unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) 395 413 { 396 414 unsigned long size = 0; 397 415 int zid; ··· 1811 1831 folio_get(folio); 1812 1832 lruvec = folio_lruvec_lock_irq(folio); 1813 1833 lruvec_del_folio(lruvec, folio); 1814 - unlock_page_lruvec_irq(lruvec); 1834 + lruvec_unlock_irq(lruvec); 1815 1835 ret = true; 1816 1836 } 1817 1837 ··· 1865 1885 /* 1866 1886 * move_folios_to_lru() moves folios from private @list to appropriate LRU list. 1867 1887 * 1868 - * Returns the number of pages moved to the given lruvec. 1888 + * Returns the number of pages moved to the appropriate lruvec. 1889 + * 1890 + * Note: The caller must not hold any lruvec lock. 1869 1891 */ 1870 - static unsigned int move_folios_to_lru(struct lruvec *lruvec, 1871 - struct list_head *list) 1892 + static unsigned int move_folios_to_lru(struct list_head *list) 1872 1893 { 1873 1894 int nr_pages, nr_moved = 0; 1895 + struct lruvec *lruvec = NULL; 1874 1896 struct folio_batch free_folios; 1875 1897 1876 1898 folio_batch_init(&free_folios); 1877 1899 while (!list_empty(list)) { 1878 1900 struct folio *folio = lru_to_folio(list); 1879 1901 1902 + lruvec = folio_lruvec_relock_irq(folio, lruvec); 1880 1903 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); 1881 1904 list_del(&folio->lru); 1882 1905 if (unlikely(!folio_evictable(folio))) { 1883 - spin_unlock_irq(&lruvec->lru_lock); 1906 + lruvec_unlock_irq(lruvec); 1884 1907 folio_putback_lru(folio); 1885 - spin_lock_irq(&lruvec->lru_lock); 1908 + lruvec = NULL; 1886 1909 continue; 1887 1910 } 1888 1911 ··· 1907 1924 1908 1925 folio_unqueue_deferred_split(folio); 1909 1926 if (folio_batch_add(&free_folios, folio) == 0) { 1910 - spin_unlock_irq(&lruvec->lru_lock); 1927 + lruvec_unlock_irq(lruvec); 1911 1928 mem_cgroup_uncharge_folios(&free_folios); 1912 1929 free_unref_folios(&free_folios); 1913 - spin_lock_irq(&lruvec->lru_lock); 1930 + lruvec = NULL; 1914 1931 } 1915 1932 1916 1933 continue; 1917 1934 } 1918 1935 1919 - /* 1920 - * All pages were isolated from the same lruvec (and isolation 1921 - * inhibits memcg migration). 1922 - */ 1923 - VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 1924 1936 lruvec_add_folio(lruvec, folio); 1925 1937 nr_pages = folio_nr_pages(folio); 1926 1938 nr_moved += nr_pages; ··· 1923 1945 workingset_age_nonresident(lruvec, nr_pages); 1924 1946 } 1925 1947 1948 + if (lruvec) 1949 + lruvec_unlock_irq(lruvec); 1950 + 1926 1951 if (free_folios.nr) { 1927 - spin_unlock_irq(&lruvec->lru_lock); 1928 1952 mem_cgroup_uncharge_folios(&free_folios); 1929 1953 free_unref_folios(&free_folios); 1930 - spin_lock_irq(&lruvec->lru_lock); 1931 1954 } 1932 1955 1933 1956 return nr_moved; ··· 1977 1998 1978 1999 lru_add_drain(); 1979 2000 1980 - spin_lock_irq(&lruvec->lru_lock); 2001 + lruvec_lock_irq(lruvec); 1981 2002 1982 2003 nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &folio_list, 1983 2004 &nr_scanned, sc, lru); ··· 1987 2008 mod_lruvec_state(lruvec, item, nr_scanned); 1988 2009 mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); 1989 2010 1990 - spin_unlock_irq(&lruvec->lru_lock); 2011 + lruvec_unlock_irq(lruvec); 1991 2012 1992 2013 if (nr_taken == 0) 1993 2014 return 0; ··· 1995 2016 nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false, 1996 2017 lruvec_memcg(lruvec)); 1997 2018 1998 - spin_lock_irq(&lruvec->lru_lock); 1999 - move_folios_to_lru(lruvec, &folio_list); 2019 + move_folios_to_lru(&folio_list); 2000 2020 2001 2021 mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), 2002 2022 stat.nr_demoted); 2003 - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2023 + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2004 2024 item = PGSTEAL_KSWAPD + reclaimer_offset(sc); 2005 2025 mod_lruvec_state(lruvec, item, nr_reclaimed); 2006 2026 mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); 2007 2027 2028 + lruvec_lock_irq(lruvec); 2008 2029 lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, 2009 2030 nr_scanned - nr_reclaimed); 2010 2031 ··· 2083 2104 2084 2105 lru_add_drain(); 2085 2106 2086 - spin_lock_irq(&lruvec->lru_lock); 2107 + lruvec_lock_irq(lruvec); 2087 2108 2088 2109 nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &l_hold, 2089 2110 &nr_scanned, sc, lru); ··· 2092 2113 2093 2114 mod_lruvec_state(lruvec, PGREFILL, nr_scanned); 2094 2115 2095 - spin_unlock_irq(&lruvec->lru_lock); 2116 + lruvec_unlock_irq(lruvec); 2096 2117 2097 2118 while (!list_empty(&l_hold)) { 2098 2119 struct folio *folio; ··· 2141 2162 /* 2142 2163 * Move folios back to the lru list. 2143 2164 */ 2144 - spin_lock_irq(&lruvec->lru_lock); 2165 + nr_activate = move_folios_to_lru(&l_active); 2166 + nr_deactivate = move_folios_to_lru(&l_inactive); 2145 2167 2146 - nr_activate = move_folios_to_lru(lruvec, &l_active); 2147 - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); 2148 - 2149 - __count_vm_events(PGDEACTIVATE, nr_deactivate); 2168 + count_vm_events(PGDEACTIVATE, nr_deactivate); 2150 2169 count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); 2170 + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2151 2171 2152 - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2153 - 2172 + lruvec_lock_irq(lruvec); 2154 2173 lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); 2155 2174 trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, 2156 2175 nr_deactivate, nr_rotated, sc->priority, file); ··· 2863 2886 return NULL; 2864 2887 2865 2888 clear_bit(key, &mm->lru_gen.bitmap); 2889 + mmgrab(mm); 2866 2890 2867 - return mmget_not_zero(mm) ? mm : NULL; 2891 + return mm; 2868 2892 } 2869 2893 2870 2894 void lru_gen_add_mm(struct mm_struct *mm) ··· 3065 3087 reset_bloom_filter(mm_state, walk->seq + 1); 3066 3088 3067 3089 if (*iter) 3068 - mmput_async(*iter); 3090 + mmdrop(*iter); 3069 3091 3070 3092 *iter = mm; 3071 3093 ··· 3420 3442 if (folio_nid(folio) != pgdat->node_id) 3421 3443 return NULL; 3422 3444 3445 + rcu_read_lock(); 3423 3446 if (folio_memcg(folio) != memcg) 3424 - return NULL; 3447 + folio = NULL; 3448 + rcu_read_unlock(); 3425 3449 3426 3450 return folio; 3427 3451 } ··· 3783 3803 } 3784 3804 3785 3805 if (walk->batched) { 3786 - spin_lock_irq(&lruvec->lru_lock); 3806 + lruvec_lock_irq(lruvec); 3787 3807 reset_batch_size(walk); 3788 - spin_unlock_irq(&lruvec->lru_lock); 3808 + lruvec_unlock_irq(lruvec); 3789 3809 } 3790 3810 3791 3811 cond_resched(); ··· 3945 3965 if (seq < READ_ONCE(lrugen->max_seq)) 3946 3966 return false; 3947 3967 3948 - spin_lock_irq(&lruvec->lru_lock); 3968 + lruvec_lock_irq(lruvec); 3949 3969 3950 3970 VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); 3951 3971 ··· 3960 3980 if (inc_min_seq(lruvec, type, swappiness)) 3961 3981 continue; 3962 3982 3963 - spin_unlock_irq(&lruvec->lru_lock); 3983 + lruvec_unlock_irq(lruvec); 3964 3984 cond_resched(); 3965 3985 goto restart; 3966 3986 } ··· 3995 4015 /* make sure preceding modifications appear */ 3996 4016 smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); 3997 4017 unlock: 3998 - spin_unlock_irq(&lruvec->lru_lock); 4018 + lruvec_unlock_irq(lruvec); 3999 4019 4000 4020 return success; 4001 4021 } ··· 4193 4213 unsigned long addr = pvmw->address; 4194 4214 struct vm_area_struct *vma = pvmw->vma; 4195 4215 struct folio *folio = pfn_folio(pvmw->pfn); 4196 - struct mem_cgroup *memcg = folio_memcg(folio); 4216 + struct mem_cgroup *memcg; 4197 4217 struct pglist_data *pgdat = folio_pgdat(folio); 4198 - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 4199 - struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); 4200 - DEFINE_MAX_SEQ(lruvec); 4201 - int gen = lru_gen_from_seq(max_seq); 4218 + struct lruvec *lruvec; 4219 + struct lru_gen_mm_state *mm_state; 4220 + unsigned long max_seq; 4221 + int gen; 4202 4222 4203 4223 lockdep_assert_held(pvmw->ptl); 4204 4224 VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); ··· 4232 4252 end = addr + MIN_LRU_BATCH * PAGE_SIZE / 2; 4233 4253 } 4234 4254 } 4255 + 4256 + memcg = get_mem_cgroup_from_folio(folio); 4257 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 4258 + max_seq = READ_ONCE((lruvec)->lrugen.max_seq); 4259 + gen = lru_gen_from_seq(max_seq); 4260 + mm_state = get_mm_state(lruvec); 4235 4261 4236 4262 lazy_mmu_mode_enable(); 4237 4263 ··· 4287 4301 /* feedback from rmap walkers to page table walkers */ 4288 4302 if (mm_state && suitable_to_scan(i, young)) 4289 4303 update_bloom_filter(mm_state, max_seq, pvmw->pmd); 4304 + 4305 + mem_cgroup_put(memcg); 4290 4306 4291 4307 return true; 4292 4308 } ··· 4423 4435 /* see the comment on MEMCG_NR_GENS */ 4424 4436 if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD) 4425 4437 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD); 4438 + } 4439 + 4440 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 4441 + { 4442 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4443 + int type; 4444 + 4445 + for (type = 0; type < ANON_AND_FILE; type++) { 4446 + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) 4447 + return false; 4448 + } 4449 + 4450 + return true; 4451 + } 4452 + 4453 + static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg, 4454 + struct lruvec *lruvec) 4455 + { 4456 + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); 4457 + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); 4458 + int swappiness = mem_cgroup_swappiness(memcg); 4459 + DEFINE_MAX_SEQ(lruvec); 4460 + bool success = false; 4461 + 4462 + /* 4463 + * We are not iterating the mm_list here, updating mm_state->seq is just 4464 + * to make mm walkers work properly. 4465 + */ 4466 + if (mm_state) { 4467 + spin_lock(&mm_list->lock); 4468 + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); 4469 + if (max_seq > mm_state->seq) { 4470 + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); 4471 + success = true; 4472 + } 4473 + spin_unlock(&mm_list->lock); 4474 + } else { 4475 + success = true; 4476 + } 4477 + 4478 + if (success) 4479 + inc_max_seq(lruvec, max_seq, swappiness); 4480 + } 4481 + 4482 + /* 4483 + * We need to ensure that the folios of child memcg can be reparented to the 4484 + * same gen of the parent memcg, so the gens of the parent memcg needed be 4485 + * incremented to the MAX_NR_GENS before reparenting. 4486 + */ 4487 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 4488 + { 4489 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4490 + int type; 4491 + 4492 + for (type = 0; type < ANON_AND_FILE; type++) { 4493 + while (get_nr_gens(lruvec, type) < MAX_NR_GENS) { 4494 + try_to_inc_max_seq_nowalk(memcg, lruvec); 4495 + cond_resched(); 4496 + } 4497 + } 4498 + } 4499 + 4500 + /* 4501 + * Compared to traditional LRU, MGLRU faces the following challenges: 4502 + * 4503 + * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the 4504 + * number of generations of the parent and child memcg may be different, 4505 + * so we cannot simply transfer MGLRU folios in the child memcg to the 4506 + * parent memcg as we did for traditional LRU folios. 4507 + * 2. The generation information is stored in folio->flags, but we cannot 4508 + * traverse these folios while holding the lru lock, otherwise it may 4509 + * cause softlockup. 4510 + * 3. In walk_update_folio(), the gen of folio and corresponding lru size 4511 + * may be updated, but the folio is not immediately moved to the 4512 + * corresponding lru list. Therefore, there may be folios of different 4513 + * generations on an LRU list. 4514 + * 4. In lru_gen_del_folio(), the generation to which the folio belongs is 4515 + * found based on the generation information in folio->flags, and the 4516 + * corresponding LRU size will be updated. Therefore, we need to update 4517 + * the lru size correctly during reparenting, otherwise the lru size may 4518 + * be updated incorrectly in lru_gen_del_folio(). 4519 + * 4520 + * Finally, we choose a compromise method, which is to splice the lru list in 4521 + * the child memcg to the lru list of the same generation in the parent memcg 4522 + * during reparenting. 4523 + * 4524 + * The same generation has different meanings in the parent and child memcg, 4525 + * so this compromise method will cause the LRU inversion problem. But as the 4526 + * system runs, this problem will be fixed automatically. 4527 + */ 4528 + static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec, 4529 + int zone, int type) 4530 + { 4531 + struct lru_gen_folio *child_lrugen, *parent_lrugen; 4532 + enum lru_list lru = type * LRU_INACTIVE_FILE; 4533 + int i; 4534 + 4535 + child_lrugen = &child_lruvec->lrugen; 4536 + parent_lrugen = &parent_lruvec->lrugen; 4537 + 4538 + for (i = 0; i < get_nr_gens(child_lruvec, type); i++) { 4539 + int gen = lru_gen_from_seq(child_lrugen->max_seq - i); 4540 + long nr_pages = child_lrugen->nr_pages[gen][type][zone]; 4541 + int child_lru_active = lru_gen_is_active(child_lruvec, gen) ? LRU_ACTIVE : 0; 4542 + int parent_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0; 4543 + 4544 + /* Assuming that child pages are colder than parent pages */ 4545 + list_splice_tail_init(&child_lrugen->folios[gen][type][zone], 4546 + &parent_lrugen->folios[gen][type][zone]); 4547 + 4548 + WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0); 4549 + WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone], 4550 + parent_lrugen->nr_pages[gen][type][zone] + nr_pages); 4551 + 4552 + if (lru_gen_is_active(child_lruvec, gen) != lru_gen_is_active(parent_lruvec, gen)) { 4553 + __update_lru_size(child_lruvec, lru + child_lru_active, zone, -nr_pages); 4554 + __update_lru_size(parent_lruvec, lru + parent_lru_active, zone, nr_pages); 4555 + } 4556 + } 4557 + } 4558 + 4559 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 4560 + { 4561 + struct lruvec *child_lruvec, *parent_lruvec; 4562 + int type, zid; 4563 + struct zone *zone; 4564 + enum lru_list lru; 4565 + 4566 + child_lruvec = get_lruvec(memcg, nid); 4567 + parent_lruvec = get_lruvec(parent, nid); 4568 + 4569 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) 4570 + for (type = 0; type < ANON_AND_FILE; type++) 4571 + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type); 4572 + 4573 + for_each_lru(lru) { 4574 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { 4575 + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); 4576 + 4577 + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); 4578 + } 4579 + } 4426 4580 } 4427 4581 4428 4582 #endif /* CONFIG_MEMCG */ ··· 4760 4630 static int get_tier_idx(struct lruvec *lruvec, int type) 4761 4631 { 4762 4632 int tier; 4763 - struct ctrl_pos sp, pv; 4633 + struct ctrl_pos sp, pv = {}; 4764 4634 4765 4635 /* 4766 4636 * To leave a margin for fluctuations, use a larger gain factor (2:3). ··· 4779 4649 4780 4650 static int get_type_to_scan(struct lruvec *lruvec, int swappiness) 4781 4651 { 4782 - struct ctrl_pos sp, pv; 4652 + struct ctrl_pos sp, pv = {}; 4783 4653 4784 4654 if (swappiness <= MIN_SWAPPINESS + 1) 4785 4655 return LRU_GEN_FILE; ··· 4837 4707 struct mem_cgroup *memcg = lruvec_memcg(lruvec); 4838 4708 struct pglist_data *pgdat = lruvec_pgdat(lruvec); 4839 4709 4840 - spin_lock_irq(&lruvec->lru_lock); 4710 + lruvec_lock_irq(lruvec); 4841 4711 4842 4712 scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list); 4843 4713 ··· 4846 4716 if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq) 4847 4717 scanned = 0; 4848 4718 4849 - spin_unlock_irq(&lruvec->lru_lock); 4719 + lruvec_unlock_irq(lruvec); 4850 4720 4851 4721 if (list_empty(&list)) 4852 4722 return scanned; ··· 4879 4749 set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active)); 4880 4750 } 4881 4751 4882 - spin_lock_irq(&lruvec->lru_lock); 4883 - 4884 - move_folios_to_lru(lruvec, &list); 4752 + move_folios_to_lru(&list); 4885 4753 4886 4754 walk = current->reclaim_state->mm_walk; 4887 4755 if (walk && walk->batched) { 4888 4756 walk->lruvec = lruvec; 4757 + lruvec_lock_irq(lruvec); 4889 4758 reset_batch_size(walk); 4759 + lruvec_unlock_irq(lruvec); 4890 4760 } 4891 4761 4892 4762 mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), ··· 4895 4765 item = PGSTEAL_KSWAPD + reclaimer_offset(sc); 4896 4766 mod_lruvec_state(lruvec, item, reclaimed); 4897 4767 mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); 4898 - 4899 - spin_unlock_irq(&lruvec->lru_lock); 4900 4768 4901 4769 list_splice_init(&clean, &list); 4902 4770 ··· 4971 4843 int i; 4972 4844 enum zone_watermarks mark; 4973 4845 4974 - /* don't abort memcg reclaim to ensure fairness */ 4975 - if (!root_reclaim(sc)) 4976 - return false; 4977 - 4978 4846 if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order))) 4979 4847 return true; 4980 4848 ··· 5024 4900 * If too many file cache in the coldest generation can't be evicted 5025 4901 * due to being dirty, wake up the flusher. 5026 4902 */ 5027 - if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) 4903 + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) { 4904 + struct pglist_data *pgdat = lruvec_pgdat(lruvec); 4905 + 5028 4906 wakeup_flusher_threads(WB_REASON_VMSCAN); 4907 + 4908 + /* 4909 + * For cgroupv1 dirty throttling is achieved by waking up 4910 + * the kernel flusher here and later waiting on folios 4911 + * which are in writeback to finish (see shrink_folio_list()). 4912 + * 4913 + * Flusher may not be able to issue writeback quickly 4914 + * enough for cgroupv1 writeback throttling to work 4915 + * on a large system. 4916 + */ 4917 + if (!writeback_throttling_sane(sc)) 4918 + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); 4919 + } 5029 4920 5030 4921 /* whether this lruvec should be rotated */ 5031 4922 return nr_to_scan < 0; ··· 5335 5196 for_each_node(nid) { 5336 5197 struct lruvec *lruvec = get_lruvec(memcg, nid); 5337 5198 5338 - spin_lock_irq(&lruvec->lru_lock); 5199 + lruvec_lock_irq(lruvec); 5339 5200 5340 5201 VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); 5341 5202 VM_WARN_ON_ONCE(!state_is_valid(lruvec)); ··· 5343 5204 lruvec->lrugen.enabled = enabled; 5344 5205 5345 5206 while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) { 5346 - spin_unlock_irq(&lruvec->lru_lock); 5207 + lruvec_unlock_irq(lruvec); 5347 5208 cond_resched(); 5348 - spin_lock_irq(&lruvec->lru_lock); 5209 + lruvec_lock_irq(lruvec); 5349 5210 } 5350 5211 5351 - spin_unlock_irq(&lruvec->lru_lock); 5212 + lruvec_unlock_irq(lruvec); 5352 5213 } 5353 5214 5354 5215 cond_resched(); ··· 8037 7898 if (lruvec) { 8038 7899 __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); 8039 7900 __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); 8040 - unlock_page_lruvec_irq(lruvec); 7901 + lruvec_unlock_irq(lruvec); 8041 7902 } else if (pgscanned) { 8042 7903 count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); 8043 7904 }

+1 -1

mm/vmstat.c

··· 2141 2141 if (cpu_is_isolated(cpu)) 2142 2142 continue; 2143 2143 2144 - if (!delayed_work_pending(dw) && need_update(cpu)) 2144 + if (!work_busy(&dw->work) && need_update(cpu)) 2145 2145 queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); 2146 2146 } 2147 2147

+19 -11

mm/workingset.c

··· 244 244 int refs = folio_lru_refs(folio); 245 245 bool workingset = folio_test_workingset(folio); 246 246 int tier = lru_tier_from_refs(refs, workingset); 247 - struct mem_cgroup *memcg = folio_memcg(folio); 247 + struct mem_cgroup *memcg; 248 248 struct pglist_data *pgdat = folio_pgdat(folio); 249 + unsigned short memcg_id; 249 250 250 251 BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > 251 252 BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON)); 252 253 254 + rcu_read_lock(); 255 + memcg = folio_memcg(folio); 253 256 lruvec = mem_cgroup_lruvec(memcg, pgdat); 254 257 lrugen = &lruvec->lrugen; 255 258 min_seq = READ_ONCE(lrugen->min_seq[type]); ··· 260 257 261 258 hist = lru_hist_from_seq(min_seq); 262 259 atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); 260 + memcg_id = mem_cgroup_private_id(memcg); 261 + rcu_read_unlock(); 263 262 264 - return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type); 263 + return pack_shadow(memcg_id, pgdat, token, workingset, type); 265 264 } 266 265 267 266 /* ··· 546 541 void workingset_refault(struct folio *folio, void *shadow) 547 542 { 548 543 bool file = folio_is_file_lru(folio); 549 - struct pglist_data *pgdat; 550 544 struct mem_cgroup *memcg; 551 545 struct lruvec *lruvec; 552 546 bool workingset; ··· 568 564 * locked to guarantee folio_memcg() stability throughout. 569 565 */ 570 566 nr = folio_nr_pages(folio); 571 - memcg = folio_memcg(folio); 572 - pgdat = folio_pgdat(folio); 573 - lruvec = mem_cgroup_lruvec(memcg, pgdat); 574 - 567 + memcg = get_mem_cgroup_from_folio(folio); 568 + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); 575 569 mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); 576 570 577 571 if (!workingset_test_recent(shadow, file, &workingset, true)) 578 - return; 572 + goto out; 579 573 580 574 folio_set_active(folio); 581 575 workingset_age_nonresident(lruvec, nr); ··· 589 587 lru_note_cost_refault(folio); 590 588 mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); 591 589 } 590 + out: 591 + mem_cgroup_put(memcg); 592 592 } 593 593 594 594 /** ··· 603 599 * Filter non-memcg pages here, e.g. unmap can call 604 600 * mark_page_accessed() on VDSO pages. 605 601 */ 606 - if (mem_cgroup_disabled() || folio_memcg_charged(folio)) 602 + if (mem_cgroup_disabled() || folio_memcg_charged(folio)) { 603 + rcu_read_lock(); 607 604 workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio)); 605 + rcu_read_unlock(); 606 + } 608 607 } 609 608 610 609 /* ··· 691 684 692 685 mem_cgroup_flush_stats_ratelimited(sc->memcg); 693 686 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); 687 + 694 688 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) 695 - pages += lruvec_page_state_local(lruvec, 696 - NR_LRU_BASE + i); 689 + pages += lruvec_lru_size(lruvec, i, MAX_NR_ZONES - 1); 690 + 697 691 pages += lruvec_page_state_local( 698 692 lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; 699 693 pages += lruvec_page_state_local(

+99 -106

mm/zswap.c

··· 242 242 **********************************/ 243 243 static void __zswap_pool_empty(struct percpu_ref *ref); 244 244 245 + static void acomp_ctx_free(struct crypto_acomp_ctx *acomp_ctx) 246 + { 247 + if (!acomp_ctx) 248 + return; 249 + 250 + /* 251 + * If there was an error in allocating @acomp_ctx->req, it 252 + * would be set to NULL. 253 + */ 254 + if (acomp_ctx->req) 255 + acomp_request_free(acomp_ctx->req); 256 + 257 + acomp_ctx->req = NULL; 258 + 259 + /* 260 + * We have to handle both cases here: an error pointer return from 261 + * crypto_alloc_acomp_node(); and a) NULL initialization by zswap, or 262 + * b) NULL assignment done in a previous call to acomp_ctx_free(). 263 + */ 264 + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) 265 + crypto_free_acomp(acomp_ctx->acomp); 266 + 267 + acomp_ctx->acomp = NULL; 268 + 269 + kfree(acomp_ctx->buffer); 270 + acomp_ctx->buffer = NULL; 271 + } 272 + 245 273 static struct zswap_pool *zswap_pool_create(char *compressor) 246 274 { 247 275 struct zswap_pool *pool; ··· 291 263 292 264 strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); 293 265 294 - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); 266 + /* Many things rely on the zero-initialization. */ 267 + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, 268 + GFP_KERNEL | __GFP_ZERO); 295 269 if (!pool->acomp_ctx) { 296 270 pr_err("percpu alloc failed\n"); 297 271 goto error; 298 272 } 299 273 300 - for_each_possible_cpu(cpu) 301 - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); 302 - 274 + /* 275 + * This is serialized against CPU hotplug operations. Hence, cores 276 + * cannot be offlined until this finishes. 277 + */ 303 278 ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, 304 279 &pool->node); 280 + 281 + /* 282 + * cpuhp_state_add_instance() will not cleanup on failure since 283 + * we don't register a hotunplug callback. 284 + */ 305 285 if (ret) 306 - goto error; 286 + goto cpuhp_add_fail; 307 287 308 288 /* being the current pool takes 1 ref; this func expects the 309 289 * caller to always add the new pool as the current pool ··· 328 292 329 293 ref_fail: 330 294 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 295 + 296 + cpuhp_add_fail: 297 + for_each_possible_cpu(cpu) 298 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 331 299 error: 332 300 if (pool->acomp_ctx) 333 301 free_percpu(pool->acomp_ctx); ··· 362 322 363 323 static void zswap_pool_destroy(struct zswap_pool *pool) 364 324 { 325 + int cpu; 326 + 365 327 zswap_pool_debug("destroying", pool); 366 328 367 329 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 330 + 331 + for_each_possible_cpu(cpu) 332 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 333 + 368 334 free_percpu(pool->acomp_ctx); 369 335 370 336 zs_destroy_pool(pool->zs_pool); ··· 710 664 struct lruvec *lruvec; 711 665 712 666 if (folio) { 667 + rcu_read_lock(); 713 668 lruvec = folio_lruvec(folio); 714 669 atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); 670 + rcu_read_unlock(); 715 671 } 716 672 } 717 673 ··· 784 736 { 785 737 struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 786 738 struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 787 - struct crypto_acomp *acomp = NULL; 788 - struct acomp_req *req = NULL; 789 - u8 *buffer = NULL; 790 - int ret; 791 - 792 - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 793 - if (!buffer) { 794 - ret = -ENOMEM; 795 - goto fail; 796 - } 797 - 798 - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 799 - if (IS_ERR(acomp)) { 800 - pr_err("could not alloc crypto acomp %s : %pe\n", 801 - pool->tfm_name, acomp); 802 - ret = PTR_ERR(acomp); 803 - goto fail; 804 - } 805 - 806 - req = acomp_request_alloc(acomp); 807 - if (!req) { 808 - pr_err("could not alloc crypto acomp_request %s\n", 809 - pool->tfm_name); 810 - ret = -ENOMEM; 811 - goto fail; 812 - } 739 + int ret = -ENOMEM; 813 740 814 741 /* 815 - * Only hold the mutex after completing allocations, otherwise we may 816 - * recurse into zswap through reclaim and attempt to hold the mutex 817 - * again resulting in a deadlock. 742 + * To handle cases where the CPU goes through online-offline-online 743 + * transitions, we return if the acomp_ctx has already been initialized. 818 744 */ 819 - mutex_lock(&acomp_ctx->mutex); 745 + if (acomp_ctx->acomp) { 746 + WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp)); 747 + return 0; 748 + } 749 + 750 + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 751 + if (!acomp_ctx->buffer) 752 + return ret; 753 + 754 + /* 755 + * In case of an error, crypto_alloc_acomp_node() returns an 756 + * error pointer, never NULL. 757 + */ 758 + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 759 + if (IS_ERR(acomp_ctx->acomp)) { 760 + pr_err("could not alloc crypto acomp %s : %pe\n", 761 + pool->tfm_name, acomp_ctx->acomp); 762 + ret = PTR_ERR(acomp_ctx->acomp); 763 + goto fail; 764 + } 765 + 766 + /* acomp_request_alloc() returns NULL in case of an error. */ 767 + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); 768 + if (!acomp_ctx->req) { 769 + pr_err("could not alloc crypto acomp_request %s\n", 770 + pool->tfm_name); 771 + goto fail; 772 + } 773 + 820 774 crypto_init_wait(&acomp_ctx->wait); 821 775 822 776 /* ··· 826 776 * crypto_wait_req(); if the backend of acomp is scomp, the callback 827 777 * won't be called, crypto_wait_req() will return without blocking. 828 778 */ 829 - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, 779 + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, 830 780 crypto_req_done, &acomp_ctx->wait); 831 781 832 - acomp_ctx->buffer = buffer; 833 - acomp_ctx->acomp = acomp; 834 - acomp_ctx->req = req; 835 - mutex_unlock(&acomp_ctx->mutex); 782 + mutex_init(&acomp_ctx->mutex); 836 783 return 0; 837 784 838 785 fail: 839 - if (!IS_ERR_OR_NULL(acomp)) 840 - crypto_free_acomp(acomp); 841 - kfree(buffer); 786 + acomp_ctx_free(acomp_ctx); 842 787 return ret; 843 - } 844 - 845 - static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) 846 - { 847 - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 848 - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 849 - struct acomp_req *req; 850 - struct crypto_acomp *acomp; 851 - u8 *buffer; 852 - 853 - if (IS_ERR_OR_NULL(acomp_ctx)) 854 - return 0; 855 - 856 - mutex_lock(&acomp_ctx->mutex); 857 - req = acomp_ctx->req; 858 - acomp = acomp_ctx->acomp; 859 - buffer = acomp_ctx->buffer; 860 - acomp_ctx->req = NULL; 861 - acomp_ctx->acomp = NULL; 862 - acomp_ctx->buffer = NULL; 863 - mutex_unlock(&acomp_ctx->mutex); 864 - 865 - /* 866 - * Do the actual freeing after releasing the mutex to avoid subtle 867 - * locking dependencies causing deadlocks. 868 - */ 869 - if (!IS_ERR_OR_NULL(req)) 870 - acomp_request_free(req); 871 - if (!IS_ERR_OR_NULL(acomp)) 872 - crypto_free_acomp(acomp); 873 - kfree(buffer); 874 - 875 - return 0; 876 - } 877 - 878 - static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) 879 - { 880 - struct crypto_acomp_ctx *acomp_ctx; 881 - 882 - for (;;) { 883 - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 884 - mutex_lock(&acomp_ctx->mutex); 885 - if (likely(acomp_ctx->req)) 886 - return acomp_ctx; 887 - /* 888 - * It is possible that we were migrated to a different CPU after 889 - * getting the per-CPU ctx but before the mutex was acquired. If 890 - * the old CPU got offlined, zswap_cpu_comp_dead() could have 891 - * already freed ctx->req (among other things) and set it to 892 - * NULL. Just try again on the new CPU that we ended up on. 893 - */ 894 - mutex_unlock(&acomp_ctx->mutex); 895 - } 896 - } 897 - 898 - static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) 899 - { 900 - mutex_unlock(&acomp_ctx->mutex); 901 788 } 902 789 903 790 static bool zswap_compress(struct page *page, struct zswap_entry *entry, ··· 849 862 u8 *dst; 850 863 bool mapped = false; 851 864 852 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 865 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 866 + mutex_lock(&acomp_ctx->mutex); 867 + 853 868 dst = acomp_ctx->buffer; 854 869 sg_init_table(&input, 1); 855 870 sg_set_page(&input, page, PAGE_SIZE, 0); ··· 882 893 * to the active LRU list in the case. 883 894 */ 884 895 if (comp_ret || !dlen || dlen >= PAGE_SIZE) { 896 + rcu_read_lock(); 885 897 if (!mem_cgroup_zswap_writeback_enabled( 886 898 folio_memcg(page_folio(page)))) { 899 + rcu_read_unlock(); 887 900 comp_ret = comp_ret ? comp_ret : -EINVAL; 888 901 goto unlock; 889 902 } 903 + rcu_read_unlock(); 890 904 comp_ret = 0; 891 905 dlen = PAGE_SIZE; 892 906 dst = kmap_local_page(page); ··· 917 925 else if (alloc_ret) 918 926 zswap_reject_alloc_fail++; 919 927 920 - acomp_ctx_put_unlock(acomp_ctx); 928 + mutex_unlock(&acomp_ctx->mutex); 921 929 return comp_ret == 0 && alloc_ret == 0; 922 930 } 923 931 ··· 929 937 struct crypto_acomp_ctx *acomp_ctx; 930 938 int ret = 0, dlen; 931 939 932 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 940 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 941 + mutex_lock(&acomp_ctx->mutex); 933 942 zs_obj_read_sg_begin(pool->zs_pool, entry->handle, input, entry->length); 934 943 935 944 /* zswap entries of length PAGE_SIZE are not compressed. */ ··· 955 962 } 956 963 957 964 zs_obj_read_sg_end(pool->zs_pool, entry->handle); 958 - acomp_ctx_put_unlock(acomp_ctx); 965 + mutex_unlock(&acomp_ctx->mutex); 959 966 960 967 if (!ret && dlen == PAGE_SIZE) 961 968 return true; ··· 1775 1782 ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE, 1776 1783 "mm/zswap_pool:prepare", 1777 1784 zswap_cpu_comp_prepare, 1778 - zswap_cpu_comp_dead); 1785 + NULL); 1779 1786 if (ret) 1780 1787 goto hp_fail; 1781 1788

+41

tools/testing/selftests/liveupdate/liveupdate.c

··· 345 345 ASSERT_EQ(close(session_fd), 0); 346 346 } 347 347 348 + /* 349 + * Test Case: Prevent Double Preservation 350 + * 351 + * Verifies that a file (memfd) can only be preserved once across all active 352 + * sessions. Attempting to preserve it a second time, whether in the same or 353 + * a different session, should fail with EBUSY. 354 + */ 355 + TEST_F(liveupdate_device, prevent_double_preservation) 356 + { 357 + int session_fd1, session_fd2, mem_fd; 358 + int ret; 359 + 360 + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); 361 + if (self->fd1 < 0 && errno == ENOENT) 362 + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); 363 + ASSERT_GE(self->fd1, 0); 364 + 365 + session_fd1 = create_session(self->fd1, "double-preserve-session-1"); 366 + ASSERT_GE(session_fd1, 0); 367 + session_fd2 = create_session(self->fd1, "double-preserve-session-2"); 368 + ASSERT_GE(session_fd2, 0); 369 + 370 + mem_fd = memfd_create("test-memfd", 0); 371 + ASSERT_GE(mem_fd, 0); 372 + 373 + /* First preservation should succeed */ 374 + ASSERT_EQ(preserve_fd(session_fd1, mem_fd, 0x1111), 0); 375 + 376 + /* Second preservation in a different session should fail with EBUSY */ 377 + ret = preserve_fd(session_fd2, mem_fd, 0x2222); 378 + EXPECT_EQ(ret, -EBUSY); 379 + 380 + /* Second preservation in the same session (different token) should fail with EBUSY */ 381 + ret = preserve_fd(session_fd1, mem_fd, 0x3333); 382 + EXPECT_EQ(ret, -EBUSY); 383 + 384 + ASSERT_EQ(close(mem_fd), 0); 385 + ASSERT_EQ(close(session_fd1), 0); 386 + ASSERT_EQ(close(session_fd2), 0); 387 + } 388 + 348 389 TEST_HARNESS_MAIN

+5

tools/testing/selftests/mm/charge_reserved_hugetlb.sh

··· 11 11 exit $ksft_skip 12 12 fi 13 13 14 + if ! command -v killall >/dev/null 2>&1; then 15 + echo "killall not available. Skipping..." 16 + exit $ksft_skip 17 + fi 18 + 14 19 nr_hugepgs=$(cat /proc/sys/vm/nr_hugepages) 15 20 16 21 fault_limit_file=limit_in_bytes

+4

tools/testing/selftests/mm/guard-regions.c

··· 21 21 #include <sys/uio.h> 22 22 #include <unistd.h> 23 23 #include "vm_util.h" 24 + #include "thp_settings.h" 24 25 25 26 #include "../pidfd/pidfd.h" 26 27 ··· 2195 2194 const unsigned long num_pages = size / page_size; 2196 2195 char *ptr; 2197 2196 int i; 2197 + 2198 + if (!thp_available()) 2199 + SKIP(return, "Transparent Hugepages not available\n"); 2198 2200 2199 2201 /* Need file to be correct size for tests for non-anon. */ 2200 2202 if (variant->backing != ANON_BACKED)

+16 -67

tools/testing/selftests/mm/hmm-tests.c

··· 34 34 */ 35 35 #include <lib/test_hmm_uapi.h> 36 36 #include <mm/gup_test.h> 37 + #include <mm/vm_util.h> 37 38 38 39 struct hmm_buffer { 39 40 void *ptr; ··· 549 548 550 549 for (migrate = 0; migrate < 2; ++migrate) { 551 550 for (use_thp = 0; use_thp < 2; ++use_thp) { 552 - npages = ALIGN(use_thp ? TWOMEG : HMM_BUFFER_SIZE, 551 + npages = ALIGN(use_thp ? read_pmd_pagesize() : HMM_BUFFER_SIZE, 553 552 self->page_size) >> self->page_shift; 554 553 ASSERT_NE(npages, 0); 555 554 size = npages << self->page_shift; ··· 729 728 int *ptr; 730 729 int ret; 731 730 732 - size = 2 * TWOMEG; 731 + size = 2 * read_pmd_pagesize(); 733 732 734 733 buffer = malloc(sizeof(*buffer)); 735 734 ASSERT_NE(buffer, NULL); ··· 745 744 buffer->fd, 0); 746 745 ASSERT_NE(buffer->ptr, MAP_FAILED); 747 746 748 - size = TWOMEG; 747 + size /= 2; 749 748 npages = size >> self->page_shift; 750 749 map = (void *)ALIGN((uintptr_t)buffer->ptr, size); 751 750 ret = madvise(map, size, MADV_HUGEPAGE); ··· 772 771 } 773 772 774 773 /* 775 - * Read numeric data from raw and tagged kernel status files. Used to read 776 - * /proc and /sys data (without a tag) and from /proc/meminfo (with a tag). 777 - */ 778 - static long file_read_ulong(char *file, const char *tag) 779 - { 780 - int fd; 781 - char buf[2048]; 782 - int len; 783 - char *p, *q; 784 - long val; 785 - 786 - fd = open(file, O_RDONLY); 787 - if (fd < 0) { 788 - /* Error opening the file */ 789 - return -1; 790 - } 791 - 792 - len = read(fd, buf, sizeof(buf)); 793 - close(fd); 794 - if (len < 0) { 795 - /* Error in reading the file */ 796 - return -1; 797 - } 798 - if (len == sizeof(buf)) { 799 - /* Error file is too large */ 800 - return -1; 801 - } 802 - buf[len] = '\0'; 803 - 804 - /* Search for a tag if provided */ 805 - if (tag) { 806 - p = strstr(buf, tag); 807 - if (!p) 808 - return -1; /* looks like the line we want isn't there */ 809 - p += strlen(tag); 810 - } else 811 - p = buf; 812 - 813 - val = strtol(p, &q, 0); 814 - if (*q != ' ') { 815 - /* Error parsing the file */ 816 - return -1; 817 - } 818 - 819 - return val; 820 - } 821 - 822 - /* 823 774 * Write huge TLBFS page. 824 775 */ 825 776 TEST_F(hmm, anon_write_hugetlbfs) ··· 779 826 struct hmm_buffer *buffer; 780 827 unsigned long npages; 781 828 unsigned long size; 782 - unsigned long default_hsize; 829 + unsigned long default_hsize = default_huge_page_size(); 783 830 unsigned long i; 784 831 int *ptr; 785 832 int ret; 786 833 787 - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); 788 - if (default_hsize < 0 || default_hsize*1024 < default_hsize) 834 + if (!default_hsize) 789 835 SKIP(return, "Huge page size could not be determined"); 790 - default_hsize = default_hsize*1024; /* KB to B */ 791 836 792 837 size = ALIGN(TWOMEG, default_hsize); 793 838 npages = size >> self->page_shift; ··· 1557 1606 struct hmm_buffer *buffer; 1558 1607 unsigned long npages; 1559 1608 unsigned long size; 1560 - unsigned long default_hsize; 1609 + unsigned long default_hsize = default_huge_page_size(); 1561 1610 int *ptr; 1562 1611 unsigned char *m; 1563 1612 int ret; ··· 1565 1614 1566 1615 /* Skip test if we can't allocate a hugetlbfs page. */ 1567 1616 1568 - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); 1569 - if (default_hsize < 0 || default_hsize*1024 < default_hsize) 1617 + if (!default_hsize) 1570 1618 SKIP(return, "Huge page size could not be determined"); 1571 - default_hsize = default_hsize*1024; /* KB to B */ 1572 1619 1573 1620 size = ALIGN(TWOMEG, default_hsize); 1574 1621 npages = size >> self->page_shift; ··· 2055 2106 int *ptr; 2056 2107 int ret; 2057 2108 2058 - size = TWOMEG; 2109 + size = read_pmd_pagesize(); 2059 2110 2060 2111 buffer = malloc(sizeof(*buffer)); 2061 2112 ASSERT_NE(buffer, NULL); ··· 2107 2158 int ret; 2108 2159 int val; 2109 2160 2110 - size = TWOMEG; 2161 + size = read_pmd_pagesize(); 2111 2162 2112 2163 buffer = malloc(sizeof(*buffer)); 2113 2164 ASSERT_NE(buffer, NULL); ··· 2170 2221 int *ptr; 2171 2222 int ret; 2172 2223 2173 - size = TWOMEG; 2224 + size = read_pmd_pagesize(); 2174 2225 2175 2226 buffer = malloc(sizeof(*buffer)); 2176 2227 ASSERT_NE(buffer, NULL); ··· 2229 2280 int *ptr; 2230 2281 int ret; 2231 2282 2232 - size = TWOMEG; 2283 + size = read_pmd_pagesize(); 2233 2284 2234 2285 buffer = malloc(sizeof(*buffer)); 2235 2286 ASSERT_NE(buffer, NULL); ··· 2281 2332 { 2282 2333 struct hmm_buffer *buffer; 2283 2334 unsigned long npages; 2284 - unsigned long size = TWOMEG; 2335 + unsigned long size = read_pmd_pagesize(); 2285 2336 unsigned long i; 2286 2337 void *old_ptr; 2287 2338 void *map; ··· 2347 2398 { 2348 2399 struct hmm_buffer *buffer; 2349 2400 unsigned long npages; 2350 - unsigned long size = TWOMEG; 2401 + unsigned long size = read_pmd_pagesize(); 2351 2402 unsigned long i; 2352 2403 void *old_ptr, *new_ptr = NULL; 2353 2404 void *map; ··· 2447 2498 int *ptr; 2448 2499 int ret; 2449 2500 2450 - size = TWOMEG; 2501 + size = read_pmd_pagesize(); 2451 2502 2452 2503 buffer = malloc(sizeof(*buffer)); 2453 2504 ASSERT_NE(buffer, NULL); ··· 2542 2593 int *ptr; 2543 2594 int ret; 2544 2595 2545 - size = TWOMEG; 2596 + size = read_pmd_pagesize(); 2546 2597 2547 2598 buffer = malloc(sizeof(*buffer)); 2548 2599 ASSERT_NE(buffer, NULL);

+69 -22

tools/testing/selftests/mm/hugetlb_dio.c

··· 17 17 #include <unistd.h> 18 18 #include <string.h> 19 19 #include <sys/mman.h> 20 + #include <sys/syscall.h> 20 21 #include "vm_util.h" 21 22 #include "kselftest.h" 22 23 23 - void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) 24 + #ifndef STATX_DIOALIGN 25 + #define STATX_DIOALIGN 0x00002000U 26 + #endif 27 + 28 + static int get_dio_alignment(int fd) 24 29 { 25 - int fd; 30 + struct statx stx; 31 + int ret; 32 + 33 + ret = syscall(__NR_statx, fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &stx); 34 + if (ret < 0) 35 + return -1; 36 + 37 + /* 38 + * If STATX_DIOALIGN is unsupported, assume no alignment 39 + * constraint and let the test proceed. 40 + */ 41 + if (!(stx.stx_mask & STATX_DIOALIGN) || !stx.stx_dio_offset_align) 42 + return 1; 43 + 44 + return stx.stx_dio_offset_align; 45 + } 46 + 47 + static bool check_dio_alignment(unsigned int start_off, 48 + unsigned int end_off, unsigned int align) 49 + { 50 + unsigned int writesize = end_off - start_off; 51 + 52 + /* 53 + * The kernel's DIO path checks that file offset, length, and 54 + * buffer address are all multiples of dio_offset_align. When 55 + * this test case's parameters don't satisfy that, the write 56 + * would fail with -EINVAL before exercising the hugetlb unpin 57 + * path, so skip. 58 + */ 59 + if (start_off % align != 0 || writesize % align != 0) { 60 + ksft_test_result_skip("DIO align=%u incompatible with offset %u writesize %u\n", 61 + align, start_off, writesize); 62 + return false; 63 + } 64 + 65 + return true; 66 + } 67 + 68 + static void run_dio_using_hugetlb(int fd, unsigned int start_off, 69 + unsigned int end_off, unsigned int align) 70 + { 26 71 char *buffer = NULL; 27 72 char *orig_buffer = NULL; 28 73 size_t h_pagesize = 0; ··· 77 32 const int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB; 78 33 const int mmap_prot = PROT_READ | PROT_WRITE; 79 34 35 + if (!check_dio_alignment(start_off, end_off, align)) 36 + return; 37 + 80 38 writesize = end_off - start_off; 81 39 82 40 /* Get the default huge page size */ ··· 87 39 if (!h_pagesize) 88 40 ksft_exit_fail_msg("Unable to determine huge page size\n"); 89 41 90 - /* Open the file to DIO */ 91 - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 92 - if (fd < 0) 93 - ksft_exit_fail_perror("Error opening file\n"); 42 + /* Reset file position since fd is shared across tests */ 43 + if (lseek(fd, 0, SEEK_SET) < 0) 44 + ksft_exit_fail_perror("lseek failed\n"); 94 45 95 46 /* Get the free huge pages before allocation */ 96 47 free_hpage_b = get_free_hugepages(); ··· 118 71 119 72 /* unmap the huge page */ 120 73 munmap(orig_buffer, h_pagesize); 121 - close(fd); 122 74 123 75 /* Get the free huge pages after unmap*/ 124 76 free_hpage_a = get_free_hugepages(); ··· 135 89 136 90 int main(void) 137 91 { 138 - size_t pagesize = 0; 139 - int fd; 92 + int fd, align; 93 + const size_t pagesize = psize(); 140 94 141 95 ksft_print_header(); 142 - 143 - /* Open the file to DIO */ 144 - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 145 - if (fd < 0) 146 - ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); 147 - close(fd); 148 96 149 97 /* Check if huge pages are free */ 150 98 if (!get_free_hugepages()) 151 99 ksft_exit_skip("No free hugepage, exiting\n"); 152 100 101 + fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 102 + if (fd < 0) 103 + ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); 104 + 105 + align = get_dio_alignment(fd); 106 + if (align < 0) 107 + ksft_exit_skip("Unable to obtain DIO alignment: %s\n", 108 + strerror(errno)); 153 109 ksft_set_plan(4); 154 110 155 - /* Get base page size */ 156 - pagesize = psize(); 157 - 158 111 /* start and end is aligned to pagesize */ 159 - run_dio_using_hugetlb(0, (pagesize * 3)); 112 + run_dio_using_hugetlb(fd, 0, (pagesize * 3), align); 160 113 161 114 /* start is aligned but end is not aligned */ 162 - run_dio_using_hugetlb(0, (pagesize * 3) - (pagesize / 2)); 115 + run_dio_using_hugetlb(fd, 0, (pagesize * 3) - (pagesize / 2), align); 163 116 164 117 /* start is unaligned and end is aligned */ 165 - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3)); 118 + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3), align); 166 119 167 120 /* both start and end are unaligned */ 168 - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3) + (pagesize / 2)); 121 + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3) + (pagesize / 2), align); 122 + 123 + close(fd); 169 124 170 125 ksft_finished(); 171 126 }

+88

tools/testing/selftests/mm/merge.c

··· 48 48 return 0; 49 49 } 50 50 51 + #ifdef __NR_mseal 52 + static int sys_mseal(void *ptr, size_t len, unsigned long flags) 53 + { 54 + return syscall(__NR_mseal, (unsigned long)ptr, len, flags); 55 + } 56 + #else 57 + static int sys_mseal(void *ptr, size_t len, unsigned long flags) 58 + { 59 + errno = ENOSYS; 60 + return -1; 61 + } 62 + #endif 63 + 51 64 FIXTURE_SETUP(merge) 52 65 { 53 66 self->page_size = psize(); ··· 1228 1215 ASSERT_TRUE(find_vma_procmap(procmap, ptr)); 1229 1216 ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); 1230 1217 ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); 1218 + } 1219 + 1220 + TEST_F(merge, merge_vmas_with_mseal) 1221 + { 1222 + unsigned int page_size = self->page_size; 1223 + struct procmap_fd *procmap = &self->procmap; 1224 + char *ptr, *ptr2, *ptr3; 1225 + /* We need our own as cannot munmap() once sealed. */ 1226 + char *carveout; 1227 + 1228 + /* Invalid mseal() call to see if implemented. */ 1229 + ASSERT_EQ(sys_mseal(NULL, 0, ~0UL), -1); 1230 + if (errno == ENOSYS) 1231 + SKIP(return, "mseal not supported, skipping."); 1232 + 1233 + /* Map carveout. */ 1234 + carveout = mmap(NULL, 5 * page_size, PROT_NONE, 1235 + MAP_PRIVATE | MAP_ANON, -1, 0); 1236 + ASSERT_NE(carveout, MAP_FAILED); 1237 + 1238 + /* 1239 + * Map 3 separate VMAs: 1240 + * 1241 + * |-----------|-----------|-----------| 1242 + * | RW | RWE | RO | 1243 + * |-----------|-----------|-----------| 1244 + * ptr ptr2 ptr3 1245 + */ 1246 + ptr = mmap(&carveout[page_size], page_size, PROT_READ | PROT_WRITE, 1247 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1248 + ASSERT_NE(ptr, MAP_FAILED); 1249 + ptr2 = mmap(&carveout[2 * page_size], page_size, 1250 + PROT_READ | PROT_WRITE | PROT_EXEC, 1251 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1252 + ASSERT_NE(ptr2, MAP_FAILED); 1253 + ptr3 = mmap(&carveout[3 * page_size], page_size, PROT_READ, 1254 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1255 + ASSERT_NE(ptr3, MAP_FAILED); 1256 + 1257 + /* 1258 + * mseal the second VMA: 1259 + * 1260 + * |-----------|-----------|-----------| 1261 + * | RW | RWES | RO | 1262 + * |-----------|-----------|-----------| 1263 + * ptr ptr2 ptr3 1264 + */ 1265 + ASSERT_EQ(sys_mseal(ptr2, page_size, 0), 0); 1266 + 1267 + /* Make first VMA mergeable upon mseal. */ 1268 + ASSERT_EQ(mprotect(ptr, page_size, 1269 + PROT_READ | PROT_WRITE | PROT_EXEC), 0); 1270 + /* 1271 + * At this point we have: 1272 + * 1273 + * |-----------|-----------|-----------| 1274 + * | RWE | RWES | RO | 1275 + * |-----------|-----------|-----------| 1276 + * ptr ptr2 ptr3 1277 + * 1278 + * Now mseal all of the VMAs. 1279 + */ 1280 + ASSERT_EQ(sys_mseal(ptr, 3 * page_size, 0), 0); 1281 + 1282 + /* 1283 + * We should end up with: 1284 + * 1285 + * |-----------------------|-----------| 1286 + * | RWES | ROS | 1287 + * |-----------------------|-----------| 1288 + * ptr ptr3 1289 + */ 1290 + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); 1291 + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); 1292 + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); 1231 1293 } 1232 1294 1233 1295 TEST_F(merge_with_fork, mremap_faulted_to_unfaulted_prev)

+3 -1

tools/testing/selftests/mm/soft-dirty.c

··· 82 82 int i, ret; 83 83 84 84 if (!thp_is_enabled()) { 85 - ksft_test_result_skip("Transparent Hugepages not available\n"); 85 + ksft_print_msg("Transparent Hugepages not available\n"); 86 + ksft_test_result_skip("Test %s huge page allocation\n", __func__); 87 + ksft_test_result_skip("Test %s huge page dirty bit\n", __func__); 86 88 return; 87 89 } 88 90

+4 -15

tools/testing/selftests/mm/split_huge_page_test.c

··· 21 21 #include <time.h> 22 22 #include "vm_util.h" 23 23 #include "kselftest.h" 24 + #include "thp_settings.h" 24 25 25 26 uint64_t pagesize; 26 27 unsigned int pageshift; ··· 254 253 255 254 free(vaddr_orders); 256 255 return status; 257 - } 258 - 259 - static void write_file(const char *path, const char *buf, size_t buflen) 260 - { 261 - int fd; 262 - ssize_t numwritten; 263 - 264 - fd = open(path, O_WRONLY); 265 - if (fd == -1) 266 - ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); 267 - 268 - numwritten = write(fd, buf, buflen - 1); 269 - close(fd); 270 - if (numwritten < 1) 271 - ksft_exit_fail_msg("Write failed\n"); 272 256 } 273 257 274 258 static void write_debugfs(const char *fmt, ...) ··· 757 771 ksft_print_msg("Please run the benchmark as root\n"); 758 772 ksft_finished(); 759 773 } 774 + 775 + if (!thp_is_enabled()) 776 + ksft_exit_skip("Transparent Hugepages not available\n"); 760 777 761 778 if (argc > 1) 762 779 optional_xfs_path = argv[1];

+3 -32

tools/testing/selftests/mm/thp_settings.c

··· 6 6 #include <string.h> 7 7 #include <unistd.h> 8 8 9 + #include "vm_util.h" 9 10 #include "thp_settings.h" 10 11 11 12 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/" ··· 65 64 return (unsigned int) numread; 66 65 } 67 66 68 - int write_file(const char *path, const char *buf, size_t buflen) 69 - { 70 - int fd; 71 - ssize_t numwritten; 72 - 73 - fd = open(path, O_WRONLY); 74 - if (fd == -1) { 75 - printf("open(%s)\n", path); 76 - exit(EXIT_FAILURE); 77 - return 0; 78 - } 79 - 80 - numwritten = write(fd, buf, buflen - 1); 81 - close(fd); 82 - if (numwritten < 1) { 83 - printf("write(%s)\n", buf); 84 - exit(EXIT_FAILURE); 85 - return 0; 86 - } 87 - 88 - return (unsigned int) numwritten; 89 - } 90 - 91 67 unsigned long read_num(const char *path) 92 68 { 93 69 char buf[21]; ··· 82 104 char buf[21]; 83 105 84 106 sprintf(buf, "%ld", num); 85 - if (!write_file(path, buf, strlen(buf) + 1)) { 86 - perror(path); 87 - exit(EXIT_FAILURE); 88 - } 107 + write_file(path, buf, strlen(buf) + 1); 89 108 } 90 109 91 110 int thp_read_string(const char *name, const char * const strings[]) ··· 140 165 printf("%s: Pathname is too long\n", __func__); 141 166 exit(EXIT_FAILURE); 142 167 } 143 - 144 - if (!write_file(path, val, strlen(val) + 1)) { 145 - perror(path); 146 - exit(EXIT_FAILURE); 147 - } 168 + write_file(path, val, strlen(val) + 1); 148 169 } 149 170 150 171 unsigned long thp_read_num(const char *name)

-1

tools/testing/selftests/mm/thp_settings.h

··· 63 63 }; 64 64 65 65 int read_file(const char *path, char *buf, size_t buflen); 66 - int write_file(const char *path, const char *buf, size_t buflen); 67 66 unsigned long read_num(const char *path); 68 67 void write_num(const char *path, unsigned long num); 69 68

+4

tools/testing/selftests/mm/transhuge-stress.c

··· 17 17 #include <sys/mman.h> 18 18 #include "vm_util.h" 19 19 #include "kselftest.h" 20 + #include "thp_settings.h" 20 21 21 22 int backing_fd = -1; 22 23 int mmap_flags = MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE; ··· 37 36 int duration = 0; 38 37 39 38 ksft_print_header(); 39 + 40 + if (!thp_is_enabled()) 41 + ksft_exit_skip("Transparent Hugepages not available\n"); 40 42 41 43 ram = sysconf(_SC_PHYS_PAGES); 42 44 if (ram > SIZE_MAX / psize() / 4)

+24

tools/testing/selftests/mm/vm_util.c

··· 764 764 765 765 return ret > 0 ? 0 : -errno; 766 766 } 767 + 768 + void write_file(const char *path, const char *buf, size_t buflen) 769 + { 770 + int fd, saved_errno; 771 + ssize_t numwritten; 772 + 773 + if (buflen < 2) 774 + ksft_exit_fail_msg("Incorrect buffer len: %zu\n", buflen); 775 + 776 + fd = open(path, O_WRONLY); 777 + if (fd == -1) 778 + ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); 779 + 780 + numwritten = write(fd, buf, buflen - 1); 781 + saved_errno = errno; 782 + close(fd); 783 + errno = saved_errno; 784 + if (numwritten < 0) 785 + ksft_exit_fail_msg("%s write(%.*s) failed: %s\n", path, (int)(buflen - 1), 786 + buf, strerror(errno)); 787 + if (numwritten != buflen - 1) 788 + ksft_exit_fail_msg("%s write(%.*s) is truncated, expected %zu bytes, got %zd bytes\n", 789 + path, (int)(buflen - 1), buf, buflen - 1, numwritten); 790 + }

+2

tools/testing/selftests/mm/vm_util.h

··· 166 166 167 167 #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 168 168 #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 169 + 170 + void write_file(const char *path, const char *buf, size_t buflen);

Configure Feed

Configure Feed