Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

- "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

Address the longstanding "dying memcg problem". A situation wherein a
no-longer-used memory control group will hang around for an extended
period pointlessly consuming memory

- "fix unexpected type conversions and potential overflows" (Qi Zheng)

Fix a couple of potential 32-bit/64-bit issues which were identified
during review of the "Eliminate Dying Memory Cgroup" series

- "kho: history: track previous kernel version and kexec boot count"
(Breno Leitao)

Use Kexec Handover (KHO) to pass the previous kernel's version string
and the number of kexec reboots since the last cold boot to the next
kernel, and print it at boot time

- "liveupdate: prevent double preservation" (Pasha Tatashin)

Teach LUO to avoid managing the same file across different active
sessions

- "liveupdate: Fix module unloading and unregister API" (Pasha
Tatashin)

Address an issue with how LUO handles module reference counting and
unregistration during module unloading

- "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

Simplify and clean up the zswap crypto compression handling and
improve the lifecycle management of zswap pool's per-CPU acomp_ctx
resources

- "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
(SeongJae Park)

Address unlikely but possible leaks and deadlocks in damon_call() and
damon_walk()

- "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)

Fix a couple of root-only wild pointer dereferences

- "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
(SeongJae Park)

Update the DAMON documentation to warn operators about potential
races which can occur if the commit_inputs parameter is altered at
the wrong time

- "Minor hmm_test fixes and cleanups" (Alistair Popple)

Bugfixes and a cleanup for the HMM kernel selftests

- "Modify memfd_luo code" (Chenghao Duan)

Cleanups, simplifications and speedups to the memfd_lou code

- "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

Support for userfaultfd in guest_memfd

- "selftests/mm: skip several tests when thp is not available" (Chunyu
Hu)

Fix several issues in the selftests code which were causing breakage
when the tests were run on CONFIG_THP=n kernels

- "mm/mprotect: micro-optimization work" (Pedro Falcato)

A couple of nice speedups for mprotect()

- "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
kexec, crash, kdump and probably other kexec-based things - they are
being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
MAINTAINERS: add page cache reviewer
mm/vmscan: avoid false-positive -Wuninitialized warning
MAINTAINERS: update Dave's kdump reviewer email address
MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
MAINTAINERS: drop include/linux/kho/abi/ from KHO
MAINTAINERS: update KHO and LIVE UPDATE maintainers
MAINTAINERS: update kexec/kdump maintainers entries
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
selftests: mm: skip charge_reserved_hugetlb without killall
userfaultfd: allow registration of ranges below mmap_min_addr
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
mm/hugetlb: fix early boot crash on parameters without '=' separator
zram: reject unrecognized type= values in recompress_store()
docs: proc: document ProtectionKey in smaps
mm/mprotect: special-case small folios when applying permissions
mm/mprotect: move softleaf code out of the main function
mm: remove '!root_reclaim' checking in should_abort_scan()
mm/sparse: fix comment for section map alignment
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
selftests/mm: transhuge_stress: skip the test when thp not available
...

+2831 -1692
+8
CREDITS
··· 1451 1451 E: andy@greyhouse.net 1452 1452 D: Maintenance and contributions to the network interface bonding driver. 1453 1453 1454 + N: Vivek Goyal 1455 + E: vgoyal@redhat.com 1456 + D: KDUMP, KEXEC, and VIRTIO FILE SYSTEM 1457 + 1458 + N: Alexander Graf 1459 + E: graf@amazon.com 1460 + D: Kexec Handover (KHO) 1461 + 1454 1462 N: Wolfgang Grandegger 1455 1463 E: wg@grandegger.com 1456 1464 D: Controller Area Network (device drivers)
+4
Documentation/admin-guide/mm/damon/lru_sort.rst
··· 79 79 parameter is set as ``N``. If invalid parameters are found while the 80 80 re-reading, DAMON_LRU_SORT will be disabled. 81 81 82 + Once ``Y`` is written to this parameter, the user must not write to any 83 + parameters until reading ``commit_inputs`` again returns ``N``. If users 84 + violate this rule, the kernel may exhibit undefined behavior. 85 + 82 86 active_mem_bp 83 87 ------------- 84 88
+4
Documentation/admin-guide/mm/damon/reclaim.rst
··· 71 71 parameter is set as ``N``. If invalid parameters are found while the 72 72 re-reading, DAMON_RECLAIM will be disabled. 73 73 74 + Once ``Y`` is written to this parameter, the user must not write to any 75 + parameters until reading ``commit_inputs`` again returns ``N``. If users 76 + violate this rule, the kernel may exhibit undefined behavior. 77 + 74 78 min_age 75 79 ------- 76 80
+40 -1
Documentation/admin-guide/mm/kho.rst
··· 42 42 an early memory reservation, the new kernel will have that memory at the 43 43 same physical address as the old kernel. 44 44 45 + Kexec Metadata 46 + ============== 47 + 48 + KHO automatically tracks metadata about the kexec chain, passing information 49 + about the previous kernel to the next kernel. This feature helps diagnose 50 + bugs that only reproduce when kexecing from specific kernel versions. 51 + 52 + On each KHO kexec, the kernel logs the previous kernel's version and the 53 + number of kexec reboots since the last cold boot:: 54 + 55 + [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) 56 + 57 + The metadata includes: 58 + 59 + ``previous_release`` 60 + The kernel version string (from ``uname -r``) of the kernel that 61 + initiated the kexec. 62 + 63 + ``kexec_count`` 64 + The number of kexec boots since the last cold boot. On cold boot, 65 + this counter starts at 0 and increments with each kexec. This helps 66 + identify issues that only manifest after multiple consecutive kexec 67 + reboots. 68 + 69 + Use Cases 70 + --------- 71 + 72 + This metadata is particularly useful for debugging kexec transition bugs, 73 + where a buggy kernel kexecs into a new kernel and the bug manifests only 74 + in the second kernel. Examples of such bugs include: 75 + 76 + - Memory corruption from the previous kernel affecting the new kernel 77 + - Incorrect hardware state left by the previous kernel 78 + - Firmware/ACPI state issues that only appear in kexec scenarios 79 + 80 + At scale, correlating crashes to the previous kernel version enables 81 + faster root cause analysis when issues only occur in specific kernel 82 + transition scenarios. 83 + 45 84 debugfs Interfaces 46 85 ================== 47 86 ··· 119 80 it finished to interpret their metadata. 120 81 121 82 ``/sys/kernel/debug/kho/in/sub_fdts/`` 122 - Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs 83 + Similar to ``kho/out/sub_fdts/``, but contains sub blobs 123 84 of KHO producers passed from the old kernel.
+4
Documentation/filesystems/proc.rst
··· 565 565 naturally aligned THP pages of any currently enabled size. 1 if true, 0 566 566 otherwise. 567 567 568 + If both the kernel and the CPU support protection keys (pkeys), 569 + "ProtectionKey" indicates the memory protection key associated with the 570 + virtual memory area. 571 + 568 572 "VmFlags" field deserves a separate description. This member represents the 569 573 kernel flags associated with the particular virtual memory area in two letter 570 574 encoded manner. The codes are the following:
+20 -9
MAINTAINERS
··· 13859 13859 KDUMP 13860 13860 M: Andrew Morton <akpm@linux-foundation.org> 13861 13861 M: Baoquan He <bhe@redhat.com> 13862 - R: Vivek Goyal <vgoyal@redhat.com> 13863 - R: Dave Young <dyoung@redhat.com> 13862 + M: Mike Rapoport <rppt@kernel.org> 13863 + M: Pasha Tatashin <pasha.tatashin@soleen.com> 13864 + M: Pratyush Yadav <pratyush@kernel.org> 13865 + R: Dave Young <ruirui.yang@linux.dev> 13864 13866 L: kexec@lists.infradead.org 13865 13867 S: Maintained 13866 13868 W: http://lse.sourceforge.net/kdump/ ··· 14177 14175 KEXEC 14178 14176 M: Andrew Morton <akpm@linux-foundation.org> 14179 14177 M: Baoquan He <bhe@redhat.com> 14178 + M: Mike Rapoport <rppt@kernel.org> 14179 + M: Pasha Tatashin <pasha.tatashin@soleen.com> 14180 + M: Pratyush Yadav <pratyush@kernel.org> 14180 14181 L: kexec@lists.infradead.org 14181 14182 W: http://kernel.org/pub/linux/utils/kernel/kexec/ 14182 14183 F: include/linux/kexec.h ··· 14187 14182 F: kernel/kexec* 14188 14183 14189 14184 KEXEC HANDOVER (KHO) 14190 - M: Alexander Graf <graf@amazon.com> 14191 14185 M: Mike Rapoport <rppt@kernel.org> 14192 14186 M: Pasha Tatashin <pasha.tatashin@soleen.com> 14193 - R: Pratyush Yadav <pratyush@kernel.org> 14187 + M: Pratyush Yadav <pratyush@kernel.org> 14188 + R: Alexander Graf <graf@amazon.com> 14194 14189 L: kexec@lists.infradead.org 14195 14190 L: linux-mm@kvack.org 14196 14191 S: Maintained 14192 + T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git 14197 14193 F: Documentation/admin-guide/mm/kho.rst 14198 14194 F: Documentation/core-api/kho/* 14199 14195 F: include/linux/kexec_handover.h 14200 14196 F: include/linux/kho/ 14201 - F: include/linux/kho/abi/ 14202 14197 F: kernel/liveupdate/kexec_handover* 14203 14198 F: lib/test_kho.c 14204 14199 F: tools/testing/selftests/kho/ ··· 14897 14892 LIVE UPDATE 14898 14893 M: Pasha Tatashin <pasha.tatashin@soleen.com> 14899 14894 M: Mike Rapoport <rppt@kernel.org> 14900 - R: Pratyush Yadav <pratyush@kernel.org> 14895 + M: Pratyush Yadav <pratyush@kernel.org> 14901 14896 L: linux-kernel@vger.kernel.org 14902 14897 S: Maintained 14898 + T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git 14903 14899 F: Documentation/core-api/liveupdate.rst 14904 14900 F: Documentation/mm/memfd_preservation.rst 14905 14901 F: Documentation/userspace-api/liveupdate.rst 14906 14902 F: include/linux/kho/abi/ 14907 14903 F: include/linux/liveupdate.h 14908 - F: include/linux/liveupdate/ 14909 14904 F: include/uapi/linux/liveupdate.h 14910 14905 F: kernel/liveupdate/ 14911 14906 F: lib/tests/liveupdate.c ··· 16864 16859 16865 16860 MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU) 16866 16861 M: Andrew Morton <akpm@linux-foundation.org> 16867 - M: Axel Rasmussen <axelrasmussen@google.com> 16868 - M: Yuanchu Xie <yuanchu@google.com> 16862 + R: Kairui Song <kasong@tencent.com> 16863 + R: Qi Zheng <qi.zheng@linux.dev> 16864 + R: Shakeel Butt <shakeel.butt@linux.dev> 16865 + R: Barry Song <baohua@kernel.org> 16866 + R: Axel Rasmussen <axelrasmussen@google.com> 16867 + R: Yuanchu Xie <yuanchu@google.com> 16869 16868 R: Wei Xu <weixugc@google.com> 16870 16869 L: linux-mm@kvack.org 16871 16870 S: Maintained ··· 20124 20115 20125 20116 PAGE CACHE 20126 20117 M: Matthew Wilcox (Oracle) <willy@infradead.org> 20118 + R: Jan Kara <jack@suse.cz> 20127 20119 L: linux-fsdevel@vger.kernel.org 20120 + L: linux-mm@kvack.org 20128 20121 S: Supported 20129 20122 T: git git://git.infradead.org/users/willy/pagecache.git 20130 20123 F: Documentation/filesystems/locking.rst
+4 -1
drivers/block/zram/zram_drv.c
··· 2546 2546 mode = RECOMPRESS_HUGE; 2547 2547 if (!strcmp(val, "huge_idle")) 2548 2548 mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE; 2549 + if (!mode) 2550 + return -EINVAL; 2549 2551 continue; 2550 2552 } 2551 2553 ··· 2680 2678 */ 2681 2679 if (offset) { 2682 2680 if (n <= (PAGE_SIZE - offset)) 2683 - return; 2681 + goto end_bio; 2684 2682 2685 2683 n -= (PAGE_SIZE - offset); 2686 2684 index++; ··· 2695 2693 n -= PAGE_SIZE; 2696 2694 } 2697 2695 2696 + end_bio: 2698 2697 bio_endio(bio); 2699 2698 } 2700 2699
+2 -2
fs/buffer.c
··· 822 822 long offset; 823 823 struct mem_cgroup *memcg, *old_memcg; 824 824 825 - /* The folio lock pins the memcg */ 826 - memcg = folio_memcg(folio); 825 + memcg = get_mem_cgroup_from_folio(folio); 827 826 old_memcg = set_active_memcg(memcg); 828 827 829 828 head = NULL; ··· 843 844 } 844 845 out: 845 846 set_active_memcg(old_memcg); 847 + mem_cgroup_put(memcg); 846 848 return head; 847 849 /* 848 850 * In case anything failed, we just free everything we got.
+11 -11
fs/fs-writeback.c
··· 280 280 if (inode_cgwb_enabled(inode)) { 281 281 struct cgroup_subsys_state *memcg_css; 282 282 283 - if (folio) { 284 - memcg_css = mem_cgroup_css_from_folio(folio); 285 - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 286 - } else { 287 - /* must pin memcg_css, see wb_get_create() */ 283 + /* must pin memcg_css, see wb_get_create() */ 284 + if (folio) 285 + memcg_css = get_mem_cgroup_css_from_folio(folio); 286 + else 288 287 memcg_css = task_get_css(current, memory_cgrp_id); 289 - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 290 - css_put(memcg_css); 291 - } 288 + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 289 + css_put(memcg_css); 292 290 } 293 291 294 292 if (!wb) ··· 977 979 if (!wbc->wb || wbc->no_cgroup_owner) 978 980 return; 979 981 980 - css = mem_cgroup_css_from_folio(folio); 982 + css = get_mem_cgroup_css_from_folio(folio); 981 983 /* dead cgroups shouldn't contribute to inode ownership arbitration */ 982 984 if (!css_is_online(css)) 983 - return; 985 + goto out; 984 986 985 987 id = css->id; 986 988 987 989 if (id == wbc->wb_id) { 988 990 wbc->wb_bytes += bytes; 989 - return; 991 + goto out; 990 992 } 991 993 992 994 if (id == wbc->wb_lcand_id) ··· 999 1001 wbc->wb_tcand_bytes += bytes; 1000 1002 else 1001 1003 wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes); 1004 + out: 1005 + css_put(css); 1002 1006 } 1003 1007 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner); 1004 1008
-2
fs/userfaultfd.c
··· 1238 1238 return -EINVAL; 1239 1239 if (!len) 1240 1240 return -EINVAL; 1241 - if (start < mmap_min_addr) 1242 - return -EINVAL; 1243 1241 if (start >= task_size) 1244 1242 return -EINVAL; 1245 1243 if (len > task_size - start)
+2
include/linux/alloc_tag.h
··· 163 163 { 164 164 WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n"); 165 165 } 166 + void alloc_tag_add_early_pfn(unsigned long pfn); 166 167 #else 167 168 static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag) {} 168 169 static inline void alloc_tag_sub_check(union codetag_ref *ref) {} 170 + static inline void alloc_tag_add_early_pfn(unsigned long pfn) {} 169 171 #endif 170 172 171 173 /* Caller should verify both ref and tag to be valid */
+2
include/linux/damon.h
··· 818 818 819 819 /* lists of &struct damon_call_control */ 820 820 struct list_head call_controls; 821 + bool call_controls_obsolete; 821 822 struct mutex call_controls_lock; 822 823 823 824 struct damos_walk_control *walk_control; 825 + bool walk_control_obsolete; 824 826 struct mutex walk_control_lock; 825 827 826 828 /*
+1 -8
include/linux/fs.h
··· 2062 2062 const struct vm_area_struct *vma); 2063 2063 int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma); 2064 2064 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma); 2065 - int __vma_check_mmap_hook(struct vm_area_struct *vma); 2066 2065 2067 2066 static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma) 2068 2067 { 2069 - int err; 2070 - 2071 2068 if (file->f_op->mmap_prepare) 2072 2069 return compat_vma_mmap(file, vma); 2073 2070 2074 - err = file->f_op->mmap(file, vma); 2075 - if (err) 2076 - return err; 2077 - 2078 - return __vma_check_mmap_hook(vma); 2071 + return file->f_op->mmap(file, vma); 2079 2072 } 2080 2073 2081 2074 static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
+7 -6
include/linux/kexec_handover.h
··· 32 32 struct folio *kho_restore_folio(phys_addr_t phys); 33 33 struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); 34 34 void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); 35 - int kho_add_subtree(const char *name, void *fdt); 36 - void kho_remove_subtree(void *fdt); 37 - int kho_retrieve_subtree(const char *name, phys_addr_t *phys); 35 + int kho_add_subtree(const char *name, void *blob, size_t size); 36 + void kho_remove_subtree(void *blob); 37 + int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size); 38 38 39 39 void kho_memory_init(void); 40 40 ··· 97 97 return NULL; 98 98 } 99 99 100 - static inline int kho_add_subtree(const char *name, void *fdt) 100 + static inline int kho_add_subtree(const char *name, void *blob, size_t size) 101 101 { 102 102 return -EOPNOTSUPP; 103 103 } 104 104 105 - static inline void kho_remove_subtree(void *fdt) { } 105 + static inline void kho_remove_subtree(void *blob) { } 106 106 107 - static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) 107 + static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys, 108 + size_t *size) 108 109 { 109 110 return -EOPNOTSUPP; 110 111 }
+16 -4
include/linux/kho/abi/kexec_handover.h
··· 41 41 * restore the preserved data.:: 42 42 * 43 43 * / { 44 - * compatible = "kho-v2"; 44 + * compatible = "kho-v3"; 45 45 * 46 46 * preserved-memory-map = <0x...>; 47 47 * 48 48 * <subnode-name-1> { 49 49 * preserved-data = <0x...>; 50 + * blob-size = <0x...>; 50 51 * }; 51 52 * 52 53 * <subnode-name-2> { 53 54 * preserved-data = <0x...>; 55 + * blob-size = <0x...>; 54 56 * }; 55 57 * ... ... 56 58 * <subnode-name-N> { 57 59 * preserved-data = <0x...>; 60 + * blob-size = <0x...>; 58 61 * }; 59 62 * }; 60 63 * 61 64 * Root KHO Node (/): 62 - * - compatible: "kho-v2" 65 + * - compatible: "kho-v3" 63 66 * 64 67 * Indentifies the overall KHO ABI version. 65 68 * ··· 81 78 * 82 79 * Physical address pointing to a subnode data blob that is also 83 80 * being preserved. 81 + * 82 + * - blob-size: u64 83 + * 84 + * Size in bytes of the preserved data blob. This is needed because 85 + * blobs may use arbitrary formats (not just FDT), so the size 86 + * cannot be determined from the blob content alone. 84 87 */ 85 88 86 89 /* The compatible string for the KHO FDT root node. */ 87 - #define KHO_FDT_COMPATIBLE "kho-v2" 90 + #define KHO_FDT_COMPATIBLE "kho-v3" 88 91 89 92 /* The FDT property for the preserved memory map. */ 90 93 #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" 91 94 92 95 /* The FDT property for preserved data blobs. */ 93 - #define KHO_FDT_SUB_TREE_PROP_NAME "preserved-data" 96 + #define KHO_SUB_TREE_PROP_NAME "preserved-data" 97 + 98 + /* The FDT property for the size of preserved data blobs. */ 99 + #define KHO_SUB_TREE_SIZE_PROP_NAME "blob-size" 94 100 95 101 /** 96 102 * DOC: Kexec Handover ABI for vmalloc Preservation
+46
include/linux/kho/abi/kexec_metadata.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + 3 + /** 4 + * DOC: Kexec Metadata ABI 5 + * 6 + * The "kexec-metadata" subtree stores optional metadata about the kexec chain. 7 + * It is registered via kho_add_subtree(), keeping it independent from the core 8 + * KHO ABI. This allows the metadata format to evolve without affecting other 9 + * KHO consumers. 10 + * 11 + * The metadata is stored as a plain C struct rather than FDT format for 12 + * simplicity and direct field access. 13 + * 14 + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates. 15 + * Copyright (c) 2026 Breno Leitao <leitao@debian.org> 16 + */ 17 + 18 + #ifndef _LINUX_KHO_ABI_KEXEC_METADATA_H 19 + #define _LINUX_KHO_ABI_KEXEC_METADATA_H 20 + 21 + #include <linux/types.h> 22 + #include <linux/utsname.h> 23 + 24 + #define KHO_KEXEC_METADATA_VERSION 1 25 + 26 + /** 27 + * struct kho_kexec_metadata - Kexec metadata passed between kernels 28 + * @version: ABI version of this struct (must be first field) 29 + * @previous_release: Kernel version string that initiated the kexec 30 + * @kexec_count: Number of kexec boots since last cold boot 31 + * 32 + * This structure is preserved across kexec and allows the new kernel to 33 + * identify which kernel it was booted from and how many kexec reboots 34 + * have occurred. 35 + * 36 + * __NEW_UTS_LEN is part of uABI, so it safe to use it in here. 37 + */ 38 + struct kho_kexec_metadata { 39 + u32 version; 40 + char previous_release[__NEW_UTS_LEN + 1]; 41 + u32 kexec_count; 42 + } __packed; 43 + 44 + #define KHO_METADATA_NODE_NAME "kexec-metadata" 45 + 46 + #endif /* _LINUX_KHO_ABI_KEXEC_METADATA_H */
+9 -8
include/linux/liveupdate.h
··· 12 12 #include <linux/kho/abi/luo.h> 13 13 #include <linux/list.h> 14 14 #include <linux/mutex.h> 15 + #include <linux/rwsem.h> 15 16 #include <linux/types.h> 16 17 #include <uapi/linux/liveupdate.h> 17 18 ··· 64 63 * finish, in order to do successful finish calls for all 65 64 * resources in the session. 66 65 * @finish: Required. Final cleanup in the new kernel. 66 + * @get_id: Optional. Returns a unique identifier for the file. 67 67 * @owner: Module reference 68 68 * 69 69 * All operations (except can_preserve) receive a pointer to a ··· 80 78 int (*retrieve)(struct liveupdate_file_op_args *args); 81 79 bool (*can_finish)(struct liveupdate_file_op_args *args); 82 80 void (*finish)(struct liveupdate_file_op_args *args); 81 + unsigned long (*get_id)(struct file *file); 83 82 struct module *owner; 84 83 }; 85 84 ··· 231 228 int liveupdate_reboot(void); 232 229 233 230 int liveupdate_register_file_handler(struct liveupdate_file_handler *fh); 234 - int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); 231 + void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); 235 232 236 233 int liveupdate_register_flb(struct liveupdate_file_handler *fh, 237 234 struct liveupdate_flb *flb); 238 - int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 239 - struct liveupdate_flb *flb); 235 + void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 236 + struct liveupdate_flb *flb); 240 237 241 238 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); 242 239 int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); ··· 258 255 return -EOPNOTSUPP; 259 256 } 260 257 261 - static inline int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 258 + static inline void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 262 259 { 263 - return -EOPNOTSUPP; 264 260 } 265 261 266 262 static inline int liveupdate_register_flb(struct liveupdate_file_handler *fh, ··· 268 266 return -EOPNOTSUPP; 269 267 } 270 268 271 - static inline int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 272 - struct liveupdate_flb *flb) 269 + static inline void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 270 + struct liveupdate_flb *flb) 273 271 { 274 - return -EOPNOTSUPP; 275 272 } 276 273 277 274 static inline int liveupdate_flb_get_incoming(struct liveupdate_flb *flb,
+98 -95
include/linux/memcontrol.h
··· 115 115 unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; 116 116 struct mem_cgroup_reclaim_iter iter; 117 117 118 + /* 119 + * objcg is wiped out as a part of the objcg repaprenting process. 120 + * orig_objcg preserves a pointer (and a reference) to the original 121 + * objcg until the end of live of memcg. 122 + */ 123 + struct obj_cgroup __rcu *objcg; 124 + struct obj_cgroup *orig_objcg; 125 + /* list of inherited objcgs, protected by objcg_lock */ 126 + struct list_head objcg_list; 127 + 118 128 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC 119 129 /* slab stats for nmi context */ 120 130 atomic_t slab_reclaimable; ··· 189 179 struct list_head list; /* protected by objcg_lock */ 190 180 struct rcu_head rcu; 191 181 }; 182 + bool is_root; 192 183 }; 193 184 194 185 /* ··· 268 257 seqlock_t socket_pressure_seqlock; 269 258 #endif 270 259 int kmemcg_id; 271 - /* 272 - * memcg->objcg is wiped out as a part of the objcg repaprenting 273 - * process. memcg->orig_objcg preserves a pointer (and a reference) 274 - * to the original objcg until the end of live of memcg. 275 - */ 276 - struct obj_cgroup __rcu *objcg; 277 - struct obj_cgroup *orig_objcg; 278 - /* list of inherited objcgs, protected by objcg_lock */ 279 - struct list_head objcg_list; 280 260 281 261 struct memcg_vmstats_percpu __percpu *vmstats_percpu; 282 262 ··· 369 367 #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1) 370 368 371 369 #ifdef CONFIG_MEMCG 372 - 373 - static inline bool folio_memcg_kmem(struct folio *folio); 374 - 375 370 /* 376 371 * After the initialization objcg->memcg is always pointing at 377 372 * a valid memcg, but can be atomically swapped to the parent memcg. ··· 382 383 } 383 384 384 385 /* 385 - * __folio_memcg - Get the memory cgroup associated with a non-kmem folio 386 - * @folio: Pointer to the folio. 387 - * 388 - * Returns a pointer to the memory cgroup associated with the folio, 389 - * or NULL. This function assumes that the folio is known to have a 390 - * proper memory cgroup pointer. It's not safe to call this function 391 - * against some type of folios, e.g. slab folios or ex-slab folios or 392 - * kmem folios. 393 - */ 394 - static inline struct mem_cgroup *__folio_memcg(struct folio *folio) 395 - { 396 - unsigned long memcg_data = folio->memcg_data; 397 - 398 - VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); 399 - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); 400 - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); 401 - 402 - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 403 - } 404 - 405 - /* 406 - * __folio_objcg - get the object cgroup associated with a kmem folio. 386 + * folio_objcg - get the object cgroup associated with a folio. 407 387 * @folio: Pointer to the folio. 408 388 * 409 389 * Returns a pointer to the object cgroup associated with the folio, 410 390 * or NULL. This function assumes that the folio is known to have a 411 - * proper object cgroup pointer. It's not safe to call this function 412 - * against some type of folios, e.g. slab folios or ex-slab folios or 413 - * LRU folios. 391 + * proper object cgroup pointer. 414 392 */ 415 - static inline struct obj_cgroup *__folio_objcg(struct folio *folio) 393 + static inline struct obj_cgroup *folio_objcg(struct folio *folio) 416 394 { 417 395 unsigned long memcg_data = folio->memcg_data; 418 396 419 397 VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); 420 398 VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); 421 - VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); 422 399 423 400 return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 424 401 } ··· 408 433 * proper memory cgroup pointer. It's not safe to call this function 409 434 * against some type of folios, e.g. slab folios or ex-slab folios. 410 435 * 411 - * For a non-kmem folio any of the following ensures folio and memcg binding 412 - * stability: 436 + * For a folio any of the following ensures folio and objcg binding stability: 413 437 * 414 438 * - the folio lock 415 439 * - LRU isolation 416 440 * - exclusive reference 417 441 * 418 - * For a kmem folio a caller should hold an rcu read lock to protect memcg 419 - * associated with a kmem folio from being released. 442 + * Based on the stable binding of folio and objcg, for a folio any of the 443 + * following ensures folio and memcg binding stability: 444 + * 445 + * - cgroup_mutex 446 + * - the lruvec lock 447 + * 448 + * If the caller only want to ensure that the page counters of memcg are 449 + * updated correctly, ensure that the binding stability of folio and objcg 450 + * is sufficient. 451 + * 452 + * Note: The caller should hold an rcu read lock or cgroup_mutex to protect 453 + * memcg associated with a folio from being released. 420 454 */ 421 455 static inline struct mem_cgroup *folio_memcg(struct folio *folio) 422 456 { 423 - if (folio_memcg_kmem(folio)) 424 - return obj_cgroup_memcg(__folio_objcg(folio)); 425 - return __folio_memcg(folio); 457 + struct obj_cgroup *objcg = folio_objcg(folio); 458 + 459 + return objcg ? obj_cgroup_memcg(objcg) : NULL; 426 460 } 427 461 428 462 /* ··· 455 471 * has an associated memory cgroup pointer or an object cgroups vector or 456 472 * an object cgroup. 457 473 * 458 - * For a non-kmem folio any of the following ensures folio and memcg binding 459 - * stability: 474 + * The page and objcg or memcg binding rules can refer to folio_memcg(). 460 475 * 461 - * - the folio lock 462 - * - LRU isolation 463 - * - exclusive reference 464 - * 465 - * For a kmem folio a caller should hold an rcu read lock to protect memcg 466 - * associated with a kmem folio from being released. 476 + * A caller should hold an rcu read lock to protect memcg associated with a 477 + * page from being released. 467 478 */ 468 479 static inline struct mem_cgroup *folio_memcg_check(struct folio *folio) 469 480 { ··· 467 488 * for slabs, READ_ONCE() should be used here. 468 489 */ 469 490 unsigned long memcg_data = READ_ONCE(folio->memcg_data); 491 + struct obj_cgroup *objcg; 470 492 471 493 if (memcg_data & MEMCG_DATA_OBJEXTS) 472 494 return NULL; 473 495 474 - if (memcg_data & MEMCG_DATA_KMEM) { 475 - struct obj_cgroup *objcg; 496 + objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 476 497 477 - objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 478 - return obj_cgroup_memcg(objcg); 479 - } 480 - 481 - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); 498 + return objcg ? obj_cgroup_memcg(objcg) : NULL; 482 499 } 483 500 484 501 static inline struct mem_cgroup *page_memcg_check(struct page *page) ··· 521 546 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 522 547 { 523 548 return (memcg == root_mem_cgroup); 549 + } 550 + 551 + static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) 552 + { 553 + return objcg->is_root; 524 554 } 525 555 526 556 static inline bool mem_cgroup_disabled(void) ··· 715 735 * folio_lruvec - return lruvec for isolating/putting an LRU folio 716 736 * @folio: Pointer to the folio. 717 737 * 718 - * This function relies on folio->mem_cgroup being stable. 738 + * Call with rcu_read_lock() held to ensure the lifetime of the returned lruvec. 739 + * Note that this alone will NOT guarantee the stability of the folio->lruvec 740 + * association; the folio can be reparented to an ancestor if this races with 741 + * cgroup deletion. 742 + * 743 + * Use folio_lruvec_lock() to ensure both lifetime and stability of the binding. 744 + * Once a lruvec is locked, folio_lruvec() can be called on other folios, and 745 + * their binding is stable if the returned lruvec matches the one the caller has 746 + * locked. Useful for lock batching. 719 747 */ 720 748 static inline struct lruvec *folio_lruvec(struct folio *folio) 721 749 { ··· 746 758 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, 747 759 unsigned long *flags); 748 760 749 - #ifdef CONFIG_DEBUG_VM 750 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio); 751 - #else 752 - static inline 753 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 754 - { 755 - } 756 - #endif 757 - 758 761 static inline 759 762 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ 760 763 return css ? container_of(css, struct mem_cgroup, css) : NULL; ··· 753 774 754 775 static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg) 755 776 { 777 + if (obj_cgroup_is_root(objcg)) 778 + return true; 756 779 return percpu_ref_tryget(&objcg->refcnt); 757 - } 758 - 759 - static inline void obj_cgroup_get(struct obj_cgroup *objcg) 760 - { 761 - percpu_ref_get(&objcg->refcnt); 762 780 } 763 781 764 782 static inline void obj_cgroup_get_many(struct obj_cgroup *objcg, 765 783 unsigned long nr) 766 784 { 767 - percpu_ref_get_many(&objcg->refcnt, nr); 785 + if (!obj_cgroup_is_root(objcg)) 786 + percpu_ref_get_many(&objcg->refcnt, nr); 787 + } 788 + 789 + static inline void obj_cgroup_get(struct obj_cgroup *objcg) 790 + { 791 + obj_cgroup_get_many(objcg, 1); 768 792 } 769 793 770 794 static inline void obj_cgroup_put(struct obj_cgroup *objcg) 771 795 { 772 - if (objcg) 796 + if (objcg && !obj_cgroup_is_root(objcg)) 773 797 percpu_ref_put(&objcg->refcnt); 774 798 } 775 799 ··· 867 885 return match; 868 886 } 869 887 870 - struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio); 888 + struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio); 871 889 ino_t page_cgroup_ino(struct page *page); 872 890 873 891 static inline bool mem_cgroup_online(struct mem_cgroup *memcg) ··· 878 896 } 879 897 880 898 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 881 - int zid, int nr_pages); 899 + int zid, long nr_pages); 882 900 883 901 static inline 884 902 unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, ··· 948 966 static inline void count_memcg_folio_events(struct folio *folio, 949 967 enum vm_event_item idx, unsigned long nr) 950 968 { 951 - struct mem_cgroup *memcg = folio_memcg(folio); 969 + struct mem_cgroup *memcg; 952 970 953 - if (memcg) 954 - count_memcg_events(memcg, idx, nr); 971 + if (!folio_memcg_charged(folio)) 972 + return; 973 + 974 + rcu_read_lock(); 975 + memcg = folio_memcg(folio); 976 + count_memcg_events(memcg, idx, nr); 977 + rcu_read_unlock(); 955 978 } 956 979 957 980 static inline void count_memcg_events_mm(struct mm_struct *mm, ··· 1074 1087 return true; 1075 1088 } 1076 1089 1090 + static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) 1091 + { 1092 + return true; 1093 + } 1094 + 1077 1095 static inline bool mem_cgroup_disabled(void) 1078 1096 { 1079 1097 return true; ··· 1171 1179 return &pgdat->__lruvec; 1172 1180 } 1173 1181 1174 - static inline 1175 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 1176 - { 1177 - } 1178 - 1179 1182 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 1180 1183 { 1181 1184 return NULL; ··· 1229 1242 { 1230 1243 struct pglist_data *pgdat = folio_pgdat(folio); 1231 1244 1245 + rcu_read_lock(); 1232 1246 spin_lock(&pgdat->__lruvec.lru_lock); 1233 1247 return &pgdat->__lruvec; 1234 1248 } ··· 1238 1250 { 1239 1251 struct pglist_data *pgdat = folio_pgdat(folio); 1240 1252 1253 + rcu_read_lock(); 1241 1254 spin_lock_irq(&pgdat->__lruvec.lru_lock); 1242 1255 return &pgdat->__lruvec; 1243 1256 } ··· 1248 1259 { 1249 1260 struct pglist_data *pgdat = folio_pgdat(folio); 1250 1261 1262 + rcu_read_lock(); 1251 1263 spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); 1252 1264 return &pgdat->__lruvec; 1253 1265 } ··· 1469 1479 return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); 1470 1480 } 1471 1481 1472 - static inline void unlock_page_lruvec(struct lruvec *lruvec) 1482 + static inline void lruvec_lock_irq(struct lruvec *lruvec) 1483 + { 1484 + rcu_read_lock(); 1485 + spin_lock_irq(&lruvec->lru_lock); 1486 + } 1487 + 1488 + static inline void lruvec_unlock(struct lruvec *lruvec) 1473 1489 { 1474 1490 spin_unlock(&lruvec->lru_lock); 1491 + rcu_read_unlock(); 1475 1492 } 1476 1493 1477 - static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) 1494 + static inline void lruvec_unlock_irq(struct lruvec *lruvec) 1478 1495 { 1479 1496 spin_unlock_irq(&lruvec->lru_lock); 1497 + rcu_read_unlock(); 1480 1498 } 1481 1499 1482 - static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, 1483 - unsigned long flags) 1500 + static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags) 1484 1501 { 1485 1502 spin_unlock_irqrestore(&lruvec->lru_lock, flags); 1503 + rcu_read_unlock(); 1486 1504 } 1487 1505 1488 1506 /* Test requires a stable folio->memcg binding, see folio_memcg() */ ··· 1509 1511 if (folio_matches_lruvec(folio, locked_lruvec)) 1510 1512 return locked_lruvec; 1511 1513 1512 - unlock_page_lruvec_irq(locked_lruvec); 1514 + lruvec_unlock_irq(locked_lruvec); 1513 1515 } 1514 1516 1515 1517 return folio_lruvec_lock_irq(folio); ··· 1523 1525 if (folio_matches_lruvec(folio, *lruvecp)) 1524 1526 return; 1525 1527 1526 - unlock_page_lruvec_irqrestore(*lruvecp, *flags); 1528 + lruvec_unlock_irqrestore(*lruvecp, *flags); 1527 1529 } 1528 1530 1529 1531 *lruvecp = folio_lruvec_lock_irqsave(folio, flags); ··· 1547 1549 if (mem_cgroup_disabled()) 1548 1550 return; 1549 1551 1552 + if (!folio_memcg_charged(folio)) 1553 + return; 1554 + 1555 + rcu_read_lock(); 1550 1556 memcg = folio_memcg(folio); 1551 - if (unlikely(memcg && &memcg->css != wb->memcg_css)) 1557 + if (unlikely(&memcg->css != wb->memcg_css)) 1552 1558 mem_cgroup_track_foreign_dirty_slowpath(folio, wb); 1559 + rcu_read_unlock(); 1553 1560 } 1554 1561 1555 1562 void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
+5
include/linux/mm.h
··· 758 758 */ 759 759 }; 760 760 761 + struct vm_uffd_ops; 762 + 761 763 /* 762 764 * These are the virtual MM functions - opening of an area, closing and 763 765 * unmapping it (needed to keep files on disk up-to-date etc), pointer ··· 867 865 struct page *(*find_normal_page)(struct vm_area_struct *vma, 868 866 unsigned long addr); 869 867 #endif /* CONFIG_FIND_NORMAL_PAGE */ 868 + #ifdef CONFIG_USERFAULTFD 869 + const struct vm_uffd_ops *uffd_ops; 870 + #endif 870 871 }; 871 872 872 873 #ifdef CONFIG_NUMA_BALANCING
+6
include/linux/mm_inline.h
··· 348 348 { 349 349 enum lru_list lru = folio_lru_list(folio); 350 350 351 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 352 + 351 353 if (lru_gen_add_folio(lruvec, folio, false)) 352 354 return; 353 355 ··· 364 362 { 365 363 enum lru_list lru = folio_lru_list(folio); 366 364 365 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 366 + 367 367 if (lru_gen_add_folio(lruvec, folio, true)) 368 368 return; 369 369 ··· 379 375 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) 380 376 { 381 377 enum lru_list lru = folio_lru_list(folio); 378 + 379 + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 382 380 383 381 if (lru_gen_del_folio(lruvec, folio, false)) 384 382 return;
+27 -15
include/linux/mmzone.h
··· 694 694 void lru_gen_offline_memcg(struct mem_cgroup *memcg); 695 695 void lru_gen_release_memcg(struct mem_cgroup *memcg); 696 696 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid); 697 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid); 698 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid); 699 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); 697 700 698 701 #else /* !CONFIG_LRU_GEN */ 699 702 ··· 735 732 } 736 733 737 734 static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid) 735 + { 736 + } 737 + 738 + static inline void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 739 + { 740 + } 741 + 742 + static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 743 + { 744 + return true; 745 + } 746 + 747 + static inline 748 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 738 749 { 739 750 } 740 751 ··· 2070 2053 extern size_t mem_section_usage_size(void); 2071 2054 2072 2055 /* 2073 - * We use the lower bits of the mem_map pointer to store 2074 - * a little bit of information. The pointer is calculated 2075 - * as mem_map - section_nr_to_pfn(pnum). The result is 2076 - * aligned to the minimum alignment of the two values: 2077 - * 1. All mem_map arrays are page-aligned. 2078 - * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT 2079 - * lowest bits. PFN_SECTION_SHIFT is arch-specific 2080 - * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the 2081 - * worst combination is powerpc with 256k pages, 2082 - * which results in PFN_SECTION_SHIFT equal 6. 2083 - * To sum it up, at least 6 bits are available on all architectures. 2084 - * However, we can exceed 6 bits on some other architectures except 2085 - * powerpc (e.g. 15 bits are available on x86_64, 13 bits are available 2086 - * with the worst case of 64K pages on arm64) if we make sure the 2087 - * exceeded bit is not applicable to powerpc. 2056 + * We use the lower bits of the mem_map pointer to store a little bit of 2057 + * information. The pointer is calculated as mem_map - section_nr_to_pfn(). 2058 + * The result is aligned to the minimum alignment of the two values: 2059 + * 2060 + * 1. All mem_map arrays are page-aligned. 2061 + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT lowest bits. 2062 + * 2063 + * We always expect a single section to cover full pages. Therefore, 2064 + * we can safely assume that PFN_SECTION_SHIFT is large enough to 2065 + * accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. 2088 2066 */ 2089 2067 enum { 2090 2068 SECTION_MARKED_PRESENT_BIT,
+1 -1
include/linux/pgalloc_tag.h
··· 181 181 182 182 if (get_page_tag_ref(page, &ref, &handle)) { 183 183 alloc_tag_sub_check(&ref); 184 - if (ref.ct) 184 + if (ref.ct && !is_codetag_empty(&ref)) 185 185 tag = ct_to_alloc_tag(ref.ct); 186 186 put_page_tag_ref(handle); 187 187 }
+1 -1
include/linux/sched.h
··· 1535 1535 /* Used by memcontrol for targeted memcg charge: */ 1536 1536 struct mem_cgroup *active_memcg; 1537 1537 1538 - /* Cache for current->cgroups->memcg->objcg lookups: */ 1538 + /* Cache for current->cgroups->memcg->nodeinfo[nid]->objcg lookups: */ 1539 1539 struct obj_cgroup *objcg; 1540 1540 #endif 1541 1541
-14
include/linux/shmem_fs.h
··· 221 221 222 222 extern bool shmem_charge(struct inode *inode, long pages); 223 223 224 - #ifdef CONFIG_USERFAULTFD 225 - #ifdef CONFIG_SHMEM 226 - extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, 227 - struct vm_area_struct *dst_vma, 228 - unsigned long dst_addr, 229 - unsigned long src_addr, 230 - uffd_flags_t flags, 231 - struct folio **foliop); 232 - #else /* !CONFIG_SHMEM */ 233 - #define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ 234 - src_addr, flags, foliop) ({ BUG(); 0; }) 235 - #endif /* CONFIG_SHMEM */ 236 - #endif /* CONFIG_USERFAULTFD */ 237 - 238 224 /* 239 225 * Used space is stored as unsigned 64-bit value in bytes but 240 226 * quota core supports only signed 64-bit values so use that
+23 -2
include/linux/swap.h
··· 310 310 311 311 /* linux/mm/swap.c */ 312 312 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, 313 - unsigned int nr_io, unsigned int nr_rotated) 314 - __releases(lruvec->lru_lock); 313 + unsigned int nr_io, unsigned int nr_rotated); 315 314 void lru_note_cost_refault(struct folio *); 316 315 void folio_add_lru(struct folio *); 317 316 void folio_add_lru_vma(struct folio *, struct vm_area_struct *); ··· 352 353 extern unsigned long zone_reclaimable_pages(struct zone *zone); 353 354 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 354 355 gfp_t gfp_mask, nodemask_t *mask); 356 + unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx); 355 357 356 358 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) 357 359 #define MEMCG_RECLAIM_PROACTIVE (1 << 2) ··· 547 547 548 548 return READ_ONCE(memcg->swappiness); 549 549 } 550 + 551 + void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); 550 552 #else 551 553 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) 552 554 { ··· 612 610 return vm_swap_full(); 613 611 } 614 612 #endif 613 + 614 + /* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to 615 + * and including the specified highidx 616 + * @zone: The current zone in the iterator 617 + * @pgdat: The pgdat which node_zones are being iterated 618 + * @idx: The index variable 619 + * @highidx: The index of the highest zone to return 620 + * 621 + * This macro iterates through all managed zones up to and including the specified highidx. 622 + * The zone iterator enters an invalid state after macro call and must be reinitialized 623 + * before it can be used again. 624 + */ 625 + #define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ 626 + for ((idx) = 0, (zone) = (pgdat)->node_zones; \ 627 + (idx) <= (highidx); \ 628 + (idx)++, (zone)++) \ 629 + if (!managed_zone(zone)) \ 630 + continue; \ 631 + else 615 632 616 633 #endif /* __KERNEL__*/ 617 634 #endif /* _LINUX_SWAP_H */
+35 -38
include/linux/userfaultfd_k.h
··· 83 83 84 84 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); 85 85 86 + /* VMA userfaultfd operations */ 87 + struct vm_uffd_ops { 88 + /* Checks if a VMA can support userfaultfd */ 89 + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); 90 + /* 91 + * Called to resolve UFFDIO_CONTINUE request. 92 + * Should return the folio found at pgoff in the VMA's pagecache if it 93 + * exists or ERR_PTR otherwise. 94 + * The returned folio is locked and with reference held. 95 + */ 96 + struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); 97 + /* 98 + * Called during resolution of UFFDIO_COPY request. 99 + * Should allocate and return a folio or NULL if allocation fails. 100 + */ 101 + struct folio *(*alloc_folio)(struct vm_area_struct *vma, 102 + unsigned long addr); 103 + /* 104 + * Called during resolution of UFFDIO_COPY request. 105 + * Should only be called with a folio returned by alloc_folio() above. 106 + * The folio will be set to locked. 107 + * Returns 0 on success, error code on failure. 108 + */ 109 + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma, 110 + unsigned long addr); 111 + /* 112 + * Called during resolution of UFFDIO_COPY request on the error 113 + * handling path. 114 + * Should revert the operation of ->filemap_add(). 115 + */ 116 + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma); 117 + }; 118 + 86 119 /* A combined operation mode + behavior flags. */ 87 120 typedef unsigned int __bitwise uffd_flags_t; 88 121 ··· 146 113 147 114 /* Flags controlling behavior. These behavior changes are mode-independent. */ 148 115 #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) 149 - 150 - extern int mfill_atomic_install_pte(pmd_t *dst_pmd, 151 - struct vm_area_struct *dst_vma, 152 - unsigned long dst_addr, struct page *page, 153 - bool newly_allocated, uffd_flags_t flags); 154 116 155 117 extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start, 156 118 unsigned long src_start, unsigned long len, ··· 239 211 return vma->vm_flags & __VM_UFFD_FLAGS; 240 212 } 241 213 242 - static inline bool vma_can_userfault(struct vm_area_struct *vma, 243 - vm_flags_t vm_flags, 244 - bool wp_async) 245 - { 246 - vm_flags &= __VM_UFFD_FLAGS; 247 - 248 - if (vma->vm_flags & VM_DROPPABLE) 249 - return false; 250 - 251 - if ((vm_flags & VM_UFFD_MINOR) && 252 - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) 253 - return false; 254 - 255 - /* 256 - * If wp async enabled, and WP is the only mode enabled, allow any 257 - * memory type. 258 - */ 259 - if (wp_async && (vm_flags == VM_UFFD_WP)) 260 - return true; 261 - 262 - /* 263 - * If user requested uffd-wp but not enabled pte markers for 264 - * uffd-wp, then shmem & hugetlbfs are not supported but only 265 - * anonymous. 266 - */ 267 - if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && 268 - !vma_is_anonymous(vma)) 269 - return false; 270 - 271 - /* By default, allow any of anon|shmem|hugetlb */ 272 - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || 273 - vma_is_shmem(vma); 274 - } 214 + bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, 215 + bool wp_async); 275 216 276 217 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) 277 218 {
+5 -5
include/trace/events/memcg.h
··· 11 11 12 12 DECLARE_EVENT_CLASS(memcg_rstat_stats, 13 13 14 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 14 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 15 15 16 16 TP_ARGS(memcg, item, val), 17 17 18 18 TP_STRUCT__entry( 19 19 __field(u64, id) 20 20 __field(int, item) 21 - __field(int, val) 21 + __field(long, val) 22 22 ), 23 23 24 24 TP_fast_assign( ··· 27 27 __entry->val = val; 28 28 ), 29 29 30 - TP_printk("memcg_id=%llu item=%d val=%d", 30 + TP_printk("memcg_id=%llu item=%d val=%ld", 31 31 __entry->id, __entry->item, __entry->val) 32 32 ); 33 33 34 34 DEFINE_EVENT(memcg_rstat_stats, mod_memcg_state, 35 35 36 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 36 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 37 37 38 38 TP_ARGS(memcg, item, val) 39 39 ); 40 40 41 41 DEFINE_EVENT(memcg_rstat_stats, mod_memcg_lruvec_state, 42 42 43 - TP_PROTO(struct mem_cgroup *memcg, int item, int val), 43 + TP_PROTO(struct mem_cgroup *memcg, int item, long val), 44 44 45 45 TP_ARGS(memcg, item, val) 46 46 );
+3
include/trace/events/writeback.h
··· 294 294 __entry->ino = inode ? inode->i_ino : 0; 295 295 __entry->memcg_id = wb->memcg_css->id; 296 296 __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); 297 + 298 + rcu_read_lock(); 297 299 __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup); 300 + rcu_read_unlock(); 298 301 ), 299 302 300 303 TP_printk("bdi %s[%llu]: ino=%llu memcg_id=%u cgroup_ino=%llu page_cgroup_ino=%llu",
+5 -4
kernel/cgroup/cgroup.c
··· 5999 5999 */ 6000 6000 static void css_killed_work_fn(struct work_struct *work) 6001 6001 { 6002 - struct cgroup_subsys_state *css = 6003 - container_of(work, struct cgroup_subsys_state, destroy_work); 6002 + struct cgroup_subsys_state *css; 6003 + 6004 + css = container_of(to_rcu_work(work), struct cgroup_subsys_state, destroy_rwork); 6004 6005 6005 6006 cgroup_lock(); 6006 6007 ··· 6022 6021 container_of(ref, struct cgroup_subsys_state, refcnt); 6023 6022 6024 6023 if (atomic_dec_and_test(&css->online_cnt)) { 6025 - INIT_WORK(&css->destroy_work, css_killed_work_fn); 6026 - queue_work(cgroup_offline_wq, &css->destroy_work); 6024 + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn); 6025 + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork); 6027 6026 } 6028 6027 } 6029 6028
+138 -20
kernel/liveupdate/kexec_handover.c
··· 18 18 #include <linux/kexec.h> 19 19 #include <linux/kexec_handover.h> 20 20 #include <linux/kho_radix_tree.h> 21 + #include <linux/utsname.h> 21 22 #include <linux/kho/abi/kexec_handover.h> 23 + #include <linux/kho/abi/kexec_metadata.h> 22 24 #include <linux/libfdt.h> 23 25 #include <linux/list.h> 24 26 #include <linux/memblock.h> ··· 726 724 } 727 725 728 726 /** 729 - * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. 727 + * kho_add_subtree - record the physical address of a sub blob in KHO root tree. 730 728 * @name: name of the sub tree. 731 - * @fdt: the sub tree blob. 729 + * @blob: the sub tree blob. 730 + * @size: size of the blob in bytes. 732 731 * 733 732 * Creates a new child node named @name in KHO root FDT and records 734 - * the physical address of @fdt. The pages of @fdt must also be preserved 733 + * the physical address of @blob. The pages of @blob must also be preserved 735 734 * by KHO for the new kernel to retrieve it after kexec. 736 735 * 737 736 * A debugfs blob entry is also created at ··· 741 738 * 742 739 * Return: 0 on success, error code on failure 743 740 */ 744 - int kho_add_subtree(const char *name, void *fdt) 741 + int kho_add_subtree(const char *name, void *blob, size_t size) 745 742 { 746 - phys_addr_t phys = virt_to_phys(fdt); 743 + phys_addr_t phys = virt_to_phys(blob); 747 744 void *root_fdt = kho_out.fdt; 745 + u64 size_u64 = size; 748 746 int err = -ENOMEM; 749 747 int off, fdt_err; 750 748 ··· 762 758 goto out_pack; 763 759 } 764 760 765 - err = fdt_setprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, 761 + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, 766 762 &phys, sizeof(phys)); 767 763 if (err < 0) 768 764 goto out_pack; 769 765 770 - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); 766 + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_SIZE_PROP_NAME, 767 + &size_u64, sizeof(size_u64)); 768 + if (err < 0) 769 + goto out_pack; 770 + 771 + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, name, blob, 772 + size, false)); 771 773 772 774 out_pack: 773 775 fdt_pack(root_fdt); ··· 782 772 } 783 773 EXPORT_SYMBOL_GPL(kho_add_subtree); 784 774 785 - void kho_remove_subtree(void *fdt) 775 + void kho_remove_subtree(void *blob) 786 776 { 787 - phys_addr_t target_phys = virt_to_phys(fdt); 777 + phys_addr_t target_phys = virt_to_phys(blob); 788 778 void *root_fdt = kho_out.fdt; 789 779 int off; 790 780 int err; ··· 800 790 const u64 *val; 801 791 int len; 802 792 803 - val = fdt_getprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, &len); 793 + val = fdt_getprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, &len); 804 794 if (!val || len != sizeof(phys_addr_t)) 805 795 continue; 806 796 807 797 if ((phys_addr_t)*val == target_phys) { 808 798 fdt_del_node(root_fdt, off); 809 - kho_debugfs_fdt_remove(&kho_out.dbg, fdt); 799 + kho_debugfs_blob_remove(&kho_out.dbg, blob); 810 800 break; 811 801 } 812 802 } ··· 1270 1260 struct kho_in { 1271 1261 phys_addr_t fdt_phys; 1272 1262 phys_addr_t scratch_phys; 1263 + char previous_release[__NEW_UTS_LEN + 1]; 1264 + u32 kexec_count; 1273 1265 struct kho_debugfs dbg; 1274 1266 }; 1275 1267 ··· 1304 1292 EXPORT_SYMBOL_GPL(is_kho_boot); 1305 1293 1306 1294 /** 1307 - * kho_retrieve_subtree - retrieve a preserved sub FDT by its name. 1308 - * @name: the name of the sub FDT passed to kho_add_subtree(). 1309 - * @phys: if found, the physical address of the sub FDT is stored in @phys. 1295 + * kho_retrieve_subtree - retrieve a preserved sub blob by its name. 1296 + * @name: the name of the sub blob passed to kho_add_subtree(). 1297 + * @phys: if found, the physical address of the sub blob is stored in @phys. 1298 + * @size: if not NULL and found, the size of the sub blob is stored in @size. 1310 1299 * 1311 - * Retrieve a preserved sub FDT named @name and store its physical 1312 - * address in @phys. 1300 + * Retrieve a preserved sub blob named @name and store its physical 1301 + * address in @phys and optionally its size in @size. 1313 1302 * 1314 1303 * Return: 0 on success, error code on failure 1315 1304 */ 1316 - int kho_retrieve_subtree(const char *name, phys_addr_t *phys) 1305 + int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size) 1317 1306 { 1318 1307 const void *fdt = kho_get_fdt(); 1319 1308 const u64 *val; ··· 1330 1317 if (offset < 0) 1331 1318 return -ENOENT; 1332 1319 1333 - val = fdt_getprop(fdt, offset, KHO_FDT_SUB_TREE_PROP_NAME, &len); 1320 + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_PROP_NAME, &len); 1334 1321 if (!val || len != sizeof(*val)) 1335 1322 return -EINVAL; 1336 1323 1337 1324 *phys = (phys_addr_t)*val; 1325 + 1326 + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_SIZE_PROP_NAME, &len); 1327 + if (!val || len != sizeof(*val)) { 1328 + pr_warn("broken KHO subnode '%s': missing or invalid blob-size property\n", 1329 + name); 1330 + return -EINVAL; 1331 + } 1332 + 1333 + if (size) 1334 + *size = (size_t)*val; 1338 1335 1339 1336 return 0; 1340 1337 } ··· 1396 1373 return err; 1397 1374 } 1398 1375 1376 + static void __init kho_in_kexec_metadata(void) 1377 + { 1378 + struct kho_kexec_metadata *metadata; 1379 + phys_addr_t metadata_phys; 1380 + size_t blob_size; 1381 + int err; 1382 + 1383 + err = kho_retrieve_subtree(KHO_METADATA_NODE_NAME, &metadata_phys, 1384 + &blob_size); 1385 + if (err) 1386 + /* This is fine, previous kernel didn't export metadata */ 1387 + return; 1388 + 1389 + /* Check that, at least, "version" is present */ 1390 + if (blob_size < sizeof(u32)) { 1391 + pr_warn("kexec-metadata blob too small (%zu bytes)\n", 1392 + blob_size); 1393 + return; 1394 + } 1395 + 1396 + metadata = phys_to_virt(metadata_phys); 1397 + 1398 + if (metadata->version != KHO_KEXEC_METADATA_VERSION) { 1399 + pr_warn("kexec-metadata version %u not supported (expected %u)\n", 1400 + metadata->version, KHO_KEXEC_METADATA_VERSION); 1401 + return; 1402 + } 1403 + 1404 + if (blob_size < sizeof(*metadata)) { 1405 + pr_warn("kexec-metadata blob too small for v%u (%zu < %zu)\n", 1406 + metadata->version, blob_size, sizeof(*metadata)); 1407 + return; 1408 + } 1409 + 1410 + /* 1411 + * Copy data to the kernel structure that will persist during 1412 + * kernel lifetime. 1413 + */ 1414 + kho_in.kexec_count = metadata->kexec_count; 1415 + strscpy(kho_in.previous_release, metadata->previous_release, 1416 + sizeof(kho_in.previous_release)); 1417 + 1418 + pr_info("exec from: %s (count %u)\n", 1419 + kho_in.previous_release, kho_in.kexec_count); 1420 + } 1421 + 1422 + /* 1423 + * Create kexec metadata to pass kernel version and boot count to the 1424 + * next kernel. This keeps the core KHO ABI minimal and allows the 1425 + * metadata format to evolve independently. 1426 + */ 1427 + static __init int kho_out_kexec_metadata(void) 1428 + { 1429 + struct kho_kexec_metadata *metadata; 1430 + int err; 1431 + 1432 + metadata = kho_alloc_preserve(sizeof(*metadata)); 1433 + if (IS_ERR(metadata)) 1434 + return PTR_ERR(metadata); 1435 + 1436 + metadata->version = KHO_KEXEC_METADATA_VERSION; 1437 + strscpy(metadata->previous_release, init_uts_ns.name.release, 1438 + sizeof(metadata->previous_release)); 1439 + /* kho_in.kexec_count is set to 0 on cold boot */ 1440 + metadata->kexec_count = kho_in.kexec_count + 1; 1441 + 1442 + err = kho_add_subtree(KHO_METADATA_NODE_NAME, metadata, 1443 + sizeof(*metadata)); 1444 + if (err) 1445 + kho_unpreserve_free(metadata); 1446 + 1447 + return err; 1448 + } 1449 + 1450 + static int __init kho_kexec_metadata_init(const void *fdt) 1451 + { 1452 + int err; 1453 + 1454 + if (fdt) 1455 + kho_in_kexec_metadata(); 1456 + 1457 + /* Populate kexec metadata for the possible next kexec */ 1458 + err = kho_out_kexec_metadata(); 1459 + if (err) 1460 + pr_warn("failed to initialize kexec-metadata subtree: %d\n", 1461 + err); 1462 + 1463 + return err; 1464 + } 1465 + 1399 1466 static __init int kho_init(void) 1400 1467 { 1401 1468 struct kho_radix_tree *tree = &kho_out.radix_tree; ··· 1519 1406 if (err) 1520 1407 goto err_free_fdt; 1521 1408 1409 + err = kho_kexec_metadata_init(fdt); 1410 + if (err) 1411 + goto err_free_fdt; 1412 + 1522 1413 if (fdt) { 1523 1414 kho_in_debugfs_init(&kho_in.dbg, fdt); 1524 1415 return 0; ··· 1547 1430 init_cma_reserved_pageblock(pfn_to_page(pfn)); 1548 1431 } 1549 1432 1550 - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", 1551 - kho_out.fdt, true)); 1433 + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, "fdt", 1434 + kho_out.fdt, 1435 + fdt_totalsize(kho_out.fdt), true)); 1552 1436 1553 1437 return 0; 1554 1438
+35 -20
kernel/liveupdate/kexec_handover_debugfs.c
··· 24 24 struct dentry *file; 25 25 }; 26 26 27 - static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, 28 - const char *name, const void *fdt) 27 + static int __kho_debugfs_blob_add(struct list_head *list, struct dentry *dir, 28 + const char *name, const void *blob, 29 + size_t size) 29 30 { 30 31 struct fdt_debugfs *f; 31 32 struct dentry *file; ··· 35 34 if (!f) 36 35 return -ENOMEM; 37 36 38 - f->wrapper.data = (void *)fdt; 39 - f->wrapper.size = fdt_totalsize(fdt); 37 + f->wrapper.data = (void *)blob; 38 + f->wrapper.size = size; 40 39 41 40 file = debugfs_create_blob(name, 0400, dir, &f->wrapper); 42 41 if (IS_ERR(file)) { ··· 50 49 return 0; 51 50 } 52 51 53 - int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 54 - const void *fdt, bool root) 52 + int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, 53 + const void *blob, size_t size, bool root) 55 54 { 56 55 struct dentry *dir; 57 56 ··· 60 59 else 61 60 dir = dbg->sub_fdt_dir; 62 61 63 - return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt); 62 + return __kho_debugfs_blob_add(&dbg->fdt_list, dir, name, blob, size); 64 63 } 65 64 66 - void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) 65 + void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob) 67 66 { 68 67 struct fdt_debugfs *ff; 69 68 70 69 list_for_each_entry(ff, &dbg->fdt_list, list) { 71 - if (ff->wrapper.data == fdt) { 70 + if (ff->wrapper.data == blob) { 72 71 debugfs_remove(ff->file); 73 72 list_del(&ff->list); 74 73 kfree(ff); ··· 114 113 goto err_rmdir; 115 114 } 116 115 117 - err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt); 116 + err = __kho_debugfs_blob_add(&dbg->fdt_list, dir, "fdt", fdt, 117 + fdt_totalsize(fdt)); 118 118 if (err) 119 119 goto err_rmdir; 120 120 121 121 fdt_for_each_subnode(child, fdt, 0) { 122 122 int len = 0; 123 123 const char *name = fdt_get_name(fdt, child, NULL); 124 - const u64 *fdt_phys; 124 + const u64 *blob_phys; 125 + const u64 *blob_size; 126 + void *blob; 125 127 126 - fdt_phys = fdt_getprop(fdt, child, KHO_FDT_SUB_TREE_PROP_NAME, &len); 127 - if (!fdt_phys) 128 + blob_phys = fdt_getprop(fdt, child, 129 + KHO_SUB_TREE_PROP_NAME, &len); 130 + if (!blob_phys) 128 131 continue; 129 - if (len != sizeof(*fdt_phys)) { 130 - pr_warn("node %s prop fdt has invalid length: %d\n", 131 - name, len); 132 + if (len != sizeof(*blob_phys)) { 133 + pr_warn("node %s prop %s has invalid length: %d\n", 134 + name, KHO_SUB_TREE_PROP_NAME, len); 132 135 continue; 133 136 } 134 - err = __kho_debugfs_fdt_add(&dbg->fdt_list, sub_fdt_dir, name, 135 - phys_to_virt(*fdt_phys)); 137 + 138 + blob_size = fdt_getprop(fdt, child, 139 + KHO_SUB_TREE_SIZE_PROP_NAME, &len); 140 + if (!blob_size || len != sizeof(*blob_size)) { 141 + pr_warn("node %s missing or invalid %s property\n", 142 + name, KHO_SUB_TREE_SIZE_PROP_NAME); 143 + continue; 144 + } 145 + 146 + blob = phys_to_virt(*blob_phys); 147 + err = __kho_debugfs_blob_add(&dbg->fdt_list, sub_fdt_dir, name, 148 + blob, *blob_size); 136 149 if (err) { 137 - pr_warn("failed to add fdt %s to debugfs: %pe\n", name, 138 - ERR_PTR(err)); 150 + pr_warn("failed to add blob %s to debugfs: %pe\n", 151 + name, ERR_PTR(err)); 139 152 continue; 140 153 } 141 154 }
+8 -7
kernel/liveupdate/kexec_handover_internal.h
··· 26 26 int kho_debugfs_init(void); 27 27 void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt); 28 28 int kho_out_debugfs_init(struct kho_debugfs *dbg); 29 - int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 30 - const void *fdt, bool root); 31 - void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt); 29 + int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, 30 + const void *blob, size_t size, bool root); 31 + void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob); 32 32 #else 33 33 static inline int kho_debugfs_init(void) { return 0; } 34 34 static inline void kho_in_debugfs_init(struct kho_debugfs *dbg, 35 35 const void *fdt) { } 36 36 static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; } 37 - static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, 38 - const void *fdt, bool root) { return 0; } 39 - static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, 40 - void *fdt) { } 37 + static inline int kho_debugfs_blob_add(struct kho_debugfs *dbg, 38 + const char *name, const void *blob, 39 + size_t size, bool root) { return 0; } 40 + static inline void kho_debugfs_blob_remove(struct kho_debugfs *dbg, 41 + void *blob) { } 41 42 #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ 42 43 43 44 #ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+9 -2
kernel/liveupdate/luo_core.c
··· 54 54 #include <linux/liveupdate.h> 55 55 #include <linux/miscdevice.h> 56 56 #include <linux/mm.h> 57 + #include <linux/rwsem.h> 57 58 #include <linux/sizes.h> 58 59 #include <linux/string.h> 59 60 #include <linux/unaligned.h> ··· 68 67 void *fdt_in; 69 68 u64 liveupdate_num; 70 69 } luo_global; 70 + 71 + /* 72 + * luo_register_rwlock - Protects registration of file handlers and FLBs. 73 + */ 74 + DECLARE_RWSEM(luo_register_rwlock); 71 75 72 76 static int __init early_liveupdate_param(char *buf) 73 77 { ··· 94 88 } 95 89 96 90 /* Retrieve LUO subtree, and verify its format. */ 97 - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys); 91 + err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); 98 92 if (err) { 99 93 if (err != -ENOENT) { 100 94 pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", ··· 178 172 if (err) 179 173 goto exit_free; 180 174 181 - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out); 175 + err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, 176 + fdt_totalsize(fdt_out)); 182 177 if (err) 183 178 goto exit_free; 184 179 luo_global.fdt_out = fdt_out;
+56 -56
kernel/liveupdate/luo_file.c
··· 108 108 #include <linux/liveupdate.h> 109 109 #include <linux/module.h> 110 110 #include <linux/sizes.h> 111 + #include <linux/xarray.h> 111 112 #include <linux/slab.h> 112 113 #include <linux/string.h> 113 114 #include "luo_internal.h" 114 115 115 116 static LIST_HEAD(luo_file_handler_list); 117 + 118 + /* Keep track of files being preserved by LUO */ 119 + static DEFINE_XARRAY(luo_preserved_files); 116 120 117 121 /* 2 4K pages, give space for 128 files per file_set */ 118 122 #define LUO_FILE_PGCNT 2ul ··· 207 203 file_set->files = NULL; 208 204 } 209 205 206 + static unsigned long luo_get_id(struct liveupdate_file_handler *fh, 207 + struct file *file) 208 + { 209 + return fh->ops->get_id ? fh->ops->get_id(file) : (unsigned long)file; 210 + } 211 + 210 212 static bool luo_token_is_used(struct luo_file_set *file_set, u64 token) 211 213 { 212 214 struct luo_file *iter; ··· 258 248 * Context: Can be called from an ioctl handler during normal system operation. 259 249 * Return: 0 on success. Returns a negative errno on failure: 260 250 * -EEXIST if the token is already used. 251 + * -EBUSY if the file descriptor is already preserved by another session. 261 252 * -EBADF if the file descriptor is invalid. 262 253 * -ENOSPC if the file_set is full. 263 254 * -ENOENT if no compatible handler is found. ··· 288 277 goto err_fput; 289 278 290 279 err = -ENOENT; 280 + down_read(&luo_register_rwlock); 291 281 list_private_for_each_entry(fh, &luo_file_handler_list, list) { 292 282 if (fh->ops->can_preserve(fh, file)) { 293 - err = 0; 283 + if (try_module_get(fh->ops->owner)) 284 + err = 0; 294 285 break; 295 286 } 296 287 } 288 + up_read(&luo_register_rwlock); 297 289 298 290 /* err is still -ENOENT if no handler was found */ 299 291 if (err) 300 292 goto err_free_files_mem; 301 293 294 + err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), 295 + file, GFP_KERNEL); 296 + if (err) 297 + goto err_module_put; 298 + 302 299 err = luo_flb_file_preserve(fh); 303 300 if (err) 304 - goto err_free_files_mem; 301 + goto err_erase_xa; 305 302 306 303 luo_file = kzalloc_obj(*luo_file); 307 304 if (!luo_file) { ··· 339 320 kfree(luo_file); 340 321 err_flb_unpreserve: 341 322 luo_flb_file_unpreserve(fh); 323 + err_erase_xa: 324 + xa_erase(&luo_preserved_files, luo_get_id(fh, file)); 325 + err_module_put: 326 + module_put(fh->ops->owner); 342 327 err_free_files_mem: 343 328 luo_free_files_mem(file_set); 344 329 err_fput: ··· 385 362 args.private_data = luo_file->private_data; 386 363 luo_file->fh->ops->unpreserve(&args); 387 364 luo_flb_file_unpreserve(luo_file->fh); 365 + module_put(luo_file->fh->ops->owner); 388 366 367 + xa_erase(&luo_preserved_files, 368 + luo_get_id(luo_file->fh, luo_file->file)); 389 369 list_del(&luo_file->list); 390 370 file_set->count--; 391 371 ··· 632 606 luo_file->file = args.file; 633 607 /* Get reference so we can keep this file in LUO until finish */ 634 608 get_file(luo_file->file); 609 + 610 + WARN_ON(xa_insert(&luo_preserved_files, 611 + luo_get_id(luo_file->fh, luo_file->file), 612 + luo_file->file, GFP_KERNEL)); 613 + 635 614 *filep = luo_file->file; 636 615 luo_file->retrieve_status = 1; 637 616 ··· 677 646 678 647 luo_file->fh->ops->finish(&args); 679 648 luo_flb_file_finish(luo_file->fh); 649 + module_put(luo_file->fh->ops->owner); 680 650 } 681 651 682 652 /** ··· 733 701 734 702 luo_file_finish_one(file_set, luo_file); 735 703 736 - if (luo_file->file) 704 + if (luo_file->file) { 705 + xa_erase(&luo_preserved_files, 706 + luo_get_id(luo_file->fh, luo_file->file)); 737 707 fput(luo_file->file); 708 + } 738 709 list_del(&luo_file->list); 739 710 file_set->count--; 740 711 mutex_destroy(&luo_file->mutex); ··· 812 777 bool handler_found = false; 813 778 struct luo_file *luo_file; 814 779 780 + down_read(&luo_register_rwlock); 815 781 list_private_for_each_entry(fh, &luo_file_handler_list, list) { 816 782 if (!strcmp(fh->compatible, file_ser[i].compatible)) { 817 - handler_found = true; 783 + if (try_module_get(fh->ops->owner)) 784 + handler_found = true; 818 785 break; 819 786 } 820 787 } 788 + up_read(&luo_register_rwlock); 821 789 822 790 if (!handler_found) { 823 - pr_warn("No registered handler for compatible '%s'\n", 791 + pr_warn("No registered handler for compatible '%.*s'\n", 792 + (int)sizeof(file_ser[i].compatible), 824 793 file_ser[i].compatible); 825 794 return -ENOENT; 826 795 } 827 796 828 797 luo_file = kzalloc_obj(*luo_file); 829 - if (!luo_file) 798 + if (!luo_file) { 799 + module_put(fh->ops->owner); 830 800 return -ENOMEM; 801 + } 831 802 832 803 luo_file->fh = fh; 833 804 luo_file->file = NULL; ··· 883 842 return -EINVAL; 884 843 } 885 844 886 - /* 887 - * Ensure the system is quiescent (no active sessions). 888 - * This prevents registering new handlers while sessions are active or 889 - * while deserialization is in progress. 890 - */ 891 - if (!luo_session_quiesce()) 892 - return -EBUSY; 893 - 845 + down_write(&luo_register_rwlock); 894 846 /* Check for duplicate compatible strings */ 895 847 list_private_for_each_entry(fh_iter, &luo_file_handler_list, list) { 896 848 if (!strcmp(fh_iter->compatible, fh->compatible)) { 897 849 pr_err("File handler registration failed: Compatible string '%s' already registered.\n", 898 850 fh->compatible); 899 851 err = -EEXIST; 900 - goto err_resume; 852 + goto err_unlock; 901 853 } 902 - } 903 - 904 - /* Pin the module implementing the handler */ 905 - if (!try_module_get(fh->ops->owner)) { 906 - err = -EAGAIN; 907 - goto err_resume; 908 854 } 909 855 910 856 INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, flb_list)); 911 857 INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, list)); 912 858 list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); 913 - luo_session_resume(); 859 + up_write(&luo_register_rwlock); 914 860 915 861 liveupdate_test_register(fh); 916 862 917 863 return 0; 918 864 919 - err_resume: 920 - luo_session_resume(); 865 + err_unlock: 866 + up_write(&luo_register_rwlock); 921 867 return err; 922 868 } 923 869 ··· 914 886 * 915 887 * Unregisters the file handler from the liveupdate core. This function 916 888 * reverses the operations of liveupdate_register_file_handler(). 917 - * 918 - * It ensures safe removal by checking that: 919 - * No live update session is currently in progress. 920 - * No FLB registered with this file handler. 921 - * 922 - * If the unregistration fails, the internal test state is reverted. 923 - * 924 - * Return: 0 Success. -EOPNOTSUPP when live update is not enabled. -EBUSY A live 925 - * update is in progress, can't quiesce live update or FLB is registred with 926 - * this file handler. 927 889 */ 928 - int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 890 + void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) 929 891 { 930 - int err = -EBUSY; 931 - 932 892 if (!liveupdate_enabled()) 933 - return -EOPNOTSUPP; 893 + return; 934 894 935 - liveupdate_test_unregister(fh); 936 - 937 - if (!luo_session_quiesce()) 938 - goto err_register; 939 - 940 - if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) 941 - goto err_resume; 942 - 895 + guard(rwsem_write)(&luo_register_rwlock); 896 + luo_flb_unregister_all(fh); 943 897 list_del(&ACCESS_PRIVATE(fh, list)); 944 - module_put(fh->ops->owner); 945 - luo_session_resume(); 946 - 947 - return 0; 948 - 949 - err_resume: 950 - luo_session_resume(); 951 - err_register: 952 - liveupdate_test_register(fh); 953 - return err; 954 898 }
+97 -85
kernel/liveupdate/luo_flb.c
··· 89 89 static struct luo_flb_private *luo_flb_get_private(struct liveupdate_flb *flb) 90 90 { 91 91 struct luo_flb_private *private = &ACCESS_PRIVATE(flb, private); 92 + static DEFINE_SPINLOCK(luo_flb_init_lock); 92 93 94 + if (smp_load_acquire(&private->initialized)) 95 + return private; 96 + 97 + guard(spinlock)(&luo_flb_init_lock); 93 98 if (!private->initialized) { 94 99 mutex_init(&private->incoming.lock); 95 100 mutex_init(&private->outgoing.lock); 96 101 INIT_LIST_HEAD(&private->list); 97 102 private->users = 0; 98 - private->initialized = true; 103 + smp_store_release(&private->initialized, true); 99 104 } 100 105 101 106 return private; ··· 115 110 struct liveupdate_flb_op_args args = {0}; 116 111 int err; 117 112 113 + if (!try_module_get(flb->ops->owner)) 114 + return -ENODEV; 115 + 118 116 args.flb = flb; 119 117 err = flb->ops->preserve(&args); 120 - if (err) 118 + if (err) { 119 + module_put(flb->ops->owner); 121 120 return err; 121 + } 122 122 private->outgoing.data = args.data; 123 123 private->outgoing.obj = args.obj; 124 124 } ··· 151 141 152 142 private->outgoing.data = 0; 153 143 private->outgoing.obj = NULL; 144 + module_put(flb->ops->owner); 154 145 } 155 146 } 156 147 } ··· 187 176 if (!found) 188 177 return -ENOENT; 189 178 179 + if (!try_module_get(flb->ops->owner)) 180 + return -ENODEV; 181 + 190 182 args.flb = flb; 191 183 args.data = private->incoming.data; 192 184 193 185 err = flb->ops->retrieve(&args); 194 - if (err) 186 + if (err) { 187 + module_put(flb->ops->owner); 195 188 return err; 189 + } 196 190 197 191 private->incoming.obj = args.obj; 198 192 private->incoming.retrieved = true; ··· 231 215 private->incoming.data = 0; 232 216 private->incoming.obj = NULL; 233 217 private->incoming.finished = true; 218 + module_put(flb->ops->owner); 234 219 } 235 220 } 236 221 } ··· 257 240 struct luo_flb_link *iter; 258 241 int err = 0; 259 242 243 + down_read(&luo_register_rwlock); 260 244 list_for_each_entry(iter, flb_list, list) { 261 245 err = luo_flb_file_preserve_one(iter->flb); 262 246 if (err) 263 247 goto exit_err; 264 248 } 249 + up_read(&luo_register_rwlock); 265 250 266 251 return 0; 267 252 268 253 exit_err: 269 254 list_for_each_entry_continue_reverse(iter, flb_list, list) 270 255 luo_flb_file_unpreserve_one(iter->flb); 256 + up_read(&luo_register_rwlock); 271 257 272 258 return err; 273 259 } ··· 292 272 struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 293 273 struct luo_flb_link *iter; 294 274 275 + guard(rwsem_read)(&luo_register_rwlock); 295 276 list_for_each_entry_reverse(iter, flb_list, list) 296 277 luo_flb_file_unpreserve_one(iter->flb); 297 278 } ··· 313 292 struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 314 293 struct luo_flb_link *iter; 315 294 295 + guard(rwsem_read)(&luo_register_rwlock); 316 296 list_for_each_entry_reverse(iter, flb_list, list) 317 297 luo_flb_file_finish_one(iter->flb); 298 + } 299 + 300 + static void luo_flb_unregister_one(struct liveupdate_file_handler *fh, 301 + struct liveupdate_flb *flb) 302 + { 303 + struct luo_flb_private *private = luo_flb_get_private(flb); 304 + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 305 + struct luo_flb_link *iter; 306 + bool found = false; 307 + 308 + /* Find and remove the link from the file handler's list */ 309 + list_for_each_entry(iter, flb_list, list) { 310 + if (iter->flb == flb) { 311 + list_del(&iter->list); 312 + kfree(iter); 313 + found = true; 314 + break; 315 + } 316 + } 317 + 318 + if (!found) { 319 + pr_warn("Failed to unregister FLB '%s': not found in file handler '%s'\n", 320 + flb->compatible, fh->compatible); 321 + return; 322 + } 323 + 324 + private->users--; 325 + 326 + /* 327 + * If this is the last file-handler with which we are registred, remove 328 + * from the global list. 329 + */ 330 + if (!private->users) { 331 + list_del_init(&private->list); 332 + luo_flb_global.count--; 333 + } 334 + } 335 + 336 + /** 337 + * luo_flb_unregister_all - Unregister all FLBs associated with a file handler. 338 + * @fh: The file handler whose FLBs should be unregistered. 339 + * 340 + * This function iterates through the list of FLBs associated with the given 341 + * file handler and unregisters them all one by one. 342 + */ 343 + void luo_flb_unregister_all(struct liveupdate_file_handler *fh) 344 + { 345 + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 346 + struct luo_flb_link *iter, *tmp; 347 + 348 + if (!liveupdate_enabled()) 349 + return; 350 + 351 + lockdep_assert_held_write(&luo_register_rwlock); 352 + list_for_each_entry_safe(iter, tmp, flb_list, list) 353 + luo_flb_unregister_one(fh, iter->flb); 318 354 } 319 355 320 356 /** ··· 404 326 struct luo_flb_link *link __free(kfree) = NULL; 405 327 struct liveupdate_flb *gflb; 406 328 struct luo_flb_link *iter; 407 - int err; 408 329 409 330 if (!liveupdate_enabled()) 410 331 return -EOPNOTSUPP; ··· 424 347 if (!link) 425 348 return -ENOMEM; 426 349 427 - /* 428 - * Ensure the system is quiescent (no active sessions). 429 - * This acts as a global lock for registration: no other thread can 430 - * be in this section, and no sessions can be creating/using FDs. 431 - */ 432 - if (!luo_session_quiesce()) 433 - return -EBUSY; 350 + guard(rwsem_write)(&luo_register_rwlock); 434 351 435 352 /* Check that this FLB is not already linked to this file handler */ 436 - err = -EEXIST; 437 353 list_for_each_entry(iter, flb_list, list) { 438 354 if (iter->flb == flb) 439 - goto err_resume; 355 + return -EEXIST; 440 356 } 441 357 442 358 /* ··· 437 367 * is registered 438 368 */ 439 369 if (!private->users) { 440 - if (WARN_ON(!list_empty(&private->list))) { 441 - err = -EINVAL; 442 - goto err_resume; 443 - } 370 + if (WARN_ON(!list_empty(&private->list))) 371 + return -EINVAL; 444 372 445 - if (luo_flb_global.count == LUO_FLB_MAX) { 446 - err = -ENOSPC; 447 - goto err_resume; 448 - } 373 + if (luo_flb_global.count == LUO_FLB_MAX) 374 + return -ENOSPC; 449 375 450 376 /* Check that compatible string is unique in global list */ 451 377 list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { 452 378 if (!strcmp(gflb->compatible, flb->compatible)) 453 - goto err_resume; 454 - } 455 - 456 - if (!try_module_get(flb->ops->owner)) { 457 - err = -EAGAIN; 458 - goto err_resume; 379 + return -EEXIST; 459 380 } 460 381 461 382 list_add_tail(&private->list, &luo_flb_global.list); ··· 457 396 private->users++; 458 397 link->flb = flb; 459 398 list_add_tail(&no_free_ptr(link)->list, flb_list); 460 - luo_session_resume(); 461 399 462 400 return 0; 463 - 464 - err_resume: 465 - luo_session_resume(); 466 - return err; 467 401 } 468 402 469 403 /** ··· 474 418 * the FLB is removed from the global registry and the reference to its 475 419 * owner module (acquired during registration) is released. 476 420 * 477 - * Context: This function ensures the session is quiesced (no active FDs 478 - * being created) during the update. It is typically called from a 479 - * subsystem's module exit function. 480 - * Return: 0 on success. 481 - * -EOPNOTSUPP if live update is disabled. 482 - * -EBUSY if the live update session is active and cannot be quiesced. 483 - * -ENOENT if the FLB was not found in the file handler's list. 421 + * Context: It is typically called from a subsystem's module exit function. 484 422 */ 485 - int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 486 - struct liveupdate_flb *flb) 423 + void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, 424 + struct liveupdate_flb *flb) 487 425 { 488 - struct luo_flb_private *private = luo_flb_get_private(flb); 489 - struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); 490 - struct luo_flb_link *iter; 491 - int err = -ENOENT; 492 - 493 426 if (!liveupdate_enabled()) 494 - return -EOPNOTSUPP; 427 + return; 495 428 496 - /* 497 - * Ensure the system is quiescent (no active sessions). 498 - * This acts as a global lock for unregistration. 499 - */ 500 - if (!luo_session_quiesce()) 501 - return -EBUSY; 429 + guard(rwsem_write)(&luo_register_rwlock); 502 430 503 - /* Find and remove the link from the file handler's list */ 504 - list_for_each_entry(iter, flb_list, list) { 505 - if (iter->flb == flb) { 506 - list_del(&iter->list); 507 - kfree(iter); 508 - err = 0; 509 - break; 510 - } 511 - } 512 - 513 - if (err) 514 - goto err_resume; 515 - 516 - private->users--; 517 - /* 518 - * If this is the last file-handler with which we are registred, remove 519 - * from the global list, and relese module reference. 520 - */ 521 - if (!private->users) { 522 - list_del_init(&private->list); 523 - luo_flb_global.count--; 524 - module_put(flb->ops->owner); 525 - } 526 - 527 - luo_session_resume(); 528 - 529 - return 0; 530 - 531 - err_resume: 532 - luo_session_resume(); 533 - return err; 431 + luo_flb_unregister_one(fh, flb); 534 432 } 535 433 536 434 /** ··· 502 492 * 503 493 * Return: 0 on success, or a negative errno on failure. -ENODATA means no 504 494 * incoming FLB data, -ENOENT means specific flb not found in the incoming 505 - * data, and -EOPNOTSUPP when live update is disabled or not configured. 495 + * data, -ENODEV if the FLB's module is unloading, and -EOPNOTSUPP when 496 + * live update is disabled or not configured. 506 497 */ 507 498 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp) 508 499 { ··· 649 638 struct liveupdate_flb *gflb; 650 639 int i = 0; 651 640 641 + guard(rwsem_read)(&luo_register_rwlock); 652 642 list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { 653 643 struct luo_flb_private *private = luo_flb_get_private(gflb); 654 644
+3 -4
kernel/liveupdate/luo_internal.h
··· 77 77 struct mutex mutex; 78 78 }; 79 79 80 + extern struct rw_semaphore luo_register_rwlock; 81 + 80 82 int luo_session_create(const char *name, struct file **filep); 81 83 int luo_session_retrieve(const char *name, struct file **filep); 82 84 int __init luo_session_setup_outgoing(void *fdt); 83 85 int __init luo_session_setup_incoming(void *fdt); 84 86 int luo_session_serialize(void); 85 87 int luo_session_deserialize(void); 86 - bool luo_session_quiesce(void); 87 - void luo_session_resume(void); 88 88 89 89 int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd); 90 90 void luo_file_unpreserve_files(struct luo_file_set *file_set); ··· 103 103 int luo_flb_file_preserve(struct liveupdate_file_handler *fh); 104 104 void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); 105 105 void luo_flb_file_finish(struct liveupdate_file_handler *fh); 106 + void luo_flb_unregister_all(struct liveupdate_file_handler *fh); 106 107 int __init luo_flb_setup_outgoing(void *fdt); 107 108 int __init luo_flb_setup_incoming(void *fdt); 108 109 void luo_flb_serialize(void); 109 110 110 111 #ifdef CONFIG_LIVEUPDATE_TEST 111 112 void liveupdate_test_register(struct liveupdate_file_handler *fh); 112 - void liveupdate_test_unregister(struct liveupdate_file_handler *fh); 113 113 #else 114 114 static inline void liveupdate_test_register(struct liveupdate_file_handler *fh) { } 115 - static inline void liveupdate_test_unregister(struct liveupdate_file_handler *fh) { } 116 115 #endif 117 116 118 117 #endif /* _LINUX_LUO_INTERNAL_H */
+2 -44
kernel/liveupdate/luo_session.c
··· 544 544 545 545 session = luo_session_alloc(sh->ser[i].name); 546 546 if (IS_ERR(session)) { 547 - pr_warn("Failed to allocate session [%s] during deserialization %pe\n", 547 + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", 548 + (int)sizeof(sh->ser[i].name), 548 549 sh->ser[i].name, session); 549 550 return PTR_ERR(session); 550 551 } ··· 607 606 return err; 608 607 } 609 608 610 - /** 611 - * luo_session_quiesce - Ensure no active sessions exist and lock session lists. 612 - * 613 - * Acquires exclusive write locks on both incoming and outgoing session lists. 614 - * It then validates no sessions exist in either list. 615 - * 616 - * This mechanism is used during file handler un/registration to ensure that no 617 - * sessions are currently using the handler, and no new sessions can be created 618 - * while un/registration is in progress. 619 - * 620 - * This prevents registering new handlers while sessions are active or 621 - * while deserialization is in progress. 622 - * 623 - * Return: 624 - * true - System is quiescent (0 sessions) and locked. 625 - * false - Active sessions exist. The locks are released internally. 626 - */ 627 - bool luo_session_quiesce(void) 628 - { 629 - down_write(&luo_session_global.incoming.rwsem); 630 - down_write(&luo_session_global.outgoing.rwsem); 631 - 632 - if (luo_session_global.incoming.count || 633 - luo_session_global.outgoing.count) { 634 - up_write(&luo_session_global.outgoing.rwsem); 635 - up_write(&luo_session_global.incoming.rwsem); 636 - return false; 637 - } 638 - 639 - return true; 640 - } 641 - 642 - /** 643 - * luo_session_resume - Unlock session lists and resume normal activity. 644 - * 645 - * Releases the exclusive locks acquired by a successful call to 646 - * luo_session_quiesce(). 647 - */ 648 - void luo_session_resume(void) 649 - { 650 - up_write(&luo_session_global.outgoing.rwsem); 651 - up_write(&luo_session_global.incoming.rwsem); 652 - }
+109
lib/alloc_tag.c
··· 6 6 #include <linux/kallsyms.h> 7 7 #include <linux/module.h> 8 8 #include <linux/page_ext.h> 9 + #include <linux/pgalloc_tag.h> 9 10 #include <linux/proc_fs.h> 11 + #include <linux/rcupdate.h> 10 12 #include <linux/seq_buf.h> 11 13 #include <linux/seq_file.h> 12 14 #include <linux/string_choices.h> ··· 760 758 return mem_profiling_support; 761 759 } 762 760 761 + #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG 762 + /* 763 + * Track page allocations before page_ext is initialized. 764 + * Some pages are allocated before page_ext becomes available, leaving 765 + * their codetag uninitialized. Track these early PFNs so we can clear 766 + * their codetag refs later to avoid warnings when they are freed. 767 + * 768 + * Early allocations include: 769 + * - Base allocations independent of CPU count 770 + * - Per-CPU allocations (e.g., CPU hotplug callbacks during smp_init, 771 + * such as trace ring buffers, scheduler per-cpu data) 772 + * 773 + * For simplicity, we fix the size to 8192. 774 + * If insufficient, a warning will be triggered to alert the user. 775 + * 776 + * TODO: Replace fixed-size array with dynamic allocation using 777 + * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion. 778 + */ 779 + #define EARLY_ALLOC_PFN_MAX 8192 780 + 781 + static unsigned long early_pfns[EARLY_ALLOC_PFN_MAX] __initdata; 782 + static atomic_t early_pfn_count __initdata = ATOMIC_INIT(0); 783 + 784 + static void __init __alloc_tag_add_early_pfn(unsigned long pfn) 785 + { 786 + int old_idx, new_idx; 787 + 788 + do { 789 + old_idx = atomic_read(&early_pfn_count); 790 + if (old_idx >= EARLY_ALLOC_PFN_MAX) { 791 + pr_warn_once("Early page allocations before page_ext init exceeded EARLY_ALLOC_PFN_MAX (%d)\n", 792 + EARLY_ALLOC_PFN_MAX); 793 + return; 794 + } 795 + new_idx = old_idx + 1; 796 + } while (!atomic_try_cmpxchg(&early_pfn_count, &old_idx, new_idx)); 797 + 798 + early_pfns[old_idx] = pfn; 799 + } 800 + 801 + typedef void alloc_tag_add_func(unsigned long pfn); 802 + static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata = 803 + RCU_INITIALIZER(__alloc_tag_add_early_pfn); 804 + 805 + void alloc_tag_add_early_pfn(unsigned long pfn) 806 + { 807 + alloc_tag_add_func *alloc_tag_add; 808 + 809 + if (static_key_enabled(&mem_profiling_compressed)) 810 + return; 811 + 812 + rcu_read_lock(); 813 + alloc_tag_add = rcu_dereference(alloc_tag_add_early_pfn_ptr); 814 + if (alloc_tag_add) 815 + alloc_tag_add(pfn); 816 + rcu_read_unlock(); 817 + } 818 + 819 + static void __init clear_early_alloc_pfn_tag_refs(void) 820 + { 821 + unsigned int i; 822 + 823 + if (static_key_enabled(&mem_profiling_compressed)) 824 + return; 825 + 826 + rcu_assign_pointer(alloc_tag_add_early_pfn_ptr, NULL); 827 + /* Make sure we are not racing with __alloc_tag_add_early_pfn() */ 828 + synchronize_rcu(); 829 + 830 + for (i = 0; i < atomic_read(&early_pfn_count); i++) { 831 + unsigned long pfn = early_pfns[i]; 832 + 833 + if (pfn_valid(pfn)) { 834 + struct page *page = pfn_to_page(pfn); 835 + union pgtag_ref_handle handle; 836 + union codetag_ref ref; 837 + 838 + if (get_page_tag_ref(page, &ref, &handle)) { 839 + /* 840 + * An early-allocated page could be freed and reallocated 841 + * after its page_ext is initialized but before we clear it. 842 + * In that case, it already has a valid tag set. 843 + * We should not overwrite that valid tag with CODETAG_EMPTY. 844 + * 845 + * Note: there is still a small race window between checking 846 + * ref.ct and calling set_codetag_empty(). We accept this 847 + * race as it's unlikely and the extra complexity of atomic 848 + * cmpxchg is not worth it for this debug-only code path. 849 + */ 850 + if (ref.ct) { 851 + put_page_tag_ref(handle); 852 + continue; 853 + } 854 + 855 + set_codetag_empty(&ref); 856 + update_page_tag_ref(handle, &ref); 857 + put_page_tag_ref(handle); 858 + } 859 + } 860 + 861 + } 862 + } 863 + #else /* !CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 864 + static inline void __init clear_early_alloc_pfn_tag_refs(void) {} 865 + #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 866 + 763 867 static __init void init_page_alloc_tagging(void) 764 868 { 869 + clear_early_alloc_pfn_tag_refs(); 765 870 } 766 871 767 872 struct page_ext_operations page_alloc_tagging_ops = {
+77 -53
lib/test_hmm.c
··· 185 185 return 0; 186 186 } 187 187 188 + static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) 189 + { 190 + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; 191 + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; 192 + unsigned long npages = end_pfn - start_pfn + 1; 193 + unsigned long i; 194 + unsigned long *src_pfns; 195 + unsigned long *dst_pfns; 196 + unsigned int order = 0; 197 + 198 + src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); 199 + dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); 200 + 201 + migrate_device_range(src_pfns, start_pfn, npages); 202 + for (i = 0; i < npages; i++) { 203 + struct page *dpage, *spage; 204 + 205 + spage = migrate_pfn_to_page(src_pfns[i]); 206 + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) 207 + continue; 208 + 209 + if (WARN_ON(!is_device_private_page(spage) && 210 + !is_device_coherent_page(spage))) 211 + continue; 212 + 213 + order = folio_order(page_folio(spage)); 214 + spage = BACKING_PAGE(spage); 215 + if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { 216 + dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, 217 + order), 0); 218 + } else { 219 + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); 220 + order = 0; 221 + } 222 + 223 + /* TODO Support splitting here */ 224 + lock_page(dpage); 225 + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); 226 + if (src_pfns[i] & MIGRATE_PFN_WRITE) 227 + dst_pfns[i] |= MIGRATE_PFN_WRITE; 228 + if (order) 229 + dst_pfns[i] |= MIGRATE_PFN_COMPOUND; 230 + folio_copy(page_folio(dpage), page_folio(spage)); 231 + } 232 + migrate_device_pages(src_pfns, dst_pfns, npages); 233 + migrate_device_finalize(src_pfns, dst_pfns, npages); 234 + kvfree(src_pfns); 235 + kvfree(dst_pfns); 236 + } 237 + 188 238 static int dmirror_fops_release(struct inode *inode, struct file *filp) 189 239 { 190 240 struct dmirror *dmirror = filp->private_data; 241 + struct dmirror_device *mdevice = dmirror->mdevice; 242 + int i; 191 243 192 244 mmu_interval_notifier_remove(&dmirror->notifier); 245 + 246 + if (mdevice->devmem_chunks) { 247 + for (i = 0; i < mdevice->devmem_count; i++) { 248 + struct dmirror_chunk *devmem = 249 + mdevice->devmem_chunks[i]; 250 + 251 + dmirror_device_evict_chunk(devmem); 252 + } 253 + } 254 + 193 255 xa_destroy(&dmirror->pt); 194 256 kfree(dmirror); 195 257 return 0; ··· 1439 1377 return ret; 1440 1378 } 1441 1379 1442 - static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) 1443 - { 1444 - unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; 1445 - unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; 1446 - unsigned long npages = end_pfn - start_pfn + 1; 1447 - unsigned long i; 1448 - unsigned long *src_pfns; 1449 - unsigned long *dst_pfns; 1450 - unsigned int order = 0; 1451 - 1452 - src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); 1453 - dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); 1454 - 1455 - migrate_device_range(src_pfns, start_pfn, npages); 1456 - for (i = 0; i < npages; i++) { 1457 - struct page *dpage, *spage; 1458 - 1459 - spage = migrate_pfn_to_page(src_pfns[i]); 1460 - if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) 1461 - continue; 1462 - 1463 - if (WARN_ON(!is_device_private_page(spage) && 1464 - !is_device_coherent_page(spage))) 1465 - continue; 1466 - 1467 - order = folio_order(page_folio(spage)); 1468 - spage = BACKING_PAGE(spage); 1469 - if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { 1470 - dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, 1471 - order), 0); 1472 - } else { 1473 - dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); 1474 - order = 0; 1475 - } 1476 - 1477 - /* TODO Support splitting here */ 1478 - lock_page(dpage); 1479 - dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); 1480 - if (src_pfns[i] & MIGRATE_PFN_WRITE) 1481 - dst_pfns[i] |= MIGRATE_PFN_WRITE; 1482 - if (order) 1483 - dst_pfns[i] |= MIGRATE_PFN_COMPOUND; 1484 - folio_copy(page_folio(dpage), page_folio(spage)); 1485 - } 1486 - migrate_device_pages(src_pfns, dst_pfns, npages); 1487 - migrate_device_finalize(src_pfns, dst_pfns, npages); 1488 - kvfree(src_pfns); 1489 - kvfree(dst_pfns); 1490 - } 1491 - 1492 1380 /* Removes free pages from the free list so they can't be re-allocated */ 1493 1381 static void dmirror_remove_free_pages(struct dmirror_chunk *devmem) 1494 1382 { ··· 1738 1726 .folio_split = dmirror_devmem_folio_split, 1739 1727 }; 1740 1728 1729 + static void dmirror_device_release(struct device *dev) 1730 + { 1731 + struct dmirror_device *mdevice = container_of(dev, struct dmirror_device, device); 1732 + 1733 + dmirror_device_remove_chunks(mdevice); 1734 + } 1735 + 1741 1736 static int dmirror_device_init(struct dmirror_device *mdevice, int id) 1742 1737 { 1743 1738 dev_t dev; ··· 1756 1737 1757 1738 cdev_init(&mdevice->cdevice, &dmirror_fops); 1758 1739 mdevice->cdevice.owner = THIS_MODULE; 1740 + mdevice->device.release = dmirror_device_release; 1741 + 1759 1742 device_initialize(&mdevice->device); 1760 1743 mdevice->device.devt = dev; 1761 1744 ··· 1765 1744 if (ret) 1766 1745 goto put_device; 1767 1746 1747 + /* Build a list of free ZONE_DEVICE struct pages */ 1748 + ret = dmirror_allocate_chunk(mdevice, NULL, false); 1749 + if (ret) 1750 + goto put_device; 1751 + 1768 1752 ret = cdev_device_add(&mdevice->cdevice, &mdevice->device); 1769 1753 if (ret) 1770 1754 goto put_device; 1771 1755 1772 - /* Build a list of free ZONE_DEVICE struct pages */ 1773 - return dmirror_allocate_chunk(mdevice, NULL, false); 1756 + return 0; 1774 1757 1775 1758 put_device: 1776 1759 put_device(&mdevice->device); ··· 1783 1758 1784 1759 static void dmirror_device_remove(struct dmirror_device *mdevice) 1785 1760 { 1786 - dmirror_device_remove_chunks(mdevice); 1787 1761 cdev_device_del(&mdevice->cdevice, &mdevice->device); 1788 1762 put_device(&mdevice->device); 1789 1763 }
+3 -2
lib/test_kho.c
··· 143 143 if (err) 144 144 goto err_unpreserve_data; 145 145 146 - err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt)); 146 + err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt), 147 + fdt_totalsize(folio_address(state->fdt))); 147 148 if (err) 148 149 goto err_unpreserve_data; 149 150 ··· 319 318 if (!kho_is_enabled()) 320 319 return 0; 321 320 322 - err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); 321 + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys, NULL); 323 322 if (!err) { 324 323 err = kho_test_restore(fdt_phys); 325 324 if (err)
-18
lib/tests/liveupdate.c
··· 135 135 TEST_NFLBS, fh->compatible); 136 136 } 137 137 138 - void liveupdate_test_unregister(struct liveupdate_file_handler *fh) 139 - { 140 - int err, i; 141 - 142 - for (i = 0; i < TEST_NFLBS; i++) { 143 - struct liveupdate_flb *flb = &test_flbs[i]; 144 - 145 - err = liveupdate_unregister_flb(fh, flb); 146 - if (err) { 147 - pr_err("Failed to unregister %s %pe\n", 148 - flb->compatible, ERR_PTR(err)); 149 - } 150 - } 151 - 152 - pr_info("Unregistered %d FLBs from file handler: [%s]\n", 153 - TEST_NFLBS, fh->compatible); 154 - } 155 - 156 138 MODULE_LICENSE("GPL"); 157 139 MODULE_AUTHOR("Pasha Tatashin <pasha.tatashin@soleen.com>"); 158 140 MODULE_DESCRIPTION("In-kernel test for LUO mechanism");
+11
mm/Kconfig.debug
··· 297 297 298 298 If unsure, say Y. 299 299 300 + config DEBUG_KMEMLEAK_VERBOSE 301 + bool "Default kmemleak to verbose mode" 302 + depends on DEBUG_KMEMLEAK_AUTO_SCAN 303 + help 304 + Say Y here to have kmemleak print unreferenced object details 305 + (backtrace, hex dump, address) to dmesg when new memory leaks are 306 + detected during automatic scanning. This can also be toggled at 307 + runtime via /sys/module/kmemleak/parameters/verbose. 308 + 309 + If unsure, say N. 310 + 300 311 config PER_VMA_LOCK_STATS 301 312 bool "Statistics for per-vma locks" 302 313 depends on PER_VMA_LOCK
+30 -13
mm/compaction.c
··· 518 518 return true; 519 519 } 520 520 521 + static struct lruvec * 522 + compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags, 523 + struct compact_control *cc) 524 + { 525 + struct lruvec *lruvec; 526 + 527 + rcu_read_lock(); 528 + retry: 529 + lruvec = folio_lruvec(folio); 530 + compact_lock_irqsave(&lruvec->lru_lock, flags, cc); 531 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 532 + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); 533 + goto retry; 534 + } 535 + 536 + return lruvec; 537 + } 538 + 521 539 /* 522 540 * Compaction requires the taking of some coarse locks that are potentially 523 541 * very heavily contended. The lock should be periodically unlocked to avoid ··· 857 839 { 858 840 pg_data_t *pgdat = cc->zone->zone_pgdat; 859 841 unsigned long nr_scanned = 0, nr_isolated = 0; 860 - struct lruvec *lruvec; 842 + struct lruvec *lruvec = NULL; 861 843 unsigned long flags = 0; 862 844 struct lruvec *locked = NULL; 863 845 struct folio *folio = NULL; ··· 931 913 */ 932 914 if (!(low_pfn % COMPACT_CLUSTER_MAX)) { 933 915 if (locked) { 934 - unlock_page_lruvec_irqrestore(locked, flags); 916 + lruvec_unlock_irqrestore(locked, flags); 935 917 locked = NULL; 936 918 } 937 919 ··· 982 964 } 983 965 /* for alloc_contig case */ 984 966 if (locked) { 985 - unlock_page_lruvec_irqrestore(locked, flags); 967 + lruvec_unlock_irqrestore(locked, flags); 986 968 locked = NULL; 987 969 } 988 970 ··· 1071 1053 if (unlikely(page_has_movable_ops(page)) && 1072 1054 !PageMovableOpsIsolated(page)) { 1073 1055 if (locked) { 1074 - unlock_page_lruvec_irqrestore(locked, flags); 1056 + lruvec_unlock_irqrestore(locked, flags); 1075 1057 locked = NULL; 1076 1058 } 1077 1059 ··· 1171 1153 if (!folio_test_clear_lru(folio)) 1172 1154 goto isolate_fail_put; 1173 1155 1174 - lruvec = folio_lruvec(folio); 1156 + if (locked) 1157 + lruvec = folio_lruvec(folio); 1175 1158 1176 1159 /* If we already hold the lock, we can skip some rechecking */ 1177 - if (lruvec != locked) { 1160 + if (lruvec != locked || !locked) { 1178 1161 if (locked) 1179 - unlock_page_lruvec_irqrestore(locked, flags); 1162 + lruvec_unlock_irqrestore(locked, flags); 1180 1163 1181 - compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); 1164 + lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc); 1182 1165 locked = lruvec; 1183 - 1184 - lruvec_memcg_debug(lruvec, folio); 1185 1166 1186 1167 /* 1187 1168 * Try get exclusive access under lock. If marked for ··· 1243 1226 isolate_fail_put: 1244 1227 /* Avoid potential deadlock in freeing page under lru_lock */ 1245 1228 if (locked) { 1246 - unlock_page_lruvec_irqrestore(locked, flags); 1229 + lruvec_unlock_irqrestore(locked, flags); 1247 1230 locked = NULL; 1248 1231 } 1249 1232 folio_put(folio); ··· 1259 1242 */ 1260 1243 if (nr_isolated) { 1261 1244 if (locked) { 1262 - unlock_page_lruvec_irqrestore(locked, flags); 1245 + lruvec_unlock_irqrestore(locked, flags); 1263 1246 locked = NULL; 1264 1247 } 1265 1248 putback_movable_pages(&cc->migratepages); ··· 1291 1274 1292 1275 isolate_abort: 1293 1276 if (locked) 1294 - unlock_page_lruvec_irqrestore(locked, flags); 1277 + lruvec_unlock_irqrestore(locked, flags); 1295 1278 if (folio) { 1296 1279 folio_set_lru(folio); 1297 1280 folio_put(folio);
+49 -39
mm/damon/core.c
··· 1573 1573 return pid; 1574 1574 } 1575 1575 1576 - /* 1577 - * damon_call_handle_inactive_ctx() - handle DAMON call request that added to 1578 - * an inactive context. 1579 - * @ctx: The inactive DAMON context. 1580 - * @control: Control variable of the call request. 1581 - * 1582 - * This function is called in a case that @control is added to @ctx but @ctx is 1583 - * not running (inactive). See if @ctx handled @control or not, and cleanup 1584 - * @control if it was not handled. 1585 - * 1586 - * Returns 0 if @control was handled by @ctx, negative error code otherwise. 1587 - */ 1588 - static int damon_call_handle_inactive_ctx( 1589 - struct damon_ctx *ctx, struct damon_call_control *control) 1590 - { 1591 - struct damon_call_control *c; 1592 - 1593 - mutex_lock(&ctx->call_controls_lock); 1594 - list_for_each_entry(c, &ctx->call_controls, list) { 1595 - if (c == control) { 1596 - list_del(&control->list); 1597 - mutex_unlock(&ctx->call_controls_lock); 1598 - return -EINVAL; 1599 - } 1600 - } 1601 - mutex_unlock(&ctx->call_controls_lock); 1602 - return 0; 1603 - } 1604 - 1605 1576 /** 1606 1577 * damon_call() - Invoke a given function on DAMON worker thread (kdamond). 1607 1578 * @ctx: DAMON context to call the function for. ··· 1590 1619 * synchronization. The return value of the function will be saved in 1591 1620 * &damon_call_control->return_code. 1592 1621 * 1622 + * Note that this function should be called only after damon_start() with the 1623 + * @ctx has succeeded. Otherwise, this function could fall into an indefinite 1624 + * wait. 1625 + * 1593 1626 * Return: 0 on success, negative error code otherwise. 1594 1627 */ 1595 1628 int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) ··· 1604 1629 INIT_LIST_HEAD(&control->list); 1605 1630 1606 1631 mutex_lock(&ctx->call_controls_lock); 1632 + if (ctx->call_controls_obsolete) { 1633 + mutex_unlock(&ctx->call_controls_lock); 1634 + return -ECANCELED; 1635 + } 1607 1636 list_add_tail(&control->list, &ctx->call_controls); 1608 1637 mutex_unlock(&ctx->call_controls_lock); 1609 - if (!damon_is_running(ctx)) 1610 - return damon_call_handle_inactive_ctx(ctx, control); 1611 1638 if (control->repeat) 1612 1639 return 0; 1613 1640 wait_for_completion(&control->completion); ··· 1637 1660 * passed at least one &damos->apply_interval_us, kdamond marks the request as 1638 1661 * completed so that damos_walk() can wakeup and return. 1639 1662 * 1663 + * Note that this function should be called only after damon_start() with the 1664 + * @ctx has succeeded. Otherwise, this function could fall into an indefinite 1665 + * wait. 1666 + * 1640 1667 * Return: 0 on success, negative error code otherwise. 1641 1668 */ 1642 1669 int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control) ··· 1648 1667 init_completion(&control->completion); 1649 1668 control->canceled = false; 1650 1669 mutex_lock(&ctx->walk_control_lock); 1670 + if (ctx->walk_control_obsolete) { 1671 + mutex_unlock(&ctx->walk_control_lock); 1672 + return -ECANCELED; 1673 + } 1651 1674 if (ctx->walk_control) { 1652 1675 mutex_unlock(&ctx->walk_control_lock); 1653 1676 return -EBUSY; 1654 1677 } 1655 1678 ctx->walk_control = control; 1656 1679 mutex_unlock(&ctx->walk_control_lock); 1657 - if (!damon_is_running(ctx)) { 1658 - mutex_lock(&ctx->walk_control_lock); 1659 - if (ctx->walk_control == control) 1660 - ctx->walk_control = NULL; 1661 - mutex_unlock(&ctx->walk_control_lock); 1662 - return -EINVAL; 1663 - } 1664 1680 wait_for_completion(&control->completion); 1665 1681 if (control->canceled) 1666 1682 return -ECANCELED; ··· 2217 2239 #endif /* CONFIG_PSI */ 2218 2240 2219 2241 #ifdef CONFIG_NUMA 2242 + static bool invalid_mem_node(int nid) 2243 + { 2244 + return nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY); 2245 + } 2246 + 2220 2247 static __kernel_ulong_t damos_get_node_mem_bp( 2221 2248 struct damos_quota_goal *goal) 2222 2249 { 2223 2250 struct sysinfo i; 2224 2251 __kernel_ulong_t numerator; 2252 + 2253 + if (invalid_mem_node(goal->nid)) { 2254 + if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) 2255 + return 0; 2256 + else /* DAMOS_QUOTA_NODE_MEM_FREE_BP */ 2257 + return 10000; 2258 + } 2225 2259 2226 2260 si_meminfo_node(&i, goal->nid); 2227 2261 if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) ··· 2250 2260 struct lruvec *lruvec; 2251 2261 unsigned long used_pages, numerator; 2252 2262 struct sysinfo i; 2263 + 2264 + if (invalid_mem_node(goal->nid)) { 2265 + if (goal->metric == DAMOS_QUOTA_NODE_MEMCG_USED_BP) 2266 + return 0; 2267 + else /* DAMOS_QUOTA_NODE_MEMCG_FREE_BP */ 2268 + return 10000; 2269 + } 2253 2270 2254 2271 memcg = mem_cgroup_get_from_id(goal->memcg_id); 2255 2272 if (!memcg) { ··· 2449 2452 } 2450 2453 2451 2454 /* New charge window starts */ 2452 - if (time_after_eq(jiffies, quota->charged_from + 2455 + if (!time_in_range_open(jiffies, quota->charged_from, 2456 + quota->charged_from + 2453 2457 msecs_to_jiffies(quota->reset_interval))) { 2454 2458 if (damos_quota_is_set(quota) && 2455 2459 quota->charged_sz >= quota->esz) ··· 2950 2952 2951 2953 pr_debug("kdamond (%d) starts\n", current->pid); 2952 2954 2955 + mutex_lock(&ctx->call_controls_lock); 2956 + ctx->call_controls_obsolete = false; 2957 + mutex_unlock(&ctx->call_controls_lock); 2958 + mutex_lock(&ctx->walk_control_lock); 2959 + ctx->walk_control_obsolete = false; 2960 + mutex_unlock(&ctx->walk_control_lock); 2953 2961 complete(&ctx->kdamond_started); 2954 2962 kdamond_init_ctx(ctx); 2955 2963 ··· 3066 3062 damon_destroy_targets(ctx); 3067 3063 3068 3064 kfree(ctx->regions_score_histogram); 3065 + mutex_lock(&ctx->call_controls_lock); 3066 + ctx->call_controls_obsolete = true; 3067 + mutex_unlock(&ctx->call_controls_lock); 3069 3068 kdamond_call(ctx, true); 3069 + mutex_lock(&ctx->walk_control_lock); 3070 + ctx->walk_control_obsolete = true; 3071 + mutex_unlock(&ctx->walk_control_lock); 3070 3072 damos_walk_cancel(ctx); 3071 3073 3072 3074 pr_debug("kdamond (%d) finishes\n", current->pid);
+4 -1
mm/damon/stat.c
··· 255 255 if (!damon_stat_context) 256 256 return -ENOMEM; 257 257 err = damon_start(&damon_stat_context, 1, true); 258 - if (err) 258 + if (err) { 259 + damon_destroy_ctx(damon_stat_context); 260 + damon_stat_context = NULL; 259 261 return err; 262 + } 260 263 261 264 damon_stat_last_refresh_jiffies = jiffies; 262 265 call_control.data = damon_stat_context;
+19 -3
mm/huge_memory.c
··· 1218 1218 1219 1219 static struct deferred_split *folio_split_queue_lock(struct folio *folio) 1220 1220 { 1221 - return split_queue_lock(folio_nid(folio), folio_memcg(folio)); 1221 + struct deferred_split *queue; 1222 + 1223 + rcu_read_lock(); 1224 + queue = split_queue_lock(folio_nid(folio), folio_memcg(folio)); 1225 + /* 1226 + * The memcg destruction path is acquiring the split queue lock for 1227 + * reparenting. Once you have it locked, it's safe to drop the rcu lock. 1228 + */ 1229 + rcu_read_unlock(); 1230 + 1231 + return queue; 1222 1232 } 1223 1233 1224 1234 static struct deferred_split * 1225 1235 folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags) 1226 1236 { 1227 - return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); 1237 + struct deferred_split *queue; 1238 + 1239 + rcu_read_lock(); 1240 + queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); 1241 + rcu_read_unlock(); 1242 + 1243 + return queue; 1228 1244 } 1229 1245 1230 1246 static inline void split_queue_unlock(struct deferred_split *queue) ··· 4010 3994 folio_ref_unfreeze(folio, folio_cache_ref_count(folio) + 1); 4011 3995 4012 3996 if (do_lru) 4013 - unlock_page_lruvec(lruvec); 3997 + lruvec_unlock(lruvec); 4014 3998 4015 3999 if (ci) 4016 4000 swap_cluster_unlock(ci);
+18
mm/hugetlb.c
··· 4218 4218 size_t len; 4219 4219 char *p; 4220 4220 4221 + if (!s) 4222 + return -EINVAL; 4223 + 4221 4224 if (hugetlb_param_index >= HUGE_MAX_CMDLINE_ARGS) 4222 4225 return -EINVAL; 4223 4226 ··· 4787 4784 return 0; 4788 4785 } 4789 4786 4787 + #ifdef CONFIG_USERFAULTFD 4788 + static bool hugetlb_can_userfault(struct vm_area_struct *vma, 4789 + vm_flags_t vm_flags) 4790 + { 4791 + return true; 4792 + } 4793 + 4794 + static const struct vm_uffd_ops hugetlb_uffd_ops = { 4795 + .can_userfault = hugetlb_can_userfault, 4796 + }; 4797 + #endif 4798 + 4790 4799 /* 4791 4800 * When a new function is introduced to vm_operations_struct and added 4792 4801 * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops. ··· 4812 4797 .close = hugetlb_vm_op_close, 4813 4798 .may_split = hugetlb_vm_op_split, 4814 4799 .pagesize = hugetlb_vm_op_pagesize, 4800 + #ifdef CONFIG_USERFAULTFD 4801 + .uffd_ops = &hugetlb_uffd_ops, 4802 + #endif 4815 4803 }; 4816 4804 4817 4805 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
+1 -1
mm/kmemleak.c
··· 241 241 /* If there are leaks that can be reported */ 242 242 static bool kmemleak_found_leaks; 243 243 244 - static bool kmemleak_verbose; 244 + static bool kmemleak_verbose = IS_ENABLED(CONFIG_DEBUG_KMEMLEAK_VERBOSE); 245 245 module_param_named(verbose, kmemleak_verbose, bool, 0600); 246 246 247 247 static void kmemleak_disable(void);
+2 -2
mm/memblock.c
··· 2601 2601 if (err) 2602 2602 goto err_unpreserve_fdt; 2603 2603 2604 - err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt); 2604 + err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt, fdt_totalsize(fdt)); 2605 2605 if (err) 2606 2606 goto err_unpreserve_fdt; 2607 2607 ··· 2646 2646 if (fdt) 2647 2647 return fdt; 2648 2648 2649 - err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys); 2649 + err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys, NULL); 2650 2650 if (err) { 2651 2651 if (err != -ENOENT) 2652 2652 pr_warn("failed to retrieve FDT '%s' from KHO: %d\n",
+25 -6
mm/memcontrol-v1.c
··· 613 613 void memcg1_swapout(struct folio *folio, swp_entry_t entry) 614 614 { 615 615 struct mem_cgroup *memcg, *swap_memcg; 616 + struct obj_cgroup *objcg; 616 617 unsigned int nr_entries; 617 618 618 619 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); ··· 625 624 if (!do_memsw_account()) 626 625 return; 627 626 628 - memcg = folio_memcg(folio); 629 - 630 - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); 631 - if (!memcg) 627 + objcg = folio_objcg(folio); 628 + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); 629 + if (!objcg) 632 630 return; 633 631 632 + rcu_read_lock(); 633 + memcg = obj_cgroup_memcg(objcg); 634 634 /* 635 635 * In case the memcg owning these pages has been offlined and doesn't 636 636 * have an ID allocated to it anymore, charge the closest online ··· 646 644 folio_unqueue_deferred_split(folio); 647 645 folio->memcg_data = 0; 648 646 649 - if (!mem_cgroup_is_root(memcg)) 647 + if (!obj_cgroup_is_root(objcg)) 650 648 page_counter_uncharge(&memcg->memory, nr_entries); 651 649 652 650 if (memcg != swap_memcg) { ··· 667 665 preempt_enable_nested(); 668 666 memcg1_check_events(memcg, folio_nid(folio)); 669 667 670 - css_put(&memcg->css); 668 + rcu_read_unlock(); 669 + obj_cgroup_put(objcg); 671 670 } 672 671 673 672 /* ··· 1886 1883 PGFAULT, 1887 1884 PGMAJFAULT, 1888 1885 }; 1886 + 1887 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1888 + { 1889 + int i; 1890 + 1891 + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) 1892 + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]); 1893 + } 1894 + 1895 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 1896 + { 1897 + int i; 1898 + 1899 + for (i = 0; i < NR_LRU_LISTS; i++) 1900 + reparent_memcg_lruvec_state_local(memcg, parent, i); 1901 + } 1889 1902 1890 1903 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) 1891 1904 {
+7
mm/memcontrol-v1.h
··· 73 73 unsigned long nr_memory, int nid); 74 74 75 75 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); 76 + void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 77 + void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); 78 + 79 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 80 + struct mem_cgroup *parent, int idx); 81 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 82 + struct mem_cgroup *parent, int idx); 76 83 77 84 void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); 78 85 static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
+425 -215
mm/memcontrol.c
··· 206 206 return objcg; 207 207 } 208 208 209 - static void memcg_reparent_objcgs(struct mem_cgroup *memcg, 210 - struct mem_cgroup *parent) 209 + static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memcg, 210 + struct mem_cgroup *parent, 211 + int nid) 211 212 { 212 213 struct obj_cgroup *objcg, *iter; 214 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 215 + struct mem_cgroup_per_node *parent_pn = parent->nodeinfo[nid]; 213 216 214 - objcg = rcu_replace_pointer(memcg->objcg, NULL, true); 215 - 216 - spin_lock_irq(&objcg_lock); 217 - 217 + objcg = rcu_replace_pointer(pn->objcg, NULL, true); 218 218 /* 1) Ready to reparent active objcg. */ 219 - list_add(&objcg->list, &memcg->objcg_list); 219 + list_add(&objcg->list, &pn->objcg_list); 220 220 /* 2) Reparent active objcg and already reparented objcgs to parent. */ 221 - list_for_each_entry(iter, &memcg->objcg_list, list) 221 + list_for_each_entry(iter, &pn->objcg_list, list) 222 222 WRITE_ONCE(iter->memcg, parent); 223 223 /* 3) Move already reparented objcgs to the parent's list */ 224 - list_splice(&memcg->objcg_list, &parent->objcg_list); 224 + list_splice(&pn->objcg_list, &parent_pn->objcg_list); 225 225 226 + return objcg; 227 + } 228 + 229 + #ifdef CONFIG_MEMCG_V1 230 + static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force); 231 + 232 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 233 + { 234 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 235 + return; 236 + 237 + /* 238 + * Reparent stats exposed non-hierarchically. Flush @memcg's stats first 239 + * to read its stats accurately , and conservatively flush @parent's 240 + * stats after reparenting to avoid hiding a potentially large stat 241 + * update (e.g. from callers of mem_cgroup_flush_stats_ratelimited()). 242 + */ 243 + __mem_cgroup_flush_stats(memcg, true); 244 + 245 + /* The following counts are all non-hierarchical and need to be reparented. */ 246 + reparent_memcg1_state_local(memcg, parent); 247 + reparent_memcg1_lruvec_state_local(memcg, parent); 248 + 249 + __mem_cgroup_flush_stats(parent, true); 250 + } 251 + #else 252 + static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) 253 + { 254 + } 255 + #endif 256 + 257 + static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 258 + { 259 + spin_lock_irq(&objcg_lock); 260 + spin_lock_nested(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock, 1); 261 + spin_lock_nested(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock, 2); 262 + } 263 + 264 + static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 265 + { 266 + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock); 267 + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock); 226 268 spin_unlock_irq(&objcg_lock); 269 + } 227 270 228 - percpu_ref_kill(&objcg->refcnt); 271 + static void memcg_reparent_objcgs(struct mem_cgroup *memcg) 272 + { 273 + struct obj_cgroup *objcg; 274 + struct mem_cgroup *parent = parent_mem_cgroup(memcg); 275 + int nid; 276 + 277 + for_each_node(nid) { 278 + retry: 279 + if (lru_gen_enabled()) 280 + max_lru_gen_memcg(parent, nid); 281 + 282 + reparent_locks(memcg, parent, nid); 283 + 284 + if (lru_gen_enabled()) { 285 + if (!recheck_lru_gen_max_memcg(parent, nid)) { 286 + reparent_unlocks(memcg, parent, nid); 287 + cond_resched(); 288 + goto retry; 289 + } 290 + lru_gen_reparent_memcg(memcg, parent, nid); 291 + } else { 292 + lru_reparent_memcg(memcg, parent, nid); 293 + } 294 + 295 + objcg = __memcg_reparent_objcgs(memcg, parent, nid); 296 + 297 + reparent_unlocks(memcg, parent, nid); 298 + 299 + percpu_ref_kill(&objcg->refcnt); 300 + } 301 + 302 + reparent_state_local(memcg, parent); 229 303 } 230 304 231 305 /* ··· 315 241 EXPORT_SYMBOL(memcg_bpf_enabled_key); 316 242 317 243 /** 318 - * mem_cgroup_css_from_folio - css of the memcg associated with a folio 244 + * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio 319 245 * @folio: folio of interest 320 246 * 321 247 * If memcg is bound to the default hierarchy, css of the memcg associated ··· 325 251 * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup 326 252 * is returned. 327 253 */ 328 - struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio) 254 + struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio) 329 255 { 330 - struct mem_cgroup *memcg = folio_memcg(folio); 256 + struct mem_cgroup *memcg; 331 257 332 - if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) 333 - memcg = root_mem_cgroup; 258 + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 259 + return &root_mem_cgroup->css; 334 260 335 - return &memcg->css; 261 + memcg = get_mem_cgroup_from_folio(folio); 262 + 263 + return memcg ? &memcg->css : &root_mem_cgroup->css; 336 264 } 337 265 338 266 /** ··· 525 449 return x; 526 450 } 527 451 452 + #ifdef CONFIG_MEMCG_V1 453 + static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, 454 + enum node_stat_item idx, long val); 455 + 456 + void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, 457 + struct mem_cgroup *parent, int idx) 458 + { 459 + int nid; 460 + 461 + for_each_node(nid) { 462 + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 463 + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); 464 + unsigned long value = lruvec_page_state_local(child_lruvec, idx); 465 + struct mem_cgroup_per_node *child_pn, *parent_pn; 466 + 467 + child_pn = container_of(child_lruvec, struct mem_cgroup_per_node, lruvec); 468 + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec); 469 + 470 + __mod_memcg_lruvec_state(child_pn, idx, -value); 471 + __mod_memcg_lruvec_state(parent_pn, idx, value); 472 + } 473 + } 474 + #endif 475 + 528 476 /* Subset of vm_event_item to report for memcg event stats */ 529 477 static const unsigned int memcg_vm_event_stat[] = { 530 478 #ifdef CONFIG_MEMCG_V1 ··· 608 508 609 509 struct memcg_vmstats_percpu { 610 510 /* Stats updates since the last flush */ 611 - unsigned int stats_updates; 511 + unsigned long stats_updates; 612 512 613 513 /* Cached pointers for fast iteration in memcg_rstat_updated() */ 614 514 struct memcg_vmstats_percpu __percpu *parent_pcpu; ··· 639 539 unsigned long events_pending[NR_MEMCG_EVENTS]; 640 540 641 541 /* Stats updates since the last flush */ 642 - atomic_t stats_updates; 542 + atomic_long_t stats_updates; 643 543 }; 644 544 645 545 /* ··· 665 565 666 566 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) 667 567 { 668 - return atomic_read(&vmstats->stats_updates) > 568 + return atomic_long_read(&vmstats->stats_updates) > 669 569 MEMCG_CHARGE_BATCH * num_online_cpus(); 670 570 } 671 571 672 - static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, 572 + static inline void memcg_rstat_updated(struct mem_cgroup *memcg, long val, 673 573 int cpu) 674 574 { 675 575 struct memcg_vmstats_percpu __percpu *statc_pcpu; 676 576 struct memcg_vmstats_percpu *statc; 677 - unsigned int stats_updates; 577 + unsigned long stats_updates; 678 578 679 579 if (!val) 680 580 return; ··· 697 597 continue; 698 598 699 599 stats_updates = this_cpu_xchg(statc_pcpu->stats_updates, 0); 700 - atomic_add(stats_updates, &statc->vmstats->stats_updates); 600 + atomic_long_add(stats_updates, &statc->vmstats->stats_updates); 701 601 } 702 602 } 703 603 ··· 705 605 { 706 606 bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats); 707 607 708 - trace_memcg_flush_stats(memcg, atomic_read(&memcg->vmstats->stats_updates), 608 + trace_memcg_flush_stats(memcg, atomic_long_read(&memcg->vmstats->stats_updates), 709 609 force, needs_flush); 710 610 711 611 if (!force && !needs_flush) ··· 784 684 * Normalize the value passed into memcg_rstat_updated() to be in pages. Round 785 685 * up non-zero sub-page updates to 1 page as zero page updates are ignored. 786 686 */ 787 - static int memcg_state_val_in_pages(int idx, int val) 687 + static long memcg_state_val_in_pages(int idx, long val) 788 688 { 789 689 int unit = memcg_page_state_unit(idx); 690 + long res; 790 691 791 692 if (!val || unit == PAGE_SIZE) 792 693 return val; 793 - else 794 - return max(val * unit / PAGE_SIZE, 1UL); 694 + 695 + /* Get the absolute value of (val * unit / PAGE_SIZE). */ 696 + res = mult_frac(abs(val), unit, PAGE_SIZE); 697 + /* Round up zero values. */ 698 + res = res ? : 1; 699 + 700 + return val < 0 ? -res : res; 701 + } 702 + 703 + #ifdef CONFIG_MEMCG_V1 704 + /* 705 + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with 706 + * reparenting of non-hierarchical state_locals. 707 + */ 708 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 709 + { 710 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 711 + return memcg; 712 + 713 + rcu_read_lock(); 714 + 715 + while (memcg_is_dying(memcg)) 716 + memcg = parent_mem_cgroup(memcg); 717 + 718 + return memcg; 719 + } 720 + 721 + static inline void get_non_dying_memcg_end(void) 722 + { 723 + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 724 + return; 725 + 726 + rcu_read_unlock(); 727 + } 728 + #else 729 + static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) 730 + { 731 + return memcg; 732 + } 733 + 734 + static inline void get_non_dying_memcg_end(void) 735 + { 736 + } 737 + #endif 738 + 739 + static void __mod_memcg_state(struct mem_cgroup *memcg, 740 + enum memcg_stat_item idx, long val) 741 + { 742 + int i = memcg_stats_index(idx); 743 + int cpu; 744 + 745 + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 746 + return; 747 + 748 + cpu = get_cpu(); 749 + 750 + this_cpu_add(memcg->vmstats_percpu->state[i], val); 751 + val = memcg_state_val_in_pages(idx, val); 752 + memcg_rstat_updated(memcg, val, cpu); 753 + 754 + trace_mod_memcg_state(memcg, idx, val); 755 + 756 + put_cpu(); 795 757 } 796 758 797 759 /** ··· 865 703 void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, 866 704 int val) 867 705 { 868 - int i = memcg_stats_index(idx); 869 - int cpu; 870 - 871 706 if (mem_cgroup_disabled()) 872 707 return; 873 708 874 - if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 875 - return; 876 - 877 - cpu = get_cpu(); 878 - 879 - this_cpu_add(memcg->vmstats_percpu->state[i], val); 880 - val = memcg_state_val_in_pages(idx, val); 881 - memcg_rstat_updated(memcg, val, cpu); 882 - trace_mod_memcg_state(memcg, idx, val); 883 - 884 - put_cpu(); 709 + memcg = get_non_dying_memcg_start(memcg); 710 + __mod_memcg_state(memcg, idx, val); 711 + get_non_dying_memcg_end(); 885 712 } 886 713 887 714 #ifdef CONFIG_MEMCG_V1 ··· 890 739 #endif 891 740 return x; 892 741 } 742 + 743 + void reparent_memcg_state_local(struct mem_cgroup *memcg, 744 + struct mem_cgroup *parent, int idx) 745 + { 746 + unsigned long value = memcg_page_state_local(memcg, idx); 747 + 748 + __mod_memcg_state(memcg, idx, -value); 749 + __mod_memcg_state(parent, idx, value); 750 + } 893 751 #endif 894 752 895 - static void mod_memcg_lruvec_state(struct lruvec *lruvec, 896 - enum node_stat_item idx, 897 - int val) 753 + static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, 754 + enum node_stat_item idx, long val) 898 755 { 899 - struct mem_cgroup_per_node *pn; 900 - struct mem_cgroup *memcg; 756 + struct mem_cgroup *memcg = pn->memcg; 901 757 int i = memcg_stats_index(idx); 902 758 int cpu; 903 759 904 760 if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) 905 761 return; 906 - 907 - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 908 - memcg = pn->memcg; 909 762 910 763 cpu = get_cpu(); 911 764 ··· 924 769 trace_mod_memcg_lruvec_state(memcg, idx, val); 925 770 926 771 put_cpu(); 772 + } 773 + 774 + static void mod_memcg_lruvec_state(struct lruvec *lruvec, 775 + enum node_stat_item idx, 776 + int val) 777 + { 778 + struct pglist_data *pgdat = lruvec_pgdat(lruvec); 779 + struct mem_cgroup_per_node *pn; 780 + struct mem_cgroup *memcg; 781 + 782 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 783 + memcg = get_non_dying_memcg_start(pn->memcg); 784 + pn = memcg->nodeinfo[pgdat->node_id]; 785 + 786 + __mod_memcg_lruvec_state(pn, idx, val); 787 + 788 + get_non_dying_memcg_end(); 927 789 } 928 790 929 791 /** ··· 1163 991 /** 1164 992 * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. 1165 993 * @folio: folio from which memcg should be extracted. 994 + * 995 + * See folio_memcg() for folio->objcg/memcg binding rules. 1166 996 */ 1167 997 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) 1168 998 { 1169 - struct mem_cgroup *memcg = folio_memcg(folio); 999 + struct mem_cgroup *memcg; 1170 1000 1171 1001 if (mem_cgroup_disabled()) 1172 1002 return NULL; 1173 1003 1004 + if (!folio_memcg_charged(folio)) 1005 + return root_mem_cgroup; 1006 + 1174 1007 rcu_read_lock(); 1175 - if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) 1176 - memcg = root_mem_cgroup; 1008 + do { 1009 + memcg = folio_memcg(folio); 1010 + } while (unlikely(!css_tryget(&memcg->css))); 1177 1011 rcu_read_unlock(); 1178 1012 return memcg; 1179 1013 } ··· 1376 1198 } 1377 1199 } 1378 1200 1379 - #ifdef CONFIG_DEBUG_VM 1380 - void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) 1381 - { 1382 - struct mem_cgroup *memcg; 1383 - 1384 - if (mem_cgroup_disabled()) 1385 - return; 1386 - 1387 - memcg = folio_memcg(folio); 1388 - 1389 - if (!memcg) 1390 - VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio); 1391 - else 1392 - VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio); 1393 - } 1394 - #endif 1395 - 1396 1201 /** 1397 1202 * folio_lruvec_lock - Lock the lruvec for a folio. 1398 1203 * @folio: Pointer to the folio. ··· 1385 1224 * - folio_test_lru false 1386 1225 * - folio frozen (refcount of 0) 1387 1226 * 1388 - * Return: The lruvec this folio is on with its lock held. 1227 + * Return: The lruvec this folio is on with its lock held and rcu read lock held. 1389 1228 */ 1390 1229 struct lruvec *folio_lruvec_lock(struct folio *folio) 1391 1230 { 1392 - struct lruvec *lruvec = folio_lruvec(folio); 1231 + struct lruvec *lruvec; 1393 1232 1233 + rcu_read_lock(); 1234 + retry: 1235 + lruvec = folio_lruvec(folio); 1394 1236 spin_lock(&lruvec->lru_lock); 1395 - lruvec_memcg_debug(lruvec, folio); 1237 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1238 + spin_unlock(&lruvec->lru_lock); 1239 + goto retry; 1240 + } 1396 1241 1397 1242 return lruvec; 1398 1243 } ··· 1413 1246 * - folio frozen (refcount of 0) 1414 1247 * 1415 1248 * Return: The lruvec this folio is on with its lock held and interrupts 1416 - * disabled. 1249 + * disabled and rcu read lock held. 1417 1250 */ 1418 1251 struct lruvec *folio_lruvec_lock_irq(struct folio *folio) 1419 1252 { 1420 - struct lruvec *lruvec = folio_lruvec(folio); 1253 + struct lruvec *lruvec; 1421 1254 1255 + rcu_read_lock(); 1256 + retry: 1257 + lruvec = folio_lruvec(folio); 1422 1258 spin_lock_irq(&lruvec->lru_lock); 1423 - lruvec_memcg_debug(lruvec, folio); 1259 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1260 + spin_unlock_irq(&lruvec->lru_lock); 1261 + goto retry; 1262 + } 1424 1263 1425 1264 return lruvec; 1426 1265 } ··· 1442 1269 * - folio frozen (refcount of 0) 1443 1270 * 1444 1271 * Return: The lruvec this folio is on with its lock held and interrupts 1445 - * disabled. 1272 + * disabled and rcu read lock held. 1446 1273 */ 1447 1274 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, 1448 1275 unsigned long *flags) 1449 1276 { 1450 - struct lruvec *lruvec = folio_lruvec(folio); 1277 + struct lruvec *lruvec; 1451 1278 1279 + rcu_read_lock(); 1280 + retry: 1281 + lruvec = folio_lruvec(folio); 1452 1282 spin_lock_irqsave(&lruvec->lru_lock, *flags); 1453 - lruvec_memcg_debug(lruvec, folio); 1283 + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { 1284 + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); 1285 + goto retry; 1286 + } 1454 1287 1455 1288 return lruvec; 1456 1289 } ··· 1472 1293 * to or just after a page is removed from an lru list. 1473 1294 */ 1474 1295 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 1475 - int zid, int nr_pages) 1296 + int zid, long nr_pages) 1476 1297 { 1477 1298 struct mem_cgroup_per_node *mz; 1478 1299 unsigned long *lru_size; ··· 1489 1310 1490 1311 size = *lru_size; 1491 1312 if (WARN_ONCE(size < 0, 1492 - "%s(%p, %d, %d): lru_size %ld\n", 1313 + "%s(%p, %d, %ld): lru_size %ld\n", 1493 1314 __func__, lruvec, lru, nr_pages, size)) { 1494 1315 VM_BUG_ON(1); 1495 1316 *lru_size = 0; ··· 2760 2581 return try_charge_memcg(memcg, gfp_mask, nr_pages); 2761 2582 } 2762 2583 2763 - static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) 2584 + static void commit_charge(struct folio *folio, struct obj_cgroup *objcg) 2764 2585 { 2765 2586 VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio); 2766 2587 /* 2767 - * Any of the following ensures page's memcg stability: 2588 + * Any of the following ensures folio's objcg stability: 2768 2589 * 2769 2590 * - the page lock 2770 2591 * - LRU isolation 2771 2592 * - exclusive reference 2772 2593 */ 2773 - folio->memcg_data = (unsigned long)memcg; 2594 + folio->memcg_data = (unsigned long)objcg; 2774 2595 } 2775 2596 2776 2597 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC ··· 2872 2693 2873 2694 static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) 2874 2695 { 2875 - struct obj_cgroup *objcg = NULL; 2696 + int nid = numa_node_id(); 2876 2697 2877 - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 2878 - objcg = rcu_dereference(memcg->objcg); 2698 + for (; memcg; memcg = parent_mem_cgroup(memcg)) { 2699 + struct obj_cgroup *objcg = rcu_dereference(memcg->nodeinfo[nid]->objcg); 2700 + 2879 2701 if (likely(objcg && obj_cgroup_tryget(objcg))) 2880 - break; 2881 - objcg = NULL; 2702 + return objcg; 2882 2703 } 2704 + 2705 + return NULL; 2706 + } 2707 + 2708 + static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) 2709 + { 2710 + struct obj_cgroup *objcg; 2711 + 2712 + rcu_read_lock(); 2713 + objcg = __get_obj_cgroup_from_memcg(memcg); 2714 + rcu_read_unlock(); 2715 + 2883 2716 return objcg; 2884 2717 } 2885 2718 ··· 2950 2759 { 2951 2760 struct mem_cgroup *memcg; 2952 2761 struct obj_cgroup *objcg; 2762 + int nid = numa_node_id(); 2953 2763 2954 2764 if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi()) 2955 2765 return NULL; ··· 2967 2775 * Objcg reference is kept by the task, so it's safe 2968 2776 * to use the objcg by the current task. 2969 2777 */ 2970 - return objcg; 2778 + return objcg ? : rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2971 2779 } 2972 2780 2973 2781 memcg = this_cpu_read(int_active_memcg); 2974 2782 if (unlikely(memcg)) 2975 2783 goto from_memcg; 2976 2784 2977 - return NULL; 2785 + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2978 2786 2979 2787 from_memcg: 2980 - objcg = NULL; 2981 - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 2788 + for (; memcg; memcg = parent_mem_cgroup(memcg)) { 2982 2789 /* 2983 2790 * Memcg pointer is protected by scope (see set_active_memcg()) 2984 2791 * and is pinning the corresponding objcg, so objcg can't go 2985 2792 * away and can be used within the scope without any additional 2986 2793 * protection. 2987 2794 */ 2988 - objcg = rcu_dereference_check(memcg->objcg, 1); 2795 + objcg = rcu_dereference_check(memcg->nodeinfo[nid]->objcg, 1); 2989 2796 if (likely(objcg)) 2990 - break; 2797 + return objcg; 2991 2798 } 2992 2799 2993 - return objcg; 2800 + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); 2994 2801 } 2995 2802 2996 2803 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) 2997 2804 { 2998 2805 struct obj_cgroup *objcg; 2999 2806 3000 - if (!memcg_kmem_online()) 3001 - return NULL; 3002 - 3003 - if (folio_memcg_kmem(folio)) { 3004 - objcg = __folio_objcg(folio); 2807 + objcg = folio_objcg(folio); 2808 + if (objcg) 3005 2809 obj_cgroup_get(objcg); 3006 - } else { 3007 - struct mem_cgroup *memcg; 3008 2810 3009 - rcu_read_lock(); 3010 - memcg = __folio_memcg(folio); 3011 - if (memcg) 3012 - objcg = __get_obj_cgroup_from_memcg(memcg); 3013 - else 3014 - objcg = NULL; 3015 - rcu_read_unlock(); 3016 - } 3017 2811 return objcg; 3018 2812 } 3019 2813 ··· 3100 2922 int ret = 0; 3101 2923 3102 2924 objcg = current_obj_cgroup(); 3103 - if (objcg) { 2925 + if (objcg && !obj_cgroup_is_root(objcg)) { 3104 2926 ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order); 3105 2927 if (!ret) { 3106 2928 obj_cgroup_get(objcg); ··· 3429 3251 * obj_cgroup_get() is used to get a permanent reference. 3430 3252 */ 3431 3253 objcg = current_obj_cgroup(); 3432 - if (!objcg) 3254 + if (!objcg || obj_cgroup_is_root(objcg)) 3433 3255 return true; 3434 3256 3435 3257 /* ··· 3561 3383 return; 3562 3384 3563 3385 new_refs = (1 << (old_order - new_order)) - 1; 3564 - css_get_many(&__folio_memcg(folio)->css, new_refs); 3386 + obj_cgroup_get_many(folio_objcg(folio), new_refs); 3565 3387 } 3566 3388 3567 - static int memcg_online_kmem(struct mem_cgroup *memcg) 3389 + static void memcg_online_kmem(struct mem_cgroup *memcg) 3568 3390 { 3569 - struct obj_cgroup *objcg; 3570 - 3571 3391 if (mem_cgroup_kmem_disabled()) 3572 - return 0; 3392 + return; 3573 3393 3574 3394 if (unlikely(mem_cgroup_is_root(memcg))) 3575 - return 0; 3576 - 3577 - objcg = obj_cgroup_alloc(); 3578 - if (!objcg) 3579 - return -ENOMEM; 3580 - 3581 - objcg->memcg = memcg; 3582 - rcu_assign_pointer(memcg->objcg, objcg); 3583 - obj_cgroup_get(objcg); 3584 - memcg->orig_objcg = objcg; 3395 + return; 3585 3396 3586 3397 static_branch_enable(&memcg_kmem_online_key); 3587 3398 3588 3399 memcg->kmemcg_id = memcg->id.id; 3589 - 3590 - return 0; 3591 3400 } 3592 3401 3593 3402 static void memcg_offline_kmem(struct mem_cgroup *memcg) ··· 3588 3423 return; 3589 3424 3590 3425 parent = parent_mem_cgroup(memcg); 3591 - if (!parent) 3592 - parent = root_mem_cgroup; 3593 - 3594 3426 memcg_reparent_list_lrus(memcg, parent); 3595 - 3596 - /* 3597 - * Objcg's reparenting must be after list_lru's, make sure list_lru 3598 - * helpers won't use parent's list_lru until child is drained. 3599 - */ 3600 - memcg_reparent_objcgs(memcg, parent); 3601 3427 } 3602 3428 3603 3429 #ifdef CONFIG_CGROUP_WRITEBACK ··· 3861 3705 break; 3862 3706 } 3863 3707 memcg = parent_mem_cgroup(memcg); 3864 - if (!memcg) 3865 - memcg = root_mem_cgroup; 3866 3708 } 3867 3709 return memcg; 3868 3710 } ··· 3925 3771 if (!pn->lruvec_stats_percpu) 3926 3772 goto fail; 3927 3773 3774 + INIT_LIST_HEAD(&pn->objcg_list); 3775 + 3928 3776 lruvec_init(&pn->lruvec); 3929 3777 pn->memcg = memcg; 3930 3778 ··· 3941 3785 { 3942 3786 int node; 3943 3787 3944 - obj_cgroup_put(memcg->orig_objcg); 3788 + for_each_node(node) { 3789 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; 3790 + if (!pn) 3791 + continue; 3945 3792 3946 - for_each_node(node) 3947 - free_mem_cgroup_per_node_info(memcg->nodeinfo[node]); 3793 + obj_cgroup_put(pn->orig_objcg); 3794 + free_mem_cgroup_per_node_info(pn); 3795 + } 3948 3796 memcg1_free_events(memcg); 3949 3797 kfree(memcg->vmstats); 3950 3798 free_percpu(memcg->vmstats_percpu); ··· 4019 3859 #endif 4020 3860 memcg1_memcg_init(memcg); 4021 3861 memcg->kmemcg_id = -1; 4022 - INIT_LIST_HEAD(&memcg->objcg_list); 4023 3862 #ifdef CONFIG_CGROUP_WRITEBACK 4024 3863 INIT_LIST_HEAD(&memcg->cgwb_list); 4025 3864 for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) ··· 4094 3935 static int mem_cgroup_css_online(struct cgroup_subsys_state *css) 4095 3936 { 4096 3937 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 3938 + struct obj_cgroup *objcg; 3939 + int nid; 4097 3940 4098 - if (memcg_online_kmem(memcg)) 4099 - goto remove_id; 3941 + memcg_online_kmem(memcg); 4100 3942 4101 3943 /* 4102 3944 * A memcg must be visible for expand_shrinker_info() ··· 4106 3946 */ 4107 3947 if (alloc_shrinker_info(memcg)) 4108 3948 goto offline_kmem; 3949 + 3950 + for_each_node(nid) { 3951 + objcg = obj_cgroup_alloc(); 3952 + if (!objcg) 3953 + goto free_objcg; 3954 + 3955 + if (unlikely(mem_cgroup_is_root(memcg))) 3956 + objcg->is_root = true; 3957 + 3958 + objcg->memcg = memcg; 3959 + rcu_assign_pointer(memcg->nodeinfo[nid]->objcg, objcg); 3960 + obj_cgroup_get(objcg); 3961 + memcg->nodeinfo[nid]->orig_objcg = objcg; 3962 + } 4109 3963 4110 3964 if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled()) 4111 3965 queue_delayed_work(system_dfl_wq, &stats_flush_dwork, ··· 4143 3969 xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL); 4144 3970 4145 3971 return 0; 3972 + free_objcg: 3973 + for_each_node(nid) { 3974 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 3975 + 3976 + objcg = rcu_replace_pointer(pn->objcg, NULL, true); 3977 + if (objcg) 3978 + percpu_ref_kill(&objcg->refcnt); 3979 + 3980 + if (pn->orig_objcg) { 3981 + obj_cgroup_put(pn->orig_objcg); 3982 + /* 3983 + * Reset pn->orig_objcg to NULL to prevent 3984 + * obj_cgroup_put() from being called again in 3985 + * __mem_cgroup_free(). 3986 + */ 3987 + pn->orig_objcg = NULL; 3988 + } 3989 + } 3990 + free_shrinker_info(memcg); 4146 3991 offline_kmem: 4147 3992 memcg_offline_kmem(memcg); 4148 - remove_id: 4149 3993 mem_cgroup_private_id_remove(memcg); 4150 3994 return -ENOMEM; 4151 3995 } ··· 4181 3989 4182 3990 memcg_offline_kmem(memcg); 4183 3991 reparent_deferred_split_queue(memcg); 3992 + /* 3993 + * The reparenting of objcg must be after the reparenting of the 3994 + * list_lru and deferred_split_queue above, which ensures that they will 3995 + * not mistakenly get the parent list_lru and deferred_split_queue. 3996 + */ 3997 + memcg_reparent_objcgs(memcg); 4184 3998 reparent_shrinker_deferred(memcg); 4185 3999 wb_memcg_offline(memcg); 4186 4000 lru_gen_offline_memcg(memcg); ··· 4419 4221 } 4420 4222 WRITE_ONCE(statc->stats_updates, 0); 4421 4223 /* We are in a per-cpu loop here, only do the atomic write once */ 4422 - if (atomic_read(&memcg->vmstats->stats_updates)) 4423 - atomic_set(&memcg->vmstats->stats_updates, 0); 4224 + if (atomic_long_read(&memcg->vmstats->stats_updates)) 4225 + atomic_long_set(&memcg->vmstats->stats_updates, 0); 4424 4226 } 4425 4227 4426 4228 static void mem_cgroup_fork(struct task_struct *task) ··· 4997 4799 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, 4998 4800 gfp_t gfp) 4999 4801 { 5000 - int ret; 4802 + int ret = 0; 4803 + struct obj_cgroup *objcg; 5001 4804 5002 - ret = try_charge(memcg, gfp, folio_nr_pages(folio)); 5003 - if (ret) 5004 - goto out; 5005 - 5006 - css_get(&memcg->css); 5007 - commit_charge(folio, memcg); 4805 + objcg = get_obj_cgroup_from_memcg(memcg); 4806 + /* Do not account at the root objcg level. */ 4807 + if (!obj_cgroup_is_root(objcg)) 4808 + ret = try_charge_memcg(memcg, gfp, folio_nr_pages(folio)); 4809 + if (ret) { 4810 + obj_cgroup_put(objcg); 4811 + return ret; 4812 + } 4813 + commit_charge(folio, objcg); 5008 4814 memcg1_commit_charge(folio, memcg); 5009 - out: 4815 + 5010 4816 return ret; 5011 4817 } 5012 4818 ··· 5096 4894 } 5097 4895 5098 4896 struct uncharge_gather { 5099 - struct mem_cgroup *memcg; 4897 + struct obj_cgroup *objcg; 5100 4898 unsigned long nr_memory; 5101 4899 unsigned long pgpgout; 5102 4900 unsigned long nr_kmem; ··· 5110 4908 5111 4909 static void uncharge_batch(const struct uncharge_gather *ug) 5112 4910 { 4911 + struct mem_cgroup *memcg; 4912 + 4913 + rcu_read_lock(); 4914 + memcg = obj_cgroup_memcg(ug->objcg); 5113 4915 if (ug->nr_memory) { 5114 - memcg_uncharge(ug->memcg, ug->nr_memory); 4916 + memcg_uncharge(memcg, ug->nr_memory); 5115 4917 if (ug->nr_kmem) { 5116 - mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem); 5117 - memcg1_account_kmem(ug->memcg, -ug->nr_kmem); 4918 + mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem); 4919 + memcg1_account_kmem(memcg, -ug->nr_kmem); 5118 4920 } 5119 - memcg1_oom_recover(ug->memcg); 4921 + memcg1_oom_recover(memcg); 5120 4922 } 5121 4923 5122 - memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid); 4924 + memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid); 4925 + rcu_read_unlock(); 5123 4926 5124 4927 /* drop reference from uncharge_folio */ 5125 - css_put(&ug->memcg->css); 4928 + obj_cgroup_put(ug->objcg); 5126 4929 } 5127 4930 5128 4931 static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) 5129 4932 { 5130 4933 long nr_pages; 5131 - struct mem_cgroup *memcg; 5132 4934 struct obj_cgroup *objcg; 5133 4935 5134 4936 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); 5135 4937 5136 4938 /* 5137 4939 * Nobody should be changing or seriously looking at 5138 - * folio memcg or objcg at this point, we have fully 5139 - * exclusive access to the folio. 4940 + * folio objcg at this point, we have fully exclusive 4941 + * access to the folio. 5140 4942 */ 5141 - if (folio_memcg_kmem(folio)) { 5142 - objcg = __folio_objcg(folio); 5143 - /* 5144 - * This get matches the put at the end of the function and 5145 - * kmem pages do not hold memcg references anymore. 5146 - */ 5147 - memcg = get_mem_cgroup_from_objcg(objcg); 5148 - } else { 5149 - memcg = __folio_memcg(folio); 5150 - } 5151 - 5152 - if (!memcg) 4943 + objcg = folio_objcg(folio); 4944 + if (!objcg) 5153 4945 return; 5154 4946 5155 - if (ug->memcg != memcg) { 5156 - if (ug->memcg) { 4947 + if (ug->objcg != objcg) { 4948 + if (ug->objcg) { 5157 4949 uncharge_batch(ug); 5158 4950 uncharge_gather_clear(ug); 5159 4951 } 5160 - ug->memcg = memcg; 4952 + ug->objcg = objcg; 5161 4953 ug->nid = folio_nid(folio); 5162 4954 5163 - /* pairs with css_put in uncharge_batch */ 5164 - css_get(&memcg->css); 4955 + /* pairs with obj_cgroup_put in uncharge_batch */ 4956 + obj_cgroup_get(objcg); 5165 4957 } 5166 4958 5167 4959 nr_pages = folio_nr_pages(folio); ··· 5163 4967 if (folio_memcg_kmem(folio)) { 5164 4968 ug->nr_memory += nr_pages; 5165 4969 ug->nr_kmem += nr_pages; 5166 - 5167 - folio->memcg_data = 0; 5168 - obj_cgroup_put(objcg); 5169 4970 } else { 5170 4971 /* LRU pages aren't accounted at the root level */ 5171 - if (!mem_cgroup_is_root(memcg)) 4972 + if (!obj_cgroup_is_root(objcg)) 5172 4973 ug->nr_memory += nr_pages; 5173 4974 ug->pgpgout++; 5174 4975 5175 4976 WARN_ON_ONCE(folio_unqueue_deferred_split(folio)); 5176 - folio->memcg_data = 0; 5177 4977 } 5178 4978 5179 - css_put(&memcg->css); 4979 + folio->memcg_data = 0; 4980 + obj_cgroup_put(objcg); 5180 4981 } 5181 4982 5182 4983 void __mem_cgroup_uncharge(struct folio *folio) ··· 5197 5004 uncharge_gather_clear(&ug); 5198 5005 for (i = 0; i < folios->nr; i++) 5199 5006 uncharge_folio(folios->folios[i], &ug); 5200 - if (ug.memcg) 5007 + if (ug.objcg) 5201 5008 uncharge_batch(&ug); 5202 5009 } 5203 5010 ··· 5214 5021 void mem_cgroup_replace_folio(struct folio *old, struct folio *new) 5215 5022 { 5216 5023 struct mem_cgroup *memcg; 5024 + struct obj_cgroup *objcg; 5217 5025 long nr_pages = folio_nr_pages(new); 5218 5026 5219 5027 VM_BUG_ON_FOLIO(!folio_test_locked(old), old); ··· 5229 5035 if (folio_memcg_charged(new)) 5230 5036 return; 5231 5037 5232 - memcg = folio_memcg(old); 5233 - VM_WARN_ON_ONCE_FOLIO(!memcg, old); 5234 - if (!memcg) 5038 + objcg = folio_objcg(old); 5039 + VM_WARN_ON_ONCE_FOLIO(!objcg, old); 5040 + if (!objcg) 5235 5041 return; 5236 5042 5043 + rcu_read_lock(); 5044 + memcg = obj_cgroup_memcg(objcg); 5237 5045 /* Force-charge the new page. The old one will be freed soon */ 5238 - if (!mem_cgroup_is_root(memcg)) { 5046 + if (!obj_cgroup_is_root(objcg)) { 5239 5047 page_counter_charge(&memcg->memory, nr_pages); 5240 5048 if (do_memsw_account()) 5241 5049 page_counter_charge(&memcg->memsw, nr_pages); 5242 5050 } 5243 5051 5244 - css_get(&memcg->css); 5245 - commit_charge(new, memcg); 5052 + obj_cgroup_get(objcg); 5053 + commit_charge(new, objcg); 5246 5054 memcg1_commit_charge(new, memcg); 5055 + rcu_read_unlock(); 5247 5056 } 5248 5057 5249 5058 /** ··· 5262 5065 */ 5263 5066 void mem_cgroup_migrate(struct folio *old, struct folio *new) 5264 5067 { 5265 - struct mem_cgroup *memcg; 5068 + struct obj_cgroup *objcg; 5266 5069 5267 5070 VM_BUG_ON_FOLIO(!folio_test_locked(old), old); 5268 5071 VM_BUG_ON_FOLIO(!folio_test_locked(new), new); ··· 5273 5076 if (mem_cgroup_disabled()) 5274 5077 return; 5275 5078 5276 - memcg = folio_memcg(old); 5079 + objcg = folio_objcg(old); 5277 5080 /* 5278 - * Note that it is normal to see !memcg for a hugetlb folio. 5081 + * Note that it is normal to see !objcg for a hugetlb folio. 5279 5082 * For e.g, it could have been allocated when memory_hugetlb_accounting 5280 5083 * was not selected. 5281 5084 */ 5282 - VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old); 5283 - if (!memcg) 5085 + VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old); 5086 + if (!objcg) 5284 5087 return; 5285 5088 5286 - /* Transfer the charge and the css ref */ 5287 - commit_charge(new, memcg); 5089 + /* Transfer the charge and the objcg ref */ 5090 + commit_charge(new, objcg); 5288 5091 5289 5092 /* Warning should never happen, so don't worry about refcount non-0 */ 5290 5093 WARN_ON_ONCE(folio_unqueue_deferred_split(old)); ··· 5467 5270 unsigned int nr_pages = folio_nr_pages(folio); 5468 5271 struct page_counter *counter; 5469 5272 struct mem_cgroup *memcg; 5273 + struct obj_cgroup *objcg; 5470 5274 5471 5275 if (do_memsw_account()) 5472 5276 return 0; 5473 5277 5474 - memcg = folio_memcg(folio); 5475 - 5476 - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); 5477 - if (!memcg) 5278 + objcg = folio_objcg(folio); 5279 + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); 5280 + if (!objcg) 5478 5281 return 0; 5479 5282 5283 + rcu_read_lock(); 5284 + memcg = obj_cgroup_memcg(objcg); 5480 5285 if (!entry.val) { 5481 5286 memcg_memory_event(memcg, MEMCG_SWAP_FAIL); 5287 + rcu_read_unlock(); 5482 5288 return 0; 5483 5289 } 5484 5290 5485 5291 memcg = mem_cgroup_private_id_get_online(memcg, nr_pages); 5292 + /* memcg is pined by memcg ID. */ 5293 + rcu_read_unlock(); 5486 5294 5487 5295 if (!mem_cgroup_is_root(memcg) && 5488 5296 !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { ··· 5545 5343 bool mem_cgroup_swap_full(struct folio *folio) 5546 5344 { 5547 5345 struct mem_cgroup *memcg; 5346 + bool ret = false; 5548 5347 5549 5348 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); 5550 5349 5551 5350 if (vm_swap_full()) 5552 5351 return true; 5553 - if (do_memsw_account()) 5554 - return false; 5352 + if (do_memsw_account() || !folio_memcg_charged(folio)) 5353 + return ret; 5555 5354 5355 + rcu_read_lock(); 5556 5356 memcg = folio_memcg(folio); 5557 - if (!memcg) 5558 - return false; 5559 - 5560 5357 for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { 5561 5358 unsigned long usage = page_counter_read(&memcg->swap); 5562 5359 5563 5360 if (usage * 2 >= READ_ONCE(memcg->swap.high) || 5564 - usage * 2 >= READ_ONCE(memcg->swap.max)) 5565 - return true; 5361 + usage * 2 >= READ_ONCE(memcg->swap.max)) { 5362 + ret = true; 5363 + break; 5364 + } 5566 5365 } 5366 + rcu_read_unlock(); 5567 5367 5568 - return false; 5368 + return ret; 5569 5369 } 5570 5370 5571 5371 static int __init setup_swap_account(char *s) ··· 5763 5559 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 5764 5560 return; 5765 5561 5562 + if (obj_cgroup_is_root(objcg)) 5563 + return; 5564 + 5766 5565 VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC)); 5767 5566 5768 5567 /* PF_MEMALLOC context, charging must succeed */ ··· 5793 5586 struct mem_cgroup *memcg; 5794 5587 5795 5588 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 5589 + return; 5590 + 5591 + if (obj_cgroup_is_root(objcg)) 5796 5592 return; 5797 5593 5798 5594 obj_cgroup_uncharge(objcg, size);
+26 -8
mm/memfd_luo.c
··· 105 105 if (!size) { 106 106 *nr_foliosp = 0; 107 107 *out_folios_ser = NULL; 108 - memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)); 109 108 return 0; 110 109 } 111 110 ··· 409 410 struct inode *inode = file_inode(file); 410 411 struct address_space *mapping = inode->i_mapping; 411 412 struct folio *folio; 413 + long npages, nr_added_pages = 0; 412 414 int err = -EIO; 413 415 long i; 414 416 ··· 456 456 if (flags & MEMFD_LUO_FOLIO_DIRTY) 457 457 folio_mark_dirty(folio); 458 458 459 - err = shmem_inode_acct_blocks(inode, 1); 459 + npages = folio_nr_pages(folio); 460 + err = shmem_inode_acct_blocks(inode, npages); 460 461 if (err) { 461 - pr_err("shmem: failed to account folio index %ld: %d\n", 462 - i, err); 463 - goto unlock_folio; 462 + pr_err("shmem: failed to account folio index %ld(%ld pages): %d\n", 463 + i, npages, err); 464 + goto remove_from_cache; 464 465 } 465 466 466 - shmem_recalc_inode(inode, 1, 0); 467 + nr_added_pages += npages; 467 468 folio_add_lru(folio); 468 469 folio_unlock(folio); 469 470 folio_put(folio); 470 471 } 471 472 473 + shmem_recalc_inode(inode, nr_added_pages, 0); 474 + 472 475 return 0; 473 476 477 + remove_from_cache: 478 + filemap_remove_folio(folio); 474 479 unlock_folio: 475 480 folio_unlock(folio); 476 481 folio_put(folio); ··· 486 481 */ 487 482 for (long j = i + 1; j < nr_folios; j++) { 488 483 const struct memfd_luo_folio_ser *pfolio = &folios_ser[j]; 484 + phys_addr_t phys; 489 485 490 - folio = kho_restore_folio(pfolio->pfn); 486 + if (!pfolio->pfn) 487 + continue; 488 + 489 + phys = PFN_PHYS(pfolio->pfn); 490 + folio = kho_restore_folio(phys); 491 491 if (folio) 492 492 folio_put(folio); 493 493 } 494 + 495 + shmem_recalc_inode(inode, nr_added_pages, 0); 494 496 495 497 return err; 496 498 } ··· 537 525 } 538 526 539 527 vfs_setpos(file, ser->pos, MAX_LFS_FILESIZE); 540 - file->f_inode->i_size = ser->size; 528 + i_size_write(file_inode(file), ser->size); 541 529 542 530 if (ser->nr_folios) { 543 531 folios_ser = kho_restore_vmalloc(&ser->folios); ··· 572 560 return shmem_file(file) && !inode->i_nlink; 573 561 } 574 562 563 + static unsigned long memfd_luo_get_id(struct file *file) 564 + { 565 + return (unsigned long)file_inode(file); 566 + } 567 + 575 568 static const struct liveupdate_file_ops memfd_luo_file_ops = { 576 569 .freeze = memfd_luo_freeze, 577 570 .finish = memfd_luo_finish, ··· 584 567 .preserve = memfd_luo_preserve, 585 568 .unpreserve = memfd_luo_unpreserve, 586 569 .can_preserve = memfd_luo_can_preserve, 570 + .get_id = memfd_luo_get_id, 587 571 .owner = THIS_MODULE, 588 572 }; 589 573
+12 -11
mm/mempolicy.c
··· 3706 3706 new_wi_state->iw_table[i] = 1; 3707 3707 3708 3708 mutex_lock(&wi_state_lock); 3709 - if (!input) { 3710 - old_wi_state = rcu_dereference_protected(wi_state, 3711 - lockdep_is_held(&wi_state_lock)); 3712 - if (!old_wi_state) 3713 - goto update_wi_state; 3714 - if (input == old_wi_state->mode_auto) { 3715 - mutex_unlock(&wi_state_lock); 3716 - return count; 3717 - } 3709 + old_wi_state = rcu_dereference_protected(wi_state, 3710 + lockdep_is_held(&wi_state_lock)); 3718 3711 3719 - memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3720 - nr_node_ids * sizeof(u8)); 3712 + if (old_wi_state && input == old_wi_state->mode_auto) { 3713 + mutex_unlock(&wi_state_lock); 3714 + kfree(new_wi_state); 3715 + return count; 3716 + } 3717 + 3718 + if (!input) { 3719 + if (old_wi_state) 3720 + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3721 + nr_node_ids * sizeof(u8)); 3721 3722 goto update_wi_state; 3722 3723 } 3723 3724
+2
mm/migrate.c
··· 672 672 struct lruvec *old_lruvec, *new_lruvec; 673 673 struct mem_cgroup *memcg; 674 674 675 + rcu_read_lock(); 675 676 memcg = folio_memcg(folio); 676 677 old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat); 677 678 new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat); ··· 700 699 mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr); 701 700 __mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr); 702 701 } 702 + rcu_read_unlock(); 703 703 } 704 704 local_irq_enable(); 705 705
-6
mm/migrate_device.c
··· 175 175 return migrate_vma_collect_skip(start, end, walk); 176 176 } 177 177 178 - if (softleaf_is_migration(entry)) { 179 - softleaf_entry_wait_on_locked(entry, ptl); 180 - spin_unlock(ptl); 181 - return -EAGAIN; 182 - } 183 - 184 178 if (softleaf_is_device_private_write(entry)) 185 179 write = MIGRATE_PFN_WRITE; 186 180 } else {
+1 -1
mm/mlock.c
··· 205 205 } 206 206 207 207 if (lruvec) 208 - unlock_page_lruvec_irq(lruvec); 208 + lruvec_unlock_irq(lruvec); 209 209 folios_put(fbatch); 210 210 } 211 211
+124 -94
mm/mprotect.c
··· 117 117 } 118 118 119 119 /* Set nr_ptes number of ptes, starting from idx */ 120 - static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, 121 - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, 122 - int idx, bool set_write, struct mmu_gather *tlb) 120 + static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, 121 + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, 122 + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) 123 123 { 124 124 /* 125 125 * Advance the position in the batch by idx; note that if idx > 0, ··· 143 143 * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce 144 144 * that the ptes point to consecutive pages of the same anon large folio. 145 145 */ 146 - static int page_anon_exclusive_sub_batch(int start_idx, int max_len, 146 + static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len, 147 147 struct page *first_page, bool expected_anon_exclusive) 148 148 { 149 149 int idx; ··· 169 169 * pte of the batch. Therefore, we must individually check all pages and 170 170 * retrieve sub-batches. 171 171 */ 172 - static void commit_anon_folio_batch(struct vm_area_struct *vma, 172 + static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, 173 173 struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, 174 174 pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) 175 175 { ··· 188 188 } 189 189 } 190 190 191 - static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, 191 + static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, 192 192 struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, 193 193 pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) 194 194 { ··· 211 211 commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); 212 212 } 213 213 214 + static long change_softleaf_pte(struct vm_area_struct *vma, 215 + unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) 216 + { 217 + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 218 + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 219 + softleaf_t entry = softleaf_from_pte(oldpte); 220 + pte_t newpte; 221 + 222 + if (softleaf_is_migration_write(entry)) { 223 + const struct folio *folio = softleaf_to_folio(entry); 224 + 225 + /* 226 + * A protection check is difficult so 227 + * just be safe and disable write 228 + */ 229 + if (folio_test_anon(folio)) 230 + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); 231 + else 232 + entry = make_readable_migration_entry(swp_offset(entry)); 233 + newpte = swp_entry_to_pte(entry); 234 + if (pte_swp_soft_dirty(oldpte)) 235 + newpte = pte_swp_mksoft_dirty(newpte); 236 + } else if (softleaf_is_device_private_write(entry)) { 237 + /* 238 + * We do not preserve soft-dirtiness. See 239 + * copy_nonpresent_pte() for explanation. 240 + */ 241 + entry = make_readable_device_private_entry(swp_offset(entry)); 242 + newpte = swp_entry_to_pte(entry); 243 + if (pte_swp_uffd_wp(oldpte)) 244 + newpte = pte_swp_mkuffd_wp(newpte); 245 + } else if (softleaf_is_marker(entry)) { 246 + /* 247 + * Ignore error swap entries unconditionally, 248 + * because any access should sigbus/sigsegv 249 + * anyway. 250 + */ 251 + if (softleaf_is_poison_marker(entry) || 252 + softleaf_is_guard_marker(entry)) 253 + return 0; 254 + /* 255 + * If this is uffd-wp pte marker and we'd like 256 + * to unprotect it, drop it; the next page 257 + * fault will trigger without uffd trapping. 258 + */ 259 + if (uffd_wp_resolve) { 260 + pte_clear(vma->vm_mm, addr, pte); 261 + return 1; 262 + } 263 + return 0; 264 + } else { 265 + newpte = oldpte; 266 + } 267 + 268 + if (uffd_wp) 269 + newpte = pte_swp_mkuffd_wp(newpte); 270 + else if (uffd_wp_resolve) 271 + newpte = pte_swp_clear_uffd_wp(newpte); 272 + 273 + if (!pte_same(oldpte, newpte)) { 274 + set_pte_at(vma->vm_mm, addr, pte, newpte); 275 + return 1; 276 + } 277 + return 0; 278 + } 279 + 280 + static __always_inline void change_present_ptes(struct mmu_gather *tlb, 281 + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, 282 + int nr_ptes, unsigned long end, pgprot_t newprot, 283 + struct folio *folio, struct page *page, unsigned long cp_flags) 284 + { 285 + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 286 + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 287 + pte_t ptent, oldpte; 288 + 289 + oldpte = modify_prot_start_ptes(vma, addr, ptep, nr_ptes); 290 + ptent = pte_modify(oldpte, newprot); 291 + 292 + if (uffd_wp) 293 + ptent = pte_mkuffd_wp(ptent); 294 + else if (uffd_wp_resolve) 295 + ptent = pte_clear_uffd_wp(ptent); 296 + 297 + /* 298 + * In some writable, shared mappings, we might want 299 + * to catch actual write access -- see 300 + * vma_wants_writenotify(). 301 + * 302 + * In all writable, private mappings, we have to 303 + * properly handle COW. 304 + * 305 + * In both cases, we can sometimes still change PTEs 306 + * writable and avoid the write-fault handler, for 307 + * example, if a PTE is already dirty and no other 308 + * COW or special handling is required. 309 + */ 310 + if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 311 + !pte_write(ptent)) 312 + set_write_prot_commit_flush_ptes(vma, folio, page, 313 + addr, ptep, oldpte, ptent, nr_ptes, tlb); 314 + else 315 + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 316 + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); 317 + } 318 + 214 319 static long change_pte_range(struct mmu_gather *tlb, 215 320 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, 216 321 unsigned long end, pgprot_t newprot, unsigned long cp_flags) ··· 326 221 bool is_private_single_threaded; 327 222 bool prot_numa = cp_flags & MM_CP_PROT_NUMA; 328 223 bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 329 - bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 330 224 int nr_ptes; 331 225 332 226 tlb_change_page_size(tlb, PAGE_SIZE); ··· 346 242 int max_nr_ptes = (end - addr) >> PAGE_SHIFT; 347 243 struct folio *folio = NULL; 348 244 struct page *page; 349 - pte_t ptent; 350 245 351 246 /* Already in the desired state. */ 352 247 if (prot_numa && pte_protnone(oldpte)) ··· 371 268 372 269 nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); 373 270 374 - oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); 375 - ptent = pte_modify(oldpte, newprot); 376 - 377 - if (uffd_wp) 378 - ptent = pte_mkuffd_wp(ptent); 379 - else if (uffd_wp_resolve) 380 - ptent = pte_clear_uffd_wp(ptent); 381 - 382 271 /* 383 - * In some writable, shared mappings, we might want 384 - * to catch actual write access -- see 385 - * vma_wants_writenotify(). 386 - * 387 - * In all writable, private mappings, we have to 388 - * properly handle COW. 389 - * 390 - * In both cases, we can sometimes still change PTEs 391 - * writable and avoid the write-fault handler, for 392 - * example, if a PTE is already dirty and no other 393 - * COW or special handling is required. 272 + * Optimize for the small-folio common case by 273 + * special-casing it here. Compiler constant propagation 274 + * plus copious amounts of __always_inline does wonders. 394 275 */ 395 - if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 396 - !pte_write(ptent)) 397 - set_write_prot_commit_flush_ptes(vma, folio, page, 398 - addr, pte, oldpte, ptent, nr_ptes, tlb); 399 - else 400 - prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, 401 - nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); 276 + if (likely(nr_ptes == 1)) { 277 + change_present_ptes(tlb, vma, addr, pte, 1, 278 + end, newprot, folio, page, cp_flags); 279 + } else { 280 + change_present_ptes(tlb, vma, addr, pte, 281 + nr_ptes, end, newprot, folio, page, 282 + cp_flags); 283 + } 284 + 402 285 pages += nr_ptes; 403 286 } else if (pte_none(oldpte)) { 404 287 /* ··· 406 317 pages++; 407 318 } 408 319 } else { 409 - softleaf_t entry = softleaf_from_pte(oldpte); 410 - pte_t newpte; 411 - 412 - if (softleaf_is_migration_write(entry)) { 413 - const struct folio *folio = softleaf_to_folio(entry); 414 - 415 - /* 416 - * A protection check is difficult so 417 - * just be safe and disable write 418 - */ 419 - if (folio_test_anon(folio)) 420 - entry = make_readable_exclusive_migration_entry( 421 - swp_offset(entry)); 422 - else 423 - entry = make_readable_migration_entry(swp_offset(entry)); 424 - newpte = swp_entry_to_pte(entry); 425 - if (pte_swp_soft_dirty(oldpte)) 426 - newpte = pte_swp_mksoft_dirty(newpte); 427 - } else if (softleaf_is_device_private_write(entry)) { 428 - /* 429 - * We do not preserve soft-dirtiness. See 430 - * copy_nonpresent_pte() for explanation. 431 - */ 432 - entry = make_readable_device_private_entry( 433 - swp_offset(entry)); 434 - newpte = swp_entry_to_pte(entry); 435 - if (pte_swp_uffd_wp(oldpte)) 436 - newpte = pte_swp_mkuffd_wp(newpte); 437 - } else if (softleaf_is_marker(entry)) { 438 - /* 439 - * Ignore error swap entries unconditionally, 440 - * because any access should sigbus/sigsegv 441 - * anyway. 442 - */ 443 - if (softleaf_is_poison_marker(entry) || 444 - softleaf_is_guard_marker(entry)) 445 - continue; 446 - /* 447 - * If this is uffd-wp pte marker and we'd like 448 - * to unprotect it, drop it; the next page 449 - * fault will trigger without uffd trapping. 450 - */ 451 - if (uffd_wp_resolve) { 452 - pte_clear(vma->vm_mm, addr, pte); 453 - pages++; 454 - } 455 - continue; 456 - } else { 457 - newpte = oldpte; 458 - } 459 - 460 - if (uffd_wp) 461 - newpte = pte_swp_mkuffd_wp(newpte); 462 - else if (uffd_wp_resolve) 463 - newpte = pte_swp_clear_uffd_wp(newpte); 464 - 465 - if (!pte_same(oldpte, newpte)) { 466 - set_pte_at(vma->vm_mm, addr, pte, newpte); 467 - pages++; 468 - } 320 + pages += change_softleaf_pte(vma, addr, pte, oldpte, cp_flags); 469 321 } 470 322 } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); 471 323 lazy_mmu_mode_disable();
+9 -1
mm/page_alloc.c
··· 1242 1242 union pgtag_ref_handle handle; 1243 1243 union codetag_ref ref; 1244 1244 1245 - if (get_page_tag_ref(page, &ref, &handle)) { 1245 + if (likely(get_page_tag_ref(page, &ref, &handle))) { 1246 1246 alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr); 1247 1247 update_page_tag_ref(handle, &ref); 1248 1248 put_page_tag_ref(handle); 1249 + } else { 1250 + /* 1251 + * page_ext is not available yet, record the pfn so we can 1252 + * clear the tag ref later when page_ext is initialized. 1253 + */ 1254 + alloc_tag_add_early_pfn(page_to_pfn(page)); 1255 + if (task->alloc_tag) 1256 + alloc_tag_set_inaccurate(task->alloc_tag); 1249 1257 } 1250 1258 } 1251 1259
+7 -3
mm/page_io.c
··· 276 276 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); 277 277 goto out_unlock; 278 278 } 279 + 280 + rcu_read_lock(); 279 281 if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { 282 + rcu_read_unlock(); 280 283 folio_mark_dirty(folio); 281 284 return AOP_WRITEPAGE_ACTIVATE; 282 285 } 286 + rcu_read_unlock(); 283 287 284 288 __swap_writepage(folio, swap_plug); 285 289 return 0; ··· 311 307 struct cgroup_subsys_state *css; 312 308 struct mem_cgroup *memcg; 313 309 314 - memcg = folio_memcg(folio); 315 - if (!memcg) 310 + if (!folio_memcg_charged(folio)) 316 311 return; 317 312 318 313 rcu_read_lock(); 314 + memcg = folio_memcg(folio); 319 315 css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys); 320 316 bio_associate_blkg_from_css(bio, css); 321 317 rcu_read_unlock(); ··· 497 493 folio_mark_uptodate(folio); 498 494 folio_unlock(folio); 499 495 } 500 - count_vm_events(PSWPIN, sio->pages); 496 + count_vm_events(PSWPIN, sio->len >> PAGE_SHIFT); 501 497 } else { 502 498 for (p = 0; p < sio->pages; p++) { 503 499 struct folio *folio = page_folio(sio->bvec[p].bv_page);
+1 -1
mm/percpu.c
··· 1622 1622 return true; 1623 1623 1624 1624 objcg = current_obj_cgroup(); 1625 - if (!objcg) 1625 + if (!objcg || obj_cgroup_is_root(objcg)) 1626 1626 return true; 1627 1627 1628 1628 if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
+80 -94
mm/shmem.c
··· 3177 3177 #endif /* CONFIG_TMPFS_QUOTA */ 3178 3178 3179 3179 #ifdef CONFIG_USERFAULTFD 3180 - int shmem_mfill_atomic_pte(pmd_t *dst_pmd, 3181 - struct vm_area_struct *dst_vma, 3182 - unsigned long dst_addr, 3183 - unsigned long src_addr, 3184 - uffd_flags_t flags, 3185 - struct folio **foliop) 3180 + static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma, 3181 + unsigned long addr) 3186 3182 { 3187 - struct inode *inode = file_inode(dst_vma->vm_file); 3188 - struct shmem_inode_info *info = SHMEM_I(inode); 3183 + struct inode *inode = file_inode(vma->vm_file); 3189 3184 struct address_space *mapping = inode->i_mapping; 3185 + struct shmem_inode_info *info = SHMEM_I(inode); 3186 + pgoff_t pgoff = linear_page_index(vma, addr); 3190 3187 gfp_t gfp = mapping_gfp_mask(mapping); 3191 - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 3192 - void *page_kaddr; 3193 3188 struct folio *folio; 3194 - int ret; 3195 - pgoff_t max_off; 3196 3189 3197 - if (shmem_inode_acct_blocks(inode, 1)) { 3198 - /* 3199 - * We may have got a page, returned -ENOENT triggering a retry, 3200 - * and now we find ourselves with -ENOMEM. Release the page, to 3201 - * avoid a BUG_ON in our caller. 3202 - */ 3203 - if (unlikely(*foliop)) { 3204 - folio_put(*foliop); 3205 - *foliop = NULL; 3206 - } 3207 - return -ENOMEM; 3190 + if (unlikely(pgoff >= DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE))) 3191 + return NULL; 3192 + 3193 + folio = shmem_alloc_folio(gfp, 0, info, pgoff); 3194 + if (!folio) 3195 + return NULL; 3196 + 3197 + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { 3198 + folio_put(folio); 3199 + return NULL; 3208 3200 } 3209 3201 3210 - if (!*foliop) { 3211 - ret = -ENOMEM; 3212 - folio = shmem_alloc_folio(gfp, 0, info, pgoff); 3213 - if (!folio) 3214 - goto out_unacct_blocks; 3202 + return folio; 3203 + } 3215 3204 3216 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { 3217 - page_kaddr = kmap_local_folio(folio, 0); 3218 - /* 3219 - * The read mmap_lock is held here. Despite the 3220 - * mmap_lock being read recursive a deadlock is still 3221 - * possible if a writer has taken a lock. For example: 3222 - * 3223 - * process A thread 1 takes read lock on own mmap_lock 3224 - * process A thread 2 calls mmap, blocks taking write lock 3225 - * process B thread 1 takes page fault, read lock on own mmap lock 3226 - * process B thread 2 calls mmap, blocks taking write lock 3227 - * process A thread 1 blocks taking read lock on process B 3228 - * process B thread 1 blocks taking read lock on process A 3229 - * 3230 - * Disable page faults to prevent potential deadlock 3231 - * and retry the copy outside the mmap_lock. 3232 - */ 3233 - pagefault_disable(); 3234 - ret = copy_from_user(page_kaddr, 3235 - (const void __user *)src_addr, 3236 - PAGE_SIZE); 3237 - pagefault_enable(); 3238 - kunmap_local(page_kaddr); 3205 + static int shmem_mfill_filemap_add(struct folio *folio, 3206 + struct vm_area_struct *vma, 3207 + unsigned long addr) 3208 + { 3209 + struct inode *inode = file_inode(vma->vm_file); 3210 + struct address_space *mapping = inode->i_mapping; 3211 + pgoff_t pgoff = linear_page_index(vma, addr); 3212 + gfp_t gfp = mapping_gfp_mask(mapping); 3213 + int err; 3239 3214 3240 - /* fallback to copy_from_user outside mmap_lock */ 3241 - if (unlikely(ret)) { 3242 - *foliop = folio; 3243 - ret = -ENOENT; 3244 - /* don't free the page */ 3245 - goto out_unacct_blocks; 3246 - } 3247 - 3248 - flush_dcache_folio(folio); 3249 - } else { /* ZEROPAGE */ 3250 - clear_user_highpage(&folio->page, dst_addr); 3251 - } 3252 - } else { 3253 - folio = *foliop; 3254 - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); 3255 - *foliop = NULL; 3256 - } 3257 - 3258 - VM_BUG_ON(folio_test_locked(folio)); 3259 - VM_BUG_ON(folio_test_swapbacked(folio)); 3260 3215 __folio_set_locked(folio); 3261 3216 __folio_set_swapbacked(folio); 3262 - __folio_mark_uptodate(folio); 3263 3217 3264 - ret = -EFAULT; 3265 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 3266 - if (unlikely(pgoff >= max_off)) 3267 - goto out_release; 3218 + err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); 3219 + if (err) 3220 + goto err_unlock; 3268 3221 3269 - ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); 3270 - if (ret) 3271 - goto out_release; 3272 - ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); 3273 - if (ret) 3274 - goto out_release; 3222 + if (shmem_inode_acct_blocks(inode, 1)) { 3223 + err = -ENOMEM; 3224 + goto err_delete_from_cache; 3225 + } 3275 3226 3276 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 3277 - &folio->page, true, flags); 3278 - if (ret) 3279 - goto out_delete_from_cache; 3280 - 3227 + folio_add_lru(folio); 3281 3228 shmem_recalc_inode(inode, 1, 0); 3282 - folio_unlock(folio); 3229 + 3283 3230 return 0; 3284 - out_delete_from_cache: 3231 + 3232 + err_delete_from_cache: 3285 3233 filemap_remove_folio(folio); 3286 - out_release: 3234 + err_unlock: 3287 3235 folio_unlock(folio); 3288 - folio_put(folio); 3289 - out_unacct_blocks: 3290 - shmem_inode_unacct_blocks(inode, 1); 3291 - return ret; 3236 + return err; 3292 3237 } 3238 + 3239 + static void shmem_mfill_filemap_remove(struct folio *folio, 3240 + struct vm_area_struct *vma) 3241 + { 3242 + struct inode *inode = file_inode(vma->vm_file); 3243 + 3244 + filemap_remove_folio(folio); 3245 + shmem_recalc_inode(inode, 0, 0); 3246 + folio_unlock(folio); 3247 + } 3248 + 3249 + static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff) 3250 + { 3251 + struct folio *folio; 3252 + int err; 3253 + 3254 + err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); 3255 + if (err) 3256 + return ERR_PTR(err); 3257 + 3258 + return folio; 3259 + } 3260 + 3261 + static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) 3262 + { 3263 + return true; 3264 + } 3265 + 3266 + static const struct vm_uffd_ops shmem_uffd_ops = { 3267 + .can_userfault = shmem_can_userfault, 3268 + .get_folio_noalloc = shmem_get_folio_noalloc, 3269 + .alloc_folio = shmem_mfill_folio_alloc, 3270 + .filemap_add = shmem_mfill_filemap_add, 3271 + .filemap_remove = shmem_mfill_filemap_remove, 3272 + }; 3293 3273 #endif /* CONFIG_USERFAULTFD */ 3294 3274 3295 3275 #ifdef CONFIG_TMPFS ··· 5305 5325 .set_policy = shmem_set_policy, 5306 5326 .get_policy = shmem_get_policy, 5307 5327 #endif 5328 + #ifdef CONFIG_USERFAULTFD 5329 + .uffd_ops = &shmem_uffd_ops, 5330 + #endif 5308 5331 }; 5309 5332 5310 5333 static const struct vm_operations_struct shmem_anon_vm_ops = { ··· 5316 5333 #ifdef CONFIG_NUMA 5317 5334 .set_policy = shmem_set_policy, 5318 5335 .get_policy = shmem_get_policy, 5336 + #endif 5337 + #ifdef CONFIG_USERFAULTFD 5338 + .uffd_ops = &shmem_uffd_ops, 5319 5339 #endif 5320 5340 }; 5321 5341
+1 -5
mm/shrinker.c
··· 288 288 { 289 289 int nid, index, offset; 290 290 long nr; 291 - struct mem_cgroup *parent; 291 + struct mem_cgroup *parent = parent_mem_cgroup(memcg); 292 292 struct shrinker_info *child_info, *parent_info; 293 293 struct shrinker_info_unit *child_unit, *parent_unit; 294 - 295 - parent = parent_mem_cgroup(memcg); 296 - if (!parent) 297 - parent = root_mem_cgroup; 298 294 299 295 /* Prevent from concurrent shrinker_info expand */ 300 296 mutex_lock(&shrinker_mutex);
-1
mm/sparse.c
··· 403 403 ms = __nr_to_section(pnum); 404 404 if (!preinited_vmemmap_section(ms)) 405 405 ms->section_mem_map = 0; 406 - ms->section_mem_map = 0; 407 406 } 408 407 } 409 408
+49 -10
mm/swap.c
··· 91 91 92 92 __page_cache_release(folio, &lruvec, &flags); 93 93 if (lruvec) 94 - unlock_page_lruvec_irqrestore(lruvec, flags); 94 + lruvec_unlock_irqrestore(lruvec, flags); 95 95 } 96 96 97 97 void __folio_put(struct folio *folio) ··· 175 175 } 176 176 177 177 if (lruvec) 178 - unlock_page_lruvec_irqrestore(lruvec, flags); 178 + lruvec_unlock_irqrestore(lruvec, flags); 179 179 folios_put(fbatch); 180 180 } 181 181 ··· 240 240 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, 241 241 unsigned int nr_io, unsigned int nr_rotated) 242 242 __releases(lruvec->lru_lock) 243 + __releases(rcu) 243 244 { 244 245 unsigned long cost; 245 246 ··· 254 253 cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated; 255 254 if (!cost) { 256 255 spin_unlock_irq(&lruvec->lru_lock); 256 + rcu_read_unlock(); 257 257 return; 258 258 } 259 259 ··· 287 285 288 286 spin_unlock_irq(&lruvec->lru_lock); 289 287 lruvec = parent_lruvec(lruvec); 290 - if (!lruvec) 288 + if (!lruvec) { 289 + rcu_read_unlock(); 291 290 break; 291 + } 292 292 spin_lock_irq(&lruvec->lru_lock); 293 293 } 294 294 } ··· 353 349 354 350 lruvec = folio_lruvec_lock_irq(folio); 355 351 lru_activate(lruvec, folio); 356 - unlock_page_lruvec_irq(lruvec); 352 + lruvec_unlock_irq(lruvec); 357 353 folio_set_lru(folio); 358 354 } 359 355 #endif ··· 416 412 417 413 static bool lru_gen_clear_refs(struct folio *folio) 418 414 { 419 - struct lru_gen_folio *lrugen; 420 415 int gen = folio_lru_gen(folio); 421 416 int type = folio_is_file_lru(folio); 417 + unsigned long seq; 422 418 423 419 if (gen < 0) 424 420 return true; 425 421 426 422 set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0); 427 423 428 - lrugen = &folio_lruvec(folio)->lrugen; 424 + rcu_read_lock(); 425 + seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]); 426 + rcu_read_unlock(); 429 427 /* whether can do without shuffling under the LRU lock */ 430 - return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type])); 428 + return gen == lru_gen_from_seq(seq); 431 429 } 432 430 433 431 #else /* !CONFIG_LRU_GEN */ ··· 969 963 970 964 if (folio_is_zone_device(folio)) { 971 965 if (lruvec) { 972 - unlock_page_lruvec_irqrestore(lruvec, flags); 966 + lruvec_unlock_irqrestore(lruvec, flags); 973 967 lruvec = NULL; 974 968 } 975 969 if (folio_ref_sub_and_test(folio, nr_refs)) ··· 983 977 /* hugetlb has its own memcg */ 984 978 if (folio_test_hugetlb(folio)) { 985 979 if (lruvec) { 986 - unlock_page_lruvec_irqrestore(lruvec, flags); 980 + lruvec_unlock_irqrestore(lruvec, flags); 987 981 lruvec = NULL; 988 982 } 989 983 free_huge_folio(folio); ··· 997 991 j++; 998 992 } 999 993 if (lruvec) 1000 - unlock_page_lruvec_irqrestore(lruvec, flags); 994 + lruvec_unlock_irqrestore(lruvec, flags); 1001 995 if (!j) { 1002 996 folio_batch_reinit(folios); 1003 997 return; ··· 1089 1083 } 1090 1084 fbatch->nr = j; 1091 1085 } 1086 + 1087 + #ifdef CONFIG_MEMCG 1088 + static void lruvec_reparent_lru(struct lruvec *child_lruvec, 1089 + struct lruvec *parent_lruvec, 1090 + enum lru_list lru, int nid) 1091 + { 1092 + int zid; 1093 + struct zone *zone; 1094 + 1095 + if (lru != LRU_UNEVICTABLE) 1096 + list_splice_tail_init(&child_lruvec->lists[lru], &parent_lruvec->lists[lru]); 1097 + 1098 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { 1099 + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); 1100 + 1101 + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); 1102 + } 1103 + } 1104 + 1105 + void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 1106 + { 1107 + enum lru_list lru; 1108 + struct lruvec *child_lruvec, *parent_lruvec; 1109 + 1110 + child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 1111 + parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); 1112 + parent_lruvec->anon_cost += child_lruvec->anon_cost; 1113 + parent_lruvec->file_cost += child_lruvec->file_cost; 1114 + 1115 + for_each_lru(lru) 1116 + lruvec_reparent_lru(child_lruvec, parent_lruvec, lru, nid); 1117 + } 1118 + #endif 1092 1119 1093 1120 static const struct ctl_table swap_sysctl_table[] = { 1094 1121 {
+399 -291
mm/userfaultfd.c
··· 14 14 #include <linux/userfaultfd_k.h> 15 15 #include <linux/mmu_notifier.h> 16 16 #include <linux/hugetlb.h> 17 - #include <linux/shmem_fs.h> 18 17 #include <asm/tlbflush.h> 19 18 #include <asm/tlb.h> 20 19 #include "internal.h" 21 20 #include "swap.h" 21 + 22 + struct mfill_state { 23 + struct userfaultfd_ctx *ctx; 24 + unsigned long src_start; 25 + unsigned long dst_start; 26 + unsigned long len; 27 + uffd_flags_t flags; 28 + 29 + struct vm_area_struct *vma; 30 + unsigned long src_addr; 31 + unsigned long dst_addr; 32 + pmd_t *pmd; 33 + }; 34 + 35 + static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) 36 + { 37 + /* anonymous memory does not support MINOR mode */ 38 + if (vm_flags & VM_UFFD_MINOR) 39 + return false; 40 + return true; 41 + } 42 + 43 + static struct folio *anon_alloc_folio(struct vm_area_struct *vma, 44 + unsigned long addr) 45 + { 46 + struct folio *folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, 47 + addr); 48 + 49 + if (!folio) 50 + return NULL; 51 + 52 + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { 53 + folio_put(folio); 54 + return NULL; 55 + } 56 + 57 + return folio; 58 + } 59 + 60 + static const struct vm_uffd_ops anon_uffd_ops = { 61 + .can_userfault = anon_can_userfault, 62 + .alloc_folio = anon_alloc_folio, 63 + }; 64 + 65 + static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) 66 + { 67 + if (vma_is_anonymous(vma)) 68 + return &anon_uffd_ops; 69 + return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL; 70 + } 22 71 23 72 static __always_inline 24 73 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) ··· 192 143 } 193 144 #endif 194 145 146 + static void mfill_put_vma(struct mfill_state *state) 147 + { 148 + if (!state->vma) 149 + return; 150 + 151 + up_read(&state->ctx->map_changing_lock); 152 + uffd_mfill_unlock(state->vma); 153 + state->vma = NULL; 154 + } 155 + 156 + static int mfill_get_vma(struct mfill_state *state) 157 + { 158 + struct userfaultfd_ctx *ctx = state->ctx; 159 + uffd_flags_t flags = state->flags; 160 + struct vm_area_struct *dst_vma; 161 + const struct vm_uffd_ops *ops; 162 + int err; 163 + 164 + /* 165 + * Make sure the vma is not shared, that the dst range is 166 + * both valid and fully within a single existing vma. 167 + */ 168 + dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len); 169 + if (IS_ERR(dst_vma)) 170 + return PTR_ERR(dst_vma); 171 + 172 + /* 173 + * If memory mappings are changing because of non-cooperative 174 + * operation (e.g. mremap) running in parallel, bail out and 175 + * request the user to retry later 176 + */ 177 + down_read(&ctx->map_changing_lock); 178 + state->vma = dst_vma; 179 + err = -EAGAIN; 180 + if (atomic_read(&ctx->mmap_changing)) 181 + goto out_unlock; 182 + 183 + err = -EINVAL; 184 + 185 + /* 186 + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but 187 + * it will overwrite vm_ops, so vma_is_anonymous must return false. 188 + */ 189 + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && 190 + dst_vma->vm_flags & VM_SHARED)) 191 + goto out_unlock; 192 + 193 + /* 194 + * validate 'mode' now that we know the dst_vma: don't allow 195 + * a wrprotect copy if the userfaultfd didn't register as WP. 196 + */ 197 + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) 198 + goto out_unlock; 199 + 200 + if (is_vm_hugetlb_page(dst_vma)) 201 + return 0; 202 + 203 + ops = vma_uffd_ops(dst_vma); 204 + if (!ops) 205 + goto out_unlock; 206 + 207 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && 208 + !ops->get_folio_noalloc) 209 + goto out_unlock; 210 + 211 + return 0; 212 + 213 + out_unlock: 214 + mfill_put_vma(state); 215 + return err; 216 + } 217 + 218 + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) 219 + { 220 + pgd_t *pgd; 221 + p4d_t *p4d; 222 + pud_t *pud; 223 + 224 + pgd = pgd_offset(mm, address); 225 + p4d = p4d_alloc(mm, pgd, address); 226 + if (!p4d) 227 + return NULL; 228 + pud = pud_alloc(mm, p4d, address); 229 + if (!pud) 230 + return NULL; 231 + /* 232 + * Note that we didn't run this because the pmd was 233 + * missing, the *pmd may be already established and in 234 + * turn it may also be a trans_huge_pmd. 235 + */ 236 + return pmd_alloc(mm, pud, address); 237 + } 238 + 239 + static int mfill_establish_pmd(struct mfill_state *state) 240 + { 241 + struct mm_struct *dst_mm = state->ctx->mm; 242 + pmd_t *dst_pmd, dst_pmdval; 243 + 244 + dst_pmd = mm_alloc_pmd(dst_mm, state->dst_addr); 245 + if (unlikely(!dst_pmd)) 246 + return -ENOMEM; 247 + 248 + dst_pmdval = pmdp_get_lockless(dst_pmd); 249 + if (unlikely(pmd_none(dst_pmdval)) && 250 + unlikely(__pte_alloc(dst_mm, dst_pmd))) 251 + return -ENOMEM; 252 + 253 + dst_pmdval = pmdp_get_lockless(dst_pmd); 254 + /* 255 + * If the dst_pmd is THP don't override it and just be strict. 256 + * (This includes the case where the PMD used to be THP and 257 + * changed back to none after __pte_alloc().) 258 + */ 259 + if (unlikely(!pmd_present(dst_pmdval) || pmd_leaf(dst_pmdval))) 260 + return -EEXIST; 261 + if (unlikely(pmd_bad(dst_pmdval))) 262 + return -EFAULT; 263 + 264 + state->pmd = dst_pmd; 265 + return 0; 266 + } 267 + 195 268 /* Check if dst_addr is outside of file's size. Must be called with ptl held. */ 196 269 static bool mfill_file_over_size(struct vm_area_struct *dst_vma, 197 270 unsigned long dst_addr) ··· 336 165 * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem 337 166 * and anon, and for both shared and private VMAs. 338 167 */ 339 - int mfill_atomic_install_pte(pmd_t *dst_pmd, 340 - struct vm_area_struct *dst_vma, 341 - unsigned long dst_addr, struct page *page, 342 - bool newly_allocated, uffd_flags_t flags) 168 + static int mfill_atomic_install_pte(pmd_t *dst_pmd, 169 + struct vm_area_struct *dst_vma, 170 + unsigned long dst_addr, struct page *page, 171 + uffd_flags_t flags) 343 172 { 344 173 int ret; 345 174 struct mm_struct *dst_mm = dst_vma->vm_mm; ··· 383 212 goto out_unlock; 384 213 385 214 if (page_in_cache) { 386 - /* Usually, cache pages are already added to LRU */ 387 - if (newly_allocated) 388 - folio_add_lru(folio); 389 215 folio_add_file_rmap_pte(folio, page, dst_vma); 390 216 } else { 391 217 folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE); ··· 397 229 398 230 set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 399 231 232 + if (page_in_cache) 233 + folio_unlock(folio); 234 + 400 235 /* No need to invalidate - it was non-present before */ 401 236 update_mmu_cache(dst_vma, dst_addr, dst_pte); 402 237 ret = 0; ··· 409 238 return ret; 410 239 } 411 240 412 - static int mfill_atomic_pte_copy(pmd_t *dst_pmd, 413 - struct vm_area_struct *dst_vma, 414 - unsigned long dst_addr, 415 - unsigned long src_addr, 416 - uffd_flags_t flags, 417 - struct folio **foliop) 241 + static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr) 418 242 { 419 243 void *kaddr; 420 244 int ret; 245 + 246 + kaddr = kmap_local_folio(folio, 0); 247 + /* 248 + * The read mmap_lock is held here. Despite the 249 + * mmap_lock being read recursive a deadlock is still 250 + * possible if a writer has taken a lock. For example: 251 + * 252 + * process A thread 1 takes read lock on own mmap_lock 253 + * process A thread 2 calls mmap, blocks taking write lock 254 + * process B thread 1 takes page fault, read lock on own mmap lock 255 + * process B thread 2 calls mmap, blocks taking write lock 256 + * process A thread 1 blocks taking read lock on process B 257 + * process B thread 1 blocks taking read lock on process A 258 + * 259 + * Disable page faults to prevent potential deadlock 260 + * and retry the copy outside the mmap_lock. 261 + */ 262 + pagefault_disable(); 263 + ret = copy_from_user(kaddr, (const void __user *) src_addr, 264 + PAGE_SIZE); 265 + pagefault_enable(); 266 + kunmap_local(kaddr); 267 + 268 + if (ret) 269 + return -EFAULT; 270 + 271 + flush_dcache_folio(folio); 272 + return ret; 273 + } 274 + 275 + static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio) 276 + { 277 + unsigned long src_addr = state->src_addr; 278 + void *kaddr; 279 + int err; 280 + 281 + /* retry copying with mm_lock dropped */ 282 + mfill_put_vma(state); 283 + 284 + kaddr = kmap_local_folio(folio, 0); 285 + err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE); 286 + kunmap_local(kaddr); 287 + if (unlikely(err)) 288 + return -EFAULT; 289 + 290 + flush_dcache_folio(folio); 291 + 292 + /* reget VMA and PMD, they could change underneath us */ 293 + err = mfill_get_vma(state); 294 + if (err) 295 + return err; 296 + 297 + err = mfill_establish_pmd(state); 298 + if (err) 299 + return err; 300 + 301 + return 0; 302 + } 303 + 304 + static int __mfill_atomic_pte(struct mfill_state *state, 305 + const struct vm_uffd_ops *ops) 306 + { 307 + unsigned long dst_addr = state->dst_addr; 308 + unsigned long src_addr = state->src_addr; 309 + uffd_flags_t flags = state->flags; 421 310 struct folio *folio; 311 + int ret; 422 312 423 - if (!*foliop) { 424 - ret = -ENOMEM; 425 - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, 426 - dst_addr); 427 - if (!folio) 428 - goto out; 313 + folio = ops->alloc_folio(state->vma, state->dst_addr); 314 + if (!folio) 315 + return -ENOMEM; 429 316 430 - kaddr = kmap_local_folio(folio, 0); 317 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { 318 + ret = mfill_copy_folio_locked(folio, src_addr); 431 319 /* 432 - * The read mmap_lock is held here. Despite the 433 - * mmap_lock being read recursive a deadlock is still 434 - * possible if a writer has taken a lock. For example: 435 - * 436 - * process A thread 1 takes read lock on own mmap_lock 437 - * process A thread 2 calls mmap, blocks taking write lock 438 - * process B thread 1 takes page fault, read lock on own mmap lock 439 - * process B thread 2 calls mmap, blocks taking write lock 440 - * process A thread 1 blocks taking read lock on process B 441 - * process B thread 1 blocks taking read lock on process A 442 - * 443 - * Disable page faults to prevent potential deadlock 444 - * and retry the copy outside the mmap_lock. 320 + * Fallback to copy_from_user outside mmap_lock. 321 + * If retry is successful, mfill_copy_folio_locked() returns 322 + * with locks retaken by mfill_get_vma(). 323 + * If there was an error, we must mfill_put_vma() anyway and it 324 + * will take care of unlocking if needed. 445 325 */ 446 - pagefault_disable(); 447 - ret = copy_from_user(kaddr, (const void __user *) src_addr, 448 - PAGE_SIZE); 449 - pagefault_enable(); 450 - kunmap_local(kaddr); 451 - 452 - /* fallback to copy_from_user outside mmap_lock */ 453 326 if (unlikely(ret)) { 454 - ret = -ENOENT; 455 - *foliop = folio; 456 - /* don't free the page */ 457 - goto out; 327 + ret = mfill_copy_folio_retry(state, folio); 328 + if (ret) 329 + goto err_folio_put; 458 330 } 459 - 460 - flush_dcache_folio(folio); 331 + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { 332 + clear_user_highpage(&folio->page, state->dst_addr); 461 333 } else { 462 - folio = *foliop; 463 - *foliop = NULL; 334 + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); 464 335 } 465 336 466 337 /* ··· 512 299 */ 513 300 __folio_mark_uptodate(folio); 514 301 515 - ret = -ENOMEM; 516 - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) 517 - goto out_release; 302 + if (ops->filemap_add) { 303 + ret = ops->filemap_add(folio, state->vma, state->dst_addr); 304 + if (ret) 305 + goto err_folio_put; 306 + } 518 307 519 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 520 - &folio->page, true, flags); 308 + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, 309 + &folio->page, flags); 521 310 if (ret) 522 - goto out_release; 523 - out: 524 - return ret; 525 - out_release: 526 - folio_put(folio); 527 - goto out; 528 - } 529 - 530 - static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd, 531 - struct vm_area_struct *dst_vma, 532 - unsigned long dst_addr) 533 - { 534 - struct folio *folio; 535 - int ret = -ENOMEM; 536 - 537 - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr); 538 - if (!folio) 539 - return ret; 540 - 541 - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) 542 - goto out_put; 543 - 544 - /* 545 - * The memory barrier inside __folio_mark_uptodate makes sure that 546 - * zeroing out the folio become visible before mapping the page 547 - * using set_pte_at(). See do_anonymous_page(). 548 - */ 549 - __folio_mark_uptodate(folio); 550 - 551 - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 552 - &folio->page, true, 0); 553 - if (ret) 554 - goto out_put; 311 + goto err_filemap_remove; 555 312 556 313 return 0; 557 - out_put: 314 + 315 + err_filemap_remove: 316 + if (ops->filemap_remove) 317 + ops->filemap_remove(folio, state->vma); 318 + err_folio_put: 558 319 folio_put(folio); 559 320 return ret; 560 321 } 561 322 562 - static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, 563 - struct vm_area_struct *dst_vma, 564 - unsigned long dst_addr) 323 + static int mfill_atomic_pte_copy(struct mfill_state *state) 565 324 { 325 + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); 326 + 327 + /* 328 + * The normal page fault path for a MAP_PRIVATE mapping in a 329 + * file-backed VMA will invoke the fault, fill the hole in the file and 330 + * COW it right away. The result generates plain anonymous memory. 331 + * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll 332 + * generate anonymous memory directly without actually filling the 333 + * hole. For the MAP_PRIVATE case the robustness check only happens in 334 + * the pagetable (to verify it's still none) and not in the page cache. 335 + */ 336 + if (!(state->vma->vm_flags & VM_SHARED)) 337 + ops = &anon_uffd_ops; 338 + 339 + return __mfill_atomic_pte(state, ops); 340 + } 341 + 342 + static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state) 343 + { 344 + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); 345 + 346 + return __mfill_atomic_pte(state, ops); 347 + } 348 + 349 + static int mfill_atomic_pte_zeropage(struct mfill_state *state) 350 + { 351 + struct vm_area_struct *dst_vma = state->vma; 352 + unsigned long dst_addr = state->dst_addr; 353 + pmd_t *dst_pmd = state->pmd; 566 354 pte_t _dst_pte, *dst_pte; 567 355 spinlock_t *ptl; 568 356 int ret; 569 357 570 - if (mm_forbids_zeropage(dst_vma->vm_mm)) 571 - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); 358 + if (mm_forbids_zeropage(dst_vma->vm_mm) || 359 + (dst_vma->vm_flags & VM_SHARED)) 360 + return mfill_atomic_pte_zeroed_folio(state); 572 361 573 362 _dst_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), 574 363 dst_vma->vm_page_prot)); ··· 596 381 } 597 382 598 383 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ 599 - static int mfill_atomic_pte_continue(pmd_t *dst_pmd, 600 - struct vm_area_struct *dst_vma, 601 - unsigned long dst_addr, 602 - uffd_flags_t flags) 384 + static int mfill_atomic_pte_continue(struct mfill_state *state) 603 385 { 604 - struct inode *inode = file_inode(dst_vma->vm_file); 386 + struct vm_area_struct *dst_vma = state->vma; 387 + const struct vm_uffd_ops *ops = vma_uffd_ops(dst_vma); 388 + unsigned long dst_addr = state->dst_addr; 605 389 pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 390 + struct inode *inode = file_inode(dst_vma->vm_file); 391 + uffd_flags_t flags = state->flags; 392 + pmd_t *dst_pmd = state->pmd; 606 393 struct folio *folio; 607 394 struct page *page; 608 395 int ret; 609 396 610 - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); 611 - /* Our caller expects us to return -EFAULT if we failed to find folio */ 612 - if (ret == -ENOENT) 613 - ret = -EFAULT; 614 - if (ret) 615 - goto out; 616 - if (!folio) { 617 - ret = -EFAULT; 618 - goto out; 397 + if (!ops) { 398 + VM_WARN_ONCE(1, "UFFDIO_CONTINUE for unsupported VMA"); 399 + return -EOPNOTSUPP; 619 400 } 401 + 402 + folio = ops->get_folio_noalloc(inode, pgoff); 403 + /* Our caller expects us to return -EFAULT if we failed to find folio */ 404 + if (IS_ERR_OR_NULL(folio)) 405 + return -EFAULT; 620 406 621 407 page = folio_file_page(folio, pgoff); 622 408 if (PageHWPoison(page)) { ··· 626 410 } 627 411 628 412 ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, 629 - page, false, flags); 413 + page, flags); 630 414 if (ret) 631 415 goto out_release; 632 416 633 - folio_unlock(folio); 634 - ret = 0; 635 - out: 636 - return ret; 417 + return 0; 418 + 637 419 out_release: 638 420 folio_unlock(folio); 639 421 folio_put(folio); 640 - goto out; 422 + return ret; 641 423 } 642 424 643 425 /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ 644 - static int mfill_atomic_pte_poison(pmd_t *dst_pmd, 645 - struct vm_area_struct *dst_vma, 646 - unsigned long dst_addr, 647 - uffd_flags_t flags) 426 + static int mfill_atomic_pte_poison(struct mfill_state *state) 648 427 { 649 - int ret; 428 + struct vm_area_struct *dst_vma = state->vma; 650 429 struct mm_struct *dst_mm = dst_vma->vm_mm; 430 + unsigned long dst_addr = state->dst_addr; 431 + pmd_t *dst_pmd = state->pmd; 651 432 pte_t _dst_pte, *dst_pte; 652 433 spinlock_t *ptl; 434 + int ret; 653 435 654 436 _dst_pte = make_pte_marker(PTE_MARKER_POISONED); 655 437 ret = -EAGAIN; ··· 674 460 pte_unmap_unlock(dst_pte, ptl); 675 461 out: 676 462 return ret; 677 - } 678 - 679 - static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) 680 - { 681 - pgd_t *pgd; 682 - p4d_t *p4d; 683 - pud_t *pud; 684 - 685 - pgd = pgd_offset(mm, address); 686 - p4d = p4d_alloc(mm, pgd, address); 687 - if (!p4d) 688 - return NULL; 689 - pud = pud_alloc(mm, p4d, address); 690 - if (!pud) 691 - return NULL; 692 - /* 693 - * Note that we didn't run this because the pmd was 694 - * missing, the *pmd may be already established and in 695 - * turn it may also be a trans_huge_pmd. 696 - */ 697 - return pmd_alloc(mm, pud, address); 698 463 } 699 464 700 465 #ifdef CONFIG_HUGETLB_PAGE ··· 850 657 uffd_flags_t flags); 851 658 #endif /* CONFIG_HUGETLB_PAGE */ 852 659 853 - static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, 854 - struct vm_area_struct *dst_vma, 855 - unsigned long dst_addr, 856 - unsigned long src_addr, 857 - uffd_flags_t flags, 858 - struct folio **foliop) 660 + static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) 859 661 { 860 - ssize_t err; 662 + uffd_flags_t flags = state->flags; 861 663 862 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { 863 - return mfill_atomic_pte_continue(dst_pmd, dst_vma, 864 - dst_addr, flags); 865 - } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { 866 - return mfill_atomic_pte_poison(dst_pmd, dst_vma, 867 - dst_addr, flags); 868 - } 664 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) 665 + return mfill_atomic_pte_continue(state); 666 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) 667 + return mfill_atomic_pte_poison(state); 668 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) 669 + return mfill_atomic_pte_copy(state); 670 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) 671 + return mfill_atomic_pte_zeropage(state); 869 672 870 - /* 871 - * The normal page fault path for a shmem will invoke the 872 - * fault, fill the hole in the file and COW it right away. The 873 - * result generates plain anonymous memory. So when we are 874 - * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll 875 - * generate anonymous memory directly without actually filling 876 - * the hole. For the MAP_PRIVATE case the robustness check 877 - * only happens in the pagetable (to verify it's still none) 878 - * and not in the radix tree. 879 - */ 880 - if (!(dst_vma->vm_flags & VM_SHARED)) { 881 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) 882 - err = mfill_atomic_pte_copy(dst_pmd, dst_vma, 883 - dst_addr, src_addr, 884 - flags, foliop); 885 - else 886 - err = mfill_atomic_pte_zeropage(dst_pmd, 887 - dst_vma, dst_addr); 888 - } else { 889 - err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, 890 - dst_addr, src_addr, 891 - flags, foliop); 892 - } 893 - 894 - return err; 673 + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); 674 + return -EOPNOTSUPP; 895 675 } 896 676 897 677 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, ··· 873 707 unsigned long len, 874 708 uffd_flags_t flags) 875 709 { 876 - struct mm_struct *dst_mm = ctx->mm; 877 - struct vm_area_struct *dst_vma; 710 + struct mfill_state state = (struct mfill_state){ 711 + .ctx = ctx, 712 + .dst_start = dst_start, 713 + .src_start = src_start, 714 + .flags = flags, 715 + .len = len, 716 + .src_addr = src_start, 717 + .dst_addr = dst_start, 718 + }; 719 + long copied = 0; 878 720 ssize_t err; 879 - pmd_t *dst_pmd; 880 - unsigned long src_addr, dst_addr; 881 - long copied; 882 - struct folio *folio; 883 721 884 722 /* 885 723 * Sanitize the command parameters: ··· 895 725 VM_WARN_ON_ONCE(src_start + len <= src_start); 896 726 VM_WARN_ON_ONCE(dst_start + len <= dst_start); 897 727 898 - src_addr = src_start; 899 - dst_addr = dst_start; 900 - copied = 0; 901 - folio = NULL; 902 - retry: 903 - /* 904 - * Make sure the vma is not shared, that the dst range is 905 - * both valid and fully within a single existing vma. 906 - */ 907 - dst_vma = uffd_mfill_lock(dst_mm, dst_start, len); 908 - if (IS_ERR(dst_vma)) { 909 - err = PTR_ERR(dst_vma); 728 + err = mfill_get_vma(&state); 729 + if (err) 910 730 goto out; 911 - } 912 - 913 - /* 914 - * If memory mappings are changing because of non-cooperative 915 - * operation (e.g. mremap) running in parallel, bail out and 916 - * request the user to retry later 917 - */ 918 - down_read(&ctx->map_changing_lock); 919 - err = -EAGAIN; 920 - if (atomic_read(&ctx->mmap_changing)) 921 - goto out_unlock; 922 - 923 - err = -EINVAL; 924 - /* 925 - * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but 926 - * it will overwrite vm_ops, so vma_is_anonymous must return false. 927 - */ 928 - if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && 929 - dst_vma->vm_flags & VM_SHARED)) 930 - goto out_unlock; 931 - 932 - /* 933 - * validate 'mode' now that we know the dst_vma: don't allow 934 - * a wrprotect copy if the userfaultfd didn't register as WP. 935 - */ 936 - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) 937 - goto out_unlock; 938 731 939 732 /* 940 733 * If this is a HUGETLB vma, pass off to appropriate routine 941 734 */ 942 - if (is_vm_hugetlb_page(dst_vma)) 943 - return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, 735 + if (is_vm_hugetlb_page(state.vma)) 736 + return mfill_atomic_hugetlb(ctx, state.vma, dst_start, 944 737 src_start, len, flags); 945 738 946 - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) 947 - goto out_unlock; 948 - if (!vma_is_shmem(dst_vma) && 949 - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) 950 - goto out_unlock; 739 + while (state.src_addr < src_start + len) { 740 + VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len); 951 741 952 - while (src_addr < src_start + len) { 953 - pmd_t dst_pmdval; 742 + err = mfill_establish_pmd(&state); 743 + if (err) 744 + break; 954 745 955 - VM_WARN_ON_ONCE(dst_addr >= dst_start + len); 956 - 957 - dst_pmd = mm_alloc_pmd(dst_mm, dst_addr); 958 - if (unlikely(!dst_pmd)) { 959 - err = -ENOMEM; 960 - break; 961 - } 962 - 963 - dst_pmdval = pmdp_get_lockless(dst_pmd); 964 - if (unlikely(pmd_none(dst_pmdval)) && 965 - unlikely(__pte_alloc(dst_mm, dst_pmd))) { 966 - err = -ENOMEM; 967 - break; 968 - } 969 - dst_pmdval = pmdp_get_lockless(dst_pmd); 970 - /* 971 - * If the dst_pmd is THP don't override it and just be strict. 972 - * (This includes the case where the PMD used to be THP and 973 - * changed back to none after __pte_alloc().) 974 - */ 975 - if (unlikely(!pmd_present(dst_pmdval) || 976 - pmd_trans_huge(dst_pmdval))) { 977 - err = -EEXIST; 978 - break; 979 - } 980 - if (unlikely(pmd_bad(dst_pmdval))) { 981 - err = -EFAULT; 982 - break; 983 - } 984 746 /* 985 747 * For shmem mappings, khugepaged is allowed to remove page 986 748 * tables under us; pte_offset_map_lock() will deal with that. 987 749 */ 988 750 989 - err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, 990 - src_addr, flags, &folio); 751 + err = mfill_atomic_pte(&state); 991 752 cond_resched(); 992 753 993 - if (unlikely(err == -ENOENT)) { 994 - void *kaddr; 995 - 996 - up_read(&ctx->map_changing_lock); 997 - uffd_mfill_unlock(dst_vma); 998 - VM_WARN_ON_ONCE(!folio); 999 - 1000 - kaddr = kmap_local_folio(folio, 0); 1001 - err = copy_from_user(kaddr, 1002 - (const void __user *) src_addr, 1003 - PAGE_SIZE); 1004 - kunmap_local(kaddr); 1005 - if (unlikely(err)) { 1006 - err = -EFAULT; 1007 - goto out; 1008 - } 1009 - flush_dcache_folio(folio); 1010 - goto retry; 1011 - } else 1012 - VM_WARN_ON_ONCE(folio); 1013 - 1014 754 if (!err) { 1015 - dst_addr += PAGE_SIZE; 1016 - src_addr += PAGE_SIZE; 755 + state.dst_addr += PAGE_SIZE; 756 + state.src_addr += PAGE_SIZE; 1017 757 copied += PAGE_SIZE; 1018 758 1019 759 if (fatal_signal_pending(current)) ··· 933 853 break; 934 854 } 935 855 936 - out_unlock: 937 - up_read(&ctx->map_changing_lock); 938 - uffd_mfill_unlock(dst_vma); 856 + mfill_put_vma(&state); 939 857 out: 940 - if (folio) 941 - folio_put(folio); 942 858 VM_WARN_ON_ONCE(copied < 0); 943 859 VM_WARN_ON_ONCE(err > 0); 944 860 VM_WARN_ON_ONCE(!copied && !err); ··· 2012 1936 VM_WARN_ON_ONCE(err > 0); 2013 1937 VM_WARN_ON_ONCE(!moved && !err); 2014 1938 return moved ? moved : err; 1939 + } 1940 + 1941 + bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, 1942 + bool wp_async) 1943 + { 1944 + const struct vm_uffd_ops *ops = vma_uffd_ops(vma); 1945 + 1946 + if (vma->vm_flags & VM_DROPPABLE) 1947 + return false; 1948 + 1949 + vm_flags &= __VM_UFFD_FLAGS; 1950 + 1951 + /* 1952 + * If WP is the only mode enabled and context is wp async, allow any 1953 + * memory type. 1954 + */ 1955 + if (wp_async && (vm_flags == VM_UFFD_WP)) 1956 + return true; 1957 + 1958 + /* For any other mode reject VMAs that don't implement vm_uffd_ops */ 1959 + if (!ops) 1960 + return false; 1961 + 1962 + /* 1963 + * If user requested uffd-wp but not enabled pte markers for 1964 + * uffd-wp, then only anonymous memory is supported 1965 + */ 1966 + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && 1967 + !vma_is_anonymous(vma)) 1968 + return false; 1969 + 1970 + return ops->can_userfault(vma, vm_flags); 2015 1971 } 2016 1972 2017 1973 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
-10
mm/util.c
··· 1281 1281 } 1282 1282 EXPORT_SYMBOL(compat_vma_mmap); 1283 1283 1284 - int __vma_check_mmap_hook(struct vm_area_struct *vma) 1285 - { 1286 - /* vm_ops->mapped is not valid if mmap() is specified. */ 1287 - if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped)) 1288 - return -EINVAL; 1289 - 1290 - return 0; 1291 - } 1292 - EXPORT_SYMBOL(__vma_check_mmap_hook); 1293 - 1294 1284 static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, 1295 1285 const struct page *page) 1296 1286 {
+221 -82
mm/vmscan.c
··· 269 269 } 270 270 #endif 271 271 272 - /* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to 273 - * and including the specified highidx 274 - * @zone: The current zone in the iterator 275 - * @pgdat: The pgdat which node_zones are being iterated 276 - * @idx: The index variable 277 - * @highidx: The index of the highest zone to return 278 - * 279 - * This macro iterates through all managed zones up to and including the specified highidx. 280 - * The zone iterator enters an invalid state after macro call and must be reinitialized 281 - * before it can be used again. 282 - */ 283 - #define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ 284 - for ((idx) = 0, (zone) = (pgdat)->node_zones; \ 285 - (idx) <= (highidx); \ 286 - (idx)++, (zone)++) \ 287 - if (!managed_zone(zone)) \ 288 - continue; \ 289 - else 290 - 291 272 static void set_task_reclaim_state(struct task_struct *task, 292 273 struct reclaim_state *rs) 293 274 { ··· 390 409 * @lru: lru to use 391 410 * @zone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list) 392 411 */ 393 - static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, 394 - int zone_idx) 412 + unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) 395 413 { 396 414 unsigned long size = 0; 397 415 int zid; ··· 1811 1831 folio_get(folio); 1812 1832 lruvec = folio_lruvec_lock_irq(folio); 1813 1833 lruvec_del_folio(lruvec, folio); 1814 - unlock_page_lruvec_irq(lruvec); 1834 + lruvec_unlock_irq(lruvec); 1815 1835 ret = true; 1816 1836 } 1817 1837 ··· 1865 1885 /* 1866 1886 * move_folios_to_lru() moves folios from private @list to appropriate LRU list. 1867 1887 * 1868 - * Returns the number of pages moved to the given lruvec. 1888 + * Returns the number of pages moved to the appropriate lruvec. 1889 + * 1890 + * Note: The caller must not hold any lruvec lock. 1869 1891 */ 1870 - static unsigned int move_folios_to_lru(struct lruvec *lruvec, 1871 - struct list_head *list) 1892 + static unsigned int move_folios_to_lru(struct list_head *list) 1872 1893 { 1873 1894 int nr_pages, nr_moved = 0; 1895 + struct lruvec *lruvec = NULL; 1874 1896 struct folio_batch free_folios; 1875 1897 1876 1898 folio_batch_init(&free_folios); 1877 1899 while (!list_empty(list)) { 1878 1900 struct folio *folio = lru_to_folio(list); 1879 1901 1902 + lruvec = folio_lruvec_relock_irq(folio, lruvec); 1880 1903 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); 1881 1904 list_del(&folio->lru); 1882 1905 if (unlikely(!folio_evictable(folio))) { 1883 - spin_unlock_irq(&lruvec->lru_lock); 1906 + lruvec_unlock_irq(lruvec); 1884 1907 folio_putback_lru(folio); 1885 - spin_lock_irq(&lruvec->lru_lock); 1908 + lruvec = NULL; 1886 1909 continue; 1887 1910 } 1888 1911 ··· 1907 1924 1908 1925 folio_unqueue_deferred_split(folio); 1909 1926 if (folio_batch_add(&free_folios, folio) == 0) { 1910 - spin_unlock_irq(&lruvec->lru_lock); 1927 + lruvec_unlock_irq(lruvec); 1911 1928 mem_cgroup_uncharge_folios(&free_folios); 1912 1929 free_unref_folios(&free_folios); 1913 - spin_lock_irq(&lruvec->lru_lock); 1930 + lruvec = NULL; 1914 1931 } 1915 1932 1916 1933 continue; 1917 1934 } 1918 1935 1919 - /* 1920 - * All pages were isolated from the same lruvec (and isolation 1921 - * inhibits memcg migration). 1922 - */ 1923 - VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); 1924 1936 lruvec_add_folio(lruvec, folio); 1925 1937 nr_pages = folio_nr_pages(folio); 1926 1938 nr_moved += nr_pages; ··· 1923 1945 workingset_age_nonresident(lruvec, nr_pages); 1924 1946 } 1925 1947 1948 + if (lruvec) 1949 + lruvec_unlock_irq(lruvec); 1950 + 1926 1951 if (free_folios.nr) { 1927 - spin_unlock_irq(&lruvec->lru_lock); 1928 1952 mem_cgroup_uncharge_folios(&free_folios); 1929 1953 free_unref_folios(&free_folios); 1930 - spin_lock_irq(&lruvec->lru_lock); 1931 1954 } 1932 1955 1933 1956 return nr_moved; ··· 1977 1998 1978 1999 lru_add_drain(); 1979 2000 1980 - spin_lock_irq(&lruvec->lru_lock); 2001 + lruvec_lock_irq(lruvec); 1981 2002 1982 2003 nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &folio_list, 1983 2004 &nr_scanned, sc, lru); ··· 1987 2008 mod_lruvec_state(lruvec, item, nr_scanned); 1988 2009 mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); 1989 2010 1990 - spin_unlock_irq(&lruvec->lru_lock); 2011 + lruvec_unlock_irq(lruvec); 1991 2012 1992 2013 if (nr_taken == 0) 1993 2014 return 0; ··· 1995 2016 nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false, 1996 2017 lruvec_memcg(lruvec)); 1997 2018 1998 - spin_lock_irq(&lruvec->lru_lock); 1999 - move_folios_to_lru(lruvec, &folio_list); 2019 + move_folios_to_lru(&folio_list); 2000 2020 2001 2021 mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), 2002 2022 stat.nr_demoted); 2003 - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2023 + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2004 2024 item = PGSTEAL_KSWAPD + reclaimer_offset(sc); 2005 2025 mod_lruvec_state(lruvec, item, nr_reclaimed); 2006 2026 mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); 2007 2027 2028 + lruvec_lock_irq(lruvec); 2008 2029 lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, 2009 2030 nr_scanned - nr_reclaimed); 2010 2031 ··· 2083 2104 2084 2105 lru_add_drain(); 2085 2106 2086 - spin_lock_irq(&lruvec->lru_lock); 2107 + lruvec_lock_irq(lruvec); 2087 2108 2088 2109 nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &l_hold, 2089 2110 &nr_scanned, sc, lru); ··· 2092 2113 2093 2114 mod_lruvec_state(lruvec, PGREFILL, nr_scanned); 2094 2115 2095 - spin_unlock_irq(&lruvec->lru_lock); 2116 + lruvec_unlock_irq(lruvec); 2096 2117 2097 2118 while (!list_empty(&l_hold)) { 2098 2119 struct folio *folio; ··· 2141 2162 /* 2142 2163 * Move folios back to the lru list. 2143 2164 */ 2144 - spin_lock_irq(&lruvec->lru_lock); 2165 + nr_activate = move_folios_to_lru(&l_active); 2166 + nr_deactivate = move_folios_to_lru(&l_inactive); 2145 2167 2146 - nr_activate = move_folios_to_lru(lruvec, &l_active); 2147 - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); 2148 - 2149 - __count_vm_events(PGDEACTIVATE, nr_deactivate); 2168 + count_vm_events(PGDEACTIVATE, nr_deactivate); 2150 2169 count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); 2170 + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2151 2171 2152 - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); 2153 - 2172 + lruvec_lock_irq(lruvec); 2154 2173 lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); 2155 2174 trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, 2156 2175 nr_deactivate, nr_rotated, sc->priority, file); ··· 2863 2886 return NULL; 2864 2887 2865 2888 clear_bit(key, &mm->lru_gen.bitmap); 2889 + mmgrab(mm); 2866 2890 2867 - return mmget_not_zero(mm) ? mm : NULL; 2891 + return mm; 2868 2892 } 2869 2893 2870 2894 void lru_gen_add_mm(struct mm_struct *mm) ··· 3065 3087 reset_bloom_filter(mm_state, walk->seq + 1); 3066 3088 3067 3089 if (*iter) 3068 - mmput_async(*iter); 3090 + mmdrop(*iter); 3069 3091 3070 3092 *iter = mm; 3071 3093 ··· 3420 3442 if (folio_nid(folio) != pgdat->node_id) 3421 3443 return NULL; 3422 3444 3445 + rcu_read_lock(); 3423 3446 if (folio_memcg(folio) != memcg) 3424 - return NULL; 3447 + folio = NULL; 3448 + rcu_read_unlock(); 3425 3449 3426 3450 return folio; 3427 3451 } ··· 3783 3803 } 3784 3804 3785 3805 if (walk->batched) { 3786 - spin_lock_irq(&lruvec->lru_lock); 3806 + lruvec_lock_irq(lruvec); 3787 3807 reset_batch_size(walk); 3788 - spin_unlock_irq(&lruvec->lru_lock); 3808 + lruvec_unlock_irq(lruvec); 3789 3809 } 3790 3810 3791 3811 cond_resched(); ··· 3945 3965 if (seq < READ_ONCE(lrugen->max_seq)) 3946 3966 return false; 3947 3967 3948 - spin_lock_irq(&lruvec->lru_lock); 3968 + lruvec_lock_irq(lruvec); 3949 3969 3950 3970 VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); 3951 3971 ··· 3960 3980 if (inc_min_seq(lruvec, type, swappiness)) 3961 3981 continue; 3962 3982 3963 - spin_unlock_irq(&lruvec->lru_lock); 3983 + lruvec_unlock_irq(lruvec); 3964 3984 cond_resched(); 3965 3985 goto restart; 3966 3986 } ··· 3995 4015 /* make sure preceding modifications appear */ 3996 4016 smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); 3997 4017 unlock: 3998 - spin_unlock_irq(&lruvec->lru_lock); 4018 + lruvec_unlock_irq(lruvec); 3999 4019 4000 4020 return success; 4001 4021 } ··· 4193 4213 unsigned long addr = pvmw->address; 4194 4214 struct vm_area_struct *vma = pvmw->vma; 4195 4215 struct folio *folio = pfn_folio(pvmw->pfn); 4196 - struct mem_cgroup *memcg = folio_memcg(folio); 4216 + struct mem_cgroup *memcg; 4197 4217 struct pglist_data *pgdat = folio_pgdat(folio); 4198 - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 4199 - struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); 4200 - DEFINE_MAX_SEQ(lruvec); 4201 - int gen = lru_gen_from_seq(max_seq); 4218 + struct lruvec *lruvec; 4219 + struct lru_gen_mm_state *mm_state; 4220 + unsigned long max_seq; 4221 + int gen; 4202 4222 4203 4223 lockdep_assert_held(pvmw->ptl); 4204 4224 VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); ··· 4232 4252 end = addr + MIN_LRU_BATCH * PAGE_SIZE / 2; 4233 4253 } 4234 4254 } 4255 + 4256 + memcg = get_mem_cgroup_from_folio(folio); 4257 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 4258 + max_seq = READ_ONCE((lruvec)->lrugen.max_seq); 4259 + gen = lru_gen_from_seq(max_seq); 4260 + mm_state = get_mm_state(lruvec); 4235 4261 4236 4262 lazy_mmu_mode_enable(); 4237 4263 ··· 4287 4301 /* feedback from rmap walkers to page table walkers */ 4288 4302 if (mm_state && suitable_to_scan(i, young)) 4289 4303 update_bloom_filter(mm_state, max_seq, pvmw->pmd); 4304 + 4305 + mem_cgroup_put(memcg); 4290 4306 4291 4307 return true; 4292 4308 } ··· 4423 4435 /* see the comment on MEMCG_NR_GENS */ 4424 4436 if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD) 4425 4437 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD); 4438 + } 4439 + 4440 + bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) 4441 + { 4442 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4443 + int type; 4444 + 4445 + for (type = 0; type < ANON_AND_FILE; type++) { 4446 + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) 4447 + return false; 4448 + } 4449 + 4450 + return true; 4451 + } 4452 + 4453 + static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg, 4454 + struct lruvec *lruvec) 4455 + { 4456 + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); 4457 + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); 4458 + int swappiness = mem_cgroup_swappiness(memcg); 4459 + DEFINE_MAX_SEQ(lruvec); 4460 + bool success = false; 4461 + 4462 + /* 4463 + * We are not iterating the mm_list here, updating mm_state->seq is just 4464 + * to make mm walkers work properly. 4465 + */ 4466 + if (mm_state) { 4467 + spin_lock(&mm_list->lock); 4468 + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); 4469 + if (max_seq > mm_state->seq) { 4470 + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); 4471 + success = true; 4472 + } 4473 + spin_unlock(&mm_list->lock); 4474 + } else { 4475 + success = true; 4476 + } 4477 + 4478 + if (success) 4479 + inc_max_seq(lruvec, max_seq, swappiness); 4480 + } 4481 + 4482 + /* 4483 + * We need to ensure that the folios of child memcg can be reparented to the 4484 + * same gen of the parent memcg, so the gens of the parent memcg needed be 4485 + * incremented to the MAX_NR_GENS before reparenting. 4486 + */ 4487 + void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) 4488 + { 4489 + struct lruvec *lruvec = get_lruvec(memcg, nid); 4490 + int type; 4491 + 4492 + for (type = 0; type < ANON_AND_FILE; type++) { 4493 + while (get_nr_gens(lruvec, type) < MAX_NR_GENS) { 4494 + try_to_inc_max_seq_nowalk(memcg, lruvec); 4495 + cond_resched(); 4496 + } 4497 + } 4498 + } 4499 + 4500 + /* 4501 + * Compared to traditional LRU, MGLRU faces the following challenges: 4502 + * 4503 + * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the 4504 + * number of generations of the parent and child memcg may be different, 4505 + * so we cannot simply transfer MGLRU folios in the child memcg to the 4506 + * parent memcg as we did for traditional LRU folios. 4507 + * 2. The generation information is stored in folio->flags, but we cannot 4508 + * traverse these folios while holding the lru lock, otherwise it may 4509 + * cause softlockup. 4510 + * 3. In walk_update_folio(), the gen of folio and corresponding lru size 4511 + * may be updated, but the folio is not immediately moved to the 4512 + * corresponding lru list. Therefore, there may be folios of different 4513 + * generations on an LRU list. 4514 + * 4. In lru_gen_del_folio(), the generation to which the folio belongs is 4515 + * found based on the generation information in folio->flags, and the 4516 + * corresponding LRU size will be updated. Therefore, we need to update 4517 + * the lru size correctly during reparenting, otherwise the lru size may 4518 + * be updated incorrectly in lru_gen_del_folio(). 4519 + * 4520 + * Finally, we choose a compromise method, which is to splice the lru list in 4521 + * the child memcg to the lru list of the same generation in the parent memcg 4522 + * during reparenting. 4523 + * 4524 + * The same generation has different meanings in the parent and child memcg, 4525 + * so this compromise method will cause the LRU inversion problem. But as the 4526 + * system runs, this problem will be fixed automatically. 4527 + */ 4528 + static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec, 4529 + int zone, int type) 4530 + { 4531 + struct lru_gen_folio *child_lrugen, *parent_lrugen; 4532 + enum lru_list lru = type * LRU_INACTIVE_FILE; 4533 + int i; 4534 + 4535 + child_lrugen = &child_lruvec->lrugen; 4536 + parent_lrugen = &parent_lruvec->lrugen; 4537 + 4538 + for (i = 0; i < get_nr_gens(child_lruvec, type); i++) { 4539 + int gen = lru_gen_from_seq(child_lrugen->max_seq - i); 4540 + long nr_pages = child_lrugen->nr_pages[gen][type][zone]; 4541 + int child_lru_active = lru_gen_is_active(child_lruvec, gen) ? LRU_ACTIVE : 0; 4542 + int parent_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0; 4543 + 4544 + /* Assuming that child pages are colder than parent pages */ 4545 + list_splice_tail_init(&child_lrugen->folios[gen][type][zone], 4546 + &parent_lrugen->folios[gen][type][zone]); 4547 + 4548 + WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0); 4549 + WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone], 4550 + parent_lrugen->nr_pages[gen][type][zone] + nr_pages); 4551 + 4552 + if (lru_gen_is_active(child_lruvec, gen) != lru_gen_is_active(parent_lruvec, gen)) { 4553 + __update_lru_size(child_lruvec, lru + child_lru_active, zone, -nr_pages); 4554 + __update_lru_size(parent_lruvec, lru + parent_lru_active, zone, nr_pages); 4555 + } 4556 + } 4557 + } 4558 + 4559 + void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) 4560 + { 4561 + struct lruvec *child_lruvec, *parent_lruvec; 4562 + int type, zid; 4563 + struct zone *zone; 4564 + enum lru_list lru; 4565 + 4566 + child_lruvec = get_lruvec(memcg, nid); 4567 + parent_lruvec = get_lruvec(parent, nid); 4568 + 4569 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) 4570 + for (type = 0; type < ANON_AND_FILE; type++) 4571 + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type); 4572 + 4573 + for_each_lru(lru) { 4574 + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { 4575 + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); 4576 + 4577 + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); 4578 + } 4579 + } 4426 4580 } 4427 4581 4428 4582 #endif /* CONFIG_MEMCG */ ··· 4760 4630 static int get_tier_idx(struct lruvec *lruvec, int type) 4761 4631 { 4762 4632 int tier; 4763 - struct ctrl_pos sp, pv; 4633 + struct ctrl_pos sp, pv = {}; 4764 4634 4765 4635 /* 4766 4636 * To leave a margin for fluctuations, use a larger gain factor (2:3). ··· 4779 4649 4780 4650 static int get_type_to_scan(struct lruvec *lruvec, int swappiness) 4781 4651 { 4782 - struct ctrl_pos sp, pv; 4652 + struct ctrl_pos sp, pv = {}; 4783 4653 4784 4654 if (swappiness <= MIN_SWAPPINESS + 1) 4785 4655 return LRU_GEN_FILE; ··· 4837 4707 struct mem_cgroup *memcg = lruvec_memcg(lruvec); 4838 4708 struct pglist_data *pgdat = lruvec_pgdat(lruvec); 4839 4709 4840 - spin_lock_irq(&lruvec->lru_lock); 4710 + lruvec_lock_irq(lruvec); 4841 4711 4842 4712 scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list); 4843 4713 ··· 4846 4716 if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq) 4847 4717 scanned = 0; 4848 4718 4849 - spin_unlock_irq(&lruvec->lru_lock); 4719 + lruvec_unlock_irq(lruvec); 4850 4720 4851 4721 if (list_empty(&list)) 4852 4722 return scanned; ··· 4879 4749 set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active)); 4880 4750 } 4881 4751 4882 - spin_lock_irq(&lruvec->lru_lock); 4883 - 4884 - move_folios_to_lru(lruvec, &list); 4752 + move_folios_to_lru(&list); 4885 4753 4886 4754 walk = current->reclaim_state->mm_walk; 4887 4755 if (walk && walk->batched) { 4888 4756 walk->lruvec = lruvec; 4757 + lruvec_lock_irq(lruvec); 4889 4758 reset_batch_size(walk); 4759 + lruvec_unlock_irq(lruvec); 4890 4760 } 4891 4761 4892 4762 mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), ··· 4895 4765 item = PGSTEAL_KSWAPD + reclaimer_offset(sc); 4896 4766 mod_lruvec_state(lruvec, item, reclaimed); 4897 4767 mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); 4898 - 4899 - spin_unlock_irq(&lruvec->lru_lock); 4900 4768 4901 4769 list_splice_init(&clean, &list); 4902 4770 ··· 4971 4843 int i; 4972 4844 enum zone_watermarks mark; 4973 4845 4974 - /* don't abort memcg reclaim to ensure fairness */ 4975 - if (!root_reclaim(sc)) 4976 - return false; 4977 - 4978 4846 if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order))) 4979 4847 return true; 4980 4848 ··· 5024 4900 * If too many file cache in the coldest generation can't be evicted 5025 4901 * due to being dirty, wake up the flusher. 5026 4902 */ 5027 - if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) 4903 + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) { 4904 + struct pglist_data *pgdat = lruvec_pgdat(lruvec); 4905 + 5028 4906 wakeup_flusher_threads(WB_REASON_VMSCAN); 4907 + 4908 + /* 4909 + * For cgroupv1 dirty throttling is achieved by waking up 4910 + * the kernel flusher here and later waiting on folios 4911 + * which are in writeback to finish (see shrink_folio_list()). 4912 + * 4913 + * Flusher may not be able to issue writeback quickly 4914 + * enough for cgroupv1 writeback throttling to work 4915 + * on a large system. 4916 + */ 4917 + if (!writeback_throttling_sane(sc)) 4918 + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); 4919 + } 5029 4920 5030 4921 /* whether this lruvec should be rotated */ 5031 4922 return nr_to_scan < 0; ··· 5335 5196 for_each_node(nid) { 5336 5197 struct lruvec *lruvec = get_lruvec(memcg, nid); 5337 5198 5338 - spin_lock_irq(&lruvec->lru_lock); 5199 + lruvec_lock_irq(lruvec); 5339 5200 5340 5201 VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); 5341 5202 VM_WARN_ON_ONCE(!state_is_valid(lruvec)); ··· 5343 5204 lruvec->lrugen.enabled = enabled; 5344 5205 5345 5206 while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) { 5346 - spin_unlock_irq(&lruvec->lru_lock); 5207 + lruvec_unlock_irq(lruvec); 5347 5208 cond_resched(); 5348 - spin_lock_irq(&lruvec->lru_lock); 5209 + lruvec_lock_irq(lruvec); 5349 5210 } 5350 5211 5351 - spin_unlock_irq(&lruvec->lru_lock); 5212 + lruvec_unlock_irq(lruvec); 5352 5213 } 5353 5214 5354 5215 cond_resched(); ··· 8037 7898 if (lruvec) { 8038 7899 __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); 8039 7900 __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); 8040 - unlock_page_lruvec_irq(lruvec); 7901 + lruvec_unlock_irq(lruvec); 8041 7902 } else if (pgscanned) { 8042 7903 count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); 8043 7904 }
+1 -1
mm/vmstat.c
··· 2141 2141 if (cpu_is_isolated(cpu)) 2142 2142 continue; 2143 2143 2144 - if (!delayed_work_pending(dw) && need_update(cpu)) 2144 + if (!work_busy(&dw->work) && need_update(cpu)) 2145 2145 queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); 2146 2146 } 2147 2147
+19 -11
mm/workingset.c
··· 244 244 int refs = folio_lru_refs(folio); 245 245 bool workingset = folio_test_workingset(folio); 246 246 int tier = lru_tier_from_refs(refs, workingset); 247 - struct mem_cgroup *memcg = folio_memcg(folio); 247 + struct mem_cgroup *memcg; 248 248 struct pglist_data *pgdat = folio_pgdat(folio); 249 + unsigned short memcg_id; 249 250 250 251 BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > 251 252 BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON)); 252 253 254 + rcu_read_lock(); 255 + memcg = folio_memcg(folio); 253 256 lruvec = mem_cgroup_lruvec(memcg, pgdat); 254 257 lrugen = &lruvec->lrugen; 255 258 min_seq = READ_ONCE(lrugen->min_seq[type]); ··· 260 257 261 258 hist = lru_hist_from_seq(min_seq); 262 259 atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); 260 + memcg_id = mem_cgroup_private_id(memcg); 261 + rcu_read_unlock(); 263 262 264 - return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type); 263 + return pack_shadow(memcg_id, pgdat, token, workingset, type); 265 264 } 266 265 267 266 /* ··· 546 541 void workingset_refault(struct folio *folio, void *shadow) 547 542 { 548 543 bool file = folio_is_file_lru(folio); 549 - struct pglist_data *pgdat; 550 544 struct mem_cgroup *memcg; 551 545 struct lruvec *lruvec; 552 546 bool workingset; ··· 568 564 * locked to guarantee folio_memcg() stability throughout. 569 565 */ 570 566 nr = folio_nr_pages(folio); 571 - memcg = folio_memcg(folio); 572 - pgdat = folio_pgdat(folio); 573 - lruvec = mem_cgroup_lruvec(memcg, pgdat); 574 - 567 + memcg = get_mem_cgroup_from_folio(folio); 568 + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); 575 569 mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); 576 570 577 571 if (!workingset_test_recent(shadow, file, &workingset, true)) 578 - return; 572 + goto out; 579 573 580 574 folio_set_active(folio); 581 575 workingset_age_nonresident(lruvec, nr); ··· 589 587 lru_note_cost_refault(folio); 590 588 mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); 591 589 } 590 + out: 591 + mem_cgroup_put(memcg); 592 592 } 593 593 594 594 /** ··· 603 599 * Filter non-memcg pages here, e.g. unmap can call 604 600 * mark_page_accessed() on VDSO pages. 605 601 */ 606 - if (mem_cgroup_disabled() || folio_memcg_charged(folio)) 602 + if (mem_cgroup_disabled() || folio_memcg_charged(folio)) { 603 + rcu_read_lock(); 607 604 workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio)); 605 + rcu_read_unlock(); 606 + } 608 607 } 609 608 610 609 /* ··· 691 684 692 685 mem_cgroup_flush_stats_ratelimited(sc->memcg); 693 686 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); 687 + 694 688 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) 695 - pages += lruvec_page_state_local(lruvec, 696 - NR_LRU_BASE + i); 689 + pages += lruvec_lru_size(lruvec, i, MAX_NR_ZONES - 1); 690 + 697 691 pages += lruvec_page_state_local( 698 692 lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; 699 693 pages += lruvec_page_state_local(
+99 -106
mm/zswap.c
··· 242 242 **********************************/ 243 243 static void __zswap_pool_empty(struct percpu_ref *ref); 244 244 245 + static void acomp_ctx_free(struct crypto_acomp_ctx *acomp_ctx) 246 + { 247 + if (!acomp_ctx) 248 + return; 249 + 250 + /* 251 + * If there was an error in allocating @acomp_ctx->req, it 252 + * would be set to NULL. 253 + */ 254 + if (acomp_ctx->req) 255 + acomp_request_free(acomp_ctx->req); 256 + 257 + acomp_ctx->req = NULL; 258 + 259 + /* 260 + * We have to handle both cases here: an error pointer return from 261 + * crypto_alloc_acomp_node(); and a) NULL initialization by zswap, or 262 + * b) NULL assignment done in a previous call to acomp_ctx_free(). 263 + */ 264 + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) 265 + crypto_free_acomp(acomp_ctx->acomp); 266 + 267 + acomp_ctx->acomp = NULL; 268 + 269 + kfree(acomp_ctx->buffer); 270 + acomp_ctx->buffer = NULL; 271 + } 272 + 245 273 static struct zswap_pool *zswap_pool_create(char *compressor) 246 274 { 247 275 struct zswap_pool *pool; ··· 291 263 292 264 strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); 293 265 294 - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); 266 + /* Many things rely on the zero-initialization. */ 267 + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, 268 + GFP_KERNEL | __GFP_ZERO); 295 269 if (!pool->acomp_ctx) { 296 270 pr_err("percpu alloc failed\n"); 297 271 goto error; 298 272 } 299 273 300 - for_each_possible_cpu(cpu) 301 - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); 302 - 274 + /* 275 + * This is serialized against CPU hotplug operations. Hence, cores 276 + * cannot be offlined until this finishes. 277 + */ 303 278 ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, 304 279 &pool->node); 280 + 281 + /* 282 + * cpuhp_state_add_instance() will not cleanup on failure since 283 + * we don't register a hotunplug callback. 284 + */ 305 285 if (ret) 306 - goto error; 286 + goto cpuhp_add_fail; 307 287 308 288 /* being the current pool takes 1 ref; this func expects the 309 289 * caller to always add the new pool as the current pool ··· 328 292 329 293 ref_fail: 330 294 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 295 + 296 + cpuhp_add_fail: 297 + for_each_possible_cpu(cpu) 298 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 331 299 error: 332 300 if (pool->acomp_ctx) 333 301 free_percpu(pool->acomp_ctx); ··· 362 322 363 323 static void zswap_pool_destroy(struct zswap_pool *pool) 364 324 { 325 + int cpu; 326 + 365 327 zswap_pool_debug("destroying", pool); 366 328 367 329 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 330 + 331 + for_each_possible_cpu(cpu) 332 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 333 + 368 334 free_percpu(pool->acomp_ctx); 369 335 370 336 zs_destroy_pool(pool->zs_pool); ··· 710 664 struct lruvec *lruvec; 711 665 712 666 if (folio) { 667 + rcu_read_lock(); 713 668 lruvec = folio_lruvec(folio); 714 669 atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); 670 + rcu_read_unlock(); 715 671 } 716 672 } 717 673 ··· 784 736 { 785 737 struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 786 738 struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 787 - struct crypto_acomp *acomp = NULL; 788 - struct acomp_req *req = NULL; 789 - u8 *buffer = NULL; 790 - int ret; 791 - 792 - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 793 - if (!buffer) { 794 - ret = -ENOMEM; 795 - goto fail; 796 - } 797 - 798 - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 799 - if (IS_ERR(acomp)) { 800 - pr_err("could not alloc crypto acomp %s : %pe\n", 801 - pool->tfm_name, acomp); 802 - ret = PTR_ERR(acomp); 803 - goto fail; 804 - } 805 - 806 - req = acomp_request_alloc(acomp); 807 - if (!req) { 808 - pr_err("could not alloc crypto acomp_request %s\n", 809 - pool->tfm_name); 810 - ret = -ENOMEM; 811 - goto fail; 812 - } 739 + int ret = -ENOMEM; 813 740 814 741 /* 815 - * Only hold the mutex after completing allocations, otherwise we may 816 - * recurse into zswap through reclaim and attempt to hold the mutex 817 - * again resulting in a deadlock. 742 + * To handle cases where the CPU goes through online-offline-online 743 + * transitions, we return if the acomp_ctx has already been initialized. 818 744 */ 819 - mutex_lock(&acomp_ctx->mutex); 745 + if (acomp_ctx->acomp) { 746 + WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp)); 747 + return 0; 748 + } 749 + 750 + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 751 + if (!acomp_ctx->buffer) 752 + return ret; 753 + 754 + /* 755 + * In case of an error, crypto_alloc_acomp_node() returns an 756 + * error pointer, never NULL. 757 + */ 758 + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 759 + if (IS_ERR(acomp_ctx->acomp)) { 760 + pr_err("could not alloc crypto acomp %s : %pe\n", 761 + pool->tfm_name, acomp_ctx->acomp); 762 + ret = PTR_ERR(acomp_ctx->acomp); 763 + goto fail; 764 + } 765 + 766 + /* acomp_request_alloc() returns NULL in case of an error. */ 767 + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); 768 + if (!acomp_ctx->req) { 769 + pr_err("could not alloc crypto acomp_request %s\n", 770 + pool->tfm_name); 771 + goto fail; 772 + } 773 + 820 774 crypto_init_wait(&acomp_ctx->wait); 821 775 822 776 /* ··· 826 776 * crypto_wait_req(); if the backend of acomp is scomp, the callback 827 777 * won't be called, crypto_wait_req() will return without blocking. 828 778 */ 829 - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, 779 + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, 830 780 crypto_req_done, &acomp_ctx->wait); 831 781 832 - acomp_ctx->buffer = buffer; 833 - acomp_ctx->acomp = acomp; 834 - acomp_ctx->req = req; 835 - mutex_unlock(&acomp_ctx->mutex); 782 + mutex_init(&acomp_ctx->mutex); 836 783 return 0; 837 784 838 785 fail: 839 - if (!IS_ERR_OR_NULL(acomp)) 840 - crypto_free_acomp(acomp); 841 - kfree(buffer); 786 + acomp_ctx_free(acomp_ctx); 842 787 return ret; 843 - } 844 - 845 - static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) 846 - { 847 - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 848 - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 849 - struct acomp_req *req; 850 - struct crypto_acomp *acomp; 851 - u8 *buffer; 852 - 853 - if (IS_ERR_OR_NULL(acomp_ctx)) 854 - return 0; 855 - 856 - mutex_lock(&acomp_ctx->mutex); 857 - req = acomp_ctx->req; 858 - acomp = acomp_ctx->acomp; 859 - buffer = acomp_ctx->buffer; 860 - acomp_ctx->req = NULL; 861 - acomp_ctx->acomp = NULL; 862 - acomp_ctx->buffer = NULL; 863 - mutex_unlock(&acomp_ctx->mutex); 864 - 865 - /* 866 - * Do the actual freeing after releasing the mutex to avoid subtle 867 - * locking dependencies causing deadlocks. 868 - */ 869 - if (!IS_ERR_OR_NULL(req)) 870 - acomp_request_free(req); 871 - if (!IS_ERR_OR_NULL(acomp)) 872 - crypto_free_acomp(acomp); 873 - kfree(buffer); 874 - 875 - return 0; 876 - } 877 - 878 - static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) 879 - { 880 - struct crypto_acomp_ctx *acomp_ctx; 881 - 882 - for (;;) { 883 - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 884 - mutex_lock(&acomp_ctx->mutex); 885 - if (likely(acomp_ctx->req)) 886 - return acomp_ctx; 887 - /* 888 - * It is possible that we were migrated to a different CPU after 889 - * getting the per-CPU ctx but before the mutex was acquired. If 890 - * the old CPU got offlined, zswap_cpu_comp_dead() could have 891 - * already freed ctx->req (among other things) and set it to 892 - * NULL. Just try again on the new CPU that we ended up on. 893 - */ 894 - mutex_unlock(&acomp_ctx->mutex); 895 - } 896 - } 897 - 898 - static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) 899 - { 900 - mutex_unlock(&acomp_ctx->mutex); 901 788 } 902 789 903 790 static bool zswap_compress(struct page *page, struct zswap_entry *entry, ··· 849 862 u8 *dst; 850 863 bool mapped = false; 851 864 852 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 865 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 866 + mutex_lock(&acomp_ctx->mutex); 867 + 853 868 dst = acomp_ctx->buffer; 854 869 sg_init_table(&input, 1); 855 870 sg_set_page(&input, page, PAGE_SIZE, 0); ··· 882 893 * to the active LRU list in the case. 883 894 */ 884 895 if (comp_ret || !dlen || dlen >= PAGE_SIZE) { 896 + rcu_read_lock(); 885 897 if (!mem_cgroup_zswap_writeback_enabled( 886 898 folio_memcg(page_folio(page)))) { 899 + rcu_read_unlock(); 887 900 comp_ret = comp_ret ? comp_ret : -EINVAL; 888 901 goto unlock; 889 902 } 903 + rcu_read_unlock(); 890 904 comp_ret = 0; 891 905 dlen = PAGE_SIZE; 892 906 dst = kmap_local_page(page); ··· 917 925 else if (alloc_ret) 918 926 zswap_reject_alloc_fail++; 919 927 920 - acomp_ctx_put_unlock(acomp_ctx); 928 + mutex_unlock(&acomp_ctx->mutex); 921 929 return comp_ret == 0 && alloc_ret == 0; 922 930 } 923 931 ··· 929 937 struct crypto_acomp_ctx *acomp_ctx; 930 938 int ret = 0, dlen; 931 939 932 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 940 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 941 + mutex_lock(&acomp_ctx->mutex); 933 942 zs_obj_read_sg_begin(pool->zs_pool, entry->handle, input, entry->length); 934 943 935 944 /* zswap entries of length PAGE_SIZE are not compressed. */ ··· 955 962 } 956 963 957 964 zs_obj_read_sg_end(pool->zs_pool, entry->handle); 958 - acomp_ctx_put_unlock(acomp_ctx); 965 + mutex_unlock(&acomp_ctx->mutex); 959 966 960 967 if (!ret && dlen == PAGE_SIZE) 961 968 return true; ··· 1775 1782 ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE, 1776 1783 "mm/zswap_pool:prepare", 1777 1784 zswap_cpu_comp_prepare, 1778 - zswap_cpu_comp_dead); 1785 + NULL); 1779 1786 if (ret) 1780 1787 goto hp_fail; 1781 1788
+41
tools/testing/selftests/liveupdate/liveupdate.c
··· 345 345 ASSERT_EQ(close(session_fd), 0); 346 346 } 347 347 348 + /* 349 + * Test Case: Prevent Double Preservation 350 + * 351 + * Verifies that a file (memfd) can only be preserved once across all active 352 + * sessions. Attempting to preserve it a second time, whether in the same or 353 + * a different session, should fail with EBUSY. 354 + */ 355 + TEST_F(liveupdate_device, prevent_double_preservation) 356 + { 357 + int session_fd1, session_fd2, mem_fd; 358 + int ret; 359 + 360 + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); 361 + if (self->fd1 < 0 && errno == ENOENT) 362 + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); 363 + ASSERT_GE(self->fd1, 0); 364 + 365 + session_fd1 = create_session(self->fd1, "double-preserve-session-1"); 366 + ASSERT_GE(session_fd1, 0); 367 + session_fd2 = create_session(self->fd1, "double-preserve-session-2"); 368 + ASSERT_GE(session_fd2, 0); 369 + 370 + mem_fd = memfd_create("test-memfd", 0); 371 + ASSERT_GE(mem_fd, 0); 372 + 373 + /* First preservation should succeed */ 374 + ASSERT_EQ(preserve_fd(session_fd1, mem_fd, 0x1111), 0); 375 + 376 + /* Second preservation in a different session should fail with EBUSY */ 377 + ret = preserve_fd(session_fd2, mem_fd, 0x2222); 378 + EXPECT_EQ(ret, -EBUSY); 379 + 380 + /* Second preservation in the same session (different token) should fail with EBUSY */ 381 + ret = preserve_fd(session_fd1, mem_fd, 0x3333); 382 + EXPECT_EQ(ret, -EBUSY); 383 + 384 + ASSERT_EQ(close(mem_fd), 0); 385 + ASSERT_EQ(close(session_fd1), 0); 386 + ASSERT_EQ(close(session_fd2), 0); 387 + } 388 + 348 389 TEST_HARNESS_MAIN
+5
tools/testing/selftests/mm/charge_reserved_hugetlb.sh
··· 11 11 exit $ksft_skip 12 12 fi 13 13 14 + if ! command -v killall >/dev/null 2>&1; then 15 + echo "killall not available. Skipping..." 16 + exit $ksft_skip 17 + fi 18 + 14 19 nr_hugepgs=$(cat /proc/sys/vm/nr_hugepages) 15 20 16 21 fault_limit_file=limit_in_bytes
+4
tools/testing/selftests/mm/guard-regions.c
··· 21 21 #include <sys/uio.h> 22 22 #include <unistd.h> 23 23 #include "vm_util.h" 24 + #include "thp_settings.h" 24 25 25 26 #include "../pidfd/pidfd.h" 26 27 ··· 2195 2194 const unsigned long num_pages = size / page_size; 2196 2195 char *ptr; 2197 2196 int i; 2197 + 2198 + if (!thp_available()) 2199 + SKIP(return, "Transparent Hugepages not available\n"); 2198 2200 2199 2201 /* Need file to be correct size for tests for non-anon. */ 2200 2202 if (variant->backing != ANON_BACKED)
+16 -67
tools/testing/selftests/mm/hmm-tests.c
··· 34 34 */ 35 35 #include <lib/test_hmm_uapi.h> 36 36 #include <mm/gup_test.h> 37 + #include <mm/vm_util.h> 37 38 38 39 struct hmm_buffer { 39 40 void *ptr; ··· 549 548 550 549 for (migrate = 0; migrate < 2; ++migrate) { 551 550 for (use_thp = 0; use_thp < 2; ++use_thp) { 552 - npages = ALIGN(use_thp ? TWOMEG : HMM_BUFFER_SIZE, 551 + npages = ALIGN(use_thp ? read_pmd_pagesize() : HMM_BUFFER_SIZE, 553 552 self->page_size) >> self->page_shift; 554 553 ASSERT_NE(npages, 0); 555 554 size = npages << self->page_shift; ··· 729 728 int *ptr; 730 729 int ret; 731 730 732 - size = 2 * TWOMEG; 731 + size = 2 * read_pmd_pagesize(); 733 732 734 733 buffer = malloc(sizeof(*buffer)); 735 734 ASSERT_NE(buffer, NULL); ··· 745 744 buffer->fd, 0); 746 745 ASSERT_NE(buffer->ptr, MAP_FAILED); 747 746 748 - size = TWOMEG; 747 + size /= 2; 749 748 npages = size >> self->page_shift; 750 749 map = (void *)ALIGN((uintptr_t)buffer->ptr, size); 751 750 ret = madvise(map, size, MADV_HUGEPAGE); ··· 772 771 } 773 772 774 773 /* 775 - * Read numeric data from raw and tagged kernel status files. Used to read 776 - * /proc and /sys data (without a tag) and from /proc/meminfo (with a tag). 777 - */ 778 - static long file_read_ulong(char *file, const char *tag) 779 - { 780 - int fd; 781 - char buf[2048]; 782 - int len; 783 - char *p, *q; 784 - long val; 785 - 786 - fd = open(file, O_RDONLY); 787 - if (fd < 0) { 788 - /* Error opening the file */ 789 - return -1; 790 - } 791 - 792 - len = read(fd, buf, sizeof(buf)); 793 - close(fd); 794 - if (len < 0) { 795 - /* Error in reading the file */ 796 - return -1; 797 - } 798 - if (len == sizeof(buf)) { 799 - /* Error file is too large */ 800 - return -1; 801 - } 802 - buf[len] = '\0'; 803 - 804 - /* Search for a tag if provided */ 805 - if (tag) { 806 - p = strstr(buf, tag); 807 - if (!p) 808 - return -1; /* looks like the line we want isn't there */ 809 - p += strlen(tag); 810 - } else 811 - p = buf; 812 - 813 - val = strtol(p, &q, 0); 814 - if (*q != ' ') { 815 - /* Error parsing the file */ 816 - return -1; 817 - } 818 - 819 - return val; 820 - } 821 - 822 - /* 823 774 * Write huge TLBFS page. 824 775 */ 825 776 TEST_F(hmm, anon_write_hugetlbfs) ··· 779 826 struct hmm_buffer *buffer; 780 827 unsigned long npages; 781 828 unsigned long size; 782 - unsigned long default_hsize; 829 + unsigned long default_hsize = default_huge_page_size(); 783 830 unsigned long i; 784 831 int *ptr; 785 832 int ret; 786 833 787 - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); 788 - if (default_hsize < 0 || default_hsize*1024 < default_hsize) 834 + if (!default_hsize) 789 835 SKIP(return, "Huge page size could not be determined"); 790 - default_hsize = default_hsize*1024; /* KB to B */ 791 836 792 837 size = ALIGN(TWOMEG, default_hsize); 793 838 npages = size >> self->page_shift; ··· 1557 1606 struct hmm_buffer *buffer; 1558 1607 unsigned long npages; 1559 1608 unsigned long size; 1560 - unsigned long default_hsize; 1609 + unsigned long default_hsize = default_huge_page_size(); 1561 1610 int *ptr; 1562 1611 unsigned char *m; 1563 1612 int ret; ··· 1565 1614 1566 1615 /* Skip test if we can't allocate a hugetlbfs page. */ 1567 1616 1568 - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); 1569 - if (default_hsize < 0 || default_hsize*1024 < default_hsize) 1617 + if (!default_hsize) 1570 1618 SKIP(return, "Huge page size could not be determined"); 1571 - default_hsize = default_hsize*1024; /* KB to B */ 1572 1619 1573 1620 size = ALIGN(TWOMEG, default_hsize); 1574 1621 npages = size >> self->page_shift; ··· 2055 2106 int *ptr; 2056 2107 int ret; 2057 2108 2058 - size = TWOMEG; 2109 + size = read_pmd_pagesize(); 2059 2110 2060 2111 buffer = malloc(sizeof(*buffer)); 2061 2112 ASSERT_NE(buffer, NULL); ··· 2107 2158 int ret; 2108 2159 int val; 2109 2160 2110 - size = TWOMEG; 2161 + size = read_pmd_pagesize(); 2111 2162 2112 2163 buffer = malloc(sizeof(*buffer)); 2113 2164 ASSERT_NE(buffer, NULL); ··· 2170 2221 int *ptr; 2171 2222 int ret; 2172 2223 2173 - size = TWOMEG; 2224 + size = read_pmd_pagesize(); 2174 2225 2175 2226 buffer = malloc(sizeof(*buffer)); 2176 2227 ASSERT_NE(buffer, NULL); ··· 2229 2280 int *ptr; 2230 2281 int ret; 2231 2282 2232 - size = TWOMEG; 2283 + size = read_pmd_pagesize(); 2233 2284 2234 2285 buffer = malloc(sizeof(*buffer)); 2235 2286 ASSERT_NE(buffer, NULL); ··· 2281 2332 { 2282 2333 struct hmm_buffer *buffer; 2283 2334 unsigned long npages; 2284 - unsigned long size = TWOMEG; 2335 + unsigned long size = read_pmd_pagesize(); 2285 2336 unsigned long i; 2286 2337 void *old_ptr; 2287 2338 void *map; ··· 2347 2398 { 2348 2399 struct hmm_buffer *buffer; 2349 2400 unsigned long npages; 2350 - unsigned long size = TWOMEG; 2401 + unsigned long size = read_pmd_pagesize(); 2351 2402 unsigned long i; 2352 2403 void *old_ptr, *new_ptr = NULL; 2353 2404 void *map; ··· 2447 2498 int *ptr; 2448 2499 int ret; 2449 2500 2450 - size = TWOMEG; 2501 + size = read_pmd_pagesize(); 2451 2502 2452 2503 buffer = malloc(sizeof(*buffer)); 2453 2504 ASSERT_NE(buffer, NULL); ··· 2542 2593 int *ptr; 2543 2594 int ret; 2544 2595 2545 - size = TWOMEG; 2596 + size = read_pmd_pagesize(); 2546 2597 2547 2598 buffer = malloc(sizeof(*buffer)); 2548 2599 ASSERT_NE(buffer, NULL);
+69 -22
tools/testing/selftests/mm/hugetlb_dio.c
··· 17 17 #include <unistd.h> 18 18 #include <string.h> 19 19 #include <sys/mman.h> 20 + #include <sys/syscall.h> 20 21 #include "vm_util.h" 21 22 #include "kselftest.h" 22 23 23 - void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) 24 + #ifndef STATX_DIOALIGN 25 + #define STATX_DIOALIGN 0x00002000U 26 + #endif 27 + 28 + static int get_dio_alignment(int fd) 24 29 { 25 - int fd; 30 + struct statx stx; 31 + int ret; 32 + 33 + ret = syscall(__NR_statx, fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &stx); 34 + if (ret < 0) 35 + return -1; 36 + 37 + /* 38 + * If STATX_DIOALIGN is unsupported, assume no alignment 39 + * constraint and let the test proceed. 40 + */ 41 + if (!(stx.stx_mask & STATX_DIOALIGN) || !stx.stx_dio_offset_align) 42 + return 1; 43 + 44 + return stx.stx_dio_offset_align; 45 + } 46 + 47 + static bool check_dio_alignment(unsigned int start_off, 48 + unsigned int end_off, unsigned int align) 49 + { 50 + unsigned int writesize = end_off - start_off; 51 + 52 + /* 53 + * The kernel's DIO path checks that file offset, length, and 54 + * buffer address are all multiples of dio_offset_align. When 55 + * this test case's parameters don't satisfy that, the write 56 + * would fail with -EINVAL before exercising the hugetlb unpin 57 + * path, so skip. 58 + */ 59 + if (start_off % align != 0 || writesize % align != 0) { 60 + ksft_test_result_skip("DIO align=%u incompatible with offset %u writesize %u\n", 61 + align, start_off, writesize); 62 + return false; 63 + } 64 + 65 + return true; 66 + } 67 + 68 + static void run_dio_using_hugetlb(int fd, unsigned int start_off, 69 + unsigned int end_off, unsigned int align) 70 + { 26 71 char *buffer = NULL; 27 72 char *orig_buffer = NULL; 28 73 size_t h_pagesize = 0; ··· 77 32 const int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB; 78 33 const int mmap_prot = PROT_READ | PROT_WRITE; 79 34 35 + if (!check_dio_alignment(start_off, end_off, align)) 36 + return; 37 + 80 38 writesize = end_off - start_off; 81 39 82 40 /* Get the default huge page size */ ··· 87 39 if (!h_pagesize) 88 40 ksft_exit_fail_msg("Unable to determine huge page size\n"); 89 41 90 - /* Open the file to DIO */ 91 - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 92 - if (fd < 0) 93 - ksft_exit_fail_perror("Error opening file\n"); 42 + /* Reset file position since fd is shared across tests */ 43 + if (lseek(fd, 0, SEEK_SET) < 0) 44 + ksft_exit_fail_perror("lseek failed\n"); 94 45 95 46 /* Get the free huge pages before allocation */ 96 47 free_hpage_b = get_free_hugepages(); ··· 118 71 119 72 /* unmap the huge page */ 120 73 munmap(orig_buffer, h_pagesize); 121 - close(fd); 122 74 123 75 /* Get the free huge pages after unmap*/ 124 76 free_hpage_a = get_free_hugepages(); ··· 135 89 136 90 int main(void) 137 91 { 138 - size_t pagesize = 0; 139 - int fd; 92 + int fd, align; 93 + const size_t pagesize = psize(); 140 94 141 95 ksft_print_header(); 142 - 143 - /* Open the file to DIO */ 144 - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 145 - if (fd < 0) 146 - ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); 147 - close(fd); 148 96 149 97 /* Check if huge pages are free */ 150 98 if (!get_free_hugepages()) 151 99 ksft_exit_skip("No free hugepage, exiting\n"); 152 100 101 + fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); 102 + if (fd < 0) 103 + ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); 104 + 105 + align = get_dio_alignment(fd); 106 + if (align < 0) 107 + ksft_exit_skip("Unable to obtain DIO alignment: %s\n", 108 + strerror(errno)); 153 109 ksft_set_plan(4); 154 110 155 - /* Get base page size */ 156 - pagesize = psize(); 157 - 158 111 /* start and end is aligned to pagesize */ 159 - run_dio_using_hugetlb(0, (pagesize * 3)); 112 + run_dio_using_hugetlb(fd, 0, (pagesize * 3), align); 160 113 161 114 /* start is aligned but end is not aligned */ 162 - run_dio_using_hugetlb(0, (pagesize * 3) - (pagesize / 2)); 115 + run_dio_using_hugetlb(fd, 0, (pagesize * 3) - (pagesize / 2), align); 163 116 164 117 /* start is unaligned and end is aligned */ 165 - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3)); 118 + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3), align); 166 119 167 120 /* both start and end are unaligned */ 168 - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3) + (pagesize / 2)); 121 + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3) + (pagesize / 2), align); 122 + 123 + close(fd); 169 124 170 125 ksft_finished(); 171 126 }
+88
tools/testing/selftests/mm/merge.c
··· 48 48 return 0; 49 49 } 50 50 51 + #ifdef __NR_mseal 52 + static int sys_mseal(void *ptr, size_t len, unsigned long flags) 53 + { 54 + return syscall(__NR_mseal, (unsigned long)ptr, len, flags); 55 + } 56 + #else 57 + static int sys_mseal(void *ptr, size_t len, unsigned long flags) 58 + { 59 + errno = ENOSYS; 60 + return -1; 61 + } 62 + #endif 63 + 51 64 FIXTURE_SETUP(merge) 52 65 { 53 66 self->page_size = psize(); ··· 1228 1215 ASSERT_TRUE(find_vma_procmap(procmap, ptr)); 1229 1216 ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); 1230 1217 ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); 1218 + } 1219 + 1220 + TEST_F(merge, merge_vmas_with_mseal) 1221 + { 1222 + unsigned int page_size = self->page_size; 1223 + struct procmap_fd *procmap = &self->procmap; 1224 + char *ptr, *ptr2, *ptr3; 1225 + /* We need our own as cannot munmap() once sealed. */ 1226 + char *carveout; 1227 + 1228 + /* Invalid mseal() call to see if implemented. */ 1229 + ASSERT_EQ(sys_mseal(NULL, 0, ~0UL), -1); 1230 + if (errno == ENOSYS) 1231 + SKIP(return, "mseal not supported, skipping."); 1232 + 1233 + /* Map carveout. */ 1234 + carveout = mmap(NULL, 5 * page_size, PROT_NONE, 1235 + MAP_PRIVATE | MAP_ANON, -1, 0); 1236 + ASSERT_NE(carveout, MAP_FAILED); 1237 + 1238 + /* 1239 + * Map 3 separate VMAs: 1240 + * 1241 + * |-----------|-----------|-----------| 1242 + * | RW | RWE | RO | 1243 + * |-----------|-----------|-----------| 1244 + * ptr ptr2 ptr3 1245 + */ 1246 + ptr = mmap(&carveout[page_size], page_size, PROT_READ | PROT_WRITE, 1247 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1248 + ASSERT_NE(ptr, MAP_FAILED); 1249 + ptr2 = mmap(&carveout[2 * page_size], page_size, 1250 + PROT_READ | PROT_WRITE | PROT_EXEC, 1251 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1252 + ASSERT_NE(ptr2, MAP_FAILED); 1253 + ptr3 = mmap(&carveout[3 * page_size], page_size, PROT_READ, 1254 + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); 1255 + ASSERT_NE(ptr3, MAP_FAILED); 1256 + 1257 + /* 1258 + * mseal the second VMA: 1259 + * 1260 + * |-----------|-----------|-----------| 1261 + * | RW | RWES | RO | 1262 + * |-----------|-----------|-----------| 1263 + * ptr ptr2 ptr3 1264 + */ 1265 + ASSERT_EQ(sys_mseal(ptr2, page_size, 0), 0); 1266 + 1267 + /* Make first VMA mergeable upon mseal. */ 1268 + ASSERT_EQ(mprotect(ptr, page_size, 1269 + PROT_READ | PROT_WRITE | PROT_EXEC), 0); 1270 + /* 1271 + * At this point we have: 1272 + * 1273 + * |-----------|-----------|-----------| 1274 + * | RWE | RWES | RO | 1275 + * |-----------|-----------|-----------| 1276 + * ptr ptr2 ptr3 1277 + * 1278 + * Now mseal all of the VMAs. 1279 + */ 1280 + ASSERT_EQ(sys_mseal(ptr, 3 * page_size, 0), 0); 1281 + 1282 + /* 1283 + * We should end up with: 1284 + * 1285 + * |-----------------------|-----------| 1286 + * | RWES | ROS | 1287 + * |-----------------------|-----------| 1288 + * ptr ptr3 1289 + */ 1290 + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); 1291 + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); 1292 + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); 1231 1293 } 1232 1294 1233 1295 TEST_F(merge_with_fork, mremap_faulted_to_unfaulted_prev)
+3 -1
tools/testing/selftests/mm/soft-dirty.c
··· 82 82 int i, ret; 83 83 84 84 if (!thp_is_enabled()) { 85 - ksft_test_result_skip("Transparent Hugepages not available\n"); 85 + ksft_print_msg("Transparent Hugepages not available\n"); 86 + ksft_test_result_skip("Test %s huge page allocation\n", __func__); 87 + ksft_test_result_skip("Test %s huge page dirty bit\n", __func__); 86 88 return; 87 89 } 88 90
+4 -15
tools/testing/selftests/mm/split_huge_page_test.c
··· 21 21 #include <time.h> 22 22 #include "vm_util.h" 23 23 #include "kselftest.h" 24 + #include "thp_settings.h" 24 25 25 26 uint64_t pagesize; 26 27 unsigned int pageshift; ··· 254 253 255 254 free(vaddr_orders); 256 255 return status; 257 - } 258 - 259 - static void write_file(const char *path, const char *buf, size_t buflen) 260 - { 261 - int fd; 262 - ssize_t numwritten; 263 - 264 - fd = open(path, O_WRONLY); 265 - if (fd == -1) 266 - ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); 267 - 268 - numwritten = write(fd, buf, buflen - 1); 269 - close(fd); 270 - if (numwritten < 1) 271 - ksft_exit_fail_msg("Write failed\n"); 272 256 } 273 257 274 258 static void write_debugfs(const char *fmt, ...) ··· 757 771 ksft_print_msg("Please run the benchmark as root\n"); 758 772 ksft_finished(); 759 773 } 774 + 775 + if (!thp_is_enabled()) 776 + ksft_exit_skip("Transparent Hugepages not available\n"); 760 777 761 778 if (argc > 1) 762 779 optional_xfs_path = argv[1];
+3 -32
tools/testing/selftests/mm/thp_settings.c
··· 6 6 #include <string.h> 7 7 #include <unistd.h> 8 8 9 + #include "vm_util.h" 9 10 #include "thp_settings.h" 10 11 11 12 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/" ··· 65 64 return (unsigned int) numread; 66 65 } 67 66 68 - int write_file(const char *path, const char *buf, size_t buflen) 69 - { 70 - int fd; 71 - ssize_t numwritten; 72 - 73 - fd = open(path, O_WRONLY); 74 - if (fd == -1) { 75 - printf("open(%s)\n", path); 76 - exit(EXIT_FAILURE); 77 - return 0; 78 - } 79 - 80 - numwritten = write(fd, buf, buflen - 1); 81 - close(fd); 82 - if (numwritten < 1) { 83 - printf("write(%s)\n", buf); 84 - exit(EXIT_FAILURE); 85 - return 0; 86 - } 87 - 88 - return (unsigned int) numwritten; 89 - } 90 - 91 67 unsigned long read_num(const char *path) 92 68 { 93 69 char buf[21]; ··· 82 104 char buf[21]; 83 105 84 106 sprintf(buf, "%ld", num); 85 - if (!write_file(path, buf, strlen(buf) + 1)) { 86 - perror(path); 87 - exit(EXIT_FAILURE); 88 - } 107 + write_file(path, buf, strlen(buf) + 1); 89 108 } 90 109 91 110 int thp_read_string(const char *name, const char * const strings[]) ··· 140 165 printf("%s: Pathname is too long\n", __func__); 141 166 exit(EXIT_FAILURE); 142 167 } 143 - 144 - if (!write_file(path, val, strlen(val) + 1)) { 145 - perror(path); 146 - exit(EXIT_FAILURE); 147 - } 168 + write_file(path, val, strlen(val) + 1); 148 169 } 149 170 150 171 unsigned long thp_read_num(const char *name)
-1
tools/testing/selftests/mm/thp_settings.h
··· 63 63 }; 64 64 65 65 int read_file(const char *path, char *buf, size_t buflen); 66 - int write_file(const char *path, const char *buf, size_t buflen); 67 66 unsigned long read_num(const char *path); 68 67 void write_num(const char *path, unsigned long num); 69 68
+4
tools/testing/selftests/mm/transhuge-stress.c
··· 17 17 #include <sys/mman.h> 18 18 #include "vm_util.h" 19 19 #include "kselftest.h" 20 + #include "thp_settings.h" 20 21 21 22 int backing_fd = -1; 22 23 int mmap_flags = MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE; ··· 37 36 int duration = 0; 38 37 39 38 ksft_print_header(); 39 + 40 + if (!thp_is_enabled()) 41 + ksft_exit_skip("Transparent Hugepages not available\n"); 40 42 41 43 ram = sysconf(_SC_PHYS_PAGES); 42 44 if (ram > SIZE_MAX / psize() / 4)
+24
tools/testing/selftests/mm/vm_util.c
··· 764 764 765 765 return ret > 0 ? 0 : -errno; 766 766 } 767 + 768 + void write_file(const char *path, const char *buf, size_t buflen) 769 + { 770 + int fd, saved_errno; 771 + ssize_t numwritten; 772 + 773 + if (buflen < 2) 774 + ksft_exit_fail_msg("Incorrect buffer len: %zu\n", buflen); 775 + 776 + fd = open(path, O_WRONLY); 777 + if (fd == -1) 778 + ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); 779 + 780 + numwritten = write(fd, buf, buflen - 1); 781 + saved_errno = errno; 782 + close(fd); 783 + errno = saved_errno; 784 + if (numwritten < 0) 785 + ksft_exit_fail_msg("%s write(%.*s) failed: %s\n", path, (int)(buflen - 1), 786 + buf, strerror(errno)); 787 + if (numwritten != buflen - 1) 788 + ksft_exit_fail_msg("%s write(%.*s) is truncated, expected %zu bytes, got %zd bytes\n", 789 + path, (int)(buflen - 1), buf, buflen - 1, numwritten); 790 + }
+2
tools/testing/selftests/mm/vm_util.h
··· 166 166 167 167 #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 168 168 #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 169 + 170 + void write_file(const char *path, const char *buf, size_t buflen);