Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: handle poisoning of pfn without struct pages

Poison (or ECC) errors can be very common on a large size cluster. The
kernel MM currently does not handle ECC errors / poison on a memory region
that is not backed by struct pages. If a memory region mapped using
remap_pfn_range() for example, but not added to the kernel, MM will not
have associated struct pages. Add a new mechanism to handle memory
failure on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register the device memory SPA and the address space associated
it. MM maintains this information as an interval tree. On poison, MM can
search for the range that the poisoned PFN belong and use the
address_space to determine the mapping VMA.

In this implementation, kernel MM follows the following sequence that is
largely similar to the memory_failure() handler for struct page backed
memory:

1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently
memory_failure_pfn() is executed.

2. memory_failure_pfn() collects the processes mapped to the PFN.

3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
faulty PFN using kill_procs().

Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
mapped in such as way would need breaking the PMD/PUD mapping into PTEs
that will get mirrored into the S2. This can greatly increase the cost of
table walks and have a major performance impact.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20251102184434.2406-3-ankita@nvidia.com
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Cc: Aniket Agashe <aniketa@nvidia.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: Vikram Sethi <vsethi@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Ankit Agrawal and committed by
Andrew Morton
2ec41967 30d0a129

+165 -1
+1
MAINTAINERS
··· 11557 11557 R: Naoya Horiguchi <nao.horiguchi@gmail.com> 11558 11558 L: linux-mm@kvack.org 11559 11559 S: Maintained 11560 + F: include/linux/memory-failure.h 11560 11561 F: mm/hwpoison-inject.c 11561 11562 F: mm/memory-failure.c 11562 11563
+17
include/linux/memory-failure.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_MEMORY_FAILURE_H 3 + #define _LINUX_MEMORY_FAILURE_H 4 + 5 + #include <linux/interval_tree.h> 6 + 7 + struct pfn_address_space; 8 + 9 + struct pfn_address_space { 10 + struct interval_tree_node node; 11 + struct address_space *mapping; 12 + }; 13 + 14 + int register_pfn_address_space(struct pfn_address_space *pfn_space); 15 + void unregister_pfn_address_space(struct pfn_address_space *pfn_space); 16 + 17 + #endif /* _LINUX_MEMORY_FAILURE_H */
+1
include/linux/mm.h
··· 4285 4285 MF_MSG_DAX, 4286 4286 MF_MSG_UNSPLIT_THP, 4287 4287 MF_MSG_ALREADY_POISONED, 4288 + MF_MSG_PFN_MAP, 4288 4289 MF_MSG_UNKNOWN, 4289 4290 }; 4290 4291
+1
include/ras/ras_event.h
··· 375 375 EM ( MF_MSG_DAX, "dax page" ) \ 376 376 EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \ 377 377 EM ( MF_MSG_ALREADY_POISONED, "already poisoned" ) \ 378 + EM ( MF_MSG_PFN_MAP, "non struct page pfn" ) \ 378 379 EMe ( MF_MSG_UNKNOWN, "unknown page" ) 379 380 380 381 /*
+1
mm/Kconfig
··· 741 741 depends on ARCH_SUPPORTS_MEMORY_FAILURE 742 742 bool "Enable recovery from hardware memory errors" 743 743 select RAS 744 + select INTERVAL_TREE 744 745 help 745 746 Enables code to recover from some memory failures on systems 746 747 with MCA recovery. This allows a system to continue running
+144 -1
mm/memory-failure.c
··· 38 38 39 39 #include <linux/kernel.h> 40 40 #include <linux/mm.h> 41 + #include <linux/memory-failure.h> 41 42 #include <linux/page-flags.h> 42 43 #include <linux/sched/signal.h> 43 44 #include <linux/sched/task.h> ··· 154 153 .extra2 = SYSCTL_ONE, 155 154 } 156 155 }; 156 + 157 + static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED; 158 + 159 + static DEFINE_MUTEX(pfn_space_lock); 157 160 158 161 /* 159 162 * Return values: ··· 890 885 [MF_MSG_DAX] = "dax page", 891 886 [MF_MSG_UNSPLIT_THP] = "unsplit thp", 892 887 [MF_MSG_ALREADY_POISONED] = "already poisoned page", 888 + [MF_MSG_PFN_MAP] = "non struct page pfn", 893 889 [MF_MSG_UNKNOWN] = "unknown page", 894 890 }; 895 891 ··· 1283 1277 { 1284 1278 trace_memory_failure_event(pfn, type, result); 1285 1279 1286 - if (type != MF_MSG_ALREADY_POISONED) { 1280 + if (type != MF_MSG_ALREADY_POISONED && type != MF_MSG_PFN_MAP) { 1287 1281 num_poisoned_pages_inc(pfn); 1288 1282 update_per_node_mf_stats(pfn, result); 1289 1283 } ··· 2153 2147 kill_procs(&tokill, true, pfn, flags); 2154 2148 } 2155 2149 2150 + int register_pfn_address_space(struct pfn_address_space *pfn_space) 2151 + { 2152 + guard(mutex)(&pfn_space_lock); 2153 + 2154 + if (interval_tree_iter_first(&pfn_space_itree, 2155 + pfn_space->node.start, 2156 + pfn_space->node.last)) 2157 + return -EBUSY; 2158 + 2159 + interval_tree_insert(&pfn_space->node, &pfn_space_itree); 2160 + 2161 + return 0; 2162 + } 2163 + EXPORT_SYMBOL_GPL(register_pfn_address_space); 2164 + 2165 + void unregister_pfn_address_space(struct pfn_address_space *pfn_space) 2166 + { 2167 + guard(mutex)(&pfn_space_lock); 2168 + 2169 + if (interval_tree_iter_first(&pfn_space_itree, 2170 + pfn_space->node.start, 2171 + pfn_space->node.last)) 2172 + interval_tree_remove(&pfn_space->node, &pfn_space_itree); 2173 + } 2174 + EXPORT_SYMBOL_GPL(unregister_pfn_address_space); 2175 + 2176 + static void add_to_kill_pfn(struct task_struct *tsk, 2177 + struct vm_area_struct *vma, 2178 + struct list_head *to_kill, 2179 + unsigned long pfn) 2180 + { 2181 + struct to_kill *tk; 2182 + 2183 + tk = kmalloc(sizeof(*tk), GFP_ATOMIC); 2184 + if (!tk) { 2185 + pr_info("Unable to kill proc %d\n", tsk->pid); 2186 + return; 2187 + } 2188 + 2189 + /* Check for pgoff not backed by struct page */ 2190 + tk->addr = vma_address(vma, pfn, 1); 2191 + tk->size_shift = PAGE_SHIFT; 2192 + 2193 + if (tk->addr == -EFAULT) 2194 + pr_info("Unable to find address %lx in %s\n", 2195 + pfn, tsk->comm); 2196 + 2197 + get_task_struct(tsk); 2198 + tk->tsk = tsk; 2199 + list_add_tail(&tk->nd, to_kill); 2200 + } 2201 + 2202 + /* 2203 + * Collect processes when the error hit a PFN not backed by struct page. 2204 + */ 2205 + static void collect_procs_pfn(struct address_space *mapping, 2206 + unsigned long pfn, struct list_head *to_kill) 2207 + { 2208 + struct vm_area_struct *vma; 2209 + struct task_struct *tsk; 2210 + 2211 + i_mmap_lock_read(mapping); 2212 + rcu_read_lock(); 2213 + for_each_process(tsk) { 2214 + struct task_struct *t = tsk; 2215 + 2216 + t = task_early_kill(tsk, true); 2217 + if (!t) 2218 + continue; 2219 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) { 2220 + if (vma->vm_mm == t->mm) 2221 + add_to_kill_pfn(t, vma, to_kill, pfn); 2222 + } 2223 + } 2224 + rcu_read_unlock(); 2225 + i_mmap_unlock_read(mapping); 2226 + } 2227 + 2228 + /** 2229 + * memory_failure_pfn - Handle memory failure on a page not backed by 2230 + * struct page. 2231 + * @pfn: Page Number of the corrupted page 2232 + * @flags: fine tune action taken 2233 + * 2234 + * Return: 2235 + * 0 - success, 2236 + * -EBUSY - Page PFN does not belong to any address space mapping. 2237 + */ 2238 + static int memory_failure_pfn(unsigned long pfn, int flags) 2239 + { 2240 + struct interval_tree_node *node; 2241 + LIST_HEAD(tokill); 2242 + 2243 + scoped_guard(mutex, &pfn_space_lock) { 2244 + bool mf_handled = false; 2245 + 2246 + /* 2247 + * Modules registers with MM the address space mapping to 2248 + * the device memory they manage. Iterate to identify 2249 + * exactly which address space has mapped to this failing 2250 + * PFN. 2251 + */ 2252 + for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node; 2253 + node = interval_tree_iter_next(node, pfn, pfn)) { 2254 + struct pfn_address_space *pfn_space = 2255 + container_of(node, struct pfn_address_space, node); 2256 + 2257 + collect_procs_pfn(pfn_space->mapping, pfn, &tokill); 2258 + 2259 + mf_handled = true; 2260 + } 2261 + 2262 + if (!mf_handled) 2263 + return action_result(pfn, MF_MSG_PFN_MAP, MF_IGNORED); 2264 + } 2265 + 2266 + /* 2267 + * Unlike System-RAM there is no possibility to swap in a different 2268 + * physical page at a given virtual address, so all userspace 2269 + * consumption of direct PFN memory necessitates SIGBUS (i.e. 2270 + * MF_MUST_KILL) 2271 + */ 2272 + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; 2273 + 2274 + kill_procs(&tokill, true, pfn, flags); 2275 + 2276 + return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED); 2277 + } 2278 + 2156 2279 /** 2157 2280 * memory_failure - Handle memory failure of a page. 2158 2281 * @pfn: Page Number of the corrupted page ··· 2330 2195 res = arch_memory_failure(pfn, flags); 2331 2196 if (res == 0) 2332 2197 goto unlock_mutex; 2198 + 2199 + if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) { 2200 + /* 2201 + * The PFN is not backed by struct page. 2202 + */ 2203 + res = memory_failure_pfn(pfn, flags); 2204 + goto unlock_mutex; 2205 + } 2333 2206 2334 2207 if (pfn_valid(pfn)) { 2335 2208 pgmap = get_dev_pagemap(pfn);