Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.

A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:

mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>

...and with these changes:

Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered

Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t

+481 -135
+42
arch/x86/include/asm/set_memory.h
··· 89 89 void set_kernel_text_rw(void); 90 90 void set_kernel_text_ro(void); 91 91 92 + #ifdef CONFIG_X86_64 93 + static inline int set_mce_nospec(unsigned long pfn) 94 + { 95 + unsigned long decoy_addr; 96 + int rc; 97 + 98 + /* 99 + * Mark the linear address as UC to make sure we don't log more 100 + * errors because of speculative access to the page. 101 + * We would like to just call: 102 + * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1); 103 + * but doing that would radically increase the odds of a 104 + * speculative access to the poison page because we'd have 105 + * the virtual address of the kernel 1:1 mapping sitting 106 + * around in registers. 107 + * Instead we get tricky. We create a non-canonical address 108 + * that looks just like the one we want, but has bit 63 flipped. 109 + * This relies on set_memory_uc() properly sanitizing any __pa() 110 + * results with __PHYSICAL_MASK or PTE_PFN_MASK. 111 + */ 112 + decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); 113 + 114 + rc = set_memory_uc(decoy_addr, 1); 115 + if (rc) 116 + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); 117 + return rc; 118 + } 119 + #define set_mce_nospec set_mce_nospec 120 + 121 + /* Restore full speculative operation to the pfn. */ 122 + static inline int clear_mce_nospec(unsigned long pfn) 123 + { 124 + return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1); 125 + } 126 + #define clear_mce_nospec clear_mce_nospec 127 + #else 128 + /* 129 + * Few people would run a 32-bit kernel on a machine that supports 130 + * recoverable errors because they have too much memory to boot 32-bit. 131 + */ 132 + #endif 133 + 92 134 #endif /* _ASM_X86_SET_MEMORY_H */
-15
arch/x86/kernel/cpu/mcheck/mce-internal.h
··· 113 113 static inline void mce_unregister_injector_chain(struct notifier_block *nb) { } 114 114 #endif 115 115 116 - #ifndef CONFIG_X86_64 117 - /* 118 - * On 32-bit systems it would be difficult to safely unmap a poison page 119 - * from the kernel 1:1 map because there are no non-canonical addresses that 120 - * we can use to refer to the address without risking a speculative access. 121 - * However, this isn't much of an issue because: 122 - * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which 123 - * are only mapped into the kernel as needed 124 - * 2) Few people would run a 32-bit kernel on a machine that supports 125 - * recoverable errors because they have too much memory to boot 32-bit. 126 - */ 127 - static inline void mce_unmap_kpfn(unsigned long pfn) {} 128 - #define mce_unmap_kpfn mce_unmap_kpfn 129 - #endif 130 - 131 116 struct mca_config { 132 117 bool dont_log_ce; 133 118 bool cmci_disabled;
+3 -35
arch/x86/kernel/cpu/mcheck/mce.c
··· 42 42 #include <linux/irq_work.h> 43 43 #include <linux/export.h> 44 44 #include <linux/jump_label.h> 45 + #include <linux/set_memory.h> 45 46 46 47 #include <asm/intel-family.h> 47 48 #include <asm/processor.h> ··· 51 50 #include <asm/mce.h> 52 51 #include <asm/msr.h> 53 52 #include <asm/reboot.h> 54 - #include <asm/set_memory.h> 55 53 56 54 #include "mce-internal.h" 57 55 ··· 107 107 static struct irq_work mce_irq_work; 108 108 109 109 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs); 110 - 111 - #ifndef mce_unmap_kpfn 112 - static void mce_unmap_kpfn(unsigned long pfn); 113 - #endif 114 110 115 111 /* 116 112 * CPU/chipset specific EDAC code can register a notifier call here to print ··· 598 602 if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) { 599 603 pfn = mce->addr >> PAGE_SHIFT; 600 604 if (!memory_failure(pfn, 0)) 601 - mce_unmap_kpfn(pfn); 605 + set_mce_nospec(pfn); 602 606 } 603 607 604 608 return NOTIFY_OK; ··· 1068 1072 if (ret) 1069 1073 pr_err("Memory error not recovered"); 1070 1074 else 1071 - mce_unmap_kpfn(m->addr >> PAGE_SHIFT); 1075 + set_mce_nospec(m->addr >> PAGE_SHIFT); 1072 1076 return ret; 1073 1077 } 1074 - 1075 - #ifndef mce_unmap_kpfn 1076 - static void mce_unmap_kpfn(unsigned long pfn) 1077 - { 1078 - unsigned long decoy_addr; 1079 - 1080 - /* 1081 - * Unmap this page from the kernel 1:1 mappings to make sure 1082 - * we don't log more errors because of speculative access to 1083 - * the page. 1084 - * We would like to just call: 1085 - * set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1); 1086 - * but doing that would radically increase the odds of a 1087 - * speculative access to the poison page because we'd have 1088 - * the virtual address of the kernel 1:1 mapping sitting 1089 - * around in registers. 1090 - * Instead we get tricky. We create a non-canonical address 1091 - * that looks just like the one we want, but has bit 63 flipped. 1092 - * This relies on set_memory_np() not checking whether we passed 1093 - * a legal address. 1094 - */ 1095 - 1096 - decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); 1097 - 1098 - if (set_memory_np(decoy_addr, 1)) 1099 - pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); 1100 - } 1101 - #endif 1102 1078 1103 1079 1104 1080 /*
+16
arch/x86/mm/pat.c
··· 512 512 return 0; 513 513 } 514 514 515 + static u64 sanitize_phys(u64 address) 516 + { 517 + /* 518 + * When changing the memtype for pages containing poison allow 519 + * for a "decoy" virtual address (bit 63 clear) passed to 520 + * set_memory_X(). __pa() on a "decoy" address results in a 521 + * physical address with bit 63 set. 522 + */ 523 + return address & __PHYSICAL_MASK; 524 + } 525 + 515 526 /* 516 527 * req_type typically has one of the: 517 528 * - _PAGE_CACHE_MODE_WB ··· 544 533 int is_range_ram; 545 534 int err = 0; 546 535 536 + start = sanitize_phys(start); 537 + end = sanitize_phys(end); 547 538 BUG_ON(start >= end); /* end is exclusive */ 548 539 549 540 if (!pat_enabled()) { ··· 621 608 622 609 if (!pat_enabled()) 623 610 return 0; 611 + 612 + start = sanitize_phys(start); 613 + end = sanitize_phys(end); 624 614 625 615 /* Low ISA region is always mapped WB. No need to track */ 626 616 if (x86_platform.is_untracked_pat_range(start, end))
+48 -27
drivers/dax/device.c
··· 248 248 return -1; 249 249 } 250 250 251 - static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) 251 + static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, 252 + struct vm_fault *vmf, pfn_t *pfn) 252 253 { 253 254 struct device *dev = &dev_dax->dev; 254 255 struct dax_region *dax_region; 255 - int rc = VM_FAULT_SIGBUS; 256 256 phys_addr_t phys; 257 - pfn_t pfn; 258 257 unsigned int fault_size = PAGE_SIZE; 259 258 260 259 if (check_vma(dev_dax, vmf->vma, __func__)) ··· 275 276 return VM_FAULT_SIGBUS; 276 277 } 277 278 278 - pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 279 + *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 279 280 280 - rc = vm_insert_mixed(vmf->vma, vmf->address, pfn); 281 - 282 - if (rc == -ENOMEM) 283 - return VM_FAULT_OOM; 284 - if (rc < 0 && rc != -EBUSY) 285 - return VM_FAULT_SIGBUS; 286 - 287 - return VM_FAULT_NOPAGE; 281 + return vmf_insert_mixed(vmf->vma, vmf->address, *pfn); 288 282 } 289 283 290 - static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) 284 + static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, 285 + struct vm_fault *vmf, pfn_t *pfn) 291 286 { 292 287 unsigned long pmd_addr = vmf->address & PMD_MASK; 293 288 struct device *dev = &dev_dax->dev; 294 289 struct dax_region *dax_region; 295 290 phys_addr_t phys; 296 291 pgoff_t pgoff; 297 - pfn_t pfn; 298 292 unsigned int fault_size = PMD_SIZE; 299 293 300 294 if (check_vma(dev_dax, vmf->vma, __func__)) ··· 323 331 return VM_FAULT_SIGBUS; 324 332 } 325 333 326 - pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 334 + *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 327 335 328 - return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn, 336 + return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn, 329 337 vmf->flags & FAULT_FLAG_WRITE); 330 338 } 331 339 332 340 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 333 - static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) 341 + static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, 342 + struct vm_fault *vmf, pfn_t *pfn) 334 343 { 335 344 unsigned long pud_addr = vmf->address & PUD_MASK; 336 345 struct device *dev = &dev_dax->dev; 337 346 struct dax_region *dax_region; 338 347 phys_addr_t phys; 339 348 pgoff_t pgoff; 340 - pfn_t pfn; 341 349 unsigned int fault_size = PUD_SIZE; 342 350 343 351 ··· 374 382 return VM_FAULT_SIGBUS; 375 383 } 376 384 377 - pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 385 + *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); 378 386 379 - return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn, 387 + return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn, 380 388 vmf->flags & FAULT_FLAG_WRITE); 381 389 } 382 390 #else 383 - static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) 391 + static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, 392 + struct vm_fault *vmf, pfn_t *pfn) 384 393 { 385 394 return VM_FAULT_FALLBACK; 386 395 } 387 396 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 388 397 389 - static int dev_dax_huge_fault(struct vm_fault *vmf, 398 + static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, 390 399 enum page_entry_size pe_size) 391 400 { 392 - int rc, id; 393 401 struct file *filp = vmf->vma->vm_file; 402 + unsigned long fault_size; 403 + int rc, id; 404 + pfn_t pfn; 394 405 struct dev_dax *dev_dax = filp->private_data; 395 406 396 407 dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm, ··· 403 408 id = dax_read_lock(); 404 409 switch (pe_size) { 405 410 case PE_SIZE_PTE: 406 - rc = __dev_dax_pte_fault(dev_dax, vmf); 411 + fault_size = PAGE_SIZE; 412 + rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn); 407 413 break; 408 414 case PE_SIZE_PMD: 409 - rc = __dev_dax_pmd_fault(dev_dax, vmf); 415 + fault_size = PMD_SIZE; 416 + rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn); 410 417 break; 411 418 case PE_SIZE_PUD: 412 - rc = __dev_dax_pud_fault(dev_dax, vmf); 419 + fault_size = PUD_SIZE; 420 + rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn); 413 421 break; 414 422 default: 415 423 rc = VM_FAULT_SIGBUS; 424 + } 425 + 426 + if (rc == VM_FAULT_NOPAGE) { 427 + unsigned long i; 428 + pgoff_t pgoff; 429 + 430 + /* 431 + * In the device-dax case the only possibility for a 432 + * VM_FAULT_NOPAGE result is when device-dax capacity is 433 + * mapped. No need to consider the zero page, or racing 434 + * conflicting mappings. 435 + */ 436 + pgoff = linear_page_index(vmf->vma, vmf->address 437 + & ~(fault_size - 1)); 438 + for (i = 0; i < fault_size / PAGE_SIZE; i++) { 439 + struct page *page; 440 + 441 + page = pfn_to_page(pfn_t_to_pfn(pfn) + i); 442 + if (page->mapping) 443 + continue; 444 + page->mapping = filp->f_mapping; 445 + page->index = pgoff + i; 446 + } 416 447 } 417 448 dax_read_unlock(id); 418 449 419 450 return rc; 420 451 } 421 452 422 - static int dev_dax_fault(struct vm_fault *vmf) 453 + static vm_fault_t dev_dax_fault(struct vm_fault *vmf) 423 454 { 424 455 return dev_dax_huge_fault(vmf, PE_SIZE_PTE); 425 456 }
+26
drivers/nvdimm/pmem.c
··· 20 20 #include <linux/hdreg.h> 21 21 #include <linux/init.h> 22 22 #include <linux/platform_device.h> 23 + #include <linux/set_memory.h> 23 24 #include <linux/module.h> 24 25 #include <linux/moduleparam.h> 25 26 #include <linux/badblocks.h> ··· 52 51 return to_nd_region(to_dev(pmem)->parent); 53 52 } 54 53 54 + static void hwpoison_clear(struct pmem_device *pmem, 55 + phys_addr_t phys, unsigned int len) 56 + { 57 + unsigned long pfn_start, pfn_end, pfn; 58 + 59 + /* only pmem in the linear map supports HWPoison */ 60 + if (is_vmalloc_addr(pmem->virt_addr)) 61 + return; 62 + 63 + pfn_start = PHYS_PFN(phys); 64 + pfn_end = pfn_start + PHYS_PFN(len); 65 + for (pfn = pfn_start; pfn < pfn_end; pfn++) { 66 + struct page *page = pfn_to_page(pfn); 67 + 68 + /* 69 + * Note, no need to hold a get_dev_pagemap() reference 70 + * here since we're in the driver I/O path and 71 + * outstanding I/O requests pin the dev_pagemap. 72 + */ 73 + if (test_and_clear_pmem_poison(page)) 74 + clear_mce_nospec(pfn); 75 + } 76 + } 77 + 55 78 static blk_status_t pmem_clear_poison(struct pmem_device *pmem, 56 79 phys_addr_t offset, unsigned int len) 57 80 { ··· 90 65 if (cleared < len) 91 66 rc = BLK_STS_IOERR; 92 67 if (cleared > 0 && cleared / 512) { 68 + hwpoison_clear(pmem, pmem->phys_addr + offset, cleared); 93 69 cleared /= 512; 94 70 dev_dbg(dev, "%#llx clear %ld sector%s\n", 95 71 (unsigned long long) sector, cleared,
+13
drivers/nvdimm/pmem.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 #ifndef __NVDIMM_PMEM_H__ 3 3 #define __NVDIMM_PMEM_H__ 4 + #include <linux/page-flags.h> 4 5 #include <linux/badblocks.h> 5 6 #include <linux/types.h> 6 7 #include <linux/pfn_t.h> ··· 28 27 29 28 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, 30 29 long nr_pages, void **kaddr, pfn_t *pfn); 30 + 31 + #ifdef CONFIG_MEMORY_FAILURE 32 + static inline bool test_and_clear_pmem_poison(struct page *page) 33 + { 34 + return TestClearPageHWPoison(page); 35 + } 36 + #else 37 + static inline bool test_and_clear_pmem_poison(struct page *page) 38 + { 39 + return false; 40 + } 41 + #endif 31 42 #endif /* __NVDIMM_PMEM_H__ */
+116 -9
fs/dax.c
··· 226 226 * 227 227 * Must be called with the i_pages lock held. 228 228 */ 229 - static void *get_unlocked_mapping_entry(struct address_space *mapping, 230 - pgoff_t index, void ***slotp) 229 + static void *__get_unlocked_mapping_entry(struct address_space *mapping, 230 + pgoff_t index, void ***slotp, bool (*wait_fn)(void)) 231 231 { 232 232 void *entry, **slot; 233 233 struct wait_exceptional_entry_queue ewait; ··· 237 237 ewait.wait.func = wake_exceptional_entry_func; 238 238 239 239 for (;;) { 240 + bool revalidate; 241 + 240 242 entry = __radix_tree_lookup(&mapping->i_pages, index, NULL, 241 243 &slot); 242 244 if (!entry || ··· 253 251 prepare_to_wait_exclusive(wq, &ewait.wait, 254 252 TASK_UNINTERRUPTIBLE); 255 253 xa_unlock_irq(&mapping->i_pages); 256 - schedule(); 254 + revalidate = wait_fn(); 257 255 finish_wait(wq, &ewait.wait); 258 256 xa_lock_irq(&mapping->i_pages); 257 + if (revalidate) 258 + return ERR_PTR(-EAGAIN); 259 259 } 260 260 } 261 261 262 - static void dax_unlock_mapping_entry(struct address_space *mapping, 263 - pgoff_t index) 262 + static bool entry_wait(void) 263 + { 264 + schedule(); 265 + /* 266 + * Never return an ERR_PTR() from 267 + * __get_unlocked_mapping_entry(), just keep looping. 268 + */ 269 + return false; 270 + } 271 + 272 + static void *get_unlocked_mapping_entry(struct address_space *mapping, 273 + pgoff_t index, void ***slotp) 274 + { 275 + return __get_unlocked_mapping_entry(mapping, index, slotp, entry_wait); 276 + } 277 + 278 + static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index) 264 279 { 265 280 void *entry, **slot; 266 281 ··· 296 277 static void put_locked_mapping_entry(struct address_space *mapping, 297 278 pgoff_t index) 298 279 { 299 - dax_unlock_mapping_entry(mapping, index); 280 + unlock_mapping_entry(mapping, index); 300 281 } 301 282 302 283 /* ··· 338 319 for (pfn = dax_radix_pfn(entry); \ 339 320 pfn < dax_radix_end_pfn(entry); pfn++) 340 321 341 - static void dax_associate_entry(void *entry, struct address_space *mapping) 322 + /* 323 + * TODO: for reflink+dax we need a way to associate a single page with 324 + * multiple address_space instances at different linear_page_index() 325 + * offsets. 326 + */ 327 + static void dax_associate_entry(void *entry, struct address_space *mapping, 328 + struct vm_area_struct *vma, unsigned long address) 342 329 { 343 - unsigned long pfn; 330 + unsigned long size = dax_entry_size(entry), pfn, index; 331 + int i = 0; 344 332 345 333 if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) 346 334 return; 347 335 336 + index = linear_page_index(vma, address & ~(size - 1)); 348 337 for_each_mapped_pfn(entry, pfn) { 349 338 struct page *page = pfn_to_page(pfn); 350 339 351 340 WARN_ON_ONCE(page->mapping); 352 341 page->mapping = mapping; 342 + page->index = index + i++; 353 343 } 354 344 } 355 345 ··· 376 348 WARN_ON_ONCE(trunc && page_ref_count(page) > 1); 377 349 WARN_ON_ONCE(page->mapping && page->mapping != mapping); 378 350 page->mapping = NULL; 351 + page->index = 0; 379 352 } 380 353 } 381 354 ··· 391 362 return page; 392 363 } 393 364 return NULL; 365 + } 366 + 367 + static bool entry_wait_revalidate(void) 368 + { 369 + rcu_read_unlock(); 370 + schedule(); 371 + rcu_read_lock(); 372 + 373 + /* 374 + * Tell __get_unlocked_mapping_entry() to take a break, we need 375 + * to revalidate page->mapping after dropping locks 376 + */ 377 + return true; 378 + } 379 + 380 + bool dax_lock_mapping_entry(struct page *page) 381 + { 382 + pgoff_t index; 383 + struct inode *inode; 384 + bool did_lock = false; 385 + void *entry = NULL, **slot; 386 + struct address_space *mapping; 387 + 388 + rcu_read_lock(); 389 + for (;;) { 390 + mapping = READ_ONCE(page->mapping); 391 + 392 + if (!dax_mapping(mapping)) 393 + break; 394 + 395 + /* 396 + * In the device-dax case there's no need to lock, a 397 + * struct dev_pagemap pin is sufficient to keep the 398 + * inode alive, and we assume we have dev_pagemap pin 399 + * otherwise we would not have a valid pfn_to_page() 400 + * translation. 401 + */ 402 + inode = mapping->host; 403 + if (S_ISCHR(inode->i_mode)) { 404 + did_lock = true; 405 + break; 406 + } 407 + 408 + xa_lock_irq(&mapping->i_pages); 409 + if (mapping != page->mapping) { 410 + xa_unlock_irq(&mapping->i_pages); 411 + continue; 412 + } 413 + index = page->index; 414 + 415 + entry = __get_unlocked_mapping_entry(mapping, index, &slot, 416 + entry_wait_revalidate); 417 + if (!entry) { 418 + xa_unlock_irq(&mapping->i_pages); 419 + break; 420 + } else if (IS_ERR(entry)) { 421 + WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN); 422 + continue; 423 + } 424 + lock_slot(mapping, slot); 425 + did_lock = true; 426 + xa_unlock_irq(&mapping->i_pages); 427 + break; 428 + } 429 + rcu_read_unlock(); 430 + 431 + return did_lock; 432 + } 433 + 434 + void dax_unlock_mapping_entry(struct page *page) 435 + { 436 + struct address_space *mapping = page->mapping; 437 + struct inode *inode = mapping->host; 438 + 439 + if (S_ISCHR(inode->i_mode)) 440 + return; 441 + 442 + unlock_mapping_entry(mapping, page->index); 394 443 } 395 444 396 445 /* ··· 815 708 new_entry = dax_radix_locked_entry(pfn, flags); 816 709 if (dax_entry_size(entry) != dax_entry_size(new_entry)) { 817 710 dax_disassociate_entry(entry, mapping, false); 818 - dax_associate_entry(new_entry, mapping); 711 + dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address); 819 712 } 820 713 821 714 if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+13
include/linux/dax.h
··· 88 88 struct block_device *bdev, struct writeback_control *wbc); 89 89 90 90 struct page *dax_layout_busy_page(struct address_space *mapping); 91 + bool dax_lock_mapping_entry(struct page *page); 92 + void dax_unlock_mapping_entry(struct page *page); 91 93 #else 92 94 static inline bool bdev_dax_supported(struct block_device *bdev, 93 95 int blocksize) ··· 120 118 struct block_device *bdev, struct writeback_control *wbc) 121 119 { 122 120 return -EOPNOTSUPP; 121 + } 122 + 123 + static inline bool dax_lock_mapping_entry(struct page *page) 124 + { 125 + if (IS_DAX(page->mapping->host)) 126 + return true; 127 + return false; 128 + } 129 + 130 + static inline void dax_unlock_mapping_entry(struct page *page) 131 + { 123 132 } 124 133 #endif 125 134
+3 -2
include/linux/huge_mm.h
··· 3 3 #define _LINUX_HUGE_MM_H 4 4 5 5 #include <linux/sched/coredump.h> 6 + #include <linux/mm_types.h> 6 7 7 8 #include <linux/fs.h> /* only for vma_is_dax() */ 8 9 ··· 47 46 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, 48 47 unsigned long addr, pgprot_t newprot, 49 48 int prot_numa); 50 - int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 49 + vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 51 50 pmd_t *pmd, pfn_t pfn, bool write); 52 - int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, 51 + vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, 53 52 pud_t *pud, pfn_t pfn, bool write); 54 53 enum transparent_hugepage_flag { 55 54 TRANSPARENT_HUGEPAGE_FLAG,
+1
include/linux/mm.h
··· 2731 2731 MF_MSG_TRUNCATED_LRU, 2732 2732 MF_MSG_BUDDY, 2733 2733 MF_MSG_BUDDY_2ND, 2734 + MF_MSG_DAX, 2734 2735 MF_MSG_UNKNOWN, 2735 2736 }; 2736 2737
+14
include/linux/set_memory.h
··· 17 17 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; } 18 18 #endif 19 19 20 + #ifndef set_mce_nospec 21 + static inline int set_mce_nospec(unsigned long pfn) 22 + { 23 + return 0; 24 + } 25 + #endif 26 + 27 + #ifndef clear_mce_nospec 28 + static inline int clear_mce_nospec(unsigned long pfn) 29 + { 30 + return 0; 31 + } 32 + #endif 33 + 20 34 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT 21 35 static inline int set_memory_encrypted(unsigned long addr, int numpages) 22 36 {
-1
kernel/memremap.c
··· 365 365 __ClearPageActive(page); 366 366 __ClearPageWaiters(page); 367 367 368 - page->mapping = NULL; 369 368 mem_cgroup_uncharge(page); 370 369 371 370 page->pgmap->page_free(page, page->pgmap->data);
+2
mm/hmm.c
··· 968 968 { 969 969 struct hmm_devmem *devmem = data; 970 970 971 + page->mapping = NULL; 972 + 971 973 devmem->ops->free(devmem, page); 972 974 } 973 975
+2 -2
mm/huge_memory.c
··· 752 752 spin_unlock(ptl); 753 753 } 754 754 755 - int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 755 + vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 756 756 pmd_t *pmd, pfn_t pfn, bool write) 757 757 { 758 758 pgprot_t pgprot = vma->vm_page_prot; ··· 812 812 spin_unlock(ptl); 813 813 } 814 814 815 - int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, 815 + vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, 816 816 pud_t *pud, pfn_t pfn, bool write) 817 817 { 818 818 pgprot_t pgprot = vma->vm_page_prot;
+14 -4
mm/madvise.c
··· 631 631 632 632 633 633 for (; start < end; start += PAGE_SIZE << order) { 634 + unsigned long pfn; 634 635 int ret; 635 636 636 637 ret = get_user_pages_fast(start, 1, 0, &page); 637 638 if (ret != 1) 638 639 return ret; 640 + pfn = page_to_pfn(page); 639 641 640 642 /* 641 643 * When soft offlining hugepages, after migrating the page ··· 653 651 654 652 if (behavior == MADV_SOFT_OFFLINE) { 655 653 pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", 656 - page_to_pfn(page), start); 654 + pfn, start); 657 655 658 656 ret = soft_offline_page(page, MF_COUNT_INCREASED); 659 657 if (ret) 660 658 return ret; 661 659 continue; 662 660 } 663 - pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", 664 - page_to_pfn(page), start); 665 661 666 - ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED); 662 + pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", 663 + pfn, start); 664 + 665 + /* 666 + * Drop the page reference taken by get_user_pages_fast(). In 667 + * the absence of MF_COUNT_INCREASED the memory_failure() 668 + * routine is responsible for pinning the page to prevent it 669 + * from being released back to the page allocator. 670 + */ 671 + put_page(page); 672 + ret = memory_failure(pfn, 0); 667 673 if (ret) 668 674 return ret; 669 675 }
+168 -40
mm/memory-failure.c
··· 55 55 #include <linux/hugetlb.h> 56 56 #include <linux/memory_hotplug.h> 57 57 #include <linux/mm_inline.h> 58 + #include <linux/memremap.h> 58 59 #include <linux/kfifo.h> 59 60 #include <linux/ratelimit.h> 60 61 #include <linux/page-isolation.h> ··· 176 175 EXPORT_SYMBOL_GPL(hwpoison_filter); 177 176 178 177 /* 178 + * Kill all processes that have a poisoned page mapped and then isolate 179 + * the page. 180 + * 181 + * General strategy: 182 + * Find all processes having the page mapped and kill them. 183 + * But we keep a page reference around so that the page is not 184 + * actually freed yet. 185 + * Then stash the page away 186 + * 187 + * There's no convenient way to get back to mapped processes 188 + * from the VMAs. So do a brute-force search over all 189 + * running processes. 190 + * 191 + * Remember that machine checks are not common (or rather 192 + * if they are common you have other problems), so this shouldn't 193 + * be a performance issue. 194 + * 195 + * Also there are some races possible while we get from the 196 + * error detection to actually handle it. 197 + */ 198 + 199 + struct to_kill { 200 + struct list_head nd; 201 + struct task_struct *tsk; 202 + unsigned long addr; 203 + short size_shift; 204 + char addr_valid; 205 + }; 206 + 207 + /* 179 208 * Send all the processes who have the page mapped a signal. 180 209 * ``action optional'' if they are not immediately affected by the error 181 210 * ``action required'' if error happened in current execution context 182 211 */ 183 - static int kill_proc(struct task_struct *t, unsigned long addr, 184 - unsigned long pfn, struct page *page, int flags) 212 + static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags) 185 213 { 186 - short addr_lsb; 214 + struct task_struct *t = tk->tsk; 215 + short addr_lsb = tk->size_shift; 187 216 int ret; 188 217 189 218 pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n", 190 219 pfn, t->comm, t->pid); 191 - addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT; 192 220 193 221 if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { 194 - ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr, 222 + ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr, 195 223 addr_lsb, current); 196 224 } else { 197 225 /* ··· 229 199 * This could cause a loop when the user sets SIGBUS 230 200 * to SIG_IGN, but hopefully no one will do that? 231 201 */ 232 - ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr, 202 + ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, 233 203 addr_lsb, t); /* synchronous? */ 234 204 } 235 205 if (ret < 0) ··· 265 235 } 266 236 EXPORT_SYMBOL_GPL(shake_page); 267 237 268 - /* 269 - * Kill all processes that have a poisoned page mapped and then isolate 270 - * the page. 271 - * 272 - * General strategy: 273 - * Find all processes having the page mapped and kill them. 274 - * But we keep a page reference around so that the page is not 275 - * actually freed yet. 276 - * Then stash the page away 277 - * 278 - * There's no convenient way to get back to mapped processes 279 - * from the VMAs. So do a brute-force search over all 280 - * running processes. 281 - * 282 - * Remember that machine checks are not common (or rather 283 - * if they are common you have other problems), so this shouldn't 284 - * be a performance issue. 285 - * 286 - * Also there are some races possible while we get from the 287 - * error detection to actually handle it. 288 - */ 238 + static unsigned long dev_pagemap_mapping_shift(struct page *page, 239 + struct vm_area_struct *vma) 240 + { 241 + unsigned long address = vma_address(page, vma); 242 + pgd_t *pgd; 243 + p4d_t *p4d; 244 + pud_t *pud; 245 + pmd_t *pmd; 246 + pte_t *pte; 289 247 290 - struct to_kill { 291 - struct list_head nd; 292 - struct task_struct *tsk; 293 - unsigned long addr; 294 - char addr_valid; 295 - }; 248 + pgd = pgd_offset(vma->vm_mm, address); 249 + if (!pgd_present(*pgd)) 250 + return 0; 251 + p4d = p4d_offset(pgd, address); 252 + if (!p4d_present(*p4d)) 253 + return 0; 254 + pud = pud_offset(p4d, address); 255 + if (!pud_present(*pud)) 256 + return 0; 257 + if (pud_devmap(*pud)) 258 + return PUD_SHIFT; 259 + pmd = pmd_offset(pud, address); 260 + if (!pmd_present(*pmd)) 261 + return 0; 262 + if (pmd_devmap(*pmd)) 263 + return PMD_SHIFT; 264 + pte = pte_offset_map(pmd, address); 265 + if (!pte_present(*pte)) 266 + return 0; 267 + if (pte_devmap(*pte)) 268 + return PAGE_SHIFT; 269 + return 0; 270 + } 296 271 297 272 /* 298 273 * Failure handling: if we can't find or can't kill a process there's ··· 328 293 } 329 294 tk->addr = page_address_in_vma(p, vma); 330 295 tk->addr_valid = 1; 296 + if (is_zone_device_page(p)) 297 + tk->size_shift = dev_pagemap_mapping_shift(p, vma); 298 + else 299 + tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT; 331 300 332 301 /* 333 302 * In theory we don't have to kill when the page was ··· 339 300 * likely very rare kill anyways just out of paranoia, but use 340 301 * a SIGKILL because the error is not contained anymore. 341 302 */ 342 - if (tk->addr == -EFAULT) { 303 + if (tk->addr == -EFAULT || tk->size_shift == 0) { 343 304 pr_info("Memory failure: Unable to find user space address %lx in %s\n", 344 305 page_to_pfn(p), tsk->comm); 345 306 tk->addr_valid = 0; ··· 357 318 * Also when FAIL is set do a force kill because something went 358 319 * wrong earlier. 359 320 */ 360 - static void kill_procs(struct list_head *to_kill, int forcekill, 361 - bool fail, struct page *page, unsigned long pfn, 362 - int flags) 321 + static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, 322 + unsigned long pfn, int flags) 363 323 { 364 324 struct to_kill *tk, *next; 365 325 ··· 381 343 * check for that, but we need to tell the 382 344 * process anyways. 383 345 */ 384 - else if (kill_proc(tk->tsk, tk->addr, 385 - pfn, page, flags) < 0) 346 + else if (kill_proc(tk, pfn, flags) < 0) 386 347 pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", 387 348 pfn, tk->tsk->comm, tk->tsk->pid); 388 349 } ··· 553 516 [MF_MSG_TRUNCATED_LRU] = "already truncated LRU page", 554 517 [MF_MSG_BUDDY] = "free buddy page", 555 518 [MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)", 519 + [MF_MSG_DAX] = "dax page", 556 520 [MF_MSG_UNKNOWN] = "unknown page", 557 521 }; 558 522 ··· 1051 1013 * any accesses to the poisoned memory. 1052 1014 */ 1053 1015 forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); 1054 - kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags); 1016 + kill_procs(&tokill, forcekill, !unmap_success, pfn, flags); 1055 1017 1056 1018 return unmap_success; 1057 1019 } ··· 1151 1113 return res; 1152 1114 } 1153 1115 1116 + static int memory_failure_dev_pagemap(unsigned long pfn, int flags, 1117 + struct dev_pagemap *pgmap) 1118 + { 1119 + struct page *page = pfn_to_page(pfn); 1120 + const bool unmap_success = true; 1121 + unsigned long size = 0; 1122 + struct to_kill *tk; 1123 + LIST_HEAD(tokill); 1124 + int rc = -EBUSY; 1125 + loff_t start; 1126 + 1127 + /* 1128 + * Prevent the inode from being freed while we are interrogating 1129 + * the address_space, typically this would be handled by 1130 + * lock_page(), but dax pages do not use the page lock. This 1131 + * also prevents changes to the mapping of this pfn until 1132 + * poison signaling is complete. 1133 + */ 1134 + if (!dax_lock_mapping_entry(page)) 1135 + goto out; 1136 + 1137 + if (hwpoison_filter(page)) { 1138 + rc = 0; 1139 + goto unlock; 1140 + } 1141 + 1142 + switch (pgmap->type) { 1143 + case MEMORY_DEVICE_PRIVATE: 1144 + case MEMORY_DEVICE_PUBLIC: 1145 + /* 1146 + * TODO: Handle HMM pages which may need coordination 1147 + * with device-side memory. 1148 + */ 1149 + goto unlock; 1150 + default: 1151 + break; 1152 + } 1153 + 1154 + /* 1155 + * Use this flag as an indication that the dax page has been 1156 + * remapped UC to prevent speculative consumption of poison. 1157 + */ 1158 + SetPageHWPoison(page); 1159 + 1160 + /* 1161 + * Unlike System-RAM there is no possibility to swap in a 1162 + * different physical page at a given virtual address, so all 1163 + * userspace consumption of ZONE_DEVICE memory necessitates 1164 + * SIGBUS (i.e. MF_MUST_KILL) 1165 + */ 1166 + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; 1167 + collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED); 1168 + 1169 + list_for_each_entry(tk, &tokill, nd) 1170 + if (tk->size_shift) 1171 + size = max(size, 1UL << tk->size_shift); 1172 + if (size) { 1173 + /* 1174 + * Unmap the largest mapping to avoid breaking up 1175 + * device-dax mappings which are constant size. The 1176 + * actual size of the mapping being torn down is 1177 + * communicated in siginfo, see kill_proc() 1178 + */ 1179 + start = (page->index << PAGE_SHIFT) & ~(size - 1); 1180 + unmap_mapping_range(page->mapping, start, start + size, 0); 1181 + } 1182 + kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags); 1183 + rc = 0; 1184 + unlock: 1185 + dax_unlock_mapping_entry(page); 1186 + out: 1187 + /* drop pgmap ref acquired in caller */ 1188 + put_dev_pagemap(pgmap); 1189 + action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); 1190 + return rc; 1191 + } 1192 + 1154 1193 /** 1155 1194 * memory_failure - Handle memory failure of a page. 1156 1195 * @pfn: Page Number of the corrupted page ··· 1250 1135 struct page *p; 1251 1136 struct page *hpage; 1252 1137 struct page *orig_head; 1138 + struct dev_pagemap *pgmap; 1253 1139 int res; 1254 1140 unsigned long page_flags; 1255 1141 ··· 1262 1146 pfn); 1263 1147 return -ENXIO; 1264 1148 } 1149 + 1150 + pgmap = get_dev_pagemap(pfn, NULL); 1151 + if (pgmap) 1152 + return memory_failure_dev_pagemap(pfn, flags, pgmap); 1265 1153 1266 1154 p = pfn_to_page(pfn); 1267 1155 if (PageHuge(p)) ··· 1896 1776 { 1897 1777 int ret; 1898 1778 unsigned long pfn = page_to_pfn(page); 1779 + 1780 + if (is_zone_device_page(page)) { 1781 + pr_debug_ratelimited("soft_offline: %#lx page is device page\n", 1782 + pfn); 1783 + if (flags & MF_COUNT_INCREASED) 1784 + put_page(page); 1785 + return -EIO; 1786 + } 1899 1787 1900 1788 if (PageHWPoison(page)) { 1901 1789 pr_info("soft offline: %#lx page already poisoned\n", pfn);