Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)

Merge hugepage allocation updates from David Rientjes:
"We (mostly Linus, Andrea, and myself) have been discussing offlist how
to implement a sane default allocation strategy for hugepages on NUMA
platforms.

With these reverts in place, the page allocator will happily allocate
a remote hugepage immediately rather than try to make a local hugepage
available. This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage
available.

This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages. Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.

The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.

The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of
the system so that we prevent thrashing of local memory when
compaction is unlikely to succeed and can prefer remote hugepages over
remote native pages when the local node is low on memory."

Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since
4.18.

Andrea had this note:

"The regression of 4.18 was that it was taking hours to start a VM
where 3.10 was only taking a few seconds, I reported all the details
on lkml when it was finally tracked down in August 2018.

https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

__GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
workload degrade like in the "current upstream" above. And it still
would have been that bad as above until 5.3-rc5"

where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the
nodes.

As a result 5.3 got the two performance fix reverts in rc5.

However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree. He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):

- "avoid expensive reclaim when compaction may not succeed": just admit
that the allocation failed when you're trying to allocate a huge-page
and compaction wasn't successful.

- "allow hugepage fallback to remote nodes when madvised": when that
node-local huge-page allocation failed, retry without forcing the
local node.

but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.

But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.

Fingers crossed.

* emailed patches from David Rientjes <rientjes@google.com>:
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Revert "Revert "mm, thp: restore node-local hugepage allocations""

Linus Torvalds 6 years ago edf445ad a2953204

+92 -42

6 changed files

expand all

include

linux

gfp.h

mempolicy.h

huge_memory.c

mempolicy.c

page_alloc.c

shmem.c

+8 -4

include/linux/gfp.h

··· 510 510 } 511 511 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, 512 512 struct vm_area_struct *vma, unsigned long addr, 513 - int node); 513 + int node, bool hugepage); 514 + #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ 515 + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true) 514 516 #else 515 517 #define alloc_pages(gfp_mask, order) \ 516 518 alloc_pages_node(numa_node_id(), gfp_mask, order) 517 - #define alloc_pages_vma(gfp_mask, order, vma, addr, node)\ 519 + #define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\ 520 + alloc_pages(gfp_mask, order) 521 + #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ 518 522 alloc_pages(gfp_mask, order) 519 523 #endif 520 524 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) 521 525 #define alloc_page_vma(gfp_mask, vma, addr) \ 522 - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id()) 526 + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false) 523 527 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ 524 - alloc_pages_vma(gfp_mask, 0, vma, addr, node) 528 + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) 525 529 526 530 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); 527 531 extern unsigned long get_zeroed_page(gfp_t gfp_mask);

-2

include/linux/mempolicy.h

··· 139 139 struct mempolicy *get_task_policy(struct task_struct *p); 140 140 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, 141 141 unsigned long addr); 142 - struct mempolicy *get_vma_policy(struct vm_area_struct *vma, 143 - unsigned long addr); 144 142 bool vma_policy_mof(struct vm_area_struct *vma); 145 143 146 144 extern void numa_default_policy(void);

+20 -31

mm/huge_memory.c

··· 659 659 * available 660 660 * never: never stall for any thp allocation 661 661 */ 662 - static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) 662 + static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) 663 663 { 664 664 const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); 665 - gfp_t this_node = 0; 666 665 667 - #ifdef CONFIG_NUMA 668 - struct mempolicy *pol; 669 - /* 670 - * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not 671 - * specified, to express a general desire to stay on the current 672 - * node for optimistic allocation attempts. If the defrag mode 673 - * and/or madvise hint requires the direct reclaim then we prefer 674 - * to fallback to other node rather than node reclaim because that 675 - * can lead to excessive reclaim even though there is free memory 676 - * on other nodes. We expect that NUMA preferences are specified 677 - * by memory policies. 678 - */ 679 - pol = get_vma_policy(vma, addr); 680 - if (pol->mode != MPOL_BIND) 681 - this_node = __GFP_THISNODE; 682 - mpol_cond_put(pol); 683 - #endif 684 - 666 + /* Always do synchronous compaction */ 685 667 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) 686 668 return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); 669 + 670 + /* Kick kcompactd and fail quickly */ 687 671 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) 688 - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; 672 + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; 673 + 674 + /* Synchronous compaction if madvised, otherwise kick kcompactd */ 689 675 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) 690 - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : 691 - __GFP_KSWAPD_RECLAIM | this_node); 676 + return GFP_TRANSHUGE_LIGHT | 677 + (vma_madvised ? __GFP_DIRECT_RECLAIM : 678 + __GFP_KSWAPD_RECLAIM); 679 + 680 + /* Only do synchronous compaction if madvised */ 692 681 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) 693 - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : 694 - this_node); 695 - return GFP_TRANSHUGE_LIGHT | this_node; 682 + return GFP_TRANSHUGE_LIGHT | 683 + (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); 684 + 685 + return GFP_TRANSHUGE_LIGHT; 696 686 } 697 687 698 688 /* Caller must hold page table lock. */ ··· 754 764 pte_free(vma->vm_mm, pgtable); 755 765 return ret; 756 766 } 757 - gfp = alloc_hugepage_direct_gfpmask(vma, haddr); 758 - page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id()); 767 + gfp = alloc_hugepage_direct_gfpmask(vma); 768 + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER); 759 769 if (unlikely(!page)) { 760 770 count_vm_event(THP_FAULT_FALLBACK); 761 771 return VM_FAULT_FALLBACK; ··· 1362 1372 alloc: 1363 1373 if (__transparent_hugepage_enabled(vma) && 1364 1374 !transparent_hugepage_debug_cow()) { 1365 - huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr); 1366 - new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma, 1367 - haddr, numa_node_id()); 1375 + huge_gfp = alloc_hugepage_direct_gfpmask(vma); 1376 + new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER); 1368 1377 } else 1369 1378 new_page = NULL; 1370 1379

+41 -4

mm/mempolicy.c

··· 1179 1179 } else if (PageTransHuge(page)) { 1180 1180 struct page *thp; 1181 1181 1182 - thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma, 1183 - address, numa_node_id()); 1182 + thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address, 1183 + HPAGE_PMD_ORDER); 1184 1184 if (!thp) 1185 1185 return NULL; 1186 1186 prep_transhuge_page(thp); ··· 1732 1732 * freeing by another task. It is the caller's responsibility to free the 1733 1733 * extra reference for shared policies. 1734 1734 */ 1735 - struct mempolicy *get_vma_policy(struct vm_area_struct *vma, 1735 + static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, 1736 1736 unsigned long addr) 1737 1737 { 1738 1738 struct mempolicy *pol = __get_vma_policy(vma, addr); ··· 2081 2081 * @vma: Pointer to VMA or NULL if not available. 2082 2082 * @addr: Virtual Address of the allocation. Must be inside the VMA. 2083 2083 * @node: Which node to prefer for allocation (modulo policy). 2084 + * @hugepage: for hugepages try only the preferred node if possible 2084 2085 * 2085 2086 * This function allocates a page from the kernel page pool and applies 2086 2087 * a NUMA policy associated with the VMA or the current process. ··· 2092 2091 */ 2093 2092 struct page * 2094 2093 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, 2095 - unsigned long addr, int node) 2094 + unsigned long addr, int node, bool hugepage) 2096 2095 { 2097 2096 struct mempolicy *pol; 2098 2097 struct page *page; ··· 2108 2107 mpol_cond_put(pol); 2109 2108 page = alloc_page_interleave(gfp, order, nid); 2110 2109 goto out; 2110 + } 2111 + 2112 + if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { 2113 + int hpage_node = node; 2114 + 2115 + /* 2116 + * For hugepage allocation and non-interleave policy which 2117 + * allows the current node (or other explicitly preferred 2118 + * node) we only try to allocate from the current/preferred 2119 + * node and don't fall back to other nodes, as the cost of 2120 + * remote accesses would likely offset THP benefits. 2121 + * 2122 + * If the policy is interleave, or does not allow the current 2123 + * node in its nodemask, we allocate the standard way. 2124 + */ 2125 + if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL)) 2126 + hpage_node = pol->v.preferred_node; 2127 + 2128 + nmask = policy_nodemask(gfp, pol); 2129 + if (!nmask || node_isset(hpage_node, *nmask)) { 2130 + mpol_cond_put(pol); 2131 + page = __alloc_pages_node(hpage_node, 2132 + gfp | __GFP_THISNODE, order); 2133 + 2134 + /* 2135 + * If hugepage allocations are configured to always 2136 + * synchronous compact or the vma has been madvised 2137 + * to prefer hugepage backing, retry allowing remote 2138 + * memory as well. 2139 + */ 2140 + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) 2141 + page = __alloc_pages_node(hpage_node, 2142 + gfp | __GFP_NORETRY, order); 2143 + 2144 + goto out; 2145 + } 2111 2146 } 2112 2147 2113 2148 nmask = policy_nodemask(gfp, pol);

+22

mm/page_alloc.c

··· 4467 4467 if (page) 4468 4468 goto got_pg; 4469 4469 4470 + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { 4471 + /* 4472 + * If allocating entire pageblock(s) and compaction 4473 + * failed because all zones are below low watermarks 4474 + * or is prohibited because it recently failed at this 4475 + * order, fail immediately. 4476 + * 4477 + * Reclaim is 4478 + * - potentially very expensive because zones are far 4479 + * below their low watermarks or this is part of very 4480 + * bursty high order allocations, 4481 + * - not guaranteed to help because isolate_freepages() 4482 + * may not iterate over freed pages as part of its 4483 + * linear scan, and 4484 + * - unlikely to make entire pageblocks free on its 4485 + * own. 4486 + */ 4487 + if (compact_result == COMPACT_SKIPPED || 4488 + compact_result == COMPACT_DEFERRED) 4489 + goto nopage; 4490 + } 4491 + 4470 4492 /* 4471 4493 * Checks for costly allocations with __GFP_NORETRY, which 4472 4494 * includes THP page fault allocations

+1 -1

mm/shmem.c

··· 1481 1481 1482 1482 shmem_pseudo_vma_init(&pvma, info, hindex); 1483 1483 page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN, 1484 - HPAGE_PMD_ORDER, &pvma, 0, numa_node_id()); 1484 + HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true); 1485 1485 shmem_pseudo_vma_destroy(&pvma); 1486 1486 if (page) 1487 1487 prep_transhuge_page(page);

Configure Feed

Configure Feed