Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: make page pfmemalloc check more robust

Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():

if (page->pfmemalloc && !page->mapping)
skb->pfmemalloc = true;

It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted. However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone. Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.

So the assumption is invalid if the networking code can see such a page.
And it seems it can. We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf. There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.

The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead. We can reuse the index
again but define it to an impossible value (-1UL). This is the page
index so it should never see the value that large. Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.

The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g. what SLAB and SLUB do).

[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.com>
Debugged-by: Jiri Bohac <jbohac@suse.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Michal Hocko and committed by
Linus Torvalds
2f064f34 e45fc85a

+47 -29
+1 -1
drivers/net/ethernet/intel/fm10k/fm10k_main.c
··· 216 216 217 217 static inline bool fm10k_page_is_reserved(struct page *page) 218 218 { 219 - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; 219 + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); 220 220 } 221 221 222 222 static bool fm10k_can_reuse_rx_page(struct fm10k_rx_buffer *rx_buffer,
+1 -1
drivers/net/ethernet/intel/igb/igb_main.c
··· 6566 6566 6567 6567 static inline bool igb_page_is_reserved(struct page *page) 6568 6568 { 6569 - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; 6569 + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); 6570 6570 } 6571 6571 6572 6572 static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
+1 -1
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
··· 1832 1832 1833 1833 static inline bool ixgbe_page_is_reserved(struct page *page) 1834 1834 { 1835 - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; 1835 + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); 1836 1836 } 1837 1837 1838 1838 /**
+1 -1
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
··· 765 765 766 766 static inline bool ixgbevf_page_is_reserved(struct page *page) 767 767 { 768 - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; 768 + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); 769 769 } 770 770 771 771 /**
+28
include/linux/mm.h
··· 1003 1003 } 1004 1004 1005 1005 /* 1006 + * Return true only if the page has been allocated with 1007 + * ALLOC_NO_WATERMARKS and the low watermark was not 1008 + * met implying that the system is under some pressure. 1009 + */ 1010 + static inline bool page_is_pfmemalloc(struct page *page) 1011 + { 1012 + /* 1013 + * Page index cannot be this large so this must be 1014 + * a pfmemalloc page. 1015 + */ 1016 + return page->index == -1UL; 1017 + } 1018 + 1019 + /* 1020 + * Only to be called by the page allocator on a freshly allocated 1021 + * page. 1022 + */ 1023 + static inline void set_page_pfmemalloc(struct page *page) 1024 + { 1025 + page->index = -1UL; 1026 + } 1027 + 1028 + static inline void clear_page_pfmemalloc(struct page *page) 1029 + { 1030 + page->index = 0; 1031 + } 1032 + 1033 + /* 1006 1034 * Different kinds of faults, as returned by handle_mm_fault(). 1007 1035 * Used to decide whether a process gets delivered SIGBUS or 1008 1036 * just gets major/minor fault counters bumped up.
-9
include/linux/mm_types.h
··· 63 63 union { 64 64 pgoff_t index; /* Our offset within mapping. */ 65 65 void *freelist; /* sl[aou]b first free object */ 66 - bool pfmemalloc; /* If set by the page allocator, 67 - * ALLOC_NO_WATERMARKS was set 68 - * and the low watermark was not 69 - * met implying that the system 70 - * is under some pressure. The 71 - * caller should try ensure 72 - * this page is only used to 73 - * free other pages. 74 - */ 75 66 }; 76 67 77 68 union {
+5 -9
include/linux/skbuff.h
··· 1602 1602 skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 1603 1603 1604 1604 /* 1605 - * Propagate page->pfmemalloc to the skb if we can. The problem is 1606 - * that not all callers have unique ownership of the page. If 1607 - * pfmemalloc is set, we check the mapping as a mapping implies 1608 - * page->index is set (index and pfmemalloc share space). 1609 - * If it's a valid mapping, we cannot use page->pfmemalloc but we 1610 - * do not lose pfmemalloc information as the pages would not be 1611 - * allocated using __GFP_MEMALLOC. 1605 + * Propagate page pfmemalloc to the skb if we can. The problem is 1606 + * that not all callers have unique ownership of the page but rely 1607 + * on page_is_pfmemalloc doing the right thing(tm). 1612 1608 */ 1613 1609 frag->page.p = page; 1614 1610 frag->page_offset = off; 1615 1611 skb_frag_size_set(frag, size); 1616 1612 1617 1613 page = compound_head(page); 1618 - if (page->pfmemalloc && !page->mapping) 1614 + if (page_is_pfmemalloc(page)) 1619 1615 skb->pfmemalloc = true; 1620 1616 } 1621 1617 ··· 2259 2263 static inline void skb_propagate_pfmemalloc(struct page *page, 2260 2264 struct sk_buff *skb) 2261 2265 { 2262 - if (page && page->pfmemalloc) 2266 + if (page_is_pfmemalloc(page)) 2263 2267 skb->pfmemalloc = true; 2264 2268 } 2265 2269
+6 -3
mm/page_alloc.c
··· 1343 1343 set_page_owner(page, order, gfp_flags); 1344 1344 1345 1345 /* 1346 - * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was necessary to 1346 + * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to 1347 1347 * allocate the page. The expectation is that the caller is taking 1348 1348 * steps that will free more memory. The caller should avoid the page 1349 1349 * being used for !PFMEMALLOC purposes. 1350 1350 */ 1351 - page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS); 1351 + if (alloc_flags & ALLOC_NO_WATERMARKS) 1352 + set_page_pfmemalloc(page); 1353 + else 1354 + clear_page_pfmemalloc(page); 1352 1355 1353 1356 return 0; 1354 1357 } ··· 3348 3345 atomic_add(size - 1, &page->_count); 3349 3346 3350 3347 /* reset page count bias and offset to start of new frag */ 3351 - nc->pfmemalloc = page->pfmemalloc; 3348 + nc->pfmemalloc = page_is_pfmemalloc(page); 3352 3349 nc->pagecnt_bias = size; 3353 3350 nc->offset = size; 3354 3351 }
+2 -2
mm/slab.c
··· 1603 1603 } 1604 1604 1605 1605 /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ 1606 - if (unlikely(page->pfmemalloc)) 1606 + if (page_is_pfmemalloc(page)) 1607 1607 pfmemalloc_active = true; 1608 1608 1609 1609 nr_pages = (1 << cachep->gfporder); ··· 1614 1614 add_zone_page_state(page_zone(page), 1615 1615 NR_SLAB_UNRECLAIMABLE, nr_pages); 1616 1616 __SetPageSlab(page); 1617 - if (page->pfmemalloc) 1617 + if (page_is_pfmemalloc(page)) 1618 1618 SetPageSlabPfmemalloc(page); 1619 1619 1620 1620 if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
+1 -1
mm/slub.c
··· 1427 1427 inc_slabs_node(s, page_to_nid(page), page->objects); 1428 1428 page->slab_cache = s; 1429 1429 __SetPageSlab(page); 1430 - if (page->pfmemalloc) 1430 + if (page_is_pfmemalloc(page)) 1431 1431 SetPageSlabPfmemalloc(page); 1432 1432 1433 1433 start = page_address(page);
+1 -1
net/core/skbuff.c
··· 340 340 341 341 if (skb && frag_size) { 342 342 skb->head_frag = 1; 343 - if (virt_to_head_page(data)->pfmemalloc) 343 + if (page_is_pfmemalloc(virt_to_head_page(data))) 344 344 skb->pfmemalloc = 1; 345 345 } 346 346 return skb;