Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: avoid early COW write protect games during fork()

In commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork()
for ptes") we write-protected the PTE before doing the page pinning
check, in order to avoid a race with concurrent fast-GUP pinning (which
doesn't take the mm semaphore or the page table lock).

That trick doesn't actually work - it doesn't handle memory ordering
properly, and doing so would be prohibitively expensive.

It also isn't really needed. While we're moving in the direction of
allowing and supporting page pinning without marking the pinned area
with MADV_DONTFORK, the fact is that we've never really supported this
kind of odd "concurrent fork() and page pinning", and doing the
serialization on a pte level is just wrong.

We can add serialization with a per-mm sequence counter, so we know how
to solve that race properly, but we'll do that at a more appropriate
time. Right now this just removes the write protect games.

It also turns out that the write protect games actually break on Power,
as reported by Aneesh Kumar:

"Architecture like ppc64 expects set_pte_at to be not used for updating
a valid pte. This is further explained in commit 56eecdb912b5 ("mm:
Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"

and the code triggered a warning there:

WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
Call Trace:
copy_present_page mm/memory.c:857 [inline]
copy_present_pte mm/memory.c:899 [inline]
copy_pte_range mm/memory.c:1014 [inline]
copy_pmd_range mm/memory.c:1092 [inline]
copy_pud_range mm/memory.c:1127 [inline]
copy_p4d_range mm/memory.c:1150 [inline]
copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
dup_mmap kernel/fork.c:592 [inline]
dup_mm+0x77c/0xab0 kernel/fork.c:1355
copy_mm kernel/fork.c:1411 [inline]
copy_process+0x1f00/0x2740 kernel/fork.c:2070
_do_fork+0xc4/0x10b0 kernel/fork.c:2429

Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

+4 -37
+4 -37
mm/memory.c
··· 806 806 return 1; 807 807 808 808 /* 809 - * The trick starts. 810 - * 811 809 * What we want to do is to check whether this page may 812 810 * have been pinned by the parent process. If so, 813 811 * instead of wrprotect the pte on both sides, we copy ··· 813 815 * the pinned page won't be randomly replaced in the 814 816 * future. 815 817 * 816 - * To achieve this, we do the following: 817 - * 818 - * 1. Write-protect the pte if it's writable. This is 819 - * to protect concurrent write fast-gup with 820 - * FOLL_PIN, so that we'll fail the fast-gup with 821 - * the write bit removed. 822 - * 823 - * 2. Check page_maybe_dma_pinned() to see whether this 824 - * page may have been pinned. 825 - * 826 - * The order of these steps is important to serialize 827 - * against the fast-gup code (gup_pte_range()) on the 828 - * pte check and try_grab_compound_head(), so that 829 - * we'll make sure either we'll capture that fast-gup 830 - * so we'll copy the pinned page here, or we'll fail 831 - * that fast-gup. 832 - * 833 - * NOTE! Even if we don't end up copying the page, 834 - * we won't undo this wrprotect(), because the normal 835 - * reference copy will need it anyway. 836 - */ 837 - if (pte_write(pte)) 838 - ptep_set_wrprotect(src_mm, addr, src_pte); 839 - 840 - /* 841 - * These are the "normally we can just copy by reference" 842 - * checks. 818 + * The page pinning checks are just "has this mm ever 819 + * seen pinning", along with the (inexact) check of 820 + * the page count. That might give false positives for 821 + * for pinning, but it will work correctly. 843 822 */ 844 823 if (likely(!atomic_read(&src_mm->has_pinned))) 845 824 return 1; 846 825 if (likely(!page_maybe_dma_pinned(page))) 847 826 return 1; 848 - 849 - /* 850 - * Uhhuh. It looks like the page might be a pinned page, 851 - * and we actually need to copy it. Now we can set the 852 - * source pte back to being writable. 853 - */ 854 - if (pte_write(pte)) 855 - set_pte_at(src_mm, addr, src_pte, pte); 856 827 857 828 new_page = *prealloc; 858 829 if (!new_page)