Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: Do early cow for pinned pages during fork() for ptes

This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.

Currently we don't have an accurate way to know whether a page is pinned
or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte(). Unluckily, we can't because we're with the page table
locks held for both the parent and child processes. So the page
allocation needs to be done outside copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup. Comments in the function should explain
better in place.

Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry. However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter. So
instead of a standalone stable patch, it is touched up in this patch
directly.

Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Peter Xu and committed by
Linus Torvalds
70e806e4 7a4830c3

+190 -17
+190 -17
mm/memory.c
··· 773 773 return 0; 774 774 } 775 775 776 - static inline void 776 + /* 777 + * Copy a present and normal page if necessary. 778 + * 779 + * NOTE! The usual case is that this doesn't need to do 780 + * anything, and can just return a positive value. That 781 + * will let the caller know that it can just increase 782 + * the page refcount and re-use the pte the traditional 783 + * way. 784 + * 785 + * But _if_ we need to copy it because it needs to be 786 + * pinned in the parent (and the child should get its own 787 + * copy rather than just a reference to the same page), 788 + * we'll do that here and return zero to let the caller 789 + * know we're done. 790 + * 791 + * And if we need a pre-allocated page but don't yet have 792 + * one, return a negative error to let the preallocation 793 + * code know so that it can do so outside the page table 794 + * lock. 795 + */ 796 + static inline int 797 + copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm, 798 + pte_t *dst_pte, pte_t *src_pte, 799 + struct vm_area_struct *vma, struct vm_area_struct *new, 800 + unsigned long addr, int *rss, struct page **prealloc, 801 + pte_t pte, struct page *page) 802 + { 803 + struct page *new_page; 804 + 805 + if (!is_cow_mapping(vma->vm_flags)) 806 + return 1; 807 + 808 + /* 809 + * The trick starts. 810 + * 811 + * What we want to do is to check whether this page may 812 + * have been pinned by the parent process. If so, 813 + * instead of wrprotect the pte on both sides, we copy 814 + * the page immediately so that we'll always guarantee 815 + * the pinned page won't be randomly replaced in the 816 + * future. 817 + * 818 + * To achieve this, we do the following: 819 + * 820 + * 1. Write-protect the pte if it's writable. This is 821 + * to protect concurrent write fast-gup with 822 + * FOLL_PIN, so that we'll fail the fast-gup with 823 + * the write bit removed. 824 + * 825 + * 2. Check page_maybe_dma_pinned() to see whether this 826 + * page may have been pinned. 827 + * 828 + * The order of these steps is important to serialize 829 + * against the fast-gup code (gup_pte_range()) on the 830 + * pte check and try_grab_compound_head(), so that 831 + * we'll make sure either we'll capture that fast-gup 832 + * so we'll copy the pinned page here, or we'll fail 833 + * that fast-gup. 834 + * 835 + * NOTE! Even if we don't end up copying the page, 836 + * we won't undo this wrprotect(), because the normal 837 + * reference copy will need it anyway. 838 + */ 839 + if (pte_write(pte)) 840 + ptep_set_wrprotect(src_mm, addr, src_pte); 841 + 842 + /* 843 + * These are the "normally we can just copy by reference" 844 + * checks. 845 + */ 846 + if (likely(!atomic_read(&src_mm->has_pinned))) 847 + return 1; 848 + if (likely(!page_maybe_dma_pinned(page))) 849 + return 1; 850 + 851 + /* 852 + * Uhhuh. It looks like the page might be a pinned page, 853 + * and we actually need to copy it. Now we can set the 854 + * source pte back to being writable. 855 + */ 856 + if (pte_write(pte)) 857 + set_pte_at(src_mm, addr, src_pte, pte); 858 + 859 + new_page = *prealloc; 860 + if (!new_page) 861 + return -EAGAIN; 862 + 863 + /* 864 + * We have a prealloc page, all good! Take it 865 + * over and copy the page & arm it. 866 + */ 867 + *prealloc = NULL; 868 + copy_user_highpage(new_page, page, addr, vma); 869 + __SetPageUptodate(new_page); 870 + page_add_new_anon_rmap(new_page, new, addr, false); 871 + lru_cache_add_inactive_or_unevictable(new_page, new); 872 + rss[mm_counter(new_page)]++; 873 + 874 + /* All done, just insert the new page copy in the child */ 875 + pte = mk_pte(new_page, new->vm_page_prot); 876 + pte = maybe_mkwrite(pte_mkdirty(pte), new); 877 + set_pte_at(dst_mm, addr, dst_pte, pte); 878 + return 0; 879 + } 880 + 881 + /* 882 + * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page 883 + * is required to copy this pte. 884 + */ 885 + static inline int 777 886 copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, 778 887 pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, 779 - unsigned long addr, int *rss) 888 + struct vm_area_struct *new, 889 + unsigned long addr, int *rss, struct page **prealloc) 780 890 { 781 891 unsigned long vm_flags = vma->vm_flags; 782 892 pte_t pte = *src_pte; 783 893 struct page *page; 894 + 895 + page = vm_normal_page(vma, addr, pte); 896 + if (page) { 897 + int retval; 898 + 899 + retval = copy_present_page(dst_mm, src_mm, 900 + dst_pte, src_pte, 901 + vma, new, 902 + addr, rss, prealloc, 903 + pte, page); 904 + if (retval <= 0) 905 + return retval; 906 + 907 + get_page(page); 908 + page_dup_rmap(page, false); 909 + rss[mm_counter(page)]++; 910 + } 784 911 785 912 /* 786 913 * If it's a COW mapping, write protect it both ··· 934 807 if (!(vm_flags & VM_UFFD_WP)) 935 808 pte = pte_clear_uffd_wp(pte); 936 809 937 - page = vm_normal_page(vma, addr, pte); 938 - if (page) { 939 - get_page(page); 940 - page_dup_rmap(page, false); 941 - rss[mm_counter(page)]++; 942 - } 943 - 944 810 set_pte_at(dst_mm, addr, dst_pte, pte); 811 + return 0; 812 + } 813 + 814 + static inline struct page * 815 + page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, 816 + unsigned long addr) 817 + { 818 + struct page *new_page; 819 + 820 + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr); 821 + if (!new_page) 822 + return NULL; 823 + 824 + if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) { 825 + put_page(new_page); 826 + return NULL; 827 + } 828 + cgroup_throttle_swaprate(new_page, GFP_KERNEL); 829 + 830 + return new_page; 945 831 } 946 832 947 833 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, ··· 965 825 pte_t *orig_src_pte, *orig_dst_pte; 966 826 pte_t *src_pte, *dst_pte; 967 827 spinlock_t *src_ptl, *dst_ptl; 968 - int progress = 0; 828 + int progress, ret = 0; 969 829 int rss[NR_MM_COUNTERS]; 970 830 swp_entry_t entry = (swp_entry_t){0}; 831 + struct page *prealloc = NULL; 971 832 972 833 again: 834 + progress = 0; 973 835 init_rss_vec(rss); 974 836 975 837 dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); 976 - if (!dst_pte) 977 - return -ENOMEM; 838 + if (!dst_pte) { 839 + ret = -ENOMEM; 840 + goto out; 841 + } 978 842 src_pte = pte_offset_map(src_pmd, addr); 979 843 src_ptl = pte_lockptr(src_mm, src_pmd); 980 844 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); ··· 1010 866 progress += 8; 1011 867 continue; 1012 868 } 1013 - copy_present_pte(dst_mm, src_mm, dst_pte, src_pte, 1014 - vma, addr, rss); 869 + /* copy_present_pte() will clear `*prealloc' if consumed */ 870 + ret = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte, 871 + vma, new, addr, rss, &prealloc); 872 + /* 873 + * If we need a pre-allocated page for this pte, drop the 874 + * locks, allocate, and try again. 875 + */ 876 + if (unlikely(ret == -EAGAIN)) 877 + break; 878 + if (unlikely(prealloc)) { 879 + /* 880 + * pre-alloc page cannot be reused by next time so as 881 + * to strictly follow mempolicy (e.g., alloc_page_vma() 882 + * will allocate page according to address). This 883 + * could only happen if one pinned pte changed. 884 + */ 885 + put_page(prealloc); 886 + prealloc = NULL; 887 + } 1015 888 progress += 8; 1016 889 } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); 1017 890 ··· 1040 879 cond_resched(); 1041 880 1042 881 if (entry.val) { 1043 - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) 882 + if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { 883 + ret = -ENOMEM; 884 + goto out; 885 + } 886 + entry.val = 0; 887 + } else if (ret) { 888 + WARN_ON_ONCE(ret != -EAGAIN); 889 + prealloc = page_copy_prealloc(src_mm, vma, addr); 890 + if (!prealloc) 1044 891 return -ENOMEM; 1045 - progress = 0; 892 + /* We've captured and resolved the error. Reset, try again. */ 893 + ret = 0; 1046 894 } 1047 895 if (addr != end) 1048 896 goto again; 1049 - return 0; 897 + out: 898 + if (unlikely(prealloc)) 899 + put_page(prealloc); 900 + return ret; 1050 901 } 1051 902 1052 903 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,