Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: fix up numa read-only thread grouping logic

Dave Chinner reported that commit 4d9424669946 ("mm: convert
p[te|md]_mknonnuma and remaining page table manipulations") slowed down
his xfsrepair test enormously. In particular, it was using more system
time due to extra TLB flushing.

The ultimate reason turns out to be how the change to use the regular
page table accessor functions broke the NUMA grouping logic. The old
special mknuma/mknonnuma code accessed the page table present bit and
the magic NUMA bit directly, while the new code just changes the page
protections using PROT_NONE and the regular vma protections.

That sounds equivalent, and from a fault standpoint it really is, but a
subtle side effect is that the *other* protection bits of the page table
entries also change. And the code to decide how to group the NUMA
entries together used the writable bit to decide whether a particular
page was likely to be shared read-only or not.

And with the change to make the NUMA handling use the regular permission
setting functions, that writable bit was basically always cleared for
private mappings due to COW. So even if the page actually ends up being
written to in the end, the NUMA balancing would act as if it was always
shared RO.

This code is a heuristic anyway, so the fix - at least for now - is to
instead check whether the page is dirty rather than writable. The bit
doesn't change with protection changes.

NOTE! This also adds a FIXME comment to revisit this issue,

Not only should we probably re-visit the whole "is this a shared
read-only page" heuristic (we might want to take the vma permissions
into account and base this more on those than the per-page ones, and
also look at whether the particular access that triggers it is a write
or not), but the whole COW issue shows that we should think about the
NUMA fault handling some more.

For example, maybe we should do the early-COW thing that a regular fault
does. Or maybe we should accept that while using the same bits as
PROTNONE was a good thing (and got rid of the specual NUMA bit), we
might still want to just preseve the other protection bits across NUMA
faulting.

Those are bigger questions, left for later. This just fixes up the
heuristic so that it at least approximates working again. More analysis
and work needed.

Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

+12 -2
+6 -1
mm/huge_memory.c
··· 1295 1295 * Avoid grouping on DSO/COW pages in specific and RO pages 1296 1296 * in general, RO pages shouldn't hurt as much anyway since 1297 1297 * they can be in shared cache state. 1298 + * 1299 + * FIXME! This checks "pmd_dirty()" as an approximation of 1300 + * "is this a read-only page", since checking "pmd_write()" 1301 + * is even more broken. We haven't actually turned this into 1302 + * a writable page, so pmd_write() will always be false. 1298 1303 */ 1299 - if (!pmd_write(pmd)) 1304 + if (!pmd_dirty(pmd)) 1300 1305 flags |= TNF_NO_GROUP; 1301 1306 1302 1307 /*
+6 -1
mm/memory.c
··· 3072 3072 * Avoid grouping on DSO/COW pages in specific and RO pages 3073 3073 * in general, RO pages shouldn't hurt as much anyway since 3074 3074 * they can be in shared cache state. 3075 + * 3076 + * FIXME! This checks "pmd_dirty()" as an approximation of 3077 + * "is this a read-only page", since checking "pmd_write()" 3078 + * is even more broken. We haven't actually turned this into 3079 + * a writable page, so pmd_write() will always be false. 3075 3080 */ 3076 - if (!pte_write(pte)) 3081 + if (!pte_dirty(pte)) 3077 3082 flags |= TNF_NO_GROUP; 3078 3083 3079 3084 /*