x86/kfence: avoid writing L1TF-vulnerable PTEs

For native, the choice of PTE is fine. There's real memory backing the
non-present PTE. However, for XenPV, Xen complains:

(XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing

To explain, some background on XenPV pagetables:

Xen PV guests are control their own pagetables; they choose the new
PTE value, and use hypercalls to make changes so Xen can audit for
safety.

In addition to a regular reference count, Xen also maintains a type
reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable
(referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower
pagetable level). This is in order to prevent e.g. a page being
inserted into the pagetables for which the guest has a writable mapping.

For non-present mappings, all other bits become software accessible,
and typically contain metadata rather a real frame address. There is
nothing that a reference count could sensibly be tied to. As such, even
if Xen could recognise the address as currently safe, nothing would
prevent that frame from changing owner to another VM in the future.

When Xen detects a PV guest writing a L1TF-PTE, it responds by
activating shadow paging. This is normally only used for the live phase
of migration, and comes with a reasonable overhead.

KFENCE only cares about getting #PF to catch wild accesses; it doesn't
care about the value for non-present mappings. Use a fully inverted PTE,
to avoid hitting the slow path when running under Xen.

While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and
skip TLB maintenance on a !P -> P transition which benefits non-Xen cases
too.

Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com
Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Andrew Cooper and committed by

Andrew Morton 4 months ago b505f194 605f6586

+24 -5

1 changed file

expand all

arch

x86

include

asm

kfence.h

+24 -5

arch/x86/include/asm/kfence.h

··· 42 42 { 43 43 unsigned int level; 44 44 pte_t *pte = lookup_address(addr, &level); 45 + pteval_t val; 45 46 46 47 if (WARN_ON(!pte || level != PG_LEVEL_4K)) 47 48 return false; 49 + 50 + val = pte_val(*pte); 51 + 52 + /* 53 + * protect requires making the page not-present. If the PTE is 54 + * already in the right state, there's nothing to do. 55 + */ 56 + if (protect != !!(val & _PAGE_PRESENT)) 57 + return true; 58 + 59 + /* 60 + * Otherwise, invert the entire PTE. This avoids writing out an 61 + * L1TF-vulnerable PTE (not present, without the high address bits 62 + * set). 63 + */ 64 + set_pte(pte, __pte(~val)); 65 + 66 + /* 67 + * If the page was protected (non-present) and we're making it 68 + * present, there is no need to flush the TLB at all. 69 + */ 70 + if (!protect) 71 + return true; 48 72 49 73 /* 50 74 * We need to avoid IPIs, as we may get KFENCE allocations or faults ··· 76 52 * does not flush TLBs on all CPUs. We can tolerate some inaccuracy; 77 53 * lazy fault handling takes care of faults after the page is PRESENT. 78 54 */ 79 - 80 - if (protect) 81 - set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); 82 - else 83 - set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); 84 55 85 56 /* 86 57 * Flush this CPU's TLB, assuming whoever did the allocation/free is

Configure Feed

Configure Feed