Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

x86/mm: Eliminate window where TLB flushes may be inadvertently skipped

tl;dr: There is a window in the mm switching code where the new CR3 is
set and the CPU should be getting TLB flushes for the new mm. But
should_flush_tlb() has a bug and suppresses the flush. Fix it by
widening the window where should_flush_tlb() sends an IPI.

Long Version:

=== History ===

There were a few things leading up to this.

First, updating mm_cpumask() was observed to be too expensive, so it was
made lazier. But being lazy caused too many unnecessary IPIs to CPUs
due to the now-lazy mm_cpumask(). So code was added to cull
mm_cpumask() periodically[2]. But that culling was a bit too aggressive
and skipped sending TLB flushes to CPUs that need them. So here we are
again.

=== Problem ===

The too-aggressive code in should_flush_tlb() strikes in this window:

// Turn on IPIs for this CPU/mm combination, but only
// if should_flush_tlb() agrees:
cpumask_set_cpu(cpu, mm_cpumask(next));

next_tlb_gen = atomic64_read(&next->context.tlb_gen);
choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
load_new_mm_cr3(need_flush);
// ^ After 'need_flush' is set to false, IPIs *MUST*
// be sent to this CPU and not be ignored.

this_cpu_write(cpu_tlbstate.loaded_mm, next);
// ^ Not until this point does should_flush_tlb()
// become true!

should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3()
and writing to 'loaded_mm', which is a window where they should not be
suppressed. Whoops.

=== Solution ===

Thankfully, the fuzzy "just about to write CR3" window is already marked
with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in
should_flush_tlb() is sufficient to ensure that the CPU is targeted with
an IPI.

This will cause more TLB flush IPIs. But the window is relatively small
and I do not expect this to cause any kind of measurable performance
impact.

Update the comment where LOADED_MM_SWITCHING is written since it grew
yet another user.

Peter Z also raised a concern that should_flush_tlb() might not observe
'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off()
writes them. Add a barrier to ensure that they are observed in the
order they are written.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1]
Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2]
Reported-by: Stephen Dolan <sdolan@janestreet.com>
Cc: stable@vger.kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Dave Hansen and committed by
Linus Torvalds
fea4e317 9c69f888

+19 -3
+19 -3
arch/x86/mm/tlb.c
··· 899 899 cond_mitigation(tsk); 900 900 901 901 /* 902 - * Let nmi_uaccess_okay() and finish_asid_transition() 903 - * know that CR3 is changing. 902 + * Indicate that CR3 is about to change. nmi_uaccess_okay() 903 + * and others are sensitive to the window where mm_cpumask(), 904 + * CR3 and cpu_tlbstate.loaded_mm are not all in sync. 904 905 */ 905 906 this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING); 906 907 barrier(); ··· 1205 1204 1206 1205 static bool should_flush_tlb(int cpu, void *data) 1207 1206 { 1207 + struct mm_struct *loaded_mm = per_cpu(cpu_tlbstate.loaded_mm, cpu); 1208 1208 struct flush_tlb_info *info = data; 1209 + 1210 + /* 1211 + * Order the 'loaded_mm' and 'is_lazy' against their 1212 + * write ordering in switch_mm_irqs_off(). Ensure 1213 + * 'is_lazy' is at least as new as 'loaded_mm'. 1214 + */ 1215 + smp_rmb(); 1209 1216 1210 1217 /* Lazy TLB will get flushed at the next context switch. */ 1211 1218 if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu)) ··· 1223 1214 if (!info->mm) 1224 1215 return true; 1225 1216 1217 + /* 1218 + * While switching, the remote CPU could have state from 1219 + * either the prev or next mm. Assume the worst and flush. 1220 + */ 1221 + if (loaded_mm == LOADED_MM_SWITCHING) 1222 + return true; 1223 + 1226 1224 /* The target mm is loaded, and the CPU is not lazy. */ 1227 - if (per_cpu(cpu_tlbstate.loaded_mm, cpu) == info->mm) 1225 + if (loaded_mm == info->mm) 1228 1226 return true; 1229 1227 1230 1228 /* In cpumask, but not the loaded mm? Periodically remove by flushing. */