Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

perf: Add APIs to create/release mediated guest vPMUs

Currently, exposing PMU capabilities to a KVM guest is done by emulating
guest PMCs via host perf events, i.e. by having KVM be "just" another user
of perf. As a result, the guest and host are effectively competing for
resources, and emulating guest accesses to vPMU resources requires
expensive actions (expensive relative to the native instruction). The
overhead and resource competition results in degraded guest performance
and ultimately very poor vPMU accuracy.

To address the issues with the perf-emulated vPMU, introduce a "mediated
vPMU", where the data plane (PMCs and enable/disable knobs) is exposed
directly to the guest, but the control plane (event selectors and access
to fixed counters) is managed by KVM (via MSR interceptions). To allow
host perf usage of the PMU to (partially) co-exist with KVM/guest usage
of the PMU, KVM and perf will coordinate to a world switch between host
perf context and guest vPMU context near VM-Enter/VM-Exit.

Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM
to create and release a mediated PMU instance (per VM). Because host perf
context will be deactivated while the guest is running, mediated PMU usage
will be mutually exclusive with perf analysis of the guest, i.e. perf
events that do NOT exclude the guest will not behave as expected.

To avoid silent failure of !exclude_guest perf events, disallow creating a
mediated PMU if there are active !exclude_guest events, and on the perf
side, disallowing creating new !exclude_guest perf events while there is
at least one active mediated PMU.

Exempt PMU resources that do not support mediated PMU usage, i.e. that are
outside the scope/view of KVM's vPMU and will not be swapped out while the
guest is running.

Guard mediated PMU with a new kconfig to help readers identify code paths
that are unique to mediated PMU support, and to allow for adding arch-
specific hooks without stubs. KVM x86 is expected to be the only KVM
architecture to support a mediated PMU in the near future (e.g. arm64 is
trending toward a partitioned PMU implementation), and KVM x86 will select
PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs.

Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that
all paths are compile tested. Full KVM support is on its way...

[sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com

authored by

Kan Liang and committed by
Peter Zijlstra
eff95e17 991bdf7e

+93
+1
arch/x86/kvm/Kconfig
··· 37 37 select SCHED_INFO 38 38 select PERF_EVENTS 39 39 select GUEST_PERF_EVENTS 40 + select PERF_GUEST_MEDIATED_PMU 40 41 select HAVE_KVM_MSI 41 42 select HAVE_KVM_CPU_RELAX_INTERCEPT 42 43 select HAVE_KVM_NO_POLL
+6
include/linux/perf_event.h
··· 305 305 #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 306 306 #define PERF_PMU_CAP_AUX_PAUSE 0x0200 307 307 #define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400 308 + #define PERF_PMU_CAP_MEDIATED_VPMU 0x0800 308 309 309 310 /** 310 311 * pmu::scope ··· 1914 1913 extern int perf_event_account_interrupt(struct perf_event *event); 1915 1914 extern int perf_event_period(struct perf_event *event, u64 value); 1916 1915 extern u64 perf_event_pause(struct perf_event *event, bool reset); 1916 + 1917 + #ifdef CONFIG_PERF_GUEST_MEDIATED_PMU 1918 + int perf_create_mediated_pmu(void); 1919 + void perf_release_mediated_pmu(void); 1920 + #endif 1917 1921 1918 1922 #else /* !CONFIG_PERF_EVENTS: */ 1919 1923
+4
init/Kconfig
··· 2061 2061 bool 2062 2062 depends on HAVE_PERF_EVENTS 2063 2063 2064 + config PERF_GUEST_MEDIATED_PMU 2065 + bool 2066 + depends on GUEST_PERF_EVENTS 2067 + 2064 2068 config PERF_USE_VMALLOC 2065 2069 bool 2066 2070 help
+82
kernel/events/core.c
··· 5656 5656 call_rcu(&event->rcu_head, free_event_rcu); 5657 5657 } 5658 5658 5659 + static void mediated_pmu_unaccount_event(struct perf_event *event); 5660 + 5659 5661 DEFINE_FREE(__free_event, struct perf_event *, if (_T) __free_event(_T)) 5660 5662 5661 5663 /* vs perf_event_alloc() success */ ··· 5667 5665 irq_work_sync(&event->pending_disable_irq); 5668 5666 5669 5667 unaccount_event(event); 5668 + mediated_pmu_unaccount_event(event); 5670 5669 5671 5670 if (event->rb) { 5672 5671 /* ··· 6189 6186 return count; 6190 6187 } 6191 6188 EXPORT_SYMBOL_GPL(perf_event_pause); 6189 + 6190 + #ifdef CONFIG_PERF_GUEST_MEDIATED_PMU 6191 + static atomic_t nr_include_guest_events __read_mostly; 6192 + 6193 + static atomic_t nr_mediated_pmu_vms __read_mostly; 6194 + static DEFINE_MUTEX(perf_mediated_pmu_mutex); 6195 + 6196 + /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */ 6197 + static inline bool is_include_guest_event(struct perf_event *event) 6198 + { 6199 + if ((event->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU) && 6200 + !event->attr.exclude_guest) 6201 + return true; 6202 + 6203 + return false; 6204 + } 6205 + 6206 + static int mediated_pmu_account_event(struct perf_event *event) 6207 + { 6208 + if (!is_include_guest_event(event)) 6209 + return 0; 6210 + 6211 + guard(mutex)(&perf_mediated_pmu_mutex); 6212 + 6213 + if (atomic_read(&nr_mediated_pmu_vms)) 6214 + return -EOPNOTSUPP; 6215 + 6216 + atomic_inc(&nr_include_guest_events); 6217 + return 0; 6218 + } 6219 + 6220 + static void mediated_pmu_unaccount_event(struct perf_event *event) 6221 + { 6222 + if (!is_include_guest_event(event)) 6223 + return; 6224 + 6225 + atomic_dec(&nr_include_guest_events); 6226 + } 6227 + 6228 + /* 6229 + * Currently invoked at VM creation to 6230 + * - Check whether there are existing !exclude_guest events of PMU with 6231 + * PERF_PMU_CAP_MEDIATED_VPMU 6232 + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on 6233 + * PMUs with PERF_PMU_CAP_MEDIATED_VPMU 6234 + * 6235 + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf 6236 + * still owns all the PMU resources. 6237 + */ 6238 + int perf_create_mediated_pmu(void) 6239 + { 6240 + guard(mutex)(&perf_mediated_pmu_mutex); 6241 + if (atomic_inc_not_zero(&nr_mediated_pmu_vms)) 6242 + return 0; 6243 + 6244 + if (atomic_read(&nr_include_guest_events)) 6245 + return -EBUSY; 6246 + 6247 + atomic_inc(&nr_mediated_pmu_vms); 6248 + return 0; 6249 + } 6250 + EXPORT_SYMBOL_GPL(perf_create_mediated_pmu); 6251 + 6252 + void perf_release_mediated_pmu(void) 6253 + { 6254 + if (WARN_ON_ONCE(!atomic_read(&nr_mediated_pmu_vms))) 6255 + return; 6256 + 6257 + atomic_dec(&nr_mediated_pmu_vms); 6258 + } 6259 + EXPORT_SYMBOL_GPL(perf_release_mediated_pmu); 6260 + #else 6261 + static int mediated_pmu_account_event(struct perf_event *event) { return 0; } 6262 + static void mediated_pmu_unaccount_event(struct perf_event *event) {} 6263 + #endif 6192 6264 6193 6265 /* 6194 6266 * Holding the top-level event's child_mutex means that any ··· 13222 13144 } 13223 13145 13224 13146 err = security_perf_event_alloc(event); 13147 + if (err) 13148 + return ERR_PTR(err); 13149 + 13150 + err = mediated_pmu_account_event(event); 13225 13151 if (err) 13226 13152 return ERR_PTR(err); 13227 13153