Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+3

Documentation/arch/arm64/silicon-errata.rst

··· 52 52 | Allwinner | A64/R18 | UNKNOWN1 | SUN50I_ERRATUM_UNKNOWN1 | 53 53 +----------------+-----------------+-----------------+-----------------------------+ 54 54 +----------------+-----------------+-----------------+-----------------------------+ 55 + | Ampere | AmpereOne | AC03_CPU_38 | AMPERE_ERRATUM_AC03_CPU_38 | 56 + +----------------+-----------------+-----------------+-----------------------------+ 57 + +----------------+-----------------+-----------------+-----------------------------+ 55 58 | ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 | 56 59 +----------------+-----------------+-----------------+-----------------------------+ 57 60 | ARM | Cortex-A510 | #2064142 | ARM64_ERRATUM_2064142 |

+1

Documentation/process/maintainer-handbooks.rst

··· 18 18 maintainer-netdev 19 19 maintainer-soc 20 20 maintainer-tip 21 + maintainer-kvm-x86

+390

Documentation/process/maintainer-kvm-x86.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + KVM x86 4 + ======= 5 + 6 + Foreword 7 + -------- 8 + KVM strives to be a welcoming community; contributions from newcomers are 9 + valued and encouraged. Please do not be discouraged or intimidated by the 10 + length of this document and the many rules/guidelines it contains. Everyone 11 + makes mistakes, and everyone was a newbie at some point. So long as you make 12 + an honest effort to follow KVM x86's guidelines, are receptive to feedback, 13 + and learn from any mistakes you make, you will be welcomed with open arms, not 14 + torches and pitchforks. 15 + 16 + TL;DR 17 + ----- 18 + Testing is mandatory. Be consistent with established styles and patterns. 19 + 20 + Trees 21 + ----- 22 + KVM x86 is currently in a transition period from being part of the main KVM 23 + tree, to being "just another KVM arch". As such, KVM x86 is split across the 24 + main KVM tree, ``git.kernel.org/pub/scm/virt/kvm/kvm.git``, and a KVM x86 25 + specific tree, ``github.com/kvm-x86/linux.git``. 26 + 27 + Generally speaking, fixes for the current cycle are applied directly to the 28 + main KVM tree, while all development for the next cycle is routed through the 29 + KVM x86 tree. In the unlikely event that a fix for the current cycle is routed 30 + through the KVM x86 tree, it will be applied to the ``fixes`` branch before 31 + making its way to the main KVM tree. 32 + 33 + Note, this transition period is expected to last quite some time, i.e. will be 34 + the status quo for the foreseeable future. 35 + 36 + Branches 37 + ~~~~~~~~ 38 + The KVM x86 tree is organized into multiple topic branches. The purpose of 39 + using finer-grained topic branches is to make it easier to keep tabs on an area 40 + of development, and to limit the collateral damage of human errors and/or buggy 41 + commits, e.g. dropping the HEAD commit of a topic branch has no impact on other 42 + in-flight commits' SHA1 hashes, and having to reject a pull request due to bugs 43 + delays only that topic branch. 44 + 45 + All topic branches, except for ``next`` and ``fixes``, are rolled into ``next`` 46 + via a Cthulhu merge on an as-needed basis, i.e. when a topic branch is updated. 47 + As a result, force pushes to ``next`` are common. 48 + 49 + Lifecycle 50 + ~~~~~~~~~ 51 + Fixes that target the current release, a.k.a. mainline, are typically applied 52 + directly to the main KVM tree, i.e. do not route through the KVM x86 tree. 53 + 54 + Changes that target the next release are routed through the KVM x86 tree. Pull 55 + requests (from KVM x86 to main KVM) are sent for each KVM x86 topic branch, 56 + typically the week before Linus' opening of the merge window, e.g. the week 57 + following rc7 for "normal" releases. If all goes well, the topic branches are 58 + rolled into the main KVM pull request sent during Linus' merge window. 59 + 60 + The KVM x86 tree doesn't have its own official merge window, but there's a soft 61 + close around rc5 for new features, and a soft close around rc6 for fixes (for 62 + the next release; see above for fixes that target the current release). 63 + 64 + Timeline 65 + ~~~~~~~~ 66 + Submissions are typically reviewed and applied in FIFO order, with some wiggle 67 + room for the size of a series, patches that are "cache hot", etc. Fixes, 68 + especially for the current release and or stable trees, get to jump the queue. 69 + Patches that will be taken through a non-KVM tree (most often through the tip 70 + tree) and/or have other acks/reviews also jump the queue to some extent. 71 + 72 + Note, the vast majority of review is done between rc1 and rc6, give or take. 73 + The period between rc6 and the next rc1 is used to catch up on other tasks, 74 + i.e. radio silence during this period isn't unusual. 75 + 76 + Pings to get a status update are welcome, but keep in mind the timing of the 77 + current release cycle and have realistic expectations. If you are pinging for 78 + acceptance, i.e. not just for feedback or an update, please do everything you 79 + can, within reason, to ensure that your patches are ready to be merged! Pings 80 + on series that break the build or fail tests lead to unhappy maintainers! 81 + 82 + Development 83 + ----------- 84 + 85 + Base Tree/Branch 86 + ~~~~~~~~~~~~~~~~ 87 + Fixes that target the current release, a.k.a. mainline, should be based on 88 + ``git://git.kernel.org/pub/scm/virt/kvm/kvm.git master``. Note, fixes do not 89 + automatically warrant inclusion in the current release. There is no singular 90 + rule, but typically only fixes for bugs that are urgent, critical, and/or were 91 + introduced in the current release should target the current release. 92 + 93 + Everything else should be based on ``kvm-x86/next``, i.e. there is no need to 94 + select a specific topic branch as the base. If there are conflicts and/or 95 + dependencies across topic branches, it is the maintainer's job to sort them 96 + out. 97 + 98 + The only exception to using ``kvm-x86/next`` as the base is if a patch/series 99 + is a multi-arch series, i.e. has non-trivial modifications to common KVM code 100 + and/or has more than superficial changes to other architectures' code. Multi- 101 + arch patch/series should instead be based on a common, stable point in KVM's 102 + history, e.g. the release candidate upon which ``kvm-x86 next`` is based. If 103 + you're unsure whether a patch/series is truly multi-arch, err on the side of 104 + caution and treat it as multi-arch, i.e. use a common base. 105 + 106 + Coding Style 107 + ~~~~~~~~~~~~ 108 + When it comes to style, naming, patterns, etc., consistency is the number one 109 + priority in KVM x86. If all else fails, match what already exists. 110 + 111 + With a few caveats listed below, follow the tip tree maintainers' preferred 112 + :ref:`maintainer-tip-coding-style`, as patches/series often touch both KVM and 113 + non-KVM x86 files, i.e. draw the attention of KVM *and* tip tree maintainers. 114 + 115 + Using reverse fir tree, a.k.a. reverse Christmas tree or reverse XMAS tree, for 116 + variable declarations isn't strictly required, though it is still preferred. 117 + 118 + Except for a handful of special snowflakes, do not use kernel-doc comments for 119 + functions. The vast majority of "public" KVM functions aren't truly public as 120 + they are intended only for KVM-internal consumption (there are plans to 121 + privatize KVM's headers and exports to enforce this). 122 + 123 + Comments 124 + ~~~~~~~~ 125 + Write comments using imperative mood and avoid pronouns. Use comments to 126 + provide a high level overview of the code, and/or to explain why the code does 127 + what it does. Do not reiterate what the code literally does; let the code 128 + speak for itself. If the code itself is inscrutable, comments will not help. 129 + 130 + SDM and APM References 131 + ~~~~~~~~~~~~~~~~~~~~~~ 132 + Much of KVM's code base is directly tied to architectural behavior defined in 133 + Intel's Software Development Manual (SDM) and AMD's Architecture Programmer’s 134 + Manual (APM). Use of "Intel's SDM" and "AMD's APM", or even just "SDM" or 135 + "APM", without additional context is a-ok. 136 + 137 + Do not reference specific sections, tables, figures, etc. by number, especially 138 + not in comments. Instead, if necessary (see below), copy-paste the relevant 139 + snippet and reference sections/tables/figures by name. The layouts of the SDM 140 + and APM are constantly changing, and so the numbers/labels aren't stable. 141 + 142 + Generally speaking, do not explicitly reference or copy-paste from the SDM or 143 + APM in comments. With few exceptions, KVM *must* honor architectural behavior, 144 + therefore it's implied that KVM behavior is emulating SDM and/or APM behavior. 145 + Note, referencing the SDM/APM in changelogs to justify the change and provide 146 + context is perfectly ok and encouraged. 147 + 148 + Shortlog 149 + ~~~~~~~~ 150 + The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of:: 151 + 152 + - x86 153 + - x86/mmu 154 + - x86/pmu 155 + - x86/xen 156 + - selftests 157 + - SVM 158 + - nSVM 159 + - VMX 160 + - nVMX 161 + 162 + **DO NOT use x86/kvm!** ``x86/kvm`` is used exclusively for Linux-as-a-KVM-guest 163 + changes, i.e. for arch/x86/kernel/kvm.c. Do not use file names or complete file 164 + paths as the subject/shortlog prefix. 165 + 166 + Note, these don't align with the topics branches (the topic branches care much 167 + more about code conflicts). 168 + 169 + All names are case sensitive! ``KVM: x86:`` is good, ``kvm: vmx:`` is not. 170 + 171 + Capitalize the first word of the condensed patch description, but omit ending 172 + punctionation. E.g.:: 173 + 174 + KVM: x86: Fix a null pointer dereference in function_xyz() 175 + 176 + not:: 177 + 178 + kvm: x86: fix a null pointer dereference in function_xyz. 179 + 180 + If a patch touches multiple topics, traverse up the conceptual tree to find the 181 + first common parent (which is often simply ``x86``). When in doubt, 182 + ``git log path/to/file`` should provide a reasonable hint. 183 + 184 + New topics do occasionally pop up, but please start an on-list discussion if 185 + you want to propose introducing a new topic, i.e. don't go rogue. 186 + 187 + See :ref:`the_canonical_patch_format` for more information, with one amendment: 188 + do not treat the 70-75 character limit as an absolute, hard limit. Instead, 189 + use 75 characters as a firm-but-not-hard limit, and use 80 characters as a hard 190 + limit. I.e. let the shortlog run a few characters over the standard limit if 191 + you have good reason to do so. 192 + 193 + Changelog 194 + ~~~~~~~~~ 195 + Most importantly, write changelogs using imperative mood and avoid pronouns. 196 + 197 + See :ref:`describe_changes` for more information, with one amendment: lead with 198 + a short blurb on the actual changes, and then follow up with the context and 199 + background. Note! This order directly conflicts with the tip tree's preferred 200 + approach! Please follow the tip tree's preferred style when sending patches 201 + that primarily target arch/x86 code that is _NOT_ KVM code. 202 + 203 + Stating what a patch does before diving into details is preferred by KVM x86 204 + for several reasons. First and foremost, what code is actually being changed 205 + is arguably the most important information, and so that info should be easy to 206 + find. Changelogs that bury the "what's actually changing" in a one-liner after 207 + 3+ paragraphs of background make it very hard to find that information. 208 + 209 + For initial review, one could argue the "what's broken" is more important, but 210 + for skimming logs and git archaeology, the gory details matter less and less. 211 + E.g. when doing a series of "git blame", the details of each change along the 212 + way are useless, the details only matter for the culprit. Providing the "what 213 + changed" makes it easy to quickly determine whether or not a commit might be of 214 + interest. 215 + 216 + Another benefit of stating "what's changing" first is that it's almost always 217 + possible to state "what's changing" in a single sentence. Conversely, all but 218 + the most simple bugs require multiple sentences or paragraphs to fully describe 219 + the problem. If both the "what's changing" and "what's the bug" are super 220 + short then the order doesn't matter. But if one is shorter (almost always the 221 + "what's changing), then covering the shorter one first is advantageous because 222 + it's less of an inconvenience for readers/reviewers that have a strict ordering 223 + preference. E.g. having to skip one sentence to get to the context is less 224 + painful than having to skip three paragraphs to get to "what's changing". 225 + 226 + Fixes 227 + ~~~~~ 228 + If a change fixes a KVM/kernel bug, add a Fixes: tag even if the change doesn't 229 + need to be backported to stable kernels, and even if the change fixes a bug in 230 + an older release. 231 + 232 + Conversely, if a fix does need to be backported, explicitly tag the patch with 233 + "Cc: stable@vger.kernel" (though the email itself doesn't need to Cc: stable); 234 + KVM x86 opts out of backporting Fixes: by default. Some auto-selected patches 235 + do get backported, but require explicit maintainer approval (search MANUALSEL). 236 + 237 + Function References 238 + ~~~~~~~~~~~~~~~~~~~ 239 + When a function is mentioned in a comment, changelog, or shortlog (or anywhere 240 + for that matter), use the format ``function_name()``. The parentheses provide 241 + context and disambiguate the reference. 242 + 243 + Testing 244 + ------- 245 + At a bare minimum, *all* patches in a series must build cleanly for KVM_INTEL=m 246 + KVM_AMD=m, and KVM_WERROR=y. Building every possible combination of Kconfigs 247 + isn't feasible, but the more the merrier. KVM_SMM, KVM_XEN, PROVE_LOCKING, and 248 + X86_64 are particularly interesting knobs to turn. 249 + 250 + Running KVM selftests and KVM-unit-tests is also mandatory (and stating the 251 + obvious, the tests need to pass). The only exception is for changes that have 252 + negligible probability of affecting runtime behavior, e.g. patches that only 253 + modify comments. When possible and relevant, testing on both Intel and AMD is 254 + strongly preferred. Booting an actual VM is encouraged, but not mandatory. 255 + 256 + For changes that touch KVM's shadow paging code, running with TDP (EPT/NPT) 257 + disabled is mandatory. For changes that affect common KVM MMU code, running 258 + with TDP disabled is strongly encouraged. For all other changes, if the code 259 + being modified depends on and/or interacts with a module param, testing with 260 + the relevant settings is mandatory. 261 + 262 + Note, KVM selftests and KVM-unit-tests do have known failures. If you suspect 263 + a failure is not due to your changes, verify that the *exact same* failure 264 + occurs with and without your changes. 265 + 266 + Changes that touch reStructured Text documentation, i.e. .rst files, must build 267 + htmldocs cleanly, i.e. with no new warnings or errors. 268 + 269 + If you can't fully test a change, e.g. due to lack of hardware, clearly state 270 + what level of testing you were able to do, e.g. in the cover letter. 271 + 272 + New Features 273 + ~~~~~~~~~~~~ 274 + With one exception, new features *must* come with test coverage. KVM specific 275 + tests aren't strictly required, e.g. if coverage is provided by running a 276 + sufficiently enabled guest VM, or by running a related kernel selftest in a VM, 277 + but dedicated KVM tests are preferred in all cases. Negative testcases in 278 + particular are mandatory for enabling of new hardware features as error and 279 + exception flows are rarely exercised simply by running a VM. 280 + 281 + The only exception to this rule is if KVM is simply advertising support for a 282 + feature via KVM_GET_SUPPORTED_CPUID, i.e. for instructions/features that KVM 283 + can't prevent a guest from using and for which there is no true enabling. 284 + 285 + Note, "new features" does not just mean "new hardware features"! New features 286 + that can't be well validated using existing KVM selftests and/or KVM-unit-tests 287 + must come with tests. 288 + 289 + Posting new feature development without tests to get early feedback is more 290 + than welcome, but such submissions should be tagged RFC, and the cover letter 291 + should clearly state what type of feedback is requested/expected. Do not abuse 292 + the RFC process; RFCs will typically not receive in-depth review. 293 + 294 + Bug Fixes 295 + ~~~~~~~~~ 296 + Except for "obvious" found-by-inspection bugs, fixes must be accompanied by a 297 + reproducer for the bug being fixed. In many cases the reproducer is implicit, 298 + e.g. for build errors and test failures, but it should still be clear to 299 + readers what is broken and how to verify the fix. Some leeway is given for 300 + bugs that are found via non-public workloads/tests, but providing regression 301 + tests for such bugs is strongly preferred. 302 + 303 + In general, regression tests are preferred for any bug that is not trivial to 304 + hit. E.g. even if the bug was originally found by a fuzzer such as syzkaller, 305 + a targeted regression test may be warranted if the bug requires hitting a 306 + one-in-a-million type race condition. 307 + 308 + Note, KVM bugs are rarely urgent *and* non-trivial to reproduce. Ask yourself 309 + if a bug is really truly the end of the world before posting a fix without a 310 + reproducer. 311 + 312 + Posting 313 + ------- 314 + 315 + Links 316 + ~~~~~ 317 + Do not explicitly reference bug reports, prior versions of a patch/series, etc. 318 + via ``In-Reply-To:`` headers. Using ``In-Reply-To:`` becomes an unholy mess 319 + for large series and/or when the version count gets high, and ``In-Reply-To:`` 320 + is useless for anyone that doesn't have the original message, e.g. if someone 321 + wasn't Cc'd on the bug report or if the list of recipients changes between 322 + versions. 323 + 324 + To link to a bug report, previous version, or anything of interest, use lore 325 + links. For referencing previous version(s), generally speaking do not include 326 + a Link: in the changelog as there is no need to record the history in git, i.e. 327 + put the link in the cover letter or in the section git ignores. Do provide a 328 + formal Link: for bug reports and/or discussions that led to the patch. The 329 + context of why a change was made is highly valuable for future readers. 330 + 331 + Git Base 332 + ~~~~~~~~ 333 + If you are using git version 2.9.0 or later (Googlers, this is all of you!), 334 + use ``git format-patch`` with the ``--base`` flag to automatically include the 335 + base tree information in the generated patches. 336 + 337 + Note, ``--base=auto`` works as expected if and only if a branch's upstream is 338 + set to the base topic branch, e.g. it will do the wrong thing if your upstream 339 + is set to your personal repository for backup purposes. An alternative "auto" 340 + solution is to derive the names of your development branches based on their 341 + KVM x86 topic, and feed that into ``--base``. E.g. ``x86/pmu/my_branch_name``, 342 + and then write a small wrapper to extract ``pmu`` from the current branch name 343 + to yield ``--base=x/pmu``, where ``x`` is whatever name your repository uses to 344 + track the KVM x86 remote. 345 + 346 + Co-Posting Tests 347 + ~~~~~~~~~~~~~~~~ 348 + KVM selftests that are associated with KVM changes, e.g. regression tests for 349 + bug fixes, should be posted along with the KVM changes as a single series. The 350 + standard kernel rules for bisection apply, i.e. KVM changes that result in test 351 + failures should be ordered after the selftests updates, and vice versa, new 352 + tests that fail due to KVM bugs should be ordered after the KVM fixes. 353 + 354 + KVM-unit-tests should *always* be posted separately. Tools, e.g. b4 am, don't 355 + know that KVM-unit-tests is a separate repository and get confused when patches 356 + in a series apply on different trees. To tie KVM-unit-tests patches back to 357 + KVM patches, first post the KVM changes and then provide a lore Link: to the 358 + KVM patch/series in the KVM-unit-tests patch(es). 359 + 360 + Notifications 361 + ------------- 362 + When a patch/series is officially accepted, a notification email will be sent 363 + in reply to the original posting (cover letter for multi-patch series). The 364 + notification will include the tree and topic branch, along with the SHA1s of 365 + the commits of applied patches. 366 + 367 + If a subset of patches is applied, this will be clearly stated in the 368 + notification. Unless stated otherwise, it's implied that any patches in the 369 + series that were not accepted need more work and should be submitted in a new 370 + version. 371 + 372 + If for some reason a patch is dropped after officially being accepted, a reply 373 + will be sent to the notification email explaining why the patch was dropped, as 374 + well as the next steps. 375 + 376 + SHA1 Stability 377 + ~~~~~~~~~~~~~~ 378 + SHA1s are not 100% guaranteed to be stable until they land in Linus' tree! A 379 + SHA1 is *usually* stable once a notification has been sent, but things happen. 380 + In most cases, an update to the notification email be provided if an applied 381 + patch's SHA1 changes. However, in some scenarios, e.g. if all KVM x86 branches 382 + need to be rebased, individual notifications will not be given. 383 + 384 + Vulnerabilities 385 + --------------- 386 + Bugs that can be exploited by the guest to attack the host (kernel or 387 + userspace), or that can be exploited by a nested VM to *its* host (L2 attacking 388 + L1), are of particular interest to KVM. Please follow the protocol for 389 + :ref:`securitybugs` if you suspect a bug can lead to an escape, data leak, etc. 390 +

+2

Documentation/process/maintainer-tip.rst

··· 455 455 Some of these options are x86-specific and can be left out when testing 456 456 on other architectures. 457 457 458 + .. _maintainer-tip-coding-style: 459 + 458 460 Coding style notes 459 461 ------------------ 460 462

+27

Documentation/virt/kvm/api.rst

··· 8445 8445 When getting the Modified Change Topology Report value, the attr->addr 8446 8446 must point to a byte where the value will be stored or retrieved from. 8447 8447 8448 + 8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 8449 + --------------------------------------- 8450 + 8451 + :Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 8452 + :Architectures: arm64 8453 + :Type: vm 8454 + :Parameters: arg[0] is the new split chunk size. 8455 + :Returns: 0 on success, -EINVAL if any memslot was already created. 8456 + 8457 + This capability sets the chunk size used in Eager Page Splitting. 8458 + 8459 + Eager Page Splitting improves the performance of dirty-logging (used 8460 + in live migrations) when guest memory is backed by huge-pages. It 8461 + avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing 8462 + it eagerly when enabling dirty logging (with the 8463 + KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using 8464 + KVM_CLEAR_DIRTY_LOG. 8465 + 8466 + The chunk size specifies how many pages to break at a time, using a 8467 + single allocation for each chunk. Bigger the chunk size, more pages 8468 + need to be allocated ahead of time. 8469 + 8470 + The chunk size needs to be a valid block size. The list of acceptable 8471 + block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 8472 + 64-bit bitmap (each bit describing a block size). The default value is 8473 + 0, to disable the eager page splitting. 8474 + 8448 8475 9. Known KVM API problems 8449 8476 ========================= 8450 8477

+1 -1

Documentation/virt/kvm/x86/mmu.rst

··· 205 205 role.passthrough: 206 206 The page is not backed by a guest page table, but its first entry 207 207 points to one. This is set if NPT uses 5-level page tables (host 208 - CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1). 208 + CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=0). 209 209 gfn: 210 210 Either the guest page table containing the translations shadowed by this 211 211 page, or the base page frame for linear translations. See role.direct.

+1

MAINTAINERS

··· 11546 11546 M: Paolo Bonzini <pbonzini@redhat.com> 11547 11547 L: kvm@vger.kernel.org 11548 11548 S: Supported 11549 + P: Documentation/process/maintainer-kvm-x86.rst 11549 11550 T: git git://git.kernel.org/pub/scm/virt/kvm/kvm.git 11550 11551 F: arch/x86/include/asm/kvm* 11551 11552 F: arch/x86/include/asm/svm.h

+19

arch/arm64/Kconfig

··· 414 414 415 415 menu "ARM errata workarounds via the alternatives framework" 416 416 417 + config AMPERE_ERRATUM_AC03_CPU_38 418 + bool "AmpereOne: AC03_CPU_38: Certain bits in the Virtualization Translation Control Register and Translation Control Registers do not follow RES0 semantics" 419 + default y 420 + help 421 + This option adds an alternative code sequence to work around Ampere 422 + erratum AC03_CPU_38 on AmpereOne. 423 + 424 + The affected design reports FEAT_HAFDBS as not implemented in 425 + ID_AA64MMFR1_EL1.HAFDBS, but (V)TCR_ELx.{HA,HD} are not RES0 426 + as required by the architecture. The unadvertised HAFDBS 427 + implementation suffers from an additional erratum where hardware 428 + A/D updates can occur after a PTE has been marked invalid. 429 + 430 + The workaround forces KVM to explicitly set VTCR_EL2.HA to 0, 431 + which avoids enabling unadvertised hardware Access Flag management 432 + at stage-2. 433 + 434 + If unsure, say Y. 435 + 417 436 config ARM64_WORKAROUND_CLEAN_CACHE 418 437 bool 419 438

+6

arch/arm64/include/asm/cpufeature.h

··· 15 15 #define MAX_CPU_FEATURES 128 16 16 #define cpu_feature(x) KERNEL_HWCAP_ ## x 17 17 18 + #define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0 19 + #define ARM64_SW_FEATURE_OVERRIDE_HVHE 4 20 + 18 21 #ifndef __ASSEMBLY__ 19 22 20 23 #include <linux/bug.h> ··· 908 905 return 8; 909 906 } 910 907 908 + s64 arm64_ftr_safe_value(const struct arm64_ftr_bits *ftrp, s64 new, s64 cur); 911 909 struct arm64_ftr_reg *get_arm64_ftr_reg(u32 sys_id); 912 910 913 911 extern struct arm64_ftr_override id_aa64mmfr1_override; ··· 918 914 extern struct arm64_ftr_override id_aa64smfr0_override; 919 915 extern struct arm64_ftr_override id_aa64isar1_override; 920 916 extern struct arm64_ftr_override id_aa64isar2_override; 917 + 918 + extern struct arm64_ftr_override arm64_sw_feature_override; 921 919 922 920 u32 get_kvm_ipa_limit(void); 923 921 void dump_cpu_features(void);

+24 -3

arch/arm64/include/asm/el2_setup.h

··· 43 43 */ 44 44 .macro __init_el2_timers 45 45 mov x0, #3 // Enable EL1 physical timers 46 + mrs x1, hcr_el2 47 + and x1, x1, #HCR_E2H 48 + cbz x1, .LnVHE_\@ 49 + lsl x0, x0, #10 50 + .LnVHE_\@: 46 51 msr cnthctl_el2, x0 47 52 msr cntvoff_el2, xzr // Clear virtual offset 48 53 .endm ··· 138 133 .endm 139 134 140 135 /* Coprocessor traps */ 141 - .macro __init_el2_nvhe_cptr 136 + .macro __init_el2_cptr 137 + mrs x1, hcr_el2 138 + and x1, x1, #HCR_E2H 139 + cbz x1, .LnVHE_\@ 140 + mov x0, #(CPACR_EL1_FPEN_EL1EN | CPACR_EL1_FPEN_EL0EN) 141 + b .Lset_cptr_\@ 142 + .LnVHE_\@: 142 143 mov x0, #0x33ff 144 + .Lset_cptr_\@: 143 145 msr cptr_el2, x0 // Disable copro. traps to EL2 144 146 .endm 145 147 ··· 222 210 __init_el2_gicv3 223 211 __init_el2_hstr 224 212 __init_el2_nvhe_idregs 225 - __init_el2_nvhe_cptr 213 + __init_el2_cptr 226 214 __init_el2_fgt 227 - __init_el2_nvhe_prepare_eret 228 215 .endm 229 216 230 217 #ifndef __KVM_NVHE_HYPERVISOR__ ··· 269 258 270 259 .Linit_sve_\@: /* SVE register access */ 271 260 mrs x0, cptr_el2 // Disable SVE traps 261 + mrs x1, hcr_el2 262 + and x1, x1, #HCR_E2H 263 + cbz x1, .Lcptr_nvhe_\@ 264 + 265 + // VHE case 266 + orr x0, x0, #(CPACR_EL1_ZEN_EL1EN | CPACR_EL1_ZEN_EL0EN) 267 + b .Lset_cptr_\@ 268 + 269 + .Lcptr_nvhe_\@: // nVHE case 272 270 bic x0, x0, #CPTR_EL2_TZ 271 + .Lset_cptr_\@: 273 272 msr cptr_el2, x0 274 273 isb 275 274 mov x1, #ZCR_ELx_LEN_MASK // SVE: Enable full vector

+3 -4

arch/arm64/include/asm/kvm_arm.h

··· 19 19 #define HCR_ATA_SHIFT 56 20 20 #define HCR_ATA (UL(1) << HCR_ATA_SHIFT) 21 21 #define HCR_AMVOFFEN (UL(1) << 51) 22 + #define HCR_TID4 (UL(1) << 49) 22 23 #define HCR_FIEN (UL(1) << 47) 23 24 #define HCR_FWB (UL(1) << 46) 24 25 #define HCR_API (UL(1) << 41) ··· 88 87 #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ 89 88 HCR_BSU_IS | HCR_FB | HCR_TACR | \ 90 89 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ 91 - HCR_FMO | HCR_IMO | HCR_PTW | HCR_TID3 | HCR_TID2) 90 + HCR_FMO | HCR_IMO | HCR_PTW | HCR_TID3) 92 91 #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) 93 92 #define HCR_HOST_NVHE_FLAGS (HCR_RW | HCR_API | HCR_APK | HCR_ATA) 94 93 #define HCR_HOST_NVHE_PROTECTED_FLAGS (HCR_HOST_NVHE_FLAGS | HCR_TSC) ··· 290 289 #define CPTR_EL2_TFP (1 << CPTR_EL2_TFP_SHIFT) 291 290 #define CPTR_EL2_TZ (1 << 8) 292 291 #define CPTR_NVHE_EL2_RES1 0x000032ff /* known RES1 bits in CPTR_EL2 (nVHE) */ 293 - #define CPTR_EL2_DEFAULT CPTR_NVHE_EL2_RES1 294 292 #define CPTR_NVHE_EL2_RES0 (GENMASK(63, 32) | \ 295 293 GENMASK(29, 21) | \ 296 294 GENMASK(19, 14) | \ ··· 351 351 ECN(SOFTSTP_CUR), ECN(WATCHPT_LOW), ECN(WATCHPT_CUR), \ 352 352 ECN(BKPT32), ECN(VECTOR32), ECN(BRK64), ECN(ERET) 353 353 354 - #define CPACR_EL1_DEFAULT (CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN |\ 355 - CPACR_EL1_ZEN_EL1EN) 354 + #define CPACR_EL1_TTA (1 << 28) 356 355 357 356 #define kvm_mode_names \ 358 357 { PSR_MODE_EL0t, "EL0t" }, \

+4

arch/arm64/include/asm/kvm_asm.h

··· 68 68 __KVM_HOST_SMCCC_FUNC___kvm_vcpu_run, 69 69 __KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context, 70 70 __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa, 71 + __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh, 71 72 __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid, 72 73 __KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context, 73 74 __KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff, ··· 226 225 extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu); 227 226 extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa, 228 227 int level); 228 + extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu, 229 + phys_addr_t ipa, 230 + int level); 229 231 extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu); 230 232 231 233 extern void __kvm_timer_set_cntvoff(u64 cntvoff);

+39 -7

arch/arm64/include/asm/kvm_emulate.h

··· 62 62 #else 63 63 static __always_inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu) 64 64 { 65 - struct kvm *kvm = vcpu->kvm; 66 - 67 - WARN_ON_ONCE(!test_bit(KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED, 68 - &kvm->arch.flags)); 69 - 70 - return test_bit(KVM_ARCH_FLAG_EL1_32BIT, &kvm->arch.flags); 65 + return test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features); 71 66 } 72 67 #endif 73 68 74 69 static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) 75 70 { 76 71 vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS; 77 - if (is_kernel_in_hyp_mode()) 72 + if (has_vhe() || has_hvhe()) 78 73 vcpu->arch.hcr_el2 |= HCR_E2H; 79 74 if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { 80 75 /* route synchronous external abort exceptions to EL2 */ ··· 89 94 */ 90 95 vcpu->arch.hcr_el2 |= HCR_TVM; 91 96 } 97 + 98 + if (cpus_have_final_cap(ARM64_HAS_EVT) && 99 + !cpus_have_final_cap(ARM64_MISMATCHED_CACHE_TYPE)) 100 + vcpu->arch.hcr_el2 |= HCR_TID4; 101 + else 102 + vcpu->arch.hcr_el2 |= HCR_TID2; 92 103 93 104 if (vcpu_el1_is_32bit(vcpu)) 94 105 vcpu->arch.hcr_el2 &= ~HCR_RW; ··· 571 570 return test_bit(feature, vcpu->arch.features); 572 571 } 573 572 573 + static __always_inline u64 kvm_get_reset_cptr_el2(struct kvm_vcpu *vcpu) 574 + { 575 + u64 val; 576 + 577 + if (has_vhe()) { 578 + val = (CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN | 579 + CPACR_EL1_ZEN_EL1EN); 580 + } else if (has_hvhe()) { 581 + val = (CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN); 582 + } else { 583 + val = CPTR_NVHE_EL2_RES1; 584 + 585 + if (vcpu_has_sve(vcpu) && 586 + (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED)) 587 + val |= CPTR_EL2_TZ; 588 + if (cpus_have_final_cap(ARM64_SME)) 589 + val &= ~CPTR_EL2_TSM; 590 + } 591 + 592 + return val; 593 + } 594 + 595 + static __always_inline void kvm_reset_cptr_el2(struct kvm_vcpu *vcpu) 596 + { 597 + u64 val = kvm_get_reset_cptr_el2(vcpu); 598 + 599 + if (has_vhe() || has_hvhe()) 600 + write_sysreg(val, cpacr_el1); 601 + else 602 + write_sysreg(val, cptr_el2); 603 + } 574 604 #endif /* __ARM64_KVM_EMULATE_H__ */

+41 -20

arch/arm64/include/asm/kvm_host.h

··· 39 39 #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS 40 40 41 41 #define KVM_VCPU_MAX_FEATURES 7 42 + #define KVM_VCPU_VALID_FEATURES (BIT(KVM_VCPU_MAX_FEATURES) - 1) 42 43 43 44 #define KVM_REQ_SLEEP \ 44 45 KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) ··· 160 159 /* The last vcpu id that ran on each physical CPU */ 161 160 int __percpu *last_vcpu_ran; 162 161 162 + #define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0 163 + /* 164 + * Memory cache used to split 165 + * KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It 166 + * is used to allocate stage2 page tables while splitting huge 167 + * pages. The choice of KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 168 + * influences both the capacity of the split page cache, and 169 + * how often KVM reschedules. Be wary of raising CHUNK_SIZE 170 + * too high. 171 + * 172 + * Protected by kvm->slots_lock. 173 + */ 174 + struct kvm_mmu_memory_cache split_page_cache; 175 + uint64_t split_page_chunk_size; 176 + 163 177 struct kvm_arch *arch; 164 178 }; 165 179 ··· 230 214 #define KVM_ARCH_FLAG_MTE_ENABLED 1 231 215 /* At least one vCPU has ran in the VM */ 232 216 #define KVM_ARCH_FLAG_HAS_RAN_ONCE 2 233 - /* 234 - * The following two bits are used to indicate the guest's EL1 235 - * register width configuration. A value of KVM_ARCH_FLAG_EL1_32BIT 236 - * bit is valid only when KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED is set. 237 - * Otherwise, the guest's EL1 register width has not yet been 238 - * determined yet. 239 - */ 240 - #define KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED 3 241 - #define KVM_ARCH_FLAG_EL1_32BIT 4 217 + /* The vCPU feature set for the VM is configured */ 218 + #define KVM_ARCH_FLAG_VCPU_FEATURES_CONFIGURED 3 242 219 /* PSCI SYSTEM_SUSPEND enabled for the guest */ 243 - #define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED 5 220 + #define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED 4 244 221 /* VM counter offset */ 245 - #define KVM_ARCH_FLAG_VM_COUNTER_OFFSET 6 222 + #define KVM_ARCH_FLAG_VM_COUNTER_OFFSET 5 246 223 /* Timer PPIs made immutable */ 247 - #define KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE 7 224 + #define KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE 6 248 225 /* SMCCC filter initialized for the VM */ 249 - #define KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED 8 226 + #define KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED 7 227 + /* Initial ID reg values loaded */ 228 + #define KVM_ARCH_FLAG_ID_REGS_INITIALIZED 8 250 229 unsigned long flags; 230 + 231 + /* VM-wide vCPU feature set */ 232 + DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); 251 233 252 234 /* 253 235 * VM-wide PMU filter, implemented as a bitmap and big enough for ··· 256 242 257 243 cpumask_var_t supported_cpus; 258 244 259 - u8 pfr0_csv2; 260 - u8 pfr0_csv3; 261 - struct { 262 - u8 imp:4; 263 - u8 unimp:4; 264 - } dfr0_pmuver; 265 - 266 245 /* Hypercall features firmware registers' descriptor */ 267 246 struct kvm_smccc_features smccc_feat; 268 247 struct maple_tree smccc_filter; 248 + 249 + /* 250 + * Emulated CPU ID registers per VM 251 + * (Op0, Op1, CRn, CRm, Op2) of the ID registers to be saved in it 252 + * is (3, 0, 0, crm, op2), where 1<=crm<8, 0<=op2<8. 253 + * 254 + * These emulated idregs are VM-wide, but accessed from the context of a vCPU. 255 + * Atomic access to multiple idregs are guarded by kvm_arch.config_lock. 256 + */ 257 + #define IDREG_IDX(id) (((sys_reg_CRm(id) - 1) << 3) | sys_reg_Op2(id)) 258 + #define IDREG(kvm, id) ((kvm)->arch.id_regs[IDREG_IDX(id)]) 259 + #define KVM_ARM_ID_REG_NUM (IDREG_IDX(sys_reg(3, 0, 0, 7, 7)) + 1) 260 + u64 id_regs[KVM_ARM_ID_REG_NUM]; 269 261 270 262 /* 271 263 * For an untrusted host VM, 'pkvm.handle' is used to lookup ··· 430 410 struct kvm_host_psci_config { 431 411 /* PSCI version used by host. */ 432 412 u32 version; 413 + u32 smccc_version; 433 414 434 415 /* Function IDs used by host if version is v0.1. */ 435 416 struct psci_0_1_function_ids function_ids_0_1;

+28 -9

arch/arm64/include/asm/kvm_hyp.h

··· 16 16 DECLARE_PER_CPU(unsigned long, kvm_hyp_vector); 17 17 DECLARE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); 18 18 19 + /* 20 + * Unified accessors for registers that have a different encoding 21 + * between VHE and non-VHE. They must be specified without their "ELx" 22 + * encoding, but with the SYS_ prefix, as defined in asm/sysreg.h. 23 + */ 24 + 25 + #if defined(__KVM_VHE_HYPERVISOR__) 26 + 27 + #define read_sysreg_el0(r) read_sysreg_s(r##_EL02) 28 + #define write_sysreg_el0(v,r) write_sysreg_s(v, r##_EL02) 29 + #define read_sysreg_el1(r) read_sysreg_s(r##_EL12) 30 + #define write_sysreg_el1(v,r) write_sysreg_s(v, r##_EL12) 31 + #define read_sysreg_el2(r) read_sysreg_s(r##_EL1) 32 + #define write_sysreg_el2(v,r) write_sysreg_s(v, r##_EL1) 33 + 34 + #else // !__KVM_VHE_HYPERVISOR__ 35 + 36 + #if defined(__KVM_NVHE_HYPERVISOR__) 37 + #define VHE_ALT_KEY ARM64_KVM_HVHE 38 + #else 39 + #define VHE_ALT_KEY ARM64_HAS_VIRT_HOST_EXTN 40 + #endif 41 + 19 42 #define read_sysreg_elx(r,nvh,vh) \ 20 43 ({ \ 21 44 u64 reg; \ 22 - asm volatile(ALTERNATIVE(__mrs_s("%0", r##nvh), \ 45 + asm volatile(ALTERNATIVE(__mrs_s("%0", r##nvh), \ 23 46 __mrs_s("%0", r##vh), \ 24 - ARM64_HAS_VIRT_HOST_EXTN) \ 47 + VHE_ALT_KEY) \ 25 48 : "=r" (reg)); \ 26 49 reg; \ 27 50 }) ··· 54 31 u64 __val = (u64)(v); \ 55 32 asm volatile(ALTERNATIVE(__msr_s(r##nvh, "%x0"), \ 56 33 __msr_s(r##vh, "%x0"), \ 57 - ARM64_HAS_VIRT_HOST_EXTN) \ 34 + VHE_ALT_KEY) \ 58 35 : : "rZ" (__val)); \ 59 36 } while (0) 60 - 61 - /* 62 - * Unified accessors for registers that have a different encoding 63 - * between VHE and non-VHE. They must be specified without their "ELx" 64 - * encoding, but with the SYS_ prefix, as defined in asm/sysreg.h. 65 - */ 66 37 67 38 #define read_sysreg_el0(r) read_sysreg_elx(r, _EL0, _EL02) 68 39 #define write_sysreg_el0(v,r) write_sysreg_elx(v, r, _EL0, _EL02) ··· 64 47 #define write_sysreg_el1(v,r) write_sysreg_elx(v, r, _EL1, _EL12) 65 48 #define read_sysreg_el2(r) read_sysreg_elx(r, _EL2, _EL1) 66 49 #define write_sysreg_el2(v,r) write_sysreg_elx(v, r, _EL2, _EL1) 50 + 51 + #endif // __KVM_VHE_HYPERVISOR__ 67 52 68 53 /* 69 54 * Without an __arch_swab32(), we fall back to ___constant_swab32(), but the

+3 -1

arch/arm64/include/asm/kvm_mmu.h

··· 172 172 173 173 void stage2_unmap_vm(struct kvm *kvm); 174 174 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type); 175 + void kvm_uninit_stage2_mmu(struct kvm *kvm); 175 176 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu); 176 177 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, 177 178 phys_addr_t pa, unsigned long size, bool writable); ··· 228 227 if (icache_is_aliasing()) { 229 228 /* any kind of VIPT cache */ 230 229 icache_inval_all_pou(); 231 - } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) { 230 + } else if (read_sysreg(CurrentEL) != CurrentEL_EL1 || 231 + !icache_is_vpipt()) { 232 232 /* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */ 233 233 icache_inval_pou((unsigned long)va, (unsigned long)va + size); 234 234 }

+75 -4

arch/arm64/include/asm/kvm_pgtable.h

··· 92 92 return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL; 93 93 } 94 94 95 + static inline u32 kvm_supported_block_sizes(void) 96 + { 97 + u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL; 98 + u32 r = 0; 99 + 100 + for (; level < KVM_PGTABLE_MAX_LEVELS; level++) 101 + r |= BIT(kvm_granule_shift(level)); 102 + 103 + return r; 104 + } 105 + 106 + static inline bool kvm_is_block_size_supported(u64 size) 107 + { 108 + bool is_power_of_two = IS_ALIGNED(size, size); 109 + 110 + return is_power_of_two && (size & kvm_supported_block_sizes()); 111 + } 112 + 95 113 /** 96 114 * struct kvm_pgtable_mm_ops - Memory management callbacks. 97 115 * @zalloc_page: Allocate a single zeroed memory page. ··· 122 104 * allocation is physically contiguous. 123 105 * @free_pages_exact: Free an exact number of memory pages previously 124 106 * allocated by zalloc_pages_exact. 125 - * @free_removed_table: Free a removed paging structure by unlinking and 107 + * @free_unlinked_table: Free an unlinked paging structure by unlinking and 126 108 * dropping references. 127 109 * @get_page: Increment the refcount on a page. 128 110 * @put_page: Decrement the refcount on a page. When the ··· 142 124 void* (*zalloc_page)(void *arg); 143 125 void* (*zalloc_pages_exact)(size_t size); 144 126 void (*free_pages_exact)(void *addr, size_t size); 145 - void (*free_removed_table)(void *addr, u32 level); 127 + void (*free_unlinked_table)(void *addr, u32 level); 146 128 void (*get_page)(void *addr); 147 129 void (*put_page)(void *addr); 148 130 int (*page_count)(void *addr); ··· 213 195 * with other software walkers. 214 196 * @KVM_PGTABLE_WALK_HANDLE_FAULT: Indicates the page-table walk was 215 197 * invoked from a fault handler. 198 + * @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries 199 + * without Break-before-make's 200 + * TLB invalidation. 201 + * @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries 202 + * without Cache maintenance 203 + * operations required. 216 204 */ 217 205 enum kvm_pgtable_walk_flags { 218 206 KVM_PGTABLE_WALK_LEAF = BIT(0), ··· 226 202 KVM_PGTABLE_WALK_TABLE_POST = BIT(2), 227 203 KVM_PGTABLE_WALK_SHARED = BIT(3), 228 204 KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4), 205 + KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5), 206 + KVM_PGTABLE_WALK_SKIP_CMO = BIT(6), 229 207 }; 230 208 231 209 struct kvm_pgtable_visit_ctx { ··· 467 441 void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); 468 442 469 443 /** 470 - * kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure. 444 + * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure. 471 445 * @mm_ops: Memory management callbacks. 472 446 * @pgtable: Unlinked stage-2 paging structure to be freed. 473 447 * @level: Level of the stage-2 paging structure to be freed. ··· 475 449 * The page-table is assumed to be unreachable by any hardware walkers prior to 476 450 * freeing and therefore no TLB invalidation is performed. 477 451 */ 478 - void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level); 452 + void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level); 453 + 454 + /** 455 + * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure. 456 + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 457 + * @phys: Physical address of the memory to map. 458 + * @level: Starting level of the stage-2 paging structure to be created. 459 + * @prot: Permissions and attributes for the mapping. 460 + * @mc: Cache of pre-allocated and zeroed memory from which to allocate 461 + * page-table pages. 462 + * @force_pte: Force mappings to PAGE_SIZE granularity. 463 + * 464 + * Returns an unlinked page-table tree. This new page-table tree is 465 + * not reachable (i.e., it is unlinked) from the root pgd and it's 466 + * therefore unreachableby the hardware page-table walker. No TLB 467 + * invalidation or CMOs are performed. 468 + * 469 + * If device attributes are not explicitly requested in @prot, then the 470 + * mapping will be normal, cacheable. 471 + * 472 + * Return: The fully populated (unlinked) stage-2 paging structure, or 473 + * an ERR_PTR(error) on failure. 474 + */ 475 + kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt, 476 + u64 phys, u32 level, 477 + enum kvm_pgtable_prot prot, 478 + void *mc, bool force_pte); 479 479 480 480 /** 481 481 * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table. ··· 671 619 * Return: 0 on success, negative error code on failure. 672 620 */ 673 621 int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size); 622 + 623 + /** 624 + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing 625 + * to PAGE_SIZE guest pages. 626 + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init(). 627 + * @addr: Intermediate physical address from which to split. 628 + * @size: Size of the range. 629 + * @mc: Cache of pre-allocated and zeroed memory from which to allocate 630 + * page-table pages. 631 + * 632 + * The function tries to split any level 1 or 2 entry that overlaps 633 + * with the input range (given by @addr and @size). 634 + * 635 + * Return: 0 on success, negative error code on failure. Note that 636 + * kvm_pgtable_stage2_split() is best effort: it tries to break as many 637 + * blocks in the input range as allowed by @mc_capacity. 638 + */ 639 + int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, 640 + struct kvm_mmu_memory_cache *mc); 674 641 675 642 /** 676 643 * kvm_pgtable_walk() - Walk a page-table.

+21

arch/arm64/include/asm/kvm_pkvm.h

··· 6 6 #ifndef __ARM64_KVM_PKVM_H__ 7 7 #define __ARM64_KVM_PKVM_H__ 8 8 9 + #include <linux/arm_ffa.h> 9 10 #include <linux/memblock.h> 11 + #include <linux/scatterlist.h> 10 12 #include <asm/kvm_pgtable.h> 11 13 12 14 /* Maximum number of VMs that can co-exist under pKVM. */ ··· 106 104 res += __hyp_pgtable_max_pages(SZ_1G >> PAGE_SHIFT); 107 105 108 106 return res; 107 + } 108 + 109 + #define KVM_FFA_MBOX_NR_PAGES 1 110 + 111 + static inline unsigned long hyp_ffa_proxy_pages(void) 112 + { 113 + size_t desc_max; 114 + 115 + /* 116 + * The hypervisor FFA proxy needs enough memory to buffer a fragmented 117 + * descriptor returned from EL3 in response to a RETRIEVE_REQ call. 118 + */ 119 + desc_max = sizeof(struct ffa_mem_region) + 120 + sizeof(struct ffa_mem_region_attributes) + 121 + sizeof(struct ffa_composite_mem_region) + 122 + SG_MAX_SEGMENTS * sizeof(struct ffa_mem_region_addr_range); 123 + 124 + /* Plus a page each for the hypervisor's RX and TX mailboxes. */ 125 + return (2 * KVM_FFA_MBOX_NR_PAGES) + DIV_ROUND_UP(desc_max, PAGE_SIZE); 109 126 } 110 127 111 128 #endif /* __ARM64_KVM_PKVM_H__ */

+1

arch/arm64/include/asm/sysreg.h

··· 510 510 (BIT(18)) | (BIT(22)) | (BIT(23)) | (BIT(28)) | \ 511 511 (BIT(29))) 512 512 513 + #define SCTLR_EL2_BT (BIT(36)) 513 514 #ifdef CONFIG_CPU_BIG_ENDIAN 514 515 #define ENDIAN_SET_EL2 SCTLR_ELx_EE 515 516 #else

+11 -1

arch/arm64/include/asm/virt.h

··· 110 110 return __boot_cpu_mode[0] != __boot_cpu_mode[1]; 111 111 } 112 112 113 - static inline bool is_kernel_in_hyp_mode(void) 113 + static __always_inline bool is_kernel_in_hyp_mode(void) 114 114 { 115 + BUILD_BUG_ON(__is_defined(__KVM_NVHE_HYPERVISOR__) || 116 + __is_defined(__KVM_VHE_HYPERVISOR__)); 115 117 return read_sysreg(CurrentEL) == CurrentEL_EL2; 116 118 } 117 119 ··· 140 138 return false; 141 139 else 142 140 return cpus_have_final_cap(ARM64_KVM_PROTECTED_MODE); 141 + } 142 + 143 + static __always_inline bool has_hvhe(void) 144 + { 145 + if (is_vhe_hyp_code()) 146 + return false; 147 + 148 + return cpus_have_final_cap(ARM64_KVM_HVHE); 143 149 } 144 150 145 151 static inline bool is_hyp_nvhe(void)

+7

arch/arm64/kernel/cpu_errata.c

··· 730 730 .cpu_enable = cpu_clear_bf16_from_user_emulation, 731 731 }, 732 732 #endif 733 + #ifdef CONFIG_AMPERE_ERRATUM_AC03_CPU_38 734 + { 735 + .desc = "AmpereOne erratum AC03_CPU_38", 736 + .capability = ARM64_WORKAROUND_AMPERE_AC03_CPU_38, 737 + ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1), 738 + }, 739 + #endif 733 740 { 734 741 } 735 742 };

+33 -1

arch/arm64/kernel/cpufeature.c

··· 672 672 struct arm64_ftr_override __ro_after_init id_aa64isar1_override; 673 673 struct arm64_ftr_override __ro_after_init id_aa64isar2_override; 674 674 675 + struct arm64_ftr_override arm64_sw_feature_override; 676 + 675 677 static const struct __ftr_reg_entry { 676 678 u32 sys_id; 677 679 struct arm64_ftr_reg *reg; ··· 809 807 return reg; 810 808 } 811 809 812 - static s64 arm64_ftr_safe_value(const struct arm64_ftr_bits *ftrp, s64 new, 810 + s64 arm64_ftr_safe_value(const struct arm64_ftr_bits *ftrp, s64 new, 813 811 s64 cur) 814 812 { 815 813 s64 ret = 0; ··· 2011 2009 return true; 2012 2010 } 2013 2011 2012 + static bool hvhe_possible(const struct arm64_cpu_capabilities *entry, 2013 + int __unused) 2014 + { 2015 + u64 val; 2016 + 2017 + val = read_sysreg(id_aa64mmfr1_el1); 2018 + if (!cpuid_feature_extract_unsigned_field(val, ID_AA64MMFR1_EL1_VH_SHIFT)) 2019 + return false; 2020 + 2021 + val = arm64_sw_feature_override.val & arm64_sw_feature_override.mask; 2022 + return cpuid_feature_extract_unsigned_field(val, ARM64_SW_FEATURE_OVERRIDE_HVHE); 2023 + } 2024 + 2014 2025 #ifdef CONFIG_ARM64_PAN 2015 2026 static void cpu_enable_pan(const struct arm64_cpu_capabilities *__unused) 2016 2027 { ··· 2697 2682 .type = ARM64_CPUCAP_BOOT_CPU_FEATURE, 2698 2683 .matches = has_cpuid_feature, 2699 2684 ARM64_CPUID_FIELDS(ID_AA64MMFR3_EL1, S1PIE, IMP) 2685 + }, 2686 + { 2687 + .desc = "VHE for hypervisor only", 2688 + .capability = ARM64_KVM_HVHE, 2689 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2690 + .matches = hvhe_possible, 2691 + }, 2692 + { 2693 + .desc = "Enhanced Virtualization Traps", 2694 + .capability = ARM64_HAS_EVT, 2695 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2696 + .sys_reg = SYS_ID_AA64MMFR2_EL1, 2697 + .sign = FTR_UNSIGNED, 2698 + .field_pos = ID_AA64MMFR2_EL1_EVT_SHIFT, 2699 + .field_width = 4, 2700 + .min_field_value = ID_AA64MMFR2_EL1_EVT_IMP, 2701 + .matches = has_cpuid_feature, 2700 2702 }, 2701 2703 {}, 2702 2704 };

+2

arch/arm64/kernel/head.S

··· 603 603 msr sctlr_el1, x1 604 604 mov x2, xzr 605 605 2: 606 + __init_el2_nvhe_prepare_eret 607 + 606 608 mov w0, #BOOT_CPU_MODE_EL2 607 609 orr x0, x0, x2 608 610 eret

+9 -1

arch/arm64/kernel/hyp-stub.S

··· 82 82 tbnz x1, #0, 1f 83 83 84 84 // Needs to be VHE capable, obviously 85 - check_override id_aa64mmfr1 ID_AA64MMFR1_EL1_VH_SHIFT 2f 1f x1 x2 85 + check_override id_aa64mmfr1 ID_AA64MMFR1_EL1_VH_SHIFT 0f 1f x1 x2 86 + 87 + 0: // Check whether we only want the hypervisor to run VHE, not the kernel 88 + adr_l x1, arm64_sw_feature_override 89 + ldr x2, [x1, FTR_OVR_VAL_OFFSET] 90 + ldr x1, [x1, FTR_OVR_MASK_OFFSET] 91 + and x2, x2, x1 92 + ubfx x2, x2, #ARM64_SW_FEATURE_OVERRIDE_HVHE, #4 93 + cbz x2, 2f 86 94 87 95 1: mov_q x0, HVC_STUB_ERR 88 96 eret

+16 -9

arch/arm64/kernel/idreg-override.c

··· 139 139 }, 140 140 }; 141 141 142 - extern struct arm64_ftr_override kaslr_feature_override; 142 + static bool __init hvhe_filter(u64 val) 143 + { 144 + u64 mmfr1 = read_sysreg(id_aa64mmfr1_el1); 143 145 144 - static const struct ftr_set_desc kaslr __initconst = { 145 - .name = "kaslr", 146 - #ifdef CONFIG_RANDOMIZE_BASE 147 - .override = &kaslr_feature_override, 148 - #endif 146 + return (val == 1 && 147 + lower_32_bits(__boot_status) == BOOT_CPU_MODE_EL2 && 148 + cpuid_feature_extract_unsigned_field(mmfr1, 149 + ID_AA64MMFR1_EL1_VH_SHIFT)); 150 + } 151 + 152 + static const struct ftr_set_desc sw_features __initconst = { 153 + .name = "arm64_sw", 154 + .override = &arm64_sw_feature_override, 149 155 .fields = { 150 - FIELD("disabled", 0, NULL), 156 + FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL), 157 + FIELD("hvhe", ARM64_SW_FEATURE_OVERRIDE_HVHE, hvhe_filter), 151 158 {} 152 159 }, 153 160 }; ··· 166 159 &isar1, 167 160 &isar2, 168 161 &smfr0, 169 - &kaslr, 162 + &sw_features, 170 163 }; 171 164 172 165 static const struct { ··· 184 177 "id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0" }, 185 178 { "arm64.nomops", "id_aa64isar2.mops=0" }, 186 179 { "arm64.nomte", "id_aa64pfr1.mte=0" }, 187 - { "nokaslr", "kaslr.disabled=1" }, 180 + { "nokaslr", "arm64_sw.nokaslr=1" }, 188 181 }; 189 182 190 183 static int __init parse_nokaslr(char *unused)

+3 -3

arch/arm64/kernel/kaslr.c

··· 12 12 13 13 u16 __initdata memstart_offset_seed; 14 14 15 - struct arm64_ftr_override kaslr_feature_override __initdata; 16 - 17 15 bool __ro_after_init __kaslr_is_enabled = false; 18 16 19 17 void __init kaslr_init(void) 20 18 { 21 - if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) { 19 + if (cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val & 20 + arm64_sw_feature_override.mask, 21 + ARM64_SW_FEATURE_OVERRIDE_NOKASLR)) { 22 22 pr_info("KASLR disabled on command line\n"); 23 23 return; 24 24 }

+9 -5

arch/arm64/kvm/arch_timer.c

··· 1406 1406 kvm_get_running_vcpus()); 1407 1407 if (err) { 1408 1408 kvm_err("kvm_arch_timer: error setting vcpu affinity\n"); 1409 - goto out_free_irq; 1409 + goto out_free_vtimer_irq; 1410 1410 } 1411 1411 1412 1412 static_branch_enable(&has_gic_active_state); ··· 1422 1422 if (err) { 1423 1423 kvm_err("kvm_arch_timer: can't request ptimer interrupt %d (%d)\n", 1424 1424 host_ptimer_irq, err); 1425 - return err; 1425 + goto out_free_vtimer_irq; 1426 1426 } 1427 1427 1428 1428 if (has_gic) { ··· 1430 1430 kvm_get_running_vcpus()); 1431 1431 if (err) { 1432 1432 kvm_err("kvm_arch_timer: error setting vcpu affinity\n"); 1433 - goto out_free_irq; 1433 + goto out_free_ptimer_irq; 1434 1434 } 1435 1435 } 1436 1436 ··· 1439 1439 kvm_err("kvm_arch_timer: invalid physical timer IRQ: %d\n", 1440 1440 info->physical_irq); 1441 1441 err = -ENODEV; 1442 - goto out_free_irq; 1442 + goto out_free_vtimer_irq; 1443 1443 } 1444 1444 1445 1445 return 0; 1446 - out_free_irq: 1446 + 1447 + out_free_ptimer_irq: 1448 + if (info->physical_irq > 0) 1449 + free_percpu_irq(host_ptimer_irq, kvm_get_running_vcpus()); 1450 + out_free_vtimer_irq: 1447 1451 free_percpu_irq(host_vtimer_irq, kvm_get_running_vcpus()); 1448 1452 return err; 1449 1453 }

+154 -55

arch/arm64/kvm/arm.c

··· 51 51 DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); 52 52 DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); 53 53 54 + DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt); 55 + 54 56 static bool vgic_present; 55 57 56 58 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled); ··· 67 65 struct kvm_enable_cap *cap) 68 66 { 69 67 int r; 68 + u64 new_cap; 70 69 71 70 if (cap->flags) 72 71 return -EINVAL; ··· 92 89 r = 0; 93 90 set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags); 94 91 break; 92 + case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE: 93 + new_cap = cap->args[0]; 94 + 95 + mutex_lock(&kvm->slots_lock); 96 + /* 97 + * To keep things simple, allow changing the chunk 98 + * size only when no memory slots have been created. 99 + */ 100 + if (!kvm_are_all_memslots_empty(kvm)) { 101 + r = -EINVAL; 102 + } else if (new_cap && !kvm_is_block_size_supported(new_cap)) { 103 + r = -EINVAL; 104 + } else { 105 + r = 0; 106 + kvm->arch.mmu.split_page_chunk_size = new_cap; 107 + } 108 + mutex_unlock(&kvm->slots_lock); 109 + break; 95 110 default: 96 111 r = -EINVAL; 97 112 break; ··· 121 100 static int kvm_arm_default_max_vcpus(void) 122 101 { 123 102 return vgic_present ? kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS; 124 - } 125 - 126 - static void set_default_spectre(struct kvm *kvm) 127 - { 128 - /* 129 - * The default is to expose CSV2 == 1 if the HW isn't affected. 130 - * Although this is a per-CPU feature, we make it global because 131 - * asymmetric systems are just a nuisance. 132 - * 133 - * Userspace can override this as long as it doesn't promise 134 - * the impossible. 135 - */ 136 - if (arm64_get_spectre_v2_state() == SPECTRE_UNAFFECTED) 137 - kvm->arch.pfr0_csv2 = 1; 138 - if (arm64_get_meltdown_state() == SPECTRE_UNAFFECTED) 139 - kvm->arch.pfr0_csv3 = 1; 140 103 } 141 104 142 105 /** ··· 166 161 /* The maximum number of VCPUs is limited by the host's GIC model */ 167 162 kvm->max_vcpus = kvm_arm_default_max_vcpus(); 168 163 169 - set_default_spectre(kvm); 170 164 kvm_arm_init_hypercalls(kvm); 171 165 172 - /* 173 - * Initialise the default PMUver before there is a chance to 174 - * create an actual PMU. 175 - */ 176 - kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit(); 166 + bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES); 177 167 178 168 return 0; 179 169 ··· 301 301 case KVM_CAP_ARM_PTRAUTH_ADDRESS: 302 302 case KVM_CAP_ARM_PTRAUTH_GENERIC: 303 303 r = system_has_full_ptr_auth(); 304 + break; 305 + case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE: 306 + if (kvm) 307 + r = kvm->arch.mmu.split_page_chunk_size; 308 + else 309 + r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT; 310 + break; 311 + case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES: 312 + r = kvm_supported_block_sizes(); 304 313 break; 305 314 default: 306 315 r = 0; ··· 1176 1167 return -EINVAL; 1177 1168 } 1178 1169 1179 - static int kvm_vcpu_set_target(struct kvm_vcpu *vcpu, 1180 - const struct kvm_vcpu_init *init) 1170 + static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu, 1171 + const struct kvm_vcpu_init *init) 1181 1172 { 1182 - unsigned int i, ret; 1183 - u32 phys_target = kvm_target_cpu(); 1173 + unsigned long features = init->features[0]; 1174 + int i; 1184 1175 1185 - if (init->target != phys_target) 1186 - return -EINVAL; 1176 + if (features & ~KVM_VCPU_VALID_FEATURES) 1177 + return -ENOENT; 1187 1178 1188 - /* 1189 - * Secondary and subsequent calls to KVM_ARM_VCPU_INIT must 1190 - * use the same target. 1191 - */ 1192 - if (vcpu->arch.target != -1 && vcpu->arch.target != init->target) 1193 - return -EINVAL; 1194 - 1195 - /* -ENOENT for unknown features, -EINVAL for invalid combinations. */ 1196 - for (i = 0; i < sizeof(init->features) * 8; i++) { 1197 - bool set = (init->features[i / 32] & (1 << (i % 32))); 1198 - 1199 - if (set && i >= KVM_VCPU_MAX_FEATURES) 1179 + for (i = 1; i < ARRAY_SIZE(init->features); i++) { 1180 + if (init->features[i]) 1200 1181 return -ENOENT; 1201 - 1202 - /* 1203 - * Secondary and subsequent calls to KVM_ARM_VCPU_INIT must 1204 - * use the same feature set. 1205 - */ 1206 - if (vcpu->arch.target != -1 && i < KVM_VCPU_MAX_FEATURES && 1207 - test_bit(i, vcpu->arch.features) != set) 1208 - return -EINVAL; 1209 - 1210 - if (set) 1211 - set_bit(i, vcpu->arch.features); 1212 1182 } 1213 1183 1214 - vcpu->arch.target = phys_target; 1184 + if (!test_bit(KVM_ARM_VCPU_EL1_32BIT, &features)) 1185 + return 0; 1186 + 1187 + if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) 1188 + return -EINVAL; 1189 + 1190 + /* MTE is incompatible with AArch32 */ 1191 + if (kvm_has_mte(vcpu->kvm)) 1192 + return -EINVAL; 1193 + 1194 + /* NV is incompatible with AArch32 */ 1195 + if (test_bit(KVM_ARM_VCPU_HAS_EL2, &features)) 1196 + return -EINVAL; 1197 + 1198 + return 0; 1199 + } 1200 + 1201 + static bool kvm_vcpu_init_changed(struct kvm_vcpu *vcpu, 1202 + const struct kvm_vcpu_init *init) 1203 + { 1204 + unsigned long features = init->features[0]; 1205 + 1206 + return !bitmap_equal(vcpu->arch.features, &features, KVM_VCPU_MAX_FEATURES) || 1207 + vcpu->arch.target != init->target; 1208 + } 1209 + 1210 + static int __kvm_vcpu_set_target(struct kvm_vcpu *vcpu, 1211 + const struct kvm_vcpu_init *init) 1212 + { 1213 + unsigned long features = init->features[0]; 1214 + struct kvm *kvm = vcpu->kvm; 1215 + int ret = -EINVAL; 1216 + 1217 + mutex_lock(&kvm->arch.config_lock); 1218 + 1219 + if (test_bit(KVM_ARCH_FLAG_VCPU_FEATURES_CONFIGURED, &kvm->arch.flags) && 1220 + !bitmap_equal(kvm->arch.vcpu_features, &features, KVM_VCPU_MAX_FEATURES)) 1221 + goto out_unlock; 1222 + 1223 + vcpu->arch.target = init->target; 1224 + bitmap_copy(vcpu->arch.features, &features, KVM_VCPU_MAX_FEATURES); 1215 1225 1216 1226 /* Now we know what it is, we can reset it. */ 1217 1227 ret = kvm_reset_vcpu(vcpu); 1218 1228 if (ret) { 1219 1229 vcpu->arch.target = -1; 1220 1230 bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES); 1231 + goto out_unlock; 1221 1232 } 1222 1233 1234 + bitmap_copy(kvm->arch.vcpu_features, &features, KVM_VCPU_MAX_FEATURES); 1235 + set_bit(KVM_ARCH_FLAG_VCPU_FEATURES_CONFIGURED, &kvm->arch.flags); 1236 + 1237 + out_unlock: 1238 + mutex_unlock(&kvm->arch.config_lock); 1223 1239 return ret; 1240 + } 1241 + 1242 + static int kvm_vcpu_set_target(struct kvm_vcpu *vcpu, 1243 + const struct kvm_vcpu_init *init) 1244 + { 1245 + int ret; 1246 + 1247 + if (init->target != kvm_target_cpu()) 1248 + return -EINVAL; 1249 + 1250 + ret = kvm_vcpu_init_check_features(vcpu, init); 1251 + if (ret) 1252 + return ret; 1253 + 1254 + if (vcpu->arch.target == -1) 1255 + return __kvm_vcpu_set_target(vcpu, init); 1256 + 1257 + if (kvm_vcpu_init_changed(vcpu, init)) 1258 + return -EINVAL; 1259 + 1260 + return kvm_reset_vcpu(vcpu); 1224 1261 } 1225 1262 1226 1263 static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu, 1227 1264 struct kvm_vcpu_init *init) 1228 1265 { 1266 + bool power_off = false; 1229 1267 int ret; 1268 + 1269 + /* 1270 + * Treat the power-off vCPU feature as ephemeral. Clear the bit to avoid 1271 + * reflecting it in the finalized feature set, thus limiting its scope 1272 + * to a single KVM_ARM_VCPU_INIT call. 1273 + */ 1274 + if (init->features[0] & BIT(KVM_ARM_VCPU_POWER_OFF)) { 1275 + init->features[0] &= ~BIT(KVM_ARM_VCPU_POWER_OFF); 1276 + power_off = true; 1277 + } 1230 1278 1231 1279 ret = kvm_vcpu_set_target(vcpu, init); 1232 1280 if (ret) ··· 1306 1240 } 1307 1241 1308 1242 vcpu_reset_hcr(vcpu); 1309 - vcpu->arch.cptr_el2 = CPTR_EL2_DEFAULT; 1243 + vcpu->arch.cptr_el2 = kvm_get_reset_cptr_el2(vcpu); 1310 1244 1311 1245 /* 1312 1246 * Handle the "start in power-off" case. 1313 1247 */ 1314 1248 spin_lock(&vcpu->arch.mp_state_lock); 1315 1249 1316 - if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) 1250 + if (power_off) 1317 1251 __kvm_arm_vcpu_power_off(vcpu); 1318 1252 else 1319 1253 WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_RUNNABLE); ··· 1732 1666 1733 1667 params->mair_el2 = read_sysreg(mair_el1); 1734 1668 1735 - tcr = (read_sysreg(tcr_el1) & TCR_EL2_MASK) | TCR_EL2_RES1; 1669 + tcr = read_sysreg(tcr_el1); 1670 + if (cpus_have_final_cap(ARM64_KVM_HVHE)) { 1671 + tcr |= TCR_EPD1_MASK; 1672 + } else { 1673 + tcr &= TCR_EL2_MASK; 1674 + tcr |= TCR_EL2_RES1; 1675 + } 1736 1676 tcr &= ~TCR_T0SZ_MASK; 1737 1677 tcr |= TCR_T0SZ(hyp_va_bits); 1738 1678 params->tcr_el2 = tcr; ··· 1748 1676 params->hcr_el2 = HCR_HOST_NVHE_PROTECTED_FLAGS; 1749 1677 else 1750 1678 params->hcr_el2 = HCR_HOST_NVHE_FLAGS; 1679 + if (cpus_have_final_cap(ARM64_KVM_HVHE)) 1680 + params->hcr_el2 |= HCR_E2H; 1751 1681 params->vttbr = params->vtcr = 0; 1752 1682 1753 1683 /* ··· 1984 1910 } 1985 1911 1986 1912 kvm_host_psci_config.version = psci_ops.get_version(); 1913 + kvm_host_psci_config.smccc_version = arm_smccc_get_version(); 1987 1914 1988 1915 if (kvm_host_psci_config.version == PSCI_VERSION(0, 1)) { 1989 1916 kvm_host_psci_config.function_ids_0_1 = get_psci_0_1_function_ids(); ··· 2140 2065 free_hyp_pgds(); 2141 2066 2142 2067 return 0; 2068 + } 2069 + 2070 + static void pkvm_hyp_init_ptrauth(void) 2071 + { 2072 + struct kvm_cpu_context *hyp_ctxt; 2073 + int cpu; 2074 + 2075 + for_each_possible_cpu(cpu) { 2076 + hyp_ctxt = per_cpu_ptr_nvhe_sym(kvm_hyp_ctxt, cpu); 2077 + hyp_ctxt->sys_regs[APIAKEYLO_EL1] = get_random_long(); 2078 + hyp_ctxt->sys_regs[APIAKEYHI_EL1] = get_random_long(); 2079 + hyp_ctxt->sys_regs[APIBKEYLO_EL1] = get_random_long(); 2080 + hyp_ctxt->sys_regs[APIBKEYHI_EL1] = get_random_long(); 2081 + hyp_ctxt->sys_regs[APDAKEYLO_EL1] = get_random_long(); 2082 + hyp_ctxt->sys_regs[APDAKEYHI_EL1] = get_random_long(); 2083 + hyp_ctxt->sys_regs[APDBKEYLO_EL1] = get_random_long(); 2084 + hyp_ctxt->sys_regs[APDBKEYHI_EL1] = get_random_long(); 2085 + hyp_ctxt->sys_regs[APGAKEYLO_EL1] = get_random_long(); 2086 + hyp_ctxt->sys_regs[APGAKEYHI_EL1] = get_random_long(); 2087 + } 2143 2088 } 2144 2089 2145 2090 /* Inits Hyp-mode on all online CPUs */ ··· 2323 2228 kvm_hyp_init_symbols(); 2324 2229 2325 2230 if (is_protected_kvm_enabled()) { 2231 + if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) && 2232 + cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH)) 2233 + pkvm_hyp_init_ptrauth(); 2234 + 2326 2235 init_cpu_logical_map(); 2327 2236 2328 2237 if (!init_psci_relay()) {

+2 -2

arch/arm64/kvm/fpsimd.c

··· 180 180 181 181 /* 182 182 * If we have VHE then the Hyp code will reset CPACR_EL1 to 183 - * CPACR_EL1_DEFAULT and we need to reenable SME. 183 + * the default value and we need to reenable SME. 184 184 */ 185 185 if (has_vhe() && system_supports_sme()) { 186 186 /* Also restore EL0 state seen on entry */ ··· 210 210 /* 211 211 * The FPSIMD/SVE state in the CPU has not been touched, and we 212 212 * have SVE (and VHE): CPACR_EL1 (alias CPTR_EL2) has been 213 - * reset to CPACR_EL1_DEFAULT by the Hyp code, disabling SVE 213 + * reset by kvm_reset_cptr_el2() in the Hyp code, disabling SVE 214 214 * for EL0. To avoid spurious traps, restore the trap state 215 215 * seen by kvm_arch_vcpu_load_fp(): 216 216 */

+82 -19

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 70 70 } 71 71 } 72 72 73 + static inline bool __hfgxtr_traps_required(void) 74 + { 75 + if (cpus_have_final_cap(ARM64_SME)) 76 + return true; 77 + 78 + if (cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) 79 + return true; 80 + 81 + return false; 82 + } 83 + 84 + static inline void __activate_traps_hfgxtr(void) 85 + { 86 + u64 r_clr = 0, w_clr = 0, r_set = 0, w_set = 0, tmp; 87 + 88 + if (cpus_have_final_cap(ARM64_SME)) { 89 + tmp = HFGxTR_EL2_nSMPRI_EL1_MASK | HFGxTR_EL2_nTPIDR2_EL0_MASK; 90 + 91 + r_clr |= tmp; 92 + w_clr |= tmp; 93 + } 94 + 95 + /* 96 + * Trap guest writes to TCR_EL1 to prevent it from enabling HA or HD. 97 + */ 98 + if (cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) 99 + w_set |= HFGxTR_EL2_TCR_EL1_MASK; 100 + 101 + sysreg_clear_set_s(SYS_HFGRTR_EL2, r_clr, r_set); 102 + sysreg_clear_set_s(SYS_HFGWTR_EL2, w_clr, w_set); 103 + } 104 + 105 + static inline void __deactivate_traps_hfgxtr(void) 106 + { 107 + u64 r_clr = 0, w_clr = 0, r_set = 0, w_set = 0, tmp; 108 + 109 + if (cpus_have_final_cap(ARM64_SME)) { 110 + tmp = HFGxTR_EL2_nSMPRI_EL1_MASK | HFGxTR_EL2_nTPIDR2_EL0_MASK; 111 + 112 + r_set |= tmp; 113 + w_set |= tmp; 114 + } 115 + 116 + if (cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) 117 + w_clr |= HFGxTR_EL2_TCR_EL1_MASK; 118 + 119 + sysreg_clear_set_s(SYS_HFGRTR_EL2, r_clr, r_set); 120 + sysreg_clear_set_s(SYS_HFGWTR_EL2, w_clr, w_set); 121 + } 122 + 73 123 static inline void __activate_traps_common(struct kvm_vcpu *vcpu) 74 124 { 75 125 /* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */ ··· 145 95 vcpu->arch.mdcr_el2_host = read_sysreg(mdcr_el2); 146 96 write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2); 147 97 148 - if (cpus_have_final_cap(ARM64_SME)) { 149 - sysreg_clear_set_s(SYS_HFGRTR_EL2, 150 - HFGxTR_EL2_nSMPRI_EL1_MASK | 151 - HFGxTR_EL2_nTPIDR2_EL0_MASK, 152 - 0); 153 - sysreg_clear_set_s(SYS_HFGWTR_EL2, 154 - HFGxTR_EL2_nSMPRI_EL1_MASK | 155 - HFGxTR_EL2_nTPIDR2_EL0_MASK, 156 - 0); 157 - } 98 + if (__hfgxtr_traps_required()) 99 + __activate_traps_hfgxtr(); 158 100 } 159 101 160 102 static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) ··· 162 120 vcpu_clear_flag(vcpu, PMUSERENR_ON_CPU); 163 121 } 164 122 165 - if (cpus_have_final_cap(ARM64_SME)) { 166 - sysreg_clear_set_s(SYS_HFGRTR_EL2, 0, 167 - HFGxTR_EL2_nSMPRI_EL1_MASK | 168 - HFGxTR_EL2_nTPIDR2_EL0_MASK); 169 - sysreg_clear_set_s(SYS_HFGWTR_EL2, 0, 170 - HFGxTR_EL2_nSMPRI_EL1_MASK | 171 - HFGxTR_EL2_nTPIDR2_EL0_MASK); 172 - } 123 + if (__hfgxtr_traps_required()) 124 + __deactivate_traps_hfgxtr(); 173 125 } 174 126 175 127 static inline void ___activate_traps(struct kvm_vcpu *vcpu) ··· 245 209 /* Valid trap. Switch the context: */ 246 210 247 211 /* First disable enough traps to allow us to update the registers */ 248 - if (has_vhe()) { 212 + if (has_vhe() || has_hvhe()) { 249 213 reg = CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN; 250 214 if (sve_guest) 251 215 reg |= CPACR_EL1_ZEN_EL0EN | CPACR_EL1_ZEN_EL1EN; ··· 437 401 return true; 438 402 } 439 403 404 + static bool handle_ampere1_tcr(struct kvm_vcpu *vcpu) 405 + { 406 + u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu)); 407 + int rt = kvm_vcpu_sys_get_rt(vcpu); 408 + u64 val = vcpu_get_reg(vcpu, rt); 409 + 410 + if (sysreg != SYS_TCR_EL1) 411 + return false; 412 + 413 + /* 414 + * Affected parts do not advertise support for hardware Access Flag / 415 + * Dirty state management in ID_AA64MMFR1_EL1.HAFDBS, but the underlying 416 + * control bits are still functional. The architecture requires these be 417 + * RES0 on systems that do not implement FEAT_HAFDBS. 418 + * 419 + * Uphold the requirements of the architecture by masking guest writes 420 + * to TCR_EL1.{HA,HD} here. 421 + */ 422 + val &= ~(TCR_HD | TCR_HA); 423 + write_sysreg_el1(val, SYS_TCR); 424 + return true; 425 + } 426 + 440 427 static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code) 441 428 { 442 429 if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_TX2_219_TVM) && 443 430 handle_tx2_tvm(vcpu)) 431 + return true; 432 + 433 + if (cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38) && 434 + handle_ampere1_tcr(vcpu)) 444 435 return true; 445 436 446 437 if (static_branch_unlikely(&vgic_v3_cpuif_trap) &&

+17

arch/arm64/kvm/hyp/include/nvhe/ffa.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2022 - Google LLC 4 + * Author: Andrew Walbran <qwandor@google.com> 5 + */ 6 + #ifndef __KVM_HYP_FFA_H 7 + #define __KVM_HYP_FFA_H 8 + 9 + #include <asm/kvm_host.h> 10 + 11 + #define FFA_MIN_FUNC_NUM 0x60 12 + #define FFA_MAX_FUNC_NUM 0x7F 13 + 14 + int hyp_ffa_init(void *pages); 15 + bool kvm_host_ffa_handler(struct kvm_cpu_context *host_ctxt); 16 + 17 + #endif /* __KVM_HYP_FFA_H */

+3

arch/arm64/kvm/hyp/include/nvhe/mem_protect.h

··· 57 57 enum pkvm_component_id { 58 58 PKVM_ID_HOST, 59 59 PKVM_ID_HYP, 60 + PKVM_ID_FFA, 60 61 }; 61 62 62 63 extern unsigned long hyp_nr_cpus; ··· 67 66 int __pkvm_host_unshare_hyp(u64 pfn); 68 67 int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); 69 68 int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); 69 + int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); 70 + int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); 70 71 71 72 bool addr_is_memory(phys_addr_t phys); 72 73 int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot);

+1 -1

arch/arm64/kvm/hyp/nvhe/Makefile

··· 22 22 23 23 hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \ 24 24 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \ 25 - cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o 25 + cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o 26 26 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ 27 27 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o 28 28 hyp-obj-$(CONFIG_DEBUG_LIST) += list_debug.o

+762

arch/arm64/kvm/hyp/nvhe/ffa.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * FF-A v1.0 proxy to filter out invalid memory-sharing SMC calls issued by 4 + * the host. FF-A is a slightly more palatable abbreviation of "Arm Firmware 5 + * Framework for Arm A-profile", which is specified by Arm in document 6 + * number DEN0077. 7 + * 8 + * Copyright (C) 2022 - Google LLC 9 + * Author: Andrew Walbran <qwandor@google.com> 10 + * 11 + * This driver hooks into the SMC trapping logic for the host and intercepts 12 + * all calls falling within the FF-A range. Each call is either: 13 + * 14 + * - Forwarded on unmodified to the SPMD at EL3 15 + * - Rejected as "unsupported" 16 + * - Accompanied by a host stage-2 page-table check/update and reissued 17 + * 18 + * Consequently, any attempts by the host to make guest memory pages 19 + * accessible to the secure world using FF-A will be detected either here 20 + * (in the case that the memory is already owned by the guest) or during 21 + * donation to the guest (in the case that the memory was previously shared 22 + * with the secure world). 23 + * 24 + * To allow the rolling-back of page-table updates and FF-A calls in the 25 + * event of failure, operations involving the RXTX buffers are locked for 26 + * the duration and are therefore serialised. 27 + */ 28 + 29 + #include <linux/arm-smccc.h> 30 + #include <linux/arm_ffa.h> 31 + #include <asm/kvm_pkvm.h> 32 + 33 + #include <nvhe/ffa.h> 34 + #include <nvhe/mem_protect.h> 35 + #include <nvhe/memory.h> 36 + #include <nvhe/trap_handler.h> 37 + #include <nvhe/spinlock.h> 38 + 39 + /* 40 + * "ID value 0 must be returned at the Non-secure physical FF-A instance" 41 + * We share this ID with the host. 42 + */ 43 + #define HOST_FFA_ID 0 44 + 45 + /* 46 + * A buffer to hold the maximum descriptor size we can see from the host, 47 + * which is required when the SPMD returns a fragmented FFA_MEM_RETRIEVE_RESP 48 + * when resolving the handle on the reclaim path. 49 + */ 50 + struct kvm_ffa_descriptor_buffer { 51 + void *buf; 52 + size_t len; 53 + }; 54 + 55 + static struct kvm_ffa_descriptor_buffer ffa_desc_buf; 56 + 57 + struct kvm_ffa_buffers { 58 + hyp_spinlock_t lock; 59 + void *tx; 60 + void *rx; 61 + }; 62 + 63 + /* 64 + * Note that we don't currently lock these buffers explicitly, instead 65 + * relying on the locking of the host FFA buffers as we only have one 66 + * client. 67 + */ 68 + static struct kvm_ffa_buffers hyp_buffers; 69 + static struct kvm_ffa_buffers host_buffers; 70 + 71 + static void ffa_to_smccc_error(struct arm_smccc_res *res, u64 ffa_errno) 72 + { 73 + *res = (struct arm_smccc_res) { 74 + .a0 = FFA_ERROR, 75 + .a2 = ffa_errno, 76 + }; 77 + } 78 + 79 + static void ffa_to_smccc_res_prop(struct arm_smccc_res *res, int ret, u64 prop) 80 + { 81 + if (ret == FFA_RET_SUCCESS) { 82 + *res = (struct arm_smccc_res) { .a0 = FFA_SUCCESS, 83 + .a2 = prop }; 84 + } else { 85 + ffa_to_smccc_error(res, ret); 86 + } 87 + } 88 + 89 + static void ffa_to_smccc_res(struct arm_smccc_res *res, int ret) 90 + { 91 + ffa_to_smccc_res_prop(res, ret, 0); 92 + } 93 + 94 + static void ffa_set_retval(struct kvm_cpu_context *ctxt, 95 + struct arm_smccc_res *res) 96 + { 97 + cpu_reg(ctxt, 0) = res->a0; 98 + cpu_reg(ctxt, 1) = res->a1; 99 + cpu_reg(ctxt, 2) = res->a2; 100 + cpu_reg(ctxt, 3) = res->a3; 101 + } 102 + 103 + static bool is_ffa_call(u64 func_id) 104 + { 105 + return ARM_SMCCC_IS_FAST_CALL(func_id) && 106 + ARM_SMCCC_OWNER_NUM(func_id) == ARM_SMCCC_OWNER_STANDARD && 107 + ARM_SMCCC_FUNC_NUM(func_id) >= FFA_MIN_FUNC_NUM && 108 + ARM_SMCCC_FUNC_NUM(func_id) <= FFA_MAX_FUNC_NUM; 109 + } 110 + 111 + static int ffa_map_hyp_buffers(u64 ffa_page_count) 112 + { 113 + struct arm_smccc_res res; 114 + 115 + arm_smccc_1_1_smc(FFA_FN64_RXTX_MAP, 116 + hyp_virt_to_phys(hyp_buffers.tx), 117 + hyp_virt_to_phys(hyp_buffers.rx), 118 + ffa_page_count, 119 + 0, 0, 0, 0, 120 + &res); 121 + 122 + return res.a0 == FFA_SUCCESS ? FFA_RET_SUCCESS : res.a2; 123 + } 124 + 125 + static int ffa_unmap_hyp_buffers(void) 126 + { 127 + struct arm_smccc_res res; 128 + 129 + arm_smccc_1_1_smc(FFA_RXTX_UNMAP, 130 + HOST_FFA_ID, 131 + 0, 0, 0, 0, 0, 0, 132 + &res); 133 + 134 + return res.a0 == FFA_SUCCESS ? FFA_RET_SUCCESS : res.a2; 135 + } 136 + 137 + static void ffa_mem_frag_tx(struct arm_smccc_res *res, u32 handle_lo, 138 + u32 handle_hi, u32 fraglen, u32 endpoint_id) 139 + { 140 + arm_smccc_1_1_smc(FFA_MEM_FRAG_TX, 141 + handle_lo, handle_hi, fraglen, endpoint_id, 142 + 0, 0, 0, 143 + res); 144 + } 145 + 146 + static void ffa_mem_frag_rx(struct arm_smccc_res *res, u32 handle_lo, 147 + u32 handle_hi, u32 fragoff) 148 + { 149 + arm_smccc_1_1_smc(FFA_MEM_FRAG_RX, 150 + handle_lo, handle_hi, fragoff, HOST_FFA_ID, 151 + 0, 0, 0, 152 + res); 153 + } 154 + 155 + static void ffa_mem_xfer(struct arm_smccc_res *res, u64 func_id, u32 len, 156 + u32 fraglen) 157 + { 158 + arm_smccc_1_1_smc(func_id, len, fraglen, 159 + 0, 0, 0, 0, 0, 160 + res); 161 + } 162 + 163 + static void ffa_mem_reclaim(struct arm_smccc_res *res, u32 handle_lo, 164 + u32 handle_hi, u32 flags) 165 + { 166 + arm_smccc_1_1_smc(FFA_MEM_RECLAIM, 167 + handle_lo, handle_hi, flags, 168 + 0, 0, 0, 0, 169 + res); 170 + } 171 + 172 + static void ffa_retrieve_req(struct arm_smccc_res *res, u32 len) 173 + { 174 + arm_smccc_1_1_smc(FFA_FN64_MEM_RETRIEVE_REQ, 175 + len, len, 176 + 0, 0, 0, 0, 0, 177 + res); 178 + } 179 + 180 + static void do_ffa_rxtx_map(struct arm_smccc_res *res, 181 + struct kvm_cpu_context *ctxt) 182 + { 183 + DECLARE_REG(phys_addr_t, tx, ctxt, 1); 184 + DECLARE_REG(phys_addr_t, rx, ctxt, 2); 185 + DECLARE_REG(u32, npages, ctxt, 3); 186 + int ret = 0; 187 + void *rx_virt, *tx_virt; 188 + 189 + if (npages != (KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE) / FFA_PAGE_SIZE) { 190 + ret = FFA_RET_INVALID_PARAMETERS; 191 + goto out; 192 + } 193 + 194 + if (!PAGE_ALIGNED(tx) || !PAGE_ALIGNED(rx)) { 195 + ret = FFA_RET_INVALID_PARAMETERS; 196 + goto out; 197 + } 198 + 199 + hyp_spin_lock(&host_buffers.lock); 200 + if (host_buffers.tx) { 201 + ret = FFA_RET_DENIED; 202 + goto out_unlock; 203 + } 204 + 205 + /* 206 + * Map our hypervisor buffers into the SPMD before mapping and 207 + * pinning the host buffers in our own address space. 208 + */ 209 + ret = ffa_map_hyp_buffers(npages); 210 + if (ret) 211 + goto out_unlock; 212 + 213 + ret = __pkvm_host_share_hyp(hyp_phys_to_pfn(tx)); 214 + if (ret) { 215 + ret = FFA_RET_INVALID_PARAMETERS; 216 + goto err_unmap; 217 + } 218 + 219 + ret = __pkvm_host_share_hyp(hyp_phys_to_pfn(rx)); 220 + if (ret) { 221 + ret = FFA_RET_INVALID_PARAMETERS; 222 + goto err_unshare_tx; 223 + } 224 + 225 + tx_virt = hyp_phys_to_virt(tx); 226 + ret = hyp_pin_shared_mem(tx_virt, tx_virt + 1); 227 + if (ret) { 228 + ret = FFA_RET_INVALID_PARAMETERS; 229 + goto err_unshare_rx; 230 + } 231 + 232 + rx_virt = hyp_phys_to_virt(rx); 233 + ret = hyp_pin_shared_mem(rx_virt, rx_virt + 1); 234 + if (ret) { 235 + ret = FFA_RET_INVALID_PARAMETERS; 236 + goto err_unpin_tx; 237 + } 238 + 239 + host_buffers.tx = tx_virt; 240 + host_buffers.rx = rx_virt; 241 + 242 + out_unlock: 243 + hyp_spin_unlock(&host_buffers.lock); 244 + out: 245 + ffa_to_smccc_res(res, ret); 246 + return; 247 + 248 + err_unpin_tx: 249 + hyp_unpin_shared_mem(tx_virt, tx_virt + 1); 250 + err_unshare_rx: 251 + __pkvm_host_unshare_hyp(hyp_phys_to_pfn(rx)); 252 + err_unshare_tx: 253 + __pkvm_host_unshare_hyp(hyp_phys_to_pfn(tx)); 254 + err_unmap: 255 + ffa_unmap_hyp_buffers(); 256 + goto out_unlock; 257 + } 258 + 259 + static void do_ffa_rxtx_unmap(struct arm_smccc_res *res, 260 + struct kvm_cpu_context *ctxt) 261 + { 262 + DECLARE_REG(u32, id, ctxt, 1); 263 + int ret = 0; 264 + 265 + if (id != HOST_FFA_ID) { 266 + ret = FFA_RET_INVALID_PARAMETERS; 267 + goto out; 268 + } 269 + 270 + hyp_spin_lock(&host_buffers.lock); 271 + if (!host_buffers.tx) { 272 + ret = FFA_RET_INVALID_PARAMETERS; 273 + goto out_unlock; 274 + } 275 + 276 + hyp_unpin_shared_mem(host_buffers.tx, host_buffers.tx + 1); 277 + WARN_ON(__pkvm_host_unshare_hyp(hyp_virt_to_pfn(host_buffers.tx))); 278 + host_buffers.tx = NULL; 279 + 280 + hyp_unpin_shared_mem(host_buffers.rx, host_buffers.rx + 1); 281 + WARN_ON(__pkvm_host_unshare_hyp(hyp_virt_to_pfn(host_buffers.rx))); 282 + host_buffers.rx = NULL; 283 + 284 + ffa_unmap_hyp_buffers(); 285 + 286 + out_unlock: 287 + hyp_spin_unlock(&host_buffers.lock); 288 + out: 289 + ffa_to_smccc_res(res, ret); 290 + } 291 + 292 + static u32 __ffa_host_share_ranges(struct ffa_mem_region_addr_range *ranges, 293 + u32 nranges) 294 + { 295 + u32 i; 296 + 297 + for (i = 0; i < nranges; ++i) { 298 + struct ffa_mem_region_addr_range *range = &ranges[i]; 299 + u64 sz = (u64)range->pg_cnt * FFA_PAGE_SIZE; 300 + u64 pfn = hyp_phys_to_pfn(range->address); 301 + 302 + if (!PAGE_ALIGNED(sz)) 303 + break; 304 + 305 + if (__pkvm_host_share_ffa(pfn, sz / PAGE_SIZE)) 306 + break; 307 + } 308 + 309 + return i; 310 + } 311 + 312 + static u32 __ffa_host_unshare_ranges(struct ffa_mem_region_addr_range *ranges, 313 + u32 nranges) 314 + { 315 + u32 i; 316 + 317 + for (i = 0; i < nranges; ++i) { 318 + struct ffa_mem_region_addr_range *range = &ranges[i]; 319 + u64 sz = (u64)range->pg_cnt * FFA_PAGE_SIZE; 320 + u64 pfn = hyp_phys_to_pfn(range->address); 321 + 322 + if (!PAGE_ALIGNED(sz)) 323 + break; 324 + 325 + if (__pkvm_host_unshare_ffa(pfn, sz / PAGE_SIZE)) 326 + break; 327 + } 328 + 329 + return i; 330 + } 331 + 332 + static int ffa_host_share_ranges(struct ffa_mem_region_addr_range *ranges, 333 + u32 nranges) 334 + { 335 + u32 nshared = __ffa_host_share_ranges(ranges, nranges); 336 + int ret = 0; 337 + 338 + if (nshared != nranges) { 339 + WARN_ON(__ffa_host_unshare_ranges(ranges, nshared) != nshared); 340 + ret = FFA_RET_DENIED; 341 + } 342 + 343 + return ret; 344 + } 345 + 346 + static int ffa_host_unshare_ranges(struct ffa_mem_region_addr_range *ranges, 347 + u32 nranges) 348 + { 349 + u32 nunshared = __ffa_host_unshare_ranges(ranges, nranges); 350 + int ret = 0; 351 + 352 + if (nunshared != nranges) { 353 + WARN_ON(__ffa_host_share_ranges(ranges, nunshared) != nunshared); 354 + ret = FFA_RET_DENIED; 355 + } 356 + 357 + return ret; 358 + } 359 + 360 + static void do_ffa_mem_frag_tx(struct arm_smccc_res *res, 361 + struct kvm_cpu_context *ctxt) 362 + { 363 + DECLARE_REG(u32, handle_lo, ctxt, 1); 364 + DECLARE_REG(u32, handle_hi, ctxt, 2); 365 + DECLARE_REG(u32, fraglen, ctxt, 3); 366 + DECLARE_REG(u32, endpoint_id, ctxt, 4); 367 + struct ffa_mem_region_addr_range *buf; 368 + int ret = FFA_RET_INVALID_PARAMETERS; 369 + u32 nr_ranges; 370 + 371 + if (fraglen > KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE) 372 + goto out; 373 + 374 + if (fraglen % sizeof(*buf)) 375 + goto out; 376 + 377 + hyp_spin_lock(&host_buffers.lock); 378 + if (!host_buffers.tx) 379 + goto out_unlock; 380 + 381 + buf = hyp_buffers.tx; 382 + memcpy(buf, host_buffers.tx, fraglen); 383 + nr_ranges = fraglen / sizeof(*buf); 384 + 385 + ret = ffa_host_share_ranges(buf, nr_ranges); 386 + if (ret) { 387 + /* 388 + * We're effectively aborting the transaction, so we need 389 + * to restore the global state back to what it was prior to 390 + * transmission of the first fragment. 391 + */ 392 + ffa_mem_reclaim(res, handle_lo, handle_hi, 0); 393 + WARN_ON(res->a0 != FFA_SUCCESS); 394 + goto out_unlock; 395 + } 396 + 397 + ffa_mem_frag_tx(res, handle_lo, handle_hi, fraglen, endpoint_id); 398 + if (res->a0 != FFA_SUCCESS && res->a0 != FFA_MEM_FRAG_RX) 399 + WARN_ON(ffa_host_unshare_ranges(buf, nr_ranges)); 400 + 401 + out_unlock: 402 + hyp_spin_unlock(&host_buffers.lock); 403 + out: 404 + if (ret) 405 + ffa_to_smccc_res(res, ret); 406 + 407 + /* 408 + * If for any reason this did not succeed, we're in trouble as we have 409 + * now lost the content of the previous fragments and we can't rollback 410 + * the host stage-2 changes. The pages previously marked as shared will 411 + * remain stuck in that state forever, hence preventing the host from 412 + * sharing/donating them again and may possibly lead to subsequent 413 + * failures, but this will not compromise confidentiality. 414 + */ 415 + return; 416 + } 417 + 418 + static __always_inline void do_ffa_mem_xfer(const u64 func_id, 419 + struct arm_smccc_res *res, 420 + struct kvm_cpu_context *ctxt) 421 + { 422 + DECLARE_REG(u32, len, ctxt, 1); 423 + DECLARE_REG(u32, fraglen, ctxt, 2); 424 + DECLARE_REG(u64, addr_mbz, ctxt, 3); 425 + DECLARE_REG(u32, npages_mbz, ctxt, 4); 426 + struct ffa_composite_mem_region *reg; 427 + struct ffa_mem_region *buf; 428 + u32 offset, nr_ranges; 429 + int ret = 0; 430 + 431 + BUILD_BUG_ON(func_id != FFA_FN64_MEM_SHARE && 432 + func_id != FFA_FN64_MEM_LEND); 433 + 434 + if (addr_mbz || npages_mbz || fraglen > len || 435 + fraglen > KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE) { 436 + ret = FFA_RET_INVALID_PARAMETERS; 437 + goto out; 438 + } 439 + 440 + if (fraglen < sizeof(struct ffa_mem_region) + 441 + sizeof(struct ffa_mem_region_attributes)) { 442 + ret = FFA_RET_INVALID_PARAMETERS; 443 + goto out; 444 + } 445 + 446 + hyp_spin_lock(&host_buffers.lock); 447 + if (!host_buffers.tx) { 448 + ret = FFA_RET_INVALID_PARAMETERS; 449 + goto out_unlock; 450 + } 451 + 452 + buf = hyp_buffers.tx; 453 + memcpy(buf, host_buffers.tx, fraglen); 454 + 455 + offset = buf->ep_mem_access[0].composite_off; 456 + if (!offset || buf->ep_count != 1 || buf->sender_id != HOST_FFA_ID) { 457 + ret = FFA_RET_INVALID_PARAMETERS; 458 + goto out_unlock; 459 + } 460 + 461 + if (fraglen < offset + sizeof(struct ffa_composite_mem_region)) { 462 + ret = FFA_RET_INVALID_PARAMETERS; 463 + goto out_unlock; 464 + } 465 + 466 + reg = (void *)buf + offset; 467 + nr_ranges = ((void *)buf + fraglen) - (void *)reg->constituents; 468 + if (nr_ranges % sizeof(reg->constituents[0])) { 469 + ret = FFA_RET_INVALID_PARAMETERS; 470 + goto out_unlock; 471 + } 472 + 473 + nr_ranges /= sizeof(reg->constituents[0]); 474 + ret = ffa_host_share_ranges(reg->constituents, nr_ranges); 475 + if (ret) 476 + goto out_unlock; 477 + 478 + ffa_mem_xfer(res, func_id, len, fraglen); 479 + if (fraglen != len) { 480 + if (res->a0 != FFA_MEM_FRAG_RX) 481 + goto err_unshare; 482 + 483 + if (res->a3 != fraglen) 484 + goto err_unshare; 485 + } else if (res->a0 != FFA_SUCCESS) { 486 + goto err_unshare; 487 + } 488 + 489 + out_unlock: 490 + hyp_spin_unlock(&host_buffers.lock); 491 + out: 492 + if (ret) 493 + ffa_to_smccc_res(res, ret); 494 + return; 495 + 496 + err_unshare: 497 + WARN_ON(ffa_host_unshare_ranges(reg->constituents, nr_ranges)); 498 + goto out_unlock; 499 + } 500 + 501 + static void do_ffa_mem_reclaim(struct arm_smccc_res *res, 502 + struct kvm_cpu_context *ctxt) 503 + { 504 + DECLARE_REG(u32, handle_lo, ctxt, 1); 505 + DECLARE_REG(u32, handle_hi, ctxt, 2); 506 + DECLARE_REG(u32, flags, ctxt, 3); 507 + struct ffa_composite_mem_region *reg; 508 + u32 offset, len, fraglen, fragoff; 509 + struct ffa_mem_region *buf; 510 + int ret = 0; 511 + u64 handle; 512 + 513 + handle = PACK_HANDLE(handle_lo, handle_hi); 514 + 515 + hyp_spin_lock(&host_buffers.lock); 516 + 517 + buf = hyp_buffers.tx; 518 + *buf = (struct ffa_mem_region) { 519 + .sender_id = HOST_FFA_ID, 520 + .handle = handle, 521 + }; 522 + 523 + ffa_retrieve_req(res, sizeof(*buf)); 524 + buf = hyp_buffers.rx; 525 + if (res->a0 != FFA_MEM_RETRIEVE_RESP) 526 + goto out_unlock; 527 + 528 + len = res->a1; 529 + fraglen = res->a2; 530 + 531 + offset = buf->ep_mem_access[0].composite_off; 532 + /* 533 + * We can trust the SPMD to get this right, but let's at least 534 + * check that we end up with something that doesn't look _completely_ 535 + * bogus. 536 + */ 537 + if (WARN_ON(offset > len || 538 + fraglen > KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE)) { 539 + ret = FFA_RET_ABORTED; 540 + goto out_unlock; 541 + } 542 + 543 + if (len > ffa_desc_buf.len) { 544 + ret = FFA_RET_NO_MEMORY; 545 + goto out_unlock; 546 + } 547 + 548 + buf = ffa_desc_buf.buf; 549 + memcpy(buf, hyp_buffers.rx, fraglen); 550 + 551 + for (fragoff = fraglen; fragoff < len; fragoff += fraglen) { 552 + ffa_mem_frag_rx(res, handle_lo, handle_hi, fragoff); 553 + if (res->a0 != FFA_MEM_FRAG_TX) { 554 + ret = FFA_RET_INVALID_PARAMETERS; 555 + goto out_unlock; 556 + } 557 + 558 + fraglen = res->a3; 559 + memcpy((void *)buf + fragoff, hyp_buffers.rx, fraglen); 560 + } 561 + 562 + ffa_mem_reclaim(res, handle_lo, handle_hi, flags); 563 + if (res->a0 != FFA_SUCCESS) 564 + goto out_unlock; 565 + 566 + reg = (void *)buf + offset; 567 + /* If the SPMD was happy, then we should be too. */ 568 + WARN_ON(ffa_host_unshare_ranges(reg->constituents, 569 + reg->addr_range_cnt)); 570 + out_unlock: 571 + hyp_spin_unlock(&host_buffers.lock); 572 + 573 + if (ret) 574 + ffa_to_smccc_res(res, ret); 575 + } 576 + 577 + /* 578 + * Is a given FFA function supported, either by forwarding on directly 579 + * or by handling at EL2? 580 + */ 581 + static bool ffa_call_supported(u64 func_id) 582 + { 583 + switch (func_id) { 584 + /* Unsupported memory management calls */ 585 + case FFA_FN64_MEM_RETRIEVE_REQ: 586 + case FFA_MEM_RETRIEVE_RESP: 587 + case FFA_MEM_RELINQUISH: 588 + case FFA_MEM_OP_PAUSE: 589 + case FFA_MEM_OP_RESUME: 590 + case FFA_MEM_FRAG_RX: 591 + case FFA_FN64_MEM_DONATE: 592 + /* Indirect message passing via RX/TX buffers */ 593 + case FFA_MSG_SEND: 594 + case FFA_MSG_POLL: 595 + case FFA_MSG_WAIT: 596 + /* 32-bit variants of 64-bit calls */ 597 + case FFA_MSG_SEND_DIRECT_REQ: 598 + case FFA_MSG_SEND_DIRECT_RESP: 599 + case FFA_RXTX_MAP: 600 + case FFA_MEM_DONATE: 601 + case FFA_MEM_RETRIEVE_REQ: 602 + return false; 603 + } 604 + 605 + return true; 606 + } 607 + 608 + static bool do_ffa_features(struct arm_smccc_res *res, 609 + struct kvm_cpu_context *ctxt) 610 + { 611 + DECLARE_REG(u32, id, ctxt, 1); 612 + u64 prop = 0; 613 + int ret = 0; 614 + 615 + if (!ffa_call_supported(id)) { 616 + ret = FFA_RET_NOT_SUPPORTED; 617 + goto out_handled; 618 + } 619 + 620 + switch (id) { 621 + case FFA_MEM_SHARE: 622 + case FFA_FN64_MEM_SHARE: 623 + case FFA_MEM_LEND: 624 + case FFA_FN64_MEM_LEND: 625 + ret = FFA_RET_SUCCESS; 626 + prop = 0; /* No support for dynamic buffers */ 627 + goto out_handled; 628 + default: 629 + return false; 630 + } 631 + 632 + out_handled: 633 + ffa_to_smccc_res_prop(res, ret, prop); 634 + return true; 635 + } 636 + 637 + bool kvm_host_ffa_handler(struct kvm_cpu_context *host_ctxt) 638 + { 639 + DECLARE_REG(u64, func_id, host_ctxt, 0); 640 + struct arm_smccc_res res; 641 + 642 + /* 643 + * There's no way we can tell what a non-standard SMC call might 644 + * be up to. Ideally, we would terminate these here and return 645 + * an error to the host, but sadly devices make use of custom 646 + * firmware calls for things like power management, debugging, 647 + * RNG access and crash reporting. 648 + * 649 + * Given that the architecture requires us to trust EL3 anyway, 650 + * we forward unrecognised calls on under the assumption that 651 + * the firmware doesn't expose a mechanism to access arbitrary 652 + * non-secure memory. Short of a per-device table of SMCs, this 653 + * is the best we can do. 654 + */ 655 + if (!is_ffa_call(func_id)) 656 + return false; 657 + 658 + switch (func_id) { 659 + case FFA_FEATURES: 660 + if (!do_ffa_features(&res, host_ctxt)) 661 + return false; 662 + goto out_handled; 663 + /* Memory management */ 664 + case FFA_FN64_RXTX_MAP: 665 + do_ffa_rxtx_map(&res, host_ctxt); 666 + goto out_handled; 667 + case FFA_RXTX_UNMAP: 668 + do_ffa_rxtx_unmap(&res, host_ctxt); 669 + goto out_handled; 670 + case FFA_MEM_SHARE: 671 + case FFA_FN64_MEM_SHARE: 672 + do_ffa_mem_xfer(FFA_FN64_MEM_SHARE, &res, host_ctxt); 673 + goto out_handled; 674 + case FFA_MEM_RECLAIM: 675 + do_ffa_mem_reclaim(&res, host_ctxt); 676 + goto out_handled; 677 + case FFA_MEM_LEND: 678 + case FFA_FN64_MEM_LEND: 679 + do_ffa_mem_xfer(FFA_FN64_MEM_LEND, &res, host_ctxt); 680 + goto out_handled; 681 + case FFA_MEM_FRAG_TX: 682 + do_ffa_mem_frag_tx(&res, host_ctxt); 683 + goto out_handled; 684 + } 685 + 686 + if (ffa_call_supported(func_id)) 687 + return false; /* Pass through */ 688 + 689 + ffa_to_smccc_error(&res, FFA_RET_NOT_SUPPORTED); 690 + out_handled: 691 + ffa_set_retval(host_ctxt, &res); 692 + return true; 693 + } 694 + 695 + int hyp_ffa_init(void *pages) 696 + { 697 + struct arm_smccc_res res; 698 + size_t min_rxtx_sz; 699 + void *tx, *rx; 700 + 701 + if (kvm_host_psci_config.smccc_version < ARM_SMCCC_VERSION_1_2) 702 + return 0; 703 + 704 + arm_smccc_1_1_smc(FFA_VERSION, FFA_VERSION_1_0, 0, 0, 0, 0, 0, 0, &res); 705 + if (res.a0 == FFA_RET_NOT_SUPPORTED) 706 + return 0; 707 + 708 + if (res.a0 != FFA_VERSION_1_0) 709 + return -EOPNOTSUPP; 710 + 711 + arm_smccc_1_1_smc(FFA_ID_GET, 0, 0, 0, 0, 0, 0, 0, &res); 712 + if (res.a0 != FFA_SUCCESS) 713 + return -EOPNOTSUPP; 714 + 715 + if (res.a2 != HOST_FFA_ID) 716 + return -EINVAL; 717 + 718 + arm_smccc_1_1_smc(FFA_FEATURES, FFA_FN64_RXTX_MAP, 719 + 0, 0, 0, 0, 0, 0, &res); 720 + if (res.a0 != FFA_SUCCESS) 721 + return -EOPNOTSUPP; 722 + 723 + switch (res.a2) { 724 + case FFA_FEAT_RXTX_MIN_SZ_4K: 725 + min_rxtx_sz = SZ_4K; 726 + break; 727 + case FFA_FEAT_RXTX_MIN_SZ_16K: 728 + min_rxtx_sz = SZ_16K; 729 + break; 730 + case FFA_FEAT_RXTX_MIN_SZ_64K: 731 + min_rxtx_sz = SZ_64K; 732 + break; 733 + default: 734 + return -EINVAL; 735 + } 736 + 737 + if (min_rxtx_sz > PAGE_SIZE) 738 + return -EOPNOTSUPP; 739 + 740 + tx = pages; 741 + pages += KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE; 742 + rx = pages; 743 + pages += KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE; 744 + 745 + ffa_desc_buf = (struct kvm_ffa_descriptor_buffer) { 746 + .buf = pages, 747 + .len = PAGE_SIZE * 748 + (hyp_ffa_proxy_pages() - (2 * KVM_FFA_MBOX_NR_PAGES)), 749 + }; 750 + 751 + hyp_buffers = (struct kvm_ffa_buffers) { 752 + .lock = __HYP_SPIN_LOCK_UNLOCKED, 753 + .tx = tx, 754 + .rx = rx, 755 + }; 756 + 757 + host_buffers = (struct kvm_ffa_buffers) { 758 + .lock = __HYP_SPIN_LOCK_UNLOCKED, 759 + }; 760 + 761 + return 0; 762 + }

+35 -1

arch/arm64/kvm/hyp/nvhe/host.S

··· 10 10 #include <asm/kvm_arm.h> 11 11 #include <asm/kvm_asm.h> 12 12 #include <asm/kvm_mmu.h> 13 + #include <asm/kvm_ptrauth.h> 13 14 14 15 .text 15 16 ··· 38 37 39 38 /* Save the host context pointer in x29 across the function call */ 40 39 mov x29, x0 40 + 41 + #ifdef CONFIG_ARM64_PTR_AUTH_KERNEL 42 + alternative_if_not ARM64_HAS_ADDRESS_AUTH 43 + b __skip_pauth_save 44 + alternative_else_nop_endif 45 + 46 + alternative_if ARM64_KVM_PROTECTED_MODE 47 + /* Save kernel ptrauth keys. */ 48 + add x18, x29, #CPU_APIAKEYLO_EL1 49 + ptrauth_save_state x18, x19, x20 50 + 51 + /* Use hyp keys. */ 52 + adr_this_cpu x18, kvm_hyp_ctxt, x19 53 + add x18, x18, #CPU_APIAKEYLO_EL1 54 + ptrauth_restore_state x18, x19, x20 55 + isb 56 + alternative_else_nop_endif 57 + __skip_pauth_save: 58 + #endif /* CONFIG_ARM64_PTR_AUTH_KERNEL */ 59 + 41 60 bl handle_trap 42 61 43 - /* Restore host regs x0-x17 */ 44 62 __host_enter_restore_full: 63 + /* Restore kernel keys. */ 64 + #ifdef CONFIG_ARM64_PTR_AUTH_KERNEL 65 + alternative_if_not ARM64_HAS_ADDRESS_AUTH 66 + b __skip_pauth_restore 67 + alternative_else_nop_endif 68 + 69 + alternative_if ARM64_KVM_PROTECTED_MODE 70 + add x18, x29, #CPU_APIAKEYLO_EL1 71 + ptrauth_restore_state x18, x19, x20 72 + alternative_else_nop_endif 73 + __skip_pauth_restore: 74 + #endif /* CONFIG_ARM64_PTR_AUTH_KERNEL */ 75 + 76 + /* Restore host regs x0-x17 */ 45 77 ldp x0, x1, [x29, #CPU_XREG_OFFSET(0)] 46 78 ldp x2, x3, [x29, #CPU_XREG_OFFSET(2)] 47 79 ldp x4, x5, [x29, #CPU_XREG_OFFSET(4)]

+29 -3

arch/arm64/kvm/hyp/nvhe/hyp-init.S

··· 83 83 * x0: struct kvm_nvhe_init_params PA 84 84 */ 85 85 SYM_CODE_START_LOCAL(___kvm_hyp_init) 86 - ldr x1, [x0, #NVHE_INIT_TPIDR_EL2] 87 - msr tpidr_el2, x1 88 - 89 86 ldr x1, [x0, #NVHE_INIT_STACK_HYP_VA] 90 87 mov sp, x1 91 88 ··· 91 94 92 95 ldr x1, [x0, #NVHE_INIT_HCR_EL2] 93 96 msr hcr_el2, x1 97 + 98 + mov x2, #HCR_E2H 99 + and x2, x1, x2 100 + cbz x2, 1f 101 + 102 + // hVHE: Replay the EL2 setup to account for the E2H bit 103 + // TPIDR_EL2 is used to preserve x0 across the macro maze... 104 + isb 105 + msr tpidr_el2, x0 106 + init_el2_state 107 + finalise_el2_state 108 + mrs x0, tpidr_el2 109 + 110 + 1: 111 + ldr x1, [x0, #NVHE_INIT_TPIDR_EL2] 112 + msr tpidr_el2, x1 94 113 95 114 ldr x1, [x0, #NVHE_INIT_VTTBR] 96 115 msr vttbr_el2, x1 ··· 141 128 SCTLR_ELx_ENDA | SCTLR_ELx_ENDB) 142 129 orr x0, x0, x1 143 130 alternative_else_nop_endif 131 + 132 + #ifdef CONFIG_ARM64_BTI_KERNEL 133 + alternative_if ARM64_BTI 134 + orr x0, x0, #SCTLR_EL2_BT 135 + alternative_else_nop_endif 136 + #endif /* CONFIG_ARM64_BTI_KERNEL */ 137 + 144 138 msr sctlr_el2, x0 145 139 isb 146 140 ··· 204 184 /* Initialize EL2 CPU state to sane values. */ 205 185 init_el2_state // Clobbers x0..x2 206 186 finalise_el2_state 187 + __init_el2_nvhe_prepare_eret 207 188 208 189 /* Enable MMU, set vectors and stack. */ 209 190 mov x0, x28 ··· 217 196 SYM_CODE_END(__kvm_hyp_init_cpu) 218 197 219 198 SYM_CODE_START(__kvm_handle_stub_hvc) 199 + /* 200 + * __kvm_handle_stub_hvc called from __host_hvc through branch instruction(br) so 201 + * we need bti j at beginning. 202 + */ 203 + bti j 220 204 cmp x0, #HVC_SOFT_RESTART 221 205 b.ne 1f 222 206

+18 -1

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 13 13 #include <asm/kvm_hyp.h> 14 14 #include <asm/kvm_mmu.h> 15 15 16 + #include <nvhe/ffa.h> 16 17 #include <nvhe/mem_protect.h> 17 18 #include <nvhe/mm.h> 18 19 #include <nvhe/pkvm.h> ··· 124 123 DECLARE_REG(int, level, host_ctxt, 3); 125 124 126 125 __kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level); 126 + } 127 + 128 + static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt) 129 + { 130 + DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1); 131 + DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2); 132 + DECLARE_REG(int, level, host_ctxt, 3); 133 + 134 + __kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level); 127 135 } 128 136 129 137 static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) ··· 325 315 HANDLE_FUNC(__kvm_vcpu_run), 326 316 HANDLE_FUNC(__kvm_flush_vm_context), 327 317 HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa), 318 + HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh), 328 319 HANDLE_FUNC(__kvm_tlb_flush_vmid), 329 320 HANDLE_FUNC(__kvm_flush_cpu_context), 330 321 HANDLE_FUNC(__kvm_timer_set_cntvoff), ··· 385 374 386 375 handled = kvm_host_psci_handler(host_ctxt); 387 376 if (!handled) 377 + handled = kvm_host_ffa_handler(host_ctxt); 378 + if (!handled) 388 379 default_host_smc_handler(host_ctxt); 389 380 390 381 /* SMC was trapped, move ELR past the current PC. */ ··· 405 392 handle_host_smc(host_ctxt); 406 393 break; 407 394 case ESR_ELx_EC_SVE: 408 - sysreg_clear_set(cptr_el2, CPTR_EL2_TZ, 0); 395 + if (has_hvhe()) 396 + sysreg_clear_set(cpacr_el1, 0, (CPACR_EL1_ZEN_EL1EN | 397 + CPACR_EL1_ZEN_EL0EN)); 398 + else 399 + sysreg_clear_set(cptr_el2, CPTR_EL2_TZ, 0); 409 400 isb(); 410 401 sve_cond_update_zcr_vq(ZCR_ELx_LEN_MASK, SYS_ZCR_EL2); 411 402 break;

+71 -3

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 91 91 hyp_put_page(&host_s2_pool, addr); 92 92 } 93 93 94 - static void host_s2_free_removed_table(void *addr, u32 level) 94 + static void host_s2_free_unlinked_table(void *addr, u32 level) 95 95 { 96 - kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level); 96 + kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level); 97 97 } 98 98 99 99 static int prepare_s2_pool(void *pgt_pool_base) ··· 110 110 host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) { 111 111 .zalloc_pages_exact = host_s2_zalloc_pages_exact, 112 112 .zalloc_page = host_s2_zalloc_page, 113 - .free_removed_table = host_s2_free_removed_table, 113 + .free_unlinked_table = host_s2_free_unlinked_table, 114 114 .phys_to_virt = hyp_phys_to_virt, 115 115 .virt_to_phys = hyp_virt_to_phys, 116 116 .page_count = hyp_page_count, ··· 842 842 case PKVM_ID_HYP: 843 843 ret = hyp_ack_share(completer_addr, tx, share->completer_prot); 844 844 break; 845 + case PKVM_ID_FFA: 846 + /* 847 + * We only check the host; the secure side will check the other 848 + * end when we forward the FFA call. 849 + */ 850 + ret = 0; 851 + break; 845 852 default: 846 853 ret = -EINVAL; 847 854 } ··· 876 869 switch (tx->completer.id) { 877 870 case PKVM_ID_HYP: 878 871 ret = hyp_complete_share(completer_addr, tx, share->completer_prot); 872 + break; 873 + case PKVM_ID_FFA: 874 + /* 875 + * We're not responsible for any secure page-tables, so there's 876 + * nothing to do here. 877 + */ 878 + ret = 0; 879 879 break; 880 880 default: 881 881 ret = -EINVAL; ··· 932 918 case PKVM_ID_HYP: 933 919 ret = hyp_ack_unshare(completer_addr, tx); 934 920 break; 921 + case PKVM_ID_FFA: 922 + /* See check_share() */ 923 + ret = 0; 924 + break; 935 925 default: 936 926 ret = -EINVAL; 937 927 } ··· 963 945 switch (tx->completer.id) { 964 946 case PKVM_ID_HYP: 965 947 ret = hyp_complete_unshare(completer_addr, tx); 948 + break; 949 + case PKVM_ID_FFA: 950 + /* See __do_share() */ 951 + ret = 0; 966 952 break; 967 953 default: 968 954 ret = -EINVAL; ··· 1256 1234 1257 1235 hyp_unlock_component(); 1258 1236 host_unlock_component(); 1237 + } 1238 + 1239 + int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages) 1240 + { 1241 + int ret; 1242 + struct pkvm_mem_share share = { 1243 + .tx = { 1244 + .nr_pages = nr_pages, 1245 + .initiator = { 1246 + .id = PKVM_ID_HOST, 1247 + .addr = hyp_pfn_to_phys(pfn), 1248 + }, 1249 + .completer = { 1250 + .id = PKVM_ID_FFA, 1251 + }, 1252 + }, 1253 + }; 1254 + 1255 + host_lock_component(); 1256 + ret = do_share(&share); 1257 + host_unlock_component(); 1258 + 1259 + return ret; 1260 + } 1261 + 1262 + int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages) 1263 + { 1264 + int ret; 1265 + struct pkvm_mem_share share = { 1266 + .tx = { 1267 + .nr_pages = nr_pages, 1268 + .initiator = { 1269 + .id = PKVM_ID_HOST, 1270 + .addr = hyp_pfn_to_phys(pfn), 1271 + }, 1272 + .completer = { 1273 + .id = PKVM_ID_FFA, 1274 + }, 1275 + }, 1276 + }; 1277 + 1278 + host_lock_component(); 1279 + ret = do_unshare(&share); 1280 + host_unlock_component(); 1281 + 1282 + return ret; 1259 1283 }

+21 -6

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 27 27 u64 hcr_set = HCR_RW; 28 28 u64 hcr_clear = 0; 29 29 u64 cptr_set = 0; 30 + u64 cptr_clear = 0; 30 31 31 32 /* Protected KVM does not support AArch32 guests. */ 32 33 BUILD_BUG_ON(FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_EL0), ··· 44 43 BUILD_BUG_ON(!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_AdvSIMD), 45 44 PVM_ID_AA64PFR0_ALLOW)); 46 45 46 + if (has_hvhe()) 47 + hcr_set |= HCR_E2H; 48 + 47 49 /* Trap RAS unless all current versions are supported */ 48 50 if (FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_RAS), feature_ids) < 49 51 ID_AA64PFR0_EL1_RAS_V1P1) { ··· 61 57 } 62 58 63 59 /* Trap SVE */ 64 - if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_SVE), feature_ids)) 65 - cptr_set |= CPTR_EL2_TZ; 60 + if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_SVE), feature_ids)) { 61 + if (has_hvhe()) 62 + cptr_clear |= CPACR_EL1_ZEN_EL0EN | CPACR_EL1_ZEN_EL1EN; 63 + else 64 + cptr_set |= CPTR_EL2_TZ; 65 + } 66 66 67 67 vcpu->arch.hcr_el2 |= hcr_set; 68 68 vcpu->arch.hcr_el2 &= ~hcr_clear; 69 69 vcpu->arch.cptr_el2 |= cptr_set; 70 + vcpu->arch.cptr_el2 &= ~cptr_clear; 70 71 } 71 72 72 73 /* ··· 129 120 mdcr_set |= MDCR_EL2_TTRF; 130 121 131 122 /* Trap Trace */ 132 - if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_TraceVer), feature_ids)) 133 - cptr_set |= CPTR_EL2_TTA; 123 + if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_TraceVer), feature_ids)) { 124 + if (has_hvhe()) 125 + cptr_set |= CPACR_EL1_TTA; 126 + else 127 + cptr_set |= CPTR_EL2_TTA; 128 + } 134 129 135 130 vcpu->arch.mdcr_el2 |= mdcr_set; 136 131 vcpu->arch.mdcr_el2 &= ~mdcr_clear; ··· 189 176 /* Clear res0 and set res1 bits to trap potential new features. */ 190 177 vcpu->arch.hcr_el2 &= ~(HCR_RES0); 191 178 vcpu->arch.mdcr_el2 &= ~(MDCR_EL2_RES0); 192 - vcpu->arch.cptr_el2 |= CPTR_NVHE_EL2_RES1; 193 - vcpu->arch.cptr_el2 &= ~(CPTR_NVHE_EL2_RES0); 179 + if (!has_hvhe()) { 180 + vcpu->arch.cptr_el2 |= CPTR_NVHE_EL2_RES1; 181 + vcpu->arch.cptr_el2 &= ~(CPTR_NVHE_EL2_RES0); 182 + } 194 183 } 195 184 196 185 /*

+11

arch/arm64/kvm/hyp/nvhe/setup.c

··· 11 11 #include <asm/kvm_pkvm.h> 12 12 13 13 #include <nvhe/early_alloc.h> 14 + #include <nvhe/ffa.h> 14 15 #include <nvhe/fixed_config.h> 15 16 #include <nvhe/gfp.h> 16 17 #include <nvhe/memory.h> ··· 29 28 static void *vm_table_base; 30 29 static void *hyp_pgt_base; 31 30 static void *host_s2_pgt_base; 31 + static void *ffa_proxy_pages; 32 32 static struct kvm_pgtable_mm_ops pkvm_pgtable_mm_ops; 33 33 static struct hyp_pool hpool; 34 34 ··· 57 55 nr_pages = host_s2_pgtable_pages(); 58 56 host_s2_pgt_base = hyp_early_alloc_contig(nr_pages); 59 57 if (!host_s2_pgt_base) 58 + return -ENOMEM; 59 + 60 + nr_pages = hyp_ffa_proxy_pages(); 61 + ffa_proxy_pages = hyp_early_alloc_contig(nr_pages); 62 + if (!ffa_proxy_pages) 60 63 return -ENOMEM; 61 64 62 65 return 0; ··· 318 311 goto out; 319 312 320 313 ret = hyp_create_pcpu_fixmap(); 314 + if (ret) 315 + goto out; 316 + 317 + ret = hyp_ffa_init(ffa_proxy_pages); 321 318 if (ret) 322 319 goto out; 323 320

+16 -12

arch/arm64/kvm/hyp/nvhe/switch.c

··· 44 44 __activate_traps_common(vcpu); 45 45 46 46 val = vcpu->arch.cptr_el2; 47 - val |= CPTR_EL2_TTA | CPTR_EL2_TAM; 47 + val |= CPTR_EL2_TAM; /* Same bit irrespective of E2H */ 48 + val |= has_hvhe() ? CPACR_EL1_TTA : CPTR_EL2_TTA; 49 + if (cpus_have_final_cap(ARM64_SME)) { 50 + if (has_hvhe()) 51 + val &= ~(CPACR_EL1_SMEN_EL1EN | CPACR_EL1_SMEN_EL0EN); 52 + else 53 + val |= CPTR_EL2_TSM; 54 + } 55 + 48 56 if (!guest_owns_fp_regs(vcpu)) { 49 - val |= CPTR_EL2_TFP | CPTR_EL2_TZ; 57 + if (has_hvhe()) 58 + val &= ~(CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN | 59 + CPACR_EL1_ZEN_EL0EN | CPACR_EL1_ZEN_EL1EN); 60 + else 61 + val |= CPTR_EL2_TFP | CPTR_EL2_TZ; 62 + 50 63 __activate_traps_fpsimd32(vcpu); 51 64 } 52 - if (cpus_have_final_cap(ARM64_SME)) 53 - val |= CPTR_EL2_TSM; 54 65 55 66 write_sysreg(val, cptr_el2); 56 67 write_sysreg(__this_cpu_read(kvm_hyp_vector), vbar_el2); ··· 84 73 static void __deactivate_traps(struct kvm_vcpu *vcpu) 85 74 { 86 75 extern char __kvm_hyp_host_vector[]; 87 - u64 cptr; 88 76 89 77 ___deactivate_traps(vcpu); 90 78 ··· 108 98 109 99 write_sysreg(this_cpu_ptr(&kvm_init_params)->hcr_el2, hcr_el2); 110 100 111 - cptr = CPTR_EL2_DEFAULT; 112 - if (vcpu_has_sve(vcpu) && (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED)) 113 - cptr |= CPTR_EL2_TZ; 114 - if (cpus_have_final_cap(ARM64_SME)) 115 - cptr &= ~CPTR_EL2_TSM; 116 - 117 - write_sysreg(cptr, cptr_el2); 101 + kvm_reset_cptr_el2(vcpu); 118 102 write_sysreg(__kvm_hyp_host_vector, vbar_el2); 119 103 } 120 104

+12 -4

arch/arm64/kvm/hyp/nvhe/timer-sr.c

··· 17 17 } 18 18 19 19 /* 20 - * Should only be called on non-VHE systems. 20 + * Should only be called on non-VHE or hVHE setups. 21 21 * VHE systems use EL2 timers and configure EL1 timers in kvm_timer_init_vhe(). 22 22 */ 23 23 void __timer_disable_traps(struct kvm_vcpu *vcpu) 24 24 { 25 - u64 val; 25 + u64 val, shift = 0; 26 + 27 + if (has_hvhe()) 28 + shift = 10; 26 29 27 30 /* Allow physical timer/counter access for the host */ 28 31 val = read_sysreg(cnthctl_el2); 29 - val |= CNTHCTL_EL1PCTEN | CNTHCTL_EL1PCEN; 32 + val |= (CNTHCTL_EL1PCTEN | CNTHCTL_EL1PCEN) << shift; 30 33 write_sysreg(val, cnthctl_el2); 31 34 } 32 35 33 36 /* 34 - * Should only be called on non-VHE systems. 37 + * Should only be called on non-VHE or hVHE setups. 35 38 * VHE systems use EL2 timers and configure EL1 timers in kvm_timer_init_vhe(). 36 39 */ 37 40 void __timer_enable_traps(struct kvm_vcpu *vcpu) ··· 52 49 set |= CNTHCTL_EL1PCTEN; 53 50 else 54 51 clr |= CNTHCTL_EL1PCTEN; 52 + 53 + if (has_hvhe()) { 54 + clr <<= 10; 55 + set <<= 10; 56 + } 55 57 56 58 sysreg_clear_set(cnthctl_el2, clr, set); 57 59 }

+52

arch/arm64/kvm/hyp/nvhe/tlb.c

··· 130 130 __tlb_switch_to_host(&cxt); 131 131 } 132 132 133 + void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu, 134 + phys_addr_t ipa, int level) 135 + { 136 + struct tlb_inv_context cxt; 137 + 138 + /* Switch to requested VMID */ 139 + __tlb_switch_to_guest(mmu, &cxt, true); 140 + 141 + /* 142 + * We could do so much better if we had the VA as well. 143 + * Instead, we invalidate Stage-2 for this IPA, and the 144 + * whole of Stage-1. Weep... 145 + */ 146 + ipa >>= 12; 147 + __tlbi_level(ipas2e1, ipa, level); 148 + 149 + /* 150 + * We have to ensure completion of the invalidation at Stage-2, 151 + * since a table walk on another CPU could refill a TLB with a 152 + * complete (S1 + S2) walk based on the old Stage-2 mapping if 153 + * the Stage-1 invalidation happened first. 154 + */ 155 + dsb(nsh); 156 + __tlbi(vmalle1); 157 + dsb(nsh); 158 + isb(); 159 + 160 + /* 161 + * If the host is running at EL1 and we have a VPIPT I-cache, 162 + * then we must perform I-cache maintenance at EL2 in order for 163 + * it to have an effect on the guest. Since the guest cannot hit 164 + * I-cache lines allocated with a different VMID, we don't need 165 + * to worry about junk out of guest reset (we nuke the I-cache on 166 + * VMID rollover), but we do need to be careful when remapping 167 + * executable pages for the same guest. This can happen when KSM 168 + * takes a CoW fault on an executable page, copies the page into 169 + * a page that was previously mapped in the guest and then needs 170 + * to invalidate the guest view of the I-cache for that page 171 + * from EL1. To solve this, we invalidate the entire I-cache when 172 + * unmapping a page from a guest if we have a VPIPT I-cache but 173 + * the host is running at EL1. As above, we could do better if 174 + * we had the VA. 175 + * 176 + * The moral of this story is: if you have a VPIPT I-cache, then 177 + * you should be running with VHE enabled. 178 + */ 179 + if (icache_is_vpipt()) 180 + icache_inval_all_pou(); 181 + 182 + __tlb_switch_to_host(&cxt); 183 + } 184 + 133 185 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu) 134 186 { 135 187 struct tlb_inv_context cxt;

+207 -21

arch/arm64/kvm/hyp/pgtable.c

··· 21 21 22 22 #define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2) 23 23 #define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6) 24 - #define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3 25 - #define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1 24 + #define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO \ 25 + ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 2 : 3; }) 26 + #define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW \ 27 + ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 0 : 1; }) 26 28 #define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8) 27 29 #define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3 28 30 #define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10) ··· 36 34 #define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3 37 35 #define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10) 38 36 39 - #define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) 37 + #define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 50) 40 38 41 39 #define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55) 42 40 43 41 #define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) 44 42 45 43 #define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) 44 + 45 + #define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50) 46 46 47 47 #define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \ 48 48 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ ··· 66 62 u64 addr; 67 63 const u64 end; 68 64 }; 65 + 66 + static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx) 67 + { 68 + return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI); 69 + } 70 + 71 + static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx) 72 + { 73 + return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO); 74 + } 69 75 70 76 static bool kvm_phys_is_valid(u64 phys) 71 77 { ··· 400 386 401 387 if (device) 402 388 return -EINVAL; 389 + 390 + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) && system_supports_bti()) 391 + attr |= KVM_PTE_LEAF_ATTR_HI_S1_GP; 403 392 } else { 404 393 attr |= KVM_PTE_LEAF_ATTR_HI_S1_XN; 405 394 } ··· 640 623 #ifdef CONFIG_ARM64_HW_AFDBM 641 624 /* 642 625 * Enable the Hardware Access Flag management, unconditionally 643 - * on all CPUs. The features is RES0 on CPUs without the support 644 - * and must be ignored by the CPUs. 626 + * on all CPUs. In systems that have asymmetric support for the feature 627 + * this allows KVM to leverage hardware support on the subset of cores 628 + * that implement the feature. 629 + * 630 + * The architecture requires VTCR_EL2.HA to be RES0 (thus ignored by 631 + * hardware) on implementations that do not advertise support for the 632 + * feature. As such, setting HA unconditionally is safe, unless you 633 + * happen to be running on a design that has unadvertised support for 634 + * HAFDBS. Here be dragons. 645 635 */ 646 - vtcr |= VTCR_EL2_HA; 636 + if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) 637 + vtcr |= VTCR_EL2_HA; 647 638 #endif /* CONFIG_ARM64_HW_AFDBM */ 648 639 649 640 /* Set the vmid bits */ ··· 780 755 if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) 781 756 return false; 782 757 783 - /* 784 - * Perform the appropriate TLB invalidation based on the evicted pte 785 - * value (if any). 786 - */ 787 - if (kvm_pte_table(ctx->old, ctx->level)) 788 - kvm_call_hyp(__kvm_tlb_flush_vmid, mmu); 789 - else if (kvm_pte_valid(ctx->old)) 790 - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level); 758 + if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) { 759 + /* 760 + * Perform the appropriate TLB invalidation based on the 761 + * evicted pte value (if any). 762 + */ 763 + if (kvm_pte_table(ctx->old, ctx->level)) 764 + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu); 765 + else if (kvm_pte_valid(ctx->old)) 766 + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, 767 + ctx->addr, ctx->level); 768 + } 791 769 792 770 if (stage2_pte_is_counted(ctx->old)) 793 771 mm_ops->put_page(ctx->ptep); ··· 897 869 return -EAGAIN; 898 870 899 871 /* Perform CMOs before installation of the guest stage-2 PTE */ 900 - if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new)) 872 + if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc && 873 + stage2_pte_cacheable(pgt, new)) 901 874 mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops), 902 - granule); 875 + granule); 903 876 904 - if (mm_ops->icache_inval_pou && stage2_pte_executable(new)) 877 + if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou && 878 + stage2_pte_executable(new)) 905 879 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule); 906 880 907 881 stage2_make_pte(ctx, new); ··· 925 895 if (ret) 926 896 return ret; 927 897 928 - mm_ops->free_removed_table(childp, ctx->level); 898 + mm_ops->free_unlinked_table(childp, ctx->level); 929 899 return 0; 930 900 } 931 901 ··· 970 940 * The TABLE_PRE callback runs for table entries on the way down, looking 971 941 * for table entries which we could conceivably replace with a block entry 972 942 * for this mapping. If it finds one it replaces the entry and calls 973 - * kvm_pgtable_mm_ops::free_removed_table() to tear down the detached table. 943 + * kvm_pgtable_mm_ops::free_unlinked_table() to tear down the detached table. 974 944 * 975 945 * Otherwise, the LEAF callback performs the mapping at the existing leaves 976 946 * instead. ··· 1239 1209 KVM_PGTABLE_WALK_HANDLE_FAULT | 1240 1210 KVM_PGTABLE_WALK_SHARED); 1241 1211 if (!ret) 1242 - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level); 1212 + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level); 1243 1213 return ret; 1244 1214 } 1245 1215 ··· 1272 1242 return kvm_pgtable_walk(pgt, addr, size, &walker); 1273 1243 } 1274 1244 1245 + kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt, 1246 + u64 phys, u32 level, 1247 + enum kvm_pgtable_prot prot, 1248 + void *mc, bool force_pte) 1249 + { 1250 + struct stage2_map_data map_data = { 1251 + .phys = phys, 1252 + .mmu = pgt->mmu, 1253 + .memcache = mc, 1254 + .force_pte = force_pte, 1255 + }; 1256 + struct kvm_pgtable_walker walker = { 1257 + .cb = stage2_map_walker, 1258 + .flags = KVM_PGTABLE_WALK_LEAF | 1259 + KVM_PGTABLE_WALK_SKIP_BBM_TLBI | 1260 + KVM_PGTABLE_WALK_SKIP_CMO, 1261 + .arg = &map_data, 1262 + }; 1263 + /* 1264 + * The input address (.addr) is irrelevant for walking an 1265 + * unlinked table. Construct an ambiguous IA range to map 1266 + * kvm_granule_size(level) worth of memory. 1267 + */ 1268 + struct kvm_pgtable_walk_data data = { 1269 + .walker = &walker, 1270 + .addr = 0, 1271 + .end = kvm_granule_size(level), 1272 + }; 1273 + struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops; 1274 + kvm_pte_t *pgtable; 1275 + int ret; 1276 + 1277 + if (!IS_ALIGNED(phys, kvm_granule_size(level))) 1278 + return ERR_PTR(-EINVAL); 1279 + 1280 + ret = stage2_set_prot_attr(pgt, prot, &map_data.attr); 1281 + if (ret) 1282 + return ERR_PTR(ret); 1283 + 1284 + pgtable = mm_ops->zalloc_page(mc); 1285 + if (!pgtable) 1286 + return ERR_PTR(-ENOMEM); 1287 + 1288 + ret = __kvm_pgtable_walk(&data, mm_ops, (kvm_pteref_t)pgtable, 1289 + level + 1); 1290 + if (ret) { 1291 + kvm_pgtable_stage2_free_unlinked(mm_ops, pgtable, level); 1292 + mm_ops->put_page(pgtable); 1293 + return ERR_PTR(ret); 1294 + } 1295 + 1296 + return pgtable; 1297 + } 1298 + 1299 + /* 1300 + * Get the number of page-tables needed to replace a block with a 1301 + * fully populated tree up to the PTE entries. Note that @level is 1302 + * interpreted as in "level @level entry". 1303 + */ 1304 + static int stage2_block_get_nr_page_tables(u32 level) 1305 + { 1306 + switch (level) { 1307 + case 1: 1308 + return PTRS_PER_PTE + 1; 1309 + case 2: 1310 + return 1; 1311 + case 3: 1312 + return 0; 1313 + default: 1314 + WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL || 1315 + level >= KVM_PGTABLE_MAX_LEVELS); 1316 + return -EINVAL; 1317 + }; 1318 + } 1319 + 1320 + static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx, 1321 + enum kvm_pgtable_walk_flags visit) 1322 + { 1323 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 1324 + struct kvm_mmu_memory_cache *mc = ctx->arg; 1325 + struct kvm_s2_mmu *mmu; 1326 + kvm_pte_t pte = ctx->old, new, *childp; 1327 + enum kvm_pgtable_prot prot; 1328 + u32 level = ctx->level; 1329 + bool force_pte; 1330 + int nr_pages; 1331 + u64 phys; 1332 + 1333 + /* No huge-pages exist at the last level */ 1334 + if (level == KVM_PGTABLE_MAX_LEVELS - 1) 1335 + return 0; 1336 + 1337 + /* We only split valid block mappings */ 1338 + if (!kvm_pte_valid(pte)) 1339 + return 0; 1340 + 1341 + nr_pages = stage2_block_get_nr_page_tables(level); 1342 + if (nr_pages < 0) 1343 + return nr_pages; 1344 + 1345 + if (mc->nobjs >= nr_pages) { 1346 + /* Build a tree mapped down to the PTE granularity. */ 1347 + force_pte = true; 1348 + } else { 1349 + /* 1350 + * Don't force PTEs, so create_unlinked() below does 1351 + * not populate the tree up to the PTE level. The 1352 + * consequence is that the call will require a single 1353 + * page of level 2 entries at level 1, or a single 1354 + * page of PTEs at level 2. If we are at level 1, the 1355 + * PTEs will be created recursively. 1356 + */ 1357 + force_pte = false; 1358 + nr_pages = 1; 1359 + } 1360 + 1361 + if (mc->nobjs < nr_pages) 1362 + return -ENOMEM; 1363 + 1364 + mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache); 1365 + phys = kvm_pte_to_phys(pte); 1366 + prot = kvm_pgtable_stage2_pte_prot(pte); 1367 + 1368 + childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys, 1369 + level, prot, mc, force_pte); 1370 + if (IS_ERR(childp)) 1371 + return PTR_ERR(childp); 1372 + 1373 + if (!stage2_try_break_pte(ctx, mmu)) { 1374 + kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level); 1375 + mm_ops->put_page(childp); 1376 + return -EAGAIN; 1377 + } 1378 + 1379 + /* 1380 + * Note, the contents of the page table are guaranteed to be made 1381 + * visible before the new PTE is assigned because stage2_make_pte() 1382 + * writes the PTE using smp_store_release(). 1383 + */ 1384 + new = kvm_init_table_pte(childp, mm_ops); 1385 + stage2_make_pte(ctx, new); 1386 + dsb(ishst); 1387 + return 0; 1388 + } 1389 + 1390 + int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, 1391 + struct kvm_mmu_memory_cache *mc) 1392 + { 1393 + struct kvm_pgtable_walker walker = { 1394 + .cb = stage2_split_walker, 1395 + .flags = KVM_PGTABLE_WALK_LEAF, 1396 + .arg = mc, 1397 + }; 1398 + 1399 + return kvm_pgtable_walk(pgt, addr, size, &walker); 1400 + } 1275 1401 1276 1402 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu, 1277 1403 struct kvm_pgtable_mm_ops *mm_ops, ··· 1497 1311 pgt->pgd = NULL; 1498 1312 } 1499 1313 1500 - void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level) 1314 + void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level) 1501 1315 { 1502 1316 kvm_pteref_t ptep = (kvm_pteref_t)pgtable; 1503 1317 struct kvm_pgtable_walker walker = {

+1 -1

arch/arm64/kvm/hyp/vhe/switch.c

··· 84 84 */ 85 85 asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT)); 86 86 87 - write_sysreg(CPACR_EL1_DEFAULT, cpacr_el1); 87 + kvm_reset_cptr_el2(vcpu); 88 88 89 89 if (!arm64_kernel_unmapped_at_el0()) 90 90 host_vectors = __this_cpu_read(this_cpu_vector);

+32

arch/arm64/kvm/hyp/vhe/tlb.c

··· 111 111 __tlb_switch_to_host(&cxt); 112 112 } 113 113 114 + void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu, 115 + phys_addr_t ipa, int level) 116 + { 117 + struct tlb_inv_context cxt; 118 + 119 + dsb(nshst); 120 + 121 + /* Switch to requested VMID */ 122 + __tlb_switch_to_guest(mmu, &cxt); 123 + 124 + /* 125 + * We could do so much better if we had the VA as well. 126 + * Instead, we invalidate Stage-2 for this IPA, and the 127 + * whole of Stage-1. Weep... 128 + */ 129 + ipa >>= 12; 130 + __tlbi_level(ipas2e1, ipa, level); 131 + 132 + /* 133 + * We have to ensure completion of the invalidation at Stage-2, 134 + * since a table walk on another CPU could refill a TLB with a 135 + * complete (S1 + S2) walk based on the old Stage-2 mapping if 136 + * the Stage-1 invalidation happened first. 137 + */ 138 + dsb(nsh); 139 + __tlbi(vmalle1); 140 + dsb(nsh); 141 + isb(); 142 + 143 + __tlb_switch_to_host(&cxt); 144 + } 145 + 114 146 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu) 115 147 { 116 148 struct tlb_inv_context cxt;

+174 -35

arch/arm64/kvm/mmu.c

··· 31 31 32 32 static unsigned long __ro_after_init io_map_base; 33 33 34 - static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) 34 + static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, 35 + phys_addr_t size) 35 36 { 36 - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); 37 37 phys_addr_t boundary = ALIGN_DOWN(addr + size, size); 38 38 39 39 return (boundary - 1 < end - 1) ? boundary : end; 40 + } 41 + 42 + static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) 43 + { 44 + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); 45 + 46 + return __stage2_range_addr_end(addr, end, size); 40 47 } 41 48 42 49 /* ··· 81 74 82 75 #define stage2_apply_range_resched(mmu, addr, end, fn) \ 83 76 stage2_apply_range(mmu, addr, end, fn, true) 77 + 78 + /* 79 + * Get the maximum number of page-tables pages needed to split a range 80 + * of blocks into PAGE_SIZE PTEs. It assumes the range is already 81 + * mapped at level 2, or at level 1 if allowed. 82 + */ 83 + static int kvm_mmu_split_nr_page_tables(u64 range) 84 + { 85 + int n = 0; 86 + 87 + if (KVM_PGTABLE_MIN_BLOCK_LEVEL < 2) 88 + n += DIV_ROUND_UP(range, PUD_SIZE); 89 + n += DIV_ROUND_UP(range, PMD_SIZE); 90 + return n; 91 + } 92 + 93 + static bool need_split_memcache_topup_or_resched(struct kvm *kvm) 94 + { 95 + struct kvm_mmu_memory_cache *cache; 96 + u64 chunk_size, min; 97 + 98 + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) 99 + return true; 100 + 101 + chunk_size = kvm->arch.mmu.split_page_chunk_size; 102 + min = kvm_mmu_split_nr_page_tables(chunk_size); 103 + cache = &kvm->arch.mmu.split_page_cache; 104 + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; 105 + } 106 + 107 + static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, 108 + phys_addr_t end) 109 + { 110 + struct kvm_mmu_memory_cache *cache; 111 + struct kvm_pgtable *pgt; 112 + int ret, cache_capacity; 113 + u64 next, chunk_size; 114 + 115 + lockdep_assert_held_write(&kvm->mmu_lock); 116 + 117 + chunk_size = kvm->arch.mmu.split_page_chunk_size; 118 + cache_capacity = kvm_mmu_split_nr_page_tables(chunk_size); 119 + 120 + if (chunk_size == 0) 121 + return 0; 122 + 123 + cache = &kvm->arch.mmu.split_page_cache; 124 + 125 + do { 126 + if (need_split_memcache_topup_or_resched(kvm)) { 127 + write_unlock(&kvm->mmu_lock); 128 + cond_resched(); 129 + /* Eager page splitting is best-effort. */ 130 + ret = __kvm_mmu_topup_memory_cache(cache, 131 + cache_capacity, 132 + cache_capacity); 133 + write_lock(&kvm->mmu_lock); 134 + if (ret) 135 + break; 136 + } 137 + 138 + pgt = kvm->arch.mmu.pgt; 139 + if (!pgt) 140 + return -EINVAL; 141 + 142 + next = __stage2_range_addr_end(addr, end, chunk_size); 143 + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); 144 + if (ret) 145 + break; 146 + } while (addr = next, addr != end); 147 + 148 + return ret; 149 + } 84 150 85 151 static bool memslot_is_logging(struct kvm_memory_slot *memslot) 86 152 { ··· 211 131 212 132 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops; 213 133 214 - static void stage2_free_removed_table_rcu_cb(struct rcu_head *head) 134 + static void stage2_free_unlinked_table_rcu_cb(struct rcu_head *head) 215 135 { 216 136 struct page *page = container_of(head, struct page, rcu_head); 217 137 void *pgtable = page_to_virt(page); 218 138 u32 level = page_private(page); 219 139 220 - kvm_pgtable_stage2_free_removed(&kvm_s2_mm_ops, pgtable, level); 140 + kvm_pgtable_stage2_free_unlinked(&kvm_s2_mm_ops, pgtable, level); 221 141 } 222 142 223 - static void stage2_free_removed_table(void *addr, u32 level) 143 + static void stage2_free_unlinked_table(void *addr, u32 level) 224 144 { 225 145 struct page *page = virt_to_page(addr); 226 146 227 147 set_page_private(page, (unsigned long)level); 228 - call_rcu(&page->rcu_head, stage2_free_removed_table_rcu_cb); 148 + call_rcu(&page->rcu_head, stage2_free_unlinked_table_rcu_cb); 229 149 } 230 150 231 151 static void kvm_host_get_page(void *addr) ··· 781 701 .zalloc_page = stage2_memcache_zalloc_page, 782 702 .zalloc_pages_exact = kvm_s2_zalloc_pages_exact, 783 703 .free_pages_exact = kvm_s2_free_pages_exact, 784 - .free_removed_table = stage2_free_removed_table, 704 + .free_unlinked_table = stage2_free_unlinked_table, 785 705 .get_page = kvm_host_get_page, 786 706 .put_page = kvm_s2_put_page, 787 707 .page_count = kvm_host_page_count, ··· 855 775 for_each_possible_cpu(cpu) 856 776 *per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1; 857 777 778 + /* The eager page splitting is disabled by default */ 779 + mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT; 780 + mmu->split_page_cache.gfp_zero = __GFP_ZERO; 781 + 858 782 mmu->pgt = pgt; 859 783 mmu->pgd_phys = __pa(pgt->pgd); 860 784 return 0; ··· 868 784 out_free_pgtable: 869 785 kfree(pgt); 870 786 return err; 787 + } 788 + 789 + void kvm_uninit_stage2_mmu(struct kvm *kvm) 790 + { 791 + kvm_free_stage2_pgd(&kvm->arch.mmu); 792 + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); 871 793 } 872 794 873 795 static void stage2_unmap_memslot(struct kvm *kvm, ··· 1079 989 } 1080 990 1081 991 /** 1082 - * kvm_mmu_write_protect_pt_masked() - write protect dirty pages 992 + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE 993 + * pages for memory slot 994 + * @kvm: The KVM pointer 995 + * @slot: The memory slot to split 996 + * 997 + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired, 998 + * serializing operations for VM memory regions. 999 + */ 1000 + static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot) 1001 + { 1002 + struct kvm_memslots *slots; 1003 + struct kvm_memory_slot *memslot; 1004 + phys_addr_t start, end; 1005 + 1006 + lockdep_assert_held(&kvm->slots_lock); 1007 + 1008 + slots = kvm_memslots(kvm); 1009 + memslot = id_to_memslot(slots, slot); 1010 + 1011 + start = memslot->base_gfn << PAGE_SHIFT; 1012 + end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; 1013 + 1014 + write_lock(&kvm->mmu_lock); 1015 + kvm_mmu_split_huge_pages(kvm, start, end); 1016 + write_unlock(&kvm->mmu_lock); 1017 + } 1018 + 1019 + /* 1020 + * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages. 1083 1021 * @kvm: The KVM pointer 1084 1022 * @slot: The memory slot associated with mask 1085 1023 * @gfn_offset: The gfn offset in memory slot 1086 - * @mask: The mask of dirty pages at offset 'gfn_offset' in this memory 1087 - * slot to be write protected 1024 + * @mask: The mask of pages at offset 'gfn_offset' in this memory 1025 + * slot to enable dirty logging on 1088 1026 * 1089 - * Walks bits set in mask write protects the associated pte's. Caller must 1090 - * acquire kvm_mmu_lock. 1027 + * Writes protect selected pages to enable dirty logging, and then 1028 + * splits them to PAGE_SIZE. Caller must acquire kvm->mmu_lock. 1091 1029 */ 1092 - static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, 1030 + void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 1093 1031 struct kvm_memory_slot *slot, 1094 1032 gfn_t gfn_offset, unsigned long mask) 1095 1033 { ··· 1125 1007 phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT; 1126 1008 phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; 1127 1009 1128 - stage2_wp_range(&kvm->arch.mmu, start, end); 1129 - } 1010 + lockdep_assert_held_write(&kvm->mmu_lock); 1130 1011 1131 - /* 1132 - * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected 1133 - * dirty pages. 1134 - * 1135 - * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to 1136 - * enable dirty logging for them. 1137 - */ 1138 - void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 1139 - struct kvm_memory_slot *slot, 1140 - gfn_t gfn_offset, unsigned long mask) 1141 - { 1142 - kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1012 + stage2_wp_range(&kvm->arch.mmu, start, end); 1013 + 1014 + /* 1015 + * Eager-splitting is done when manual-protect is set. We 1016 + * also check for initially-all-set because we can avoid 1017 + * eager-splitting if initially-all-set is false. 1018 + * Initially-all-set equal false implies that huge-pages were 1019 + * already split when enabling dirty logging: no need to do it 1020 + * again. 1021 + */ 1022 + if (kvm_dirty_log_manual_protect_and_init_set(kvm)) 1023 + kvm_mmu_split_huge_pages(kvm, start, end); 1143 1024 } 1144 1025 1145 1026 static void kvm_send_hwpoison_signal(unsigned long address, short lsb) ··· 1907 1790 const struct kvm_memory_slot *new, 1908 1791 enum kvm_mr_change change) 1909 1792 { 1793 + bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES; 1794 + 1910 1795 /* 1911 1796 * At this point memslot has been committed and there is an 1912 1797 * allocated dirty_bitmap[], dirty pages will be tracked while the 1913 1798 * memory slot is write protected. 1914 1799 */ 1915 - if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) { 1800 + if (log_dirty_pages) { 1801 + 1802 + if (change == KVM_MR_DELETE) 1803 + return; 1804 + 1916 1805 /* 1917 - * If we're with initial-all-set, we don't need to write 1918 - * protect any pages because they're all reported as dirty. 1919 - * Huge pages and normal pages will be write protect gradually. 1806 + * Huge and normal pages are write-protected and split 1807 + * on either of these two cases: 1808 + * 1809 + * 1. with initial-all-set: gradually with CLEAR ioctls, 1920 1810 */ 1921 - if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) { 1922 - kvm_mmu_wp_memory_region(kvm, new->id); 1923 - } 1811 + if (kvm_dirty_log_manual_protect_and_init_set(kvm)) 1812 + return; 1813 + /* 1814 + * or 1815 + * 2. without initial-all-set: all in one shot when 1816 + * enabling dirty logging. 1817 + */ 1818 + kvm_mmu_wp_memory_region(kvm, new->id); 1819 + kvm_mmu_split_memory_region(kvm, new->id); 1820 + } else { 1821 + /* 1822 + * Free any leftovers from the eager page splitting cache. Do 1823 + * this when deleting, moving, disabling dirty logging, or 1824 + * creating the memslot (a nop). Doing it for deletes makes 1825 + * sure we don't leak memory, and there's no need to keep the 1826 + * cache around for any of the other cases. 1827 + */ 1828 + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); 1924 1829 } 1925 1830 } 1926 1831 ··· 2016 1877 2017 1878 void kvm_arch_flush_shadow_all(struct kvm *kvm) 2018 1879 { 2019 - kvm_free_stage2_pgd(&kvm->arch.mmu); 1880 + kvm_uninit_stage2_mmu(kvm); 2020 1881 } 2021 1882 2022 1883 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,

+1

arch/arm64/kvm/pkvm.c

··· 78 78 hyp_mem_pages += host_s2_pgtable_pages(); 79 79 hyp_mem_pages += hyp_vm_table_pages(); 80 80 hyp_mem_pages += hyp_vmemmap_pages(STRUCT_HYP_PAGE_SIZE); 81 + hyp_mem_pages += hyp_ffa_proxy_pages(); 81 82 82 83 /* 83 84 * Try to allocate a PMD-aligned region to reduce TLB pressure once

-58

arch/arm64/kvm/reset.c

··· 187 187 } 188 188 189 189 /** 190 - * kvm_set_vm_width() - set the register width for the guest 191 - * @vcpu: Pointer to the vcpu being configured 192 - * 193 - * Set both KVM_ARCH_FLAG_EL1_32BIT and KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED 194 - * in the VM flags based on the vcpu's requested register width, the HW 195 - * capabilities and other options (such as MTE). 196 - * When REG_WIDTH_CONFIGURED is already set, the vcpu settings must be 197 - * consistent with the value of the FLAG_EL1_32BIT bit in the flags. 198 - * 199 - * Return: 0 on success, negative error code on failure. 200 - */ 201 - static int kvm_set_vm_width(struct kvm_vcpu *vcpu) 202 - { 203 - struct kvm *kvm = vcpu->kvm; 204 - bool is32bit; 205 - 206 - is32bit = vcpu_has_feature(vcpu, KVM_ARM_VCPU_EL1_32BIT); 207 - 208 - lockdep_assert_held(&kvm->arch.config_lock); 209 - 210 - if (test_bit(KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED, &kvm->arch.flags)) { 211 - /* 212 - * The guest's register width is already configured. 213 - * Make sure that the vcpu is consistent with it. 214 - */ 215 - if (is32bit == test_bit(KVM_ARCH_FLAG_EL1_32BIT, &kvm->arch.flags)) 216 - return 0; 217 - 218 - return -EINVAL; 219 - } 220 - 221 - if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) && is32bit) 222 - return -EINVAL; 223 - 224 - /* MTE is incompatible with AArch32 */ 225 - if (kvm_has_mte(kvm) && is32bit) 226 - return -EINVAL; 227 - 228 - /* NV is incompatible with AArch32 */ 229 - if (vcpu_has_nv(vcpu) && is32bit) 230 - return -EINVAL; 231 - 232 - if (is32bit) 233 - set_bit(KVM_ARCH_FLAG_EL1_32BIT, &kvm->arch.flags); 234 - 235 - set_bit(KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED, &kvm->arch.flags); 236 - 237 - return 0; 238 - } 239 - 240 - /** 241 190 * kvm_reset_vcpu - sets core registers and sys_regs to reset value 242 191 * @vcpu: The VCPU pointer 243 192 * ··· 210 261 int ret; 211 262 bool loaded; 212 263 u32 pstate; 213 - 214 - mutex_lock(&vcpu->kvm->arch.config_lock); 215 - ret = kvm_set_vm_width(vcpu); 216 - mutex_unlock(&vcpu->kvm->arch.config_lock); 217 - 218 - if (ret) 219 - return ret; 220 264 221 265 spin_lock(&vcpu->arch.mp_state_lock); 222 266 reset_state = vcpu->arch.reset_state;

+353 -152

arch/arm64/kvm/sys_regs.c

··· 42 42 */ 43 43 44 44 static u64 sys_reg_to_index(const struct sys_reg_desc *reg); 45 + static int set_id_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, 46 + u64 val); 45 47 46 48 static bool read_from_write_only(struct kvm_vcpu *vcpu, 47 49 struct sys_reg_params *params, ··· 555 553 return 0; 556 554 } 557 555 558 - static void reset_bvr(struct kvm_vcpu *vcpu, 556 + static u64 reset_bvr(struct kvm_vcpu *vcpu, 559 557 const struct sys_reg_desc *rd) 560 558 { 561 559 vcpu->arch.vcpu_debug_state.dbg_bvr[rd->CRm] = rd->val; 560 + return rd->val; 562 561 } 563 562 564 563 static bool trap_bcr(struct kvm_vcpu *vcpu, ··· 592 589 return 0; 593 590 } 594 591 595 - static void reset_bcr(struct kvm_vcpu *vcpu, 592 + static u64 reset_bcr(struct kvm_vcpu *vcpu, 596 593 const struct sys_reg_desc *rd) 597 594 { 598 595 vcpu->arch.vcpu_debug_state.dbg_bcr[rd->CRm] = rd->val; 596 + return rd->val; 599 597 } 600 598 601 599 static bool trap_wvr(struct kvm_vcpu *vcpu, ··· 630 626 return 0; 631 627 } 632 628 633 - static void reset_wvr(struct kvm_vcpu *vcpu, 629 + static u64 reset_wvr(struct kvm_vcpu *vcpu, 634 630 const struct sys_reg_desc *rd) 635 631 { 636 632 vcpu->arch.vcpu_debug_state.dbg_wvr[rd->CRm] = rd->val; 633 + return rd->val; 637 634 } 638 635 639 636 static bool trap_wcr(struct kvm_vcpu *vcpu, ··· 667 662 return 0; 668 663 } 669 664 670 - static void reset_wcr(struct kvm_vcpu *vcpu, 665 + static u64 reset_wcr(struct kvm_vcpu *vcpu, 671 666 const struct sys_reg_desc *rd) 672 667 { 673 668 vcpu->arch.vcpu_debug_state.dbg_wcr[rd->CRm] = rd->val; 669 + return rd->val; 674 670 } 675 671 676 - static void reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 672 + static u64 reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 677 673 { 678 674 u64 amair = read_sysreg(amair_el1); 679 675 vcpu_write_sys_reg(vcpu, amair, AMAIR_EL1); 676 + return amair; 680 677 } 681 678 682 - static void reset_actlr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 679 + static u64 reset_actlr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 683 680 { 684 681 u64 actlr = read_sysreg(actlr_el1); 685 682 vcpu_write_sys_reg(vcpu, actlr, ACTLR_EL1); 683 + return actlr; 686 684 } 687 685 688 - static void reset_mpidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 686 + static u64 reset_mpidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 689 687 { 690 688 u64 mpidr; 691 689 ··· 702 694 mpidr = (vcpu->vcpu_id & 0x0f) << MPIDR_LEVEL_SHIFT(0); 703 695 mpidr |= ((vcpu->vcpu_id >> 4) & 0xff) << MPIDR_LEVEL_SHIFT(1); 704 696 mpidr |= ((vcpu->vcpu_id >> 12) & 0xff) << MPIDR_LEVEL_SHIFT(2); 705 - vcpu_write_sys_reg(vcpu, (1ULL << 31) | mpidr, MPIDR_EL1); 697 + mpidr |= (1ULL << 31); 698 + vcpu_write_sys_reg(vcpu, mpidr, MPIDR_EL1); 699 + 700 + return mpidr; 706 701 } 707 702 708 703 static unsigned int pmu_visibility(const struct kvm_vcpu *vcpu, ··· 717 706 return REG_HIDDEN; 718 707 } 719 708 720 - static void reset_pmu_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 709 + static u64 reset_pmu_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 721 710 { 722 711 u64 n, mask = BIT(ARMV8_PMU_CYCLE_IDX); 723 712 724 713 /* No PMU available, any PMU reg may UNDEF... */ 725 714 if (!kvm_arm_support_pmu_v3()) 726 - return; 715 + return 0; 727 716 728 717 n = read_sysreg(pmcr_el0) >> ARMV8_PMU_PMCR_N_SHIFT; 729 718 n &= ARMV8_PMU_PMCR_N_MASK; ··· 732 721 733 722 reset_unknown(vcpu, r); 734 723 __vcpu_sys_reg(vcpu, r->reg) &= mask; 724 + 725 + return __vcpu_sys_reg(vcpu, r->reg); 735 726 } 736 727 737 - static void reset_pmevcntr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 728 + static u64 reset_pmevcntr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 738 729 { 739 730 reset_unknown(vcpu, r); 740 731 __vcpu_sys_reg(vcpu, r->reg) &= GENMASK(31, 0); 732 + 733 + return __vcpu_sys_reg(vcpu, r->reg); 741 734 } 742 735 743 - static void reset_pmevtyper(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 736 + static u64 reset_pmevtyper(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 744 737 { 745 738 reset_unknown(vcpu, r); 746 739 __vcpu_sys_reg(vcpu, r->reg) &= ARMV8_PMU_EVTYPE_MASK; 740 + 741 + return __vcpu_sys_reg(vcpu, r->reg); 747 742 } 748 743 749 - static void reset_pmselr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 744 + static u64 reset_pmselr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 750 745 { 751 746 reset_unknown(vcpu, r); 752 747 __vcpu_sys_reg(vcpu, r->reg) &= ARMV8_PMU_COUNTER_MASK; 748 + 749 + return __vcpu_sys_reg(vcpu, r->reg); 753 750 } 754 751 755 - static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 752 + static u64 reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 756 753 { 757 754 u64 pmcr; 758 755 759 756 /* No PMU available, PMCR_EL0 may UNDEF... */ 760 757 if (!kvm_arm_support_pmu_v3()) 761 - return; 758 + return 0; 762 759 763 760 /* Only preserve PMCR_EL0.N, and reset the rest to 0 */ 764 761 pmcr = read_sysreg(pmcr_el0) & (ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT); ··· 774 755 pmcr |= ARMV8_PMU_PMCR_LC; 775 756 776 757 __vcpu_sys_reg(vcpu, r->reg) = pmcr; 758 + 759 + return __vcpu_sys_reg(vcpu, r->reg); 777 760 } 778 761 779 762 static bool check_pmu_access_disabled(struct kvm_vcpu *vcpu, u64 flags) ··· 1208 1187 return true; 1209 1188 } 1210 1189 1211 - static u8 vcpu_pmuver(const struct kvm_vcpu *vcpu) 1190 + static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp, 1191 + s64 new, s64 cur) 1212 1192 { 1213 - if (kvm_vcpu_has_pmu(vcpu)) 1214 - return vcpu->kvm->arch.dfr0_pmuver.imp; 1193 + struct arm64_ftr_bits kvm_ftr = *ftrp; 1215 1194 1216 - return vcpu->kvm->arch.dfr0_pmuver.unimp; 1195 + /* Some features have different safe value type in KVM than host features */ 1196 + switch (id) { 1197 + case SYS_ID_AA64DFR0_EL1: 1198 + if (kvm_ftr.shift == ID_AA64DFR0_EL1_PMUVer_SHIFT) 1199 + kvm_ftr.type = FTR_LOWER_SAFE; 1200 + break; 1201 + case SYS_ID_DFR0_EL1: 1202 + if (kvm_ftr.shift == ID_DFR0_EL1_PerfMon_SHIFT) 1203 + kvm_ftr.type = FTR_LOWER_SAFE; 1204 + break; 1205 + } 1206 + 1207 + return arm64_ftr_safe_value(&kvm_ftr, new, cur); 1217 1208 } 1218 1209 1219 - static u8 perfmon_to_pmuver(u8 perfmon) 1210 + /** 1211 + * arm64_check_features() - Check if a feature register value constitutes 1212 + * a subset of features indicated by the idreg's KVM sanitised limit. 1213 + * 1214 + * This function will check if each feature field of @val is the "safe" value 1215 + * against idreg's KVM sanitised limit return from reset() callback. 1216 + * If a field value in @val is the same as the one in limit, it is always 1217 + * considered the safe value regardless For register fields that are not in 1218 + * writable, only the value in limit is considered the safe value. 1219 + * 1220 + * Return: 0 if all the fields are safe. Otherwise, return negative errno. 1221 + */ 1222 + static int arm64_check_features(struct kvm_vcpu *vcpu, 1223 + const struct sys_reg_desc *rd, 1224 + u64 val) 1220 1225 { 1221 - switch (perfmon) { 1222 - case ID_DFR0_EL1_PerfMon_PMUv3: 1223 - return ID_AA64DFR0_EL1_PMUVer_IMP; 1224 - case ID_DFR0_EL1_PerfMon_IMPDEF: 1225 - return ID_AA64DFR0_EL1_PMUVer_IMP_DEF; 1226 - default: 1227 - /* Anything ARMv8.1+ and NI have the same value. For now. */ 1228 - return perfmon; 1226 + const struct arm64_ftr_reg *ftr_reg; 1227 + const struct arm64_ftr_bits *ftrp = NULL; 1228 + u32 id = reg_to_encoding(rd); 1229 + u64 writable_mask = rd->val; 1230 + u64 limit = rd->reset(vcpu, rd); 1231 + u64 mask = 0; 1232 + 1233 + /* 1234 + * Hidden and unallocated ID registers may not have a corresponding 1235 + * struct arm64_ftr_reg. Of course, if the register is RAZ we know the 1236 + * only safe value is 0. 1237 + */ 1238 + if (sysreg_visible_as_raz(vcpu, rd)) 1239 + return val ? -E2BIG : 0; 1240 + 1241 + ftr_reg = get_arm64_ftr_reg(id); 1242 + if (!ftr_reg) 1243 + return -EINVAL; 1244 + 1245 + ftrp = ftr_reg->ftr_bits; 1246 + 1247 + for (; ftrp && ftrp->width; ftrp++) { 1248 + s64 f_val, f_lim, safe_val; 1249 + u64 ftr_mask; 1250 + 1251 + ftr_mask = arm64_ftr_mask(ftrp); 1252 + if ((ftr_mask & writable_mask) != ftr_mask) 1253 + continue; 1254 + 1255 + f_val = arm64_ftr_value(ftrp, val); 1256 + f_lim = arm64_ftr_value(ftrp, limit); 1257 + mask |= ftr_mask; 1258 + 1259 + if (f_val == f_lim) 1260 + safe_val = f_val; 1261 + else 1262 + safe_val = kvm_arm64_ftr_safe_value(id, ftrp, f_val, f_lim); 1263 + 1264 + if (safe_val != f_val) 1265 + return -E2BIG; 1229 1266 } 1267 + 1268 + /* For fields that are not writable, values in limit are the safe values. */ 1269 + if ((val & ~mask) != (limit & ~mask)) 1270 + return -E2BIG; 1271 + 1272 + return 0; 1230 1273 } 1231 1274 1232 1275 static u8 pmuver_to_perfmon(u8 pmuver) ··· 1307 1222 } 1308 1223 1309 1224 /* Read a sanitised cpufeature ID register by sys_reg_desc */ 1310 - static u64 read_id_reg(const struct kvm_vcpu *vcpu, struct sys_reg_desc const *r) 1225 + static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu, 1226 + const struct sys_reg_desc *r) 1311 1227 { 1312 1228 u32 id = reg_to_encoding(r); 1313 1229 u64 val; ··· 1319 1233 val = read_sanitised_ftr_reg(id); 1320 1234 1321 1235 switch (id) { 1322 - case SYS_ID_AA64PFR0_EL1: 1323 - if (!vcpu_has_sve(vcpu)) 1324 - val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_SVE); 1325 - val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_AMU); 1326 - val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV2); 1327 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV2), (u64)vcpu->kvm->arch.pfr0_csv2); 1328 - val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV3); 1329 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV3), (u64)vcpu->kvm->arch.pfr0_csv3); 1330 - if (kvm_vgic_global_state.type == VGIC_V3) { 1331 - val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_GIC); 1332 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_GIC), 1); 1333 - } 1334 - break; 1335 1236 case SYS_ID_AA64PFR1_EL1: 1336 1237 if (!kvm_has_mte(vcpu->kvm)) 1337 1238 val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTE); ··· 1340 1267 val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_WFxT); 1341 1268 val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_MOPS); 1342 1269 break; 1343 - case SYS_ID_AA64DFR0_EL1: 1344 - /* Limit debug to ARMv8.0 */ 1345 - val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_DebugVer); 1346 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_DebugVer), 6); 1347 - /* Set PMUver to the required version */ 1348 - val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer); 1349 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), 1350 - vcpu_pmuver(vcpu)); 1351 - /* Hide SPE from guests */ 1352 - val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMSVer); 1353 - break; 1354 - case SYS_ID_DFR0_EL1: 1355 - val &= ~ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon); 1356 - val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon), 1357 - pmuver_to_perfmon(vcpu_pmuver(vcpu))); 1358 - break; 1359 1270 case SYS_ID_AA64MMFR2_EL1: 1360 1271 val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK; 1361 1272 break; ··· 1349 1292 } 1350 1293 1351 1294 return val; 1295 + } 1296 + 1297 + static u64 kvm_read_sanitised_id_reg(struct kvm_vcpu *vcpu, 1298 + const struct sys_reg_desc *r) 1299 + { 1300 + return __kvm_read_sanitised_id_reg(vcpu, r); 1301 + } 1302 + 1303 + static u64 read_id_reg(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 1304 + { 1305 + return IDREG(vcpu->kvm, reg_to_encoding(r)); 1306 + } 1307 + 1308 + /* 1309 + * Return true if the register's (Op0, Op1, CRn, CRm, Op2) is 1310 + * (3, 0, 0, crm, op2), where 1<=crm<8, 0<=op2<8. 1311 + */ 1312 + static inline bool is_id_reg(u32 id) 1313 + { 1314 + return (sys_reg_Op0(id) == 3 && sys_reg_Op1(id) == 0 && 1315 + sys_reg_CRn(id) == 0 && sys_reg_CRm(id) >= 1 && 1316 + sys_reg_CRm(id) < 8); 1352 1317 } 1353 1318 1354 1319 static unsigned int id_visibility(const struct kvm_vcpu *vcpu, ··· 1434 1355 return REG_HIDDEN; 1435 1356 } 1436 1357 1437 - static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, 1438 - const struct sys_reg_desc *rd, 1439 - u64 val) 1358 + static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, 1359 + const struct sys_reg_desc *rd) 1440 1360 { 1441 - u8 csv2, csv3; 1361 + u64 val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); 1362 + 1363 + if (!vcpu_has_sve(vcpu)) 1364 + val &= ~ID_AA64PFR0_EL1_SVE_MASK; 1442 1365 1443 1366 /* 1444 - * Allow AA64PFR0_EL1.CSV2 to be set from userspace as long as 1445 - * it doesn't promise more than what is actually provided (the 1446 - * guest could otherwise be covered in ectoplasmic residue). 1367 + * The default is to expose CSV2 == 1 if the HW isn't affected. 1368 + * Although this is a per-CPU feature, we make it global because 1369 + * asymmetric systems are just a nuisance. 1370 + * 1371 + * Userspace can override this as long as it doesn't promise 1372 + * the impossible. 1447 1373 */ 1448 - csv2 = cpuid_feature_extract_unsigned_field(val, ID_AA64PFR0_EL1_CSV2_SHIFT); 1449 - if (csv2 > 1 || 1450 - (csv2 && arm64_get_spectre_v2_state() != SPECTRE_UNAFFECTED)) 1451 - return -EINVAL; 1374 + if (arm64_get_spectre_v2_state() == SPECTRE_UNAFFECTED) { 1375 + val &= ~ID_AA64PFR0_EL1_CSV2_MASK; 1376 + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, CSV2, IMP); 1377 + } 1378 + if (arm64_get_meltdown_state() == SPECTRE_UNAFFECTED) { 1379 + val &= ~ID_AA64PFR0_EL1_CSV3_MASK; 1380 + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, CSV3, IMP); 1381 + } 1452 1382 1453 - /* Same thing for CSV3 */ 1454 - csv3 = cpuid_feature_extract_unsigned_field(val, ID_AA64PFR0_EL1_CSV3_SHIFT); 1455 - if (csv3 > 1 || 1456 - (csv3 && arm64_get_meltdown_state() != SPECTRE_UNAFFECTED)) 1457 - return -EINVAL; 1383 + if (kvm_vgic_global_state.type == VGIC_V3) { 1384 + val &= ~ID_AA64PFR0_EL1_GIC_MASK; 1385 + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); 1386 + } 1458 1387 1459 - /* We can only differ with CSV[23], and anything else is an error */ 1460 - val ^= read_id_reg(vcpu, rd); 1461 - val &= ~(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV2) | 1462 - ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_CSV3)); 1463 - if (val) 1464 - return -EINVAL; 1388 + val &= ~ID_AA64PFR0_EL1_AMU_MASK; 1465 1389 1466 - vcpu->kvm->arch.pfr0_csv2 = csv2; 1467 - vcpu->kvm->arch.pfr0_csv3 = csv3; 1390 + return val; 1391 + } 1468 1392 1469 - return 0; 1393 + static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, 1394 + const struct sys_reg_desc *rd) 1395 + { 1396 + u64 val = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1); 1397 + 1398 + /* Limit debug to ARMv8.0 */ 1399 + val &= ~ID_AA64DFR0_EL1_DebugVer_MASK; 1400 + val |= SYS_FIELD_PREP_ENUM(ID_AA64DFR0_EL1, DebugVer, IMP); 1401 + 1402 + /* 1403 + * Only initialize the PMU version if the vCPU was configured with one. 1404 + */ 1405 + val &= ~ID_AA64DFR0_EL1_PMUVer_MASK; 1406 + if (kvm_vcpu_has_pmu(vcpu)) 1407 + val |= SYS_FIELD_PREP(ID_AA64DFR0_EL1, PMUVer, 1408 + kvm_arm_pmu_get_pmuver_limit()); 1409 + 1410 + /* Hide SPE from guests */ 1411 + val &= ~ID_AA64DFR0_EL1_PMSVer_MASK; 1412 + 1413 + return val; 1470 1414 } 1471 1415 1472 1416 static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, 1473 1417 const struct sys_reg_desc *rd, 1474 1418 u64 val) 1475 1419 { 1476 - u8 pmuver, host_pmuver; 1477 - bool valid_pmu; 1478 - 1479 - host_pmuver = kvm_arm_pmu_get_pmuver_limit(); 1420 + u8 pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, val); 1480 1421 1481 1422 /* 1482 - * Allow AA64DFR0_EL1.PMUver to be set from userspace as long 1483 - * as it doesn't promise more than what the HW gives us. We 1484 - * allow an IMPDEF PMU though, only if no PMU is supported 1485 - * (KVM backward compatibility handling). 1423 + * Prior to commit 3d0dba5764b9 ("KVM: arm64: PMU: Move the 1424 + * ID_AA64DFR0_EL1.PMUver limit to VM creation"), KVM erroneously 1425 + * exposed an IMP_DEF PMU to userspace and the guest on systems w/ 1426 + * non-architectural PMUs. Of course, PMUv3 is the only game in town for 1427 + * PMU virtualization, so the IMP_DEF value was rather user-hostile. 1428 + * 1429 + * At minimum, we're on the hook to allow values that were given to 1430 + * userspace by KVM. Cover our tracks here and replace the IMP_DEF value 1431 + * with a more sensible NI. The value of an ID register changing under 1432 + * the nose of the guest is unfortunate, but is certainly no more 1433 + * surprising than an ill-guided PMU driver poking at impdef system 1434 + * registers that end in an UNDEF... 1486 1435 */ 1487 - pmuver = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), val); 1488 - if ((pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)) 1489 - return -EINVAL; 1436 + if (pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 1437 + val &= ~ID_AA64DFR0_EL1_PMUVer_MASK; 1490 1438 1491 - valid_pmu = (pmuver != 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF); 1439 + return set_id_reg(vcpu, rd, val); 1440 + } 1492 1441 1493 - /* Make sure view register and PMU support do match */ 1494 - if (kvm_vcpu_has_pmu(vcpu) != valid_pmu) 1495 - return -EINVAL; 1442 + static u64 read_sanitised_id_dfr0_el1(struct kvm_vcpu *vcpu, 1443 + const struct sys_reg_desc *rd) 1444 + { 1445 + u8 perfmon = pmuver_to_perfmon(kvm_arm_pmu_get_pmuver_limit()); 1446 + u64 val = read_sanitised_ftr_reg(SYS_ID_DFR0_EL1); 1496 1447 1497 - /* We can only differ with PMUver, and anything else is an error */ 1498 - val ^= read_id_reg(vcpu, rd); 1499 - val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer); 1500 - if (val) 1501 - return -EINVAL; 1448 + val &= ~ID_DFR0_EL1_PerfMon_MASK; 1449 + if (kvm_vcpu_has_pmu(vcpu)) 1450 + val |= SYS_FIELD_PREP(ID_DFR0_EL1, PerfMon, perfmon); 1502 1451 1503 - if (valid_pmu) 1504 - vcpu->kvm->arch.dfr0_pmuver.imp = pmuver; 1505 - else 1506 - vcpu->kvm->arch.dfr0_pmuver.unimp = pmuver; 1507 - 1508 - return 0; 1452 + return val; 1509 1453 } 1510 1454 1511 1455 static int set_id_dfr0_el1(struct kvm_vcpu *vcpu, 1512 1456 const struct sys_reg_desc *rd, 1513 1457 u64 val) 1514 1458 { 1515 - u8 perfmon, host_perfmon; 1516 - bool valid_pmu; 1459 + u8 perfmon = SYS_FIELD_GET(ID_DFR0_EL1, PerfMon, val); 1517 1460 1518 - host_perfmon = pmuver_to_perfmon(kvm_arm_pmu_get_pmuver_limit()); 1461 + if (perfmon == ID_DFR0_EL1_PerfMon_IMPDEF) { 1462 + val &= ~ID_DFR0_EL1_PerfMon_MASK; 1463 + perfmon = 0; 1464 + } 1519 1465 1520 1466 /* 1521 1467 * Allow DFR0_EL1.PerfMon to be set from userspace as long as ··· 1548 1444 * AArch64 side (as everything is emulated with that), and 1549 1445 * that this is a PMUv3. 1550 1446 */ 1551 - perfmon = FIELD_GET(ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon), val); 1552 - if ((perfmon != ID_DFR0_EL1_PerfMon_IMPDEF && perfmon > host_perfmon) || 1553 - (perfmon != 0 && perfmon < ID_DFR0_EL1_PerfMon_PMUv3)) 1447 + if (perfmon != 0 && perfmon < ID_DFR0_EL1_PerfMon_PMUv3) 1554 1448 return -EINVAL; 1555 1449 1556 - valid_pmu = (perfmon != 0 && perfmon != ID_DFR0_EL1_PerfMon_IMPDEF); 1557 - 1558 - /* Make sure view register and PMU support do match */ 1559 - if (kvm_vcpu_has_pmu(vcpu) != valid_pmu) 1560 - return -EINVAL; 1561 - 1562 - /* We can only differ with PerfMon, and anything else is an error */ 1563 - val ^= read_id_reg(vcpu, rd); 1564 - val &= ~ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon); 1565 - if (val) 1566 - return -EINVAL; 1567 - 1568 - if (valid_pmu) 1569 - vcpu->kvm->arch.dfr0_pmuver.imp = perfmon_to_pmuver(perfmon); 1570 - else 1571 - vcpu->kvm->arch.dfr0_pmuver.unimp = perfmon_to_pmuver(perfmon); 1572 - 1573 - return 0; 1450 + return set_id_reg(vcpu, rd, val); 1574 1451 } 1575 1452 1576 1453 /* ··· 1564 1479 static int get_id_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, 1565 1480 u64 *val) 1566 1481 { 1482 + /* 1483 + * Avoid locking if the VM has already started, as the ID registers are 1484 + * guaranteed to be invariant at that point. 1485 + */ 1486 + if (kvm_vm_has_ran_once(vcpu->kvm)) { 1487 + *val = read_id_reg(vcpu, rd); 1488 + return 0; 1489 + } 1490 + 1491 + mutex_lock(&vcpu->kvm->arch.config_lock); 1567 1492 *val = read_id_reg(vcpu, rd); 1493 + mutex_unlock(&vcpu->kvm->arch.config_lock); 1494 + 1568 1495 return 0; 1569 1496 } 1570 1497 1571 1498 static int set_id_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, 1572 1499 u64 val) 1573 1500 { 1574 - /* This is what we mean by invariant: you can't change it. */ 1575 - if (val != read_id_reg(vcpu, rd)) 1576 - return -EINVAL; 1501 + u32 id = reg_to_encoding(rd); 1502 + int ret; 1577 1503 1578 - return 0; 1504 + mutex_lock(&vcpu->kvm->arch.config_lock); 1505 + 1506 + /* 1507 + * Once the VM has started the ID registers are immutable. Reject any 1508 + * write that does not match the final register value. 1509 + */ 1510 + if (kvm_vm_has_ran_once(vcpu->kvm)) { 1511 + if (val != read_id_reg(vcpu, rd)) 1512 + ret = -EBUSY; 1513 + else 1514 + ret = 0; 1515 + 1516 + mutex_unlock(&vcpu->kvm->arch.config_lock); 1517 + return ret; 1518 + } 1519 + 1520 + ret = arm64_check_features(vcpu, rd, val); 1521 + if (!ret) 1522 + IDREG(vcpu->kvm, id) = val; 1523 + 1524 + mutex_unlock(&vcpu->kvm->arch.config_lock); 1525 + 1526 + /* 1527 + * arm64_check_features() returns -E2BIG to indicate the register's 1528 + * feature set is a superset of the maximally-allowed register value. 1529 + * While it would be nice to precisely describe this to userspace, the 1530 + * existing UAPI for KVM_SET_ONE_REG has it that invalid register 1531 + * writes return -EINVAL. 1532 + */ 1533 + if (ret == -E2BIG) 1534 + ret = -EINVAL; 1535 + return ret; 1579 1536 } 1580 1537 1581 1538 static int get_raz_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, ··· 1657 1530 * Fabricate a CLIDR_EL1 value instead of using the real value, which can vary 1658 1531 * by the physical CPU which the vcpu currently resides in. 1659 1532 */ 1660 - static void reset_clidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 1533 + static u64 reset_clidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 1661 1534 { 1662 1535 u64 ctr_el0 = read_sanitised_ftr_reg(SYS_CTR_EL0); 1663 1536 u64 clidr; ··· 1705 1578 clidr |= 2 << CLIDR_TTYPE_SHIFT(loc); 1706 1579 1707 1580 __vcpu_sys_reg(vcpu, r->reg) = clidr; 1581 + 1582 + return __vcpu_sys_reg(vcpu, r->reg); 1708 1583 } 1709 1584 1710 1585 static int set_clidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, ··· 1806 1677 .visibility = elx2_visibility, \ 1807 1678 } 1808 1679 1680 + /* 1681 + * Since reset() callback and field val are not used for idregs, they will be 1682 + * used for specific purposes for idregs. 1683 + * The reset() would return KVM sanitised register value. The value would be the 1684 + * same as the host kernel sanitised value if there is no KVM sanitisation. 1685 + * The val would be used as a mask indicating writable fields for the idreg. 1686 + * Only bits with 1 are writable from userspace. This mask might not be 1687 + * necessary in the future whenever all ID registers are enabled as writable 1688 + * from userspace. 1689 + */ 1690 + 1809 1691 /* sys_reg_desc initialiser for known cpufeature ID registers */ 1810 1692 #define ID_SANITISED(name) { \ 1811 1693 SYS_DESC(SYS_##name), \ ··· 1824 1684 .get_user = get_id_reg, \ 1825 1685 .set_user = set_id_reg, \ 1826 1686 .visibility = id_visibility, \ 1687 + .reset = kvm_read_sanitised_id_reg, \ 1688 + .val = 0, \ 1827 1689 } 1828 1690 1829 1691 /* sys_reg_desc initialiser for known cpufeature ID registers */ ··· 1835 1693 .get_user = get_id_reg, \ 1836 1694 .set_user = set_id_reg, \ 1837 1695 .visibility = aa32_id_visibility, \ 1696 + .reset = kvm_read_sanitised_id_reg, \ 1697 + .val = 0, \ 1838 1698 } 1839 1699 1840 1700 /* ··· 1849 1705 .access = access_id_reg, \ 1850 1706 .get_user = get_id_reg, \ 1851 1707 .set_user = set_id_reg, \ 1852 - .visibility = raz_visibility \ 1708 + .visibility = raz_visibility, \ 1709 + .reset = kvm_read_sanitised_id_reg, \ 1710 + .val = 0, \ 1853 1711 } 1854 1712 1855 1713 /* ··· 1865 1719 .get_user = get_id_reg, \ 1866 1720 .set_user = set_id_reg, \ 1867 1721 .visibility = raz_visibility, \ 1722 + .reset = kvm_read_sanitised_id_reg, \ 1723 + .val = 0, \ 1868 1724 } 1869 1725 1870 1726 static bool access_sp_el1(struct kvm_vcpu *vcpu, ··· 1974 1826 /* CRm=1 */ 1975 1827 AA32_ID_SANITISED(ID_PFR0_EL1), 1976 1828 AA32_ID_SANITISED(ID_PFR1_EL1), 1977 - { SYS_DESC(SYS_ID_DFR0_EL1), .access = access_id_reg, 1978 - .get_user = get_id_reg, .set_user = set_id_dfr0_el1, 1979 - .visibility = aa32_id_visibility, }, 1829 + { SYS_DESC(SYS_ID_DFR0_EL1), 1830 + .access = access_id_reg, 1831 + .get_user = get_id_reg, 1832 + .set_user = set_id_dfr0_el1, 1833 + .visibility = aa32_id_visibility, 1834 + .reset = read_sanitised_id_dfr0_el1, 1835 + .val = ID_DFR0_EL1_PerfMon_MASK, }, 1980 1836 ID_HIDDEN(ID_AFR0_EL1), 1981 1837 AA32_ID_SANITISED(ID_MMFR0_EL1), 1982 1838 AA32_ID_SANITISED(ID_MMFR1_EL1), ··· 2009 1857 2010 1858 /* AArch64 ID registers */ 2011 1859 /* CRm=4 */ 2012 - { SYS_DESC(SYS_ID_AA64PFR0_EL1), .access = access_id_reg, 2013 - .get_user = get_id_reg, .set_user = set_id_aa64pfr0_el1, }, 1860 + { SYS_DESC(SYS_ID_AA64PFR0_EL1), 1861 + .access = access_id_reg, 1862 + .get_user = get_id_reg, 1863 + .set_user = set_id_reg, 1864 + .reset = read_sanitised_id_aa64pfr0_el1, 1865 + .val = ID_AA64PFR0_EL1_CSV2_MASK | ID_AA64PFR0_EL1_CSV3_MASK, }, 2014 1866 ID_SANITISED(ID_AA64PFR1_EL1), 2015 1867 ID_UNALLOCATED(4,2), 2016 1868 ID_UNALLOCATED(4,3), ··· 2024 1868 ID_UNALLOCATED(4,7), 2025 1869 2026 1870 /* CRm=5 */ 2027 - { SYS_DESC(SYS_ID_AA64DFR0_EL1), .access = access_id_reg, 2028 - .get_user = get_id_reg, .set_user = set_id_aa64dfr0_el1, }, 1871 + { SYS_DESC(SYS_ID_AA64DFR0_EL1), 1872 + .access = access_id_reg, 1873 + .get_user = get_id_reg, 1874 + .set_user = set_id_aa64dfr0_el1, 1875 + .reset = read_sanitised_id_aa64dfr0_el1, 1876 + .val = ID_AA64DFR0_EL1_PMUVer_MASK, }, 2029 1877 ID_SANITISED(ID_AA64DFR1_EL1), 2030 1878 ID_UNALLOCATED(5,2), 2031 1879 ID_UNALLOCATED(5,3), ··· 2363 2203 EL2_REG(ACTLR_EL2, access_rw, reset_val, 0), 2364 2204 EL2_REG(HCR_EL2, access_rw, reset_val, 0), 2365 2205 EL2_REG(MDCR_EL2, access_rw, reset_val, 0), 2366 - EL2_REG(CPTR_EL2, access_rw, reset_val, CPTR_EL2_DEFAULT ), 2206 + EL2_REG(CPTR_EL2, access_rw, reset_val, CPTR_NVHE_EL2_RES1), 2367 2207 EL2_REG(HSTR_EL2, access_rw, reset_val, 0), 2368 2208 EL2_REG(HACR_EL2, access_rw, reset_val, 0), 2369 2209 ··· 2419 2259 2420 2260 EL2_REG(SP_EL2, NULL, reset_unknown, 0), 2421 2261 }; 2262 + 2263 + static const struct sys_reg_desc *first_idreg; 2422 2264 2423 2265 static bool trap_dbgdidr(struct kvm_vcpu *vcpu, 2424 2266 struct sys_reg_params *p, ··· 3112 2950 return false; 3113 2951 } 3114 2952 2953 + static void kvm_reset_id_regs(struct kvm_vcpu *vcpu) 2954 + { 2955 + const struct sys_reg_desc *idreg = first_idreg; 2956 + u32 id = reg_to_encoding(idreg); 2957 + struct kvm *kvm = vcpu->kvm; 2958 + 2959 + if (test_bit(KVM_ARCH_FLAG_ID_REGS_INITIALIZED, &kvm->arch.flags)) 2960 + return; 2961 + 2962 + lockdep_assert_held(&kvm->arch.config_lock); 2963 + 2964 + /* Initialize all idregs */ 2965 + while (is_id_reg(id)) { 2966 + IDREG(kvm, id) = idreg->reset(vcpu, idreg); 2967 + 2968 + idreg++; 2969 + id = reg_to_encoding(idreg); 2970 + } 2971 + 2972 + set_bit(KVM_ARCH_FLAG_ID_REGS_INITIALIZED, &kvm->arch.flags); 2973 + } 2974 + 3115 2975 /** 3116 2976 * kvm_reset_sys_regs - sets system registers to reset value 3117 2977 * @vcpu: The VCPU pointer ··· 3145 2961 { 3146 2962 unsigned long i; 3147 2963 3148 - for (i = 0; i < ARRAY_SIZE(sys_reg_descs); i++) 3149 - if (sys_reg_descs[i].reset) 3150 - sys_reg_descs[i].reset(vcpu, &sys_reg_descs[i]); 2964 + kvm_reset_id_regs(vcpu); 2965 + 2966 + for (i = 0; i < ARRAY_SIZE(sys_reg_descs); i++) { 2967 + const struct sys_reg_desc *r = &sys_reg_descs[i]; 2968 + 2969 + if (is_id_reg(reg_to_encoding(r))) 2970 + continue; 2971 + 2972 + if (r->reset) 2973 + r->reset(vcpu, r); 2974 + } 3151 2975 } 3152 2976 3153 2977 /** ··· 3256 3064 */ 3257 3065 3258 3066 #define FUNCTION_INVARIANT(reg) \ 3259 - static void get_##reg(struct kvm_vcpu *v, \ 3067 + static u64 get_##reg(struct kvm_vcpu *v, \ 3260 3068 const struct sys_reg_desc *r) \ 3261 3069 { \ 3262 3070 ((struct sys_reg_desc *)r)->val = read_sysreg(reg); \ 3071 + return ((struct sys_reg_desc *)r)->val; \ 3263 3072 } 3264 3073 3265 3074 FUNCTION_INVARIANT(midr_el1) 3266 3075 FUNCTION_INVARIANT(revidr_el1) 3267 3076 FUNCTION_INVARIANT(aidr_el1) 3268 3077 3269 - static void get_ctr_el0(struct kvm_vcpu *v, const struct sys_reg_desc *r) 3078 + static u64 get_ctr_el0(struct kvm_vcpu *v, const struct sys_reg_desc *r) 3270 3079 { 3271 3080 ((struct sys_reg_desc *)r)->val = read_sanitised_ftr_reg(SYS_CTR_EL0); 3081 + return ((struct sys_reg_desc *)r)->val; 3272 3082 } 3273 3083 3274 3084 /* ->val is filled in by kvm_sys_reg_table_init() */ ··· 3562 3368 3563 3369 int __init kvm_sys_reg_table_init(void) 3564 3370 { 3371 + struct sys_reg_params params; 3565 3372 bool valid = true; 3566 3373 unsigned int i; 3567 3374 ··· 3580 3385 /* We abuse the reset function to overwrite the table itself. */ 3581 3386 for (i = 0; i < ARRAY_SIZE(invariant_sys_regs); i++) 3582 3387 invariant_sys_regs[i].reset(NULL, &invariant_sys_regs[i]); 3388 + 3389 + /* Find the first idreg (SYS_ID_PFR0_EL1) in sys_reg_descs. */ 3390 + params = encoding_to_params(SYS_ID_PFR0_EL1); 3391 + first_idreg = find_reg(&params, sys_reg_descs, ARRAY_SIZE(sys_reg_descs)); 3392 + if (!first_idreg) 3393 + return -EINVAL; 3583 3394 3584 3395 return 0; 3585 3396 }

+17 -5

arch/arm64/kvm/sys_regs.h

··· 27 27 bool is_write; 28 28 }; 29 29 30 + #define encoding_to_params(reg) \ 31 + ((struct sys_reg_params){ .Op0 = sys_reg_Op0(reg), \ 32 + .Op1 = sys_reg_Op1(reg), \ 33 + .CRn = sys_reg_CRn(reg), \ 34 + .CRm = sys_reg_CRm(reg), \ 35 + .Op2 = sys_reg_Op2(reg) }) 36 + 30 37 #define esr_sys64_to_params(esr) \ 31 38 ((struct sys_reg_params){ .Op0 = ((esr) >> 20) & 3, \ 32 39 .Op1 = ((esr) >> 14) & 0x7, \ ··· 71 64 struct sys_reg_params *, 72 65 const struct sys_reg_desc *); 73 66 74 - /* Initialization for vcpu. */ 75 - void (*reset)(struct kvm_vcpu *, const struct sys_reg_desc *); 67 + /* 68 + * Initialization for vcpu. Return initialized value, or KVM 69 + * sanitized value for ID registers. 70 + */ 71 + u64 (*reset)(struct kvm_vcpu *, const struct sys_reg_desc *); 76 72 77 73 /* Index into sys_reg[], or 0 if we don't need to save it. */ 78 74 int reg; 79 75 80 - /* Value (usually reset value) */ 76 + /* Value (usually reset value), or write mask for idregs */ 81 77 u64 val; 82 78 83 79 /* Custom get/set_user functions, fallback to generic if NULL */ ··· 133 123 } 134 124 135 125 /* Reset functions */ 136 - static inline void reset_unknown(struct kvm_vcpu *vcpu, 126 + static inline u64 reset_unknown(struct kvm_vcpu *vcpu, 137 127 const struct sys_reg_desc *r) 138 128 { 139 129 BUG_ON(!r->reg); 140 130 BUG_ON(r->reg >= NR_SYS_REGS); 141 131 __vcpu_sys_reg(vcpu, r->reg) = 0x1de7ec7edbadc0deULL; 132 + return __vcpu_sys_reg(vcpu, r->reg); 142 133 } 143 134 144 - static inline void reset_val(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 135 + static inline u64 reset_val(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 145 136 { 146 137 BUG_ON(!r->reg); 147 138 BUG_ON(r->reg >= NR_SYS_REGS); 148 139 __vcpu_sys_reg(vcpu, r->reg) = r->val; 140 + return __vcpu_sys_reg(vcpu, r->reg); 149 141 } 150 142 151 143 static inline unsigned int sysreg_visibility(const struct kvm_vcpu *vcpu,

+3

arch/arm64/tools/cpucaps

··· 25 25 HAS_ECV 26 26 HAS_ECV_CNTPOFF 27 27 HAS_EPAN 28 + HAS_EVT 28 29 HAS_GENERIC_AUTH 29 30 HAS_GENERIC_AUTH_ARCH_QARMA3 30 31 HAS_GENERIC_AUTH_ARCH_QARMA5 ··· 52 51 HAS_VIRT_HOST_EXTN 53 52 HAS_WFXT 54 53 HW_DBM 54 + KVM_HVHE 55 55 KVM_PROTECTED_MODE 56 56 MISMATCHED_CACHE_TYPE 57 57 MTE ··· 83 81 WORKAROUND_2457168 84 82 WORKAROUND_2645198 85 83 WORKAROUND_2658417 84 + WORKAROUND_AMPERE_AC03_CPU_38 86 85 WORKAROUND_TRBE_OVERWRITE_FILL_MODE 87 86 WORKAROUND_TSB_FLUSH_FAILURE 88 87 WORKAROUND_TRBE_WRITE_OUT_OF_RANGE

+2

arch/riscv/include/asm/csr.h

··· 90 90 #define EXC_INST_ACCESS 1 91 91 #define EXC_INST_ILLEGAL 2 92 92 #define EXC_BREAKPOINT 3 93 + #define EXC_LOAD_MISALIGNED 4 93 94 #define EXC_LOAD_ACCESS 5 95 + #define EXC_STORE_MISALIGNED 6 94 96 #define EXC_STORE_ACCESS 7 95 97 #define EXC_SYSCALL 8 96 98 #define EXC_HYPERVISOR_SYSCALL 9

+77 -30

arch/riscv/include/asm/kvm_aia.h

··· 20 20 21 21 /* In-kernel irqchip initialized */ 22 22 bool initialized; 23 + 24 + /* Virtualization mode (Emulation, HW Accelerated, or Auto) */ 25 + u32 mode; 26 + 27 + /* Number of MSIs */ 28 + u32 nr_ids; 29 + 30 + /* Number of wired IRQs */ 31 + u32 nr_sources; 32 + 33 + /* Number of group bits in IMSIC address */ 34 + u32 nr_group_bits; 35 + 36 + /* Position of group bits in IMSIC address */ 37 + u32 nr_group_shift; 38 + 39 + /* Number of hart bits in IMSIC address */ 40 + u32 nr_hart_bits; 41 + 42 + /* Number of guest bits in IMSIC address */ 43 + u32 nr_guest_bits; 44 + 45 + /* Guest physical address of APLIC */ 46 + gpa_t aplic_addr; 47 + 48 + /* Internal state of APLIC */ 49 + void *aplic_state; 23 50 }; 24 51 25 52 struct kvm_vcpu_aia_csr { ··· 65 38 66 39 /* CPU AIA CSR context upon Guest VCPU reset */ 67 40 struct kvm_vcpu_aia_csr guest_reset_csr; 41 + 42 + /* Guest physical address of IMSIC for this VCPU */ 43 + gpa_t imsic_addr; 44 + 45 + /* HART index of IMSIC extacted from guest physical address */ 46 + u32 hart_index; 47 + 48 + /* Internal state of IMSIC for this VCPU */ 49 + void *imsic_state; 68 50 }; 51 + 52 + #define KVM_RISCV_AIA_UNDEF_ADDR (-1) 69 53 70 54 #define kvm_riscv_aia_initialized(k) ((k)->arch.aia.initialized) 71 55 72 56 #define irqchip_in_kernel(k) ((k)->arch.aia.in_kernel) 73 57 58 + extern unsigned int kvm_riscv_aia_nr_hgei; 59 + extern unsigned int kvm_riscv_aia_max_ids; 74 60 DECLARE_STATIC_KEY_FALSE(kvm_riscv_aia_available); 75 61 #define kvm_riscv_aia_available() \ 76 62 static_branch_unlikely(&kvm_riscv_aia_available) 77 63 64 + extern struct kvm_device_ops kvm_riscv_aia_device_ops; 65 + 66 + void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu); 67 + int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu); 68 + 78 69 #define KVM_RISCV_AIA_IMSIC_TOPEI (ISELECT_MASK + 1) 79 - static inline int kvm_riscv_vcpu_aia_imsic_rmw(struct kvm_vcpu *vcpu, 80 - unsigned long isel, 81 - unsigned long *val, 82 - unsigned long new_val, 83 - unsigned long wr_mask) 84 - { 85 - return 0; 86 - } 70 + int kvm_riscv_vcpu_aia_imsic_rmw(struct kvm_vcpu *vcpu, unsigned long isel, 71 + unsigned long *val, unsigned long new_val, 72 + unsigned long wr_mask); 73 + int kvm_riscv_aia_imsic_rw_attr(struct kvm *kvm, unsigned long type, 74 + bool write, unsigned long *val); 75 + int kvm_riscv_aia_imsic_has_attr(struct kvm *kvm, unsigned long type); 76 + void kvm_riscv_vcpu_aia_imsic_reset(struct kvm_vcpu *vcpu); 77 + int kvm_riscv_vcpu_aia_imsic_inject(struct kvm_vcpu *vcpu, 78 + u32 guest_index, u32 offset, u32 iid); 79 + int kvm_riscv_vcpu_aia_imsic_init(struct kvm_vcpu *vcpu); 80 + void kvm_riscv_vcpu_aia_imsic_cleanup(struct kvm_vcpu *vcpu); 81 + 82 + int kvm_riscv_aia_aplic_set_attr(struct kvm *kvm, unsigned long type, u32 v); 83 + int kvm_riscv_aia_aplic_get_attr(struct kvm *kvm, unsigned long type, u32 *v); 84 + int kvm_riscv_aia_aplic_has_attr(struct kvm *kvm, unsigned long type); 85 + int kvm_riscv_aia_aplic_inject(struct kvm *kvm, u32 source, bool level); 86 + int kvm_riscv_aia_aplic_init(struct kvm *kvm); 87 + void kvm_riscv_aia_aplic_cleanup(struct kvm *kvm); 87 88 88 89 #ifdef CONFIG_32BIT 89 90 void kvm_riscv_vcpu_aia_flush_interrupts(struct kvm_vcpu *vcpu); ··· 148 93 { .base = CSR_SIREG, .count = 1, .func = kvm_riscv_vcpu_aia_rmw_ireg }, \ 149 94 { .base = CSR_STOPEI, .count = 1, .func = kvm_riscv_vcpu_aia_rmw_topei }, 150 95 151 - static inline int kvm_riscv_vcpu_aia_update(struct kvm_vcpu *vcpu) 152 - { 153 - return 1; 154 - } 96 + int kvm_riscv_vcpu_aia_update(struct kvm_vcpu *vcpu); 97 + void kvm_riscv_vcpu_aia_reset(struct kvm_vcpu *vcpu); 98 + int kvm_riscv_vcpu_aia_init(struct kvm_vcpu *vcpu); 99 + void kvm_riscv_vcpu_aia_deinit(struct kvm_vcpu *vcpu); 155 100 156 - static inline void kvm_riscv_vcpu_aia_reset(struct kvm_vcpu *vcpu) 157 - { 158 - } 101 + int kvm_riscv_aia_inject_msi_by_id(struct kvm *kvm, u32 hart_index, 102 + u32 guest_index, u32 iid); 103 + int kvm_riscv_aia_inject_msi(struct kvm *kvm, struct kvm_msi *msi); 104 + int kvm_riscv_aia_inject_irq(struct kvm *kvm, unsigned int irq, bool level); 159 105 160 - static inline int kvm_riscv_vcpu_aia_init(struct kvm_vcpu *vcpu) 161 - { 162 - return 0; 163 - } 106 + void kvm_riscv_aia_init_vm(struct kvm *kvm); 107 + void kvm_riscv_aia_destroy_vm(struct kvm *kvm); 164 108 165 - static inline void kvm_riscv_vcpu_aia_deinit(struct kvm_vcpu *vcpu) 166 - { 167 - } 168 - 169 - static inline void kvm_riscv_aia_init_vm(struct kvm *kvm) 170 - { 171 - } 172 - 173 - static inline void kvm_riscv_aia_destroy_vm(struct kvm *kvm) 174 - { 175 - } 109 + int kvm_riscv_aia_alloc_hgei(int cpu, struct kvm_vcpu *owner, 110 + void __iomem **hgei_va, phys_addr_t *hgei_pa); 111 + void kvm_riscv_aia_free_hgei(int cpu, int hgei); 112 + void kvm_riscv_aia_wakeon_hgei(struct kvm_vcpu *owner, bool enable); 176 113 177 114 void kvm_riscv_aia_enable(void); 178 115 void kvm_riscv_aia_disable(void);

+58

arch/riscv/include/asm/kvm_aia_aplic.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2021 Western Digital Corporation or its affiliates. 4 + * Copyright (C) 2022 Ventana Micro Systems Inc. 5 + */ 6 + #ifndef __KVM_RISCV_AIA_IMSIC_H 7 + #define __KVM_RISCV_AIA_IMSIC_H 8 + 9 + #include <linux/bitops.h> 10 + 11 + #define APLIC_MAX_IDC BIT(14) 12 + #define APLIC_MAX_SOURCE 1024 13 + 14 + #define APLIC_DOMAINCFG 0x0000 15 + #define APLIC_DOMAINCFG_RDONLY 0x80000000 16 + #define APLIC_DOMAINCFG_IE BIT(8) 17 + #define APLIC_DOMAINCFG_DM BIT(2) 18 + #define APLIC_DOMAINCFG_BE BIT(0) 19 + 20 + #define APLIC_SOURCECFG_BASE 0x0004 21 + #define APLIC_SOURCECFG_D BIT(10) 22 + #define APLIC_SOURCECFG_CHILDIDX_MASK 0x000003ff 23 + #define APLIC_SOURCECFG_SM_MASK 0x00000007 24 + #define APLIC_SOURCECFG_SM_INACTIVE 0x0 25 + #define APLIC_SOURCECFG_SM_DETACH 0x1 26 + #define APLIC_SOURCECFG_SM_EDGE_RISE 0x4 27 + #define APLIC_SOURCECFG_SM_EDGE_FALL 0x5 28 + #define APLIC_SOURCECFG_SM_LEVEL_HIGH 0x6 29 + #define APLIC_SOURCECFG_SM_LEVEL_LOW 0x7 30 + 31 + #define APLIC_IRQBITS_PER_REG 32 32 + 33 + #define APLIC_SETIP_BASE 0x1c00 34 + #define APLIC_SETIPNUM 0x1cdc 35 + 36 + #define APLIC_CLRIP_BASE 0x1d00 37 + #define APLIC_CLRIPNUM 0x1ddc 38 + 39 + #define APLIC_SETIE_BASE 0x1e00 40 + #define APLIC_SETIENUM 0x1edc 41 + 42 + #define APLIC_CLRIE_BASE 0x1f00 43 + #define APLIC_CLRIENUM 0x1fdc 44 + 45 + #define APLIC_SETIPNUM_LE 0x2000 46 + #define APLIC_SETIPNUM_BE 0x2004 47 + 48 + #define APLIC_GENMSI 0x3000 49 + 50 + #define APLIC_TARGET_BASE 0x3004 51 + #define APLIC_TARGET_HART_IDX_SHIFT 18 52 + #define APLIC_TARGET_HART_IDX_MASK 0x3fff 53 + #define APLIC_TARGET_GUEST_IDX_SHIFT 12 54 + #define APLIC_TARGET_GUEST_IDX_MASK 0x3f 55 + #define APLIC_TARGET_IPRIO_MASK 0xff 56 + #define APLIC_TARGET_EIID_MASK 0x7ff 57 + 58 + #endif

+38

arch/riscv/include/asm/kvm_aia_imsic.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2021 Western Digital Corporation or its affiliates. 4 + * Copyright (C) 2022 Ventana Micro Systems Inc. 5 + */ 6 + #ifndef __KVM_RISCV_AIA_IMSIC_H 7 + #define __KVM_RISCV_AIA_IMSIC_H 8 + 9 + #include <linux/types.h> 10 + #include <asm/csr.h> 11 + 12 + #define IMSIC_MMIO_PAGE_SHIFT 12 13 + #define IMSIC_MMIO_PAGE_SZ (1UL << IMSIC_MMIO_PAGE_SHIFT) 14 + #define IMSIC_MMIO_PAGE_LE 0x00 15 + #define IMSIC_MMIO_PAGE_BE 0x04 16 + 17 + #define IMSIC_MIN_ID 63 18 + #define IMSIC_MAX_ID 2048 19 + 20 + #define IMSIC_EIDELIVERY 0x70 21 + 22 + #define IMSIC_EITHRESHOLD 0x72 23 + 24 + #define IMSIC_EIP0 0x80 25 + #define IMSIC_EIP63 0xbf 26 + #define IMSIC_EIPx_BITS 32 27 + 28 + #define IMSIC_EIE0 0xc0 29 + #define IMSIC_EIE63 0xff 30 + #define IMSIC_EIEx_BITS 32 31 + 32 + #define IMSIC_FIRST IMSIC_EIDELIVERY 33 + #define IMSIC_LAST IMSIC_EIE63 34 + 35 + #define IMSIC_MMIO_SETIPNUM_LE 0x00 36 + #define IMSIC_MMIO_SETIPNUM_BE 0x04 37 + 38 + #endif

+4

arch/riscv/include/asm/kvm_host.h

··· 28 28 29 29 #define KVM_VCPU_MAX_FEATURES 0 30 30 31 + #define KVM_IRQCHIP_NUM_PINS 1024 32 + 31 33 #define KVM_REQ_SLEEP \ 32 34 KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 33 35 #define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(1) ··· 321 319 int kvm_riscv_gstage_vmid_init(struct kvm *kvm); 322 320 bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid); 323 321 void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu); 322 + 323 + int kvm_riscv_setup_default_irq_routing(struct kvm *kvm, u32 lines); 324 324 325 325 void __kvm_riscv_unpriv_trap(void); 326 326

+10 -1

arch/riscv/include/asm/kvm_vcpu_sbi.h

··· 14 14 #define KVM_SBI_VERSION_MAJOR 1 15 15 #define KVM_SBI_VERSION_MINOR 0 16 16 17 + enum kvm_riscv_sbi_ext_status { 18 + KVM_RISCV_SBI_EXT_UNINITIALIZED, 19 + KVM_RISCV_SBI_EXT_AVAILABLE, 20 + KVM_RISCV_SBI_EXT_UNAVAILABLE, 21 + }; 22 + 17 23 struct kvm_vcpu_sbi_context { 18 24 int return_handled; 19 - bool extension_disabled[KVM_RISCV_SBI_EXT_MAX]; 25 + enum kvm_riscv_sbi_ext_status ext_status[KVM_RISCV_SBI_EXT_MAX]; 20 26 }; 21 27 22 28 struct kvm_vcpu_sbi_return { ··· 72 66 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental; 73 67 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor; 74 68 69 + #ifdef CONFIG_RISCV_PMU_SBI 70 + extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu; 71 + #endif 75 72 #endif /* __RISCV_KVM_VCPU_SBI_H__ */

+73

arch/riscv/include/uapi/asm/kvm.h

··· 15 15 #include <asm/bitsperlong.h> 16 16 #include <asm/ptrace.h> 17 17 18 + #define __KVM_HAVE_IRQ_LINE 18 19 #define __KVM_HAVE_READONLY_MEM 19 20 20 21 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 ··· 123 122 KVM_RISCV_ISA_EXT_ZBB, 124 123 KVM_RISCV_ISA_EXT_SSAIA, 125 124 KVM_RISCV_ISA_EXT_V, 125 + KVM_RISCV_ISA_EXT_SVNAPOT, 126 126 KVM_RISCV_ISA_EXT_MAX, 127 127 }; 128 128 ··· 212 210 (offsetof(struct __riscv_v_ext_state, name) / sizeof(unsigned long)) 213 211 #define KVM_REG_RISCV_VECTOR_REG(n) \ 214 212 ((n) + sizeof(struct __riscv_v_ext_state) / sizeof(unsigned long)) 213 + 214 + /* Device Control API: RISC-V AIA */ 215 + #define KVM_DEV_RISCV_APLIC_ALIGN 0x1000 216 + #define KVM_DEV_RISCV_APLIC_SIZE 0x4000 217 + #define KVM_DEV_RISCV_APLIC_MAX_HARTS 0x4000 218 + #define KVM_DEV_RISCV_IMSIC_ALIGN 0x1000 219 + #define KVM_DEV_RISCV_IMSIC_SIZE 0x1000 220 + 221 + #define KVM_DEV_RISCV_AIA_GRP_CONFIG 0 222 + #define KVM_DEV_RISCV_AIA_CONFIG_MODE 0 223 + #define KVM_DEV_RISCV_AIA_CONFIG_IDS 1 224 + #define KVM_DEV_RISCV_AIA_CONFIG_SRCS 2 225 + #define KVM_DEV_RISCV_AIA_CONFIG_GROUP_BITS 3 226 + #define KVM_DEV_RISCV_AIA_CONFIG_GROUP_SHIFT 4 227 + #define KVM_DEV_RISCV_AIA_CONFIG_HART_BITS 5 228 + #define KVM_DEV_RISCV_AIA_CONFIG_GUEST_BITS 6 229 + 230 + /* 231 + * Modes of RISC-V AIA device: 232 + * 1) EMUL (aka Emulation): Trap-n-emulate IMSIC 233 + * 2) HWACCEL (aka HW Acceleration): Virtualize IMSIC using IMSIC guest files 234 + * 3) AUTO (aka Automatic): Virtualize IMSIC using IMSIC guest files whenever 235 + * available otherwise fallback to trap-n-emulation 236 + */ 237 + #define KVM_DEV_RISCV_AIA_MODE_EMUL 0 238 + #define KVM_DEV_RISCV_AIA_MODE_HWACCEL 1 239 + #define KVM_DEV_RISCV_AIA_MODE_AUTO 2 240 + 241 + #define KVM_DEV_RISCV_AIA_IDS_MIN 63 242 + #define KVM_DEV_RISCV_AIA_IDS_MAX 2048 243 + #define KVM_DEV_RISCV_AIA_SRCS_MAX 1024 244 + #define KVM_DEV_RISCV_AIA_GROUP_BITS_MAX 8 245 + #define KVM_DEV_RISCV_AIA_GROUP_SHIFT_MIN 24 246 + #define KVM_DEV_RISCV_AIA_GROUP_SHIFT_MAX 56 247 + #define KVM_DEV_RISCV_AIA_HART_BITS_MAX 16 248 + #define KVM_DEV_RISCV_AIA_GUEST_BITS_MAX 8 249 + 250 + #define KVM_DEV_RISCV_AIA_GRP_ADDR 1 251 + #define KVM_DEV_RISCV_AIA_ADDR_APLIC 0 252 + #define KVM_DEV_RISCV_AIA_ADDR_IMSIC(__vcpu) (1 + (__vcpu)) 253 + #define KVM_DEV_RISCV_AIA_ADDR_MAX \ 254 + (1 + KVM_DEV_RISCV_APLIC_MAX_HARTS) 255 + 256 + #define KVM_DEV_RISCV_AIA_GRP_CTRL 2 257 + #define KVM_DEV_RISCV_AIA_CTRL_INIT 0 258 + 259 + /* 260 + * The device attribute type contains the memory mapped offset of the 261 + * APLIC register (range 0x0000-0x3FFF) and it must be 4-byte aligned. 262 + */ 263 + #define KVM_DEV_RISCV_AIA_GRP_APLIC 3 264 + 265 + /* 266 + * The lower 12-bits of the device attribute type contains the iselect 267 + * value of the IMSIC register (range 0x70-0xFF) whereas the higher order 268 + * bits contains the VCPU id. 269 + */ 270 + #define KVM_DEV_RISCV_AIA_GRP_IMSIC 4 271 + #define KVM_DEV_RISCV_AIA_IMSIC_ISEL_BITS 12 272 + #define KVM_DEV_RISCV_AIA_IMSIC_ISEL_MASK \ 273 + ((1U << KVM_DEV_RISCV_AIA_IMSIC_ISEL_BITS) - 1) 274 + #define KVM_DEV_RISCV_AIA_IMSIC_MKATTR(__vcpu, __isel) \ 275 + (((__vcpu) << KVM_DEV_RISCV_AIA_IMSIC_ISEL_BITS) | \ 276 + ((__isel) & KVM_DEV_RISCV_AIA_IMSIC_ISEL_MASK)) 277 + #define KVM_DEV_RISCV_AIA_IMSIC_GET_ISEL(__attr) \ 278 + ((__attr) & KVM_DEV_RISCV_AIA_IMSIC_ISEL_MASK) 279 + #define KVM_DEV_RISCV_AIA_IMSIC_GET_VCPU(__attr) \ 280 + ((__attr) >> KVM_DEV_RISCV_AIA_IMSIC_ISEL_BITS) 281 + 282 + /* One single KVM irqchip, ie. the AIA */ 283 + #define KVM_NR_IRQCHIPS 1 215 284 216 285 #endif 217 286

+4

arch/riscv/kvm/Kconfig

··· 21 21 tristate "Kernel-based Virtual Machine (KVM) support (EXPERIMENTAL)" 22 22 depends on RISCV_SBI && MMU 23 23 select HAVE_KVM_EVENTFD 24 + select HAVE_KVM_IRQCHIP 25 + select HAVE_KVM_IRQFD 26 + select HAVE_KVM_IRQ_ROUTING 27 + select HAVE_KVM_MSI 24 28 select HAVE_KVM_VCPU_ASYNC_IOCTL 25 29 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 26 30 select KVM_GENERIC_HARDWARE_ENABLING

+3

arch/riscv/kvm/Makefile

··· 28 28 kvm-y += vcpu_timer.o 29 29 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o vcpu_sbi_pmu.o 30 30 kvm-y += aia.o 31 + kvm-y += aia_device.o 32 + kvm-y += aia_aplic.o 33 + kvm-y += aia_imsic.o

+272 -2

arch/riscv/kvm/aia.c

··· 8 8 */ 9 9 10 10 #include <linux/kernel.h> 11 + #include <linux/bitops.h> 12 + #include <linux/irq.h> 13 + #include <linux/irqdomain.h> 11 14 #include <linux/kvm_host.h> 15 + #include <linux/percpu.h> 16 + #include <linux/spinlock.h> 12 17 #include <asm/hwcap.h> 18 + #include <asm/kvm_aia_imsic.h> 13 19 20 + struct aia_hgei_control { 21 + raw_spinlock_t lock; 22 + unsigned long free_bitmap; 23 + struct kvm_vcpu *owners[BITS_PER_LONG]; 24 + }; 25 + static DEFINE_PER_CPU(struct aia_hgei_control, aia_hgei); 26 + static int hgei_parent_irq; 27 + 28 + unsigned int kvm_riscv_aia_nr_hgei; 29 + unsigned int kvm_riscv_aia_max_ids; 14 30 DEFINE_STATIC_KEY_FALSE(kvm_riscv_aia_available); 31 + 32 + static int aia_find_hgei(struct kvm_vcpu *owner) 33 + { 34 + int i, hgei; 35 + unsigned long flags; 36 + struct aia_hgei_control *hgctrl = get_cpu_ptr(&aia_hgei); 37 + 38 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 39 + 40 + hgei = -1; 41 + for (i = 1; i <= kvm_riscv_aia_nr_hgei; i++) { 42 + if (hgctrl->owners[i] == owner) { 43 + hgei = i; 44 + break; 45 + } 46 + } 47 + 48 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 49 + 50 + put_cpu_ptr(&aia_hgei); 51 + return hgei; 52 + } 15 53 16 54 static void aia_set_hvictl(bool ext_irq_pending) 17 55 { ··· 94 56 95 57 bool kvm_riscv_vcpu_aia_has_interrupts(struct kvm_vcpu *vcpu, u64 mask) 96 58 { 59 + int hgei; 97 60 unsigned long seip; 98 61 99 62 if (!kvm_riscv_aia_available()) ··· 112 73 113 74 if (!kvm_riscv_aia_initialized(vcpu->kvm) || !seip) 114 75 return false; 76 + 77 + hgei = aia_find_hgei(vcpu); 78 + if (hgei > 0) 79 + return !!(csr_read(CSR_HGEIP) & BIT(hgei)); 115 80 116 81 return false; 117 82 } ··· 366 323 return KVM_INSN_CONTINUE_NEXT_SEPC; 367 324 } 368 325 369 - #define IMSIC_FIRST 0x70 370 - #define IMSIC_LAST 0xff 371 326 int kvm_riscv_vcpu_aia_rmw_ireg(struct kvm_vcpu *vcpu, unsigned int csr_num, 372 327 unsigned long *val, unsigned long new_val, 373 328 unsigned long wr_mask) ··· 389 348 return KVM_INSN_EXIT_TO_USER_SPACE; 390 349 } 391 350 351 + int kvm_riscv_aia_alloc_hgei(int cpu, struct kvm_vcpu *owner, 352 + void __iomem **hgei_va, phys_addr_t *hgei_pa) 353 + { 354 + int ret = -ENOENT; 355 + unsigned long flags; 356 + struct aia_hgei_control *hgctrl = per_cpu_ptr(&aia_hgei, cpu); 357 + 358 + if (!kvm_riscv_aia_available() || !hgctrl) 359 + return -ENODEV; 360 + 361 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 362 + 363 + if (hgctrl->free_bitmap) { 364 + ret = __ffs(hgctrl->free_bitmap); 365 + hgctrl->free_bitmap &= ~BIT(ret); 366 + hgctrl->owners[ret] = owner; 367 + } 368 + 369 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 370 + 371 + /* TODO: To be updated later by AIA IMSIC HW guest file support */ 372 + if (hgei_va) 373 + *hgei_va = NULL; 374 + if (hgei_pa) 375 + *hgei_pa = 0; 376 + 377 + return ret; 378 + } 379 + 380 + void kvm_riscv_aia_free_hgei(int cpu, int hgei) 381 + { 382 + unsigned long flags; 383 + struct aia_hgei_control *hgctrl = per_cpu_ptr(&aia_hgei, cpu); 384 + 385 + if (!kvm_riscv_aia_available() || !hgctrl) 386 + return; 387 + 388 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 389 + 390 + if (hgei > 0 && hgei <= kvm_riscv_aia_nr_hgei) { 391 + if (!(hgctrl->free_bitmap & BIT(hgei))) { 392 + hgctrl->free_bitmap |= BIT(hgei); 393 + hgctrl->owners[hgei] = NULL; 394 + } 395 + } 396 + 397 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 398 + } 399 + 400 + void kvm_riscv_aia_wakeon_hgei(struct kvm_vcpu *owner, bool enable) 401 + { 402 + int hgei; 403 + 404 + if (!kvm_riscv_aia_available()) 405 + return; 406 + 407 + hgei = aia_find_hgei(owner); 408 + if (hgei > 0) { 409 + if (enable) 410 + csr_set(CSR_HGEIE, BIT(hgei)); 411 + else 412 + csr_clear(CSR_HGEIE, BIT(hgei)); 413 + } 414 + } 415 + 416 + static irqreturn_t hgei_interrupt(int irq, void *dev_id) 417 + { 418 + int i; 419 + unsigned long hgei_mask, flags; 420 + struct aia_hgei_control *hgctrl = get_cpu_ptr(&aia_hgei); 421 + 422 + hgei_mask = csr_read(CSR_HGEIP) & csr_read(CSR_HGEIE); 423 + csr_clear(CSR_HGEIE, hgei_mask); 424 + 425 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 426 + 427 + for_each_set_bit(i, &hgei_mask, BITS_PER_LONG) { 428 + if (hgctrl->owners[i]) 429 + kvm_vcpu_kick(hgctrl->owners[i]); 430 + } 431 + 432 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 433 + 434 + put_cpu_ptr(&aia_hgei); 435 + return IRQ_HANDLED; 436 + } 437 + 438 + static int aia_hgei_init(void) 439 + { 440 + int cpu, rc; 441 + struct irq_domain *domain; 442 + struct aia_hgei_control *hgctrl; 443 + 444 + /* Initialize per-CPU guest external interrupt line management */ 445 + for_each_possible_cpu(cpu) { 446 + hgctrl = per_cpu_ptr(&aia_hgei, cpu); 447 + raw_spin_lock_init(&hgctrl->lock); 448 + if (kvm_riscv_aia_nr_hgei) { 449 + hgctrl->free_bitmap = 450 + BIT(kvm_riscv_aia_nr_hgei + 1) - 1; 451 + hgctrl->free_bitmap &= ~BIT(0); 452 + } else 453 + hgctrl->free_bitmap = 0; 454 + } 455 + 456 + /* Find INTC irq domain */ 457 + domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(), 458 + DOMAIN_BUS_ANY); 459 + if (!domain) { 460 + kvm_err("unable to find INTC domain\n"); 461 + return -ENOENT; 462 + } 463 + 464 + /* Map per-CPU SGEI interrupt from INTC domain */ 465 + hgei_parent_irq = irq_create_mapping(domain, IRQ_S_GEXT); 466 + if (!hgei_parent_irq) { 467 + kvm_err("unable to map SGEI IRQ\n"); 468 + return -ENOMEM; 469 + } 470 + 471 + /* Request per-CPU SGEI interrupt */ 472 + rc = request_percpu_irq(hgei_parent_irq, hgei_interrupt, 473 + "riscv-kvm", &aia_hgei); 474 + if (rc) { 475 + kvm_err("failed to request SGEI IRQ\n"); 476 + return rc; 477 + } 478 + 479 + return 0; 480 + } 481 + 482 + static void aia_hgei_exit(void) 483 + { 484 + /* Free per-CPU SGEI interrupt */ 485 + free_percpu_irq(hgei_parent_irq, &aia_hgei); 486 + } 487 + 392 488 void kvm_riscv_aia_enable(void) 393 489 { 394 490 if (!kvm_riscv_aia_available()) ··· 540 362 csr_write(CSR_HVIPRIO1H, 0x0); 541 363 csr_write(CSR_HVIPRIO2H, 0x0); 542 364 #endif 365 + 366 + /* Enable per-CPU SGEI interrupt */ 367 + enable_percpu_irq(hgei_parent_irq, 368 + irq_get_trigger_type(hgei_parent_irq)); 369 + csr_set(CSR_HIE, BIT(IRQ_S_GEXT)); 543 370 } 544 371 545 372 void kvm_riscv_aia_disable(void) 546 373 { 374 + int i; 375 + unsigned long flags; 376 + struct kvm_vcpu *vcpu; 377 + struct aia_hgei_control *hgctrl; 378 + 547 379 if (!kvm_riscv_aia_available()) 548 380 return; 381 + hgctrl = get_cpu_ptr(&aia_hgei); 382 + 383 + /* Disable per-CPU SGEI interrupt */ 384 + csr_clear(CSR_HIE, BIT(IRQ_S_GEXT)); 385 + disable_percpu_irq(hgei_parent_irq); 549 386 550 387 aia_set_hvictl(false); 388 + 389 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 390 + 391 + for (i = 0; i <= kvm_riscv_aia_nr_hgei; i++) { 392 + vcpu = hgctrl->owners[i]; 393 + if (!vcpu) 394 + continue; 395 + 396 + /* 397 + * We release hgctrl->lock before notifying IMSIC 398 + * so that we don't have lock ordering issues. 399 + */ 400 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 401 + 402 + /* Notify IMSIC */ 403 + kvm_riscv_vcpu_aia_imsic_release(vcpu); 404 + 405 + /* 406 + * Wakeup VCPU if it was blocked so that it can 407 + * run on other HARTs 408 + */ 409 + if (csr_read(CSR_HGEIE) & BIT(i)) { 410 + csr_clear(CSR_HGEIE, BIT(i)); 411 + kvm_vcpu_kick(vcpu); 412 + } 413 + 414 + raw_spin_lock_irqsave(&hgctrl->lock, flags); 415 + } 416 + 417 + raw_spin_unlock_irqrestore(&hgctrl->lock, flags); 418 + 419 + put_cpu_ptr(&aia_hgei); 551 420 } 552 421 553 422 int kvm_riscv_aia_init(void) 554 423 { 424 + int rc; 425 + 555 426 if (!riscv_isa_extension_available(NULL, SxAIA)) 556 427 return -ENODEV; 428 + 429 + /* Figure-out number of bits in HGEIE */ 430 + csr_write(CSR_HGEIE, -1UL); 431 + kvm_riscv_aia_nr_hgei = fls_long(csr_read(CSR_HGEIE)); 432 + csr_write(CSR_HGEIE, 0); 433 + if (kvm_riscv_aia_nr_hgei) 434 + kvm_riscv_aia_nr_hgei--; 435 + 436 + /* 437 + * Number of usable HGEI lines should be minimum of per-HART 438 + * IMSIC guest files and number of bits in HGEIE 439 + * 440 + * TODO: To be updated later by AIA IMSIC HW guest file support 441 + */ 442 + kvm_riscv_aia_nr_hgei = 0; 443 + 444 + /* 445 + * Find number of guest MSI IDs 446 + * 447 + * TODO: To be updated later by AIA IMSIC HW guest file support 448 + */ 449 + kvm_riscv_aia_max_ids = IMSIC_MAX_ID; 450 + 451 + /* Initialize guest external interrupt line management */ 452 + rc = aia_hgei_init(); 453 + if (rc) 454 + return rc; 455 + 456 + /* Register device operations */ 457 + rc = kvm_register_device_ops(&kvm_riscv_aia_device_ops, 458 + KVM_DEV_TYPE_RISCV_AIA); 459 + if (rc) { 460 + aia_hgei_exit(); 461 + return rc; 462 + } 557 463 558 464 /* Enable KVM AIA support */ 559 465 static_branch_enable(&kvm_riscv_aia_available); ··· 647 385 648 386 void kvm_riscv_aia_exit(void) 649 387 { 388 + if (!kvm_riscv_aia_available()) 389 + return; 390 + 391 + /* Unregister device operations */ 392 + kvm_unregister_device_ops(KVM_DEV_TYPE_RISCV_AIA); 393 + 394 + /* Cleanup the HGEI state */ 395 + aia_hgei_exit(); 650 396 }

+619

arch/riscv/kvm/aia_aplic.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2021 Western Digital Corporation or its affiliates. 4 + * Copyright (C) 2022 Ventana Micro Systems Inc. 5 + * 6 + * Authors: 7 + * Anup Patel <apatel@ventanamicro.com> 8 + */ 9 + 10 + #include <linux/kvm_host.h> 11 + #include <linux/math.h> 12 + #include <linux/spinlock.h> 13 + #include <linux/swab.h> 14 + #include <kvm/iodev.h> 15 + #include <asm/kvm_aia_aplic.h> 16 + 17 + struct aplic_irq { 18 + raw_spinlock_t lock; 19 + u32 sourcecfg; 20 + u32 state; 21 + #define APLIC_IRQ_STATE_PENDING BIT(0) 22 + #define APLIC_IRQ_STATE_ENABLED BIT(1) 23 + #define APLIC_IRQ_STATE_ENPEND (APLIC_IRQ_STATE_PENDING | \ 24 + APLIC_IRQ_STATE_ENABLED) 25 + #define APLIC_IRQ_STATE_INPUT BIT(8) 26 + u32 target; 27 + }; 28 + 29 + struct aplic { 30 + struct kvm_io_device iodev; 31 + 32 + u32 domaincfg; 33 + u32 genmsi; 34 + 35 + u32 nr_irqs; 36 + u32 nr_words; 37 + struct aplic_irq *irqs; 38 + }; 39 + 40 + static u32 aplic_read_sourcecfg(struct aplic *aplic, u32 irq) 41 + { 42 + u32 ret; 43 + unsigned long flags; 44 + struct aplic_irq *irqd; 45 + 46 + if (!irq || aplic->nr_irqs <= irq) 47 + return 0; 48 + irqd = &aplic->irqs[irq]; 49 + 50 + raw_spin_lock_irqsave(&irqd->lock, flags); 51 + ret = irqd->sourcecfg; 52 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 53 + 54 + return ret; 55 + } 56 + 57 + static void aplic_write_sourcecfg(struct aplic *aplic, u32 irq, u32 val) 58 + { 59 + unsigned long flags; 60 + struct aplic_irq *irqd; 61 + 62 + if (!irq || aplic->nr_irqs <= irq) 63 + return; 64 + irqd = &aplic->irqs[irq]; 65 + 66 + if (val & APLIC_SOURCECFG_D) 67 + val = 0; 68 + else 69 + val &= APLIC_SOURCECFG_SM_MASK; 70 + 71 + raw_spin_lock_irqsave(&irqd->lock, flags); 72 + irqd->sourcecfg = val; 73 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 74 + } 75 + 76 + static u32 aplic_read_target(struct aplic *aplic, u32 irq) 77 + { 78 + u32 ret; 79 + unsigned long flags; 80 + struct aplic_irq *irqd; 81 + 82 + if (!irq || aplic->nr_irqs <= irq) 83 + return 0; 84 + irqd = &aplic->irqs[irq]; 85 + 86 + raw_spin_lock_irqsave(&irqd->lock, flags); 87 + ret = irqd->target; 88 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 89 + 90 + return ret; 91 + } 92 + 93 + static void aplic_write_target(struct aplic *aplic, u32 irq, u32 val) 94 + { 95 + unsigned long flags; 96 + struct aplic_irq *irqd; 97 + 98 + if (!irq || aplic->nr_irqs <= irq) 99 + return; 100 + irqd = &aplic->irqs[irq]; 101 + 102 + val &= APLIC_TARGET_EIID_MASK | 103 + (APLIC_TARGET_HART_IDX_MASK << APLIC_TARGET_HART_IDX_SHIFT) | 104 + (APLIC_TARGET_GUEST_IDX_MASK << APLIC_TARGET_GUEST_IDX_SHIFT); 105 + 106 + raw_spin_lock_irqsave(&irqd->lock, flags); 107 + irqd->target = val; 108 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 109 + } 110 + 111 + static bool aplic_read_pending(struct aplic *aplic, u32 irq) 112 + { 113 + bool ret; 114 + unsigned long flags; 115 + struct aplic_irq *irqd; 116 + 117 + if (!irq || aplic->nr_irqs <= irq) 118 + return false; 119 + irqd = &aplic->irqs[irq]; 120 + 121 + raw_spin_lock_irqsave(&irqd->lock, flags); 122 + ret = (irqd->state & APLIC_IRQ_STATE_PENDING) ? true : false; 123 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 124 + 125 + return ret; 126 + } 127 + 128 + static void aplic_write_pending(struct aplic *aplic, u32 irq, bool pending) 129 + { 130 + unsigned long flags, sm; 131 + struct aplic_irq *irqd; 132 + 133 + if (!irq || aplic->nr_irqs <= irq) 134 + return; 135 + irqd = &aplic->irqs[irq]; 136 + 137 + raw_spin_lock_irqsave(&irqd->lock, flags); 138 + 139 + sm = irqd->sourcecfg & APLIC_SOURCECFG_SM_MASK; 140 + if (!pending && 141 + ((sm == APLIC_SOURCECFG_SM_LEVEL_HIGH) || 142 + (sm == APLIC_SOURCECFG_SM_LEVEL_LOW))) 143 + goto skip_write_pending; 144 + 145 + if (pending) 146 + irqd->state |= APLIC_IRQ_STATE_PENDING; 147 + else 148 + irqd->state &= ~APLIC_IRQ_STATE_PENDING; 149 + 150 + skip_write_pending: 151 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 152 + } 153 + 154 + static bool aplic_read_enabled(struct aplic *aplic, u32 irq) 155 + { 156 + bool ret; 157 + unsigned long flags; 158 + struct aplic_irq *irqd; 159 + 160 + if (!irq || aplic->nr_irqs <= irq) 161 + return false; 162 + irqd = &aplic->irqs[irq]; 163 + 164 + raw_spin_lock_irqsave(&irqd->lock, flags); 165 + ret = (irqd->state & APLIC_IRQ_STATE_ENABLED) ? true : false; 166 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 167 + 168 + return ret; 169 + } 170 + 171 + static void aplic_write_enabled(struct aplic *aplic, u32 irq, bool enabled) 172 + { 173 + unsigned long flags; 174 + struct aplic_irq *irqd; 175 + 176 + if (!irq || aplic->nr_irqs <= irq) 177 + return; 178 + irqd = &aplic->irqs[irq]; 179 + 180 + raw_spin_lock_irqsave(&irqd->lock, flags); 181 + if (enabled) 182 + irqd->state |= APLIC_IRQ_STATE_ENABLED; 183 + else 184 + irqd->state &= ~APLIC_IRQ_STATE_ENABLED; 185 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 186 + } 187 + 188 + static bool aplic_read_input(struct aplic *aplic, u32 irq) 189 + { 190 + bool ret; 191 + unsigned long flags; 192 + struct aplic_irq *irqd; 193 + 194 + if (!irq || aplic->nr_irqs <= irq) 195 + return false; 196 + irqd = &aplic->irqs[irq]; 197 + 198 + raw_spin_lock_irqsave(&irqd->lock, flags); 199 + ret = (irqd->state & APLIC_IRQ_STATE_INPUT) ? true : false; 200 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 201 + 202 + return ret; 203 + } 204 + 205 + static void aplic_inject_msi(struct kvm *kvm, u32 irq, u32 target) 206 + { 207 + u32 hart_idx, guest_idx, eiid; 208 + 209 + hart_idx = target >> APLIC_TARGET_HART_IDX_SHIFT; 210 + hart_idx &= APLIC_TARGET_HART_IDX_MASK; 211 + guest_idx = target >> APLIC_TARGET_GUEST_IDX_SHIFT; 212 + guest_idx &= APLIC_TARGET_GUEST_IDX_MASK; 213 + eiid = target & APLIC_TARGET_EIID_MASK; 214 + kvm_riscv_aia_inject_msi_by_id(kvm, hart_idx, guest_idx, eiid); 215 + } 216 + 217 + static void aplic_update_irq_range(struct kvm *kvm, u32 first, u32 last) 218 + { 219 + bool inject; 220 + u32 irq, target; 221 + unsigned long flags; 222 + struct aplic_irq *irqd; 223 + struct aplic *aplic = kvm->arch.aia.aplic_state; 224 + 225 + if (!(aplic->domaincfg & APLIC_DOMAINCFG_IE)) 226 + return; 227 + 228 + for (irq = first; irq <= last; irq++) { 229 + if (!irq || aplic->nr_irqs <= irq) 230 + continue; 231 + irqd = &aplic->irqs[irq]; 232 + 233 + raw_spin_lock_irqsave(&irqd->lock, flags); 234 + 235 + inject = false; 236 + target = irqd->target; 237 + if ((irqd->state & APLIC_IRQ_STATE_ENPEND) == 238 + APLIC_IRQ_STATE_ENPEND) { 239 + irqd->state &= ~APLIC_IRQ_STATE_PENDING; 240 + inject = true; 241 + } 242 + 243 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 244 + 245 + if (inject) 246 + aplic_inject_msi(kvm, irq, target); 247 + } 248 + } 249 + 250 + int kvm_riscv_aia_aplic_inject(struct kvm *kvm, u32 source, bool level) 251 + { 252 + u32 target; 253 + bool inject = false, ie; 254 + unsigned long flags; 255 + struct aplic_irq *irqd; 256 + struct aplic *aplic = kvm->arch.aia.aplic_state; 257 + 258 + if (!aplic || !source || (aplic->nr_irqs <= source)) 259 + return -ENODEV; 260 + irqd = &aplic->irqs[source]; 261 + ie = (aplic->domaincfg & APLIC_DOMAINCFG_IE) ? true : false; 262 + 263 + raw_spin_lock_irqsave(&irqd->lock, flags); 264 + 265 + if (irqd->sourcecfg & APLIC_SOURCECFG_D) 266 + goto skip_unlock; 267 + 268 + switch (irqd->sourcecfg & APLIC_SOURCECFG_SM_MASK) { 269 + case APLIC_SOURCECFG_SM_EDGE_RISE: 270 + if (level && !(irqd->state & APLIC_IRQ_STATE_INPUT) && 271 + !(irqd->state & APLIC_IRQ_STATE_PENDING)) 272 + irqd->state |= APLIC_IRQ_STATE_PENDING; 273 + break; 274 + case APLIC_SOURCECFG_SM_EDGE_FALL: 275 + if (!level && (irqd->state & APLIC_IRQ_STATE_INPUT) && 276 + !(irqd->state & APLIC_IRQ_STATE_PENDING)) 277 + irqd->state |= APLIC_IRQ_STATE_PENDING; 278 + break; 279 + case APLIC_SOURCECFG_SM_LEVEL_HIGH: 280 + if (level && !(irqd->state & APLIC_IRQ_STATE_PENDING)) 281 + irqd->state |= APLIC_IRQ_STATE_PENDING; 282 + break; 283 + case APLIC_SOURCECFG_SM_LEVEL_LOW: 284 + if (!level && !(irqd->state & APLIC_IRQ_STATE_PENDING)) 285 + irqd->state |= APLIC_IRQ_STATE_PENDING; 286 + break; 287 + } 288 + 289 + if (level) 290 + irqd->state |= APLIC_IRQ_STATE_INPUT; 291 + else 292 + irqd->state &= ~APLIC_IRQ_STATE_INPUT; 293 + 294 + target = irqd->target; 295 + if (ie && ((irqd->state & APLIC_IRQ_STATE_ENPEND) == 296 + APLIC_IRQ_STATE_ENPEND)) { 297 + irqd->state &= ~APLIC_IRQ_STATE_PENDING; 298 + inject = true; 299 + } 300 + 301 + skip_unlock: 302 + raw_spin_unlock_irqrestore(&irqd->lock, flags); 303 + 304 + if (inject) 305 + aplic_inject_msi(kvm, source, target); 306 + 307 + return 0; 308 + } 309 + 310 + static u32 aplic_read_input_word(struct aplic *aplic, u32 word) 311 + { 312 + u32 i, ret = 0; 313 + 314 + for (i = 0; i < 32; i++) 315 + ret |= aplic_read_input(aplic, word * 32 + i) ? BIT(i) : 0; 316 + 317 + return ret; 318 + } 319 + 320 + static u32 aplic_read_pending_word(struct aplic *aplic, u32 word) 321 + { 322 + u32 i, ret = 0; 323 + 324 + for (i = 0; i < 32; i++) 325 + ret |= aplic_read_pending(aplic, word * 32 + i) ? BIT(i) : 0; 326 + 327 + return ret; 328 + } 329 + 330 + static void aplic_write_pending_word(struct aplic *aplic, u32 word, 331 + u32 val, bool pending) 332 + { 333 + u32 i; 334 + 335 + for (i = 0; i < 32; i++) { 336 + if (val & BIT(i)) 337 + aplic_write_pending(aplic, word * 32 + i, pending); 338 + } 339 + } 340 + 341 + static u32 aplic_read_enabled_word(struct aplic *aplic, u32 word) 342 + { 343 + u32 i, ret = 0; 344 + 345 + for (i = 0; i < 32; i++) 346 + ret |= aplic_read_enabled(aplic, word * 32 + i) ? BIT(i) : 0; 347 + 348 + return ret; 349 + } 350 + 351 + static void aplic_write_enabled_word(struct aplic *aplic, u32 word, 352 + u32 val, bool enabled) 353 + { 354 + u32 i; 355 + 356 + for (i = 0; i < 32; i++) { 357 + if (val & BIT(i)) 358 + aplic_write_enabled(aplic, word * 32 + i, enabled); 359 + } 360 + } 361 + 362 + static int aplic_mmio_read_offset(struct kvm *kvm, gpa_t off, u32 *val32) 363 + { 364 + u32 i; 365 + struct aplic *aplic = kvm->arch.aia.aplic_state; 366 + 367 + if ((off & 0x3) != 0) 368 + return -EOPNOTSUPP; 369 + 370 + if (off == APLIC_DOMAINCFG) { 371 + *val32 = APLIC_DOMAINCFG_RDONLY | 372 + aplic->domaincfg | APLIC_DOMAINCFG_DM; 373 + } else if ((off >= APLIC_SOURCECFG_BASE) && 374 + (off < (APLIC_SOURCECFG_BASE + (aplic->nr_irqs - 1) * 4))) { 375 + i = ((off - APLIC_SOURCECFG_BASE) >> 2) + 1; 376 + *val32 = aplic_read_sourcecfg(aplic, i); 377 + } else if ((off >= APLIC_SETIP_BASE) && 378 + (off < (APLIC_SETIP_BASE + aplic->nr_words * 4))) { 379 + i = (off - APLIC_SETIP_BASE) >> 2; 380 + *val32 = aplic_read_pending_word(aplic, i); 381 + } else if (off == APLIC_SETIPNUM) { 382 + *val32 = 0; 383 + } else if ((off >= APLIC_CLRIP_BASE) && 384 + (off < (APLIC_CLRIP_BASE + aplic->nr_words * 4))) { 385 + i = (off - APLIC_CLRIP_BASE) >> 2; 386 + *val32 = aplic_read_input_word(aplic, i); 387 + } else if (off == APLIC_CLRIPNUM) { 388 + *val32 = 0; 389 + } else if ((off >= APLIC_SETIE_BASE) && 390 + (off < (APLIC_SETIE_BASE + aplic->nr_words * 4))) { 391 + i = (off - APLIC_SETIE_BASE) >> 2; 392 + *val32 = aplic_read_enabled_word(aplic, i); 393 + } else if (off == APLIC_SETIENUM) { 394 + *val32 = 0; 395 + } else if ((off >= APLIC_CLRIE_BASE) && 396 + (off < (APLIC_CLRIE_BASE + aplic->nr_words * 4))) { 397 + *val32 = 0; 398 + } else if (off == APLIC_CLRIENUM) { 399 + *val32 = 0; 400 + } else if (off == APLIC_SETIPNUM_LE) { 401 + *val32 = 0; 402 + } else if (off == APLIC_SETIPNUM_BE) { 403 + *val32 = 0; 404 + } else if (off == APLIC_GENMSI) { 405 + *val32 = aplic->genmsi; 406 + } else if ((off >= APLIC_TARGET_BASE) && 407 + (off < (APLIC_TARGET_BASE + (aplic->nr_irqs - 1) * 4))) { 408 + i = ((off - APLIC_TARGET_BASE) >> 2) + 1; 409 + *val32 = aplic_read_target(aplic, i); 410 + } else 411 + return -ENODEV; 412 + 413 + return 0; 414 + } 415 + 416 + static int aplic_mmio_read(struct kvm_vcpu *vcpu, struct kvm_io_device *dev, 417 + gpa_t addr, int len, void *val) 418 + { 419 + if (len != 4) 420 + return -EOPNOTSUPP; 421 + 422 + return aplic_mmio_read_offset(vcpu->kvm, 423 + addr - vcpu->kvm->arch.aia.aplic_addr, 424 + val); 425 + } 426 + 427 + static int aplic_mmio_write_offset(struct kvm *kvm, gpa_t off, u32 val32) 428 + { 429 + u32 i; 430 + struct aplic *aplic = kvm->arch.aia.aplic_state; 431 + 432 + if ((off & 0x3) != 0) 433 + return -EOPNOTSUPP; 434 + 435 + if (off == APLIC_DOMAINCFG) { 436 + /* Only IE bit writeable */ 437 + aplic->domaincfg = val32 & APLIC_DOMAINCFG_IE; 438 + } else if ((off >= APLIC_SOURCECFG_BASE) && 439 + (off < (APLIC_SOURCECFG_BASE + (aplic->nr_irqs - 1) * 4))) { 440 + i = ((off - APLIC_SOURCECFG_BASE) >> 2) + 1; 441 + aplic_write_sourcecfg(aplic, i, val32); 442 + } else if ((off >= APLIC_SETIP_BASE) && 443 + (off < (APLIC_SETIP_BASE + aplic->nr_words * 4))) { 444 + i = (off - APLIC_SETIP_BASE) >> 2; 445 + aplic_write_pending_word(aplic, i, val32, true); 446 + } else if (off == APLIC_SETIPNUM) { 447 + aplic_write_pending(aplic, val32, true); 448 + } else if ((off >= APLIC_CLRIP_BASE) && 449 + (off < (APLIC_CLRIP_BASE + aplic->nr_words * 4))) { 450 + i = (off - APLIC_CLRIP_BASE) >> 2; 451 + aplic_write_pending_word(aplic, i, val32, false); 452 + } else if (off == APLIC_CLRIPNUM) { 453 + aplic_write_pending(aplic, val32, false); 454 + } else if ((off >= APLIC_SETIE_BASE) && 455 + (off < (APLIC_SETIE_BASE + aplic->nr_words * 4))) { 456 + i = (off - APLIC_SETIE_BASE) >> 2; 457 + aplic_write_enabled_word(aplic, i, val32, true); 458 + } else if (off == APLIC_SETIENUM) { 459 + aplic_write_enabled(aplic, val32, true); 460 + } else if ((off >= APLIC_CLRIE_BASE) && 461 + (off < (APLIC_CLRIE_BASE + aplic->nr_words * 4))) { 462 + i = (off - APLIC_CLRIE_BASE) >> 2; 463 + aplic_write_enabled_word(aplic, i, val32, false); 464 + } else if (off == APLIC_CLRIENUM) { 465 + aplic_write_enabled(aplic, val32, false); 466 + } else if (off == APLIC_SETIPNUM_LE) { 467 + aplic_write_pending(aplic, val32, true); 468 + } else if (off == APLIC_SETIPNUM_BE) { 469 + aplic_write_pending(aplic, __swab32(val32), true); 470 + } else if (off == APLIC_GENMSI) { 471 + aplic->genmsi = val32 & ~(APLIC_TARGET_GUEST_IDX_MASK << 472 + APLIC_TARGET_GUEST_IDX_SHIFT); 473 + kvm_riscv_aia_inject_msi_by_id(kvm, 474 + val32 >> APLIC_TARGET_HART_IDX_SHIFT, 0, 475 + val32 & APLIC_TARGET_EIID_MASK); 476 + } else if ((off >= APLIC_TARGET_BASE) && 477 + (off < (APLIC_TARGET_BASE + (aplic->nr_irqs - 1) * 4))) { 478 + i = ((off - APLIC_TARGET_BASE) >> 2) + 1; 479 + aplic_write_target(aplic, i, val32); 480 + } else 481 + return -ENODEV; 482 + 483 + aplic_update_irq_range(kvm, 1, aplic->nr_irqs - 1); 484 + 485 + return 0; 486 + } 487 + 488 + static int aplic_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *dev, 489 + gpa_t addr, int len, const void *val) 490 + { 491 + if (len != 4) 492 + return -EOPNOTSUPP; 493 + 494 + return aplic_mmio_write_offset(vcpu->kvm, 495 + addr - vcpu->kvm->arch.aia.aplic_addr, 496 + *((const u32 *)val)); 497 + } 498 + 499 + static struct kvm_io_device_ops aplic_iodoev_ops = { 500 + .read = aplic_mmio_read, 501 + .write = aplic_mmio_write, 502 + }; 503 + 504 + int kvm_riscv_aia_aplic_set_attr(struct kvm *kvm, unsigned long type, u32 v) 505 + { 506 + int rc; 507 + 508 + if (!kvm->arch.aia.aplic_state) 509 + return -ENODEV; 510 + 511 + rc = aplic_mmio_write_offset(kvm, type, v); 512 + if (rc) 513 + return rc; 514 + 515 + return 0; 516 + } 517 + 518 + int kvm_riscv_aia_aplic_get_attr(struct kvm *kvm, unsigned long type, u32 *v) 519 + { 520 + int rc; 521 + 522 + if (!kvm->arch.aia.aplic_state) 523 + return -ENODEV; 524 + 525 + rc = aplic_mmio_read_offset(kvm, type, v); 526 + if (rc) 527 + return rc; 528 + 529 + return 0; 530 + } 531 + 532 + int kvm_riscv_aia_aplic_has_attr(struct kvm *kvm, unsigned long type) 533 + { 534 + int rc; 535 + u32 val; 536 + 537 + if (!kvm->arch.aia.aplic_state) 538 + return -ENODEV; 539 + 540 + rc = aplic_mmio_read_offset(kvm, type, &val); 541 + if (rc) 542 + return rc; 543 + 544 + return 0; 545 + } 546 + 547 + int kvm_riscv_aia_aplic_init(struct kvm *kvm) 548 + { 549 + int i, ret = 0; 550 + struct aplic *aplic; 551 + 552 + /* Do nothing if we have zero sources */ 553 + if (!kvm->arch.aia.nr_sources) 554 + return 0; 555 + 556 + /* Allocate APLIC global state */ 557 + aplic = kzalloc(sizeof(*aplic), GFP_KERNEL); 558 + if (!aplic) 559 + return -ENOMEM; 560 + kvm->arch.aia.aplic_state = aplic; 561 + 562 + /* Setup APLIC IRQs */ 563 + aplic->nr_irqs = kvm->arch.aia.nr_sources + 1; 564 + aplic->nr_words = DIV_ROUND_UP(aplic->nr_irqs, 32); 565 + aplic->irqs = kcalloc(aplic->nr_irqs, 566 + sizeof(*aplic->irqs), GFP_KERNEL); 567 + if (!aplic->irqs) { 568 + ret = -ENOMEM; 569 + goto fail_free_aplic; 570 + } 571 + for (i = 0; i < aplic->nr_irqs; i++) 572 + raw_spin_lock_init(&aplic->irqs[i].lock); 573 + 574 + /* Setup IO device */ 575 + kvm_iodevice_init(&aplic->iodev, &aplic_iodoev_ops); 576 + mutex_lock(&kvm->slots_lock); 577 + ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, 578 + kvm->arch.aia.aplic_addr, 579 + KVM_DEV_RISCV_APLIC_SIZE, 580 + &aplic->iodev); 581 + mutex_unlock(&kvm->slots_lock); 582 + if (ret) 583 + goto fail_free_aplic_irqs; 584 + 585 + /* Setup default IRQ routing */ 586 + ret = kvm_riscv_setup_default_irq_routing(kvm, aplic->nr_irqs); 587 + if (ret) 588 + goto fail_unreg_iodev; 589 + 590 + return 0; 591 + 592 + fail_unreg_iodev: 593 + mutex_lock(&kvm->slots_lock); 594 + kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &aplic->iodev); 595 + mutex_unlock(&kvm->slots_lock); 596 + fail_free_aplic_irqs: 597 + kfree(aplic->irqs); 598 + fail_free_aplic: 599 + kvm->arch.aia.aplic_state = NULL; 600 + kfree(aplic); 601 + return ret; 602 + } 603 + 604 + void kvm_riscv_aia_aplic_cleanup(struct kvm *kvm) 605 + { 606 + struct aplic *aplic = kvm->arch.aia.aplic_state; 607 + 608 + if (!aplic) 609 + return; 610 + 611 + mutex_lock(&kvm->slots_lock); 612 + kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &aplic->iodev); 613 + mutex_unlock(&kvm->slots_lock); 614 + 615 + kfree(aplic->irqs); 616 + 617 + kvm->arch.aia.aplic_state = NULL; 618 + kfree(aplic); 619 + }

+673

arch/riscv/kvm/aia_device.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2021 Western Digital Corporation or its affiliates. 4 + * Copyright (C) 2022 Ventana Micro Systems Inc. 5 + * 6 + * Authors: 7 + * Anup Patel <apatel@ventanamicro.com> 8 + */ 9 + 10 + #include <linux/bits.h> 11 + #include <linux/kvm_host.h> 12 + #include <linux/uaccess.h> 13 + #include <asm/kvm_aia_imsic.h> 14 + 15 + static void unlock_vcpus(struct kvm *kvm, int vcpu_lock_idx) 16 + { 17 + struct kvm_vcpu *tmp_vcpu; 18 + 19 + for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) { 20 + tmp_vcpu = kvm_get_vcpu(kvm, vcpu_lock_idx); 21 + mutex_unlock(&tmp_vcpu->mutex); 22 + } 23 + } 24 + 25 + static void unlock_all_vcpus(struct kvm *kvm) 26 + { 27 + unlock_vcpus(kvm, atomic_read(&kvm->online_vcpus) - 1); 28 + } 29 + 30 + static bool lock_all_vcpus(struct kvm *kvm) 31 + { 32 + struct kvm_vcpu *tmp_vcpu; 33 + unsigned long c; 34 + 35 + kvm_for_each_vcpu(c, tmp_vcpu, kvm) { 36 + if (!mutex_trylock(&tmp_vcpu->mutex)) { 37 + unlock_vcpus(kvm, c - 1); 38 + return false; 39 + } 40 + } 41 + 42 + return true; 43 + } 44 + 45 + static int aia_create(struct kvm_device *dev, u32 type) 46 + { 47 + int ret; 48 + unsigned long i; 49 + struct kvm *kvm = dev->kvm; 50 + struct kvm_vcpu *vcpu; 51 + 52 + if (irqchip_in_kernel(kvm)) 53 + return -EEXIST; 54 + 55 + ret = -EBUSY; 56 + if (!lock_all_vcpus(kvm)) 57 + return ret; 58 + 59 + kvm_for_each_vcpu(i, vcpu, kvm) { 60 + if (vcpu->arch.ran_atleast_once) 61 + goto out_unlock; 62 + } 63 + ret = 0; 64 + 65 + kvm->arch.aia.in_kernel = true; 66 + 67 + out_unlock: 68 + unlock_all_vcpus(kvm); 69 + return ret; 70 + } 71 + 72 + static void aia_destroy(struct kvm_device *dev) 73 + { 74 + kfree(dev); 75 + } 76 + 77 + static int aia_config(struct kvm *kvm, unsigned long type, 78 + u32 *nr, bool write) 79 + { 80 + struct kvm_aia *aia = &kvm->arch.aia; 81 + 82 + /* Writes can only be done before irqchip is initialized */ 83 + if (write && kvm_riscv_aia_initialized(kvm)) 84 + return -EBUSY; 85 + 86 + switch (type) { 87 + case KVM_DEV_RISCV_AIA_CONFIG_MODE: 88 + if (write) { 89 + switch (*nr) { 90 + case KVM_DEV_RISCV_AIA_MODE_EMUL: 91 + break; 92 + case KVM_DEV_RISCV_AIA_MODE_HWACCEL: 93 + case KVM_DEV_RISCV_AIA_MODE_AUTO: 94 + /* 95 + * HW Acceleration and Auto modes only 96 + * supported on host with non-zero guest 97 + * external interrupts (i.e. non-zero 98 + * VS-level IMSIC pages). 99 + */ 100 + if (!kvm_riscv_aia_nr_hgei) 101 + return -EINVAL; 102 + break; 103 + default: 104 + return -EINVAL; 105 + } 106 + aia->mode = *nr; 107 + } else 108 + *nr = aia->mode; 109 + break; 110 + case KVM_DEV_RISCV_AIA_CONFIG_IDS: 111 + if (write) { 112 + if ((*nr < KVM_DEV_RISCV_AIA_IDS_MIN) || 113 + (*nr >= KVM_DEV_RISCV_AIA_IDS_MAX) || 114 + ((*nr & KVM_DEV_RISCV_AIA_IDS_MIN) != 115 + KVM_DEV_RISCV_AIA_IDS_MIN) || 116 + (kvm_riscv_aia_max_ids <= *nr)) 117 + return -EINVAL; 118 + aia->nr_ids = *nr; 119 + } else 120 + *nr = aia->nr_ids; 121 + break; 122 + case KVM_DEV_RISCV_AIA_CONFIG_SRCS: 123 + if (write) { 124 + if ((*nr >= KVM_DEV_RISCV_AIA_SRCS_MAX) || 125 + (*nr >= kvm_riscv_aia_max_ids)) 126 + return -EINVAL; 127 + aia->nr_sources = *nr; 128 + } else 129 + *nr = aia->nr_sources; 130 + break; 131 + case KVM_DEV_RISCV_AIA_CONFIG_GROUP_BITS: 132 + if (write) { 133 + if (*nr >= KVM_DEV_RISCV_AIA_GROUP_BITS_MAX) 134 + return -EINVAL; 135 + aia->nr_group_bits = *nr; 136 + } else 137 + *nr = aia->nr_group_bits; 138 + break; 139 + case KVM_DEV_RISCV_AIA_CONFIG_GROUP_SHIFT: 140 + if (write) { 141 + if ((*nr < KVM_DEV_RISCV_AIA_GROUP_SHIFT_MIN) || 142 + (*nr >= KVM_DEV_RISCV_AIA_GROUP_SHIFT_MAX)) 143 + return -EINVAL; 144 + aia->nr_group_shift = *nr; 145 + } else 146 + *nr = aia->nr_group_shift; 147 + break; 148 + case KVM_DEV_RISCV_AIA_CONFIG_HART_BITS: 149 + if (write) { 150 + if (*nr >= KVM_DEV_RISCV_AIA_HART_BITS_MAX) 151 + return -EINVAL; 152 + aia->nr_hart_bits = *nr; 153 + } else 154 + *nr = aia->nr_hart_bits; 155 + break; 156 + case KVM_DEV_RISCV_AIA_CONFIG_GUEST_BITS: 157 + if (write) { 158 + if (*nr >= KVM_DEV_RISCV_AIA_GUEST_BITS_MAX) 159 + return -EINVAL; 160 + aia->nr_guest_bits = *nr; 161 + } else 162 + *nr = aia->nr_guest_bits; 163 + break; 164 + default: 165 + return -ENXIO; 166 + } 167 + 168 + return 0; 169 + } 170 + 171 + static int aia_aplic_addr(struct kvm *kvm, u64 *addr, bool write) 172 + { 173 + struct kvm_aia *aia = &kvm->arch.aia; 174 + 175 + if (write) { 176 + /* Writes can only be done before irqchip is initialized */ 177 + if (kvm_riscv_aia_initialized(kvm)) 178 + return -EBUSY; 179 + 180 + if (*addr & (KVM_DEV_RISCV_APLIC_ALIGN - 1)) 181 + return -EINVAL; 182 + 183 + aia->aplic_addr = *addr; 184 + } else 185 + *addr = aia->aplic_addr; 186 + 187 + return 0; 188 + } 189 + 190 + static int aia_imsic_addr(struct kvm *kvm, u64 *addr, 191 + unsigned long vcpu_idx, bool write) 192 + { 193 + struct kvm_vcpu *vcpu; 194 + struct kvm_vcpu_aia *vcpu_aia; 195 + 196 + vcpu = kvm_get_vcpu(kvm, vcpu_idx); 197 + if (!vcpu) 198 + return -EINVAL; 199 + vcpu_aia = &vcpu->arch.aia_context; 200 + 201 + if (write) { 202 + /* Writes can only be done before irqchip is initialized */ 203 + if (kvm_riscv_aia_initialized(kvm)) 204 + return -EBUSY; 205 + 206 + if (*addr & (KVM_DEV_RISCV_IMSIC_ALIGN - 1)) 207 + return -EINVAL; 208 + } 209 + 210 + mutex_lock(&vcpu->mutex); 211 + if (write) 212 + vcpu_aia->imsic_addr = *addr; 213 + else 214 + *addr = vcpu_aia->imsic_addr; 215 + mutex_unlock(&vcpu->mutex); 216 + 217 + return 0; 218 + } 219 + 220 + static gpa_t aia_imsic_ppn(struct kvm_aia *aia, gpa_t addr) 221 + { 222 + u32 h, l; 223 + gpa_t mask = 0; 224 + 225 + h = aia->nr_hart_bits + aia->nr_guest_bits + 226 + IMSIC_MMIO_PAGE_SHIFT - 1; 227 + mask = GENMASK_ULL(h, 0); 228 + 229 + if (aia->nr_group_bits) { 230 + h = aia->nr_group_bits + aia->nr_group_shift - 1; 231 + l = aia->nr_group_shift; 232 + mask |= GENMASK_ULL(h, l); 233 + } 234 + 235 + return (addr & ~mask) >> IMSIC_MMIO_PAGE_SHIFT; 236 + } 237 + 238 + static u32 aia_imsic_hart_index(struct kvm_aia *aia, gpa_t addr) 239 + { 240 + u32 hart, group = 0; 241 + 242 + hart = (addr >> (aia->nr_guest_bits + IMSIC_MMIO_PAGE_SHIFT)) & 243 + GENMASK_ULL(aia->nr_hart_bits - 1, 0); 244 + if (aia->nr_group_bits) 245 + group = (addr >> aia->nr_group_shift) & 246 + GENMASK_ULL(aia->nr_group_bits - 1, 0); 247 + 248 + return (group << aia->nr_hart_bits) | hart; 249 + } 250 + 251 + static int aia_init(struct kvm *kvm) 252 + { 253 + int ret, i; 254 + unsigned long idx; 255 + struct kvm_vcpu *vcpu; 256 + struct kvm_vcpu_aia *vaia; 257 + struct kvm_aia *aia = &kvm->arch.aia; 258 + gpa_t base_ppn = KVM_RISCV_AIA_UNDEF_ADDR; 259 + 260 + /* Irqchip can be initialized only once */ 261 + if (kvm_riscv_aia_initialized(kvm)) 262 + return -EBUSY; 263 + 264 + /* We might be in the middle of creating a VCPU? */ 265 + if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) 266 + return -EBUSY; 267 + 268 + /* Number of sources should be less than or equals number of IDs */ 269 + if (aia->nr_ids < aia->nr_sources) 270 + return -EINVAL; 271 + 272 + /* APLIC base is required for non-zero number of sources */ 273 + if (aia->nr_sources && aia->aplic_addr == KVM_RISCV_AIA_UNDEF_ADDR) 274 + return -EINVAL; 275 + 276 + /* Initialize APLIC */ 277 + ret = kvm_riscv_aia_aplic_init(kvm); 278 + if (ret) 279 + return ret; 280 + 281 + /* Iterate over each VCPU */ 282 + kvm_for_each_vcpu(idx, vcpu, kvm) { 283 + vaia = &vcpu->arch.aia_context; 284 + 285 + /* IMSIC base is required */ 286 + if (vaia->imsic_addr == KVM_RISCV_AIA_UNDEF_ADDR) { 287 + ret = -EINVAL; 288 + goto fail_cleanup_imsics; 289 + } 290 + 291 + /* All IMSICs should have matching base PPN */ 292 + if (base_ppn == KVM_RISCV_AIA_UNDEF_ADDR) 293 + base_ppn = aia_imsic_ppn(aia, vaia->imsic_addr); 294 + if (base_ppn != aia_imsic_ppn(aia, vaia->imsic_addr)) { 295 + ret = -EINVAL; 296 + goto fail_cleanup_imsics; 297 + } 298 + 299 + /* Update HART index of the IMSIC based on IMSIC base */ 300 + vaia->hart_index = aia_imsic_hart_index(aia, 301 + vaia->imsic_addr); 302 + 303 + /* Initialize IMSIC for this VCPU */ 304 + ret = kvm_riscv_vcpu_aia_imsic_init(vcpu); 305 + if (ret) 306 + goto fail_cleanup_imsics; 307 + } 308 + 309 + /* Set the initialized flag */ 310 + kvm->arch.aia.initialized = true; 311 + 312 + return 0; 313 + 314 + fail_cleanup_imsics: 315 + for (i = idx - 1; i >= 0; i--) { 316 + vcpu = kvm_get_vcpu(kvm, i); 317 + if (!vcpu) 318 + continue; 319 + kvm_riscv_vcpu_aia_imsic_cleanup(vcpu); 320 + } 321 + kvm_riscv_aia_aplic_cleanup(kvm); 322 + return ret; 323 + } 324 + 325 + static int aia_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 326 + { 327 + u32 nr; 328 + u64 addr; 329 + int nr_vcpus, r = -ENXIO; 330 + unsigned long v, type = (unsigned long)attr->attr; 331 + void __user *uaddr = (void __user *)(long)attr->addr; 332 + 333 + switch (attr->group) { 334 + case KVM_DEV_RISCV_AIA_GRP_CONFIG: 335 + if (copy_from_user(&nr, uaddr, sizeof(nr))) 336 + return -EFAULT; 337 + 338 + mutex_lock(&dev->kvm->lock); 339 + r = aia_config(dev->kvm, type, &nr, true); 340 + mutex_unlock(&dev->kvm->lock); 341 + 342 + break; 343 + 344 + case KVM_DEV_RISCV_AIA_GRP_ADDR: 345 + if (copy_from_user(&addr, uaddr, sizeof(addr))) 346 + return -EFAULT; 347 + 348 + nr_vcpus = atomic_read(&dev->kvm->online_vcpus); 349 + mutex_lock(&dev->kvm->lock); 350 + if (type == KVM_DEV_RISCV_AIA_ADDR_APLIC) 351 + r = aia_aplic_addr(dev->kvm, &addr, true); 352 + else if (type < KVM_DEV_RISCV_AIA_ADDR_IMSIC(nr_vcpus)) 353 + r = aia_imsic_addr(dev->kvm, &addr, 354 + type - KVM_DEV_RISCV_AIA_ADDR_IMSIC(0), true); 355 + mutex_unlock(&dev->kvm->lock); 356 + 357 + break; 358 + 359 + case KVM_DEV_RISCV_AIA_GRP_CTRL: 360 + switch (type) { 361 + case KVM_DEV_RISCV_AIA_CTRL_INIT: 362 + mutex_lock(&dev->kvm->lock); 363 + r = aia_init(dev->kvm); 364 + mutex_unlock(&dev->kvm->lock); 365 + break; 366 + } 367 + 368 + break; 369 + case KVM_DEV_RISCV_AIA_GRP_APLIC: 370 + if (copy_from_user(&nr, uaddr, sizeof(nr))) 371 + return -EFAULT; 372 + 373 + mutex_lock(&dev->kvm->lock); 374 + r = kvm_riscv_aia_aplic_set_attr(dev->kvm, type, nr); 375 + mutex_unlock(&dev->kvm->lock); 376 + 377 + break; 378 + case KVM_DEV_RISCV_AIA_GRP_IMSIC: 379 + if (copy_from_user(&v, uaddr, sizeof(v))) 380 + return -EFAULT; 381 + 382 + mutex_lock(&dev->kvm->lock); 383 + r = kvm_riscv_aia_imsic_rw_attr(dev->kvm, type, true, &v); 384 + mutex_unlock(&dev->kvm->lock); 385 + 386 + break; 387 + } 388 + 389 + return r; 390 + } 391 + 392 + static int aia_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 393 + { 394 + u32 nr; 395 + u64 addr; 396 + int nr_vcpus, r = -ENXIO; 397 + void __user *uaddr = (void __user *)(long)attr->addr; 398 + unsigned long v, type = (unsigned long)attr->attr; 399 + 400 + switch (attr->group) { 401 + case KVM_DEV_RISCV_AIA_GRP_CONFIG: 402 + if (copy_from_user(&nr, uaddr, sizeof(nr))) 403 + return -EFAULT; 404 + 405 + mutex_lock(&dev->kvm->lock); 406 + r = aia_config(dev->kvm, type, &nr, false); 407 + mutex_unlock(&dev->kvm->lock); 408 + if (r) 409 + return r; 410 + 411 + if (copy_to_user(uaddr, &nr, sizeof(nr))) 412 + return -EFAULT; 413 + 414 + break; 415 + case KVM_DEV_RISCV_AIA_GRP_ADDR: 416 + if (copy_from_user(&addr, uaddr, sizeof(addr))) 417 + return -EFAULT; 418 + 419 + nr_vcpus = atomic_read(&dev->kvm->online_vcpus); 420 + mutex_lock(&dev->kvm->lock); 421 + if (type == KVM_DEV_RISCV_AIA_ADDR_APLIC) 422 + r = aia_aplic_addr(dev->kvm, &addr, false); 423 + else if (type < KVM_DEV_RISCV_AIA_ADDR_IMSIC(nr_vcpus)) 424 + r = aia_imsic_addr(dev->kvm, &addr, 425 + type - KVM_DEV_RISCV_AIA_ADDR_IMSIC(0), false); 426 + mutex_unlock(&dev->kvm->lock); 427 + if (r) 428 + return r; 429 + 430 + if (copy_to_user(uaddr, &addr, sizeof(addr))) 431 + return -EFAULT; 432 + 433 + break; 434 + case KVM_DEV_RISCV_AIA_GRP_APLIC: 435 + if (copy_from_user(&nr, uaddr, sizeof(nr))) 436 + return -EFAULT; 437 + 438 + mutex_lock(&dev->kvm->lock); 439 + r = kvm_riscv_aia_aplic_get_attr(dev->kvm, type, &nr); 440 + mutex_unlock(&dev->kvm->lock); 441 + if (r) 442 + return r; 443 + 444 + if (copy_to_user(uaddr, &nr, sizeof(nr))) 445 + return -EFAULT; 446 + 447 + break; 448 + case KVM_DEV_RISCV_AIA_GRP_IMSIC: 449 + if (copy_from_user(&v, uaddr, sizeof(v))) 450 + return -EFAULT; 451 + 452 + mutex_lock(&dev->kvm->lock); 453 + r = kvm_riscv_aia_imsic_rw_attr(dev->kvm, type, false, &v); 454 + mutex_unlock(&dev->kvm->lock); 455 + if (r) 456 + return r; 457 + 458 + if (copy_to_user(uaddr, &v, sizeof(v))) 459 + return -EFAULT; 460 + 461 + break; 462 + } 463 + 464 + return r; 465 + } 466 + 467 + static int aia_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 468 + { 469 + int nr_vcpus; 470 + 471 + switch (attr->group) { 472 + case KVM_DEV_RISCV_AIA_GRP_CONFIG: 473 + switch (attr->attr) { 474 + case KVM_DEV_RISCV_AIA_CONFIG_MODE: 475 + case KVM_DEV_RISCV_AIA_CONFIG_IDS: 476 + case KVM_DEV_RISCV_AIA_CONFIG_SRCS: 477 + case KVM_DEV_RISCV_AIA_CONFIG_GROUP_BITS: 478 + case KVM_DEV_RISCV_AIA_CONFIG_GROUP_SHIFT: 479 + case KVM_DEV_RISCV_AIA_CONFIG_HART_BITS: 480 + case KVM_DEV_RISCV_AIA_CONFIG_GUEST_BITS: 481 + return 0; 482 + } 483 + break; 484 + case KVM_DEV_RISCV_AIA_GRP_ADDR: 485 + nr_vcpus = atomic_read(&dev->kvm->online_vcpus); 486 + if (attr->attr == KVM_DEV_RISCV_AIA_ADDR_APLIC) 487 + return 0; 488 + else if (attr->attr < KVM_DEV_RISCV_AIA_ADDR_IMSIC(nr_vcpus)) 489 + return 0; 490 + break; 491 + case KVM_DEV_RISCV_AIA_GRP_CTRL: 492 + switch (attr->attr) { 493 + case KVM_DEV_RISCV_AIA_CTRL_INIT: 494 + return 0; 495 + } 496 + break; 497 + case KVM_DEV_RISCV_AIA_GRP_APLIC: 498 + return kvm_riscv_aia_aplic_has_attr(dev->kvm, attr->attr); 499 + case KVM_DEV_RISCV_AIA_GRP_IMSIC: 500 + return kvm_riscv_aia_imsic_has_attr(dev->kvm, attr->attr); 501 + } 502 + 503 + return -ENXIO; 504 + } 505 + 506 + struct kvm_device_ops kvm_riscv_aia_device_ops = { 507 + .name = "kvm-riscv-aia", 508 + .create = aia_create, 509 + .destroy = aia_destroy, 510 + .set_attr = aia_set_attr, 511 + .get_attr = aia_get_attr, 512 + .has_attr = aia_has_attr, 513 + }; 514 + 515 + int kvm_riscv_vcpu_aia_update(struct kvm_vcpu *vcpu) 516 + { 517 + /* Proceed only if AIA was initialized successfully */ 518 + if (!kvm_riscv_aia_initialized(vcpu->kvm)) 519 + return 1; 520 + 521 + /* Update the IMSIC HW state before entering guest mode */ 522 + return kvm_riscv_vcpu_aia_imsic_update(vcpu); 523 + } 524 + 525 + void kvm_riscv_vcpu_aia_reset(struct kvm_vcpu *vcpu) 526 + { 527 + struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr; 528 + struct kvm_vcpu_aia_csr *reset_csr = 529 + &vcpu->arch.aia_context.guest_reset_csr; 530 + 531 + if (!kvm_riscv_aia_available()) 532 + return; 533 + memcpy(csr, reset_csr, sizeof(*csr)); 534 + 535 + /* Proceed only if AIA was initialized successfully */ 536 + if (!kvm_riscv_aia_initialized(vcpu->kvm)) 537 + return; 538 + 539 + /* Reset the IMSIC context */ 540 + kvm_riscv_vcpu_aia_imsic_reset(vcpu); 541 + } 542 + 543 + int kvm_riscv_vcpu_aia_init(struct kvm_vcpu *vcpu) 544 + { 545 + struct kvm_vcpu_aia *vaia = &vcpu->arch.aia_context; 546 + 547 + if (!kvm_riscv_aia_available()) 548 + return 0; 549 + 550 + /* 551 + * We don't do any memory allocations over here because these 552 + * will be done after AIA device is initialized by the user-space. 553 + * 554 + * Refer, aia_init() implementation for more details. 555 + */ 556 + 557 + /* Initialize default values in AIA vcpu context */ 558 + vaia->imsic_addr = KVM_RISCV_AIA_UNDEF_ADDR; 559 + vaia->hart_index = vcpu->vcpu_idx; 560 + 561 + return 0; 562 + } 563 + 564 + void kvm_riscv_vcpu_aia_deinit(struct kvm_vcpu *vcpu) 565 + { 566 + /* Proceed only if AIA was initialized successfully */ 567 + if (!kvm_riscv_aia_initialized(vcpu->kvm)) 568 + return; 569 + 570 + /* Cleanup IMSIC context */ 571 + kvm_riscv_vcpu_aia_imsic_cleanup(vcpu); 572 + } 573 + 574 + int kvm_riscv_aia_inject_msi_by_id(struct kvm *kvm, u32 hart_index, 575 + u32 guest_index, u32 iid) 576 + { 577 + unsigned long idx; 578 + struct kvm_vcpu *vcpu; 579 + 580 + /* Proceed only if AIA was initialized successfully */ 581 + if (!kvm_riscv_aia_initialized(kvm)) 582 + return -EBUSY; 583 + 584 + /* Inject MSI to matching VCPU */ 585 + kvm_for_each_vcpu(idx, vcpu, kvm) { 586 + if (vcpu->arch.aia_context.hart_index == hart_index) 587 + return kvm_riscv_vcpu_aia_imsic_inject(vcpu, 588 + guest_index, 589 + 0, iid); 590 + } 591 + 592 + return 0; 593 + } 594 + 595 + int kvm_riscv_aia_inject_msi(struct kvm *kvm, struct kvm_msi *msi) 596 + { 597 + gpa_t tppn, ippn; 598 + unsigned long idx; 599 + struct kvm_vcpu *vcpu; 600 + u32 g, toff, iid = msi->data; 601 + struct kvm_aia *aia = &kvm->arch.aia; 602 + gpa_t target = (((gpa_t)msi->address_hi) << 32) | msi->address_lo; 603 + 604 + /* Proceed only if AIA was initialized successfully */ 605 + if (!kvm_riscv_aia_initialized(kvm)) 606 + return -EBUSY; 607 + 608 + /* Convert target address to target PPN */ 609 + tppn = target >> IMSIC_MMIO_PAGE_SHIFT; 610 + 611 + /* Extract and clear Guest ID from target PPN */ 612 + g = tppn & (BIT(aia->nr_guest_bits) - 1); 613 + tppn &= ~((gpa_t)(BIT(aia->nr_guest_bits) - 1)); 614 + 615 + /* Inject MSI to matching VCPU */ 616 + kvm_for_each_vcpu(idx, vcpu, kvm) { 617 + ippn = vcpu->arch.aia_context.imsic_addr >> 618 + IMSIC_MMIO_PAGE_SHIFT; 619 + if (ippn == tppn) { 620 + toff = target & (IMSIC_MMIO_PAGE_SZ - 1); 621 + return kvm_riscv_vcpu_aia_imsic_inject(vcpu, g, 622 + toff, iid); 623 + } 624 + } 625 + 626 + return 0; 627 + } 628 + 629 + int kvm_riscv_aia_inject_irq(struct kvm *kvm, unsigned int irq, bool level) 630 + { 631 + /* Proceed only if AIA was initialized successfully */ 632 + if (!kvm_riscv_aia_initialized(kvm)) 633 + return -EBUSY; 634 + 635 + /* Inject interrupt level change in APLIC */ 636 + return kvm_riscv_aia_aplic_inject(kvm, irq, level); 637 + } 638 + 639 + void kvm_riscv_aia_init_vm(struct kvm *kvm) 640 + { 641 + struct kvm_aia *aia = &kvm->arch.aia; 642 + 643 + if (!kvm_riscv_aia_available()) 644 + return; 645 + 646 + /* 647 + * We don't do any memory allocations over here because these 648 + * will be done after AIA device is initialized by the user-space. 649 + * 650 + * Refer, aia_init() implementation for more details. 651 + */ 652 + 653 + /* Initialize default values in AIA global context */ 654 + aia->mode = (kvm_riscv_aia_nr_hgei) ? 655 + KVM_DEV_RISCV_AIA_MODE_AUTO : KVM_DEV_RISCV_AIA_MODE_EMUL; 656 + aia->nr_ids = kvm_riscv_aia_max_ids - 1; 657 + aia->nr_sources = 0; 658 + aia->nr_group_bits = 0; 659 + aia->nr_group_shift = KVM_DEV_RISCV_AIA_GROUP_SHIFT_MIN; 660 + aia->nr_hart_bits = 0; 661 + aia->nr_guest_bits = 0; 662 + aia->aplic_addr = KVM_RISCV_AIA_UNDEF_ADDR; 663 + } 664 + 665 + void kvm_riscv_aia_destroy_vm(struct kvm *kvm) 666 + { 667 + /* Proceed only if AIA was initialized successfully */ 668 + if (!kvm_riscv_aia_initialized(kvm)) 669 + return; 670 + 671 + /* Cleanup APLIC context */ 672 + kvm_riscv_aia_aplic_cleanup(kvm); 673 + }

+1084

arch/riscv/kvm/aia_imsic.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2021 Western Digital Corporation or its affiliates. 4 + * Copyright (C) 2022 Ventana Micro Systems Inc. 5 + * 6 + * Authors: 7 + * Anup Patel <apatel@ventanamicro.com> 8 + */ 9 + 10 + #include <linux/atomic.h> 11 + #include <linux/bitmap.h> 12 + #include <linux/kvm_host.h> 13 + #include <linux/math.h> 14 + #include <linux/spinlock.h> 15 + #include <linux/swab.h> 16 + #include <kvm/iodev.h> 17 + #include <asm/csr.h> 18 + #include <asm/kvm_aia_imsic.h> 19 + 20 + #define IMSIC_MAX_EIX (IMSIC_MAX_ID / BITS_PER_TYPE(u64)) 21 + 22 + struct imsic_mrif_eix { 23 + unsigned long eip[BITS_PER_TYPE(u64) / BITS_PER_LONG]; 24 + unsigned long eie[BITS_PER_TYPE(u64) / BITS_PER_LONG]; 25 + }; 26 + 27 + struct imsic_mrif { 28 + struct imsic_mrif_eix eix[IMSIC_MAX_EIX]; 29 + unsigned long eithreshold; 30 + unsigned long eidelivery; 31 + }; 32 + 33 + struct imsic { 34 + struct kvm_io_device iodev; 35 + 36 + u32 nr_msis; 37 + u32 nr_eix; 38 + u32 nr_hw_eix; 39 + 40 + /* 41 + * At any point in time, the register state is in 42 + * one of the following places: 43 + * 44 + * 1) Hardware: IMSIC VS-file (vsfile_cpu >= 0) 45 + * 2) Software: IMSIC SW-file (vsfile_cpu < 0) 46 + */ 47 + 48 + /* IMSIC VS-file */ 49 + rwlock_t vsfile_lock; 50 + int vsfile_cpu; 51 + int vsfile_hgei; 52 + void __iomem *vsfile_va; 53 + phys_addr_t vsfile_pa; 54 + 55 + /* IMSIC SW-file */ 56 + struct imsic_mrif *swfile; 57 + phys_addr_t swfile_pa; 58 + }; 59 + 60 + #define imsic_vs_csr_read(__c) \ 61 + ({ \ 62 + unsigned long __r; \ 63 + csr_write(CSR_VSISELECT, __c); \ 64 + __r = csr_read(CSR_VSIREG); \ 65 + __r; \ 66 + }) 67 + 68 + #define imsic_read_switchcase(__ireg) \ 69 + case __ireg: \ 70 + return imsic_vs_csr_read(__ireg); 71 + #define imsic_read_switchcase_2(__ireg) \ 72 + imsic_read_switchcase(__ireg + 0) \ 73 + imsic_read_switchcase(__ireg + 1) 74 + #define imsic_read_switchcase_4(__ireg) \ 75 + imsic_read_switchcase_2(__ireg + 0) \ 76 + imsic_read_switchcase_2(__ireg + 2) 77 + #define imsic_read_switchcase_8(__ireg) \ 78 + imsic_read_switchcase_4(__ireg + 0) \ 79 + imsic_read_switchcase_4(__ireg + 4) 80 + #define imsic_read_switchcase_16(__ireg) \ 81 + imsic_read_switchcase_8(__ireg + 0) \ 82 + imsic_read_switchcase_8(__ireg + 8) 83 + #define imsic_read_switchcase_32(__ireg) \ 84 + imsic_read_switchcase_16(__ireg + 0) \ 85 + imsic_read_switchcase_16(__ireg + 16) 86 + #define imsic_read_switchcase_64(__ireg) \ 87 + imsic_read_switchcase_32(__ireg + 0) \ 88 + imsic_read_switchcase_32(__ireg + 32) 89 + 90 + static unsigned long imsic_eix_read(int ireg) 91 + { 92 + switch (ireg) { 93 + imsic_read_switchcase_64(IMSIC_EIP0) 94 + imsic_read_switchcase_64(IMSIC_EIE0) 95 + } 96 + 97 + return 0; 98 + } 99 + 100 + #define imsic_vs_csr_swap(__c, __v) \ 101 + ({ \ 102 + unsigned long __r; \ 103 + csr_write(CSR_VSISELECT, __c); \ 104 + __r = csr_swap(CSR_VSIREG, __v); \ 105 + __r; \ 106 + }) 107 + 108 + #define imsic_swap_switchcase(__ireg, __v) \ 109 + case __ireg: \ 110 + return imsic_vs_csr_swap(__ireg, __v); 111 + #define imsic_swap_switchcase_2(__ireg, __v) \ 112 + imsic_swap_switchcase(__ireg + 0, __v) \ 113 + imsic_swap_switchcase(__ireg + 1, __v) 114 + #define imsic_swap_switchcase_4(__ireg, __v) \ 115 + imsic_swap_switchcase_2(__ireg + 0, __v) \ 116 + imsic_swap_switchcase_2(__ireg + 2, __v) 117 + #define imsic_swap_switchcase_8(__ireg, __v) \ 118 + imsic_swap_switchcase_4(__ireg + 0, __v) \ 119 + imsic_swap_switchcase_4(__ireg + 4, __v) 120 + #define imsic_swap_switchcase_16(__ireg, __v) \ 121 + imsic_swap_switchcase_8(__ireg + 0, __v) \ 122 + imsic_swap_switchcase_8(__ireg + 8, __v) 123 + #define imsic_swap_switchcase_32(__ireg, __v) \ 124 + imsic_swap_switchcase_16(__ireg + 0, __v) \ 125 + imsic_swap_switchcase_16(__ireg + 16, __v) 126 + #define imsic_swap_switchcase_64(__ireg, __v) \ 127 + imsic_swap_switchcase_32(__ireg + 0, __v) \ 128 + imsic_swap_switchcase_32(__ireg + 32, __v) 129 + 130 + static unsigned long imsic_eix_swap(int ireg, unsigned long val) 131 + { 132 + switch (ireg) { 133 + imsic_swap_switchcase_64(IMSIC_EIP0, val) 134 + imsic_swap_switchcase_64(IMSIC_EIE0, val) 135 + } 136 + 137 + return 0; 138 + } 139 + 140 + #define imsic_vs_csr_write(__c, __v) \ 141 + do { \ 142 + csr_write(CSR_VSISELECT, __c); \ 143 + csr_write(CSR_VSIREG, __v); \ 144 + } while (0) 145 + 146 + #define imsic_write_switchcase(__ireg, __v) \ 147 + case __ireg: \ 148 + imsic_vs_csr_write(__ireg, __v); \ 149 + break; 150 + #define imsic_write_switchcase_2(__ireg, __v) \ 151 + imsic_write_switchcase(__ireg + 0, __v) \ 152 + imsic_write_switchcase(__ireg + 1, __v) 153 + #define imsic_write_switchcase_4(__ireg, __v) \ 154 + imsic_write_switchcase_2(__ireg + 0, __v) \ 155 + imsic_write_switchcase_2(__ireg + 2, __v) 156 + #define imsic_write_switchcase_8(__ireg, __v) \ 157 + imsic_write_switchcase_4(__ireg + 0, __v) \ 158 + imsic_write_switchcase_4(__ireg + 4, __v) 159 + #define imsic_write_switchcase_16(__ireg, __v) \ 160 + imsic_write_switchcase_8(__ireg + 0, __v) \ 161 + imsic_write_switchcase_8(__ireg + 8, __v) 162 + #define imsic_write_switchcase_32(__ireg, __v) \ 163 + imsic_write_switchcase_16(__ireg + 0, __v) \ 164 + imsic_write_switchcase_16(__ireg + 16, __v) 165 + #define imsic_write_switchcase_64(__ireg, __v) \ 166 + imsic_write_switchcase_32(__ireg + 0, __v) \ 167 + imsic_write_switchcase_32(__ireg + 32, __v) 168 + 169 + static void imsic_eix_write(int ireg, unsigned long val) 170 + { 171 + switch (ireg) { 172 + imsic_write_switchcase_64(IMSIC_EIP0, val) 173 + imsic_write_switchcase_64(IMSIC_EIE0, val) 174 + } 175 + } 176 + 177 + #define imsic_vs_csr_set(__c, __v) \ 178 + do { \ 179 + csr_write(CSR_VSISELECT, __c); \ 180 + csr_set(CSR_VSIREG, __v); \ 181 + } while (0) 182 + 183 + #define imsic_set_switchcase(__ireg, __v) \ 184 + case __ireg: \ 185 + imsic_vs_csr_set(__ireg, __v); \ 186 + break; 187 + #define imsic_set_switchcase_2(__ireg, __v) \ 188 + imsic_set_switchcase(__ireg + 0, __v) \ 189 + imsic_set_switchcase(__ireg + 1, __v) 190 + #define imsic_set_switchcase_4(__ireg, __v) \ 191 + imsic_set_switchcase_2(__ireg + 0, __v) \ 192 + imsic_set_switchcase_2(__ireg + 2, __v) 193 + #define imsic_set_switchcase_8(__ireg, __v) \ 194 + imsic_set_switchcase_4(__ireg + 0, __v) \ 195 + imsic_set_switchcase_4(__ireg + 4, __v) 196 + #define imsic_set_switchcase_16(__ireg, __v) \ 197 + imsic_set_switchcase_8(__ireg + 0, __v) \ 198 + imsic_set_switchcase_8(__ireg + 8, __v) 199 + #define imsic_set_switchcase_32(__ireg, __v) \ 200 + imsic_set_switchcase_16(__ireg + 0, __v) \ 201 + imsic_set_switchcase_16(__ireg + 16, __v) 202 + #define imsic_set_switchcase_64(__ireg, __v) \ 203 + imsic_set_switchcase_32(__ireg + 0, __v) \ 204 + imsic_set_switchcase_32(__ireg + 32, __v) 205 + 206 + static void imsic_eix_set(int ireg, unsigned long val) 207 + { 208 + switch (ireg) { 209 + imsic_set_switchcase_64(IMSIC_EIP0, val) 210 + imsic_set_switchcase_64(IMSIC_EIE0, val) 211 + } 212 + } 213 + 214 + static unsigned long imsic_mrif_atomic_rmw(struct imsic_mrif *mrif, 215 + unsigned long *ptr, 216 + unsigned long new_val, 217 + unsigned long wr_mask) 218 + { 219 + unsigned long old_val = 0, tmp = 0; 220 + 221 + __asm__ __volatile__ ( 222 + "0: lr.w.aq %1, %0\n" 223 + " and %2, %1, %3\n" 224 + " or %2, %2, %4\n" 225 + " sc.w.rl %2, %2, %0\n" 226 + " bnez %2, 0b" 227 + : "+A" (*ptr), "+r" (old_val), "+r" (tmp) 228 + : "r" (~wr_mask), "r" (new_val & wr_mask) 229 + : "memory"); 230 + 231 + return old_val; 232 + } 233 + 234 + static unsigned long imsic_mrif_atomic_or(struct imsic_mrif *mrif, 235 + unsigned long *ptr, 236 + unsigned long val) 237 + { 238 + return atomic_long_fetch_or(val, (atomic_long_t *)ptr); 239 + } 240 + 241 + #define imsic_mrif_atomic_write(__mrif, __ptr, __new_val) \ 242 + imsic_mrif_atomic_rmw(__mrif, __ptr, __new_val, -1UL) 243 + #define imsic_mrif_atomic_read(__mrif, __ptr) \ 244 + imsic_mrif_atomic_or(__mrif, __ptr, 0) 245 + 246 + static u32 imsic_mrif_topei(struct imsic_mrif *mrif, u32 nr_eix, u32 nr_msis) 247 + { 248 + struct imsic_mrif_eix *eix; 249 + u32 i, imin, imax, ei, max_msi; 250 + unsigned long eipend[BITS_PER_TYPE(u64) / BITS_PER_LONG]; 251 + unsigned long eithreshold = imsic_mrif_atomic_read(mrif, 252 + &mrif->eithreshold); 253 + 254 + max_msi = (eithreshold && (eithreshold <= nr_msis)) ? 255 + eithreshold : nr_msis; 256 + for (ei = 0; ei < nr_eix; ei++) { 257 + eix = &mrif->eix[ei]; 258 + eipend[0] = imsic_mrif_atomic_read(mrif, &eix->eie[0]) & 259 + imsic_mrif_atomic_read(mrif, &eix->eip[0]); 260 + #ifdef CONFIG_32BIT 261 + eipend[1] = imsic_mrif_atomic_read(mrif, &eix->eie[1]) & 262 + imsic_mrif_atomic_read(mrif, &eix->eip[1]); 263 + if (!eipend[0] && !eipend[1]) 264 + #else 265 + if (!eipend[0]) 266 + #endif 267 + continue; 268 + 269 + imin = ei * BITS_PER_TYPE(u64); 270 + imax = ((imin + BITS_PER_TYPE(u64)) < max_msi) ? 271 + imin + BITS_PER_TYPE(u64) : max_msi; 272 + for (i = (!imin) ? 1 : imin; i < imax; i++) { 273 + if (test_bit(i - imin, eipend)) 274 + return (i << TOPEI_ID_SHIFT) | i; 275 + } 276 + } 277 + 278 + return 0; 279 + } 280 + 281 + static int imsic_mrif_isel_check(u32 nr_eix, unsigned long isel) 282 + { 283 + u32 num = 0; 284 + 285 + switch (isel) { 286 + case IMSIC_EIDELIVERY: 287 + case IMSIC_EITHRESHOLD: 288 + break; 289 + case IMSIC_EIP0 ... IMSIC_EIP63: 290 + num = isel - IMSIC_EIP0; 291 + break; 292 + case IMSIC_EIE0 ... IMSIC_EIE63: 293 + num = isel - IMSIC_EIE0; 294 + break; 295 + default: 296 + return -ENOENT; 297 + } 298 + #ifndef CONFIG_32BIT 299 + if (num & 0x1) 300 + return -EINVAL; 301 + #endif 302 + if ((num / 2) >= nr_eix) 303 + return -EINVAL; 304 + 305 + return 0; 306 + } 307 + 308 + static int imsic_mrif_rmw(struct imsic_mrif *mrif, u32 nr_eix, 309 + unsigned long isel, unsigned long *val, 310 + unsigned long new_val, unsigned long wr_mask) 311 + { 312 + bool pend; 313 + struct imsic_mrif_eix *eix; 314 + unsigned long *ei, num, old_val = 0; 315 + 316 + switch (isel) { 317 + case IMSIC_EIDELIVERY: 318 + old_val = imsic_mrif_atomic_rmw(mrif, &mrif->eidelivery, 319 + new_val, wr_mask & 0x1); 320 + break; 321 + case IMSIC_EITHRESHOLD: 322 + old_val = imsic_mrif_atomic_rmw(mrif, &mrif->eithreshold, 323 + new_val, wr_mask & (IMSIC_MAX_ID - 1)); 324 + break; 325 + case IMSIC_EIP0 ... IMSIC_EIP63: 326 + case IMSIC_EIE0 ... IMSIC_EIE63: 327 + if (isel >= IMSIC_EIP0 && isel <= IMSIC_EIP63) { 328 + pend = true; 329 + num = isel - IMSIC_EIP0; 330 + } else { 331 + pend = false; 332 + num = isel - IMSIC_EIE0; 333 + } 334 + 335 + if ((num / 2) >= nr_eix) 336 + return -EINVAL; 337 + eix = &mrif->eix[num / 2]; 338 + 339 + #ifndef CONFIG_32BIT 340 + if (num & 0x1) 341 + return -EINVAL; 342 + ei = (pend) ? &eix->eip[0] : &eix->eie[0]; 343 + #else 344 + ei = (pend) ? &eix->eip[num & 0x1] : &eix->eie[num & 0x1]; 345 + #endif 346 + 347 + /* Bit0 of EIP0 or EIE0 is read-only */ 348 + if (!num) 349 + wr_mask &= ~BIT(0); 350 + 351 + old_val = imsic_mrif_atomic_rmw(mrif, ei, new_val, wr_mask); 352 + break; 353 + default: 354 + return -ENOENT; 355 + } 356 + 357 + if (val) 358 + *val = old_val; 359 + 360 + return 0; 361 + } 362 + 363 + struct imsic_vsfile_read_data { 364 + int hgei; 365 + u32 nr_eix; 366 + bool clear; 367 + struct imsic_mrif *mrif; 368 + }; 369 + 370 + static void imsic_vsfile_local_read(void *data) 371 + { 372 + u32 i; 373 + struct imsic_mrif_eix *eix; 374 + struct imsic_vsfile_read_data *idata = data; 375 + struct imsic_mrif *mrif = idata->mrif; 376 + unsigned long new_hstatus, old_hstatus, old_vsiselect; 377 + 378 + old_vsiselect = csr_read(CSR_VSISELECT); 379 + old_hstatus = csr_read(CSR_HSTATUS); 380 + new_hstatus = old_hstatus & ~HSTATUS_VGEIN; 381 + new_hstatus |= ((unsigned long)idata->hgei) << HSTATUS_VGEIN_SHIFT; 382 + csr_write(CSR_HSTATUS, new_hstatus); 383 + 384 + /* 385 + * We don't use imsic_mrif_atomic_xyz() functions to store 386 + * values in MRIF because imsic_vsfile_read() is always called 387 + * with pointer to temporary MRIF on stack. 388 + */ 389 + 390 + if (idata->clear) { 391 + mrif->eidelivery = imsic_vs_csr_swap(IMSIC_EIDELIVERY, 0); 392 + mrif->eithreshold = imsic_vs_csr_swap(IMSIC_EITHRESHOLD, 0); 393 + for (i = 0; i < idata->nr_eix; i++) { 394 + eix = &mrif->eix[i]; 395 + eix->eip[0] = imsic_eix_swap(IMSIC_EIP0 + i * 2, 0); 396 + eix->eie[0] = imsic_eix_swap(IMSIC_EIE0 + i * 2, 0); 397 + #ifdef CONFIG_32BIT 398 + eix->eip[1] = imsic_eix_swap(IMSIC_EIP0 + i * 2 + 1, 0); 399 + eix->eie[1] = imsic_eix_swap(IMSIC_EIE0 + i * 2 + 1, 0); 400 + #endif 401 + } 402 + } else { 403 + mrif->eidelivery = imsic_vs_csr_read(IMSIC_EIDELIVERY); 404 + mrif->eithreshold = imsic_vs_csr_read(IMSIC_EITHRESHOLD); 405 + for (i = 0; i < idata->nr_eix; i++) { 406 + eix = &mrif->eix[i]; 407 + eix->eip[0] = imsic_eix_read(IMSIC_EIP0 + i * 2); 408 + eix->eie[0] = imsic_eix_read(IMSIC_EIE0 + i * 2); 409 + #ifdef CONFIG_32BIT 410 + eix->eip[1] = imsic_eix_read(IMSIC_EIP0 + i * 2 + 1); 411 + eix->eie[1] = imsic_eix_read(IMSIC_EIE0 + i * 2 + 1); 412 + #endif 413 + } 414 + } 415 + 416 + csr_write(CSR_HSTATUS, old_hstatus); 417 + csr_write(CSR_VSISELECT, old_vsiselect); 418 + } 419 + 420 + static void imsic_vsfile_read(int vsfile_hgei, int vsfile_cpu, u32 nr_eix, 421 + bool clear, struct imsic_mrif *mrif) 422 + { 423 + struct imsic_vsfile_read_data idata; 424 + 425 + /* We can only read clear if we have a IMSIC VS-file */ 426 + if (vsfile_cpu < 0 || vsfile_hgei <= 0) 427 + return; 428 + 429 + /* We can only read clear on local CPU */ 430 + idata.hgei = vsfile_hgei; 431 + idata.nr_eix = nr_eix; 432 + idata.clear = clear; 433 + idata.mrif = mrif; 434 + on_each_cpu_mask(cpumask_of(vsfile_cpu), 435 + imsic_vsfile_local_read, &idata, 1); 436 + } 437 + 438 + struct imsic_vsfile_rw_data { 439 + int hgei; 440 + int isel; 441 + bool write; 442 + unsigned long val; 443 + }; 444 + 445 + static void imsic_vsfile_local_rw(void *data) 446 + { 447 + struct imsic_vsfile_rw_data *idata = data; 448 + unsigned long new_hstatus, old_hstatus, old_vsiselect; 449 + 450 + old_vsiselect = csr_read(CSR_VSISELECT); 451 + old_hstatus = csr_read(CSR_HSTATUS); 452 + new_hstatus = old_hstatus & ~HSTATUS_VGEIN; 453 + new_hstatus |= ((unsigned long)idata->hgei) << HSTATUS_VGEIN_SHIFT; 454 + csr_write(CSR_HSTATUS, new_hstatus); 455 + 456 + switch (idata->isel) { 457 + case IMSIC_EIDELIVERY: 458 + if (idata->write) 459 + imsic_vs_csr_write(IMSIC_EIDELIVERY, idata->val); 460 + else 461 + idata->val = imsic_vs_csr_read(IMSIC_EIDELIVERY); 462 + break; 463 + case IMSIC_EITHRESHOLD: 464 + if (idata->write) 465 + imsic_vs_csr_write(IMSIC_EITHRESHOLD, idata->val); 466 + else 467 + idata->val = imsic_vs_csr_read(IMSIC_EITHRESHOLD); 468 + break; 469 + case IMSIC_EIP0 ... IMSIC_EIP63: 470 + case IMSIC_EIE0 ... IMSIC_EIE63: 471 + #ifndef CONFIG_32BIT 472 + if (idata->isel & 0x1) 473 + break; 474 + #endif 475 + if (idata->write) 476 + imsic_eix_write(idata->isel, idata->val); 477 + else 478 + idata->val = imsic_eix_read(idata->isel); 479 + break; 480 + default: 481 + break; 482 + } 483 + 484 + csr_write(CSR_HSTATUS, old_hstatus); 485 + csr_write(CSR_VSISELECT, old_vsiselect); 486 + } 487 + 488 + static int imsic_vsfile_rw(int vsfile_hgei, int vsfile_cpu, u32 nr_eix, 489 + unsigned long isel, bool write, 490 + unsigned long *val) 491 + { 492 + int rc; 493 + struct imsic_vsfile_rw_data rdata; 494 + 495 + /* We can only access register if we have a IMSIC VS-file */ 496 + if (vsfile_cpu < 0 || vsfile_hgei <= 0) 497 + return -EINVAL; 498 + 499 + /* Check IMSIC register iselect */ 500 + rc = imsic_mrif_isel_check(nr_eix, isel); 501 + if (rc) 502 + return rc; 503 + 504 + /* We can only access register on local CPU */ 505 + rdata.hgei = vsfile_hgei; 506 + rdata.isel = isel; 507 + rdata.write = write; 508 + rdata.val = (write) ? *val : 0; 509 + on_each_cpu_mask(cpumask_of(vsfile_cpu), 510 + imsic_vsfile_local_rw, &rdata, 1); 511 + 512 + if (!write) 513 + *val = rdata.val; 514 + 515 + return 0; 516 + } 517 + 518 + static void imsic_vsfile_local_clear(int vsfile_hgei, u32 nr_eix) 519 + { 520 + u32 i; 521 + unsigned long new_hstatus, old_hstatus, old_vsiselect; 522 + 523 + /* We can only zero-out if we have a IMSIC VS-file */ 524 + if (vsfile_hgei <= 0) 525 + return; 526 + 527 + old_vsiselect = csr_read(CSR_VSISELECT); 528 + old_hstatus = csr_read(CSR_HSTATUS); 529 + new_hstatus = old_hstatus & ~HSTATUS_VGEIN; 530 + new_hstatus |= ((unsigned long)vsfile_hgei) << HSTATUS_VGEIN_SHIFT; 531 + csr_write(CSR_HSTATUS, new_hstatus); 532 + 533 + imsic_vs_csr_write(IMSIC_EIDELIVERY, 0); 534 + imsic_vs_csr_write(IMSIC_EITHRESHOLD, 0); 535 + for (i = 0; i < nr_eix; i++) { 536 + imsic_eix_write(IMSIC_EIP0 + i * 2, 0); 537 + imsic_eix_write(IMSIC_EIE0 + i * 2, 0); 538 + #ifdef CONFIG_32BIT 539 + imsic_eix_write(IMSIC_EIP0 + i * 2 + 1, 0); 540 + imsic_eix_write(IMSIC_EIE0 + i * 2 + 1, 0); 541 + #endif 542 + } 543 + 544 + csr_write(CSR_HSTATUS, old_hstatus); 545 + csr_write(CSR_VSISELECT, old_vsiselect); 546 + } 547 + 548 + static void imsic_vsfile_local_update(int vsfile_hgei, u32 nr_eix, 549 + struct imsic_mrif *mrif) 550 + { 551 + u32 i; 552 + struct imsic_mrif_eix *eix; 553 + unsigned long new_hstatus, old_hstatus, old_vsiselect; 554 + 555 + /* We can only update if we have a HW IMSIC context */ 556 + if (vsfile_hgei <= 0) 557 + return; 558 + 559 + /* 560 + * We don't use imsic_mrif_atomic_xyz() functions to read values 561 + * from MRIF in this function because it is always called with 562 + * pointer to temporary MRIF on stack. 563 + */ 564 + 565 + old_vsiselect = csr_read(CSR_VSISELECT); 566 + old_hstatus = csr_read(CSR_HSTATUS); 567 + new_hstatus = old_hstatus & ~HSTATUS_VGEIN; 568 + new_hstatus |= ((unsigned long)vsfile_hgei) << HSTATUS_VGEIN_SHIFT; 569 + csr_write(CSR_HSTATUS, new_hstatus); 570 + 571 + for (i = 0; i < nr_eix; i++) { 572 + eix = &mrif->eix[i]; 573 + imsic_eix_set(IMSIC_EIP0 + i * 2, eix->eip[0]); 574 + imsic_eix_set(IMSIC_EIE0 + i * 2, eix->eie[0]); 575 + #ifdef CONFIG_32BIT 576 + imsic_eix_set(IMSIC_EIP0 + i * 2 + 1, eix->eip[1]); 577 + imsic_eix_set(IMSIC_EIE0 + i * 2 + 1, eix->eie[1]); 578 + #endif 579 + } 580 + imsic_vs_csr_write(IMSIC_EITHRESHOLD, mrif->eithreshold); 581 + imsic_vs_csr_write(IMSIC_EIDELIVERY, mrif->eidelivery); 582 + 583 + csr_write(CSR_HSTATUS, old_hstatus); 584 + csr_write(CSR_VSISELECT, old_vsiselect); 585 + } 586 + 587 + static void imsic_vsfile_cleanup(struct imsic *imsic) 588 + { 589 + int old_vsfile_hgei, old_vsfile_cpu; 590 + unsigned long flags; 591 + 592 + /* 593 + * We don't use imsic_mrif_atomic_xyz() functions to clear the 594 + * SW-file in this function because it is always called when the 595 + * VCPU is being destroyed. 596 + */ 597 + 598 + write_lock_irqsave(&imsic->vsfile_lock, flags); 599 + old_vsfile_hgei = imsic->vsfile_hgei; 600 + old_vsfile_cpu = imsic->vsfile_cpu; 601 + imsic->vsfile_cpu = imsic->vsfile_hgei = -1; 602 + imsic->vsfile_va = NULL; 603 + imsic->vsfile_pa = 0; 604 + write_unlock_irqrestore(&imsic->vsfile_lock, flags); 605 + 606 + memset(imsic->swfile, 0, sizeof(*imsic->swfile)); 607 + 608 + if (old_vsfile_cpu >= 0) 609 + kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei); 610 + } 611 + 612 + static void imsic_swfile_extirq_update(struct kvm_vcpu *vcpu) 613 + { 614 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 615 + struct imsic_mrif *mrif = imsic->swfile; 616 + 617 + if (imsic_mrif_atomic_read(mrif, &mrif->eidelivery) && 618 + imsic_mrif_topei(mrif, imsic->nr_eix, imsic->nr_msis)) 619 + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_VS_EXT); 620 + else 621 + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT); 622 + } 623 + 624 + static void imsic_swfile_read(struct kvm_vcpu *vcpu, bool clear, 625 + struct imsic_mrif *mrif) 626 + { 627 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 628 + 629 + /* 630 + * We don't use imsic_mrif_atomic_xyz() functions to read and 631 + * write SW-file and MRIF in this function because it is always 632 + * called when VCPU is not using SW-file and the MRIF points to 633 + * a temporary MRIF on stack. 634 + */ 635 + 636 + memcpy(mrif, imsic->swfile, sizeof(*mrif)); 637 + if (clear) { 638 + memset(imsic->swfile, 0, sizeof(*imsic->swfile)); 639 + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT); 640 + } 641 + } 642 + 643 + static void imsic_swfile_update(struct kvm_vcpu *vcpu, 644 + struct imsic_mrif *mrif) 645 + { 646 + u32 i; 647 + struct imsic_mrif_eix *seix, *eix; 648 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 649 + struct imsic_mrif *smrif = imsic->swfile; 650 + 651 + imsic_mrif_atomic_write(smrif, &smrif->eidelivery, mrif->eidelivery); 652 + imsic_mrif_atomic_write(smrif, &smrif->eithreshold, mrif->eithreshold); 653 + for (i = 0; i < imsic->nr_eix; i++) { 654 + seix = &smrif->eix[i]; 655 + eix = &mrif->eix[i]; 656 + imsic_mrif_atomic_or(smrif, &seix->eip[0], eix->eip[0]); 657 + imsic_mrif_atomic_or(smrif, &seix->eie[0], eix->eie[0]); 658 + #ifdef CONFIG_32BIT 659 + imsic_mrif_atomic_or(smrif, &seix->eip[1], eix->eip[1]); 660 + imsic_mrif_atomic_or(smrif, &seix->eie[1], eix->eie[1]); 661 + #endif 662 + } 663 + 664 + imsic_swfile_extirq_update(vcpu); 665 + } 666 + 667 + void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu) 668 + { 669 + unsigned long flags; 670 + struct imsic_mrif tmrif; 671 + int old_vsfile_hgei, old_vsfile_cpu; 672 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 673 + 674 + /* Read and clear IMSIC VS-file details */ 675 + write_lock_irqsave(&imsic->vsfile_lock, flags); 676 + old_vsfile_hgei = imsic->vsfile_hgei; 677 + old_vsfile_cpu = imsic->vsfile_cpu; 678 + imsic->vsfile_cpu = imsic->vsfile_hgei = -1; 679 + imsic->vsfile_va = NULL; 680 + imsic->vsfile_pa = 0; 681 + write_unlock_irqrestore(&imsic->vsfile_lock, flags); 682 + 683 + /* Do nothing, if no IMSIC VS-file to release */ 684 + if (old_vsfile_cpu < 0) 685 + return; 686 + 687 + /* 688 + * At this point, all interrupt producers are still using 689 + * the old IMSIC VS-file so we first re-direct all interrupt 690 + * producers. 691 + */ 692 + 693 + /* Purge the G-stage mapping */ 694 + kvm_riscv_gstage_iounmap(vcpu->kvm, 695 + vcpu->arch.aia_context.imsic_addr, 696 + IMSIC_MMIO_PAGE_SZ); 697 + 698 + /* TODO: Purge the IOMMU mapping ??? */ 699 + 700 + /* 701 + * At this point, all interrupt producers have been re-directed 702 + * to somewhere else so we move register state from the old IMSIC 703 + * VS-file to the IMSIC SW-file. 704 + */ 705 + 706 + /* Read and clear register state from old IMSIC VS-file */ 707 + memset(&tmrif, 0, sizeof(tmrif)); 708 + imsic_vsfile_read(old_vsfile_hgei, old_vsfile_cpu, imsic->nr_hw_eix, 709 + true, &tmrif); 710 + 711 + /* Update register state in IMSIC SW-file */ 712 + imsic_swfile_update(vcpu, &tmrif); 713 + 714 + /* Free-up old IMSIC VS-file */ 715 + kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei); 716 + } 717 + 718 + int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu) 719 + { 720 + unsigned long flags; 721 + phys_addr_t new_vsfile_pa; 722 + struct imsic_mrif tmrif; 723 + void __iomem *new_vsfile_va; 724 + struct kvm *kvm = vcpu->kvm; 725 + struct kvm_run *run = vcpu->run; 726 + struct kvm_vcpu_aia *vaia = &vcpu->arch.aia_context; 727 + struct imsic *imsic = vaia->imsic_state; 728 + int ret = 0, new_vsfile_hgei = -1, old_vsfile_hgei, old_vsfile_cpu; 729 + 730 + /* Do nothing for emulation mode */ 731 + if (kvm->arch.aia.mode == KVM_DEV_RISCV_AIA_MODE_EMUL) 732 + return 1; 733 + 734 + /* Read old IMSIC VS-file details */ 735 + read_lock_irqsave(&imsic->vsfile_lock, flags); 736 + old_vsfile_hgei = imsic->vsfile_hgei; 737 + old_vsfile_cpu = imsic->vsfile_cpu; 738 + read_unlock_irqrestore(&imsic->vsfile_lock, flags); 739 + 740 + /* Do nothing if we are continuing on same CPU */ 741 + if (old_vsfile_cpu == vcpu->cpu) 742 + return 1; 743 + 744 + /* Allocate new IMSIC VS-file */ 745 + ret = kvm_riscv_aia_alloc_hgei(vcpu->cpu, vcpu, 746 + &new_vsfile_va, &new_vsfile_pa); 747 + if (ret <= 0) { 748 + /* For HW acceleration mode, we can't continue */ 749 + if (kvm->arch.aia.mode == KVM_DEV_RISCV_AIA_MODE_HWACCEL) { 750 + run->fail_entry.hardware_entry_failure_reason = 751 + CSR_HSTATUS; 752 + run->fail_entry.cpu = vcpu->cpu; 753 + run->exit_reason = KVM_EXIT_FAIL_ENTRY; 754 + return 0; 755 + } 756 + 757 + /* Release old IMSIC VS-file */ 758 + if (old_vsfile_cpu >= 0) 759 + kvm_riscv_vcpu_aia_imsic_release(vcpu); 760 + 761 + /* For automatic mode, we continue */ 762 + goto done; 763 + } 764 + new_vsfile_hgei = ret; 765 + 766 + /* 767 + * At this point, all interrupt producers are still using 768 + * to the old IMSIC VS-file so we first move all interrupt 769 + * producers to the new IMSIC VS-file. 770 + */ 771 + 772 + /* Zero-out new IMSIC VS-file */ 773 + imsic_vsfile_local_clear(new_vsfile_hgei, imsic->nr_hw_eix); 774 + 775 + /* Update G-stage mapping for the new IMSIC VS-file */ 776 + ret = kvm_riscv_gstage_ioremap(kvm, vcpu->arch.aia_context.imsic_addr, 777 + new_vsfile_pa, IMSIC_MMIO_PAGE_SZ, 778 + true, true); 779 + if (ret) 780 + goto fail_free_vsfile_hgei; 781 + 782 + /* TODO: Update the IOMMU mapping ??? */ 783 + 784 + /* Update new IMSIC VS-file details in IMSIC context */ 785 + write_lock_irqsave(&imsic->vsfile_lock, flags); 786 + imsic->vsfile_hgei = new_vsfile_hgei; 787 + imsic->vsfile_cpu = vcpu->cpu; 788 + imsic->vsfile_va = new_vsfile_va; 789 + imsic->vsfile_pa = new_vsfile_pa; 790 + write_unlock_irqrestore(&imsic->vsfile_lock, flags); 791 + 792 + /* 793 + * At this point, all interrupt producers have been moved 794 + * to the new IMSIC VS-file so we move register state from 795 + * the old IMSIC VS/SW-file to the new IMSIC VS-file. 796 + */ 797 + 798 + memset(&tmrif, 0, sizeof(tmrif)); 799 + if (old_vsfile_cpu >= 0) { 800 + /* Read and clear register state from old IMSIC VS-file */ 801 + imsic_vsfile_read(old_vsfile_hgei, old_vsfile_cpu, 802 + imsic->nr_hw_eix, true, &tmrif); 803 + 804 + /* Free-up old IMSIC VS-file */ 805 + kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei); 806 + } else { 807 + /* Read and clear register state from IMSIC SW-file */ 808 + imsic_swfile_read(vcpu, true, &tmrif); 809 + } 810 + 811 + /* Restore register state in the new IMSIC VS-file */ 812 + imsic_vsfile_local_update(new_vsfile_hgei, imsic->nr_hw_eix, &tmrif); 813 + 814 + done: 815 + /* Set VCPU HSTATUS.VGEIN to new IMSIC VS-file */ 816 + vcpu->arch.guest_context.hstatus &= ~HSTATUS_VGEIN; 817 + if (new_vsfile_hgei > 0) 818 + vcpu->arch.guest_context.hstatus |= 819 + ((unsigned long)new_vsfile_hgei) << HSTATUS_VGEIN_SHIFT; 820 + 821 + /* Continue run-loop */ 822 + return 1; 823 + 824 + fail_free_vsfile_hgei: 825 + kvm_riscv_aia_free_hgei(vcpu->cpu, new_vsfile_hgei); 826 + return ret; 827 + } 828 + 829 + int kvm_riscv_vcpu_aia_imsic_rmw(struct kvm_vcpu *vcpu, unsigned long isel, 830 + unsigned long *val, unsigned long new_val, 831 + unsigned long wr_mask) 832 + { 833 + u32 topei; 834 + struct imsic_mrif_eix *eix; 835 + int r, rc = KVM_INSN_CONTINUE_NEXT_SEPC; 836 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 837 + 838 + if (isel == KVM_RISCV_AIA_IMSIC_TOPEI) { 839 + /* Read pending and enabled interrupt with highest priority */ 840 + topei = imsic_mrif_topei(imsic->swfile, imsic->nr_eix, 841 + imsic->nr_msis); 842 + if (val) 843 + *val = topei; 844 + 845 + /* Writes ignore value and clear top pending interrupt */ 846 + if (topei && wr_mask) { 847 + topei >>= TOPEI_ID_SHIFT; 848 + if (topei) { 849 + eix = &imsic->swfile->eix[topei / 850 + BITS_PER_TYPE(u64)]; 851 + clear_bit(topei & (BITS_PER_TYPE(u64) - 1), 852 + eix->eip); 853 + } 854 + } 855 + } else { 856 + r = imsic_mrif_rmw(imsic->swfile, imsic->nr_eix, isel, 857 + val, new_val, wr_mask); 858 + /* Forward unknown IMSIC register to user-space */ 859 + if (r) 860 + rc = (r == -ENOENT) ? 0 : KVM_INSN_ILLEGAL_TRAP; 861 + } 862 + 863 + if (wr_mask) 864 + imsic_swfile_extirq_update(vcpu); 865 + 866 + return rc; 867 + } 868 + 869 + int kvm_riscv_aia_imsic_rw_attr(struct kvm *kvm, unsigned long type, 870 + bool write, unsigned long *val) 871 + { 872 + u32 isel, vcpu_id; 873 + unsigned long flags; 874 + struct imsic *imsic; 875 + struct kvm_vcpu *vcpu; 876 + int rc, vsfile_hgei, vsfile_cpu; 877 + 878 + if (!kvm_riscv_aia_initialized(kvm)) 879 + return -ENODEV; 880 + 881 + vcpu_id = KVM_DEV_RISCV_AIA_IMSIC_GET_VCPU(type); 882 + vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); 883 + if (!vcpu) 884 + return -ENODEV; 885 + 886 + isel = KVM_DEV_RISCV_AIA_IMSIC_GET_ISEL(type); 887 + imsic = vcpu->arch.aia_context.imsic_state; 888 + 889 + read_lock_irqsave(&imsic->vsfile_lock, flags); 890 + 891 + rc = 0; 892 + vsfile_hgei = imsic->vsfile_hgei; 893 + vsfile_cpu = imsic->vsfile_cpu; 894 + if (vsfile_cpu < 0) { 895 + if (write) { 896 + rc = imsic_mrif_rmw(imsic->swfile, imsic->nr_eix, 897 + isel, NULL, *val, -1UL); 898 + imsic_swfile_extirq_update(vcpu); 899 + } else 900 + rc = imsic_mrif_rmw(imsic->swfile, imsic->nr_eix, 901 + isel, val, 0, 0); 902 + } 903 + 904 + read_unlock_irqrestore(&imsic->vsfile_lock, flags); 905 + 906 + if (!rc && vsfile_cpu >= 0) 907 + rc = imsic_vsfile_rw(vsfile_hgei, vsfile_cpu, imsic->nr_eix, 908 + isel, write, val); 909 + 910 + return rc; 911 + } 912 + 913 + int kvm_riscv_aia_imsic_has_attr(struct kvm *kvm, unsigned long type) 914 + { 915 + u32 isel, vcpu_id; 916 + struct imsic *imsic; 917 + struct kvm_vcpu *vcpu; 918 + 919 + if (!kvm_riscv_aia_initialized(kvm)) 920 + return -ENODEV; 921 + 922 + vcpu_id = KVM_DEV_RISCV_AIA_IMSIC_GET_VCPU(type); 923 + vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); 924 + if (!vcpu) 925 + return -ENODEV; 926 + 927 + isel = KVM_DEV_RISCV_AIA_IMSIC_GET_ISEL(type); 928 + imsic = vcpu->arch.aia_context.imsic_state; 929 + return imsic_mrif_isel_check(imsic->nr_eix, isel); 930 + } 931 + 932 + void kvm_riscv_vcpu_aia_imsic_reset(struct kvm_vcpu *vcpu) 933 + { 934 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 935 + 936 + if (!imsic) 937 + return; 938 + 939 + kvm_riscv_vcpu_aia_imsic_release(vcpu); 940 + 941 + memset(imsic->swfile, 0, sizeof(*imsic->swfile)); 942 + } 943 + 944 + int kvm_riscv_vcpu_aia_imsic_inject(struct kvm_vcpu *vcpu, 945 + u32 guest_index, u32 offset, u32 iid) 946 + { 947 + unsigned long flags; 948 + struct imsic_mrif_eix *eix; 949 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 950 + 951 + /* We only emulate one IMSIC MMIO page for each Guest VCPU */ 952 + if (!imsic || !iid || guest_index || 953 + (offset != IMSIC_MMIO_SETIPNUM_LE && 954 + offset != IMSIC_MMIO_SETIPNUM_BE)) 955 + return -ENODEV; 956 + 957 + iid = (offset == IMSIC_MMIO_SETIPNUM_BE) ? __swab32(iid) : iid; 958 + if (imsic->nr_msis <= iid) 959 + return -EINVAL; 960 + 961 + read_lock_irqsave(&imsic->vsfile_lock, flags); 962 + 963 + if (imsic->vsfile_cpu >= 0) { 964 + writel(iid, imsic->vsfile_va + IMSIC_MMIO_SETIPNUM_LE); 965 + kvm_vcpu_kick(vcpu); 966 + } else { 967 + eix = &imsic->swfile->eix[iid / BITS_PER_TYPE(u64)]; 968 + set_bit(iid & (BITS_PER_TYPE(u64) - 1), eix->eip); 969 + imsic_swfile_extirq_update(vcpu); 970 + } 971 + 972 + read_unlock_irqrestore(&imsic->vsfile_lock, flags); 973 + 974 + return 0; 975 + } 976 + 977 + static int imsic_mmio_read(struct kvm_vcpu *vcpu, struct kvm_io_device *dev, 978 + gpa_t addr, int len, void *val) 979 + { 980 + if (len != 4 || (addr & 0x3) != 0) 981 + return -EOPNOTSUPP; 982 + 983 + *((u32 *)val) = 0; 984 + 985 + return 0; 986 + } 987 + 988 + static int imsic_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *dev, 989 + gpa_t addr, int len, const void *val) 990 + { 991 + struct kvm_msi msi = { 0 }; 992 + 993 + if (len != 4 || (addr & 0x3) != 0) 994 + return -EOPNOTSUPP; 995 + 996 + msi.address_hi = addr >> 32; 997 + msi.address_lo = (u32)addr; 998 + msi.data = *((const u32 *)val); 999 + kvm_riscv_aia_inject_msi(vcpu->kvm, &msi); 1000 + 1001 + return 0; 1002 + }; 1003 + 1004 + static struct kvm_io_device_ops imsic_iodoev_ops = { 1005 + .read = imsic_mmio_read, 1006 + .write = imsic_mmio_write, 1007 + }; 1008 + 1009 + int kvm_riscv_vcpu_aia_imsic_init(struct kvm_vcpu *vcpu) 1010 + { 1011 + int ret = 0; 1012 + struct imsic *imsic; 1013 + struct page *swfile_page; 1014 + struct kvm *kvm = vcpu->kvm; 1015 + 1016 + /* Fail if we have zero IDs */ 1017 + if (!kvm->arch.aia.nr_ids) 1018 + return -EINVAL; 1019 + 1020 + /* Allocate IMSIC context */ 1021 + imsic = kzalloc(sizeof(*imsic), GFP_KERNEL); 1022 + if (!imsic) 1023 + return -ENOMEM; 1024 + vcpu->arch.aia_context.imsic_state = imsic; 1025 + 1026 + /* Setup IMSIC context */ 1027 + imsic->nr_msis = kvm->arch.aia.nr_ids + 1; 1028 + rwlock_init(&imsic->vsfile_lock); 1029 + imsic->nr_eix = BITS_TO_U64(imsic->nr_msis); 1030 + imsic->nr_hw_eix = BITS_TO_U64(kvm_riscv_aia_max_ids); 1031 + imsic->vsfile_hgei = imsic->vsfile_cpu = -1; 1032 + 1033 + /* Setup IMSIC SW-file */ 1034 + swfile_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 1035 + get_order(sizeof(*imsic->swfile))); 1036 + if (!swfile_page) { 1037 + ret = -ENOMEM; 1038 + goto fail_free_imsic; 1039 + } 1040 + imsic->swfile = page_to_virt(swfile_page); 1041 + imsic->swfile_pa = page_to_phys(swfile_page); 1042 + 1043 + /* Setup IO device */ 1044 + kvm_iodevice_init(&imsic->iodev, &imsic_iodoev_ops); 1045 + mutex_lock(&kvm->slots_lock); 1046 + ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, 1047 + vcpu->arch.aia_context.imsic_addr, 1048 + KVM_DEV_RISCV_IMSIC_SIZE, 1049 + &imsic->iodev); 1050 + mutex_unlock(&kvm->slots_lock); 1051 + if (ret) 1052 + goto fail_free_swfile; 1053 + 1054 + return 0; 1055 + 1056 + fail_free_swfile: 1057 + free_pages((unsigned long)imsic->swfile, 1058 + get_order(sizeof(*imsic->swfile))); 1059 + fail_free_imsic: 1060 + vcpu->arch.aia_context.imsic_state = NULL; 1061 + kfree(imsic); 1062 + return ret; 1063 + } 1064 + 1065 + void kvm_riscv_vcpu_aia_imsic_cleanup(struct kvm_vcpu *vcpu) 1066 + { 1067 + struct kvm *kvm = vcpu->kvm; 1068 + struct imsic *imsic = vcpu->arch.aia_context.imsic_state; 1069 + 1070 + if (!imsic) 1071 + return; 1072 + 1073 + imsic_vsfile_cleanup(imsic); 1074 + 1075 + mutex_lock(&kvm->slots_lock); 1076 + kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &imsic->iodev); 1077 + mutex_unlock(&kvm->slots_lock); 1078 + 1079 + free_pages((unsigned long)imsic->swfile, 1080 + get_order(sizeof(*imsic->swfile))); 1081 + 1082 + vcpu->arch.aia_context.imsic_state = NULL; 1083 + kfree(imsic); 1084 + }

+2 -1

arch/riscv/kvm/main.c

··· 116 116 kvm_info("VMID %ld bits available\n", kvm_riscv_gstage_vmid_bits()); 117 117 118 118 if (kvm_riscv_aia_available()) 119 - kvm_info("AIA available\n"); 119 + kvm_info("AIA available with %d guest external interrupts\n", 120 + kvm_riscv_aia_nr_hgei); 120 121 121 122 rc = kvm_init(sizeof(struct kvm_vcpu), 0, THIS_MODULE); 122 123 if (rc) {

+1 -1

arch/riscv/kvm/tlb.c

··· 296 296 unsigned int actual_req = req; 297 297 DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS); 298 298 299 - bitmap_clear(vcpu_mask, 0, KVM_MAX_VCPUS); 299 + bitmap_zero(vcpu_mask, KVM_MAX_VCPUS); 300 300 kvm_for_each_vcpu(i, vcpu, kvm) { 301 301 if (hbase != -1UL) { 302 302 if (vcpu->vcpu_id < hbase)

+4

arch/riscv/kvm/vcpu.c

··· 64 64 KVM_ISA_EXT_ARR(SSAIA), 65 65 KVM_ISA_EXT_ARR(SSTC), 66 66 KVM_ISA_EXT_ARR(SVINVAL), 67 + KVM_ISA_EXT_ARR(SVNAPOT), 67 68 KVM_ISA_EXT_ARR(SVPBMT), 68 69 KVM_ISA_EXT_ARR(ZBB), 69 70 KVM_ISA_EXT_ARR(ZIHINTPAUSE), ··· 108 107 case KVM_RISCV_ISA_EXT_SSAIA: 109 108 case KVM_RISCV_ISA_EXT_SSTC: 110 109 case KVM_RISCV_ISA_EXT_SVINVAL: 110 + case KVM_RISCV_ISA_EXT_SVNAPOT: 111 111 case KVM_RISCV_ISA_EXT_ZIHINTPAUSE: 112 112 case KVM_RISCV_ISA_EXT_ZBB: 113 113 return false; ··· 265 263 266 264 void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) 267 265 { 266 + kvm_riscv_aia_wakeon_hgei(vcpu, true); 268 267 } 269 268 270 269 void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) 271 270 { 271 + kvm_riscv_aia_wakeon_hgei(vcpu, false); 272 272 } 273 273 274 274 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)

+2

arch/riscv/kvm/vcpu_exit.c

··· 183 183 run->exit_reason = KVM_EXIT_UNKNOWN; 184 184 switch (trap->scause) { 185 185 case EXC_INST_ILLEGAL: 186 + case EXC_LOAD_MISALIGNED: 187 + case EXC_STORE_MISALIGNED: 186 188 if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV) { 187 189 kvm_riscv_vcpu_trap_redirect(vcpu, trap); 188 190 ret = 1;

+54 -26

arch/riscv/kvm/vcpu_sbi.c

··· 20 20 }; 21 21 #endif 22 22 23 - #ifdef CONFIG_RISCV_PMU_SBI 24 - extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu; 25 - #else 23 + #ifndef CONFIG_RISCV_PMU_SBI 26 24 static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = { 27 25 .extid_start = -1UL, 28 26 .extid_end = -1UL, ··· 29 31 #endif 30 32 31 33 struct kvm_riscv_sbi_extension_entry { 32 - enum KVM_RISCV_SBI_EXT_ID dis_idx; 34 + enum KVM_RISCV_SBI_EXT_ID ext_idx; 33 35 const struct kvm_vcpu_sbi_extension *ext_ptr; 34 36 }; 35 37 36 38 static const struct kvm_riscv_sbi_extension_entry sbi_ext[] = { 37 39 { 38 - .dis_idx = KVM_RISCV_SBI_EXT_V01, 40 + .ext_idx = KVM_RISCV_SBI_EXT_V01, 39 41 .ext_ptr = &vcpu_sbi_ext_v01, 40 42 }, 41 43 { 42 - .dis_idx = KVM_RISCV_SBI_EXT_MAX, /* Can't be disabled */ 44 + .ext_idx = KVM_RISCV_SBI_EXT_MAX, /* Can't be disabled */ 43 45 .ext_ptr = &vcpu_sbi_ext_base, 44 46 }, 45 47 { 46 - .dis_idx = KVM_RISCV_SBI_EXT_TIME, 48 + .ext_idx = KVM_RISCV_SBI_EXT_TIME, 47 49 .ext_ptr = &vcpu_sbi_ext_time, 48 50 }, 49 51 { 50 - .dis_idx = KVM_RISCV_SBI_EXT_IPI, 52 + .ext_idx = KVM_RISCV_SBI_EXT_IPI, 51 53 .ext_ptr = &vcpu_sbi_ext_ipi, 52 54 }, 53 55 { 54 - .dis_idx = KVM_RISCV_SBI_EXT_RFENCE, 56 + .ext_idx = KVM_RISCV_SBI_EXT_RFENCE, 55 57 .ext_ptr = &vcpu_sbi_ext_rfence, 56 58 }, 57 59 { 58 - .dis_idx = KVM_RISCV_SBI_EXT_SRST, 60 + .ext_idx = KVM_RISCV_SBI_EXT_SRST, 59 61 .ext_ptr = &vcpu_sbi_ext_srst, 60 62 }, 61 63 { 62 - .dis_idx = KVM_RISCV_SBI_EXT_HSM, 64 + .ext_idx = KVM_RISCV_SBI_EXT_HSM, 63 65 .ext_ptr = &vcpu_sbi_ext_hsm, 64 66 }, 65 67 { 66 - .dis_idx = KVM_RISCV_SBI_EXT_PMU, 68 + .ext_idx = KVM_RISCV_SBI_EXT_PMU, 67 69 .ext_ptr = &vcpu_sbi_ext_pmu, 68 70 }, 69 71 { 70 - .dis_idx = KVM_RISCV_SBI_EXT_EXPERIMENTAL, 72 + .ext_idx = KVM_RISCV_SBI_EXT_EXPERIMENTAL, 71 73 .ext_ptr = &vcpu_sbi_ext_experimental, 72 74 }, 73 75 { 74 - .dis_idx = KVM_RISCV_SBI_EXT_VENDOR, 76 + .ext_idx = KVM_RISCV_SBI_EXT_VENDOR, 75 77 .ext_ptr = &vcpu_sbi_ext_vendor, 76 78 }, 77 79 }; ··· 145 147 return -EINVAL; 146 148 147 149 for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 148 - if (sbi_ext[i].dis_idx == reg_num) { 150 + if (sbi_ext[i].ext_idx == reg_num) { 149 151 sext = &sbi_ext[i]; 150 152 break; 151 153 } ··· 153 155 if (!sext) 154 156 return -ENOENT; 155 157 156 - scontext->extension_disabled[sext->dis_idx] = !reg_val; 158 + /* 159 + * We can't set the extension status to available here, since it may 160 + * have a probe() function which needs to confirm availability first, 161 + * but it may be too early to call that here. We can set the status to 162 + * unavailable, though. 163 + */ 164 + if (!reg_val) 165 + scontext->ext_status[sext->ext_idx] = 166 + KVM_RISCV_SBI_EXT_UNAVAILABLE; 157 167 158 168 return 0; 159 169 } ··· 178 172 return -EINVAL; 179 173 180 174 for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 181 - if (sbi_ext[i].dis_idx == reg_num) { 175 + if (sbi_ext[i].ext_idx == reg_num) { 182 176 sext = &sbi_ext[i]; 183 177 break; 184 178 } ··· 186 180 if (!sext) 187 181 return -ENOENT; 188 182 189 - *reg_val = !scontext->extension_disabled[sext->dis_idx]; 183 + /* 184 + * If the extension status is still uninitialized, then we should probe 185 + * to determine if it's available, but it may be too early to do that 186 + * here. The best we can do is report that the extension has not been 187 + * disabled, i.e. we return 1 when the extension is available and also 188 + * when it only may be available. 189 + */ 190 + *reg_val = scontext->ext_status[sext->ext_idx] != 191 + KVM_RISCV_SBI_EXT_UNAVAILABLE; 190 192 191 193 return 0; 192 194 } ··· 321 307 const struct kvm_vcpu_sbi_extension *kvm_vcpu_sbi_find_ext( 322 308 struct kvm_vcpu *vcpu, unsigned long extid) 323 309 { 324 - int i; 325 - const struct kvm_riscv_sbi_extension_entry *sext; 326 310 struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 311 + const struct kvm_riscv_sbi_extension_entry *entry; 312 + const struct kvm_vcpu_sbi_extension *ext; 313 + int i; 327 314 328 315 for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 329 - sext = &sbi_ext[i]; 330 - if (sext->ext_ptr->extid_start <= extid && 331 - sext->ext_ptr->extid_end >= extid) { 332 - if (sext->dis_idx < KVM_RISCV_SBI_EXT_MAX && 333 - scontext->extension_disabled[sext->dis_idx]) 316 + entry = &sbi_ext[i]; 317 + ext = entry->ext_ptr; 318 + 319 + if (ext->extid_start <= extid && ext->extid_end >= extid) { 320 + if (entry->ext_idx >= KVM_RISCV_SBI_EXT_MAX || 321 + scontext->ext_status[entry->ext_idx] == 322 + KVM_RISCV_SBI_EXT_AVAILABLE) 323 + return ext; 324 + if (scontext->ext_status[entry->ext_idx] == 325 + KVM_RISCV_SBI_EXT_UNAVAILABLE) 334 326 return NULL; 335 - return sbi_ext[i].ext_ptr; 327 + if (ext->probe && !ext->probe(vcpu)) { 328 + scontext->ext_status[entry->ext_idx] = 329 + KVM_RISCV_SBI_EXT_UNAVAILABLE; 330 + return NULL; 331 + } 332 + 333 + scontext->ext_status[entry->ext_idx] = 334 + KVM_RISCV_SBI_EXT_AVAILABLE; 335 + return ext; 336 336 } 337 337 } 338 338

+118

arch/riscv/kvm/vm.c

··· 55 55 kvm_riscv_aia_destroy_vm(kvm); 56 56 } 57 57 58 + int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irql, 59 + bool line_status) 60 + { 61 + if (!irqchip_in_kernel(kvm)) 62 + return -ENXIO; 63 + 64 + return kvm_riscv_aia_inject_irq(kvm, irql->irq, irql->level); 65 + } 66 + 67 + int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, 68 + struct kvm *kvm, int irq_source_id, 69 + int level, bool line_status) 70 + { 71 + struct kvm_msi msi; 72 + 73 + if (!level) 74 + return -1; 75 + 76 + msi.address_lo = e->msi.address_lo; 77 + msi.address_hi = e->msi.address_hi; 78 + msi.data = e->msi.data; 79 + msi.flags = e->msi.flags; 80 + msi.devid = e->msi.devid; 81 + 82 + return kvm_riscv_aia_inject_msi(kvm, &msi); 83 + } 84 + 85 + static int kvm_riscv_set_irq(struct kvm_kernel_irq_routing_entry *e, 86 + struct kvm *kvm, int irq_source_id, 87 + int level, bool line_status) 88 + { 89 + return kvm_riscv_aia_inject_irq(kvm, e->irqchip.pin, level); 90 + } 91 + 92 + int kvm_riscv_setup_default_irq_routing(struct kvm *kvm, u32 lines) 93 + { 94 + struct kvm_irq_routing_entry *ents; 95 + int i, rc; 96 + 97 + ents = kcalloc(lines, sizeof(*ents), GFP_KERNEL); 98 + if (!ents) 99 + return -ENOMEM; 100 + 101 + for (i = 0; i < lines; i++) { 102 + ents[i].gsi = i; 103 + ents[i].type = KVM_IRQ_ROUTING_IRQCHIP; 104 + ents[i].u.irqchip.irqchip = 0; 105 + ents[i].u.irqchip.pin = i; 106 + } 107 + rc = kvm_set_irq_routing(kvm, ents, lines, 0); 108 + kfree(ents); 109 + 110 + return rc; 111 + } 112 + 113 + bool kvm_arch_can_set_irq_routing(struct kvm *kvm) 114 + { 115 + return irqchip_in_kernel(kvm); 116 + } 117 + 118 + int kvm_set_routing_entry(struct kvm *kvm, 119 + struct kvm_kernel_irq_routing_entry *e, 120 + const struct kvm_irq_routing_entry *ue) 121 + { 122 + int r = -EINVAL; 123 + 124 + switch (ue->type) { 125 + case KVM_IRQ_ROUTING_IRQCHIP: 126 + e->set = kvm_riscv_set_irq; 127 + e->irqchip.irqchip = ue->u.irqchip.irqchip; 128 + e->irqchip.pin = ue->u.irqchip.pin; 129 + if ((e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS) || 130 + (e->irqchip.irqchip >= KVM_NR_IRQCHIPS)) 131 + goto out; 132 + break; 133 + case KVM_IRQ_ROUTING_MSI: 134 + e->set = kvm_set_msi; 135 + e->msi.address_lo = ue->u.msi.address_lo; 136 + e->msi.address_hi = ue->u.msi.address_hi; 137 + e->msi.data = ue->u.msi.data; 138 + e->msi.flags = ue->flags; 139 + e->msi.devid = ue->u.msi.devid; 140 + break; 141 + default: 142 + goto out; 143 + } 144 + r = 0; 145 + out: 146 + return r; 147 + } 148 + 149 + int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, 150 + struct kvm *kvm, int irq_source_id, int level, 151 + bool line_status) 152 + { 153 + if (!level) 154 + return -EWOULDBLOCK; 155 + 156 + switch (e->type) { 157 + case KVM_IRQ_ROUTING_MSI: 158 + return kvm_set_msi(e, kvm, irq_source_id, level, line_status); 159 + 160 + case KVM_IRQ_ROUTING_IRQCHIP: 161 + return kvm_riscv_set_irq(e, kvm, irq_source_id, 162 + level, line_status); 163 + } 164 + 165 + return -EWOULDBLOCK; 166 + } 167 + 168 + bool kvm_arch_irqchip_in_kernel(struct kvm *kvm) 169 + { 170 + return irqchip_in_kernel(kvm); 171 + } 172 + 58 173 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) 59 174 { 60 175 int r; 61 176 62 177 switch (ext) { 178 + case KVM_CAP_IRQCHIP: 179 + r = kvm_riscv_aia_available(); 180 + break; 63 181 case KVM_CAP_IOEVENTFD: 64 182 case KVM_CAP_DEVICE_CTRL: 65 183 case KVM_CAP_USER_MEMORY:

+4

arch/s390/boot/uv.c

··· 47 47 uv_info.conf_dump_finalize_len = uvcb.conf_dump_finalize_len; 48 48 uv_info.supp_att_req_hdr_ver = uvcb.supp_att_req_hdr_ver; 49 49 uv_info.supp_att_pflags = uvcb.supp_att_pflags; 50 + uv_info.supp_add_secret_req_ver = uvcb.supp_add_secret_req_ver; 51 + uv_info.supp_add_secret_pcf = uvcb.supp_add_secret_pcf; 52 + uv_info.supp_secret_types = uvcb.supp_secret_types; 53 + uv_info.max_secrets = uvcb.max_secrets; 50 54 } 51 55 52 56 #ifdef CONFIG_PROTECTED_VIRTUALIZATION_GUEST

+30 -2

arch/s390/include/asm/uv.h

··· 58 58 #define UVC_CMD_SET_SHARED_ACCESS 0x1000 59 59 #define UVC_CMD_REMOVE_SHARED_ACCESS 0x1001 60 60 #define UVC_CMD_RETR_ATTEST 0x1020 61 + #define UVC_CMD_ADD_SECRET 0x1031 62 + #define UVC_CMD_LIST_SECRETS 0x1033 63 + #define UVC_CMD_LOCK_SECRETS 0x1034 61 64 62 65 /* Bits in installed uv calls */ 63 66 enum uv_cmds_inst { ··· 91 88 BIT_UVC_CMD_DUMP_CPU = 26, 92 89 BIT_UVC_CMD_DUMP_COMPLETE = 27, 93 90 BIT_UVC_CMD_RETR_ATTEST = 28, 91 + BIT_UVC_CMD_ADD_SECRET = 29, 92 + BIT_UVC_CMD_LIST_SECRETS = 30, 93 + BIT_UVC_CMD_LOCK_SECRETS = 31, 94 94 }; 95 95 96 96 enum uv_feat_ind { ··· 123 117 u32 reserved70[3]; /* 0x0070 */ 124 118 u32 max_num_sec_conf; /* 0x007c */ 125 119 u64 max_guest_stor_addr; /* 0x0080 */ 126 - u8 reserved88[158 - 136]; /* 0x0088 */ 120 + u8 reserved88[0x9e - 0x88]; /* 0x0088 */ 127 121 u16 max_guest_cpu_id; /* 0x009e */ 128 122 u64 uv_feature_indications; /* 0x00a0 */ 129 123 u64 reserveda8; /* 0x00a8 */ ··· 135 129 u64 reservedd8; /* 0x00d8 */ 136 130 u64 supp_att_req_hdr_ver; /* 0x00e0 */ 137 131 u64 supp_att_pflags; /* 0x00e8 */ 138 - u8 reservedf0[256 - 240]; /* 0x00f0 */ 132 + u64 reservedf0; /* 0x00f0 */ 133 + u64 supp_add_secret_req_ver; /* 0x00f8 */ 134 + u64 supp_add_secret_pcf; /* 0x0100 */ 135 + u64 supp_secret_types; /* 0x0180 */ 136 + u16 max_secrets; /* 0x0110 */ 137 + u8 reserved112[0x120 - 0x112]; /* 0x0112 */ 139 138 } __packed __aligned(8); 140 139 141 140 /* Initialize Ultravisor */ ··· 303 292 u64 reserved30[5]; 304 293 } __packed __aligned(8); 305 294 295 + /* 296 + * A common UV call struct for pv guests that contains a single address 297 + * Examples: 298 + * Add Secret 299 + * List Secrets 300 + */ 301 + struct uv_cb_guest_addr { 302 + struct uv_cb_header header; 303 + u64 reserved08[3]; 304 + u64 addr; 305 + u64 reserved28[4]; 306 + } __packed __aligned(8); 307 + 306 308 static inline int __uv_call(unsigned long r1, unsigned long r2) 307 309 { 308 310 int cc; ··· 389 365 unsigned long conf_dump_finalize_len; 390 366 unsigned long supp_att_req_hdr_ver; 391 367 unsigned long supp_att_pflags; 368 + unsigned long supp_add_secret_req_ver; 369 + unsigned long supp_add_secret_pcf; 370 + unsigned long supp_secret_types; 371 + unsigned short max_secrets; 392 372 }; 393 373 394 374 extern struct uv_info uv_info;

+52 -1

arch/s390/include/uapi/asm/uvdevice.h

··· 32 32 __u16 reserved136; /* 0x0136 */ 33 33 }; 34 34 35 + /** 36 + * uvio_uvdev_info - Information of supported functions 37 + * @supp_uvio_cmds - supported IOCTLs by this device 38 + * @supp_uv_cmds - supported UVCs corresponding to the IOCTL 39 + * 40 + * UVIO request to get information about supported request types by this 41 + * uvdevice and the Ultravisor. Everything is output. Bits are in LSB0 42 + * ordering. If the bit is set in both, @supp_uvio_cmds and @supp_uv_cmds, the 43 + * uvdevice and the Ultravisor support that call. 44 + * 45 + * Note that bit 0 (UVIO_IOCTL_UVDEV_INFO_NR) is always zero for `supp_uv_cmds` 46 + * as there is no corresponding UV-call. 47 + */ 48 + struct uvio_uvdev_info { 49 + /* 50 + * If bit `n` is set, this device supports the IOCTL with nr `n`. 51 + */ 52 + __u64 supp_uvio_cmds; 53 + /* 54 + * If bit `n` is set, the Ultravisor(UV) supports the UV-call 55 + * corresponding to the IOCTL with nr `n` in the calling contextx (host 56 + * or guest). The value is only valid if the corresponding bit in 57 + * @supp_uvio_cmds is set as well. 58 + */ 59 + __u64 supp_uv_cmds; 60 + }; 61 + 35 62 /* 36 63 * The following max values define an upper length for the IOCTL in/out buffers. 37 64 * However, they do not represent the maximum the Ultravisor allows which is ··· 69 42 #define UVIO_ATT_ARCB_MAX_LEN 0x100000 70 43 #define UVIO_ATT_MEASUREMENT_MAX_LEN 0x8000 71 44 #define UVIO_ATT_ADDITIONAL_MAX_LEN 0x8000 45 + #define UVIO_ADD_SECRET_MAX_LEN 0x100000 46 + #define UVIO_LIST_SECRETS_LEN 0x1000 72 47 73 48 #define UVIO_DEVICE_NAME "uv" 74 49 #define UVIO_TYPE_UVC 'u' 75 50 76 - #define UVIO_IOCTL_ATT _IOWR(UVIO_TYPE_UVC, 0x01, struct uvio_ioctl_cb) 51 + enum UVIO_IOCTL_NR { 52 + UVIO_IOCTL_UVDEV_INFO_NR = 0x00, 53 + UVIO_IOCTL_ATT_NR, 54 + UVIO_IOCTL_ADD_SECRET_NR, 55 + UVIO_IOCTL_LIST_SECRETS_NR, 56 + UVIO_IOCTL_LOCK_SECRETS_NR, 57 + /* must be the last entry */ 58 + UVIO_IOCTL_NUM_IOCTLS 59 + }; 60 + 61 + #define UVIO_IOCTL(nr) _IOWR(UVIO_TYPE_UVC, nr, struct uvio_ioctl_cb) 62 + #define UVIO_IOCTL_UVDEV_INFO UVIO_IOCTL(UVIO_IOCTL_UVDEV_INFO_NR) 63 + #define UVIO_IOCTL_ATT UVIO_IOCTL(UVIO_IOCTL_ATT_NR) 64 + #define UVIO_IOCTL_ADD_SECRET UVIO_IOCTL(UVIO_IOCTL_ADD_SECRET_NR) 65 + #define UVIO_IOCTL_LIST_SECRETS UVIO_IOCTL(UVIO_IOCTL_LIST_SECRETS_NR) 66 + #define UVIO_IOCTL_LOCK_SECRETS UVIO_IOCTL(UVIO_IOCTL_LOCK_SECRETS_NR) 67 + 68 + #define UVIO_SUPP_CALL(nr) (1ULL << (nr)) 69 + #define UVIO_SUPP_UDEV_INFO UVIO_SUPP_CALL(UVIO_IOCTL_UDEV_INFO_NR) 70 + #define UVIO_SUPP_ATT UVIO_SUPP_CALL(UVIO_IOCTL_ATT_NR) 71 + #define UVIO_SUPP_ADD_SECRET UVIO_SUPP_CALL(UVIO_IOCTL_ADD_SECRET_NR) 72 + #define UVIO_SUPP_LIST_SECRETS UVIO_SUPP_CALL(UVIO_IOCTL_LIST_SECRETS_NR) 73 + #define UVIO_SUPP_LOCK_SECRETS UVIO_SUPP_CALL(UVIO_IOCTL_LOCK_SECRETS_NR) 77 74 78 75 #endif /* __S390_ASM_UVDEVICE_H */

+75 -33

arch/s390/kernel/uv.c

··· 23 23 int __bootdata_preserved(prot_virt_guest); 24 24 #endif 25 25 26 + /* 27 + * uv_info contains both host and guest information but it's currently only 28 + * expected to be used within modules if it's the KVM module or for 29 + * any PV guest module. 30 + * 31 + * The kernel itself will write these values once in uv_query_info() 32 + * and then make some of them readable via a sysfs interface. 33 + */ 26 34 struct uv_info __bootdata_preserved(uv_info); 35 + EXPORT_SYMBOL(uv_info); 27 36 28 37 #if IS_ENABLED(CONFIG_KVM) 29 38 int __bootdata_preserved(prot_virt_host); 30 39 EXPORT_SYMBOL(prot_virt_host); 31 - EXPORT_SYMBOL(uv_info); 32 40 33 41 static int __init uv_init(phys_addr_t stor_base, unsigned long stor_len) 34 42 { ··· 470 462 471 463 #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM) 472 464 static ssize_t uv_query_facilities(struct kobject *kobj, 473 - struct kobj_attribute *attr, char *page) 465 + struct kobj_attribute *attr, char *buf) 474 466 { 475 - return scnprintf(page, PAGE_SIZE, "%lx\n%lx\n%lx\n%lx\n", 476 - uv_info.inst_calls_list[0], 477 - uv_info.inst_calls_list[1], 478 - uv_info.inst_calls_list[2], 479 - uv_info.inst_calls_list[3]); 467 + return sysfs_emit(buf, "%lx\n%lx\n%lx\n%lx\n", 468 + uv_info.inst_calls_list[0], 469 + uv_info.inst_calls_list[1], 470 + uv_info.inst_calls_list[2], 471 + uv_info.inst_calls_list[3]); 480 472 } 481 473 482 474 static struct kobj_attribute uv_query_facilities_attr = ··· 501 493 __ATTR(supp_se_hdr_pcf, 0444, uv_query_supp_se_hdr_pcf, NULL); 502 494 503 495 static ssize_t uv_query_dump_cpu_len(struct kobject *kobj, 504 - struct kobj_attribute *attr, char *page) 496 + struct kobj_attribute *attr, char *buf) 505 497 { 506 - return scnprintf(page, PAGE_SIZE, "%lx\n", 507 - uv_info.guest_cpu_stor_len); 498 + return sysfs_emit(buf, "%lx\n", uv_info.guest_cpu_stor_len); 508 499 } 509 500 510 501 static struct kobj_attribute uv_query_dump_cpu_len_attr = 511 502 __ATTR(uv_query_dump_cpu_len, 0444, uv_query_dump_cpu_len, NULL); 512 503 513 504 static ssize_t uv_query_dump_storage_state_len(struct kobject *kobj, 514 - struct kobj_attribute *attr, char *page) 505 + struct kobj_attribute *attr, char *buf) 515 506 { 516 - return scnprintf(page, PAGE_SIZE, "%lx\n", 517 - uv_info.conf_dump_storage_state_len); 507 + return sysfs_emit(buf, "%lx\n", uv_info.conf_dump_storage_state_len); 518 508 } 519 509 520 510 static struct kobj_attribute uv_query_dump_storage_state_len_attr = 521 511 __ATTR(dump_storage_state_len, 0444, uv_query_dump_storage_state_len, NULL); 522 512 523 513 static ssize_t uv_query_dump_finalize_len(struct kobject *kobj, 524 - struct kobj_attribute *attr, char *page) 514 + struct kobj_attribute *attr, char *buf) 525 515 { 526 - return scnprintf(page, PAGE_SIZE, "%lx\n", 527 - uv_info.conf_dump_finalize_len); 516 + return sysfs_emit(buf, "%lx\n", uv_info.conf_dump_finalize_len); 528 517 } 529 518 530 519 static struct kobj_attribute uv_query_dump_finalize_len_attr = ··· 537 532 __ATTR(feature_indications, 0444, uv_query_feature_indications, NULL); 538 533 539 534 static ssize_t uv_query_max_guest_cpus(struct kobject *kobj, 540 - struct kobj_attribute *attr, char *page) 535 + struct kobj_attribute *attr, char *buf) 541 536 { 542 - return scnprintf(page, PAGE_SIZE, "%d\n", 543 - uv_info.max_guest_cpu_id + 1); 537 + return sysfs_emit(buf, "%d\n", uv_info.max_guest_cpu_id + 1); 544 538 } 545 539 546 540 static struct kobj_attribute uv_query_max_guest_cpus_attr = 547 541 __ATTR(max_cpus, 0444, uv_query_max_guest_cpus, NULL); 548 542 549 543 static ssize_t uv_query_max_guest_vms(struct kobject *kobj, 550 - struct kobj_attribute *attr, char *page) 544 + struct kobj_attribute *attr, char *buf) 551 545 { 552 - return scnprintf(page, PAGE_SIZE, "%d\n", 553 - uv_info.max_num_sec_conf); 546 + return sysfs_emit(buf, "%d\n", uv_info.max_num_sec_conf); 554 547 } 555 548 556 549 static struct kobj_attribute uv_query_max_guest_vms_attr = 557 550 __ATTR(max_guests, 0444, uv_query_max_guest_vms, NULL); 558 551 559 552 static ssize_t uv_query_max_guest_addr(struct kobject *kobj, 560 - struct kobj_attribute *attr, char *page) 553 + struct kobj_attribute *attr, char *buf) 561 554 { 562 - return scnprintf(page, PAGE_SIZE, "%lx\n", 563 - uv_info.max_sec_stor_addr); 555 + return sysfs_emit(buf, "%lx\n", uv_info.max_sec_stor_addr); 564 556 } 565 557 566 558 static struct kobj_attribute uv_query_max_guest_addr_attr = 567 559 __ATTR(max_address, 0444, uv_query_max_guest_addr, NULL); 568 560 569 561 static ssize_t uv_query_supp_att_req_hdr_ver(struct kobject *kobj, 570 - struct kobj_attribute *attr, char *page) 562 + struct kobj_attribute *attr, char *buf) 571 563 { 572 - return scnprintf(page, PAGE_SIZE, "%lx\n", uv_info.supp_att_req_hdr_ver); 564 + return sysfs_emit(buf, "%lx\n", uv_info.supp_att_req_hdr_ver); 573 565 } 574 566 575 567 static struct kobj_attribute uv_query_supp_att_req_hdr_ver_attr = 576 568 __ATTR(supp_att_req_hdr_ver, 0444, uv_query_supp_att_req_hdr_ver, NULL); 577 569 578 570 static ssize_t uv_query_supp_att_pflags(struct kobject *kobj, 579 - struct kobj_attribute *attr, char *page) 571 + struct kobj_attribute *attr, char *buf) 580 572 { 581 - return scnprintf(page, PAGE_SIZE, "%lx\n", uv_info.supp_att_pflags); 573 + return sysfs_emit(buf, "%lx\n", uv_info.supp_att_pflags); 582 574 } 583 575 584 576 static struct kobj_attribute uv_query_supp_att_pflags_attr = 585 577 __ATTR(supp_att_pflags, 0444, uv_query_supp_att_pflags, NULL); 578 + 579 + static ssize_t uv_query_supp_add_secret_req_ver(struct kobject *kobj, 580 + struct kobj_attribute *attr, char *buf) 581 + { 582 + return sysfs_emit(buf, "%lx\n", uv_info.supp_add_secret_req_ver); 583 + } 584 + 585 + static struct kobj_attribute uv_query_supp_add_secret_req_ver_attr = 586 + __ATTR(supp_add_secret_req_ver, 0444, uv_query_supp_add_secret_req_ver, NULL); 587 + 588 + static ssize_t uv_query_supp_add_secret_pcf(struct kobject *kobj, 589 + struct kobj_attribute *attr, char *buf) 590 + { 591 + return sysfs_emit(buf, "%lx\n", uv_info.supp_add_secret_pcf); 592 + } 593 + 594 + static struct kobj_attribute uv_query_supp_add_secret_pcf_attr = 595 + __ATTR(supp_add_secret_pcf, 0444, uv_query_supp_add_secret_pcf, NULL); 596 + 597 + static ssize_t uv_query_supp_secret_types(struct kobject *kobj, 598 + struct kobj_attribute *attr, char *buf) 599 + { 600 + return sysfs_emit(buf, "%lx\n", uv_info.supp_secret_types); 601 + } 602 + 603 + static struct kobj_attribute uv_query_supp_secret_types_attr = 604 + __ATTR(supp_secret_types, 0444, uv_query_supp_secret_types, NULL); 605 + 606 + static ssize_t uv_query_max_secrets(struct kobject *kobj, 607 + struct kobj_attribute *attr, char *buf) 608 + { 609 + return sysfs_emit(buf, "%d\n", uv_info.max_secrets); 610 + } 611 + 612 + static struct kobj_attribute uv_query_max_secrets_attr = 613 + __ATTR(max_secrets, 0444, uv_query_max_secrets, NULL); 586 614 587 615 static struct attribute *uv_query_attrs[] = { 588 616 &uv_query_facilities_attr.attr, ··· 630 592 &uv_query_dump_cpu_len_attr.attr, 631 593 &uv_query_supp_att_req_hdr_ver_attr.attr, 632 594 &uv_query_supp_att_pflags_attr.attr, 595 + &uv_query_supp_add_secret_req_ver_attr.attr, 596 + &uv_query_supp_add_secret_pcf_attr.attr, 597 + &uv_query_supp_secret_types_attr.attr, 598 + &uv_query_max_secrets_attr.attr, 633 599 NULL, 634 600 }; 635 601 ··· 642 600 }; 643 601 644 602 static ssize_t uv_is_prot_virt_guest(struct kobject *kobj, 645 - struct kobj_attribute *attr, char *page) 603 + struct kobj_attribute *attr, char *buf) 646 604 { 647 605 int val = 0; 648 606 649 607 #ifdef CONFIG_PROTECTED_VIRTUALIZATION_GUEST 650 608 val = prot_virt_guest; 651 609 #endif 652 - return scnprintf(page, PAGE_SIZE, "%d\n", val); 610 + return sysfs_emit(buf, "%d\n", val); 653 611 } 654 612 655 613 static ssize_t uv_is_prot_virt_host(struct kobject *kobj, 656 - struct kobj_attribute *attr, char *page) 614 + struct kobj_attribute *attr, char *buf) 657 615 { 658 616 int val = 0; 659 617 ··· 661 619 val = prot_virt_host; 662 620 #endif 663 621 664 - return scnprintf(page, PAGE_SIZE, "%d\n", val); 622 + return sysfs_emit(buf, "%d\n", val); 665 623 } 666 624 667 625 static struct kobj_attribute uv_prot_virt_guest =

+5 -3

arch/s390/kvm/diag.c

··· 166 166 static int __diag_time_slice_end_directed(struct kvm_vcpu *vcpu) 167 167 { 168 168 struct kvm_vcpu *tcpu; 169 + int tcpu_cpu; 169 170 int tid; 170 171 171 172 tid = vcpu->run->s.regs.gprs[(vcpu->arch.sie_block->ipa & 0xf0) >> 4]; ··· 182 181 goto no_yield; 183 182 184 183 /* target guest VCPU already running */ 185 - if (READ_ONCE(tcpu->cpu) >= 0) { 184 + tcpu_cpu = READ_ONCE(tcpu->cpu); 185 + if (tcpu_cpu >= 0) { 186 186 if (!diag9c_forwarding_hz || diag9c_forwarding_overrun()) 187 187 goto no_yield; 188 188 189 189 /* target host CPU already running */ 190 - if (!vcpu_is_preempted(tcpu->cpu)) 190 + if (!vcpu_is_preempted(tcpu_cpu)) 191 191 goto no_yield; 192 - smp_yield_cpu(tcpu->cpu); 192 + smp_yield_cpu(tcpu_cpu); 193 193 VCPU_EVENT(vcpu, 5, 194 194 "diag time slice end directed to %d: yield forwarded", 195 195 tid);

+4

arch/s390/kvm/kvm-s390.c

··· 2156 2156 ms = container_of(mnode, struct kvm_memory_slot, gfn_node[slots->node_idx]); 2157 2157 ofs = 0; 2158 2158 } 2159 + 2160 + if (cur_gfn < ms->base_gfn) 2161 + ofs = 0; 2162 + 2159 2163 ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, ofs); 2160 2164 while (ofs >= ms->npages && (mnode = rb_next(mnode))) { 2161 2165 ms = container_of(mnode, struct kvm_memory_slot, gfn_node[slots->node_idx]);

+4 -2

arch/s390/kvm/vsie.c

··· 177 177 sizeof(struct kvm_s390_apcb0))) 178 178 return -EFAULT; 179 179 180 - bitmap_and(apcb_s, apcb_s, apcb_h, sizeof(struct kvm_s390_apcb0)); 180 + bitmap_and(apcb_s, apcb_s, apcb_h, 181 + BITS_PER_BYTE * sizeof(struct kvm_s390_apcb0)); 181 182 182 183 return 0; 183 184 } ··· 204 203 sizeof(struct kvm_s390_apcb1))) 205 204 return -EFAULT; 206 205 207 - bitmap_and(apcb_s, apcb_s, apcb_h, sizeof(struct kvm_s390_apcb1)); 206 + bitmap_and(apcb_s, apcb_s, apcb_h, 207 + BITS_PER_BYTE * sizeof(struct kvm_s390_apcb1)); 208 208 209 209 return 0; 210 210 }

-1

arch/x86/include/asm/kvm-x86-pmu-ops.h

··· 13 13 * at the call sites. 14 14 */ 15 15 KVM_X86_PMU_OP(hw_event_available) 16 - KVM_X86_PMU_OP(pmc_is_enabled) 17 16 KVM_X86_PMU_OP(pmc_idx_to_pmc) 18 17 KVM_X86_PMU_OP(rdpmc_ecx_to_pmc) 19 18 KVM_X86_PMU_OP(msr_idx_to_pmc)

+1 -1

arch/x86/include/asm/kvm_host.h

··· 523 523 u64 global_status; 524 524 u64 counter_bitmask[2]; 525 525 u64 global_ctrl_mask; 526 - u64 global_ovf_ctrl_mask; 526 + u64 global_status_mask; 527 527 u64 reserved_bits; 528 528 u64 raw_event_mask; 529 529 struct kvm_pmc gp_counters[KVM_INTEL_PMC_MAX_GENERIC];

+32 -11

arch/x86/kvm/cpuid.c

··· 501 501 struct kvm_cpuid2 *cpuid, 502 502 struct kvm_cpuid_entry2 __user *entries) 503 503 { 504 - int r; 505 - 506 - r = -E2BIG; 507 504 if (cpuid->nent < vcpu->arch.cpuid_nent) 508 - goto out; 509 - r = -EFAULT; 505 + return -E2BIG; 506 + 510 507 if (copy_to_user(entries, vcpu->arch.cpuid_entries, 511 508 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2))) 512 - goto out; 513 - return 0; 509 + return -EFAULT; 514 510 515 - out: 516 511 cpuid->nent = vcpu->arch.cpuid_nent; 517 - return r; 512 + return 0; 518 513 } 519 514 520 515 /* Mask kvm_cpu_caps for @leaf with the raw CPUID capabilities of this CPU. */ ··· 729 734 F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */ 730 735 ); 731 736 737 + kvm_cpu_cap_init_kvm_defined(CPUID_8000_0022_EAX, 738 + F(PERFMON_V2) 739 + ); 740 + 732 741 /* 733 742 * Synthesize "LFENCE is serializing" into the AMD-defined entry in 734 743 * KVM's supported CPUID if the feature is reported as supported by the ··· 947 948 union cpuid10_eax eax; 948 949 union cpuid10_edx edx; 949 950 950 - if (!static_cpu_has(X86_FEATURE_ARCH_PERFMON)) { 951 + if (!enable_pmu || !static_cpu_has(X86_FEATURE_ARCH_PERFMON)) { 951 952 entry->eax = entry->ebx = entry->ecx = entry->edx = 0; 952 953 break; 953 954 } ··· 1127 1128 entry->edx = 0; 1128 1129 break; 1129 1130 case 0x80000000: 1130 - entry->eax = min(entry->eax, 0x80000021); 1131 + entry->eax = min(entry->eax, 0x80000022); 1131 1132 /* 1132 1133 * Serializing LFENCE is reported in a multitude of ways, and 1133 1134 * NullSegClearsBase is not reported in CPUID on Zen2; help ··· 1232 1233 entry->ebx = entry->ecx = entry->edx = 0; 1233 1234 cpuid_entry_override(entry, CPUID_8000_0021_EAX); 1234 1235 break; 1236 + /* AMD Extended Performance Monitoring and Debug */ 1237 + case 0x80000022: { 1238 + union cpuid_0x80000022_ebx ebx; 1239 + 1240 + entry->ecx = entry->edx = 0; 1241 + if (!enable_pmu || !kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2)) { 1242 + entry->eax = entry->ebx; 1243 + break; 1244 + } 1245 + 1246 + cpuid_entry_override(entry, CPUID_8000_0022_EAX); 1247 + 1248 + if (kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2)) 1249 + ebx.split.num_core_pmc = kvm_pmu_cap.num_counters_gp; 1250 + else if (kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE)) 1251 + ebx.split.num_core_pmc = AMD64_NUM_COUNTERS_CORE; 1252 + else 1253 + ebx.split.num_core_pmc = AMD64_NUM_COUNTERS; 1254 + 1255 + entry->ebx = ebx.full; 1256 + break; 1257 + } 1235 1258 /*Add support for Centaur's CPUID instruction*/ 1236 1259 case 0xC0000000: 1237 1260 /*Just support up to 0xC0000004 now*/

+3

arch/x86/kvm/i8259.c

··· 411 411 pic_clear_isr(s, ret); 412 412 if (addr1 >> 7 || ret != 2) 413 413 pic_update_irq(s->pics_state); 414 + /* Bit 7 is 1, means there's an interrupt */ 415 + ret |= 0x80; 414 416 } else { 417 + /* Bit 7 is 0, means there's no interrupt */ 415 418 ret = 0x07; 416 419 pic_update_irq(s->pics_state); 417 420 }

-5

arch/x86/kvm/lapic.c

··· 51 51 #define mod_64(x, y) ((x) % (y)) 52 52 #endif 53 53 54 - #define PRId64 "d" 55 - #define PRIx64 "llx" 56 - #define PRIu64 "u" 57 - #define PRIo64 "o" 58 - 59 54 /* 14 is the version for Xeon and Pentium 8.4.8*/ 60 55 #define APIC_VERSION 0x14UL 61 56 #define LAPIC_MMIO_LENGTH (1 << 12)

+48 -5

arch/x86/kvm/mmu/mmu.c

··· 58 58 59 59 extern bool itlb_multihit_kvm_mitigation; 60 60 61 + static bool nx_hugepage_mitigation_hard_disabled; 62 + 61 63 int __read_mostly nx_huge_pages = -1; 62 64 static uint __read_mostly nx_huge_pages_recovery_period_ms; 63 65 #ifdef CONFIG_PREEMPT_RT ··· 69 67 static uint __read_mostly nx_huge_pages_recovery_ratio = 60; 70 68 #endif 71 69 70 + static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); 72 71 static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); 73 72 static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp); 74 73 75 74 static const struct kernel_param_ops nx_huge_pages_ops = { 76 75 .set = set_nx_huge_pages, 77 - .get = param_get_bool, 76 + .get = get_nx_huge_pages, 78 77 }; 79 78 80 79 static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { ··· 1602 1599 1603 1600 if (tdp_mmu_enabled) 1604 1601 flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush); 1602 + 1603 + if (kvm_x86_ops.set_apic_access_page_addr && 1604 + range->slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) 1605 + kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); 1605 1606 1606 1607 return flush; 1607 1608 } ··· 5804 5797 5805 5798 vcpu_clear_mmio_info(vcpu, addr); 5806 5799 5800 + /* 5801 + * Walking and synchronizing SPTEs both assume they are operating in 5802 + * the context of the current MMU, and would need to be reworked if 5803 + * this is ever used to sync the guest_mmu, e.g. to emulate INVEPT. 5804 + */ 5805 + if (WARN_ON_ONCE(mmu != vcpu->arch.mmu)) 5806 + return; 5807 + 5807 5808 if (!VALID_PAGE(root_hpa)) 5808 5809 return; 5809 5810 ··· 6859 6844 kmem_cache_destroy(mmu_page_header_cache); 6860 6845 } 6861 6846 6847 + static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) 6848 + { 6849 + if (nx_hugepage_mitigation_hard_disabled) 6850 + return sprintf(buffer, "never\n"); 6851 + 6852 + return param_get_bool(buffer, kp); 6853 + } 6854 + 6862 6855 static bool get_nx_auto_mode(void) 6863 6856 { 6864 6857 /* Return true when CPU has the bug, and mitigations are ON */ ··· 6883 6860 bool old_val = nx_huge_pages; 6884 6861 bool new_val; 6885 6862 6863 + if (nx_hugepage_mitigation_hard_disabled) 6864 + return -EPERM; 6865 + 6886 6866 /* In "auto" mode deploy workaround only if CPU has the bug. */ 6887 - if (sysfs_streq(val, "off")) 6867 + if (sysfs_streq(val, "off")) { 6888 6868 new_val = 0; 6889 - else if (sysfs_streq(val, "force")) 6869 + } else if (sysfs_streq(val, "force")) { 6890 6870 new_val = 1; 6891 - else if (sysfs_streq(val, "auto")) 6871 + } else if (sysfs_streq(val, "auto")) { 6892 6872 new_val = get_nx_auto_mode(); 6893 - else if (kstrtobool(val, &new_val) < 0) 6873 + } else if (sysfs_streq(val, "never")) { 6874 + new_val = 0; 6875 + 6876 + mutex_lock(&kvm_lock); 6877 + if (!list_empty(&vm_list)) { 6878 + mutex_unlock(&kvm_lock); 6879 + return -EBUSY; 6880 + } 6881 + nx_hugepage_mitigation_hard_disabled = true; 6882 + mutex_unlock(&kvm_lock); 6883 + } else if (kstrtobool(val, &new_val) < 0) { 6894 6884 return -EINVAL; 6885 + } 6895 6886 6896 6887 __set_nx_huge_pages(new_val); 6897 6888 ··· 7042 7005 bool was_recovery_enabled, is_recovery_enabled; 7043 7006 uint old_period, new_period; 7044 7007 int err; 7008 + 7009 + if (nx_hugepage_mitigation_hard_disabled) 7010 + return -EPERM; 7045 7011 7046 7012 was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); 7047 7013 ··· 7203 7163 int kvm_mmu_post_init_vm(struct kvm *kvm) 7204 7164 { 7205 7165 int err; 7166 + 7167 + if (nx_hugepage_mitigation_hard_disabled) 7168 + return 0; 7206 7169 7207 7170 err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, 7208 7171 "kvm-nx-lpage-recovery",

+4 -1

arch/x86/kvm/mmu/tdp_mmu.c

··· 592 592 593 593 /* 594 594 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and 595 - * does not hold the mmu_lock. 595 + * does not hold the mmu_lock. On failure, i.e. if a different logical 596 + * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with 597 + * the current value, so the caller operates on fresh data, e.g. if it 598 + * retries tdp_mmu_set_spte_atomic() 596 599 */ 597 600 if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) 598 601 return -EBUSY;

+31 -33

arch/x86/kvm/mtrr.c

··· 25 25 #define IA32_MTRR_DEF_TYPE_FE (1ULL << 10) 26 26 #define IA32_MTRR_DEF_TYPE_TYPE_MASK (0xff) 27 27 28 + static bool is_mtrr_base_msr(unsigned int msr) 29 + { 30 + /* MTRR base MSRs use even numbers, masks use odd numbers. */ 31 + return !(msr & 0x1); 32 + } 33 + 34 + static struct kvm_mtrr_range *var_mtrr_msr_to_range(struct kvm_vcpu *vcpu, 35 + unsigned int msr) 36 + { 37 + int index = (msr - MTRRphysBase_MSR(0)) / 2; 38 + 39 + return &vcpu->arch.mtrr_state.var_ranges[index]; 40 + } 41 + 28 42 static bool msr_mtrr_valid(unsigned msr) 29 43 { 30 44 switch (msr) { 31 - case 0x200 ... 0x200 + 2 * KVM_NR_VAR_MTRR - 1: 45 + case MTRRphysBase_MSR(0) ... MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1): 32 46 case MSR_MTRRfix64K_00000: 33 47 case MSR_MTRRfix16K_80000: 34 48 case MSR_MTRRfix16K_A0000: ··· 55 41 case MSR_MTRRfix4K_F0000: 56 42 case MSR_MTRRfix4K_F8000: 57 43 case MSR_MTRRdefType: 58 - case MSR_IA32_CR_PAT: 59 44 return true; 60 45 } 61 46 return false; ··· 65 52 return t < 8 && (1 << t) & 0x73; /* 0, 1, 4, 5, 6 */ 66 53 } 67 54 68 - bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data) 55 + static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data) 69 56 { 70 57 int i; 71 58 u64 mask; ··· 73 60 if (!msr_mtrr_valid(msr)) 74 61 return false; 75 62 76 - if (msr == MSR_IA32_CR_PAT) { 77 - return kvm_pat_valid(data); 78 - } else if (msr == MSR_MTRRdefType) { 63 + if (msr == MSR_MTRRdefType) { 79 64 if (data & ~0xcff) 80 65 return false; 81 66 return valid_mtrr_type(data & 0xff); ··· 85 74 } 86 75 87 76 /* variable MTRRs */ 88 - WARN_ON(!(msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR)); 77 + WARN_ON(!(msr >= MTRRphysBase_MSR(0) && 78 + msr <= MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1))); 89 79 90 80 mask = kvm_vcpu_reserved_gpa_bits_raw(vcpu); 91 81 if ((msr & 1) == 0) { ··· 100 88 101 89 return (data & mask) == 0; 102 90 } 103 - EXPORT_SYMBOL_GPL(kvm_mtrr_valid); 104 91 105 92 static bool mtrr_is_enabled(struct kvm_mtrr *mtrr_state) 106 93 { ··· 319 308 { 320 309 struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state; 321 310 gfn_t start, end; 322 - int index; 323 311 324 - if (msr == MSR_IA32_CR_PAT || !tdp_enabled || 325 - !kvm_arch_has_noncoherent_dma(vcpu->kvm)) 312 + if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm)) 326 313 return; 327 314 328 315 if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType) ··· 335 326 end = ~0ULL; 336 327 } else { 337 328 /* variable range MTRRs. */ 338 - index = (msr - 0x200) / 2; 339 - var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end); 329 + var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end); 340 330 } 341 331 342 332 kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end)); ··· 350 342 { 351 343 struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state; 352 344 struct kvm_mtrr_range *tmp, *cur; 353 - int index, is_mtrr_mask; 354 345 355 - index = (msr - 0x200) / 2; 356 - is_mtrr_mask = msr - 0x200 - 2 * index; 357 - cur = &mtrr_state->var_ranges[index]; 346 + cur = var_mtrr_msr_to_range(vcpu, msr); 358 347 359 348 /* remove the entry if it's in the list. */ 360 349 if (var_mtrr_range_is_valid(cur)) 361 - list_del(&mtrr_state->var_ranges[index].node); 350 + list_del(&cur->node); 362 351 363 352 /* 364 353 * Set all illegal GPA bits in the mask, since those bits must 365 354 * implicitly be 0. The bits are then cleared when reading them. 366 355 */ 367 - if (!is_mtrr_mask) 356 + if (is_mtrr_base_msr(msr)) 368 357 cur->base = data; 369 358 else 370 359 cur->mask = data | kvm_vcpu_reserved_gpa_bits_raw(vcpu); ··· 387 382 *(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index] = data; 388 383 else if (msr == MSR_MTRRdefType) 389 384 vcpu->arch.mtrr_state.deftype = data; 390 - else if (msr == MSR_IA32_CR_PAT) 391 - vcpu->arch.pat = data; 392 385 else 393 386 set_var_mtrr_msr(vcpu, msr, data); 394 387 ··· 414 411 return 1; 415 412 416 413 index = fixed_msr_to_range_index(msr); 417 - if (index >= 0) 414 + if (index >= 0) { 418 415 *pdata = *(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index]; 419 - else if (msr == MSR_MTRRdefType) 416 + } else if (msr == MSR_MTRRdefType) { 420 417 *pdata = vcpu->arch.mtrr_state.deftype; 421 - else if (msr == MSR_IA32_CR_PAT) 422 - *pdata = vcpu->arch.pat; 423 - else { /* Variable MTRRs */ 424 - int is_mtrr_mask; 425 - 426 - index = (msr - 0x200) / 2; 427 - is_mtrr_mask = msr - 0x200 - 2 * index; 428 - if (!is_mtrr_mask) 429 - *pdata = vcpu->arch.mtrr_state.var_ranges[index].base; 418 + } else { 419 + /* Variable MTRRs */ 420 + if (is_mtrr_base_msr(msr)) 421 + *pdata = var_mtrr_msr_to_range(vcpu, msr)->base; 430 422 else 431 - *pdata = vcpu->arch.mtrr_state.var_ranges[index].mask; 423 + *pdata = var_mtrr_msr_to_range(vcpu, msr)->mask; 432 424 433 425 *pdata &= ~kvm_vcpu_reserved_gpa_bits_raw(vcpu); 434 426 }

+84 -8

arch/x86/kvm/pmu.c

··· 93 93 #undef __KVM_X86_PMU_OP 94 94 } 95 95 96 - static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc) 97 - { 98 - return static_call(kvm_x86_pmu_pmc_is_enabled)(pmc); 99 - } 100 - 101 96 static void kvm_pmi_trigger_fn(struct irq_work *irq_work) 102 97 { 103 98 struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu, irq_work); ··· 557 562 558 563 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) 559 564 { 565 + switch (msr) { 566 + case MSR_CORE_PERF_GLOBAL_STATUS: 567 + case MSR_CORE_PERF_GLOBAL_CTRL: 568 + case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 569 + return kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)); 570 + default: 571 + break; 572 + } 560 573 return static_call(kvm_x86_pmu_msr_idx_to_pmc)(vcpu, msr) || 561 574 static_call(kvm_x86_pmu_is_valid_msr)(vcpu, msr); 562 575 } ··· 580 577 581 578 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 582 579 { 583 - return static_call(kvm_x86_pmu_get_msr)(vcpu, msr_info); 580 + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 581 + u32 msr = msr_info->index; 582 + 583 + switch (msr) { 584 + case MSR_CORE_PERF_GLOBAL_STATUS: 585 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS: 586 + msr_info->data = pmu->global_status; 587 + break; 588 + case MSR_AMD64_PERF_CNTR_GLOBAL_CTL: 589 + case MSR_CORE_PERF_GLOBAL_CTRL: 590 + msr_info->data = pmu->global_ctrl; 591 + break; 592 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR: 593 + case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 594 + msr_info->data = 0; 595 + break; 596 + default: 597 + return static_call(kvm_x86_pmu_get_msr)(vcpu, msr_info); 598 + } 599 + 600 + return 0; 584 601 } 585 602 586 603 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 587 604 { 588 - kvm_pmu_mark_pmc_in_use(vcpu, msr_info->index); 589 - return static_call(kvm_x86_pmu_set_msr)(vcpu, msr_info); 605 + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 606 + u32 msr = msr_info->index; 607 + u64 data = msr_info->data; 608 + u64 diff; 609 + 610 + /* 611 + * Note, AMD ignores writes to reserved bits and read-only PMU MSRs, 612 + * whereas Intel generates #GP on attempts to write reserved/RO MSRs. 613 + */ 614 + switch (msr) { 615 + case MSR_CORE_PERF_GLOBAL_STATUS: 616 + if (!msr_info->host_initiated) 617 + return 1; /* RO MSR */ 618 + fallthrough; 619 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS: 620 + /* Per PPR, Read-only MSR. Writes are ignored. */ 621 + if (!msr_info->host_initiated) 622 + break; 623 + 624 + if (data & pmu->global_status_mask) 625 + return 1; 626 + 627 + pmu->global_status = data; 628 + break; 629 + case MSR_AMD64_PERF_CNTR_GLOBAL_CTL: 630 + data &= ~pmu->global_ctrl_mask; 631 + fallthrough; 632 + case MSR_CORE_PERF_GLOBAL_CTRL: 633 + if (!kvm_valid_perf_global_ctrl(pmu, data)) 634 + return 1; 635 + 636 + if (pmu->global_ctrl != data) { 637 + diff = pmu->global_ctrl ^ data; 638 + pmu->global_ctrl = data; 639 + reprogram_counters(pmu, diff); 640 + } 641 + break; 642 + case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 643 + /* 644 + * GLOBAL_OVF_CTRL, a.k.a. GLOBAL STATUS_RESET, clears bits in 645 + * GLOBAL_STATUS, and so the set of reserved bits is the same. 646 + */ 647 + if (data & pmu->global_status_mask) 648 + return 1; 649 + fallthrough; 650 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR: 651 + if (!msr_info->host_initiated) 652 + pmu->global_status &= ~data; 653 + break; 654 + default: 655 + kvm_pmu_mark_pmc_in_use(vcpu, msr_info->index); 656 + return static_call(kvm_x86_pmu_set_msr)(vcpu, msr_info); 657 + } 658 + 659 + return 0; 590 660 } 591 661 592 662 /* refresh PMU settings. This function generally is called when underlying

+51 -5

arch/x86/kvm/pmu.h

··· 20 20 21 21 struct kvm_pmu_ops { 22 22 bool (*hw_event_available)(struct kvm_pmc *pmc); 23 - bool (*pmc_is_enabled)(struct kvm_pmc *pmc); 24 23 struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx); 25 24 struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, 26 25 unsigned int idx, u64 *mask); ··· 36 37 37 38 const u64 EVENTSEL_EVENT; 38 39 const int MAX_NR_GP_COUNTERS; 40 + const int MIN_NR_GP_COUNTERS; 39 41 }; 40 42 41 43 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops); 44 + 45 + static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu) 46 + { 47 + /* 48 + * Architecturally, Intel's SDM states that IA32_PERF_GLOBAL_CTRL is 49 + * supported if "CPUID.0AH: EAX[7:0] > 0", i.e. if the PMU version is 50 + * greater than zero. However, KVM only exposes and emulates the MSR 51 + * to/for the guest if the guest PMU supports at least "Architectural 52 + * Performance Monitoring Version 2". 53 + * 54 + * AMD's version of PERF_GLOBAL_CTRL conveniently shows up with v2. 55 + */ 56 + return pmu->version > 1; 57 + } 42 58 43 59 static inline u64 pmc_bitmask(struct kvm_pmc *pmc) 44 60 { ··· 175 161 static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops) 176 162 { 177 163 bool is_intel = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL; 164 + int min_nr_gp_ctrs = pmu_ops->MIN_NR_GP_COUNTERS; 178 165 179 166 /* 180 167 * Hybrid PMUs don't play nice with virtualization without careful ··· 190 175 perf_get_x86_pmu_capability(&kvm_pmu_cap); 191 176 192 177 /* 193 - * For Intel, only support guest architectural pmu 194 - * on a host with architectural pmu. 178 + * WARN if perf did NOT disable hardware PMU if the number of 179 + * architecturally required GP counters aren't present, i.e. if 180 + * there are a non-zero number of counters, but fewer than what 181 + * is architecturally required. 195 182 */ 196 - if ((is_intel && !kvm_pmu_cap.version) || 197 - !kvm_pmu_cap.num_counters_gp) 183 + if (!kvm_pmu_cap.num_counters_gp || 184 + WARN_ON_ONCE(kvm_pmu_cap.num_counters_gp < min_nr_gp_ctrs)) 185 + enable_pmu = false; 186 + else if (is_intel && !kvm_pmu_cap.version) 198 187 enable_pmu = false; 199 188 } 200 189 ··· 218 199 { 219 200 set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi); 220 201 kvm_make_request(KVM_REQ_PMU, pmc->vcpu); 202 + } 203 + 204 + static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff) 205 + { 206 + int bit; 207 + 208 + if (!diff) 209 + return; 210 + 211 + for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) 212 + set_bit(bit, pmu->reprogram_pmi); 213 + kvm_make_request(KVM_REQ_PMU, pmu_to_vcpu(pmu)); 214 + } 215 + 216 + /* 217 + * Check if a PMC is enabled by comparing it against global_ctrl bits. 218 + * 219 + * If the vPMU doesn't have global_ctrl MSR, all vPMCs are enabled. 220 + */ 221 + static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc) 222 + { 223 + struct kvm_pmu *pmu = pmc_to_pmu(pmc); 224 + 225 + if (!kvm_pmu_has_perf_global_ctrl(pmu)) 226 + return true; 227 + 228 + return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl); 221 229 } 222 230 223 231 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);

+7

arch/x86/kvm/reverse_cpuid.h

··· 15 15 CPUID_12_EAX = NCAPINTS, 16 16 CPUID_7_1_EDX, 17 17 CPUID_8000_0007_EDX, 18 + CPUID_8000_0022_EAX, 18 19 NR_KVM_CPU_CAPS, 19 20 20 21 NKVMCAPINTS = NR_KVM_CPU_CAPS - NCAPINTS, ··· 48 47 /* CPUID level 0x80000007 (EDX). */ 49 48 #define KVM_X86_FEATURE_CONSTANT_TSC KVM_X86_FEATURE(CPUID_8000_0007_EDX, 8) 50 49 50 + /* CPUID level 0x80000022 (EAX) */ 51 + #define KVM_X86_FEATURE_PERFMON_V2 KVM_X86_FEATURE(CPUID_8000_0022_EAX, 0) 52 + 51 53 struct cpuid_reg { 52 54 u32 function; 53 55 u32 index; ··· 78 74 [CPUID_7_1_EDX] = { 7, 1, CPUID_EDX}, 79 75 [CPUID_8000_0007_EDX] = {0x80000007, 0, CPUID_EDX}, 80 76 [CPUID_8000_0021_EAX] = {0x80000021, 0, CPUID_EAX}, 77 + [CPUID_8000_0022_EAX] = {0x80000022, 0, CPUID_EAX}, 81 78 }; 82 79 83 80 /* ··· 113 108 return KVM_X86_FEATURE_SGX_EDECCSSA; 114 109 else if (x86_feature == X86_FEATURE_CONSTANT_TSC) 115 110 return KVM_X86_FEATURE_CONSTANT_TSC; 111 + else if (x86_feature == X86_FEATURE_PERFMON_V2) 112 + return KVM_X86_FEATURE_PERFMON_V2; 116 113 117 114 return x86_feature; 118 115 }

+49 -19

arch/x86/kvm/svm/pmu.c

··· 78 78 return true; 79 79 } 80 80 81 - /* check if a PMC is enabled by comparing it against global_ctrl bits. Because 82 - * AMD CPU doesn't have global_ctrl MSR, all PMCs are enabled (return TRUE). 83 - */ 84 - static bool amd_pmc_is_enabled(struct kvm_pmc *pmc) 85 - { 86 - return true; 87 - } 88 - 89 81 static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) 90 82 { 91 83 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); ··· 94 102 return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx & ~(3u << 30)); 95 103 } 96 104 97 - static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) 98 - { 99 - /* All MSRs refer to exactly one PMC, so msr_idx_to_pmc is enough. */ 100 - return false; 101 - } 102 - 103 105 static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr) 104 106 { 105 107 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); ··· 103 117 pmc = pmc ? pmc : get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); 104 118 105 119 return pmc; 120 + } 121 + 122 + static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) 123 + { 124 + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 125 + 126 + switch (msr) { 127 + case MSR_K7_EVNTSEL0 ... MSR_K7_PERFCTR3: 128 + return pmu->version > 0; 129 + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5: 130 + return guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE); 131 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS: 132 + case MSR_AMD64_PERF_CNTR_GLOBAL_CTL: 133 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR: 134 + return pmu->version > 1; 135 + default: 136 + if (msr > MSR_F15H_PERF_CTR5 && 137 + msr < MSR_F15H_PERF_CTL0 + 2 * pmu->nr_arch_gp_counters) 138 + return pmu->version > 1; 139 + break; 140 + } 141 + 142 + return amd_msr_idx_to_pmc(vcpu, msr); 106 143 } 107 144 108 145 static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) ··· 181 172 static void amd_pmu_refresh(struct kvm_vcpu *vcpu) 182 173 { 183 174 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 175 + union cpuid_0x80000022_ebx ebx; 184 176 185 - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) 177 + pmu->version = 1; 178 + if (guest_cpuid_has(vcpu, X86_FEATURE_PERFMON_V2)) { 179 + pmu->version = 2; 180 + /* 181 + * Note, PERFMON_V2 is also in 0x80000022.0x0, i.e. the guest 182 + * CPUID entry is guaranteed to be non-NULL. 183 + */ 184 + BUILD_BUG_ON(x86_feature_cpuid(X86_FEATURE_PERFMON_V2).function != 0x80000022 || 185 + x86_feature_cpuid(X86_FEATURE_PERFMON_V2).index); 186 + ebx.full = kvm_find_cpuid_entry_index(vcpu, 0x80000022, 0)->ebx; 187 + pmu->nr_arch_gp_counters = ebx.split.num_core_pmc; 188 + } else if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) { 186 189 pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS_CORE; 187 - else 190 + } else { 188 191 pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS; 192 + } 193 + 194 + pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters, 195 + kvm_pmu_cap.num_counters_gp); 196 + 197 + if (pmu->version > 1) { 198 + pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1); 199 + pmu->global_status_mask = pmu->global_ctrl_mask; 200 + } 189 201 190 202 pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1; 191 203 pmu->reserved_bits = 0xfffffff000280000ull; 192 204 pmu->raw_event_mask = AMD64_RAW_EVENT_MASK; 193 - pmu->version = 1; 194 205 /* not applicable to AMD; but clean them to prevent any fall out */ 195 206 pmu->counter_bitmask[KVM_PMC_FIXED] = 0; 196 207 pmu->nr_arch_fixed_counters = 0; 197 - pmu->global_status = 0; 198 208 bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters); 199 209 } 200 210 ··· 244 216 pmc_stop_counter(pmc); 245 217 pmc->counter = pmc->prev_counter = pmc->eventsel = 0; 246 218 } 219 + 220 + pmu->global_ctrl = pmu->global_status = 0; 247 221 } 248 222 249 223 struct kvm_pmu_ops amd_pmu_ops __initdata = { 250 224 .hw_event_available = amd_hw_event_available, 251 - .pmc_is_enabled = amd_pmc_is_enabled, 252 225 .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, 253 226 .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, 254 227 .msr_idx_to_pmc = amd_msr_idx_to_pmc, ··· 262 233 .reset = amd_pmu_reset, 263 234 .EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT, 264 235 .MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC, 236 + .MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS, 265 237 };

+11 -8

arch/x86/kvm/svm/sev.c

··· 2216 2216 } 2217 2217 2218 2218 sev_asid_count = max_sev_asid - min_sev_asid + 1; 2219 - if (misc_cg_set_capacity(MISC_CG_RES_SEV, sev_asid_count)) 2220 - goto out; 2221 - 2222 - pr_info("SEV supported: %u ASIDs\n", sev_asid_count); 2219 + WARN_ON_ONCE(misc_cg_set_capacity(MISC_CG_RES_SEV, sev_asid_count)); 2223 2220 sev_supported = true; 2224 2221 2225 2222 /* SEV-ES support requested? */ ··· 2241 2244 goto out; 2242 2245 2243 2246 sev_es_asid_count = min_sev_asid - 1; 2244 - if (misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count)) 2245 - goto out; 2246 - 2247 - pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count); 2247 + WARN_ON_ONCE(misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count)); 2248 2248 sev_es_supported = true; 2249 2249 2250 2250 out: 2251 + if (boot_cpu_has(X86_FEATURE_SEV)) 2252 + pr_info("SEV %s (ASIDs %u - %u)\n", 2253 + sev_supported ? "enabled" : "disabled", 2254 + min_sev_asid, max_sev_asid); 2255 + if (boot_cpu_has(X86_FEATURE_SEV_ES)) 2256 + pr_info("SEV-ES %s (ASIDs %u - %u)\n", 2257 + sev_es_supported ? "enabled" : "disabled", 2258 + min_sev_asid > 1 ? 1 : 0, min_sev_asid - 1); 2259 + 2251 2260 sev_enabled = sev_supported; 2252 2261 sev_es_enabled = sev_es_supported; 2253 2262 #endif

+23 -33

arch/x86/kvm/svm/svm.c

··· 244 244 245 245 static unsigned long iopm_base; 246 246 247 - struct kvm_ldttss_desc { 248 - u16 limit0; 249 - u16 base0; 250 - unsigned base1:8, type:5, dpl:2, p:1; 251 - unsigned limit1:4, zero0:3, g:1, base2:8; 252 - u32 base3; 253 - u32 zero1; 254 - } __attribute__((packed)); 255 - 256 247 DEFINE_PER_CPU(struct svm_cpu_data, svm_data); 257 248 258 249 /* ··· 579 588 580 589 struct svm_cpu_data *sd; 581 590 uint64_t efer; 582 - struct desc_struct *gdt; 583 591 int me = raw_smp_processor_id(); 584 592 585 593 rdmsrl(MSR_EFER, efer); ··· 590 600 sd->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1; 591 601 sd->next_asid = sd->max_asid + 1; 592 602 sd->min_asid = max_sev_asid + 1; 593 - 594 - gdt = get_current_gdt_rw(); 595 - sd->tss_desc = (struct kvm_ldttss_desc *)(gdt + GDT_ENTRY_TSS); 596 603 597 604 wrmsrl(MSR_EFER, efer | EFER_SVME); 598 605 ··· 739 752 740 753 BUG_ON(offset == MSR_INVALID); 741 754 742 - return !!test_bit(bit_write, &tmp); 755 + return test_bit(bit_write, &tmp); 743 756 } 744 757 745 758 static void set_msr_interception_bitmap(struct kvm_vcpu *vcpu, u32 *msrpm, ··· 2926 2939 2927 2940 break; 2928 2941 case MSR_IA32_CR_PAT: 2929 - if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) 2930 - return 1; 2931 - vcpu->arch.pat = data; 2942 + ret = kvm_set_msr_common(vcpu, msr); 2943 + if (ret) 2944 + break; 2945 + 2932 2946 svm->vmcb01.ptr->save.g_pat = data; 2933 2947 if (is_guest_mode(vcpu)) 2934 2948 nested_vmcb02_compute_g_pat(svm); ··· 3406 3418 struct kvm_run *kvm_run = vcpu->run; 3407 3419 u32 exit_code = svm->vmcb->control.exit_code; 3408 3420 3409 - trace_kvm_exit(vcpu, KVM_ISA_SVM); 3410 - 3411 3421 /* SEV-ES guests must use the CR write traps to track CR registers. */ 3412 3422 if (!sev_es_guest(vcpu->kvm)) { 3413 3423 if (!svm_is_intercept(svm, INTERCEPT_CR0_WRITE)) ··· 3441 3455 return 1; 3442 3456 3443 3457 return svm_invoke_exit_handler(vcpu, exit_code); 3444 - } 3445 - 3446 - static void reload_tss(struct kvm_vcpu *vcpu) 3447 - { 3448 - struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, vcpu->cpu); 3449 - 3450 - sd->tss_desc->type = 9; /* available 32/64-bit TSS */ 3451 - load_TR_desc(); 3452 3458 } 3453 3459 3454 3460 static void pre_svm_run(struct kvm_vcpu *vcpu) ··· 4077 4099 4078 4100 svm_vcpu_enter_exit(vcpu, spec_ctrl_intercepted); 4079 4101 4080 - if (!sev_es_guest(vcpu->kvm)) 4081 - reload_tss(vcpu); 4082 - 4083 4102 if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL)) 4084 4103 x86_spec_ctrl_restore_host(svm->virt_spec_ctrl); 4085 4104 ··· 4130 4155 if (unlikely(svm->vmcb->control.exit_code == 4131 4156 SVM_EXIT_EXCP_BASE + MC_VECTOR)) 4132 4157 svm_handle_mce(vcpu); 4158 + 4159 + trace_kvm_exit(vcpu, KVM_ISA_SVM); 4133 4160 4134 4161 svm_complete_interrupts(vcpu); 4135 4162 ··· 5002 5025 boot_cpu_has(X86_FEATURE_AMD_SSBD)) 5003 5026 kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); 5004 5027 5005 - /* AMD PMU PERFCTR_CORE CPUID */ 5006 - if (enable_pmu && boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) 5007 - kvm_cpu_cap_set(X86_FEATURE_PERFCTR_CORE); 5028 + if (enable_pmu) { 5029 + /* 5030 + * Enumerate support for PERFCTR_CORE if and only if KVM has 5031 + * access to enough counters to virtualize "core" support, 5032 + * otherwise limit vPMU support to the legacy number of counters. 5033 + */ 5034 + if (kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS_CORE) 5035 + kvm_pmu_cap.num_counters_gp = min(AMD64_NUM_COUNTERS, 5036 + kvm_pmu_cap.num_counters_gp); 5037 + else 5038 + kvm_cpu_cap_check_and_set(X86_FEATURE_PERFCTR_CORE); 5039 + 5040 + if (kvm_pmu_cap.version != 2 || 5041 + !kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE)) 5042 + kvm_cpu_cap_clear(X86_FEATURE_PERFMON_V2); 5043 + } 5008 5044 5009 5045 /* CPUID 0x8000001F (SME/SEV features) */ 5010 5046 sev_set_cpu_caps();

-1

arch/x86/kvm/svm/svm.h

··· 303 303 u32 max_asid; 304 304 u32 next_asid; 305 305 u32 min_asid; 306 - struct kvm_ldttss_desc *tss_desc; 307 306 308 307 struct page *save_area; 309 308 unsigned long save_area_pa;

+2 -2

arch/x86/kvm/vmx/capabilities.h

··· 152 152 153 153 static inline bool vmx_umip_emulated(void) 154 154 { 155 - return vmcs_config.cpu_based_2nd_exec_ctrl & 156 - SECONDARY_EXEC_DESC; 155 + return !boot_cpu_has(X86_FEATURE_UMIP) && 156 + (vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_DESC); 157 157 } 158 158 159 159 static inline bool cpu_has_vmx_rdtscp(void)

+3 -4

arch/x86/kvm/vmx/nested.c

··· 2328 2328 * Preset *DT exiting when emulating UMIP, so that vmx_set_cr4() 2329 2329 * will not have to rewrite the controls just for this bit. 2330 2330 */ 2331 - if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated() && 2332 - (vmcs12->guest_cr4 & X86_CR4_UMIP)) 2331 + if (vmx_umip_emulated() && (vmcs12->guest_cr4 & X86_CR4_UMIP)) 2333 2332 exec_control |= SECONDARY_EXEC_DESC; 2334 2333 2335 2334 if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ··· 2648 2649 } 2649 2650 2650 2651 if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) && 2651 - intel_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) && 2652 + kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) && 2652 2653 WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, 2653 2654 vmcs12->guest_ia32_perf_global_ctrl))) { 2654 2655 *entry_failure_code = ENTRY_FAIL_DEFAULT; ··· 4523 4524 vcpu->arch.pat = vmcs12->host_ia32_pat; 4524 4525 } 4525 4526 if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) && 4526 - intel_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu))) 4527 + kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu))) 4527 4528 WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, 4528 4529 vmcs12->host_ia32_perf_global_ctrl)); 4529 4530

+12 -67

arch/x86/kvm/vmx/pmu_intel.c

··· 73 73 } 74 74 } 75 75 76 - static void reprogram_counters(struct kvm_pmu *pmu, u64 diff) 77 - { 78 - int bit; 79 - 80 - if (!diff) 81 - return; 82 - 83 - for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) 84 - set_bit(bit, pmu->reprogram_pmi); 85 - kvm_make_request(KVM_REQ_PMU, pmu_to_vcpu(pmu)); 86 - } 87 - 88 76 static bool intel_hw_event_available(struct kvm_pmc *pmc) 89 77 { 90 78 struct kvm_pmu *pmu = pmc_to_pmu(pmc); ··· 93 105 } 94 106 95 107 return true; 96 - } 97 - 98 - /* check if a PMC is enabled by comparing it with globl_ctrl bits. */ 99 - static bool intel_pmc_is_enabled(struct kvm_pmc *pmc) 100 - { 101 - struct kvm_pmu *pmu = pmc_to_pmu(pmc); 102 - 103 - if (!intel_pmu_has_perf_global_ctrl(pmu)) 104 - return true; 105 - 106 - return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl); 107 108 } 108 109 109 110 static bool intel_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) ··· 175 198 176 199 switch (msr) { 177 200 case MSR_CORE_PERF_FIXED_CTR_CTRL: 178 - case MSR_CORE_PERF_GLOBAL_STATUS: 179 - case MSR_CORE_PERF_GLOBAL_CTRL: 180 - case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 181 - return intel_pmu_has_perf_global_ctrl(pmu); 182 - break; 201 + return kvm_pmu_has_perf_global_ctrl(pmu); 183 202 case MSR_IA32_PEBS_ENABLE: 184 203 ret = vcpu_get_perf_capabilities(vcpu) & PERF_CAP_PEBS_FORMAT; 185 204 break; ··· 325 352 case MSR_CORE_PERF_FIXED_CTR_CTRL: 326 353 msr_info->data = pmu->fixed_ctr_ctrl; 327 354 break; 328 - case MSR_CORE_PERF_GLOBAL_STATUS: 329 - msr_info->data = pmu->global_status; 330 - break; 331 - case MSR_CORE_PERF_GLOBAL_CTRL: 332 - msr_info->data = pmu->global_ctrl; 333 - break; 334 - case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 335 - msr_info->data = 0; 336 - break; 337 355 case MSR_IA32_PEBS_ENABLE: 338 356 msr_info->data = pmu->pebs_enable; 339 357 break; ··· 374 410 if (pmu->fixed_ctr_ctrl != data) 375 411 reprogram_fixed_counters(pmu, data); 376 412 break; 377 - case MSR_CORE_PERF_GLOBAL_STATUS: 378 - if (!msr_info->host_initiated) 379 - return 1; /* RO MSR */ 380 - 381 - pmu->global_status = data; 382 - break; 383 - case MSR_CORE_PERF_GLOBAL_CTRL: 384 - if (!kvm_valid_perf_global_ctrl(pmu, data)) 385 - return 1; 386 - 387 - if (pmu->global_ctrl != data) { 388 - diff = pmu->global_ctrl ^ data; 389 - pmu->global_ctrl = data; 390 - reprogram_counters(pmu, diff); 391 - } 392 - break; 393 - case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 394 - if (data & pmu->global_ovf_ctrl_mask) 395 - return 1; 396 - 397 - if (!msr_info->host_initiated) 398 - pmu->global_status &= ~data; 399 - break; 400 413 case MSR_IA32_PEBS_ENABLE: 401 414 if (data & pmu->pebs_enable_mask) 402 415 return 1; ··· 385 444 } 386 445 break; 387 446 case MSR_IA32_DS_AREA: 388 - if (msr_info->host_initiated && data && !guest_cpuid_has(vcpu, X86_FEATURE_DS)) 389 - return 1; 390 447 if (is_noncanonical_address(data, vcpu)) 391 448 return 1; 392 449 ··· 470 531 pmu->reserved_bits = 0xffffffff00200000ull; 471 532 pmu->raw_event_mask = X86_RAW_EVENT_MASK; 472 533 pmu->global_ctrl_mask = ~0ull; 473 - pmu->global_ovf_ctrl_mask = ~0ull; 534 + pmu->global_status_mask = ~0ull; 474 535 pmu->fixed_ctr_ctrl_mask = ~0ull; 475 536 pmu->pebs_enable_mask = ~0ull; 476 537 pmu->pebs_data_cfg_mask = ~0ull; ··· 524 585 counter_mask = ~(((1ull << pmu->nr_arch_gp_counters) - 1) | 525 586 (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED)); 526 587 pmu->global_ctrl_mask = counter_mask; 527 - pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask 588 + 589 + /* 590 + * GLOBAL_STATUS and GLOBAL_OVF_CONTROL (a.k.a. GLOBAL_STATUS_RESET) 591 + * share reserved bit definitions. The kernel just happens to use 592 + * OVF_CTRL for the names. 593 + */ 594 + pmu->global_status_mask = pmu->global_ctrl_mask 528 595 & ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF | 529 596 MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD); 530 597 if (vmx_pt_mode_is_host_guest()) 531 - pmu->global_ovf_ctrl_mask &= 598 + pmu->global_status_mask &= 532 599 ~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI; 533 600 534 601 entry = kvm_find_cpuid_entry_index(vcpu, 7, 0); ··· 746 801 pmc = intel_pmc_idx_to_pmc(pmu, bit); 747 802 748 803 if (!pmc || !pmc_speculative_in_use(pmc) || 749 - !intel_pmc_is_enabled(pmc) || !pmc->perf_event) 804 + !pmc_is_globally_enabled(pmc) || !pmc->perf_event) 750 805 continue; 751 806 752 807 /* ··· 761 816 762 817 struct kvm_pmu_ops intel_pmu_ops __initdata = { 763 818 .hw_event_available = intel_hw_event_available, 764 - .pmc_is_enabled = intel_pmc_is_enabled, 765 819 .pmc_idx_to_pmc = intel_pmc_idx_to_pmc, 766 820 .rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc, 767 821 .msr_idx_to_pmc = intel_msr_idx_to_pmc, ··· 775 831 .cleanup = intel_pmu_cleanup, 776 832 .EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT, 777 833 .MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC, 834 + .MIN_NR_GP_COUNTERS = 1, 778 835 };

+9 -6

arch/x86/kvm/vmx/sgx.c

··· 357 357 358 358 static inline bool encls_leaf_enabled_in_guest(struct kvm_vcpu *vcpu, u32 leaf) 359 359 { 360 - if (!enable_sgx || !guest_cpuid_has(vcpu, X86_FEATURE_SGX)) 361 - return false; 362 - 360 + /* 361 + * ENCLS generates a #UD if SGX1 isn't supported, i.e. this point will 362 + * be reached if and only if the SGX1 leafs are enabled. 363 + */ 363 364 if (leaf >= ECREATE && leaf <= ETRACK) 364 - return guest_cpuid_has(vcpu, X86_FEATURE_SGX1); 365 + return true; 365 366 366 367 if (leaf >= EAUG && leaf <= EMODT) 367 368 return guest_cpuid_has(vcpu, X86_FEATURE_SGX2); ··· 381 380 { 382 381 u32 leaf = (u32)kvm_rax_read(vcpu); 383 382 384 - if (!encls_leaf_enabled_in_guest(vcpu, leaf)) { 383 + if (!enable_sgx || !guest_cpuid_has(vcpu, X86_FEATURE_SGX) || 384 + !guest_cpuid_has(vcpu, X86_FEATURE_SGX1)) { 385 385 kvm_queue_exception(vcpu, UD_VECTOR); 386 - } else if (!sgx_enabled_in_guest_bios(vcpu)) { 386 + } else if (!encls_leaf_enabled_in_guest(vcpu, leaf) || 387 + !sgx_enabled_in_guest_bios(vcpu) || !is_paging(vcpu)) { 387 388 kvm_inject_gp(vcpu, 0); 388 389 } else { 389 390 if (leaf == ECREATE)

+1 -1

arch/x86/kvm/vmx/vmenter.S

··· 187 187 _ASM_EXTABLE(.Lvmresume, .Lfixup) 188 188 _ASM_EXTABLE(.Lvmlaunch, .Lfixup) 189 189 190 - SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL) 190 + SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL) 191 191 192 192 /* Restore unwind state from before the VMRESUME/VMLAUNCH. */ 193 193 UNWIND_HINT_RESTORE

+60 -17

arch/x86/kvm/vmx/vmx.c

··· 2287 2287 return 1; 2288 2288 goto find_uret_msr; 2289 2289 case MSR_IA32_CR_PAT: 2290 - if (!kvm_pat_valid(data)) 2291 - return 1; 2290 + ret = kvm_set_msr_common(vcpu, msr_info); 2291 + if (ret) 2292 + break; 2292 2293 2293 2294 if (is_guest_mode(vcpu) && 2294 2295 get_vmcs12(vcpu)->vm_exit_controls & VM_EXIT_SAVE_IA32_PAT) 2295 2296 get_vmcs12(vcpu)->guest_ia32_pat = data; 2296 2297 2297 - if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { 2298 + if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) 2298 2299 vmcs_write64(GUEST_IA32_PAT, data); 2299 - vcpu->arch.pat = data; 2300 - break; 2301 - } 2302 - ret = kvm_set_msr_common(vcpu, msr_info); 2303 2300 break; 2304 2301 case MSR_IA32_MCG_EXT_CTL: 2305 2302 if ((!msr_info->host_initiated && ··· 3384 3387 3385 3388 void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) 3386 3389 { 3387 - unsigned long old_cr4 = vcpu->arch.cr4; 3390 + unsigned long old_cr4 = kvm_read_cr4(vcpu); 3388 3391 struct vcpu_vmx *vmx = to_vmx(vcpu); 3392 + unsigned long hw_cr4; 3393 + 3389 3394 /* 3390 3395 * Pass through host's Machine Check Enable value to hw_cr4, which 3391 3396 * is in force while we are in guest mode. Do not let guests control 3392 3397 * this bit, even if host CR4.MCE == 0. 3393 3398 */ 3394 - unsigned long hw_cr4; 3395 - 3396 3399 hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE); 3397 3400 if (is_unrestricted_guest(vcpu)) 3398 3401 hw_cr4 |= KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST; ··· 3401 3404 else 3402 3405 hw_cr4 |= KVM_PMODE_VM_CR4_ALWAYS_ON; 3403 3406 3404 - if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated()) { 3407 + if (vmx_umip_emulated()) { 3405 3408 if (cr4 & X86_CR4_UMIP) { 3406 3409 secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_DESC); 3407 3410 hw_cr4 &= ~X86_CR4_UMIP; ··· 5399 5402 5400 5403 static int handle_desc(struct kvm_vcpu *vcpu) 5401 5404 { 5402 - WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP)); 5405 + /* 5406 + * UMIP emulation relies on intercepting writes to CR4.UMIP, i.e. this 5407 + * and other code needs to be updated if UMIP can be guest owned. 5408 + */ 5409 + BUILD_BUG_ON(KVM_POSSIBLE_CR4_GUEST_BITS & X86_CR4_UMIP); 5410 + 5411 + WARN_ON_ONCE(!kvm_is_cr4_bit_set(vcpu, X86_CR4_UMIP)); 5403 5412 return kvm_emulate_instruction(vcpu, 0); 5404 5413 } 5405 5414 ··· 6711 6708 6712 6709 static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu) 6713 6710 { 6714 - struct page *page; 6711 + const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT; 6712 + struct kvm *kvm = vcpu->kvm; 6713 + struct kvm_memslots *slots = kvm_memslots(kvm); 6714 + struct kvm_memory_slot *slot; 6715 + unsigned long mmu_seq; 6716 + kvm_pfn_t pfn; 6715 6717 6716 6718 /* Defer reload until vmcs01 is the current VMCS. */ 6717 6719 if (is_guest_mode(vcpu)) { ··· 6728 6720 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) 6729 6721 return; 6730 6722 6731 - page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); 6732 - if (is_error_page(page)) 6723 + /* 6724 + * Grab the memslot so that the hva lookup for the mmu_notifier retry 6725 + * is guaranteed to use the same memslot as the pfn lookup, i.e. rely 6726 + * on the pfn lookup's validation of the memslot to ensure a valid hva 6727 + * is used for the retry check. 6728 + */ 6729 + slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT); 6730 + if (!slot || slot->flags & KVM_MEMSLOT_INVALID) 6733 6731 return; 6734 6732 6735 - vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page)); 6733 + /* 6734 + * Ensure that the mmu_notifier sequence count is read before KVM 6735 + * retrieves the pfn from the primary MMU. Note, the memslot is 6736 + * protected by SRCU, not the mmu_notifier. Pairs with the smp_wmb() 6737 + * in kvm_mmu_invalidate_end(). 6738 + */ 6739 + mmu_seq = kvm->mmu_invalidate_seq; 6740 + smp_rmb(); 6741 + 6742 + /* 6743 + * No need to retry if the memslot does not exist or is invalid. KVM 6744 + * controls the APIC-access page memslot, and only deletes the memslot 6745 + * if APICv is permanently inhibited, i.e. the memslot won't reappear. 6746 + */ 6747 + pfn = gfn_to_pfn_memslot(slot, gfn); 6748 + if (is_error_noslot_pfn(pfn)) 6749 + return; 6750 + 6751 + read_lock(&vcpu->kvm->mmu_lock); 6752 + if (mmu_invalidate_retry_hva(kvm, mmu_seq, 6753 + gfn_to_hva_memslot(slot, gfn))) { 6754 + kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 6755 + read_unlock(&vcpu->kvm->mmu_lock); 6756 + goto out; 6757 + } 6758 + 6759 + vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn)); 6760 + read_unlock(&vcpu->kvm->mmu_lock); 6761 + 6736 6762 vmx_flush_tlb_current(vcpu); 6737 6763 6764 + out: 6738 6765 /* 6739 6766 * Do not pin apic access page in memory, the MMU notifier 6740 6767 * will call us again if it is migrated or swapped out. 6741 6768 */ 6742 - put_page(page); 6769 + kvm_release_pfn_clean(pfn); 6743 6770 } 6744 6771 6745 6772 static void vmx_hwapic_isr_update(int max_isr)

-12

arch/x86/kvm/vmx/vmx.h

··· 93 93 u32 full; 94 94 }; 95 95 96 - static inline bool intel_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu) 97 - { 98 - /* 99 - * Architecturally, Intel's SDM states that IA32_PERF_GLOBAL_CTRL is 100 - * supported if "CPUID.0AH: EAX[7:0] > 0", i.e. if the PMU version is 101 - * greater than zero. However, KVM only exposes and emulates the MSR 102 - * to/for the guest if the guest PMU supports at least "Architectural 103 - * Performance Monitoring Version 2". 104 - */ 105 - return pmu->version > 1; 106 - } 107 - 108 96 struct lbr_desc { 109 97 /* Basic info about guest LBR records. */ 110 98 struct x86_pmu_lbr records;

+40 -40

arch/x86/kvm/x86.c

··· 1017 1017 wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss); 1018 1018 } 1019 1019 1020 - #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 1021 - if (static_cpu_has(X86_FEATURE_PKU) && 1020 + if (cpu_feature_enabled(X86_FEATURE_PKU) && 1022 1021 vcpu->arch.pkru != vcpu->arch.host_pkru && 1023 1022 ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 1024 1023 kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) 1025 1024 write_pkru(vcpu->arch.pkru); 1026 - #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 1027 1025 } 1028 1026 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); 1029 1027 ··· 1030 1032 if (vcpu->arch.guest_state_protected) 1031 1033 return; 1032 1034 1033 - #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 1034 - if (static_cpu_has(X86_FEATURE_PKU) && 1035 + if (cpu_feature_enabled(X86_FEATURE_PKU) && 1035 1036 ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 1036 1037 kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) { 1037 1038 vcpu->arch.pkru = rdpkru(); 1038 1039 if (vcpu->arch.pkru != vcpu->arch.host_pkru) 1039 1040 write_pkru(vcpu->arch.host_pkru); 1040 1041 } 1041 - #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 1042 1042 1043 1043 if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { 1044 1044 ··· 1423 1427 EXPORT_SYMBOL_GPL(kvm_emulate_rdpmc); 1424 1428 1425 1429 /* 1426 - * List of msr numbers which we expose to userspace through KVM_GET_MSRS 1427 - * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST. 1428 - * 1429 - * The three MSR lists(msrs_to_save, emulated_msrs, msr_based_features) 1430 - * extract the supported MSRs from the related const lists. 1431 - * msrs_to_save is selected from the msrs_to_save_all to reflect the 1432 - * capabilities of the host cpu. This capabilities test skips MSRs that are 1433 - * kvm-specific. Those are put in emulated_msrs_all; filtering of emulated_msrs 1434 - * may depend on host virtualization features rather than host cpu features. 1430 + * The three MSR lists(msrs_to_save, emulated_msrs, msr_based_features) track 1431 + * the set of MSRs that KVM exposes to userspace through KVM_GET_MSRS, 1432 + * KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST. msrs_to_save holds MSRs that 1433 + * require host support, i.e. should be probed via RDMSR. emulated_msrs holds 1434 + * MSRs that KVM emulates without strictly requiring host support. 1435 + * msr_based_features holds MSRs that enumerate features, i.e. are effectively 1436 + * CPUID leafs. Note, msr_based_features isn't mutually exclusive with 1437 + * msrs_to_save and emulated_msrs. 1435 1438 */ 1436 1439 1437 1440 static const u32 msrs_to_save_base[] = { ··· 1478 1483 MSR_F15H_PERF_CTL3, MSR_F15H_PERF_CTL4, MSR_F15H_PERF_CTL5, 1479 1484 MSR_F15H_PERF_CTR0, MSR_F15H_PERF_CTR1, MSR_F15H_PERF_CTR2, 1480 1485 MSR_F15H_PERF_CTR3, MSR_F15H_PERF_CTR4, MSR_F15H_PERF_CTR5, 1486 + 1487 + MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 1488 + MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, 1489 + MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, 1481 1490 }; 1482 1491 1483 1492 static u32 msrs_to_save[ARRAY_SIZE(msrs_to_save_base) + ··· 1530 1531 MSR_IA32_UCODE_REV, 1531 1532 1532 1533 /* 1533 - * The following list leaves out MSRs whose values are determined 1534 - * by arch/x86/kvm/vmx/nested.c based on CPUID or other MSRs. 1535 - * We always support the "true" VMX control MSRs, even if the host 1536 - * processor does not, so I am putting these registers here rather 1537 - * than in msrs_to_save_all. 1534 + * KVM always supports the "true" VMX control MSRs, even if the host 1535 + * does not. The VMX MSRs as a whole are considered "emulated" as KVM 1536 + * doesn't strictly require them to exist in the host (ignoring that 1537 + * KVM would refuse to load in the first place if the core set of MSRs 1538 + * aren't supported). 1538 1539 */ 1539 1540 MSR_IA32_VMX_BASIC, 1540 1541 MSR_IA32_VMX_TRUE_PINBASED_CTLS, ··· 1630 1631 * If we're doing cache flushes (either "always" or "cond") 1631 1632 * we will do one whenever the guest does a vmlaunch/vmresume. 1632 1633 * If an outer hypervisor is doing the cache flush for us 1633 - * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that 1634 + * (ARCH_CAP_SKIP_VMENTRY_L1DFLUSH), we can safely pass that 1634 1635 * capability to the guest too, and if EPT is disabled we're not 1635 1636 * vulnerable. Overall, only VMENTER_L1D_FLUSH_NEVER will 1636 1637 * require a nested hypervisor to do a flush of its own. ··· 1808 1809 unsigned long *bitmap = ranges[i].bitmap; 1809 1810 1810 1811 if ((index >= start) && (index < end) && (flags & type)) { 1811 - allowed = !!test_bit(index - start, bitmap); 1812 + allowed = test_bit(index - start, bitmap); 1812 1813 break; 1813 1814 } 1814 1815 } ··· 3700 3701 return 1; 3701 3702 } 3702 3703 break; 3703 - case 0x200 ... MSR_IA32_MC0_CTL2 - 1: 3704 - case MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) ... 0x2ff: 3704 + case MSR_IA32_CR_PAT: 3705 + if (!kvm_pat_valid(data)) 3706 + return 1; 3707 + 3708 + vcpu->arch.pat = data; 3709 + break; 3710 + case MTRRphysBase_MSR(0) ... MSR_MTRRfix4K_F8000: 3711 + case MSR_MTRRdefType: 3705 3712 return kvm_mtrr_set_msr(vcpu, msr, data); 3706 3713 case MSR_IA32_APICBASE: 3707 3714 return kvm_set_apic_base(vcpu, msr_info); ··· 4114 4109 msr_info->data = kvm_scale_tsc(rdtsc(), ratio) + offset; 4115 4110 break; 4116 4111 } 4112 + case MSR_IA32_CR_PAT: 4113 + msr_info->data = vcpu->arch.pat; 4114 + break; 4117 4115 case MSR_MTRRcap: 4118 - case 0x200 ... MSR_IA32_MC0_CTL2 - 1: 4119 - case MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) ... 0x2ff: 4116 + case MTRRphysBase_MSR(0) ... MSR_MTRRfix4K_F8000: 4117 + case MSR_MTRRdefType: 4120 4118 return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data); 4121 4119 case 0xcd: /* fsb frequency */ 4122 4120 msr_info->data = 3; ··· 7155 7147 case MSR_ARCH_PERFMON_FIXED_CTR0 ... MSR_ARCH_PERFMON_FIXED_CTR_MAX: 7156 7148 if (msr_index - MSR_ARCH_PERFMON_FIXED_CTR0 >= 7157 7149 kvm_pmu_cap.num_counters_fixed) 7150 + return; 7151 + break; 7152 + case MSR_AMD64_PERF_CNTR_GLOBAL_CTL: 7153 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS: 7154 + case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR: 7155 + if (!kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2)) 7158 7156 return; 7159 7157 break; 7160 7158 case MSR_IA32_XFD: ··· 10446 10432 10447 10433 static_call_cond(kvm_x86_load_eoi_exitmap)( 10448 10434 vcpu, (u64 *)vcpu->arch.ioapic_handled_vectors); 10449 - } 10450 - 10451 - void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, 10452 - unsigned long start, unsigned long end) 10453 - { 10454 - unsigned long apic_address; 10455 - 10456 - /* 10457 - * The physical address of apic access page is stored in the VMCS. 10458 - * Update it when it becomes invalid. 10459 - */ 10460 - apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); 10461 - if (start <= apic_address && apic_address < end) 10462 - kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); 10463 10435 } 10464 10436 10465 10437 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)

-1

arch/x86/kvm/x86.h

··· 309 309 310 310 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu); 311 311 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn); 312 - bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data); 313 312 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data); 314 313 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); 315 314 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,

+1 -1

drivers/s390/char/Kconfig

··· 96 96 config S390_UV_UAPI 97 97 def_tristate m 98 98 prompt "Ultravisor userspace API" 99 - depends on S390 99 + depends on S390 && (KVM || PROTECTED_VIRTUALIZATION_GUEST) 100 100 help 101 101 Selecting exposes parts of the UV interface to userspace 102 102 by providing a misc character device at /dev/uv.

+224 -7

drivers/s390/char/uvdevice.c

··· 32 32 #include <asm/uvdevice.h> 33 33 #include <asm/uv.h> 34 34 35 + #define BIT_UVIO_INTERNAL U32_MAX 36 + /* Mapping from IOCTL-nr to UVC-bit */ 37 + static const u32 ioctl_nr_to_uvc_bit[] __initconst = { 38 + [UVIO_IOCTL_UVDEV_INFO_NR] = BIT_UVIO_INTERNAL, 39 + [UVIO_IOCTL_ATT_NR] = BIT_UVC_CMD_RETR_ATTEST, 40 + [UVIO_IOCTL_ADD_SECRET_NR] = BIT_UVC_CMD_ADD_SECRET, 41 + [UVIO_IOCTL_LIST_SECRETS_NR] = BIT_UVC_CMD_LIST_SECRETS, 42 + [UVIO_IOCTL_LOCK_SECRETS_NR] = BIT_UVC_CMD_LOCK_SECRETS, 43 + }; 44 + 45 + static_assert(ARRAY_SIZE(ioctl_nr_to_uvc_bit) == UVIO_IOCTL_NUM_IOCTLS); 46 + 47 + static struct uvio_uvdev_info uvdev_info = { 48 + .supp_uvio_cmds = GENMASK_ULL(UVIO_IOCTL_NUM_IOCTLS - 1, 0), 49 + }; 50 + 51 + static void __init set_supp_uv_cmds(unsigned long *supp_uv_cmds) 52 + { 53 + int i; 54 + 55 + for (i = 0; i < UVIO_IOCTL_NUM_IOCTLS; i++) { 56 + if (ioctl_nr_to_uvc_bit[i] == BIT_UVIO_INTERNAL) 57 + continue; 58 + if (!test_bit_inv(ioctl_nr_to_uvc_bit[i], uv_info.inst_calls_list)) 59 + continue; 60 + __set_bit(i, supp_uv_cmds); 61 + } 62 + } 63 + 64 + /** 65 + * uvio_uvdev_info() - get information about the uvdevice 66 + * 67 + * @uv_ioctl: ioctl control block 68 + * 69 + * Lists all IOCTLs that are supported by this uvdevice 70 + */ 71 + static int uvio_uvdev_info(struct uvio_ioctl_cb *uv_ioctl) 72 + { 73 + void __user *user_buf_arg = (void __user *)uv_ioctl->argument_addr; 74 + 75 + if (uv_ioctl->argument_len < sizeof(uvdev_info)) 76 + return -EINVAL; 77 + if (copy_to_user(user_buf_arg, &uvdev_info, sizeof(uvdev_info))) 78 + return -EFAULT; 79 + 80 + uv_ioctl->uv_rc = UVC_RC_EXECUTED; 81 + return 0; 82 + } 83 + 35 84 static int uvio_build_uvcb_attest(struct uv_cb_attest *uvcb_attest, u8 *arcb, 36 85 u8 *meas, u8 *add_data, struct uvio_attest *uvio_attest) 37 86 { ··· 234 185 return ret; 235 186 } 236 187 237 - static int uvio_copy_and_check_ioctl(struct uvio_ioctl_cb *ioctl, void __user *argp) 188 + /** uvio_add_secret() - perform an Add Secret UVC 189 + * 190 + * @uv_ioctl: ioctl control block 191 + * 192 + * uvio_add_secret() performs the Add Secret Ultravisor Call. 193 + * 194 + * The given userspace argument address and size are verified to be 195 + * valid but every other check is made by the Ultravisor 196 + * (UV). Therefore UV errors won't result in a negative return 197 + * value. The request is then copied to kernelspace, the UV-call is 198 + * performed and the results are copied back to userspace. 199 + * 200 + * The argument has to point to an Add Secret Request Control Block 201 + * which is an encrypted and cryptographically verified request that 202 + * inserts a protected guest's secrets into the Ultravisor for later 203 + * use. 204 + * 205 + * If the Add Secret UV facility is not present, UV will return 206 + * invalid command rc. This won't be fenced in the driver and does not 207 + * result in a negative return value. 208 + * 209 + * Context: might sleep 210 + * 211 + * Return: 0 on success or a negative error code on error. 212 + */ 213 + static int uvio_add_secret(struct uvio_ioctl_cb *uv_ioctl) 238 214 { 215 + void __user *user_buf_arg = (void __user *)uv_ioctl->argument_addr; 216 + struct uv_cb_guest_addr uvcb = { 217 + .header.len = sizeof(uvcb), 218 + .header.cmd = UVC_CMD_ADD_SECRET, 219 + }; 220 + void *asrcb = NULL; 221 + int ret; 222 + 223 + if (uv_ioctl->argument_len > UVIO_ADD_SECRET_MAX_LEN) 224 + return -EINVAL; 225 + if (uv_ioctl->argument_len == 0) 226 + return -EINVAL; 227 + 228 + asrcb = kvzalloc(uv_ioctl->argument_len, GFP_KERNEL); 229 + if (!asrcb) 230 + return -ENOMEM; 231 + 232 + ret = -EFAULT; 233 + if (copy_from_user(asrcb, user_buf_arg, uv_ioctl->argument_len)) 234 + goto out; 235 + 236 + ret = 0; 237 + uvcb.addr = (u64)asrcb; 238 + uv_call_sched(0, (u64)&uvcb); 239 + uv_ioctl->uv_rc = uvcb.header.rc; 240 + uv_ioctl->uv_rrc = uvcb.header.rrc; 241 + 242 + out: 243 + kvfree(asrcb); 244 + return ret; 245 + } 246 + 247 + /** uvio_list_secrets() - perform a List Secret UVC 248 + * @uv_ioctl: ioctl control block 249 + * 250 + * uvio_list_secrets() performs the List Secret Ultravisor Call. It verifies 251 + * that the given userspace argument address is valid and its size is sane. 252 + * Every other check is made by the Ultravisor (UV) and won't result in a 253 + * negative return value. It builds the request, performs the UV-call, and 254 + * copies the result to userspace. 255 + * 256 + * The argument specifies the location for the result of the UV-Call. 257 + * 258 + * If the List Secrets UV facility is not present, UV will return invalid 259 + * command rc. This won't be fenced in the driver and does not result in a 260 + * negative return value. 261 + * 262 + * Context: might sleep 263 + * 264 + * Return: 0 on success or a negative error code on error. 265 + */ 266 + static int uvio_list_secrets(struct uvio_ioctl_cb *uv_ioctl) 267 + { 268 + void __user *user_buf_arg = (void __user *)uv_ioctl->argument_addr; 269 + struct uv_cb_guest_addr uvcb = { 270 + .header.len = sizeof(uvcb), 271 + .header.cmd = UVC_CMD_LIST_SECRETS, 272 + }; 273 + void *secrets = NULL; 274 + int ret = 0; 275 + 276 + if (uv_ioctl->argument_len != UVIO_LIST_SECRETS_LEN) 277 + return -EINVAL; 278 + 279 + secrets = kvzalloc(UVIO_LIST_SECRETS_LEN, GFP_KERNEL); 280 + if (!secrets) 281 + return -ENOMEM; 282 + 283 + uvcb.addr = (u64)secrets; 284 + uv_call_sched(0, (u64)&uvcb); 285 + uv_ioctl->uv_rc = uvcb.header.rc; 286 + uv_ioctl->uv_rrc = uvcb.header.rrc; 287 + 288 + if (copy_to_user(user_buf_arg, secrets, UVIO_LIST_SECRETS_LEN)) 289 + ret = -EFAULT; 290 + 291 + kvfree(secrets); 292 + return ret; 293 + } 294 + 295 + /** uvio_lock_secrets() - perform a Lock Secret Store UVC 296 + * @uv_ioctl: ioctl control block 297 + * 298 + * uvio_lock_secrets() performs the Lock Secret Store Ultravisor Call. It 299 + * performs the UV-call and copies the return codes to the ioctl control block. 300 + * After this call was dispatched successfully every following Add Secret UVC 301 + * and Lock Secrets UVC will fail with return code 0x102. 302 + * 303 + * The argument address and size must be 0. 304 + * 305 + * If the Lock Secrets UV facility is not present, UV will return invalid 306 + * command rc. This won't be fenced in the driver and does not result in a 307 + * negative return value. 308 + * 309 + * Context: might sleep 310 + * 311 + * Return: 0 on success or a negative error code on error. 312 + */ 313 + static int uvio_lock_secrets(struct uvio_ioctl_cb *ioctl) 314 + { 315 + struct uv_cb_nodata uvcb = { 316 + .header.len = sizeof(uvcb), 317 + .header.cmd = UVC_CMD_LOCK_SECRETS, 318 + }; 319 + 320 + if (ioctl->argument_addr || ioctl->argument_len) 321 + return -EINVAL; 322 + 323 + uv_call(0, (u64)&uvcb); 324 + ioctl->uv_rc = uvcb.header.rc; 325 + ioctl->uv_rrc = uvcb.header.rrc; 326 + 327 + return 0; 328 + } 329 + 330 + static int uvio_copy_and_check_ioctl(struct uvio_ioctl_cb *ioctl, void __user *argp, 331 + unsigned long cmd) 332 + { 333 + u8 nr = _IOC_NR(cmd); 334 + 335 + if (_IOC_DIR(cmd) != (_IOC_READ | _IOC_WRITE)) 336 + return -ENOIOCTLCMD; 337 + if (_IOC_TYPE(cmd) != UVIO_TYPE_UVC) 338 + return -ENOIOCTLCMD; 339 + if (nr >= UVIO_IOCTL_NUM_IOCTLS) 340 + return -ENOIOCTLCMD; 341 + if (_IOC_SIZE(cmd) != sizeof(*ioctl)) 342 + return -ENOIOCTLCMD; 239 343 if (copy_from_user(ioctl, argp, sizeof(*ioctl))) 240 344 return -EFAULT; 241 345 if (ioctl->flags != 0) ··· 396 194 if (memchr_inv(ioctl->reserved14, 0, sizeof(ioctl->reserved14))) 397 195 return -EINVAL; 398 196 399 - return 0; 197 + return nr; 400 198 } 401 199 402 200 /* ··· 407 205 void __user *argp = (void __user *)arg; 408 206 struct uvio_ioctl_cb uv_ioctl = { }; 409 207 long ret; 208 + int nr; 410 209 411 - switch (cmd) { 412 - case UVIO_IOCTL_ATT: 413 - ret = uvio_copy_and_check_ioctl(&uv_ioctl, argp); 414 - if (ret) 415 - return ret; 210 + nr = uvio_copy_and_check_ioctl(&uv_ioctl, argp, cmd); 211 + if (nr < 0) 212 + return nr; 213 + 214 + switch (nr) { 215 + case UVIO_IOCTL_UVDEV_INFO_NR: 216 + ret = uvio_uvdev_info(&uv_ioctl); 217 + break; 218 + case UVIO_IOCTL_ATT_NR: 416 219 ret = uvio_attestation(&uv_ioctl); 220 + break; 221 + case UVIO_IOCTL_ADD_SECRET_NR: 222 + ret = uvio_add_secret(&uv_ioctl); 223 + break; 224 + case UVIO_IOCTL_LIST_SECRETS_NR: 225 + ret = uvio_list_secrets(&uv_ioctl); 226 + break; 227 + case UVIO_IOCTL_LOCK_SECRETS_NR: 228 + ret = uvio_lock_secrets(&uv_ioctl); 417 229 break; 418 230 default: 419 231 ret = -ENOIOCTLCMD; ··· 461 245 462 246 static int __init uvio_dev_init(void) 463 247 { 248 + set_supp_uv_cmds((unsigned long *)&uvdev_info.supp_uv_cmds); 464 249 return misc_register(&uvio_dev_miscdev); 465 250 } 466 251

+6 -2

include/kvm/arm_pmu.h

··· 92 92 /* 93 93 * Evaluates as true when emulating PMUv3p5, and false otherwise. 94 94 */ 95 - #define kvm_pmu_is_3p5(vcpu) \ 96 - (vcpu->kvm->arch.dfr0_pmuver.imp >= ID_AA64DFR0_EL1_PMUVer_V3P5) 95 + #define kvm_pmu_is_3p5(vcpu) ({ \ 96 + u64 val = IDREG(vcpu->kvm, SYS_ID_AA64DFR0_EL1); \ 97 + u8 pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, val); \ 98 + \ 99 + pmuver >= ID_AA64DFR0_EL1_PMUVer_V3P5; \ 100 + }) 97 101 98 102 u8 kvm_arm_pmu_get_pmuver_limit(void); 99 103

-6

include/kvm/iodev.h

··· 55 55 : -EOPNOTSUPP; 56 56 } 57 57 58 - static inline void kvm_iodevice_destructor(struct kvm_io_device *dev) 59 - { 60 - if (dev->ops->destructor) 61 - dev->ops->destructor(dev); 62 - } 63 - 64 58 #endif /* __KVM_IODEV_H__ */

+8

include/linux/arm_ffa.h

··· 94 94 */ 95 95 #define FFA_PAGE_SIZE SZ_4K 96 96 97 + /* 98 + * Minimum buffer size/alignment encodings returned by an FFA_FEATURES 99 + * query for FFA_RXTX_MAP. 100 + */ 101 + #define FFA_FEAT_RXTX_MIN_SZ_4K 0 102 + #define FFA_FEAT_RXTX_MIN_SZ_64K 1 103 + #define FFA_FEAT_RXTX_MIN_SZ_16K 2 104 + 97 105 /* FFA Bus/Device/Driver related */ 98 106 struct ffa_device { 99 107 u32 id;

+4 -5

include/linux/kvm_host.h

··· 849 849 850 850 #define KVM_BUG(cond, kvm, fmt...) \ 851 851 ({ \ 852 - int __ret = (cond); \ 852 + bool __ret = !!(cond); \ 853 853 \ 854 854 if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \ 855 855 kvm_vm_bugged(kvm); \ ··· 858 858 859 859 #define KVM_BUG_ON(cond, kvm) \ 860 860 ({ \ 861 - int __ret = (cond); \ 861 + bool __ret = !!(cond); \ 862 862 \ 863 863 if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ 864 864 kvm_vm_bugged(kvm); \ ··· 990 990 { 991 991 return RB_EMPTY_ROOT(&slots->gfn_tree); 992 992 } 993 + 994 + bool kvm_are_all_memslots_empty(struct kvm *kvm); 993 995 994 996 #define kvm_for_each_memslot(memslot, bkt, slots) \ 995 997 hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \ ··· 2238 2236 return -ENOIOCTLCMD; 2239 2237 } 2240 2238 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */ 2241 - 2242 - void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, 2243 - unsigned long start, unsigned long end); 2244 2239 2245 2240 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm); 2246 2241

+5 -1

include/uapi/linux/kvm.h

··· 1190 1190 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225 1191 1191 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226 1192 1192 #define KVM_CAP_COUNTER_OFFSET 227 1193 + #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 1194 + #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 1193 1195 1194 1196 #ifdef KVM_CAP_IRQ_ROUTING 1195 1197 ··· 1444 1442 #define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE 1445 1443 KVM_DEV_TYPE_ARM_PV_TIME, 1446 1444 #define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME 1445 + KVM_DEV_TYPE_RISCV_AIA, 1446 + #define KVM_DEV_TYPE_RISCV_AIA KVM_DEV_TYPE_RISCV_AIA 1447 1447 KVM_DEV_TYPE_MAX, 1448 1448 }; 1449 1449 ··· 1617 1613 #define KVM_GET_DEBUGREGS _IOR(KVMIO, 0xa1, struct kvm_debugregs) 1618 1614 #define KVM_SET_DEBUGREGS _IOW(KVMIO, 0xa2, struct kvm_debugregs) 1619 1615 /* 1620 - * vcpu version available with KVM_ENABLE_CAP 1616 + * vcpu version available with KVM_CAP_ENABLE_CAP 1621 1617 * vm version available with KVM_CAP_ENABLE_CAP_VM 1622 1618 */ 1623 1619 #define KVM_ENABLE_CAP _IOW(KVMIO, 0xa3, struct kvm_enable_cap)

+17 -2

tools/testing/selftests/kvm/Makefile

··· 61 61 # Compiled test targets 62 62 TEST_GEN_PROGS_x86_64 = x86_64/cpuid_test 63 63 TEST_GEN_PROGS_x86_64 += x86_64/cr4_cpuid_sync_test 64 + TEST_GEN_PROGS_x86_64 += x86_64/dirty_log_page_splitting_test 64 65 TEST_GEN_PROGS_x86_64 += x86_64/get_msr_index_features 65 66 TEST_GEN_PROGS_x86_64 += x86_64/exit_on_emulation_failure_test 66 67 TEST_GEN_PROGS_x86_64 += x86_64/fix_hypercall_test ··· 165 164 TEST_GEN_PROGS_s390x += s390x/resets 166 165 TEST_GEN_PROGS_s390x += s390x/sync_regs_test 167 166 TEST_GEN_PROGS_s390x += s390x/tprot 167 + TEST_GEN_PROGS_s390x += s390x/cmma_test 168 168 TEST_GEN_PROGS_s390x += demand_paging_test 169 169 TEST_GEN_PROGS_s390x += dirty_log_test 170 170 TEST_GEN_PROGS_s390x += kvm_create_max_vcpus ··· 186 184 TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_$(ARCH_DIR)) 187 185 LIBKVM += $(LIBKVM_$(ARCH_DIR)) 188 186 187 + OVERRIDE_TARGETS = 1 188 + 189 189 # lib.mak defines $(OUTPUT), prepends $(OUTPUT)/ to $(TEST_GEN_PROGS), and most 190 190 # importantly defines, i.e. overwrites, $(CC) (unless `make -e` or `make CC=`, 191 191 # which causes the environment variable to override the makefile). ··· 202 198 LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include 203 199 endif 204 200 CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ 205 - -Wno-gnu-variable-sized-type-not-at-end \ 201 + -Wno-gnu-variable-sized-type-not-at-end -MD\ 206 202 -fno-builtin-memcmp -fno-builtin-memcpy -fno-builtin-memset \ 207 203 -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ 208 204 -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ ··· 229 225 LIBKVM_STRING_OBJ := $(patsubst %.c, $(OUTPUT)/%.o, $(LIBKVM_STRING)) 230 226 LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) $(LIBKVM_STRING_OBJ) 231 227 232 - EXTRA_CLEAN += $(LIBKVM_OBJS) cscope.* 228 + TEST_GEN_OBJ = $(patsubst %, %.o, $(TEST_GEN_PROGS)) 229 + TEST_GEN_OBJ += $(patsubst %, %.o, $(TEST_GEN_PROGS_EXTENDED)) 230 + TEST_DEP_FILES = $(patsubst %.o, %.d, $(TEST_GEN_OBJ)) 231 + TEST_DEP_FILES += $(patsubst %.o, %.d, $(LIBKVM_OBJS)) 232 + -include $(TEST_DEP_FILES) 233 + 234 + $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED): %: %.o 235 + $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBKVM_OBJS) $(LDLIBS) -o $@ 236 + $(TEST_GEN_OBJ): $(OUTPUT)/%.o: %.c 237 + $(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c $< -o $@ 238 + 239 + EXTRA_CLEAN += $(LIBKVM_OBJS) $(TEST_DEP_FILES) $(TEST_GEN_OBJ) cscope.* 233 240 234 241 x := $(shell mkdir -p $(sort $(dir $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ)))) 235 242 $(LIBKVM_C_OBJ): $(OUTPUT)/%.o: %.c

+22 -10

tools/testing/selftests/kvm/demand_paging_test.c

··· 128 128 129 129 static void run_test(enum vm_guest_mode mode, void *arg) 130 130 { 131 + struct memstress_vcpu_args *vcpu_args; 131 132 struct test_params *p = arg; 132 133 struct uffd_desc **uffd_descs = NULL; 133 134 struct timespec start; ··· 146 145 "Failed to allocate buffer for guest data pattern"); 147 146 memset(guest_data_prototype, 0xAB, demand_paging_size); 148 147 148 + if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) { 149 + for (i = 0; i < nr_vcpus; i++) { 150 + vcpu_args = &memstress_args.vcpu_args[i]; 151 + prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa), 152 + vcpu_args->pages * memstress_args.guest_page_size); 153 + } 154 + } 155 + 149 156 if (p->uffd_mode) { 150 157 uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *)); 151 158 TEST_ASSERT(uffd_descs, "Memory allocation failed"); 152 - 153 159 for (i = 0; i < nr_vcpus; i++) { 154 - struct memstress_vcpu_args *vcpu_args; 155 160 void *vcpu_hva; 156 - void *vcpu_alias; 157 161 158 162 vcpu_args = &memstress_args.vcpu_args[i]; 159 163 160 164 /* Cache the host addresses of the region */ 161 165 vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa); 162 - vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa); 163 - 164 - prefault_mem(vcpu_alias, 165 - vcpu_args->pages * memstress_args.guest_page_size); 166 - 167 166 /* 168 167 * Set up user fault fd to handle demand paging 169 168 * requests. ··· 208 207 { 209 208 puts(""); 210 209 printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n" 211 - " [-b memory] [-s type] [-v vcpus] [-o]\n", name); 210 + " [-b memory] [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name); 212 211 guest_modes_help(); 213 212 printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n" 214 213 " UFFD registration mode: 'MISSING' or 'MINOR'.\n"); 214 + kvm_print_vcpu_pinning_help(); 215 215 printf(" -d: add a delay in usec to the User Fault\n" 216 216 " FD handler to simulate demand paging\n" 217 217 " overheads. Ignored without -u.\n"); ··· 230 228 int main(int argc, char *argv[]) 231 229 { 232 230 int max_vcpus = kvm_check_cap(KVM_CAP_MAX_VCPUS); 231 + const char *cpulist = NULL; 233 232 struct test_params p = { 234 233 .src_type = DEFAULT_VM_MEM_SRC, 235 234 .partition_vcpu_memory_access = true, ··· 239 236 240 237 guest_modes_append_default(); 241 238 242 - while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) { 239 + while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:c:o")) != -1) { 243 240 switch (opt) { 244 241 case 'm': 245 242 guest_modes_cmdline(optarg); ··· 266 263 TEST_ASSERT(nr_vcpus <= max_vcpus, 267 264 "Invalid number of vcpus, must be between 1 and %d", max_vcpus); 268 265 break; 266 + case 'c': 267 + cpulist = optarg; 268 + break; 269 269 case 'o': 270 270 p.partition_vcpu_memory_access = false; 271 271 break; ··· 282 276 if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR && 283 277 !backing_src_is_shared(p.src_type)) { 284 278 TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s"); 279 + } 280 + 281 + if (cpulist) { 282 + kvm_parse_vcpu_pinning(cpulist, memstress_args.vcpu_to_pcpu, 283 + nr_vcpus); 284 + memstress_args.pin_vcpus = true; 285 285 } 286 286 287 287 for_each_guest_mode(run_test, &p);

+8 -88

tools/testing/selftests/kvm/dirty_log_perf_test.c

··· 136 136 bool random_access; 137 137 }; 138 138 139 - static void toggle_dirty_logging(struct kvm_vm *vm, int slots, bool enable) 140 - { 141 - int i; 142 - 143 - for (i = 0; i < slots; i++) { 144 - int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 145 - int flags = enable ? KVM_MEM_LOG_DIRTY_PAGES : 0; 146 - 147 - vm_mem_region_set_flags(vm, slot, flags); 148 - } 149 - } 150 - 151 - static inline void enable_dirty_logging(struct kvm_vm *vm, int slots) 152 - { 153 - toggle_dirty_logging(vm, slots, true); 154 - } 155 - 156 - static inline void disable_dirty_logging(struct kvm_vm *vm, int slots) 157 - { 158 - toggle_dirty_logging(vm, slots, false); 159 - } 160 - 161 - static void get_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], int slots) 162 - { 163 - int i; 164 - 165 - for (i = 0; i < slots; i++) { 166 - int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 167 - 168 - kvm_vm_get_dirty_log(vm, slot, bitmaps[i]); 169 - } 170 - } 171 - 172 - static void clear_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], 173 - int slots, uint64_t pages_per_slot) 174 - { 175 - int i; 176 - 177 - for (i = 0; i < slots; i++) { 178 - int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 179 - 180 - kvm_vm_clear_dirty_log(vm, slot, bitmaps[i], 0, pages_per_slot); 181 - } 182 - } 183 - 184 - static unsigned long **alloc_bitmaps(int slots, uint64_t pages_per_slot) 185 - { 186 - unsigned long **bitmaps; 187 - int i; 188 - 189 - bitmaps = malloc(slots * sizeof(bitmaps[0])); 190 - TEST_ASSERT(bitmaps, "Failed to allocate bitmaps array."); 191 - 192 - for (i = 0; i < slots; i++) { 193 - bitmaps[i] = bitmap_zalloc(pages_per_slot); 194 - TEST_ASSERT(bitmaps[i], "Failed to allocate slot bitmap."); 195 - } 196 - 197 - return bitmaps; 198 - } 199 - 200 - static void free_bitmaps(unsigned long *bitmaps[], int slots) 201 - { 202 - int i; 203 - 204 - for (i = 0; i < slots; i++) 205 - free(bitmaps[i]); 206 - 207 - free(bitmaps); 208 - } 209 - 210 139 static void run_test(enum vm_guest_mode mode, void *arg) 211 140 { 212 141 struct test_params *p = arg; ··· 165 236 host_num_pages = vm_num_host_pages(mode, guest_num_pages); 166 237 pages_per_slot = host_num_pages / p->slots; 167 238 168 - bitmaps = alloc_bitmaps(p->slots, pages_per_slot); 239 + bitmaps = memstress_alloc_bitmaps(p->slots, pages_per_slot); 169 240 170 241 if (dirty_log_manual_caps) 171 242 vm_enable_cap(vm, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, ··· 206 277 207 278 /* Enable dirty logging */ 208 279 clock_gettime(CLOCK_MONOTONIC, &start); 209 - enable_dirty_logging(vm, p->slots); 280 + memstress_enable_dirty_logging(vm, p->slots); 210 281 ts_diff = timespec_elapsed(start); 211 282 pr_info("Enabling dirty logging time: %ld.%.9lds\n\n", 212 283 ts_diff.tv_sec, ts_diff.tv_nsec); ··· 235 306 iteration, ts_diff.tv_sec, ts_diff.tv_nsec); 236 307 237 308 clock_gettime(CLOCK_MONOTONIC, &start); 238 - get_dirty_log(vm, bitmaps, p->slots); 309 + memstress_get_dirty_log(vm, bitmaps, p->slots); 239 310 ts_diff = timespec_elapsed(start); 240 311 get_dirty_log_total = timespec_add(get_dirty_log_total, 241 312 ts_diff); ··· 244 315 245 316 if (dirty_log_manual_caps) { 246 317 clock_gettime(CLOCK_MONOTONIC, &start); 247 - clear_dirty_log(vm, bitmaps, p->slots, pages_per_slot); 318 + memstress_clear_dirty_log(vm, bitmaps, p->slots, 319 + pages_per_slot); 248 320 ts_diff = timespec_elapsed(start); 249 321 clear_dirty_log_total = timespec_add(clear_dirty_log_total, 250 322 ts_diff); ··· 264 334 265 335 /* Disable dirty logging */ 266 336 clock_gettime(CLOCK_MONOTONIC, &start); 267 - disable_dirty_logging(vm, p->slots); 337 + memstress_disable_dirty_logging(vm, p->slots); 268 338 ts_diff = timespec_elapsed(start); 269 339 pr_info("Disabling dirty logging time: %ld.%.9lds\n", 270 340 ts_diff.tv_sec, ts_diff.tv_nsec); ··· 289 359 clear_dirty_log_total.tv_nsec, avg.tv_sec, avg.tv_nsec); 290 360 } 291 361 292 - free_bitmaps(bitmaps, p->slots); 362 + memstress_free_bitmaps(bitmaps, p->slots); 293 363 arch_cleanup_vm(vm); 294 364 memstress_destroy_vm(vm); 295 365 } ··· 332 402 " so -w X means each page has an X%% chance of writing\n" 333 403 " and a (100-X)%% chance of reading.\n" 334 404 " (default: 100 i.e. all pages are written to.)\n"); 335 - printf(" -c: Pin tasks to physical CPUs. Takes a list of comma separated\n" 336 - " values (target pCPU), one for each vCPU, plus an optional\n" 337 - " entry for the main application task (specified via entry\n" 338 - " <nr_vcpus + 1>). If used, entries must be provided for all\n" 339 - " vCPUs, i.e. pinning vCPUs is all or nothing.\n\n" 340 - " E.g. to create 3 vCPUs, pin vCPU0=>pCPU22, vCPU1=>pCPU23,\n" 341 - " vCPU2=>pCPU24, and pin the application task to pCPU50:\n\n" 342 - " ./dirty_log_perf_test -v 3 -c 22,23,24,50\n\n" 343 - " To leave the application task unpinned, drop the final entry:\n\n" 344 - " ./dirty_log_perf_test -v 3 -c 22,23,24\n\n" 345 - " (default: no pinning)\n"); 405 + kvm_print_vcpu_pinning_help(); 346 406 puts(""); 347 407 exit(0); 348 408 }

+1

tools/testing/selftests/kvm/include/kvm_util_base.h

··· 733 733 struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm); 734 734 735 735 void kvm_pin_this_task_to_pcpu(uint32_t pcpu); 736 + void kvm_print_vcpu_pinning_help(void); 736 737 void kvm_parse_vcpu_pinning(const char *pcpus_string, uint32_t vcpu_to_pcpu[], 737 738 int nr_vcpus); 738 739

+8

tools/testing/selftests/kvm/include/memstress.h

··· 72 72 uint64_t memstress_nested_pages(int nr_vcpus); 73 73 void memstress_setup_nested(struct kvm_vm *vm, int nr_vcpus, struct kvm_vcpu *vcpus[]); 74 74 75 + void memstress_enable_dirty_logging(struct kvm_vm *vm, int slots); 76 + void memstress_disable_dirty_logging(struct kvm_vm *vm, int slots); 77 + void memstress_get_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], int slots); 78 + void memstress_clear_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], 79 + int slots, uint64_t pages_per_slot); 80 + unsigned long **memstress_alloc_bitmaps(int slots, uint64_t pages_per_slot); 81 + void memstress_free_bitmaps(unsigned long *bitmaps[], int slots); 82 + 75 83 #endif /* SELFTEST_KVM_MEMSTRESS_H */

+17

tools/testing/selftests/kvm/lib/kvm_util.c

··· 494 494 return pcpu; 495 495 } 496 496 497 + void kvm_print_vcpu_pinning_help(void) 498 + { 499 + const char *name = program_invocation_name; 500 + 501 + printf(" -c: Pin tasks to physical CPUs. Takes a list of comma separated\n" 502 + " values (target pCPU), one for each vCPU, plus an optional\n" 503 + " entry for the main application task (specified via entry\n" 504 + " <nr_vcpus + 1>). If used, entries must be provided for all\n" 505 + " vCPUs, i.e. pinning vCPUs is all or nothing.\n\n" 506 + " E.g. to create 3 vCPUs, pin vCPU0=>pCPU22, vCPU1=>pCPU23,\n" 507 + " vCPU2=>pCPU24, and pin the application task to pCPU50:\n\n" 508 + " %s -v 3 -c 22,23,24,50\n\n" 509 + " To leave the application task unpinned, drop the final entry:\n\n" 510 + " %s -v 3 -c 22,23,24\n\n" 511 + " (default: no pinning)\n", name, name); 512 + } 513 + 497 514 void kvm_parse_vcpu_pinning(const char *pcpus_string, uint32_t vcpu_to_pcpu[], 498 515 int nr_vcpus) 499 516 {

+75

tools/testing/selftests/kvm/lib/memstress.c

··· 5 5 #define _GNU_SOURCE 6 6 7 7 #include <inttypes.h> 8 + #include <linux/bitmap.h> 8 9 9 10 #include "kvm_util.h" 10 11 #include "memstress.h" ··· 65 64 GUEST_ASSERT(vcpu_args->vcpu_idx == vcpu_idx); 66 65 67 66 while (true) { 67 + for (i = 0; i < sizeof(memstress_args); i += args->guest_page_size) 68 + (void) *((volatile char *)args + i); 69 + 68 70 for (i = 0; i < pages; i++) { 69 71 if (args->random_access) 70 72 page = guest_random_u32(&rand_state) % pages; ··· 323 319 324 320 for (i = 0; i < nr_vcpus; i++) 325 321 pthread_join(vcpu_threads[i].thread, NULL); 322 + } 323 + 324 + static void toggle_dirty_logging(struct kvm_vm *vm, int slots, bool enable) 325 + { 326 + int i; 327 + 328 + for (i = 0; i < slots; i++) { 329 + int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 330 + int flags = enable ? KVM_MEM_LOG_DIRTY_PAGES : 0; 331 + 332 + vm_mem_region_set_flags(vm, slot, flags); 333 + } 334 + } 335 + 336 + void memstress_enable_dirty_logging(struct kvm_vm *vm, int slots) 337 + { 338 + toggle_dirty_logging(vm, slots, true); 339 + } 340 + 341 + void memstress_disable_dirty_logging(struct kvm_vm *vm, int slots) 342 + { 343 + toggle_dirty_logging(vm, slots, false); 344 + } 345 + 346 + void memstress_get_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], int slots) 347 + { 348 + int i; 349 + 350 + for (i = 0; i < slots; i++) { 351 + int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 352 + 353 + kvm_vm_get_dirty_log(vm, slot, bitmaps[i]); 354 + } 355 + } 356 + 357 + void memstress_clear_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], 358 + int slots, uint64_t pages_per_slot) 359 + { 360 + int i; 361 + 362 + for (i = 0; i < slots; i++) { 363 + int slot = MEMSTRESS_MEM_SLOT_INDEX + i; 364 + 365 + kvm_vm_clear_dirty_log(vm, slot, bitmaps[i], 0, pages_per_slot); 366 + } 367 + } 368 + 369 + unsigned long **memstress_alloc_bitmaps(int slots, uint64_t pages_per_slot) 370 + { 371 + unsigned long **bitmaps; 372 + int i; 373 + 374 + bitmaps = malloc(slots * sizeof(bitmaps[0])); 375 + TEST_ASSERT(bitmaps, "Failed to allocate bitmaps array."); 376 + 377 + for (i = 0; i < slots; i++) { 378 + bitmaps[i] = bitmap_zalloc(pages_per_slot); 379 + TEST_ASSERT(bitmaps[i], "Failed to allocate slot bitmap."); 380 + } 381 + 382 + return bitmaps; 383 + } 384 + 385 + void memstress_free_bitmaps(unsigned long *bitmaps[], int slots) 386 + { 387 + int i; 388 + 389 + for (i = 0; i < slots; i++) 390 + free(bitmaps[i]); 391 + 392 + free(bitmaps); 326 393 }

+2 -2

tools/testing/selftests/kvm/lib/userfaultfd_util.c

··· 70 70 r = read(pollfd[1].fd, &tmp_chr, 1); 71 71 TEST_ASSERT(r == 1, 72 72 "Error reading pipefd in UFFD thread\n"); 73 - return NULL; 73 + break; 74 74 } 75 75 76 76 if (!(pollfd[0].revents & POLLIN)) ··· 103 103 ts_diff = timespec_elapsed(start); 104 104 PER_VCPU_DEBUG("userfaulted %ld pages over %ld.%.9lds. (%f/sec)\n", 105 105 pages, ts_diff.tv_sec, ts_diff.tv_nsec, 106 - pages / ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0)); 106 + pages / ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC)); 107 107 108 108 return NULL; 109 109 }

+700

tools/testing/selftests/kvm/s390x/cmma_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Test for s390x CMMA migration 4 + * 5 + * Copyright IBM Corp. 2023 6 + * 7 + * Authors: 8 + * Nico Boehr <nrb@linux.ibm.com> 9 + */ 10 + 11 + #define _GNU_SOURCE /* for program_invocation_short_name */ 12 + #include <fcntl.h> 13 + #include <stdio.h> 14 + #include <stdlib.h> 15 + #include <string.h> 16 + #include <sys/ioctl.h> 17 + 18 + #include "test_util.h" 19 + #include "kvm_util.h" 20 + #include "kselftest.h" 21 + 22 + #define MAIN_PAGE_COUNT 512 23 + 24 + #define TEST_DATA_PAGE_COUNT 512 25 + #define TEST_DATA_MEMSLOT 1 26 + #define TEST_DATA_START_GFN 4096 27 + 28 + #define TEST_DATA_TWO_PAGE_COUNT 256 29 + #define TEST_DATA_TWO_MEMSLOT 2 30 + #define TEST_DATA_TWO_START_GFN 8192 31 + 32 + static char cmma_value_buf[MAIN_PAGE_COUNT + TEST_DATA_PAGE_COUNT]; 33 + 34 + /** 35 + * Dirty CMMA attributes of exactly one page in the TEST_DATA memslot, 36 + * so use_cmma goes on and the CMMA related ioctls do something. 37 + */ 38 + static void guest_do_one_essa(void) 39 + { 40 + asm volatile( 41 + /* load TEST_DATA_START_GFN into r1 */ 42 + " llilf 1,%[start_gfn]\n" 43 + /* calculate the address from the gfn */ 44 + " sllg 1,1,12(0)\n" 45 + /* set the first page in TEST_DATA memslot to STABLE */ 46 + " .insn rrf,0xb9ab0000,2,1,1,0\n" 47 + /* hypercall */ 48 + " diag 0,0,0x501\n" 49 + "0: j 0b" 50 + : 51 + : [start_gfn] "L"(TEST_DATA_START_GFN) 52 + : "r1", "r2", "memory", "cc" 53 + ); 54 + } 55 + 56 + /** 57 + * Touch CMMA attributes of all pages in TEST_DATA memslot. Set them to stable 58 + * state. 59 + */ 60 + static void guest_dirty_test_data(void) 61 + { 62 + asm volatile( 63 + /* r1 = TEST_DATA_START_GFN */ 64 + " xgr 1,1\n" 65 + " llilf 1,%[start_gfn]\n" 66 + /* r5 = TEST_DATA_PAGE_COUNT */ 67 + " lghi 5,%[page_count]\n" 68 + /* r5 += r1 */ 69 + "2: agfr 5,1\n" 70 + /* r2 = r1 << 12 */ 71 + "1: sllg 2,1,12(0)\n" 72 + /* essa(r4, r2, SET_STABLE) */ 73 + " .insn rrf,0xb9ab0000,4,2,1,0\n" 74 + /* i++ */ 75 + " agfi 1,1\n" 76 + /* if r1 < r5 goto 1 */ 77 + " cgrjl 1,5,1b\n" 78 + /* hypercall */ 79 + " diag 0,0,0x501\n" 80 + "0: j 0b" 81 + : 82 + : [start_gfn] "L"(TEST_DATA_START_GFN), 83 + [page_count] "L"(TEST_DATA_PAGE_COUNT) 84 + : 85 + /* the counter in our loop over the pages */ 86 + "r1", 87 + /* the calculated page physical address */ 88 + "r2", 89 + /* ESSA output register */ 90 + "r4", 91 + /* last page */ 92 + "r5", 93 + "cc", "memory" 94 + ); 95 + } 96 + 97 + static struct kvm_vm *create_vm(void) 98 + { 99 + return ____vm_create(VM_MODE_DEFAULT); 100 + } 101 + 102 + static void create_main_memslot(struct kvm_vm *vm) 103 + { 104 + int i; 105 + 106 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, MAIN_PAGE_COUNT, 0); 107 + /* set the array of memslots to zero like __vm_create does */ 108 + for (i = 0; i < NR_MEM_REGIONS; i++) 109 + vm->memslots[i] = 0; 110 + } 111 + 112 + static void create_test_memslot(struct kvm_vm *vm) 113 + { 114 + vm_userspace_mem_region_add(vm, 115 + VM_MEM_SRC_ANONYMOUS, 116 + TEST_DATA_START_GFN << vm->page_shift, 117 + TEST_DATA_MEMSLOT, 118 + TEST_DATA_PAGE_COUNT, 119 + 0 120 + ); 121 + vm->memslots[MEM_REGION_TEST_DATA] = TEST_DATA_MEMSLOT; 122 + } 123 + 124 + static void create_memslots(struct kvm_vm *vm) 125 + { 126 + /* 127 + * Our VM has the following memory layout: 128 + * +------+---------------------------+ 129 + * | GFN | Memslot | 130 + * +------+---------------------------+ 131 + * | 0 | | 132 + * | ... | MAIN (Code, Stack, ...) | 133 + * | 511 | | 134 + * +------+---------------------------+ 135 + * | 4096 | | 136 + * | ... | TEST_DATA | 137 + * | 4607 | | 138 + * +------+---------------------------+ 139 + */ 140 + create_main_memslot(vm); 141 + create_test_memslot(vm); 142 + } 143 + 144 + static void finish_vm_setup(struct kvm_vm *vm) 145 + { 146 + struct userspace_mem_region *slot0; 147 + 148 + kvm_vm_elf_load(vm, program_invocation_name); 149 + 150 + slot0 = memslot2region(vm, 0); 151 + ucall_init(vm, slot0->region.guest_phys_addr + slot0->region.memory_size); 152 + 153 + kvm_arch_vm_post_create(vm); 154 + } 155 + 156 + static struct kvm_vm *create_vm_two_memslots(void) 157 + { 158 + struct kvm_vm *vm; 159 + 160 + vm = create_vm(); 161 + 162 + create_memslots(vm); 163 + 164 + finish_vm_setup(vm); 165 + 166 + return vm; 167 + } 168 + 169 + static void enable_cmma(struct kvm_vm *vm) 170 + { 171 + int r; 172 + 173 + r = __kvm_device_attr_set(vm->fd, KVM_S390_VM_MEM_CTRL, KVM_S390_VM_MEM_ENABLE_CMMA, NULL); 174 + TEST_ASSERT(!r, "enabling cmma failed r=%d errno=%d", r, errno); 175 + } 176 + 177 + static void enable_dirty_tracking(struct kvm_vm *vm) 178 + { 179 + vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES); 180 + vm_mem_region_set_flags(vm, TEST_DATA_MEMSLOT, KVM_MEM_LOG_DIRTY_PAGES); 181 + } 182 + 183 + static int __enable_migration_mode(struct kvm_vm *vm) 184 + { 185 + return __kvm_device_attr_set(vm->fd, 186 + KVM_S390_VM_MIGRATION, 187 + KVM_S390_VM_MIGRATION_START, 188 + NULL 189 + ); 190 + } 191 + 192 + static void enable_migration_mode(struct kvm_vm *vm) 193 + { 194 + int r = __enable_migration_mode(vm); 195 + 196 + TEST_ASSERT(!r, "enabling migration mode failed r=%d errno=%d", r, errno); 197 + } 198 + 199 + static bool is_migration_mode_on(struct kvm_vm *vm) 200 + { 201 + u64 out; 202 + int r; 203 + 204 + r = __kvm_device_attr_get(vm->fd, 205 + KVM_S390_VM_MIGRATION, 206 + KVM_S390_VM_MIGRATION_STATUS, 207 + &out 208 + ); 209 + TEST_ASSERT(!r, "getting migration mode status failed r=%d errno=%d", r, errno); 210 + return out; 211 + } 212 + 213 + static int vm_get_cmma_bits(struct kvm_vm *vm, u64 flags, int *errno_out) 214 + { 215 + struct kvm_s390_cmma_log args; 216 + int rc; 217 + 218 + errno = 0; 219 + 220 + args = (struct kvm_s390_cmma_log){ 221 + .start_gfn = 0, 222 + .count = sizeof(cmma_value_buf), 223 + .flags = flags, 224 + .values = (__u64)&cmma_value_buf[0] 225 + }; 226 + rc = __vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, &args); 227 + 228 + *errno_out = errno; 229 + return rc; 230 + } 231 + 232 + static void test_get_cmma_basic(void) 233 + { 234 + struct kvm_vm *vm = create_vm_two_memslots(); 235 + struct kvm_vcpu *vcpu; 236 + int rc, errno_out; 237 + 238 + /* GET_CMMA_BITS without CMMA enabled should fail */ 239 + rc = vm_get_cmma_bits(vm, 0, &errno_out); 240 + ASSERT_EQ(rc, -1); 241 + ASSERT_EQ(errno_out, ENXIO); 242 + 243 + enable_cmma(vm); 244 + vcpu = vm_vcpu_add(vm, 1, guest_do_one_essa); 245 + 246 + vcpu_run(vcpu); 247 + 248 + /* GET_CMMA_BITS without migration mode and without peeking should fail */ 249 + rc = vm_get_cmma_bits(vm, 0, &errno_out); 250 + ASSERT_EQ(rc, -1); 251 + ASSERT_EQ(errno_out, EINVAL); 252 + 253 + /* GET_CMMA_BITS without migration mode and with peeking should work */ 254 + rc = vm_get_cmma_bits(vm, KVM_S390_CMMA_PEEK, &errno_out); 255 + ASSERT_EQ(rc, 0); 256 + ASSERT_EQ(errno_out, 0); 257 + 258 + enable_dirty_tracking(vm); 259 + enable_migration_mode(vm); 260 + 261 + /* GET_CMMA_BITS with invalid flags */ 262 + rc = vm_get_cmma_bits(vm, 0xfeedc0fe, &errno_out); 263 + ASSERT_EQ(rc, -1); 264 + ASSERT_EQ(errno_out, EINVAL); 265 + 266 + kvm_vm_free(vm); 267 + } 268 + 269 + static void assert_exit_was_hypercall(struct kvm_vcpu *vcpu) 270 + { 271 + ASSERT_EQ(vcpu->run->exit_reason, 13); 272 + ASSERT_EQ(vcpu->run->s390_sieic.icptcode, 4); 273 + ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x8300); 274 + ASSERT_EQ(vcpu->run->s390_sieic.ipb, 0x5010000); 275 + } 276 + 277 + static void test_migration_mode(void) 278 + { 279 + struct kvm_vm *vm = create_vm(); 280 + struct kvm_vcpu *vcpu; 281 + u64 orig_psw; 282 + int rc; 283 + 284 + /* enabling migration mode on a VM without memory should fail */ 285 + rc = __enable_migration_mode(vm); 286 + ASSERT_EQ(rc, -1); 287 + ASSERT_EQ(errno, EINVAL); 288 + TEST_ASSERT(!is_migration_mode_on(vm), "migration mode should still be off"); 289 + errno = 0; 290 + 291 + create_memslots(vm); 292 + finish_vm_setup(vm); 293 + 294 + enable_cmma(vm); 295 + vcpu = vm_vcpu_add(vm, 1, guest_do_one_essa); 296 + orig_psw = vcpu->run->psw_addr; 297 + 298 + /* 299 + * Execute one essa instruction in the guest. Otherwise the guest will 300 + * not have use_cmm enabled and GET_CMMA_BITS will return no pages. 301 + */ 302 + vcpu_run(vcpu); 303 + assert_exit_was_hypercall(vcpu); 304 + 305 + /* migration mode when memslots have dirty tracking off should fail */ 306 + rc = __enable_migration_mode(vm); 307 + ASSERT_EQ(rc, -1); 308 + ASSERT_EQ(errno, EINVAL); 309 + TEST_ASSERT(!is_migration_mode_on(vm), "migration mode should still be off"); 310 + errno = 0; 311 + 312 + /* enable dirty tracking */ 313 + enable_dirty_tracking(vm); 314 + 315 + /* enabling migration mode should work now */ 316 + rc = __enable_migration_mode(vm); 317 + ASSERT_EQ(rc, 0); 318 + TEST_ASSERT(is_migration_mode_on(vm), "migration mode should be on"); 319 + errno = 0; 320 + 321 + /* execute another ESSA instruction to see this goes fine */ 322 + vcpu->run->psw_addr = orig_psw; 323 + vcpu_run(vcpu); 324 + assert_exit_was_hypercall(vcpu); 325 + 326 + /* 327 + * With migration mode on, create a new memslot with dirty tracking off. 328 + * This should turn off migration mode. 329 + */ 330 + TEST_ASSERT(is_migration_mode_on(vm), "migration mode should be on"); 331 + vm_userspace_mem_region_add(vm, 332 + VM_MEM_SRC_ANONYMOUS, 333 + TEST_DATA_TWO_START_GFN << vm->page_shift, 334 + TEST_DATA_TWO_MEMSLOT, 335 + TEST_DATA_TWO_PAGE_COUNT, 336 + 0 337 + ); 338 + TEST_ASSERT(!is_migration_mode_on(vm), 339 + "creating memslot without dirty tracking turns off migration mode" 340 + ); 341 + 342 + /* ESSA instructions should still execute fine */ 343 + vcpu->run->psw_addr = orig_psw; 344 + vcpu_run(vcpu); 345 + assert_exit_was_hypercall(vcpu); 346 + 347 + /* 348 + * Turn on dirty tracking on the new memslot. 349 + * It should be possible to turn migration mode back on again. 350 + */ 351 + vm_mem_region_set_flags(vm, TEST_DATA_TWO_MEMSLOT, KVM_MEM_LOG_DIRTY_PAGES); 352 + rc = __enable_migration_mode(vm); 353 + ASSERT_EQ(rc, 0); 354 + TEST_ASSERT(is_migration_mode_on(vm), "migration mode should be on"); 355 + errno = 0; 356 + 357 + /* 358 + * Turn off dirty tracking again, this time with just a flag change. 359 + * Again, migration mode should turn off. 360 + */ 361 + TEST_ASSERT(is_migration_mode_on(vm), "migration mode should be on"); 362 + vm_mem_region_set_flags(vm, TEST_DATA_TWO_MEMSLOT, 0); 363 + TEST_ASSERT(!is_migration_mode_on(vm), 364 + "disabling dirty tracking should turn off migration mode" 365 + ); 366 + 367 + /* ESSA instructions should still execute fine */ 368 + vcpu->run->psw_addr = orig_psw; 369 + vcpu_run(vcpu); 370 + assert_exit_was_hypercall(vcpu); 371 + 372 + kvm_vm_free(vm); 373 + } 374 + 375 + /** 376 + * Given a VM with the MAIN and TEST_DATA memslot, assert that both slots have 377 + * CMMA attributes of all pages in both memslots and nothing more dirty. 378 + * This has the useful side effect of ensuring nothing is CMMA dirty after this 379 + * function. 380 + */ 381 + static void assert_all_slots_cmma_dirty(struct kvm_vm *vm) 382 + { 383 + struct kvm_s390_cmma_log args; 384 + 385 + /* 386 + * First iteration - everything should be dirty. 387 + * Start at the main memslot... 388 + */ 389 + args = (struct kvm_s390_cmma_log){ 390 + .start_gfn = 0, 391 + .count = sizeof(cmma_value_buf), 392 + .flags = 0, 393 + .values = (__u64)&cmma_value_buf[0] 394 + }; 395 + memset(cmma_value_buf, 0xff, sizeof(cmma_value_buf)); 396 + vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, &args); 397 + ASSERT_EQ(args.count, MAIN_PAGE_COUNT); 398 + ASSERT_EQ(args.remaining, TEST_DATA_PAGE_COUNT); 399 + ASSERT_EQ(args.start_gfn, 0); 400 + 401 + /* ...and then - after a hole - the TEST_DATA memslot should follow */ 402 + args = (struct kvm_s390_cmma_log){ 403 + .start_gfn = MAIN_PAGE_COUNT, 404 + .count = sizeof(cmma_value_buf), 405 + .flags = 0, 406 + .values = (__u64)&cmma_value_buf[0] 407 + }; 408 + memset(cmma_value_buf, 0xff, sizeof(cmma_value_buf)); 409 + vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, &args); 410 + ASSERT_EQ(args.count, TEST_DATA_PAGE_COUNT); 411 + ASSERT_EQ(args.start_gfn, TEST_DATA_START_GFN); 412 + ASSERT_EQ(args.remaining, 0); 413 + 414 + /* ...and nothing else should be there */ 415 + args = (struct kvm_s390_cmma_log){ 416 + .start_gfn = TEST_DATA_START_GFN + TEST_DATA_PAGE_COUNT, 417 + .count = sizeof(cmma_value_buf), 418 + .flags = 0, 419 + .values = (__u64)&cmma_value_buf[0] 420 + }; 421 + memset(cmma_value_buf, 0xff, sizeof(cmma_value_buf)); 422 + vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, &args); 423 + ASSERT_EQ(args.count, 0); 424 + ASSERT_EQ(args.start_gfn, 0); 425 + ASSERT_EQ(args.remaining, 0); 426 + } 427 + 428 + /** 429 + * Given a VM, assert no pages are CMMA dirty. 430 + */ 431 + static void assert_no_pages_cmma_dirty(struct kvm_vm *vm) 432 + { 433 + struct kvm_s390_cmma_log args; 434 + 435 + /* If we start from GFN 0 again, nothing should be dirty. */ 436 + args = (struct kvm_s390_cmma_log){ 437 + .start_gfn = 0, 438 + .count = sizeof(cmma_value_buf), 439 + .flags = 0, 440 + .values = (__u64)&cmma_value_buf[0] 441 + }; 442 + memset(cmma_value_buf, 0xff, sizeof(cmma_value_buf)); 443 + vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, &args); 444 + if (args.count || args.remaining || args.start_gfn) 445 + TEST_FAIL("pages are still dirty start_gfn=0x%llx count=%u remaining=%llu", 446 + args.start_gfn, 447 + args.count, 448 + args.remaining 449 + ); 450 + } 451 + 452 + static void test_get_inital_dirty(void) 453 + { 454 + struct kvm_vm *vm = create_vm_two_memslots(); 455 + struct kvm_vcpu *vcpu; 456 + 457 + enable_cmma(vm); 458 + vcpu = vm_vcpu_add(vm, 1, guest_do_one_essa); 459 + 460 + /* 461 + * Execute one essa instruction in the guest. Otherwise the guest will 462 + * not have use_cmm enabled and GET_CMMA_BITS will return no pages. 463 + */ 464 + vcpu_run(vcpu); 465 + assert_exit_was_hypercall(vcpu); 466 + 467 + enable_dirty_tracking(vm); 468 + enable_migration_mode(vm); 469 + 470 + assert_all_slots_cmma_dirty(vm); 471 + 472 + /* Start from the beginning again and make sure nothing else is dirty */ 473 + assert_no_pages_cmma_dirty(vm); 474 + 475 + kvm_vm_free(vm); 476 + } 477 + 478 + static void query_cmma_range(struct kvm_vm *vm, 479 + u64 start_gfn, u64 gfn_count, 480 + struct kvm_s390_cmma_log *res_out) 481 + { 482 + *res_out = (struct kvm_s390_cmma_log){ 483 + .start_gfn = start_gfn, 484 + .count = gfn_count, 485 + .flags = 0, 486 + .values = (__u64)&cmma_value_buf[0] 487 + }; 488 + memset(cmma_value_buf, 0xff, sizeof(cmma_value_buf)); 489 + vm_ioctl(vm, KVM_S390_GET_CMMA_BITS, res_out); 490 + } 491 + 492 + /** 493 + * Assert the given cmma_log struct that was executed by query_cmma_range() 494 + * indicates the first dirty gfn is at first_dirty_gfn and contains exactly 495 + * dirty_gfn_count CMMA values. 496 + */ 497 + static void assert_cmma_dirty(u64 first_dirty_gfn, 498 + u64 dirty_gfn_count, 499 + const struct kvm_s390_cmma_log *res) 500 + { 501 + ASSERT_EQ(res->start_gfn, first_dirty_gfn); 502 + ASSERT_EQ(res->count, dirty_gfn_count); 503 + for (size_t i = 0; i < dirty_gfn_count; i++) 504 + ASSERT_EQ(cmma_value_buf[0], 0x0); /* stable state */ 505 + ASSERT_EQ(cmma_value_buf[dirty_gfn_count], 0xff); /* not touched */ 506 + } 507 + 508 + static void test_get_skip_holes(void) 509 + { 510 + size_t gfn_offset; 511 + struct kvm_vm *vm = create_vm_two_memslots(); 512 + struct kvm_s390_cmma_log log; 513 + struct kvm_vcpu *vcpu; 514 + u64 orig_psw; 515 + 516 + enable_cmma(vm); 517 + vcpu = vm_vcpu_add(vm, 1, guest_dirty_test_data); 518 + 519 + orig_psw = vcpu->run->psw_addr; 520 + 521 + /* 522 + * Execute some essa instructions in the guest. Otherwise the guest will 523 + * not have use_cmm enabled and GET_CMMA_BITS will return no pages. 524 + */ 525 + vcpu_run(vcpu); 526 + assert_exit_was_hypercall(vcpu); 527 + 528 + enable_dirty_tracking(vm); 529 + enable_migration_mode(vm); 530 + 531 + /* un-dirty all pages */ 532 + assert_all_slots_cmma_dirty(vm); 533 + 534 + /* Then, dirty just the TEST_DATA memslot */ 535 + vcpu->run->psw_addr = orig_psw; 536 + vcpu_run(vcpu); 537 + 538 + gfn_offset = TEST_DATA_START_GFN; 539 + /** 540 + * Query CMMA attributes of one page, starting at page 0. Since the 541 + * main memslot was not touched by the VM, this should yield the first 542 + * page of the TEST_DATA memslot. 543 + * The dirty bitmap should now look like this: 544 + * 0: not dirty 545 + * [0x1, 0x200): dirty 546 + */ 547 + query_cmma_range(vm, 0, 1, &log); 548 + assert_cmma_dirty(gfn_offset, 1, &log); 549 + gfn_offset++; 550 + 551 + /** 552 + * Query CMMA attributes of 32 (0x20) pages past the end of the TEST_DATA 553 + * memslot. This should wrap back to the beginning of the TEST_DATA 554 + * memslot, page 1. 555 + * The dirty bitmap should now look like this: 556 + * [0, 0x21): not dirty 557 + * [0x21, 0x200): dirty 558 + */ 559 + query_cmma_range(vm, TEST_DATA_START_GFN + TEST_DATA_PAGE_COUNT, 0x20, &log); 560 + assert_cmma_dirty(gfn_offset, 0x20, &log); 561 + gfn_offset += 0x20; 562 + 563 + /* Skip 32 pages */ 564 + gfn_offset += 0x20; 565 + 566 + /** 567 + * After skipping 32 pages, query the next 32 (0x20) pages. 568 + * The dirty bitmap should now look like this: 569 + * [0, 0x21): not dirty 570 + * [0x21, 0x41): dirty 571 + * [0x41, 0x61): not dirty 572 + * [0x61, 0x200): dirty 573 + */ 574 + query_cmma_range(vm, gfn_offset, 0x20, &log); 575 + assert_cmma_dirty(gfn_offset, 0x20, &log); 576 + gfn_offset += 0x20; 577 + 578 + /** 579 + * Query 1 page from the beginning of the TEST_DATA memslot. This should 580 + * yield page 0x21. 581 + * The dirty bitmap should now look like this: 582 + * [0, 0x22): not dirty 583 + * [0x22, 0x41): dirty 584 + * [0x41, 0x61): not dirty 585 + * [0x61, 0x200): dirty 586 + */ 587 + query_cmma_range(vm, TEST_DATA_START_GFN, 1, &log); 588 + assert_cmma_dirty(TEST_DATA_START_GFN + 0x21, 1, &log); 589 + gfn_offset++; 590 + 591 + /** 592 + * Query 15 (0xF) pages from page 0x23 in TEST_DATA memslot. 593 + * This should yield pages [0x23, 0x33). 594 + * The dirty bitmap should now look like this: 595 + * [0, 0x22): not dirty 596 + * 0x22: dirty 597 + * [0x23, 0x33): not dirty 598 + * [0x33, 0x41): dirty 599 + * [0x41, 0x61): not dirty 600 + * [0x61, 0x200): dirty 601 + */ 602 + gfn_offset = TEST_DATA_START_GFN + 0x23; 603 + query_cmma_range(vm, gfn_offset, 15, &log); 604 + assert_cmma_dirty(gfn_offset, 15, &log); 605 + 606 + /** 607 + * Query 17 (0x11) pages from page 0x22 in TEST_DATA memslot. 608 + * This should yield page [0x22, 0x33) 609 + * The dirty bitmap should now look like this: 610 + * [0, 0x33): not dirty 611 + * [0x33, 0x41): dirty 612 + * [0x41, 0x61): not dirty 613 + * [0x61, 0x200): dirty 614 + */ 615 + gfn_offset = TEST_DATA_START_GFN + 0x22; 616 + query_cmma_range(vm, gfn_offset, 17, &log); 617 + assert_cmma_dirty(gfn_offset, 17, &log); 618 + 619 + /** 620 + * Query 25 (0x19) pages from page 0x40 in TEST_DATA memslot. 621 + * This should yield page 0x40 and nothing more, since there are more 622 + * than 16 non-dirty pages after page 0x40. 623 + * The dirty bitmap should now look like this: 624 + * [0, 0x33): not dirty 625 + * [0x33, 0x40): dirty 626 + * [0x40, 0x61): not dirty 627 + * [0x61, 0x200): dirty 628 + */ 629 + gfn_offset = TEST_DATA_START_GFN + 0x40; 630 + query_cmma_range(vm, gfn_offset, 25, &log); 631 + assert_cmma_dirty(gfn_offset, 1, &log); 632 + 633 + /** 634 + * Query pages [0x33, 0x40). 635 + * The dirty bitmap should now look like this: 636 + * [0, 0x61): not dirty 637 + * [0x61, 0x200): dirty 638 + */ 639 + gfn_offset = TEST_DATA_START_GFN + 0x33; 640 + query_cmma_range(vm, gfn_offset, 0x40 - 0x33, &log); 641 + assert_cmma_dirty(gfn_offset, 0x40 - 0x33, &log); 642 + 643 + /** 644 + * Query the remaining pages [0x61, 0x200). 645 + */ 646 + gfn_offset = TEST_DATA_START_GFN; 647 + query_cmma_range(vm, gfn_offset, TEST_DATA_PAGE_COUNT - 0x61, &log); 648 + assert_cmma_dirty(TEST_DATA_START_GFN + 0x61, TEST_DATA_PAGE_COUNT - 0x61, &log); 649 + 650 + assert_no_pages_cmma_dirty(vm); 651 + } 652 + 653 + struct testdef { 654 + const char *name; 655 + void (*test)(void); 656 + } testlist[] = { 657 + { "migration mode and dirty tracking", test_migration_mode }, 658 + { "GET_CMMA_BITS: basic calls", test_get_cmma_basic }, 659 + { "GET_CMMA_BITS: all pages are dirty initally", test_get_inital_dirty }, 660 + { "GET_CMMA_BITS: holes are skipped", test_get_skip_holes }, 661 + }; 662 + 663 + /** 664 + * The kernel may support CMMA, but the machine may not (i.e. if running as 665 + * guest-3). 666 + * 667 + * In this case, the CMMA capabilities are all there, but the CMMA-related 668 + * ioctls fail. To find out whether the machine supports CMMA, create a 669 + * temporary VM and then query the CMMA feature of the VM. 670 + */ 671 + static int machine_has_cmma(void) 672 + { 673 + struct kvm_vm *vm = create_vm(); 674 + int r; 675 + 676 + r = !__kvm_has_device_attr(vm->fd, KVM_S390_VM_MEM_CTRL, KVM_S390_VM_MEM_ENABLE_CMMA); 677 + kvm_vm_free(vm); 678 + 679 + return r; 680 + } 681 + 682 + int main(int argc, char *argv[]) 683 + { 684 + int idx; 685 + 686 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_SYNC_REGS)); 687 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_S390_CMMA_MIGRATION)); 688 + TEST_REQUIRE(machine_has_cmma()); 689 + 690 + ksft_print_header(); 691 + 692 + ksft_set_plan(ARRAY_SIZE(testlist)); 693 + 694 + for (idx = 0; idx < ARRAY_SIZE(testlist); idx++) { 695 + testlist[idx].test(); 696 + ksft_test_result_pass("%s\n", testlist[idx].name); 697 + } 698 + 699 + ksft_finished(); /* Print results and exit() accordingly */ 700 + }

+21

tools/testing/selftests/kvm/x86_64/cpuid_test.c

··· 163 163 ent->eax = eax; 164 164 } 165 165 166 + static void test_get_cpuid2(struct kvm_vcpu *vcpu) 167 + { 168 + struct kvm_cpuid2 *cpuid = allocate_kvm_cpuid2(vcpu->cpuid->nent + 1); 169 + int i, r; 170 + 171 + vcpu_ioctl(vcpu, KVM_GET_CPUID2, cpuid); 172 + TEST_ASSERT(cpuid->nent == vcpu->cpuid->nent, 173 + "KVM didn't update nent on success, wanted %u, got %u\n", 174 + vcpu->cpuid->nent, cpuid->nent); 175 + 176 + for (i = 0; i < vcpu->cpuid->nent; i++) { 177 + cpuid->nent = i; 178 + r = __vcpu_ioctl(vcpu, KVM_GET_CPUID2, cpuid); 179 + TEST_ASSERT(r && errno == E2BIG, KVM_IOCTL_ERROR(KVM_GET_CPUID2, r)); 180 + TEST_ASSERT(cpuid->nent == i, "KVM modified nent on failure"); 181 + } 182 + free(cpuid); 183 + } 184 + 166 185 int main(void) 167 186 { 168 187 struct kvm_vcpu *vcpu; ··· 201 182 run_vcpu(vcpu, stage); 202 183 203 184 set_cpuid_after_run(vcpu); 185 + 186 + test_get_cpuid2(vcpu); 204 187 205 188 kvm_vm_free(vm); 206 189 }

+259

tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * KVM dirty logging page splitting test 4 + * 5 + * Based on dirty_log_perf.c 6 + * 7 + * Copyright (C) 2018, Red Hat, Inc. 8 + * Copyright (C) 2023, Google, Inc. 9 + */ 10 + 11 + #include <stdio.h> 12 + #include <stdlib.h> 13 + #include <pthread.h> 14 + #include <linux/bitmap.h> 15 + 16 + #include "kvm_util.h" 17 + #include "test_util.h" 18 + #include "memstress.h" 19 + #include "guest_modes.h" 20 + 21 + #define VCPUS 2 22 + #define SLOTS 2 23 + #define ITERATIONS 2 24 + 25 + static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE; 26 + 27 + static enum vm_mem_backing_src_type backing_src = VM_MEM_SRC_ANONYMOUS_HUGETLB; 28 + 29 + static u64 dirty_log_manual_caps; 30 + static bool host_quit; 31 + static int iteration; 32 + static int vcpu_last_completed_iteration[KVM_MAX_VCPUS]; 33 + 34 + struct kvm_page_stats { 35 + uint64_t pages_4k; 36 + uint64_t pages_2m; 37 + uint64_t pages_1g; 38 + uint64_t hugepages; 39 + }; 40 + 41 + static void get_page_stats(struct kvm_vm *vm, struct kvm_page_stats *stats, const char *stage) 42 + { 43 + stats->pages_4k = vm_get_stat(vm, "pages_4k"); 44 + stats->pages_2m = vm_get_stat(vm, "pages_2m"); 45 + stats->pages_1g = vm_get_stat(vm, "pages_1g"); 46 + stats->hugepages = stats->pages_2m + stats->pages_1g; 47 + 48 + pr_debug("\nPage stats after %s: 4K: %ld 2M: %ld 1G: %ld huge: %ld\n", 49 + stage, stats->pages_4k, stats->pages_2m, stats->pages_1g, 50 + stats->hugepages); 51 + } 52 + 53 + static void run_vcpu_iteration(struct kvm_vm *vm) 54 + { 55 + int i; 56 + 57 + iteration++; 58 + for (i = 0; i < VCPUS; i++) { 59 + while (READ_ONCE(vcpu_last_completed_iteration[i]) != 60 + iteration) 61 + ; 62 + } 63 + } 64 + 65 + static void vcpu_worker(struct memstress_vcpu_args *vcpu_args) 66 + { 67 + struct kvm_vcpu *vcpu = vcpu_args->vcpu; 68 + int vcpu_idx = vcpu_args->vcpu_idx; 69 + 70 + while (!READ_ONCE(host_quit)) { 71 + int current_iteration = READ_ONCE(iteration); 72 + 73 + vcpu_run(vcpu); 74 + 75 + ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_SYNC); 76 + 77 + vcpu_last_completed_iteration[vcpu_idx] = current_iteration; 78 + 79 + /* Wait for the start of the next iteration to be signaled. */ 80 + while (current_iteration == READ_ONCE(iteration) && 81 + READ_ONCE(iteration) >= 0 && 82 + !READ_ONCE(host_quit)) 83 + ; 84 + } 85 + } 86 + 87 + static void run_test(enum vm_guest_mode mode, void *unused) 88 + { 89 + struct kvm_vm *vm; 90 + unsigned long **bitmaps; 91 + uint64_t guest_num_pages; 92 + uint64_t host_num_pages; 93 + uint64_t pages_per_slot; 94 + int i; 95 + uint64_t total_4k_pages; 96 + struct kvm_page_stats stats_populated; 97 + struct kvm_page_stats stats_dirty_logging_enabled; 98 + struct kvm_page_stats stats_dirty_pass[ITERATIONS]; 99 + struct kvm_page_stats stats_clear_pass[ITERATIONS]; 100 + struct kvm_page_stats stats_dirty_logging_disabled; 101 + struct kvm_page_stats stats_repopulated; 102 + 103 + vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size, 104 + SLOTS, backing_src, false); 105 + 106 + guest_num_pages = (VCPUS * guest_percpu_mem_size) >> vm->page_shift; 107 + guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages); 108 + host_num_pages = vm_num_host_pages(mode, guest_num_pages); 109 + pages_per_slot = host_num_pages / SLOTS; 110 + 111 + bitmaps = memstress_alloc_bitmaps(SLOTS, pages_per_slot); 112 + 113 + if (dirty_log_manual_caps) 114 + vm_enable_cap(vm, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, 115 + dirty_log_manual_caps); 116 + 117 + /* Start the iterations */ 118 + iteration = -1; 119 + host_quit = false; 120 + 121 + for (i = 0; i < VCPUS; i++) 122 + vcpu_last_completed_iteration[i] = -1; 123 + 124 + memstress_start_vcpu_threads(VCPUS, vcpu_worker); 125 + 126 + run_vcpu_iteration(vm); 127 + get_page_stats(vm, &stats_populated, "populating memory"); 128 + 129 + /* Enable dirty logging */ 130 + memstress_enable_dirty_logging(vm, SLOTS); 131 + 132 + get_page_stats(vm, &stats_dirty_logging_enabled, "enabling dirty logging"); 133 + 134 + while (iteration < ITERATIONS) { 135 + run_vcpu_iteration(vm); 136 + get_page_stats(vm, &stats_dirty_pass[iteration - 1], 137 + "dirtying memory"); 138 + 139 + memstress_get_dirty_log(vm, bitmaps, SLOTS); 140 + 141 + if (dirty_log_manual_caps) { 142 + memstress_clear_dirty_log(vm, bitmaps, SLOTS, pages_per_slot); 143 + 144 + get_page_stats(vm, &stats_clear_pass[iteration - 1], "clearing dirty log"); 145 + } 146 + } 147 + 148 + /* Disable dirty logging */ 149 + memstress_disable_dirty_logging(vm, SLOTS); 150 + 151 + get_page_stats(vm, &stats_dirty_logging_disabled, "disabling dirty logging"); 152 + 153 + /* Run vCPUs again to fault pages back in. */ 154 + run_vcpu_iteration(vm); 155 + get_page_stats(vm, &stats_repopulated, "repopulating memory"); 156 + 157 + /* 158 + * Tell the vCPU threads to quit. No need to manually check that vCPUs 159 + * have stopped running after disabling dirty logging, the join will 160 + * wait for them to exit. 161 + */ 162 + host_quit = true; 163 + memstress_join_vcpu_threads(VCPUS); 164 + 165 + memstress_free_bitmaps(bitmaps, SLOTS); 166 + memstress_destroy_vm(vm); 167 + 168 + /* Make assertions about the page counts. */ 169 + total_4k_pages = stats_populated.pages_4k; 170 + total_4k_pages += stats_populated.pages_2m * 512; 171 + total_4k_pages += stats_populated.pages_1g * 512 * 512; 172 + 173 + /* 174 + * Check that all huge pages were split. Since large pages can only 175 + * exist in the data slot, and the vCPUs should have dirtied all pages 176 + * in the data slot, there should be no huge pages left after splitting. 177 + * Splitting happens at dirty log enable time without 178 + * KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 and after the first clear pass 179 + * with that capability. 180 + */ 181 + if (dirty_log_manual_caps) { 182 + ASSERT_EQ(stats_clear_pass[0].hugepages, 0); 183 + ASSERT_EQ(stats_clear_pass[0].pages_4k, total_4k_pages); 184 + ASSERT_EQ(stats_dirty_logging_enabled.hugepages, stats_populated.hugepages); 185 + } else { 186 + ASSERT_EQ(stats_dirty_logging_enabled.hugepages, 0); 187 + ASSERT_EQ(stats_dirty_logging_enabled.pages_4k, total_4k_pages); 188 + } 189 + 190 + /* 191 + * Once dirty logging is disabled and the vCPUs have touched all their 192 + * memory again, the page counts should be the same as they were 193 + * right after initial population of memory. 194 + */ 195 + ASSERT_EQ(stats_populated.pages_4k, stats_repopulated.pages_4k); 196 + ASSERT_EQ(stats_populated.pages_2m, stats_repopulated.pages_2m); 197 + ASSERT_EQ(stats_populated.pages_1g, stats_repopulated.pages_1g); 198 + } 199 + 200 + static void help(char *name) 201 + { 202 + puts(""); 203 + printf("usage: %s [-h] [-b vcpu bytes] [-s mem type]\n", 204 + name); 205 + puts(""); 206 + printf(" -b: specify the size of the memory region which should be\n" 207 + " dirtied by each vCPU. e.g. 10M or 3G.\n" 208 + " (default: 1G)\n"); 209 + backing_src_help("-s"); 210 + puts(""); 211 + } 212 + 213 + int main(int argc, char *argv[]) 214 + { 215 + int opt; 216 + 217 + TEST_REQUIRE(get_kvm_param_bool("eager_page_split")); 218 + TEST_REQUIRE(get_kvm_param_bool("tdp_mmu")); 219 + 220 + while ((opt = getopt(argc, argv, "b:hs:")) != -1) { 221 + switch (opt) { 222 + case 'b': 223 + guest_percpu_mem_size = parse_size(optarg); 224 + break; 225 + case 'h': 226 + help(argv[0]); 227 + exit(0); 228 + case 's': 229 + backing_src = parse_backing_src_type(optarg); 230 + break; 231 + default: 232 + help(argv[0]); 233 + exit(1); 234 + } 235 + } 236 + 237 + if (!is_backing_src_hugetlb(backing_src)) { 238 + pr_info("This test will only work reliably with HugeTLB memory. " 239 + "It can work with THP, but that is best effort.\n"); 240 + } 241 + 242 + guest_modes_append_default(); 243 + 244 + dirty_log_manual_caps = 0; 245 + for_each_guest_mode(run_test, NULL); 246 + 247 + dirty_log_manual_caps = 248 + kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2); 249 + 250 + if (dirty_log_manual_caps) { 251 + dirty_log_manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | 252 + KVM_DIRTY_LOG_INITIALLY_SET); 253 + for_each_guest_mode(run_test, NULL); 254 + } else { 255 + pr_info("Skipping testing with MANUAL_PROTECT as it is not supported"); 256 + } 257 + 258 + return 0; 259 + }

+1 -1

tools/testing/selftests/kvm/x86_64/nx_huge_pages_test.c

··· 226 226 puts(""); 227 227 printf("usage: %s [-h] [-p period_ms] [-t token]\n", name); 228 228 puts(""); 229 - printf(" -p: The NX reclaim period in miliseconds.\n"); 229 + printf(" -p: The NX reclaim period in milliseconds.\n"); 230 230 printf(" -t: The magic token to indicate environment setup is done.\n"); 231 231 printf(" -r: The test has reboot permissions and can disable NX huge pages.\n"); 232 232 puts("");

+7 -15

tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c

··· 116 116 GUEST_DONE(); 117 117 } 118 118 119 - static void stable_tsc_check_supported(void) 119 + static bool system_has_stable_tsc(void) 120 120 { 121 + bool tsc_is_stable; 121 122 FILE *fp; 122 123 char buf[4]; 123 124 124 125 fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r"); 125 126 if (fp == NULL) 126 - goto skip_test; 127 + return false; 127 128 128 - if (fgets(buf, sizeof(buf), fp) == NULL) 129 - goto close_fp; 130 - 131 - if (strncmp(buf, "tsc", sizeof(buf))) 132 - goto close_fp; 129 + tsc_is_stable = fgets(buf, sizeof(buf), fp) && 130 + !strncmp(buf, "tsc", sizeof(buf)); 133 131 134 132 fclose(fp); 135 - return; 136 - 137 - close_fp: 138 - fclose(fp); 139 - skip_test: 140 - print_skip("Kernel does not use TSC clocksource - assuming that host TSC is not stable"); 141 - exit(KSFT_SKIP); 133 + return tsc_is_stable; 142 134 } 143 135 144 136 int main(int argc, char *argv[]) ··· 148 156 149 157 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 150 158 TEST_REQUIRE(kvm_has_cap(KVM_CAP_TSC_CONTROL)); 151 - stable_tsc_check_supported(); 159 + TEST_REQUIRE(system_has_stable_tsc()); 152 160 153 161 /* 154 162 * We set L1's scale factor to be a random number from 2 to 10.

+2 -7

virt/kvm/coalesced_mmio.c

··· 186 186 coalesced_mmio_in_range(dev, zone->addr, zone->size)) { 187 187 r = kvm_io_bus_unregister_dev(kvm, 188 188 zone->pio ? KVM_PIO_BUS : KVM_MMIO_BUS, &dev->dev); 189 - 190 - kvm_iodevice_destructor(&dev->dev); 191 - 192 189 /* 193 190 * On failure, unregister destroys all devices on the 194 - * bus _except_ the target device, i.e. coalesced_zones 195 - * has been modified. Bail after destroying the target 196 - * device, there's no need to restart the walk as there 197 - * aren't any zones left. 191 + * bus, including the target device. There's no need 192 + * to restart the walk as there aren't any zones left. 198 193 */ 199 194 if (r) 200 195 break;

+3 -5

virt/kvm/eventfd.c

··· 889 889 890 890 unlock_fail: 891 891 mutex_unlock(&kvm->slots_lock); 892 + kfree(p); 892 893 893 894 fail: 894 - kfree(p); 895 895 eventfd_ctx_put(eventfd); 896 896 897 897 return ret; ··· 901 901 kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx, 902 902 struct kvm_ioeventfd *args) 903 903 { 904 - struct _ioeventfd *p, *tmp; 904 + struct _ioeventfd *p; 905 905 struct eventfd_ctx *eventfd; 906 906 struct kvm_io_bus *bus; 907 907 int ret = -ENOENT; ··· 915 915 916 916 mutex_lock(&kvm->slots_lock); 917 917 918 - list_for_each_entry_safe(p, tmp, &kvm->ioeventfds, list) { 919 - 918 + list_for_each_entry(p, &kvm->ioeventfds, list) { 920 919 if (p->bus_idx != bus_idx || 921 920 p->eventfd != eventfd || 922 921 p->addr != args->addr || ··· 930 931 bus = kvm_get_bus(kvm, bus_idx); 931 932 if (bus) 932 933 bus->ioeventfd_count--; 933 - ioeventfd_release(p); 934 934 ret = 0; 935 935 break; 936 936 }

+22 -29

virt/kvm/kvm_main.c

··· 154 154 155 155 static DEFINE_PER_CPU(cpumask_var_t, cpu_kick_mask); 156 156 157 - __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, 158 - unsigned long start, unsigned long end) 159 - { 160 - } 161 - 162 157 __weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm) 163 158 { 164 159 } ··· 514 519 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) 515 520 { 516 521 return container_of(mn, struct kvm, mmu_notifier); 517 - } 518 - 519 - static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, 520 - struct mm_struct *mm, 521 - unsigned long start, unsigned long end) 522 - { 523 - struct kvm *kvm = mmu_notifier_to_kvm(mn); 524 - int idx; 525 - 526 - idx = srcu_read_lock(&kvm->srcu); 527 - kvm_arch_mmu_notifier_invalidate_range(kvm, start, end); 528 - srcu_read_unlock(&kvm->srcu, idx); 529 522 } 530 523 531 524 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); ··· 893 910 } 894 911 895 912 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { 896 - .invalidate_range = kvm_mmu_notifier_invalidate_range, 897 913 .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start, 898 914 .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, 899 915 .clear_flush_young = kvm_mmu_notifier_clear_flush_young, ··· 3873 3891 static int vcpu_get_pid(void *data, u64 *val) 3874 3892 { 3875 3893 struct kvm_vcpu *vcpu = data; 3876 - *val = pid_nr(rcu_access_pointer(vcpu->pid)); 3894 + 3895 + rcu_read_lock(); 3896 + *val = pid_nr(rcu_dereference(vcpu->pid)); 3897 + rcu_read_unlock(); 3877 3898 return 0; 3878 3899 } 3879 3900 ··· 3978 3993 if (r < 0) 3979 3994 goto kvm_put_xa_release; 3980 3995 3981 - if (KVM_BUG_ON(!!xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) { 3996 + if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) { 3982 3997 r = -EINVAL; 3983 3998 goto kvm_put_xa_release; 3984 3999 } ··· 4608 4623 return -EINVAL; 4609 4624 } 4610 4625 4611 - static bool kvm_are_all_memslots_empty(struct kvm *kvm) 4626 + bool kvm_are_all_memslots_empty(struct kvm *kvm) 4612 4627 { 4613 4628 int i; 4614 4629 ··· 4621 4636 4622 4637 return true; 4623 4638 } 4639 + EXPORT_SYMBOL_GPL(kvm_are_all_memslots_empty); 4624 4640 4625 4641 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, 4626 4642 struct kvm_enable_cap *cap) ··· 5301 5315 } 5302 5316 #endif /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */ 5303 5317 5318 + static void kvm_iodevice_destructor(struct kvm_io_device *dev) 5319 + { 5320 + if (dev->ops->destructor) 5321 + dev->ops->destructor(dev); 5322 + } 5323 + 5304 5324 static void kvm_io_bus_destroy(struct kvm_io_bus *bus) 5305 5325 { 5306 5326 int i; ··· 5530 5538 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, 5531 5539 struct kvm_io_device *dev) 5532 5540 { 5533 - int i, j; 5541 + int i; 5534 5542 struct kvm_io_bus *new_bus, *bus; 5535 5543 5536 5544 lockdep_assert_held(&kvm->slots_lock); ··· 5560 5568 rcu_assign_pointer(kvm->buses[bus_idx], new_bus); 5561 5569 synchronize_srcu_expedited(&kvm->srcu); 5562 5570 5563 - /* Destroy the old bus _after_ installing the (null) bus. */ 5571 + /* 5572 + * If NULL bus is installed, destroy the old bus, including all the 5573 + * attached devices. Otherwise, destroy the caller's device only. 5574 + */ 5564 5575 if (!new_bus) { 5565 5576 pr_err("kvm: failed to shrink bus, removing it completely\n"); 5566 - for (j = 0; j < bus->dev_count; j++) { 5567 - if (j == i) 5568 - continue; 5569 - kvm_iodevice_destructor(bus->range[j].dev); 5570 - } 5577 + kvm_io_bus_destroy(bus); 5578 + return -ENOMEM; 5571 5579 } 5572 5580 5581 + kvm_iodevice_destructor(dev); 5573 5582 kfree(bus); 5574 - return new_bus ? 0 : -ENOMEM; 5583 + return 0; 5575 5584 } 5576 5585 5577 5586 struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,

Configure Feed

Configure Feed